Elfen Scheduling. Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang

Size: px

Start display at page:

Download "Elfen Scheduling. Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang"

Arthur Fletcher
6 years ago
Views:

1 Elfen Scheduling Fine-Grain Principled Borrowing from Latency-Critical Workloads using Simultaneous Multithreading (SMT) Xi Yang Australian National University Stephen M Blackburn Australian National University Kathryn S McKinley Microsoft Research

2 Tail Latency Matters Two second slowdown reduced revenue/user by 4.3%. [Eric Schurman, Bing] 4 millisecond delay decreased searches/user by.59%. [Jack Brutlag, Google] Such WSCs tend to have relatively low average utilization, spending most of its time in the - 5% CPU utilization range. Luiz André Barroso, Urs Hölzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines 2

3 High Responsiveness Low Utilization Latency (ms) Utilization / no SMT core, no SMT Lucene alone 5%ile Lucene alone 95%ile Lucene alone 99%ile 34% w/ SMT % no SMT LC Service Level Objective ms SLO LC Intel Xeon-D 54 Broadwell Such WSCs tend to have relatively low average utilization, spending most of its time in the - 5% CPU utilization range. Luiz André Barroso, Urs Hölzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines 3

4 Soak up Slack with Batch? LC batch responsiveness requires idle cores because OS descheduling is slow core LC core idle LC batch core idle core LC tcore t 2 tcore t 2 tcore t 2 tcore t 2 t t 2 t t 2 t t 2 t t 2 shared cache shared cache shared cache Co-running on different cores SMT turned off Co-running on different cores SMT turned off Co-running on same core in SMT lanes 4

5 SMT Co-Runner 99%ile latency (ms) Utilization / no SMT Lucene alone with IPC. with IPC core, 2 SMT lanes SLO Great utilization! while(); IPC. while() { movnti(); mfence(); } IPC. Even IPC. violates SLO at low load! 5

6 Simultaneous Multithreading OFF Lanes Issue Logic Load Store Queue Functional Units lucene time 6

7 Simultaneous Multithreading ON Lanes Issue Logic Load Store Queue Functional Units lane lane 2 Round-robin shared Statically partitioned Dynamically shared lucene IPC. time Active SMT lanes share critical resources 7

8 Principled Borrowing busy idle LC batch time Batch borrows hardware when LC is idle Batch releases hardware when LC is busy Can we implement principled borrowing on current hardware? 8

9 Hardware is Ready Software is Not LC batch time Batch lane calls mwait Thread sleeps, releasing hardware to OS (~2K cycles) OS schedules batch lane with any ready job OS supports thread sleeping, but not hardware sleeping release SMT hardware to other lane 9

10 nanonap() Semantics Thread invoking nanonap releases SMT hardware without releasing SMT context OS cannot schedule hardware context OS can interrupt and wakeup thread

11 nanonap() per_cpu_variable: nap_flag; void nanonap() { enter_kernel(); disable_preemption(); my_nap_flag = this_cpu_flag(nap_flag); monitor(my_nap_flag); mwait(); enable_preemption(); leave_kernel(); } LC batch time nanonap

12 Elfen Scheduler No change to latency-critical threads Instrument batch workloads to detect LC threads & nap Bind latency-critical threads to LC lane Bind batch threads to batch lane 2

$request_lane_idle) slow_path(); time 2 slow_path() { nanonap(); } /* maps lane IDs to the running task */ 3 exposed SHIM signal: cpu_task_map$

13 Elfen Scheduler LC batch 2 Batch thread borrows resources, continuously checks LC lane status 2 LC starts, batch calls nanonap() to release SMT hardware resources 3 OS touches nap_flag to wake up batch thread 3 nanonap() /* fast path check injected into method body */ check: if (!request_lane_idle) slow_path(); time 2 slow_path() { nanonap(); } /* maps lane IDs to the running task */ 3 exposed SHIM signal: cpu_task_map task_switch(task T) { cpu_task_map[thiscpu] = T; } idle_task() { // wake up any waiting batch thread update_nap_flag_of_partner_lane();... } 3

Results: Borrow Idle 99%ile latency (ms) Utilization / no SMT 4 2 8 6 4 2 core, 2 SMT lanes w antlr

6 7 8 9.75.5.25.75.5.25 same latency! increased utilization x -.

14 Results: Borrow Idle 99%ile latency (ms) Utilization / no SMT core, 2 SMT lanes w antlr w bloat w eclipse w fop w hsqldb w jython w luindex w lusearch w pmd w xalan Lucene alone same latency! increased utilization x -.5x %ile latency (ms) Utilization / no SMT cores, 2x7 SMT lanes x -.9x

15 Beyond Mutual Exclusion Borrow idle policy Mutual exclusion LC batch time nanonap nanonap nanonap Concurrent co-running with a budget Fixed budget policy Refresh budget policy Dynamic budget policy 5

16 Borrow idle on core Borrow idle on 7 cores Dynamic budget 7 cores 99%ile latency (ms) Utilization / no SMT %ile latency (ms) Utilization / no SMT %ile latency (ms) Utilization / no SMT x -.2x

17 Conclusion Principled borrowing for SMT nanonap() releases STM hardware without giving it to OS Elfen scheduler implementation & policies 9% to 25% increase in utilization Same tail latency! Thank you! 7

18 BACKUP SLIDES 8

High Responsiveness Low Utilization Latency (ms) Utilization / no SMT 2 5

19 High Responsiveness Low Utilization Latency (ms) Utilization / no SMT core, no SMT 5%ile Lucene alone 95%ile Lucene alone 99%ile Lucene alone 34% w/ SMT % no SMT LC Service Level Objective ms SLO LC Intel Xeon-D 54 Broadwell Such WSCs tend to have relatively low average utilization, spending most of its time in the - 5% CPU utilization range. Luiz André Barroso, Urs Hölzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines 9

20 Results: Borrow Idle 99%ile latency (ms) core, 2 SMT lanes Lucene alone w antlr w bloat w eclipse w fop w hsqldb w jython w luindex w lusearch w pmd w xalan cores, 2x7 SMT lanes Utilization / no SMT increased utilization 9% to 37% increased utilization 8% to %

21 Borrow idle on cores Borrow idle on 7 cores Dynamic budget 7 cores 99%ile latency (ms) Lucene alone w antlr w bloat w eclipse w fop w hsqldb w jython w luindex w lusearch w pmd w xalan Utilization / no SMT increased utilization % to %

22 Result 22

23 Result 23

24 BACKUP SLIDES 24

25 Tail Latency Matters Two second slowdown reduced revenue/user by 4.3%. [Eric Schurman, Bing] 4 millisecond delay decreased searches/user by.59%. [Jack Brutlag, Google] Such WSCs tend to have relatively low average utilization, spending most of its time in the - 5% CPU utilization range. p75, Luiz André Barroso, Urs Hölzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines 25

26 High Responsiveness Low Utilization Latency (ms) LC 5%ile 95%ile 99%ile Utilization / no SMT % w/ SMT 67% no SMT Lucene search benchmark on one core of an Intel Xeon-D 54 Broadwell. Such WSCs tend to have relatively low average utilization, spending most of its time in the - 5% CPU utilization range. p75, Luiz André Barroso, Urs Hölzle The Datacenter as a Computer: An Introduction to the Design of Warehouse-Scale Machines 26

27 Soak up Slack with Batch? LC batch core LC core idle batch core core LC tcore t 2 tcore t 2 tcore t 2 tcore t 2 t t 2 t t 2 t t 2 t t 2 shared cache shared cache shared cache Co-running on different cores SMT turned off Co-running on same core in SMT lanes 27

SMT Co-Runner 99%ile latency (ms) Utilization / no SMT 6 5 4 3 2 2.75.5.25.75.5.25 2 2 Lucene alone with IPC.

28 SMT Co-Runner 99%ile latency (ms) Utilization / no SMT Lucene alone with IPC. with IPC while(); IPC. while() { movnti(); mfence(); } IPC. Even IPC. violates SLO at low load! 28

29 Even Low IPC Co-Runner is Destructive Lanes Issue Logic Load Store Queue Functional Units LC batch Round-robin shared Statically partitioned Dynamically shared lucene IPC. time Critical resources are shared by active SMT lanes. 29

30 Principled Borrowing busy idle LC batch time Batch lane borrows hardware resources when LC lane is idle Batch lane releases hardware resources when LC lane is busy Can we implement principled borrowing on current hardware? 3

31 Hardware is Ready Software is Not LC batch time Batch lane calls mwait Thread sleeps, releasing hardware to OS (~2K cycles) OS schedules batch lane with any ready job When LC lane is busy, batch lane must do nothing (nap) 3

32 nanonap() Thread nap semantics Put SMT hardware lane in idle state OS cannot schedule hardware context OS can interrupt and wakeup thread 32

33 nanonap() per_cpu_variable: nap_flag; void nanonap() { enter_kernel(); disable_preemption(); my_nap_flag = this_cpu_flag(nap_flag); monitor(my_nap_flag); mwait(); enable_preemption(); leave_kernel(); } LC batch time batch lane naps 33

34 Elfen Scheduler No change to latency critical threads Bind latency critical threads to LC lane Bind batch threads to batch lane Instrument batch workloads to detect LC threads quickly & nap 34

35 Elfen Scheduler LC batch 3 /* fast path check injected into method body */ check: if (!request_lane_idle) slow_path(); time slow_path() { nanonap(); } 2 2 Batch thread borrows resources continuously checks LC lane status 2 LC starts, batch calls nanonap() to release SMT hardware context 3 OS touches nap_flag to wake up batch thread /* maps lane IDs to the running task */ exposed SHIM signal: cpu_task_map task_switch(task T) { cpu_task_map[thiscpu] = T; } idle_task() { // wake up any waiting batch thread update_nap_flag_of_partner_lane(); 3... } 35

36 Results: Borrow Idle 99%ile latency (ms) core, 2 SMT lanes Lucene alone w antlr w bloat w eclipse w fop w hsqldb w jython w luindex w lusearch w pmd w xalan cores, 2x7 SMT lanes Utilization / no SMT increased utilization 9% to 37% increased utilization 8% to %

37 Beyond Mutual Exclusion Mutual exclusion LC batch time nanonap nanonap nanonap Concurrent co-running with a budget Borrow idle policy Fixed budget policy Refresh budget policy Dynamic budget policy 37

38 Borrow idle on cores Borrow idle on 7 cores Dynamic budget 7 cores 99%ile latency (ms) Lucene alone w antlr w bloat w eclipse w fop w hsqldb w jython w luindex w lusearch w pmd w xalan Utilization / no SMT increased utilization % to %

39 Conclusion Principled borrowing for SMT nanonap() releases STM hardware without giving it to OS Elfen scheduler implementation & policies 9% to 25% increase in utilization Same tail latency!? https: //github.com/elfenscheduler Thank you! 39

Tail Latency: Beyond Queuing Theory. Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini

Tail Latency: Beyond Queuing Theory Kathryn S McKinley Xi Yang, Stephen M Blackburn, Sameh Elnikety, Yuxiong He, Ricardo Bianchini Microsoft s Quincy datacenter 3 Servers in US datacenters 2 Servers in