Design Principles for End-to-End Multicore Schedulers

Size: px

Start display at page:

Download "Design Principles for End-to-End Multicore Schedulers"

Abel Jordan
6 years ago
Views:

1 c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris Timothy Roscoe Systems Group, ETH Zurich Microsoft Research HotPar 10

HotPar 10 Systems Group Department of Computer Science ETH Zürich 2 Context: Barrelfish Multikernel operating system Developed at ETHZ and Microsoft Research Scalable research OS on heterogeneous

2 HotPar 10 Systems Group Department of Computer Science ETH Zürich 2 Context: Barrelfish Multikernel operating system Developed at ETHZ and Microsoft Research Scalable research OS on heterogeneous multicore hardware Operating system principles and structure Programming models and language runtime systems Other scalable OS approaches are similar Tessellation, Corey, ROS, fos,... Ideas in this talk more widely applicable

3 HotPar 10 Systems Group Department of Computer Science ETH Zürich 3 Today s talk topic OS Scheduler architecture for today s (and tomorrow s) multicore machines General-purpose setting: Dynamic workload mix Multiple parallel apps Interactive parallel apps

4 HotPar 10 Systems Group Department of Computer Science ETH Zürich 4 Why this is a problem A simple example Run 2 OpenMP applications concurrently On 16-core AMD Shanghai system Intel OpenMP library Linux OS

5 HotPar 10 Systems Group Department of Computer Science ETH Zürich 5 Why this is a problem Example: 2x OpenMP on 16-core Linux One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; }

6 HotPar 10 Systems Group Department of Computer Science ETH Zürich 6 Why this is a problem Example: 2x OpenMP on 16-core Linux Run for x in [2..16]: OMP_NUM_THREADS=x./BARRIER & OMP_NUM_THREADS=8./cpu_bound & sleep 20 killall BARRIER cpu_bound Plot average iterations/thread/s over 20s

7 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads

8 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads

9 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound Until 8 BARRIER threads BARRIER Relative Rate of Progress Number of BARRIER Threads

10 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) Number of BARRIER Threads

11 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Number of BARRIER Threads

12 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Space-partitioning Number of BARRIER Threads

13 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound BARRIER Number of BARRIER Threads

14 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly CPU-Bound BARRIER Number of BARRIER Threads

15 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly BARRIER drops sharply (only makes progress when all threads run concurrently) Number of BARRIER Threads CPU-Bound BARRIER

16 HotPar 10 Systems Group Department of Computer Science ETH Zürich 8 Why this is a problem Example: 2x OpenMP on 16-core Linux Gang scheduling or smart core allocation would help Gang scheduling: OS unaware of apps requirements The run-time system could ve known Eg. via annotations or compiler Smart core allocation: OS knows general system state Run-time system chooses number of threads Information and mechanisms in the wrong place

17 HotPar 10 Systems Group Department of Computer Science ETH Zürich 9 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Huge error bars (min/max over 20 runs) Random placement of threads to cores Number of BARRIER Threads

18 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

19 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

20 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

21 HotPar 10 Systems Group Department of Computer Science ETH Zürich 11 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress threads case: Performance difference of Number of BARRIER Threads

22 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ AMD Opteron (Magny-Cours) On-chip interconnect Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network

HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2

23 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet AMD Opteron (Magny-Cours) Manual tuning increasingly On-chip difficultinterconnect Architectures change too quickly Offline auto-tuning (eg. ATLAS) limited MCU MCU MCU Full Cross Bar Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network

24 HotPar 10 Systems Group Department of Computer Science ETH Zürich 13 Online adaptation Online adaptation remains viable Easier with contemporary runtime systems OpenMP, Grand Central Dispatch, ConcRT, MPI,... Synchronization patterns are more explicit But needs information at right places

25 HotPar 10 Systems Group Department of Computer Science ETH Zürich 14 The end-to-end approach The system stack: Component Related work Hardware Heterogeneous,... OS scheduler CAMP, HASS,... Runtime systems OpenMP, MPI, ConcRT, McRT,... Compilers Auto-parallel.,... Programming paradigms MapReduce, ICC,... Applications annotations,... Involve all components, top to bottom Need to cut through classical OS abstractions Here we focus on OS / runtime system integration

26 Design Principles HotPar 10 Systems Group Department of Computer Science ETH Zürich 15

27 HotPar 10 Systems Group Department of Computer Science ETH Zürich 16 Design principles 1. Time-multiplexing cores is still needed Resource abundance scheduler freedom Asymmetric multi-core architectures Contention for big cores Provide real-time QoS to interactive apps, not wasting cores Avoid power wasted through over-provisioning

28 HotPar 10 Systems Group Department of Computer Science ETH Zürich 17 Design principles 2. Schedule at multiple timescales Interactive workloads are now parallel Requirements might change abruptly Eg. parallel web browser Much shorter, interactive time scales Thus need small overhead when scheduling Synchronized scheduling on every time-slice won t scale

29 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling

HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities

30 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling Phase-locked gang scheduling Gang scheduling over interactive timescales

31 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

32 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Progress only in small time windows Phase-locked gang scheduling (actual trace):

33 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Synchronize core-local clocks

34 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Agree on future gang start time

35 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):...and gang period

36 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Resync in future when necessary

37 HotPar 10 Systems Group Department of Computer Science ETH Zürich 20 Design principles 3. Reason online about the hardware We employ a system knowledge base Contains rich representation of the hardware Queries in subset of first-order logic Logical unification aids dealing with diversity Both OS and apps use it

38 HotPar 10 Systems Group Department of Computer Science ETH Zürich 21 Design principles 4. Reason online about each application OS should exploit knowledge about apps for efficiency Eg. gang schedule threads in an OpenMP team But no sense in gang scheduling unrelated threads A single app might go through different phases Optimal allocation of resources changes over time Implementation: Apps submit scheduling manifests to planner Contain predicted long-term resource requirements Expressed as constrained cost-functions May make use of any information in the SKB

39 HotPar 10 Systems Group Department of Computer Science ETH Zürich 22 Design principles 5. Applications and OS must communicate Implementing the end-to-end principle Resource allocation may be renegotiated during runtime Implementation: Hardware threads run user-level dispatchers Cf. Psyche, inheritance scheduling Related dispatchers are grouped into dispatcher groups Derived from RTIDs of McRT Used as handles when renegotiating Scheduler activations [Anderson 1992] to inform app

40 HotPar 10 Systems Group Department of Computer Science ETH Zürich 23 Implementation in the Barrelfish OS Disp Disp Disp Disp Disp Disp D1 D2

41 HotPar 10 Systems Group Department of Computer Science ETH Zürich 24 Open questions What are appropriate mechanisms and timescales for inter-core phase synchronization? How can programmers provide useful concurrency information to the runtime? How efficiently can runtime specify requirements to OS? Hidden cost (if any) of decoupling scheduling timescales? Tradeoffs between centralized and distributed planners? Appropriate level of expressivity for the SKB?

The Multikernel A new OS architecture for scalable multicore systems

Systems Group Department of Computer Science ETH Zurich SOSP, 12th October 2009 The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand