Design Principles for End-to-End Multicore Schedulers

Similar documents
The Multikernel A new OS architecture for scalable multicore systems

Message Passing. Advanced Operating Systems Tutorial 5

The Barrelfish operating system for heterogeneous multicore systems

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

the Barrelfish multikernel: with Timothy Roscoe

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

INFLUENTIAL OS RESEARCH

Modern systems: multicore issues

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Master s Thesis Nr. 10

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Shoal: smart allocation and replication of memory for parallel programs

Decoupling Cores, Kernels and Operating Systems

Performance in the Multicore Era

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Arrakis: The Operating System is the Control Plane

An Introduction to the SPEC High Performance Group and their Benchmark Suites

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

CUDA GPGPU Workshop 2012

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Dynamic inter-core scheduling in Barrelfish

CSE 120 Principles of Operating Systems

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Performance comparison between a massive SMP machine and clusters

Capriccio : Scalable Threads for Internet Services

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Trends and Challenges in Multicore Programming

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Barrelfish Project ETH Zurich. Routing in Barrelfish

2 TEST: A Tracer for Extracting Speculative Threads

Introduction to parallel Computing

A unified multicore programming model

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

AMD Opteron 4200 Series Processor

Intel Xeon Phi архитектура, модели программирования, оптимизация.

The Art of Parallel Processing

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

How much energy can you save with a multicore computer for web applications?

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Hyperthreading Technology

Bruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary

Oversubscription on Multicore Processors

WHY PARALLEL PROCESSING? (CE-401)

FUSION1200 Scalable x86 SMP System

Dealing with Heterogeneous Multicores

Performance Tools for Technical Computing

Carlo Cavazzoni, HPC department, CINECA

Overview of research activities Toward portability of performance

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Multithreaded Processors. Department of Electrical Engineering Stanford University

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Improving Packet Processing Performance of a Memory- Bounded Application

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

Parallel Programming on Larrabee. Tim Foley Intel Corp

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Intel VTune Amplifier XE

Parallel Programming for Graphics

Best Practices for Setting BIOS Parameters for Performance

CellSs Making it easier to program the Cell Broadband Engine processor

Arquitecturas y Modelos de. Multicore

Chapter 4: Threads. Chapter 4: Threads

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

CS420: Operating Systems

Decelerating Suspend and Resume in Operating Systems. XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Programming for Fujitsu Supercomputers

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Evaluating the Portability of UPC to the Cell Broadband Engine

Chapter 4: Multithreaded Programming

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Why you should care about hardware locality and how.

Barrelfish Project ETH Zurich. Message Notifications

A brief introduction to OpenMP

First Experiences with Intel Cluster OpenMP

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Technical Computing Suite supporting the hybrid system

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Capriccio: Scalable Threads for Internet Services

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

High Performance Computing Systems

Parallel Algorithm Engineering

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

Transcription:

c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris Timothy Roscoe Systems Group, ETH Zurich Microsoft Research HotPar 10

HotPar 10 Systems Group Department of Computer Science ETH Zürich 2 Context: Barrelfish Multikernel operating system Developed at ETHZ and Microsoft Research Scalable research OS on heterogeneous multicore hardware Operating system principles and structure Programming models and language runtime systems Other scalable OS approaches are similar Tessellation, Corey, ROS, fos,... Ideas in this talk more widely applicable

HotPar 10 Systems Group Department of Computer Science ETH Zürich 3 Today s talk topic OS Scheduler architecture for today s (and tomorrow s) multicore machines General-purpose setting: Dynamic workload mix Multiple parallel apps Interactive parallel apps

HotPar 10 Systems Group Department of Computer Science ETH Zürich 4 Why this is a problem A simple example Run 2 OpenMP applications concurrently On 16-core AMD Shanghai system Intel OpenMP library Linux OS

HotPar 10 Systems Group Department of Computer Science ETH Zürich 5 Why this is a problem Example: 2x OpenMP on 16-core Linux One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; }

HotPar 10 Systems Group Department of Computer Science ETH Zürich 6 Why this is a problem Example: 2x OpenMP on 16-core Linux Run for x in [2..16]: OMP_NUM_THREADS=x./BARRIER & OMP_NUM_THREADS=8./cpu_bound & sleep 20 killall BARRIER cpu_bound Plot average iterations/thread/s over 20s

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 1 CPU-Bound BARRIER Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 1 CPU-Bound BARRIER Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 1 CPU-Bound Until 8 BARRIER threads BARRIER Relative Rate of Progress 0.8 0.6 0.4 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 0.2 CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 0.2 CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 0.2 CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Space-partitioning 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 From 9 threads (threads > cores) Time-multiplexing CPU-Bound BARRIER 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly CPU-Bound BARRIER 0.2 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress 1.2 1 0.8 0.6 0.4 0.2 0 From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly BARRIER drops sharply (only makes progress when all threads run concurrently) 2 4 6 8 10 12 14 16 Number of BARRIER Threads CPU-Bound BARRIER

HotPar 10 Systems Group Department of Computer Science ETH Zürich 8 Why this is a problem Example: 2x OpenMP on 16-core Linux Gang scheduling or smart core allocation would help Gang scheduling: OS unaware of apps requirements The run-time system could ve known Eg. via annotations or compiler Smart core allocation: OS knows general system state Run-time system chooses number of threads Information and mechanisms in the wrong place

HotPar 10 Systems Group Department of Computer Science ETH Zürich 9 Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 1 CPU-Bound BARRIER Relative Rate of Progress 0.8 0.6 0.4 0.2 Huge error bars (min/max over 20 runs) Random placement of threads to cores 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

HotPar 10 Systems Group Department of Computer Science ETH Zürich 11 Why this is a problem Example: 2x OpenMP on 16-core Linux 1.2 1 CPU-Bound BARRIER Relative Rate of Progress 0.8 0.6 0.4 0.2 2 threads case: Performance difference of 0.4 0 2 4 6 8 10 12 14 16 Number of BARRIER Threads

HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ AMD Opteron (Magny-Cours) On-chip interconnect Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe x8 @ 2.0 GHz Intel Nehalem (Beckton) On-die ring network

HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet AMD Opteron (Magny-Cours) Manual tuning increasingly On-chip difficultinterconnect Architectures change too quickly Offline auto-tuning (eg. ATLAS) limited MCU MCU MCU Full Cross Bar Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe x8 @ 2.0 GHz Intel Nehalem (Beckton) On-die ring network

HotPar 10 Systems Group Department of Computer Science ETH Zürich 13 Online adaptation Online adaptation remains viable Easier with contemporary runtime systems OpenMP, Grand Central Dispatch, ConcRT, MPI,... Synchronization patterns are more explicit But needs information at right places

HotPar 10 Systems Group Department of Computer Science ETH Zürich 14 The end-to-end approach The system stack: Component Related work Hardware Heterogeneous,... OS scheduler CAMP, HASS,... Runtime systems OpenMP, MPI, ConcRT, McRT,... Compilers Auto-parallel.,... Programming paradigms MapReduce, ICC,... Applications annotations,... Involve all components, top to bottom Need to cut through classical OS abstractions Here we focus on OS / runtime system integration

Design Principles HotPar 10 Systems Group Department of Computer Science ETH Zürich 15

HotPar 10 Systems Group Department of Computer Science ETH Zürich 16 Design principles 1. Time-multiplexing cores is still needed Resource abundance scheduler freedom Asymmetric multi-core architectures Contention for big cores Provide real-time QoS to interactive apps, not wasting cores Avoid power wasted through over-provisioning

HotPar 10 Systems Group Department of Computer Science ETH Zürich 17 Design principles 2. Schedule at multiple timescales Interactive workloads are now parallel Requirements might change abruptly Eg. parallel web browser Much shorter, interactive time scales Thus need small overhead when scheduling Synchronized scheduling on every time-slice won t scale

HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling

HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling Phase-locked gang scheduling Gang scheduling over interactive timescales

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Progress only in small time windows Phase-locked gang scheduling (actual trace):

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Synchronize core-local clocks

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Agree on future gang start time

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):...and gang period

HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Resync in future when necessary

HotPar 10 Systems Group Department of Computer Science ETH Zürich 20 Design principles 3. Reason online about the hardware We employ a system knowledge base Contains rich representation of the hardware Queries in subset of first-order logic Logical unification aids dealing with diversity Both OS and apps use it

HotPar 10 Systems Group Department of Computer Science ETH Zürich 21 Design principles 4. Reason online about each application OS should exploit knowledge about apps for efficiency Eg. gang schedule threads in an OpenMP team But no sense in gang scheduling unrelated threads A single app might go through different phases Optimal allocation of resources changes over time Implementation: Apps submit scheduling manifests to planner Contain predicted long-term resource requirements Expressed as constrained cost-functions May make use of any information in the SKB

HotPar 10 Systems Group Department of Computer Science ETH Zürich 22 Design principles 5. Applications and OS must communicate Implementing the end-to-end principle Resource allocation may be renegotiated during runtime Implementation: Hardware threads run user-level dispatchers Cf. Psyche, inheritance scheduling Related dispatchers are grouped into dispatcher groups Derived from RTIDs of McRT Used as handles when renegotiating Scheduler activations [Anderson 1992] to inform app

HotPar 10 Systems Group Department of Computer Science ETH Zürich 23 Implementation in the Barrelfish OS Disp Disp Disp Disp Disp Disp D1 D2

HotPar 10 Systems Group Department of Computer Science ETH Zürich 24 Open questions What are appropriate mechanisms and timescales for inter-core phase synchronization? How can programmers provide useful concurrency information to the runtime? How efficiently can runtime specify requirements to OS? Hidden cost (if any) of decoupling scheduling timescales? Tradeoffs between centralized and distributed planners? Appropriate level of expressivity for the SKB?