Design Principles for End-to-End Multicore Schedulers

Size: px
Start display at page:

Download "Design Principles for End-to-End Multicore Schedulers"

Transcription

1 c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris Timothy Roscoe Systems Group, ETH Zurich Microsoft Research HotPar 10

2 HotPar 10 Systems Group Department of Computer Science ETH Zürich 2 Context: Barrelfish Multikernel operating system Developed at ETHZ and Microsoft Research Scalable research OS on heterogeneous multicore hardware Operating system principles and structure Programming models and language runtime systems Other scalable OS approaches are similar Tessellation, Corey, ROS, fos,... Ideas in this talk more widely applicable

3 HotPar 10 Systems Group Department of Computer Science ETH Zürich 3 Today s talk topic OS Scheduler architecture for today s (and tomorrow s) multicore machines General-purpose setting: Dynamic workload mix Multiple parallel apps Interactive parallel apps

4 HotPar 10 Systems Group Department of Computer Science ETH Zürich 4 Why this is a problem A simple example Run 2 OpenMP applications concurrently On 16-core AMD Shanghai system Intel OpenMP library Linux OS

5 HotPar 10 Systems Group Department of Computer Science ETH Zürich 5 Why this is a problem Example: 2x OpenMP on 16-core Linux One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; }

6 HotPar 10 Systems Group Department of Computer Science ETH Zürich 6 Why this is a problem Example: 2x OpenMP on 16-core Linux Run for x in [2..16]: OMP_NUM_THREADS=x./BARRIER & OMP_NUM_THREADS=8./cpu_bound & sleep 20 killall BARRIER cpu_bound Plot average iterations/thread/s over 20s

7 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads

8 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads

9 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound Until 8 BARRIER threads BARRIER Relative Rate of Progress Number of BARRIER Threads

10 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) Number of BARRIER Threads

11 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Number of BARRIER Threads

12 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Space-partitioning Number of BARRIER Threads

13 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound BARRIER Number of BARRIER Threads

14 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly CPU-Bound BARRIER Number of BARRIER Threads

15 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly BARRIER drops sharply (only makes progress when all threads run concurrently) Number of BARRIER Threads CPU-Bound BARRIER

16 HotPar 10 Systems Group Department of Computer Science ETH Zürich 8 Why this is a problem Example: 2x OpenMP on 16-core Linux Gang scheduling or smart core allocation would help Gang scheduling: OS unaware of apps requirements The run-time system could ve known Eg. via annotations or compiler Smart core allocation: OS knows general system state Run-time system chooses number of threads Information and mechanisms in the wrong place

17 HotPar 10 Systems Group Department of Computer Science ETH Zürich 9 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Huge error bars (min/max over 20 runs) Random placement of threads to cores Number of BARRIER Threads

18 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

19 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

20 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine

21 HotPar 10 Systems Group Department of Computer Science ETH Zürich 11 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress threads case: Performance difference of Number of BARRIER Threads

22 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ AMD Opteron (Magny-Cours) On-chip interconnect Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network

23 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet AMD Opteron (Magny-Cours) Manual tuning increasingly On-chip difficultinterconnect Architectures change too quickly Offline auto-tuning (eg. ATLAS) limited MCU MCU MCU Full Cross Bar Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network

24 HotPar 10 Systems Group Department of Computer Science ETH Zürich 13 Online adaptation Online adaptation remains viable Easier with contemporary runtime systems OpenMP, Grand Central Dispatch, ConcRT, MPI,... Synchronization patterns are more explicit But needs information at right places

25 HotPar 10 Systems Group Department of Computer Science ETH Zürich 14 The end-to-end approach The system stack: Component Related work Hardware Heterogeneous,... OS scheduler CAMP, HASS,... Runtime systems OpenMP, MPI, ConcRT, McRT,... Compilers Auto-parallel.,... Programming paradigms MapReduce, ICC,... Applications annotations,... Involve all components, top to bottom Need to cut through classical OS abstractions Here we focus on OS / runtime system integration

26 Design Principles HotPar 10 Systems Group Department of Computer Science ETH Zürich 15

27 HotPar 10 Systems Group Department of Computer Science ETH Zürich 16 Design principles 1. Time-multiplexing cores is still needed Resource abundance scheduler freedom Asymmetric multi-core architectures Contention for big cores Provide real-time QoS to interactive apps, not wasting cores Avoid power wasted through over-provisioning

28 HotPar 10 Systems Group Department of Computer Science ETH Zürich 17 Design principles 2. Schedule at multiple timescales Interactive workloads are now parallel Requirements might change abruptly Eg. parallel web browser Much shorter, interactive time scales Thus need small overhead when scheduling Synchronized scheduling on every time-slice won t scale

29 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling

30 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling Phase-locked gang scheduling Gang scheduling over interactive timescales

31 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):

32 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Progress only in small time windows Phase-locked gang scheduling (actual trace):

33 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Synchronize core-local clocks

34 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Agree on future gang start time

35 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):...and gang period

36 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Resync in future when necessary

37 HotPar 10 Systems Group Department of Computer Science ETH Zürich 20 Design principles 3. Reason online about the hardware We employ a system knowledge base Contains rich representation of the hardware Queries in subset of first-order logic Logical unification aids dealing with diversity Both OS and apps use it

38 HotPar 10 Systems Group Department of Computer Science ETH Zürich 21 Design principles 4. Reason online about each application OS should exploit knowledge about apps for efficiency Eg. gang schedule threads in an OpenMP team But no sense in gang scheduling unrelated threads A single app might go through different phases Optimal allocation of resources changes over time Implementation: Apps submit scheduling manifests to planner Contain predicted long-term resource requirements Expressed as constrained cost-functions May make use of any information in the SKB

39 HotPar 10 Systems Group Department of Computer Science ETH Zürich 22 Design principles 5. Applications and OS must communicate Implementing the end-to-end principle Resource allocation may be renegotiated during runtime Implementation: Hardware threads run user-level dispatchers Cf. Psyche, inheritance scheduling Related dispatchers are grouped into dispatcher groups Derived from RTIDs of McRT Used as handles when renegotiating Scheduler activations [Anderson 1992] to inform app

40 HotPar 10 Systems Group Department of Computer Science ETH Zürich 23 Implementation in the Barrelfish OS Disp Disp Disp Disp Disp Disp D1 D2

41 HotPar 10 Systems Group Department of Computer Science ETH Zürich 24 Open questions What are appropriate mechanisms and timescales for inter-core phase synchronization? How can programmers provide useful concurrency information to the runtime? How efficiently can runtime specify requirements to OS? Hidden cost (if any) of decoupling scheduling timescales? Tradeoffs between centralized and distributed planners? Appropriate level of expressivity for the SKB?

The Multikernel A new OS architecture for scalable multicore systems

The Multikernel A new OS architecture for scalable multicore systems Systems Group Department of Computer Science ETH Zurich SOSP, 12th October 2009 The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand

More information

Message Passing. Advanced Operating Systems Tutorial 5

Message Passing. Advanced Operating Systems Tutorial 5 Message Passing Advanced Operating Systems Tutorial 5 Tutorial Outline Review of Lectured Material Discussion: Barrelfish and multi-kernel systems Programming exercise!2 Review of Lectured Material Implications

More information

The Barrelfish operating system for heterogeneous multicore systems

The Barrelfish operating system for heterogeneous multicore systems c Systems Group Department of Computer Science ETH Zurich 8th April 2009 The Barrelfish operating system for heterogeneous multicore systems Andrew Baumann Systems Group, ETH Zurich 8th April 2009 Systems

More information

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS. Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS

More information

the Barrelfish multikernel: with Timothy Roscoe

the Barrelfish multikernel: with Timothy Roscoe RIK FARROW the Barrelfish multikernel: an interview with Timothy Roscoe Timothy Roscoe is part of the ETH Zürich Computer Science Department s Systems Group. His main research areas are operating systems,

More information

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS

MULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS Overview COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS Multiprocessor OS Scalability Multiprocessor Hardware Contemporary systems Experimental and Future systems OS design for

More information

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs

More information

Implications of Multicore Systems. Advanced Operating Systems Lecture 10

Implications of Multicore Systems. Advanced Operating Systems Lecture 10 Implications of Multicore Systems Advanced Operating Systems Lecture 10 Lecture Outline Hardware trends Programming models Operating systems for NUMA hardware!2 Hardware Trends Power consumption limits

More information

INFLUENTIAL OS RESEARCH

INFLUENTIAL OS RESEARCH INFLUENTIAL OS RESEARCH Multiprocessors Jan Bierbaum Tobias Stumpf SS 2017 ROADMAP Roadmap Multiprocessor Architectures Usage in the Old Days (mid 90s) Disco Present Age Research The Multikernel Helios

More information

Modern systems: multicore issues

Modern systems: multicore issues Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt

More information

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform

More information

Master s Thesis Nr. 10

Master s Thesis Nr. 10 Master s Thesis Nr. 10 Systems Group, Department of Computer Science, ETH Zurich Performance isolation on multicore hardware by Kaveh Razavi Supervised by Akhilesh Singhania and Timothy Roscoe November

More information

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )

Today. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( ) Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture

More information

Shoal: smart allocation and replication of memory for parallel programs

Shoal: smart allocation and replication of memory for parallel programs Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris $ ETH Zurich $ Oracle Labs Cambridge, UK Problem Congestion on interconnect

More information

Decoupling Cores, Kernels and Operating Systems

Decoupling Cores, Kernels and Operating Systems Decoupling Cores, Kernels and Operating Systems Gerd Zellweger, Simon Gerber, Kornilios Kourtis, Timothy Roscoe Systems Group, ETH Zürich 10/6/2014 1 Outline Motivation Trends in hardware and software

More information

Performance in the Multicore Era

Performance in the Multicore Era Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An

More information

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing

Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Arrakis: The Operating System is the Control Plane

Arrakis: The Operating System is the Control Plane Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building

More information

An Introduction to the SPEC High Performance Group and their Benchmark Suites

An Introduction to the SPEC High Performance Group and their Benchmark Suites An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies

More information

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland

New directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland New directions in OS design: the Barrelfish research operating system Timothy Roscoe (Mothy) ETH Zurich Switzerland Acknowledgements ETH Zurich: Zachary Anderson, Pierre Evariste Dagand, Stefan Kästle,

More information

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH

SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture

Administration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall

More information

Dynamic inter-core scheduling in Barrelfish

Dynamic inter-core scheduling in Barrelfish Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh

HPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight

More information

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter

Lecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)

More information

Performance comparison between a massive SMP machine and clusters

Performance comparison between a massive SMP machine and clusters Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano

More information

Capriccio : Scalable Threads for Internet Services

Capriccio : Scalable Threads for Internet Services Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Trends and Challenges in Multicore Programming

Trends and Challenges in Multicore Programming Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores

More information

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads

Operating Systems 2 nd semester 2016/2017. Chapter 4: Threads Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition

More information

Barrelfish Project ETH Zurich. Routing in Barrelfish

Barrelfish Project ETH Zurich. Routing in Barrelfish Barrelfish Project ETH Zurich Routing in Barrelfish Barrelfish Technical Note 006 Akhilesh Singhania, Alexander Grest 01.12.2013 Systems Group Department of Computer Science ETH Zurich CAB F.79, Universitätstrasse

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

A unified multicore programming model

A unified multicore programming model A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

AMD Opteron 4200 Series Processor

AMD Opteron 4200 Series Processor What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

How much energy can you save with a multicore computer for web applications?

How much energy can you save with a multicore computer for web applications? How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green

More information

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University

Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University 18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?

More information

Hyperthreading Technology

Hyperthreading Technology Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?

More information

Bruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary

Bruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary Bruce Shriver University of Tromsø, Norway November 2010 Óbuda University, Hungary Reconfigurability Issues in Multi-core Systems The first lecture explored the thesis that reconfigurability is an integral

More information

Oversubscription on Multicore Processors

Oversubscription on Multicore Processors Oversubscription on Multicore Processors ostin Iancu, teven Hofmeyr, Filip lagojević, Yili Zheng Lawrence erkeley National Laboratory Parallel & Dtributed Processing (IPDP), / Motivation Increasingly parallel

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

FUSION1200 Scalable x86 SMP System

FUSION1200 Scalable x86 SMP System FUSION1200 Scalable x86 SMP System Introduction Life Sciences Departmental System Manufacturing (CAE) Departmental System Competitive Analysis: IBM x3950 Competitive Analysis: SUN x4600 / SUN x4600 M2

More information

Dealing with Heterogeneous Multicores

Dealing with Heterogeneous Multicores Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Carlo Cavazzoni, HPC department, CINECA

Carlo Cavazzoni, HPC department, CINECA Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD

Leveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Improving Packet Processing Performance of a Memory- Bounded Application

Improving Packet Processing Performance of a Memory- Bounded Application Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California

EE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004

A Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004 A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into

More information

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:

Administration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA: CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,

More information

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP

Shared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Parallel Programming for Graphics

Parallel Programming for Graphics Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models

More information

Best Practices for Setting BIOS Parameters for Performance

Best Practices for Setting BIOS Parameters for Performance White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page

More information

CellSs Making it easier to program the Cell Broadband Engine processor

CellSs Making it easier to program the Cell Broadband Engine processor Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of

More information

Arquitecturas y Modelos de. Multicore

Arquitecturas y Modelos de. Multicore Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have

More information

Chapter 4: Threads. Chapter 4: Threads

Chapter 4: Threads. Chapter 4: Threads Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples

More information

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System

EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick

More information

CS420: Operating Systems

CS420: Operating Systems Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing

More information

Decelerating Suspend and Resume in Operating Systems. XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017

Decelerating Suspend and Resume in Operating Systems. XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017 Decelerating Suspend and Resume in Operating Systems XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017 1 Mobile/IoT devices see many short-lived tasks Sleeping for a

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Programming for Fujitsu Supercomputers

Programming for Fujitsu Supercomputers Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming

More information

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS

SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power

More information

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison

Revisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to

More information

Evaluating the Portability of UPC to the Cell Broadband Engine

Evaluating the Portability of UPC to the Cell Broadband Engine Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and

More information

Chapter 4: Multithreaded Programming

Chapter 4: Multithreaded Programming Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading

More information

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS

More information

Why you should care about hardware locality and how.

Why you should care about hardware locality and how. Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient

More information

Barrelfish Project ETH Zurich. Message Notifications

Barrelfish Project ETH Zurich. Message Notifications Barrelfish Project ETH Zurich Message Notifications Barrelfish Technical Note 9 Barrelfish project 16.06.2010 Systems Group Department of Computer Science ETH Zurich CAB F.79, Universitätstrasse 6, Zurich

More information

A brief introduction to OpenMP

A brief introduction to OpenMP A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism

More information

First Experiences with Intel Cluster OpenMP

First Experiences with Intel Cluster OpenMP First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Technical Computing Suite supporting the hybrid system

Technical Computing Suite supporting the hybrid system Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect

More information

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin

More information

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)

EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Capriccio: Scalable Threads for Internet Services

Capriccio: Scalable Threads for Internet Services Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley Presenter: Cong Lin Outline Part I Motivation

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

High Performance Computing Systems

High Performance Computing Systems High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: Full weight kernel Lightweight kernel Why not both? How does implementation affect usage and performance?

More information

Parallel Algorithm Engineering

Parallel Algorithm Engineering Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples

More information

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden

CSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden CSE 160 Lecture 8 NUMA OpenMP Scott B. Baden OpenMP Today s lecture NUMA Architectures 2013 Scott B. Baden / CSE 160 / Fall 2013 2 OpenMP A higher level interface for threads programming Parallelization

More information