Design Principles for End-to-End Multicore Schedulers
|
|
- Abel Jordan
- 6 years ago
- Views:
Transcription
1 c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris Timothy Roscoe Systems Group, ETH Zurich Microsoft Research HotPar 10
2 HotPar 10 Systems Group Department of Computer Science ETH Zürich 2 Context: Barrelfish Multikernel operating system Developed at ETHZ and Microsoft Research Scalable research OS on heterogeneous multicore hardware Operating system principles and structure Programming models and language runtime systems Other scalable OS approaches are similar Tessellation, Corey, ROS, fos,... Ideas in this talk more widely applicable
3 HotPar 10 Systems Group Department of Computer Science ETH Zürich 3 Today s talk topic OS Scheduler architecture for today s (and tomorrow s) multicore machines General-purpose setting: Dynamic workload mix Multiple parallel apps Interactive parallel apps
4 HotPar 10 Systems Group Department of Computer Science ETH Zürich 4 Why this is a problem A simple example Run 2 OpenMP applications concurrently On 16-core AMD Shanghai system Intel OpenMP library Linux OS
5 HotPar 10 Systems Group Department of Computer Science ETH Zürich 5 Why this is a problem Example: 2x OpenMP on 16-core Linux One app is CPU-Bound: #pragma omp parallel for(;;) iterations[omp_get_thread_num()]++; Other is synchronization intensive (eg. BARRIER): #pragma omp parallel for(;;) { #pragma omp barrier iterations[omp_get_thread_num()]++; }
6 HotPar 10 Systems Group Department of Computer Science ETH Zürich 6 Why this is a problem Example: 2x OpenMP on 16-core Linux Run for x in [2..16]: OMP_NUM_THREADS=x./BARRIER & OMP_NUM_THREADS=8./cpu_bound & sleep 20 killall BARRIER cpu_bound Plot average iterations/thread/s over 20s
7 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads
8 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Number of BARRIER Threads
9 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound Until 8 BARRIER threads BARRIER Relative Rate of Progress Number of BARRIER Threads
10 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) Number of BARRIER Threads
11 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Number of BARRIER Threads
12 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress CPU-Bound Until 8 BARRIER threads BARRIER CPU-Bound stays at 1 (same thread allocation) BARRIER degrades (due to increasing cost) Space-partitioning Number of BARRIER Threads
13 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound BARRIER Number of BARRIER Threads
14 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly CPU-Bound BARRIER Number of BARRIER Threads
15 HotPar 10 Systems Group Department of Computer Science ETH Zürich 7 Why this is a problem Example: 2x OpenMP on 16-core Linux Relative Rate of Progress From 9 threads (threads > cores) Time-multiplexing CPU-Bound degrades linearly BARRIER drops sharply (only makes progress when all threads run concurrently) Number of BARRIER Threads CPU-Bound BARRIER
16 HotPar 10 Systems Group Department of Computer Science ETH Zürich 8 Why this is a problem Example: 2x OpenMP on 16-core Linux Gang scheduling or smart core allocation would help Gang scheduling: OS unaware of apps requirements The run-time system could ve known Eg. via annotations or compiler Smart core allocation: OS knows general system state Run-time system chooses number of threads Information and mechanisms in the wrong place
17 HotPar 10 Systems Group Department of Computer Science ETH Zürich 9 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress Huge error bars (min/max over 20 runs) Random placement of threads to cores Number of BARRIER Threads
18 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine
19 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine
20 HotPar 10 Systems Group Department of Computer Science ETH Zürich 10 Why this is a problem 16-core AMD Shanghai system L3 HT L3 HT HT L3 HT L3 Same-die L3 access twice as fast as cross-die OpenMP run-time does not know about this machine
21 HotPar 10 Systems Group Department of Computer Science ETH Zürich 11 Why this is a problem Example: 2x OpenMP on 16-core Linux CPU-Bound BARRIER Relative Rate of Progress threads case: Performance difference of Number of BARRIER Threads
22 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ AMD Opteron (Magny-Cours) On-chip interconnect Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network
23 HotPar 10 Systems Group Department of Computer Science ETH Zürich 12 Why this is a problem System diversity FB DIMM FB DIMM FB DIMM FB DIMM L3 HT3 HT3 L3 MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) 2x 10 Gigabit Ethernet AMD Opteron (Magny-Cours) Manual tuning increasingly On-chip difficultinterconnect Architectures change too quickly Offline auto-tuning (eg. ATLAS) limited MCU MCU MCU Full Cross Bar Sun Niagara T2 Flat, fast cache hierarchy Sys I/F Buffer Switch Core Power <95 W PCIe 2.0 GHz Intel Nehalem (Beckton) On-die ring network
24 HotPar 10 Systems Group Department of Computer Science ETH Zürich 13 Online adaptation Online adaptation remains viable Easier with contemporary runtime systems OpenMP, Grand Central Dispatch, ConcRT, MPI,... Synchronization patterns are more explicit But needs information at right places
25 HotPar 10 Systems Group Department of Computer Science ETH Zürich 14 The end-to-end approach The system stack: Component Related work Hardware Heterogeneous,... OS scheduler CAMP, HASS,... Runtime systems OpenMP, MPI, ConcRT, McRT,... Compilers Auto-parallel.,... Programming paradigms MapReduce, ICC,... Applications annotations,... Involve all components, top to bottom Need to cut through classical OS abstractions Here we focus on OS / runtime system integration
26 Design Principles HotPar 10 Systems Group Department of Computer Science ETH Zürich 15
27 HotPar 10 Systems Group Department of Computer Science ETH Zürich 16 Design principles 1. Time-multiplexing cores is still needed Resource abundance scheduler freedom Asymmetric multi-core architectures Contention for big cores Provide real-time QoS to interactive apps, not wasting cores Avoid power wasted through over-provisioning
28 HotPar 10 Systems Group Department of Computer Science ETH Zürich 17 Design principles 2. Schedule at multiple timescales Interactive workloads are now parallel Requirements might change abruptly Eg. parallel web browser Much shorter, interactive time scales Thus need small overhead when scheduling Synchronized scheduling on every time-slice won t scale
29 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling
30 HotPar 10 Systems Group Department of Computer Science ETH Zürich 18 Implementation in Barrelfish Combination of techniques at different time granularities Long-term placement of apps on cores Medium-term resource allocation Short-term per-core scheduling Phase-locked gang scheduling Gang scheduling over interactive timescales
31 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):
32 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Progress only in small time windows Phase-locked gang scheduling (actual trace):
33 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Synchronize core-local clocks
34 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Agree on future gang start time
35 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace):...and gang period
36 HotPar 10 Systems Group Department of Computer Science ETH Zürich 19 Phase-locked gang scheduling Decouple schedule synchronization from dispatch Best-effort (actual trace): Phase-locked gang scheduling (actual trace): Resync in future when necessary
37 HotPar 10 Systems Group Department of Computer Science ETH Zürich 20 Design principles 3. Reason online about the hardware We employ a system knowledge base Contains rich representation of the hardware Queries in subset of first-order logic Logical unification aids dealing with diversity Both OS and apps use it
38 HotPar 10 Systems Group Department of Computer Science ETH Zürich 21 Design principles 4. Reason online about each application OS should exploit knowledge about apps for efficiency Eg. gang schedule threads in an OpenMP team But no sense in gang scheduling unrelated threads A single app might go through different phases Optimal allocation of resources changes over time Implementation: Apps submit scheduling manifests to planner Contain predicted long-term resource requirements Expressed as constrained cost-functions May make use of any information in the SKB
39 HotPar 10 Systems Group Department of Computer Science ETH Zürich 22 Design principles 5. Applications and OS must communicate Implementing the end-to-end principle Resource allocation may be renegotiated during runtime Implementation: Hardware threads run user-level dispatchers Cf. Psyche, inheritance scheduling Related dispatchers are grouped into dispatcher groups Derived from RTIDs of McRT Used as handles when renegotiating Scheduler activations [Anderson 1992] to inform app
40 HotPar 10 Systems Group Department of Computer Science ETH Zürich 23 Implementation in the Barrelfish OS Disp Disp Disp Disp Disp Disp D1 D2
41 HotPar 10 Systems Group Department of Computer Science ETH Zürich 24 Open questions What are appropriate mechanisms and timescales for inter-core phase synchronization? How can programmers provide useful concurrency information to the runtime? How efficiently can runtime specify requirements to OS? Hidden cost (if any) of decoupling scheduling timescales? Tradeoffs between centralized and distributed planners? Appropriate level of expressivity for the SKB?
The Multikernel A new OS architecture for scalable multicore systems
Systems Group Department of Computer Science ETH Zurich SOSP, 12th October 2009 The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand
More informationMessage Passing. Advanced Operating Systems Tutorial 5
Message Passing Advanced Operating Systems Tutorial 5 Tutorial Outline Review of Lectured Material Discussion: Barrelfish and multi-kernel systems Programming exercise!2 Review of Lectured Material Implications
More informationThe Barrelfish operating system for heterogeneous multicore systems
c Systems Group Department of Computer Science ETH Zurich 8th April 2009 The Barrelfish operating system for heterogeneous multicore systems Andrew Baumann Systems Group, ETH Zurich 8th April 2009 Systems
More informationMultiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.
Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS
More informationthe Barrelfish multikernel: with Timothy Roscoe
RIK FARROW the Barrelfish multikernel: an interview with Timothy Roscoe Timothy Roscoe is part of the ETH Zürich Computer Science Department s Systems Group. His main research areas are operating systems,
More informationMULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS
Overview COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS Multiprocessor OS Scalability Multiprocessor Hardware Contemporary systems Experimental and Future systems OS design for
More informationThe Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith
The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs
More informationImplications of Multicore Systems. Advanced Operating Systems Lecture 10
Implications of Multicore Systems Advanced Operating Systems Lecture 10 Lecture Outline Hardware trends Programming models Operating systems for NUMA hardware!2 Hardware Trends Power consumption limits
More informationINFLUENTIAL OS RESEARCH
INFLUENTIAL OS RESEARCH Multiprocessors Jan Bierbaum Tobias Stumpf SS 2017 ROADMAP Roadmap Multiprocessor Architectures Usage in the Old Days (mid 90s) Disco Present Age Research The Multikernel Helios
More informationModern systems: multicore issues
Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationMaster s Thesis Nr. 10
Master s Thesis Nr. 10 Systems Group, Department of Computer Science, ETH Zurich Performance isolation on multicore hardware by Kaveh Razavi Supervised by Akhilesh Singhania and Timothy Roscoe November
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationShoal: smart allocation and replication of memory for parallel programs
Shoal: smart allocation and replication of memory for parallel programs Stefan Kaestle, Reto Achermann, Timothy Roscoe, Tim Harris $ ETH Zurich $ Oracle Labs Cambridge, UK Problem Congestion on interconnect
More informationDecoupling Cores, Kernels and Operating Systems
Decoupling Cores, Kernels and Operating Systems Gerd Zellweger, Simon Gerber, Kornilios Kourtis, Timothy Roscoe Systems Group, ETH Zürich 10/6/2014 1 Outline Motivation Trends in hardware and software
More informationPerformance in the Multicore Era
Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationCSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore
CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors
More informationArrakis: The Operating System is the Control Plane
Arrakis: The Operating System is the Control Plane Simon Peter, Jialin Li, Irene Zhang, Dan Ports, Doug Woos, Arvind Krishnamurthy, Tom Anderson University of Washington Timothy Roscoe ETH Zurich Building
More informationAn Introduction to the SPEC High Performance Group and their Benchmark Suites
An Introduction to the SPEC High Performance Group and their Benchmark Suites Robert Henschel Manager, Scientific Applications and Performance Tuning Secretary, SPEC High Performance Group Research Technologies
More informationNew directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland
New directions in OS design: the Barrelfish research operating system Timothy Roscoe (Mothy) ETH Zurich Switzerland Acknowledgements ETH Zurich: Zachary Anderson, Pierre Evariste Dagand, Stefan Kästle,
More informationSCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH
Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationAdministration. Course material. Prerequisites. CS 395T: Topics in Multicore Programming. Instructors: TA: Course in computer architecture
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.26A ACES Email: pingali@cs.utexas.edu TA: Xin Sui Email: xin@cs.utexas.edu University of Texas, Austin Fall
More informationDynamic inter-core scheduling in Barrelfish
Dynamic inter-core scheduling in Barrelfish. avoiding contention with malleable domains Georgios Varisteas, Mats Brorsson, Karl-Filip Faxén November 25, 2011 Outline Introduction Scheduling & Programming
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationHPMMAP: Lightweight Memory Management for Commodity Operating Systems. University of Pittsburgh
HPMMAP: Lightweight Memory Management for Commodity Operating Systems Brian Kocoloski Jack Lange University of Pittsburgh Lightweight Experience in a Consolidated Environment HPC applications need lightweight
More informationLecture Topics. Announcements. Today: Advanced Scheduling (Stallings, chapter ) Next: Deadlock (Stallings, chapter
Lecture Topics Today: Advanced Scheduling (Stallings, chapter 10.1-10.4) Next: Deadlock (Stallings, chapter 6.1-6.6) 1 Announcements Exam #2 returned today Self-Study Exercise #10 Project #8 (due 11/16)
More informationPerformance comparison between a massive SMP machine and clusters
Performance comparison between a massive SMP machine and clusters Martin Scarcia, Stefano Alberto Russo Sissa/eLab joint Democritos/Sissa Laboratory for e-science Via Beirut 2/4 34151 Trieste, Italy Stefano
More informationCapriccio : Scalable Threads for Internet Services
Capriccio : Scalable Threads for Internet Services - Ron von Behren &et al - University of California, Berkeley. Presented By: Rajesh Subbiah Background Each incoming request is dispatched to a separate
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationUsing Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology
Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationOperating Systems 2 nd semester 2016/2017. Chapter 4: Threads
Operating Systems 2 nd semester 2016/2017 Chapter 4: Threads Mohamed B. Abubaker Palestine Technical College Deir El-Balah Note: Adapted from the resources of textbox Operating System Concepts, 9 th edition
More informationBarrelfish Project ETH Zurich. Routing in Barrelfish
Barrelfish Project ETH Zurich Routing in Barrelfish Barrelfish Technical Note 006 Akhilesh Singhania, Alexander Grest 01.12.2013 Systems Group Department of Computer Science ETH Zurich CAB F.79, Universitätstrasse
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationA unified multicore programming model
A unified multicore programming model Simplifying multicore migration By Sven Brehmer Abstract There are a number of different multicore architectures and programming models available, making it challenging
More informationComputer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13
Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,
More informationComputer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University
Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationAMD Opteron 4200 Series Processor
What s new in the AMD Opteron 4200 Series Processor (Codenamed Valencia ) and the new Bulldozer Microarchitecture? Platform Processor Socket Chipset Opteron 4000 Opteron 4200 C32 56x0 / 5100 (codenamed
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationSpring 2011 Parallel Computer Architecture Lecture 4: Multi-core. Prof. Onur Mutlu Carnegie Mellon University
18-742 Spring 2011 Parallel Computer Architecture Lecture 4: Multi-core Prof. Onur Mutlu Carnegie Mellon University Research Project Project proposal due: Jan 31 Project topics Does everyone have a topic?
More informationHyperthreading Technology
Hyperthreading Technology Aleksandar Milenkovic Electrical and Computer Engineering Department University of Alabama in Huntsville milenka@ece.uah.edu www.ece.uah.edu/~milenka/ Outline What is hyperthreading?
More informationBruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary
Bruce Shriver University of Tromsø, Norway November 2010 Óbuda University, Hungary Reconfigurability Issues in Multi-core Systems The first lecture explored the thesis that reconfigurability is an integral
More informationOversubscription on Multicore Processors
Oversubscription on Multicore Processors ostin Iancu, teven Hofmeyr, Filip lagojević, Yili Zheng Lawrence erkeley National Laboratory Parallel & Dtributed Processing (IPDP), / Motivation Increasingly parallel
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationFUSION1200 Scalable x86 SMP System
FUSION1200 Scalable x86 SMP System Introduction Life Sciences Departmental System Manufacturing (CAE) Departmental System Competitive Analysis: IBM x3950 Competitive Analysis: SUN x4600 / SUN x4600 M2
More informationDealing with Heterogeneous Multicores
Dealing with Heterogeneous Multicores François Bodin INRIA-UIUC, June 12 th, 2009 Introduction Main stream applications will rely on new multicore / manycore architectures It is about performance not parallelism
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationCarlo Cavazzoni, HPC department, CINECA
Introduction to Shared memory architectures Carlo Cavazzoni, HPC department, CINECA Modern Parallel Architectures Two basic architectural scheme: Distributed Memory Shared Memory Now most computers have
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationNVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU
NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated
More informationLeveraging OpenSPARC. ESA Round Table 2006 on Next Generation Microprocessors for Space Applications EDD
Leveraging OpenSPARC ESA Round Table 2006 on Next Generation Microprocessors for Space Applications G.Furano, L.Messina TEC- OpenSPARC T1 The T1 is a new-from-the-ground-up SPARC microprocessor implementation
More informationMultithreaded Processors. Department of Electrical Engineering Stanford University
Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationImproving Packet Processing Performance of a Memory- Bounded Application
Improving Packet Processing Performance of a Memory- Bounded Application Jörn Schumacher CERN / University of Paderborn, Germany jorn.schumacher@cern.ch On behalf of the ATLAS FELIX Developer Team LHCb
More informationDesigning High Performance Communication Middleware with Emerging Multi-core Architectures
Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu
More informationEE/CSCI 451 Introduction to Parallel and Distributed Computation. Discussion #4 2/3/2017 University of Southern California
EE/CSCI 451 Introduction to Parallel and Distributed Computation Discussion #4 2/3/2017 University of Southern California 1 USC HPCC Access Compile Submit job OpenMP Today s topic What is OpenMP OpenMP
More informationParallel Programming on Larrabee. Tim Foley Intel Corp
Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This
More informationA Study of High Performance Computing and the Cray SV1 Supercomputer. Michael Sullivan TJHSST Class of 2004
A Study of High Performance Computing and the Cray SV1 Supercomputer Michael Sullivan TJHSST Class of 2004 June 2004 0.1 Introduction A supercomputer is a device for turning compute-bound problems into
More informationAdministration. Prerequisites. CS 395T: Topics in Multicore Programming. Why study parallel programming? Instructors: TA:
CS 395T: Topics in Multicore Programming Administration Instructors: Keshav Pingali (CS,ICES) 4.126A ACES Email: pingali@cs.utexas.edu TA: Aditya Rawal Email: 83.aditya.rawal@gmail.com University of Texas,
More informationShared Memory Parallel Programming. Shared Memory Systems Introduction to OpenMP
Shared Memory Parallel Programming Shared Memory Systems Introduction to OpenMP Parallel Architectures Distributed Memory Machine (DMP) Shared Memory Machine (SMP) DMP Multicomputer Architecture SMP Multiprocessor
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationParallel Programming for Graphics
Beyond Programmable Shading Course ACM SIGGRAPH 2010 Parallel Programming for Graphics Aaron Lefohn Advanced Rendering Technology (ART) Intel What s In This Talk? Overview of parallel programming models
More informationBest Practices for Setting BIOS Parameters for Performance
White Paper Best Practices for Setting BIOS Parameters for Performance Cisco UCS E5-based M3 Servers May 2013 2014 Cisco and/or its affiliates. All rights reserved. This document is Cisco Public. Page
More informationCellSs Making it easier to program the Cell Broadband Engine processor
Perez, Bellens, Badia, and Labarta CellSs Making it easier to program the Cell Broadband Engine processor Presented by: Mujahed Eleyat Outline Motivation Architecture of the cell processor Challenges of
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationChapter 4: Threads. Chapter 4: Threads
Chapter 4: Threads Silberschatz, Galvin and Gagne 2013 Chapter 4: Threads Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading Issues Operating System Examples
More informationEXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System
EXOCHI: Architecture and Programming Environment for A Heterogeneous Multicore Multithreaded System By Perry H. Wang, Jamison D. Collins, Gautham N. Chinya, Hong Jiang, Xinmin Tian, Milind Girkar, Nick
More informationCS420: Operating Systems
Threads James Moscola Department of Physical Sciences York College of Pennsylvania Based on Operating System Concepts, 9th Edition by Silberschatz, Galvin, Gagne Threads A thread is a basic unit of processing
More informationDecelerating Suspend and Resume in Operating Systems. XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017
Decelerating Suspend and Resume in Operating Systems XSEL, Purdue ECE Shuang Zhai, Liwei Guo, Xiangyu Li, and Felix Xiaozhu Lin 02/21/2017 1 Mobile/IoT devices see many short-lived tasks Sleeping for a
More informationSystems Programming and Computer Architecture ( ) Timothy Roscoe
Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture
More informationProgramming for Fujitsu Supercomputers
Programming for Fujitsu Supercomputers Koh Hotta The Next Generation Technical Computing Fujitsu Limited To Programmers who are busy on their own research, Fujitsu provides environments for Parallel Programming
More informationSOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS
SOFTWARE-DEFINED MEMORY HIERARCHIES: SCALABILITY AND QOS IN THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CSAIL IAP MEETING MAY 21, 2013 Research Agenda Lack of technology progress Moore s Law still alive Power
More informationRevisiting the Past 25 Years: Lessons for the Future. Guri Sohi University of Wisconsin-Madison
Revisiting the Past 25 Years: Lessons for the Future Guri Sohi University of Wisconsin-Madison Outline VLIW OOO Superscalar Enhancing Superscalar And the future 2 Beyond pipelining to ILP Late 1980s to
More informationEvaluating the Portability of UPC to the Cell Broadband Engine
Evaluating the Portability of UPC to the Cell Broadband Engine Dipl. Inform. Ruben Niederhagen JSC Cell Meeting CHAIR FOR OPERATING SYSTEMS Outline Introduction UPC Cell UPC on Cell Mapping Compiler and
More informationChapter 4: Multithreaded Programming
Chapter 4: Multithreaded Programming Silberschatz, Galvin and Gagne 2013 Chapter 4: Multithreaded Programming Overview Multicore Programming Multithreading Models Thread Libraries Implicit Threading Threading
More informationIntel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division
Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS
More informationWhy you should care about hardware locality and how.
Why you should care about hardware locality and how. Brice Goglin TADaaM team Inria Bordeaux Sud-Ouest Agenda Quick example as an introduction Bind your processes What's the actual problem? Convenient
More informationBarrelfish Project ETH Zurich. Message Notifications
Barrelfish Project ETH Zurich Message Notifications Barrelfish Technical Note 9 Barrelfish project 16.06.2010 Systems Group Department of Computer Science ETH Zurich CAB F.79, Universitätstrasse 6, Zurich
More informationA brief introduction to OpenMP
A brief introduction to OpenMP Alejandro Duran Barcelona Supercomputing Center Outline 1 Introduction 2 Writing OpenMP programs 3 Data-sharing attributes 4 Synchronization 5 Worksharings 6 Task parallelism
More informationFirst Experiences with Intel Cluster OpenMP
First Experiences with Intel Christian Terboven, Dieter an Mey, Dirk Schmidl, Marcus Wagner surname@rz.rwth aachen.de Center for Computing and Communication RWTH Aachen University, Germany IWOMP 2008 May
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationTechnical Computing Suite supporting the hybrid system
Technical Computing Suite supporting the hybrid system Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster Hybrid System Configuration Supercomputer PRIMEHPC FX10 PRIMERGY x86 cluster 6D mesh/torus Interconnect
More informationFractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures
Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures University of Virginia Dept. of Computer Science Technical Report #CS-2011-09 Jeremy W. Sheaffer and Kevin
More informationEI 338: Computer Systems Engineering (Operating Systems & Computer Architecture)
EI 338: Computer Systems Engineering (Operating Systems & Computer Architecture) Dept. of Computer Science & Engineering Chentao Wu wuct@cs.sjtu.edu.cn Download lectures ftp://public.sjtu.edu.cn User:
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationCapriccio: Scalable Threads for Internet Services
Capriccio: Scalable Threads for Internet Services Rob von Behren, Jeremy Condit, Feng Zhou, Geroge Necula and Eric Brewer University of California at Berkeley Presenter: Cong Lin Outline Part I Motivation
More informationGenerating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory
Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationModern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design
Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design The 1960s - 1970s Instructions took multiple cycles Only one instruction in flight at once Optimisation meant
More informationCSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.
CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance
More informationHigh Performance Computing Systems
High Performance Computing Systems Multikernels Doug Shook Multikernels Two predominant approaches to OS: Full weight kernel Lightweight kernel Why not both? How does implementation affect usage and performance?
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationCSE 160 Lecture 8. NUMA OpenMP. Scott B. Baden
CSE 160 Lecture 8 NUMA OpenMP Scott B. Baden OpenMP Today s lecture NUMA Architectures 2013 Scott B. Baden / CSE 160 / Fall 2013 2 OpenMP A higher level interface for threads programming Parallelization
More information