Mul$processor OS COMP9242 Advanced Opera$ng Systems. Mul$processor OS. Overview. Uniprocessor OS
|
|
- Kerrie Owens
- 5 years ago
- Views:
Transcription
1 Overview Mul$processor COMP9242 Advanced Opera$ng Systems Ihor Kuz S2/2015 Week 12 Mul=processor How does it work? Scalability (Review) Mul=processor Hardware Contemporary systems (Intel, AMD, ARM, Oracle/Sun) Experimental and Future systems (Intel, MS, Polaris) Design for Mul=processors Guidelines Design approaches Divide and Conquer (Disco, Tessela=on) Reduce Sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) No Sharing (Barrelfish, fos) 2 Uniprocessor App1 App2 Mul$processor Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Memory 3 4 1
2 Mul$processor Mul$processor App1 App3 App4 App4 App1 App3 App4 App4 Key design challenges: Correctness of of (shared) data structures Scalability Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Memory Memory 5 6 Correctness of Shared Data Concurrency control Locks Semaphores Transac=ons Lock- free data structures We know how to do this: In the applica=on In the Scalability Speedup as more processors added Ideal Speedup (S) S(N) = T 1 T N number of processors (n) 7 8 2
3 Scalability Speedup as more processors added Reality speedup S(N) = T 1 T N Scalability and Serialisa$on Remember Amdahl s law Serial (non- parallel) por=on: when applica=on not running on all cores Serialisa=on prevents scalability T 1 =1= (1 P)+ P T N = (1 P)+ P N S(N) = T 1 1 = T N (1 P)+ P N S( ) 1 (1 P) number of processors 9 10 From Serialisa$on More Cache- related Serialisa$on Where does serialisa=on show up? Applica=on (e.g. access shared app data) (e.g. performing syscall for app) How much =me is spent in? Sources of Serialisa=on: Locking (explicit serialisa=on) Wai=ng for a lock è stalls self Lock implementa=on: Atomic opera=ons lock bus è stalls everyone Cache coherence traffic loads bus è slows down others Memory access (implicit) Rela=vely high latency to memory è stalls self Cache (implicit) Processor stalled while cache line is fetched or invalidated Affected by latency of interconnect Performance depends on data size (cache lines) and conten=on (number of cores) False sharing Unrelated data structs share the same cache line Accessed from different processors è Cache coherence traffic and delay Cache line bouncing Shared R/W on many processors E.g: bouncing due to locks: each processor spinning on a lock brings it into its own cache è Cache coherence traffic and delay Cache misses Poten=ally direct memory access è stalls self When does cache miss occur? Applica=on accesses data for the first =me, Applica=on runs on new core Cached memory has been evicted Cache footprint too big, another app ran, ran
4 Mul$- What? Mul$processor Hardware Mul=processor, SMP >1 separate processors, connected by off chip bus Mul=core >1 processing cores in a single processor, connected by on chip bus Mul=thread, SMT >1 hardware threads in a single core Mul=core + Mul=processor >1 mul=core processors >1 mul=core dies in a package (mul=- chip module) Interes$ng Proper$es of Mul$processors Scale and Structure How many cores and processors are there What kinds of cores and processors are there How are they organised Interconnect How are the cores and processors connected Memory Locality and Caches Where is the memory What is the cache architecture Interprocessor Communica=on How do cores and processors send messages to each other Contemporary Mul$processor Hardware Intel: Nehalem, Westmere: 10 core, QPI Sandy Bridge, Ivy Bridge: 5 core, ring bus, integrated GPU, L3, IO Haswell (Broadwell): 18 core, ring bus, transac=onal memory, slices (EP) AMD: K10 (Opteron: Barcelona, Magny Cours) 12 core, Hypertransport Bulldozer, Piledriver, Steamroller (Opteron, FX) 16 core, Clustered Mul=thread: module with 2 integer cores Oracle (Sun) UltraSparc T1,T2,T3,T4,T5 (Niagara) 16 cores, 8 threads/core (2 simultaneous), crossbar, 8 sockets ARM Cortex A9, A15 MPCore, big.little 4-8 cores, big.little: A7 + A
5 Scale and Structure ARM Cortex A9 MPCore Scale and Structure ARM big.little 17 From 18 From Scale and Structure Intel Nehalem Interconnect AMD Barcelona 19 From 20 From 5
6 Memory Locality and Caches Interprocessor Communica$on Oracle Sparc T2 FB DIMM FB DIMM FB DIMM FB DIMM MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) Sys I/F Buffer Switch Core PCIe 21 From x 10 Gigabit Ethernet Power <95 W 2.0 GHz From Sun/Oracle Interprocessor Communica$on Interprocessor Communica$on/ Structure/Memory 23 From 24 From 6
7 Experimental/Future Mul$processor Hardware Microson Beehive Ring bus, no cache coherence Tilera Tile64, Tile- Gx 100 cores, mesh network Intel Polaris 80 cores, mesh network Intel SCC 48 cores, mesh network, no cache coherency Intel MIC (Mul= Integrated Core) (Knight s Corner - Xeon Phi) 60+ cores, ring bus Scale and Structure Tilera Tile64 (newest: EzChips TILE- Gx), Intel Polaris DDR2 Controller 0 DDR2 Controller 1 SerDes SerDes PCIe 0 MAC/ PHY UART, HPI, I2C, GbE JTAG,SPI Flexible GbE 1 I/O Flexible I/O PCIe 1 MAC/ XAUI 1 PHY MAC/ PHY SerDes SerDes PROCESSOR CACHE Reg File L2 CACHE L-1I L-1D P P P I-TLB D-TLB D DMA MDN TDN UDN IDN STN SWITCH DDR2 Controller 3 DDR2 Controller From Cache and Memory Intel SCC Interprocessor Communica$on Beehive 27 From techresearch.intel.com/spaw2/uploads/files/scc_platform_overview.pdf 28 From projects.csail.mit.edu/beehive/beehivev5.pdf 7
8 Interprocess Communica$on Intel MIC (Mul= Integrated Core) (Knight s Corner/Landing - Xeon Phi) Summary Scalability 100+ cores Amdahl s law really kicks in Heterogeneity Heterogeneous cores, memory, etc. Proper=es of similar systems may vary wildly (e.g. interconnect topology and latencies between different AMD plaoorms) NUMA Also variable latencies due to topology and cache coherence Cache coherence may not be possible Can t use it for locking Shared data structures require explicit work Computer is a distributed system Message passing Consistency and Synchronisa=on Fault tolerance 29 From 30 Op$misa$on for Scalability DESIGN for Mul$processors Reduce amount of code in cri=cal sec=ons Increases concurrency Fine grained locking Lock data not code Tradeoff: more concurrency but more locking (and locking causes serialisa=on) Lock free data structures Avoid expensive memory access Avoid uncached memory Access cheap (close) memory
9 Op$misa$on for Scalability Reduce false sharing Pad data structures to cache lines Reduce cache line bouncing Reduce sharing E.g: MCS locks use local data Reduce cache misses Affinity scheduling: run process on the core where it last ran. Avoid cache pollu=on Design Guidelines for Modern (and future) Mul$processors Avoid shared data Performance issues arise less from lock conten=on than from data locality Explicit communica=on Regain control over communica=on costs (and predictability) Some=mes it s the only op=on Tradeoff: parallelism vs synchronisa=on Synchronisa=on introduces serialisa=on Make concurrent threads independent: reduce crit sec=ons & cache misses Allocate for locality E.g. provide memory local to a core Schedule for locality With cached data With local memory Tradeoff: uniprocessor performance vs scalability Design approaches Divide and conquer Divide mul=processor into smaller bits, use them as normal Using virtualisa=on Using exokernel Reduced sharing Brute force & Heroic Effort Find problems in exis=ng and fix them E.g Linux rearchitec=ng: BKL - > fine grained locking By design Avoid shared data as much as possible No sharing Computer is a distributed system Do extra work to share! 35 Divide and Conquer Disco Scalability is too hard! Context: ca. 1995, large ccnuma mul=processors appearing Scaling es requires extensive modifica=ons Idea: Implement a scalable VMM Run mul=ple instances VMM has most of the features of a scalable : NUMA aware allocator Page replica=on, remapping, etc. VMM substan=ally simpler/cheaper to implement Modern incarna=ons of this Virtual servers (Amazon, etc.) Research (Cerberus) 36 Running commodity es on scalable multiprocessors [Bugnion et al., 1997] 9
10 Disco Architecture Disco Performance Space- Time Par$$oning Tessella$on Tessella$on Space- Time par==oning 2- level scheduling Context: highly parallel mul=core systems Berkeley Par Lab 39 Tessellation: Space-Time Partitioning in a Manycore Client [Liu et al., 2010]
11 Reduce Sharing K42 Context: : for ccnuma systems IBM, U Toronto (Tornado, Hurricane) Goals: High locality Scalability Object Oriented Fine grained objects Clustered (Distributed) Objects Data locality Deferred dele=on (RCU) Avoid locking NUMA aware memory allocator Memory locality K42: Fine- grained objects 41 Clustered Objects, Ph.D. thesis [Appavoo, 2005] 42 K42: Clustered objects Globally valid object reference Resolves to Processor local representa=ve Sharing, locking strategy local to each object Transparency Eases complexity Controlled introduc=on of locality Shared counter: inc, dec: local access val: communica=on Fast path: Access mostly local structures K42 Performance
12 Corey Context 2008, high- end mul=core servers, MIT Goals: Applica=on control of sharing Exokernel- like, higher- level services as libraries By default only single core access to data structures Calls to control how data structures are shared Address Ranges Control private per core and shared address spaces Kernel Cores Dedicate cores to run specific kernel func=ons Shares Lookup tables for kernel objects allow control over which object iden=fiers are visible to other cores. Linux Brute Force Scalability Context 2010, high- end mul=core servers, MIT Goals: Scaling commodity Linux scalability (2010 scale Linux to 48 cores) 45 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008] 46 An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 2010] Linux Brute Force Scalability Apply lessons from parallel compu=ng and past research sloppy counters, per- core data structs, fine- grained lock, lock free, cache lines 3002 lines of code changed Conclusion: no scalability reason to give up on tradi=onal opera=ng system organiza=ons just yet. Scalability of the API Context 2013, previous mul=core projects at MIT Goals How to know if a system is really scalable? Workload- based evalua=on Run workload, plot scalability, fix problems Did we miss any non- scalable workload? Did we find all boulenecks? Is there something fundamental that makes an system non- scalable? The interface might be a fundamental bouleneck The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors [Clements et al., 2013] 12
13 Scalable Commuta$vity Rule The Rule Whenever interface opera1ons commute, they can be implemented in a way that scales. Commuta=ve opera=ons: Cannot dis=nguish order of opera=ons from results Example: Creat: Requires that lowest available FD be returned Not commuta=ve: can tell which one was run first Why are commuta=ve opera=ons scalable? results independent of order communica=on is unnecessary without communica=on, no conflicts Informs sonware design process Design: design guideline for scalable interfaces Implementa=on: clear target Test: workload- independent tes=ng Commuter: An Automated Scalability Tes$ng Tool (sv6) FlexSC Context: 2010, commodity mul=cores U Toronto Goal: Reduce context switch overhead of system calls Syscall context switch: Usual mode switch overhead But: cache and TLB pollu=on! FlexSC Asynchronous system calls Batch system calls Run them on dedicated cores FlexSC- Threads M on N M >> N 51 FlexSC: Flexible System Call Scheduling with Exception-Less System Calls [Soares and Stumm., 2010] 52 13
14 FlexSC Results No sharing Mul=kernel Barrelfish fos: factored opera=ng system Apache FlexSC: batching, sys call core redirect The Multikernel: A new architecture for scalable multicore systems [Baumann et al., 2009] Barrelfish Context: 2007 large mul=core machines appearing 100s of cores on the horizon NUMA (cc and non- cc) ETH Zurich and Microson Goals: Scale to many cores Support and manage heterogeneous hardware Approach: Structure as distributed system Design principles: Interprocessor communica=on is explicit structure hardware neutral State is replicated Microkernel Similar to sel4: capabili=es Barrelfish 55 The Multikernel: A new architecture for scalable multicore systems [Baumann et al., 2009]
15 Barrelfish: Replica$on Kernel + Monitor: Only memory shared for message channels Monitor: Collec=vely coordinate system- wide state System- wide state: Memory alloca=on tables Address space mappings Capability lists What state is replicated in Barrelfish Capability lists Consistency and Coordina=on Retype: two- phase commit to globally execute opera=on in order Page (re/un)mapping: one- phase commit to synchronise TLBs Barrelfish: Communica$on Different mechanisms: Intra- core Kernel endpoints Inter- core URPC URPC Uses cache coherence + polling Shared bufffer Sender writes a cache line Receiver polls on cache line (last word so no part message) Polling? Cache only changes when sender writes, so poll is cheap Switch to block and IPI if wait is too long Barrelfish: Results Message passing vs caching Latency (cycles 1000) SHM8 SHM4 SHM2 SHM1 MSG8 MSG1 Server Barrelfish: Results Broadcast vs Mul=cast Latency (cycles 1000) Broadcast Unicast Multicast NUMA-Aware Multicast Cores Cores 15
16 Barrelfish: Results TLB shootdown Windows Linux Barrelfish Latency (cycles 1000) Summary Cores 62 Summary Trends in mul=core Scale (100+ cores) NUMA No cache coherence Distributed system Heterogeneity design guidelines Avoid shared data Explicit communica=on Locality Approaches to mul=core Par==on the machine (Disco, Tessella=on) Reduce sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) No sharing (Barrelfish, fos) 63 16
Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.
Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS
More informationMULTIPROCESSOR OS. Overview. COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS. Multiprocessor OS
Overview COMP9242 Advanced Operating Systems S2/2013 Week 11: Multiprocessor OS Multiprocessor OS Scalability Multiprocessor Hardware Contemporary systems Experimental and Future systems OS design for
More informationThe Multikernel A new OS architecture for scalable multicore systems
Systems Group Department of Computer Science ETH Zurich SOSP, 12th October 2009 The Multikernel A new OS architecture for scalable multicore systems Andrew Baumann 1 Paul Barham 2 Pierre-Evariste Dagand
More informationThe Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith
The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs
More informationLecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Systems Group Department of Computer Science ETH Zürich Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Today Non-Uniform
More informationToday. SMP architecture. SMP architecture. Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming ( )
Lecture 26: Multiprocessing continued Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Systems Group Department of Computer Science ETH Zürich SMP architecture
More informationDesign Principles for End-to-End Multicore Schedulers
c Systems Group Department of Computer Science ETH Zürich HotPar 10 Design Principles for End-to-End Multicore Schedulers Simon Peter Adrian Schüpbach Paul Barham Andrew Baumann Rebecca Isaacs Tim Harris
More informationTile Processor (TILEPro64)
Tile Processor Case Study of Contemporary Multicore Fall 2010 Agarwal 6.173 1 Tile Processor (TILEPro64) Performance # of cores On-chip cache (MB) Cache coherency Operations (16/32-bit BOPS) On chip bandwidth
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationThe Barrelfish operating system for heterogeneous multicore systems
c Systems Group Department of Computer Science ETH Zurich 8th April 2009 The Barrelfish operating system for heterogeneous multicore systems Andrew Baumann Systems Group, ETH Zurich 8th April 2009 Systems
More informationImplications of Multicore Systems. Advanced Operating Systems Lecture 10
Implications of Multicore Systems Advanced Operating Systems Lecture 10 Lecture Outline Hardware trends Programming models Operating systems for NUMA hardware!2 Hardware Trends Power consumption limits
More informationCS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it
Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1
More informationSCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH
Faculty of Computer Science Institute of Systems Architecture, Operating Systems Group SCALABILITY AND HETEROGENEITY MICHAEL ROITZSCH LAYER CAKE Application Runtime OS Kernel ISA Physical RAM 2 COMMODITY
More informationMultiprocessor Systems. Chapter 8, 8.1
Multiprocessor Systems Chapter 8, 8.1 1 Learning Outcomes An understanding of the structure and limits of multiprocessor hardware. An appreciation of approaches to operating system support for multiprocessor
More informationWHY PARALLEL PROCESSING? (CE-401)
PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:
More informationComputing architectures Part 2 TMA4280 Introduction to Supercomputing
Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:
More informationParallel Computing Platforms
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster
More informationMultiprocessor Systems. COMP s1
Multiprocessor Systems 1 Multiprocessor System We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than one CPU to improve
More informationIntroduc)on to Xeon Phi
Introduc)on to Xeon Phi ACES Aus)n, TX Dec. 04 2013 Kent Milfeld, Luke Wilson, John McCalpin, Lars Koesterke TACC What is it? Co- processor PCI Express card Stripped down Linux opera)ng system Dense, simplified
More informationParallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Parallel Computing Platforms Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Elements of a Parallel Computer Hardware Multiple processors Multiple
More informationChapter 18 - Multicore Computers
Chapter 18 - Multicore Computers Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ Luis Tarrataca Chapter 18 - Multicore Computers 1 / 28 Table of Contents I 1 2 Where to focus your study Luis Tarrataca
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationVirtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018
Virtual Machines Disco and Xen (Lecture 10, cs262a) Ion Stoica & Ali Ghodsi UC Berkeley February 26, 2018 Today s Papers Disco: Running Commodity Operating Systems on Scalable Multiprocessors, Edouard
More informationMultiprocessor System. Multiprocessor Systems. Bus Based UMA. Types of Multiprocessors (MPs) Cache Consistency. Bus Based UMA. Chapter 8, 8.
Multiprocessor System Multiprocessor Systems Chapter 8, 8.1 We will look at shared-memory multiprocessors More than one processor sharing the same memory A single CPU can only go so fast Use more than
More informationConvergence of Parallel Architecture
Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationParallel Algorithm Engineering
Parallel Algorithm Engineering Kenneth S. Bøgh PhD Fellow Based on slides by Darius Sidlauskas Outline Background Current multicore architectures UMA vs NUMA The openmp framework and numa control Examples
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMotivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism
Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the
More informationBarrelfish Project ETH Zurich. Message Notifications
Barrelfish Project ETH Zurich Message Notifications Barrelfish Technical Note 9 Barrelfish project 16.06.2010 Systems Group Department of Computer Science ETH Zurich CAB F.79, Universitätstrasse 6, Zurich
More informationCache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)
Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues
More informationBruce Shriver University of Tromsø, Norway. November 2010 Óbuda University, Hungary
Bruce Shriver University of Tromsø, Norway November 2010 Óbuda University, Hungary Reconfigurability Issues in Multi-core Systems The first lecture explored the thesis that reconfigurability is an integral
More informationComputer Architecture Spring 2016
Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More
More informationFlexible Architecture Research Machine (FARM)
Flexible Architecture Research Machine (FARM) RAMP Retreat June 25, 2009 Jared Casper, Tayo Oguntebi, Sungpack Hong, Nathan Bronson Christos Kozyrakis, Kunle Olukotun Motivation Why CPUs + FPGAs make sense
More informationFlexSC. Flexible System Call Scheduling with Exception-Less System Calls. Livio Soares and Michael Stumm. University of Toronto
FlexSC Flexible System Call Scheduling with Exception-Less System Calls Livio Soares and Michael Stumm University of Toronto Motivation The synchronous system call interface is a legacy from the single
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604
More informationMultiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448. The Greed for Speed
Multiprocessors and Thread Level Parallelism Chapter 4, Appendix H CS448 1 The Greed for Speed Two general approaches to making computers faster Faster uniprocessor All the techniques we ve been looking
More informationMilestone 7: another core. Multics. Multiprocessor OSes. Multics on GE645. Multics: typical configuration
Milestone 7: another core Chapter 7: Multiprocessing Advanced Operating Systems (263 3800 00L) Timothy Roscoe Herbstsemester 202 http://www.systems.ethz.ch/courses/fall202/aos Assignment: Bring up the
More informationMul$processor Architecture. CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014
Mul$processor Architecture CS 5334/4390 Spring 2014 Shirley Moore, Instructor February 4, 2014 1 Agenda Announcements (5 min) Quick quiz (10 min) Analyze results of STREAM benchmark (15 min) Mul$processor
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationDisco: Running Commodity Operating Systems on Scalable Multiprocessors
Disco: Running Commodity Operating Systems on Scalable Multiprocessors Edouard Bugnion, Scott Devine, Kinskuk Govil and Mendel Rosenblum Stanford University Presented by : Long Zhang Overiew Background
More informationCS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics
CS4230 Parallel Programming Lecture 3: Introduction to Parallel Architectures Mary Hall August 28, 2012 Homework 1: Parallel Programming Basics Due before class, Thursday, August 30 Turn in electronically
More informationMulti-core Architectures. Dr. Yingwu Zhu
Multi-core Architectures Dr. Yingwu Zhu Outline Parallel computing? Multi-core architectures Memory hierarchy Vs. SMT Cache coherence What is parallel computing? Using multiple processors in parallel to
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationMultiprocessors and Locking
Types of Multiprocessors (MPs) Uniform memory-access (UMA) MP Access to all memory occurs at the same speed for all processors. Multiprocessors and Locking COMP9242 2008/S2 Week 12 Part 1 Non-uniform memory-access
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Multi-{Socket,,Thread} Getting More Performance Keep pushing IPC and/or frequenecy Design complexity (time to market) Cooling (cost) Power delivery (cost) Possible, but too
More informationCOSC 6385 Computer Architecture. - Thread Level Parallelism (III)
OS 6385 omputer Architecture - Thread Level Parallelism (III) Spring 2013 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05
More informationCSE 392/CS 378: High-performance Computing - Principles and Practice
CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer
More information1/12/11. ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures. Introduc3on. Interconnec3on Networks Introduc3on
ECE 1749H: Interconnec3on Networks for Parallel Computer Architectures Introduc3on Prof. Natalie Enright Jerger Winter 2011 ECE 1749H: Interconnec3on Networks (Enright Jerger) 1 Interconnec3on Networks
More informationChapter 7: Multiprocessing Advanced Operating Systems ( L)
Chapter 7: Multiprocessing Advanced Operating Systems (263 3800 00L) Timothy Roscoe Herbstsemester 2012 http://www.systems.ethz.ch/education/courses/hs11/aos/ Systems Group Department of Computer Science
More informationModern systems: multicore issues
Modern systems: multicore issues By Paul Grubbs Portions of this talk were taken from Deniz Altinbuken s talk on Disco in 2009: http://www.cs.cornell.edu/courses/cs6410/2009fa/lectures/09-multiprocessors.ppt
More informationCSE Opera,ng System Principles
CSE 30341 Opera,ng System Principles Lecture 5 Processes / Threads Recap Processes What is a process? What is in a process control bloc? Contrast stac, heap, data, text. What are process states? Which
More informationLecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015
Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMaster s Thesis Nr. 10
Master s Thesis Nr. 10 Systems Group, Department of Computer Science, ETH Zurich Performance isolation on multicore hardware by Kaveh Razavi Supervised by Akhilesh Singhania and Timothy Roscoe November
More informationECE 1749H: Interconnec1on Networks for Parallel Computer Architectures. Introduc1on. Prof. Natalie Enright Jerger
ECE 1749H: Interconnec1on Networks for Parallel Computer Architectures Introduc1on Prof. Natalie Enright Jerger Winter 2011 ECE 1749H: Interconnec1on Networks (Enright Jerger) 1 Interconnec1on Networks
More informationExploring different level of parallelism Instruction-level parallelism (ILP): how many of the operations/instructions in a computer program can be performed simultaneously 1. e = a + b 2. f = c + d 3.
More informationCOMP9242 Advanced OS. Copyright Notice. The Memory Wall. Caching. These slides are distributed under the Creative Commons Attribution 3.
Copyright Notice COMP9242 Advanced OS S2/2018 W03: s: What Every OS Designer Must Know @GernotHeiser These slides are distributed under the Creative Commons Attribution 3.0 License You are free: to share
More informationHow much energy can you save with a multicore computer for web applications?
How much energy can you save with a multicore computer for web applications? Peter Strazdins Computer Systems Group, Department of Computer Science, The Australian National University seminar at Green
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationNUMA replicated pagecache for Linux
NUMA replicated pagecache for Linux Nick Piggin SuSE Labs January 27, 2008 0-0 Talk outline I will cover the following areas: Give some NUMA background information Introduce some of Linux s NUMA optimisations
More information4. Shared Memory Parallel Architectures
Master rogram (Laurea Magistrale) in Computer cience and Networking High erformance Computing ystems and Enabling latforms Marco Vanneschi 4. hared Memory arallel Architectures 4.4. Multicore Architectures
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu
ffwd: delegation is (much) faster than you think Sepideh Roghanchi, Jakob Eriksson, Nilanjana Basu int get_seqno() { } return ++seqno; // ~1 Billion ops/s // single-threaded int threadsafe_get_seqno()
More informationECSE 425 Lecture 25: Mul1- threading
ECSE 425 Lecture 25: Mul1- threading H&P Chapter 3 Last Time Theore1cal and prac1cal limits of ILP Instruc1on window Branch predic1on Register renaming 2 Today Mul1- threading Chapter 3.5 Summary of ILP:
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationCan Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects?
Can Parallel Replication Benefit Hadoop Distributed File System for High Performance Interconnects? N. S. Islam, X. Lu, M. W. Rahman, and D. K. Panda Network- Based Compu2ng Laboratory Department of Computer
More informationCluster Communica/on Latency:
Cluster Communica/on Latency: towards approaching its Minimum Hardware Limits, on Low-Power Pla=orms Manolis Katevenis FORTH, Heraklion, Crete, Greece (in collab. with Univ. of Crete) http://www.ics.forth.gr/carv/
More informationAn Introduction to Parallel Programming
An Introduction to Parallel Programming Ing. Andrea Marongiu (a.marongiu@unibo.it) Includes slides from Multicore Programming Primer course at Massachusetts Institute of Technology (MIT) by Prof. SamanAmarasinghe
More informationInterconnection Networks
Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact
More informationINFLUENTIAL OS RESEARCH
INFLUENTIAL OS RESEARCH Multiprocessors Jan Bierbaum Tobias Stumpf SS 2017 ROADMAP Roadmap Multiprocessor Architectures Usage in the Old Days (mid 90s) Disco Present Age Research The Multikernel Helios
More informationOnline Course Evaluation. What we will do in the last week?
Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do
More informationECSE 425 Lecture 1: Course Introduc5on Bre9 H. Meyer
ECSE 425 Lecture 1: Course Introduc5on 2011 Bre9 H. Meyer Staff Instructor: Bre9 H. Meyer, Professor of ECE Email: bre9 dot meyer at mcgill.ca Phone: 514-398- 4210 Office: McConnell 525 OHs: M 14h00-15h00;
More informationMulticore Hardware and Parallelism
Multicore Hardware and Parallelism Minsoo Ryu Department of Computer Science and Engineering 2 1 Advent of Multicore Hardware 2 Multicore Processors 3 Amdahl s Law 4 Parallelism in Hardware 5 Q & A 2 3
More informationLinux multi-core scalability
Linux multi-core scalability Oct 2009 Andi Kleen Intel Corporation andi@firstfloor.org Overview Scalability theory Linux history Some common scalability trouble-spots Application workarounds Motivation
More informationMarkets Demanding More Performance
The Tile Processor Architecture: Embedded Multicore for Networking and Digital Multimedia Tilera Corporation August 20 th 2007 Hotchips 2007 Markets Demanding More Performance Networking market - Demand
More informationKaisen Lin and Michael Conley
Kaisen Lin and Michael Conley Simultaneous Multithreading Instructions from multiple threads run simultaneously on superscalar processor More instruction fetching and register state Commercialized! DEC
More informationIntroduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1
Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip
More informationComputer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士
Computer Architecture 计算机体系结构 Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统 Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017 Review Classification of parallel architectures Shared-memory system Cache coherency
More informationCOMP9242 Advanced OS. S2/2017 W03: Caches: What Every OS Designer Must
COMP9242 Advanced OS S2/2017 W03: Caches: What Every OS Designer Must Know @GernotHeiser Copyright Notice These slides are distributed under the Creative Commons Attribution 3.0 License You are free: to
More informationDISTRIBUTED SYSTEMS [COMP9243] Lecture 3b: Distributed Shared Memory DISTRIBUTED SHARED MEMORY (DSM) DSM consists of two components:
SHARED ADDRESS SPACE DSM consists of two components: DISTRIBUTED SYSTEMS [COMP9243] ➀ Shared address space ➁ Replication and consistency of memory objects Lecture 3b: Distributed Shared Memory Shared address
More informationInterconnection Networks
Lecture 18: Interconnection Networks Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Credit: many of these slides were created by Michael Papamichael This lecture is partially
More informationArquitecturas y Modelos de. Multicore
Arquitecturas y Modelos de rogramacion para Multicore 17 Septiembre 2008 Castellón Eduard Ayguadé Alex Ramírez Opening statements * Some visionaries already predicted multicores 30 years ago And they have
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationNetSlices: Scalable Mul/- Core Packet Processing in User- Space
NetSlices: Scalable Mul/- Core Packet Processing in - Space Tudor Marian, Ki Suh Lee, Hakim Weatherspoon Cornell University Presented by Ki Suh Lee Packet Processors Essen/al for evolving networks Sophis/cated
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationEfficient inter-kernel communication in a heterogeneous multikernel operating system. Ajithchandra Saya
Efficient inter-kernel communication in a heterogeneous multikernel operating system Ajithchandra Saya Report submitted to the faculty of the Virginia Polytechnic Institute and State University in partial
More informationCOSC 6385 Computer Architecture. - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors
OS 6385 omputer Architecture - Multi-Processors (IV) Simultaneous multi-threading and multi-core processors Spring 2012 Long-term trend on the number of transistor per integrated circuit Number of transistors
More informationHPC Architectures. Types of resource currently in use
HPC Architectures Types of resource currently in use Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationMegaPipe: A New Programming Interface for Scalable Network I/O
MegaPipe: A New Programming Interface for Scalable Network I/O Sangjin Han in collabora=on with Sco? Marshall Byung- Gon Chun Sylvia Ratnasamy University of California, Berkeley Yahoo! Research tl;dr?
More informationNew directions in OS design: the Barrelfish research operating system. Timothy Roscoe (Mothy) ETH Zurich Switzerland
New directions in OS design: the Barrelfish research operating system Timothy Roscoe (Mothy) ETH Zurich Switzerland Acknowledgements ETH Zurich: Zachary Anderson, Pierre Evariste Dagand, Stefan Kästle,
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMaster Program (Laurea Magistrale) in Computer Science and Networking. High Performance Computing Systems and Enabling Platforms.
Master Program (Laurea Magistrale) in Computer Science and Networking High Performance Computing Systems and Enabling Platforms Marco Vanneschi Multithreading Contents Main features of explicit multithreading
More information