Mul$processor OS COMP9242 Advanced Opera$ng Systems. Mul$processor OS. Overview. Uniprocessor OS

Size: px

Start display at page:

Download "Mul$processor OS COMP9242 Advanced Opera$ng Systems. Mul$processor OS. Overview. Uniprocessor OS"

Kerrie Owens
5 years ago
Views:

Overview Mul$processor COMP9242 Advanced

au S2/2015 Week 12 Mul=processor How does it

Hardware Contemporary systems (Intel, AMD,

au 2 Uniprocessor App1 App2 Mul$processor

1 Overview Mul$processor COMP9242 Advanced Opera$ng Systems Ihor Kuz S2/2015 Week 12 Mul=processor How does it work? Scalability (Review) Mul=processor Hardware Contemporary systems (Intel, AMD, ARM, Oracle/Sun) Experimental and Future systems (Intel, MS, Polaris) Design for Mul=processors Guidelines Design approaches Divide and Conquer (Disco, Tessela=on) Reduce Sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) No Sharing (Barrelfish, fos) 2 Uniprocessor App1 App2 Mul$processor Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Memory 3 4 1

Mul$processor Mul$processor App1 App3 App4 App4 App1 App3 App4 App4 Key design challenges:

structs Application data App4 App1 App2 App3 Run queue data Process control blocks FS structs

control Locks Semaphores Transac=ons Lock- free data structures We know how to do this: In the

2 Mul$processor Mul$processor App1 App3 App4 App4 App1 App3 App4 App4 Key design challenges: Correctness of of (shared) data structures Scalability Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Run queue data Process control blocks FS structs Application data App4 App1 App2 App3 Memory Memory 5 6 Correctness of Shared Data Concurrency control Locks Semaphores Transac=ons Lock- free data structures We know how to do this: In the applica=on In the Scalability Speedup as more processors added Ideal Speedup (S) S(N) = T 1 T N number of processors (n) 7 8 2

3 Scalability Speedup as more processors added Reality speedup S(N) = T 1 T N Scalability and Serialisa$on Remember Amdahl s law Serial (non- parallel) por=on: when applica=on not running on all cores Serialisa=on prevents scalability T 1 =1= (1 P)+ P T N = (1 P)+ P N S(N) = T 1 1 = T N (1 P)+ P N S( ) 1 (1 P) number of processors 9 10 From Serialisa$on More Cache- related Serialisa$on Where does serialisa=on show up? Applica=on (e.g. access shared app data) (e.g. performing syscall for app) How much =me is spent in? Sources of Serialisa=on: Locking (explicit serialisa=on) Wai=ng for a lock è stalls self Lock implementa=on: Atomic opera=ons lock bus è stalls everyone Cache coherence traffic loads bus è slows down others Memory access (implicit) Rela=vely high latency to memory è stalls self Cache (implicit) Processor stalled while cache line is fetched or invalidated Affected by latency of interconnect Performance depends on data size (cache lines) and conten=on (number of cores) False sharing Unrelated data structs share the same cache line Accessed from different processors è Cache coherence traffic and delay Cache line bouncing Shared R/W on many processors E.g: bouncing due to locks: each processor spinning on a lock brings it into its own cache è Cache coherence traffic and delay Cache misses Poten=ally direct memory access è stalls self When does cache miss occur? Applica=on accesses data for the first =me, Applica=on runs on new core Cached memory has been evicted Cache footprint too big, another app ran, ran

4 Mul$- What? Mul$processor Hardware Mul=processor, SMP >1 separate processors, connected by off chip bus Mul=core >1 processing cores in a single processor, connected by on chip bus Mul=thread, SMT >1 hardware threads in a single core Mul=core + Mul=processor >1 mul=core processors >1 mul=core dies in a package (mul=- chip module) Interes$ng Proper$es of Mul$processors Scale and Structure How many cores and processors are there What kinds of cores and processors are there How are they organised Interconnect How are the cores and processors connected Memory Locality and Caches Where is the memory What is the cache architecture Interprocessor Communica=on How do cores and processors send messages to each other Contemporary Mul$processor Hardware Intel: Nehalem, Westmere: 10 core, QPI Sandy Bridge, Ivy Bridge: 5 core, ring bus, integrated GPU, L3, IO Haswell (Broadwell): 18 core, ring bus, transac=onal memory, slices (EP) AMD: K10 (Opteron: Barcelona, Magny Cours) 12 core, Hypertransport Bulldozer, Piledriver, Steamroller (Opteron, FX) 16 core, Clustered Mul=thread: module with 2 integer cores Oracle (Sun) UltraSparc T1,T2,T3,T4,T5 (Niagara) 16 cores, 8 threads/core (2 simultaneous), crossbar, 8 sockets ARM Cortex A9, A15 MPCore, big.little 4-8 cores, big.little: A7 + A

Scale and Structure ARM Cortex A9 MPCore Scale and Structure ARM big.

jpg Scale and Structure Intel Nehalem Interconnect AMD Barcelona 19 From

5 Scale and Structure ARM Cortex A9 MPCore Scale and Structure ARM big.little 17 From 18 From Scale and Structure Intel Nehalem Interconnect AMD Barcelona 19 From 20 From 5

Memory Locality and Caches Interprocessor Communica$on Oracle Sparc T2 FB DIMM FB DIMM FB DIMM FB DIMM MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4

ch/education/past-courses/fall-2010/aos/lectures/wk10-multicore.pdf 22 2x 10 Gigabit Ethernet Power <95 W x8 @ 2.

6 Memory Locality and Caches Interprocessor Communica$on Oracle Sparc T2 FB DIMM FB DIMM FB DIMM FB DIMM MCU MCU MCU MCU L2$ L2$ L2$ L2$ L2$ L2$ L2$ L2$ Full Cross Bar C0 C1 C2 C3 C4 C5 C6 C7 FPU FPU FPU FPU FPU FPU FPU FPU SPU SPU SPU SPU SPU SPU SPU SPU NIU (Ethernet+) Sys I/F Buffer Switch Core PCIe 21 From x 10 Gigabit Ethernet Power <95 W 2.0 GHz From Sun/Oracle Interprocessor Communica$on Interprocessor Communica$on/ Structure/Memory 23 From 24 From 6

Experimental/Future Mul$processor Hardware Microson Beehive Ring bus, no cache coherence Tilera Tile64, Tile- Gx 100 cores, mesh network Intel Polaris 80 cores, mesh network Intel SCC 48 cores, mesh

7 Experimental/Future Mul$processor Hardware Microson Beehive Ring bus, no cache coherence Tilera Tile64, Tile- Gx 100 cores, mesh network Intel Polaris 80 cores, mesh network Intel SCC 48 cores, mesh network, no cache coherency Intel MIC (Mul= Integrated Core) (Knight s Corner - Xeon Phi) 60+ cores, ring bus Scale and Structure Tilera Tile64 (newest: EzChips TILE- Gx), Intel Polaris DDR2 Controller 0 DDR2 Controller 1 SerDes SerDes PCIe 0 MAC/ PHY UART, HPI, I2C, GbE JTAG,SPI Flexible GbE 1 I/O Flexible I/O PCIe 1 MAC/ XAUI 1 PHY MAC/ PHY SerDes SerDes PROCESSOR CACHE Reg File L2 CACHE L-1I L-1D P P P I-TLB D-TLB D DMA MDN TDN UDN IDN STN SWITCH DDR2 Controller 3 DDR2 Controller From Cache and Memory Intel SCC Interprocessor Communica$on Beehive 27 From techresearch.intel.com/spaw2/uploads/files/scc_platform_overview.pdf 28 From projects.csail.mit.edu/beehive/beehivev5.pdf 7

Interprocess Communica$on Intel MIC (Mul= Integrated Core) (Knight s Corner/Landing - Xeon Phi) Summary Scalability 100+ cores Amdahl s law really kicks in Heterogeneity Heterogeneous cores, memory,

8 Interprocess Communica$on Intel MIC (Mul= Integrated Core) (Knight s Corner/Landing - Xeon Phi) Summary Scalability 100+ cores Amdahl s law really kicks in Heterogeneity Heterogeneous cores, memory, etc. Proper=es of similar systems may vary wildly (e.g. interconnect topology and latencies between different AMD plaoorms) NUMA Also variable latencies due to topology and cache coherence Cache coherence may not be possible Can t use it for locking Shared data structures require explicit work Computer is a distributed system Message passing Consistency and Synchronisa=on Fault tolerance 29 From 30 Op$misa$on for Scalability DESIGN for Mul$processors Reduce amount of code in cri=cal sec=ons Increases concurrency Fine grained locking Lock data not code Tradeoff: more concurrency but more locking (and locking causes serialisa=on) Lock free data structures Avoid expensive memory access Avoid uncached memory Access cheap (close) memory

9 Op$misa$on for Scalability Reduce false sharing Pad data structures to cache lines Reduce cache line bouncing Reduce sharing E.g: MCS locks use local data Reduce cache misses Affinity scheduling: run process on the core where it last ran. Avoid cache pollu=on Design Guidelines for Modern (and future) Mul$processors Avoid shared data Performance issues arise less from lock conten=on than from data locality Explicit communica=on Regain control over communica=on costs (and predictability) Some=mes it s the only op=on Tradeoff: parallelism vs synchronisa=on Synchronisa=on introduces serialisa=on Make concurrent threads independent: reduce crit sec=ons & cache misses Allocate for locality E.g. provide memory local to a core Schedule for locality With cached data With local memory Tradeoff: uniprocessor performance vs scalability Design approaches Divide and conquer Divide mul=processor into smaller bits, use them as normal Using virtualisa=on Using exokernel Reduced sharing Brute force & Heroic Effort Find problems in exis=ng and fix them E.g Linux rearchitec=ng: BKL - > fine grained locking By design Avoid shared data as much as possible No sharing Computer is a distributed system Do extra work to share! 35 Divide and Conquer Disco Scalability is too hard! Context: ca. 1995, large ccnuma mul=processors appearing Scaling es requires extensive modifica=ons Idea: Implement a scalable VMM Run mul=ple instances VMM has most of the features of a scalable : NUMA aware allocator Page replica=on, remapping, etc. VMM substan=ally simpler/cheaper to implement Modern incarna=ons of this Virtual servers (Amazon, etc.) Research (Cerberus) 36 Running commodity es on scalable multiprocessors [Bugnion et al., 1997] 9

Disco Architecture Disco Performance 37 38 Space-

par==oning 2- level scheduling Context: 2009- highly

Tessellation: Space-Time Partitioning in a Manycore

10 Disco Architecture Disco Performance Space- Time Par$$oning Tessella$on Tessella$on Space- Time par==oning 2- level scheduling Context: highly parallel mul=core systems Berkeley Par Lab 39 Tessellation: Space-Time Partitioning in a Manycore Client [Liu et al., 2010]

Reduce Sharing K42 Context: 1997-2006: for ccnuma systems IBM, U Toronto (Tornado, Hurricane) Goals: High locality Scalability Object Oriented Fine grained objects Clustered (Distributed) Objects

11 Reduce Sharing K42 Context: : for ccnuma systems IBM, U Toronto (Tornado, Hurricane) Goals: High locality Scalability Object Oriented Fine grained objects Clustered (Distributed) Objects Data locality Deferred dele=on (RCU) Avoid locking NUMA aware memory allocator Memory locality K42: Fine- grained objects 41 Clustered Objects, Ph.D. thesis [Appavoo, 2005] 42 K42: Clustered objects Globally valid object reference Resolves to Processor local representa=ve Sharing, locking strategy local to each object Transparency Eases complexity Controlled introduc=on of locality Shared counter: inc, dec: local access val: communica=on Fast path: Access mostly local structures K42 Performance

Corey Context 2008, high- end mul=core servers, MIT Goals: Applica=on control of sharing Exokernel- like, higher- level services as libraries By default only single core access to data structures

kernel objects allow control over which object iden=fiers are visible to other cores.

12 Corey Context 2008, high- end mul=core servers, MIT Goals: Applica=on control of sharing Exokernel- like, higher- level services as libraries By default only single core access to data structures Calls to control how data structures are shared Address Ranges Control private per core and shared address spaces Kernel Cores Dedicate cores to run specific kernel func=ons Shares Lookup tables for kernel objects allow control over which object iden=fiers are visible to other cores. Linux Brute Force Scalability Context 2010, high- end mul=core servers, MIT Goals: Scaling commodity Linux scalability (2010 scale Linux to 48 cores) 45 Corey: An Operating System for Many Cores [Boyd-Wickizer et al., 2008] 46 An Analysis of Linux Scalability to Many Cores [Boyd-Wickizer et al., 2010] Linux Brute Force Scalability Apply lessons from parallel compu=ng and past research sloppy counters, per- core data structs, fine- grained lock, lock free, cache lines 3002 lines of code changed Conclusion: no scalability reason to give up on tradi=onal opera=ng system organiza=ons just yet. Scalability of the API Context 2013, previous mul=core projects at MIT Goals How to know if a system is really scalable? Workload- based evalua=on Run workload, plot scalability, fix problems Did we miss any non- scalable workload? Did we find all boulenecks? Is there something fundamental that makes an system non- scalable? The interface might be a fundamental bouleneck The Scalable Commutativity Rule: Designing Scalable Software for Multicore Processors [Clements et al., 2013] 12

Scalable Commuta$vity Rule The Rule Whenever interface opera1ons commute, they can be

Commuta=ve opera=ons: Cannot dis=nguish order of opera=ons from results Example:

one was run first Why are commuta=ve opera=ons scalable?

conflicts Informs sonware design process Design: design guideline for scalable

Automated Scalability Tes$ng Tool (sv6) 49 50 FlexSC Context: 2010, commodity

context switch: Usual mode switch overhead But: cache and TLB pollu=on!

13 Scalable Commuta$vity Rule The Rule Whenever interface opera1ons commute, they can be implemented in a way that scales. Commuta=ve opera=ons: Cannot dis=nguish order of opera=ons from results Example: Creat: Requires that lowest available FD be returned Not commuta=ve: can tell which one was run first Why are commuta=ve opera=ons scalable? results independent of order communica=on is unnecessary without communica=on, no conflicts Informs sonware design process Design: design guideline for scalable interfaces Implementa=on: clear target Test: workload- independent tes=ng Commuter: An Automated Scalability Tes$ng Tool (sv6) FlexSC Context: 2010, commodity mul=cores U Toronto Goal: Reduce context switch overhead of system calls Syscall context switch: Usual mode switch overhead But: cache and TLB pollu=on! FlexSC Asynchronous system calls Batch system calls Run them on dedicated cores FlexSC- Threads M on N M >> N 51 FlexSC: Flexible System Call Scheduling with Exception-Less System Calls [Soares and Stumm., 2010] 52 13

14 FlexSC Results No sharing Mul=kernel Barrelfish fos: factored opera=ng system Apache FlexSC: batching, sys call core redirect The Multikernel: A new architecture for scalable multicore systems [Baumann et al., 2009] Barrelfish Context: 2007 large mul=core machines appearing 100s of cores on the horizon NUMA (cc and non- cc) ETH Zurich and Microson Goals: Scale to many cores Support and manage heterogeneous hardware Approach: Structure as distributed system Design principles: Interprocessor communica=on is explicit structure hardware neutral State is replicated Microkernel Similar to sel4: capabili=es Barrelfish 55 The Multikernel: A new architecture for scalable multicore systems [Baumann et al., 2009]

15 Barrelfish: Replica$on Kernel + Monitor: Only memory shared for message channels Monitor: Collec=vely coordinate system- wide state System- wide state: Memory alloca=on tables Address space mappings Capability lists What state is replicated in Barrelfish Capability lists Consistency and Coordina=on Retype: two- phase commit to globally execute opera=on in order Page (re/un)mapping: one- phase commit to synchronise TLBs Barrelfish: Communica$on Different mechanisms: Intra- core Kernel endpoints Inter- core URPC URPC Uses cache coherence + polling Shared bufffer Sender writes a cache line Receiver polls on cache line (last word so no part message) Polling? Cache only changes when sender writes, so poll is cheap Switch to block and IPI if wait is too long Barrelfish: Results Message passing vs caching Latency (cycles 1000) SHM8 SHM4 SHM2 SHM1 MSG8 MSG1 Server Barrelfish: Results Broadcast vs Mul=cast Latency (cycles 1000) Broadcast Unicast Multicast NUMA-Aware Multicast Cores Cores 15

16 Barrelfish: Results TLB shootdown Windows Linux Barrelfish Latency (cycles 1000) Summary Cores 62 Summary Trends in mul=core Scale (100+ cores) NUMA No cache coherence Distributed system Heterogeneity design guidelines Avoid shared data Explicit communica=on Locality Approaches to mul=core Par==on the machine (Disco, Tessella=on) Reduce sharing (K42, Corey, Linux, FlexSC, scalable commuta=vity) No sharing (Barrelfish, fos) 63 16

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS.

Multiprocessor OS. COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2. Scalability of Multiprocessor OS. Multiprocessor OS COMP9242 Advanced Operating Systems S2/2011 Week 7: Multiprocessors Part 2 2 Key design challenges: Correctness of (shared) data structures Scalability Scalability of Multiprocessor OS