EECS 470. Lecture 17 Multiprocessors I. Fall 2018 Jon Beaumont

Lecture 17 Multiprocessors I Fall 2018 Jon Beaumont www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth Shen, Smith, Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Slide 1

Announcements Milestone 3 due Friday (11/30) Should be passing many non-trivial benchmarks Send a 1 page status update as normal Indicate if you would like to meet Slide 2

Roadmap Speedup Programs Reduce Instruction Latency Parallelize Reduce number of instructions Reduce average memory latency Instruction Level Parallelism Thread Level Parallelism Caching First 2 months Power Efficiency MultiProcessors Programmability Precise State Virtual Memory Slide 3

Multiprocessors Slide 4

Spectrum of Parallelism Bit-level Pipelining ILP Multithreading Multiprocessing Distributed EECS 370 EECS 570 EECS 591 Why multiprocessing? Desire for performance Techniques from 370/470 difficult to scale further Slide 5

Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; if (accts[id].bal >= amt) { accts[id].bal -= amt; spew_cash(); } 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash Thread-level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web server (each query is a thread) accts is shared, can t register allocate even if it were scalar id and amt are private variables, register allocated to r1, r2 Slide 6

Thread-Level Parallelism Coarser than instruction level parallelism Usually exploited over thousands of instructions, not 10s-100s More independence between different threads than within a thread Less data dependencies in TLP than ILP Fewer guarantees about how instructions execute relative to one another outside a thread than within Instructions in a thread are guaranteed to execute (as if they were done) sequentially Only guaranteed across threads if programmer explicitly indicates so Slide 7

Shared-Memory Multiprocessors Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores Theoretical foundation: PRAM model P 1 P 2 P 3 P 4 Memory System Slide 8

Pluses Why Shared Memory? For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result Traditionally bus-based Symmetric Multiprocessors (SMPs), and now CMPs are the most success parallel machines ever And the first with multi-billion-dollar markets Slide 9

Shared Memory CPU and cache Mem R Memory Router logic figure out where transactions get sent Slide 10

Paired vs. Separate Processor/Memory? Separate processor/memory Uniform memory access (UMA): equal latency to all memory + Simple software, doesn t matter where you put data Lower peak performance Bus-based UMAs common: symmetric multi-processors (SMP) Paired processor/memory Non-uniform memory access (NUMA): faster to local memory More complex software: where you put data matters + Higher peak performance: assuming proper data placement Mem R Mem R Mem R Mem R Mem Mem Mem Mem Slide 11

Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point-to-point network: e.g., mesh or ring (right) Longer latency: may need multiple hops to communicate + Higher bandwidth: scales to 1000s of processors Cache coherence protocols are complex Mem R Mem R Mem R Mem R Mem R R Mem Mem R R Mem Slide 12

Organizing Point-To-Point Networks Network topology: organization of network Tradeoff performance (connectivity, latency, bandwidth) cost Router chips Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, Glueless MP Point-to-point network examples Indirect tree (left) Direct mesh or ring (right) R R R Mem R R Mem Mem R Mem R Mem R Mem R Mem R R Mem Slide 13

Implementation #1: Snooping Bus MP Mem Mem Mem Mem Two basic implementations Bus-based systems Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday Slide 14

Implementation #2: Scalable MP Mem R R Mem Mem R R Mem General point-to-point network-based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs) In reality only government (DoD) has MPPs Companies have much smaller systems: 32 64 processors Scalable multi-processors Slide 15

Issues for Shared Memory Systems Two in particular Cache coherence Memory consistency model Closely related to each other Different solutions for SMPs and MPPs Slide 16

Cache Coherence: The Problem t1: Store A=1 P1 P2 t2: Load A? A: A: 0 1 Write-back A: 0 Bus A: 0 Main Memory Need to do something to keep P2 s cache coherent Slide 17

Slide 18

Approaches to Cache Coherence Software-based solutions Mechanisms: Mark cache blocks/memory pages as cacheable/non-cacheable Add Flush and Invalidate instructions When are each of these needed? Could be done by compiler or run-time system Difficult to get perfect (e.g., what about memory aliasing?) Hardware solutions are far more common Simple schemes rely on broadcast over a bus Slide 19

Simple Write-Through Scheme: Valid-Invalid Coherence t1: Store A=1 P1 P2 A [V]: 0 1 Write-through A [V]: I]: 0 No-write-allocate t3: Invalidate A t2: BusWr A=1 Bus Valid-Invalid Coherence A: A: 0 1 Main Memory Allows multiple readers, but must write through to bus Write-through, no-write-allocate cache All caches must monitor (aka snoop ) all bus traffic simple state machine for each cache frame Slide 20

Valid-Invalid Snooping Protocol Load / BusRd Load / -- Valid Store / BusWr BusWr Actions: From proc: Ld, St From bus: BusRd, BusWr Write-through, no-write-allocate cache Invalid 1 bit of storage overhead per cache frame Store / BusWr EECS 570 Slide 21

Supporting Write-Back Caches Write-back caches drastically reduce bus write bandwidth Key idea: add notion of ownership to Valid-Invalid Mutual exclusion when owner has only replica of a cache block, it may update it freely Sharing multiple readers are ok, but they may not write without gaining ownership Need to find which cache (if any) is an owner on read misses Need to eventually update memory so writes are not lost Slide 22

Modified-Shared-Invalid (MSI) Protocol Three states tracked per-block at each cache Invalid cache does not have a copy Shared cache has a read-only copy; clean Clean == memory is up to date Modified cache has the only copy; writable; dirty Dirty == memory is out of date Three processor actions Load, Store, Evict Five bus messages BusRd, BusRdX, BusInv, BusWB, BusReply Could combine some of these Slide 23

Modified-Shared Invalid (MSI) Protocol Load / BusRd Invalid Shared P1 1: Load A P2 A [I A S]: [I] 0 A [I] 2: BusRd A 3: BusReply A Bus A: 0 EECS 570 Slide 24

Modified-Shared Invalid (MSI) Protocol Invalid Load / BusRd Shared BusRd / [BusReply] Load / -- P1 1: Load A P2 1: Load A A [S]: 0 A [I A S]: [I] 0 3: BusReply A Bus A: 0 2: BusRd A EECS 570 Slide 25

Modified-Shared Invalid (MSI) Protocol Invalid Load / BusRd Evict / -- Shared BusRd / [BusReply] Load / -- P1 P2 Evict A A [S]: 0 A [S]: [I] I] 0 Bus A: 0 EECS 570 Slide 26

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX / [BusReply] Shared Load / -- Evict / -- P1 A [S]: I]: 0 P2 1: Store A A [IA M]: [I] 0 1 Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 0 EECS 570 Slide 27

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX / [BusReply] Shared Load / -- Evict / -- P1 1: Load A P2 A [I A S]: [I] 1 A [M]: S]: 1 Modified 2: BusRd A Bus 3: BusReply A Load, Store / -- EECS 570 4: Snarf A A: A: 001 Slide 28

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusRdX BusInv / / [BusReply] Shared Load / -- Evict / -- Load, Store / -- EECS 570 Modified 2: BusInv A P1 1: Store A aka Upgrade A A [S[S]: M]: 12 A [S]: I] 1 Bus A: 1 P2 Slide 29

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- BusRdX / BusReply Evict / -- P1 P2 A [M]: I]: 2 A [I A M]: [I] 3 1: Store A Modified 3: BusReply A Bus 2: BusRdX A Load, Store / -- A: 1 EECS 570 Slide 30

Store / BusRdX Modified-Shared Invalid (MSI) Protocol Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- Evict / BusWB BusRdX / BusReply Evict / -- P1 A [I] P2 A [M]: I]: 3 1: Evict A Load, Store / -- EECS 570 Modified Bus A: A: 1 3 2: BusWB A Slide 31

Store / BusRdX MSI Protocol Summary Load / BusRd BusRd / [BusReply] Invalid BusRdX, BusInv / [BusReply] Shared Load / -- Evict / BusWB Modified BusRdX / BusReply Evict / -- Cache Actions: Load, Store, Evict Bus Actions: BusRd, BusRdX BusInv, BusWB, BusReply Load, Store / -- EECS 570 Slide 32

Coherence vs. Consistency Intuition says loads should return latest value what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if can serialize all operations to that location such that, operations performed by any processor appear in program order program order = order defined program text or assembly code value returned by a read is value written by last store to that location Slide 33

Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn t say anything. Why? Your uniprocessor ordering mechanisms (ld/st queue) won t help. Why? Slide 34

Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op provides single sequential order among all operations Memory Slide 35

Sufficient Conditions for SC A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order specified by its program -Lamport, 1979 Every proc. issues memory ops in program order Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing proc waits for load to complete, before issuing next op Easily implemented with a shared bus Slide 36

Dekker s Algorithm Mutually exclusive access to a critical region Works as advertised under sequential consistency /* initial A = B = 0 */ P1 P2 A = 1; B=1; if (B!= 0) goto retry; if (A!= 0) goto retry; /* enter critical section*/ /* enter critical section*/ Slide 37

Problems with SC Memory Model Difficult to implement efficiently in hardware Straight-forward implementations: No concurrency among memory access Strict ordering of memory accesses at each node Essentially precludes out-of-order CPUs Unnecessarily restrictive Most parallel programs won t notice out-of-order accesses Conflicts with latency hiding techniques Slide 38

E.g., Add a Store Buffer Allow reads to bypass incomplete writes Reads search store buffer for matching values Hides all latency of store misses in uniprocessors, but Slide 39

Dekker s Algorithm w/ Store Buffer P1 P2 Read B t1 Write A t3 Read A t2 Write B t4 Shared Bus P1 A = 1; if (B!= 0) goto retry; A: 0 B: 0 P2 B=1; if (A!= 0) goto retry; /* enter critical section*/ /* enter critical section*/ Slide 40

Fixing SC Performance Option 1: Change the memory model Weak/Relaxed Consistency Programmer specifies when order matters Other access happen concurrently/ out-of-order + Simple hardware can yield high performance Programmer must reason under counter-intuitive rules P P P Overlap Accesses Relaxed ordering Option 2: Speculatively ignore ordering rules In-window Speculation & InvisiFence Order matters only if re-orderings are observed Ignore the rules and hope no-one notices Works because data races are rare + Performance of relaxed consistency with simple programming model More sophisticated HW; speculation can lead to pathological behavior P St P Ld R Ld Q ckpt Mem One of the most esoteric (but important) topics in multiprocessors More in 570 Slide 41

Multiprocessors Are Here To Stay Moore s law is making the multiprocessor a commodity part 500M transistors on a chip, what to do with all of them? Not enough ILP to justify a huge uniprocessor Really big caches? t hit increases, diminishing % miss returns Chip multiprocessors (CMPs) Multiple full processors on a single chip Example: IBM POWER4: two 1GHz processors, 1MB L2, L3 tags Multiprocessors a huge part of computer architecture Multiprocessor equivalent of H+P is 50% bigger than H+P Another entire cores on multiprocessor architecture Hopefully every other year or so (not this year) Slide 53

Multiprocessing & Power Consumption Multiprocessing can be very power efficient Recall: dynamic voltage and frequency scaling Performance vs power is NOT linear Example: Intel s Xscale 1 GHz 200 MHz reduces energy used by 30x Impact of parallel execution What if we used 5 Xscales at 200Mhz? Similar performance as a 1Ghz Xscale, but 1/6th the energy 5 cores * 1/30th = 1/6th Assumes parallel speedup (a difficult task) Remember Ahmdal s law Slide 54

Shared Memory Summary Shared-memory multiprocessors + Simple software: easy data sharing, handles both DLP and TLP Complex hardware: must provide illusion of global address space Two basic implementations Symmetric (UMA) multi-processors (SMPs) Underlying communication network: bus (ordered) + Low-latency, simple protocols that rely on global order Low-bandwidth, poor scalability Scalable (NUMA) multi-processors (MPPs) Underlying communication network: point-to-point (unordered) + Scalable bandwidth Higher-latency, complex protocols Slide 55

Shared Memory Summary Two aspects to global memory space illusion Coherence: consistent view of individual cache lines Absolute coherence not needed, relative coherence OK VI and MSI protocols, cache-to-cache transfer optimization Implementation? SMP: snooping, MPP: directories Consistency: consistent global view of all memory locations Programmers intuitively expect sequential consistency (SC) Global interleaving of individual processor access streams Not naturally provided by coherence, needs extra stuff Relaxed ordering: consistency only for synchronization points Slide 56