Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS PDF Free Download

Multiprocessors Fall 2007 Prof. Thomas Wenisch www.eecs.umich.edu/courses/eecs470 Slides developed in part by Profs. Falsafi, Hill, Hoe, Lipasti, Martin, Roth Shen, Smith, Sohi, and Vijaykumar of Carnegie Mellon University, Purdue University, University of Pennsylvania, and University of Wisconsin. Slide 1

Announcements Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, CourseEvaluations at end of lecture today HW 6 Posted, due 12/7 Project due 12/10 In class presentations (~8 minutes + questions) Slide 2

Thread-Level Parallelism struct acct_t { int bal; }; shared struct acct_t accts[max_acct]; int id,amt; 0: addi r1,accts,r3 if (accts[id].bal >= amt) 1: ld 0(r3),r4r4 { 2: blt r4,r2,6 accts[id].bal -= amt; 3: sub r4,r2,r4 spew_cash(); 4: st r4,0(r3) } 5: call spew_cash Thread level parallelism (TLP) Collection of asynchronous tasks: not started and stopped together Data shared loosely, dynamically Example: database/web b server (each query is a thread) accts is shared, can t register allocate even if it were scalar id and amt are private variables, register allocated to r1, r2 Slide 3

Shared-Memory Multiprocessors Shared memory Multiple execution contexts sharing a single address space Multiple programs (MIMD) Or more frequently: multiple copies of one program (SPMD) Implicit (automatic) communication via loads and stores P 1 P 2 P 3 P 4 Memory System Slide 4

Pluses Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Why Shared Memory? For applications looks like multitasking uniprocessor For OS only evolutionary extensions required Easy to do communication without OS Software can worry about correctness first then performance Minuses Proper synchronization is complex Communication is implicit so harder to optimize Hardware designers must implement Result Traditionally bus based Symmetric Multiprocessors (SMPs), and now the CMPs are the most success parallel machines ever And the first with multi billion dollar markets Slide 5

In More Detail Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Efficient Naming virtual to physical using TLBs ability to name relevent portions of objects Ease and efficiency of caching caching is natural and well understood d can be done in HW automatically Communication Overhead low since protection is built into memory system easy for HW to packetize requests / replies Integration of latency tolerance demand driven: consistency models, prefetching, multithreaded Can extend to push data to processors and use bulk transfer Slide 6

Shared-Memory Multiprocessors Provide a shared memory abstraction Familiar and efficient for programmers P 1 P 2 P 3 P 4 Cache M 1 Cache M 2 Cache M 3 Cache M 4 Interface Interface Interface Interface Interconnection Network Slide 7

Paired vs. Separate Processor/Memory? Separate processor/memory Uniform memory access (UMA): equal latency to all memory + Simple software, doesn t matter where you put data Lower peak performance Bus based UMAs common: symmetric multi processors (SMP) Paired processor/memory Non uniform memory access (NUMA): faster to local memory More complex software: where you put data matters + Higher peak performance: assuming proper data placement CPU($) CPU($) CPU($) CPU($) CPU($) Mem R CPU($) Mem R CPU($) Mem R CPU($) Mem R Mem Mem Mem Mem Slide 8

Shared vs. Point-to-Point Networks Shared network: e.g., bus (left) + Low latency Low bandwidth: doesn t scale beyond ~16 processors + Shared property simplifies cache coherence protocols (later) Point to point i network: e.g., mesh or ring (right) (ih) Longer latency: may need multiple hops to communicate + Higherbandwidth: scales to 1000s of processors Cache coherence protocols are complex CPU($) CPU($) CPU($) CPU($) CPU($) CPU($) Mem R Mem R Mem R Mem R Mem R R Mem Mem R CPU($) R Mem CPU($) Slide 9

Organizing Point-To-Point Networks Network topology: organization of network Tradeoff performance (connectivity, latency, bandwidth) cost Router chips Networks that require separate router chips are indirect Networks that use processor/memory/router packages are direct + Fewer components, Glueless MP Point to point network examples Indirect tree (left) Direct mesh or ring (right) R R CPU($) CPU($) Mem R R Mem R Mem R Mem R Mem R Mem R CPU($) CPU($) CPU($) CPU($) Mem R CPU($) R Mem CPU($) Slide 10

Implementation #1: Snooping Bus MP CPU($) CPU($) CPU($) CPU($) Two basic implementations Bus based systems Mem Mem Mem Mem Typically small: 2 8 (maybe 16) processors Typically processors split from memories (UMA) Sometimes multiple processors on single chip (CMP) Symmetric multiprocessors (SMPs) Common, I use one everyday Slide 11

Implementation #2: Scalable MP CPU($) Mem R CPU($) R Mem Mem R CPU($) R Mem CPU($) General point to point network based systems Typically processor/memory/router blocks (NUMA) Glueless MP: no need for additional glue chips Can be arbitrarily large: 1000 s of processors Massively parallel processors (MPPs) In reality only government (DoD) has MPPs Companies have much smaller systems: 32 64 processors Scalable multi processors Slide 12

Issues for Shared Memory Systems Two in particular Cache coherence Memory consistency model Closely related to each other Different solutions for SMPs and MPPs Slide 13

An Example Execution Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash CPU0 CPU1 Mem Two $100 withdrawals from account #241 at two ATMs Each transaction maps to thread on different processor Track accts[241].bal (dd (address is in r3) Slide 14

No-Cache, No-Problem Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 500 400 400 300 Scenario I: processors have no caches No problem Slide 15

Cache Incoherence Processor 0 Processor 1 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 0: addi r1,accts,r3 1: ld 0(r3),r4 2: blt r4,r2,6 3: sub r4,r2,r4 4: st r4,0(r3) 5: call spew_cash 500 V:500 500 D:400 500 D:400 V:500 500 D:400 D:400 500 Scenario II: processors have write back caches Potentially 3 copies of accts[241].bal: memory, p0$, p1$ Canget incoherent (inconsistent) Slide 16

Hardware Cache Coherence CPU Absolute coherence All copies have same data at all times Hard to implement and slow + Not strictly necessary CC D$ tags D$ data a Relative coherence Temporary incoherence OK (e.g., write back) As long as all loads get right values bus Coherence controller: Examines bus traffic (addresses and data) Executes coherence protocol What to do with local copy when you see different things happening on bus Slide 17

Snooping Cache-Coherence Protocols Bus provides serialization point Each cache controller snoops all bus transactions relevant transactions if for a block it contains take action to ensure coherence invalidate update supply value depends on state of the block and the protocol SimultaneousOperationofIndependent of Controllers Slide 18

Snooping Design Choices Controller updates state of blocks in response to processor and snoop events ents and generates bus xactions Often have duplicate cache tags Snoopy protocol set of states state transition diagram actions Basic Choices write through vs. write back invalidate vs. update Processor ld/st Cache State Tag Data... Snoop (observed bus transaction) Slide 19

The Simple Invalidate Snooping Protocol PrRd / -- Valid PrWr / BusWr Wi Write through, h nowrite allocate cache Actions: PrRd, PrWr, BusRd, BusWr PrRd / BusRd BusWr Invalid PrWr / BusWr Slide 20

More Generally: MOESI [Sweazey & Smith ISCA86] M Modified (dirty) O Owned (dirty but shared) WHY? E Exclusive (clean unshared) only copy, not dirty S Shared I Invalid Variants MSI MESI MOSI MOESI S O M E I ownership validity exclusiveness Slide 21

Base Snoopy Organization P Addr Cmd Data Tags and state for snoop Comparator To controller Cache data RAM Tags and state for P Bus- side controller Processorside controller Tag Write-back buffer Comparator To controller Snoop state Addr Cmd Data buffer Addr Cmd System bus Slide 22

Non-Atomic State Transitions Operations involve multiple actions Look up cache tags Bus arbitration Check for writeback Even if bus is atomic, overall set of actions is not Race conditions among multiple operations Suppose P1and P2attemptto to write cached block A Each decides to issue BusUpgr to allow S > M Issues Handle requests for other blocks while waiting to acquire bus Must handle requests for this block A Slide 23

Non-Atomicity Transient States Two types of states Stable (e.g. MESI) P W / Transient or Intermediate M Increases complexity P r W r / PrRd/ P r W r / BusRd/Flush BusRdX/Flush B u s G ra n t/b u s U p g r E BusGrant/BusRdX S M P r W r / BusReq B u s G ra n t/ BusRd (S) P r R d / BusRd/Flush BusRdX/Flush I M PrWr/BusReq BusRdX/Flush I S,E BusGrant/ BusRd (S) PrRd/BusReq S PrRd/ BusRd/ Flush BusR dx/flush I Slide 24

Scalability problems of Snoopy Coherence Prohibitive bus bandwidth Required bandwidth grows with # CPUS but available BW per bus is fixed Adding busses makes serialization/ordering hard Prohibitive processor snooping bandwidth All caches do tag lookup when ANY processor accesses memory Inclusion limits this to L2, but still lots of lookups Upshot: bus based b coherence doesn t scale beyond d8 16 CPUs Slide 25

Scalable Cache Coherence Scalable cache coherence: two part solution Part I: bus bandwidth Replace non scalable bandwidth substrate (bus) with scalable bandwidth one (point to point to network, e.g., mesh) Part II: processor snooping bandwidth Interesting: most snoops result in no action Replace non scalable broadcast protocol (spam everyone) with scalable directory protocol (only spam processors that care) Slide 26

Directory Coherence Protocols Observe: physical address space statically partitioned + Can easily determine which memory module holds a given line That memory module sometimes called home Can t easily determine which processors have line in their caches Bus based protocol: broadcast events to all processors/caches ± Simple and fast, but non scalable Directories: non broadcast coherence protocol Extend memory to track caching information For each physical cache line whose home this is, track: Owner: which processor has a dirty copy (I.e., M state) Sharers: which processors have clean copies (I.e., S state) Processor sends coherence event to home directory Home directory only sends events to processors that care Slide 27

Read Processing Wenisch 2007 -- Portions Falsafi, Hill, Hoe, Lipasti, Node #1 Directory Node #2 Read A (miss) A: Shared, #1 Slide 28

Write Processing Node #1 Directory Node #2 Read A (miss) A: Shared, #1 A: Mod., #2 Trade offs: Longer accesses (3 hop between Processor, directory, other Processor) Lower bandwidth no snoops necessary Makes sense either for CMPs (lots of L1 miss traffic) or large scale servers (shared memory MP > 32 nodes) Slide 29

Serialization and Ordering Let A and flag be 0 P1 P2 A += 5 while (flag == 0) flag = 1 print A Assume A and flag are in different cache blocks What happens? How do you implement it correctly? Slide 30

Coherence vs. Consistency Intuition says loads should return latest value what is latest? Coherence concerns only one memory location Consistency concerns apparent ordering for all locations A Memory System is Coherent if can serialize all operations to that location such that, operations performed dby any processor appear in program order program order = order defined program text or assembly code value returned by a read is value written by last store to that location Slide 31

Why Coherence!= Consistency /* initial A = B = flag = 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; print A; flag = 1; print B; Intuition says printed A = B = 1 Coherence doesn t say anything. Why? Your uniprocessor ordering mechanisms (ld/st queue) won t help. Why? Slide 32

Sequential Consistency (SC) processors issue memory ops in program order P1 P2 P3 switch randomly set after each memory op provides single sequential order among all operations Memory Slide 33

Sufficient Conditions for SC Every proc. issues memory ops in program order Memory ops happen (start and end) atomically must wait for store to complete before issuing next memory op after load, issuing i proc waits for load dto complete, lt bf before issuing i next op Easily implemented with a shared bus Slide 34

Relaxed Memory Models Motivation with Directory Protocols Misses have longer latency (do cache hits to get to next miss) Collecting acknowledgements can take even longer Recall llsc has Each processor generates at total order of its reads and writes (R >R, R >W, W >W, & W >R) That are interleaved into a global total order ExampleRelaxed Models PC: Relax ordering from writes to (other proc s) reads RC: Relaxallread/write all orderings (but add fences ) Slide 35

Relax Write to Read Order /* initial A = B = 0 */ P1 P2 A = 1; B = 1 r1 = B; r2 = A; Processor Consistent (PC) Models Allow r1==r2==0 (precluded by SC) Examples: IBM 370, Sun TSO, & Intel IA 32 Why do this? Allows FIFO store buffers Does not astonish programmers (too much) Slide 36

Write Buffers w/ Read Bypass P1 P2 Read Flag 2 t1 Write Flag 1 Read t3 Flag 1 Write Flag 2 t2 Shared Bus t4 Flag 1: 0 Flag 2: 0 P1 P2 Flag 1 = 1 Flag 2 = 1 if (Flag 2 == 0) if (Flag 1 == 0) critical section critical section Slide 37

Why Not Relax All Order? /* initially all 0 */ P1 P2 A = 1; while (flag == 0); /* spin */ B = 1; r1 = A; flag = 1; r2 = B; Reorder of A = 1 / B = 1 or r1 = A / r2 = B Via OoO proc., non FIFO store buffers, delayed directory acks., etc. ButMust Order A = 1 / B = 1 before flag =1 flag!= 0 before r1 = A / r2 = B Solution: Programmer indicates when ordering needed Slide 38

The store problem in MP Uniprocessors: Never wait to write memory CPU St X Ld Y Store buffer Memory system Multiprocessors: Must wait order matters! Producer St data, x 1 Consumer CPU St valid, 1 2 CPU Spin until valid Ld data How to know which stores must wait? 2007 Thomas Wenisch 39

The Consitency Debate Programmer must choose: Strict memory ordering e.g., Sequential Consistency (SC) Intuitive like multitasked uni. Slow wait for all stores Relaxed memory ordering eg e.g., Sun RMO Complex SW enforces order Fast parallel / OoO accesses P P P Strict ordering P P P Overlap Accesses Relaxed ordering Can we get intuitive and fast? 2007 Thomas Wenisch 40

Observations Order only matters when CPUs race Races rare speculate! [Gniady 99] [von Praun 06] Even under RMO, fences often unnecessary 5 10% performance loss in commercial apps If access sequence appears atomic, OK to reorder within sequence Enforce order over coarse grain atomic sequences Strict memory order with RMO performance 2007 Thomas Wenisch 41

Store-Wait-Free Multiprocessors [ISCA 07] Atomic Sequence Ordering St P Ld R St A Ld C CPU Ld Q Ld D ckpt ckpt Practical CPU rollback via checkpoints Mem. System Scalable Store Buffer prior designs our design CPU L1 L2 CPU L1 L2 SSB speculative consistent speculative consistent On-chip memory commit/rollback w/o assoc. search 2007 Thomas Wenisch 42

Multiprocessors Are Here To Stay Moore s law is making the multiprocessor a commodity part 500M transistors on a chip, what to do with all of them? Not enough ILP to justify a huge uniprocessor Really big caches? t hit increases, diminishing % miss returns Chip multiprocessors (CMPs) Multiple full processors on a single chip Example: IBM POWER4: two 1GHz processors, 1MB L2, L3 tags Multiprocessors a huge part of computer architecture Multiprocessor equivalent of H+P is 50% bigger than H+P Another entire cores on multiprocessor architecture Hopefully every other year or so (not this year) Slide 43

Multiprocessing & Power Consumption Multiprocessing can be very power efficient Recall:dynamicvoltageandfrequency and scaling Performance vs power is NOT linear Example: Intel s Xscale 1 GHz 200 MHz reduces energy used by 30x Impact of parallel execution What if we used 5 Xscales at 200Mhz? Similar performance as a 1Ghz Xscale, but 1/6th the energy 5 cores * 1/30th = 1/6th Assumes parallel speedup (a difficult task) Remember Ahmdal s law Slide 44

Shared Memory Summary Shared memory multiprocessors + Simple software: easy data sharing, handles both DLP and TLP Complex hardware: must provide illusion of global address space Two basic implementations Symmetric (UMA) multi processors (SMPs) Underlying communication network: bus (ordered) + Low latency, simple protocols that rely on global order Low bandwidth, poor scalability Scalable (NUMA) multi processors (MPPs) Underlying communication network: point to point (unordered) + Scalable lbl bandwidth d Higher latency, complex protocols Slide 45

Shared Memory Summary Two aspects to global memory space illusion Coherence: consistent view of individual cache lines Absolute coherence not needed, relative coherence OK VI and MSI protocols, cache to cache transfer optimization Implementation? SMP: snooping, MPP: directories Consistency: consistent global view of all memory locations Programmers intuitively expect sequential consistency (SC) Global interleaving of individual processor access streams Not naturally provided by coherence, needs extra stuff Relaxed ordering: consistency only for synchronization points Slide 46

Wenisch Portions Falsafi, Hill, Hoe, Lipasti, Martin, Roth, Shen, Smith, Sohi, Vijaykumar. Prof. Thomas Wenisch. Lecture 23 EECS 470.