ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1

Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel Pr ogramming models Compilation or library Communication abstraction User/system boundary Operating systems support Communication har dwar e Har dwar e/softwar e boundary Physical communication medium Conceptual Picture P P 1 n Mem 2

Natural Extensions of Memory System P 1 Switch P n Scale (Interleaved) First-level $ P 1 P n (Interleaved) Main memory $ Inter connection network $ Shared Cache Mem Mem Centralized Memory Dance Hall, UMA P 1 P n Mem $ Mem $ Inter connection network Distributed Memory (NUMA) 3

Bus-Based Symmetric Shared Memory P 1 P n $ $ Bus Mem I/O devices Dominate the server market Building blocks for larger systems; arriving to desktop Attractive as throughput servers and for parallel programs Fine-grain resource sharing Uniform access via loads/stores Automatic data movement and coherent replication in caches Cheap and powerful extension Normal uniprocessor mechanisms to access data Key is extension of memory hierarchy to support multiple processors 4

Caches are Critical for Performance Reduce average latency automatic replication closer to processor Reduce average bandwidth Data is logically transferred from producer to consumer to memory store reg --> mem load reg <-- mem P P P Many processor can shared data efficiently What happens when store & load are executed on different processors? 5

Example Cache Coherence Problem u :5 P 1 P 2 P 3 u =? u =? $ 4 $ 5 $ u :5 3 u = 7 1 u :5 2 I/O devices Memory Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or writes back value first» Processes accessing main memory may see very stale value Unacceptable to programs, and frequent! 6

Caches and Cache Coherence Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared interconnect private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others» They ll keep accessing stale value in their caches => Cache coherence problem What do we do about it? Organize the mem hierarchy to make it go away Detect and take actions to eliminate the problem 7

Shared Cache: Examples Alliant FX-8 early 80 s eight 68020s with x-bar to 512 KB interleaved cache Encore & Sequent first 32-bit micros (N32032) two to a board with a shared cache P1 Pn Switch (Interlea ved) Cache (Interlea ved) Ma in Memory 8

Advantages Cache placement identical to single cache only one copy of any cached block fine-grain sharing communication latency determined level in the storage hierarchy where the access paths meet» 2-10 cycles» Cray Xmp has shared registers! Potential for positive interference one proc prefetches data for another Smaller total storage only one copy of code/data used by both proc. Can share data within a line without ping-pong long lines without false sharing P1 Switch (Interlea ved) Cache (Interlea ved) Ma in Memory Pn 9

Disadvantages Fundamental BW limitation Increases latency of all accesses X-bar Larger cache L1 hit time determines proc. cycle time!!! Potential for negative interference one proc flushes data needed by another P1 Switch (Interlea ved) Cache (Interlea ved) Ma in Memory Pn Many L2 caches are shared today 10

Intuitive Memory Model L1 P 100:67 L2 100:35 Memory Disk 100:34 Reading an address should return the last value written to that address Easy in uniprocessors except for I/O Cache coherence problem in MPs is more pervasive and more performance critical 11

Snoopy Cache-Coherence Protocols State Address Data P 1 $ Bus snoop P n $ Mem I/O devices Cache-memory transaction Bus is a broadcast medium & Caches know what they have Cache Controller snoops all transactions on the shared bus relevant transaction if for a block it contains take action to ensure coherence» invalidate, update, or supply value depends on state of the block and the protocol 12

Example: Write-thru Invalidate P 1 P 2 P 3 u =? u =? 3 4 $ $ 5 $ u :5 u :5 u = 7 1 u = 7 u :5 2 I/O devices Memory 13

Architectural Building Blocks Bus Transactions fundamental system design abstraction single set of wires connect several devices bus protocol: arbitration, command/addr, data => Every device observes every transaction Cache block state transition diagram FSM specifying how disposition of block changes» invalid, valid, dirty 14

Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus transactions Snoopy protocol set of states state-transition diagram actions Basic Choices Write-through vs Write-back Invalidate vs. Update Snoop Processor Ld/St Cache Controller State Tag Data 15

Write-through Invalidate Protocol Two states per block in each cache as in uniprocessor state of a block is a p-vector of states Hardware state bits associated with blocks that are in the cache other blocks can be seen as being in invalid (not-present) state in that cache Writes invalidate all other caches can have multiple simultaneous readers of block,but write invalidates them PrRd / BusRd State Tag Data P 1 V I PrRd/ -- PrWr / BusWr BusWr / - PrWr / BusWr State Tag Data P n $ $ Bus Mem I/O devices 16

Write-through vs. Write-back Write-through protocol is simple every write is observable Every write goes on the bus => Only one write can take place at a time in any processor Uses a lot of bandwidth! Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes => 30 M stores per second per processor => 240 MB/s per processor 1GB/s bus can support only about 4 processors without saturating 17

Invalidate vs. Update Basic question of program behavior: Is a block written by one processor later read by others before it is overwritten? Invalidate. yes: readers will take a miss no: multiple writes without addition traffic Update.» also clears out copies that will never be used again yes: avoids misses on later references no: multiple useless updates» even to pack rats => Need to look at program reference patterns and hardware complexity 18

Intuitive Memory Model??? L1 P 100:67 L2 100:35 Memory Disk 100:34 Reading an address should return the last value written to that address What does that mean in a multiprocessor? 19

Coherence? Caches are supposed to be transparent What would happen if there were no caches Every memory operation would go to the memory location may have multiple memory banks all operations on a particular location would be serialized» all would see THE order Interleaving among accesses from different processors within individual processor => program order across processors => only constrained by explicit synchronization Processor only observes state of memory system by issuing memory operations! 20

Definitions Memory operation load, store, read-modify-write Issues leaves processor s internal environment and is presented to the memory subsystem (caches, buffers, busses,dram, etc) Performed with respect to a processor write: subsequent reads return the value read: subsequent writes cannot affect the value Coherent Memory System there exists a serial order of mem operations on each location s. t.» operations issued by a process appear in order issued» value returned by each read is that written by previous write in the serial order => write propagation + write serialization 21

Is 2-state Protocol Coherent? Assume bus transactions and memory operations are atomic, one-level cache all phases of one bus transaction complete before next one starts processor waits for memory operation to complete before issuing next with one-level cache, assume invalidations applied during bus xaction All writes go to bus + atomicity Writes serialized by order in which they appear on bus (bus order) => invalidations applied to caches in bus order How to insert reads in this order? Important since processors see writes through reads, so determines whether write serialization is satisfied But read hits may happen independently and do not appear on bus or enter directly in bus order 22

Ordering Reads Read misses appear on bus, and will see last write in bus order Read hits: do not appear on bus But value read was placed in cache by either» most recent write by this processor, or» most recent read miss by this processor Both these transactions appeared on the bus So reads hits also see values as produced bus order 23

Determining Orders More Generally mem op M2 is subsequent to mem op M1 (M2 >> M1) if the operations are issued by the same processor and M2 follows M1 in program order. read R >> write W if read generates bus xaction that follows that for W. write W >> read or write M if M generates bus xaction and the xaction for W follows that for M. write W >> read R if read R does not generate a bus xaction and is not already separated from write W by another bus xaction. 24

Ordering P 0 : R R R W R R P 1 : R R R R R W P 2 : R R R R R R Writes establish a partial order Doesn t constrain ordering of reads, though bus will order read misses too any order among reads between writes is fine, as long as in program order 25

Write-Through vs Write-Back Write-thru requires high bandwidth Write-back caches absorb most writes as cache hits => Write hits don t go on bus But now how do we ensure write propagation and serialization? Need more sophisticated protocols: large design space But first, let s understand other ordering issues 26

Setup for Mem. Consistency Coherence => Writes to a location become visible to all in the same order But when does a write become visible? How do we establish orders between a write and a read by different procs? use event synchronization Typically use more than one location! 27

Example P 1 P 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; Intuition not guaranteed by coherence expect memory to respect order between accesses to different locations issued by a given process to preserve orders among accesses to the same location by different processes Coherence is not enough! pertains only to single location Conceptual Picture P P 1 n Mem 28

Another Example of Ordering? P 1 P 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; What s the intuition? Whatever it is, we need an ordering model for clear semantics» across different locations as well» so programmers can reason about what results are possible This is the memory consistency model 29

Memory Consistency Model Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, constrains the possible values returned by it Without it, can t tell much about an SAS program s execution Implications for both programmer and system designer Programmer uses to reason about correctness and possible results System designer can use to constrain how much accesses can be reordered by compiler or hardware Contract between programmer and system 30

Sequential Consistency Processors issuing memory references as per pr ogram or der P 1 P 2 P n The switch is randomly set after each memory refer ence Memory Total order achieved by interleaving accesses from different processes Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others as if there were no caches, and a single memory A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order 31 specified by its program. [Lamport, 1979]

What Really is Program Order? Intuitively, order in which operations appear in source code Straightforward translation of source code to assembly At most one memory operation per instruction But not the same as order presented to hardware by compiler So which is program order? Depends on which layer, and who s doing the reasoning We assume order as seen by programmer 32

SC Example What matters is order in which operations appear to execute, not the chronological order of events Possible outcomes for (A,B): (0,0), (1,0), (1,2) What about (0,2)? P 1 P 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; A=0 B=2 program order => 1a->1b and 2a->2b A = 0 implies 2b->1a, which implies 2a->1b B = 2 implies 1b->2a, which leads to a contradiction What is actual execution 1b->1a->2b->2a? appears just like 1a->1b->2a->2b as visible from results actual execution 1b->2a->2b->1a is not 33

Implementing SC Two kinds of requirements Program order» memory operations issued by a process must appear to execute (become visible to others and itself) in program order Atomicity» in the overall hypothetical total order, one memory operation should appear to complete with respect to all processes before the next one is issued» guarantees that total order is consistent across processes [all processes see the same order] tricky part is making writes atomic Example illustrating the importance of write atomicity for sequential consistency 34

Write-back Caches 2 processor operations PrRd, PrWr 3 states invalid, valid (clean), modified (dirty) ownership: who supplies block 2 bus transactions: read (BusRd), write-back (BusWB) only cache-block transfers => treat Valid as shared and Modified as exclusive => introduce one new bus transaction read-exclusive: read for purpose of modifying (read-toown) 35

MSI Invalidate Protocol Read obtains block in shared even if only cache copy Obtain exclusive ownership before writing BusRdx causes others to invalidate (demote) If M, BusRdx from another cache, will flush BusRdx even if hit in S» promote to M (upgrade) What about replacement? S->I, M->I as before PrRd/ PrWr/ M BusRd/Flush PrWr/BusRdX S BusRdX/Flush BusRdX/ PrRd/BusRd PrRd/ BusRd/ PrWr/BusRdX I 36

An Example of the MSI WB Invalidation Protocol 37

Example: Write-Back Protocol PrRd U P PrRd U 0 P 1 P 4 PrRd U PrWr U 7 U S 5 U S 7 U M S 5 7 BusRd U BusRd U BusRdx U BusRd Flush u :57 Memory I/O devices 38

Correctness When is write miss performed? How does writer observe write? How is it made visible to others? How do they observe the write? When is write hit made visible? 39

Write Serialization for Coherence Writes that appear on the bus (BusRdX) are ordered by bus performed in writer s cache before other transactions, so ordered same w.r.t. all processors (incl. writer) Read misses also ordered wrt these Write that don t appear on the bus: P issues BusRdX B. further mem operations on B until next transaction are from P» read and write hits» these are in program order for read or write from another processor» separated by intervening bus transaction Reads hits? 40

Sequential Consistency Bus imposes total order on bus xactions for all locations Between xactions, procs perform reads/writes (locally) in program order So any execution defines a natural partial order M j subsequent to M i if» (I) follows in program order on same processor,» (ii) M j generates bus xaction that follows the memory operation for M i In segment between two bus transactions, any interleaving of local program orders leads to consistent total order within segment writes observed by proc P serialized as: Writes from other processors by the previous bus xaction P issued Writes from P by program order 41

Sufficient conditions Sufficient Conditions issued in program order after write issues, the issuing process waits for the write to complete before issuing next memory operation after read is issued, the issuing process waits for the read to complete and for the write whose value is being returned to complete (globally) before issuing its next operation Write completion can detect when write appears on bus Write atomicity: if a read returns the value of a write, that write has already become visible to all others already 42

Lower-level Protocol Choices BusRd observed in M state: what transition to make? M ----> I M ----> S Depends on expectations of access patterns How does memory know whether or not to supply data on BusRd? Problem: Read/Write is 2 bus xactions, even if no sharing» BusRd (I->S) followed by BusRdX or BusUpgr (S->M)» What happens on sequential programs? 43

MESI (4-state) Invalidation Protocol Add exclusive state distinguish exclusive (writable) and owned (written) Main memory is up to date, so cache not necessarily owner can be written locally States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty) I -> E on PrRd if no cache has copy => How can you tell? 44

Hardware Support for MESI P 0 P 1 P 4 u :5 Memory I/O devices shared signal - wired-or All cache controllers snoop on BusRd Assert shared if present (S? E? M?) Issuer chooses between S and E Is it possible for a block to be in the S state even if no other copies exist? 45

MESI State Transition Diagram BusRd(S) means shared line asserted on BusRd transaction Flush : if cache-to-cache xfers only one cache flushes data MOESI protocol: Owned state: exclusive but memory not valid PrW r/busrdx PrW r/busrdx PrRd/ BusRd (S ) M PrW r/ PrRd/ PrRd/ BusRd(S) PrRd PrW r/ BusRd/Flush E S BusRd/ Flush PrRd/ BusRd/Flush BusRdX/Flush BusRdX/Flush BusRdX/Flush I 46

Lower-level Protocol Choices Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache, since assumed faster than memory Not true in modern systems» Intervening in another cache more expensive than getting from memory Cache-to-cache sharing adds complexity How does memory know it should supply data (must wait for caches) Selection algorithm if multiple caches have valid data Valuable for cache-coherent machines with distributed memory May be cheaper to obtain from nearby cache than distant memory, especially when constructed out of SMP nodes (Stanford DASH) 47

Update Protocols If data is to be communicated between processors, invalidate protocols seem inefficient consider shared flag p0 waits for it to be zero, then does work and sets it one p1 waits for it to be one, then does work and sets it zero how many transactions? 48

Dragon Write-back Update Protocol 4 states Exclusive-clean or exclusive (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I m not owner Shared modified (Sm): I and others but not memory, and I m the owner» Sm and Sc can coexist in different caches, with only one Sm Modified or dirty (M): I and, no one else No invalid state If in cache, cannot be invalid If not present in cache, view as being in not-present or invalid state New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches 49

Dragon State Transition Diagram PrRd/ PrRd/ BusUpd/Update PrRdMiss/BusRd(S) E BusRd/ Sc PrRdMiss/BusRd(S) PrW r/ PrW r/busupd(s) BusUpd/Update PrW r/busupd(s) PrW rmiss/(busrd(s); BusUpd) Sm BusRd/Flush PrW r/busupd(s) M PrW rmiss/busrd(s) PrRd/ PrW r/busupd(s) BusRd/Flush PrRd/ PrW r/ 50

An Example of the Dragon Update Protocol 51

Lower-level Protocol Choices Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn t (assumes DRAM memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to E state and not generate updates Replacement bus transaction is not in critical path, later update may be Can local copy be updated on write hit before controller gets bus? Can mess up serialization Coherence, consistency considerations much like write-through case 52

Assessing Protocol Tradeoffs Tradeoffs affected by technology characteristics and design complexity Part art and part science Art: experience, intuition and aesthetics of designers Science: Workload-driven evaluation for cost-performance» want a balanced system: no expensive resource heavily underutilized 53

Workload-Driven Evaluation Evaluating real machines Evaluating an architectural idea or trade-offs => need good metrics of performance => need to pick good workloads => need to pay attention to scaling many factors involved 54

Evaluation in Uniprocessors Decisions made only after quantitative evaluation For existing systems: comparison and procurement evaluation For future systems: careful extrapolation from known quantities Wide base of programs leads to standard benchmarks Measured on wide range of machines and successive generations Measurements and technology assessment lead to proposed features Then simulation Simulator developed that can run with and without a feature Benchmarks run through the simulator to obtain results Together with cost and complexity, decisions made 55

More Difficult for Multiprocessors What is a representative workload? Software model has not stabilized Many architectural and application degrees of freedom Huge design space: no. of processors, other architectural, application Impact of these parameters and their interactions can be huge High cost of communication What are the appropriate metrics? Simulation is expensive Realistic configurations and sensitivity analysis difficult Larger design space, but more difficult to cover Understanding of parallel programs as workloads is critical Particularly interaction of application and architectural parameters 56

Speedup Speedup A Lot Depends on Sizes Application parameters and no. of procs affect inherent properties Load balance, communication, extra work, temporal and spatial locality Interactions with organization parameters of extended memory hierarchy affect artifactual communication and performance Effects often dramatic, sometimes small: application-dependent 30 25 20 N = 130 N = 258 N = 514 N = 1,026 ocean 30 Origin 16 K Origin 64 K 25 Origin 512 K Challenge 16 K 20 Challenge 512 K 15 10 5 0 1 4 7 10 13 16 19 22 25 28 31 Number of processors 10 5 0 1 4 7 10 13 16 19 22 25 28 31 Number of processors Understanding size interactions and scaling relationships is key 15 Barnes-hut 57

Scaling: Why Worry? Fixed problem size is insufficient. Too small a problem: May be appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines» Load imbalance» Communication to computation ratio May even achieve slowdowns Doesn t reflect real usage, and inappropriate for large machines Too large a problem Difficult to measure improvement (next) 58

Too Large a Problem Suppose problem realistically large for big machine May not fit in small machine Can t run Thrashing to disk Working set doesn t fit in cache Fits at some p, leading to superlinear speedup Real effect, but doesn t help evaluate effectiveness Finally, users want to scale problems as machines grow Can help avoid these problems 59

Speedup Speedup Demonstrating Scaling Problems Small Ocean and big equation solver problems on SGI Origin2000 30 25 20 15 10 Ideal Ocean: 258 x 258 5 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors 50 45 40 35 30 25 20 15 10 5 Grid solver: 12 K x 12 K Ideal 0 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 Number of processors 60

Example Application Set Table 4.1 General Statistics about Application Pr ograms Application LU Ocean Barnes-Hut Radix Input Data Set 512 512 matrix 16 16 blocks 258 258 grids tolerance = 10 7 4 time-steps 16-K particles = 1.0 3 time-steps 256-K points radix = 1,024 Total Instructions (M) Total FLOPS (M) Total Ref erences (M) Total Reads (M) Total Writes (M) Shared Reads (M) Shared Writes (M) Barriers Locks 489.52 92.20 151.07 103.09 47.99 92.79 44.74 66 0 376.51 101.54 99.70 81.16 18.54 76.95 16.97 364 1,296 2,002.74 239.24 720.13 406.84 313.29 225.04 93.23 7 34,516 84.62 14.19 7.81 6.38 3.61 2.18 11 16 Ray trace Car scene 833.35 290.35 210.03 80.31 161.10 22.35 0 94,456 Radiosity Room scene 2,297.19 769.56 486.84 282.72 249.67 21.88 10 210,485 Multiprog: User Multiprog: Kernel SGI IRIX 5.2, two pmakes + two compress jobs 1,296.43 500.22 350.42 149.80 668.10 212.58 178.14 34.44 621,505 For the parallel programs, shared reads and writes simply ef r er to all nonstack ef r erences issued by the application pr ocesses. All such ref erences do not necessarily point to data that is truly shar ed by multiple processes. The Multiprog workload is not rallel pa application, so it does not access shar ed data. A dash in a table entry means that this measur ement is not applicable to not or is measured f or that application (e.g., Radix has no oating-point operations). (M) denotes that measur ement in that column is in millions. 61

Types of Workloads Kernels: matrix factorization, FFT, depth-first tree search Complete Applications: ocean simulation, crew scheduling, database Multiprogrammed Workloads Multiprog. Appls Kernels Microbench. Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance 62

Coverage: Stressing Features Easy to mislead with workloads Choose those with features for which machine is good, avoid others Some features of interest: Compute v. memory v. communication v. I/O bound Working set size and spatial locality Local memory and communication bandwidth needs Importance of communication latency Fine-grained or coarse-grained» Data access, communication, task size Synchronization patterns and granularity Contention Communication patterns Choose workloads that cover a range of properties 63

Concurrency Should have enough to utilize the processors Algorithmic speedup: useful measure of concurrency/imbalance Speedup (under scaling model) assuming all memory/communication operations take zero time Ignores memory system, measures imbalance and extra work Uses PRAM machine model (Parallel Random Access Machine)» Unrealistic, but widely used for theoretical algorithm development At least, should isolate performance limitations due to program characteristics that a machine cannot do much about (concurrency) from those that it can. 64

Workload/Benchmark Suites Numerical Aerodynamic Simulation (NAS) Originally pencil and paper benchmarks SPLASH/SPLASH-2 Shared address space parallel programs ParkBench Message-passing parallel programs ScaLapack TPC Message-passing kernels Transaction processing SPEC-HPC... 65

Multiprocessor Simulation Simulation runs on a uniprocessor (can be parallelized too) Simulated processes are interleaved on the processor Two parts to a simulator: Reference generator: plays role of simulated processors» And schedules simulated processes based on simulated time Simulator of extended memory hierarchy» Simulates operations (references, commands) issued by reference generator Coupling or information flow between the two parts varies Trace-driven simulation: from generator to simulator Execution-driven simulation: in both directions (more accurate) Simulator keeps track of simulated time and detailed statistics 66

Execution-driven Simulation Memory hierarchy simulator returns simulated time information to reference generator, which is used to schedule simulated processes P 1 $ 1 Mem 1 P 2 P 3 $ 2 $ 3 Mem 2 Mem 3 N e t w o r k P p $ p Mem p Reference generator Memory and inter connect simulator 67

Difficulties in Simulation-based Evaluation Cost of simulation (in time and memory) cannot simulate the problem/machine sizes we care about have to use scaled down problem and machine sizes» how to scale down and stay representative? Huge design space application parameters (as before) machine parameters (depending on generality of evaluation context)» number of processors» cache/replication size» associativity» granularities of allocation, transfer, coherence» communication parameters (latency, bandwidth, occupancies) cost of simulation makes it all the more critical to prune the space 68

Choosing Parameters Problem size and number of processors Use inherent characteristics considerations as discussed earlier For example, low c-to-c ratio will not allow block transfer to help much Cache/Replication Size Choose based on knowledge of working set curve Choosing cache sizes for given problem and machine size analogous to choosing problem sizes for given cache and machine size, discussed Whether or not working set fits affects block transfer benefits greatly» if local data, not fitting makes communication relatively less important» If nonlocal, can increase artifactual comm. So BT has more opportunity Sharp knees in working set curve can help prune space» Knees can be determined by analysis or by very simple simulation 69

Focus on protocol tradeoffs Methodology: Choose cache parameters» default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some Focus on frequencies, not end performance for now» transcends architectural details Use idealized memory performance model» Using simple PRAM cost model: all memory operations are assumed to complete in one cycle.» Cheap simulation: no need to model contention Run program on parallel machine simulator» collect trace of cache state transitions» analyze properties of the transitions» It evaluates protocol class (invalidation or update), protocol states and actions, and lower-level implementation tradeoff. 70

Bandwidth per transition in MESI Bus Transaction Address / Cmd Data PrRd PrW r/ BusRd 6 64 BusRdX 6 64 BusWB 6 64 BusUpgd 6 -- Ocean Data Cache Frequency Matrix (per 1000 references) To NP I E S M NP 0 0 1.25 0.96 0.001 I 0.64 0 0 1.87 0.001 E 0.20 0 14.00 0.0 2.24 S 0.42 2.50 0 134.72 2.24 PrW r/busrdx M 2.63 0.00 0 2.30 843.57 PrW r/busrdx PrRd/ BusRd (S) M PrW r/ PrRd/ BusRd(S) BusRd/Flush E PrRd/ S PrRd/ BusRd/Flush I BusRd/ Flush BusRdX/Flush BusRdX/Flush BusRdX/Flush Example 5.8 (p. 310): how to compute bus bandwidth requirement? State transitions -> bus transactions -> bus traffic per mem. ref. --- (376.5M inst vs. 99.7M mem ref )-> bus 71 traffic per instr. (200 MIPS) -> required bandwidth for an app (see next slide)

barnes/il l barnes/3s t barnes/3s t-rde x lu/ill lu/3st lu/3st-rdex ocean/ill ocean/3s t ocean/3s t-rde x radiosity/ill radi osi ty/3st radi osi ty/3st-rdex radix/ill radix/3st radix/3st-rdex raytrace/il l raytrace/3s t raytrace/3s t-rde x Appl-Code/Ill Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/Ill Appl-Data/3St Appl-Data/3St-RdEx OS-Code/Il l OS-Code/3 St OS-Code/3 St-Rd Ex OS-Data/Il l OS-Data/3S t OS-Data/3S t-rde x Traffic (MB/s) Traffic (MB/s) Bandwidth Trade-off 1 MB Cache, 200 MIPS / 200 MFLOPS Processor E -> M are infrequent BusUpgrade is cheap 200 180 Parallel Prog 45 Multiprog 160 40 140 120 100 80 60 Cmd Addr/cmd data Bus How many processors can be supported with 1.2GB/s bus? 35 30 25 20 15 Addr/cmd Cmd data Bus 40 10 20 5 0 0 Ill (MESI): no need of BusUpgr from E to M 3st (MSI): BusUpgr is used from S to M 3st-RdEx: MSI with BusRdX used from S to M 72

oce an/ill ocean/3st ocean/3st-rdex radix/ill radix/3st radix/3st-rdex raytra ce/il l raytra ce/3s t raytra ce/3s t-rde x Traffic (MB/s) Smaller (64KB) Caches 400 350 300 Cmd Addr/cmd data Bus How many processors can be supported with 1.2GB/s bus? 250 200 150 100 50 Dramatically increased bus bandwidth requirement. 0 73

Cache Block Size Trade-offs in uniprocessors with increasing block size reduced cold misses (due to spatial locality) increased transfer time increased conflict misses (fewer sets) Additional concerns in multiprocessors parallel programs have less spatial locality parallel programs have sharing false sharing bus contention Need to classify misses to understand impact cold misses capacity / conflict misses true sharing misses» one proc writes words in a block, invalidating a block in another processor s cache, which is later read by that process false sharing misses 74

barnes/8 barnes/16 barnes/32 barnes/64 barnes/128 barnes/256 lu/8 lu/16 lu/32 lu/64 lu/128 lu/256 radiosity/8 radiosity/16 radiosity/32 radiosity/64 radiosity/128 radiosity/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 Miss Rate Miss Rate Breakdown of Miss Rates with Block Size 0.006 Cold, capacity, and true sharing misses tend to decrease with increasing block size. False sharing misses tend to increase; 0.12 UPGMR 0.005 FSMR TSMR 0.1 UPGMR FSMR 0.004 CAPMR COLDMR 0.08 TSMR CAPMR COLDMR 0.003 0.06 0.002 0.04 0.001 0.02 0 0 Parallel Prog 75

ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 Miss Rate Breakdown with 64KB Caches 0.3 0.25 UPGMR FSMR 0.2 TSMR CAPMR COLDMR 0.15 0.1 0.05 0 Miss rate, especially capacity miss rate, is much higher. True sharing misses aren t significantly different. larger block size is more effective, but with risks (higher demand on the bus) 76

barnes/8 barnes/32 barnes/128 radiosity/16 radiosity/64 radiosity/256 raytrace/8 raytrace/32 raytrace/128 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 lu/8 lu/16 lu/32 lu/64 lu/128 lu/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 Traffic (bytes/instr) Traffic (bytes/instr) Traffic (bytes/flop) Impact of Block Size on Traffic 0.18 4.5 1.8 0.16 0.14 4 3.5 Addr/cmd Cmd data Bus 1.6 1.4 0.12 0.1 Addr/cmd Cmd data Bus 3 2.5 1.2 1 Cmd Addr/cmd data Bus 0.08 2 0.8 0.06 1.5 0.6 0.04 1 0.4 0.02 0.5 0.2 0 0 0 Bus traffic Indirectly impacts performance (via contention and increased miss penalty) Bus traffic increases with the block size; The overall traffic is still small. 77

radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 Traffic (bytes/instr) Traffic (bytes/flop) Traffic with 64 KB caches Bus Cmd 14 6 12 5 10 8 6 Cmd Addr/cmd data Bus 4 3 4 2 2 1 0 0 78

Making Large Blocks More Effective Software Improve spatial locality by better data structuring Compiler techniques Hardware Retain granularity of transfer but reduce granularity of coherence» use subblocks: same tag but different state bits» one subblock may be valid but another invalid or dirty Reduce both granularities, but prefetch more blocks on a miss Use update instead of invalidate protocols to reduce false sharing effect 79

Update versus Invalidate Much debate over the years: tradeoff depends on sharing patterns Intuition: If those that used continue to use, and writes between uses are few, update should do better» e.g. producer-consumer pattern If those that use unlikely to use again, or many writes between reads, updates not good» pack rat phenomenon particularly bad under process migration in the multiprogramming environment.» useless updates where only last one will be used Can construct scenarios where one or other is much better Can combine them in hybrid schemes E.g. competitive: observe patterns at runtime and change protocol 80

Upgrade and Update Rates (Traffic) Update traffic is substantial Main cause is multiple writes by a processor before a read by other» many bus transactions versus one in invalidation case Overall, trend is away from update based protocols as default» bandwidth, complexity, large blocks trend, pack rat for process migration updates have greater problems for scalable systems 81

Hardware-Software Trade-offs in Synchronization Role of Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization Mutual Exclusion Event synchronization» point-to-point» group» global (barriers) How much hardware support? high-level operations? atomic instructions? specialized interconnect?

Mini-Instruction Set debate atomic read-modify-write instructions [bottom line on hardware support] IBM 370: included atomic compare&swap for multiprogramming x86: any instruction can be prefixed with a lock modifier High-level language advocates want hardware locks/barriers» but it s goes against the RISC flow,and has other problems SPARC: atomic register-memory ops (swap, compare&swap) MIPS, IBM Power: no atomic operations but pair of instructions» load-locked, store-conditional» later used by PowerPC and DEC Alpha too Rich set of tradeoffs

Components of a Synchronization Event Acquire method Acquire right to the synch» enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn t busy-waiting, blocking, or hybrid Release method Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization makes no sense to put in hardware

Strawman Lock Busy-Wait lock: ld register, location /* copy location to register */ cmp location, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Why doesn t the acquire method work?

Atomic Instructions Specifies a location, register, & atomic operation (read-modify-write) Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0

Simple Test&Set Lock lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Other read-modify-write primitives Swap Fetch&op (fetch&invrement, fetch&decrment, fetch&dd ) Compare&swap»Three operands: location (m), register to compare with (r1), register to swap with (r2) [if m == r1 then m = r2)»not commonly supported by RISC instruction sets

Performance Criteria for Synch. Ops Latency (time per op) especially when light contention Bandwidth (ops per sec) especially under high contention Traffic load on critical resources especially on failures under contention Fairness

T ime ( m s) T&S Lock Microbenchmark: SGI Chal. 20 18 16 T est&set, c = 0 T est&set, exponential backof f, c = 3.64 T est&set, exponential backof f, c = 0 Ideal 14 12 10 8 6 4 2 0 3 5 7 9 Number of processors Lock(L); 11 13 15 critical-section(c); unlock(l); The same of lock call with increasing #procs, the delay (c) not counted. Why does performance degrade? Bus Transactions on T&S? (every T&S has a write, invalidating copies in other processors)

Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: i th time = k*c i Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load» cached lock variable will be invalidated when release occurs When value changes (to 0), try to obtain lock with test&set» only one attemptor will succeed; others will fail and start testing again

Improved Hardware Primitives: LL-SC Goals: Test with reads Failed read-modify-write attempts don t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional LL reads variable into register Follow with arbitrary instructions to manipulate its value SC tries to store back to location succeed if and only if no other write to the variable since this processor s LL» indicated by condition codes; If SC succeeds, all three steps happened atomically If fails, doesn t write or generate invalidations must retry aquire

Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ beqz reg2, lock sc location, reg2 /* Store reg2 conditionally into location*/ /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can do more fancy atomic ops by changing what s between LL & SC But keep it small so SC likely to succeed Don t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: Detects intervening write even before trying to get bus Tries to get bus but another processor s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables

Ticket Lock What happens when several processors spinning on lock and it is released? traffic per P lock operations? Only one r-m-w per acquire Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket» atomic op when arrive at lock, not when it s free (so less contention) Release: increment now-serving Performance low latency for low-contention - if fetch&inc cacheable O(p) read misses at release, since all spin on same variable

Array-based Queuing Locks Waiting processes poll on different locations in an array of size p Acquire» fetch&inc to obtain address on which to spin (next array element)» ensure that these addresses are in different cache lines or memories Release» set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock, but, O(p) space per lock

Point to Point Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is full with newly produced data (i.e. when written) Unset when word is empty due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization» producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexibility» multiple consumers, or multiple writes before consumer reads?» needs language support to specify when to use» composite data structures?

Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus» Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors» even harder with multiple processes per processor Difficult to dynamically change number and identity of participants» e.g. latter due to process migration Not common today on bus-based machines

A Simple Centralized Barrier struct bar_type { increment when arrive (lock), check until reaches numprocs int counter; Problem? struct lock_type lock; int flag = 0;} bar_name; Shared counter maintains number of processes that have arrived BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = ++bar_name.counter; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ }

A Working Centralized Barrier Consecutively entering the same barrier doesn t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense =!(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag!= local_sense) {}; } }