ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors

Size: px
Start display at page:

Download "ECE7660 Parallel Computer Architecture. Shared Memory Multiprocessors"

Transcription

1 ECE7660 Parallel Computer Architecture Shared Memory Multiprocessors 1

2 Layer Perspective CAD Database Scientific modeling Parallel applications Multipr ogramming Shar ed addr ess Message passing Data parallel Pr ogramming models Compilation or library Communication abstraction User/system boundary Operating systems support Communication har dwar e Har dwar e/softwar e boundary Physical communication medium Conceptual Picture P P 1 n Mem 2

3 Natural Extensions of Memory System P 1 Switch P n Scale (Interleaved) First-level $ P 1 P n (Interleaved) Main memory $ Inter connection network $ Shared Cache Mem Mem Centralized Memory Dance Hall, UMA P 1 P n Mem $ Mem $ Inter connection network Distributed Memory (NUMA) 3

4 Bus-Based Symmetric Shared Memory P 1 P n $ $ Bus Mem I/O devices Dominate the server market Building blocks for larger systems; arriving to desktop Attractive as throughput servers and for parallel programs Fine-grain resource sharing Uniform access via loads/stores Automatic data movement and coherent replication in caches Cheap and powerful extension Normal uniprocessor mechanisms to access data Key is extension of memory hierarchy to support multiple processors 4

5 Caches are Critical for Performance Reduce average latency automatic replication closer to processor Reduce average bandwidth Data is logically transferred from producer to consumer to memory store reg --> mem load reg <-- mem P P P Many processor can shared data efficiently What happens when store & load are executed on different processors? 5

6 Example Cache Coherence Problem u :5 P 1 P 2 P 3 u =? u =? $ 4 $ 5 $ u :5 3 u = 7 1 u :5 2 I/O devices Memory Processors see different values for u after event 3 With write back caches, value written back to memory depends on which cache flushes or writes back value first» Processes accessing main memory may see very stale value Unacceptable to programs, and frequent! 6

7 Caches and Cache Coherence Caches play key role in all cases Reduce average data access time Reduce bandwidth demands placed on shared interconnect private processor caches create a problem Copies of a variable can be present in multiple caches A write by one processor may not become visible to others» They ll keep accessing stale value in their caches => Cache coherence problem What do we do about it? Organize the mem hierarchy to make it go away Detect and take actions to eliminate the problem 7

8 Shared Cache: Examples Alliant FX-8 early 80 s eight 68020s with x-bar to 512 KB interleaved cache Encore & Sequent first 32-bit micros (N32032) two to a board with a shared cache P1 Pn Switch (Interlea ved) Cache (Interlea ved) Ma in Memory 8

9 Advantages Cache placement identical to single cache only one copy of any cached block fine-grain sharing communication latency determined level in the storage hierarchy where the access paths meet» 2-10 cycles» Cray Xmp has shared registers! Potential for positive interference one proc prefetches data for another Smaller total storage only one copy of code/data used by both proc. Can share data within a line without ping-pong long lines without false sharing P1 Switch (Interlea ved) Cache (Interlea ved) Ma in Memory Pn 9

10 Disadvantages Fundamental BW limitation Increases latency of all accesses X-bar Larger cache L1 hit time determines proc. cycle time!!! Potential for negative interference one proc flushes data needed by another P1 Switch (Interlea ved) Cache (Interlea ved) Ma in Memory Pn Many L2 caches are shared today 10

11 Intuitive Memory Model L1 P 100:67 L2 100:35 Memory Disk 100:34 Reading an address should return the last value written to that address Easy in uniprocessors except for I/O Cache coherence problem in MPs is more pervasive and more performance critical 11

12 Snoopy Cache-Coherence Protocols State Address Data P 1 $ Bus snoop P n $ Mem I/O devices Cache-memory transaction Bus is a broadcast medium & Caches know what they have Cache Controller snoops all transactions on the shared bus relevant transaction if for a block it contains take action to ensure coherence» invalidate, update, or supply value depends on state of the block and the protocol 12

13 Example: Write-thru Invalidate P 1 P 2 P 3 u =? u =? 3 4 $ $ 5 $ u :5 u :5 u = 7 1 u = 7 u :5 2 I/O devices Memory 13

14 Architectural Building Blocks Bus Transactions fundamental system design abstraction single set of wires connect several devices bus protocol: arbitration, command/addr, data => Every device observes every transaction Cache block state transition diagram FSM specifying how disposition of block changes» invalid, valid, dirty 14

15 Design Choices Controller updates state of blocks in response to processor and snoop events and generates bus transactions Snoopy protocol set of states state-transition diagram actions Basic Choices Write-through vs Write-back Invalidate vs. Update Snoop Processor Ld/St Cache Controller State Tag Data 15

16 Write-through Invalidate Protocol Two states per block in each cache as in uniprocessor state of a block is a p-vector of states Hardware state bits associated with blocks that are in the cache other blocks can be seen as being in invalid (not-present) state in that cache Writes invalidate all other caches can have multiple simultaneous readers of block,but write invalidates them PrRd / BusRd State Tag Data P 1 V I PrRd/ -- PrWr / BusWr BusWr / - PrWr / BusWr State Tag Data P n $ $ Bus Mem I/O devices 16

17 Write-through vs. Write-back Write-through protocol is simple every write is observable Every write goes on the bus => Only one write can take place at a time in any processor Uses a lot of bandwidth! Example: 200 MHz dual issue, CPI = 1, 15% stores of 8 bytes => 30 M stores per second per processor => 240 MB/s per processor 1GB/s bus can support only about 4 processors without saturating 17

18 Invalidate vs. Update Basic question of program behavior: Is a block written by one processor later read by others before it is overwritten? Invalidate. yes: readers will take a miss no: multiple writes without addition traffic Update.» also clears out copies that will never be used again yes: avoids misses on later references no: multiple useless updates» even to pack rats => Need to look at program reference patterns and hardware complexity 18

19 Intuitive Memory Model??? L1 P 100:67 L2 100:35 Memory Disk 100:34 Reading an address should return the last value written to that address What does that mean in a multiprocessor? 19

20 Coherence? Caches are supposed to be transparent What would happen if there were no caches Every memory operation would go to the memory location may have multiple memory banks all operations on a particular location would be serialized» all would see THE order Interleaving among accesses from different processors within individual processor => program order across processors => only constrained by explicit synchronization Processor only observes state of memory system by issuing memory operations! 20

21 Definitions Memory operation load, store, read-modify-write Issues leaves processor s internal environment and is presented to the memory subsystem (caches, buffers, busses,dram, etc) Performed with respect to a processor write: subsequent reads return the value read: subsequent writes cannot affect the value Coherent Memory System there exists a serial order of mem operations on each location s. t.» operations issued by a process appear in order issued» value returned by each read is that written by previous write in the serial order => write propagation + write serialization 21

22 Is 2-state Protocol Coherent? Assume bus transactions and memory operations are atomic, one-level cache all phases of one bus transaction complete before next one starts processor waits for memory operation to complete before issuing next with one-level cache, assume invalidations applied during bus xaction All writes go to bus + atomicity Writes serialized by order in which they appear on bus (bus order) => invalidations applied to caches in bus order How to insert reads in this order? Important since processors see writes through reads, so determines whether write serialization is satisfied But read hits may happen independently and do not appear on bus or enter directly in bus order 22

23 Ordering Reads Read misses appear on bus, and will see last write in bus order Read hits: do not appear on bus But value read was placed in cache by either» most recent write by this processor, or» most recent read miss by this processor Both these transactions appeared on the bus So reads hits also see values as produced bus order 23

24 Determining Orders More Generally mem op M2 is subsequent to mem op M1 (M2 >> M1) if the operations are issued by the same processor and M2 follows M1 in program order. read R >> write W if read generates bus xaction that follows that for W. write W >> read or write M if M generates bus xaction and the xaction for W follows that for M. write W >> read R if read R does not generate a bus xaction and is not already separated from write W by another bus xaction. 24

25 Ordering P 0 : R R R W R R P 1 : R R R R R W P 2 : R R R R R R Writes establish a partial order Doesn t constrain ordering of reads, though bus will order read misses too any order among reads between writes is fine, as long as in program order 25

26 Write-Through vs Write-Back Write-thru requires high bandwidth Write-back caches absorb most writes as cache hits => Write hits don t go on bus But now how do we ensure write propagation and serialization? Need more sophisticated protocols: large design space But first, let s understand other ordering issues 26

27 Setup for Mem. Consistency Coherence => Writes to a location become visible to all in the same order But when does a write become visible? How do we establish orders between a write and a read by different procs? use event synchronization Typically use more than one location! 27

28 Example P 1 P 2 /*Assume initial value of A and flag is 0*/ A = 1; while (flag == 0); /*spin idly*/ flag = 1; print A; Intuition not guaranteed by coherence expect memory to respect order between accesses to different locations issued by a given process to preserve orders among accesses to the same location by different processes Coherence is not enough! pertains only to single location Conceptual Picture P P 1 n Mem 28

29 Another Example of Ordering? P 1 P 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; What s the intuition? Whatever it is, we need an ordering model for clear semantics» across different locations as well» so programmers can reason about what results are possible This is the memory consistency model 29

30 Memory Consistency Model Specifies constraints on the order in which memory operations (from any process) can appear to execute with respect to one another What orders are preserved? Given a load, constrains the possible values returned by it Without it, can t tell much about an SAS program s execution Implications for both programmer and system designer Programmer uses to reason about correctness and possible results System designer can use to constrain how much accesses can be reordered by compiler or hardware Contract between programmer and system 30

31 Sequential Consistency Processors issuing memory references as per pr ogram or der P 1 P 2 P n The switch is randomly set after each memory refer ence Memory Total order achieved by interleaving accesses from different processes Maintains program order, and memory operations, from all processes, appear to [issue, execute, complete] atomically w.r.t. others as if there were no caches, and a single memory A multiprocessor is sequentially consistent if the result of any execution is the same as if the operations of all the processors were executed in some sequential order, and the operations of each individual processor appear in this sequence in the order 31 specified by its program. [Lamport, 1979]

32 What Really is Program Order? Intuitively, order in which operations appear in source code Straightforward translation of source code to assembly At most one memory operation per instruction But not the same as order presented to hardware by compiler So which is program order? Depends on which layer, and who s doing the reasoning We assume order as seen by programmer 32

33 SC Example What matters is order in which operations appear to execute, not the chronological order of events Possible outcomes for (A,B): (0,0), (1,0), (1,2) What about (0,2)? P 1 P 2 /*Assume initial values of A and B are 0*/ (1a) A = 1; (2a) print B; (1b) B = 2; (2b) print A; A=0 B=2 program order => 1a->1b and 2a->2b A = 0 implies 2b->1a, which implies 2a->1b B = 2 implies 1b->2a, which leads to a contradiction What is actual execution 1b->1a->2b->2a? appears just like 1a->1b->2a->2b as visible from results actual execution 1b->2a->2b->1a is not 33

34 Implementing SC Two kinds of requirements Program order» memory operations issued by a process must appear to execute (become visible to others and itself) in program order Atomicity» in the overall hypothetical total order, one memory operation should appear to complete with respect to all processes before the next one is issued» guarantees that total order is consistent across processes [all processes see the same order] tricky part is making writes atomic Example illustrating the importance of write atomicity for sequential consistency 34

35 Write-back Caches 2 processor operations PrRd, PrWr 3 states invalid, valid (clean), modified (dirty) ownership: who supplies block 2 bus transactions: read (BusRd), write-back (BusWB) only cache-block transfers => treat Valid as shared and Modified as exclusive => introduce one new bus transaction read-exclusive: read for purpose of modifying (read-toown) 35

36 MSI Invalidate Protocol Read obtains block in shared even if only cache copy Obtain exclusive ownership before writing BusRdx causes others to invalidate (demote) If M, BusRdx from another cache, will flush BusRdx even if hit in S» promote to M (upgrade) What about replacement? S->I, M->I as before PrRd/ PrWr/ M BusRd/Flush PrWr/BusRdX S BusRdX/Flush BusRdX/ PrRd/BusRd PrRd/ BusRd/ PrWr/BusRdX I 36

37 An Example of the MSI WB Invalidation Protocol 37

38 Example: Write-Back Protocol PrRd U P PrRd U 0 P 1 P 4 PrRd U PrWr U 7 U S 5 U S 7 U M S 5 7 BusRd U BusRd U BusRdx U BusRd Flush u :57 Memory I/O devices 38

39 Correctness When is write miss performed? How does writer observe write? How is it made visible to others? How do they observe the write? When is write hit made visible? 39

40 Write Serialization for Coherence Writes that appear on the bus (BusRdX) are ordered by bus performed in writer s cache before other transactions, so ordered same w.r.t. all processors (incl. writer) Read misses also ordered wrt these Write that don t appear on the bus: P issues BusRdX B. further mem operations on B until next transaction are from P» read and write hits» these are in program order for read or write from another processor» separated by intervening bus transaction Reads hits? 40

41 Sequential Consistency Bus imposes total order on bus xactions for all locations Between xactions, procs perform reads/writes (locally) in program order So any execution defines a natural partial order M j subsequent to M i if» (I) follows in program order on same processor,» (ii) M j generates bus xaction that follows the memory operation for M i In segment between two bus transactions, any interleaving of local program orders leads to consistent total order within segment writes observed by proc P serialized as: Writes from other processors by the previous bus xaction P issued Writes from P by program order 41

42 Sufficient conditions Sufficient Conditions issued in program order after write issues, the issuing process waits for the write to complete before issuing next memory operation after read is issued, the issuing process waits for the read to complete and for the write whose value is being returned to complete (globally) before issuing its next operation Write completion can detect when write appears on bus Write atomicity: if a read returns the value of a write, that write has already become visible to all others already 42

43 Lower-level Protocol Choices BusRd observed in M state: what transition to make? M ----> I M ----> S Depends on expectations of access patterns How does memory know whether or not to supply data on BusRd? Problem: Read/Write is 2 bus xactions, even if no sharing» BusRd (I->S) followed by BusRdX or BusUpgr (S->M)» What happens on sequential programs? 43

44 MESI (4-state) Invalidation Protocol Add exclusive state distinguish exclusive (writable) and owned (written) Main memory is up to date, so cache not necessarily owner can be written locally States invalid exclusive or exclusive-clean (only this cache has copy, but not modified) shared (two or more caches may have copies) modified (dirty) I -> E on PrRd if no cache has copy => How can you tell? 44

45 Hardware Support for MESI P 0 P 1 P 4 u :5 Memory I/O devices shared signal - wired-or All cache controllers snoop on BusRd Assert shared if present (S? E? M?) Issuer chooses between S and E Is it possible for a block to be in the S state even if no other copies exist? 45

46 MESI State Transition Diagram BusRd(S) means shared line asserted on BusRd transaction Flush : if cache-to-cache xfers only one cache flushes data MOESI protocol: Owned state: exclusive but memory not valid PrW r/busrdx PrW r/busrdx PrRd/ BusRd (S ) M PrW r/ PrRd/ PrRd/ BusRd(S) PrRd PrW r/ BusRd/Flush E S BusRd/ Flush PrRd/ BusRd/Flush BusRdX/Flush BusRdX/Flush BusRdX/Flush I 46

47 Lower-level Protocol Choices Who supplies data on miss when not in M state: memory or cache? Original, lllinois MESI: cache, since assumed faster than memory Not true in modern systems» Intervening in another cache more expensive than getting from memory Cache-to-cache sharing adds complexity How does memory know it should supply data (must wait for caches) Selection algorithm if multiple caches have valid data Valuable for cache-coherent machines with distributed memory May be cheaper to obtain from nearby cache than distant memory, especially when constructed out of SMP nodes (Stanford DASH) 47

48 Update Protocols If data is to be communicated between processors, invalidate protocols seem inefficient consider shared flag p0 waits for it to be zero, then does work and sets it one p1 waits for it to be one, then does work and sets it zero how many transactions? 48

49 Dragon Write-back Update Protocol 4 states Exclusive-clean or exclusive (E): I and memory have it Shared clean (Sc): I, others, and maybe memory, but I m not owner Shared modified (Sm): I and others but not memory, and I m the owner» Sm and Sc can coexist in different caches, with only one Sm Modified or dirty (M): I and, no one else No invalid state If in cache, cannot be invalid If not present in cache, view as being in not-present or invalid state New processor events: PrRdMiss, PrWrMiss Introduced to specify actions when block not present in cache New bus transaction: BusUpd Broadcasts single word written on bus; updates other relevant caches 49

50 Dragon State Transition Diagram PrRd/ PrRd/ BusUpd/Update PrRdMiss/BusRd(S) E BusRd/ Sc PrRdMiss/BusRd(S) PrW r/ PrW r/busupd(s) BusUpd/Update PrW r/busupd(s) PrW rmiss/(busrd(s); BusUpd) Sm BusRd/Flush PrW r/busupd(s) M PrW rmiss/busrd(s) PrRd/ PrW r/busupd(s) BusRd/Flush PrRd/ PrW r/ 50

51 An Example of the Dragon Update Protocol 51

52 Lower-level Protocol Choices Can shared-modified state be eliminated? If update memory as well on BusUpd transactions (DEC Firefly) Dragon protocol doesn t (assumes DRAM memory slow to update) Should replacement of an Sc block be broadcast? Would allow last copy to go to E state and not generate updates Replacement bus transaction is not in critical path, later update may be Can local copy be updated on write hit before controller gets bus? Can mess up serialization Coherence, consistency considerations much like write-through case 52

53 Assessing Protocol Tradeoffs Tradeoffs affected by technology characteristics and design complexity Part art and part science Art: experience, intuition and aesthetics of designers Science: Workload-driven evaluation for cost-performance» want a balanced system: no expensive resource heavily underutilized 53

54 Workload-Driven Evaluation Evaluating real machines Evaluating an architectural idea or trade-offs => need good metrics of performance => need to pick good workloads => need to pay attention to scaling many factors involved 54

55 Evaluation in Uniprocessors Decisions made only after quantitative evaluation For existing systems: comparison and procurement evaluation For future systems: careful extrapolation from known quantities Wide base of programs leads to standard benchmarks Measured on wide range of machines and successive generations Measurements and technology assessment lead to proposed features Then simulation Simulator developed that can run with and without a feature Benchmarks run through the simulator to obtain results Together with cost and complexity, decisions made 55

56 More Difficult for Multiprocessors What is a representative workload? Software model has not stabilized Many architectural and application degrees of freedom Huge design space: no. of processors, other architectural, application Impact of these parameters and their interactions can be huge High cost of communication What are the appropriate metrics? Simulation is expensive Realistic configurations and sensitivity analysis difficult Larger design space, but more difficult to cover Understanding of parallel programs as workloads is critical Particularly interaction of application and architectural parameters 56

57 Speedup Speedup A Lot Depends on Sizes Application parameters and no. of procs affect inherent properties Load balance, communication, extra work, temporal and spatial locality Interactions with organization parameters of extended memory hierarchy affect artifactual communication and performance Effects often dramatic, sometimes small: application-dependent N = 130 N = 258 N = 514 N = 1,026 ocean 30 Origin 16 K Origin 64 K 25 Origin 512 K Challenge 16 K 20 Challenge 512 K Number of processors Number of processors Understanding size interactions and scaling relationships is key 15 Barnes-hut 57

58 Scaling: Why Worry? Fixed problem size is insufficient. Too small a problem: May be appropriate for small machine Parallelism overheads begin to dominate benefits for larger machines» Load imbalance» Communication to computation ratio May even achieve slowdowns Doesn t reflect real usage, and inappropriate for large machines Too large a problem Difficult to measure improvement (next) 58

59 Too Large a Problem Suppose problem realistically large for big machine May not fit in small machine Can t run Thrashing to disk Working set doesn t fit in cache Fits at some p, leading to superlinear speedup Real effect, but doesn t help evaluate effectiveness Finally, users want to scale problems as machines grow Can help avoid these problems 59

60 Speedup Speedup Demonstrating Scaling Problems Small Ocean and big equation solver problems on SGI Origin Ideal Ocean: 258 x Number of processors Grid solver: 12 K x 12 K Ideal Number of processors 60

61 Example Application Set Table 4.1 General Statistics about Application Pr ograms Application LU Ocean Barnes-Hut Radix Input Data Set matrix blocks grids tolerance = time-steps 16-K particles = time-steps 256-K points radix = 1,024 Total Instructions (M) Total FLOPS (M) Total Ref erences (M) Total Reads (M) Total Writes (M) Shared Reads (M) Shared Writes (M) Barriers Locks ,296 2, , Ray trace Car scene ,456 Radiosity Room scene 2, ,485 Multiprog: User Multiprog: Kernel SGI IRIX 5.2, two pmakes + two compress jobs 1, ,505 For the parallel programs, shared reads and writes simply ef r er to all nonstack ef r erences issued by the application pr ocesses. All such ref erences do not necessarily point to data that is truly shar ed by multiple processes. The Multiprog workload is not rallel pa application, so it does not access shar ed data. A dash in a table entry means that this measur ement is not applicable to not or is measured f or that application (e.g., Radix has no oating-point operations). (M) denotes that measur ement in that column is in millions. 61

62 Types of Workloads Kernels: matrix factorization, FFT, depth-first tree search Complete Applications: ocean simulation, crew scheduling, database Multiprogrammed Workloads Multiprog. Appls Kernels Microbench. Realistic Complex Higher level interactions Are what really matters Easier to understand Controlled Repeatable Basic machine characteristics Each has its place: Use kernels and microbenchmarks to gain understanding, but applications to evaluate effectiveness and performance 62

63 Coverage: Stressing Features Easy to mislead with workloads Choose those with features for which machine is good, avoid others Some features of interest: Compute v. memory v. communication v. I/O bound Working set size and spatial locality Local memory and communication bandwidth needs Importance of communication latency Fine-grained or coarse-grained» Data access, communication, task size Synchronization patterns and granularity Contention Communication patterns Choose workloads that cover a range of properties 63

64 Concurrency Should have enough to utilize the processors Algorithmic speedup: useful measure of concurrency/imbalance Speedup (under scaling model) assuming all memory/communication operations take zero time Ignores memory system, measures imbalance and extra work Uses PRAM machine model (Parallel Random Access Machine)» Unrealistic, but widely used for theoretical algorithm development At least, should isolate performance limitations due to program characteristics that a machine cannot do much about (concurrency) from those that it can. 64

65 Workload/Benchmark Suites Numerical Aerodynamic Simulation (NAS) Originally pencil and paper benchmarks SPLASH/SPLASH-2 Shared address space parallel programs ParkBench Message-passing parallel programs ScaLapack TPC Message-passing kernels Transaction processing SPEC-HPC... 65

66 Multiprocessor Simulation Simulation runs on a uniprocessor (can be parallelized too) Simulated processes are interleaved on the processor Two parts to a simulator: Reference generator: plays role of simulated processors» And schedules simulated processes based on simulated time Simulator of extended memory hierarchy» Simulates operations (references, commands) issued by reference generator Coupling or information flow between the two parts varies Trace-driven simulation: from generator to simulator Execution-driven simulation: in both directions (more accurate) Simulator keeps track of simulated time and detailed statistics 66

67 Execution-driven Simulation Memory hierarchy simulator returns simulated time information to reference generator, which is used to schedule simulated processes P 1 $ 1 Mem 1 P 2 P 3 $ 2 $ 3 Mem 2 Mem 3 N e t w o r k P p $ p Mem p Reference generator Memory and inter connect simulator 67

68 Difficulties in Simulation-based Evaluation Cost of simulation (in time and memory) cannot simulate the problem/machine sizes we care about have to use scaled down problem and machine sizes» how to scale down and stay representative? Huge design space application parameters (as before) machine parameters (depending on generality of evaluation context)» number of processors» cache/replication size» associativity» granularities of allocation, transfer, coherence» communication parameters (latency, bandwidth, occupancies) cost of simulation makes it all the more critical to prune the space 68

69 Choosing Parameters Problem size and number of processors Use inherent characteristics considerations as discussed earlier For example, low c-to-c ratio will not allow block transfer to help much Cache/Replication Size Choose based on knowledge of working set curve Choosing cache sizes for given problem and machine size analogous to choosing problem sizes for given cache and machine size, discussed Whether or not working set fits affects block transfer benefits greatly» if local data, not fitting makes communication relatively less important» If nonlocal, can increase artifactual comm. So BT has more opportunity Sharp knees in working set curve can help prune space» Knees can be determined by analysis or by very simple simulation 69

70 Focus on protocol tradeoffs Methodology: Choose cache parameters» default 1MB, 4-way cache, 64-byte block, 16 processors; 64K cache for some Focus on frequencies, not end performance for now» transcends architectural details Use idealized memory performance model» Using simple PRAM cost model: all memory operations are assumed to complete in one cycle.» Cheap simulation: no need to model contention Run program on parallel machine simulator» collect trace of cache state transitions» analyze properties of the transitions» It evaluates protocol class (invalidation or update), protocol states and actions, and lower-level implementation tradeoff. 70

71 Bandwidth per transition in MESI Bus Transaction Address / Cmd Data PrRd PrW r/ BusRd 6 64 BusRdX 6 64 BusWB 6 64 BusUpgd 6 -- Ocean Data Cache Frequency Matrix (per 1000 references) To NP I E S M NP I E S PrW r/busrdx M PrW r/busrdx PrRd/ BusRd (S) M PrW r/ PrRd/ BusRd(S) BusRd/Flush E PrRd/ S PrRd/ BusRd/Flush I BusRd/ Flush BusRdX/Flush BusRdX/Flush BusRdX/Flush Example 5.8 (p. 310): how to compute bus bandwidth requirement? State transitions -> bus transactions -> bus traffic per mem. ref. --- (376.5M inst vs. 99.7M mem ref )-> bus 71 traffic per instr. (200 MIPS) -> required bandwidth for an app (see next slide)

72 barnes/il l barnes/3s t barnes/3s t-rde x lu/ill lu/3st lu/3st-rdex ocean/ill ocean/3s t ocean/3s t-rde x radiosity/ill radi osi ty/3st radi osi ty/3st-rdex radix/ill radix/3st radix/3st-rdex raytrace/il l raytrace/3s t raytrace/3s t-rde x Appl-Code/Ill Appl-Code/3St Appl-Code/3St-RdEx Appl-Data/Ill Appl-Data/3St Appl-Data/3St-RdEx OS-Code/Il l OS-Code/3 St OS-Code/3 St-Rd Ex OS-Data/Il l OS-Data/3S t OS-Data/3S t-rde x Traffic (MB/s) Traffic (MB/s) Bandwidth Trade-off 1 MB Cache, 200 MIPS / 200 MFLOPS Processor E -> M are infrequent BusUpgrade is cheap Parallel Prog 45 Multiprog Cmd Addr/cmd data Bus How many processors can be supported with 1.2GB/s bus? Addr/cmd Cmd data Bus Ill (MESI): no need of BusUpgr from E to M 3st (MSI): BusUpgr is used from S to M 3st-RdEx: MSI with BusRdX used from S to M 72

73 oce an/ill ocean/3st ocean/3st-rdex radix/ill radix/3st radix/3st-rdex raytra ce/il l raytra ce/3s t raytra ce/3s t-rde x Traffic (MB/s) Smaller (64KB) Caches Cmd Addr/cmd data Bus How many processors can be supported with 1.2GB/s bus? Dramatically increased bus bandwidth requirement. 0 73

74 Cache Block Size Trade-offs in uniprocessors with increasing block size reduced cold misses (due to spatial locality) increased transfer time increased conflict misses (fewer sets) Additional concerns in multiprocessors parallel programs have less spatial locality parallel programs have sharing false sharing bus contention Need to classify misses to understand impact cold misses capacity / conflict misses true sharing misses» one proc writes words in a block, invalidating a block in another processor s cache, which is later read by that process false sharing misses 74

75 barnes/8 barnes/16 barnes/32 barnes/64 barnes/128 barnes/256 lu/8 lu/16 lu/32 lu/64 lu/128 lu/256 radiosity/8 radiosity/16 radiosity/32 radiosity/64 radiosity/128 radiosity/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 Miss Rate Miss Rate Breakdown of Miss Rates with Block Size Cold, capacity, and true sharing misses tend to decrease with increasing block size. False sharing misses tend to increase; 0.12 UPGMR FSMR TSMR 0.1 UPGMR FSMR CAPMR COLDMR 0.08 TSMR CAPMR COLDMR Parallel Prog 75

76 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 Miss Rate Breakdown with 64KB Caches UPGMR FSMR 0.2 TSMR CAPMR COLDMR Miss rate, especially capacity miss rate, is much higher. True sharing misses aren t significantly different. larger block size is more effective, but with risks (higher demand on the bus) 76

77 barnes/8 barnes/32 barnes/128 radiosity/16 radiosity/64 radiosity/256 raytrace/8 raytrace/32 raytrace/128 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 lu/8 lu/16 lu/32 lu/64 lu/128 lu/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 Traffic (bytes/instr) Traffic (bytes/instr) Traffic (bytes/flop) Impact of Block Size on Traffic Addr/cmd Cmd data Bus Addr/cmd Cmd data Bus Cmd Addr/cmd data Bus Bus traffic Indirectly impacts performance (via contention and increased miss penalty) Bus traffic increases with the block size; The overall traffic is still small. 77

78 radix/8 radix/16 radix/32 radix/64 radix/128 radix/256 raytrace/8 raytrace/16 raytrace/32 raytrace/64 raytrace/128 raytrace/256 ocean/8 ocean/16 ocean/32 ocean/64 ocean/128 ocean/256 Traffic (bytes/instr) Traffic (bytes/flop) Traffic with 64 KB caches Bus Cmd Cmd Addr/cmd data Bus

79 Making Large Blocks More Effective Software Improve spatial locality by better data structuring Compiler techniques Hardware Retain granularity of transfer but reduce granularity of coherence» use subblocks: same tag but different state bits» one subblock may be valid but another invalid or dirty Reduce both granularities, but prefetch more blocks on a miss Use update instead of invalidate protocols to reduce false sharing effect 79

80 Update versus Invalidate Much debate over the years: tradeoff depends on sharing patterns Intuition: If those that used continue to use, and writes between uses are few, update should do better» e.g. producer-consumer pattern If those that use unlikely to use again, or many writes between reads, updates not good» pack rat phenomenon particularly bad under process migration in the multiprogramming environment.» useless updates where only last one will be used Can construct scenarios where one or other is much better Can combine them in hybrid schemes E.g. competitive: observe patterns at runtime and change protocol 80

81 Upgrade and Update Rates (Traffic) Update traffic is substantial Main cause is multiple writes by a processor before a read by other» many bus transactions versus one in invalidation case Overall, trend is away from update based protocols as default» bandwidth, complexity, large blocks trend, pack rat for process migration updates have greater problems for scalable systems 81

82 Hardware-Software Trade-offs in Synchronization Role of Synchronization A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Types of Synchronization Mutual Exclusion Event synchronization» point-to-point» group» global (barriers) How much hardware support? high-level operations? atomic instructions? specialized interconnect?

83 Mini-Instruction Set debate atomic read-modify-write instructions [bottom line on hardware support] IBM 370: included atomic compare&swap for multiprogramming x86: any instruction can be prefixed with a lock modifier High-level language advocates want hardware locks/barriers» but it s goes against the RISC flow,and has other problems SPARC: atomic register-memory ops (swap, compare&swap) MIPS, IBM Power: no atomic operations but pair of instructions» load-locked, store-conditional» later used by PowerPC and DEC Alpha too Rich set of tradeoffs

84 Components of a Synchronization Event Acquire method Acquire right to the synch» enter critical section, go past event Waiting algorithm Wait for synch to become available when it isn t busy-waiting, blocking, or hybrid Release method Enable other processors to acquire right to the synch Waiting algorithm is independent of type of synchronization makes no sense to put in hardware

85 Strawman Lock Busy-Wait lock: ld register, location /* copy location to register */ cmp location, #0 /* compare with 0 */ bnz lock /* if not 0, try again */ st location, #1 /* store 1 to mark it locked */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Why doesn t the acquire method work?

86 Atomic Instructions Specifies a location, register, & atomic operation (read-modify-write) Value in location read into a register Another value (function of value read or not) stored into location Many variants Varying degrees of flexibility in second part Simple example: test&set Value in location read into a specified register Constant 1 stored into location Successful if value loaded into register is 0 Other constants could be used instead of 1 and 0

87 Simple Test&Set Lock lock: t&s register, location bnz lock /* if not 0, try again */ ret /* return control to caller */ unlock: st location, #0 /* write 0 to location */ ret /* return control to caller */ Other read-modify-write primitives Swap Fetch&op (fetch&invrement, fetch&decrment, fetch&dd ) Compare&swap»Three operands: location (m), register to compare with (r1), register to swap with (r2) [if m == r1 then m = r2)»not commonly supported by RISC instruction sets

88 Performance Criteria for Synch. Ops Latency (time per op) especially when light contention Bandwidth (ops per sec) especially under high contention Traffic load on critical resources especially on failures under contention Fairness

89 T ime ( m s) T&S Lock Microbenchmark: SGI Chal T est&set, c = 0 T est&set, exponential backof f, c = 3.64 T est&set, exponential backof f, c = 0 Ideal Number of processors Lock(L); critical-section(c); unlock(l); The same of lock call with increasing #procs, the delay (c) not counted. Why does performance degrade? Bus Transactions on T&S? (every T&S has a write, invalidating copies in other processors)

90 Enhancements to Simple Lock Reduce frequency of issuing test&sets while waiting Test&set lock with backoff Don t back off too much or will be backed off when lock becomes free Exponential backoff works quite well empirically: i th time = k*c i Busy-wait with read operations rather than test&set Test-and-test&set lock Keep testing with ordinary load» cached lock variable will be invalidated when release occurs When value changes (to 0), try to obtain lock with test&set» only one attemptor will succeed; others will fail and start testing again

91 Improved Hardware Primitives: LL-SC Goals: Test with reads Failed read-modify-write attempts don t generate invalidations Nice if single primitive can implement range of r-m-w operations Load-Locked (or -linked), Store-Conditional LL reads variable into register Follow with arbitrary instructions to manipulate its value SC tries to store back to location succeed if and only if no other write to the variable since this processor s LL» indicated by condition codes; If SC succeeds, all three steps happened atomically If fails, doesn t write or generate invalidations must retry aquire

92 Simple Lock with LL-SC lock: ll reg1, location /* LL location to reg1 */ beqz reg2, lock sc location, reg2 /* Store reg2 conditionally into location*/ /* if failed, start again */ ret unlock: st location, #0 /* write 0 to location */ ret Can do more fancy atomic ops by changing what s between LL & SC But keep it small so SC likely to succeed Don t include instructions that would need to be undone (e.g. stores) SC can fail (without putting transaction on bus) if: Detects intervening write even before trying to get bus Tries to get bus but another processor s SC gets bus first LL, SC are not lock, unlock respectively Only guarantee no conflicting write to lock variable between them But can use directly to implement simple operations on shared variables

93 Ticket Lock What happens when several processors spinning on lock and it is released? traffic per P lock operations? Only one r-m-w per acquire Two counters per lock (next_ticket, now_serving) Acquire: fetch&inc next_ticket; wait for now_serving == next_ticket» atomic op when arrive at lock, not when it s free (so less contention) Release: increment now-serving Performance low latency for low-contention - if fetch&inc cacheable O(p) read misses at release, since all spin on same variable

94 Array-based Queuing Locks Waiting processes poll on different locations in an array of size p Acquire» fetch&inc to obtain address on which to spin (next array element)» ensure that these addresses are in different cache lines or memories Release» set next location in array, thus waking up process spinning on it O(1) traffic per acquire with coherent caches FIFO ordering, as in ticket lock, but, O(p) space per lock

95 Point to Point Event Synchronization Software methods: Interrupts Busy-waiting: use ordinary variables as flags Blocking: use semaphores Full hardware support: full-empty bit with each word in memory Set when word is full with newly produced data (i.e. when written) Unset when word is empty due to being consumed (i.e. when read) Natural for word-level producer-consumer synchronization» producer: write if empty, set to full; consumer: read if full; set to empty Hardware preserves atomicity of bit manipulation with read or write Problem: flexibility» multiple consumers, or multiple writes before consumer reads?» needs language support to specify when to use» composite data structures?

96 Barriers Software algorithms implemented using locks, flags, counters Hardware barriers Wired-AND line separate from address/data bus» Set input high when arrive, wait for output to be high to leave In practice, multiple wires to allow reuse Useful when barriers are global and very frequent Difficult to support arbitrary subset of processors» even harder with multiple processes per processor Difficult to dynamically change number and identity of participants» e.g. latter due to process migration Not common today on bus-based machines

97 A Simple Centralized Barrier struct bar_type { increment when arrive (lock), check until reaches numprocs int counter; Problem? struct lock_type lock; int flag = 0;} bar_name; Shared counter maintains number of processes that have arrived BARRIER (bar_name, p) { LOCK(bar_name.lock); if (bar_name.counter == 0) bar_name.flag = 0; /* reset flag if first to reach*/ mycount = ++bar_name.counter; /* mycount is private */ UNLOCK(bar_name.lock); if (mycount == p) { /* last to arrive */ bar_name.counter = 0; /* reset for next barrier */ bar_name.flag = 1; /* release waiters */ } else while (bar_name.flag == 0) {}; /* busy wait for release */ }

98 A Working Centralized Barrier Consecutively entering the same barrier doesn t work Must prevent process from entering until all have left previous instance Could use another counter, but increases latency and contention Sense reversal: wait for flag to take different value consecutive times Toggle this value only when all processes reach BARRIER (bar_name, p) { local_sense =!(local_sense); /* toggle private sense variable */ LOCK(bar_name.lock); mycount = bar_name.counter++; /* mycount is private */ if (bar_name.counter == p) UNLOCK(bar_name.lock); bar_name.flag = local_sense; /* release waiters*/ else { UNLOCK(bar_name.lock); while (bar_name.flag!= local_sense) {}; } }

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency.

NOW Handout Page 1. Recap. Protocol Design Space of Snooping Cache Coherent Multiprocessors. Sequential Consistency. Recap Protocol Design Space of Snooping Cache Coherent ultiprocessors CS 28, Spring 99 David E. Culler Computer Science Division U.C. Berkeley Snooping cache coherence solve difficult problem by applying

More information

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout

Role of Synchronization. CS 258 Parallel Computer Architecture Lecture 23. Hardware-Software Trade-offs in Synchronization and Data Layout CS 28 Parallel Computer Architecture Lecture 23 Hardware-Software Trade-offs in Synchronization and Data Layout April 21, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs28 Role of

More information

ECE 669 Parallel Computer Architecture

ECE 669 Parallel Computer Architecture ECE 669 Parallel Computer Architecture Lecture 9 Workload Evaluation Outline Evaluation of applications is important Simulation of sample data sets provides important information Working sets indicate

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Parallel Computing Shared Memory Multiprocessors Hwansoo Han Cache Coherence Problem P 0 P 1 P 2 cache load r1 (100) load r1 (100) r1 =? r1 =? 4 cache 5 cache store b (100) 3 100: a 100: a 1 Memory 2 I/O

More information

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors

L7 Shared Memory Multiprocessors. Shared Memory Multiprocessors L7 Shared Memory Multiprocessors Logical design and software interactions 1 Shared Memory Multiprocessors Symmetric Multiprocessors (SMPs) Symmetric access to all of main memory from any processor Dominate

More information

Cache Coherence in Bus-Based Shared Memory Multiprocessors

Cache Coherence in Bus-Based Shared Memory Multiprocessors Cache Coherence in Bus-Based Shared Memory Multiprocessors Shared Memory Multiprocessors Variations Cache Coherence in Shared Memory Multiprocessors A Coherent Memory System: Intuition Formal Definition

More information

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor?

NOW Handout Page 1. Recap: Performance Trade-offs. Shared Memory Multiprocessors. Uniprocessor View. Recap (cont) What is a Multiprocessor? ecap: erformance Trade-offs Shared ory Multiprocessors CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley rogrammer s View of erformance Speedup < Sequential Work Max (Work + Synch

More information

Cache Coherence: Part 1

Cache Coherence: Part 1 Cache Coherence: art 1 Todd C. Mowry CS 74 October 5, Topics The Cache Coherence roblem Snoopy rotocols The Cache Coherence roblem 1 3 u =? u =? $ 4 $ 5 $ u:5 u:5 1 I/O devices u:5 u = 7 3 Memory A Coherent

More information

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Cache Coherence. CMU : Parallel Computer Architecture and Programming (Spring 2012) Cache Coherence CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Shared memory multi-processor Processors read and write to shared variables - More precisely: processors issues

More information

Shared Memory Architectures. Approaches to Building Parallel Machines

Shared Memory Architectures. Approaches to Building Parallel Machines Shared Memory Architectures Arvind Krishnamurthy Fall 2004 Approaches to Building Parallel Machines P 1 Switch/Bus P n Scale (Interleaved) First-level $ P 1 P n $ $ (Interleaved) Main memory Shared Cache

More information

The MESI State Transition Graph

The MESI State Transition Graph Small-scale shared memory multiprocessors Semantics of the shared address space model (Ch. 5.3-5.5) Design of the M(O)ESI snoopy protocol Design of the Dragon snoopy protocol Performance issues Synchronization

More information

Module 10: "Design of Shared Memory Multiprocessors" Lecture 20: "Performance of Coherence Protocols" MOESI protocol.

Module 10: Design of Shared Memory Multiprocessors Lecture 20: Performance of Coherence Protocols MOESI protocol. MOESI protocol Dragon protocol State transition Dragon example Design issues General issues Evaluating protocols Protocol optimizations Cache size Cache line size Impact on bus traffic Large cache line

More information

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols

Recall: Sequential Consistency. CS 258 Parallel Computer Architecture Lecture 15. Sequential Consistency and Snoopy Protocols CS 258 Parallel Computer Architecture Lecture 15 Sequential Consistency and Snoopy Protocols arch 17, 2008 Prof John D. Kubiatowicz http://www.cs.berkeley.edu/~kubitron/cs258 ecall: Sequential Consistency

More information

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures

Approaches to Building Parallel Machines. Shared Memory Architectures. Example Cache Coherence Problem. Shared Cache Architectures Approaches to Building arallel achines Switch/Bus n Scale Shared ory Architectures (nterleaved) First-level (nterleaved) ain memory n Arvind Krishnamurthy Fall 2004 (nterleaved) ain memory Shared Cache

More information

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 11: Snooping Cache Coherence: Part II. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 11: Snooping Cache Coherence: Part II CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 2 due tonight 11:59 PM - Recall 3-late day policy Assignment

More information

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture.

Alewife Messaging. Sharing of Network Interface. Alewife User-level event mechanism. CS252 Graduate Computer Architecture. CS252 Graduate Computer Architecture Lecture 18 April 5 th, 2010 ory Consistency Models and Snoopy Bus Protocols Alewife Messaging Send message write words to special network interface registers Execute

More information

Computer Architecture

Computer Architecture 18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming Cache design review Let s say your code executes int x = 1; (Assume for simplicity x corresponds to the address 0x12345604

More information

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy

More information

Processor Architecture

Processor Architecture Processor Architecture Shared Memory Multiprocessors M. Schölzel The Coherence Problem s may contain local copies of the same memory address without proper coordination they work independently on their

More information

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5)

CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) CS/ECE 757: Advanced Computer Architecture II (Parallel Computer Architecture) Symmetric Multiprocessors Part 1 (Chapter 5) Copyright 2001 Mark D. Hill University of Wisconsin-Madison Slides are derived

More information

Parallel Arch. Review

Parallel Arch. Review Parallel Arch. Review Zeljko Zilic McConnell Engineering Building Room 536 Main Points Understanding of the design and engineering of modern parallel computers Technology forces Fundamental architectural

More information

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley

Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer

More information

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University

Parallel Computer Architecture Lecture 5: Cache Coherence. Chris Craik (TA) Carnegie Mellon University 18-742 Parallel Computer Architecture Lecture 5: Cache Coherence Chris Craik (TA) Carnegie Mellon University Readings: Coherence Required for Review Papamarcos and Patel, A low-overhead coherence solution

More information

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 10: Cache Coherence: Part I. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 10: Cache Coherence: Part I Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Marble House The Knife (Silent Shout) Before starting The Knife, we were working

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Types of Synchronization Mutual Exclusion Locks Event Synchronization Global or group-based

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Bus-Based Coherent Multiprocessors

Bus-Based Coherent Multiprocessors Bus-Based Coherent Multiprocessors Lecture 13 (Chapter 7) 1 Outline Bus-based coherence Memory consistency Sequential consistency Invalidation vs. update coherence protocols Several Configurations for

More information

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015

Lecture 11: Cache Coherence: Part II. Parallel Computer Architecture and Programming CMU /15-618, Spring 2015 Lecture 11: Cache Coherence: Part II Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Bang Bang (My Baby Shot Me Down) Nancy Sinatra (Kill Bill Volume 1 Soundtrack) It

More information

Switch Gear to Memory Consistency

Switch Gear to Memory Consistency Outline Memory consistency equential consistency Invalidation vs. update coherence protocols MI protocol tate diagrams imulation Gehringer, based on slides by Yan olihin 1 witch Gear to Memory Consistency

More information

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Synchronization. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Synchronization Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu SSE3054: Multicore Systems, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu) Types of Synchronization

More information

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins

4 Chip Multiprocessors (I) Chip Multiprocessors (ACS MPhil) Robert Mullins 4 Chip Multiprocessors (I) Robert Mullins Overview Coherent memory systems Introduction to cache coherency protocols Advanced cache coherency protocols, memory systems and synchronization covered in the

More information

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence SMP Review Multiprocessors Today s topics: SMP cache coherence general cache coherence issues snooping protocols Improved interaction lots of questions warning I m going to wait for answers granted it

More information

Multiprocessors 1. Outline

Multiprocessors 1. Outline Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples

More information

Shared Memory Multiprocessors

Shared Memory Multiprocessors Shared Memory Multiprocessors Jesús Labarta Index 1 Shared Memory architectures............... Memory Interconnect Cache Processor Concepts? Memory Time 2 Concepts? Memory Load/store (@) Containers Time

More information

Handout 3 Multiprocessor and thread level parallelism

Handout 3 Multiprocessor and thread level parallelism Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed

More information

Snooping coherence protocols (cont.)

Snooping coherence protocols (cont.) Snooping coherence protocols (cont.) A four-state update protocol [ 5.3.3] When there is a high degree of sharing, invalidation-based protocols perform poorly. Blocks are often invalidated, and then have

More information

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation

Shared Memory Architectures. Shared Memory Multiprocessors. Caches and Cache Coherence. Cache Memories. Cache Memories Write Operation hared Architectures hared Multiprocessors ngo ander ngo@imit.kth.se hared Multiprocessor are often used pecial Class: ymmetric Multiprocessors (MP) o ymmetric access to all of main from any processor A

More information

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal.

Basic Architecture of SMP. Shared Memory Multiprocessors. Cache Coherency -- The Problem. Cache Coherency, The Goal. Shared emory ultiprocessors Basic Architecture of SP Buses are good news and bad news The (memory) bus is a point all processors can see and thus be informed of what is happening A bus is serially used,

More information

Snooping-Based Cache Coherence

Snooping-Based Cache Coherence Lecture 10: Snooping-Based Cache Coherence Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2017 Tunes Elle King Ex s & Oh s (Love Stuff) Once word about my code profiling skills

More information

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( )

Lecture 24: Multiprocessing Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 24: Multiprocessing Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Most of the rest of this

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

ECE PP used in class for assessing cache coherence protocols

ECE PP used in class for assessing cache coherence protocols ECE 5315 PP used in class for assessing cache coherence protocols Assessing Protocol Design The benchmark programs are executed on a multiprocessor simulator The state transitions observed determine the

More information

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency

Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency Lecture 25: Thread Level Parallelism -- Synchronization and Memory Consistency CSE 564 Computer Architecture Fall 2016 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu www.secs.oakland.edu/~yan

More information

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,

More information

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization

Introduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems

Multiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

CSE502: Computer Architecture CSE 502: Computer Architecture

CSE502: Computer Architecture CSE 502: Computer Architecture CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important

Page 1. Outline. Coherence vs. Consistency. Why Consistency is Important Outline ECE 259 / CPS 221 Advanced Computer Architecture II (Parallel Computer Architecture) Memory Consistency Models Copyright 2006 Daniel J. Sorin Duke University Slides are derived from work by Sarita

More information

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization.

Synchronization. Erik Hagersten Uppsala University Sweden. Components of a Synchronization Even. Need to introduce synchronization. Synchronization sum := thread_create Execution on a sequentially consistent shared-memory machine: Erik Hagersten Uppsala University Sweden while (sum < threshold) sum := sum while + (sum < threshold)

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh

Multicore Workshop. Cache Coherency. Mark Bull David Henty. EPCC, University of Edinburgh Multicore Workshop Cache Coherency Mark Bull David Henty EPCC, University of Edinburgh Symmetric MultiProcessing 2 Each processor in an SMP has equal access to all parts of memory same latency and bandwidth

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri

Cache Coherence. (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri Cache Coherence (Architectural Supports for Efficient Shared Memory) Mainak Chaudhuri mainakc@cse.iitk.ac.in 1 Setting Agenda Software: shared address space Hardware: shared memory multiprocessors Cache

More information

Advanced OpenMP. Lecture 3: Cache Coherency

Advanced OpenMP. Lecture 3: Cache Coherency Advanced OpenMP Lecture 3: Cache Coherency Cache coherency Main difficulty in building multiprocessor systems is the cache coherency problem. The shared memory programming model assumes that a shared variable

More information

Convergence of Parallel Architecture

Convergence of Parallel Architecture Parallel Computing Convergence of Parallel Architecture Hwansoo Han History Parallel architectures tied closely to programming models Divergent architectures, with no predictable pattern of growth Uncertainty

More information

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions

Lecture 9 Outline. Lower-Level Protocol Choices. MESI (4-state) Invalidation Protocol. MESI: Processor-Initiated Transactions Outline protocol Dragon updatebased protocol mpact of protocol optimizations LowerLevel Protocol Choices observed in state: what transition to make? Change to : assume ll read again soon good for mostly

More information

[ 5.4] What cache line size is performs best? Which protocol is best to use?

[ 5.4] What cache line size is performs best? Which protocol is best to use? Performance results [ 5.4] What cache line size is performs best? Which protocol is best to use? Questions like these can be answered by simulation. However, getting the answer write is part art and part

More information

A three-state update protocol

A three-state update protocol A three-state update protocol Whenever a bus update is generated, suppose that main memory as well as the caches updates its contents. Then which state don t we need? What s the advantage, then, of having

More information

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware

Overview: Shared Memory Hardware. Shared Address Space Systems. Shared Address Space and Shared Memory Computers. Shared Memory Hardware Overview: Shared Memory Hardware Shared Address Space Systems overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and

More information

Overview: Shared Memory Hardware

Overview: Shared Memory Hardware Overview: Shared Memory Hardware overview of shared address space systems example: cache hierarchy of the Intel Core i7 cache coherency protocols: basic ideas, invalidate and update protocols false sharing

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations

Lecture 3: Snooping Protocols. Topics: snooping-based cache coherence implementations Lecture 3: Snooping Protocols Topics: snooping-based cache coherence implementations 1 Design Issues, Optimizations When does memory get updated? demotion from modified to shared? move from modified in

More information

The need for atomicity This code sequence illustrates the need for atomicity. Explain.

The need for atomicity This code sequence illustrates the need for atomicity. Explain. Lock Implementations [ 8.1] Recall the three kinds of synchronization from Lecture 6: Point-to-point Lock Performance metrics for lock implementations Uncontended latency Traffic o Time to acquire a lock

More information

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence CS252 Spring 2017 Graduate Computer Architecture Lecture 12: Cache Coherence Lisa Wu, Krste Asanovic http://inst.eecs.berkeley.edu/~cs252/sp17 WU UCB CS252 SP17 Last Time in Lecture 11 Memory Systems DRAM

More information

Page 1. Cache Coherence

Page 1. Cache Coherence Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale

More information

Aleksandar Milenkovich 1

Aleksandar Milenkovich 1 Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection

More information

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently

More information

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex

More information

CS315A Midterm Solutions

CS315A Midterm Solutions K. Olukotun Spring 05/06 Handout #14 CS315a CS315A Midterm Solutions Open Book, Open Notes, Calculator okay NO computer. (Total time = 120 minutes) Name (please print): Solutions I agree to abide by the

More information

Workload-Driven Architectural Evaluation

Workload-Driven Architectural Evaluation Workload-Driven Architectural Evaluation Evaluation in Uniprocessors Decisions made only after quantitative evaluation For existing systems: comparison and procurement evaluation For future systems: careful

More information

EC 513 Computer Architecture

EC 513 Computer Architecture EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P

More information

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013 Lecture 13: Memory Consistency + a Course-So-Far Review Parallel Computer Architecture and Programming Today: what you should know Understand the motivation for relaxed consistency models Understand the

More information

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon

More information

Review: Creating a Parallel Program. Programming for Performance

Review: Creating a Parallel Program. Programming for Performance Review: Creating a Parallel Program Can be done by programmer, compiler, run-time system or OS Steps for creating parallel program Decomposition Assignment of tasks to processes Orchestration Mapping (C)

More information

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012)

Lecture 19: Synchronization. CMU : Parallel Computer Architecture and Programming (Spring 2012) Lecture 19: Synchronization CMU 15-418: Parallel Computer Architecture and Programming (Spring 2012) Announcements Assignment 4 due tonight at 11:59 PM Synchronization primitives (that we have or will

More information

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB

Memory Consistency and Multiprocessor Performance. Adapted from UCB CS252 S01, Copyright 2001 USB Memory Consistency and Multiprocessor Performance Adapted from UCB CS252 S01, Copyright 2001 USB 1 Memory Consistency Model Define memory correctness for parallel execution Execution appears to the that

More information

Other consistency models

Other consistency models Last time: Symmetric multiprocessing (SMP) Lecture 25: Synchronization primitives Computer Architecture and Systems Programming (252-0061-00) CPU 0 CPU 1 CPU 2 CPU 3 Timothy Roscoe Herbstsemester 2012

More information

Lecture-22 (Cache Coherence Protocols) CS422-Spring

Lecture-22 (Cache Coherence Protocols) CS422-Spring Lecture-22 (Cache Coherence Protocols) CS422-Spring 2018 Biswa@CSE-IITK Single Core Core 0 Private L1 Cache Bus (Packet Scheduling) Private L2 DRAM CS422: Spring 2018 Biswabandan Panda, CSE@IITK 2 Multicore

More information

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM

MULTIPROCESSORS AND THREAD LEVEL PARALLELISM UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared

More information

Flynn s Classification

Flynn s Classification Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today

More information

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection

More information

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem.

NOW Handout Page 1. Memory Consistency Model. Background for Debate on Memory Consistency Models. Multiprogrammed Uniprocessor Mem. Memory Consistency Model Background for Debate on Memory Consistency Models CS 258, Spring 99 David E. Culler Computer Science Division U.C. Berkeley for a SAS specifies constraints on the order in which

More information

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II

CS 252 Graduate Computer Architecture. Lecture 11: Multiprocessors-II CS 252 Graduate Computer Architecture Lecture 11: Multiprocessors-II Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste http://inst.eecs.berkeley.edu/~cs252

More information

Memory Consistency and Multiprocessor Performance

Memory Consistency and Multiprocessor Performance Memory Consistency Model Memory Consistency and Multiprocessor Performance Define memory correctness for parallel execution Execution appears to the that of some correct execution of some theoretical parallel

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

Chapter 18 Parallel Processing

Chapter 18 Parallel Processing Chapter 18 Parallel Processing Multiple Processor Organization Single instruction, single data stream - SISD Single instruction, multiple data stream - SIMD Multiple instruction, single data stream - MISD

More information

Conventional Computer Architecture. Abstraction

Conventional Computer Architecture. Abstraction Conventional Computer Architecture Conventional = Sequential or Single Processor Single Processor Abstraction Conventional computer architecture has two aspects: 1 The definition of critical abstraction

More information

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms.

Module 7: Synchronization Lecture 13: Introduction to Atomic Primitives. The Lecture Contains: Synchronization. Waiting Algorithms. The Lecture Contains: Synchronization Waiting Algorithms Implementation Hardwired Locks Software Locks Hardware Support Atomic Exchange Test & Set Fetch & op Compare & Swap Traffic of Test & Set Backoff

More information

Chapter 5. Thread-Level Parallelism

Chapter 5. Thread-Level Parallelism Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated

More information

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will

More information

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share

More information