Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency
|
|
- Lee Hoover
- 5 years ago
- Views:
Transcription
1 Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 *
2 Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel Computing,1989 Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?
3 Parallel Processors Religion The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models dl since uniprocessors can t tkeep going e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972,, 1989, NOW Borders religious fervor: you must believe! Fervor damped somewhat when 1990s companies went out of business: Thinking Machines, Kendall Square,... Arguments: pull of opportunity of scalable performance; the push of uniprocessor performance plateau?
4 What Level Parallelism? Bit level parallelism: 1970 to ~ bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): ~1985 through today Pipelining Superscalar VLIW Out-of-Order execution Limits to benefits of ILP? Process Level or Thread level lparallelism; li mainstream for general purpose computing? Servers are parallel Many PCs too!
5 Why Multiprocessors? Microprocessors are the fastest CPUs leverage them Collecting several much easier than redesigning one Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? Slow (but steady) improvement in parallel software (scientific apps, databases, OS) Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency Long term goal: scale number of processors to size of budget and desired performance
6 Parallel Architecture Parallel Architecture extends traditional computer architecture with a communication architecture Abstractions (HW/SW interface) Organizational structure t to realize abstraction ti efficiently i
7 Popular Flynn Categories (1966) SISD (Single Instruction Stream, Single Data Stream) Uniprocessors SIMD (Single Instruction Stream, Multiple Data Stream) Sing instruction memory and control processor (e.g., Illiac-IV, CM-2, vector machines) Simple programming model Low overhead Requires lots of data parallelism All custom integrated circuits MISD (Multiple Instruction Stream, Single Data Stream) No commercial multiprocessor of this type has been built MIMD(Multiple Instruction Stream, Multiple Data Stream) Examples: IBM SP, Cray T3D, SGI Origin, your PC? Flexible Use off-the-shelf microprocessors MIMD is the current winner: Concentrate on major design emphasis <= 128 processor MIMD machines
8 MIMD: Centralized Shared Memory UMA: Uniform Memory Access (latency) UMA: Uniform Memory Access (latency) Symmetric (shared-memory) Multi-Processor (SMP)
9 MIMD: Distributed Memory Advantages: More memory bandwidth (low cost), lower memory latency (local) Drawback: Longer communication latency, More complex software model
10 Distributed Memory Versions Distributed Shared Memory (DSM) Single physical address space with load/store access Address space is shared among nodes Also called Non-Uniform Memory Access (NUMA) Access time depends on the location of the data word in memory Message-Passing Multicomputer Separate address space per processor The same address on different nodes refers to different words Could be completely separated computers connected on LAN Also called clusters Tightly-coupled cluster
11 Layers of Parallel Framework Programming Model: Multiprogramming : lots of jobs, no communications Shared memory: communicate via memory Message passing: send and receive messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: Shared address space, e.g., load, store, atomic swap Message passing, e.g., send, receive library calls Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model
12 Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array One Control Processor broadcast to many PEs When computers were large, could amortize the control portion of many replicated PEs Condition flag per PE so that PE can be skipped Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit procs + memory on a chip was the PE Data parallel programming languages lay out data to processors
13 Communication Mechanisms Shared memory: Memory load and store Oldest, but most popular model Message passing: Send and receive messages Sending messages that request action or deliver data Remote procedure call (RPC) Request the data and/or operations performed on data Synchronous: Request, wait, and continue Three performance metrics Communication bandwidth Communication latency Communication latency hiding Overlap communication with computation or with other communication
14 Performance Metrics: Latency and Bandwidth Bandwidth Need high bandwidth in communication Match limits in network, memory, and processor Challenge is link speed of network interface vs. bisection bandwidth of network Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation Overhead to communicate is a problem in many machines Latency Hiding How can a mechanism help hide latency? Increases programming system burden Examples: overlap message send with computation, prefetch data, switch to other tasks
15 Shared Memory Model Communicate via Load and Store Based on timesharing: processes on multiple processors vs. sharing single processor Process: a virtual address space and about one thread of control Multiple processes can overlap (share), but ALL threads share a process address space Threads may be used in a casual way to refer to multiple loci of execution, ecu even when they do not share an address space Writes to shared address space by one thread are visible to reads of other threads Usual model: share code, private stack, some shared heap, some private heap
16 Message Passing Model Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Usually send includes process tag and receive has rule on tag (i.e. match 1, match any) Synchronize: When send completes? When buffer free? When request accepted? Receive wait for send? Send+receive => memory-memory copy, where each supplies local address, AND does pairwise sychronization!
17 Advantages of Shared-Memory Communication Compatibility Oldest, but the most popular model Ease of programming Communication patterns among processors are complex and dynamic Simplify compilers Use familiar shared-memory model to develop applications Attention only on performance critical accesses Lower overhead for communications and better use of bandwidth when communicating small items Can use hardware-controlled caching to reduce the frequency of remote communication
18 Advantages of Message-Passing Communication Hardware is simpler Communication is explicit It is easier to understand (when, cost, etc.) In shared memory it can be hard to know when communicating and when not, and how costly it is Force programmer to focus on communication, the costly aspects of parallel computing Synchronization is naturally associated with sending messages Reducing the possibility for errors introduced by incorrect synchronization Easier to use sender-initiated communication May have some advantages in performance
19 Communication Options Supporting message passing on top of shared memory Sending a message == Copying data Data may be misaligned though Supporting shared memory on top of hardware for message passing The cost of communication is too expensive Loads/stores usually move small amounts of data Shared virtual memory may be a possible direction What s the best choice for massively parallel processors (MPP)? Message passing is clearly simpler Shared-memory has been widely supported since 1995
20 Amdahl s Law and Parallel Computers Amdahl s Law: Speedup FracX ( SpeedupX p 1 (1 FracX )) The sequential portion limits parallel speedup Speedup 1/(1 FracX) The large latency of remote access is another major challenge Example: What fraction sequential to get 80x speedup from 100 processors? Assume either 1 processor or 100 fully used. 80 = 1 / ((FracX/100 + (1 FracX)) 0.8 FracX + 80 (1 FracX) = FracX = 1 FracX = (80 1) / 79.2 = Only 0.25% sequential!
21 But In The Real World... Wood and Hill [1995] compared cost-effectiveness of multiprocessor systems stems Challenge DM (1 proc, 6GB memory) cost DM( (1, MB) = $38,400 + $100 MB cost DM (1, 1GB) = $140,800 Challenge XL (32 proc) cost XL (p, MB) = $81,600 + $20,000 p + $100 MB cost XL (8, 1GB) = $344,000 = 2.5x cost DM( (1,1GB) So, XL speedup of 2.5x or more makes the XL more cost effective!
22 Memory System Coherence Cache coherence challenge for hardware Memory system is coherent if any read of a data item returns the most recently written value of that data item Coherence determines what value can be returned by a read Defines the behavior of reads and writes with respect to the same memory location Ensures the memory system is correct. Consistency determines hen written values will be returned by reads (all locations) Defines the behavior of accesses with respect to accesses to other memory locations
23 Multiprocessor Coherence Rules A memory system is coherent if Processor P writes X... No other writes to X Processor P reads X Value must tbe the same Processor Q writes X... Sufficient time... No other writes to X Processor P reads X Value must be the same Processor P writes X Processor Q writes X All processors see it as P, Q or Q, P, but the order should be the same for all processors Wirtes to the same location are serialized
24 Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes (discussed later) Keep track of what is being shared in a centralized place (logically) Distributed ib t memory => distributed ib t d directory for scalability (avoids bottlenecks) Send point-to-point p requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes
25 Bus Snooping Topology Memory: centralized with uniform access time ( UMA ) and bus interconnect Symmetric Multiprocessor (SMP)
26 Basic Snooping Protocols Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Handling read miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast/Update Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Handling read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration
27 Basic Snooping Protocols Write Invalidate versus Broadcast: Invalidate requires one transaction per write-run Invalidate uses spatial locality Invalidate a block each time Do not need to invalidate when writing other words in the same block Broadcast has lower latency between write and read Writes update all copies
28 An Example Snooping Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches (Invalid) Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has the only copy, it is writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to a clean line are treated as misses
29 Snoopy-Cache State Machine-I State machine for CPU requests for each cache block Invalid CPU Read Place read miss on bus CPU Read hit Shared (read/only) Cache Block Statet CPU Write Place Write Miss on bus CPU read hit CPU write hit Exclusive (read/write) CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus
30 Snoopy-Cache State Machine-II State machine for bus requests for each cache block Invalid Write miss for this block Shared (read/only) Write miss for this block Write Back Block; (abort memory access) Exclusive (read/write) Read miss for this block Write Back Block; (abort memory access) Read miss for this block
31 Example Processor 1 Processor 2 Bus Memory P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
32 Example: Step 1 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Active arrow = Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
33 Example: Step 2 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
34 Example: Step 3 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 10 P2: Write 40 to A Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
35 Example: Step 4 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
36 Example: Step 5 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back
37 False Sharing P1 for( i=0; ; i++ ) {... *a = i;... } P2 for( j=0; ; j++ ) {... *(a + 1) = j;... } state cache block a
38 False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1? 2 Read x2 3 Write x1 4 Write x2 5 Read x2
39 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2? 3 Write x1 4 Write x2 5 Read x2
40 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1? 4 Write x2 5 Read x2
41 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2? 5 Read x2
42 False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2?
43 False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2 True
44 Larger MPs Separate Memory per Processor Local or Remote access via memory controller One Cache Coherency solution: non-cached pages Alternative: ti directory per cache that ttracks state t of every block kin every cache Which caches have copies of block, dirty vs. clean,... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which processors have copies of their blocks
45 Distributed Directory MPs
46 Directory Protocol States identical to Snooping Protocol: Shared: 1 processors have data; memory is up-to-date Invalid: (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory is out-of-date In addition to cache state, must track which processors have data when in the shared state t (usually a bit vector; 1 if processor has a copy)
47 Directory-based cache coherence States identical to snooping case; transactions very similar. Three processors typically involved: Local node - where a request originates Home node - where the memory location of an address resides Remote node - has a copy of a cache block, whether exclusive or shared Generates read miss & write miss message to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.
48 CPU-Cache State Machine CPU Read hit State machine for CPU requests Invalidate for each Shared memory block Invalid (read/only) CPU Read: Invalid state Send Read Miss if in message CPU read miss: memory CPU Write: Send Read Miss Send Write Miss CPU Write:Send msg to h.d. Write Miss message Fetch/Invalidate: to home directory send Data Write Back message to home directory Fetch: send Data Write Back message to home directory CPU read miss: send Data Wit Write Back message and Exclusive read miss to home directory CPU read hit (read/writ) CPU write miss: CPU write hit send Data Write Back message and Write Miss to home directory
49 Directory State Machine Read miss: Sharers += {P}; send Data Value Reply State machine Read miss: for Directory Sharers = {P} requests for each send Data Value Reply Shared memory block Invalid (read only) Data Write Back: Sharers = {} (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write) Write Miss: Write Miss: Sharers = {P}; send Invalidate send Data to Sharers; Value Reply then Sharers = {P}; msg send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block)
50 Example Processor 1 Processor 2 Interconnect Directory Memory step P1: Write 10 to A1 P1 P2 Bus Directory Memory State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block
51 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block
52 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block
53 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A P2: Write 40 to A2 10 Wit Write Back A1 and A2 map to the same cache block
54 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 10 A1 and A2 map to the same cache block
55 Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0 WrBk P2 A1 20 A1 Unca. {} 20 Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0 A1 and A2 map to the same cache block
56 Implementation issues with a snoopy protocol Bus request and bus acquisition occurs in two steps Four possible events can occur in between: Write miss on the bus by another processor Read miss on the bus by another processor CPU read miss CPU write miss Interventions: Cache overriding memory
57 Implementation issues with a directory-based protocol Assumptions about networks In-order delivery between a pair of nodes Messages can always be injected Network-level deadlock freedom To ensure system-level deadlock freedom Separate network for requests and replies Allocates space for replies upon sending out requests Requests can be NACK, but not replies Retry NACKed requests
58 Memory Consistency Adve and Gharacharloo, Shared Memory Consistency Models: A Tutorial, WRL Research Report 95/7. Gharachorloo, K., D. Lenoski, J. Ludon, P. Gibbons, A. Gupta, and J. Hennessy, ``Memory consistency and event ordering in scalable sharedmemory multiprocessors,'' in Proceedings of the 17th International Symposium on Computer Architecture, pp , 1990.
59 Memory Consistency Models Cache coherence ensures multiple processors see a consistent view of memory, but how consistent. When must a processor see a value written by another processor? In what order must a processor observe the writes of another processor? Example: P1: A = 0; P2: B = 0; A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Should this behavior be allowed? Under what conditions? Memory consistency models: what htare the rules for such cases?
60 Strict Consistency In the strict model, any read to a memory location X returns the value stored by the most recent write operation to X. Precise serialization of all memory accesses No cache, talking to memory through bus
61 Examples: P1 W(x) 1 P2 R(x) 1 R(x) 1 P1 W(x) 1 P2 R(x) 0 R(x) 1 P1 W(x) 0 P2 R(x) 0 R(x) 1 Time
62 Sequential Consistency The result of any execution is the same as if The operations of all the processors were executed in some sequential order, and The operations o of each individual du processor appear in this sequence in the order specified by its program. Intuitive: maps well with uniprocessor Implementations Delay memory access completion until all invalidates are done Or, delay next memory access until previous one is done Severe performance restrictions
63 Examples: P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 1 R(x) 2 P4 W(x) 2 P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 2 R(x) 1 P4 W(x) 2
64 Coherency vs consistency P1 W(x) 1 W(y) 2 P2 R(x) 0 R(x) 2 R(x) 1 R(y) 0 R(y) 1 P3 R(y) 0 R(y) 1 R(x) 0 R(x) 1 P4 W(x) 2 W(y) 1
65 How uniprocessor hardware optimizations of memory system break sequential consistency + solution {No caches} (RAW) Reads bypassing writes: Write buffers Code example: Solution: No buffering of writes Add delays
66 Animations of reads bypassing writes 1. P1: Flag1=1 P1 P2 2. P2: Flag2= P1: if (Flag2==0) Flag1= P2: if (Flag1==0) bus Flag1:0 Flag2:0
67 Animations 1. P1: Flag1=1 P1 P2 2. P2: Flag2= P1: if (Flag2==0) Flag1=1 1 Flag2=1 4. P2: if (Flag1==0) bus Flag1:0 Flag2:0
68 Animations 3 P1 P2 1 2 Flag1=1 1 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) Flag1:0 Flag2:0
69 Animations 3 P1 P2 1 2 Flag1=1 1 4 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) true! Flag1:0 Flag2:0
70 How uniprocessor optimizations of memory system break sequential consistency + solution {No caches} (WAW) Write ordering Overlapping writes in the network Write merging into a cache line Code example: Solution: Wait till write operation has reached memory module before allowing next write acknowledgment for writes more delay
71 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0
72 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0
73 Animations of writes bypassing writes P2 P P1: Data= P1: Head=1 3. P2: while (Head==0); false 4. P2: = Data Head: 0 Data: 0 M3: M4:
74 Animations of writes bypassing writes P P2 4 M3: Head: 0 M4: Data: 0 1. P1: Data= P1: Head=1 3. P2: while (Head==0); true 4. P2: = Data Data=0!
75 How uniprocessor optimizations of memory system break sequential consistency y{ {No caches} (RAR, WAR) Non-blocking reads Speculative execution Code example: Solution: Disable non-blocking reads more delay
76 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); 4. P2: = Data P2 M3: Head: 0 M4: Data: 0
77 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); 2 4. P2: = Data Data=0 2 2 P2 M3: Head: 0 M4: Data: 0
78 Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); Ld miss: Head= P2 4. P2: = Data Data=0 3 Head: 0 Data: 0 M3: M4:
79 Now, let s add caches.. More problems (WAW) P1 s write to Data reaches memory, but has not yet been propagated to P2 (cache coherence protocol s message (inv. or update) has not yet arrived) Code Example: Solution: Wait for acknowledgment of update/invalidate before any write/read even longer delay than just waiting for ack from memory module.
80 Animations of problems with a cache P1 P2 Data= P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 0 Data: 2000 M3: M4:
81 Animations of problems with a cache P1 P2 2 Data= P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 1 Data: 2000 M3: M4:
82 Animations of problems with a cache 1. P1: Data=2000 P1 2. P1: Head= P2: while (Head==0); false P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 4. P2: = Data
83 Animations of problems with a cache P1 1 P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 1. P1: Data= P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Data=0!
84 Caches + network.. headaches Non-atomic write all processors have to see writes to the same location in the same order Code example: writes to A may appear at P3/P4 in different order due to different network paths Solution: Wait till all acks to a location is complete; or Network preserves ordering between a node pair
85 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 M2 P4 B,C
86 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4
87 Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4 Register2=A=2
88 Summary of sequential consistency Slow, because In each processor, previous memory operation in program order must be completed before proceeding with next memory operation In cache-coherent h systems, writes to the same location must be visible in the same order to all processors; and no subsequent reads till write is visible to all processors. Mitigating solutions Prefetch cache ownership while waiting in write buffer Speculate and rollback Compiler memory analysis reorder at hw/sw only if safe
89 Relaxed (Weak) Consistency Models Several relaxed models for memory consistency characterized by their attitude towards RAR, WAR, RAW, and WAW to different addresses Not really an issue for most programs -- they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations acquire (s) {lock}... normal access... release (s) {unlock} Other processes do not need to know intermediate values Only those programs willing to be nondeterministic are not synchronized, leading to data races
90 Relaxed (Weak) Consistency Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed P1 W(x) 1 W(x) 2 S P2 R(x) 0 R(x) 2 S R(x) 2 P3 R(x) 1 S R(x) 2
91 Processor consistency Also called PRAM (an acronym for Pipelined Random Access Memory) consistency. Writes done by a single processor are received by all other processors in the order in which hthey were issued. However, writes from different processors may be seen in a different order by different processors The basic idea of processor consistency is to better reflect the reality of networks in which the latency between different nodes can be different Is the following scenario valid under processor consistency? P1 W(x) 1 W(x) 2 P2 R(x) 2 R(x) 1
92 Release consistency Release consistency considers locks on areas of memory, and propagates only the locked memory as needed Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed successfully Before a release is allowed to be performed, all previous reads and writes done by the process must have completed The acquire and release accesses must be sequentially consistent
93 Synchronization Hardware needs to know how programmer intends different processes to use shared data Issues for synchronization: Hardware needs to support uninterruptable instructions to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization
94 Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 and swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if another processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: returns the value of a memory location and atomically increments it 0 => synchronization variable is free
95 Load Linked, Store Conditional Hard to have read and write in one instruction; use two instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceeding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg reg) reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)
96 User-Level Synchronization Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likl Likely to get cache hits for such variables ibl Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it is free, then try exchange ( test and test&set ): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked?
97 Scalable Synchronization Exponential backoff if you fail to get the lock, don t try again for an exponentially increasing amount of time Queuing lock Wait for lock by getting on the queue. Lock is passed down the queue. Can be implemented in hardware or software Combining tree Implement barriers as a tree to avoid contention Powerful atomic primitives such as fetch-and-increment
98 Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all memory blocks Distributing directory => scalable shared memory multiprocessor => Cache coherent, Non uniform memory access (cc-numa) Practical implementations of shared memory multiprocessors are not sequentially consistent due to performance reasons programmers use synchronization to ensure correctness
99 Case Study: IBM Power4: Cache States L1 cache: Writethrough L2 cache: Maintains coherence (invalidation-based), Writeback, Inclusive of L1 L3 cache: Writeback, non-inclusive, maintains coherence (invalidation-based) Snooping protocol Snoops on all buses.
100 IBM Power4
101 Case Study: IBM Power4 Caches
102 Interconnection network of buses
103 IBM Power4 s L2 coherence protocol I (Invalid state) SL (Shared state, in one core within the module) S (Shared state, in multiple cores within a module) M (Modified state, and exclusive) Me (Data is not modified, but exclusive, no write back necessary) Mu [Load-linked lwarx, Store-conditional swarx] T (tagged: data modified, but not exclusively owned) Optimization: Do not write back modified exclusive data upon a read dhit
104 IBM Power4 s L2 coherence protocol
105 IBM Power4 s L3 coherence protocol States: Invalid Shared: Can only give data to its own L2s that it s caching data for Tagged: Modified and shared (local memory) Tagged remote: Modified and shared (remote memory) O (prefetch): Data in L3 = Memory, and same node, not modified.
106 IBM Blue Gene/L memory architecture
107 Node overview
108 Single chip
Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology
Review: Multiprocessor CPE 631 Session 21: Multiprocessors (Part 2) Department of Electrical and Computer Engineering University of Alabama in Huntsville Basic issues and terminology Communication: share
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationAleksandar Milenkovich 1
Parallel Computers Lecture 8: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection
More informationPage 1. Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency. Bus Snooping Topology
CS252 Graduate Computer Architecture Lecture 12: Multiprocessor 2: Snooping Protocol, Directory Protocol, Synchronization, Consistency Review: Multiprocessor Basic issues and terminology Communication:
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville
Lecture 18: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Parallel Computers Definition: A parallel computer is a collection
More informationLecture 17: Multiprocessors: Size, Consitency. Review: Networking Summary
Lecture 17: Multiprocessors: Size, Consitency Professor David A. Patterson Computer Science 252 Spring 1998 DAP Spr. 98 UCB 1 Review: Networking Summary Protocols allow hetereogeneous networking Protocols
More informationAleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville. Review: Small-Scale Shared Memory
Lecture 20: Multiprocessors Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter-4 Multiprocessors and Thread-Level Parallelism
Chapter-4 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing returns
More informationCS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors
CS654 Advanced Computer Architecture Lec 14 Directory Based Multiprocessors Peter Kemper Adapted from the slides of EECS 252 by Prof. David Patterson Electrical Engineering and Computer Sciences University
More informationLecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single
More informationCMSC 611: Advanced. Distributed & Shared Memory
CMSC 611: Advanced Computer Architecture Distributed & Shared Memory Centralized Shared Memory MIMD Processors share a single centralized memory through a bus interconnect Feasible for small processor
More informationCPE 631 Lecture 21: Multiprocessors
Lecture 2: Multiprocessors Aleksandar Milenković, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Review: Small-Scale Shared Memory Caches serve to: Increase
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Shared Memory Most slides adapted from David Patterson. Some from Mohomed Younis Interconnection Networks Massively processor networks (MPP) Thousands of nodes
More informationMultiprocessor Synchronization
Multiprocessor Synchronization Material in this lecture in Henessey and Patterson, Chapter 8 pgs. 694-708 Some material from David Patterson s slides for CS 252 at Berkeley 1 Multiprogramming and Multiprocessing
More information5008: Computer Architecture
5008: Computer Architecture Chapter 4 Multiprocessors and Thread-Level Parallelism --II CA Lecture08 - multiprocessors and TLP (cwliu@twins.ee.nctu.edu.tw) 09-1 Review Caches contain all information on
More informationChapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST
Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationChapter 5. Thread-Level Parallelism
Chapter 5 Thread-Level Parallelism Instructor: Josep Torrellas CS433 Copyright Josep Torrellas 1999, 2001, 2002, 2013 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors saturated
More informationLecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 33: Multiprocessors Synchronization and Consistency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: Miss Rates for Snooping Protocol 4th C: Coherency Misses More processors:
More informationChapter 6: Multiprocessors and Thread-Level Parallelism
Chapter 6: Multiprocessors and Thread-Level Parallelism 1 Parallel Architectures 2 Flynn s Four Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data)???;
More informationComputer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University
Computer Architecture Lecture 10: Thread Level Parallelism II (Chapter 5) Chih Wei Liu 劉志尉 National Chiao Tung University cwliu@twins.ee.nctu.edu.tw Review Caches contain all information on state of cached
More informationParallel Architecture. Hwansoo Han
Parallel Architecture Hwansoo Han Performance Curve 2 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3 Power Consumption (watts) 4 Wire Delay Range
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationChapter 5 Thread-Level Parallelism. Abdullah Muzahid
Chapter 5 Thread-Level Parallelism Abdullah Muzahid 1 Progress Towards Multiprocessors + Rate of speed growth in uniprocessors is saturating + Modern multiple issue processors are becoming very complex
More informationCSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization
CSE 502 Graduate Computer Architecture Lec 19 Directory-Based Shared-Memory Multiprocessors & MP Synchronization Larry Wittie Computer Science, StonyBrook University http://www.cs.sunysb.edu/~cse502 and
More informationChapter 5. Multiprocessors and Thread-Level Parallelism
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationShared Memory SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB
Shared SMP and Cache Coherence (cont) Adapted from UCB CS252 S01, Copyright 2001 USB 1 Review: Snoopy Cache Protocol Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationMultiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types
Chapter 5 Multiprocessor Cache Coherence Thread-Level Parallelism 1: read 2: read 3: write??? 1 4 From ILP to TLP Memory System is Coherent If... ILP became inefficient in terms of Power consumption Silicon
More informationLecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol
Lecture 24: Thread Level Parallelism -- Distributed Shared Memory and Directory-based Coherence Protocol CSE 564 Computer Architecture Fall 2016 Department of Computer Science and Engineering Yonghong
More informationMULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationMULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming
MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationThread-Level Parallelism
Advanced Computer Architecture Thread-Level Parallelism Some slides are from the instructors resources which accompany the 6 th and previous editions. Some slides are from David Patterson, David Culler
More informationIntroduction. Coherency vs Consistency. Lec-11. Multi-Threading Concepts: Coherency, Consistency, and Synchronization
Lec-11 Multi-Threading Concepts: Coherency, Consistency, and Synchronization Coherency vs Consistency Memory coherency and consistency are major concerns in the design of shared-memory systems. Consistency
More informationProcessor Architecture and Interconnect
Processor Architecture and Interconnect What is Parallelism? Parallel processing is a term used to denote simultaneous computation in CPU for the purpose of measuring its computation speeds. Parallel Processing
More informationPage 1. Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications. Parallel Processors Religion. Parallel Computers
CS252 Graduate Computer Architecture Lecture 11: Multiprocessor 1: Reasons, Classifications, Performance Metrics, Applications February 23, 2001 Prof. David A. Patterson Computer Science 252 Spring 2001
More informationESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence
Computer Architecture ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols 1 Shared Memory Multiprocessor Memory Bus P 1 Snoopy Cache Physical Memory P 2 Snoopy
More informationMULTIPROCESSORS AND THREAD LEVEL PARALLELISM
UNIT III MULTIPROCESSORS AND THREAD LEVEL PARALLELISM 1. Symmetric Shared Memory Architectures: The Symmetric Shared Memory Architecture consists of several processors with a single physical memory shared
More informationHandout 3 Multiprocessor and thread level parallelism
Handout 3 Multiprocessor and thread level parallelism Outline Review MP Motivation SISD v SIMD (SIMT) v MIMD Centralized vs Distributed Memory MESI and Directory Cache Coherency Synchronization and Relaxed
More informationFlynn s Classification
Flynn s Classification SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) No machine is built yet for this type SIMD (Single Instruction Multiple Data) Examples:
More informationCMSC 611: Advanced. Parallel Systems
CMSC 611: Advanced Computer Architecture Parallel Systems Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems
More informationCISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism
CISC 662 Graduate Computer Architecture Lectures 15 and 16 - Multiprocessors and Thread-Level Parallelism Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from
More informationParallel Computer Architecture Spring Distributed Shared Memory Architectures & Directory-Based Memory Coherence
Parallel Computer Architecture Spring 2018 Distributed Shared Memory Architectures & Directory-Based Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly
More informationChap. 4 Multiprocessors and Thread-Level Parallelism
Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,
More informationCOSC 6385 Computer Architecture - Multi Processor Systems
COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationPage 1. Instruction-Level Parallelism (ILP) CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism
CISC 662 Graduate Computer Architecture Lectures 16 and 17 - Multiprocessors and Thread-Level Parallelism Michela Taufer Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationThread- Level Parallelism. ECE 154B Dmitri Strukov
Thread- Level Parallelism ECE 154B Dmitri Strukov Introduc?on Thread- Level parallelism Have mul?ple program counters and resources Uses MIMD model Targeted for?ghtly- coupled shared- memory mul?processors
More informationComputer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors
Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently
More informationChapter - 5 Multiprocessors and Thread-Level Parallelism
Chapter - 5 Multiprocessors and Thread-Level Parallelism We have seen the renewed interest in developing multiprocessors in early 2000: - The slowdown in uniprocessor performance due to the diminishing
More informationChapter Seven. Idea: create powerful computers by connecting many smaller ones
Chapter Seven Multiprocessors Idea: create powerful computers by connecting many smaller ones good news: works for timesharing (better than supercomputer) vector processing may be coming back bad news:
More informationCOSC4201 Multiprocessors
COSC4201 Multiprocessors Prof. Mokhtar Aboelaze Parts of these slides are taken from Notes by Prof. David Patterson (UCB) Multiprocessing We are dedicating all of our future product development to multicore
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (III)
OS 6385 omputer Architecture - Thread Level Parallelism (III) Edgar Gabriel Spring 2018 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05
More informationLecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol
Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol CSE 564 Computer Architecture Summer 2017 Department of Computer Science and Engineering Yonghong Yan yan@oakland.edu
More informationCOSC 6385 Computer Architecture. - Thread Level Parallelism (III)
OS 6385 omputer Architecture - Thread Level Parallelism (III) Spring 2013 Some slides are based on a lecture by David uller, University of alifornia, Berkley http://www.eecs.berkeley.edu/~culler/courses/cs252-s05
More informationMultiprocessors 1. Outline
Multiprocessors 1 Outline Multiprocessing Coherence Write Consistency Snooping Building Blocks Snooping protocols and examples Coherence traffic and performance on MP Directory-based protocols and examples
More informationEC 513 Computer Architecture
EC 513 Computer Architecture Cache Coherence - Directory Cache Coherence Prof. Michel A. Kinsy Shared Memory Multiprocessor Processor Cores Local Memories Memory Bus P 1 Snoopy Cache Physical Memory P
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory
More information3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4
Outline CSCI Computer System Architecture Lec 8 Multiprocessor Introduction Xiuzhen Cheng Department of Computer Sciences The George Washington University MP Motivation SISD v. SIMD v. MIMD Centralized
More informationMultiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.
Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network
More informationComputer Architecture
Jens Teubner Computer Architecture Summer 2016 1 Computer Architecture Jens Teubner, TU Dortmund jens.teubner@cs.tu-dortmund.de Summer 2016 Jens Teubner Computer Architecture Summer 2016 83 Part III Multi-Core
More informationLecture 24: Virtual Memory, Multiprocessors
Lecture 24: Virtual Memory, Multiprocessors Today s topics: Virtual memory Multiprocessors, cache coherence 1 Virtual Memory Processes deal with virtual memory they have the illusion that a very large
More information10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems
1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase
More informationEEC 581 Computer Architecture. Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6)
EEC 581 Computer rchitecture Lec 11 Synchronization and Memory Consistency Models (4.5 & 4.6) Chansu Yu Electrical and Computer Engineering Cleveland State University cknowledgement Part of class notes
More informationChapter 6. Parallel Processors from Client to Cloud Part 2 COMPUTER ORGANIZATION AND DESIGN. Homogeneous & Heterogeneous Multicore Architectures
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 6 Parallel Processors from Client to Cloud Part 2 Homogeneous & Heterogeneous Multicore Architectures Intel XEON 22nm
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationCOSC 6385 Computer Architecture - Thread Level Parallelism (I)
COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month
More informationM4 Parallelism. Implementation of Locks Cache Coherence
M4 Parallelism Implementation of Locks Cache Coherence Outline Parallelism Flynn s classification Vector Processing Subword Parallelism Symmetric Multiprocessors, Distributed Memory Machines Shared Memory
More informationComputer Systems Architecture
Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationComputer Architecture
18-447 Computer Architecture CSCI-564 Advanced Computer Architecture Lecture 29: Consistency & Coherence Lecture 20: Consistency and Coherence Bo Wu Prof. Onur Mutlu Colorado Carnegie School Mellon University
More informationPage 1. Cache Coherence
Page 1 Cache Coherence 1 Page 2 Memory Consistency in SMPs CPU-1 CPU-2 A 100 cache-1 A 100 cache-2 CPU-Memory bus A 100 memory Suppose CPU-1 updates A to 200. write-back: memory and cache-2 have stale
More informationUNIT I (Two Marks Questions & Answers)
UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-
More informationFall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley
Fall 2012 EE 6633: Architecture of Parallel Computers Lecture 4: Shared Address Multiprocessors Acknowledgement: Dave Patterson, UC Berkeley Avinash Kodi Department of Electrical Engineering & Computer
More informationCMSC Computer Architecture Lecture 15: Memory Consistency and Synchronization. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 15: Memory Consistency and Synchronization Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 5 (multi-core) " Basic requirements: out later today
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationCSE502: Computer Architecture CSE 502: Computer Architecture
CSE 502: Computer Architecture Shared-Memory Multi-Processors Shared-Memory Multiprocessors Multiple threads use shared memory (address space) SysV Shared Memory or Threads in software Communication implicit
More informationLimitations of parallel processing
Your professor du jour: Steve Gribble gribble@cs.washington.edu 323B Sieg Hall all material in this lecture in Henessey and Patterson, Chapter 8 635-640 645, 646 654-665 11/8/00 CSE 471 Multiprocessors
More informationMultiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism
Computing Systems & Performance Beyond Instruction-Level Parallelism MSc Informatics Eng. 2012/13 A.J.Proença From ILP to Multithreading and Shared Cache (most slides are borrowed) When exploiting ILP,
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationParallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence
Parallel Computer Architecture Spring 2018 Shared Memory Multiprocessors Memory Coherence Nikos Bellas Computer and Communications Engineering Department University of Thessaly Parallel Computer Architecture
More informationFlynn's Classification
Multiprocessors Oracles's SPARC M7-32 core, 64MB L3 cache (8 x 8 MB), 1.6TB/s. 256 KB of 4-way SA L2 ICache, 0.5 TB/s per cluster. 2 cores share 256 KB, 8-way SA L2 DCache, 0.5 TB/s. Flynn's Classification
More informationEITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor
EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration I/O MultiProcessor Summary 2 Virtual memory benifits Using physical memory efficiently
More informationShared Symmetric Memory Systems
Shared Symmetric Memory Systems Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering Department University
More informationINSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing
UNIVERSIDADE TÉCNICA DE LISBOA INSTITUTO SUPERIOR TÉCNICO Departamento de Engenharia Informática Architectures for Embedded Computing MEIC-A, MEIC-T, MERC Lecture Slides Version 3.0 - English Lecture 11
More informationComputer Architecture
Computer Architecture Chapter 7 Parallel Processing 1 Parallelism Instruction-level parallelism (Ch.6) pipeline superscalar latency issues hazards Processor-level parallelism (Ch.7) array/vector of processors
More informationMultiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar
Multiprocessors and Thread Level Parallelism CS 282 KAUST Spring 2010 Muhamed Mudawar Original slides created by: David Patterson Uniprocessor Performance (SPECint) 80) Performance (vs. VAX-11/78 10000
More informationEN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy
EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy Prof. Sherief Reda School of Engineering Brown University Material from: Parallel Computer Organization and Design by Debois,
More informationECSE 425 Lecture 30: Directory Coherence
ECSE 425 Lecture 30: Directory Coherence H&P Chapter 4 Last Time Snoopy Coherence Symmetric SMP Performance 2 Today Directory- based Coherence 3 A Scalable Approach: Directories One directory entry for
More informationLecture 18: Coherence Protocols. Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections
Lecture 18: Coherence Protocols Topics: coherence protocols for symmetric and distributed shared-memory multiprocessors (Sections 4.2-4.4) 1 SMP/UMA/Centralized Memory Multiprocessor Main Memory I/O System
More informationMultiprocessors II: CC-NUMA DSM. CC-NUMA for Large Systems
Multiprocessors II: CC-NUMA DSM DSM cache coherence the hardware stuff Today s topics: what happens when we lose snooping new issues: global vs. local cache line state enter the directory issues of increasing
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationEN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction)
EN164: Design of Computing Systems Topic 08: Parallel Processor Design (introduction) Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering
More information