Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency

Multiprocessors: Basics, Cache coherence, Synchronization, and Memory consistency Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 *

Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel Computing,1989 Questions about parallel computers: How large a collection? How powerful are processing elements? How do they cooperate and communicate? How are data transmitted? What type of interconnection? What are HW and SW primitives for programmer? Does it translate into performance?

Parallel Processors Religion The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Led to innovative organization tied to particular programming models dl since uniprocessors can t tkeep going e.g., uniprocessors must stop getting faster due to limit of speed of light: 1972,, 1989, NOW Borders religious fervor: you must believe! Fervor damped somewhat when 1990s companies went out of business: Thinking Machines, Kendall Square,... Arguments: pull of opportunity of scalable performance; the push of uniprocessor performance plateau?

What Level Parallelism? Bit level parallelism: 1970 to ~1985 4 bits, 8 bit, 16 bit, 32 bit microprocessors Instruction level parallelism (ILP): ~1985 through today Pipelining Superscalar VLIW Out-of-Order execution Limits to benefits of ILP? Process Level or Thread level lparallelism; li mainstream for general purpose computing? Servers are parallel Many PCs too!

Why Multiprocessors? Microprocessors are the fastest CPUs leverage them Collecting several much easier than redesigning one Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? Slow (but steady) improvement in parallel software (scientific apps, databases, OS) Emergence of embedded and server markets driving microprocessors in addition to desktops Embedded functional parallelism, producer/consumer model Server figure of merit is tasks per hour vs. latency Long term goal: scale number of processors to size of budget and desired performance

Parallel Architecture Parallel Architecture extends traditional computer architecture with a communication architecture Abstractions (HW/SW interface) Organizational structure t to realize abstraction ti efficiently i

Popular Flynn Categories (1966) SISD (Single Instruction Stream, Single Data Stream) Uniprocessors SIMD (Single Instruction Stream, Multiple Data Stream) Sing instruction memory and control processor (e.g., Illiac-IV, CM-2, vector machines) Simple programming model Low overhead Requires lots of data parallelism All custom integrated circuits MISD (Multiple Instruction Stream, Single Data Stream) No commercial multiprocessor of this type has been built MIMD(Multiple Instruction Stream, Multiple Data Stream) Examples: IBM SP, Cray T3D, SGI Origin, your PC? Flexible Use off-the-shelf microprocessors MIMD is the current winner: Concentrate on major design emphasis <= 128 processor MIMD machines

MIMD: Centralized Shared Memory UMA: Uniform Memory Access (latency) UMA: Uniform Memory Access (latency) Symmetric (shared-memory) Multi-Processor (SMP)

MIMD: Distributed Memory Advantages: More memory bandwidth (low cost), lower memory latency (local) Drawback: Longer communication latency, More complex software model

Distributed Memory Versions Distributed Shared Memory (DSM) Single physical address space with load/store access Address space is shared among nodes Also called Non-Uniform Memory Access (NUMA) Access time depends on the location of the data word in memory Message-Passing Multicomputer Separate address space per processor The same address on different nodes refers to different words Could be completely separated computers connected on LAN Also called clusters Tightly-coupled cluster

Layers of Parallel Framework Programming Model: Multiprogramming : lots of jobs, no communications Shared memory: communicate via memory Message passing: send and receive messages Data Parallel: several agents operate on several data sets simultaneously and then exchange information globally and simultaneously (shared or message passing) Communication Abstraction: Shared address space, e.g., load, store, atomic swap Message passing, e.g., send, receive library calls Debate over this topic (ease of programming, scaling) => many hardware designs 1:1 programming model

Data Parallel Model Operations can be performed in parallel on each element of a large regular data structure, such as an array One Control Processor broadcast to many PEs When computers were large, could amortize the control portion of many replicated PEs Condition flag per PE so that PE can be skipped Data distributed in each memory Early 1980s VLSI => SIMD rebirth: 32 1-bit procs + memory on a chip was the PE Data parallel programming languages lay out data to processors

Communication Mechanisms Shared memory: Memory load and store Oldest, but most popular model Message passing: Send and receive messages Sending messages that request action or deliver data Remote procedure call (RPC) Request the data and/or operations performed on data Synchronous: Request, wait, and continue Three performance metrics Communication bandwidth Communication latency Communication latency hiding Overlap communication with computation or with other communication

Performance Metrics: Latency and Bandwidth Bandwidth Need high bandwidth in communication Match limits in network, memory, and processor Challenge is link speed of network interface vs. bisection bandwidth of network Latency Affects performance, since processor may have to wait Affects ease of programming, since it requires more thought to overlap communication and computation Overhead to communicate is a problem in many machines Latency Hiding How can a mechanism help hide latency? Increases programming system burden Examples: overlap message send with computation, prefetch data, switch to other tasks

Shared Memory Model Communicate via Load and Store Based on timesharing: processes on multiple processors vs. sharing single processor Process: a virtual address space and about one thread of control Multiple processes can overlap (share), but ALL threads share a process address space Threads may be used in a casual way to refer to multiple loci of execution, ecu even when they do not share an address space Writes to shared address space by one thread are visible to reads of other threads Usual model: share code, private stack, some shared heap, some private heap

Message Passing Model Whole computers (CPU, memory, I/O devices) communicate as explicit I/O operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Usually send includes process tag and receive has rule on tag (i.e. match 1, match any) Synchronize: When send completes? When buffer free? When request accepted? Receive wait for send? Send+receive => memory-memory copy, where each supplies local address, AND does pairwise sychronization!

Advantages of Shared-Memory Communication Compatibility Oldest, but the most popular model Ease of programming Communication patterns among processors are complex and dynamic Simplify compilers Use familiar shared-memory model to develop applications Attention only on performance critical accesses Lower overhead for communications and better use of bandwidth when communicating small items Can use hardware-controlled caching to reduce the frequency of remote communication

Advantages of Message-Passing Communication Hardware is simpler Communication is explicit It is easier to understand (when, cost, etc.) In shared memory it can be hard to know when communicating and when not, and how costly it is Force programmer to focus on communication, the costly aspects of parallel computing Synchronization is naturally associated with sending messages Reducing the possibility for errors introduced by incorrect synchronization Easier to use sender-initiated communication May have some advantages in performance

Communication Options Supporting message passing on top of shared memory Sending a message == Copying data Data may be misaligned though Supporting shared memory on top of hardware for message passing The cost of communication is too expensive Loads/stores usually move small amounts of data Shared virtual memory may be a possible direction What s the best choice for massively parallel processors (MPP)? Message passing is clearly simpler Shared-memory has been widely supported since 1995

Amdahl s Law and Parallel Computers Amdahl s Law: Speedup FracX ( SpeedupX p 1 (1 FracX )) The sequential portion limits parallel speedup Speedup 1/(1 FracX) The large latency of remote access is another major challenge Example: What fraction sequential to get 80x speedup from 100 processors? Assume either 1 processor or 100 fully used. 80 = 1 / ((FracX/100 + (1 FracX)) 0.8 FracX + 80 (1 FracX) = 80-79.2 FracX = 1 FracX = (80 1) / 79.2 = 0.9975 Only 0.25% sequential!

But In The Real World... Wood and Hill [1995] compared cost-effectiveness of multiprocessor systems stems Challenge DM (1 proc, 6GB memory) cost DM( (1, MB) = $38,400 + $100 MB cost DM (1, 1GB) = $140,800 Challenge XL (32 proc) cost XL (p, MB) = $81,600 + $20,000 p + $100 MB cost XL (8, 1GB) = $344,000 = 2.5x cost DM( (1,1GB) So, XL speedup of 2.5x or more makes the XL more cost effective!

Memory System Coherence Cache coherence challenge for hardware Memory system is coherent if any read of a data item returns the most recently written value of that data item Coherence determines what value can be returned by a read Defines the behavior of reads and writes with respect to the same memory location Ensures the memory system is correct. Consistency determines hen written values will be returned by reads (all locations) Defines the behavior of accesses with respect to accesses to other memory locations

Multiprocessor Coherence Rules A memory system is coherent if Processor P writes X... No other writes to X Processor P reads X Value must tbe the same Processor Q writes X... Sufficient time... No other writes to X Processor P reads X Value must be the same Processor P writes X Processor Q writes X All processors see it as P, Q or Q, P, but the order should be the same for all processors Wirtes to the same location are serialized

Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes (discussed later) Keep track of what is being shared in a centralized place (logically) Distributed ib t memory => distributed ib t d directory for scalability (avoids bottlenecks) Send point-to-point p requests to processors via network Scales better than Snooping Actually existed BEFORE Snooping-based schemes

Bus Snooping Topology Memory: centralized with uniform access time ( UMA ) and bus interconnect Symmetric Multiprocessor (SMP)

Basic Snooping Protocols Write Invalidate Protocol: Multiple readers, single writer Write to shared data: an invalidate is sent to all caches which snoop and invalidate any copies Handling read miss: Write-through: memory is always up-to-date Write-back: snoop in caches to find most recent copy Write Broadcast/Update Protocol (typically write through): Write to shared data: broadcast on bus, processors snoop, and update any copies Handling read miss: memory is always up-to-date Write serialization: bus serializes requests! Bus is single point of arbitration

Basic Snooping Protocols Write Invalidate versus Broadcast: Invalidate requires one transaction per write-run Invalidate uses spatial locality Invalidate a block each time Do not need to invalidate when writing other words in the same block Broadcast has lower latency between write and read Writes update all copies

An Example Snooping Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory (Shared) OR Dirty in exactly one cache (Exclusive) OR Not in any caches (Invalid) Each cache block is in one state (track these): Shared : block can be read OR Exclusive : cache has the only copy, it is writeable, and dirty OR Invalid : block contains no data Read misses: cause all caches to snoop bus Writes to a clean line are treated as misses

Snoopy-Cache State Machine-I State machine for CPU requests for each cache block Invalid CPU Read Place read miss on bus CPU Read hit Shared (read/only) Cache Block Statet CPU Write Place Write Miss on bus CPU read hit CPU write hit Exclusive (read/write) CPU read miss Write back block, Place read miss on bus CPU Read miss Place read miss on bus CPU Write Place Write Miss on Bus CPU Write Miss Write back cache block Place write miss on bus

Snoopy-Cache State Machine-II State machine for bus requests for each cache block Invalid Write miss for this block Shared (read/only) Write miss for this block Write Back Block; (abort memory access) Exclusive (read/write) Read miss for this block Write Back Block; (abort memory access) Read miss for this block

Example Processor 1 Processor 2 Bus Memory P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

Example: Step 1 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Active arrow = Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

Example: Step 2 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

Example: Step 3 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 10 P2: Write 40 to A2 10 10 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2. Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

Example: Step 4 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 10 10 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Wit Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

Example: Step 5 P1 P2 Bus Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr Value P1: Write 10 to A1 Excl. A1 10 WrMs P1 A1 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 WrBk P1 A1 10 A1 10 Shar. A1 10 RdDa P2 A1 10 A1 10 P2: Write 20 to A1 Inv. Excl. A1 20 WrMs P2 A1 A1 10 P2: Write 40 to A2 WrMs P2 A2 A1 10 Excl. A2 40 WrBk P2 A1 20 A1 20 Assumes initial cache state is invalid and A1 and A2 map to same cache block, but A1!= A2 Remote Write Invalid Read miss on bus Write Remote miss on bus Write Remote Read Write Back Write Back CPU Read hit Shared CPU Read Miss CPU Write Place Write Miss on Bus Exclusive CPU read hit CPU write hit CPU Write Miss Write Back

False Sharing P1 for( i=0; ; i++ ) {... *a = i;... } P2 for( j=0; ; j++ ) {... *(a + 1) = j;... } state cache block a

False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1? 2 Read x2 3 Write x1 4 Write x2 5 Read x2

False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2? 3 Write x1 4 Write x2 5 Read x2

False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1? 4 Write x2 5 Read x2

False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2? 5 Read x2

False Sharing Quiz Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2?

False Sharing Quiz x1 and x2 share a cache block, which starts in the shared state. What happens next? Time P1 P2 True or False? 1 Write x1 True 2 Read x2 False 3 Write x1 False 4 Write x2 False 5 Read x2 True

Larger MPs Separate Memory per Processor Local or Remote access via memory controller One Cache Coherency solution: non-cached pages Alternative: ti directory per cache that ttracks state t of every block kin every cache Which caches have copies of block, dirty vs. clean,... Info per memory block vs. per cache block? PLUS: In memory => simpler protocol (centralized/one location) MINUS: In memory => directory is ƒ(memory size) vs. ƒ(cache size) Prevent directory as bottleneck? distribute directory entries with memory, each keeping track of which processors have copies of their blocks

Distributed Directory MPs

Directory Protocol States identical to Snooping Protocol: Shared: 1 processors have data; memory is up-to-date Invalid: (no processor has it; not valid in any cache) Exclusive: 1 processor (owner) has data; memory is out-of-date In addition to cache state, must track which processors have data when in the shared state t (usually a bit vector; 1 if processor has a copy)

Directory-based cache coherence States identical to snooping case; transactions very similar. Three processors typically involved: Local node - where a request originates Home node - where the memory location of an address resides Remote node - has a copy of a cache block, whether exclusive or shared Generates read miss & write miss message to home directory. Write misses that were broadcast on the bus for snooping => explicit invalidate & data fetch requests.

CPU-Cache State Machine CPU Read hit State machine for CPU requests Invalidate for each Shared memory block Invalid (read/only) CPU Read: Invalid state Send Read Miss if in message CPU read miss: memory CPU Write: Send Read Miss Send Write Miss CPU Write:Send msg to h.d. Write Miss message Fetch/Invalidate: to home directory send Data Write Back message to home directory Fetch: send Data Write Back message to home directory CPU read miss: send Data Wit Write Back message and Exclusive read miss to home directory CPU read hit (read/writ) CPU write miss: CPU write hit send Data Write Back message and Write Miss to home directory

Directory State Machine Read miss: Sharers += {P}; send Data Value Reply State machine Read miss: for Directory Sharers = {P} requests for each send Data Value Reply Shared memory block Invalid (read only) Data Write Back: Sharers = {} (Write back block) Write Miss: Sharers = {P}; send Fetch/Invalidate; send Data Value Reply msg to remote cache Exclusive (read/write) Write Miss: Write Miss: Sharers = {P}; send Invalidate send Data to Sharers; Value Reply then Sharers = {P}; msg send Data Value Reply msg Read miss: Sharers += {P}; send Fetch; send Data Value Reply msg to remote cache (Write back block)

Example Processor 1 Processor 2 Interconnect Directory Memory step P1: Write 10 to A1 P1 P2 Bus Directory Memory State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step State Addr Value State Addr Value Action Proc. Addr Value Addr State {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 P2: Write 20 to A1 P2: Write 40 to A2 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 10 10 P2: Write 40 to A2 10 Wit Write Back A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 10 A1 and A2 map to the same cache block

Example Processor 1 Processor 2 Interconnect Directory Memory P1 P2 Bus Directory Memory step Statet Addr Value Statet Addr Value Ati Action Proc. Addr Value Addr State t {Procs} Value P1: Write 10 to A1 WrMs P1 A1 A1 Ex {P1} Excl. A1 10 DaRp P1 A1 0 P1: Read A1 Excl. A1 10 P2: Read A1 Shar. A1 RdMs P2 A1 Shar. A1 10 Ftch P1 A1 10 A1 10 Shar. A1 10 DaRp P2 A1 10 A1 Shar. P1,P2} 10 P2: Write 20 to A1 Excl. A1 20 WrMs P2 A1 10 Inv. Inval. P1 A1 A1 Excl. {P2} 10 P2: Write 40 to A2 WrMs P2 A2 A2 Excl. {P2} 0 WrBk P2 A1 20 A1 Unca. {} 20 Excl. A2 40 DaRp P2 A2 0 A2 Excl. {P2} 0 A1 and A2 map to the same cache block

Implementation issues with a snoopy protocol Bus request and bus acquisition occurs in two steps Four possible events can occur in between: Write miss on the bus by another processor Read miss on the bus by another processor CPU read miss CPU write miss Interventions: Cache overriding memory

Implementation issues with a directory-based protocol Assumptions about networks In-order delivery between a pair of nodes Messages can always be injected Network-level deadlock freedom To ensure system-level deadlock freedom Separate network for requests and replies Allocates space for replies upon sending out requests Requests can be NACK, but not replies Retry NACKed requests

Memory Consistency Adve and Gharacharloo, Shared Memory Consistency Models: A Tutorial, WRL Research Report 95/7. Gharachorloo, K., D. Lenoski, J. Ludon, P. Gibbons, A. Gupta, and J. Hennessy, ``Memory consistency and event ordering in scalable sharedmemory multiprocessors,'' in Proceedings of the 17th International Symposium on Computer Architecture, pp. 15-26, 1990.

Memory Consistency Models Cache coherence ensures multiple processors see a consistent view of memory, but how consistent. When must a processor see a value written by another processor? In what order must a processor observe the writes of another processor? Example: P1: A = 0; P2: B = 0;...... A = 1; B = 1; L1: if (B == 0)... L2: if (A == 0)... Impossible for both if statements L1 & L2 to be true? What if write invalidate is delayed & processor continues? Should this behavior be allowed? Under what conditions? Memory consistency models: what htare the rules for such cases?

Strict Consistency In the strict model, any read to a memory location X returns the value stored by the most recent write operation to X. Precise serialization of all memory accesses No cache, talking to memory through bus

Examples: P1 W(x) 1 P2 R(x) 1 R(x) 1 P1 W(x) 1 P2 R(x) 0 R(x) 1 P1 W(x) 0 P2 R(x) 0 R(x) 1 Time

Sequential Consistency The result of any execution is the same as if The operations of all the processors were executed in some sequential order, and The operations o of each individual du processor appear in this sequence in the order specified by its program. Intuitive: maps well with uniprocessor Implementations Delay memory access completion until all invalidates are done Or, delay next memory access until previous one is done Severe performance restrictions

Examples: P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 1 R(x) 2 P4 W(x) 2 P1 W(x) 1 P2 R(x) 1 R(x) 2 P3 R(x) 2 R(x) 1 P4 W(x) 2

Coherency vs consistency P1 W(x) 1 W(y) 2 P2 R(x) 0 R(x) 2 R(x) 1 R(y) 0 R(y) 1 P3 R(y) 0 R(y) 1 R(x) 0 R(x) 1 P4 W(x) 2 W(y) 1

How uniprocessor hardware optimizations of memory system break sequential consistency + solution {No caches} (RAW) Reads bypassing writes: Write buffers Code example: Solution: No buffering of writes Add delays

Animations of reads bypassing writes 1. P1: Flag1=1 P1 P2 2. P2: Flag2=1 1 3. P1: if (Flag2==0) Flag1=1 1 4. P2: if (Flag1==0) bus Flag1:0 Flag2:0

Animations 1. P1: Flag1=1 P1 P2 2. P2: Flag2=1 1 2 3. P1: if (Flag2==0) Flag1=1 1 Flag2=1 4. P2: if (Flag1==0) bus Flag1:0 Flag2:0

Animations 3 P1 P2 1 2 Flag1=1 1 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) Flag1:0 Flag2:0

Animations 3 P1 P2 1 2 Flag1=1 1 4 Flag2=1 bus 1. P1: Flag1=1 2. P2: Flag2=1 3. P1: if (Flag2==0) true! 4. P2: if (Flag1==0) true! Flag1:0 Flag2:0

How uniprocessor optimizations of memory system break sequential consistency + solution {No caches} (WAW) Write ordering Overlapping writes in the network Write merging into a cache line Code example: Solution: Wait till write operation has reached memory module before allowing next write acknowledgment for writes more delay

Animations of writes bypassing writes P2 P1 1 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0

Animations of writes bypassing writes P2 P1 2 1 2 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); 4. P2: = Data M3: Head: 0 M4: Data: 0

Animations of writes bypassing writes P2 P1 3 2 1 2 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); false 4. P2: = Data Head: 0 Data: 0 M3: M4:

Animations of writes bypassing writes P1 1 2 2 3 P2 4 M3: Head: 0 M4: Data: 0 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); true 4. P2: = Data Data=0!

How uniprocessor optimizations of memory system break sequential consistency y{ {No caches} (RAR, WAR) Non-blocking reads Speculative execution Code example: Solution: Disable non-blocking reads more delay

Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head=1 1 3. P2: while (Head==0); 4. P2: = Data P2 M3: Head: 0 M4: Data: 0

Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head=1 1 3. P2: while (Head==0); 2 4. P2: = Data Data=0 2 2 P2 M3: Head: 0 M4: Data: 0

Animations of reads bypassing reads 1. P1: Data=2000 P1 2. P1: Head=1 1 3. P2: while (Head==0); Ld miss: Head=0 2 2 2 P2 4. P2: = Data Data=0 3 Head: 0 Data: 0 M3: M4:

Now, let s add caches.. More problems (WAW) P1 s write to Data reaches memory, but has not yet been propagated to P2 (cache coherence protocol s message (inv. or update) has not yet arrived) Code Example: Solution: Wait for acknowledgment of update/invalidate before any write/read even longer delay than just waiting for ack from memory module.

Animations of problems with a cache P1 P2 Data=0 1 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 0 Data: 2000 M3: M4:

Animations of problems with a cache P1 P2 2 Data=0 1 2 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Head: 1 Data: 2000 M3: M4:

Animations of problems with a cache 1. P1: Data=2000 P1 2. P1: Head=1 1 3. P2: while (Head==0); false P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 4. P2: = Data

Animations of problems with a cache P1 1 P2 2 Data=0 2 3 M3: Head: 1 Data: 2000 M4: 1. P1: Data=2000 2. P1: Head=1 3. P2: while (Head==0); 4. P2: = Data Data=0!

Caches + network.. headaches Non-atomic write all processors have to see writes to the same location in the same order Code example: writes to A may appear at P3/P4 in different order due to different network paths Solution: Wait till all acks to a location is complete; or Network preserves ordering between a node pair

Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 M2 P4 B,C

Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4

Animations of problems with networks M1 A=1 P1 A=2 1 2 P2 A P3 Register1=A=1 M2 B,C P4 Register2=A=2

Summary of sequential consistency Slow, because In each processor, previous memory operation in program order must be completed before proceeding with next memory operation In cache-coherent h systems, writes to the same location must be visible in the same order to all processors; and no subsequent reads till write is visible to all processors. Mitigating solutions Prefetch cache ownership while waiting in write buffer Speculate and rollback Compiler memory analysis reorder at hw/sw only if safe

Relaxed (Weak) Consistency Models Several relaxed models for memory consistency characterized by their attitude towards RAR, WAR, RAW, and WAW to different addresses Not really an issue for most programs -- they are synchronized A program is synchronized if all access to shared data are ordered by synchronization operations acquire (s) {lock}... normal access... release (s) {unlock} Other processes do not need to know intermediate values Only those programs willing to be nondeterministic are not synchronized, leading to data races

Relaxed (Weak) Consistency Accesses to synchronization variables are sequentially consistent No access to a synchronization variable is allowed to be performed until all previous writes have completed everywhere No data access (read or write) is allowed to be performed until all previous accesses to synchronization variables have been performed P1 W(x) 1 W(x) 2 S P2 R(x) 0 R(x) 2 S R(x) 2 P3 R(x) 1 S R(x) 2

Processor consistency Also called PRAM (an acronym for Pipelined Random Access Memory) consistency. Writes done by a single processor are received by all other processors in the order in which hthey were issued. However, writes from different processors may be seen in a different order by different processors The basic idea of processor consistency is to better reflect the reality of networks in which the latency between different nodes can be different Is the following scenario valid under processor consistency? P1 W(x) 1 W(x) 2 P2 R(x) 2 R(x) 1

Release consistency Release consistency considers locks on areas of memory, and propagates only the locked memory as needed Before an ordinary access to a shared variable is performed, all previous acquires done by the process must have completed successfully Before a release is allowed to be performed, all previous reads and writes done by the process must have completed The acquire and release accesses must be sequentially consistent

Synchronization Hardware needs to know how programmer intends different processes to use shared data Issues for synchronization: Hardware needs to support uninterruptable instructions to fetch and update memory (atomic operation); User level synchronization operation using this primitive; For large scale MPs, synchronization can be a bottleneck; techniques to reduce contention and latency of synchronization

Uninterruptable Instruction to Fetch and Update Memory Atomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 and swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if another processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test Fetch-and-increment: returns the value of a memory location and atomically increments it 0 => synchronization variable is free

Load Linked, Store Conditional Hard to have read and write in one instruction; use two instead Load linked (or load locked) + store conditional Load linked returns the initial value Store conditional returns 1 if it succeeds (no other store to same memory location since preceeding load) and 0 otherwise Example doing atomic swap with LL & SC: try: mov R3,R4 ; mov exchange value ll R2,0(R1) ; load linked sc R3,0(R1) ; store conditional beqz R3,try ; branch store fails (R3 = 0) mov R4,R2 ; put load value in R4 Example doing fetch & increment with LL & SC: try: ll R2,0(R1) ; load linked addi R2,R2,#1 ; increment (OK if reg reg) reg) sc R2,0(R1) ; store conditional beqz R2,try ; branch store fails (R2 = 0)

User-Level Synchronization Spin locks: processor continuously tries to acquire, spinning around a loop trying to get the lock li R2,#1 lockit: exch R2,0(R1) ;atomic exchange bnez R2,lockit ;already locked? What about MP with cache coherency? Want to spin on cache copy to avoid full memory latency Likl Likely to get cache hits for such variables ibl Problem: exchange includes a write, which invalidates all other copies; this generates considerable bus traffic Solution: start by simply repeatedly reading the variable; when it is free, then try exchange ( test and test&set ): try: li R2,#1 lockit: lw R3,0(R1) ;load var bnez R3,lockit ;not free=>spin exch R2,0(R1) ;atomic exchange bnez R2,try ;already locked?

Scalable Synchronization Exponential backoff if you fail to get the lock, don t try again for an exponentially increasing amount of time Queuing lock Wait for lock by getting on the queue. Lock is passed down the queue. Can be implemented in hardware or software Combining tree Implement barriers as a tree to avoid contention Powerful atomic primitives such as fetch-and-increment

Summary Caches contain all information on state of cached memory blocks Snooping and Directory Protocols similar; bus makes snooping easier because of broadcast (snooping => uniform memory access) Directory has extra data structure to keep track of state of all memory blocks Distributing directory => scalable shared memory multiprocessor => Cache coherent, Non uniform memory access (cc-numa) Practical implementations of shared memory multiprocessors are not sequentially consistent due to performance reasons programmers use synchronization to ensure correctness

Case Study: IBM Power4: Cache States L1 cache: Writethrough L2 cache: Maintains coherence (invalidation-based), Writeback, Inclusive of L1 L3 cache: Writeback, non-inclusive, maintains coherence (invalidation-based) Snooping protocol Snoops on all buses.

IBM Power4

Case Study: IBM Power4 Caches

Interconnection network of buses

IBM Power4 s L2 coherence protocol I (Invalid state) SL (Shared state, in one core within the module) S (Shared state, in multiple cores within a module) M (Modified state, and exclusive) Me (Data is not modified, but exclusive, no write back necessary) Mu [Load-linked lwarx, Store-conditional swarx] T (tagged: data modified, but not exclusively owned) Optimization: Do not write back modified exclusive data upon a read dhit

IBM Power4 s L2 coherence protocol

IBM Power4 s L3 coherence protocol States: Invalid Shared: Can only give data to its own L2s that it s caching data for Tagged: Modified and shared (local memory) Tagged remote: Modified and shared (remote memory) O (prefetch): Data in L3 = Memory, and same node, not modified.

IBM Blue Gene/L memory architecture

Node overview

Single chip