Chapter 8. Multiprocessors In-Cheol Park Dept. of EE, KAIST
Can the rapid rate of uniprocessor performance growth be sustained indefinitely? If the pace does slow down, multiprocessor architectures will be attractive Small scale multiprocessors rather than large scale parallel computers Compiler problem Low ILP in applications challenges of parallel processing Amdahl s law Single chip multiprocessors Integrate a small number of processors and memory on a chip
SISD (Single instruction stream, single data stream) SIMD (Single instruction stream, multiple data streams) Special purpose, for example, media-processors MISD (Multiple instruction streams, single data stream) No machine to date MIMD (Multiple instruction streams, multiple data streams) Offer flexibility Build on the cost/performance advantages of microprocessors
Centralized shared-memory Small no. of processors Single centralized main memory Connected to a bus UMA (Uniform Memory Access) Popular organization Distributed shared-memory Large no. of processors Memory is distributed among processors Large memory bandwidth, easily scalable Interconnection network
Shared address space DSM (Distributed shared memory) NUMA (Non-uniform Memory Access) Private address space Multi-computers Communication is done by passing messages Called message passing machines Remote procedure call (RPC)
Share-memory communication Well-understood mechanism Easy programming Lower overhead and better use of bandwidth when communicating small items Hardware controlled caching of remote data Supporting message passing on top of shared memory is easier Message-passing communication Simple hardware Explicit communication Results in optimization in user level Supporting shared memory on top of message passing hardware is more difficult
Insufficient parallelism Low ILP Amdahl s law Long latency of remote access Can reduce remote accesses with assistance of hardware and software mechanisms
Cache coherence Coherence: defines what value can be returned by a read A read by a processor P to a location X that follows a write by P to X, with no writes of X by another processor occurring between the write and the read by P, always returns the value written by P A read by a processor to location X that follows a write by another processor to X returns the written value if the read and write are sufficiently separated and no other writes to X occur between the two accesses Writes to the same location are serialized Consistency: determines when a written value is returned by a read
Directory based Sharing status is kept in one location called directory DSM Snooping Shared-memory bus Cache controllers monitor on the bus to determine whether they have a copy of a block that is requested on the bus popular
Write invalidate protocol Invalidate all other copies on a write Works on cache blocks One invalidation for multiple writes to the same word Preferred choice in a bus-based multiprocessor Write update protocol Update all other copies on a write Works on word Multiple update for writes to the same cache block Less delay between a write of one processor and a read of another processor
Write-through cache Recent value is always in memory Write-back cache Recent value may be in a cache Preferable due to reduced memory bandwidth The cache that has the dirty block provides that cache block in response to the read request Owner: memory or a cache Valid bit, dirty bit, shared bit The write is not placed on the bus if not shared Snooping overhead Duplicating the tag Multi-level cache with inclusion
Exclusive coherence Only private data is kept in the caches Shared data is marked as uncacheable Include coherence An accepted requirement Directory-based cache coherence protocol To provide scalability Associate an entry in the directory with each memory block Directory entries can be distributed along with the memory Shared state, uncached state, exclusive state Keeps a bit-vector to indicate the processors that have a copy the block Local node, home node, remote node
Hardware primitives Automatically read and modify a memory location Primitives (single instruction): atomic read-and-update Exchange Exchange a value in a register for a value in memory Test-and-set Test a value and set if the value passes the test Fetch-and-increment Returns the value in a memory and automatically increments it
Primitives (a pair of instructions) Load linked(load locked) and store conditional Store conditional fails if the memory location specified by the load linked is changed before the store conditional Fail cases: context switch, other writes Store conditional returns 1 if it succeeds Load linked is implemented with the link register that keeps track of the address specified in the load linked instruction Link register is cleared if an interrupt occurs or its cache block is invalidated
Locks that a processor continuously tries to acquire li R2, #1 lockit: exch R2, 0(R1) bnez R2,lockit If cache coherence is supported, we can cache the locks Each exchange requires a write operation lockit: lw R2, 0(R1) bnez R2, lockit li R2, #1 exch R2, 0(R1) bnez R2, lockit
Load-linked/store-conditional lockit: ll R2, 0(R1) bnez R2, lockit li R2, #1 sc R2, 0(R1) beqz R2, lockit Simple but leads to lots of contention as well as traffic Fairness of the bus makes things worse
A barrier forces all processes to wait until all the processes reach the barrier and then releases all of the processes Implemented with two spin locks lock (counterlock); if(counter == 0) release = 0; count++; unlock(counterlock); if(count == total) { count = 0; release = 1; } else { spin(release == 1); } The fast process can trap slow processes in the barrier by resetting the flag release
Sense-reverse barrier local_sense =! local_sense; lock (counterlock); count++; unlock(counterlock); if(count == total) { count = 0; release = local_sense; } else { spin(release == local_sense); }
Software implementation Exponential back-off Combining tree n-ary tree structure where multiple requests are locally combined in tree fashion k processes are arrived at a node, we signal the next level in the tree Hardware primitives Unneeded contention after the release Queuing lock Keep a list of waiting processes and hand the lock to one explicitly when its turn comes
Sequential consistency The most straightforward model Sequential consistency requires that the result of any execution be the same as if the accesses executed by each processor were kept in order and the accesses among different processors were interleaved Delay the next access until the previous one is completed We cannot use the write buffer with read bypassing Programmer s View Release and Acquire release = unlock acquire = lock
Release and Acquire release = unlock acquire = lock write (x). release (s).. acquire (s) read(x). Memory Fence Fixed points in a computation that ensure that no read and write is moved across the fence Read fence / Write fence In sequential consistency, all reads are read fences and all writes are write fences
TSO: processor consistency or total store ordering Eliminate the order W R Allow the buffering or writes with bypassing by reads Check to determine whether a pending write is the same as the read miss PSO: partial store ordering Relaxing W W Pipelining or overlapping of write operations
Weak ordering Relaxing R R, R W A read or write is completed before any synchronization operation executed in program order by the processor after the read or write A synchronization operation is always completed before any reads or writes that occur in program order after the operation Take advantage of nonblocking reads Release consistency Acquire synchronization (S A ) Release synchronization (S R ) Removes W S A, R S A, S R R, and S R W