The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple processors with a single shared address space Cluster multiple computers (each with their own address space) connected over a local area network (LN) functioning as a single system.1.2 pplications Needing Supercomputing Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration) DoE stockpile stewardship (to ensure the safety and reliability of the nation s stockpile of nuclear weapons) Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks) Transportation (improving vehicles airflow dynamics, fuel consumption, crashworthiness, noise reduction) Bioinformatics and computational biology (genomics, protein folding, designer drugs) Societal health and safety (pollution reduction, disaster planning, terrorist action detection) Encountering mdahl s Law Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E Speedup w/ E =.3.4
Encountering mdahl s Law Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S) Examples: mdahl s Law Speedup w/ E = Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = What if its usable only 15% of the time? Speedup w/ E = mdahl s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less.5.6 Examples: mdahl s Law Speedup w/ E = 1 / ((1-F) + F/S) Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 +.25/20) = 1.31 What if its usable only 15% of the time? Speedup w/ E = 1/(.85 +.15/20) = 1.17 mdahl s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less.7 Supercomputer Style Migration (Top500) http://www.top500.org/lists/2005/11/ 500 400 300 200 100 0 1993 1994 Nov data 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Clusters Constellations SIMDs MPPs s Uniproc's Cluster whole computers interconnected using their I/O bus Constellation a cluster that uses an multiprocessor as the building block In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%.8
Multiprocessor/Clusters Key Questions Q1 How do they share data? Q2 How do they coordinate? Q3 How scalable is the architecture? How many processors can be supported? Flynn s Classification Scheme SISD single instruction, single data stream aka uniprocessor - what we have been talking about all semester SIMD single instruction, multiple data streams single control unit broadcasting operations to multiple datapaths MISD multiple instruction, single data no such machine (although some people put vector machines in this category) MIMD multiple instructions, multiple data streams aka multiprocessors (s, MPPs, clusters, NOWs) Now obsolete except for....9.10 SIMD s Example SIMD Machines Single control unit Multiple datapaths (processing elements s) running in parallel Q1 s are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit Q2 Each performs the same operation on its own local data Illiac IV DP MPP CM-2 MP-1216 ICL Maker UIUC Goodyear Thinking Machines MasPar Year 1972 1980 1982 1987 1989 # s 64 4,096 16,384 65,536 16,384 # b/ 64 1 1 1 4 Max memory (MB) 1 2 2 512 1024 clock (MHz) 13 5 10 7 25 System BW (MB/s) 2,560 2,560 20,480 16,384 23,000.11.12
Multiprocessor Basic Organizations s connected by a single bus s connected by a network Communication model Physical connection Message passing address Network Bus UM # of Proc 8 to 2048 8 to 256 2 to 64 8 to 256 2 to 36 ddress ( ) Multi s Q1 Single address space shared by all the processors Q2 s coordinate/communicate through shared variables in memory (via loads and stores) Use of shared data must be coordinated via synchronization primitives (locks) UMs (uniform memory access) aka (symmetric multiprocessors) all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested s (nonuniform memory access) some main memory accesses are faster than others depending on the processor making the request and which location is requested can scale to larger sizes than UMs so are potentially higher performance.13.14 N/UM Remote ccess Times (RMT) Single Bus ( ddress UM) Multi s Year Type Max Proc Interconnection Network RMT (ns) Sun Starfire 1996 64 ddress buses, data switch 500 Cache Cache Cache Cray 3TE HP V 1996 1998 2048 32 2-way 3D torus 8 x 8 crossbar 300 1000 Single Bus SGI Origin 3000 Compaq lphaserver GS Sun V880 1999 1999 2002 512 32 8 Fat tree Switched bus Switched bus 500 400 240 Caches are used to reduce latency and to lower bus traffic I/O HP Superdome 9000 NS Columbia 2003 2004 64 10240 Switched bus Fat tree 275??? Must provide hardware to ensure that caches and memory are consistent (cache coherency) Must provide a hardware mechanism to support process synchronization.15.16
Message Passing Multiprocessors Each processor has its own private address space Q1 s share data by explicitly sending and receiving information (messages) Q2 Coordination is built into message passing primitives (send and receive) Summary Flynn s classification of processors SISD, SIMD, MIMD Q1 How do processors share data? Q2 How do processors coordinate their activity? Q3 How scalable is the architecture (what is the maximum number of processors)? address multis UMs and s Bus connected (shared address UMs) multis Cache coherency hardware to ensure data consistency Synchronization primitives for synchronization Bus traffic limits scalability of architecture (< ~ 36 processors) Message passing multis.17.18 Review: Where are We Now? Multiprocessor Basics Q1 How do they share data? Output Output Q2 How do they coordinate? Datapath Input Input Datapath Q3 How scalable is the architecture? How many processors? # of Proc Multiprocessor multiple processors with a single shared address space Cluster multiple computers (each with their own address space) connected over a local area network (LN) functioning as a single system Communication model Physical connection Message passing address Network Bus UM 8 to 2048 8 to 256 2 to 64 8 to 256 2 to 36.19.20
Single Bus ( ddress UM) Multi s Proc1 Proc2 Proc3 Proc4 Caches Caches Caches Caches Multiprocessor Cache Coherency Cache coherency protocols Bus snooping cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don t interfere with processor s access to the cache) Single Bus I/O Proc1 Proc2 ProcN Caches are used to reduce latency and to lower bus traffic Write-back caches used to keep bus traffic at a minimum Must provide hardware to ensure that caches and memory are consistent (cache coherency) Must provide a hardware mechanism to support process synchronization.21 Snoop DCache Snoop DCache Snoop DCache Single Bus I/O.22 Bus Snooping Protocols Multiple copies are not a problem when reading must have exclusive access to write a word What happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior. This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved. ll other processors sharing that data must be informed of writes Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: 1. Write-update (write-broadcast) writing processor broadcasts new data over the bus, all copies are updated ll writes go to the bus higher bus traffic Since new values appear in caches sooner, can reduce latency 2. Write-invalidate writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) Uses the bus only on the first write lower bus traffic, so better use of bus bandwidth.23.24
Write-Invalidate CC Protocol write (miss) Invalid Modified (dirty) read (hit) or write (hit or miss) read (miss) write (hit or miss) (clean) write-back caching protocol in black read (hit or miss) Write-Invalidate CC Protocol write (miss) Invalid send invalidate write-back due to read (miss) by another processor to this block Modified (dirty) read (hit) or write (hit) read (miss) receives invalidate (write by another processor to this block) send invalidate write (hit or miss) (clean) read (hit or miss) write-back caching protocol in black signals from the processor coherence additions in red signals from the bus coherence additions in blue.25.26 Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one) S I S I Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one) 3. snoop sees 1. read miss for 1. write miss for read request for & lets MM 4. gets from MM 2. writes & supply & changes its state 4. change changes its state S I to S state to I S I to M 2. read request for 3. P2 sends invalidate for M I M I 3. snoop sees read 1. read miss for request Proc for, 1writes- back to MM 4. gets from MM 5. change & changes its state M I state to I to M 5. P2 sends invalidate for 2. read request for 4. change state to I M 1. write miss for 2. writes & changes its state I to M 3. P2 sends invalidate for.27.28
Data Miss Rates data has lower spatial and temporal locality Share data misses often dominate cache behavior even though they may only be 10% to 40% of the data accesses 64KB 2-way set associative data cache with 32B blocks Hennessy & Patterson, Computer rchitecture: Quantitative pproach 8 6 4 2 0 Capacity miss rate Coherence miss rate 1 2 4 8 16 FFT 18 16 14 12 10 8 6 4 2 0 Capacity miss rate Coherence miss rate 1 2 4 8 16 Ocean.29 Block Size Effects Writes to one word in a multi-word block mean either the full block is invalidated (write-invalidate) or the full block is exchanged between processors (write-update) - lternatively, could broadcast only the written word Multi-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache block With write-invalidate false sharing increases cache miss rates Proc1 Proc2 Compilers can help reduce false sharing by allocating highly correlated data to the same cache block B 4 word cache block.30 Other Coherence Protocols There are many variations on cache coherence protocols MESI Cache Coherency Protocol shared read nother write-invalidate protocol used in the Pentium 4 (and many other micro s) is MESI with four states: Modified same Exclusive only one copy of the shared data is allowed to be cached; memory has an up-to-date copy - Since there is only one copy of the block, write hits don t need to send invalidate signal multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy Invalid same nother processor has read/write miss for this block Invalid (not valid block) [Write back block] Modified (dirty) write or read hit write miss Invalidate for this block write [Write back block] [Write back block] shared read miss [Send invalidate signal] exclusive read miss exclusive read miss (clean) shared read Exclusive (clean) exclusive read exclusive read.31.32
Commercial Single Backplane Multiprocessors Compaq PL IBM R40 lphaserver SGI Pow Chal Sun 6000 Pentium Pro PowerPC lpha 21164 MIPS R10000 UltraSPRC # proc. 4 8 12 36 30 MHz 200 112 440 195 167 BW/ system 540 1800 2150 1200 2600 Summary Key questions Q1 - How do processors share data? Q2 - How do processors coordinate their activity? Q3 - How scalable is the architecture (what is the maximum number of processors)? Bus connected (shared address UM s( s)) multi s Cache coherency hardware to ensure data consistency Synchronization primitives for synchronization Scalability of bus connected UMs limited (< ~ 36 processors) because the three desirable bus characteristics - high bandwidth - low latency - long length are incompatible Network connected s are more scalable.33.34