Intro to Multiprocessors

Similar documents
Chapter Seven. Idea: create powerful computers by connecting many smaller ones

Chapter 9 Multiprocessors

Computer Systems Architecture

Multiprocessors & Thread Level Parallelism

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

Chap. 4 Multiprocessors and Thread-Level Parallelism

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Computer Architecture

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Computer Systems Architecture

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

Computer Architecture Spring 2016

Chapter 5. Multiprocessors and Thread-Level Parallelism

Flynn s Classification

Chapter 18 Parallel Processing

Parallel Computer Architecture Spring Shared Memory Multiprocessors Memory Coherence

COSC 6385 Computer Architecture - Multi Processor Systems

Handout 3 Multiprocessor and thread level parallelism

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Parallel Processing. Computer Architecture. Computer Architecture. Outline. Multiple Processor Organization

Multiprocessors 1. Outline

Aleksandar Milenkovich 1

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

CS 61C: Great Ideas in Computer Architecture. Amdahl s Law, Thread Level Parallelism

EITF20: Computer Architecture Part 5.2.1: IO and MultiProcessor

Aleksandar Milenkovic, Electrical and Computer Engineering University of Alabama in Huntsville

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Parallel Architecture. Sathish Vadhiyar

3/13/2008 Csci 211 Lecture %/year. Manufacturer/Year. Processors/chip Threads/Processor. Threads/chip 3/13/2008 Csci 211 Lecture 8 4

Chapter 5 Thread-Level Parallelism. Abdullah Muzahid

Multiprocessors. Flynn Taxonomy. Classifying Multiprocessors. why would you want a multiprocessor? more is better? Cache Cache Cache.

Chapter 5. Thread-Level Parallelism

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Organisasi Sistem Komputer

Computer Organization. Chapter 16

Parallel Computing Platforms. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Limitations of parallel processing

COSC4201 Multiprocessors

CSCI 4717 Computer Architecture

Chapter 18. Parallel Processing. Yonsei University

Parallel Computing Platforms

INSTITUTO SUPERIOR TÉCNICO. Architectures for Embedded Computing

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Issues in Multiprocessors

Chapter-4 Multiprocessors and Thread-Level Parallelism

Parallel Architecture. Hwansoo Han

Multiple Issue and Static Scheduling. Multiple Issue. MSc Informatics Eng. Beyond Instruction-Level Parallelism

Non-uniform memory access machine or (NUMA) is a system where the memory access time to any region of memory is not the same for all processors.

EN164: Design of Computing Systems Lecture 34: Misc Multi-cores and Multi-processors

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CS/COE1541: Intro. to Computer Architecture

Issues in Multiprocessors

Chapter 8. Multiprocessors. In-Cheol Park Dept. of EE, KAIST

BlueGene/L (No. 4 in the Latest Top500 List)

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

CSE 260 Introduction to Parallel Computation

Parallel Architectures

ESE 545 Computer Architecture Symmetric Multiprocessors and Snoopy Cache Coherence Protocols CA SMP and cache coherence

EN2910A: Advanced Computer Architecture Topic 05: Coherency of Memory Hierarchy

EE382 Processor Design. Processor Issues for MP

Comp. Org II, Spring

Multiprocessor Cache Coherence. Chapter 5. Memory System is Coherent If... From ILP to TLP. Enforcing Cache Coherence. Multiprocessor Types

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 5. Multiprocessors and Thread-Level Parallelism

Scalable Multiprocessors

Parallel Processing & Multicore computers

Unit 9 : Fundamentals of Parallel Processing

Chapter 4 Data-Level Parallelism

Overview. Processor organizations Types of parallel machines. Real machines

Chapter 5. Multiprocessors and Thread-Level Parallelism

CS650 Computer Architecture. Lecture 10 Introduction to Multiprocessors and PC Clustering

Page 1. SMP Review. Multiprocessors. Bus Based Coherence. Bus Based Coherence. Characteristics. Cache coherence. Cache coherence

Shared Memory Multiprocessors. Symmetric Shared Memory Architecture (SMP) Cache Coherence. Cache Coherence Mechanism. Interconnection Network

CSE Introduction to Parallel Processing. Chapter 4. Models of Parallel Processing

Lecture 2 Parallel Programming Platforms

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

COSC 6374 Parallel Computation. Parallel Computer Architectures

Number of processing elements (PEs). Computing power of each element. Amount of physical memory used. Data access, Communication and Synchronization

Comp. Org II, Spring

Dr. Joe Zhang PDC-3: Parallel Platforms

COSC 6374 Parallel Computation. Parallel Computer Architectures

Computer parallelism Flynn s categories

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM (PART 1)

PARALLEL COMPUTER ARCHITECTURES

CMPE 511 TERM PAPER. Distributed Shared Memory Architecture. Seda Demirağ

Computer Architecture

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Module 5 Introduction to Parallel Processing Systems

Cache Coherence. Introduction to High Performance Computing Systems (CS1645) Esteban Meneses. Spring, 2014

CS252 Spring 2017 Graduate Computer Architecture. Lecture 12: Cache Coherence

Parallel Architectures

COSC4201. Multiprocessors and Thread Level Parallelism. Prof. Mokhtar Aboelaze York University

CS Parallel Algorithms in Scientific Computing

Convergence of Parallel Architecture

Review: Multiprocessor. CPE 631 Session 21: Multiprocessors (Part 2) Potential HW Coherency Solutions. Bus Snooping Topology

5DV118 Computer Organization and Architecture Umeå University Department of Computing Science Stephen J. Hegner. Topic 7: Support for Parallelism

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

Transcription:

The Big Picture: Where are We Now? Intro to Multiprocessors Output Output Datapath Input Input Datapath [dapted from Computer Organization and Design, Patterson & Hennessy, 2005] Multiprocessor multiple processors with a single shared address space Cluster multiple computers (each with their own address space) connected over a local area network (LN) functioning as a single system.1.2 pplications Needing Supercomputing Energy (plasma physics (simulating fusion reactions), geophysical (petroleum) exploration) DoE stockpile stewardship (to ensure the safety and reliability of the nation s stockpile of nuclear weapons) Earth and climate (climate and weather prediction, earthquake, tsunami prediction and mitigation of risks) Transportation (improving vehicles airflow dynamics, fuel consumption, crashworthiness, noise reduction) Bioinformatics and computational biology (genomics, protein folding, designer drugs) Societal health and safety (pollution reduction, disaster planning, terrorist action detection) Encountering mdahl s Law Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E Speedup w/ E =.3.4

Encountering mdahl s Law Speedup due to enhancement E is Exec time w/o E Speedup w/ E = ---------------------- Exec time w/ E Suppose that enhancement E accelerates a fraction F (F <1) of the task by a factor S (S>1) and the remainder of the task is unaffected ExTime w/ E = ExTime w/o E ((1-F) + F/S) Speedup w/ E = 1 / ((1-F) + F/S) Examples: mdahl s Law Speedup w/ E = Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = What if its usable only 15% of the time? Speedup w/ E = mdahl s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less.5.6 Examples: mdahl s Law Speedup w/ E = 1 / ((1-F) + F/S) Consider an enhancement which runs 20 times faster but which is only usable 25% of the time. Speedup w/ E = 1/(.75 +.25/20) = 1.31 What if its usable only 15% of the time? Speedup w/ E = 1/(.85 +.15/20) = 1.17 mdahl s Law tells us that to achieve linear speedup with 100 processors, none of the original computation can be scalar! To get a speedup of 99 from 100 processors, the percentage of the original program that could be scalar would have to be 0.01% or less.7 Supercomputer Style Migration (Top500) http://www.top500.org/lists/2005/11/ 500 400 300 200 100 0 1993 1994 Nov data 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 Clusters Constellations SIMDs MPPs s Uniproc's Cluster whole computers interconnected using their I/O bus Constellation a cluster that uses an multiprocessor as the building block In the last 8 years uniprocessor and SIMDs disappeared while Clusters and Constellations grew from 3% to 80%.8

Multiprocessor/Clusters Key Questions Q1 How do they share data? Q2 How do they coordinate? Q3 How scalable is the architecture? How many processors can be supported? Flynn s Classification Scheme SISD single instruction, single data stream aka uniprocessor - what we have been talking about all semester SIMD single instruction, multiple data streams single control unit broadcasting operations to multiple datapaths MISD multiple instruction, single data no such machine (although some people put vector machines in this category) MIMD multiple instructions, multiple data streams aka multiprocessors (s, MPPs, clusters, NOWs) Now obsolete except for....9.10 SIMD s Example SIMD Machines Single control unit Multiple datapaths (processing elements s) running in parallel Q1 s are interconnected (usually via a mesh or torus) and exchange/share data as directed by the control unit Q2 Each performs the same operation on its own local data Illiac IV DP MPP CM-2 MP-1216 ICL Maker UIUC Goodyear Thinking Machines MasPar Year 1972 1980 1982 1987 1989 # s 64 4,096 16,384 65,536 16,384 # b/ 64 1 1 1 4 Max memory (MB) 1 2 2 512 1024 clock (MHz) 13 5 10 7 25 System BW (MB/s) 2,560 2,560 20,480 16,384 23,000.11.12

Multiprocessor Basic Organizations s connected by a single bus s connected by a network Communication model Physical connection Message passing address Network Bus UM # of Proc 8 to 2048 8 to 256 2 to 64 8 to 256 2 to 36 ddress ( ) Multi s Q1 Single address space shared by all the processors Q2 s coordinate/communicate through shared variables in memory (via loads and stores) Use of shared data must be coordinated via synchronization primitives (locks) UMs (uniform memory access) aka (symmetric multiprocessors) all accesses to main memory take the same amount of time no matter which processor makes the request or which location is requested s (nonuniform memory access) some main memory accesses are faster than others depending on the processor making the request and which location is requested can scale to larger sizes than UMs so are potentially higher performance.13.14 N/UM Remote ccess Times (RMT) Single Bus ( ddress UM) Multi s Year Type Max Proc Interconnection Network RMT (ns) Sun Starfire 1996 64 ddress buses, data switch 500 Cache Cache Cache Cray 3TE HP V 1996 1998 2048 32 2-way 3D torus 8 x 8 crossbar 300 1000 Single Bus SGI Origin 3000 Compaq lphaserver GS Sun V880 1999 1999 2002 512 32 8 Fat tree Switched bus Switched bus 500 400 240 Caches are used to reduce latency and to lower bus traffic I/O HP Superdome 9000 NS Columbia 2003 2004 64 10240 Switched bus Fat tree 275??? Must provide hardware to ensure that caches and memory are consistent (cache coherency) Must provide a hardware mechanism to support process synchronization.15.16

Message Passing Multiprocessors Each processor has its own private address space Q1 s share data by explicitly sending and receiving information (messages) Q2 Coordination is built into message passing primitives (send and receive) Summary Flynn s classification of processors SISD, SIMD, MIMD Q1 How do processors share data? Q2 How do processors coordinate their activity? Q3 How scalable is the architecture (what is the maximum number of processors)? address multis UMs and s Bus connected (shared address UMs) multis Cache coherency hardware to ensure data consistency Synchronization primitives for synchronization Bus traffic limits scalability of architecture (< ~ 36 processors) Message passing multis.17.18 Review: Where are We Now? Multiprocessor Basics Q1 How do they share data? Output Output Q2 How do they coordinate? Datapath Input Input Datapath Q3 How scalable is the architecture? How many processors? # of Proc Multiprocessor multiple processors with a single shared address space Cluster multiple computers (each with their own address space) connected over a local area network (LN) functioning as a single system Communication model Physical connection Message passing address Network Bus UM 8 to 2048 8 to 256 2 to 64 8 to 256 2 to 36.19.20

Single Bus ( ddress UM) Multi s Proc1 Proc2 Proc3 Proc4 Caches Caches Caches Caches Multiprocessor Cache Coherency Cache coherency protocols Bus snooping cache controllers monitor shared bus traffic with duplicate address tag hardware (so they don t interfere with processor s access to the cache) Single Bus I/O Proc1 Proc2 ProcN Caches are used to reduce latency and to lower bus traffic Write-back caches used to keep bus traffic at a minimum Must provide hardware to ensure that caches and memory are consistent (cache coherency) Must provide a hardware mechanism to support process synchronization.21 Snoop DCache Snoop DCache Snoop DCache Single Bus I/O.22 Bus Snooping Protocols Multiple copies are not a problem when reading must have exclusive access to write a word What happens if two processors try to write to the same shared data word in the same clock cycle? The bus arbiter decides which processor gets the bus first (and this will be the processor with the first exclusive access). Then the second processor will get exclusive access. Thus, bus arbitration forces sequential behavior. This sequential consistency is the most conservative of the memory consistency models. With it, the result of any execution is the same as if the accesses of each processor were kept in order and the accesses among different processors were interleaved. ll other processors sharing that data must be informed of writes Handling Writes Ensuring that all other processors sharing data are informed of writes can be handled two ways: 1. Write-update (write-broadcast) writing processor broadcasts new data over the bus, all copies are updated ll writes go to the bus higher bus traffic Since new values appear in caches sooner, can reduce latency 2. Write-invalidate writing processor issues invalidation signal on bus, cache snoops check to see if they have a copy of the data, if so they invalidate their cache block containing the word (this allows multiple readers but only one writer) Uses the bus only on the first write lower bus traffic, so better use of bus bandwidth.23.24

Write-Invalidate CC Protocol write (miss) Invalid Modified (dirty) read (hit) or write (hit or miss) read (miss) write (hit or miss) (clean) write-back caching protocol in black read (hit or miss) Write-Invalidate CC Protocol write (miss) Invalid send invalidate write-back due to read (miss) by another processor to this block Modified (dirty) read (hit) or write (hit) read (miss) receives invalidate (write by another processor to this block) send invalidate write (hit or miss) (clean) read (hit or miss) write-back caching protocol in black signals from the processor coherence additions in red signals from the bus coherence additions in blue.25.26 Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one) S I S I Write-Invalidate CC Examples I = invalid (many), S = shared (many), M = modified (only one) 3. snoop sees 1. read miss for 1. write miss for read request for & lets MM 4. gets from MM 2. writes & supply & changes its state 4. change changes its state S I to S state to I S I to M 2. read request for 3. P2 sends invalidate for M I M I 3. snoop sees read 1. read miss for request Proc for, 1writes- back to MM 4. gets from MM 5. change & changes its state M I state to I to M 5. P2 sends invalidate for 2. read request for 4. change state to I M 1. write miss for 2. writes & changes its state I to M 3. P2 sends invalidate for.27.28

Data Miss Rates data has lower spatial and temporal locality Share data misses often dominate cache behavior even though they may only be 10% to 40% of the data accesses 64KB 2-way set associative data cache with 32B blocks Hennessy & Patterson, Computer rchitecture: Quantitative pproach 8 6 4 2 0 Capacity miss rate Coherence miss rate 1 2 4 8 16 FFT 18 16 14 12 10 8 6 4 2 0 Capacity miss rate Coherence miss rate 1 2 4 8 16 Ocean.29 Block Size Effects Writes to one word in a multi-word block mean either the full block is invalidated (write-invalidate) or the full block is exchanged between processors (write-update) - lternatively, could broadcast only the written word Multi-word blocks can also result in false sharing: when two processors are writing to two different variables in the same cache block With write-invalidate false sharing increases cache miss rates Proc1 Proc2 Compilers can help reduce false sharing by allocating highly correlated data to the same cache block B 4 word cache block.30 Other Coherence Protocols There are many variations on cache coherence protocols MESI Cache Coherency Protocol shared read nother write-invalidate protocol used in the Pentium 4 (and many other micro s) is MESI with four states: Modified same Exclusive only one copy of the shared data is allowed to be cached; memory has an up-to-date copy - Since there is only one copy of the block, write hits don t need to send invalidate signal multiple copies of the shared data may be cached (i.e., data permitted to be cached with more than one processor); memory has an up-to-date copy Invalid same nother processor has read/write miss for this block Invalid (not valid block) [Write back block] Modified (dirty) write or read hit write miss Invalidate for this block write [Write back block] [Write back block] shared read miss [Send invalidate signal] exclusive read miss exclusive read miss (clean) shared read Exclusive (clean) exclusive read exclusive read.31.32

Commercial Single Backplane Multiprocessors Compaq PL IBM R40 lphaserver SGI Pow Chal Sun 6000 Pentium Pro PowerPC lpha 21164 MIPS R10000 UltraSPRC # proc. 4 8 12 36 30 MHz 200 112 440 195 167 BW/ system 540 1800 2150 1200 2600 Summary Key questions Q1 - How do processors share data? Q2 - How do processors coordinate their activity? Q3 - How scalable is the architecture (what is the maximum number of processors)? Bus connected (shared address UM s( s)) multi s Cache coherency hardware to ensure data consistency Synchronization primitives for synchronization Scalability of bus connected UMs limited (< ~ 36 processors) because the three desirable bus characteristics - high bandwidth - low latency - long length are incompatible Network connected s are more scalable.33.34