ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University of Rochester Spring 2003 TU Eindhoven 2004 TUT Spring 2004 What we ll cover today Multiprocessor motivation Classification of parallel computation Multiprocessor organizations Shared memory multiprocessors Cache coherence Synchronization Interconnection network basics
Multiprocessor motivation, part 1 Many scientific applications take too long to run on a single processor machine Modeling of weather patterns, astrophysics, chemical reactions, ocean currents, etc. Many of these are parallel applications which largely consist of loops which operate on independent data Such applications can make efficient use of a multiprocessor machine with each loop iteration running on a different processor and operating on independent data Multiprocessor motivation, part 2 Many multi-user environments require more compute power than available from a single processor machine irline reservation system, department store chain inventory system, file server for a large department, web server for a major corporation, etc. These consist of largely parallel transactions which operate on independent data Such applications can make efficient usage of a multiprocessor machine with each transaction running on a different processor and operating on independent data
Classification: Flynn Categories SISD (Single Instruction Single Data) Uniprocessors MISD (Multiple Instruction Single Data) Systolic arrays / stream based processing SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2 (Thinking Machines Corp) Simple programming model Low overhead Now applied as sub-word parallelism!! MIMD (Multiple Instruction Multiple Data) Examples: Sun Enterprise 5000, Cray T3D, SGI Origin Flexible Use off-the-shelf micros Communication Models Shared Memory Processors communicate with shared address space Easy on small-scale machines Model of choice for uniprocessors, small-scale MPs Lower latency Message passing Processors have private memories, communicate via messages Focuses attention on costly non-local operations Can support either SW model on either HW base
Shared ddress Model Summary Each processor can name every physical location in the machine Each process can name all data it shares with other processes Data transfer via load and store Data size: byte, word,... or blocks May use virtual memory manager to map virtual to local or remote physical Memory hierarchy model applies: now communication moves data to local proc. Multiprocessor organizations Shared memory multiprocessors ll processors share the same memory address space Single copy of the OS (although some parts may be parallel) Relatively easy to program and port sequential code to Difficult to scale to large numbers of processors Uniform memory access (UM) machine block diagram
Multiprocessor organizations Distributed memory multiprocessors Processors have their own memory address space Message passing used to access another processor s memory Multiple copies of the OS Usually commodity hardware and network (e.g., Ethernet) More difficult to program Easier to scale hardware and more inherently fault resilient Multiprocessor variants Non-uniform memory access (NUM) shared memory multiprocessors ll memory can be addressed by all processors, but access to a processor s own local memory is faster than access to another processor s remote memory Looks like a distributed machine, but interconnection network is usually custom-designed switches and/or buses
Multiprocessor variants Distributed shared memory (DSM) multiprocessors Commodity hardware of a distributed memory multiprocessor, but all processors have the illusion of shared memory Operating system handles accesses to remote memory transparently on behalf of the application Relieves application developer of the burden of memory management across the network Multiprocessor variants Shared memory machines connected together over a network (operating as a distributed memory or DSM machine) network controller network controller network
Message Passing Model Explicit message send and receive operations Send specifies local buffer + receiving process on remote computer Receive specifies sending process on remote computer + local buffer to place data Typically blocking, but may use DM Message structure Header Data Trailer Communication Models Comparison Shared-Memory Compatibility with well-understood mechanisms Ease of programming for complex or dynamic communications patterns Shared-memory applications Efficient for small items Supports hardware caching Messaging Passing Simpler hardware Explicit communication Improved synchronization Easier for sender-initiated communication
Shared memory multiprocessors Major design issues Cache coherence: ensuring that stores to d data are seen by other processors Synchronization: the coordination among processors accessing shared data Memory consistency: definition of when a processor must observe a write from another processor Cache coherence problem Two writeback s becoming incoherent (1) reads block
Cache coherence problem Two writeback s becoming incoherent (1) reads block (2) reads block Cache coherence problem Two writeback s becoming incoherent (1) reads block (2) reads block (3) writes block old, out of date copies of block
Potential HW Coherency Solutions Snooping Solution (Snoopy Bus): Send all requests for data to all processors Processors snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory for scalability (avoids bottlenecks) Send point-to-point requests to processors via network Scales better than Snooping ctually existed BEFORE Snooping-based schemes Cache coherence protocols Ensures that d blocks that are written to are observable by all processors ssigns a state field to all d blocks Defines actions for performing reads and writes to blocks in each state that ensure coherence ctions are much more complicated than described here in a real machine with a split transaction bus
MESI coherence protocol Commonly used (or variant thereof) in shared memory multiprocessors Idea is to ensure that when a wants to write to a block that other remote s invalidate their copies first Each block is in one of four states (2 bits stored with each block) Invalid: contents are not valid Shared: other processor s may have the same copy; has the same copy Exclusive: no other processor has a copy; has the same copy Modified: no other processor has a copy; has an old copy MESI coherence protocol ctions on a load that results in hit Local actions Read block Remote actions None ctions on a load that results in miss Local actions Request block from bus If not in a remote, set state to Exclusive If also in a remote, set state to Shared Remote actions Look up tags to see if the block is present If so, signal the local that we have a copy, provide it if it is in state Modified, and change the state of our copy to Shared
MESI coherence protocol ctions on a store that results in hit Local actions Check state of block If Shared, send an Invalidation bus command to all remote s Write the block and change the state to Modified Remote actions Upon receipt of an Invalidation command on the bus, look up tags to see if the block is present If so, change the state of the block to Invalid ctions on a store that results in miss Local actions Simultaneously request block from bus and send an Invalidation command fter block received, write the block and set the state to Modified Remote actions Look up tags to see if the block is present If so, signal the local that we have a copy, provide it if it is in state Modified, and change the state of our copy to Invalid Cache coherence problem revisited (1) reads block Exclusive
Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared (3) invalidates remote block Shared Invalid Invalidate command
Cache coherence problem revisited (1) reads block (2) reads block Exclusive Shared Shared (3) invalidates remote block (4) writes block Shared Invalid Modified Invalid Invalidate command Synchronization For parallel programs to share data, we must make sure that accesses to a given memory location are ordered Example: database of available inventory at a department store simultaneously accessed from different store computers; only one computer must win the race to reserve a particular item Solution rchitecture defines a special atomic swap instruction in which a memory location is tested for 0, and if so, is set to 1 Software associates a lock variable with each data that needs to be ordered (e.g., particular class of merchandise) and uses the atomic swap instruction to try to set it Software acquires the lock before modifying the associated data (e.g., reserving the merchandise) Software releases the lock by setting it to 0 when done
Uninterruptable Instruction to Fetch and Update Memory tomic exchange: interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Set register to 1 & swap New value in register determines success in getting lock 0 if you succeeded in setting the lock (you were first) 1 if other processor had already claimed access Key is that exchange operation is indivisible Test-and-set: tests a value and sets it if the value passes the test (also Compare-and-swap, as on previous slide) Fetch-and-increment: it returns the value of a memory location and atomically increments it 0 => synchronization variable is free Synchronization flowchart spinning
Synchronization and coherence example Bus (shared) or Network (switched) Network: claimed to be more scalable no bus arbitration point-to-point connections but router overhead
Shared vs. Switched Media Shared media: nodes share a single interconnection medium (e.g., Ethernet, Bus) Only one message sent at a time Inexpensive : one medium used by all processors Limited bandwidth : medium becomes the bottleneck Needs arbitration Switched media : allow direct communication between source and destination nodes (e.g., TM) Multiple users at a time More expensive : need to replicated medium Higher total bandwidth No arbitration dded latency to go through the switch Network design parameters Large network design space: topology, degree routing algorithm path, path control, collision resolvement, network support, deadlock handling, livelock handling virtual layer support flow control, buffering QoS guarantees error handling etc, etc.
Switch Topology Networks have a topology that indicate how nodes are connected Topology determines Degree: number of links from a node Diameter: max number of links crossed between nodes verage distance: number of links to random destination Bisection: minimum number of links that separate the network into two halves Bisection bandwidth: link bandwidth x bisection Common Topologies Type Degree Diameter ve Dist Bisection 1D mesh 2 N-1 N/3 1 2D mesh 4 2(N 1/2-1) 2N 1/2 / 3 N 1/2 3D mesh 6 3(N 1/3-1) 3N 1/3 / 3 N 2/3 nd mesh 2n n(n 1/n -1) nn 1/n / 3 N (n-1) / n Ring 2 N/2 N/4 2 2D torus 4 N 1/2 N 1/2 / 2 2N 1/2 Hypercube Log 2 N n=log 2 N n/2 N/2 2D Tree 3 2Log 2 N ~2Log 2 N 1 Crossbar N-1 1 1 N 2 /2 N = number of nodes, n = dimension
Butterfly or Omega Network N/2 Butterfly N/2 Butterfly 8 x 8 butterfly switch ll paths equal length Unique path from any input to any output Try to avoid conflicts Multistage Fat Tree multistage fat tree (CM-5) avoids congestion at the root node Randomly assign packets to different paths on way up to spread the load Increase degree, decrease congestion
Example Networks Name Number Topology Bits Clock Link Bis. BW Year ncube/ten 1-1024 10-cube 1 10 MHz 1.2 640 1987 ipsc/2 16-128 7-cube 1 16 MHz 2 345 1988 MP-1216 32-512 2D grid 1 25 MHz 3 1,300 1989 Delta 540 2D grid 16 40 MHz 40 640 1991 CM-5 32-2048 fat tree 4 40 MHz 20 10,240 1991 CS-2 32-1024 fat tree 8 70 MHz 50 50,000 1992 Paragon 4-1024 2D grid 16100 MHz 200 6,400 1992 T3D 16-1024 3D Torus 16150 MHz 300 19,200 1993 MBytes/second No standard topology! However, for on-chip mesh and torus are in favor!