Multiprocessors why would you want a multiprocessor? Multiprocessors and Multithreading more is better? Cache Cache Cache Classifying Multiprocessors Flynn Taxonomy Flynn Taxonomy Interconnection Network Topology Programming Model SISD (Single Instruction Single Data) Uniprocessors SIMD (Single Instruction Multiple Data) Examples: Illiac-IV, CM-2, Nvidia GPUs, etc» Simple programming model» Low overhead MIMD (Multiple Instruction Multiple Data) Examples: many, nearly all modern multiprocessors or multicores» Flexible» Use off-the-shelf microprocessors or microprocessor cores MISD (Multiple Instruction Single Data)???
Interconnection Network Topology Bus P 0 Network pros/cons? P 1 P 2 P 3 P 4 P 5 UMA (Uniform Access) NUMA (Non-uniform Access) pros/cons? P 6 P 7 Cache Cache Cache Cache Cache Cache Cache Cache Cache cpu M cpu cpu Network M M Network cpu M Programming Model Parallel Programming Shared -- every processor can name every address location Message Passing -- each processor can name only it s local memory Communication is through explicit messages pros/cons? A index = i++; i = 47 B index = i++; Cache Cache Cache Network find the max of 100,000 integers on 10 processors Shared-memory programming requires synchronization to provide mutual exclusion and prevent race conditions locks (semaphores) barriers
Multiprocessor Caches (Shared ) the problem -- cache coherency the solution? Cache Cache Cache What Does Coherence Mean? Informally: Any read must return the most recent write Too strict and very difficult to implement Better: A processor sees its own writes to a location in the correct order Any write must eventually be seen by a read All writes are seen in order ( serialization ) Writes to the same location are seen in the same order by all processors Without these guarantees, synchronization doesn t work Potential Solutions ing Solution (y Bus): Send all requests for unknown data to all processors s snoop to see if they have a copy and respond accordingly Requires broadcast, since caching information is at processors Works well with bus (natural broadcast medium) Dominates for small scale machines (most of the market) Directory-Based Schemes Keep track of what is being shared in one centralized place Distributed memory => distributed directory y( (avoids bottlenecks) Send point-to-point requests to processors Scales better than Actually existed BEFORE -based schemes Basic y Protocols write-update on each write, each cache holding that location updates its value write-invalidate <= most common on each write, each cache holding that location invalidates the cache line Cache Cache Cache both schemes MUCH easier on a bus-based multiprocessor potentially requires a LOT of messages, but
Basic y Protocols Write Invalidate versus Broadcast: Invalidate requires one transaction per write-run Invalidate exploits spatial locality: one transaction per block Broadcast has lower latency between write and read Broadcast: BW (increased) vs latency (decreased) d) tradeoff An Example y Protocol Invalidation protocol, write-back cache Each block of memory is in one state: Clean in all caches and up-to-date in memory Dirty in exactly one cache Not in any caches Each cache block is in one state: (S)hared: block can be read (E)xclusive: cache has only copy, its writeable, and dirty (I)nvalid: block contains no data Read misses: cause all caches to snoop Writes to clean line are treated as misses Other protocols (more common) y-cache State Machine ESI = Exclusive, Shared, Invalid CPU read hit MESI = Modified, Exclusive, Shared, Invalid Exclusive = private, clean Modified = private, dirty MOESI = Modified, Owned, Exclusive, Shared, Invalid Owned = responsible for writing back shared, dirty line Invalid Place write miss on bus CPU write Exclusive (read/write) CPU read Place read miss on bus CPU read miss Write-back block Place write miss on bus CPU write Shared (read only) Cache state transitions based on requests from CPU Place read miss on bus CPU read miss Wit Write miss for this block Invalid Write-back block; abort memory access Exclusive (read/write) Write miss for this block Write-back block; abort memory access Read miss for this block Shared (read only) Cache state transitions based on requests from the bus How does MESI avoid sending messages across bus (compared to ESI)? What traffic does MOESI avoid? CPU write hit CPU read hit CPU write miss Write-back cache block Place write miss on bus
Cache Coherency Simultaneous Multithreading How do you know when an external (not this processor core) read/write occurs? ing protocols Directory protocols Cache Cache Cache Cache Cache Directory Directory Network Hardware Multithreading Multithreaded Conventional regs regs regs regs CPU i nstructio on stream Time (proc cycles) Superscalar Execution Issue Slots
Superscalar Execution Superscalar Execution with Fine-Grain Multithreading cycles) Time (proc Issue Slots Vertical waste Time (proc cycles) Issue Slots Thread 1 Thread 2 Thread 3 Horizontal waste Simultaneous Multithreading SMT Performance Time (proc cycles) Issue Slots Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Th roughput (Instructio ons per Cy ycle) 7 6 5 4 3 2 1 0 Simultaneous Multithreading Fine-Grain i Multithreading Conventional Superscalar 1 2 3 4 5 6 7 8 Number of Threads
Goals A Conventional Superscalar Architecture We had three primary goals for this architecture: 1 Minimize the architectural impact on conventional superscalar design Fth Fetch Unit floating point instruction queue fp fp reg s Data Cache 2 Minimize the performance impact on a single thread Instruction Cache 3 Achieve significant throughput gains with many threads 8 Decode Register Renaming integer instruction queue int integer reg s int/ldstore Fetch up to 8 instructions ti per cycle Out-of-order, speculative lti execution Issue 3 floating point, 6integer instructions per cycle An SMT Architecture Performance of the Naïve Design Fth Fetch Unit Instruction Cache 8 Decode Register Renaming Fetch up to 8 instructions ti per cycle floating point instruction queue integer instruction queue Out-of-order, speculative lti execution fp int fp reg s integer reg s Issue 3 floating Data Cache point, 6integer instructions per cycle int/ldstore ) tructions Per Cycle) Throug ghput (Inst 5 4 3 2 1 Unmodified Superscalar 2 4 6 8 Number of Threads
Bottlenecks of the Baseline Architecture Improving Fetch Throughput Instruction queue full conditions (12-21% of cycles) Lack of parallelism in the queue Fetch throughput (42 instructions per cycle when queue not full) The fetch unit in an SMT architecture has two distinct advanes over a conventional architecture Can fetch from multiple threads at once Can choose which threads to fetch Improved Fetch Performance Improved Performance Fetching from 2 threads/cycle achieved most of the performance from multiple-thread l dfetch Fetching from the thread(s) which have the fewest unissued instructions in-flight significantly increases parallelism and throughput (ICOUNT fetch policy) Instruction ns per cyc le 5 4 3 2 Improved Baseline Unmodified superscalar 1 2 4 6 8 Number of Threads
This SMT Architecture, then: Borrows heavily from conventional superscalar design Minimizes the impact on single-thread performance Achieves significant throughput gains over the superscalar (25X, up to 54 I) Multithreaded s Coarse-grain multithreading (Alewife-MIT) context switch at long-latency operations (cache misses) Fine-grain multithreading (Tera Supercomputer) context switch every cycle Sun Niagara T1 Simultaneous multithreading (SMT) execute instructions i from multiple l threads in the same cycle is only different from fine-grain multithreading in the context of superscalar execution requires surprisingly few changes to a conventional out-of-orderof order superscalar processor Was announced to be featured in the next Compaq Alpha processor (21464), but that processor never completed Introduced din the Intel Pentium 4 processor announced as Hyperthreading technology (HT Technology) IBM Power 5, 6 has 2 cores, each 2-way SMT Nehalem Multicore has 2 threads/core Multi-Core s (aka Chip Multiprocessors) Why Multicore, Multithreaded? CPU CPU CPU CPU CPU CPU Multiple cores on the same die, may or may not share L2 cache Intel, AMD both have quad core processors Sun Niagara T2 is 8 cores x 8 threads (64 contexts!) Everyone s roadmap seems to be increasingly multi-core ormance perfo 1 25 Number of threads
Intel Nehalem Intel Nehalem Core First instantiation Intel Core i7 Intel Nehalem, summary Up to 8 cores (i7, 4 cores) 2 SMT threads per core 20+ se pipeline x86 86instructions ti translated ltdto RISC-like RISClik uops Superscalar, 4 instructions (uops) per cycle (more with fusing) Caches (i7) 32KB 4-way set-associative I cache per core 32KB, 8-way set-associative D cache per core 256 KB unified 8-way set-associative L2 cache per core 8 MB shared 16-way set-associative L3 cache Multiprocessors -- Key Points Network vs Bus Message-passing vs Shared d Shared is more intuitive, but creates problems for both the programmer (memory consistency, requiring synchronization) and the architect (cache coherency) Multithreading gives the illusion of multiprocessing (including, in many cases, the performance) with very little additional hardware When multiprocessing happens within a single die/processor, we call that a chip multiprocessor, or a multi-core architecture