Computer Science 246. Computer Architecture

Size: px

Start display at page:

Download "Computer Science 246. Computer Architecture"

Margaret Casey
5 years ago
Views:

1 Computer Architecture Spring 2010 Harvard University Instructor: Prof. Multiprocessing

2 Multiprocessors Outline Multiprocessors Computation taxonomy Interconnection overview Shared Memory design criteria Cache coherence Very broad overview H&P: Chapter 6, mostly , 8.5 Next time: More Multiprocessors +Multithreading

3 Parallel Processing Multiple processors working cooperatively on problems: not the same as multiprogramming Goals/Motivation Performance: limits of uniprocessors ILP (branch prediction, RAW dependencies, memory) Cost Efficiency: build big systems with commodity parts (uniprocessor cores) Scalability: just add more cores to get more performance Fault tolerance: One processor fails you still can continue processing

4 Parallel Processing Sounds great, but As usual, software is the bottleneck Difficult to parallelize applications Compiler parallelization is hard (huge research efforts in 80s/90s) By-hand parallelization is harder/expensive/error-prone Difficult to make parallel applications run fast Communication is expensive Makes first task even harder Inherent problems Insufficient parallelism in some apps Long-latency remote communication

5 Recall Amdahl s Law Speedup Overall = 1 ((1 - Fraction enhanced ) + ) Fraction enhanced Speedup enhanced Example: Achieve speedup of 80x using 100 processors 80 = 1/[Frac parallel /100+1-Frac parallel ] Frac parallel = => only.25% of the work can be serial! Assumes linear speedup Superlinear speedup is possible in some cases Increased memory + cache with increased processor count

6 Parallel Application Domains Traditional Parallel Programs True parallelism in one job Regular loop structures Data usually tightly shared Automatic parallelization Called data-level parallelism Can often exploit vectors as well Workloads Scientific simulation codes (FFT, weather, fluid dynamics) Dominant market segment in the 1980s

7 Parallel Program Example: Parameters Size of matrix: (N*N) Matrix Multiply P processors: N/P 1/2 * N/P 1/2 blocks Cij needs Aik, Bkj k=0 to n-1 Growth Functions Computation grows as f(n 3 ) Computation per processor: f(n 3 /P) Data size: f(n 2 ) Data size per processor: f(n 2 /P) Communication: f(n 2 /P 1/2 ) Computation/Communication: f(n/p 1/2 ) P N

8 Application Domains Parallel independent but similar tasks Irregular control structures Loosely shared data locked at different granularities Programmer defines and fine tunes parallelism Cannot exploit vectors Thread-level Parallelism (throughput is important) Workloads Transaction Processing, databases, web-servers Dominant MP market segment today

9 Database Application Example Bank Database Parameters: D = number of accounts P = number of processors in server N = number of ATMs (parallel transactions) Growth Functions Computation: f(n) Computation per processor: F(N/P) Communication? Lock records while changing them Communication: f(n) Computation/communication: f(1)

10 Computation Taxonomy Proposed by Flynn (1966) Dimensions: Instruction streams: single (SI) or multiple (MI) Data streams: single (SD) or multiple (MD) Cross-product: SISD: uniprocessors (chapters 3-4) SIMD: multimedia/vector processors (MMX, HP-MAX) MISD: no real examples MIMD: multiprocessors + multicomputers (ch. 6)

11 SIMD vs. MIMD MP vs. SIMD Programming model flexibility Could simulate vectors with MIMD but not the other way Dominant market segment cannot use vectors Cost effectiveness Commodity Parts: high volume (cheap) components MPs can be made up of many uniprocessors Allows easy scalability from small to large

12 Taxonomy of Parallel (MIMD) Processors Center on organization of main memory Shared vs. Distributed Appearance of memory to hardware Q1: Memory access latency uniform? Shared (UMA): yes, doesn t matter where data goes Distributed (NUMA): no, makes a big difference Appearance of memory to software Q2: Can processors communicate directly via memory? Shared (shared memory): yes, communicate via load/store Distributed (message passing): no, communicate via messages Dimensions are orthogonal e.g. DSM: (physically) distributed, (logically) shared memory

13 UMA vs. NUMA: Why it matters Memory p0 p1 p2 p3 Ideal model: Perfect (single-cycle) memory latency Perfect (infinite) memory bandwidth Real systems: Latencies are long and grow with system size Bandwidth is limited Add memory banks, interconnect to hook up (latency goes up) Low latency

14 UMA vs. NUMA: Why it matters long latency long latency short latency m0 m1 m2 m3 Interconnect p0 p1 p2 p3 Interconnect m0 m1 m2 m3 p0 p1 p2 p3 UMA: uniform memory access From p0 same latency to m0-m3 Data placement doesn t matter Latency worse as system scales Interconnect contention restricts bandwidth Small multiprocessors only NUMA: non-uniform memory access From p0 faster to m0 than m1-m3 Low latency to local memory helps performance Data placement important (software!) Less contention => more scalable Large multiprocessor systems

15 Interconnection Networks Need something to connect those processors/memories Direct: endpoints connect directly Indirect: endpoints connect via switches/routers Interconnect issues Latency: average latency more important than max Bandwidth: per processor Cost: #wires, #switches, #ports per switch Scalability: how latency, bandwidth, cost grow with processors Interconnect topology primarily concerns architects

16 Interconnect 1: Bus m0 m1 m2 m3 Direct Interconnect Style p0 p1 p2 p3 m0 m1 m2 m3 + Cost: f(1) wires + Latency: f(1) - Bandwidth: Not Scalable, f(1/p) Only works in small systems (<=4) + Only performs ordered broadcast No other protocols work p0 p1 p2 p3

17 Interconnect 2: Crossbar Switch p0 p1 p2 p3 m0 m1 m2 m3 Indirect Interconnect Switch implemented as big MUXes + Latency: f(1) + Bandwidth: f(1) - Cost: f(2p) wires F(P 2 ) switches 4 wires per switch

18 Interconnect 3: Multistage Network m0 m1 m2 m3 p0 p1 p2 p3 Indirect Interconnect Routing done by address decoding k: switch arity (#inputs/#outputs per switch) d: number of network stages = log k P + Cost f(d*p/k) switches f(p*d) wires f(k) wires per switch + Latency: f(d) + Bandwidth: f(1) Commonly used in large UMA systems

19 Interconnect 4: 2D Torus p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m p/m Direct Interconnect + Cost f(2p) wires 4 wires per switch + Latency: f(p 1/2 ) + Bandwidth: f(1) Good scalability (widely used) Variants: 1D (ring), 3D, mesh (no wraparound)

20 Interconnect 5: Hypercube Direct Interconnect K: arity (#nodes per dimension) p/m p/m p/m p/m p/m p/m p/m p/m D: dimension =log k P + Latency: f(d) + Bandwidth: f(k*d) p/m p/m p/m p/m p/m p/m p/m p/m - Cost F(k*d*P) wires F(k*d) wires per switch Good scalability, expensive switches new design needed for more nodes

21 Interconnect Routing Store-and-Forward Routing Switch buffers entire message before passing it on Latency = [(message length / bandwidth) + fixed switch overhead] * #hops Wormhole Routing Pipeline message through interconnect Switch passes message on before completely arrives Latency = (message length / bandwidth) + (fixed switch overhead * #hops) No buffering needed at switch Latency (relative) independent of number of intermediate hops

22 Shared Memory vs. Message Passing MIMD (appearance of memory to software) Message Passing (multicomputers) Each processor has its own address space Processors send and receive messages to and from each other Communication patterns explicit and precise Explicit messaging forces programmer to optimize this Used for scientific codes (explicit communication) Message passing systems: PVM, MPI, OpenMP Simple Hardware Difficult programming Model

23 Shared Memory vs. Message Passing Shared Memory (multiprocessors) One shared address space Processors use conventional load/stores to access shared data Communication can be complex/dynamic Simpler programming model (compatible with uniprocessors) Hardware controlled caching is useful to reduce latency + contention Has drawbacks Synchronization (discussed later) More complex hardware needed

24 Parallel Systems (80s and 90s) Machine Communication Interconnect #cpus Remote latency (us) SPARCcenter Shared memory Bus <=20 1 SGI Challenge Shared memory Bus <=32 1 CRAY T3D Shared memory 3D Torus Convex SPP Shared memory X-bar/ring KSR-1 Shared memory Bus/ring TMC CM-5 Messages Fat tree Intel Paragon Messages 2D mesh IBM SP-2 Messages Multistage Trend towards shared memory systems

25 Multiprocessor Trends Shared Memory Easier, more dynamic programming model Can do more to optimize the hardware Small-to-medium size UMA systems (2-8 processors) Processor + memory + switch on single board (4x Pentium) Single-chip multiprocessors (POWER4) Commodity parts soon glueless MP systems Larger NUMAs built from smaller UMAs Use commodity small UMAs with commodity interconnects (ethernet, myrinet) NUMA clusters

26 Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory Message Passing UMA vs. NUMA Cache Coherence Write Invalidate/Update? Snooping? Directory? Synchronization Memory Consistency

27 Major issues for Shared Memory Cache coherence Common Sense P1 Read[X] => P1 Write[X] => P1 Read[X] will return X P1 Write[X] => P2 Read[X] => will return value written by P1 P1 Write[X] => P2 Write[X] => Serialized (all processor see the writes in the same order) Synchronization Atomic read/write operations Memory consistency Model When will a written value be seen? P1 Write[X] (10ps later) P2 Read[X] what happens? These are not issues for message passing systems Why?

28 Cache Coherence Benefits of coherent caches in parallel systems? Migration and Replication of shared data Migration: data moved locally lowers latency + main memory bandwidth Replication: data being simultaneously read can be replicated to reduce latency + contention Problem: sharing of writeable data Processor 0 Processor 1 Correct value of A in: Read A Write A Read a Read A Memory Memory, p0 cache Memory, p0 cache, p1 cache P0 cache, memory (if write-through) P1 gets stale value on hit

29 Solutions to Cache Coherence No caches Not likely :) Make shared data non-cacheable Simple software solution low performance if lots of shared data Software flush at strategic times Relatively simple, but could be costly (frequent syncs) Hardware cache coherence Make memory and caches coherent/consistent with each other

30 HW Coherence Protocols Absolute coherence All copies of each block have same data at all times A little bit overkill Need appearance of absolute coherence Temporary incoherence is ok Similar to write-back cache As long as all loads get their correct value Coherence protocol: FSM that runs at every cache Invalidate protocols: invalidate copies in other caches Update protocols: update copies in other caches Snooping vs. Directory based protocols (HW implementation) Memory is always updated

31 Write Invalidate Much more common for most systems Mechanics Broadcast address of cache line to invalidate All processor snoop until they see it, then invalidate if in local cache Same policy can be used to service cache misses in write-back caches Processor Activity Bus Activity Contents of CPU A s cache Contents of CPU B s cache Contents of Memory Location X 0 CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Invalidation miss for X 1 0 CPU B reads X Cache miss for X 1 1 1

32 Write Update (Broadcast) Processor Activity Bus Activity Contents of CPU A s cache Contents of CPU B s cache CPU A reads X Cache miss for X 0 0 CPU B reads X Cache miss for X CPU A writes 1 to X Write Broadcast for X CPU B reads X Contents of Memory Location X Bandwidth requirements are excessive Can reduce by checking if word is shared (also can reduce write-invalidate traffic) Comparison with Write invalidate Multiple writes, no intervening reads require multiple broadcasts Multiple broadcasts needed for multiple word cache line writes (only 1 invalidate) Advantage: other processors can get the data faster 0

33 Bus-based protocols (Snooping) Snooping All caches see and react to all bus events Protocol relies on global visibility of events (ordered broadcast) Events: Processor (events from own processor) Read (R), Write (W), Writeback (WB) Bus Events (events from other processors) Bus Read (BR), Bus Write (BW)

34 Three-State Invalidate Protocol CPU Write Put write miss on bus invalid CPU Read/ exclusive Write Hit CPU read hit Write miss for block CPU Read Put read miss on bus Write miss for block WB block shared CPU Write Miss WB, write miss on bus CPU read miss Put read miss on bus Three states Invalid: don t have block Exclusive: have block and wrote it Shared: have block but only read it CPU activity Bus activity

35 Three-State Invalidate Protocol Processor 1 Processor 2 C E invalid exclusive A D B C shared A1 invalid E exclusive A1 B D F-Z shared A: P1 Read A1: P2 Read B: P1 Write C: P2 Read D: P1 Write E: P2 Write F-Z: P2 Write

36 Three State Invalidate Problems Real implementations are much more complicated Assumed all operations are atomic Multiphase operation can be done with no intervening ops E.g. write miss detected, acquire bus, receive response Even read misses are non-atomic with split transaction busses Deadlock is a possibility, see Appendix I Other Simplifications Write hits/misses treated same Don t distinguish between true sharing and clean in one cache Other caches could supply data on misses to shared block

37 Scalable Coherence Protocols: Directory Based Systems Bus based systems are not scalable Not enough bus B/W for everyone s coherence traffic Not enough processor snooping B/W to handle everyone s traffic Directories: scalable cache coherence for large MPs Each memory entry (cache line) has a bit vector (1 bit per processor) Bit vector tracks which processors have cached copies of line Send all messages to memory directory If no other cached copies, memory returns data Otherwise memory forwards request to correct processor + Low b/w consumption (communicate only with relevant processors) + Works with general interconnect (bus not needed) - Longer latency (3-hop transaction: p0 => directory => p1 => p0)

38 Multiprocessor Performance OLTP: TPC-C (Oracle) CPI ~ 7.0 DSS: TPC-D CPI ~ 1.61 AV: Altavista CPI ~1.3 Huge fraction of time in memory and I/O for TPC-C

39 Coherence Protocols: Performance Misses per 1000 instructions As block size is increased Coherence misses increase (false sharing) False sharing Sharing of different data in the same cache line Block is shared, but data word is not shared OLTP Workload

40 Coherence Protocols: Performance Miss Rate 3C Miss model => 4C miss model Capacity, compulsory, conflict Coherence: additional misses due to coherence protocol Complicates uniprocessor cache analysis As cache size is increased Capacity misses decrease Coherence misses may increase (more shared data is cached)

41 Coherence Protocols: Performance As processors are added Coherence misses increase (more communication)

42 Today multiprocessor trends: Single Chip multiprocessors e.g. IBM POWER4 1 chip contains: 2 1.3GHz processors, L2, L3 tags Can connect 4 chips on 1 MCM to create 8-way system Targets threaded server workloads, high-performance computing Core0 Core1 L3 Cache L2 Tags/Cache L3 Tags Bus I/O L3 Cache

43 Single Chip MP: The Future but how many cores? Some companies are betting big on it Sun s Niagara (8 cores per chip) Intel s rumored Tukwila or Tanglewood processor (8 or 16?) Basis in Piranha research project (ISCA 2000)

44 Multithreading Another trend: multithreaded processors Processor utilization: IPC / Processor Width Decreases as width increases (~50% on 4-wide) Why? Cache misses, branch mis-predicts, RAW dependencies Idea? Two (or more) processes (threads) share one pipeline Replicate process (thread) state PC, register file, bpred, page table pointer, etc (5% area overhead) One copy of stateless (naturally tagged) structures Caches, functional units, buses, etc Hardware thread context must be fast Multiple on-chip contexts => no need to load from memory

45 Multithreading Paradigms SuperScalar Coarse MT Fine MT Simultaneous MT Pentium 3, Alpha EV6/7 IBM Pulsar Intel P4-HT, EV8,POWER5

46 Coarse vs. Fine Grained MT Coarse-Grained Makes sense for in-order/shorter pipelines Switch threads on long stalls (L2 cache misses) Threads don t interfere with each other much Can t improve utilization on L1 misses/bpred mispredicts Fine-grained Out-of-order, deep pipelines Instructions from multiple threads in stage at a time, miss or not Improves utilization in all scenarios Individual thread performance suffers due to interference

47 Pentium 4 Front-End Front end resources arbitrate between threads every cycle ITLB, RAS, BHB(Global History), Decode Queue are duplicated

48 Pentium 4 Backend Some queues and buffers are partitioned (only ½ entries per thread) Scheduler is oblivious to instruction thread Ids (limit on # per scheduler)

49 Multithreading Performance 9-stage pipeline 128KB I/D Cache 6 IntALU (4 Load/Store) 4 FPALUs 8 SMT threads 8-insn fetch (2 threads) Intel claiming more like 20-30% increase on P4 Debate still exist on performance benefits

50 Synchronization Synchronization: important issue for shared memory Regulates access to shared data E.g. semaphore, monitor, critical section (s/w constructs) Synchronization primitive: lock Well written parallel programs will synchronize on shared data acquire(lock); //while (lock!= 0); lock = 1; critical section; release(lock); // lock = 0; 0: ldw r1, lock // wait for lock to be free 1: bnez r1, #0 2: stw #1, lock // acquire lock // critical section 9: stw #0, lock // release lock

51 Implementing Locks Called spin lock, doesn t work quite right Processor 0 Processor 1 0: ldw r1, lock 1: bnez r1, #0 // p0 sees lock free 0: ldw r1, lock 1: bnez r1, #0 // p1 sees lock free 2: stw #1, lock // p0 acquires lock 9: stw #0, lock 2: stw #1, lock // p1 acquires lock // p0 and p1 in // critical section // together

52 Implementing Locks Problem: acquire sequence (load-test-store) is not atomic Solution: ISA provides an atomic lock-acquire operation Load+Check+Store in one instruction (uninterruptible by definition) E.g. test+set instruction (fetch-and-increment, exchange) t&s r1, lock // ldw r1, lock; stw #1, lock 0: t&s r1, lock 1: bnez r1, 0 2: 3: stw r1, #0 Lock-release is already atomic

53 More atomic instructions All atomic instructions can implement semaphores Fetch-and-Add (generalization of T&S) Compare-and-Swap (CAS) Load-linked and Store-conditional (LL/SC) Only CAS and LL/SC can implement lock-free and waitfree algorithms (hard) Semaphores can deadlock if a thread with lock swaps out Lock-free: Thread does not require lock, always makes progress Wait-free: Can complete in finite-steps See proof: M.P. Herlihy. Wait-free synchronization. ACM Transactions on Programming Languages and Systems, 13(1): , January 1991.

54 Load-Linked/Store-Conditional Example from Prof. Computer Bruning, Science 246 Universität Mannheim

55 HW Implementation Issues

56 Memory Ordering Consistency Memory updates may become reordered by the memory system Processor 0 Processor 1 A = 0; B = 0; A = 1; B = 1; L1: if (B==0) L2: if (A==0) Critical section Critical Section Intuitively, impossible for both to be in critical section BUT, if memory ops are reordered it could happen Coherence: A s and B s must be same eventually Doesn t specify relative timing of coherence See: Sarita V. Adve, Kourosh Gharachorloo, Shared Memory Consistency Models: A Tutorial, IEEE Computer, 1995.

57 Memory Ordering: Sequential Consistency System is sequentially consistent if the result of any execution is the same as if the operations of all processors were executed in some sequential order and the operations of each individual processor appear in this sequence in program order [Lamport] Sequential Consistency All loads and stores in order Delay completion of memory access until all invalidations caused by that access complete Delay next memory access until previous one completes Delay read of A/B until previous write of A/B completes Cannot place writes in a write buffer and continue with read! Simple for programmer Not much room for HW/SW performance optimizations

58 Sequential Consistency Performance Suppose a write miss takes: 40 cycles to establish ownership 10 cycles to issue each invalidate after ownership 50 cycles to complete/acknowledge invalidates Suppose 4 processors have shared cache block Sequential Consistency: Must wait cycles = 130 cycles Ownership time is only 40 cycles Write buffers could continue before this

59 Weaker Consistency Models Assume programs are synchronized Observation: SC only needed for lock variables Other variables? Either in critical section (no parallel access) Or not shared Weaker consistency: can delay/reorder loads and stores More room for hardware optimizations Somewhat trickier programming model (memory barriers)

60 One other possibility Many processors have extensive support for speculation recovery Can this be re-used to support strict consistency? Use delayed commit feature for memory references If we receive an invalidate message, back-out the computation and restart Rarely triggered anyway (only on unsynchronized accesses that cause a race)

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 24 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 2: More Multiprocessors Computation Taxonomy SISD SIMD MISD MIMD ILP Vectors, MM-ISAs Shared Memory