Parallel Architecture. Hwansoo Han

Size: px

Start display at page:

Download "Parallel Architecture. Hwansoo Han"

Earl Rice
5 years ago
Views:

1 Parallel Architecture Hwansoo Han

2 Performance Curve 2

3 Unicore Limitations Performance scaling stopped due to: Power Wire delay DRAM latency Limitation in ILP 3

4 Power Consumption (watts) 4

5 Wire Delay Range of a wire in one clock cycle 5

6 DRAM Latency Microprocessor 60% / year 2x / 18 months DRAM latency 9% / year 2x / 10 years 6

7 Instruction Level Parallelism 1980s: More transistors Superscalar Pipeline 10 CPI 1 CPI 1990s: Exploit last implicit parallelism Multi-way issue, out-of-order issue, branch prediction 1 CPI 0.5 CPI 2000s: Multicore Explicit parallelism is needed 7

Multicore Processors cancelled Intel Tejas & Jayhawk Unicore, 4 GHz P4 IBM Cell Scalable multicore IBM Power 4 & 5 Dual Cores since 2001 Intel Montecito Dual Core IA/64 Intel Pentium D (Smithfield)

8 Multicore Processors cancelled Intel Tejas & Jayhawk Unicore, 4 GHz P4 IBM Cell Scalable multicore IBM Power 4 & 5 Dual Cores since 2001 Intel Montecito Dual Core IA/64 Intel Pentium D (Smithfield) Intel Pentium Extreme 3.2 GHz Dual Core AMD Opteron Dual Core Intel Yonah Dual Core Mobile Intel Tanglewood Dual Core IA/64 Intel Dempsey Dual Core Xeon SUN Olympus & Niagara 8 Processor Cores IBM Power 6 Dual Core 8 2H H H H H 2006

9 Chip Multiprocessors (Multicores) Processor Name Company Target Market Cores PE Interconnect Programming Model Power7 IBM Servers 4~8xPower7 (16~32 threads) Niagara2 Sun Servers 8xUltraSPARC (64 threads) Bloomfield (i7) Intel Servers, Desktop 4xNehalem (8 threads) Barcelona AMD Servers, Desktop 4xNG-Opteron (4 threads) Full crossbar to L2$ Full crossbar to L2$ Point-to-point network Full crossbar onchip Shared Memory Multi-threading Shared Memory Multi-threading Traditional SMP Traditional SMP Xenon IBM/ Microsoft XBox360 3xPowerPC w/vmx128 (6 threads) Traditional SMP Cell Sony/ Toshiba/ IBM Game Consoles, DTV, HPC PowerPC +8xSPE(SIMD) (2+8 threads) 4 Rings Shared DRAM Private SRAM Tesla NVIDIA GPGPU 240 streaming processors CUDA 9

10 Why Multiprocessors? 1. Microprocessors as the fastest CPUs Collecting several CPUs much easier than redesigning 1 CPU 2. Complexity of current microprocessors Do we have enough ideas to sustain 1.5X/yr? Can we deliver such complexity on schedule? 3. Slow (but steady) improvement in parallel software Scientific apps, databases, OS 4. Emergence of embedded and server markets drive microprocessors in addition to desktops Embedded system Functional parallelism Server performance Producer/consumer model Transactions/sec vs. latency of one transaction 10

11 Many Parallel Workloads Exist Multiprogramming OS & multiple programs Commercial workloads OLTP, data-mining Scientific computing Weather prediction, chemical simulation, Multimedia HDTV playback, speech recognition, All interesting workloads are parallel Demand for higher performance drives parallel computers 11

12 Challenges of Multiprocessors Difficult to write parallel programs Most programmers think sequentially Performance vs. correctness tradeoffs Missing good parallel abstractions Automatic parallelization by compilers Works with some applications (loop parallelism, reduction) Unclear how we can apply to other complex applications 12

13 Limitations of Multiprocessors Serial portion of applications Amdhal s law f is parallelizable with n CPUs : speedup = 1 / (1-f + f/n) If 80% parallelizable, maximum speedup is 5 Latency of communication Often takes 10~1000 cycles for CPUs to communicate CPUs often stall waiting for communications Solutions Exploit locality (caches) Overlaps communication with independent computation 13

14 Popular Flynn Categories SISD (single instruction single data) Uniprocessors SIMD (single instruction multiple data) Vector processors (e.g. CM-2, Cray XP/YP, ) Multimedia extension (Intel MMX/SSE, ) MISD (multiple instruction single data) Systolic arrays MIMD (multiple instructions multiple data) MPP (massively parallel processors - special interconnect) SMP (symmetric multi-processors) Cluster (commodity CPUs connected with basically ethernet) Most successful model virtually all multiprocessors today Sun Enterprise 10000, SGI Origin, Cray T3D, 14

Parallel Architectures (MIMD) Shared memory Access all data within a single address space SMP, UMA, cc-numa Popular programming model Thread APIs (pthread, ) OpenMP Distributed memory Access

15 Parallel Architectures (MIMD) Shared memory Access all data within a single address space SMP, UMA, cc-numa Popular programming model Thread APIs (pthread, ) OpenMP Distributed memory Access only partial data. Others are accessed via communication NUMA, Cluster Popular programming model PVM (obsolete) MPI (de facto standard) CPU $ CPU $ CPU $ CPU $ CPU $ CPU $ Mem Mem Mem Memory 15

16 Machine Abstraction for Program Shared-memory Message-passing Single address space for all CPUs Private address space per CPU Communication through regular load/store (implicit) Communication through message send/receive over network interface (explicit) Synchronization using locks and barriers Synchronization using blocking messages Ease of programming Need to program explicit communication Complex HW for cache coherence Simple HW (no cache coherence supporting HW) 16

17 Cache Coherence in SMP Assume the following sequence P0 loads A (A in P0 s $D) P1 loads A (A in P1 s $D) P0 writes a new value to A P1 loads A (Can P1 get a new value?) CPU $ CPU $ Memory CPU $ Memory system behavior Cache coherence What value can be returned by a load Memory consistency When a written value can be read (or visible) by a load A solution for cache coherence Multiple read-only copies and exclusive modified copy (invalidate other copies when a CPU need to update a cache line) 17

18 Snooping Protocol All cache controllers monitor (or snoop) on the bus Send all requests for data to all processors Processors snoop to see if they have a shared block Requires broadcast, since caching information resides at processors Works well with bus (natural broadcast) Dominates for small scale machines Cache coherence unit Cache block (line) is the unit of management False sharing is possible Two processors share the same cache line but not the actual word Coherence miss Invalidate can cause a miss for the data read before 18

19 Write Invalidate vs. Write Update Write invalidate protocol in snooping A write to shared data occurs An invalidate is sent to all caches which snoop Invalidate any copies If a read miss occurs Write-through: memory is always up-to-date Write-back: snoop to force the write-back of most recent copy Write update protocol in snooping A write to shared data occurs Broadcast on bus, processors snoop, & update copies 19 If a read miss occurs Write-through: memory is always up-to-date Write-back: one of sharers (owner) updates memory

20 An Example Snoopy Protocol Invalidation protocol, write-back cache Each cache block is in one state (MSI protocol) Modified: cache has only copy (writable & dirty) Shared: block can be read Invalid: block contains no data State change due to the actions from both CPU and Bus CPU MSI Cache Block Bus 20

Snoopy-Cache State Machine CPU State of each cache block CPU Read / Bus Read CPU Read hit /- (no Bus traffic) CPU Read miss / Bus Read MSI Cache Block Bus Invalid Bus ReadX /

21 Snoopy-Cache State Machine CPU State of each cache block CPU Read / Bus Read CPU Read hit /- (no Bus traffic) CPU Read miss / Bus Read MSI Cache Block Bus Invalid Bus ReadX / - Shared (read-only) Bus Read / - invalidated due to other CPUs 21 CPU Read,Write hit / - (no Bus traffic) Modified (read/write) CPU Write miss / Bus WriteBack(Flush); BusReadX

22 MESI Protocol Add 4 th state Distinguish Shared and Exclusive Shared (read-only) MSI protocol Shared (read only) Exclusive (read-only) MESI protocol Common case optimization In MSI, [shared modified] causes invalidate traffic Writes to non-shared data cause unnecessary invalidate Even for shared data, only one processor often reads and write In MESI, [exclusive modified] without invalidate traffic 22

MESI Protocol State Machine Needs shared signal in the physical interconnect Invalid CPU Read / Bus Read & S-signal on Bus ReadX / - CPU Read / - (no Bus traffic) Shared (read-only) If cache miss

23 MESI Protocol State Machine Needs shared signal in the physical interconnect Invalid CPU Read / Bus Read & S-signal on Bus ReadX / - CPU Read / - (no Bus traffic) Shared (read-only) If cache miss occurs, cache will write back modified block. Bus ReadX / Bus WriteBack(Flush) CPU Write / Bus ReadX Bus Read / Bus S-signal on Bus Read / Bus S-signal on 23 CPU read / - (no Bus traffic) Modified (read/write) CPU Write / - (invalidate is not needed) CPU wrtie / - (no Bus traffic) Exclusive (read-only) CPU Read / - (no Bus traffic)

24 Distributed Shared-Memory Architectures Non-uniform memory access time (NUMA, E.g. Cray T3D/E) Cannot use snooping protocol for cache coherence Snooping protocol requires all cache communication on misses using bus, but NUMA does not have such central structure Snooping protocol is efficient only for small scale multiprocessors Use directory per cached memory block (directory protocol) Keep track of the states of memory blocks in cached local memory Which processors have data when in the shared/exclusive state 3 processors involved Local node: where a request originates Home node: where the original memory block s location Remote node: where a copy of a memory block exists 24

Directory-based Cache Coherence A directory is added to each node

cache cache cache memory I/O memory I/O memory I/O memory I/O

directory directory directory directory memory I/O memory I/O

25 Directory-based Cache Coherence A directory is added to each node for cache coherence Processor Processor Processor Processor cache cache cache cache memory I/O memory I/O memory I/O memory I/O directory directory directory directory interconnection network directory directory directory directory memory I/O memory I/O memory I/O memory I/O cache cache cache cache Processor Processor Processor Processor 25

26 Directory Protocol Three cache block states in directory Exclusive: 1 processor (owner) has data, memory is out-of-date Shared: one or more processors have data, memory is up-to-date Uncached: no processor has it, not valid in any cached memory In addition to memory block state, must track which processors have data in the shared/exclusive state (Sharers) Usually bit vector, 1 if processor has copy Directories at home nodes gather information of all memory blocks Instead of bus snooping, home nodes have all the information required If you need to broadcast some message, send it to home directory 26

27 State Trans. in Cache CPU requests for each cache block Home directory sends message to sharers (remote cache) Local cache sends message to home directory Remote cache sends message to home directory Fetch/Invalidate Send Data Write Back Invalid CPU Write (miss) Send Write Miss Invalidate CPU Read (miss) Send Read Miss Shared (read-only) Fetch Send Data Write Back CPU Read (hit) CPU read (miss) Send Read Miss CPU Write (hit or miss) Send Write Miss CPU read (hit) CPU write (hit) Modified (read/write) CPU read (miss due to addr conflict) Send Data Write Back & Read Miss CPU write (miss due to addr conflict) Send Data Write Back & Write Miss 27

28 State Trans in Directory Requests to Home Directory from caches for each cache block Not in local cache, but in home memory Uncached Read miss: Sharers = {P}; Send Data Value Reply Read miss: Sharers += {P}; Send Data Value Reply Shared (read-only) Write Miss: Sharers = {P}; Send Data Value Reply; Data Write Back: Sharers = {}; (Write back) Write Miss: Send Invalidate to Sharers; Sharers = {P}; Send Data Value Reply; Write Miss: Send Fetch/Invalidate; Sharers = {P}; Send Data Value Reply; 28 Exclusive (read/write) Read miss: Send Fetch; Sharers += {P}; Send Data Value Reply; (Write back)

29 Synchronization Why synchronize? Mutual exclusion Need to know when it is safe for other processes to use shared data Keep pace with other processes (event synchronization) Wait until other processes calculate needed results Implementation Atomic instructions (uninterruptible) Fetch-and-update, test-and-swap, User level synchronization operations Implemented with the atomic instructions For large scale MPs, synchronization can be a bottleneck Optimization techniques to reduce contention & latency 29

30 Atomic Instructions Atomic exchange Interchange a value in a register for a value in memory 0 => synchronization variable is free 1 => synchronization variable is locked and unavailable Test-and-set Tests the value in memory is zero and sets it to 1 if the value passes the test. Then returns old value. Fetch-and-increment Returns the value of a memory location and atomically increments it 30

31 Implementation of Spin Locks (1) Spin lock Try to find lock variable is 0 before proceed further First version li R2, #1 lockit: exch R2, 0(R1) bnez R2, lockit ; 0(R1) is lock var ; atomic exchange ; already locked? MP with cache coherence protocol Whenever exch writes to cache block containing 0(R1) coherence protocol invalidates all other copies of the rest of the processors, which possibly perform spin locks, too. Many invalidate traffic on bus Do not want to disrupt the caches in other processor 31

32 Implementation of Spin Locks (2) Second version ( test and test-and-set ) Repeatedly reading the variable. When it changes, then try exchange li R2, #1 lockit: lw R3, 0(R1) bnez R3, lockit exch R2, 0(R1) bnez R2, lockit ; 0(R1) is lock var ; not free then spin ; atomic exchange ; already locked? Most of the time it will spin reading lock variable in cache When it changes, it tries exch (invalidating other copies) 32

33 Barrier Synchronization Keep pace with other processes (or threads) Wait until all threads finish to a certain point (barrier) Make all updates on shared data visible Proceed the next processing until the next barrier 33 P0 Do i=1,10 S0 += A[i] barrier(0); S = S0+S1+S2 barrier(1); P1 Do i = 11,20 S1 += A[i] barrier(0); barrier(1); P2 Do i = 21, 30 S2 += A[i] barrier(0); barrier(1);

34 Time (proc cycles) Multithreading Superscalar vs. multithreading vs. simultaneous multithreading Issue Slots Issue Slots Issue Slots Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Superscalar Multi-threading SMT 34

35 Summary Parallel architecture Shared memory Distributed memory Cache coherence Keep multiple read-only copies & exclusive modified copy Snoopy protocol vs. directory protocol Synchronization Implement with an atomic instruction Used for mutual exclusion and event synchronization Multithreading architectures 35

Chapter 5. Multiprocessors and Thread-Level Parallelism

Chapter 5. Multiprocessors and Thread-Level Parallelism Computer Architecture A Quantitative Approach, Fifth Edition Chapter 5 Multiprocessors and Thread-Level Parallelism 1 Introduction Thread-Level parallelism Have multiple program counters Uses MIMD model