COMP 322: Principles of Parallel Programming. Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall 2009

Size: px

Start display at page:

Download "COMP 322: Principles of Parallel Programming. Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall 2009"

Darrell Holland
6 years ago
Views:

1 COMP 322: Principles of Parallel Programming Lecture 18: Understanding Parallel Computers (Chapter 2, contd) Fall Vivek Sarkar Department of Computer Science Rice University COMP 322 Lecture October 2009

Acknowledgments for todayʼs lecture Course text: Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Includes resources available at

html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin http://www.cs.utexas.edu/users/lin/cs380p/schedule.

2 Acknowledgments for todayʼs lecture Course text: Principles of Parallel Programming, Calvin Lin & Lawrence Snyder Includes resources available at 0,3110, ,00.html Parallel Architectures, Calvin Lin Lectures 5 & 6, CS380P, Spring 2009, UT Austin A Gentler, Kinder Guide to the Multi-core Galaxy ECE 4100/6100 guest lecture by Prof. Hsien-Hsin S. Lee School of Electrical and Computer Engineering Georgia Tech Parallel Systems: Introduction, Fall 2008 Jan Lemiere, Vrije Universititeit Brussel 2

3 A Look at Six Parallel Computers (Chapter 2) Chip Multiprocessors Intel Core Duo (previous lecture) AMD Dual Core Opteron (previous lecture) Symmetric Multiprocessors Sun Fire E25K (this lecture) Heterogeneous Processors Cell processor (this lecture) GPUs (this lecture, only brief mention in Chapter 2) Clusters Supercomputers Blue Gene/L (this lecture) 3

4 Sun Fire E25K Up to 72 processors, each of which can execute 2 hardware threads and directly access 16GB memory Total 1.15TB memory accessed via a directory-based cache consistency protocol Each board contains four processors w/ a snooping bus Boards are connected by three 18x18 crossbar switches Crossbars have high bisection bandwidth, but the switch cost grows as n 2 for n boards 4

5 Demystifying Crossbar switches Figure 2.5 Crossbar switch connecting four nodes. Notice the output and input channels; crossing wires do not connect unless a connection is shown. Each pair of nodes is directly connected by setting one of the open circles. 5

6 Flynnʼs Taxonomy Single Instruction Multiple Instructions Single Data SISD MISD Multiple Data SIMD MIMD 6

7 Architecture of the Cell processor (Figure 2.6 ) 7

8 Cell Features Heterogeneous multicore system architecture Power Processor Element for control tasks Synergistic Processor Elements for dataintensive processing 16B/cycle SPE SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC SPU SXU LS MFC Synergistic Processor Element (SPE) consists of Synergistic Processor Unit (SPU) Synergistic Memory Flow Control (MFC) Data movement and synchronization Interface to highperformance Element Interconnect Bus 8 EIB (up to 96B/cycle) PPE L2 16B/cycle 32B/cycle PPU L1 16B/cycle PXU MIC 64-bit Power Architecture with VMX 16B/cycle Dual XDR TM BIC 16B/cycle (2x) FlexIO TM

Synergistic Processor Element (SPE) ISA influenced by VMX and PS2 s Emotion Engine User-mode architecture No translation/protection within SPE DMA is full PowerPC protect/xlate Direct programmer

9 Synergistic Processor Element (SPE) ISA influenced by VMX and PS2 s Emotion Engine User-mode architecture No translation/protection within SPE DMA is full PowerPC protect/xlate Direct programmer control DMA/DMA-list Branch hint No dynamic prediction In-order execution VMX-like SIMD dataflow Graphics SP-Float No saturate arith, some byte IEEE DP-Float (BlueGene-like) Unified register file 128 entry x 128 bit 256KB Local Store Combined I & D 16B/cycle L/S bandwidth 128B/cycle DMA bandwidth DMA unit w/ Memory Flow Control (MFC) commands MFC s MMU allows consistent interface to system storage map for all processors despite heterogeneous structure 9 SPU Details SPU Units (pipelined) Simple (FXU even) Add/Compare Rotate Logical, Count Leading Zero Permute (FXU odd) Permute Table-lookup FPU (Single / Double Precision) Control (SCN) Dual Issue, Load/Store, ECC Handling Channel (SSC) Interface to MFC Register File (GPR/FWD) SPU Latencies Simple fixed point - 2 cycles* Complex fixed point - 4 cycles* Load - 6 cycles* Single-precision (ER) float - 6 cycles* Integer multiply - 7 cycles* Branch miss (no penalty for correct hint) - 20 cycles DP (IEEE) float (partially pipelined) - 13 cycles* Enqueue DMA Command - 20 cycles*

10 10 Memory Flow Controller Commands DMA Commands Put - Transfer from Local Store to EA space Puts - Transfer and Start SPU execution Putr - Put Result - (Arch. Scarf into L2) Putl - Put using DMA List in Local Store Putrl - Put Result using DMA List in LS (Arch) Get - Transfer from EA Space to Local Store Gets - Transfer and Start SPU execution Getl - Get using DMA List in Local Store Sndsig - Send Signal to SPU Command Modifiers: <f,b> f: Embedded Tag Specific Fence Command will not start until all previous commands in same tag group have completed b: Embedded Tag Specific Barrier Command and all subsiquent commands in same tag group will not start until previous commands in same tag group have completed SL1 Cache Management Commands sdcrt - Data cache region touch (DMA Get hint) sdcrtst - Data cache region touch for store (DMA Put hint) sdcrz - Data cache region zero sdcrs - Data cache region store sdcrf - Data cache region flush Command Parameters LSA - Local Store Address (32 bit) EA - Effective Address (32 or 64 bit) TS - Transfer Size (16 bytes to 16K bytes) LS - DMA List Size (8 bytes to 16 K bytes) TG - Tag Group(5 bit) CL - Cache Management / Bandwidth Class Synchronization Commands Lockline (Atomic Update) Commands: getllar - DMA 128 bytes from EA to LS and set Reservation putllc - Conditionally DMA 128 bytes from LS to EA putlluc - Unconditionally DMA 128 bytes from LS to EA barrier - all previous commands complete before subsiquent commands are started mfcsync - Results of all previous commands in Tag group are remotely visible mfceieio - Results of all preceding Puts commands in same group visible with respect to succeeding Get commands

Power Processor Element (PPE): General purpose, 64-bit RISC processor (Power/ PowePC binary compatible) In-order, dual issue, dual threaded L1 : 32KB I ; 32KB D L2 : 512KB

11 Power Processor Element (PPE): General purpose, 64-bit RISC processor (Power/ PowePC binary compatible) In-order, dual issue, dual threaded L1 : 32KB I ; 32KB D L2 : 512KB Coherent load / store VMX-32 Realtime Controls Locking L2 Cache & TLB Software / hardware managed TLB Bandwidth / Resource Reservation Mediated Interrupts PPE Structure 11

12 Element Interconnect Bus EIB data ring for internal communication Four unidirectional 16 byte data rings, supporting multiple transfers 2 clockwise, 2 anti-clockwise; worst-case latency is half ring length 96B/cycle peak bandwidth Over 100 outstanding requests 12

13 Example of Eight Concurrent Transactions PPE SPE1 SPE3 SPE5 SPE7 IOIF1 PPE SPE1 SPE3 SPE5 SPE7 IOIF1 Ramp 6 Ramp Ramp 7 7 Ramp Ramp 8 8 Ramp Ramp 9 9 Ramp Ramp Ramp Ramp Controller Controller Controller Controller Controller Controller Data Arbiter Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Controller Ramp Ramp 5 5 Ramp 7 Ramp 4 Ramp 4 Ramp 8 Ramp 3 Ramp 3 Ramp 9 Ramp 2 Ramp 2 Ramp 10 Ramp 1 Ramp 1 Ramp 11 Ramp 0 Ramp 0 13 Ring0 Ring2 MIC PPE SPE1 SPE0 SPE3 SPE2 SPE5 SPE4 SPE7 SPE6 BIF IOIF1 BIF / / IOIF1 IOIF0 Ring1 Ring3 controls

14 Theoretical Peak Operations FP (SP) FP (DP) Int (16 bit) Int (32 bit) 250 Billion Ops / sec Freescale MPC8641D 1.5 GHz AMD Athlon 64 X2 2.4 GHz Intel Pentium D 3.2 GHz PowerPC 970MP 2.5 GHz Cell Broadband Engine TM 3.2 GHz 14

15 CELL Software Design Considerations Four Levels of Parallelism Blade Level: Two Cell processors per blade Chip Level: 9 cores run independent tasks Instruction level: Dual issue pipelines on each SPE Register level: Native SIMD on SPE and PPE VMX 256KB local store per SPE: data + code + stack Communication DMA and Bus bandwidth DMA granularity 128 bytes DMA bandwidth among LS and System memory Traffic control Exploit computational complexity and data locality to lower data traffic requirement Shared memory / Message passing abstraction overhead Synchronization DMA latency handling 15

16 Typical CELL Software Development Flow Algorithm complexity study Data layout/locality and Data flow analysis Experimental partitioning and mapping of the algorithm and program structure to the architecture Develop PPE Control, PPE Scalar code Develop PPE Control, partitioned SPE scalar code Communication, synchronization, latency handling Transform SPE scalar code to SPE SIMD code Re-balance the computation / data movement Other optimization considerations PPE SIMD, system bottleneck, load balance 16

17 Programming the cell is challenging Issues Dividing program among different cores Creating instructions in a different language for the 8 SPEs than for the PowerPC core. Need to think in terms of SIMD nature of dataflow to get maximum performance from SPUs SPU local store needs to perform coherent DMA access for accessing system memory 17

18 Shared Memory Processor CBE can be explicitly programmed as a shared-memory multiprocessor using two different instruction sets The SPEs and the PPE can be programmed to fully inter-operate in a cache-coherent Shared-Memory Multiprocessor Model Cache-coherent DMA operations for SPEs DMA operations use effective address common to all PPE and SPEs SPE shared-memory store instructions are replaced A store from the register file to the LS DMA operation from LS to shared memory SPE shared-memory load instructions are replaced DMA operation from shared memory to LS A load from LS to register file Of course a compiler could provide much of this functionality. 18

19 Compiling a single source file for the Cell (w/o buffers) foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; foo2 (); Single source outline foo3(lb,ub) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo3_spu (LB,UB) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo1 (); Runtime distribution of work: invoke foo3, for i=[0,n) Runtime barrier foo2 (); In SPE code: A, B, and x are shared 19

20 Compiling a single source file for the Cell (w/ buffers) foo1 (); #pragma omp parallel for for (i=0; i < N; i++) A[i] = x * B[i]; foo2 (); Single source outline foo3(lb,ub) for (i=lb; i < UB; i++) A[i] = x * B[i]; Runtime barrier foo3_spu (LB,UB) /** buffers A [M], B [M] **/ 20 foo1 (); Runtime distribution of work: invoke foo3 and foo3_spu, for i=[0,n) Runtime barrier foo2 (); for ( k=lb; k < UB; k+=m) { DMA M elements of B into B for (j=0; j<m; j++) { A [j] = cache_lookup(x) * B [j]; } DMA M elements of A out of A } Runtime barrier

21 Data Partitioning Single Source assumption: all data lives in System Memory Naïve implementation, every load and store requires a dma operation Too costly (~700 cycles per load or store) MP will require locking on every reference What can be done to make this acceptable? 21

22 Example: Prefetching Software Pipelined Prefetch dma_get(b,b[0],400); dma_get(c,c[0],400); Blocked, with prefetch for(i=0;i<99900;i+=100) { Original Code dma_get(b,b[i+100],400); for(i=0;i<100000;i+=100) { dma_get(c,c[i+100],400); for(i=0;i<100000;i++) dma_get(b,b[i],400); for(ii=0;ii<100;ii++) a[i]=b[i]+c[i]; dma_get(c,c[i],400); a [ii]=b [ii]+c [ii]; for(ii=0;ii<100;ii++) dma_put(a[i],a,400); a [ii]=b [ii]+c [ii]; swap(a,a ); dma_put(a[i],a,400); swap(b,b ); } swap(c,c ); } for(ii=0;ii<100;ii++) a [ii]=b [ii]+c [ii]; dma_put(a[i],a,400); 22

23 Irregular Accesses What do we do about this? for(i=0;i<100000;i++) a[i]=b[i]+c[i]*d[f(i)]; b and c can be prefetched, but d has an irregular access pattern, thus we cannot predict what elements of d are required we seem to be thrown back on the naïve implementation, d[f(i)] must be fetched on each iteration with a consequent large slowdown of the loop observation: it s as if every access to d incurred a cache miss 23

24 Software Caching Original Code for(i=0;i<100000;i++) = d[f(i)]; Code with explicit Cache Lookup for(i=0;i<100000;i++) t=cache_lookup(d[f(i)]; = t; inline vector cache_lookup(addr) { if (cache_directory[addr&key_mask]!= (addr&tag_mask)) miss_handler(addr); return cache_data[addr&key_mask][addr&offset_mask]; } the miss handler will dma the required data, and some suitable quantity of surrounding data higher degrees of associativity can be supported, possibly for little extra cost on a SIMD processor 24

25 Combining Prefetch with Software Cache Original Code for(i=0;i<100000;i++) a[i]=b[i]+c[i]*b[f(i)]; Prefetching and Caching for(i=0;i<100000;i+=100) { dma_get(b,b[i],400); dma_get(c,c[i],400); for(ii=0;ii<100;ii++) { t=cache_look_up(b[f(i)]); a [ii]=b [ii]+c [ii]*t; } dma_put(a[i],a,400); } Prefetching must also update the cache directory, and Miss handling must not evict prefetched data 25

26 Coherence Problem SPE accesses data in global memory through two mechanisms: Software controlled cache Static buffers Incorrect value may be used or generated if coherence is not maintained. Examples: Two copies of data in software controlled cache and static buffer. One changes the value and the other one may read a stale value Multiple copies of data in different static buffers Approaches: Compiler: no runtime overhead, Runtime: more powerful but complicated 26

27 Solution Overview Combine two approaches for optimal solution Try to apply compiler solution as much as possible Resort to runtime solution if necessary Components Local coherence simplification Global coherence avoidance analysis Dynamic coherence maintenance 27

28 Local Coherence Simplification Runtime coherence maintenance is needed only At the entry of loop: DMA read and check whether the software controlled cache has updated data At the exit of loop: Write-through: update the hit cache line and DMA write Write-back: put the static buffer content into cache Pros/Cons Requires local data dependence info, which may be more likely to be available The structure of software controlled cache remains unchanged References are put into static buffer in a loop only when there is no data dependence between the reference and any other reference accessed by software controlled cache or another static buffer in the loop. The coherence maintenance can be overlapped with DMA operations Candidates for static buffer may be lost if the data dependence information is too conservative 28

29 Global Coherence Avoidance Analysis Runtime coherence maintenance can be avoided by compiler analysis At entry: if there is no updated cache line for this static buffer At exit: if there is no cache line for this static buffer already in cache that will be referenced later How the compiler predicts cache contents No lines in cache after flush If data is carefully aligned or padded, compiler can assume different variables will never be in the same cache line Can not predict the replacement. A line will be assumed to stay in cache until flush 29

30 Optimization with Flushes When runtime coherence maintenance is needed by the previous analysis, it may be profitable to insert extra cache flushes to avoid the coherence maintenance Flush can be a flush for one variable or combine them as flush all The previous analysis can provide information about the possible insertion points for flush Move in the control flow graph to reduce the overhead Similar to the algorithm of partial redundant elimination. Branch profiling may help 30

GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s 2.

31 Why GPUs? Two major trends 1. Increasing performance gap relative to mainstream CPUs Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s 2. Availability of more general (non-graphics) programming interfaces 31 GPU in every PC and workstation massive volume and potential impact

32 What is GPGPU? General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical path of application Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications see GPGPU.org Game effects (FX) physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting 32

33 Traditional vs. General Purpose GPUs Traditional graphics pipeline (Figure 10.3) General-purpose GPU (Figure 10.4(b)) 33

34 Nvidia GeForce 8800 GTX (a.k.a. G80) The device is a set of 16 multiprocessors Each multiprocessor is a set of 32-bit processors with a Single Instruction Multiple Data architecture shared instruction unit Each multiprocessor has: bit registers per processor Device Multiprocessor N Multiprocessor 2 Multiprocessor 1 Shared Memory Registers Registers Processor 1 Processor 2 Registers Processor M Instruction Unit 16KB on-chip shared memory per multiprocessor A read-only constant cache Constant Cache A read-only texture cache Texture Cache Device memory 34

35 Thread Batching: Grids and Blocks 35 A kernel is executed as a grid of thread blocks All threads share data memory space A thread block is a batch of threads that can cooperate with each other by: Synchronizing their execution For hazard-free shared memory accesses Efficiently sharing data through a low latency shared memory Two threads from two different blocks cannot cooperate Host Kernel 1 Kernel 2 Block (1, 1) Thread (0, 0) Thread (0, 1) Thread (0, 2) Device Thread (1, 0) Thread (1, 1) Thread (1, 2) Grid 1 Block (0, 0) Block (0, 1) Grid 2 Thread (2, 0) Thread (2, 1) Thread (2, 2) Block (1, 0) Block (1, 1) Thread (3, 0) Thread (3, 1) Thread (3, 2) Courtesy: NDVIA Thread (4, 0) Thread (4, 1) Thread (4, 2) Block (2, 0) Block (2, 1)

36 Device Memory Space Overview Each thread can: R/W per-thread registers R/W per-thread local memory R/W per-block shared memory R/W per-grid global memory Read only per-grid constant memory Read only per-grid texture memory (Device) Grid Block (0, 0) Shared Memory Registers Registers Block (1, 0) Shared Memory Registers Registers Thread (0, 0) Thread (1, 0) Thread (0, 0) Thread (1, 0) The host can R/W global, constant, and texture memories These memory spaces are persistent across kernels called by the same application. Host Local Memory Global Memory Constant Memory Local Memory Local Memory Local Memory 36 Texture Memory

37 CUDA Host-Device Data Transfer cudaerror_t cudamemcpy(void* dst, const void* src, size_t count, enum cudamemcpykind kind) copies count bytes from the memory area pointed to by src to the memory area pointed to by dst, where kind is one of cudamemcpyhosttohost cudamemcpyhosttodevice (Device) Grid Block (0, 0) Shared Memory Register s Register s Block (1, 0) Shared Memory Register s Register s cudamemcpydevicetohost cudamemcpydevicetodevice The memory areas may not overlap Calling cudamemcpy() with dst and src pointers that do not match the direction of the copy results in an undefined behavior. Synchronous in CUDA will become asynchronous in future CUDA version? Host Thread (0, 0) Local Memor y Global Memory Constant Memory Texture Memory Thread (1, 0) Local Memor y Thread (0, 0) Local Memor y Thread (1, 0) Local Memor y 37

38 Access Times Register dedicated HW - single cycle Shared Memory dedicated HW - single cycle Local Memory DRAM, no cache - *slow* Global Memory DRAM, no cache - *slow* Constant Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Texture Memory DRAM, cached, 1 10s 100s of cycles, depending on cache locality Instruction Memory (invisible) DRAM, cached 38

39 Blue Gene/L Full machine has 65,536 dual-core nodes, and each node has 32KB L1 instruction & data caches 4MB on-chip L3 cache 2.8 GFLOPS computation capacity with a 770MHz cloc 6 bidirectional ports to a 3-D torus interconnect 3 bidirectional ports to a collective network 4 ports to a barrier/interrupt network Figure 2.7 Logical organization of a BlueGene/L node. 39

40 Figure 2.8 BlueGene/L communication networks; (a) 3D 64x32x32 torus for standard interprocessor data transfer; (b) collective network for fast evaluation of reductions. Worst case latency = = 64 hops 40

41 Candidate Type Architecture (CTA) Differentiate between local vs. remote memory with respect to a processor Define λ = latency to access non-local memory relative to local memory access Locality rule: Fast programs tend to maximize the number of local memory references and minimize the number of non-local memory references 41

42 Table 2.1 Estimates for λ for common architectures; speeds generally do not include congestion or other traffic delays. 42

43 The Memory Wall 43

44 Conclusions Great diversity in parallel architectures A multi-dimensional space Memory model Communication latency Communication bandwidth Processing power Number of processors Issues of scale add to the diversity Many ways of balancing these characteristics 44 Impact on programmers This diversity complicates the task of programmers who care about portability and scalability

45 Announcement No class on Tuesday, Nov 3 rd We will meet next on Thursday, Nov 5 th Homework #2, due on Thursday, Nov 5 th Problem 8, Chapter 2, page 60 A single processor is a 0-cube, two processors connected are a 1-cube; given two n-cubes, connecting corresponding elements produces an (n+1)-cube. In an n-cube, what is the maximum length of the path required to connect two arbitrary nodes? Explain why. 45

Parallel Computing: Parallel Architectures Jin, Hai

Parallel Computing: Parallel Architectures Jin, Hai School of Computer Science and Technology Huazhong University of Science and Technology Peripherals Computer Central Processing Unit Main Memory Computer