Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017
Review Thread, Multithreading, SMT CMP and multicore Benefits of multicore Multicore system architecture Heterogeneous multicore system Heterogeneous-ISA CMP Multicore and manycore Design challenges 2
Exercises Can we keep adding cores to a CPU? Why? What are the opportunities of multicore design optimization? 3
Outlines DLP Overview Vector Processor GPGPU 4
Throughput Workloads Interested in how many jobs can be completed Finance, social media, gaming, etc. CPUs with multiple cores are considered suitable Recent years we observe a quick adoption of GPU 5
Data-Level Parallelism DLP: Parallelism comes from simultaneous operations across large data sets, rather than from multiple threads Particularly useful for large scientific/engineering tasks Inst. Data Inst. Data SISD SIMD Results Results 6
Outlines DLP Overview Vector Processor GPGPU 7
Vector Operation vectors 1 3 6 8 4 6 2 9 2 6 7 0 3 6 5 2 scalars 2 1 6 8 A vector is just a linear array of numbers, or scalars Can we extend processors with vector data type? The idea of vector operations: Read a set of data elements Operate on the element array 8
Vector Instructions A vector instruction operates on a sequence of data items A few key advantages: Reduced instruction fetch bandwidth A single vector instruction specifies a great deal of work Less data hazard checking hardware Independent operation between elements in the same vector Memory access latency is amortized Vector instructions have a known memory access pattern 9
Vector Architecture Two primary styles of vector architecture: memory-memory vector processors All vector operations are memory to memory CDC STAR-100 (several times faster than the CDC 7600) vector-register processors Vector equivalent of load-store architectures Cray, Convex, Fujitsu, Hitachi, NEC in 1980s 10
Key Components A vector processor typically consists of vector units and the ordinary pipelined scalar units Vector register: fixed-size bank holding a single vector typically holding 64~128 FP elements determines the maximum vector length (MVL) Vector register file: 8~32 vector registers 11
Early Design: Cray-1 Released in 1975 115 KW, 5.5 Tons 160 MFLOPS Eight 64-entry vector registers 64-bit register 4096 bits in each register 4KB for vector RF Special vector instructions: Vector + Vector, Vector + Scalar Each instruction specifies 64 operations Cray-1 12
Case Study: VMIPS Vector registers 64-element register Register file has 16 read ports and 8 write ports Vector functional units 5 fully pipelined FUs Hazards detection Vector load-store unit Loads/stores a vector 1 word per clock cycle 13
Case Study: VMIPS Add two vectors ADDVV.D V1, V2, V3 // add elements of V2 and V3, then put each result in V1 Add vector to scalar ADDVS.D V1, V2, F0 // add F0 to each elements of V2, then put each result in V1 Load vector LV V1, R1 // Load vector register V1 from memory starting at address R1 Store vector SV V1, R1 // Store vector register V1 from memory starting at address R1 14
Vectorized Code Consider a DAXPY loop: Double-precision a X plus Y (Y= a X + Y). Assume the starting addresses of X and Y are in Rx ad Ry, respectively MIPS code for DAXPY: ~ 600 instructions; VMIPS code for DAXPY: 6 instructions; (the vector operation work on 64 elements) 15
Multiple Lanes ADDV C, A, B Element n of vector register A is hardwired to element n of vector register B Using multiple parallel pipelines ( or lanes) to produce more than one results per cycle lanes 16
Vector Instruction Execution Vector execution time mainly depends on: The length of the operand vectors Structural hazards Data dependences Start-up time: pipeline latency (depth of the FU pipeline) VMIPS s FU consumes one element per cycle Execution time is about the vector length Pipelined multi-lane vector execution T vector = T scalar + (vector length / lane number) - 1 17
Vector Chaining Chaining is the vector version of register bypassing Allows a vector operation to start as soon as the individual elements of its vector source operand become available LV v1 V1 V2 V3 V4 V5 MULV v3, v1, v2 ADDV v5, v3, v4 Load Unit MUL ADD Mem Convoy: Vector instructions that may execute together No structural hazards in a convoy No data hazards (or managed via chaining) in a convoy 18
Vector Length Register What to do when vector length is not exactly 64? vector size depends on n : For ( i=0; i<n; i++) Y[i] = a * X[i] + Y[i] Vector length register (VLR) Controls the length of any vector operation The value is no greater than the maximum vector length (MVL) Use strip mining for vectors over maximum length The first loop do short segment (n mod MVL); All the subsequent segments use VL = MVL 19
Vector Mask Register The IF statement cannot normally be vectorized IF statement in the loop : For ( i=0; i<64; i++) if (X[i]!= 0) X[i] = X[i] + Y[i] Vector mask control Provide conditional execution of each element Based on a Boolean vector (vector mask register) Instruction operates only on vector elements whose corresponding entries in the mask register are 1 20
Outlines DLP Overview Vector Processor GPGPU 21
General Purpose GPU Graphics processing unit (GPU) emerged in 1999 GPU is basically a massively parallel architecture GPGPU: general-purpose computation on GPU Leverage available GPU execution units for acceleration Became practical and popular after about 2001 CPU codes runs on the host, GPU code runs on the device GPGPU exposes the massively multi-threaded hardware via a high level programming interface NVIDIA s CUDA and AMD s CTM 22
Graphics Pipeline Modern GPUs structure their graphics computation in a similar organization called the graphics pipeline Vertices Primitives Fragments Pixels Kayvon Fatahalian, CMU 2011 23
Execution Model Programmer provides a serial task ( shader program ) that defines pipeline logic for certain stages Vertex processors Running vertex shader programs Operates on point, line, and triangle primitives Fragment processors Running pixel shader programs Operates on rasterizer output to fill up the primitives Program GPGPUs with CUDA or OpenCL A minimal extension of C/C++ Executes shader programs across many data elements 24
CUDA Parallel Computing Model Stands for Compute Unified Device Architecture (CUDA) CUDA is the HW/SW architecture that enables NVIDIA GPUs to execute programs written with C, C++, etc. The GPU instantiates a kernel program which executes in parallel by a large number of CUDA threads CUDA threads are different from POSIX threads 25
CUDA Parallel Computing Model A thread block is an array of cooperative thread 1~512 concurrent threads Each thread has a unique ID Grid: an array of thread blocks that execute the same kernel Read/write from global memory Sync between dependent calls 26
Hardware Execution The GPU hardware contains a collection of multithreaded SIMD processors the executes a grid of thread blocks Thread Block Scheduler Assigns thread blocks to multithreaded SIMD processors SIMD Thread Scheduler Schedules when threads of SIMD instructions should run The SIMD processor must have parallel function units SIMD Lanes ( # of lanes per SIMD varies across GPUs) 27
Case Study: Tesla GPU Architecture The primary design objective for Tesla was to execute vertex and pixel-fragment shader programs on the same unified processor architecture. 28
Case Study: Tesla GPU Architecture 8 texture/processor clusters (TPCs) 2 streaming multiprocessors (SM) per TPC 8 streaming processor (SP) cores per SM Geometry Controller SMC SM is a unified graphics and computing engine Executes vertex and pixelfragment shaders, as well as parallel computing tasks Three execution unit: Streaming processor (SP) Special-function unit (SFU) TPC texture unit SM I Cache MT Issue C Cache SP SP SP SP SP SP SP SP SFU SFU Shared Mem SM SM Texture Unit 29
MC MC MC Case Study: Fermi GF100 Architectural Overview Each GPC contains four streaming multiprocessors (SMs) Graphics Processor Clusters (GPC) Graphics Processor Clusters (GPC) SM SM SM SM SM SM SM SM L2 Cache Graphics Processor Clusters (GPC) Graphics Processor Clusters (GPC) SM SM SM SM SM SM SM SM MC MC MC 30
SFU Case Study: Fermi GF100 Architectural Overview A single SM contains 32 CUDA cores Dual warp scheduler Streaming Multiprocessor (SM) I-Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit FP unit CUDA core Dispatch port Operand collector Results queue INT unit Register File Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST LD/ST Interconnect Network Shared memory / L1 Cache Texture Cache 31
Warps GPUs rely on massive hardware multithreading to keep arithmetic units occupied SM executes a pool of threads organized into groups It is called a warp by NVIDIA It is called a wavefront by AMD Warp: 32 parallel CUDA threads of the same type Start together at the same program address All the threads in a warp use a common PC 32
Fermi s Dual Warp Scheduler Simultaneously schedules and dispatches instructions from two independent warps 33
Warp Scheduling Thread scheduling happens at the granularity of warps Scheduler selects a warp from a pool for execution Considers instruction type and fairness, etc. It is preferable that all the SIMD lanes are occupied All 32 threads of a warp take the same execution path Latency-hiding capability of GPGPU Fast context-switching Large number of warps Warps can be out-of-order 34
L1 Inst. Cache Warp Scheduling (Cont d) Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Fetch Unit Potential factors that can delay a warp s execution Scheduling policies Warp Pool Inst. Warp 0 Decoder Warp 1 Instruction/Data cache miss Structural/Control/Data hazard Synchronization primitives Warp Scheduler Warp 2 Warp n-1 Scoreboard Inst. Buffers Reg. Buffers L1 Data Cache SIMD Lanes (adder) SIMD Lanes (mult.) 35
Branch Divergence Normally, all threads of a warp execute in the same way What if the threads of warp diverge due to control flow? Warp divergence: Branch instructions may cause some threads to jump, while others fall through in a warp Diverged warps possess inactive thread lanes Diverging paths execute serially 36
Stack Based Re-Convergence Divergent Branch Idx.x: 0 1 2 3 4 5 6 7 If (IDx.x < 5) { my_data[idx.x] += a } else { my_data[idx.x] *= b } Mask: 11111000 Mask: 00000111 Re-convergence stack is used to merge back thread Saves re-convergence PC, an active mask and an execution PC Mask: 11111111 37
Dynamic Warp Formation Form new warps from a pool of ready threads by combining threads whose PC values are the same 38
CPU-GPU Systems Integrate general purpose programmable GPUs together with traditional CPUs on the same die AMD (Fusion APUs); Intel (Sandy Bridge); ARM (MALI) Need to smartly partitioning benchmarks on the system Chip integrated CPU-GPU architecture 39
Summary Throughput computing and data-level parallelism Vector processor and vector instruction VMPIS, vector registers, DAXPY, execution latency SIMD lanes, chaining, vector length register GPGPU, CUDA programming model TPC, SM, SP, warp and warp scheduling Branch divergence CPU-GPU system 40
References 课本内容 :J. Hennessy, D. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Chapters: 4.1, 4.2, 4.4 参考阅读 : [1] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro Magazine, 2008. [2] C. Wittenbrink et al. Fermi GF100 GPU Architecture. IEEE Micro Magazine, 2011. [3] W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, MICRO 2007 41
Exercises Think about vectors vs superscalar vs. VLIW architecture What is the difference between SIMD vector architecture and SIMT processor architecture? What are the advantages of dynamic warp formation? 42