Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

Similar documents
Portland State University ECE 588/688. Graphics Processors

Computer Architecture 计算机体系结构. Lecture 9. CMP and Multicore System 第九讲 片上多处理器与多核系统. Chao Li, PhD. 李超博士

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

GRAPHICS PROCESSING UNITS

Mattan Erez. The University of Texas at Austin

CS427 Multicore Architecture and Parallel Computing

Parallel Computing: Parallel Architectures Jin, Hai

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Computer Architecture 计算机体系结构. Lecture 4. Instruction-Level Parallelism II 第四讲 指令级并行 II. Chao Li, PhD. 李超博士

Chapter 4 Data-Level Parallelism

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Threading Hardware in G80

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CMPE 655 Multiple Processor Systems. SIMD/Vector Machines. Daniel Terrance Stephen Charles Rajkumar Ramadoss

Antonio R. Miele Marco D. Santambrogio

Static Compiler Optimization Techniques

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Computer Architecture

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

332 Advanced Computer Architecture Chapter 7

EE 4683/5683: COMPUTER ARCHITECTURE

GPU Fundamentals Jeff Larkin November 14, 2016

The Graphics Pipeline: Evolution of the GPU!

Introduction to CUDA (1 of n*)

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Real-Time Rendering Architectures

Vector Processors and Graphics Processing Units (GPUs)

Advanced Topics in Computer Architecture

Computer Architecture

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Tesla Architecture, CUDA and Optimization Strategies

COMP 605: Introduction to Parallel Computing Lecture : GPU Architecture

Parallel Processing SIMD, Vector and GPU s cont.

Parallel Processing SIMD, Vector and GPU s

! Readings! ! Room-level, on-chip! vs.!

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Spring Prof. Hyesoon Kim

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Vector Processors. Kavitha Chandrasekar Sreesudhan Ramkumar

Scientific Computing on GPUs: GPU Architecture Overview

Vector Processors. Abhishek Kulkarni Girish Subramanian

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU ARCHITECTURE Chris Schultz, June 2017

GPU A rchitectures Architectures Patrick Neill May

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

! An alternate classification. Introduction. ! Vector architectures (slides 5 to 18) ! SIMD & extensions (slides 19 to 23)

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS 179 Lecture 4. GPU Compute Architecture

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

COSC 6385 Computer Architecture. - Vector Processors

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

GPU! Advanced Topics on Heterogeneous System Architectures. Politecnico di Milano! Seminar Room, Bld 20! 11 December, 2017!

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Warps and Reduction Algorithms

Handout 3. HSAIL and A SIMT GPU Simulator

Multi-Processors and GPU

GPU ARCHITECTURE Chris Schultz, June 2017

Modern Processor Architectures. L25: Modern Compiler Design

Numerical Simulation on the GPU

Lecture 27: Multiprocessors. Today s topics: Shared memory vs message-passing Simultaneous multi-threading (SMT) GPUs

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

ME964 High Performance Computing for Engineering Applications

Fundamental CUDA Optimization. NVIDIA Corporation

Parallel Systems I The GPU architecture. Jan Lemeire

ILP Limit: Perfect/Infinite Hardware. Chapter 3: Limits of Instr Level Parallelism. ILP Limit: see Figure in book. Narrow Window Size

Master Informatics Eng.


Multithreaded Processors. Department of Electrical Engineering Stanford University

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

From Shader Code to a Teraflop: How Shader Cores Work

Introduction II. Overview

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Introduction to GPU programming with CUDA

Data-Level Parallelism

COSC 6339 Accelerators in Big Data

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

ME964 High Performance Computing for Engineering Applications

Lecture: Storage, GPUs. Topics: disks, RAID, reliability, GPUs (Appendix D, Ch 4)

45-year CPU Evolution: 1 Law -2 Equations

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

Computer Architecture 计算机体系结构. Lecture 6. Data Storage and I/O 第六讲 数据存储和输入输出. Chao Li, PhD. 李超博士

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Data-Parallel Architectures

Understanding Outstanding Memory Request Handling Resources in GPGPUs

The instruction scheduling problem with a focus on GPU. Ecole Thématique ARCHI 2015 David Defour

Occupancy-based compilation

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Transcription:

Computer Architecture 计算机体系结构 Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU Chao Li, PhD. 李超博士 SJTU-SE346, Spring 2017

Review Thread, Multithreading, SMT CMP and multicore Benefits of multicore Multicore system architecture Heterogeneous multicore system Heterogeneous-ISA CMP Multicore and manycore Design challenges 2

Exercises Can we keep adding cores to a CPU? Why? What are the opportunities of multicore design optimization? 3

Outlines DLP Overview Vector Processor GPGPU 4

Throughput Workloads Interested in how many jobs can be completed Finance, social media, gaming, etc. CPUs with multiple cores are considered suitable Recent years we observe a quick adoption of GPU 5

Data-Level Parallelism DLP: Parallelism comes from simultaneous operations across large data sets, rather than from multiple threads Particularly useful for large scientific/engineering tasks Inst. Data Inst. Data SISD SIMD Results Results 6

Outlines DLP Overview Vector Processor GPGPU 7

Vector Operation vectors 1 3 6 8 4 6 2 9 2 6 7 0 3 6 5 2 scalars 2 1 6 8 A vector is just a linear array of numbers, or scalars Can we extend processors with vector data type? The idea of vector operations: Read a set of data elements Operate on the element array 8

Vector Instructions A vector instruction operates on a sequence of data items A few key advantages: Reduced instruction fetch bandwidth A single vector instruction specifies a great deal of work Less data hazard checking hardware Independent operation between elements in the same vector Memory access latency is amortized Vector instructions have a known memory access pattern 9

Vector Architecture Two primary styles of vector architecture: memory-memory vector processors All vector operations are memory to memory CDC STAR-100 (several times faster than the CDC 7600) vector-register processors Vector equivalent of load-store architectures Cray, Convex, Fujitsu, Hitachi, NEC in 1980s 10

Key Components A vector processor typically consists of vector units and the ordinary pipelined scalar units Vector register: fixed-size bank holding a single vector typically holding 64~128 FP elements determines the maximum vector length (MVL) Vector register file: 8~32 vector registers 11

Early Design: Cray-1 Released in 1975 115 KW, 5.5 Tons 160 MFLOPS Eight 64-entry vector registers 64-bit register 4096 bits in each register 4KB for vector RF Special vector instructions: Vector + Vector, Vector + Scalar Each instruction specifies 64 operations Cray-1 12

Case Study: VMIPS Vector registers 64-element register Register file has 16 read ports and 8 write ports Vector functional units 5 fully pipelined FUs Hazards detection Vector load-store unit Loads/stores a vector 1 word per clock cycle 13

Case Study: VMIPS Add two vectors ADDVV.D V1, V2, V3 // add elements of V2 and V3, then put each result in V1 Add vector to scalar ADDVS.D V1, V2, F0 // add F0 to each elements of V2, then put each result in V1 Load vector LV V1, R1 // Load vector register V1 from memory starting at address R1 Store vector SV V1, R1 // Store vector register V1 from memory starting at address R1 14

Vectorized Code Consider a DAXPY loop: Double-precision a X plus Y (Y= a X + Y). Assume the starting addresses of X and Y are in Rx ad Ry, respectively MIPS code for DAXPY: ~ 600 instructions; VMIPS code for DAXPY: 6 instructions; (the vector operation work on 64 elements) 15

Multiple Lanes ADDV C, A, B Element n of vector register A is hardwired to element n of vector register B Using multiple parallel pipelines ( or lanes) to produce more than one results per cycle lanes 16

Vector Instruction Execution Vector execution time mainly depends on: The length of the operand vectors Structural hazards Data dependences Start-up time: pipeline latency (depth of the FU pipeline) VMIPS s FU consumes one element per cycle Execution time is about the vector length Pipelined multi-lane vector execution T vector = T scalar + (vector length / lane number) - 1 17

Vector Chaining Chaining is the vector version of register bypassing Allows a vector operation to start as soon as the individual elements of its vector source operand become available LV v1 V1 V2 V3 V4 V5 MULV v3, v1, v2 ADDV v5, v3, v4 Load Unit MUL ADD Mem Convoy: Vector instructions that may execute together No structural hazards in a convoy No data hazards (or managed via chaining) in a convoy 18

Vector Length Register What to do when vector length is not exactly 64? vector size depends on n : For ( i=0; i<n; i++) Y[i] = a * X[i] + Y[i] Vector length register (VLR) Controls the length of any vector operation The value is no greater than the maximum vector length (MVL) Use strip mining for vectors over maximum length The first loop do short segment (n mod MVL); All the subsequent segments use VL = MVL 19

Vector Mask Register The IF statement cannot normally be vectorized IF statement in the loop : For ( i=0; i<64; i++) if (X[i]!= 0) X[i] = X[i] + Y[i] Vector mask control Provide conditional execution of each element Based on a Boolean vector (vector mask register) Instruction operates only on vector elements whose corresponding entries in the mask register are 1 20

Outlines DLP Overview Vector Processor GPGPU 21

General Purpose GPU Graphics processing unit (GPU) emerged in 1999 GPU is basically a massively parallel architecture GPGPU: general-purpose computation on GPU Leverage available GPU execution units for acceleration Became practical and popular after about 2001 CPU codes runs on the host, GPU code runs on the device GPGPU exposes the massively multi-threaded hardware via a high level programming interface NVIDIA s CUDA and AMD s CTM 22

Graphics Pipeline Modern GPUs structure their graphics computation in a similar organization called the graphics pipeline Vertices Primitives Fragments Pixels Kayvon Fatahalian, CMU 2011 23

Execution Model Programmer provides a serial task ( shader program ) that defines pipeline logic for certain stages Vertex processors Running vertex shader programs Operates on point, line, and triangle primitives Fragment processors Running pixel shader programs Operates on rasterizer output to fill up the primitives Program GPGPUs with CUDA or OpenCL A minimal extension of C/C++ Executes shader programs across many data elements 24

CUDA Parallel Computing Model Stands for Compute Unified Device Architecture (CUDA) CUDA is the HW/SW architecture that enables NVIDIA GPUs to execute programs written with C, C++, etc. The GPU instantiates a kernel program which executes in parallel by a large number of CUDA threads CUDA threads are different from POSIX threads 25

CUDA Parallel Computing Model A thread block is an array of cooperative thread 1~512 concurrent threads Each thread has a unique ID Grid: an array of thread blocks that execute the same kernel Read/write from global memory Sync between dependent calls 26

Hardware Execution The GPU hardware contains a collection of multithreaded SIMD processors the executes a grid of thread blocks Thread Block Scheduler Assigns thread blocks to multithreaded SIMD processors SIMD Thread Scheduler Schedules when threads of SIMD instructions should run The SIMD processor must have parallel function units SIMD Lanes ( # of lanes per SIMD varies across GPUs) 27

Case Study: Tesla GPU Architecture The primary design objective for Tesla was to execute vertex and pixel-fragment shader programs on the same unified processor architecture. 28

Case Study: Tesla GPU Architecture 8 texture/processor clusters (TPCs) 2 streaming multiprocessors (SM) per TPC 8 streaming processor (SP) cores per SM Geometry Controller SMC SM is a unified graphics and computing engine Executes vertex and pixelfragment shaders, as well as parallel computing tasks Three execution unit: Streaming processor (SP) Special-function unit (SFU) TPC texture unit SM I Cache MT Issue C Cache SP SP SP SP SP SP SP SP SFU SFU Shared Mem SM SM Texture Unit 29

MC MC MC Case Study: Fermi GF100 Architectural Overview Each GPC contains four streaming multiprocessors (SMs) Graphics Processor Clusters (GPC) Graphics Processor Clusters (GPC) SM SM SM SM SM SM SM SM L2 Cache Graphics Processor Clusters (GPC) Graphics Processor Clusters (GPC) SM SM SM SM SM SM SM SM MC MC MC 30

SFU Case Study: Fermi GF100 Architectural Overview A single SM contains 32 CUDA cores Dual warp scheduler Streaming Multiprocessor (SM) I-Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit FP unit CUDA core Dispatch port Operand collector Results queue INT unit Register File Core Core Core Core LD/ST LD/ST Core Core Core Core LD/ST LD/ST Interconnect Network Shared memory / L1 Cache Texture Cache 31

Warps GPUs rely on massive hardware multithreading to keep arithmetic units occupied SM executes a pool of threads organized into groups It is called a warp by NVIDIA It is called a wavefront by AMD Warp: 32 parallel CUDA threads of the same type Start together at the same program address All the threads in a warp use a common PC 32

Fermi s Dual Warp Scheduler Simultaneously schedules and dispatches instructions from two independent warps 33

Warp Scheduling Thread scheduling happens at the granularity of warps Scheduler selects a warp from a pool for execution Considers instruction type and fairness, etc. It is preferable that all the SIMD lanes are occupied All 32 threads of a warp take the same execution path Latency-hiding capability of GPGPU Fast context-switching Large number of warps Warps can be out-of-order 34

L1 Inst. Cache Warp Scheduling (Cont d) Streaming Multiprocessor Streaming Multiprocessor Streaming Multiprocessor Fetch Unit Potential factors that can delay a warp s execution Scheduling policies Warp Pool Inst. Warp 0 Decoder Warp 1 Instruction/Data cache miss Structural/Control/Data hazard Synchronization primitives Warp Scheduler Warp 2 Warp n-1 Scoreboard Inst. Buffers Reg. Buffers L1 Data Cache SIMD Lanes (adder) SIMD Lanes (mult.) 35

Branch Divergence Normally, all threads of a warp execute in the same way What if the threads of warp diverge due to control flow? Warp divergence: Branch instructions may cause some threads to jump, while others fall through in a warp Diverged warps possess inactive thread lanes Diverging paths execute serially 36

Stack Based Re-Convergence Divergent Branch Idx.x: 0 1 2 3 4 5 6 7 If (IDx.x < 5) { my_data[idx.x] += a } else { my_data[idx.x] *= b } Mask: 11111000 Mask: 00000111 Re-convergence stack is used to merge back thread Saves re-convergence PC, an active mask and an execution PC Mask: 11111111 37

Dynamic Warp Formation Form new warps from a pool of ready threads by combining threads whose PC values are the same 38

CPU-GPU Systems Integrate general purpose programmable GPUs together with traditional CPUs on the same die AMD (Fusion APUs); Intel (Sandy Bridge); ARM (MALI) Need to smartly partitioning benchmarks on the system Chip integrated CPU-GPU architecture 39

Summary Throughput computing and data-level parallelism Vector processor and vector instruction VMPIS, vector registers, DAXPY, execution latency SIMD lanes, chaining, vector length register GPGPU, CUDA programming model TPC, SM, SP, warp and warp scheduling Branch divergence CPU-GPU system 40

References 课本内容 :J. Hennessy, D. Patterson. Computer Architecture, Fifth Edition: A Quantitative Approach. Chapters: 4.1, 4.2, 4.4 参考阅读 : [1] E. Lindholm et al. NVIDIA Tesla: A Unified Graphics and Computing Architecture. IEEE Micro Magazine, 2008. [2] C. Wittenbrink et al. Fermi GF100 GPU Architecture. IEEE Micro Magazine, 2011. [3] W. Fung et al. Dynamic Warp Formation and Scheduling for Efficient GPU Control Flow, MICRO 2007 41

Exercises Think about vectors vs superscalar vs. VLIW architecture What is the difference between SIMD vector architecture and SIMT processor architecture? What are the advantages of dynamic warp formation? 42