Vector Processors and Graphics Processing Units (GPUs)

Similar documents
CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs)

CS 152 Computer Architecture and Engineering. Lecture 16: Graphics Processing Units (GPUs) John Wawrzynek. EECS, University of California at Berkeley

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

Programmer's View of Execution Teminology Summary

Flynn s Classification (1966)

CS 152 Computer Architecture and Engineering. Lecture 17: Vector Computers

Design of Digital Circuits Lecture 21: GPUs. Prof. Onur Mutlu ETH Zurich Spring May 2017

CS 152 Computer Architecture and Engineering. Lecture 16: Vector Computers

Computer Architecture: SIMD and GPUs (Part I) Prof. Onur Mutlu Carnegie Mellon University

Motivation: Who Cares About I/O? Historical Perspective. Hard Disk Drives. CS252 Graduate Computer Architecture Lecture 25

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

Lecture 21: Data Level Parallelism -- SIMD ISA Extensions for Multimedia and Roofline Performance Model

CS 152, Spring 2011 Section 8

Antonio R. Miele Marco D. Santambrogio

Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

06-2 EE Lecture Transparency. Formatted 14:50, 4 December 1998 from lsli

Fundamentals of Computers Design

EE 4683/5683: COMPUTER ARCHITECTURE

Parallel Computing: Parallel Architectures Jin, Hai

Lecture 14: Multithreading

DATA-LEVEL PARALLELISM IN VECTOR, SIMD ANDGPU ARCHITECTURES(PART 2)

Arquitetura e Organização de Computadores 2

Lecture 7 - Memory Hierarchy-II

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Asanovic/Devadas Spring Vector Computers. Krste Asanovic Laboratory for Computer Science Massachusetts Institute of Technology

Fundamentals of Computer Design

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Introduction to GPU programming with CUDA

Computer Architecture Lecture 16: SIMD Processing (Vector and Array Processors)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Static Compiler Optimization Techniques

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Multi-Processors and GPU

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

Parallel Systems I The GPU architecture. Jan Lemeire

Data-Level Parallelism in SIMD and Vector Architectures. Advanced Computer Architectures, Laura Pozzi & Cristina Silvano

Course II Parallel Computer Architecture. Week 2-3 by Dr. Putu Harry Gunawan

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. Sources. What is Computer Architecture?

UNIT III DATA-LEVEL PARALLELISM IN VECTOR, SIMD, AND GPU ARCHITECTURES

Computer Architecture 计算机体系结构. Lecture 10. Data-Level Parallelism and GPGPU 第十讲 数据级并行化与 GPGPU. Chao Li, PhD. 李超博士

CS 152 Computer Architecture and Engineering. Lecture 12 - Advanced Out-of-Order Superscalars

Chapter 4 Data-Level Parallelism

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

CS 152 Computer Architecture and Engineering. Lecture 8 - Memory Hierarchy-III

Data-Parallel Architectures

Warps and Reduction Algorithms

Computer Architecture

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS 152 Computer Architecture and Engineering. Lecture 14: Multithreading

Computer Architecture ELEC2401 & ELEC3441

COSC 6385 Computer Architecture. - Data Level Parallelism (II)

GPU programming. Dr. Bernhard Kainz

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Computer Architecture: SIMD and GPUs (Part II) Prof. Onur Mutlu Carnegie Mellon University

This Unit: Putting It All Together. CIS 501 Computer Architecture. What is Computer Architecture? Sources

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

CS 152 Computer Architecture and Engineering. Lecture 13 - VLIW Machines and Statically Scheduled ILP

CS 152 Computer Architecture and Engineering. Lecture 6 - Memory

Lecture 25: Interrupt Handling and Multi-Data Processing. Spring 2018 Jason Tang

CS 152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. Lecture 7 Memory III

Lecture 12 Branch Prediction and Advanced Out-of-Order Superscalars

Unit 11: Putting it All Together: Anatomy of the XBox 360 Game Console

45-year CPU Evolution: 1 Law -2 Equations

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

Master Informatics Eng.

Numerical Simulation on the GPU

Handout 3. HSAIL and A SIMT GPU Simulator

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

CS4230 Parallel Programming. Lecture 3: Introduction to Parallel Architectures 8/28/12. Homework 1: Parallel Programming Basics

Portland State University ECE 588/688. Graphics Processors

332 Advanced Computer Architecture Chapter 7

WHY PARALLEL PROCESSING? (CE-401)

! Readings! ! Room-level, on-chip! vs.!

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

CS 152 Computer Architecture and Engineering. Lecture 10 - Complex Pipelines, Out-of-Order Issue, Register Renaming

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

CS 152 Computer Architecture and Engineering. Lecture 13 - Out-of-Order Issue and Register Renaming

CS 252 Graduate Computer Architecture. Lecture 7: Vector Computers

Lecture 6 - Memory. Dr. George Michelogiannakis EECS, University of California at Berkeley CRD, Lawrence Berkeley National Laboratory

Introduction to CUDA Programming

Fundamental CUDA Optimization. NVIDIA Corporation

ESE 545 Computer Architecture. Data Level Parallelism (DLP), Vector Processing and Single-Instruction Multiple Data (SIMD) Computing

Lecture 23: Thread Level Parallelism -- Introduction, SMP and Snooping Cache Coherence Protocol

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

GP-GPU. General Purpose Programming on the Graphics Processing Unit

This Unit: Putting It All Together. CIS 371 Computer Organization and Design. What is Computer Architecture? Sources

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 13 Memory Part 2

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Parallel Processing SIMD, Vector and GPU s cont.

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GRAPHICS PROCESSING UNITS

Threading Hardware in G80

Transcription:

Vector Processors and Graphics Processing Units (GPUs) Many slides from: Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley

TA Evaluations Please fill out your TA evaluations You should have received a link to do so. I think there might be CAPES as well. If you ve received mail about them, please fill them out. 2

3 Flynn s Taxonomy of Parallelism Single-Instruction, Single data (SISD) No parallelism. Normal, single-threaded execution Multiple-Instruction, Multiple Data (MIMD) Multiple independent threads of execution CMPs, MPs, etc. Single-instruction, Multiple-Data (SIMD) A single instruction is applied across many pieces of data Vector processors Multiple-Instruction, Single-Data (MISD)??? Maybe for heterogeneity Named for Michael Flynn (Professor at Stanford)

4 The Rise of SIMD SIMD is good for applying identical computations across many data elements E.g. For(i = 0; I < 100; i++) { A[i] = B[i] + C[i];} Also data level parallelism SIMD is energy efficient Less control logic per functional unit Less instruction fetch and decode energy SIMD computations tend to bandwidth-efficient and latency tolerant Memory systems (caches, prefetchers, etc) are good at sequential scans through arrays Easy examples Dense linear algebra Computer graphics (which includes a lot of dense linear algebra) Machine vision Digital signal processing Harder examples can be made to work to gain benefits above Database queries Sorting

Vector Programming Model Scalar Registers r15 v15 Vector Registers r0 v0 [0] [1] [2] [VLRMAX-1] Vector Arithmetic Instructions ADDV v3, v1, v2 v1 v2 v3 Vector Length Register VLR + + + + + + [0] [1] [VLR-1] Vector Load and Store Instructions LV v1, r1, r2 v1 Vector Register Base, r1 Stride, r2 Memory 5

6 Vector Instruction Execution ADDV C,A,B Execution using one pipelined functional unit Execution using four pipelined functional units A[6] B[6] A[24] B[24] A[25] B[25] A[26] B[26] A[27] B[27] A[5] B[5] A[20] B[20] A[21] B[21] A[22] B[22] A[23] B[23] A[4] B[4] A[16] B[16] A[17] B[17] A[18] B[18] A[19] B[19] A[3] B[3] A[12] B[12] A[13] B[13] A[14] B[14] A[15] B[15] C[2] C[8] C[9] C[10] C[11] C[1] C[4] C[5] C[6] C[7] C[0] C[0] C[1] C[2] C[3]

Vector Unit Structure Functional Unit Vector Registers Elements 0, 4, 8, Elements 1, 5, 9, Elements 2, 6, 10, Elements 3, 7, 11, Lane Memory Subsystem 7

8 Multimedia Extensions (aka SIMD extensions) 32b 64b 32b 16b 16b 16b 16b 8b 8b 8b 8b 8b 8b 8b 8b Very short vectors added to existing ISAs for microprocessors Use existing 64-bit registers split into 2x32b or 4x16b or 8x8b Lincoln Labs TX-2 from 1957 had 36b datapath split into 2x18b or 4x9b Newer designs have wider registers 128b for PowerPC Altivec, Intel SSE2/3/4 256b for Intel AVX Single instruction operates on all elements within register 16b 16b 16b 16b 16b 16b 16b 16b 4x16b adds + + + + 16b 16b 16b 16b

9 Multimedia Extensions versus Vectors Limited instruction set: no vector length control no strided load/store or scatter/gather unit-stride loads must be aligned to 64/128-bit boundary Limited vector register length: requires superscalar dispatch to keep multiply/add/load units busy loop unrolling to hide latencies increases register pressure Trend towards fuller vector support in microprocessors Better support for misaligned memory accesses Support of double-precision (64-bit floating-point) Intel AVX spec, 256b vector registers (expandable up to 1024b)

DLP important for conventional CPUs Prediction for x86 processors, from Hennessy & Patterson, 5 th edition Note: Educated guess, not Intel product plans! TLP: 2+ cores / 2 years DLP: 2x width / 4 years DLP will account for more mainstream parallelism growth than TLP in next decade. SIMD single-instruction multiple-data (DLP) MIMD- multiple-instruction multiple-data (TLP) 10

11 Graphics Processing Units (GPUs) Original GPUs were dedicated fixed-function devices for generating 3D graphics (mid-late 1990s) including highperformance floating-point units Provide workstation-like graphics for PCs User could configure graphics pipeline, but not really program it Over time, more programmability added (2001-2005) E.g., New language Cg for writing small programs run on each vertex or each pixel, also Windows DirectX variants Massively parallel (millions of vertices or pixels per frame) but very constrained programming model Some users noticed they could do general-purpose computation by mapping input and output data to images, and computation to vertex and pixel shading computations Incredibly difficult programming model as had to use graphics pipeline model for general computation

General-Purpose GPUs (GP-GPUs) In 2006, Nvidia introduced GeForce 8800 GPU supporting a new programming language: CUDA Compute Unified Device Architecture Subsequently, broader industry pushing for OpenCL, a vendor-neutral version of same ideas. Idea: Take advantage of GPU computational performance and memory bandwidth to accelerate some kernels for generalpurpose computing Attached processor model: Host CPU issues data-parallel kernels to GP-GPU for execution This lecture has a simplified version of Nvidia CUDA-style model and only considers GPU execution for computational kernels, not graphics Would probably need another course to describe graphics processing 12

Simplified CUDA Programming Model Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // C version of DAXPY (Double precision A * X + Y) void daxpy(int n, double a, double*x, double*y) { for (int i=0; i<n; i++) y[i] = a*x[i] + y[i]; } Or, for large N: // C version of DAXPY (Double precision A * X + Y) void daxpy(int n, double a, double*x, double*y) { nblocks = (n+255)/256; for (int j=0; j<nblocks; j++) for (int k=0; k<256; k++){ i = j*256+k; if (i<n) y[i] = a*x[i] + y[i]; } } 13

14 Simplified CUDA Programming Model Computation performed by a very large number of independent small scalar threads (CUDA threads or microthreads) grouped into thread blocks. // CUDA host-side code int nblocks = (n+255)/256; // 256 CUDA threads/block daxpy<<<nblocks,256>>>(n,2.0,x,y); // CUDA GPU-side code void daxpy(int n, double a, double*x, double*y) { int i = blockidx.x*blockdim.x + threadid.x; if (i<n) y[i]=a*x[i]+y[i]; }

15 Programmer s View of Execution blockidx 0 threadid 0 threadid 1 threadid 255 blockdim = 256 (programmer can choose) Create enough blocks to cover input vector (Nvidia calls this ensemble of blocks a Grid, it can be 2- dimensional) blockidx 1 blockidx (n+255/256) threadid 0 threadid 1 threadid 255 threadid 0 threadid 1 threadid 255

16 Hardware Execution Model CPU Lane 0 Lane 1 Lane 0 Lane 1 Lane 0 Lane 1 CPU Memory Lane 15 Core 0 Lane 15 Core 1 GPU Lane 15 Core 15 GPU Memory GPU is built from multiple parallel cores, each core contains a multithreaded SIMD processor with multiple lanes but with no scalar processor CPU sends whole grid over to GPU, which distributes thread blocks among cores (each thread block executes on one core) Programmer unaware of number of cores

Single Instruction, Multiple Thread GPUs use a SIMT model, where individual scalar instruction streams for each CUDA thread are grouped together for SIMD execution on hardware (Nvidia groups 32 CUDA threads into a warp) All vector loads and stores are scatter-gather, as individual µthreads perform scalar loads and stores Scalar instruction stream ld x mul a ld y add st y µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD execution across warp 17

18 Conditionals in SIMT model Simple if-then-else are compiled into predicated execution, equivalent to vector masking More complex control flow compiled into branches How to execute a vector of branches? Scalar instruction stream tid=threadid If (tid >= n) skip Call func1 add st y skip: µt0 µt1 µt2 µt3 µt4 µt5 µt6 µt7 SIMD execution across warp

Warps are multithreaded on core [Nvidia, 2010] One warp of 32 µthreads is a single thread in the hardware Multiple warp threads are interleaved in execution on a single core to hide latencies (memory and functional unit) A single thread block can contain multiple warps (up to 512 µt max in CUDA), all mapped to single core Can have multiple blocks executing on one core 20

21 GPU Memory Hierarchy [ Nvidia, 2010]

22 Nvidia Kepler GK110 GPU [Nvidia, 2010]

23 Fermi Streaming Multiprocessor Core Scheduling Up to 2 instructions each from 4 Warps Broadcast instructions to 32 thread processors Up to 256 instructions per cycle(!) This is an in-onder, SMT, vector machine Fetch from multiple threads (warps) each cycle The instructions in each thread are vector instructions.

Kepler Die Photo 24

Apple A5X Processor for ipad v3 (2012) 12.90mm x 12.79mm 45nm technology [Source: Chipworks, 2012] 25

Apple A7 Runs in the iphone 5s 64-bit >1Billion xtrs Lots of GPU 26

Sony PS4 27

Xbox Playstation One 5 billion xtrs 47MBytes of on-chip memory 28

30 GPU Future High-end desktops have separate GPU chip, but trend towards integrating GPU on same die as CPU (already in laptops, tablets and smartphones) Advantage is shared memory with CPU, no need to transfer data Disadvantage is reduced memory bandwidth compared to dedicated smaller-capacity specialized memory system Graphics DRAM (GDDR) versus regular DRAM (DDR3) Will GP-GPU survive? Or will improvements in CPU DLP make GP-GPU redundant? On same die, CPU and GPU should have same memory bandwidth GPU might have more FLOPS as needed for graphics anyway

Acknowledgements These slides contain material developed and copyright by: Krste Asanovic (UCB) Steven Swanson (UCSD) Arvind (MIT) Krste Asanovic (MIT/UCB) Joel Emer (Intel/MIT) James Hoe (CMU) John Kubiatowicz (UCB) David Patterson (UCB) MIT material derived from course 6.823 UCB material derived from course CS252 UCSD material derived from course 240a 31