ECE 8823: GPU Architectures. Objectives

Size: px

Start display at page:

Download "ECE 8823: GPU Architectures. Objectives"

Darlene York
6 years ago
Views:

1 ECE 8823: GPU Architectures Introduction 1 Objectives Distinguishing features of GPUs vs. CPUs Major drivers in the evolution of general purpose GPUs (GPGPUs) 2 1

2 Chapter 1 Chapter 2: 2.2, 2.3 Reading 3 CPU and GPU have very different design philosophy GPU Throughput Oriented Cores CPU Latency Oriented Cores Chi p Compute Unit Cache/Local Mem Registers SIMD Unit Threading Chi p Core Local Cache Registers SIMD Unit Control David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 2

3 CPUs: Latency Oriented Design Large caches Convert long latency memory accesses to short latency cache accesses Sophisticated control Branch prediction for reduced branch latency Data forwarding for reduced data latency Powerful ALU Reduced operation latency Small number of hardware threads DRAM Control CPU Cache ALU ALU ALU ALU David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 5 GPUs: Throughput Oriented Design Small caches To boost memory throughput Simple control No branch prediction No data forwarding Energy efficient ALUs Many, long latency but heavily pipelined for high throughput Require massive number of threads to tolerate latencies DRAM core GPU David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 6 3

Hwu, 2007-2012 ECE408/CS483, University of Illinois, Urbana-Champaign 7 Evolution from Graphics Pipelines A fixed-function NVIDIA GeForce

4 Winning Applications Use Both CPU and GPU CPUs for sequential parts where latency matters GPUs for parallel parts where throughput wins CPUs can be 10+X faster than GPUs for sequential code GPUs can be 10+X faster than CPUs for parallel code David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 7 Evolution from Graphics Pipelines A fixed-function NVIDIA GeForce graphics pipeline. Unified programmable processor array of the GeForce 8800 GT graphics pipeline Elsevier, Inc. All rights reserved. 8 4

GPUs and High Performance Computing 299,088 Opteron

memory 9 GPUs and the Enterprise slideshare.net www.

com Amazon Elastic Compute Cloud (EC2) GPU instances

5 GPUs and High Performance Computing 299,088 Opteron cores 18,688 K20 GPUs (2496 cores/gpu) 710 Tbytes of memory 9 GPUs and the Enterprise slideshare.net Amazon Elastic Compute Cloud (EC2) GPU instances for high throughput data intensive processing Intel CPU + NVIDIA GPU Co-processors 10 5

6 Green Heterogeneous Processors Qualcomm Snapdragon General Purpose Cores IBM Power 8 Accelerators CPUs are transitioning to System on Chip (SoC) Designs) Multiple Instruction Set Architectures (ISA) 12 6

Bulk Synchronous Parallel Model 13 Parallel Programming Work Flow Identify compute intensive parts of an application Adopt

7 Multiple Programming Models Multithreaded Cores and Vector Units AMD Trinity General Purpose Graphics Processing Unit (GPGPU)! Bulk Synchronous Parallel Model 13 Parallel Programming Work Flow Identify compute intensive parts of an application Adopt scalable algorithms Optimize data arrangements to maximize locality Performance Tuning Pay attention to code portability and maintainability David Kirk/NVIDIA and Wen-mei W. Hwu, ECE408/CS483, University of Illinois, Urbana-Champaign 7

8 Software Dominates System Cost SW lines per chip increases at 2x/10 months HW gates per chip increases at 2x/18 months Future system must minimize software redevelopment Keys to Software Cost Control App Core A Scalability 8

9 Keys to Software Cost Control Scalability App Core A Core 2.0 The same application runs efficiently on new generations of cores Keys to Software Cost Control App Core A Core A Core A Scalability The same application runs efficiently on new generations of cores The same application runs efficiently on more of the same cores 9

Scalability and Portability Performance growth with HW generations Increasing number of

depth Increasing DRAM burst size Increasing number of DRAM channels Increasing data

many-core GPUs VLIW vs. SIMD vs. threading Shared memory vs.

and Programming Models: Designed for Productivity Tools Compiler Run Time Execution

10 Scalability and Portability Performance growth with HW generations Increasing number of compute units Increasing number of threads Increasing vector length Increasing pipeline depth Increasing DRAM burst size Increasing number of DRAM channels Increasing data movement latency Portability across many different HW types Multi-core CPUs vs. many-core GPUs VLIW vs. SIMD vs. threading Shared memory vs. distributed memory C/C++ Key to Portability CUDA Haskell C++AMP Datalog OpenCL Languages and Programming Models: Designed for Productivity Tools Compiler Run Time Execution Models (EM): Dynamic Translation of EMs to bridge this gap Hardware Architectures Design under speed, cost, and energy constraints 20 10

11 Keys to Software Cost Control App App App Core B Core A Core C Scalability Portability The same application runs efficiently on different types of cores Keys to Software Cost Control App App App Scalability Portability The same application runs efficiently on different types of cores The same application runs efficiently on systems with different organizations and interfaces 11

12 Parallelism Scalability Algorithm Complexity and Data Scalability 12

13 Why is data scalability important? Any algorithm complexity higher than linear is not data scalable Execution time explodes as data size grows even for an n*log(n) algorithm Processing large data sets is a major motivation for parallel computing A sequential algorithm with linear data scalability can outperform a parallel algorithm with n*log(n) complexity log(n) grows to be greater than degree of HW parallelism and makes parallel algorithm run slower than sequential algorithm Parallelism cannot overcome complexity for large data sets 13

14 Massive Parallelism How do you orchestrate correct computation? Bulk synchronous parallel (BSP) execution model 27 Massive Parallelism - Regularity 5/24/2012 (c) Wen-mei Hwu, CTHPC

15 Load Balance The total amount of time to complete a parallel job is limited by the thread that takes the longest to finish good bad! Global Memory Bandwidth Ideal Reality 15

16 Conflicting Data Accesses Cause Serialization and Delays Massively parallel execution cannot afford serialization Contentions in accessing critical data causes serialization What is the stake? Scalable and portable software lasts through many hardware generations Scalable algorithms and libraries can be the best legacy we can leave behind from this era 16

17 What About Architecture? Changing ISAs has a disruptive effect on the software stack! Hardware evolution requires microarchitectural advances to sustain performance advances Hardware Evolution Scalability & Portability 33 NVIDIA GPU Roadmap Need a sustaining HW/SW Interface Virtual ISAs and JIT compilation vr-zone.com 34 17

18 Virtual ISAs Front-End Back-End Applications Parallel Thread Execution (PTX) ISA Native ISA-1++ Native ISA-1+ Native ISA-1 cache cache cache Gen1 Gen2 Gen3 35 Major Themes Heterogeneous Architectures CPU + GPU organizations Programming Models CUDA, OpenCL, OpenACC Massive Parallelism and BSP Execution Model Base Microarchitecture Optimizations Memory Hierarchy 36 18

19 QUESTIONS? 37 19

Lecture 1: Gentle Introduction to GPUs

CSCI-GA.3033-004 Graphics Processing Units (GPUs): Architecture and Programming Lecture 1: Gentle Introduction to GPUs Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Who Am I? Mohamed