GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Size: px

Start display at page:

Download "GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27"

Linette Mathews
5 years ago
Views:

1 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas

2 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution Model Programming GPUs using Nvidia CUDA

3 3 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution Model Programming GPUs using Nvidia CUDA

4 4 / 27 Course Objective GPU programming Learn how to program massively parallel processors and achieve High performance Scalability across future generations Acquire technical knowledge required to achieve the above goals Principles and patterns of parallel programming Processor architecture features and constraints Programming APIs, tools and techniques

5 Lab assignments, Projects, and Course grading Constituent components of course grading 10 lab assignments: 60% Quizzes: 10% Frequently given, approximately once a week Final exam: 30% All lab assignments are supposed to be carried out individually 5 / 27

6 6 / 27 Lab Equipment GPU computing GPU: GeForce GTX CUDA cores We will have access to computers with GTX 480 GPU: Telsa C CUDA cores Low power consumption, high double-precision floating-point performance Kepler GPU: Telsa K CUDA cores High double-precision floating-point performance, advanced features

7 7 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution Model Programming GPUs using Nvidia CUDA

8 8 / 27 Why Massively Parallel Processor A quiet revolution and potential build-up Performance advantages again multicore CPU GFLOPS: 1,000 vs. 100 (in year 2009) Memory bandwidth (GB/s): 200 vs. 20 GPU in every PC and workstation - massive volume and potential impact

9 9 / 27 Different Design Philosophies Control ALU ALU ALU ALU Cache DRAM DRAM CPU: sequential execution Multiple complicated ALU design Complicated control logic, e.g., branch prediction Big cache GPU: parallel computing Many simple processing cores Simple control and scheduling logic No or small cache

10 Architecture of GPU CUDA Core Instruction Cache Warp Scheduler Warp Scheduler DRAM Gigathread DRAM Host Interface DRAM L2 Cache 16- GPU Dispatch Port Operant Collector FP Unit INT Unit Result Queue DRAM DRAM DRAM DRAM Dispatch Unit Register File (32, bit) Dispatch Unit Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Streaming Multiprocessor () 512 streaming processors in 16 streaming multiprocessors 10 / 27

11 11 / 27 Architecture of Kepler GPU 2,880 streaming processors in 15 streaming multiprocessors

12 12 / 27 Architecture of Kepler Streaming Multiprocessor Each streaming multiprocessor contains 192 single-precision cores and 64 double-precision cores

13 Chapter 2: Programming Model Basic Programming Model on GPU Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Issue hundreds of thousands of threads targeting thousands of processors Figure 2-1. Grid of Thread Blocks The number of threads per block and the number of blocks per grid specified in the 13 / 27

14 14 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data Beyond Programmable Shading: Fundamentals 32

15 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32

15 15 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Beyond Programmable Shading: Fundamentals 33

16 16 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Runnable Beyond Programmable Shading: Fundamentals 34

17 17 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Runnable Beyond Programmable Shading: Fundamentals 35

18 18 / 27 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Stall Runnable Stall Runnable Stall Runnable Beyond Programmable Shading: Fundamentals 36

19 19 / 27 Throughput! Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Runnable Start Stall Start Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! To maximum throughput of many groups Done! Beyond Programmable Shading: Fundamentals 37

20 / 27 But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2.

20 20 / 27 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 25

21 21 / 27 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 26

22 / 27 But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F Not all ALUs do useful work!

22 22 / 27 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27

23 23 / 27 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 28

Obstacles Increase the data parallel portion of an application

24 24 / 27 Partition of an application Sequential portions Traditional CPU coverage Parallel portions GPU coverage Obstacles Increase the data parallel portion of an application Analyze an existing application Expand the data volume of the parallel part

25 25 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution Model Programming GPUs using Nvidia CUDA

26 26 / 27 Nvidia CUDA CUDA driver Handle the communication with Nvidia GPUs CUDA toolkit Contain the tools needed to compile and build a CUDA application CUDA SDK Include sample projects that provide source code and other resources for constructing CUDA programs

27 27 / 27 Program GPUs in JBHT CUDA 6.5, including drivers and toolkits, has been installed on all computers with GTX The environment variables have been properly set to compile your code 3. Log into the machine using your uark id: username 4. Remote log into the machine, say, hostname is csce-t7500-xx (e.g., 01-14): ssh username@hostname.ddns.uark.edu On Windows platform, SSH Secure Shell Client is free to download and use

Hardware/Software Co-Design

1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011 2 / 27 Outline 1 2 3 3 / 27 Outline 1 2 3 CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing