Hardware/Software Co-Design

Size: px

Start display at page:

Download "Hardware/Software Co-Design"

Griffin Simmons
5 years ago
Views:

1 1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011

2 2 / 27 Outline 1 2 3

3 3 / 27 Outline 1 2 3

4 CSCE Speical Topic in Hardware/Software Co-Design Instructor Miaoqing Huang Office: JBHT 526; Tel: Office Hours: Mon 1:30-2:30AM; Wed 3:30-4:30PM Meeting Mon, Wed, Fri: 2:30-3:20PM JBHT 237 Class Website: (or access through my home page) mqhuang/courses/5013/f2011/csce5013_fall2011.htm Textbooks (optional) 1 Programming Massively Parallel Processors: A Hands-on Approach, by David B. Kirk and Wen-mei W. Hwu, Morgan Kaufmann, 2010, ISBN: NVidia CUDA C Programming Guide 3 Reconfigurable Computing, by Scott Hauck and Andre DeHon, Morgan Kaufmann, 2008, ISBN: / 27

5 Course Objective GPU programming Learn how to program massively parallel processors and achieve High performance Functionality and maintainability Scalability across future generations Acquire technical knowledge required to achieve the above goals Principles and patterns of parallel programming Processor architecture features and constraints Programming APIs, tools and techniques FPGA programming Learn how to implement application on FPGA devices to achieve High performance High productivity Acquire technical knowledge required to achieve the above goals Deep pipelining and parallelism FPGA architecture and system platform architecture Two programming entries High level languages, i.e., C or Fortran Hardware description languages, i.e., Verilog or VHDL 5 / 27

6 Academic Honesty You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people outside of the class is also fine. Any copying of non-trivial code is unacceptable Non-trivial = more than a line or so Includes reading someone else s code and then going off to write your own Penalties for academic dishonesty Zero on the assignment for the first occasion Automatic failure of the course for repeat offenses Academic Integrity at the University of Arkansas Academic Integrity Sanction Rubric 6 / 27

7 Lab assignments, Projects, and Course grading No homework, no exam Constituent components of course grading 9 lab assignments 6 for GPU programming and 3 for FPGA programming 2 projects 1 on GPU and 1 on FPGA Classroom presentation Lab discussion and project All lab assignments and projects are supposed to be carried out individually 7 / 27

8 Lab Equipment GPU computing GPU: GeForce GTX CUDA cores All the workstations in JBHT 237 are equipped with two GTX 480 GPUs FPGA computing SRC-7 reconfigurable computer One dual-core Xeon CPU and two FPGA co-processors FPGA: Altera Stratix II EP2S180 Located in JBHT 444, remotely accessible 8 / 27

9 9 / 27 Outline 1 2 3

10 Why Massively Parallel Processor A quiet revolution and potential build-up Performance advantages again multicore CPU GFLOPS: 1,000 vs. 100 (in year 2009) Memory bandwidth (GB/s): 200 vs. 20 GPU in every PC and workstation - massive volume and potential impact 10 / 27

11 Different Design Philosophies Control ALU ALU ALU ALU Cache DRAM DRAM CPU: sequential execution Multiple complicated ALU design Complicated control logic, e.g., branch prediction Big cache GPU: parallel computing Many simple processing cores Simple control and scheduling logic No or small cache 11 / 27

12 Architecture of GPU DRAM Gigathread DRAM Host Interface DRAM L2 Cache 16- GPU CUDA Core Dispatch Port Operant Collector FP Unit INT Unit Result Queue DRAM DRAM DRAM DRAM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32, bit) Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Streaming Multiprocessor () 512 streaming processors in 16 streaming multiprocessors 12 / 27

13 Chapter 2: Programming Model Basic Programming Model on GPU Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Issue thousands of threads targeting hundreds of processors 13 / 27

14 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data Beyond Programmable Shading: Fundamentals / 27

15 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Beyond Programmable Shading: Fundamentals / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17

16 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Runnable Beyond Programmable Shading: Fundamentals / 27

17 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Runnable Beyond Programmable Shading: Fundamentals / 27

18 Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag Frag Stall Stall Runnable Stall Runnable Stall Runnable Beyond Programmable Shading: Fundamentals / 27

19 Throughput! Execution Model on GPU Time (clocks) Stall Runnable Frag 1 8 Frag 9 16 Frag Frag Start Stall Start Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! To maximum throughput of many groups Done! Beyond Programmable Shading: Fundamentals / 27

But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2.

20 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals / 27

21 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals / 27

Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else {

22 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals / 27

23 But what about branches? How to deal with branches? Time (clocks) ALU 1 ALU ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals / 27

the data parallel portion of an application Analyze an

24 Partition of an application Sequential portions Traditional CPU coverage Parallel portions GPU coverage Obstacles Increase the data parallel portion of an application Analyze an existing application Expand the data volume of the parallel part 24 / 27

25 25 / 27 Outline 1 2 3

26 26 / 27 Nvidia CUDA CUDA driver Handle the communication with Nvidia GPUs CUDA toolkit Contain the tools needed to compile and build a CUDA application CUDA SDK Include sample projects that provide source code and other resources for constructing CUDA programs

27 27 / 27 Install CUDA SDK on Local Machine 1 download the CUDA SDK from the Nvidia website 2 install the SDK by running it 3 add the following lines at the end of your.bashrc file PATH=/usr/local/cuda/bin:$PATH LD_LIBRARY_PATH=/usr/local/cuda/lib64: /usr/local/cuda/lib:$ld_library_path export PATH export LD_LIBRARY_PATH 4 source the.bashrc file: source.bashrc 5 go to director /NVIDIA_GPU_Computing_SDK/C and type make 6 test one executable under /NVIDIA_GPU_Computing_SDK/bin

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27 1 / 27 GPU Programming Lecture 1: Introduction Miaoqing Huang University of Arkansas 2 / 27 Outline Course Introduction GPUs as Parallel Computers Trend and Design Philosophies Programming and Execution