1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011
2 / 27 Outline 1 2 3
3 / 27 Outline 1 2 3
CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing Huang Email: mqhuang@uark.edu Office: JBHT 526; Tel: 479-575-7578 Office Hours: Mon 1:30-2:30AM; Wed 3:30-4:30PM Meeting Mon, Wed, Fri: 2:30-3:20PM JBHT 237 Class Website: (or access through my home page) http://www.csce.uark.edu/ mqhuang/courses/5013/f2011/csce5013_fall2011.htm Textbooks (optional) 1 Programming Massively Parallel Processors: A Hands-on Approach, by David B. Kirk and Wen-mei W. Hwu, Morgan Kaufmann, 2010, ISBN: 9780123814722 2 NVidia CUDA C Programming Guide 3 Reconfigurable Computing, by Scott Hauck and Andre DeHon, Morgan Kaufmann, 2008, ISBN: 9780123705228 4 / 27
Course Objective GPU programming Learn how to program massively parallel processors and achieve High performance Functionality and maintainability Scalability across future generations Acquire technical knowledge required to achieve the above goals Principles and patterns of parallel programming Processor architecture features and constraints Programming APIs, tools and techniques FPGA programming Learn how to implement application on FPGA devices to achieve High performance High productivity Acquire technical knowledge required to achieve the above goals Deep pipelining and parallelism FPGA architecture and system platform architecture Two programming entries High level languages, i.e., C or Fortran Hardware description languages, i.e., Verilog or VHDL 5 / 27
Academic Honesty You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people outside of the class is also fine. Any copying of non-trivial code is unacceptable Non-trivial = more than a line or so Includes reading someone else s code and then going off to write your own Penalties for academic dishonesty Zero on the assignment for the first occasion Automatic failure of the course for repeat offenses Academic Integrity at the University of Arkansas http://provost.uark.edu/245.php Academic Integrity Sanction Rubric http://provost.uark.edu/246.php 6 / 27
Lab assignments, Projects, and Course grading No homework, no exam Constituent components of course grading 9 lab assignments 6 for GPU programming and 3 for FPGA programming 2 projects 1 on GPU and 1 on FPGA Classroom presentation Lab discussion and project All lab assignments and projects are supposed to be carried out individually 7 / 27
Lab Equipment GPU computing GPU: GeForce GTX 480 480 CUDA cores All the workstations in JBHT 237 are equipped with two GTX 480 GPUs FPGA computing SRC-7 reconfigurable computer One dual-core Xeon CPU and two FPGA co-processors FPGA: Altera Stratix II EP2S180 Located in JBHT 444, remotely accessible 8 / 27
9 / 27 Outline 1 2 3
Why Massively Parallel Processor A quiet revolution and potential build-up Performance advantages again multicore CPU GFLOPS: 1,000 vs. 100 (in year 2009) Memory bandwidth (GB/s): 200 vs. 20 GPU in every PC and workstation - massive volume and potential impact 10 / 27
Different Design Philosophies Control ALU ALU ALU ALU Cache DRAM DRAM CPU: sequential execution Multiple complicated ALU design Complicated control logic, e.g., branch prediction Big cache GPU: parallel computing Many simple processing cores Simple control and scheduling logic No or small cache 11 / 27
Architecture of GPU DRAM Gigathread DRAM Host Interface DRAM L2 Cache 16- GPU CUDA Core Dispatch Port Operant Collector FP Unit INT Unit Result Queue DRAM DRAM DRAM DRAM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 32-bit) Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Streaming Multiprocessor () 512 streaming processors in 16 streaming multiprocessors 12 / 27
Chapter 2: Programming Model Basic Programming Model on GPU Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Issue thousands of threads targeting hundreds of processors 13 / 27
Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data Beyond Programmable Shading: Fundamentals 32 14 / 27
Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 Beyond Programmable Shading: Fundamentals 33 15 / 27
Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable Beyond Programmable Shading: Fundamentals 34 16 / 27
Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable Beyond Programmable Shading: Fundamentals 35 17 / 27
Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable Beyond Programmable Shading: Fundamentals 36 18 / 27
Throughput! Execution Model on GPU Time (clocks) Stall Runnable Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Start Stall Start Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! To maximum throughput of many groups Done! Beyond Programmable Shading: Fundamentals 37 19 / 27
But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 25 20 / 27
But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 26 21 / 27
But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27 22 / 27
But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 28 23 / 27
Partition of an application Sequential portions Traditional CPU coverage Parallel portions GPU coverage Obstacles Increase the data parallel portion of an application Analyze an existing application Expand the data volume of the parallel part 24 / 27
25 / 27 Outline 1 2 3
26 / 27 Nvidia CUDA CUDA driver Handle the communication with Nvidia GPUs CUDA toolkit Contain the tools needed to compile and build a CUDA application CUDA SDK Include sample projects that provide source code and other resources for constructing CUDA programs
27 / 27 Install CUDA SDK on Local Machine 1 download the CUDA SDK from the Nvidia website 2 install the SDK by running it 3 add the following lines at the end of your.bashrc file PATH=/usr/local/cuda/bin:$PATH LD_LIBRARY_PATH=/usr/local/cuda/lib64: /usr/local/cuda/lib:$ld_library_path export PATH export LD_LIBRARY_PATH 4 source the.bashrc file: source.bashrc 5 go to director /NVIDIA_GPU_Computing_SDK/C and type make 6 test one executable under /NVIDIA_GPU_Computing_SDK/bin