Hardware/Software Co-Design

Similar documents
GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

From Shader Code to a Teraflop: How Shader Cores Work

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Real-Time Rendering Architectures

Paralization on GPU using CUDA An Introduction

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

GPU Architecture. Robert Strzodka (MPII), Dominik Göddeke G. TUDo), Dominik Behr (AMD)

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 1: Gentle Introduction to GPUs

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

Numerical Simulation on the GPU

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Mattan Erez. The University of Texas at Austin

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to GPU hardware and to CUDA

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation

Hardware Oriented Security

GRAPHICS PROCESSING UNITS

GPU Fundamentals Jeff Larkin November 14, 2016

Warps and Reduction Algorithms

Threading Hardware in G80

Parallel Computing: Parallel Architectures Jin, Hai

CS 668 Parallel Computing Spring 2011

Introduction to CUDA

GPU Programming and Architecture: Course Overview

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CUDA OPTIMIZATIONS ISC 2011 Tutorial

GPU for HPC. October 2010

CS427 Multicore Architecture and Parallel Computing

High Performance Computing on GPUs using NVIDIA CUDA

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Portland State University ECE 588/688. Graphics Processors

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Mathematical computations with GPUs

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Parallelization Techniques for Implementing Trellis Algorithms on Graphics Processors

Parallel Processing SIMD, Vector and GPU s cont.

Multithreaded Processors. Department of Electrical Engineering Stanford University

Graphics Processor Acceleration and YOU

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

CSCE 5013: Cloud Computing Spring 2017

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallelizing FPGA Technology Mapping using GPUs. Doris Chen Deshanand Singh Aug 31 st, 2010

Slide credit: Slides adapted from David Kirk/NVIDIA and Wen-mei W. Hwu, DRAM Bandwidth

Accelerating Mean Shift Segmentation Algorithm on Hybrid CPU/GPU Platforms

Duksu Kim. Professional Experience Senior researcher, KISTI High performance visualization

Chapter 9 Pipelining. Jin-Fu Li Department of Electrical Engineering National Central University Jungli, Taiwan

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

ECE 8823: GPU Architectures. Objectives

Exotic Methods in Parallel Computing [GPU Computing]

Josef Pelikán, Jan Horáček CGG MFF UK Praha

The NVIDIA GeForce 8800 GPU

CS 179: GPU Programming

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Cartoon parallel architectures; CPUs and GPUs

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Advanced CUDA Programming. Dr. Timo Stich

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

First, the need for parallel processing and the limitations of uniprocessors are introduced.

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

Introduction to GPU programming with CUDA

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

Master Informatics Eng.

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Accelerating image registration on GPUs

Tesla GPU Computing A Revolution in High Performance Computing

Programming in CUDA. Malik M Khan

"On the Capability and Achievable Performance of FPGAs for HPC Applications"

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallel Programming Concepts. GPU Computing with OpenCL

Comparison of High-Speed Ray Casting on GPU

Experts in Application Acceleration Synective Labs AB

Introduction to CUDA Programming

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Scientific Computing on GPUs: GPU Architecture Overview

Implementation and Experimental Evaluation of a CUDA Core under Single Event Effects. Werner Nedel, Fernanda Kastensmidt, José.

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Data Parallel Execution Model

Introduction CPS343. Spring Parallel and High Performance Computing. CPS343 (Parallel and HPC) Introduction Spring / 29

Introduction to CUDA (1 of n*)

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Spring 2011 Prof. Hyesoon Kim

Parallel Systems I The GPU architecture. Jan Lemeire

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

CUDA GPGPU Workshop 2012

Transcription:

1 / 27 Hardware/Software Co-Design Miaoqing Huang University of Arkansas Fall 2011

2 / 27 Outline 1 2 3

3 / 27 Outline 1 2 3

CSCE 5013-002 Speical Topic in Hardware/Software Co-Design Instructor Miaoqing Huang Email: mqhuang@uark.edu Office: JBHT 526; Tel: 479-575-7578 Office Hours: Mon 1:30-2:30AM; Wed 3:30-4:30PM Meeting Mon, Wed, Fri: 2:30-3:20PM JBHT 237 Class Website: (or access through my home page) http://www.csce.uark.edu/ mqhuang/courses/5013/f2011/csce5013_fall2011.htm Textbooks (optional) 1 Programming Massively Parallel Processors: A Hands-on Approach, by David B. Kirk and Wen-mei W. Hwu, Morgan Kaufmann, 2010, ISBN: 9780123814722 2 NVidia CUDA C Programming Guide 3 Reconfigurable Computing, by Scott Hauck and Andre DeHon, Morgan Kaufmann, 2008, ISBN: 9780123705228 4 / 27

Course Objective GPU programming Learn how to program massively parallel processors and achieve High performance Functionality and maintainability Scalability across future generations Acquire technical knowledge required to achieve the above goals Principles and patterns of parallel programming Processor architecture features and constraints Programming APIs, tools and techniques FPGA programming Learn how to implement application on FPGA devices to achieve High performance High productivity Acquire technical knowledge required to achieve the above goals Deep pipelining and parallelism FPGA architecture and system platform architecture Two programming entries High level languages, i.e., C or Fortran Hardware description languages, i.e., Verilog or VHDL 5 / 27

Academic Honesty You are allowed and encouraged to discuss assignments with other students in the class. Getting verbal advice/help from people outside of the class is also fine. Any copying of non-trivial code is unacceptable Non-trivial = more than a line or so Includes reading someone else s code and then going off to write your own Penalties for academic dishonesty Zero on the assignment for the first occasion Automatic failure of the course for repeat offenses Academic Integrity at the University of Arkansas http://provost.uark.edu/245.php Academic Integrity Sanction Rubric http://provost.uark.edu/246.php 6 / 27

Lab assignments, Projects, and Course grading No homework, no exam Constituent components of course grading 9 lab assignments 6 for GPU programming and 3 for FPGA programming 2 projects 1 on GPU and 1 on FPGA Classroom presentation Lab discussion and project All lab assignments and projects are supposed to be carried out individually 7 / 27

Lab Equipment GPU computing GPU: GeForce GTX 480 480 CUDA cores All the workstations in JBHT 237 are equipped with two GTX 480 GPUs FPGA computing SRC-7 reconfigurable computer One dual-core Xeon CPU and two FPGA co-processors FPGA: Altera Stratix II EP2S180 Located in JBHT 444, remotely accessible 8 / 27

9 / 27 Outline 1 2 3

Why Massively Parallel Processor A quiet revolution and potential build-up Performance advantages again multicore CPU GFLOPS: 1,000 vs. 100 (in year 2009) Memory bandwidth (GB/s): 200 vs. 20 GPU in every PC and workstation - massive volume and potential impact 10 / 27

Different Design Philosophies Control ALU ALU ALU ALU Cache DRAM DRAM CPU: sequential execution Multiple complicated ALU design Complicated control logic, e.g., branch prediction Big cache GPU: parallel computing Many simple processing cores Simple control and scheduling logic No or small cache 11 / 27

Architecture of GPU DRAM Gigathread DRAM Host Interface DRAM L2 Cache 16- GPU CUDA Core Dispatch Port Operant Collector FP Unit INT Unit Result Queue DRAM DRAM DRAM DRAM Instruction Cache Warp Scheduler Warp Scheduler Dispatch Unit Dispatch Unit Register File (32,768 32-bit) Special Function Unit Special Function Unit Special Function Unit Special Function Unit Interconnect Network 64 KB Shared Memory / L1 Cache Streaming Multiprocessor () 512 streaming processors in 16 streaming multiprocessors 12 / 27

Chapter 2: Programming Model Basic Programming Model on GPU Grid Block (0, 0) Block (1, 0) Block (2, 0) Block (0, 1) Block (1, 1) Block (2, 1) Block (1, 1) Thread (0, 0) Thread (1, 0) Thread (2, 0) Thread (3, 0) Thread (0, 1) Thread (1, 1) Thread (2, 1) Thread (3, 1) Thread (0, 2) Thread (1, 2) Thread (2, 2) Thread (3, 2) Issue thousands of threads targeting hundreds of processors 13 / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU Ctx Ctx Ctx Ctx Ctx Ctx Ctx Ctx Shared Ctx Data Beyond Programmable Shading: Fundamentals 32 14 / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Fetch/ Decode ALU ALU ALU ALU ALU ALU ALU ALU 1 2 3 4 Beyond Programmable Shading: Fundamentals 33 15 / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable Beyond Programmable Shading: Fundamentals 34 16 / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Runnable Beyond Programmable Shading: Fundamentals 35 17 / 27

Hiding shader stalls Execution Model on GPU Time (clocks) Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Stall Stall Runnable Stall Runnable Stall Runnable Beyond Programmable Shading: Fundamentals 36 18 / 27

Throughput! Execution Model on GPU Time (clocks) Stall Runnable Frag 1 8 Frag 9 16 Frag 17 24 Frag 25 32 1 2 3 4 Start Stall Start Stall Start Runnable Stall Done! Runnable Done! Runnable Increase run time of one group Done! To maximum throughput of many groups Done! Beyond Programmable Shading: Fundamentals 37 19 / 27

But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 25 20 / 27

But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 26 21 / 27

But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F Not all ALUs do useful work! Worst case: 1/8 performance <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 27 22 / 27

But what about branches? How to deal with branches? Time (clocks) 1 2...... 8 ALU 1 ALU 2...... ALU 8 T T F T F F F F <unconditional shader code> if (x > 0) { refl = y + Ka; } else { x = 0; } y = pow(x, exp); y *= Ks; refl = Ka; <resume unconditional shader code> Beyond Programmable Shading: Fundamentals 28 23 / 27

Partition of an application Sequential portions Traditional CPU coverage Parallel portions GPU coverage Obstacles Increase the data parallel portion of an application Analyze an existing application Expand the data volume of the parallel part 24 / 27

25 / 27 Outline 1 2 3

26 / 27 Nvidia CUDA CUDA driver Handle the communication with Nvidia GPUs CUDA toolkit Contain the tools needed to compile and build a CUDA application CUDA SDK Include sample projects that provide source code and other resources for constructing CUDA programs

27 / 27 Install CUDA SDK on Local Machine 1 download the CUDA SDK from the Nvidia website 2 install the SDK by running it 3 add the following lines at the end of your.bashrc file PATH=/usr/local/cuda/bin:$PATH LD_LIBRARY_PATH=/usr/local/cuda/lib64: /usr/local/cuda/lib:$ld_library_path export PATH export LD_LIBRARY_PATH 4 source the.bashrc file: source.bashrc 5 go to director /NVIDIA_GPU_Computing_SDK/C and type make 6 test one executable under /NVIDIA_GPU_Computing_SDK/bin