CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Similar documents
Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Threading Hardware in G80

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

General-Purpose Computation on Graphics Hardware

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Introduction to CUDA (1 of n*)

CS427 Multicore Architecture and Parallel Computing

Portland State University ECE 588/688. Graphics Processors

Mattan Erez. The University of Texas at Austin

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Programming in CUDA. Malik M Khan

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

An Example of Physical Reality Behind CUDA. GeForce Speedup of Applications Why Massively Parallel Processor.

Massively Parallel Architectures

From Brook to CUDA. GPU Technology Conference

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPGPU introduction and network applications. PacketShaders, SSLShader

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CIS 665: GPU Programming. Lecture 2: The CUDA Programming Model

Introduction to Modern GPU Hardware

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Accelerating CFD with Graphics Hardware

Graphics Processor Acceleration and YOU

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Current Trends in Computer Graphics Hardware

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Lecture 1: Gentle Introduction to GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 1: Introduction and Computational Thinking

Georgia Institute of Technology, August 17, Justin W. L. Wan. Canada Research Chair in Scientific Computing

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

What s New with GPGPU?

Parallel Computing: Parallel Architectures Jin, Hai

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Nvidia G80 Architecture and CUDA Programming

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

Graphics Processing Unit Architecture (GPU Arch)

GPUs and GPGPUs. Greg Blanton John T. Lubia

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Introduction to CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Hardware/Software Co-Design

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Accelerating image registration on GPUs

Shaders. Slide credit to Prof. Zwicker

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU Memory Model. Adapted from:

ECE 571 Advanced Microprocessor-Based Design Lecture 20

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

GRAPHICS PROCESSING UNITS

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Graphics Hardware. Instructor Stephen J. Guy

ECE 8823: GPU Architectures. Objectives

Convolution, Constant Memory and Constant Caching

Bifrost - The GPU architecture for next five billion

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Administrivia. Administrivia. Administrivia. CIS 565: GPU Programming and Architecture. Meeting

GPU Architecture and Function. Michael Foster and Ian Frasch

Intro to GPU s for Parallel Computing

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

Using The CUDA Programming Model

Introduction to CUDA (1 of n*)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Multi-Processors and GPU

CS 677: Parallel Programming for Many-core Processors Lecture 1

Antonio R. Miele Marco D. Santambrogio

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Accelerating Molecular Modeling Applications with Graphics Processors

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Today s Content. Lecture 7. Trends. Factors contributed to the growth of Beowulf class computers. Introduction. CUDA Programming CUDA (I)

PowerVR Hardware. Architecture Overview for Developers

Accelerating NAMD with Graphics Processors

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Master Informatics Eng.

GPU Programming Using NVIDIA CUDA

CS 668 Parallel Computing Spring 2011

Windowing System on a 3D Pipeline. February 2005

! Readings! ! Room-level, on-chip! vs.!

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Mattan Erez. The University of Texas at Austin

Efficient and Scalable Shading for Many Lights

Automatic Tuning Matrix Multiplication Performance on Graphics Hardware

Performance Insights on Executing Non-Graphics Applications on CUDA on the NVIDIA GeForce 8800 GTX

GPU Memory Model Overview

high performance medical reconstruction using stream programming paradigms

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Many-core Processors Lecture 1. Instructor: Philippos Mordohai Webpage:

Introduction to GPU hardware and to CUDA

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Graphics Processing Unit (GPU) Acceleration of Machine Vision Software for Space Flight Applications

Core/Many-Core Architectures and Programming. Prof. Huiyang Zhou

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Transcription:

CS8803SC Software and Hardware Cooperative Computing GPGPU Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology Why GPU? A quiet revolution and potential build-up Calculation: 367 GFLOPS vs. 32 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Until recently, programmed through graphics API GFLOPS G80 = GeForce 8800 GTX G71 = GeForce 7900 GTX G70 = GeForce 7800 GTX NV40 = GeForce 6800 Ultra NV35 = GeForce FX 5950 Ultra NV30 = GeForce FX 5800 GPU in every PC and workstation massive volume and potential impact David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC 1

Computational Power Why are GPUs getting faster so fast? Arithmetic intensity: the specialized nature of GPUs makes it easier to use additional transistors for computation not cache Economics: multi-billion dollar video game market is a pressure cooker that drives innovation Architecture design decisions: General CPU : cache, branch handling units, OOO support etc. Graphics processor: most transistors are ALUs www.gpgpu.org/s2004/slides/luebke.introduction.ppt GPGPU? http://www.gpgpu.org GPGPU stands for General-Purpose computation on GPUs General Purpose computation using GPU in applications other than 3D graphics GPU accelerates critical paths of applications Data parallel algorithms leverage GPU attributes Large data arrays, streaming throughput Fine-grain SIMD parallelism Low-latency floating point (FP) computation Applications Game effects physics, image processing Physical modeling, computational engineering, matrix algebra, convolution, correlation, sorting David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC 2

Background on Graphics Describing an Object David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC 3

GPU Fundamentals: The Graphics Pipeline Graphics State Application Vertices (3D) Transform Xformed, Lit Vertices (2D) Rasterizer Fragments (pre-pixels) pixels) Shade Final pixels (Color, Depth) Video Memory (Textures) CPU GPU A simplified graphics pipeline Note that pipe widths vary Many caches, FIFOs, and so on not shown Render-to to-texturetexture GPU Fundamentals: The Modern Graphics Pipeline Graphics State Application Vertices (3D) Vertex Processor Xformed, Lit Vertices (2D) Rasterizer Fragments (pre-pixels) pixels) Fragment Pixel Processor Final pixels (Color, Depth) Video Memory (Textures) CPU Programmable vertex processor! GPU Render-to to-texturetexture Programmable pixel processor! 4

GPU Pipeline: Transform Vertex Processor (multiple operate in parallel) Transform from world space to image space Compute per-vertex lighting Rotate, translate, and scale the entire scene to correctly place it relative to the camera s position, view direction, and field of view. GPU Pipeline: Rasterizer Rasterizer Convert geometric rep. (vertex) to image rep. (fragment) Fragment = image fragment Pixel + associated data: color, depth, stencil, etc. Interpolate per-vertex quantities across pixels 5

GPU Pipeline: Shade Fragment Processors (multiple in parallel) Compute a color for each pixel Optionally read colors from textures (images) A fragment is a computer graphics term for all of the data necessary needed to generate a pixel in the frame buffer. This may include, but is not limited to: raster position depth interpolated attributes (color, texture coordinates, etc.) NVIDIA GeForce 7800 Pipeline 6

GeForce 8800 16 highly threaded SM s, >128 FPU s, 367 GFLOPS, 768 MB DRAM, 86.4 GB/S Mem BW, 4GB/S BW to CPU Host Input Assembler Thread Execution Manager Texture Texture Texture Texture Texture Texture Texture Texture Load/store Load/store Load/store Load/store Load/store Load/store Global Memory David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC GPGPU Programming Traditional GPGPU Use a pixel processor, vortex processor, texture cache.. Copies from the frame buffer to a texture Uses a texture as the frame buffer With CUDA Highly parallel threads SIMD programming with MPI style 7

What Kinds of Computation Map Well to GPUs? Computing graphics Two key attributes: Data parallelism Independence Arithmetic Intensity Arithmetic intensity = operations/works transferred Data Streams & Kernels Streams Collection of records requiring similar computation Vertex positions, Voxels, FEM cells, etc. Provide data parallelism Kernels Functions applied to each element in stream transforms, PDE, No dependencies between stream elements Encourage high Arithmetic Intensity 8

CPU-GPU Analogies CPU GPU Inner loops = Kernels Stream / Data Array = Texture Memory Read = Texture Sample Importance of Data Parallelism GPUs are designed for graphics Highly parallel tasks GPUs process independent vertices & fragments Temporary registers are zeroed No shared or static data No read-modify-write buffers Data-parallel processing GPUs architecture is ALU-heavy Multiple vertex & pixel pipelines, multiple ALUs per pipe Hide memory latency (with more computation) 9

Example: Simulation Grid Common GPGPU computation style Textures represent computational grids = streams Many computations map to grids Matrix algebra Image & Volume processing Physical simulation Global Illumination ray tracing, photon mapping, radiosity Non-grid streams can be mapped to grids Programming a GPU for Graphics Application specifies geometry rasterized Each fragment is shaded w/ SIMD program Shading can use values from texture memory Image can be used as texture on future passes Owens & Luebke 10

Programming a GPU for GP Programs Owens & Luebke Draw a screen-sized quad stream Run a SIMD kernel over each fragment Gather is permitted from texture memory Resulting buffer can be treated as texture on next pass Kernels CPU GPU Kernel / loop body / algorithm step = Fragment Program Owens & Luebke 11

CUDA Programming Model: A Highly Multithreaded Coprocessor The GPU is viewed as a compute device that: Is a coprocessor to the CPU or host Has its own DRAM (device memory) Runs many threads in parallel Data-parallel portions of an application are executed on the device as kernels which run in parallel on many threads Differences between GPU and CPU threads GPU threads are extremely lightweight Very little creation overhead GPU needs 1000s of threads for full efficiency Multi- CPU needs only a few David Kirk/NVIDIA and Wen-mei W. Hwu, 2007 ECE 498AL, UIUC CUDA: Matrix Multiplication global void MatrixMulKernel(Matrix M, Matrix N, Matrix P) { // 2D Thread ID int tx = threadidx.x; int ty = threadidx.y; // Pvalue is used to store the element of the matrix // that is computed by the thread float Pvalue = 0; for (int k = 0; k < M.width; ++k) { float Melement = M.elements[ty * M.pitch + k]; float Nelement = Nd.elements[k * N.pitch + tx]; Pvalue += Melement * Nelement; } // Write the matrix to device memory; // each thread writes one element P.elements[ty * P.pitch + tx] = Pvalue; } } 12

Limitations in GPGPU High latency between CPU-GPU Handling control flow graphs I/O access Bit operations Limited data structure (e.g., no link list) But then why are we looking at this? Relatively short development time Relatively cheap devices Is GPU always Good? 50 45 40 35 30 OpenMP 25 CUDA 20 15 10 5 0 0 200 400 600 800 1000 1200 Not enough data parallelism GPU overhead is higher than the benefit 13

Parallel programming is difficult. GPGPU could be one solution to utilize parallel processors. The Future of GPGPU? Architecture is a moving target. Programming environment is evolving. e.g.) Intel s Larrabee ( 09 expected) MIMD style Can it provide enough performance to beat Nvidia?? 14