GPGPU. Peter Laurens 1st-year PhD Student, NSC

Similar documents
Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

From Brook to CUDA. GPU Technology Conference

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

What s New with GPGPU?

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Current Trends in Computer Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

General Purpose Computing on Graphical Processing Units (GPGPU(

Lecture 2. Shaders, GLSL and GPGPU

Graphics Hardware. Instructor Stephen J. Guy

Pat Hanrahan. Modern Graphics Pipeline. How Powerful are GPUs? Application. Command. Geometry. Rasterization. Fragment. Display.

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Introduction to Shaders.

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Shaders. Slide credit to Prof. Zwicker

Using Graphics Chips for General Purpose Computation

Spring 2009 Prof. Hyesoon Kim

GpuPy: Accelerating NumPy With a GPU

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

High Performance Computing on GPUs using NVIDIA CUDA

Cg 2.0. Mark Kilgard

Neural Network Implementation using CUDA and OpenMP

Threading Hardware in G80

A MATLAB Interface to the GPU

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Spring 2011 Prof. Hyesoon Kim

GPU Architecture and Function. Michael Foster and Ian Frasch

Real-time Graphics 9. GPGPU

CMPE 665:Multiple Processor Systems CUDA-AWARE MPI VIGNESH GOVINDARAJULU KOTHANDAPANI RANJITH MURUGESAN

high performance medical reconstruction using stream programming paradigms

Scanline Rendering 2 1/42

Programming shaders & GPUs Christian Miller CS Fall 2011

Efficient and Scalable Shading for Many Lights

GPGPU Lessons Learned. Mark Harris

CONSOLE ARCHITECTURE

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

2.11 Particle Systems

GPUs and GPGPUs. Greg Blanton John T. Lubia

Real-time Graphics 9. GPGPU

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

General-Purpose Computation on Graphics Hardware

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

printf Debugging Examples

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

ECE 574 Cluster Computing Lecture 16

CS 428: Fall Introduction to. OpenGL primer. Andrew Nealen, Rutgers, /13/2010 1

Windowing System on a 3D Pipeline. February 2005

Could you make the XNA functions yourself?

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Parallel Execution of Kahn Process Networks in the GPU

GPU Memory Model. Adapted from:

Lecture 13: OpenGL Shading Language (GLSL)

ECE 571 Advanced Microprocessor-Based Design Lecture 18

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

NVIDIA Developer Tools for Graphics and PhysX

Mattan Erez. The University of Texas at Austin

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Musemage. The Revolution of Image Processing

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Technology for a better society. hetcomp.com

Shader Series Primer: Fundamentals of the Programmable Pipeline in XNA Game Studio Express

CIS 665: GPU Programming and Architecture. Original Slides by: Suresh Venkatasubramanian Updates by Joseph Kider

frame buffer depth buffer stencil buffer

Graphics Processor Acceleration and YOU

Evolution of GPUs Chris Seitz

GPGPU introduction and network applications. PacketShaders, SSLShader

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to Modern GPU Hardware

Real-Time Graphics Architecture

A Streaming Virtual Machine for GPUs

The GPGPU Programming Model

Antonio R. Miele Marco D. Santambrogio

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Prospects for a more robust, simpler and more efficient shader cross-compilation pipeline in Unity with SPIR-V

3D Computer Games Technology and History. Markus Hadwiger VRVis Research Center

General Purpose GPU Computing in Partial Wave Analysis

The rcuda middleware and applications

Introduction to CUDA

Introduction to OpenGL

Performance potential for simulating spin models on GPU

CUDA Programming Model

Developer Tools. Robert Strzodka. caesar research center Bonn, Germany

EECS 487: Interactive Computer Graphics

Programmable Graphics Hardware

Rationale for Non-Programmable Additions to OpenGL 2.0

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Finite Difference Time Domain (FDTD) Simulations Using Graphics Processors

Module Contact: Dr Stephen Laycock, CMP Copyright of the University of East Anglia Version 1

CS195V Week 9. GPU Architecture and Other Shading Languages

Transcription:

GPGPU Peter Laurens 1st-year PhD Student, NSC

Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future?

What is it? Introducing the GPU

The GPU A Specialised Hardware Component for Graphical Rendering Core Memory Bandwidth Price 250-500MHz 128MB - 512MB 10-40GB/s ~ 150 GBP

CryENGINE Crytek A.G.

Fixed Function Pipeline Pixel and Vertex Shaders 3D Geometry can be passed to the GPU, The GPU is now Almost programmable, all processing Transform (almost) to and screen Lighting arbitrary done by code CPUmay be run on 2D Drawing rasterised, of vector coloured, shapes and textured. is accelerated the GPU Some which sprite FFP shades extended compositing (affects to the support may colour be more accelerated of) every features pixel (BitBLT) on a surface. Effects such as mip-mapping, multi-texturing, bump-mapping, etc. are These shaders are applied in parallel to the pixels. possible, but you only get what you are given. 1980 1985 1990 1995 2000 2005 CPU 2D 3D T&L Programmable Shaders

CPU GPU Geometry Vertex Shaders CPU OpenGL/ D3D Texture Memory Rasteriser Fragment/Pixel Shaders Bandwidth is Expensive But Computation is very, very, cheap

Shaders The Core of the Programmable Pipeline Vertex Processor Fully programmable Runs shader programs on world geometry Write to vertices only (no read) Fragment Processor Fully programmable Takes rasterised pixel fragment data from the vertex stage Affects the output color Random read from across the texture data, no RA writes Direct output to texture Much more applicable to GPGPU

GPGPU What can the GPU do for me?

General Purpose GPU Using the GPU for General Purpose Computing, taking advantage of the massive parallelism available for scientific processing Instead of rendering geometry, render a flat quad with your data on it, use the pixel shaders to process it. Pixel shader is applied in parallel to every pixel, independently. Several pixel shaders in sequence. Max 4096x4096 matrix size.

Performance CPU vs. GPU Intel Core 2 Duo @ 3.0GHz Computation 48 GFLOPS (peak) Memory Bandwidth 21GB/s (peak) Price $851 (Apr 16th) nvidia GeForce 8800GTX Computation Memory Bandwidth Price 330 GFLOPS (observed) 55.2GB/s (observed) $549 (Apr 16th) Source: SC06 (GPGPU.org)

Folding@Home 200 60 150 45 100 30 50 15 0 TFLOPS 0 GFLOPS/Machine CPU GPU Source: Folding@Home, April 2007

Raw Multiply Performance vs. CPU 350.0 G80 262.5 G71 G70 GFLOPS 175.0 87.5 NV35 NV40 Intel Core 2 Duo @ 3.0GHz NV30 0 1H2003 2H2003 1H2004 2H2004 1H2005 2H2005 1H2006 2H2006 CPU GPU Source: nvidia Corporation

Suitability Convolution (neighbourhoods), Arithmetical Intensity, Data Independence Cellular Autonoma Neural Networks Genetic Algorithms Genetic Programming (Harding & Banzhaf) Chemical Simulation Cryptography and Cryptanalysis Digital Signal Processing (DSP) Fast Fourier Transform Speech Processing Computer Vision Linear Algebra Video Processing (Motion Compensation, De-interlacing etc.) Image Effects and Processing (of course)

How How can I get it to do that?

Pre-requisites Setting Up An Environment for GPGPU Hardware Driver Shader API App PS 3.0 Control-Structures Shader length ATi/nVidia OS Support Stability Assembly GLSL HLSL Cg Direct3D OpenGL

Programming Paradigms These Can be Quite Different from the CPU Data Code Kernels Output Computation CPU Arrays > GPU Textures Textures are the fundamental data-structure fragment shaders are used on. Therefore, arrays on the CPU become 2D-textures on the GPU. CPU Inner Loops > GPU Fragment Programs On a CPU we would use a loop to iterate over the many elements of an array sequentially. On a GPU, the inner loop of the code becomes the instructions to be applied by the fragment program to the stream of texture data. Feedback from the GPU On a CPU, the unified memory model means we can read and write wherever we choose, making getting output easy. On the GPU results must be rendered to a texture, which can then be read off of the GPU, or used in future processing. CPU Execution > GPU Drawing Invocation of computation is straight-forward on the CPU. To run a GPU program, the GPU must draw the result.

More Concepts Ways in Which a Kernel May Operate on a Stream of Data Mapping Reduction Filtering Scatter Gather Search One-to-One Function Mapping, also known as Transformation, describes a one-to-one input/output application of a function to the texture array data. E.g., multiplication by two. Many-to-Fewer Data may be reduced to a smaller stream, for example, averaging. For this example, the GPU architecture is best suited to cascading reductions in several stages, using the output as new input, before arriving at a single element. Non-Uniform Reduction Removing some items based on certain criteria. Vertex Operation to Spread Data Spreads the data in a certain manner by enlarging the geometry of the underlying plane. Random Access Read Refers to the capability of the Fragment Shader to read from a texture in a random fashion Parallel Search The GPU does not speed up the search for an element, but allows a parallel search technique Source: GPGPU.org, Wikipedia.org/GPGPU

Fragment Shader Code What Does it Look Like? float4 saxpy ( float2 coords : TEXCOORD0, uniform samplerrect texturey, uniform samplerrect texturex, uniform float alpha ) : COLOR { float4 result; float4 y = texrect(texturey,coords); float4 x = texrect(texturex,coords); result = y + alpha*x; return result; }

What s the Catch? Well...

Problems Why We Aren t Using it Already Problem must be in the right format for processing on the GPU Limited memory access (maximum ~2GB) Slow transfer across PCIe Texture sizes limited to 4096x4096 Vendor specific adaptations (think HTML3.0) limits adoption Lack of standardisation of features Drivers can be flakey, unstable, imprecise Driver availability for Linux Only Single-Precision Floating-Point Accuracy Limited control structures, no recursion Very poor Multi-GPU support (OS, Drivers, API) Immature High-Level Shader Languages Forced to use Graphics-oriented API and terminology Loose and Fast attitude toward driver development No integers Software tools are immature, debugging can be a pain (e.g. running shader on hw) Unusual programming model Serial-code with lots of complex data interdependencies not suitable Limits on the length of shaders

Future Prospects They re Actually Pretty Bright...

The Future of GPGPU Why We Might be Using it Soon New intermediary APIs Brook and Sh nvidia s CUDA: A step further than Brooke and Sh (write in C, special drivers, plus guarantees of future compatibility) ATi s CTM: Similar to CUDA (better access to memory etc.) Maturity will inevitably bring stability and homogeneity to drivers/features AMD + ATi Stream Processing and CPU Integration (alleviate bus latency and memory constraint) Direct3D, OpenGL, and the cards themselves continue to support more features, double-precision floating point is coming.

Places to Go Learning More About GPGPU www.gpgpu.org www.nvidia.com www.ati.com Stream Processing (USA)