Why GPU? Chapter 1
Graphics Hardware Graphics Processing Unit (GPU) is a Subsidiary hardware With massively multi-threaded many-core Dedicated to 2D and 3D graphics Special purpose low functionality, high performance H/W accelerated graphics operation
Graphics Hardware CPUs vs. GPUs CPUs Optimized for high performance on sequential code Model for threading coarse, heavyweight GPUs Optimized for highly data-parallel nature of graphics computation Model for threading fine, extremely lightweight
Computational Power of GPU GPUs are getting faster CPUs Annual growth: 1.5x Decade growth: 60x GPUs Annual growth: >2.0x Decade growth: > 1000x
Why GPU is Trendy A massively parallel architecture Modern GPUs are deeply programmable Programmable pixel, vertex, and geometry engines Modern GPUs support real precision 32-bit floating point throughout the pipeline
Why GPU is Trendy Dedicated instructions for graphics tasks Useful operations for graphics vectors, matrices, textures Extremely fast filtering Linear and some anisotropic interpolation is implemented in wired logic
Limitations: H/W Restrictions Restriction of on-board memory size Up to 4GB, usually <1GB Insufficient support for flexible memory manipulation Programmability still restricted in a number of ways Limited branch divergence, such as loops or conditional clauses
Limitations: Difficult to Use GPUs designed for and driven by video games Underlying architectures are: Inherently data parallel Rapidly evolving (even in basic feature set!) Largely secret Can t simply port CPU code Good News: it s getting better (GPGPU)
Limitations: Matter of Choice H/W side: vendor wars Semantically same functionality is implemented by different methods in internal architecture Vendors are reluctant to open internal architecture API side: too obsolete, too fluctuating, too complex SGI OpenGL: standardization process is too slow Microsoft Direct3D: fast adaptation of new technologies GPGPU languages
Summary GPU is a massively parallel architecture Many problems map well to GPU-style computing GPUs have large amount of arithmetic capability Increasing amount of programmability in the pipeline Challenge: How do we make the best use of GPU hardware? Think in parallel
Traditional Graphics Pipeline
Lighting and Rasterization Most time-consuming part on early CG era H/Ws to accelerate pixel processing are introduced 3Dfx Voodoo (1996) no VGA 3DLabs Permedia (1996) H/W support for OpenGL API NVIDIA Riva (1997) Vertex processed on CPU
Transformation & Lighting Hardware-accelerated vertex processing Vertex data stored in graphics memory Microsoft Direct3D 7 (1999) NVIDIA GeForce256 (1999)
Programmable Shader Fixed-pipeline acceleration H/W is inflexible Fixed vertex transformation with WVP matrix Fixed shading algorithms, filtering methods, Demand for high quality rendering exploded! Complicated texture mapping and filtering methods Various light sources and finer shading methods
Programmable Shader Shaders: greater flexibility Vertex shaders allow the manipulation of vertex data Pixel shaders allow the manipulation of pixel data
Programmable Graphics Pipeline (early age) Application Scene management: vertices, Vertex operations Transform and lighting Culling, clipping Pixel operations Triangle setup and rasterization Shading, multi-texturing Alpha test, depth buffering,. Display Vertex shader Pixel shader A vertex shader operates on one vertex at a time A vertex shader cannot add vertices A pixel shader operates on one pixel at a time A pixel shader cannot add pixels
Shader Assembly The first programmable shader model on PC Microsoft Direc3D 8.1 (2000) NVIDIA GeForce 3 and ATI Radeon 8500 (2001) Mnemonic instructions correspond with machine instructions for programmable-shader H/W Too difficult to develop! H/W-dependent
NVIDIA Cg C language for graphics Similar syntax to C with many restrictions and exceptions Integrated with NVIDIA Cg SDK Supports various targets GeForce series or DirectX versions OpenGL extension
NVIDIA Cg Example code: Phong Shading void main( position: TEXCOORD0, : per each fragement normal: TEXCOORD1, ocolor: COLOR, ambientcol, lightcol, lightpos, eyepos, Ka, Kd, Ks, shiny) { P = position.xyz;
Microsoft HLSL Microsoft adopts NVIDIA Cg into Direct3D API Microsoft Direct3D 9 (2002) HLSL (High Level Shading Language) 2.0 and Shader Model 2.0 Become de facto standard of shader language on PC graphics hardware ATI Radeon 9500 (2002) NVIDIA GeForceFX (2003)
Restriction of Shader Model 2.0 Severe limitations on resources 256 instructions per program 16 temporary 4-vector registers 256 uniform parameter registers 2 address registers (4-vector) 6 clip-distance outputs 16 per-vertex attributes (only) Texture sampling in pixel shader only No dynamic flow control Loops are unrolled Conditional skippings do not save time
Shader Model 3.0 Microsoft Direct3D 9.0c (2004) NVIDIA GeForce 6 series (2004) ATI X1x00 series (2005) HLSL 3.0 introduced Grammar is almost identical to HLSL 2.0
Shader Model 3.0 Many restrictions are relieved Several limitations still exist
Shader Model 3.0 Dynamic branching Highly computational functions on some areas
Shader Model 3.0 Primitive instancing Render large number of objects with one vertex set and per-instance information
Shader Load Balancing Asymmetry on shader computation ability Vertex processing has been overwhelmed by pixel processing in traditional graphics application # of vertices rapidly increased Detailed modeling of objects # of objects exploded (e.g. MMORPG) Vertex processing is not lightweight anymore
Load balancing problem Load Balancing
Unified Shader Why use separate cores for VS and PS? Since SM 3.0, classification of VS and PS has become meaningless Vertex shader samples textures Dynamic branches
Unified Shader
Unified Shader ATI Xenos in Microsoft Xbox 360 (2005)
Shader Model 4.0 Microsoft Direct3D 10 (2006) NVIDIA GeForce 8 series (2006) ATI Radeon 2900 Unified Shader High flexibility Many limitations are removed or relaxed # of instructions, constants, variables, Flexible branching Loop, conditional branching,
New pipeline Shader Model 4.0
Shader Model 4.0
Direct3D 11 Shader model 5.0 with HLSL 5.0 Tesselation stages Two programmable shader Hull shader, domain shader Compute shader (DirectCompute) New programmable stage for GPGPU support
General-Purpose GPU GPU is a processor Extreme performance with high parallelism For special purposes But, modern GPUs are no more designed for special purposes Flexible programming Sufficient bidirectional memory bandwidth
General-Purpose GPU GPU is cheaper than other parallel units Huge market Computer game Cinema industry Why not use GPU for general-purpose heavy computations?
GPGPU Ideal application High arithmetic intensity Large data sets Lots of work to do w/o CPU intervention High parallelism Minimal dependencies between elements
GPGPU Early Days Programming by exploiting some GPU functionalities Stream/array textures Parallel loops drawing quads Memory read texture fetch Memory write frame buffer output Classification of data depth test Value accumulation alpha blending
GPGPU Early Days GPU wrapper for GP programming Such exploitations as an API set BrookGPU Developed at Stanford University C-like language with streaming extensions Compiles GPGPU-coded kernel to D3D/OpenGL shading models
GPGPU H/W vendors noticed API CTM (2006) NVIDIA CUDA (2007) dominant in market Standardization OpenCL (2008)
Q/A