Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Similar documents
arxiv:hep-lat/ v1 21 Nov 2006

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Graphics Hardware. Instructor Stephen J. Guy

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Spring 2009 Prof. Hyesoon Kim

Spring 2011 Prof. Hyesoon Kim

Shaders. Slide credit to Prof. Zwicker

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

2.11 Particle Systems

CS 428: Fall Introduction to. OpenGL primer. Andrew Nealen, Rutgers, /13/2010 1

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Lecture 07: Buffers and Textures

GPUs and GPGPUs. Greg Blanton John T. Lubia

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Programming shaders & GPUs Christian Miller CS Fall 2011

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Lecture 13: OpenGL Shading Language (GLSL)

From Brook to CUDA. GPU Technology Conference

ECE 571 Advanced Microprocessor-Based Design Lecture 18

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

ECE 574 Cluster Computing Lecture 16

Graphics Processing Unit Architecture (GPU Arch)

Lattice QCD on Graphics Processing Units?

Lighting and Texturing

CS179: GPU Programming

Selecting the right Tesla/GTX GPU from a Drunken Baker's Dozen

Real-Time Rendering Architectures

Antonio R. Miele Marco D. Santambrogio

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Trends in HPC (hardware complexity and software challenges)

Technology for a better society. hetcomp.com

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

GPU for HPC. October 2010

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Scanline Rendering 2 1/42

Today. Rendering - III. Outline. Texturing: The 10,000m View. Texture Coordinates. Specifying Texture Coordinates in GL

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Windowing System on a 3D Pipeline. February 2005

Rationale for Non-Programmable Additions to OpenGL 2.0

Introduction to Modern GPU Hardware

Blink: 3D Display Multiplexing for Virtualized Applications

The GPGPU Programming Model

Introduction to Shaders.

Current Trends in Computer Graphics Hardware

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

Numerical Algorithms on Multi-GPU Architectures

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

CS427 Multicore Architecture and Parallel Computing

GPU Architecture and Function. Michael Foster and Ian Frasch

GPU Memory Model. Adapted from:

CSC573: TSHA Introduction to Accelerators

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Texture Mapping. Mike Bailey.

Threading Hardware in G80

SHADER PROGRAMMING. Based on Jian Huang s lecture on Shader Programming

Lecture 2. Shaders, GLSL and GPGPU

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Introduction to CUDA

CS 179: GPU Programming

CONSOLE ARCHITECTURE

The Graphics Pipeline

Computing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

From Shader Code to a Teraflop: How Shader Cores Work

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

Musemage. The Revolution of Image Processing

The Rasterization Pipeline

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

CS770/870 Spring 2017 Open GL Shader Language GLSL

CS770/870 Spring 2017 Open GL Shader Language GLSL

Real-time Graphics 9. GPGPU

Rendering Objects. Need to transform all geometry then

Multi-Processors and GPU

! Readings! ! Room-level, on-chip! vs.!

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Comparison of High-Speed Ray Casting on GPU

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Evolution of GPUs Chris Seitz

Consumer graphics cards for fast image processing based on the Pixel Shader 3.0 standard

GpuPy: Accelerating NumPy With a GPU

Graphics Processing Unit (GPU)

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

Efficient and Scalable Shading for Many Lights

Programming Graphics Hardware

Mattan Erez. The University of Texas at Austin

CS179 GPU Programming Introduction to CUDA. Lecture originally by Luke Durant and Tamas Szalay

Accelerating image registration on GPUs

General Purpose computation on GPUs. Liangjun Zhang 2/23/2005

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Transcription:

Part 1: General introduction Ch. Hoelbling Wuppertal University Lattice Practices 2011

Outline 1 Motivation 2 Hardware Overview History Present Capabilities 3 Programming model Past: OpenGL Present: CUDA C

WHY GPUs? Far better peak performance than CPUs

WHY GPUs? Even more important for QCD: memory bandwidth

QCD CODE nominal single chip bandwidth [GB/s] 100 10 Wilson matrix GFLOPS 100 10 2005 2006 2007 2008 2009 2005 2006 2007 2008 2009 GPUs give a factor 3 boost compared to CPUs 2-4 years ahead

RAM LANDSCAPE memory 1TB 1GB 1MB network bandwidth 1Gb/s 10Gb/s Nehalem Cell 8i GPU 128 4 64 4 32 4 Neh. cache 16 4 GPU cache Cell cache 8 4 100 1000 bandwidth [GB/s] max. lattice

History How did GPUs become so capable? Until mid 1990s: Framebuffer + BitBlock shifter 2D operations only CPU does all floating point work Fixed pipline Hardware renders 3D polygons Simple texture to surface mapping Programmable shaders Polygon and texture rendering programmable Floating point support Fully general streaming processors Fully generic streaming processor All graphics rendering in software

Present Capabilities TODAYS GPUs Examples: GPU: NVidia GeForce GTX 480, 350e CPU: Intel Xeon W5590 (Nehalem), 1400e Modern GPUs are streaming processors Highly parallel: 512 cores 4 cores Floating point optimized: 1345 GFLOPS 51.2 GFLOPS Memory bandwidth optimized: 177 GB/s 32 GB/s Cache and fast memory (configurable): 2MB registers, 1MB L1+shared, L2, texture, constant 8MB L3 More die space for number crunching Less die space for other stuff No x86 translator, branch prediction, MMU,... Number of program instructions limited: 2 10 6 Fixed amount of memory: 1.5GB up to 192GB

GPUs in Wuppertal GPU clusters in Wuppertal 2004/5: 1 6800 Ultra First tests OpenGL 1.3, AGP Bus 2006: J/Ψ: 110 7900 GTX 1.6TFLOPS sustained (single) 256MB, OpenGL 2.0, PCIe 1.0 2007: uddn: 110 8800 GTX 3TFLOPS sustained (single) 512MB, CUDA 1.0, PCIe 1.0 2008: uddn upgrade: 500 260 GTX 40TFLOPS sustained (single) 896MB, CUDA 1.3, PCIe 2.0

Motivation Hardware Overview Programming model GPUs in Wuppertal uddn: 350 TFLOPS peak, 40 TFLOPS sustained Okt. 2008

Past: OpenGL GPU ARCHITECTURE Textures Vertex processor Rasterizer Fragment processor Screen Elements of the pipeline: Vertex processor: basic transformations (e.g. rotations) Rasterizer: generate texture coordinates Fragment processor: shade a region using texture inputs

Past: OpenGL OLD PROGRAMING MODEL OpenGL, supported by GeForce6+ cards input textures output (screen/textures) x, y coordinates Fragment program y Each pixel computation: incoming textures outgoing textures: same operation on each pixel massively parallel Like a stream: pipeline should be full performance is better for large textures (large system size) x

Past: OpenGL OPENGL BASICS Native data format: 2D arrays (textures) (surfaces of polygons usually displayed on screen) Each pixel (or fragment): 4 floating point numbers: RGBA colour channels (RGB, alpha) Data types: half-float (16 bit), float (32 bit), more recently double (64 bit) Everything done with floats (even indexing)

Past: OpenGL GENERAL SETUP Initialize graphics library (OpenGL) Set up rectangles (textures), that will contain data Write the shader program Done in pseudo-assembler or Cg (C-like) Compiled by driver during runtime Upload to GPU Run the shader (GL call) Download result textures All this is done blindly (no printf in shader)

Past: OpenGL GL EXAMPLE Real life example: x i = y i + z i i = 1... 4nm Shader in Cg: struct FragmentOut { float4 color0:color0; }; FragmentOut example( in float2 mytexcoord:wpos, uniform samplerrect y, uniform samplerrect z ) { FragmentOut c; c.color0 = texrect( y, mytexcoord ) + texrect( z, mytexcoord ); return c; }

Past: OpenGL EXAMPLE (ctd.) Create texture in OpenGL (like malloc) GLuint X; glgentextures( 1, &X); glbindtexture(..., X); gltexparameteri(...); glteximage2d(..., n, m,..., 0 ); // also for y,z Transfer data CPU GPU in OpenGL glbindtexture(..., Y); glframebuffertexture2dext(..., Y,... ); gltexsubimage2d(..., n, m,..., y ); // also for z

Past: OpenGL EXAMPLE (ctd.) Do the actual computation, run the Cg shader cgglsettextureparameter(..., Y); cgglenabletextureparameter(..., "y" ); // also for z glframebuffertexture2dext(..., X, 0 ); gldrawbuffer(... ); cgglbindprogram(... ); glbegin(... ); { glvertex2f( -n, -m); glvertex2f( n, -m); glvertex2f( n, m); glvertex2f( -n, m); } glend( );

Past: OpenGL EXAMPLE (ctd.) Transfer data GPU CPU in OpenGL glframebuffertexture2dext(..., X,... ); glreadbuffer(... ); glreadpixels(..., n, m,..., x); Same in C: for( i = 0; i < 4 * n * m; i++ ) x[i] = y[i] + z[i];

Past: OpenGL GL PECULIARITIES Restriction on input textures: maximum 32 read only can be addressed indirectly (careful with performance!) can have different sizes Restriction on output textures: maximum 4 (only half Wilson vectors!) write only same sizes location fixed per pixel Other features: no branching native vectors allow arbitrary swizzle x=y.rbag+z.agrb much better than SSE!

Present: CUDA C NEW PROGRAMMING MODEL CUDA-C, supported by GeForce8+ cards General-purpose streaming language Exact features supported depend on hardware! Minimal extension to C Other language interfaces exist (C++, Fortran, OpenCL) Lifts OpenGL restrictions Lots of possibilities to screw up performance

Present: CUDA C CUDA SOFTWARE STACK 3 components: Hardware driver API library and its runtime Some higher level math libraries CUFFT, CUBLAS