CS427 Multicore Architecture and Parallel Computing

Similar documents
Threading Hardware in G80

Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Portland State University ECE 588/688. Graphics Processors

Spring 2009 Prof. Hyesoon Kim

Graphics Processing Unit Architecture (GPU Arch)

Introduction to CUDA (1 of n*)

Lecture 2. Shaders, GLSL and GPGPU

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

The NVIDIA GeForce 8800 GPU

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Windowing System on a 3D Pipeline. February 2005

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Spring 2011 Prof. Hyesoon Kim

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Early 3D Graphics. NVIDIA Corporation Perspective study of a chalice Paolo Uccello, circa 1450

Scheduling the Graphics Pipeline on a GPU

GRAPHICS PROCESSING UNITS

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

What s New with GPGPU?

The Rasterization Pipeline

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Antonio R. Miele Marco D. Santambrogio

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

PowerVR Hardware. Architecture Overview for Developers

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Shaders. Slide credit to Prof. Zwicker

Graphics Hardware. Instructor Stephen J. Guy

CSCI-GA Graphics Processing Units (GPUs): Architecture and Programming Lecture 2: Hardware Perspective of GPUs

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

PowerVR Series5. Architecture Guide for Developers

Pipeline Operations. CS 4620 Lecture 14

GPGPU. Peter Laurens 1st-year PhD Student, NSC

1/26/09. Administrative. L4: Hardware Execution Model and Overview. Recall Execution Model. Outline. First assignment out, due Friday at 5PM

Course Recap + 3D Graphics on Mobile GPUs

GeForce4. John Montrym Henry Moreton

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

Mattan Erez. The University of Texas at Austin

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Current Trends in Computer Graphics Hardware

GPU Architecture. Michael Doggett Department of Computer Science Lund university

CE 431 Parallel Computer Architecture Spring Graphics Processor Units (GPU) Architecture

ASYNCHRONOUS SHADERS WHITE PAPER 0

1.2.3 The Graphics Hardware Pipeline

NVIDIA Fermi Architecture

! Readings! ! Room-level, on-chip! vs.!

CS195V Week 9. GPU Architecture and Other Shading Languages

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

GPU A rchitectures Architectures Patrick Neill May

Mattan Erez. The University of Texas at Austin

Programming Graphics Hardware

A Trip Down The (2011) Rasterization Pipeline

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

High-Quality Surface Splatting on Today s GPUs

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Parallel Computing: Parallel Architectures Jin, Hai

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

frame buffer depth buffer stencil buffer

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

CSE 167: Introduction to Computer Graphics Lecture #7: Lights. Jürgen P. Schulze, Ph.D. University of California, San Diego Spring Quarter 2015

CS452/552; EE465/505. Clipping & Scan Conversion

From Brook to CUDA. GPU Technology Conference

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

GRAPHICS HARDWARE. Niels Joubert, 4th August 2010, CS147

CSE 167: Introduction to Computer Graphics Lecture #6: Lights. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2014

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Introduction to Modern GPU Hardware

CS GPU and GPGPU Programming Lecture 7: Shading and Compute APIs 1. Markus Hadwiger, KAUST

CS 428: Fall Introduction to. OpenGL primer. Andrew Nealen, Rutgers, /13/2010 1

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

ME964 High Performance Computing for Engineering Applications

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU for HPC. October 2010

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

GPU Architecture and Function. Michael Foster and Ian Frasch

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

C P S C 314 S H A D E R S, O P E N G L, & J S RENDERING PIPELINE. Mikhail Bessmeltsev

Scanline Rendering 2 1/42

Hardware-driven visibility culling

GPGPU introduction and network applications. PacketShaders, SSLShader

Transcription:

CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1

GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth: 86.4 GB/s vs. 8.4 GB/s Every PC, phone, pad has GPU now 2

GPU Speedup GeForce 8800 GTX vs. 2.2GHz Opteron 248 10 speedup in a kernel is typical, as long as the kernel can occupy enough parallel threads 25 to 400 speedup if the function s data requirements and control flow suit the GPU and the application is optimized 3

GPU Speedup 4

Early Graphic Hardware 5

Early Electronic Machine 6

Early Graphic Chip 7

Graphic Pipeline Sequence of operations to generate an image using object-order processing Primitives processed one-at-a-time Software pipeline: e.g. Renderman High-quality and efficiency for large scenes Hardware pipeline: e.g. graphics accelerators Will cover algorithms of modern hardware pipeline But evolve drastically every few years We will only look at triangles 8

Graphic Pipeline Handles only simple primitives by design Points, lines, triangles, quads (as two triangles) Efficient algorithm Complex primitives by tessellation Complex curves: tessellate into line strips Curves surfaces: tessellate into triangle meshes pipeline name derives from architecture design Sequences of stages with defined input/output Easy-to-optimize, modular design 9

Graphic Pipeline 10

Pipeline Stages Vertex processing Input: vertex data (position, normal, color, etc.) Output: transformed vertices in homogeneous canonical viewvolume, colors, etc. Applies transformation from object-space to clip-space Passes along material and shading data Clipping and rasterization Turns sets of vertices into primitives and fills them in Output: set of fragments with interpolated data 11

Pipeline Stages Fragment processing Output: final color and depth Traditionally mostly for texture lookups Lighting was computed for each vertex Today, computes lighting per-pixel Frame buffer processing Output: final picture Hidden surface elimination Compositing via alpha-blending 12

Vertex Processing 13

Clipping 14

Rasterization 15

Anti-Aliasing 16

Texture 17

Gouraud Shading 18

Phong Shading 19

Alpha Blending 20

Wireframe 21

SGI Reality Engine (1997) 22

Graphic Pipeline Characteristic Simple algorithms can be mapped to hardware High performance using on-chip parallel execution highly parallel algorithms memory access tends to be coherent 23

Graphic Pipeline Characteristic Multiple arithmetic units NVidia Geforce 7800: 8 vertex units, 24 pixel units Very small caches not needed since memory access are very coherent Fast memory architecture needed for color/z-buffer traffic Restricted memory access patterns read-modify-write Easy to make fast: this is what Intel would love! 24

Programmable Shader 25

Programmable Shader 26

Unified Shader 27

Unified Shader 28

Unified Shader 29

GeForce 8 30

GT200 31

GPU Evolution 32

Moore s Law Computers no longer get faster, just wider You must re-think your algorithms to be parallel! Data-parallel computing is most scalable solution 33

GPGPU 1.0 GPU Computing 1.0: compute pretending to be graphics Disguise data as textures or geometry Disguise algorithm as render passes Trick graphics pipeline into doing your computation! Term GPGPU coined by Mark Harris 34

GPU Grows Fast GPUs get progressively more capable Fixed-function! register combiners! shaders fp32 pixel hardware greatly extends reach Algorithms get more sophisticated Cellular automata! PDE solvers! ray tracing Clever graphics tricks High-level shading languages emerge HLSL developed by Microsoft with Direct3D API GLSL with OpenGL Nvidia Cg 35

GPGPU 2.0 GPU Computing 2.0: direct compute Program GPU directly, no graphics-based restrictions GPU Computing supplants graphics-based GPGPU November 2006: NVIDIA introduces CUDA 36

GPGPU 3.0 GPU Computing 3.0: an emerging ecosystem Hardware & product lines Algorithmic sophistication Cross-platform standards Education & research Consumer applications High-level languages 37

GPGPU Platforms 38

Fermi 39

Fermi Architecture 40

SM Architecture 41

SM Architecture Each Thread Blocks is divided in 32- thread Warps This is an implementation decision, not part of the CUDA programming model Warps are scheduling units in SM If 3 blocks are assigned to an SM and each Block has 256 threads, how many Warps are there in an SM? Each Block is divided into 256/32 = 8 Warps There are 8 * 3 = 24 Warps At any point in time, only one of the 24 Warps will be selected for instruction fetch and execution. 42

SM Architecture SM hardware implements zero overhead Warp scheduling Warps whose next instruction has its operands ready for consumption are eligible for execution Eligible Warps are selected for execution on a prioritized scheduling policy All threads in a Warp execute the same instruction when selected 4 clock cycles needed to dispatch the same instruction for all threads in a Warp If one global memory access is needed for every 4 instructions A minimal of 13 Warps are needed to fully tolerate 200-cycle memory latency 43

SM Architecture All register operands of all instructions in the Instruction Buffer are scoreboarded Instruction becomes ready after the needed values are deposited Prevents hazards Cleared instructions are eligible for issue Decoupled Memory/Processor pipelines Any thread can continue to issue instructions until scoreboarding prevents issue Allows Memory/Processor ops to proceed in shadow of other waiting Memory/Processor ops 44

SM Architecture Register File (RF) 32 KB (8K entries) for each SM 16 physical lanes x 2K registers/lane Single read/write port, heavily banked TEX pipe can also read/write RF Load/Store pipe can also read/write RF 45

SM Architecture This is an implementation decision, not part of CUDA Registers are dynamically partitioned across all blocks/warps assigned to the SM Once assigned to a block, the register is NOT accessible by threads in other warps Each thread in the same block only access registers assigned to itself 46

SM Architecture Each SM has 16 KB of Shared Memory 16 banks of 32bit words CUDA uses Shared Memory as shared storage visible to all threads in a thread block read and write access Not used explicitly for pixel shader programs we dislike pixels talking to each other 47

SM Architecture Immediate address constants/cache Indexed address constants/cache Constants stored in DRAM, and cached on chip 1 L1 per SM A constant value can be broadcast to all threads in a Warp Extremely efficient way of accessing a value that is common for all threads in a block! 48

Bank Conflict Shared memory is as fast as registers if there are no bank conflicts The fast case: If all threads access different banks, there is no bank conflict If all threads access the identical address, there is no bank conflict (broadcast) The slow case: Bank Conflict: multiple threads access the same bank Must serialize the accesses Cost = max # of simultaneous accesses to a single bank 49

Bank Conflict 50

Final Thought 51