The Need for Programmability

Similar documents
Windowing System on a 3D Pipeline. February 2005

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Graphics Processing Unit Architecture (GPU Arch)

ECE 571 Advanced Microprocessor-Based Design Lecture 20

3D Computer Games Technology and History. Markus Hadwiger VRVis Research Center

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Programming Graphics Hardware

ECE 571 Advanced Microprocessor-Based Design Lecture 18

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

PowerVR Series5. Architecture Guide for Developers

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

The Rasterization Pipeline

Spring 2009 Prof. Hyesoon Kim

1. Introduction 2. Methods for I/O Operations 3. Buses 4. Liquid Crystal Displays 5. Other Types of Displays 6. Graphics Adapters 7.

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Monday Morning. Graphics Hardware

CS427 Multicore Architecture and Parallel Computing

PowerVR Hardware. Architecture Overview for Developers

Graphics Hardware. Instructor Stephen J. Guy

GPUs and GPGPUs. Greg Blanton John T. Lubia

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Anatomy of AMD s TeraScale Graphics Engine

POWERVR MBX. Technology Overview

The NVIDIA GeForce 8800 GPU

GeForce4. John Montrym Henry Moreton

Spring 2011 Prof. Hyesoon Kim

Scanline Rendering 2 1/42

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Real-Time Reyes Programmable Pipelines and Research Challenges

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

User Guide. NVIDIA Quadro FX 4700 X2 BY PNY Technologies Part No. VCQFX4700X2-PCIE-PB

NVIDIA workstation 2D and 3D graphics adapter upgrade options let you experience productivity improvements and superior image quality

Hot Chips Bringing Workstation Graphics Performance to a Desktop Near You. S3 Incorporated August 18-20, 1996

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

VISUALIZE Workstation Graphics for Windows NT. By Ken Severson HP Workstation System Lab

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Direct Rendering of Trimmed NURBS Surfaces

NVIDIA nfinitefx Engine: Programmable Pixel Shaders

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

RADEON X1000 Memory Controller

Dave Shreiner, ARM March 2009

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

CHAPTER 1 Graphics Systems and Models 3

Craig Peeper Software Architect Windows Graphics & Gaming Technologies Microsoft Corporation

Real-Time Graphics Architecture

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

GoForce 3D: Coming to a Pixel Near You

Evolution of GPUs Chris Seitz

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

Mattan Erez. The University of Texas at Austin

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Real-World Applications of Computer Arithmetic

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Volume Graphics Introduction

GPU Memory Model Overview

Using Virtual Texturing to Handle Massive Texture Data

Grafica Computazionale: Lezione 30. Grafica Computazionale. Hiding complexity... ;) Introduction to OpenGL. lezione30 Introduction to OpenGL

Readings on graphics architecture for Advanced Computer Architecture class

GeForce3 OpenGL Performance. John Spitzer

2 x Maximum Display Monitor(s) support MHz Core Clock 28 nm Chip 384 x Stream Processors. 145(L)X95(W)X26(H) mm Size. 1.

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Motivation MGB Agenda. Compression. Scalability. Scalability. Motivation. Tessellation Basics. DX11 Tessellation Pipeline

GPU-Based Volume Rendering of. Unstructured Grids. João L. D. Comba. Fábio F. Bernardon UFRGS

Hardware Displacement Mapping

Graphics Hardware. Ulf Assarsson. Graphics hardware why? Recall the following. Perspective-correct texturing

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Building scalable 3D applications. Ville Miettinen Hybrid Graphics

Could you make the XNA functions yourself?

E.Order of Operations

2.11 Particle Systems

Performance Analysis and Culling Algorithms

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Antonio R. Miele Marco D. Santambrogio

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Beginning Direct3D Game Programming: 1. The History of Direct3D Graphics

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Module 13C: Using The 3D Graphics APIs OpenGL ES

The VISUALIZE fx family of graphics subsystems consists of three

Introduction to Shaders.

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Hardware-driven visibility culling

GPU Architecture and Function. Michael Foster and Ian Frasch

SAPPHIRE R7 260X 2GB GDDR5 OC BATLELFIELD 4 EDITION

CONSOLE ARCHITECTURE

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Mattan Erez. The University of Texas at Austin

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Texture mapping. Computer Graphics CSE 167 Lecture 9

Chapter 1 Introduction

GPU Target Applications

Transcription:

Visual Processing The next graphics revolution GPUs Graphics Processors have been engineered for extreme speed - Highly parallel pipelines exploits natural parallelism in pixel and vertex processing - BUT hardware-centric design limits algorithm complexity and adaptability CPUs Central Processors have been engineered for flexibility - High-level language programmability using state-of the art compiler technology - Virtual memory to cost-effectively break the barriers of physical memory - Multi-threading to enable responsive high-performance in a complex environment - BUT limited parallelism gives poor graphics performance GPUs Highly Parallel optimized pipelines Speed VISUAL PROCESSORS (VPUs) A new breed of graphics processors Versatility CPUs Flexible and programmable Copyright 3Dlabs, 2003 - Page 1 The Need for Programmability Visual realism needs more algorithmic flexibility Today s fast graphics chips are used for modeling and design - Limited, hardware-based vertex and fragment operations But highly realistic images are still rendered on CPUs - Large, sophisticated software renderers Today s graphics hardware cannot accelerate large software renderers - Traditional hardware has fixed functionality, limited flexibility and poor programmability Visual Processing Goal #1 Accelerate shading algorithms expressed in high-level, hardware independent shading languages Copyright 3Dlabs, 2003 - Page 2 8-2

The Need for Flexibility Learning from decades of CPU experience Too many graphics chips are optimized to run a single game application - Single task, single full screen rendering surface There are now many competing demands for graphics resources - Multiple threads and applications all demanding seamless interactivity - 3D GUIs arriving using an impossibly large amount of graphics memory Multi-tasking, Lightweight threads, Real-time response Virtualize memory resources, page unused resources to host virtual memory Hyper-threaded Host CPU D3D based Windows UI Multiple rich media applications Multiple Command Streams Multi-threaded Visual Processor D3D based Windows UI Multiple rich media applications Graphics Memory Multiple overlapping, composited and obscured 3D surfaces up to and beyond full-screen size Visual Processing Goal #2 Virtualize processor & memory resources for optimal system performance and flexibility Copyright 3Dlabs, 2003 - Page 3 The Visual Processing Vision Combining state-of-the-art hardware and software To create the next major step in graphics acceleration capability Programmable hardware architecture with compiler-centric approach Programmable, RISC-like, orthogonal compiler friendly architecture Visual Processing Advanced silicon processors - Visual Processor Units or VPUs Flexible CPU-like virtualized resource management For highly interactive real-world system performance, not just straight-line speed High-level, hardware independent shading languages Such as D3Dx and OpenGL 2.0 with compilers to drive any hardware target Massive parallelism for interactive performance Exploiting parallelism in geometry and pixel processing for outstanding performance Copyright 3Dlabs, 2003 - Page 4 8-3

3Dlabs Visual Processing Architecture The ideal mix of flexibility and price/performance VPUs SIMD vertex, texture and pixel arrays Inserted into traditional pipeline Ideal mix of speed and flexibility 100% Hardwired No programmable elements No flexibility Pure hardware speed Oxygen GVX1 GeForce, GeForce2 Radeon 128 Programmable Geometry Microcoded SIMD geometry processing Fixed function texture and pixel processing Wildcat I and II GPUs SIMD Geometry Combiner stage texture processing No programmable pixel stage GeForce3, GeForce4 Radeon 8500 Pixel processing Texture processing Geometry and lighting processing 100% SIMD No hardware pipeline Uncompetitive price performance Flexible but hard to program PixelFlow PixelFusion Programmable Flexible (register combiner) Fixed Function CPU Flexible and Programmable Minimal parallelism gives low performance Pentium, Athlon Copyright 3Dlabs, 2003 - Page 5 P10 Block Diagram The industry s first VPU At a high-level P10 looks deceptively normal graphics processor Command Processor Transform & Lighting Rasterizer Bus Interface Memory Controller VGA Dual Analog/ Digital Outputs High-speed VIP 2 Digital Input Copyright 3Dlabs, 2003 - Page 6 8-4

P10 Pipeline Architecture Traditional pipeline with five programmable stages load commands and data transform and light vertices 16 32-bit floating-point geometry processors rasterize triangle into tiles Command Vertex Processor Cull Clip Setup Tile Setup Load Visibility Store test depth, GID, and stencil Setup Coordinate Load Filter Shade 64 32-bit floating-point texture coordinate generators over four texture pipes 64 32-bit integer texture shading processors over four texture pipes Dedicated hardware units offloads mom-varying housekeeping functionality from SIMD arrays Setup Coordinate Load Filter Shade Setup Coordinate Load Filter Shade Setup Coordinate Load Filter Shade load and apply texture Array of 64 32-bit pixel processors fed by an a programmable address processor alpha blend, dither, anti-aliasing custom formatting, logic ops Address Load Pixel Store Copyright 3Dlabs, 2003 - Page 7 P10 Command Processor Creating the first real-time, multi-threaded VPU P10 can efficiently process multiple visual processing threads - Use a virtual VPU context for each thread Hardware scans command ring buffers from multiple host threads - No CPU software intervention required VPU context switches to track CPU workload Efficient context switch to rapidly jump between VPU threads - General-purpose state change in 15us very low-overhead Isochronous interrupt invokes hard real-time state change in 3us - High priority processes e.g. no-drop video processing Constant scanning of multiple buffers by hardware to track busiest threads Multiple circular command buffers from many host threads P10 Command Processor Command stream to P10 pipeline Context switch commands to P10 High priority command stream triggered by real time event (e.g. video blank) Copyright 3Dlabs, 2003 - Page 8 8-5

P10 Vertex Processing Flexible, powerful and efficient SIMD array of 16 floating point scalar processors - Powerful, RISC-like instruction set - Move, Add, Mul, MAdd, Min, Max, IntFloat, Fract, Trunc, Dot, Div, RSqrt, Log, Clipping Flow control includes conditional jumps, subroutines, loops - A superset of DX9 flow control Parallelism is completely automatic - Simple programming model - single processor From command processor Automatic distribution and collation of processing across SIMD array VP Manager VP VP VP VP VP VP VP VP VP VP VP VP VP VP VP VP VP Mux To texture processor Copyright 3Dlabs, 2003 - Page 9 Using P10 Vertex Processing Substantial flexibility Loops add significant flexibility - E.g. can programmatically handle up to 200 lights Can write intermediate results into memory - Enables multi-pass algorithms Can process surfaces and higher-order geometry - NURBS, N-Patches, tessellation, surface subdivision, vertex blending Static and dynamic displacement mapping - Offset position of vertex by contents of texture map Wavelet-based geometry compression - Useful for large geometry models No set limits to what ANY of the SIMD arrays can used for - General purpose processors reading and writing memory - Floating point image processing - Generate texture coordinates - Payroll Copyright 3Dlabs, 2003 - Page 10 8-6

P10 Texture Processing 128 32-bit SIMD Processors Programmable filter kernels - Flexibility to go beyond hardwired filters - anisotropic, bi-cubic, mip-mapped 3D Eight simultaneous textures in a single pass (Nvidia has four, ATI six) - Each map independently 1D, 2D, 3D or cube with independent size, wrapping, filtering Up to 8Kx8K or 1Kx1Kx1K texels per texture - Arbitrary size no power of 2 restrictions (unless doing mip-mapping) Programmable texture formats - Compressed DXT1-5, YUV422 supported natively, others can be programmed Programmability for advanced functionality - Wavelets, ray casting into volumetric textures, advanced video operations Fixed function unit calculates plane equation parameters Load texture data from cache Programmable 64-way integer SIMD array calculates final color of pixel Setup Coordinate Load Filter Shader Programmable 64-way floating point SIMD array calculates coordinates Feedback from filter unit supports for multi-pass operations Hardwired common filters: Bilinear, trilinear mip-mapping Copyright 3Dlabs, 2003 - Page 11 Programmable Pixel Processing Unprecedented back-end processing flexibility 100% programmable back-end flexibility - For anti-aliasing, image processing, compositing etc Highly flexible pixel formats - Use byte planar memory formats for unlimited pixel depth including 16 bit components - R10:G10:B10:A2 is also a natively supported format Rich anti-aliasing choices - OpenGL edge anti-aliasing 1-16 samples - 64-bit hardware accumulation buffer unlimited samples - Super sampling effectively unlimited samples - Multi sampling 1-8 samples - anywhere on a 16x16 sub pixel grid, stochastic positioning - Preferred - same memory footprint as super sample - higher performance, slightly less quality Processed pixels from texture processor Address Load Pixel Store Programmable address processor Scalar SIMD array of 64 pixel processors for alpha blending, dithering, anti-aliasing custom formatting, logic ops Copyright 3Dlabs, 2003 - Page 12 8-7

P10 Virtual Memory Enabling virtual 16Gbyte graphics boards Virtual addresses converted to physical addresses just like a CPU - Logical address space of 16Gbytes with a 4Kbyte page size - For ALL memory accesses not just textures On-board memory becomes a large L2-cache - With hardware optimized demand paging to and from host virtual memory/disk Solves multiple intractable texture management problems - Massive textures can be transparently handled even if they don t fit in physical memory - Eliminates memory fragmentation - texture pages do not need to be physically contiguous - Improves performance pages only loaded as required - Obscured graphics memory automatically paged out to host Host Virtual Memory System DMA missing pages Virtual Memory Unit Physical Addresses Logical Addresses P10 Core On-board memory acts as L2 cache for all buffers Copyright 3Dlabs, 2003 - Page 13 The P10 Processor Leading performance, functionality and quality 170GFlops and 1 TeraOp Processing Power - Over 200 floating point and integer SIMD processors Over 20 GB/sec Memory Bandwidth - Up to 256MB of 256-bit DDR memory Fast graphics performance - 60M drawn polygons/sec, Over 1Gpixels/sec fill-rate: drawn, trilinear mip-mapped 10 bit DACs - Industry-leading quality Dual head - Dual analog and digital displays High-speed VIP-2 video input port - For TV and video applications For embedded applications - 14W worst case power scales linearly with frequency - Can interface directly to PCI-66 (with clamping) Can drive 128-bit memory for reduced power and space 820-ball thermally-enhanced HSBGA package Copyright 3Dlabs, 2003 - Page 14 8-8

Visual Processing Architecture Flexible programmability throughout the pipeline Compilers need RISC-like orthogonality - NOT CISC-like specialized gadgets (and acronyms) Surfaces Vertices Textures Pixels TRUFORM SMARTSHADER Pixel Tapestry Charisma Engine ATI Radeon 8500 N-Patches??? 4 Processors (Ganged vector) Register Combiner 6 Textures Fixed Function AA nfinitefx NVIDIA GeForce4 Ti Bezier, B-Splines 2x4 Processors (Ganged vector) Register Combiner 4 Textures Fixed Function AA 6 multi-samples 4 multi-samples Visual Processing Architecture 3Dlabs P10 N-Patches, Bezier, B-splines, NURBS 16 Processors (SIMD scalar) 128 Processors (SIMD scalar) 8 Textures 64 Processors (SIMD scalar) 8 multi-sample AA Image Processing 3Dlabs Advantages RISC-like scalar programmability Genuine programmability - not register combiners Unique pixel programmability General Purpose Programmability Flexibility (Texture Combiner) Fixed Function Copyright 3Dlabs, 2003 - Page 15 P10 Display Unit The ultimate RAMDAC Two units in P10 for independent dual displays 16 entry GID LUT fed from 4 bit planes of any channel for per-pixel control of double buffering, LUT select etc. Control field inserted back into alpha channel GID LUT Underlay Main Image Background Color Color Key & Blend Color Key & Blend LUT can be optionally positioned before blend Channels combined with color key test and/or blend (8 bits per component) Overlay Color Key & Blend Cursor Independent channels read from memory Cursor stored in memory and can be large (DX9 ready) 10 bit DAC for high quality display Color Key & Blend Dual 256-entry LUTs DAC LUT LUT Controller Mux Control External TV Encoder Or TMDS Uses 24 bit digital output pins, >250 MHz Copyright 3Dlabs, 2003 - Page 16 8-9