A Reconfigurable Architecture for Load-Balanced Rendering

Similar documents
A Stream Compiler for Communication-Exposed Architectures

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

CS427 Multicore Architecture and Parallel Computing

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

PowerVR Hardware. Architecture Overview for Developers

CS8803SC Software and Hardware Cooperative Computing GPGPU. Prof. Hyesoon Kim School of Computer Science Georgia Institute of Technology

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

StreamIt: A Language for Streaming Applications

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

StreamIt on Fleet. Amir Kamil Computer Science Division, University of California, Berkeley UCB-AK06.

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

GPU Architecture and Function. Michael Foster and Ian Frasch

Scheduling the Graphics Pipeline on a GPU

A Data-Parallel Genealogy: The GPU Family Tree. John Owens University of California, Davis

Portland State University ECE 588/688. Graphics Processors

The Graphics Pipeline

Performance Analysis and Culling Algorithms

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

Lecture 25: Board Notes: Threads and GPUs

GeForce4. John Montrym Henry Moreton

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Cache Aware Optimization of Stream Programs

Parallel Programming

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

PowerVR Series5. Architecture Guide for Developers

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Threading Hardware in G80

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Rendering Objects. Need to transform all geometry then

Shaders. Slide credit to Prof. Zwicker

Windowing System on a 3D Pipeline. February 2005

Spring 2009 Prof. Hyesoon Kim

Fast Tessellated Rendering on Fermi GF100. HPG 2010 Tim Purcell

Streaming Massive Environments From Zero to 200MPH

Computer Graphics 1. Chapter 7 (June 17th, 2010, 2-4pm): Shading and rendering. LMU München Medieninformatik Andreas Butz Computergraphik 1 SS2010

Spring 2011 Prof. Hyesoon Kim

Monday Morning. Graphics Hardware

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Hardware-driven visibility culling

Enabling immersive gaming experiences Intro to Ray Tracing

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Lecture 2. Shaders, GLSL and GPGPU

From Brook to CUDA. GPU Technology Conference

Programming Graphics Hardware

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

GPU Memory Model Overview

Bifrost - The GPU architecture for next five billion

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

Could you make the XNA functions yourself?

High-Quality Surface Splatting on Today s GPUs

Real-Time Graphics Architecture

Mattan Erez. The University of Texas at Austin

Comparing Reyes and OpenGL on a Stream Architecture

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Graphics Hardware. Instructor Stephen J. Guy

The NVIDIA GeForce 8800 GPU

General-Purpose Computation on Graphics Hardware

Hardware-driven Visibility Culling Jeong Hyun Kim

Rasterization Overview

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

The Rasterization Pipeline

Course Recap + 3D Graphics on Mobile GPUs

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Computer Graphics (CS 543) Lecture 10: Soft Shadows (Maps and Volumes), Normal and Bump Mapping

Hardware Accelerated Volume Visualization. Leonid I. Dimitrov & Milos Sramek GMI Austrian Academy of Sciences

Creating a Scalable Microprocessor:

Scanline Rendering 2 1/42

Blue-Steel Ray Tracer

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Coding OpenGL ES 3.0 for Better Graphics Quality

The Pixel-Planes Family of Graphics Architectures

Introduction to Shaders.

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

PROFESSIONAL. WebGL Programming DEVELOPING 3D GRAPHICS FOR THE WEB. Andreas Anyuru WILEY. John Wiley & Sons, Ltd.

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Ray Tracing.

GPU Memory Model. Adapted from:

Multicore SoC is coming. Scalable and Reconfigurable Stream Processor for Mobile Multimedia Systems. Source: 2007 ISSCC and IDF.

Mio: Fast Multipass Partitioning via Priority-Based Instruction Scheduling

CS GPU and GPGPU Programming Lecture 2: Introduction; GPU Architecture 1. Markus Hadwiger, KAUST

Programmable GPUS. Last Time? Reading for Today. Homework 4. Planar Shadows Projective Texture Shadows Shadow Maps Shadow Volumes

EECS 487: Interactive Computer Graphics

GPGPU introduction and network applications. PacketShaders, SSLShader

Transcription:

A Reconfigurable Architecture for Load-Balanced Rendering Jiawen Chen Michael I. Gordon William Thies Matthias Zwicker Kari Pulli Frédo Durand Graphics Hardware July 31, 2005, Los Angeles, CA

The Load Balancing Problem GPUs: fixed resource allocation Fixed number of functional units per task Horizontal load balancing achieved via data parallelism Vertical load balancing impossible for many applications Our goal: flexible allocation Both vertical and horizontal On a per-rendering pass basis V R T F D data parallel V R T F D V R T F D V R T F D Parallelism in multiple graphics pipelines task parallel

Application-specific load balancing Input Vertex V Vertex Sync Triangle Setup Pixel P Pixel Screenshot from Counterstrike Simplified graphics pipeline

Application-specific load balancing Input Vertex V Vertex Sync Triangle Setup Rasterizer R Rasterizer Screenshot from Doom 3 Rest of Pixel Pipeline Rest of Pixel Pipeline Simplified graphics pipeline

Our Approach: Hardware Use a general-purpose multi-core processor With a programmable communications network Map pipeline stages to one or more cores MIT Raw Processor 16 general purpose cores Low-latency programmable network Diagram of a 4x4 Raw processor Die Photo of 16-tile Raw chip

Our Approach: Software Specify graphics pipeline in software as a stream program Easily reconfigurable Static load balancing Stream graph specifies resource allocation Tailor stream graph to rendering pass StreamIt programming language Input Vertex split V Vertex join Triangle Setup split Pixel P Pixel Sort-middle graphics pipeline stream graph

Benefits of Programmable Approach Compile stream program to multi-core processor Flexible resource allocation Fully programmable pipeline Pipeline specialization Nontraditional configurations Image processing GPGPU Stream graph for graphics pipeline StreamIt Layout on 8x8 Raw

Related Work Scalable Architectures Pomegranate [Eldridge et al., 2000] Streaming Architectures Imagine [Owens et al., 2000] Unified Shader Architectures ATI Xenos

Outline Background Raw Architecture StreamIt programming language Programmer Workflow Examples and Results Future Work

The Raw Processor A scalable computation fabric Mesh of identical tiles No global signals Programmable interconnect Integrated into bypass paths Register mapped Fast neighbor communications Essential for flexible resource allocation Raw tiles Compute processor Programmable Switch Processor A 4x4 Raw chip Computation Resources Switch Processor Diagram

The Raw Processor Current hardware 180nm process 16 tiles at 425 MHz 6.8 GFLOPS peak 47.6 GB/s memory bandwidth Simulation results based on 8x8 configuration 64 tiles at 425 MHz 27.2 GFLOPS peak 108.8 GB/s memory bandwidth (32 ports) Die photo of 16-tile Raw chip 180nm process, 331 mm 2

StreamIt High-level stream programming language Architecture independent Structured Stream Model Computation organized as filters in a stream graph FIFO data channels No global notion of time No global state Example stream graph

StreamIt Graph Constructs filter pipeline feedback loop joiner may be any StreamIt language construct splitter splitjoin parallel computation splitter joiner Graphics pipeline stream graph

Automatic Layout and Scheduling StreamIt compiler performs layout, scheduling on Raw Simulated annealing layout algorithm Generates code for compute processors Generates routing schedule for switch processors Input Vertex Processor Sync Triangle Setup StreamIt Compiler Rasterizer Pixel Processor Frame Buffer Stream graph Layout on 8x8 Raw

Outline Background Raw Architecture StreamIt programming language Programmer Workflow Examples and Results Future Work

Programmer Workflow For each rendering pass Estimate resource requirements Implement pipeline in StreamIt Adjust splitjoin widths Compile with StreamIt compiler Profile application Input Vertex split V Vertex join Triangle Setup split Pixel P Pixel Sort-middle Stream Graph

Switching Between Multiple Configurations Multi-pass rendering algorithms Switch configurations between passes Pipeline flush required anyway (e.g. shadow volumes) Configuration 1 Configuration 2

Experimental Setup Compare reconfigurable pipeline against fixed resource allocation Use same inputs on Raw simulator Compare throughput and utilization Input Vertex Processor Sync Triangle Setup Rasterizer Pixel Processor Frame Buffer Fixed Resource Allocation: 6 vertex units, 15 pixel pipelines Manual layout on Raw

Example: Phong Shading Per-pixel phong-shaded polyhedron 162 vertices, 1 light Covers large area of screen Allocate only 1 vertex unit Exploit task parallelism Devote 2 tiles to pixel shader 1 for computing the lighting direction and normal 1 for shading Pipeline specialization Eliminate texture coordinate interpolation, etc Output, rendered using the Raw simulator

Phong Shading Stream Graph Phong Shading Stream Graph Automatic Layout on Raw Input Vertex Processor Triangle Setup Rasterizer Pixel Processor A Pixel Processor B Frame Buffer

Utilization Plot: Phong Shading Fixed pipeline Reconfigurable pipeline

Example: Shadow Volumes 4 textured triangles, 1 point light Very large shadow volumes cover most of the screen Rendered in 3 passes Initialize depth buffer Draw extruded shadow volume geometry with Z-fail algorithm Draw textured triangles with stencil testing Different configuration for each pass Adjust ratio of vertex to pixel units Eliminate unused operations Output, rendered using the Raw simulator

Shadow Volumes Stream Graph: Passes 1 and 2 Input Vertex Processor Triangle Setup Rasterizer Frame Buffer

Shadow Volumes Stream Graph: Pass 3 Shadow Volumes Pass 3 Stream Graph Automatic Layout on Raw Input Vertex Processor Triangle Setup Rasterizer Texture Lookup Texture Filtering Frame Buffer

Utilization Plot: Shadow Volumes Fixed pipeline Pass 1 Pass 2 Pass 3 Reconfigurable pipeline Pass 1 Pass 2 Pass 3

Limitations Software rasterization is extremely slow 55 cycles per fragment Memory system Technique does not optimize for texture access

Future Work Augment Raw with special purpose hardware Explore memory hierarchy Texture prefetching Cache performance Single-pass rendering algorithms Load imbalances may occur within a pass Decompose scene into multiple passses Tradeoff between throughput gained from better load balance and cost of flush Dynamic Load Balancing

Summary Reconfigurable Architecture Application-specific static load balancing Increased throughput and utilization Ideas: General-purpose multi-core processor Programmable communications network Streaming characterization

Acknowledgements Mike Doggett, Eric Chan David Wentzlaff, Patrick Griffin, Rodric Rabbah, and Jasper Lin John Owens Saman Amarasinghe Raw group at MIT DARPA, NSF, MIT Oxygen Alliance