Course Recap + 3D Graphics on Mobile GPUs

Similar documents
Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

PowerVR Hardware. Architecture Overview for Developers

Scheduling the Graphics Pipeline on a GPU

Addressing the Memory Wall

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS427 Multicore Architecture and Parallel Computing

The Rasterization Pipeline

Lecture 9: Deferred Shading. Visual Computing Systems CMU , Fall 2013

Enhancing Traditional Rasterization Graphics with Ray Tracing. October 2015

Graphics Processing Unit Architecture (GPU Arch)

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Parallel Triangle Rendering on a Modern GPU

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

Many rendering scenarios, such as battle scenes or urban environments, require rendering of large numbers of autonomous characters.

Ray Tracing. Computer Graphics CMU /15-662, Fall 2016

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

Enhancing Traditional Rasterization Graphics with Ray Tracing. March 2015

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

PowerVR Series5. Architecture Guide for Developers

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

CS451Real-time Rendering Pipeline

Bifrost - The GPU architecture for next five billion

Lecture 25: Board Notes: Threads and GPUs

Lecture 13: Reyes Architecture and Implementation. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CS452/552; EE465/505. Clipping & Scan Conversion

Spring 2011 Prof. Hyesoon Kim

Pipeline Operations. CS 4620 Lecture 14

Deferred Rendering Due: Wednesday November 15 at 10pm

Drawing Fast The Graphics Pipeline

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

CS4620/5620: Lecture 14 Pipeline

Pipeline Operations. CS 4620 Lecture 10

Spring 2009 Prof. Hyesoon Kim

Drawing Fast The Graphics Pipeline

Performance Analysis and Culling Algorithms

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Performance OpenGL Programming (for whatever reason)

Drawing Fast The Graphics Pipeline

GeForce4. John Montrym Henry Moreton

Hardware-driven visibility culling

PowerVR Performance Recommendations. The Golden Rules

Rendering Grass with Instancing in DirectX* 10

Enabling immersive gaming experiences Intro to Ray Tracing

Wed, October 12, 2011

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

CSE 167: Introduction to Computer Graphics Lecture #5: Rasterization. Jürgen P. Schulze, Ph.D. University of California, San Diego Fall Quarter 2015

Graphics and Imaging Architectures

Mobile HW and Bandwidth

A Real-time Micropolygon Rendering Pipeline. Kayvon Fatahalian Stanford University

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

Shadows. COMP 575/770 Spring 2013

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Comparing Reyes and OpenGL on a Stream Architecture

Advanced Shading I: Shadow Rasterization Techniques

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

graphics pipeline computer graphics graphics pipeline 2009 fabio pellacini 1

Achieving Console Quality Games on Mobile

POWERVR MBX. Technology Overview

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

Tiled shading: light culling reaching the speed of light. Dmitry Zhdan Developer Technology Engineer, NVIDIA

Sign up for crits! Announcments

Working with Metal Overview

CS 498 VR. Lecture 18-4/4/18. go.illinois.edu/vrlect18

ARM Multimedia IP: working together to drive down system power and bandwidth

The NVIDIA GeForce 8800 GPU

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

Lecture 4: Visibility. (coverage and occlusion using rasterization and the Z-buffer) Visual Computing Systems CMU , Fall 2014

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

TSBK03 Screen-Space Ambient Occlusion

Beyond Programmable Shading Course ACM SIGGRAPH 2010 Panel: Fixed-Function Hardware

Mattan Erez. The University of Texas at Austin

Inside VR on Mobile. Sam Martin Graphics Architect GDC 2016

CS130 : Computer Graphics Lecture 2: Graphics Pipeline. Tamar Shinar Computer Science & Engineering UC Riverside

The Rasterization Pipeline

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

Software Engineering. ò Look up Design Patterns. ò Code Refactoring. ò Code Documentation? ò One of the lessons to learn for good design:

Hardware-driven Visibility Culling Jeong Hyun Kim

Optimisation. CS7GV3 Real-time Rendering

S U N G - E U I YO O N, K A I S T R E N D E R I N G F R E E LY A VA I L A B L E O N T H E I N T E R N E T

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Lecturer Athanasios Nikolaidis

A SXGA 3D Display Processor with Reduced Rendering Data and Enhanced Precision. Seok-Hoon Kim MVLSI Lab., KAIST

Fast Stereoscopic Rendering on Mobile Ray Tracing GPU for Virtual Reality Applications

EECS 487: Interactive Computer Graphics

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

LOD and Occlusion Christian Miller CS Fall 2011

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

CS 354R: Computer Game Technology

Software Occlusion Culling

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Advanced Deferred Rendering Techniques. NCCA, Thesis Portfolio Peter Smith

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

CS GAME PROGRAMMING Question bank

Transcription:

Lecture 18: Course Recap + 3D Graphics on Mobile GPUs Interactive Computer Graphics

Q. What is a big concern in mobile computing?

A. Power

Two reasons to save power Run at higher performance for a fixed amount of time. Power = heat If a chip gets too hot, it must be clocked down to cool off Run at sufficient performance for a longer amount of time. Power = battery Long battery life is a desirable feature in mobile devices

Mobile phone examples Samsung Galaxy s9 Apple iphone 8 11.5 Watt hours 7 Watt hours

Graphics processors (GPUs) in these mobile phones Samsung Galaxy s9 (non US version) Apple iphone 8 ARM Mali G72MP18 Custom Apple GPU in A11 Processor

One way to conserve power Compute less - Reduce the amount of work required to render a picture - Less computation = less power

Another way to conserve power Read data from memory less often - Redesign algorithms so that they make good use of on-chip memory or processor caches A fact you might not have heard: -It is far more costly (in energy) to load/store data from memory, than it is to perform an arithmetic operation Ballpark numbers - Integer op: ~ 1 pj * - Floating point op: ~20 pj * - Reading 64 bits from small local SRAM (1mm away on chip): ~ 26 pj - Reading 64 bits from low power mobile DRAM (LPDDR): ~1200 pj Implications - Reading 10 GB/sec from memory: ~1.6 watts [Sources: Bill Dally (NVIDIA), Tom Olson (ARM)] * Cost to just perform the logical operation, not counting overhead of instruction decode, load data from registers, etc.

Today: a simple mobile GPU A set of programmable cores (run vertex and fragment shader programs) Hardware for rasterization, texture mapping, and frame-buffer access Rasterizer Rasterizer Rasterizer Rasterizer Depth Test Depth Test Depth Test Depth Test Shader Processor Core Data Cache Texture Shader Processor Core Data Cache Texture Shader Processor Core Data Cache Texture Shader Processor Core Data Cache Texture Render Target Blend Render Target Blend Render Target Blend Render Target Blend

Block diagrams from vendors ARM Mali G72MP18 Imagination PowerVR (in earlier iphones)

Let s consider different workloads Average triangle size Image credit: https://www.theverge.com/2013/11/29/5155726/next-gen-supplementary-piece http://www.mobygames.com/game/android/ghostbusters-slime-city/screenshots/gameshotid,852293/

Let s consider different workloads Scene depth complexity Average number of overlapping triangles per pixel [Imagination Technologies] In this visualization: bright colors = more overlap

One very simple solution Let s assume four GPU cores Divide screen into four quadrants, each processor processes all triangles, but only renders triangles that overlap quadrant Problems?

Unequal work partitioning (partition the primitives to parallel units based on screen overlap) 1 2 3 4

Step 1: parallel geometry processing Distribute triangles to render to the processors (e.g., round robin) Processors performs vertex processing Work queue of triangles in scene Core 1 Core 2 Core 3 Core 4

Step 1 produces list of triangle bins One bin per tile of screen Core runs vertex processing, computes 2D triangle/screen-tile overlap, inserts triangle into appropriate bin(s) List of scene triangles Core 1 Core 2 Core 3 Core 4 After processing first five triangles: Bin 1 list: 1,2,3,4 Bin 2 list: 4,5 3 Bin 1 Bin 2 Bin 3 Bin 4 1 2 4 5 Bin 5 Bin 6 Bin 7 Bin 8 Bin 9 Bin 10 Bin 11 Bin 12

Step 2: per-tile processing Cores process bins in parallel performing rasterization fragment shading and frame buffer update While more bins to process: List of triangles in bin: - Assign bin to available core - For all triangles: - Rasterize - Fragment shade Rasterizer Depth Test Shader Processor Core Data Cache Texture - Depth test - Render target blend Render Target Blend final pixels for NxN tile of render target

What should the size of the bins be?

What should the size of the bins be? Fine granularity interleaving Coarse granularity interleaving [Image source: NVIDIA]

What size should the bins be? Small enough for a tile of the color buffer and depth buffer (potentially supersampled) to fit in a shader processor core s on-chip storage (i.e., cache) Tile sizes in range 16x16 to 64x64 pixels are common ARM Mali GPU: commonly uses 16x16 pixel tiles

Tiled rendering sorts the scene in 2D space to enable efficient color/depth buffer access Consider rendering without a sort: (process triangles in order given) This sample updated three times, but may have fallen out of cache in between accesses 6 8 2 Now consider step 2 of a tiled 1 renderer: 3 7 4 5 Initialize Z and color buffer for tile for all triangles in tile: for all each fragment: shade fragment update depth/color write color tile to final image buffer Q. Why doesn t the renderer need to read color or depth buffer from memory? Q. Why doesn t the renderer need to write depth buffer in memory? * * Assuming application does not need depth buffer for other purposes.

Mobile GPU architects go to many steps to reduce bandwidth to save power Compress frame buffer Compress texture data Eliminate unnecessary memory writes! - Frame 1: - Render frame as normal - Compute hash of pixels in each tile on screen - Frame 2: - Render frame tile at a time - Before storing pixel values for tile to memory, compute hash and see if tile s contents are the same as in the last frame - If yes, skip memory write Slow camera motion: 96% of writes avoided Fast camera motion: ~50% of writes avoided (red tile = required a memory write) [Source: Tom Olson http://community.arm.com/groups/arm-mali-graphics/blog/2012/08/17/how-low-can-you-go-building-low-powerlow-bandwidth-arm-mali-gpus]

Minimizing computation performed

Depth testing as we ve described it Rasterization Fragment Processing Frame-Buffer Ops Graphics pipeline abstraction specifies that depth test is performed here! Pipeline generates, shades, and depth tests orange triangle fragments in this region although they do not contribute to final image. (they are occluded by the blue triangle)

Early Z culling Implemented by all modern GPUs, not just mobile GPUs Application needs to sort geometry to make early Z most effective. Why? Rasterization Rasterization Fragment Processing Frame-Buffer Ops Graphics pipeline specifies that depth test is performed here! Fragment Processing Frame-Buffer Ops Optimization: reorder pipeline operations: perform depth test immediately following rasterization and before fragment shading Key assumption: occlusion results do not depend on fragment shading - Example operations that prevent use of this early Z optimization: enabling alpha test, fragment shader modifies fragment s Z value

Tile-based deferred rendering (TBDR) Mobile GPUs implement deferred shading in the hardware! Divide step 2 of tiled pipeline into two phases: Phase 1: compute what triangle/quad fragment is visible at every sample Phase 2: perform shading of only the visible quad fragments T3 1 2 3 4 7 5 6 8 T1 T2 T4 <not shown T3 T4 for space> T1 1 2 3 T2 none

Summary Mobile 3D graphics implementations are highly optimized for power efficiency - Tiled rendering for bandwidth efficiency* - Deferred rendering to reduce shading costs - Many additional optimizations such as buffer compression, eliminating unnecessary memory ops, etc. * Not all mobile GPUs use tiled rendering as described in this lecture.

Graphics today Computer graphics expands beyond the nuts and bolts ideas of 2D / 3D drawing and mesh representation we ve discussed in class (and is useful well beyond games and movies!)

Computational photography Using computation (and increasingly machine learning) to make more aesthetic photographs, simulate behavior of more complex lenses, etc. Google Pixel 2 Portrait mode Image credit: Google / Matt Jones (https://ai.googleblog.com/2017/10/portrait-mode-on-pixel-2-and-pixel-2-xl.html)

Computational photography Using computation (and increasingly machine learning) to make more aesthetic photographs, simulate behavior of more complex lenses, etc. High Dynamic Range Imaging (HDR)

Creating physically plausible models Via 3D printing, fabrication Creatures that locomotes, furniture that stands, etc. Fabricate models that are balanced to stand Fabricate robots that can balance and move

Advanced geometry processing Fundamental questions about alignment, similarly, symmetry, etc

Advanced displays/rendering for VR/AR Near eye light field display

Content creation and capture Manipulating actors by performance capture Audio input to mesh animation

Synthesizing/simulating sound Simulating sound of scenes, not just their appearance

More graphics classes at Stanford Image Synthesis Techniques, (CS348B, Spring), theory and practice of advanced ray tracing (Hanrahan) Visual Computing Systems, (CS348K, Fall), parallel systems design for computational photographic, deep learning, 3D graphics (Fatahalian) Animation and Simulation (CS348C, Winter), deep dive into animation and simulation techniques (James) Computer Game Design (CS146, Fall), make your own games in Unity (James) Virtual Reality (EE267, Spring), focuses on display, tracking hardware for VR (Wetzstein) Computational Imaging and Display (EE367/CS448i, Winter), advanced course on display design (Wetzstein)

Thanks for being a great class! Good luck on projects! Make sure you have fun, that s the point! See you on the 9th!