Saving the Planet Designing Low-Power, Low-Bandwidth GPUs

Similar documents
ASTC Does It. Eason Tang Staff Applications Engineer, ARM

Adaptive Scalable Texture Compression

Mobile HW and Bandwidth

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

Unreal Engine 4: Mobile Graphics on ARM CPU and GPU Architecture

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

ARM Multimedia IP: working together to drive down system power and bandwidth

PowerVR Series5. Architecture Guide for Developers

Introduction to OpenGL ES 3.0

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

Multimedia in Mobile Phones. Architectures and Trends Lund

Graphics, Mobile Computing, APIs and Life

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Achieving Console Quality Games on Mobile

Profiling and Debugging Games on Mobile Platforms

Working with Metal Overview

Texture Compression. Jacob Ström, Ericsson Research

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

Course Recap + 3D Graphics on Mobile GPUs

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

The Application Stage. The Game Loop, Resource Management and Renderer Design

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Scheduling the Graphics Pipeline on a GPU

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

PowerVR Performance Recommendations The Golden Rules. October 2015

PVRTC & Texture Compression. User Guide

Mali-G72 Enabling tomorrow s technology today

Bifrost - The GPU architecture for next five billion

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

PowerVR Hardware. Architecture Overview for Developers

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Optimized Effects for Mobile Devices. Ed Plowman, Director of Performance Analysis, ARM Stacy Smith, Senior Software Engineer, ARM

Coding OpenGL ES 3.0 for Better Graphics Quality

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

POWERVR MBX. Technology Overview

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Optimization Tips 杜博, 美国高通公司资深工程师

Tools To Get Great Graphics Performance

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

POWERVR MBX & SGX OpenVG Support and Resources

PowerVR Performance Recommendations. The Golden Rules

Spring 2011 Prof. Hyesoon Kim

PowerVR: Getting Great Graphics Performance with the PowerVR Insider SDK. PowerVR Developer Technology

The Bifrost GPU architecture and the ARM Mali-G71 GPU

PERFORMANCE OPTIMIZATIONS FOR AUTOMOTIVE SOFTWARE

Evolution of GPUs Chris Seitz

The Rasterization Pipeline

GoForce 3D: Coming to a Pixel Near You

Spring 2009 Prof. Hyesoon Kim

Optimizing Mobile Games with Gameloft and ARM

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Baback Elmieh, Software Lead James Ritts, Profiler Lead Qualcomm Incorporated Advanced Content Group

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

MAXIS-mizing Darkspore*: A Case Study of Graphic Analysis and Optimizations in Maxis Deferred Renderer

Wednesday, July 24, 13

PowerVR Performance Recommendations. The Golden Rules

Mudd Adventure. A 3D Raycasting Game. CSEE 4840 Embedded Systems. Project Design 3/27/2014

ARM Mali GPU OpenGL ES 3.x

CS130 : Computer Graphics. Tamar Shinar Computer Science & Engineering UC Riverside

Developing the Bifrost GPU architecture for mainstream graphics

ArcGIS Runtime: Maximizing Performance of Your Apps. Will Jarvis and Ralf Gottschalk

Monday Morning. Graphics Hardware

Optimizing Mobile Games with ARM. Solo Chang Staff Applications Engineer, ARM

Rendering. Converting a 3D scene to a 2D image. Camera. Light. Rendering. View Plane

ECE 574 Cluster Computing Lecture 16

Engine Development & Support Team Lead for Korea UE4 Mobile Team Lead

Next-gen Mobile Rendering

Windowing System on a 3D Pipeline. February 2005

GUERRILLA DEVELOP CONFERENCE JULY 07 BRIGHTON

VISUALISATION AND ANALYSIS

Programming Graphics Hardware

Mali Demos: Behind the Pixels. Stacy Smith

3D Graphics Texture Compression And Its Recent Trends.

Performance OpenGL Programming (for whatever reason)

Render-To-Texture Caching. D. Sim Dietrich Jr.

Dave Shreiner, ARM March 2009

There are two lights in the scene: one infinite (directional) light, and one spotlight casting from the lighthouse.

Beyond Programmable Shading. Scheduling the Graphics Pipeline

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

PVRTC Specification and User Guide

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

Vulkan Multipass mobile deferred done right

PowerVR Graphics - Latest Developments and Future Plans

Moving Mobile Graphics Advanced Real-time Shadowing. Marius Bjørge ARM

Broken Age's Approach to Scalability. Oliver Franzke Lead Programmer, Double Fine Productions

CS427 Multicore Architecture and Parallel Computing

Mobile 3D Devices. -- They re not little PCs! Stephen Wilkinson Graphics Software Technical Lead Texas Instruments CSSD/OMAP

Addressing the Memory Wall

Distributed Virtual Reality Computation

Adding Advanced Shader Features and Handling Fragmentation

PowerVR. Performance Recommendations

Introduction to Modern GPU Hardware

Transcription:

Saving the Planet Designing Low-Power, Low-Bandwidth GPUs Alan Tsai Business Development Manager ARM

Saving the Planet? Really? Photo courtesy of NASA. 2

Mobile GPU design is all about power It s not about the battery It s about heat The power you have now is all you re ever going to get 3

Outline Power and memory bandwidth Three ways to reduce memory bandwidth Tile-based rendering Transaction elimination Advanced texture compression Conclusion 4

Where does the power go? System power Radio, GPS, Wi-Fi, etc. Display Computing Static power Dynamic power Memory bandwidth 5

Does bandwidth matter? Rule of thumb: energy cost of transferring one byte is about 0.15 nj/b * * Please take with several grains of salt. Assumes 2x32 LPDDR2 memory system and includes memory controller, DDR PHY, memory, I/O, and other assumptions. Your mileage may vary. Void where prohibited. 6

Does bandwidth matter? The assumed memory system can deliver 4 to 8 GB/s 4 to 8 GB/s 0.15 nj/b = 600 to1200 mw Yes, bandwidth matters 7

A Trip Down the Graphics Pipeline VERTEX SHADER FRAGMENT TA RAST Z-TEST BLEND SHADER RESOLVE GPU Memory Vertex Buffers Textures MS Z MS C MS C C Observations Data explosion from per-vertex to per-fragment to per-sample Frame buffer BW dominates especially when multi-sampled What to do? Buffers are too big to cache 8

Tile-based rendering pipeline MS Z MS C MS C VERTEX SHADER FRAGMENT TA RAST Z-TEST BLEND SHADER RESOLVE GPU Memory Vertex Buffers Polygon Lists Textures 1 2 3 4 5 6 C Tradeoff Vertices must flow through memory but sample buffers are now on-chip 7 8 9 9

Tile-based rendering is a win! Used in the majority of mobile GPUs Variations ARM Mali Tile-based direct rendering, small tiles Imagination SGX Tile-based deferred rendering, small tiles Qualcomm Adreno Chunk-based direct rendering, large tiles (chunks) 10

What about tile writeback? MS Z MS C MS C GPU Memory VERTEX SHADER FRAGMENT TA RAST Z-TEST BLEND SHADER RESOLVE Vertex Buffers Polygon Lists Textures C Is it really a problem? 11

Screens keep getting bigger Apply our rule of thumb: Screen Resolution Bytes/Pix FPS Power (mw) VGA 640 x 480 2 30 3 WXGA 1280 x 768 4 30 18 FHD 1920 x 1080 4 60 75 MB Pro 2880 x 1800 4 60 187 4K 3840 x 2160 4 60 299 Yes, this is a problem we have to deal with 12

What can we do? Observation: Not everything in the scene changes every frame This is true even in high-end FPS games Skybox HUD Static Scene Elements If a tile hasn t changed, we don t have to write it. 13

Transaction elimination sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig sig Maintain a list of signatures for each tile... sig sig sig sig sig sig sig... Compare to sigs calculated for frame N+1... sig sig sig sig sig sig sig... Where signatures match, don t write the tile Surprisingly effective, even on FPS games and video 14

But how well does it work really? Angry Birds characters and images are copyright 2009-2012, Rovio Entertainment Ltd. Used by permission. 15

But how well does it work really? Save ~ 75% write bandwidth <Video> Save ~ 50% total bandwidth Angry Birds characters and images are copyright 2009-2012, Rovio Entertainment Ltd. Used by permission. 16

Saving the Planet? Photo courtesy of NASA. 17

Saving the Planet? Angry Birds is played ~200 million minutes every day http://thenextweb.com/apps/2011/02/16/angry-birds-gamers-spend-200-million-minutes-playing-each-day / If all these were played using Transaction Elimination 3.1MB of bandwidth would be saved per frame (average) 3.1MB/frame * 60FPS * 60s/min * 200M min/day = 2.2 ExaBytes/day Apply the rule of thumb 2.2ExaBytes * 0.15nJ/Bytes = 335 MJ/day or 34 Megawatt-hours per year 18

What s Next? MS Z MS C MS C VERTEX SHADER FRAGMENT TA RAST Z-TEST BLEND SHADER RESOLVE GPU Vertex Buffers Polygon Lists Textures Texture fetch bandwidth is the biggest remaining problem C The answer is obvious: compression 20

Texture Compression: Problems No universally supported formats Desktop: S3TC, BPTC, RGTC Mobile: ETC1, ATITC, PVRTC, S3TC Limited choice of bit rates and formats: RGB{A}, 4bpp / 8bpp Quality not that great News Flash! ETC2 / EAC are standard in OpenGL ES 3.0, OpenGL 4.3 4bpp R/RGB, 8bpp RG/RGBA Better quality And here comes ASTC in OpenGL ES 3.0 21

Adaptive Scalable Texture Compression Goals Maximum flexibility Quality Functionality Scalable bit rate: from 8bpp down to <1bpp in fine steps Orthogonality: Any number of components at any bit rate Adaptive: # components, pixel format are specified locally Both 2D and 3D textures Both LDR and HDR pixel formats Significant quality improvement 22

Color Formats Codecs Today XY+Z All Major Players ETC, BC5 X+Y RGB+A PVRTC PVRTC ETC, BC2, BC3, BC6(HDR), BC7 RGBA ETC, BC1 BC7 RGB LA ETC, BC4 L 1 2 3 4 5 6 7 8 bits/pixel 23

Color Formats Codecs Today Low Dynamic Range XY+Z X+Y RGB+A RGBA RGB LA L 1 2 3 4 5 6 7 8 bits/pixel 24

Color Formats ASTC Low Dynamic Range XY+Z X+Y RGB+A RGBA RGB LA L 1 2 3 4 5 6 7 8 bits/pixel 25

ASTC Bit Rates Standard block-based paradigm Generalized to 3D Unusually large number of block sizes 2D Bit Rates 3D Bit Rates 4x4 8.00 bpp 10x5 2.56 bpp 3x3x3 4.74 bpp 5x5x4 1.28 bpp 5x4 6.40 bpp 10x6 2.13 bpp 4x3x3 3.56 bpp 5x5x5 1.02 bpp 5x5 5.12 bpp 8x8 2.00 bpp 4x4x3 2.67 bpp 6x5x5 0.85 bpp 6x5 4.27 bpp 10x8 1.60 bpp 4x4x4 2.00 bpp 6x6x5 0.71 bpp 6x6 3.56 bpp 10x10 1.28 bpp 5x4x4 1.60 bpp 6x6x6 0.59 bpp 8x5 3.20 bpp 12x10 1.07 bpp 8x6 2.67 bpp 12x12 0.89 bpp 26

db PSNR Quality Comparison RGB LDR 2bpp Kodak test set 24 natural RGB images PSNR comparison ASTC vs PVRTC 2bpp: 40 38 36 34 32 30 28 26 24 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 Image ASTC 8x8 PVRTC 2bpp 27

Image Comparison ORIGINAL ASTC 6x6: PSNR 33.5 db S3TC 4 bpp: PSNR 30.7 db ASTC at 3.56 bpp vs S3TC at 4 bpp 2.8 db PSNR advantage 11% lower bit rate 28

Summary Mobile GPU design is all about power Memory bandwidth is a key contributor to power Standard and not-so-standard tricks for controlling bandwidth: Tile-based rendering Transaction elimination Advances in texture compression Use Mali GPU to save the planet 29

Thanks!