Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Similar documents
Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Falanx Microsystems. Company Overview

ARM Mali -400 MP. The Scalable Multicore Graphics Processing Unit. Under embargo until June 2 nd, 2008

PowerVR Hardware. Architecture Overview for Developers

PowerVR Series5. Architecture Guide for Developers

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Developing the Bifrost GPU architecture for mainstream graphics

Graphics, Mobile Computing, APIs and Life

Architectures. Michael Doggett Department of Computer Science Lund University 2009 Tomas Akenine-Möller and Michael Doggett 1

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 3.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555C (ID102813)

Profiling and Debugging Games on Mobile Platforms

! Readings! ! Room-level, on-chip! vs.!

POWERVR MBX & SGX OpenVG Support and Resources

TEAPOT: A Toolset for Evaluating Performance, Power and Image Quality on Mobile Graphics Systems

Rasterization Overview

CS427 Multicore Architecture and Parallel Computing

ARM. Mali GPU. OpenGL ES Application Optimization Guide. Version: 2.0. Copyright 2011, 2013 ARM. All rights reserved. ARM DUI 0555B (ID051413)

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

Case 1:17-cv SLR Document 1-4 Filed 01/23/17 Page 1 of 30 PageID #: 75 EXHIBIT D

Working with Metal Overview

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

Bringing AAA graphics to mobile platforms. Niklas Smedberg Senior Engine Programmer, Epic Games

GeForce4. John Montrym Henry Moreton

Graphics Processing Unit Architecture (GPU Arch)

ARM Multimedia IP: working together to drive down system power and bandwidth

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Multimedia in Mobile Phones. Architectures and Trends Lund

Rendering Objects. Need to transform all geometry then

Saving the Planet Designing Low-Power, Low-Bandwidth GPUs

Vertex Shader Design I

Lecture 25: Board Notes: Threads and GPUs

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

Bifrost - The GPU architecture for next five billion

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Hardware- Software Co-design at Arm GPUs

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

PowerVR Performance Recommendations The Golden Rules. October 2015

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

EECS 487: Interactive Computer Graphics

Course Recap + 3D Graphics on Mobile GPUs

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

Spring 2010 Prof. Hyesoon Kim. AMD presentations from Richard Huddy and Michael Doggett

Beyond Programmable Shading Keeping Many Cores Busy: Scheduling the Graphics Pipeline

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Dave Shreiner, ARM March 2009

Windowing System on a 3D Pipeline. February 2005

Overview. Think Silicon is a privately held company founded in 2007 by the core team of Atmel MMC IC group

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Enabling immersive gaming experiences Intro to Ray Tracing

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Portland State University ECE 588/688. Graphics Processors

A Reconfigurable Architecture for Load-Balanced Rendering

Ciril Bohak. - INTRODUCTION TO WEBGL

Spring 2009 Prof. Hyesoon Kim

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Hardware-driven visibility culling

Mali Developer Resources. Kevin Ho ARM Taiwan FAE

A Low Cost Tile-based 3D Graphics Full Pipeline with Real-time Performance Monitoring Support for OpenGL ES in Consumer Electronics

Coming to a Pixel Near You: Mobile 3D Graphics on the GoForce WMP. Chris Wynn NVIDIA Corporation

Graphics Hardware. Instructor Stephen J. Guy

Real-Time Reyes: Programmable Pipelines and Research Challenges. Anjul Patney University of California, Davis

A Bandwidth Effective Rendering Scheme for 3D Texture-based Volume Visualization on GPU

Scheduling the Graphics Pipeline on a GPU

CS230 : Computer Graphics Lecture 4. Tamar Shinar Computer Science & Engineering UC Riverside

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

CS452/552; EE465/505. Clipping & Scan Conversion

Baback Elmieh, Software Lead James Ritts, Profiler Lead Qualcomm Incorporated Advanced Content Group

Threading Hardware in G80

Introduction to Modern GPU Hardware

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

PowerVR Performance Recommendations. The Golden Rules

PowerVR Performance Recommendations. The Golden Rules

3-D Accelerator on Chip

Parallel Computing: Parallel Architectures Jin, Hai

Vulkan Multipass mobile deferred done right

Scanline Rendering 2 1/42

Mali-G72 Enabling tomorrow s technology today

E.Order of Operations

Antonio R. Miele Marco D. Santambrogio

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Graphics Hardware, Graphics APIs, and Computation on GPUs. Mark Segal

Whiz-Bang Graphics and Media Performance for Java Platform, Micro Edition (JavaME)

Effective System Design with ARM System IP

Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

Introduction to Multicore architecture. Tao Zhang Oct. 21, 2010

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

CS450/550. Pipeline Architecture. Adapted From: Angel and Shreiner: Interactive Computer Graphics6E Addison-Wesley 2012

PowerVR Graphics - Latest Developments and Future Plans

The NVIDIA GeForce 8800 GPU

Exploring System Coherency and Maximizing Performance of Mobile Memory Systems

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Lecture 2. Shaders, GLSL and GPGPU

Jeremy W. Sheaffer 1 David P. Luebke 2 Kevin Skadron 1. University of Virginia Computer Science 2. NVIDIA Research

GPU Architecture. Michael Doggett Department of Computer Science Lund university

Transcription:

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson Director, Graphics Research, ARM

Outline ARM and Mobile Graphics Design Constraints for Mobile GPUs Mali Architecture Overview Multicore Scaling in Mali-400 MP Results 2

About ARM World s leading supplier of semiconductor IP Processor Architectures and Implementations Related IP: buses, caches, debug & trace, physical IP Software tools and infrastructure Business Model License fees Per-chip royalties Graphics at ARM Acquired Falanx in 2006 ARM Mali is now the world s most widely licensed GPU family 3

Challenges for Mobile GPUs Size Power Memory Bandwidth 4

More Challenges Graphics is going into anything that has a screen Mobile Navigation Set Top Box/DTV Automotive Video telephony Cameras Printers Huge range of form factors, screen sizes, power budgets, and performance requirements In some applications, a huge difference between peak and average performance requirements 5

Solution: Scalability Address a wide variety of performance points and applications with a single IP and a single software stack. Need static scalability to adapt to different peak requirements in different platforms / markets Need dynamic scalability to reduce power when peak performance isn t needed 6

Options for Scalability Fine-grained: Multiple pipes, wide SIMD, etc Proven approach, efficient and effective But, adding pipes / lanes is invasive Hard for IP licensees to do on their own And, hard to partition to provide dynamic scalability Coarse-grained: Multicore Easy for licensees to select desired performance Putting cores on separate power islands allows dynamic scaling 7

Mali 400-MP Top Level Architecture Asynch APB Mali-400 MP Top-Level Geometry Processor Pixel Processor #1 Pixel Processor #2 Pixel Processor #3 Pixel Processor #4 CLKs RESETs MaliMMUs IRQs IDLEs MaliL2 AXI Scalable pixel performance 1-4 rasterizer cores 32K-128K L2 cache 8

Mali Tile-Based Rendering 1 1 0 0,1 0,1 0 0 0,1 0,1 0 0 0,1 0,1 0 Reduces off-chip framebuffer bandwidth Rasterize into on-chip 16x16 tile buffer Z, stencil, MSAA samples never go off-chip Tradeoff against increased geometry bandwidth Details: see Real-Time Rendering, 3 rd ed. 9

Mali-400 MP Geometry Processor System bus interface Vertex Loader Vertex Shader Vertex Storer Polygon List Builder Unit Control Registers / Performance Counters Vertex Shader Single-threaded, deeply pipelined 10

Mali-400 MP Pixel Processor Fragment Shader 128-thread barrel processor Fully general control flow VLIW ISA, tuned for graphics One texture sample per clock Color / Depth / Stencil Buffers Blending System bus interface Fragment Shader Key Features 16x16 on-chip tile buffer Renders one pixel per clock: 275M pix/sec @ 275 MHz No penalty for 4x MSAA No penalty for blending 4x or 16x MSAA resolve on output Control Registers / Performance Counters Polygon List Reader OpenGL ES 2.0 states handled without state dependent shaders Triangle setup Rasterization Thread control 11

Going Multi-Core All cores work in parallel on separate tasks Each core processes one tile at a time until completion no communication between cores Tiles assigned statically to cores in a swizzled order Asynch APB Mali-400 MP Top-Level Vertex Processor Fragment Processor #1 Fragment Processor #2 Fragment Processor #3 Fragment Processor #4 Tile processing order maximizes L2 hit rate for polygon descriptors, textures CLKs RESETs IRQs IDLEs MaliMMU MaliL2 AXI 12

Does it work? Test content Industry standard benchmarks for GL ES 1.1, 2.0 Thanks Rightware! 13

Normalized Frame Rate* Normalized Frame Rate* Relative Performance 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 OpenGL ES 1.1 OpenGL ES 2.0 4.00 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 MC1 MC2 MC3 MC4 MC1 MC2 MC3 MC4 4.50 Single frames in RTL simulation Artificial memory model (fixed static latency) 1-4 cores, 32KB L2 cache 1920x1080, 32bpp, 4xMSAA *Single core frame rate = 1.0 14

Normalized Bandwidth* Impact on Bandwidth OpenGL ES 1.1 1.40 1.20 1.00 0.80 0.60 0.40 0.20 0.00 MC1 MC2 MC3 MC4 Memory bandwidth per frame is stable L2 cache is working to prevent redundant memory accesses *Single core bandwidth per frame = 1.0 15

Summary Scalability is an essential feature for mobile GPUs Static and Dynamic Multi-core architectures are a good fit to the need Easy to configure for different performance requirements Easy to power-gate for dynamic scaling Mali-400 MP shows that the concept works Pixel rate scalable from 275 Mpix/s to 1.1 Gpix/s Linear speedup demonstrated on common mobile benchmarks Single driver for all configurations transparent to applications Power and bandwidth efficient 16

Thanks to The HPG 2010 Hot3D organizers Remi Pedersen, Mashhuda Glencross, and the Mali team Our Silicon Partners 17