General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Similar documents
Graphics Performance Optimisation. John Spitzer Director of European Developer Technology

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

GPU Memory Model. Adapted from:

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

printf Debugging Examples

Graphics Hardware. Instructor Stephen J. Guy

Copyright Khronos Group, Page Graphic Remedy. All Rights Reserved

Optimisation. CS7GV3 Real-time Rendering

Graphics Processing Unit Architecture (GPU Arch)

Real - Time Rendering. Graphics pipeline. Michal Červeňanský Juraj Starinský

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

Graphics Hardware. Graphics Processing Unit (GPU) is a Subsidiary hardware. With massively multi-threaded many-core. Dedicated to 2D and 3D graphics

Real - Time Rendering. Pipeline optimization. Michal Červeňanský Juraj Starinský

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

Rendering Objects. Need to transform all geometry then

Feeding the Beast: How to Satiate Your GoForce While Differentiating Your Game

Practical Performance Analysis Koji Ashida NVIDIA Developer Technology Group

UberFlow: A GPU-Based Particle Engine

1.2.3 The Graphics Hardware Pipeline

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

2.11 Particle Systems

The F-Buffer: A Rasterization-Order FIFO Buffer for Multi-Pass Rendering. Bill Mark and Kekoa Proudfoot. Stanford University

GeForce4. John Montrym Henry Moreton

GeForce3 OpenGL Performance. John Spitzer

Performance OpenGL Programming (for whatever reason)

Shaders. Slide credit to Prof. Zwicker

Lecture 2. Shaders, GLSL and GPGPU

Real-Time Rendering (Echtzeitgraphik) Michael Wimmer

Could you make the XNA functions yourself?

Vulkan: Architecture positive How Vulkan maps to PowerVR GPUs Kevin sun Lead Developer Support Engineer, APAC PowerVR Graphics.

Tutorial on GPU Programming #2. Joong-Youn Lee Supercomputing Center, KISTI

Lecture 4: Geometry Processing. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Getting fancy with texture mapping (Part 2) CS559 Spring Apr 2017

CS427 Multicore Architecture and Parallel Computing

A SIMD-efficient 14 Instruction Shader Program for High-Throughput Microtriangle Rasterization

LPGPU Workshop on Power-Efficient GPU and Many-core Computing (PEGPUM 2014)

PowerVR Series5. Architecture Guide for Developers

Graphics Hardware. Computer Graphics COMP 770 (236) Spring Instructor: Brandon Lloyd 2/26/07 1

Programmable GPUs. Real Time Graphics 11/13/2013. Nalu 2004 (NVIDIA Corporation) GeForce 6. Virtua Fighter 1995 (SEGA Corporation) NV1

General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing)

Shaders (some slides taken from David M. course)

RSX Best Practices. Mark Cerny, Cerny Games David Simpson, Naughty Dog Jon Olick, Naughty Dog

EECS 487: Interactive Computer Graphics

Working with Metal Overview

Spring 2009 Prof. Hyesoon Kim

Scheduling the Graphics Pipeline on a GPU

The NVIDIA GeForce 8800 GPU

Data-Parallel Algorithms on GPUs. Mark Harris NVIDIA Developer Technology

Optimizing and Profiling Unity Games for Mobile Platforms. Angelo Theodorou Senior Software Engineer, MPG Gamelab 2014, 25 th -27 th June

Efficient and Scalable Shading for Many Lights

GPU Computation Strategies & Tricks. Ian Buck NVIDIA

Hardware-driven Visibility Culling Jeong Hyun Kim

CS770/870 Fall 2015 Advanced GLSL

Programmable Graphics Hardware

Optimizing Games for ATI s IMAGEON Aaftab Munshi. 3D Architect ATI Research

PowerVR Performance Recommendations The Golden Rules. October 2015

Siggraph Agenda. Usability & Productivity. FX Composer 2.5. Usability & Productivity 9/12/2008 9:16 AM

Ciril Bohak. - INTRODUCTION TO WEBGL

PowerVR Hardware. Architecture Overview for Developers

CIS 665 GPU Programming and Architecture

Portland State University ECE 588/688. Graphics Processors

CS451Real-time Rendering Pipeline

Mali Demos: Behind the Pixels. Stacy Smith

Dave Shreiner, ARM March 2009

OpenGL Programmable Shaders

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Cornell University CS 569: Interactive Computer Graphics. Introduction. Lecture 1. [John C. Stone, UIUC] NASA. University of Calgary

For example, could you make the XNA func8ons yourself?

Threading Hardware in G80

GPU Memory Model Overview

Coding OpenGL ES 3.0 for Better Graphics Quality

How to Work on Next Gen Effects Now: Bridging DX10 and DX9. Guennadi Riguer ATI Technologies

NVIDIA Developer Tools for Graphics and PhysX

Fast Tessellated Rendering on Fermi GF100. HPG 2010 Tim Purcell

ECE 574 Cluster Computing Lecture 16

Parallelizing Graphics Pipeline Execution (+ Basics of Characterizing a Rendering Workload)

Analyze and Optimize Windows* Game Applications Using Intel INDE Graphics Performance Analyzers (GPA)

Spring 2011 Prof. Hyesoon Kim

CENG 477 Introduction to Computer Graphics. Graphics Hardware and OpenGL

Windowing System on a 3D Pipeline. February 2005

Graphics Architectures and OpenCL. Michael Doggett Department of Computer Science Lund university

Pipeline Operations. CS 4620 Lecture Steve Marschner. Cornell CS4620 Spring 2018 Lecture 11

Blis: Better Language for Image Stuff Project Proposal Programming Languages and Translators, Spring 2017

Programming with OpenGL Shaders I. Adapted From: Ed Angel Professor of Emeritus of Computer Science University of New Mexico

E.Order of Operations

Programming shaders & GPUs Christian Miller CS Fall 2011

Profiling and Debugging Games on Mobile Platforms

Case 1:17-cv SLR Document 1-3 Filed 01/23/17 Page 1 of 33 PageID #: 60 EXHIBIT C

Direct Rendering of Trimmed NURBS Surfaces

High-Quality Surface Splatting on Today s GPUs

Sung-Eui Yoon ( 윤성의 )

Introduction to Shaders.

Real-Time Buffer Compression. Michael Doggett Department of Computer Science Lund university

Deferred Rendering Due: Wednesday November 15 at 10pm

CS GPU and GPGPU Programming Lecture 3: GPU Architecture 2. Markus Hadwiger, KAUST

Mali-400 MP: A Scalable GPU for Mobile Devices Tom Olson

Monday Morning. Graphics Hardware

GPGPU. Peter Laurens 1st-year PhD Student, NSC

Transcription:

ME 290-R: General Purpose Computation (CAD/CAM/CAE) on the GPU (a.k.a. Topics in Manufacturing) Sara McMains Spring 2009 Performance: Bottlenecks

Sources of bottlenecks CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Texture Ops 2

Sources of bottlenecks: CPU CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Many applications CPU-bound preparing data file I/O Texture Ops Inefficient rendering of geometry triangle strips GOOD many small batches BAD Too many state changes cause pipeline to flush 3

Sources of bottlenecks: Transfer CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops vertices Texture Ops multiple of 32 bytes best lower precision, compress, pad static geometry display lists write-only vertex buffer (D3D) ARB_vertex_buffer_object (OGL) dynamic geometry dynamic vertex buffer (D3D) ARB_vertex_buffer_object (OGL) access vertices sequentially (cache friendly) 4

Sources of bottlenecks: Transform CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Increase speed Texture Ops Use triangle strips/fans reduces vtx count Use built-in operators rather than explicitly computing in VP Use swizzles and masks cheaper than MOVs Move bottleneck shift computation from VP to the CPU increase resolution for free! 5

Sources of bottlenecks: Rasterize CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Texture Ops rarely the bottleneck unless lots of z-culled triangles no-ops downstream speed affected by large triangles lots of triangles 6

Sources of bottlenecks: Fragments CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Increase speed built-in functions and swizzle operators branching expensive mix math and texture access 8 math ops in parallel per texture access Move bottleneck shift computation to VP Texture Ops GPGPU often fill-rate limited, not vertex-pipeline limited shift linear interpolation to rasterizer 7

Sources of bottlenecks: Fragments CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Decrease fragments Texture Ops draw in rough front-to-back order z-only 1st pass may help masking color ops fast masking only some color channels slow 8

Sources of bottlenecks: Texture CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Expensive cache-unfriendly access dependent texture reads high precision large non-power-of-two textures... Texture Ops 9

Sources of bottlenecks: Framebuffer CPU Transfer Processing Rasterizer Fragment processing Framebuffer Ops Texture Ops 16 bits cheaper than 32 floating point expensive blending ops can become bottleneck 10

Find the bottleneck! Increase framebuffer resolution tells you if rasterizer onward is bottleneck Change fragment instuctions checks fragment processor load Reduce texture precision checks texture load Change vertex instuctions checks vertex processor load Change vertex sizes checks transfer Balanced pipeline is goal 11

12

Tools: GPUBench toolkit of programs that probe the GPU Measure download speeds transfer from CPU to texture memory Measure readbacks speeds for different kinds of RenderTextures Evaluate costs of various instructions Multiple parameters that show variation with size of quad, form of rendering, cache effects, etc. Evaluating Texture Access To check cache effects repeatedly access the same texture location To check sequential behaviour access texture elements in a row 13

Tools: NV ShaderPerf (2.0) Takes a shader file in appropriate format produces scheduling information (#clock cycles, #instructions) for running the shader on the GPU Can take different architectures as target, yielding different bandwidths float2 pos1 = float2(passno,texcoords0.y); float2 pos2 = float2(texcoords0.x,passno); return (texrect(sum, texcoords0) + texrect(tex1,pos1) * texrect(tex2,pos2)).r; NV31: 66 MP/s NV40: 1.6 GP/s 14

Tools: NV PerfHUD (6.0) Only works with DirectX Provides much more detailed shader profiling Enable it by choosing a specific target in Cg code 15

Tools: Graphic Remedy gdebugger Breakpoints, stepping Watch windows Performance analysis OpenGL 16

gdebugger 17

Acknowledgements John Spitzer Koji Ashida Suresh Venkatasubramanian Tim Purcell 18