H.264 Decoding. University of Central Florida

Similar documents
Multimedia Decoder Using the Nios II Processor

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CPU Structure and Function. Chapter 12, William Stallings Computer Organization and Architecture 7 th Edition

William Stallings Computer Organization and Architecture. Chapter 11 CPU Structure and Function

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Drawing Fast The Graphics Pipeline

DirectCompute Performance on DX11 Hardware. Nicolas Thibieroz, AMD Cem Cebenoyan, NVIDIA

ECE 571 Advanced Microprocessor-Based Design Lecture 20

Optimizing DirectX Graphics. Richard Huddy European Developer Relations Manager

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

CUDA Threads. Origins. ! The CPU processing core 5/4/11

! Readings! ! Room-level, on-chip! vs.!

Lecture 25: Board Notes: Threads and GPUs

Advanced CUDA Programming. Dr. Timo Stich

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Using Virtual Texturing to Handle Massive Texture Data

Pipelining. CSC Friday, November 6, 2015

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Challenges for GPU Architecture. Michael Doggett Graphics Architecture Group April 2, 2008

Administrivia. HW0 scores, HW1 peer-review assignments out. If you re having Cython trouble with HW2, let us know.

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Dynamic Control Hazard Avoidance

PowerVR Series5. Architecture Guide for Developers

Fundamental CUDA Optimization. NVIDIA Corporation

UNIT- 5. Chapter 12 Processor Structure and Function

Portland State University ECE 588/688. Graphics Processors

PIPELINING: HAZARDS. Mahdi Nazm Bojnordi. CS/ECE 6810: Computer Architecture. Assistant Professor School of Computing University of Utah

Recall from Tuesday. Our solution to fragmentation is to split up a process s address space into smaller chunks. Physical Memory OS.

From Shader Code to a Teraflop: How Shader Cores Work

Threading Hardware in G80

Processors, Performance, and Profiling

Lecture 11 Cache. Peng Liu.

Optimizing for DirectX Graphics. Richard Huddy European Developer Relations Manager

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 18 GPUs (III)

Scaling of 3D Game Engine Workloads on Modern Multi-GPU Systems. Jordi Roca Monfort (Universitat Politècnica Catalunya) Mark Grossman (Microsoft)

CUDA OPTIMIZATION WITH NVIDIA NSIGHT VISUAL STUDIO EDITION

IF1 --> IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB. add $10, $2, $3 IF1 IF2 ID1 ID2 EX1 EX2 ME1 ME2 WB sub $4, $10, $6 IF1 IF2 ID1 ID2 --> EX1 EX2 ME1 ME2 WB

Squeezing Performance out of your Game with ATI Developer Performance Tools and Optimization Techniques

B. Tech. Project Second Stage Report on

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Photo David Wright STEVEN R. BAGLEY PIPELINES AND ILP

GPU Memory Model Overview

PowerVR Hardware. Architecture Overview for Developers

The Bifrost GPU architecture and the ARM Mali-G71 GPU

Windowing System on a 3D Pipeline. February 2005

Mali GPU acceleration of HEVC and VP9 Decoder

Hardware-Based Speculation

CS427 Multicore Architecture and Parallel Computing

Lecture 7: The Programmable GPU Core. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

COMP 4801 Final Year Project. Ray Tracing for Computer Graphics. Final Project Report FYP Runjing Liu. Advised by. Dr. L.Y.

08 - Address Generator Unit (AGU)

Collecting OpenCL*-related Metrics with Intel Graphics Performance Analyzers

EE 457 Unit 7b. Main Memory Organization

3/12/2014. Single Cycle (Review) CSE 2021: Computer Organization. Single Cycle with Jump. Multi-Cycle Implementation. Why Multi-Cycle?

Parallel H.264/AVC Motion Compensation for GPUs using OpenCL

Optimizing Graphics Drivers with Streaming SIMD Extensions. Copyright 1999, Intel Corporation. All rights reserved.

Overview. Technology Details. D/AVE NX Preliminary Product Brief

Real-Time Rendering Architectures

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

These actions may use different parts of the CPU. Pipelining is when the parts run simultaneously on different instructions.

Fundamental CUDA Optimization. NVIDIA Corporation

Write only as much as necessary. Be brief!

Modern Processor Architectures. L25: Modern Compiler Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

Parallelism. Execution Cycle. Dual Bus Simple CPU. Pipelining COMP375 1

MIPS Pipelining. Computer Organization Architectures for Embedded Computing. Wednesday 8 October 14

What is Superscalar? CSCI 4717 Computer Architecture. Why the drive toward Superscalar? What is Superscalar? (continued) In class exercise

Game Architecture. 2/19/16: Rasterization

Module 4c: Pipelining

Design Issues 1 / 36. Local versus Global Allocation. Choosing

Mobile Performance Tools and GPU Performance Tuning. Lars M. Bishop, NVIDIA Handheld DevTech Jason Allen, NVIDIA Handheld DevTools

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

Grassroots ASPLOS. can we still rethink the hardware/software interface in processors? Raphael kena Poss University of Amsterdam, the Netherlands

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

What about branches? Branch outcomes are not known until EXE What are our options?

Lecture 5: Pipelining Basics

Advanced Computer Architecture

Rasterization and Graphics Hardware. Not just about fancy 3D! Rendering/Rasterization. The simplest case: Points. When do we care?

Memory management. Requirements. Relocation: program loading. Terms. Relocation. Protection. Sharing. Logical organization. Physical organization

CPU Structure and Function

William Stallings Computer Organization and Architecture

Lecture 12: Demand Paging

Parallel Programming on Larrabee. Tim Foley Intel Corp

GeForce3 OpenGL Performance. John Spitzer

Working with Metal Overview

15 Sharing Main Memory Segmentation and Paging

From Shader Code to a Teraflop: How GPU Shader Cores Work. Jonathan Ragan- Kelley (Slides by Kayvon Fatahalian)

OPENCL GPU BEST PRACTICES BENJAMIN COQUELLE MAY 2015

Memory Management. Disclaimer: some slides are adopted from book authors slides with permission 1

CS 333 Introduction to Operating Systems. Class 1 - Introduction to OS-related Hardware and Software

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CS252 Prerequisite Quiz. Solutions Fall 2007

Spring 2017 :: CSE 506. Device Programming. Nima Honarmand

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Efficient and Scalable Shading for Many Lights

Final Lecture. A few minutes to wrap up and add some perspective

Why Use the GPU? How to Exploit? New Hardware Features. Sparse Matrix Solvers on the GPU: Conjugate Gradients and Multigrid. Semiconductor trends

Transcription:

1

Optimization Example: H.264 inverse transform Interprediction Intraprediction In-Loop Deblocking Render Interprediction filter data from previously decoded frames Deblocking filter out block edges Today: Loosely based on DX9 HW Implementation 2

Problem Domain Max Resolution 1080p = 1920x1088 RGBA pixels (3Bpp) Luma (Y) = 1920x1088 pels (1Bpp) Chroma (U and V) = 960x544 pels each Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y Y 3

Interprediction Decode each pel by interpolating a subregion of previously decoded frames Example case: 6x6 fetches per pel Kernel not aligned to x4 boundary Target Source 4

Interprediction Decode each pel by interpolating a subregion of previously decoded frames Example case: 6x6 fetches per pel Kernel not aligned to x4 boundary Target Source 4

Interprediction Decode each pel by interpolating a subregion of previously decoded frames Example case: 6x6 fetches per pel Kernel not aligned to x4 boundary Target Source 4

Interprediction Decode each pel by interpolating a subregion of previously decoded frames Example case: 6x6 fetches per pel Kernel not aligned to x4 boundary Target Source 4

Do more work per pixel Problem: Shader has a lot of texture fetches Share already fetched data by processing four pels in one pixel (thread) Consolidate texture fetches & Consolidate ALU instructions Consolidate writes (four bytes per pixel) Increase number of threads Target Saved 96 fetches! Source 5

Optimizing the algorithm Theoretical Model for 4 pels per pixel 115 ALU, 56 TEX, 66 Bytes In/Out 480x1088 pixels processed From previous equations: ALU time = 2 ms TEX time = 3 ms Mem time = 0.67 ms Theoretical time = 3 ms Using the theoretical model we can find the problems and fix it 6

Do even more work per pixel! Use multiple render targets 16 pels per pixel Target Saved 522 fetches! Source 7

What happens to the model? Theoretical Model for 16 pels per pixel 304 ALU, 85 TEX, 107 Bytes In/Out 480x272 pixels processed From previous equations: ALU time = 1.32 ms TEX time = 1.12 ms Mem time = 0.29 ms Theoretical time = 1.32 ms Shader goes from TEX bound to ALU bound 8

Oops... Render targets split the image 0 1 2 3 0 1 2 3 Could do a copy shader at the end, or... 9

Interleave Render Targets Hardware supports interleaved render targets Four surfaces appear as one 0 1 2 3 0 12 3 0 1 2 3 0 1 2 3 0 1 2 3 Scatters are slow, MRTs are fast 10

Back to the real world Compare actual performance with theoretical Easy way to determine next step If you re close, you probably have to redesign if you ve missed your performance target If you re off, time to look at the hardware 11

Interprediction Filtering Each block of 4x4 pels are processed differently Up to 6 filter cases Up to 2 prediction directions Up to 2 block types (frame/field) Up to 16 input textures (max 4 for 1080p) With up to up to 384 possible cases, that s a pretty big shader! 12

Branching isn t perfect Shader length Branch overhead Divergent paths between pixel blocks One pixel branch can stall the others Branch Granularity 13

Our old friend the Z-Buffer Use a fast z-pass to initialize the buffer Render each case separately Fast because of top-of-the-pipe pixel kills Very little branching overhead Potentially shorter shaders More threads 14

Deblocking Filters block edges Performance can suffer due to render dependencies Input to next pixel is output from previous pixel Frame divided into MB of 16x16 pels 4 vertical and 4 horizontal passes per MB For 1080p frame, up to 748 passes Example 15

Vertical Deblocking 8 pels 0 1 2 3 0 1 2 3 0 1 2 3 0 1 2 3 Previous macroblock Edge 0 Edge 1 Edge 2 Edge 3 16

Do more work in a pixel revisited Save on number of passes If I have the data, why not use it? Filter two edges in one pixel 16 pels 0 1 2 3 Edge 0 and 1 Render as 1x16 rectangle to 3 MRT 17

Hmmm. Performance improves, but is still off Hardware units arranged in 2x2 pattern 1x16 strip only uses half the pipes Rearrange the rendering 18

Read/Write to one texture Because of dependencies the output is the input to the next pass. Hardware allows reading and writing from the same texture No need to reset states 19

Getting more info Look at performance counters HW Utilization Texture cache miss Z-buffer statistics 20

Cache and Memory Tips Rearrange the data Play with the memory layout CAL API allows more fine grain control Memory performance is hard to predict Access pattern Memory subsystem is unknown Tip: Try replacing input textures with a very small texture (1x1) Doesn t help with dependent texture fetches 21