On improving the performance per area of ASTC with a multi-output decoder

Similar documents
Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Direct Rendering. Direct Rendering Goals

Adaptive Scalable Texture Compression

Texture Compression in Memory- and Performance- Constrained Embedded Systems. Jens Ogniewski Information Coding Group Linköpings University

ASTC Does It. Eason Tang Staff Applications Engineer, ARM

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

PVRTC & Texture Compression. User Guide

Texture Compression. Jacob Ström, Ericsson Research

2D rendering takes a photo of the 2D scene with a virtual camera that selects an axis aligned rectangle from the scene. The photograph is placed into

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

PVRTC Specification and User Guide

CS 130 Final. Fall 2015

FAST: A Framework to Accelerate Super- Resolution Processing on Compressed Videos

CS GPU and GPGPU Programming Lecture 16+17: GPU Texturing 1+2. Markus Hadwiger, KAUST

CS770/870 Fall 2015 Advanced GLSL

Hardware Displacement Mapping

Using Virtual Texturing to Handle Massive Texture Data

Module 6 STILL IMAGE COMPRESSION STANDARDS

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

DigiPoints Volume 1. Student Workbook. Module 8 Digital Compression

Digital Video Processing

CONTENT ADAPTIVE SCREEN IMAGE SCALING

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Morphological: Sub-pixel Morhpological Anti-Aliasing [Jimenez 11] Fast AproXimatte Anti Aliasing [Lottes 09]

COMP371 COMPUTER GRAPHICS

Motion Estimation for Video Coding Standards

Support Triangle rendering with texturing: used for bitmap rotation, transformation or scaling

Texture Mapping. Texture (images) lecture 16. Texture mapping Aliasing (and anti-aliasing) Adding texture improves realism.

3D Graphics Texture Compression And Its Recent Trends.

lecture 16 Texture mapping Aliasing (and anti-aliasing)

15 Data Compression 2014/9/21. Objectives After studying this chapter, the student should be able to: 15-1 LOSSLESS COMPRESSION

PowerVR Series5. Architecture Guide for Developers

CS GPU and GPGPU Programming Lecture 12: GPU Texturing 1. Markus Hadwiger, KAUST

Next-Generation Graphics on Larrabee. Tim Foley Intel Corp

Compression; Error detection & correction

Computergrafik. Matthias Zwicker Universität Bern Herbst 2016

ARM Multimedia IP: working together to drive down system power and bandwidth

3D Graphics in Future Mobile Devices. Steve Steele, ARM

Texturing Theory. Overview. All it takes is for the rendered image to look right. -Jim Blinn 11/10/2018

Parallel Triangle Rendering on a Modern GPU

Core Facts. Documentation. Encrypted VHDL Fallerovo setaliste Zagreb, Croatia. Verification. Reference Designs &

Magnification and Minification

Anatomy of a Video Codec

Edge Detection. Whitepaper

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Graphics, Mobile Computing, APIs and Life

In the name of Allah. the compassionate, the merciful

A Image Comparative Study using DCT, Fast Fourier, Wavelet Transforms and Huffman Algorithm

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

A Reconfigurable Architecture for Load-Balanced Rendering

Perceptual coding. A psychoacoustic model is used to identify those signals that are influenced by both these effects.

Encoding/Decoding. Russell Impagliazzo and Miles Jones Thanks to Janine Tiefenbruck. May 9, 2016

3/1/2010. Acceleration Techniques V1.2. Goals. Overview. Based on slides from Celine Loscos (v1.0)

GeForce4. John Montrym Henry Moreton

Professor Laurence S. Dooley. School of Computing and Communications Milton Keynes, UK

Georgios Tziritas Computer Science Department

INF3320 Computer Graphics and Discrete Geometry

Bifurcation Between CPU and GPU CPUs General purpose, serial GPUs Special purpose, parallel CPUs are becoming more parallel Dual and quad cores, roadm

Compression of 3-Dimensional Medical Image Data Using Part 2 of JPEG 2000

SD 575 Image Processing

AMCS / CS 247 Scientific Visualization Lecture 10: (GPU) Texture Mapping. Markus Hadwiger, KAUST

Both LPC and CELP are used primarily for telephony applications and hence the compression of a speech signal.

Subdivision Of Triangular Terrain Mesh Breckon, Chenney, Hobbs, Hoppe, Watts

Filtering Cubemaps Angular Extent Filtering and Edge Seam Fixup Methods

Texture. Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan.

Introduction to Video Compression

POWERVR MBX. Technology Overview

Today. Texture mapping in OpenGL. Texture mapping. Basic shaders for texturing. Today. Computergrafik

Scalable Texture Compression using the Wavelet Transform

Performance Analysis and Culling Algorithms

PowerVR Performance Recommendations. The Golden Rules

CS GPU and GPGPU Programming Lecture 11: GPU Texturing 1. Markus Hadwiger, KAUST

Parallax Bumpmapping. Whitepaper

Comparison of High-Speed Ray Casting on GPU

Fast Texture Based Form Factor Calculations for Radiosity using Graphics Hardware

Circular Arcs as Primitives for Vector Textures

DTC-350. VISUALmpeg PRO MPEG Analyser Software.

I ll be speaking today about BC6H compression, how it works, how to compress it in realtime and what s the motivation behind doing that.

Compressed-Domain Video Processing and Transcoding

Real-Time Graphics Architecture. Kurt Akeley Pat Hanrahan. Texture

Course Recap + 3D Graphics on Mobile GPUs

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Delay Tolerant Networks

All the Polygons You Can Eat. Doug Rogers Developer Relations

The Power and Bandwidth Advantage of an H.264 IP Core with 8-16:1 Compressed Reference Frame Store

Addressing the Memory Wall

5LSH0 Advanced Topics Video & Analysis

Lecture 5: Error Resilience & Scalability

Lossless and Lossy Minimal Redundancy Pyramidal Decomposition for Scalable Image Compression Technique

Video Coding in H.26L

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining

Topic 5 Image Compression

Fully Integrated Communication Terminal and Equipment. FlexWave II :Executive Summary

Technical Report. Non-Power-of-Two Mipmap Creation

A New Compression Method Strictly for English Textual Data

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

High Efficiency Video Coding. Li Li 2016/10/18

Textures. Texture coordinates. Introduce one more component to geometry

+ = To Do. Adding Visual Detail. Texture Mapping. Parameterization. Option: Varieties of projections. Foundations of Computer Graphics (Fall 2012)

Compression; Error detection & correction

Transcription:

On improving the performance per area of ASTC with a multi-output decoder Kenneth Rovers 25th July, 2017 www.imgtec.com

Textures Background In general, an image to be mapped to an object surface But also lighting, normals, height, etc. Texels are texture pixels Pixels are picture elements (0,1) (1,1) (0,0) (1,0) Imagination Technologies 2

Textures Background Texels do not exactly map to (screen) pixels Each screen pixel (fragment) may require up to 128 texels Imagination Technologies 3

Texture compression Motivation GPUs are in general memory bandwidth limited Textures consume a large part of this bandwidth Larger resolutions means larger textures means more bandwidth Compression of textures is essential Imagination Technologies 4

Texture compression Requirements Good compression typically means it may be lossy at the cost of quality Good quality few artefacts Flexible Number of colour channels Quality Simple decoding Smaller implementation Imagination Technologies 5

ASTC Adaptive Scalable Texture Compression One texture compression format to rule them all Flexible, per-texture, space versus quality trade-off Imagination Technologies 6

ASTC Adaptive Scalable Texture Compression One texture compression format to rule them all Flexible, per-texture, space versus quality trade-off Allows from 4 4 up to 12 12 texels to be compressed in a fixed 128 bits Allow 1 to 4 colour channels Modes with different quality per channel possible Improve quality by grouping texels into up to four different partitions Imagination Technologies 7

ASTC Adaptive Scalable Texture Compression One texture compression format to rule them all Flexible, per-texture, space versus quality trade-off Allows from 4 4 up to 12 12 texels to be compressed in a fixed 128 bits Allow 1 to 4 colour channels Modes with different quality per channel possible Improve quality by grouping texels into up to four different partitions High quality, very flexible, many use cases no longer simple to decode! Imagination Technologies 8

Approach ASTC decoder is large in area! (Simple) idea Nearby texels are likely to be requested at the same time multi-output decoder Share common decoding hardware save area Imagination Technologies 9

ASTC Basics Up to four partitions Each partition has a set of colour endpoints Each texel has a partition index Each texel has a weight used to interpolate between the colour endpoints Decode Configuration data Weight data Colour data Interpolate Imagination Technologies 10

ASTC decoder Structure Configuration sizes, modes, partitions Weights Variable size Fractional number of bits per weight Normalise Fewer weights can be stored than number of texels bilinear interpolation Results in 1 or 2 weights astc_block weight decoding select bits int. seq. decode weight value normalise bilinear interp footprint coords configuration decode colour endpoint block mode mode colour decoding select bits int. seq. decode colour value normalise endpoint decode 2 weights 2 colour endpoints interpolate texel Imagination Technologies 11

ASTC decoder Structure Colours Variable size Fractional number of bits per colour value Normalise Colour endpoint mode determines how colour values combine to actual colours Results in two colours Interpolate Up to four colour channels, each using one of the two weights astc_block weight decoding select bits int. seq. decode weight value normalise bilinear interp footprint configuration decode colour endpoint block mode mode interpolate coords colour decoding select bits int. seq. decode colour value normalise endpoint decode 2 weights 2 colour endpoints texel Imagination Technologies 12

ASTC multi-output decoder 2x2 block Assume a four-output decoder In general any number of outputs possible Assume a 2 2 block of adjacent texels Very common for texture filtering Imagination Technologies 13

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Imagination Technologies 14

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Imagination Technologies 15

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Imagination Technologies 16

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Imagination Technologies 17

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Imagination Technologies 18

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Imagination Technologies 19

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Colour data Bit selection is shared Limit of 18 colour values instead of 4 values 4 partitions 2 endpoints = 32 Some modes are not legal for 3 or 4 partitions Imagination Technologies 20

Sharing For a 2 2 request Configuration data is mostly shared for all texels Partition index is per texel Colour endpoint mode is per partition (potentially all needed) Weight data Bit selection shared Adjacent texels share weight values for bilinear interpolation Instead of 4 texels 4 interpolant 2 weight =32 weight values, only 3 3 2=18 are needed Colour data Bit selection is shared Limit of 18 colour values instead of 4 values 4 partitions 2 endpoints = 32 Some modes are not legal for 3 or 4 partitions Interpolation Is independent (at least up to 4 texels) Imagination Technologies 21

Area For a 2 2 request Synthesis of four single-output decoders About 100 10 3 µm 2 in 40nm at 400MHz Synthesis of one four-output decoders About 60 10 3 µm 2 in 40nm at 400MHz Area savings is 40% Roughly corresponds to expected savings of: 75% in configuration decoding 1-18 / 32 44% in weight decoding 1-18 / 32 44% in colour decoding No savings in interpolating Imagination Technologies 22

Performance For a 2 2 request If not all texel request fall inside the encoded block there is a performance hit A 2 2 request at an edge requires two decoding cycles A 2 2 request at a corner requires four decoding cycles Imagination Technologies 23

Performance For a 2 2 request If not all texel request fall inside the encoded block there is a performance hit A 2 2 request at an edge requires two decoding cycles A 2 2 request at a corner requires four decoding cycles Assuming each location is equally likely With footprint n m ( n 1) ( m 1) n m fall inside the block ( n 1) ( m 1) n m 1 n m fall on an edge fall on a corner Imagination Technologies 24

Performance per area For a 2 2 request Area savings: 40% Average performance assuming each footprint is equally likely: 75.8% Performance/area improvement: 25% (75/60=1.25) Worst case: 7% (64.0/60), best-case: 42% (85.2/60) Imagination Technologies 25

Conclusion A multi-output decoder can decode a 2 2 block of texels for 40% less hardware cost than four individual decoders This is achieved by sharing common hardware, exploiting overlap in adjacent requests and exploiting the invalid space in the specification This comes at an performance cost of 25% on average (although smarter scheduling not discussed here can reduce this) The performance per area improves significantly by 25% Imagination Technologies 26

Questions? www.imgtec.com