MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Similar documents
Parallel H.264/AVC Motion Compensation for GPUs using OpenCL

Mali GPU acceleration of HEVC and VP9 Decoder

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

Parallelism In Video Streaming

HIGH PERFORMANCE VIDEO CODECS

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

LOWERING POWER CONSUMPTION OF HEVC DECODING. Chi Ching Chi Techinische Universität Berlin - AES PEGPUM 2014

Modern Processor Architectures. L25: Modern Compiler Design

Using Virtual Texturing to Handle Massive Texture Data

ECE 8823: GPU Architectures. Objectives

High Efficiency Video Decoding on Multicore Processor

Multimedia in Mobile Phones. Architectures and Trends Lund

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

Video Compression Standards (II) A/Prof. Jian Zhang

GPGPU on ARM. Tom Gall, Gil Pitney, 30 th Oct 2013

High Efficiency Video Coding. Li Li 2016/10/18

EE 5359 H.264 to VC 1 Transcoding

H.264 Decoding. University of Central Florida

Intel Iris Graphics Quick Sync Video Innovation Behind Quality and Performance Leadership

Optimizing Film, Media with OpenCL & Intel Quick Sync Video

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

Modern Processor Architectures (A compiler writer s perspective) L25: Modern Compiler Design

Performance Comparison of SIMD-based HEVC Decoders on Mobile Processor

Mobile Graphics Ecosystem. Tom Olson OpenGL ES working group chair

Lecture 1: Gentle Introduction to GPUs

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Profiling and Debugging OpenCL Applications with ARM Development Tools. October 2014

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

VIDEO COMPRESSION STANDARDS

Image and Video Coding I: Fundamentals

TKT-2431 SoC design. Introduction to exercises

Our Technology Expertise for Software Engineering Services. AceThought Services Your Partner in Innovation

More performance options

An H.264/AVC Main Profile Video Decoder Accelerator in a Multimedia SOC Platform

A Framework for Modeling GPUs Power Consumption

4K Video Processing and Streaming Platform on TX1

AVS VIDEO DECODING ACCELERATION ON ARM CORTEX-A WITH NEON

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

Video Compression An Introduction

! Readings! ! Room-level, on-chip! vs.!

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

Next Generation Visual Computing

Parallel Programming on Larrabee. Tim Foley Intel Corp

Intel Media Server Studio 2018 R1 - HEVC Decoder and Encoder Release Notes (Version )

GPU ACCELERATED DATABASE MANAGEMENT SYSTEMS

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

4K HEVC Video Processing with GPU Optimization on Jetson TX1

Fundamental CUDA Optimization. NVIDIA Corporation

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

Fundamental CUDA Optimization. NVIDIA Corporation

H.264 Parallel Optimization on Graphics Processors

Antonio R. Miele Marco D. Santambrogio

Introduction to Video Compression

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

MULTIMEDIA PROCESSING Real-time H.264 video enhancement by using AMD APP SDK

Digital Video Processing

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

The Benefits of GPU Compute on ARM Mali GPUs

The LPGPU2 Project. Ben Juurlink, TU Berlin

4K Video Processing and Streaming Platform on TX1

Unleashing the benefits of GPU Computing with ARM Mali TM Practical applications and use-cases. Steve Steele, ARM

Efficient Video Processing on Embedded GPU

Profiling and Debugging Games on Mobile Platforms

PowerVR GPU IP from Wearables to Servers. Kristof Beets Director of Business Development May 2015

A Large-Scale Cross-Architecture Evaluation of Thread-Coarsening. Alberto Magni, Christophe Dubach, Michael O'Boyle

A Code Merging Optimization Technique for GPU. Ryan Taylor Xiaoming Li University of Delaware

Real-Time Rendering Architectures

Building an Area-optimized Multi-format Video Encoder IP. Tomi Jalonen VP Sales

Martin Dubois, ing. Contents

Graphics Processor Acceleration and YOU

Renderscript Accelerated Advanced Image and Video Processing on ARM Mali T-600 GPUs. Lihua Zhang, Ph.D. MulticoreWare Inc.

Graphics and Imaging Architectures

Introduction to Video Encoding

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

Serial and Parallel Sobel Filtering for multimedia applications

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

ARM Multimedia IP: working together to drive down system power and bandwidth

Performance potential for simulating spin models on GPU

Lecture 5: Error Resilience & Scalability

rcuda: an approach to provide remote access to GPU computational power

Emerging Architectures for HD Video Transcoding. Jeremiah Golston CTO, Digital Entertainment Products Texas Instruments

Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

Motivation Hardware Overview Programming model. GPU computing. Part 1: General introduction. Ch. Hoelbling. Wuppertal University

Windowing System on a 3D Pipeline. February 2005

PREFACE...XIII ACKNOWLEDGEMENTS...XV

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

JPlaylist. Offline Playlist Editing OVERVIEW PRODUCT FEATURES

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Working with Metal Overview

MPEG-4: Simple Profile (SP)

Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)

Transcription:

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Video Codecs 70% of internet traffic will be video in 2018 [CISCO] Video compression: significant improvements over the last years Growing demand/offer for higher quality: HD to UHD Video codecs require more and more computing power Slide 2

Heterogeneous Architectures Like them or not: they are available CPUs + GPUs + DSPs + HW accelerators +... Slide 3

Mapping: the Complete Application Video Player: Decoding + Rendering Slide 4

Mapping: Where to Execute What? Figure : Qualcomm Sanpdragon 810 SoC Slide 5

Outline Introduction Hardware Acceleration of Video Decoding OpenCL Acceleration of Video Decoding CPU Decoding + GPU Rendering Conclusions Slide 6

Hardware Accelerators Mapping: Decoder in HW: ASIC for video decoding Rendering in GPU Slide 7

Hardware Accelerators Mapping: Decoder in HW: ASIC for video decoding Rendering in GPU Pros Low-power: 4K H.265 < 100 mw Cons Flexibility: more than one codec Programmability issues Figure : Tikekar et al., JSSC, Jan. 2014 Slide 7

HW Accelerators: the Software Issue Multiple APIs: accessing HW modules from SW player Supported codecs depend on the chosen API: which is selected at runtime depending on what is available on the system Gstreamer Manual API Vendor MPEG2 MPEG4 H.264 VC1 H.265 VAAPI Intel X X X X? VDPAU NVidia X X X X? XVBA AMD X X? DXVA Microsoft X? VDA Apple X? Slide 8

Outline Introduction Hardware Acceleration of Video Decoding OpenCL Acceleration of Video Decoding CPU Decoding + GPU Rendering Conclusions Slide 9

OpenCL Video Decoding Mapping Decoder: accelerate decoding on the GPU using OpenCL Rendering: in GPU Slide 10

OpenCL Video Decoding Mapping Decoder: accelerate decoding on the GPU using OpenCL Rendering: in GPU Pros Programmability Compatibility: write once, and compile/run everywhere (isn t it?) Slide 10

OpenCL Video Decoding Mapping Decoder: accelerate decoding on the GPU using OpenCL Rendering: in GPU Pros Programmability Compatibility: write once, and compile/run everywhere (isn t it?) Cons Programming model mismatch Overheads Slide 10

H.264 Video Decoding with GPUs H.264/AVC: one of the most widely used video codecs H.264 stream Entropy decoding Inverse quantization Inverse DCT + Deblocking filter Intra prediction YUV video Motion compensation Frame buffer Kernels Entropy Dec. Inverse DCT Intra- Pred. Motion Comp. Deblocking Parallelism Low High Low High Low Divergence High Low High High High Slide 11

H.264 Inverse Transform on GPUs Kernel Only Speedups 30 25 20 15 10 5 0 SIMD GT430 GTS450 GTX560Ti 4x4 CRF37 CRF22 Intra 8x8 CRF37 CRF22 Intra Speedups over scalar execution in CPU Significant speedups: over 25 Slide 12

H.264 Inverse Transform on GPUs Kernel + Memory Transfers + OpenCL runtime Speedups 4.0 3.5 3.0 2.5 2.0 1.5 1.0 0.5 0.0 1080p CRF37 CRF22 Intra Non-Over Overlap SIMD 2160p CRF37 CRF22 Intra Entire application slower than CPU + SIMD with 1 core B. Wang et al. An Optimized Parallel IDCT on Graphics Processing Units. LNCS 2012. Slide 13

H.264 Motion Compensation on GPUs discrete GPU attains the highest performance, 1.7 speedup significant performance penalty when including the overhead Slide 14

H.264 Motion Compensation on GPUs kernel Mem Copy OCL runtime 1 1080p 2160p 0.8 MC Breakdown 0.6 0.4 0.2 0 AMD Nvidia Intel AMD Nvidia Intel B. Wang et al. Parallel H.264/AVC Motion Compensation for GPUs using OpenCL. IEEE TCSVT. Slide 15

H.264: Complete Application Entropy: frame N offload IDCT Entropy: frame N+1 IDCT... Signal Signal Reconstruction: frame N offload MC Reconstruction: frame N+1 MC... Pipelined Seq-GPU-IDCT-MC Seq-GPU-MC Seq-GPU-IDCT Seq-CPU-IDCT-MC Seq-CPU-MC Seq-CPU-IDCT Seq-CPU 0 5 10 15 20 25 Others Entropy IDCT MC MBREC Total GPU + frame-level pipelining: faster than single threaded CPU Required complete application redesign Slide 16

HEVC/H.265 Video Coding Standard HEVC/H.265: 50% better compression compared to H.264/AVC Add high-level parallelization tools: WP, Tiles good for CPUs Remove dependencies in deblocking filter Add another filter: SAO HEVC stream Entropy decoding Inverse quantization Inverse DCT + Deblocking filter SAO filter Intra prediction YUV video Motion compensation Reference pictures Kernels Entropy Dec. Inverse T Intra-Pred. Motion Comp. De-blocking SAO Parallelism Low High Low High High High Divergence High Low High High High Low Slide 17

H.265 In-loop Filters on Discrete GPUs Slide 18

H.265 In-loop Filters on Integrated GPUs Slide 19

H.265 In-loop Filters on Mobile GPUs Slide 20

OPenCL Decoding: Lessons Learned Programming model mismatch: not all kernels can be offloaded for kernels that can be offloaded: data Locality loss Acceleration of kernels but not of complete application Overheads: CPU GPU memory transfers OpenCL runtime Effects on rendering of using the GPU for decoding not analyzed (yet) Slide 21

Outline Introduction Hardware Acceleration of Video Decoding OpenCL Acceleration of Video Decoding CPU Decoding + GPU Rendering Conclusions Slide 22

CPU Decoding CPU optimizations Multithreading: C++11 threads SIMD: compiler intrinsics Pros Programmability High performance: in high performance CPUs Slide 23

CPU Decoding CPU optimizations Multithreading: C++11 threads SIMD: compiler intrinsics Pros Programmability High performance: in high performance CPUs Cons Low power efficiency: see Chi et al. TACO Jan 2015. Low performance: in low power CPUs Slide 23

4K CPU Decoding Performance 1-Thread 2-Threads 4-Threads fps sp. fps sp. fps sp Haswell 37.8 1.00 73.4 1.94 134.9 3.57 Bay Trail 6.2 1.00 12.2 1.95 23.6 3.79 Cortex-A9 2.8 1.00 5.3 1.90 8.6 3.05 Cortex-A15 6.0 1.00 11.4 1.91 19.9 3.33 High performance CPUs fast enough for 4Kp50 Low power CPUs not fast enough for 4Kp25 References Chi et al. SIMD acceleration for HEVC Decoding, IEEE TCSVT. 2015 Chi et al. Parallel Scalability and Efficiency of HEVC Parallelization Approaches. IEEE TCSVT. Dec 2012 Chi et al. Parallel HEVC Decoding on Multi- and Many-core Architectures. JSPS 2013 Slide 24

GPU Rendering using OpenGL Pros GPU should be efficient for simple 2D rendering Color conversion is just matrix vector multiplication GPUs have HW units for color conversion OPenGL is a mature language, well supported in many platforms Slide 25

GPU Rendering using OpenGL Pros GPU should be efficient for simple 2D rendering Color conversion is just matrix vector multiplication GPUs have HW units for color conversion OPenGL is a mature language, well supported in many platforms Cons 2D rendering is complex Color Conversion is not the bottleneck OPenGL has extensions, and has no performance guarantees Slide 25

Video Rendering Stages Decoding Frame in YUV Texture YUV Color Model RGBA Uploading Conversion Displaying Texture Uploading: upload/transform image from CPU to GPU Color Model Conversion: YUV into RGBA conversion Displaying: from framebuffer to screen Slide 26

4K Video Play: Discrete GPU 4K Playback on GTX750-Ti Linux Performance (millisec/frame) 20 15 10 5 Decoding + Rendering Decoding Rendering Texture Uploading CC + Displaying Color Conversion 0 YUV420 8-bit 11.9MB YUV420 10-bit 23.7MB YUV422 10-bit 31.6MB Video Formats YUV444 10-bit 47.5MB Slide 27

4K Video Play: Integrated GPUs 4K Playback on HD4600 Linux Performance (millisec/frame) 60 50 40 30 20 10 Decoding + Rendering Decoding Rendering Texture Uploading CC + Displaying Color Conversion 0 YUV420 8-bit 11.9MB YUV420 10-bit 23.7MB YUV422 10-bit 31.6MB Video Formats YUV444 10-bit 47.5MB Slide 28

Integrated GPUs + Linear Texture 4K Texture Uploading on HD4600 Win8.1 Performance (millisec/frame) 30 25 20 15 10 5 gltexsubimage2d PBO + memcpy PBOMAP INTEL_map_texture 0 YUV420 8-bit 11.9MB YUV420 10-bit 23.7MB YUV422 10-bit 31.6MB Video Formats YUV444 10-bit 47.5MB Slide 29

Integrated GPUs + Linear Texture 4K Texture Uploading on HD4600 Win8.1 4K Playback on HD4600 Win8.1 Performance (millisec/frame) 30 25 20 15 10 5 gltexsubimage2d PBO + memcpy PBOMAP INTEL_map_texture Performance (millisec/frame) 30 25 20 15 10 5 Decoding + Rendering Decoding Rendering Texture Uploading CC + Displaying Color Conversion 0 YUV420 8-bit 11.9MB YUV420 10-bit 23.7MB YUV422 10-bit 31.6MB Video Formats YUV444 10-bit 47.5MB 0 YUV420 8-bit 11.9MB YUV420 10-bit 23.7MB YUV422 10-bit 31.6MB Video Formats YUV444 10-bit 47.5MB Slide 29

Decoding + Rendering: lessons learned Discrete GPUs Performance drop: sharing memory for PCIe copy Can be solved by using integrated GPUs: remove copies Slide 30

Decoding + Rendering: lessons learned Discrete GPUs Performance drop: sharing memory for PCIe copy Can be solved by using integrated GPUs: remove copies Integrated GPUs Texture uploading takes more time than decoding Can be solved with an OPenGL extension Slide 30

Decoding + Rendering: lessons learned Discrete GPUs Performance drop: sharing memory for PCIe copy Can be solved by using integrated GPUs: remove copies Integrated GPUs Texture uploading takes more time than decoding Can be solved with an OPenGL extension OpenGL extensions INTEL_map_texture: remove memory layout conversion Performance loss due to memory contention Slide 30

Outline Introduction Hardware Acceleration of Video Decoding OpenCL Acceleration of Video Decoding CPU Decoding + GPU Rendering Conclusions Slide 31

Conclusions Video codecs and parallel architectures: conflicting objectives Video codecs: remove redundancy introduce dependencies Parallelism : remove dependencies reduce compresssion Slide 32

Conclusions Video codecs and parallel architectures: conflicting objectives Video codecs: remove redundancy introduce dependencies Parallelism : remove dependencies reduce compresssion Solve the system problem Complete video decoding: not just kernels Rendering: is not just matrix multiplication Video player is Decoding + Rendering Slide 32

Questions? Acknowledgements: Biao Wang, Haifeng Gao, Chi Ching Chi More info: http://www.aes.tu-berlin.de/ http://www.spin-digital.com/ Slide 33