Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

Similar documents
Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

AVS VIDEO DECODING ACCELERATION ON ARM CORTEX-A WITH NEON

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

How to Add Low-Power, Multi-Codec, Digital Video and Audio to Your Next ASIC or SOC Design

Architecture Considerations for Multi-Format Programmable Video Processors

MPEG-2. ISO/IEC (or ITU-T H.262)

Functional modeling style for efficient SW code generation of video codec applications

Porting an MPEG-2 Decoder to the Cell Architecture

Video Compression An Introduction

Introduction to Video Compression

Performance Tuning on the Blackfin Processor

Mali GPU acceleration of HEVC and VP9 Decoder

Multimedia Decoder Using the Nios II Processor

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

Lab-1: Profiling/Optimizing Video Decoder Using ADS. National Chiao Tung University Chun-Jen Tsai 3/3/2011

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

VIDEO COMPRESSION STANDARDS

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Advanced Video Coding: The new H.264 video compression standard

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Introduction to Video Encoding

The Basics of Video Compression

PREFACE...XIII ACKNOWLEDGEMENTS...XV

5LSE0 - Mod 10 Part 1. MPEG Motion Compensation and Video Coding. MPEG Video / Temporal Prediction (1)

High Efficiency Video Coding. Li Li 2016/10/18

Ch. 4: Video Compression Multimedia Systems

CEC 450 Real-Time Systems

Chapter 2 Joint MPEG-2 and H.264/AVC Decoder

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

4G WIRELESS VIDEO COMMUNICATIONS

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

Parallelized Progressive Network Coding with Hardware Acceleration

Digital video coding systems MPEG-1/2 Video

TKT-2431 SoC design. Introduction to exercises

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

Module 7 VIDEO CODING AND MOTION ESTIMATION

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

Putting MPSOC to Work in Multimedia

EE 5359 H.264 to VC 1 Transcoding

Project Title: Design and Implementation of IPTV System Project Supervisor: Dr. Khaled Fouad Elsayed

An Efficient Motion Estimation Method for H.264-Based Video Transcoding with Arbitrary Spatial Resolution Conversion

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

MPEG-2 Video Decoding on the TMS320C6X DSP Architecture

ISSCC 2001 / SESSION 9 / INTEGRATED MULTIMEDIA PROCESSORS / 9.2

The VC-1 and H.264 Video Compression Standards for Broadband Video Services

MPEG-4: Simple Profile (SP)

MPEG Decoding Workload Characterization

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Welcome Back to Fundamentals of Multimedia (MR412) Fall, 2012 Chapter 10 ZHU Yongxin, Winson

Development of Low Power ISDB-T One-Segment Decoder by Mobile Multi-Media Engine SoC (S1G)

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video

CISC 879 Software Support for Multicore Architectures Spring Student Presentation 6: April 8. Presenter: Pujan Kafle, Deephan Mohan

AnySP: Anytime Anywhere Anyway Signal Processing

Vertex Shader Design I

Digital Video Processing

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing

Parallel Scalability of Video Decoders

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

Audio and video compression

DIGITAL TELEVISION 1. DIGITAL VIDEO FUNDAMENTALS

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

The Scope of Picture and Video Coding Standardization

Modeling and Simulation of System-on. Platorms. Politecnico di Milano. Donatella Sciuto. Piazza Leonardo da Vinci 32, 20131, Milano

Understanding Sources of Inefficiency in General-Purpose Chips

Real-Time Course. Video Streaming Over network. June Peter van der TU/e Computer Science, System Architecture and Networking

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Lecture 13 Video Coding H.264 / MPEG4 AVC

A high-level simulator for the H.264/AVC decoding process in multi-core systems

TKT-2431 SoC design. Introduction to exercises. SoC design / September 10

VIDEO streaming applications over the Internet are gaining. Brief Papers

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

Updates in MPEG-4 AVC (H.264) Standard to Improve Picture Quality and Usability

Multimedia Standards

COMPARATIVE ANALYSIS OF DIRAC PRO-VC-2, H.264 AVC AND AVS CHINA-P7

JPEG decoding using end of block markers to concurrently partition channels on a GPU. Patrick Chieppe (u ) Supervisor: Dr.

Model: LT-122-PCIE For PCI Express

A macroblock-level analysis on the dynamic behaviour of an H.264 decoder

Understanding Sources of Inefficiency in General-Purpose Chips. Hameed, Rehan, et al. PRESENTED BY: XIAOMING GUO SIJIA HE

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami

Editorial Manager(tm) for Journal of Real-Time Image Processing Manuscript Draft

EE382N (20): Computer Architecture - Parallelism and Locality Fall 2011 Lecture 11 Parallelism in Software II

A Parallel Transaction-Level Model of H.264 Video Decoder

SA-1500: A 300 MHz RISC CPU with Attached Media Processor*

Parallel Scalability of Video Decoders

Research on Transcoding of MPEG-2/H.264 Video Compression

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

White paper: Video Coding A Timeline

Overview. Videos are everywhere. But can take up large amounts of resources. Exploit redundancy to reduce file size

10.2 Video Compression with Motion Compensation 10.4 H H.263

Design and Optimization of Geometry Acceleration for Portable 3D Graphics

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

In the name of Allah. the compassionate, the merciful

0;L$+LJK3HUIRUPDQFH ;3URFHVVRU:LWK,QWHJUDWHG'*UDSKLFV

Transcription:

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis University of Thessaly Greece 1 Outline Introduction to AVS video standard Introduction to Tensilica 388VDO Diamond heterogeneous dual core architecture Porting the AVS decoder to the Tensilica architecture Pure software optimizations SIMD optimizations Task level partitioning Experimental evaluation Conclusion 2

Proliferation of video standards Variety of standards for different applications From Wireless, Low-rate to Broadcast High Definition quality MPEG-1 (CD) MPEG-2 (DVD) MPEG-4 (wide range of applications) H.264 (wider range of applications) Video compression standards historically have a major influence on computer architecture innovation SIMD instruction sets (SSE, Altivec, Cell SPE) 3 Audio Video Standard (AVS) Development initiated by the Government of China in 2002 China s national standard in 2005 Aim to reduce China s royalty payments to foreign companies IPTV drives AVS adoption AVS similar in structure to other video standards Simpler than H.264, with very similar rate-distortion performance. 4

Audio Video Standard (AVS) * OpenAVS is an open source AVS decoder Profiling of OpenAVS* in the baseline Tensilica Xtensa RISC processor. Motion Compensation 64% of exec. time Complexity mainly due to half and quarter pixel interpolation Deblocking filter ~13% 5 AVS Implementation Objective Port the AVS video decoder to a dual core SIMD processor Provide a reference port, as a documented example for people wanting to port codecs to this core Performance target is 25 frames/sec for D1 resolution, NTSC format (720x576) OpenAVS is the starting code Main challenge Enable real-time video decoding (25fps) starting from sequential OpenAVS code running at 1.73 GHz in the base Xtensa RISC processor Exploit all forms of parallelism available in the AVS decoder 6

Tensilica 388VDO Diamond core Two heterogeneous processors Based on the Xtensa RISC processor Enhanced with video specific instructions Around 400 such instructions Stream processor Optimized for serial processing for bitstream parsing and Variable Length Decoding (VLD) Pixel processor Optimized for SIMD processingfor Motion Compensation, Deblocking, inverse DCT, and inverse Quantization 7 Tensilica 388VDO Diamond core Two heterogeneous processors Tightly coupled Instruction and Data SRAMs in each processor to reduce memory latency Multichannel DMA controller asynchronously transfers data between external DRAM and SRAMs GCC-based crosscompiler, assembler, debugger, simulator, profiler 8

Support for SIMD operations A, B, res: vector registers res = xvd_add_16x12(a, B) 16 vector elements, 12 bits each Vector register file extended with guard bits to enable extended precision during computation 8 vector elements, 24 bits each res = xvd_add_8x24(a, B) 9 Code restructuring Main loop for macroblock processing for (MbIndex = 0; MbIndex < no_mbs; < MbIndex++) { ParseOneMacroblock; // Parsing, VLD McIdctRecOneMacroblock; // Inverse transform, Motion Comp } for (MbIndex = 0; MbIndex < no_mbs; < MbIndex++) { DeblockOneMacroblock; // Deblocking We apply loop fusion toreplace the two loops with a single one This avoids spilling the frame pixels to external memory in the first loop and reading it back in the second Memory spills degrade performance Additional data structures are needed to store intermediate data for deblocking filter Three pixel rows and three pixel columns after Motion Comp This extra information is stored in the local SRAMs 10

SIMD optimizations Deblock filter Low-pass filter across block boundaries to smooth block edges. Horizontal and vertical filter Deblock filter flow diagram for luma components 11 SIMD optimizations Deblock filter Example // Computing for Orange Boxes // Speculatively, compute all possible results // pr0, pr1 are 16x12 predicate vectors // pr0[i] == 1, when (X==2, C==TRUE) for bit i // pr1[i] == 1, when (X==2, C==FALSE) for bit i // Compute V = P0 + Q0 + 2 only once and reuse V = xvd_add_16x12(p0, Q0); V = xvd_add_16x12_i(v, 2); // (P1 + P0 + V) >> 2 T0 = xvd_add_16x12(v, P0); T0 = xvd_add_16x12(t0, P1); T0 = xvd_sra_16x12_i(t0, 2); // (2*P1 + V) >> 2 T1 = xvd_add_16x12(v, P1); T1 = xvd_add_16x12(t1, P1); T1 = xvd_sra_16x12_i(t1, 2); // Use predication to compute new P0, P1 // xvd_mvtis a conditional move instruction xvd_mvt_16x12(p0, T0, pr0); xvd_mvt_16x12(p0, T1, pr1); xvd_mvt_16x12(p1, T1, pr0); 12

SIMD performance Global speed up due to SIMD optimizations 2.26x Motion Compensation kernel speed up 4.8x Deblock Filter kernel speed up 3.35x 13 Task level parallelism Exploit the concurrent functionality of three cores: Stream Processor Pixel Processor DMA controller Two separate instruction threads and data-set spaces User partitions both code and data among the two processors and orchestrates data communications 45% -55% load balance between processors 14

Decoupled functionality Allow the two processors to work independently in successive macroblocks Avoids excessive lockstep synchronization A processor can run ahead in execution Multi-buffering scheme necessary to keep intermediate data Buffers are kept in local Data SRAMs Dual buffering scheme used for AVS decoder 15 Decoupled functionality Decoupling facilitates performance variability Different bitstreamscan create corner-cases where the bottleneck shifts between the two cores Multi-buffering alleviates such corner cases Performance depends now on average load balance 16

Stream processor can proceed with the next MB(i+1) when: it finishes VLD of MB(i) there is enough space in the double buffer Overlap multiple macroblocks Concurrent functionality 17 Performance Evaluation Tensilica toolset used to compile, simulate and profile the code Xtensa optimizing compiler (xt-xcc) Cycle accurate multi-core simulator used for profiling Main Memory latency 32 cycles for first read, 1 cycle after that 64 cycles for first write, 1 cycle after that Benchmark 4Mbps input bitstream, D1 720 x 576 output resolution 18

Performance Evaluation 2000 1800 1600 1400 1200 1000 800 600 1730 (1x) 1461 (1.18x) 1354 (1.28x) 748 (2.31x) Required 388VDO MHz frequency for 720 x 576 x 25 pix/sec (Speed Up in parenthesis) 649 (2.67x) 400 200 359 (4.8x) 0 Baseline SW optimizations SIMD (VLD+Parsing) SIMD (+MC+Intra+iT) SIMD (+Deblock) +Task Level Parallelism Optimizations 19 The 388 VDO Diamond processor technology TSMC 65 GP process technology Typical Fmax: 448 MHz Typical Power (H.264 decode) : 36 mw Courtesy of Tensilica, Inc. 20

Conclusion We outlined the steps to extract parallelism from the AVS video decoding application Data level parallelism (SIMD) Task level parallelism and map it to Tensilica s Diamond 388 VDO heterogeneous dual-core Total speed up almost 5x starting from a sequential software version 1.18x from software optimizations 2.26x from SIMD 1.8x from task level parallelism 21