Ittiam Systems (Pvt.) Ltd.,

Similar documents
A Multimedia Streaming Server/Client Framework for DM64x

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

IP Video Phone on DM64x

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Advanced Video Coding: The new H.264 video compression standard

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

Lecture 13 Video Coding H.264 / MPEG4 AVC

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Multimedia Decoder Using the Nios II Processor

SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP

The Scope of Picture and Video Coding Standardization

Video Coding Standards

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

VIDEO COMPRESSION STANDARDS

High Efficiency Video Coding. Li Li 2016/10/18

4G WIRELESS VIDEO COMMUNICATIONS

Emerging H.26L Standard:

MPEG-4: Simple Profile (SP)

Performance Analysis of H.264 Encoder on TMS320C64x+ and ARM 9E. Nikshep Patil

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

Digital Video Processing

THE H.264 ADVANCED VIDEO COMPRESSION STANDARD

Introduction to Video Encoding

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments

High Efficiency Video Coding: The Next Gen Codec. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Emerging Architectures for HD Video Transcoding. Jeremiah Golston CTO, Digital Entertainment Products Texas Instruments

Emerging Architectures for HD Video Transcoding. Leon Adams Worldwide Manager Catalog DSP Marketing Texas Instruments

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Lecture 5: Error Resilience & Scalability

Real-Time Course. Video Streaming Over network. June Peter van der TU/e Computer Science, System Architecture and Networking

Networking Applications

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

Video coding. Concepts and notations.

Video Compression An Introduction

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

Performance Tuning on the Blackfin Processor

Video Compression Standards (II) A/Prof. Jian Zhang

Selected coding methods in H.265/HEVC

Introduction to LAN/WAN. Application Layer 4

Video Coding Standards: H.261, H.263 and H.26L

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Mali GPU acceleration of HEVC and VP9 Decoder

MPEG-2. ISO/IEC (or ITU-T H.262)

Ch. 4: Video Compression Multimedia Systems

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

MPEG-4 Part 10 AVC (H.264) Video Encoding

Cross Layer Protocol Design

10.2 Video Compression with Motion Compensation 10.4 H H.263

H.264 / AVC (Advanced Video Coding)

A macroblock-level analysis on the dynamic behaviour of an H.264 decoder

TI TMS320C6000 DSP Online Seminar

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

Georgios Tziritas Computer Science Department

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

ECE 417 Guest Lecture Video Compression in MPEG-1/2/4. Min-Hsuan Tsai Apr 02, 2013

Advanced Encoding Features of the Sencore TXS Transcoder

Streaming Video over the Internet. Dr. Dapeng Wu University of Florida Department of Electrical and Computer Engineering

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

Streaming Technologies Glossary

An H.264/AVC Main Profile Video Decoder Accelerator in a Multimedia SOC Platform

MAPPING VIDEO CODECS TO HETEROGENEOUS ARCHITECTURES. Mauricio Alvarez-Mesa Techische Universität Berlin - Spin Digital MULTIPROG 2015

CMPT 365 Multimedia Systems. Media Compression - Video Coding Standards

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

Compressed-Domain Video Processing and Transcoding

In the name of Allah. the compassionate, the merciful

Complexity Estimation of the H.264 Coded Video Bitstreams

5LSE0 - Mod 10 Part 1. MPEG Motion Compensation and Video Coding. MPEG Video / Temporal Prediction (1)

Blackfin Optimizations for Performance and Power Consumption

Module 6 STILL IMAGE COMPRESSION STANDARDS

MPEG-4 departs from its predecessors in adopting a new object-based coding:

PREFACE...XIII ACKNOWLEDGEMENTS...XV

Full HD HEVC(H.265)/H.264 Hardware IPTV Encoder Model: MagicBox HD4 series MagicBox HD401: Single channel HDMI/AV, HDMI/VGA/YPbPr/AV, HDSDI input

A Video CoDec Based on the TMS320C6X DSP José Brito, Leonel Sousa EST IPCB / INESC Av. Do Empresário Castelo Branco Portugal

IO [io] MAYAH. IO [io] Audio Video Codec Systems

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

IOCAST video transmission solutions

Efficient Implementation of Transform Based Audio Coders using SIMD Paradigm and Multifunction Computations

Optimal Porting of Embedded Software on DSPs

Mark Kogan CTO Video Delivery Technologies Bluebird TV

WHITE PAPER ON2 TECHNOLOGIES, INC. TrueMotion VP7 Video Codec. January 10, 2005 Document Version: 1.0

Introduction to Video Compression

Introduction of Video Codec

Introduction to Video Coding

Parallelism In Video Streaming

RECOMMENDATION ITU-R BT.1720 *

Encoding Video for the Highest Quality and Performance

Video Compression MPEG-4. Market s requirements for Video compression standard

Module Introduction. Content 15 pages 2 questions. Learning Time 25 minutes

Google Workloads for Consumer Devices: Mitigating Data Movement Bottlenecks Amirali Boroumand

EE 5359 H.264 to VC 1 Transcoding

Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao

MPEG-4 BSAC Technology

H.264 Decoding. University of Central Florida

The S6000 Family of Processors

Architecture Considerations for Multi-Format Programmable Video Processors

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Transcription:

Arvind Raman, Kismat Singh, Manisha Agrawal Mohan, Neelakanth Shigihalli, Sriram Sethuraman, B.S. Supreeth Ittiam Systems (Pvt.) Ltd., Bangalore, India Presented by: Sriram Sethuraman TI DSP Fest, 3 December 2003, Bangalore 1

Outline of Presentation Overview of H.264 Application Scenario TI DM642 Capabilities/Features End-to-End Setup Task Scheduling Design Methodology Low-level optimization Strategies High-level multithreading between DSP and DMA Conclusions 2

H.264 - Introduction Started in 1998 as H.26L Goal to achieve 50% reduction in bitrate Joint ITU-T and ISO/IEC standard ISO/IEC 14496-10 (aka MPEG-4 Part 10) Status Now officially balloted as an international standard Via licensing and MPEG LA have formed patent pools Addresses diverse applications Video telephony Streaming Storage (contender for the new DVD format) Digital Cinema 3

H.264 Decoder Bits Entropy Decode Inverse Quantization Inverse Transform Intra Pred Motion Compensation + Reconstructed Segment Buffer Frame Store In-loop Deblocking Filter Decoded frame 4

H.264 - Features 4x4 block sizes Improved intra prediction Compression efficiency on par with JPEG-2000 Enhanced temporal prediction Improved quarter pixel motion compensation Variable block size motion compensation Multiple reference frames Weighted prediction Efficient entropy coding Context based VLC (CAVLC) Context based binary arithmetic coding (CABAC) Interlaced support Field pictures, MB level adaptive field/frame Integer transform No mismatch between encoder and decoder Low complexity, wider range quantization Superior in-loop deblocking filter 5

Profiles Baseline Profile Targets low-end devices Main Profile Targets entertainment quality Supports I, P, B slices; interlaced; CABAC Extended Profile For streaming support; extra error resilience Here we target the Main Profile 6

Y-PSNR (db) H.264 Advantage H.264 vs. MPEG-4 SP 40 35 30 MPEG-4 SP H264 25 20 0 200 400 600 800 1000 Bitrate (kbps) H.264 with CABAC, no B-frames, 1 reference frame 50% bit-rate reduction compared to MPEG-4 67% bit-rate reduction compared to MPEG-2 7

Application Scenario TV #1 TV #2 H264 Streaming Wi-Fi Client H264 Streaming Wi-Fi Client Central TV Central Wi-Fi Home Server 8

Application Scenario (Contd.) Wireless enabled Home Media Server Error-free transmission bandwidth Decreases with increasing distance between connection points MPEG-2 today requires ~4Mbps, which results in poor video quality due to lost packets QoS is being worked on from the WLAN side as well H.264 requires only ~1.4 Mbps Enables entertainment quality Issue is A low-cost H.264 Streaming Client 9

Challenges Computational complexity is significantly higher While transform and quantization are simpler, the following add a lot of complexity Deblocking filter in the loop Adaptive; lots of conditional code» Computing boundary strength; signal-based thresholds Multiple filters (3 to 5 taps; variable coefficients) Half-sample interpolation with a 6-tap filter rather than bilinear interpolation Intra prediction Planar mode (takes ~10 operations per pixel) Others (~5 operations per pixel) Context-adaptive entropy coding Multiple VLC tables or arithmetic coder contexts Renormalization complexity in arithmetic decoding 10

Challenges Control flow limitations Intra (4x4) prediction requires reconstructed samples Deblocking cannot be performed immediately following the decoding of a macroblock From motion vector decode to motion compensation, up to 96 distinct blocks of data have to be fetched for a single MB Keeping the DSP non-idle Slice can contain an arbitrary number of MBs Memory bandwidth problems MVs for every 4x4 subblock from up to 15 ref. frames; with 6-tap filter, requires 9x9 samples, and with bi-directional prediction, works out to ~10 frames for each frame Code size (cache misses) 11

TI TMS320DM642 12 Reproduced from SPRS200A, Texas Instruments

TI C64x DSP Core VLIW architecture (VelociTI) 256-bit instructions (8 32-bit) Dual data paths 4 functional units (L, M, S, D) each L, S arithmetic/logical; M multiply/shift; D Data load/store 32 32-bit registers each Cross path access to registers Can load and store up to 64-bits Non-aligned load/stores are supported All instructions can be conditionally executed Packed 8-bit and 16-bit operations E.g. AVGU4, DOTPU4, ADD2, MIN2, MAX2, etc. Instructions to pack and unpack Transfer crossbar with 4 queues of varying priority 64-channel enhanced DMA (TCInt, Chaining, linking) 2-level cache (16 kb I and D L1; 256 kb L2) 13

TI DM642 Features 2 External Memory Interfaces 1 64-bit (EMIFA); 1 16-bit (EMIFB) 3 video ports Can be configured as video input, video output, or transport stream interface Multi-channel Audio Serial Port (Data/Control) 2 Multi-channel Buffered Serial Ports Ethernet MAC PCI interface Host Post Interface I2C Interface 14

TI DM642 EVM Features Composite/S-video input/output Line In/Out, Mic In 10/100 Mbps Ethernet Controller 32 MB SDRAM + 4 MB Flash PCI interface (boot option + HPI) EMIF Daughter Card Interface Video port Daughter Card Interface All interfaces to run a streaming A/V client 15

Streaming Set-up Render Audio Sync Render Video Audio Dec Video Dec A V RTP UDP IP RTSP TCP IP RTSP TCP IP RTP UDP IP Ethernet/WLAN Server (on PC) Client (on DM64x) 16

Task Scheduling TI provides on the software framework side, DSP BIOS Networking Development Kit (NDK) with TCP/IP stack Network Scheduler (Event/Timer based processing) Drivers for video display and audio playback Run as separate tasks managing the buffers and EDMA Scheduling concerns UDP packets are not lost due to scheduling conflicts Audio/Video synchronization is maintained Buffers do not overflow Achieved by Suitably setting relative priorities of network scheduler, receive threads, and processing threads Using semaphores to trigger a consumer task 17

Design Methodology Develop optimized assembly code for the leaf modules Provides estimates on code size and MCPS Ensure that leaf modules have a clean interface To avoid having to re-write ASM modules later Optimization guidelines Analyze whether module is I/O limited or process limited Maximize the utilization by packing as many slots as possible Reduce slots by using appropriate SIMD instructions Evaluate alternate ways of reducing the # of execution packets Balance the load between data-paths A and B in a clean way to reduce cross-path accesses Explore unrolling possibilities balance code-size, register pressure and speedup Minimize stalls due to branches and delay slots Use conditional execution to avoid shallow branches Avoid memory bank stalls 18

Exploiting Parallelism Algorithmic Parallelism Multiple pixels can be worked on without any read-afterwrite dependencies Packed data operations Using the packed data operations, multiple operations can be performed in one slot E.g. AVGU4 - If pixel data is in bytes, compute four (p i,j + p i,j+1 +1)>>1 operations in one instruction E.g. DOTPU4 w i* p i (4 8x8 multiplications and 3 adds) Instructions that come in handy AVGU4, DOTP4, DOTP2, SHRMB, SHLMB, UNPCK, PACK, SPACK, MIN2, MAX2, MPY4, SUBABS4, Instruction level Parallelism All operations at an instruction level that are independent and hence can be scheduled in parallel Used to fill all the 8 slots in an execution packet Only catch is the register pressure 19

Design Methodology (Contd.) Partition the decoding tasks such that I-cache thrashing is minimized Suitable pipelining can be established to minimize the cycles spent by the DSP waiting for data; (Keep pipeline granularity flexible) Goal is to maximally utilize DSP and DMA Required buffering due to above pipelining can be easily accommodated in the ISRAM Re-use of scratch buffers minimizes L1 D-cache misses Pick suitable groupings from syntax decoding, motion compensation, texture decoding, deblocking, and 4:2:2 conversion Partitioned groups can be pipelined at MB, groups of MBs, or row of MBs Larger groups can leverage statistical properties Verify the pipeline design Use memcpys as placeholders for DMAs with suitable sync points in software (to verify completion of DMA) Select the type of triggering for the various DMA channels Implement the DMA and replace the memcpys 20 Adjust code placements and buffer placements in ISRAM

EDMA considerations Use QDMA whenever possible for short bursts of transfer Use chaining whenever possible to automate triggering Optimize the triggering under an ISR to minimize context saving overheads Balance the load across the four priority queues and according to the criticality of the data Plan the buffer structures according to the restriction that 2D to 2D transfer is possible only if both source and destination strides are equal Ensure that DMAed area is not L2 cached Do not hard-code channels; use CSL functions 21

Performance In real-time on a DM642@600MHz H.264 Main Profile Decoder D-1 resolution at ~2 Mbps MPEG-4 AAC-LC Decoder Stereo at 44.1 KHz at 64kbps/ch RTSP/RTP streaming client Video/Audio rendering with synchronization Speed-ups on leaf modules up to 40x compared to O3 compiled C code Typical speed-ups in the range of 10-16x 22

Conclusions Real-time streaming client enabled on DM642 Exploiting the VLIW architecture and packed-data instructions Counters the high computational complexity Suitable task partition structure and multithreaded pipeline between DSP and DMA mitigates Control flow limitations Memory bandwidth issues I-cache misses due to increased code size Further Challenges Moving to a lower cost processor such as DM640/641 23

24

Segmented Motion Compensation 16x16 8x16 16x8 8x8 8x8 4x8 8x4 4x4 Total of 1-16 motion vectors per MB 25

Ittiam H.264 Decoder on DM642 Main Profile Decoder Supports CAVLC CABAC Segmented motion compensation Multiple reference frames Intra prediction In-loop deblocking filter Targets Entertainment quality video Applications such as PVRs and STBs A/V Synchronized Player Demo Platform: TI DM642 EVM File Format: AVI Audio Decoder: MPEG-4 AAC-LC (64kbps/ch, stereo) Video Bit-rate: 1.4 Mbps (VBR) Streaming Client Demo 26 RTSP/RTP based streaming