Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Similar documents
Ittiam Systems (Pvt.) Ltd.,

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD

A Multimedia Streaming Server/Client Framework for DM64x

DSP Solutions For High Quality Video Systems. Todd Hiers Texas Instruments

Emerging Architectures for HD Video Transcoding. Jeremiah Golston CTO, Digital Entertainment Products Texas Instruments

Emerging Architectures for HD Video Transcoding. Leon Adams Worldwide Manager Catalog DSP Marketing Texas Instruments

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Multimedia Decoder Using the Nios II Processor

IP Video Phone on DM64x

EE 5359 H.264 to VC 1 Transcoding

MPEG-2. ISO/IEC (or ITU-T H.262)

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Upcoming Video Standards. Madhukar Budagavi, Ph.D. DSPS R&D Center, Dallas Texas Instruments Inc.

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

Lecture 5: Error Resilience & Scalability

DaVinci Family Video Performance

Video Compression An Introduction

Lecture 13 Video Coding H.264 / MPEG4 AVC

Compressed-Domain Video Processing and Transcoding

Introduction to Video Compression

WHITE PAPER ON2 TECHNOLOGIES, INC. TrueMotion VP7 Video Codec. January 10, 2005 Document Version: 1.0

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

Advanced Video Coding: The new H.264 video compression standard

MPEG-2. And Scalability Support. Nimrod Peleg Update: July.2004

Scalable Video Coding

An H.264/AVC Main Profile Video Decoder Accelerator in a Multimedia SOC Platform

High Efficiency Video Coding: The Next Gen Codec. Matthew Goldman Senior Vice President TV Compression Technology Ericsson

Video Coding Standards. Yao Wang Polytechnic University, Brooklyn, NY11201 http: //eeweb.poly.edu/~yao

THE H.264 ADVANCED VIDEO COMPRESSION STANDARD

VIDEO COMPRESSION STANDARDS

Energy scalability and the RESUME scalable video codec

Video Compression Standards (II) A/Prof. Jian Zhang

About MPEG Compression. More About Long-GOP Video

High Efficiency Video Coding. Li Li 2016/10/18

In the name of Allah. the compassionate, the merciful

Mark Kogan CTO Video Delivery Technologies Bluebird TV

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

PREFACE...XIII ACKNOWLEDGEMENTS...XV

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

Selected coding methods in H.265/HEVC

Recommended Readings

Chapter 18 Parallel Processing

A macroblock-level analysis on the dynamic behaviour of an H.264 decoder

MPEG-4: Simple Profile (SP)

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

CSCI 4717 Computer Architecture

The Scope of Picture and Video Coding Standardization

Partitioning Strategies for Concurrent Programming

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Parallel Scalability of Video Decoders

Video Coding Standards

System Modeling and Implementation of MPEG-4. Encoder under Fine-Granular-Scalability Framework

MAP1000A: A 5W, 230MHz VLIW Mediaprocessor

Ch. 4: Video Compression Multimedia Systems

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

NOT FOR DISTRIBUTION OR REPRODUCTION

Introduction to Video Encoding

Parallel Scalability of Video Decoders

The VC-1 and H.264 Video Compression Standards for Broadband Video Services

Parallelism In Video Streaming

Digital Video Processing

High Efficiency Video Decoding on Multicore Processor

A high-level simulator for the H.264/AVC decoding process in multi-core systems

an MPEG-2 Transcoder

4G WIRELESS VIDEO COMMUNICATIONS

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Emerging H.26L Standard:

A Multiprocessor system generally means that more than one instruction stream is being executed in parallel.

Scalable Extension of HEVC 한종기

Copyright Notice. Springer papers: Springer. Pre-prints are provided only for personal use. The final publication is available at link.springer.

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

An Efficient Motion Estimation Method for H.264-Based Video Transcoding with Arbitrary Spatial Resolution Conversion

Building an Area-optimized Multi-format Video Encoder IP. Tomi Jalonen VP Sales

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

Complexity Estimation of the H.264 Coded Video Bitstreams

Comparing Memory Systems for Chip Multiprocessors

COMPARATIVE ANALYSIS OF DIRAC PRO-VC-2, H.264 AVC AND AVS CHINA-P7

5LSE0 - Mod 10 Part 1. MPEG Motion Compensation and Video Coding. MPEG Video / Temporal Prediction (1)

A VIDEO TRANSCODING USING SPATIAL RESOLUTION FILTER INTRA FRAME METHOD IN MULTIMEDIA NETWORKS

The S6000 Family of Processors

A real-time SNR scalable transcoder for MPEG-2 video streams

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Architecture Considerations for Multi-Format Programmable Video Processors

Intel Stress Bitstreams and Encoder (Intel SBE) HEVC Getting Started

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

Video Coding Standards: H.261, H.263 and H.26L

MPEG-4 Part 10 AVC (H.264) Video Encoding

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

Lec 08 Video Signal Processing I

Model: LT-122-PCIE For PCI Express

Professor, CSE Department, Nirma University, Ahmedabad, India

Performance Analysis of H.264 Encoder on TMS320C64x+ and ARM 9E. Nikshep Patil

Transcription:

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor implementation TI C64x architecture Multiprocessor board details Design options and metrics Spatial split Functional split Load balancing Ability to scale from CIF to HD Demonstration Conclusions

Transcoding Approaches Transform domain With Drift Minimal Drift Drift free Complete decode and re-encode Transcode Convert from one standard to another Trans-rate Convert the bit-rate Trans-scale Downsample the input stream s resolution and then transcode

MPEG-2 to H.264 Transcoding MPEG-2 dominates the digital video broadcast space H.264 offers nearly 50% more compression over MPEG-2 Transcoding from MPEG-2 to H.264 will help Network edge servers to stream over lower bandwidth last mile links at good quality for VoD or multicasting applications Set Top Boxes (STBs) with Personal Video Recorder (PVR) functionality to record more hours of content Complexity of transcoding is higher as Compressed domain transcoding is not possible Different block sizes are employed in MPEG-2 and H.264 Tools such as intra prediction, deblocking and CABAC are of higher complexity Re-use of information can be used to reduce complexity

Lower Complexity Transcoder Re-uses information obtained from the MPEG-2 stream Motion, modes, residual energy, bit allocation, Drift-free as re-encoding is not done in the decoding loop MPEG-2 MP@ML Video Decoder Frames Optional Processing Coding Information H.264 MP@L3 Video Transcoder

Compression Efficiency Gain Transcoded quality can never be the same as the MPEG-2 stream quality unless losslessly coded Details in video cannot be distinguished from artifacts in decoded video Optional processing can reduce artifacts or change resolution Similar visual quality is possible with coding tools such as Better Intra prediction In broadcast, I-frames occur every half a second consume ~35% of the total bits In-loop filter CABAC Better entropy coding Quarter sample and segmented MC Reduces residuals to code 25-30% reduction is possible at similar quality with above tools Further reduction with multiple reference frames, weighted pred, etc.

Need for Multi-processor Transcoding Brute force transcoding MPEG-2 decoder cascaded with a H.264 encoder Only CIF MP@ML transcoding is possible on a single DM642 600MHz Low complexity transcoding Several approaches - transform domain, re-use of information 3/8ths-D1 MP@ML transcoding is possible on a single DM642 600MHz Single processor solutions Cater to the low end market (sub-sdtv resolutions). Do not scale or adapt to market / application needs

Multi-Processing (MP) Several Processing Units (PUs) work together towards a common objective Many approaches to design such a system Functional split Spatial split Common objectives for MP design Load balancing Flexible to changes in the complexity of the algorithm / input Adapts to the number of PUs in the cluster Increase in workload is handled by increasing the number of PUs Efficient in using the resources (memory, inter-pu bandwidth)

TI C64x DSP Core High speed VLIW architecture (VelociTI) Orthogonal functional units 2 Datapaths with 4 functional units each Packed SIMD operations on 8 or 16-bit data 2-level cache (16 kb I and D L1; 256 kb L2) 64-channel enhanced DMA (TCInt, Chaining, linking) Ideally suited for high performance 8-bit video processing

Multiprocessor DM642 Boards Boards from vendors such as Vitec Multimedia, Mango DSP Typical designs are based on Multiple DM642s with their dedicated SDRAMs An FPGA-based DMA capability from SDRAM of one DM642 to the SDRAM of any other DM642 Ability to concurrently handle independent transfers PCI connectivity with host Carrier or PMC form factors Examples: Vitec VP3-PMC board

Video Codec Multiprocessing Aspects Spatial prediction/availability constraints Cannot proceed without spatial neighbor information availability Temporal prediction constraints Motion vector can point anywhere in the reference frame Bandwidth of inter-processor communications Intermediate precisions in video will vary SDRAM access All code may not fit in internal memory All scratch or persistent data may not fit in internal memory Cache misses need to be balanced with inter-sdram transfers Scalability to resolutions and features

Functional split Part I Output of one PU is the input to the other Task 0 - Thread 0 Wait on thread 0 Task 1 Thread 0 Task 0- Thread 1 Wait on thread 1 Task 2 Thread 0 Task 1- Thread 1 Task 0 Thread 2 Task 2- Thread 1 Task 1- Thread 2 Task 2- Thread 2

Functional split Part II Issues with functional-split Careful partitioning to balance loads across PUs Does not scale well with the complexity of the algorithm Does not adapt to the number of PUs in cluster Adding extra PUs may not help process extra work load Inter processor data exchange tends to be higher Not everything about it is bad Code size on each PU is lower

Spatial split Part I All PUs are identical Divide the input instead of tasks PU - 0 PU - 1 PU - 2

Advantages Spatial split Part II Load balancing is inherent since PUs are alike Scales well with the complexity of the algorithm Adapts well to the number of PUs in the cluster Extra work load can be handled by increasing the number of PUs Are efficient in handling resources Interprocessor data exchange can be kept low Disadvantages Code size hit as each CPU runs the same code Large internal memory, cache and intelligent code sectioning mitigate the impact of the increased code size.

DM642 Code Management L1P is 16kB L2 can be configured as 224kB ISRAM and 32kB cache as application code for a transcoder is minimal Load time vs. run time address specification allows dynamic downloading of code sections Facilitates overlay of multiple code sections in ISRAM Structure code so as to perform the same processing operation on multiple data to reduce code thrash in L1P

Spatial split Architecture PU-0 is the master PU ; PU-1, PU-2 are slave PUs Each PU process encodes equal number of MB rows as slice Easily done since every MPEG-2 row starts on a new slice. PUs may have only Right or Left neighbor or both Application sends control / config parameters to master PU PUs exchange portions of their reconstructed picture buffers MPEG2 and H.264 with their neighbors PUs synchronize with their neighbors and master PU.

Spatial split Concurrency Little dependency across PUs PUs are equally loaded => synchronization time is minimal Some bottlenecks exist however H.264 MP does not allow Arbitrary Slice Ordering (ASO) Bitstream needs to be concatenated before sending Similar constraints do not exist in H.264 BP De-blocking should be done in a strict raster-scan order Disabling de-blocking across slices overcomes the constraint Rate control modifications Slice level bit allocation model Possibility of quality deterioration

Spatial split Load sharing Simple rectangular partitions may not always suffice e.g partitioning a NTSC sequence across 2 PUs Requires the MPEG-2 decoder to decode two extra MB rows

Resource requirements Spatial split configuration 2 PUs Inter PU data bandwidth ~ 134Mbps Functional split configuration 1 st PU MPEG-2 decode + H.264 motion estimation 2 nd PU Does the remaining of the encoding operation Inter PU data bandwidth approx 330Mbps Resource requirements in functional split will depend on split but may disturb the load balance In general resource (spatial) <= resource (functional)

Functional vs. Spatial split Comparison Functional split Spatial split Load sharing X Flexibility X Inter PU data bandwidth X

Demonstration Setup MPEG-2 bitstream PCI host PC hard disk H.264 bitstream MPEG-2 to H.264 Transcoder Vitec VP3-PMC (5x DM642) Card Stream the H.264 bitstream over Ethernet TV H.264 Streaming Client on DM642 EVM Only 2 DM642 processors are used for 3/4 th D-1 transcoding Bit-rate reduction of 33% compared to the input

Conclusions Spatial split is superior to Functional split in terms of Scaling with resolution Memory bandwidth Ease of design and maintenance Multi-DM642 transcoder Scalable from sub-d1 to HD resolutions without effort Algorithms can be improved without affecting the design

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India sriram.sethuraman@ittiam.com arvind.raman@ittiam.com

Application Scenario 1 Entertainment Quality A/V Satellite/Cable/Terrestrial/IP MPEG-2 Broadcast Edge Server (Transcode/ TransScale to H.264) Lower B/W Last-mile link Streaming Client