A Parallel Transaction-Level Model of H.264 Video Decoder

Similar documents
Parallel Discrete Event Simulation of Transaction Level Models

Multicore Simulation of Transaction-Level Models Using the SoC Environment

System-On-Chip Architecture Modeling Style Guide

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

A Distributed Parallel Simulator for Transaction Level Models with Relaxed Timing

Multimedia Decoder Using the Nios II Processor

Advances in Parallel Discrete Event Simulation EECS Colloquium, May 9, Advances in Parallel Discrete Event Simulation For Embedded System Design

Computer-Aided Recoding for Multi-Core Systems

Eliminating Race Conditions in System-Level Models by using Parallel Simulation Infrastructure

High Efficiency Video Coding. Li Li 2016/10/18

FPGA based High Performance CAVLC Implementation for H.264 Video Coding

THE H.264 ADVANCED VIDEO COMPRESSION STANDARD

Advanced Video Coding: The new H.264 video compression standard

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Efficient Modeling of Embedded Systems using Designer-controlled Recoding. Rainer Dömer. With contributions by Pramod Chandraiah

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

High Efficiency Video Coding (HEVC) test model HM vs. HM- 16.6: objective and subjective performance analysis

NEW CAVLC ENCODING ALGORITHM FOR LOSSLESS INTRA CODING IN H.264/AVC. Jin Heo, Seung-Hwan Kim, and Yo-Sung Ho

H.264 / AVC (Advanced Video Coding)

PREFACE...XIII ACKNOWLEDGEMENTS...XV

MPEG-4: Simple Profile (SP)

EE Low Complexity H.264 encoder for mobile applications

System-on-Chip Environment (SCE)

Laboratoire d'informatique, de Robotique et de Microélectronique de Montpellier Montpellier Cedex 5 France

High Efficiency Video Decoding on Multicore Processor

Objective: Introduction: To: Dr. K. R. Rao. From: Kaustubh V. Dhonsale (UTA id: ) Date: 04/24/2012

Video Codecs. National Chiao Tung University Chun-Jen Tsai 1/5/2015

ISSCC 2006 / SESSION 22 / LOW POWER MULTIMEDIA / 22.1

Complexity Estimation of the H.264 Coded Video Bitstreams

AVS VIDEO DECODING ACCELERATION ON ARM CORTEX-A WITH NEON

H.264/AVC Video Watermarking Algorithm Against Recoding

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

4G WIRELESS VIDEO COMMUNICATIONS

A macroblock-level analysis on the dynamic behaviour of an H.264 decoder

EE 5359 Low Complexity H.264 encoder for mobile applications. Thejaswini Purushotham Student I.D.: Date: February 18,2010

Selected coding methods in H.265/HEVC

A high-level simulator for the H.264/AVC decoding process in multi-core systems

Creating Explicit Communication in SoC Models Using Interactive Re-Coding

System-on-Chip Environment (SCE Version Beta): Manual

The VC-1 and H.264 Video Compression Standards for Broadband Video Services

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD

Video Coding Using Spatially Varying Transform

Data Storage Exploration and Bandwidth Analysis for Distributed MPEG-4 Decoding

2014 Summer School on MPEG/VCEG Video. Video Coding Concept

Video Quality Analysis for H.264 Based on Human Visual System

A COMPARISON OF CABAC THROUGHPUT FOR HEVC/H.265 VS. AVC/H.264. Massachusetts Institute of Technology Texas Instruments

Automatic Generation of Communication Architectures

An Efficient Table Prediction Scheme for CAVLC

Transcoding from H.264/AVC to High Efficiency Video Coding (HEVC)

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Introduction to Video Encoding

Homogeneous Transcoding of HEVC for bit rate reduction

HEVC The Next Generation Video Coding. 1 ELEG5502 Video Coding Technology

RTOS Scheduling in Transaction Level Models

System-on-Chip Environment

Jun Zhang, Feng Dai, Yongdong Zhang, and Chenggang Yan

EFFICIENT DEISGN OF LOW AREA BASED H.264 COMPRESSOR AND DECOMPRESSOR WITH H.264 INTEGER TRANSFORM

Mali GPU acceleration of HEVC and VP9 Decoder

A Hybrid Instruction Set Simulator for System Level Design

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

Lecture 13 Video Coding H.264 / MPEG4 AVC

In the name of Allah. the compassionate, the merciful

The Design and Evaluation of Hierarchical Multilevel Parallelisms for H.264 Encoder on Multi-core. Architecture.

Review and Implementation of DWT based Scalable Video Coding with Scalable Motion Coding.

Lec 10 Video Coding Standard and System - HEVC

Video Compression Standards (II) A/Prof. Jian Zhang

Intra-Mode Indexed Nonuniform Quantization Parameter Matrices in AVC/H.264

Implementation and analysis of Directional DCT in H.264

PERFORMANCE ANALYSIS OF AVS-M AND ITS APPLICATION IN MOBILE ENVIRONMENT

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Overview, implementation and comparison of Audio Video Standard (AVS) China and H.264/MPEG -4 part 10 or Advanced Video Coding Standard

SpecC Methodology for High-Level Modeling

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

10.2 Video Compression with Motion Compensation 10.4 H H.263

EE 5359 H.264 to VC 1 Transcoding

Xin-Fu Wang et al.: Performance Comparison of AVS and H.264/AVC 311 prediction mode and four directional prediction modes are shown in Fig.1. Intra ch

MPEG-2. And Scalability Support. Nimrod Peleg Update: July.2004

System-on-Chip Environment

The Use Of Virtual Platforms In MP-SoC Design. Eshel Haritan, VP Engineering CoWare Inc. MPSoC 2006

A VIDEO TRANSCODING USING SPATIAL RESOLUTION FILTER INTRA FRAME METHOD IN MULTIMEDIA NETWORKS

Reduced Frame Quantization in Video Coding

Testing HEVC model HM on objective and subjective way

Venezia: a Scalable Multicore Subsystem for Multimedia Applications

The Scope of Picture and Video Coding Standardization

Mapping the AVS Video Decoder on a Heterogeneous Dual-Core SIMD Processor. NikolaosBellas, IoannisKatsavounidis, Maria Koziri, Dimitris Zacharis

H.264 Video Decoder Implemented on FPGAs using 3x3 and 2x2 Networks-on-Chip

IMPLEMENTATION OF H.264 DECODER ON SANDBLASTER DSP Vaidyanathan Ramadurai, Sanjay Jinturkar, Mayan Moudgill, John Glossner

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

H.264/AVC und MPEG-4 SVC - die nächsten Generationen der Videokompression

COMPARISON OF HIGH EFFICIENCY VIDEO CODING (HEVC) PERFORMANCE WITH H.264 ADVANCED VIDEO CODING (AVC)

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

Sample Adaptive Offset Optimization in HEVC

Professor, CSE Department, Nirma University, Ahmedabad, India

Multi-level Parallelization of Advanced Video Coding on Hybrid CPU+GPU Platforms

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

A comparison of CABAC throughput for HEVC/H.265 VS. AVC/H.264

Transcription:

Center for Embedded Computer Systems University of California, Irvine A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Technical Report CECS-11-03 June 2, 2011 Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8059 {hanx, weiweic, doemer}@uci.edu http://www.cecs.uci.edu/

A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Technical Report CECS-11-03 June 2, 2011 Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA (949) 824-8059 {hanx, weiweic, doemer}@uci.edu http://www.cecs.uci.edu Abstract H.264 video decoder is a computationally demanding application. In resource-limited embedded environment, it is desirable to exploit parallelism in order to implement a H.264 decoder. After reviewing a list of technical details of H.264 standard, we have discussed several possiblities of parallelization and developed a TLM model with parallel slice decoders. Extensive experiments are performed to demonstrate the benefit of the our model.

Contents 1 Introduction 1 2 H.264/AVC Standard Features 2 2.1 YUV Color Space and 4:2:0 Sampling........................ 2 2.2 Macroblocks..................................... 3 2.3 Slices......................................... 4 2.4 H.264 Decoding Algorithm.............................. 5 3 Parallelism Exploitation 6 4 A SpecC Model with Parallel Slice Decoders 7 5 Experiment and Results 8 6 Conclusion 9 References 10 i

List of Figures 1 A typical multimedia data unit............................ 2 2 4:2:0 Sampling.................................... 2 3 Into a Macroblock................................... 3 4 Nine Intra 4 4 prediction modes........................... 3 5 Motion Compensation................................ 4 6 A H.264 video frame divided into four fixed-size slices................ 4 7 Possible Slice Patterns with FMO enabled...................... 5 8 Block Diagram of H.264 Decoder in JM 13.0.................... 6 9 Construction of a macroblock in JM 13.0 H.264 decoder.............. 7 10 Recoding From JM Reference to SpecC model................... 7 ii

List of Tables 1 Simulation results of H.264 Decoder ( Harbour, 299 frames 4 slices each, 30 fps). 9 2 Simulation speedup with different h264 streams (spec model)............ 10 iii

A Parallel Transaction-Level Model of H.264 Video Decoder Xu Han, Weiwei Chen and Rainer Doemer Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {hanx, weiweic, doemer}@uci.edu http://www.cecs.uci.edu Abstract H.264 video decoder is a computationally demanding application. In resource-limited embedded environment, it is desirable to exploit parallelism in order to implement a H.264 decoder. After reviewing a list of technical details of H.264 standard, we have discussed several possiblities of parallelization and developed a TLM model with parallel slice decoders. Extensive experiments are performed to demonstrate the benefit of the our model. 1 Introduction H.264/AVC video coding standard [6] is widely used in video applications such as internet streaming, disc storage, and television services. H.264/AVC provides high-quality video at less than half the bit rate compared to its predecessors H.263 and H.262. In multimedia services, it can be considered the best compromise between quality and size to combine H.264 video and Mp3 video, as in Figure 1. However, H.264 encoding and decoding also requires more computing resources than its predecessors. In order to implement the standard on resource-limited embedded systems, it is highly desirable to exploit available parallelism in its algorithm. The rest of the report is organized as follows: several features of H.264 standard is introduced in the next section. Section 3 discusses possible parallelism that we can exploit to build our model. Section 4 describes the parallel H.264 decoder model, followed by some experiments and results showing the benefit of the parallelism. Section 6 concludes the report. 1

Track (H.264 video) Track(MP3 audio) Track (Timed text) 2 H.264/AVC Standard Features Figure 1: A typical multimedia data unit This section introduces a list of features of H.264 standard that are most related to our decoder model. 2.1 YUV Color Space and 4:2:0 Sampling H.264 standard represents color in a picture with YUV format. Component Y (called luma) represents brightness. Components U and V (called chroma) represent color. Because human visual system is more sensitive to brightness than color, H.264 keeps more luma samples than chroma to ensure both picture quality and compression rate. Especially, the most popular sampling structure in which chroma component has one fourth of the number of luma component is called 4:2:0 sampling. Each sample is represented by 8 bits of data. Figure 2: 4:2:0 Sampling 2

2.2 Macroblocks A macroblock is the basic encoding and decoding unit in H.264 standard. One macroblock covers a fixed-size rectangular picture area of 16 16 pixels, which contains, with 4:2:0 sampling, 16 16 luma samples and 8 8 of each of the chroma samples, as illustrated in Figure 3. To decode a macroblock, either intra prediction or inter prediction is required. Figure 3: Into a Macroblock Intra prediction constructs a macroblock by using neighboring samples which are from previously decoded macroblocks in current picture. There are three modes of intra prediction. Intra 4 4 mode predicts a macroblock with smaller blocks of size 4 4. Intra 16 16 mode predicts the whole 16 16 block and is suitable for smooth area of a picture. I mode simply bypasses the prediction and transform coding of this macroblock to preserve precise video samples. Figure 4 shows how Intra 4 4 prediction can be done by referring to video samples from left and above (which is the previously decoded area) the block to be predicted. Figure 4: Nine Intra 4 4 prediction modes Inter prediction, also known as motion compensation, predicts a macroblock by referring to previously decoded pictures. A macroblock can be partitioned into sub-blocks for inter prediction. The option of sub-blocks sizes are 16 16, 16 8, 8 16, and 8 8, among which 8 8 block can be further partitioned into 8 4, 4 8 or 4 4 blocks. Figure 5 illustrates how motion compensation is 3

done with two references pictures. A picture reference index and a motion vector are required to perform motion compensation. Figure 5: Motion Compensation 2.3 Slices A slice is a sequence of macroblocks. On the other hand, a H.264 video frame can be split into one or more slices. Figure 6 shows an example of a frame divided into four slices. Note that slices can be of very flexible shape and size with a feature of Flexible Macroblock Ordering(FMO) enabled, as shown in Figure 7, where each color represents a slice group which can contain one or more slices. Slice 0 Slice 1 Slice 2 Slice 3 Figure 6: A H.264 video frame divided into four fixed-size slices. Slices can be classified into three types according to their coding style. A slice with all macroblocks coded using intra prediction is called an I slice. In addition to intra prediction, a P slice contains inter predicted macroblocks with only one reference frame per prediction block. In addition to coding types in a P slice, a B slice can have inter predicted macroblocks with two reference 4

Figure 7: Possible Slice Patterns with FMO enabled frames per prediction block. Notably, Slices are independent of each other, in the sense that decoding one slice will not require any data from the other slices in the current video frame. However, to correctly decode a frame, reference frames are needed if the frame contains P slices, and information from other slices are still needed when deblocking filter works across slice boundaries. 2.4 H.264 Decoding Algorithm Our study is based on H.264/AVC JM reference software[4]. This section is to introduce the algorithm of H.264 decoder as implemented in JM version 13.0. JM 13.0 relies heavily on global storage depicted in parallelograms in Figure 8) in the decoding process. Generally, it accesses all input image parameter in structure img, temporary output data in structure dec picture and decoded frames in decoded picture buffer dpb. As shown in Figure 8, decoding starts on parsing incoming NAL units, which is a logical data packet containing H.264 syntax structures. It carries both coded video data and information indicating the method to decode the data. After parsing, coded residual data is entropy decoded, inverse quantized, reordered, inverse transformed and result in the decoded residual data stored in array m7, which will later build part of a decoded macroblock. On the other hand, parsed information on MB types and prediction mode info is used to construct the current macroblock by predicting its content. Depending on how the macroblock is encoded in the first place, either intra prediction or motion compensation will apply to acquired the predicted data, the result of which is stored in an array mpr in JM 13.0. At this point, the macroblock has all necessary data decoded and can be 5

Output Img parameter (img) Decoded picture (dec_picture) Decoded picture buffer (dpb) NAL (derive MB_types, slice_header, and other decoding info) NAL parser coded residual data Entroy decoding AND Inverse Quantize AND Reorder prediction mode info decoded residual data (cof) Inverse transform Intra prediction OR motion compensation Transformed residual data (m7) Predicted data (mpr) + Decoded macroblcok (m7, imgy/uv) End of frame? No Yes Deblocking filter next macroblock / next slice Figure 8: Block Diagram of H.264 Decoder in JM 13.0 constructed by adding transformed residual data and predicted data together. On finish decoding one macroblock, the decoder proceeds to next macroblock in current slice or next slice. After an entire frame is decoded, deblocking filter is applied to remove the blocking artifacts in a frame. The output of the deblocking filter is stored in a decoded picture buffer which uses the buffered frames as reference to decode future frames. 3 Parallelism Exploitation Various possible parallelisms exist in H.264 standard. On frame level, decoding a frame can depend on reference frames of very flexible position and number with inter prediction applied. Although part of the frames can be intra predicted only, it is still very difficult to exploit parallelism on frame level. On macroblock level, we firstly review how a macroblock is constructed in JM 13.0. Illustrated by Figure 9, the decoder firstly acquires the predicted data (mpr) by refer As for the deblocking filter, although it operates across boundaries of macroblocks and slices, deblocking a macroblock only depends on its neighbor macroblocks since its goal is to remove block edges. Therefore, it is possible to filter a frame in parallel, though such parallelism is not easy to exploit. Due to the nature of slices, it is most promising to exploit parallelism on slice level. We developed a SpecC model with four parallel slice decoders, which is expected to be most efficiency when incoming stream has 4 balanced slices per frame. 6

Step 1 mpr cof Step 3 Step 2 Step 4 m7 Decoded area Macroblock under construction mpr Predicted block cof Decoded coefficients (residual) m7 Residual block Figure 9: Construction of a macroblock in JM 13.0 H.264 decoder 4 A SpecC Model with Parallel Slice Decoders decode_one_frame() DUT read_new _slice() Slice reader decode_sli ce() Slice decoder0 Slice decoder1 Slice decoder2 Slice decoder3 EOS? Y N exit_pictur e() Synchronizer Figure 10: Recoding From JM Reference to SpecC model An initial system-level model with Simulator, DUT and Monitor structure is designed in [5]. Now after identifying parallelism at slice level, we have extended the existing system-level model 7

by further recoding JM reference code. The decoding procedure in JM is done by function decode one frame, which contains subfunctions read new slice, decode slice and exit picture, as shown in Figure 10. After recoding, read new slice is recoded into slice reader, decode slice into four slice decoders and exit picture into Synchronizer. With any input video stream encoded with four slices per frame, Slice Reader reads out four slices every running cycle and dispatch them to Slice Decoders via channel interface. Each slice decoder internally consists of the regular H.264 decoder functions, such as entropy decoding, inverse quantization and transformation, motion compensation, and intra-prediction. Upon each Slice Decoder finishing its work in parallel, Synchronizer completes the decoding by applying a deblocking filter to the decoded frame. Up to date, this model can be manually adjusted to decode video stream with different number of slices. 5 Experiment and Results We use the System-on-Chip Environment (SCE) [2] for synthesis and validating of our H.264 decoder model. SCE is a refinement-based framework for heterogeneous MPSoC design. It starts with a system specification model described in the SpecC language [3] and implements a top-down ESL design flow based on the specify-explore-refine methodology. We partition our parallel H.264 decoder model as follows: the four slice decoders are mapped onto four custom hardware units; the synchronizer is mapped onto an ARM7TDMI processor at 100MHz which also implements the overall control tasks and cooperation with the surrounding testbench. We choose Round-Robin scheduling for tasks in the processor and allocate an AMBA AHB for communication between the processor and the hardware units. Using SCE, we generate the transaction level models (TLM) of our parallel H.264 decoder design at different abstraction levels. They are spec for the specification model, arch for the architecture mapped model with different kinds of processing elements (PE), sched for the model with scheduling decisions made for operation systems on mapped processors, net for the model with network connectivities among PEs, tlm for transaction level model with communication protocols, and comm for pin-cycle accurate model with communication details. For our first experiment, we use the same stream Harbour of 299 video frames, each with 4 slices of equal size. As shown in [1], 68.4% of the total computation time is spent in the slice decoding, which we have parallelized in our decoder model. As a reference point, we calculate the maximum possible performance gain as follows: 1 MaxSpeedup = ParallelPart NumO fcores + SerialPart For 4 parallel cores, the maximum speedup is MaxSpeedup 4 = 1 0.684 4 +(1 0.684) = 2.05 The maximum speedup for 2 cores is accordingly MaxSpeedup 2 = 1.52. Table 1 lists the simulation results for several TLMs generated with SCE when using our multicore simulator on a Fedora core 12 host PC with a 4-core CPU (Intel(R) Core(TM)2 Quad) at 8

3.0 GHz, compiled with optimization (-O2) enabled. We compare the elapsed simulation time against the single-core reference simulator (the table also includes the CPU load reported by the Linux OS). Although simulation performances decrease when issuing only one parallel thread due to additional mutexes for safe synchronization in each channel and the scheduler, our multi-core parallel simulation is very effective in reducing the simulation time for all the models when multiple cores in the simulation host are used. Table 1: Simulation results of H.264 Decoder ( Harbour, 299 frames 4 slices each, 30 fps). Simulator Reference Multi-Core Par. issued threads: n/a 1 2 4 #delta cycles #threads sim. time sim. time speedup sim. time speedup sim. time speedup spec 20.80s (99%) 21.12s (99%) 0.98 14.57s (146%) 1.43 11.96s (193%) 1.74 76280 15 arch 21.27s (97%) 21.50s (97%) 0.99 14.90s (142%) 1.43 12.05s (188%) 1.76 76280 15 models sched 21.43s (97%) 21.72s (97%) 0.99 15.26s (141%) 1.40 12.98s (182%) 1.65 82431 16 net 21.37s (97%) 21.49s (99%) 0.99 15.58s (138%) 1.37 13.04s (181%) 1.64 82713 16 tlm 21.64s (98%) 22.12s (98%) 0.98 16.06s (137%) 1.35 13.99s (175%) 1.55 115564 63 comm 26.32s (96%) 26.25s (97%) 1.00 19.50s (133%) 1.35 25.57s (138%) 1.03 205010 75 maximum speedup 1.00 1.00 1.52 2.05 n/a n/a Table 1 also lists the measured speedup and the maximum theoretical speedup for the models that we have created following the SCE design flow. The more threads are issued in each scheduling step, the more speedup we gain. The #delta cycles column shows the total number of delta cycles executed when simulating each model. This number increases when the design is refined and is the reason why we gain less speedup at lower abstraction levels. More communication overhead is introduced and the increasing need for scheduling reduce the parallelism. However, the measured speedups are somewhat lower than the maximum, which is reasonable given the overhead introduced due to parallelizing and synchronizing the slice decoders. The comparatively lower performance gain for the comm model in simulation with 4 threads can be explained due to the inefficient cache utilization in our Intel(R) Core(TM)2 Quad machine 1. Using a video stream with 4 slices in each frame is ideal for our model with 4 hardware decoders. However, we even achieve simulation speedup for less ideal cases. Table 2 shows the results when the test stream contains different number of slices. We also create a test stream file with 4 slices per frame but the size of the slices are imbalanced (percentage of MBs in each slice is 31%, 31%, 31%, 7%). Here, the speedup of our multi-core simulator versus the reference one is 0.98 for issuing 1 thread, 1.28 for 2 threads, and 1.58 for 4 threads. As expected, the speedup decreases when available parallel work load is imbalanced. 6 Conclusion In this work, we have discussed options of parallelism in H.264 decoding and developed a SpecC model with four parallel slice decoders. We have refined the model using SCE design flow and per- 1 The Intel(R) Core(TM)2 Quad implements a two-pairs-of-two-cores architecture and Intel Advanced Smart Cache technology for each core pair (http://www.intel.com/products/processor/core2quad/prod brief.pdf) 9

Table 2: Simulation speedup with different h264 streams (spec model). Simulator Reference Multi-Core Par. issued threads: n/a 1 2 4 1 1.00 0.98 0.98 0.95 2 1.00 0.98 1.40 1.35 3 1.00 0.99 1.26 1.72 slices 4 1.00 0.98 1.43 1.74 frame 5 1.00 0.99 1.27 1.53 6 1.00 0.99 1.41 1.68 7 1.00 0.98 1.30 1.55 8 1.00 0.98 1.39 1.59 formed extensive experiments with our multi-core simulator. The results show satisfactory speedup in our parallel decoder model. Acknowledgment This work has been supported in part by funding from the National Science Foundation (NSF) under research grant NSF Award #0747523. The authors thank the NSF for the valuable support. Any opinions, findings, and conclusions or recommendations expressed in this material are those of the authors and do not necessarily reflect the views of the National Science Foundation. References [1] Weiwei Chen, Xu Han, and R. Doemer. Multicore simulation of transaction-level models using the soc environment. Design Test of Computers, IEEE, 28(3):20 31, may-june 2011. [2] Rainer Dömer, Andreas Gerstlauer, Junyu Peng, Dongwan Shin, Lukai Cai, Haobo Yu, Samar Abdi, and Daniel Gajski. System-on-Chip Environment: A SpecC-based Framework for Heterogeneous MPSoC Design. EURASIP Journal on Embedded Systems, 2008(647953):13 pages, 2008. [3] Andreas Gerstlauer, Rainer Dömer, Junyu Peng, and Daniel D. Gajski. System Design: A Practical Guide with SpecC. Kluwer, 2001. [4] H.264/AVC JM Reference Software. http://iphome.hhi.de/suehring/tml/. [5] Bin Zhang Weiwei Chen, Siwen Sun and R. Dömer. System level modeling of a h.264 decoder. Technical Report CECS-TR-08-10, Center for Embedded Computer Systems, University of California, Irvine, 2008. 10

[6] T. Wiegand, G.J. Sullivan, G. Bjontegaard, and A. Luthra. Overview of the H.264/AVC video coding standard. Circuits and Systems for Video Technology, IEEE Transactions on, 13(7):560 576, july 2003. 11