GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture

2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam, C.Eswaran 1+ and Junaidi Abdullah 1++ 1, 1+, & 1++ Faculty of Information Technology, MultiMedia University, Cyberjaya, Selangor,Malaysia. 2 Faculty of Engineering, MultiMedia University, Cyberjaya, Selangor,Malaysia. {sreemula.sankaraia10, hslam, eswaran, junaidi.abdullah}@mmu.edu.my Abstract: H.264 is a popular codec used for encoding the videos that are hosted on the video server and delivered over the internet. Achieving real time encoding still remains a challenging problem. A possible solution to minimize the encoding time would be to develop applications with high level of Thread-Level (TLP) to exploit the power of multi-core processors. Parallelization strategies at various levels such as Macro-block level, slice level, frame level have been proposed by various authors. Most of these techniques suffer from the drawbacks of limited scalability, and data dependency. We propose in this paper, a high level parallelization method based on Group-Of-Pictures (GOP). In this method, each GOP will be encoded independently and the frames being referenced are included within the GOP. In GOP-level parallelism, openmp programming model is used to restructure the H.264 encoder. This is to exploit the capability of the available hardware resources to support concurrent processing. The results obtained show that the strategy implemented provides high level of parallelism and efficiently exploits the capabilities of the multi-core system. The speedup achieved using the proposed method is 5.6 to 10 times higher compared to a well-optimized sequential code implementation. Keywords: Video encoding, H.264, Parallel Programming, TLP, GOP, ME, TP, OpenMP, Multi-core, Dual Processor (DP) and Quad Processor (QP). 1. Introduction The H.264 is currently the most popular and good quality video coding standard [1]. The H.264 standard is designed to serve a broad range of applications ranging from low to high bitrates, from low to high resolutions, and a variety of networks and systems i.e., internet streams, mobile streams, disc storage and broadcast. Since H.264 codec is developed with many advanced features which make the encoding process require more computation power than the other existing standards [2]. Hence, there is a need for speeding up the encoder. One possible way of improving the speed is to process the data in parallel [3]. This paper describes how to efficiently restructure the H.264 encoder using GOP parallelization. The remainder of this paper is organized as follows. In Section 2, we provide an overview on the parallelization of H.264. In Section 3, the simulation environment and the experimental methodology to evaluate the dynamic with the access pattern as group of pictures (GOP) are presented. In Section 4, the implementation of H.264 parallelism with the GOP pattern on multicore are discussed in detail. In Section 5, the simulation results, analysis of the scalability and the performance of the GOP-level parallelism, as well as the impacts of parallelization overhead are presented. Section 6 consists of the conclusion and the possible future work. 2. Previous works on Parallelization of H.264 The high quality outputs from the advanced video codec such as H.264 come at the price of increased computational complexity. As a result, the current high performance Uni-Processor (UP) architecture is not capable of providing the required performance [4]. Thus, it is necessary to exploit parallelism. The H.264 codec can be parallelized by using the Task-Level or the Data-Level Decomposition methods. In the Task- 127

level Decomposition (TLD) method, the functional partitions of the algorithm are assigned to different processors. The main drawbacks of the TLD method are the load balancing issue and the scalability constrains. For the Data-level Decomposition (DLD) method, the data is divided into smaller parts and each of the parts is assigned to a different processor. Therefore, each processor runs the same program but with different sets of data elements. In the H.264 encoding process, the DLD method can be implemented at various levels of the data structures such as GOP-level, frame-level, slice-level, macro-block-level, and block level. The implementation of parallelism at various levels on H.264 codec has been described in several papers. Rodriguez et al. implemented the H.264 encoder using frame-level parallelism combined with a group of frames on a clustered workstations using Message Passing Interface (MPI) [5]. Although, real-time operation can be achieved with this approach, the latency is very high. Chen et al. presented a parallel implementation that encodes and decodes several B frames in parallel [6]. This limits the scalability to a few threads. This problem is solved in our proposed approach by dynamically detecting the dependencies and automatically exploiting the parallelism. Van der Tol et al. presented the exploitation of the intra-frame MB-level parallelism and they suggested combining it with frame-level parallelism [7]. The frame-level parallelism method is determined statically by the length of the motion vectors, while in our approach, the parallelism is determined dynamically. In terms of scalability, independency, load balancing and the utilization of processing cores, GOP-level parallelism has many advantages over other methods. The scalability can be easily achieved by increasing the number of processing cores and by applying homogeneous software optimization techniques to each core. The same concept can be applied to a full- HD (1920X1080) video encoding. It is found by experiments, as the number of processing cores increases, the performance improvement is enhanced almost linearly. As per Moore s law, it is expected that the number of cores on a CMP will double every three years, resulting in an approximately 150 high performance cores on a single die in the year 2017 [8]. This increases the challenges for improving the applications with high scalability exploiting the capability of multi-core by implementing load balancing among processing cores. There are various techniques suggested by Strenstrom et al. [9] in analyzing the scalability in terms of parallelism. This paper focuses on a new parallelization strategy that provides sufficient scalability to fully utilize the processing cores in the future. 3. Methodology and Simulation environment In this section, the tools and methodology used to implement and evaluate the dynamic scheduling based on GOP-level parallelism technique are described. The computations on the processing cores are modeled based on number of cycles that are implemented accurately. The memory system is modeled using average transfer times with channel and bank contention. It is assumed that each of the cores has its own L1 data cache and the data can be copied from other L1 caches through 4 channels. The processing cores will be sharing a distributed L2 cache with 8 banks and an average access time of 40 cycles. The average access time takes into account the L2 hits, misses, and the interconnect delays. With the modeling of the L2 bank contention, the two cores will not access the same bank simultaneously. The multi-core programming model follows the task pool model. In this approach, one main thread and other slave threads are created. The task execution overhead is very low and the time to request a task is less than 2% of the entire GOP encoding time. The experimental results focused on the modified main profile of the H.264 standard, as this profile supports I, P and B frames. The simulation was conducted using JM 17.2 reference software compiled with Visual studio 2008 on two platforms: (1) Dell Laptop built with Intel Core2 Duo CPU T5750 operated on Windows XP OS, running at 2.0GHz with 32KB L1 D-Cache, 32KB L1 I-Cache 2MB L2 cache with 8-way set associative and 2GB RAM. (2) Dell desktop built with Intel Core2 Quad 9400, operated with Windows 7 Ultimate 64bits, running at 3.0GHz with 64KB L1 D-cache, 64KB I-cache, 4MB L2 cache with 8-way set associative and 4GB RAM. The encoding and elapse time for each thread are measured with Intel Parallel Studio 2011 and AMD Code Analyst. All video sequences used in the simulation are with QCIF and CIF resolutions. 4. Implementaion of Parallel H.264 128

To achieve good data parallelism, the set of data which can be treated independently and fed to a processing element must be determined. In the GOP-level parallelism, each GOP is handled by a separatee thread. The GOP-level parallelism assigns GOP s into different processor threads and each thread processess multiple sequence of frames. This method uses temporal division of frames to implement parallelism. For a GOP data access pattern, dependency exists among the frames within a GOP and there is no data dependency between two sets of GOP s, thus each thread can independently process each GOP set without referencing to any frame outside the GOP. Figure 1 shows the GOP access pattern of frames in independent manner. For data access pattern, the memory hierarchy needs to store large amounts of data, but requires considerably lesss synchronization. This is due to fact that the system exhibits higher granularities of parallelism. This higher level of granularity characterizes the data accesss pattern and the system memory becomes a bottleneck as the smaller L1 and L2 memory levels are insufficient to hold multiple frames of data [10]. In the proposed approach all the frames of a GOP are stored in a temporary buffer and sequentially transferred to the corresponding cores for processing. Odd numbered GOP s are processed by core 1 and even numbered GOP s are processed by core 2. In a dual core system, the two cores will share the L2 cache memory, which is connected to the main memory with a separate bus. In the proposed GOP-level parallelism, closed GOPs are used and there is no reference between the two GOPs processed by the two cores. In this implementation, additional core is not used for task scheduling as one of the available cores will be assigned to do this task. Figure 2 shows the implementationn of GOP-level parallelism with threads. Two GOP buffers are used for moving the raw images, which will first store the frames when these buffers have space. It will schedule the frames into 4 temporary buffers according to the frame types, namely I, P and B frames as shown in Figure 2. There will be one master thread for handling the input outputt processes, such as checking of data dependency, and this master thread will be run on whichever core is free. Four working threads will be created to encodee the frames waiting in the temporary buffers. The number of threads created shall be according to the number of processing cores available in the system. Sequentially, all the operations are synchronized through the GOP buffers by the master thread. Figure 3 shows the steps involved in the encoding process. Fig 1: The GOP frame access pattern Fig 2: Implementation of the GOP-level parallelism with threads 129

5. Experimental results and Discussions Fig 3: The flow of the encoding process In this section the experimental results are presented. The results include the values of PSNR, total encoding time, ME time and bit-rate of the video. Two different types of video sequences are considered for testing. In Table1, the results for the Grandma video sequence with slow motion, are presented. In Table 2, the results for the Foreman video sequence with high motion are presented. The resultss have been obtained by performing tests with 300 frames on both Dual-core and Quad-core processors, using the GOP-levell parallelism with I frame as the starting frame. In Tables 1 and 2, the results obtained with GOP parallelism are compared with those obtained using original JM. The size of the GOP is fixed as 15. The results show that the proposed method yields reduced encoding time and ME time with a small reduction in the bit rate. Further it is noticed that the proposed method does not affect the PSNR value. To achieve an optimum performance e with higher speed up and lower bit-rate (without reducing the video quality), the size of GOP should be carefully determined. Figures 4,5 and 6 show the effect of GOP size on PSNR, encoding time and bit- rate respectively in a quad processor. From these figures, we note that GOP size 15 yields optimum results with regard to these quality parameters. The effect of the number of threads on PSNR in a quad processor is shown in Figure7. Parameters Original JM with DP and QP 15GOP with DP 15GOP With QP Parameters Original JM with DP and QP 15GOP With DP 15 GOP With QP Average PSNR (db) 39.19 39.19 39.19 Average PSNR (db) 39.50 39.50 36.23 Total Encoding 116.18 22.40 11.72 Total Encoding 122.24 24.44 12.23 Total ME 99.20 18.43 9.36 Total ME 101.39 20.05 10.89 Bit rate (Kbit/s) 91.19 85.26 85.26 Bit rate (Kbit/s) 93.56 88.28 88.28 Table 1: The results of parallel encoding of less Table 2: The results of parallel encoding of high motion video sequence, Grandma_cif motion video sequence, Foreman_cif 130

PSNR 44 42 40 38 36 34 32 30 PSNR Vs GOP size 3 6 9 12 15 18 20 GOP Size CIF QCIF Encoding Time(min) 120 110 100 90 80 70 60 50 40 30 20 10 0 Encoding time vs GOP size 0 3 6 9 12 15 18 21 GOP size CIF QCIF Fig 4 : GOP size Vs PSNR Fig 5 :Encoding time Vs GOP Size Bit-rate(Kbps) 120 110 100 90 80 70 60 50 40 30 20 10 0 0 3 Bit-rate vs GOP size 6 9 12 15 18 GOP size Fig 6: Bit-rate Vs GOP size 21 CIF QCIF Fig 7: The PSNR Vs the number of threads Figure 7 shows a constant PSNR, even when the number of threads is increased in both the resolutions of QCIF and CIF. The results show that there is no loss of video quality after exploiting the GOP levell parallelism. Table 3 shows a comparison of the performance parameters obtained for different processorss during the encoding process [9,10]. Quad-core processor shows a good utilization of front-side-bus rate. It is observed that the bus activities do not increase significantly with the increasing of number of threads. Therefore the execution time is reduced due to better utilization of the processor resources by exploiting the optimum thread-level parallelism. Parameters UP DP QP Instruction per 0.689 1.71 3.02 cycle Microoperations 1.22 2.69 5.05 per cycle Trace cache deliver 78.23 89.98 93.58 mode % Trace cache build 21.49 9.84 5.23 mode % 1 st level cache load 5.23 5.35 4.89 misses rate % 2 nd level cache load 0.47 0.53 0.21 misses rate % Front-side-bus 0.59 4.59 12. 35 utilization rate % Table 3: Micro Architecture metrics Fig 8: Speedup Vs the number of Threads 131

[1] International Standardd of Joint Video specification (ITU-T Rec. H. 264 ISO/IEC) (2009). [2] Ostermann.J et.al., Video Coding with H.264/AVC: Tools, Performance, and Complexity, IEEE Circuits and Systemss Magazine 4( 1)(2004) pp. 7-28. [3] Hoogerbrugge.J, et all., A Multithreaded Multicore System for Embedded Media Processing, Trans. on Highon H. Performance embedded Architectures and Compilers (2009). [4] Drose.M, Clemen.C, Sikora.T, Extending Single-View Scalable Video Coding to Multi- View Based 264/AVC, Image Processing, 2006 IEEE Inter.Conf. on. (2006) pp. 2977 2980. [5] Rodriguez.A, et al., Hirarchical Parallelization of an H.264/AVC Video Encoder, Proc. Int l. Symp. on Parallel Computing in Electrical Engineering (2006) pp.363 368. [6] Chen.Y, Li.E, Zhou.X, Ge.S, Imple-mentation of H.264 Encoder and Decoder on Personal Computers, Journal of Visual Communications and Image Representation 17 (2006). [7] Vander Tol.E, Jaspers.E, Gelderblom.R, Mapping of H.264 Decoding on a Multiprocessor Architecture, Proc. SPIE Conf. on Image and Video Communications and Processing (2003). [8] Stenstrom.P,et al., Chip-multiprocessing and Beyond, Proc. Twelth Int l. Symp. On High-Performance Computer Architecture. (2006) pp.109 109. [9] Y.K Chen, et.al., Towards Efficient MultiLevel Threading of H.264 Encoder on Intel Hyper-Threading Architectures, Proc. Of the 18 th Int l Parallel and Distributed Processing Symposium, Apr.2004. [10] S.Ge, X..Tian and Y.K.Chen, Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures, IEEE Pacific-Rim conf. on Multimedia, Dec 2003. The standard measure, speed-upp which is defined as follows is used to evaluatee the performance of the proposed method. Figure.8 shows the plot speedup vs number of threads. It can be seen from this figure that the peak performancee is achieved when the number of threads equals the number of cores. It is also observed that. the speedup is almost constant ( or slightly lower) when the number of threads exceeds the number of cores, this is due to the fact that additional overheads are required to schedule and hold the information or process the extra threads. We also observe from Figure 8 that it is possible to achieve significantly higher speedup values using the GOP parallelism. 6. Conclusion and Future Work In this paper, we have presented a method based on GOP parallelism and analyzed the parallel scalability of the H.264 video encoding process using dual core and quad core processors. Our proposed parallelization strategy can overcome many of the shortfalls of the other known methods such as scalability issues and dataa dependency constraints. In general, the experimental results show thatt the GOP-level parallelism strategy efficiently exploits the capabilities of the multicore processors. The speedup values obtained using dual and quad core systems are 5.6 and 10 are higher compared to the original reference software for H.264 (JM 17.2).. Although, the focus of this paper is on the H.264 codec, it is expected that other video codecs and multimedia applications also exhibit similar characteristics. Hence, the proposed method can be extended to any of the computationally intensive applications of video processing. 7. References 132