GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture

Size: px
Start display at page:

Download "GOP Level Parallelism on H.264 Video Encoder for Multicore Architecture"

Transcription

1 2011 International Conference on Circuits, System and Simulation IPCSIT vol.7 (2011) (2011) IACSIT Press, Singapore GOP Level on H.264 Video Encoder for Multicore Architecture S.Sankaraiah 1 2, H.S.Lam, C.Eswaran 1+ and Junaidi Abdullah 1++ 1, 1+, & 1++ Faculty of Information Technology, MultiMedia University, Cyberjaya, Selangor,Malaysia. 2 Faculty of Engineering, MultiMedia University, Cyberjaya, Selangor,Malaysia. {sreemula.sankaraia10, hslam, eswaran, junaidi.abdullah}@mmu.edu.my Abstract: H.264 is a popular codec used for encoding the videos that are hosted on the video server and delivered over the internet. Achieving real time encoding still remains a challenging problem. A possible solution to minimize the encoding time would be to develop applications with high level of Thread-Level (TLP) to exploit the power of multi-core processors. Parallelization strategies at various levels such as Macro-block level, slice level, frame level have been proposed by various authors. Most of these techniques suffer from the drawbacks of limited scalability, and data dependency. We propose in this paper, a high level parallelization method based on Group-Of-Pictures (GOP). In this method, each GOP will be encoded independently and the frames being referenced are included within the GOP. In GOP-level parallelism, openmp programming model is used to restructure the H.264 encoder. This is to exploit the capability of the available hardware resources to support concurrent processing. The results obtained show that the strategy implemented provides high level of parallelism and efficiently exploits the capabilities of the multi-core system. The speedup achieved using the proposed method is 5.6 to 10 times higher compared to a well-optimized sequential code implementation. Keywords: Video encoding, H.264, Parallel Programming, TLP, GOP, ME, TP, OpenMP, Multi-core, Dual Processor (DP) and Quad Processor (QP). 1. Introduction The H.264 is currently the most popular and good quality video coding standard [1]. The H.264 standard is designed to serve a broad range of applications ranging from low to high bitrates, from low to high resolutions, and a variety of networks and systems i.e., internet streams, mobile streams, disc storage and broadcast. Since H.264 codec is developed with many advanced features which make the encoding process require more computation power than the other existing standards [2]. Hence, there is a need for speeding up the encoder. One possible way of improving the speed is to process the data in parallel [3]. This paper describes how to efficiently restructure the H.264 encoder using GOP parallelization. The remainder of this paper is organized as follows. In Section 2, we provide an overview on the parallelization of H.264. In Section 3, the simulation environment and the experimental methodology to evaluate the dynamic with the access pattern as group of pictures (GOP) are presented. In Section 4, the implementation of H.264 parallelism with the GOP pattern on multicore are discussed in detail. In Section 5, the simulation results, analysis of the scalability and the performance of the GOP-level parallelism, as well as the impacts of parallelization overhead are presented. Section 6 consists of the conclusion and the possible future work. 2. Previous works on Parallelization of H.264 The high quality outputs from the advanced video codec such as H.264 come at the price of increased computational complexity. As a result, the current high performance Uni-Processor (UP) architecture is not capable of providing the required performance [4]. Thus, it is necessary to exploit parallelism. The H.264 codec can be parallelized by using the Task-Level or the Data-Level Decomposition methods. In the Task- 127

2 level Decomposition (TLD) method, the functional partitions of the algorithm are assigned to different processors. The main drawbacks of the TLD method are the load balancing issue and the scalability constrains. For the Data-level Decomposition (DLD) method, the data is divided into smaller parts and each of the parts is assigned to a different processor. Therefore, each processor runs the same program but with different sets of data elements. In the H.264 encoding process, the DLD method can be implemented at various levels of the data structures such as GOP-level, frame-level, slice-level, macro-block-level, and block level. The implementation of parallelism at various levels on H.264 codec has been described in several papers. Rodriguez et al. implemented the H.264 encoder using frame-level parallelism combined with a group of frames on a clustered workstations using Message Passing Interface (MPI) [5]. Although, real-time operation can be achieved with this approach, the latency is very high. Chen et al. presented a parallel implementation that encodes and decodes several B frames in parallel [6]. This limits the scalability to a few threads. This problem is solved in our proposed approach by dynamically detecting the dependencies and automatically exploiting the parallelism. Van der Tol et al. presented the exploitation of the intra-frame MB-level parallelism and they suggested combining it with frame-level parallelism [7]. The frame-level parallelism method is determined statically by the length of the motion vectors, while in our approach, the parallelism is determined dynamically. In terms of scalability, independency, load balancing and the utilization of processing cores, GOP-level parallelism has many advantages over other methods. The scalability can be easily achieved by increasing the number of processing cores and by applying homogeneous software optimization techniques to each core. The same concept can be applied to a full- HD (1920X1080) video encoding. It is found by experiments, as the number of processing cores increases, the performance improvement is enhanced almost linearly. As per Moore s law, it is expected that the number of cores on a CMP will double every three years, resulting in an approximately 150 high performance cores on a single die in the year 2017 [8]. This increases the challenges for improving the applications with high scalability exploiting the capability of multi-core by implementing load balancing among processing cores. There are various techniques suggested by Strenstrom et al. [9] in analyzing the scalability in terms of parallelism. This paper focuses on a new parallelization strategy that provides sufficient scalability to fully utilize the processing cores in the future. 3. Methodology and Simulation environment In this section, the tools and methodology used to implement and evaluate the dynamic scheduling based on GOP-level parallelism technique are described. The computations on the processing cores are modeled based on number of cycles that are implemented accurately. The memory system is modeled using average transfer times with channel and bank contention. It is assumed that each of the cores has its own L1 data cache and the data can be copied from other L1 caches through 4 channels. The processing cores will be sharing a distributed L2 cache with 8 banks and an average access time of 40 cycles. The average access time takes into account the L2 hits, misses, and the interconnect delays. With the modeling of the L2 bank contention, the two cores will not access the same bank simultaneously. The multi-core programming model follows the task pool model. In this approach, one main thread and other slave threads are created. The task execution overhead is very low and the time to request a task is less than 2% of the entire GOP encoding time. The experimental results focused on the modified main profile of the H.264 standard, as this profile supports I, P and B frames. The simulation was conducted using JM 17.2 reference software compiled with Visual studio 2008 on two platforms: (1) Dell Laptop built with Intel Core2 Duo CPU T5750 operated on Windows XP OS, running at 2.0GHz with 32KB L1 D-Cache, 32KB L1 I-Cache 2MB L2 cache with 8-way set associative and 2GB RAM. (2) Dell desktop built with Intel Core2 Quad 9400, operated with Windows 7 Ultimate 64bits, running at 3.0GHz with 64KB L1 D-cache, 64KB I-cache, 4MB L2 cache with 8-way set associative and 4GB RAM. The encoding and elapse time for each thread are measured with Intel Parallel Studio 2011 and AMD Code Analyst. All video sequences used in the simulation are with QCIF and CIF resolutions. 4. Implementaion of Parallel H

3 To achieve good data parallelism, the set of data which can be treated independently and fed to a processing element must be determined. In the GOP-level parallelism, each GOP is handled by a separatee thread. The GOP-level parallelism assigns GOP s into different processor threads and each thread processess multiple sequence of frames. This method uses temporal division of frames to implement parallelism. For a GOP data access pattern, dependency exists among the frames within a GOP and there is no data dependency between two sets of GOP s, thus each thread can independently process each GOP set without referencing to any frame outside the GOP. Figure 1 shows the GOP access pattern of frames in independent manner. For data access pattern, the memory hierarchy needs to store large amounts of data, but requires considerably lesss synchronization. This is due to fact that the system exhibits higher granularities of parallelism. This higher level of granularity characterizes the data accesss pattern and the system memory becomes a bottleneck as the smaller L1 and L2 memory levels are insufficient to hold multiple frames of data [10]. In the proposed approach all the frames of a GOP are stored in a temporary buffer and sequentially transferred to the corresponding cores for processing. Odd numbered GOP s are processed by core 1 and even numbered GOP s are processed by core 2. In a dual core system, the two cores will share the L2 cache memory, which is connected to the main memory with a separate bus. In the proposed GOP-level parallelism, closed GOPs are used and there is no reference between the two GOPs processed by the two cores. In this implementation, additional core is not used for task scheduling as one of the available cores will be assigned to do this task. Figure 2 shows the implementationn of GOP-level parallelism with threads. Two GOP buffers are used for moving the raw images, which will first store the frames when these buffers have space. It will schedule the frames into 4 temporary buffers according to the frame types, namely I, P and B frames as shown in Figure 2. There will be one master thread for handling the input outputt processes, such as checking of data dependency, and this master thread will be run on whichever core is free. Four working threads will be created to encodee the frames waiting in the temporary buffers. The number of threads created shall be according to the number of processing cores available in the system. Sequentially, all the operations are synchronized through the GOP buffers by the master thread. Figure 3 shows the steps involved in the encoding process. Fig 1: The GOP frame access pattern Fig 2: Implementation of the GOP-level parallelism with threads 129

4 5. Experimental results and Discussions Fig 3: The flow of the encoding process In this section the experimental results are presented. The results include the values of PSNR, total encoding time, ME time and bit-rate of the video. Two different types of video sequences are considered for testing. In Table1, the results for the Grandma video sequence with slow motion, are presented. In Table 2, the results for the Foreman video sequence with high motion are presented. The resultss have been obtained by performing tests with 300 frames on both Dual-core and Quad-core processors, using the GOP-levell parallelism with I frame as the starting frame. In Tables 1 and 2, the results obtained with GOP parallelism are compared with those obtained using original JM. The size of the GOP is fixed as 15. The results show that the proposed method yields reduced encoding time and ME time with a small reduction in the bit rate. Further it is noticed that the proposed method does not affect the PSNR value. To achieve an optimum performance e with higher speed up and lower bit-rate (without reducing the video quality), the size of GOP should be carefully determined. Figures 4,5 and 6 show the effect of GOP size on PSNR, encoding time and bit- rate respectively in a quad processor. From these figures, we note that GOP size 15 yields optimum results with regard to these quality parameters. The effect of the number of threads on PSNR in a quad processor is shown in Figure7. Parameters Original JM with DP and QP 15GOP with DP 15GOP With QP Parameters Original JM with DP and QP 15GOP With DP 15 GOP With QP Average PSNR (db) Average PSNR (db) Total Encoding Total Encoding Total ME Total ME Bit rate (Kbit/s) Bit rate (Kbit/s) Table 1: The results of parallel encoding of less Table 2: The results of parallel encoding of high motion video sequence, Grandma_cif motion video sequence, Foreman_cif 130

5 PSNR PSNR Vs GOP size GOP Size CIF QCIF Encoding Time(min) Encoding time vs GOP size GOP size CIF QCIF Fig 4 : GOP size Vs PSNR Fig 5 :Encoding time Vs GOP Size Bit-rate(Kbps) Bit-rate vs GOP size GOP size Fig 6: Bit-rate Vs GOP size 21 CIF QCIF Fig 7: The PSNR Vs the number of threads Figure 7 shows a constant PSNR, even when the number of threads is increased in both the resolutions of QCIF and CIF. The results show that there is no loss of video quality after exploiting the GOP levell parallelism. Table 3 shows a comparison of the performance parameters obtained for different processorss during the encoding process [9,10]. Quad-core processor shows a good utilization of front-side-bus rate. It is observed that the bus activities do not increase significantly with the increasing of number of threads. Therefore the execution time is reduced due to better utilization of the processor resources by exploiting the optimum thread-level parallelism. Parameters UP DP QP Instruction per cycle Microoperations per cycle Trace cache deliver mode % Trace cache build mode % 1 st level cache load misses rate % 2 nd level cache load misses rate % Front-side-bus utilization rate % Table 3: Micro Architecture metrics Fig 8: Speedup Vs the number of Threads 131

6 [1] International Standardd of Joint Video specification (ITU-T Rec. H. 264 ISO/IEC) (2009). [2] Ostermann.J et.al., Video Coding with H.264/AVC: Tools, Performance, and Complexity, IEEE Circuits and Systemss Magazine 4( 1)(2004) pp [3] Hoogerbrugge.J, et all., A Multithreaded Multicore System for Embedded Media Processing, Trans. on Highon H. Performance embedded Architectures and Compilers (2009). [4] Drose.M, Clemen.C, Sikora.T, Extending Single-View Scalable Video Coding to Multi- View Based 264/AVC, Image Processing, 2006 IEEE Inter.Conf. on. (2006) pp [5] Rodriguez.A, et al., Hirarchical Parallelization of an H.264/AVC Video Encoder, Proc. Int l. Symp. on Parallel Computing in Electrical Engineering (2006) pp [6] Chen.Y, Li.E, Zhou.X, Ge.S, Imple-mentation of H.264 Encoder and Decoder on Personal Computers, Journal of Visual Communications and Image Representation 17 (2006). [7] Vander Tol.E, Jaspers.E, Gelderblom.R, Mapping of H.264 Decoding on a Multiprocessor Architecture, Proc. SPIE Conf. on Image and Video Communications and Processing (2003). [8] Stenstrom.P,et al., Chip-multiprocessing and Beyond, Proc. Twelth Int l. Symp. On High-Performance Computer Architecture. (2006) pp [9] Y.K Chen, et.al., Towards Efficient MultiLevel Threading of H.264 Encoder on Intel Hyper-Threading Architectures, Proc. Of the 18 th Int l Parallel and Distributed Processing Symposium, Apr [10] S.Ge, X..Tian and Y.K.Chen, Efficient Multithreading Implementation of H.264 Encoder on Intel Hyper- Threading Architectures, IEEE Pacific-Rim conf. on Multimedia, Dec The standard measure, speed-upp which is defined as follows is used to evaluatee the performance of the proposed method. Figure.8 shows the plot speedup vs number of threads. It can be seen from this figure that the peak performancee is achieved when the number of threads equals the number of cores. It is also observed that. the speedup is almost constant ( or slightly lower) when the number of threads exceeds the number of cores, this is due to the fact that additional overheads are required to schedule and hold the information or process the extra threads. We also observe from Figure 8 that it is possible to achieve significantly higher speedup values using the GOP parallelism. 6. Conclusion and Future Work In this paper, we have presented a method based on GOP parallelism and analyzed the parallel scalability of the H.264 video encoding process using dual core and quad core processors. Our proposed parallelization strategy can overcome many of the shortfalls of the other known methods such as scalability issues and dataa dependency constraints. In general, the experimental results show thatt the GOP-level parallelism strategy efficiently exploits the capabilities of the multicore processors. The speedup values obtained using dual and quad core systems are 5.6 and 10 are higher compared to the original reference software for H.264 (JM 17.2).. Although, the focus of this paper is on the H.264 codec, it is expected that other video codecs and multimedia applications also exhibit similar characteristics. Hence, the proposed method can be extended to any of the computationally intensive applications of video processing. 7. References 132

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE 5359 Gaurav Hansda 1000721849 gaurav.hansda@mavs.uta.edu Outline Introduction to H.264 Current algorithms for

More information

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India

Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder. Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Scalable Multi-DM642-based MPEG-2 to H.264 Transcoder Arvind Raman, Sriram Sethuraman Ittiam Systems (Pvt.) Ltd. Bangalore, India Outline of Presentation MPEG-2 to H.264 Transcoding Need for a multiprocessor

More information

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC) STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC) EE 5359-Multimedia Processing Spring 2012 Dr. K.R Rao By: Sumedha Phatak(1000731131) OBJECTIVE A study, implementation and comparison

More information

Homogeneous Transcoding of HEVC for bit rate reduction

Homogeneous Transcoding of HEVC for bit rate reduction Homogeneous of HEVC for bit rate reduction Ninad Gorey Dept. of Electrical Engineering University of Texas at Arlington Arlington 7619, United States ninad.gorey@mavs.uta.edu Dr. K. R. Rao Fellow, IEEE

More information

A high-level simulator for the H.264/AVC decoding process in multi-core systems

A high-level simulator for the H.264/AVC decoding process in multi-core systems A high-level simulator for the H.264/AVC decoding process in multi-core systems Florian H. Seitner, Ralf M. Schreier, Michael Bleyer, Margrit Gelautz Institute for Software Technology and Interactive Systems

More information

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming

Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Parallel Algorithms on Clusters of Multicores: Comparing Message Passing vs Hybrid Programming Fabiana Leibovich, Laura De Giusti, and Marcelo Naiouf Instituto de Investigación en Informática LIDI (III-LIDI),

More information

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors

Computer and Information Sciences College / Computer Science Department CS 207 D. Computer Architecture. Lecture 9: Multiprocessors Computer and Information Sciences College / Computer Science Department CS 207 D Computer Architecture Lecture 9: Multiprocessors Challenges of Parallel Processing First challenge is % of program inherently

More information

Improving the quality of H.264 video transmission using the Intra-Frame FEC over IEEE e networks

Improving the quality of H.264 video transmission using the Intra-Frame FEC over IEEE e networks Improving the quality of H.264 video transmission using the Intra-Frame FEC over IEEE 802.11e networks Seung-Seok Kang 1,1, Yejin Sohn 1, and Eunji Moon 1 1Department of Computer Science, Seoul Women s

More information

Fast Mode Decision for H.264/AVC Using Mode Prediction

Fast Mode Decision for H.264/AVC Using Mode Prediction Fast Mode Decision for H.264/AVC Using Mode Prediction Song-Hak Ri and Joern Ostermann Institut fuer Informationsverarbeitung, Appelstr 9A, D-30167 Hannover, Germany ri@tnt.uni-hannover.de ostermann@tnt.uni-hannover.de

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

WHY PARALLEL PROCESSING? (CE-401)

WHY PARALLEL PROCESSING? (CE-401) PARALLEL PROCESSING (CE-401) COURSE INFORMATION 2 + 1 credits (60 marks theory, 40 marks lab) Labs introduced for second time in PP history of SSUET Theory marks breakup: Midterm Exam: 15 marks Assignment:

More information

Fast frame memory access method for H.264/AVC

Fast frame memory access method for H.264/AVC Fast frame memory access method for H.264/AVC Tian Song 1a), Tomoyuki Kishida 2, and Takashi Shimamoto 1 1 Computer Systems Engineering, Department of Institute of Technology and Science, Graduate School

More information

Pattern based Residual Coding for H.264 Encoder *

Pattern based Residual Coding for H.264 Encoder * Pattern based Residual Coding for H.264 Encoder * Manoranjan Paul and Manzur Murshed Gippsland School of Information Technology, Monash University, Churchill, Vic-3842, Australia E-mail: {Manoranjan.paul,

More information

Moore s Law. Computer architect goal Software developer assumption

Moore s Law. Computer architect goal Software developer assumption Moore s Law The number of transistors that can be placed inexpensively on an integrated circuit will double approximately every 18 months. Self-fulfilling prophecy Computer architect goal Software developer

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Computing architectures Part 2 TMA4280 Introduction to Supercomputing Computing architectures Part 2 TMA4280 Introduction to Supercomputing NTNU, IMF January 16. 2017 1 Supercomputing What is the motivation for Supercomputing? Solve complex problems fast and accurately:

More information

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology

Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology Using Industry Standards to Exploit the Advantages and Resolve the Challenges of Multicore Technology September 19, 2007 Markus Levy, EEMBC and Multicore Association Enabling the Multicore Ecosystem Multicore

More information

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Efficient MPEG- to H.64/AVC Transcoding in Transform-domain Yeping Su, Jun Xin, Anthony Vetro, Huifang Sun TR005-039 May 005 Abstract In this

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC

SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC Randa Atta, Rehab F. Abdel-Kader, and Amera Abd-AlRahem Electrical Engineering Department, Faculty of Engineering, Port

More information

Copyright Notice. Springer papers: Springer. Pre-prints are provided only for personal use. The final publication is available at link.springer.

Copyright Notice. Springer papers: Springer. Pre-prints are provided only for personal use. The final publication is available at link.springer. Copyright Notice The document is provided by the contributing author(s) as a means to ensure timely dissemination of scholarly and technical work on a non-commercial basis. This is the author s version

More information

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC 17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC Damian Karwowski, Marek Domański Poznań University

More information

Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec

Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec Optimum Quantization Parameters for Mode Decision in Scalable Extension of H.264/AVC Video Codec Seung-Hwan Kim and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST), 1 Oryong-dong Buk-gu,

More information

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor.

Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. CS 320 Ch. 18 Multicore Computers Multicore computer: Combines two or more processors (cores) on a single die. Also called a chip-multiprocessor. Definitions: Hyper-threading Intel's proprietary simultaneous

More information

The Design and Evaluation of Hierarchical Multilevel Parallelisms for H.264 Encoder on Multi-core. Architecture.

The Design and Evaluation of Hierarchical Multilevel Parallelisms for H.264 Encoder on Multi-core. Architecture. UDC 0043126, DOI: 102298/CSIS1001189W The Design and Evaluation of Hierarchical Multilevel Parallelisms for H264 Encoder on Multi-core Architecture Haitao Wei 1, Junqing Yu 1, and Jiang Li 1 1 School of

More information

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION Yen-Chieh Wang( 王彥傑 ), Zong-Yi Chen( 陳宗毅 ), Pao-Chi Chang( 張寶基 ) Dept. of Communication Engineering, National Central

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 24 Mahadevan Gomathisankaran April 29, 2010 04/29/2010 Lecture 24 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1 Introduction to parallel computers and parallel programming Introduction to parallel computersand parallel programming p. 1 Content A quick overview of morden parallel hardware Parallelism within a chip

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

VIDEO COMPRESSION STANDARDS

VIDEO COMPRESSION STANDARDS VIDEO COMPRESSION STANDARDS Family of standards: the evolution of the coding model state of the art (and implementation technology support): H.261: videoconference x64 (1988) MPEG-1: CD storage (up to

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

H.264 Parallel Optimization on Graphics Processors

H.264 Parallel Optimization on Graphics Processors H.264 Parallel Optimization on Graphics Processors Elias Baaklini, Hassan Sbeity and Smail Niar University of Valenciennes, 59313, Valenciennes, Cedex 9, France {elias.baaklini,smail.niar}@univ-valenciennes.fr

More information

Module 18: "TLP on Chip: HT/SMT and CMP" Lecture 39: "Simultaneous Multithreading and Chip-multiprocessing" TLP on Chip: HT/SMT and CMP SMT

Module 18: TLP on Chip: HT/SMT and CMP Lecture 39: Simultaneous Multithreading and Chip-multiprocessing TLP on Chip: HT/SMT and CMP SMT TLP on Chip: HT/SMT and CMP SMT Multi-threading Problems of SMT CMP Why CMP? Moore s law Power consumption? Clustered arch. ABCs of CMP Shared cache design Hierarchical MP file:///e /parallel_com_arch/lecture39/39_1.htm[6/13/2012

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER Zong-Yi Chen, Jiunn-Tsair Fang 2, Tsai-Ling Liao, and Pao-Chi Chang Department of Communication Engineering, National Central

More information

Designing High Performance Communication Middleware with Emerging Multi-core Architectures

Designing High Performance Communication Middleware with Emerging Multi-core Architectures Designing High Performance Communication Middleware with Emerging Multi-core Architectures Dhabaleswar K. (DK) Panda Department of Computer Science and Engg. The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Analysis of Parallelization Techniques and Tools

Analysis of Parallelization Techniques and Tools International Journal of Information and Computation Technology. ISSN 97-2239 Volume 3, Number 5 (213), pp. 71-7 International Research Publications House http://www. irphouse.com /ijict.htm Analysis of

More information

Digital Video Processing

Digital Video Processing Video signal is basically any sequence of time varying images. In a digital video, the picture information is digitized both spatially and temporally and the resultant pixel intensities are quantized.

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

Chap. 4 Multiprocessors and Thread-Level Parallelism

Chap. 4 Multiprocessors and Thread-Level Parallelism Chap. 4 Multiprocessors and Thread-Level Parallelism Uniprocessor performance Performance (vs. VAX-11/780) 10000 1000 100 10 From Hennessy and Patterson, Computer Architecture: A Quantitative Approach,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Multi-Grain Parallel Accelerate System for H.264 Encoder on ULTRASPARC T2

Multi-Grain Parallel Accelerate System for H.264 Encoder on ULTRASPARC T2 JOURNAL OF COMPUTERS, VOL 8, NO 12, DECEMBER 2013 3293 Multi-Grain Parallel Accelerate System for H264 Encoder on ULTRASPARC T2 Yu Wang, Linda Wu, and Jing Guo Key Lab of the Academy of Equipment, Beijing,

More information

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 9, SEPTEMBER

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 9, SEPTEMBER IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 19, NO. 9, SEPTEER 2009 1389 Transactions Letters Robust Video Region-of-Interest Coding Based on Leaky Prediction Qian Chen, Xiaokang

More information

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor

Parallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 23 Mahadevan Gomathisankaran April 27, 2010 04/27/2010 Lecture 23 CSCE 4610/5610 1 Reminder ABET Feedback: http://www.cse.unt.edu/exitsurvey.cgi?csce+4610+001 Student

More information

High Efficiency Video Decoding on Multicore Processor

High Efficiency Video Decoding on Multicore Processor High Efficiency Video Decoding on Multicore Processor Hyeonggeon Lee 1, Jong Kang Park 2, and Jong Tae Kim 1,2 Department of IT Convergence 1 Sungkyunkwan University Suwon, Korea Department of Electrical

More information

COSC 6385 Computer Architecture - Thread Level Parallelism (I)

COSC 6385 Computer Architecture - Thread Level Parallelism (I) COSC 6385 Computer Architecture - Thread Level Parallelism (I) Edgar Gabriel Spring 2014 Long-term trend on the number of transistor per integrated circuit Number of transistors double every ~18 month

More information

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2

MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 MINIMUM HARDWARE AND OS SPECIFICATIONS File Stream Document Management Software - System Requirements for V4.2 NB: please read this page carefully, as it contains 4 separate specifications for a Workstation

More information

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding Jung-Ah Choi and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST) 261 Cheomdan-gwagiro, Buk-gu, Gwangju, 500-712, Korea

More information

Introduction to Parallel Computing

Introduction to Parallel Computing Portland State University ECE 588/688 Introduction to Parallel Computing Reference: Lawrence Livermore National Lab Tutorial https://computing.llnl.gov/tutorials/parallel_comp/ Copyright by Alaa Alameldeen

More information

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec

Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec Improved Context-Based Adaptive Binary Arithmetic Coding in MPEG-4 AVC/H.264 Video Codec Abstract. An improved Context-based Adaptive Binary Arithmetic Coding (CABAC) is presented for application in compression

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

ADAPTIVE JOINT H.263-CHANNEL CODING FOR MEMORYLESS BINARY CHANNELS

ADAPTIVE JOINT H.263-CHANNEL CODING FOR MEMORYLESS BINARY CHANNELS ADAPTIVE JOINT H.263-CHANNEL ING FOR MEMORYLESS BINARY CHANNELS A. Navarro, J. Tavares Aveiro University - Telecommunications Institute, 38 Aveiro, Portugal, navarro@av.it.pt Abstract - The main purpose

More information

THREAD LEVEL PARALLELISM

THREAD LEVEL PARALLELISM THREAD LEVEL PARALLELISM Mahdi Nazm Bojnordi Assistant Professor School of Computing University of Utah CS/ECE 6810: Computer Architecture Overview Announcement Homework 4 is due on Dec. 11 th This lecture

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 19: Multiprocessing Shuai Wang Department of Computer Science and Technology Nanjing University [Slides adapted from CSE 502 Stony Brook University] Getting More

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 1. Copyright 2012, Elsevier Inc. All rights reserved. Computer Technology Computer Architecture A Quantitative Approach, Fifth Edition Chapter 1 Fundamentals of Quantitative Design and Analysis 1 Computer Technology Performance improvements: Improvements in semiconductor technology

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.)

Out-of-Order Parallel Simulation of SystemC Models. G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Out-of-Order Simulation of s using Intel MIC Architecture G. Liu, T. Schmidt, R. Dömer (CECS) A. Dingankar, D. Kirkpatrick (Intel Corp.) Speaker: Rainer Dömer doemer@uci.edu Center for Embedded Computer

More information

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain Author manuscript, published in "International Symposium on Broadband Multimedia Systems and Broadcasting, Bilbao : Spain (2009)" One-pass bitrate control for MPEG-4 Scalable Video Coding using ρ-domain

More information

High Efficient Intra Coding Algorithm for H.265/HVC

High Efficient Intra Coding Algorithm for H.265/HVC H.265/HVC における高性能符号化アルゴリズムに関する研究 宋天 1,2* 三木拓也 2 島本隆 1,2 High Efficient Intra Coding Algorithm for H.265/HVC by Tian Song 1,2*, Takuya Miki 2 and Takashi Shimamoto 1,2 Abstract This work proposes a novel

More information

Multi-core Architectures. Dr. Yingwu Zhu

Multi-core Architectures. Dr. Yingwu Zhu Multi-core Architectures Dr. Yingwu Zhu What is parallel computing? Using multiple processors in parallel to solve problems more quickly than with a single processor Examples of parallel computing A cluster

More information

Advanced Parallel Programming I

Advanced Parallel Programming I Advanced Parallel Programming I Alexander Leutgeb, RISC Software GmbH RISC Software GmbH Johannes Kepler University Linz 2016 22.09.2016 1 Levels of Parallelism RISC Software GmbH Johannes Kepler University

More information

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000 Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000 EE5359 Multimedia Processing Interim Report Spring 2013 The University of Texas at Arlington Department of Electrical

More information

Minimum Hardware and OS Specifications

Minimum Hardware and OS Specifications Hardware and OS Specifications File Stream Document Management Software System Requirements for v4.5 NB: please read through carefully, as it contains 4 separate specifications for a Workstation PC, a

More information

Introduction to Microprocessor

Introduction to Microprocessor Introduction to Microprocessor Slide 1 Microprocessor A microprocessor is a multipurpose, programmable, clock-driven, register-based electronic device That reads binary instructions from a storage device

More information

Parallel Scalability of Video Decoders

Parallel Scalability of Video Decoders J Sign Process Syst (2009) 57:173 194 DOI 10.1007/s11265-008-0256-9 Parallel Scalability of Video Decoders Cor Meenderinck Arnaldo Azevedo Ben Juurlink Mauricio Alvarez Mesa Alex Ramirez Received: 14 September

More information

COSC 6385 Computer Architecture - Multi Processor Systems

COSC 6385 Computer Architecture - Multi Processor Systems COSC 6385 Computer Architecture - Multi Processor Systems Fall 2006 Classification of Parallel Architectures Flynn s Taxonomy SISD: Single instruction single data Classical von Neumann architecture SIMD:

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

Scalable Video Coding

Scalable Video Coding 1 Scalable Video Coding Z. Shahid, M. Chaumont and W. Puech LIRMM / UMR 5506 CNRS / Universite Montpellier II France 1. Introduction With the evolution of Internet to heterogeneous networks both in terms

More information

Complexity/Performance Analysis of a H.264/AVC Video Encoder

Complexity/Performance Analysis of a H.264/AVC Video Encoder Complexity/Performance Analysis of a H.264/AVC Video Encoder Hajer Krichene Zrida 1, Ahmed Chiheb Ammari 2, Mohamed Abid 1 and Abderrazek Jemai 3 1 Sfax University, ENIS Institute, Computer and Embedded

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Video compression with 1-D directional transforms in H.264/AVC

Video compression with 1-D directional transforms in H.264/AVC Video compression with 1-D directional transforms in H.264/AVC The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Kamisli, Fatih,

More information

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question:

Database Workload. from additional misses in this already memory-intensive databases? interference could be a problem) Key question: Database Workload + Low throughput (0.8 IPC on an 8-wide superscalar. 1/4 of SPEC) + Naturally threaded (and widely used) application - Already high cache miss rates on a single-threaded machine (destructive

More information

Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao

Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao Smoooth Streaming over wireless Networks Sreya Chakraborty Final Report EE-5359 under the guidance of Dr. K.R.Rao 28th April 2011 LIST OF ACRONYMS AND ABBREVIATIONS AVC: Advanced Video Coding DVD: Digital

More information

SMD149 - Operating Systems - Multiprocessing

SMD149 - Operating Systems - Multiprocessing SMD149 - Operating Systems - Multiprocessing Roland Parviainen December 1, 2005 1 / 55 Overview Introduction Multiprocessor systems Multiprocessor, operating system and memory organizations 2 / 55 Introduction

More information

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy Overview SMD149 - Operating Systems - Multiprocessing Roland Parviainen Multiprocessor systems Multiprocessor, operating system and memory organizations December 1, 2005 1/55 2/55 Multiprocessor system

More information

Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment

Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment Parallel Processing of Multimedia Data in a Heterogeneous Computing Environment Heegon Kim, Sungju Lee, Yongwha Chung, Daihee Park, and Taewoong Jeon Dept. of Computer and Information Science, Korea University,

More information

GV-System V8.7 Supports H.265 GPU Decoding

GV-System V8.7 Supports H.265 GPU Decoding GV-System V8.7 Supports H.265 GPU Decoding Article ID: V1-16-07-15-a Applied to GV-System V8.7 Release Date: 07/15/2016 Summary It takes both Intel Skylake platform and GV-System V8.7 to enable the highly

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

MATE-EC2: A Middleware for Processing Data with Amazon Web Services MATE-EC2: A Middleware for Processing Data with Amazon Web Services Tekin Bicer David Chiu* and Gagan Agrawal Department of Compute Science and Engineering Ohio State University * School of Engineering

More information

EE Low Complexity H.264 encoder for mobile applications

EE Low Complexity H.264 encoder for mobile applications EE 5359 Low Complexity H.264 encoder for mobile applications Thejaswini Purushotham Student I.D.: 1000-616 811 Date: February 18,2010 Objective The objective of the project is to implement a low-complexity

More information

CSE 392/CS 378: High-performance Computing - Principles and Practice

CSE 392/CS 378: High-performance Computing - Principles and Practice CSE 392/CS 378: High-performance Computing - Principles and Practice Parallel Computer Architectures A Conceptual Introduction for Software Developers Jim Browne browne@cs.utexas.edu Parallel Computer

More information

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000

Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000 Comparative and performance analysis of HEVC and H.264 Intra frame coding and JPEG2000 EE5359 Multimedia Processing Project Proposal Spring 2013 The University of Texas at Arlington Department of Electrical

More information

Many-Core VS. Many Thread Machines

Many-Core VS. Many Thread Machines Many-Core VS. Many Thread Machines Stay away from the valley Harshdeep Singh Chawla WHAT DOES IT MEANS????? The author of the paper shows, how can the performance of a hybrid system (comprised of multicores

More information

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation 2009 Third International Conference on Multimedia and Ubiquitous Engineering A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation Yuan Li, Ning Han, Chen Chen Department of Automation,

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Unit-level Optimization for SVC Extractor

Unit-level Optimization for SVC Extractor Unit-level Optimization for SVC Extractor Chang-Ming Lee, Chia-Ying Lee, Bo-Yao Huang, and Kang-Chih Chang Department of Communications Engineering National Chung Cheng University Chiayi, Taiwan changminglee@ee.ccu.edu.tw,

More information

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD

Video Encoding with. Multicore Processors. March 29, 2007 REAL TIME HD Video Encoding with Multicore Processors March 29, 2007 Video is Ubiquitous... Demand for Any Content Any Time Any Where Resolution ranges from 128x96 pixels for mobile to 1920x1080 pixels for full HD

More information

Partitioning Strategies for Concurrent Programming

Partitioning Strategies for Concurrent Programming Partitioning Strategies for Concurrent Programming Henry Hoffmann, Anant Agarwal, and Srini Devadas Massachusetts Institute of Technology Computer Science and Artificial Intelligence Laboratory {hank,agarwal,devadas}@csail.mit.edu

More information

STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING

STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING Journal of the Chinese Institute of Engineers, Vol. 29, No. 7, pp. 1203-1214 (2006) 1203 STACK ROBUST FINE GRANULARITY SCALABLE VIDEO CODING Hsiang-Chun Huang and Tihao Chiang* ABSTRACT A novel scalable

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information