Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding

Size: px
Start display at page:

Download "Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding"

Transcription

1 1 Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding Martin Schwalb, Ralph Ewerth and Bernd Freisleben Department of Mathematics and Computer Science, University of Marburg Hans-Meerwein-Str. 3, D Marburg, Germany {schwalbm, ewerth, Abstract The video coding standard H.264 supports video compression with a higher coding efficiency than previous standards. However, this comes at the expense of an increased encoding complexity, in particular for motion estimation which becomes a very time consuming task even for today s central processing units (CPU). On the other hand, modern graphics hardware includes a powerful graphics processing unit (GPU) whose computing power remains idle most of the time. In this paper, we present a GPU based approach to motion estimation for the purpose of H.264 video encoding. A small diamond search is adapted to the programming model of modern GPUs to exploit their available parallel computing power and memory bandwidth. Experimental results demonstrate a significant reduction of computation time and a competitive encoding quality compared to a CPU UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel. Index Terms Parallel motion estimation, H.264, GPGPU (general purpose computation on GPU), programmable graphics hardware, MPEG-4 part 10/AVC. I. INTRODUCTION THE H.264 standard (also known as MPEG-4 part 10 Advanced Video Coding (AVC) [12]) allows a bitrate reduction of coded video by enabling the encoder to carry out more complex motion estimation than any of the predecessor standards [1]. The maximum number of possible reference frames for the block matching search has been increased. Furthermore, each macroblock can be split up irregularly into smaller rectangular blocks down to the size of four by four pels (picture elements), allowing the calculation of a motion vector for each of these subblocks of different sizes. Completely exploiting these new possibilities requires a computational effort that often cannot be met by CPU based software encoders running on standard PC hardware. Therefore, some heuristic algorithms have been proposed to reduce the number of motion vectors that are considered worthy to be estimated (e.g., Huang et al. [6] and Li et al. [5]). Other approaches estimate good motion vector predictors by considering the spatial or temporal neighborhood (e.g., Li et al. [5], Chen et al. [8] and Zhou et al. [7]) to save some search iterations. Although such techniques successfully reduce the computation time compared to a full search, the remaining motion estimation tasks still consume a large amount of the total encoding time. On the other hand, the CPU is no longer the only programmable source of computing power in mainstream PCs. The programmability rises from each series of Graphics Processing Units (GPUs) to its successors, allowing the implementation of more general tasks. Thereby, the computational capabilities of GPUs cannot only compete with those of CPUs they greatly outperform them in many areas. For instance, using a synthetical benchmark, the raw floating point computing power of an NVidia GeForce 6800 Ultra was observed [10] to be about 40 GFlops, and a peak memory bandwidth of 35.2 GB/s was measured. This vast computing power originates from the parallel architecture of the GPUs. Significant benefits are achievable if computationally expensive problems are successfully mapped to a GPU. Thus, it is reasonable to examine the possibilities of exploiting this normally idle processing power for video coding tasks. A few approaches to motion estimation utilizing GPU processing power have already been published. However, the exhaustive full search [2] is computationally too expensive for real world video coding applications, while gradient based methods ([3] and [4]) do not have a freely definable cost function which is recommended for video encoding. When trying to map typical fast (non-exhaustive) motion estimation algorithms based on local block matching to the GPU, one problem is that the main concepts which make these algorithms fast cannot be executed in parallel. For example, getting a good predictor for a motion vector typically requires several already estimated vectors of its temporal and spatial neighborhood as source data. This does not allow a parallel estimation of the new motion vector and the vectors that serve to build its predictor. Thus, when designing a GPU based fast motion estimation algorithm, one has to find a tradeoff between predictor quality and possible parallelism. In this paper, we propose a novel GPU based implementation of a fast (non-exhaustive) local block matching search. The contribution of the proposed approach is threefold. First, a parallel GPU implementation of a small diamond search is introduced. Starting from predicted motion vectors, it locally finds the best matching blocks for an arbitrary set of blocks of a particular size in a single reference frame in parallel. This parallel small diamond search relies on the availability of good vector predictors and the decision which subset of all possible vectors is worthy to be estimated. Thus, the second contribution is the presented composition of available CPU algorithms that creates vector predictors for blocks of the same size for the same reference frame. This composition directly supports parallel local block searches like the GPU

2 2 implementation of the small diamond search. Together, these two parts form an efficient motion estimation framework for H.264 with the compute-intensive parts running on the GPU and non-critical parts running on the CPU. Third, since the CPU is relieved from a significant amout of computational load, the suggested GPU based motion estimation approach potentially enables the CPU to process other encoding tasks in parallel. Experimental results demonstrate the good encoding quality of our approach. It is comparable to a UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel. Although a complete motion estimation framework for H.264 is presented, the scope of this paper lies on the GPU specific aspects and not on the invention of an optimal generic motion estimation algorithm for H.264. The remainder of this paper is organized as follows. Section II outlines approaches to motion vector predictor generation in H.264 motion estimation, as well as the GPU based motion estimation approaches published until now. Section III gives an introduction to the current generation of programmable commodity graphics hardware and discusses its limitations. In Section IV, mapping motion estimation to the GPU and implementation details are explained. The experimental results of our motion estimation (extending the JM9.0 reference implementation of the H.264 encoder) are presented in Section V. Finally, Section VI concludes the paper and outlines areas for future work. II. RELATED WORK One way to face the increased complexity of H.264 motion estimation is to reduce the total number of motion vectors to be estimated. For example, Huang et al. [6] develop various heuristic methods based on the available information after motion estimation with respect to the directly preceding reference frame. The main idea is to identify macroblocks which cover boundaries of moving objects. Due to possible occlusion/uncovering, searching more reference frames is suggested only for these macroblocks, but not for the other ones. Li et al. [5] suggest another heuristic method to identify macroblocks with fast and continuous movement using estimated motion vectors referring to the nearest three reference frames. Searching more reference frames seems to be worthy to the authors only for macroblocks with a vibrant or smooth movement. Therefore, the macroblocks with a fast and continuous movement are neglected during further search. Another popular method to reduce the computing effort needed for H.264 motion estimation is to estimate good motion vector predictors and to refine them only locally instead of starting the estimation from an assumed zero motion vector. This often saves iterations of the local search. Li et al. [5] use motion vectors of surrounding blocks with the same mode (top, top-right and left), vectors of the uplayer 1 block modes on the same position, vectors of the recent reference frame and scaled vectors of previous reference frames as vector predictors. 1 Li et al. call a block mode an uplayer of another block mode if it is of larger or equal size in both dimensions. For example, 16*16 is an uplayer block mode of 16*8 and 8*16. Chen et al. [8] predict motion vectors based on reference frames with a temporal distance greater than one by reusing motion vectors of several previous frames which themselves refer to their directly preceding reference frame. Therefore, motion vectors that span n frames are predicted by adding n already estimated motion vectors. The Split and Merge operations proposed by Zhou et al. [7] take advantage of the correlation between motion vectors of various block sizes to form predictors for vectors of differently sized blocks at the same position. Known vectors of larger subblocks are split up to form predictors of smaller subblocks, and vectors of smaller subblocks are merged together to form predictors of larger subblocks that cover the same area. There are some approaches exploiting the processing power of GPUs for video encoding/decoding tasks. Shen et al. [9] utilize the programmable GPU to accelerate video decoding by implementing the complete motion compensation feedback loop of a proprietary MPEG-like decoder and the color space conversion on the GPU as shader programs. Kelly and Kokaram [2] exploit the bilinear image interpolation capabilities of GPU Samplers (see Section III) to estimate subpel accurate motion vectors. This interpolation is used within a full search of motion vectors based on block matching, where the GPU interpolates the subpel positions and iteratively computes SAD (sum of absolute differences) values using a number of rendering steps. The choice of the optimal motion vector is made on the CPU. Since this approach always operates on a complete frame calculating the absolute difference of the current frame to the displaced reference frame, the same motion vector has to be evaluated for all blocks at once. Therefore, it is not trivial to directly incorporate any fast local search strategy into this approach. Apart from the block based full search approach, the same authors also present a hierarchical Wiener based pel-recursive motion estimation approach [3]. The solution for the update vector to a current vector estimate is split up into elementary addition, subtraction and multiplication operations that are performed in parallel on complete textures by the GPU. Automatic mipmap generation of the GPU is used to generate various low-pass filtered versions of the frames for different hierarchy levels. On-the-fly interpolation by the sampler is used like in the block based full search approach. Strzodka et al. [4] present a real-time motion estimator for visualization purposes that is implemented on graphics cards. They estimate motion through an eigenvector analysis of the spatio-temporal structure tensor at every pixel location. Apart from motion estimation, various techniques are used to visualize the estimated motion in real time. Their complete framework is implemented on the GPU and is able to perform real time motion estimation and visualization of a 25Hz 320*240 pixel image sequence. III. PROGRAMMABLE GRAPHICS HARDWARE Today s commodity graphics cards mainly consist of a Graphics Processing Unit (GPU) and a dedicated texture random access memory (RAM) that holds the frame buffer

3 3 and textures. While GPU accesses to the texture RAM are fast, the bandwidth of the bus connecting the CPU to the texture RAM is limited. On Accelerated Graphics Port (AGP) PCs, especially the back transfer from the texture RAM to the CPU is limited to the PCI bus (Peripheral Component Interconnect bus) data rate of only 266 MB/s. Complex 3D objects in real-time computer graphics are typically approximated by large sets of triangles. The main purpose of the GPU is to render a stream of textured triangles to the frame buffer or another texture. Each triangle is defined by its three vertices, which are data structures that hold 3D world position coordinates and optionally texture coordinates and lighting color. The rendering task can be split up into several subtasks. Therefore, the GPU is constructed as a pipelined architecture (Fig. 1), and currently two of the pipeline stages are more or less freely programmable: the Vertex Processor and the Fragment Processor. CPU Vertex Processor Rasterizer Fragment-Processor Destination Texture(s) Texture Ram GPU Sampler Source Texture(s) Fig. 1. Simplified overview of the graphics pipeline from the programmer s view. Technically, the Vertex- and Fragment Processor themselves are split up into a number of parallel subpipelines. The arrows denote the data flow. The Vertex Processor processes a small number of triangle vertices of a vertex stream in parallel and is used to compute arbitrary 3D transformations and projections on them. The resulting vertices are taken as input into the next stage of the pipeline: the Rasterizer. The purpose of this stage is to generate so called fragments for all output pixels that the currently rendered triangle covers. A fragment is a data structure analog to the structure of a vertex and is calculated by the Rasterizer by interpolating position coordinates, texture coordinates and lighting color values between its three input vertices of the triangle. The Fragment Processor then takes these interpolated parameters and the source textures as input and computes a small number of pixel color values at a time in parallel by running the same fragment shader program on these fragments. The term shader is often used to refer to the part of a rendering system which encapsulates the illumination model and the shading technique (or shading model). Such an illumination model or shading technique describes the rules to compute a surface s color at a given point. Source textures can be accessed through the Sampler which is able to linearly interpolate between four neighboring texels on the fly. Finally, the resulting color values are stored in the destination texture(s) exactly at the position(s) determined by the Rasterizer. In current GPUs, such as the NVidia GeForce 7800 GTX, the Vertex Processor itself consists of up to 8 parallel vertex pipelines and the Fragment Processor is built up of up to 24 pixel pipelines. Apart from this parallelism, the basic data types of the Vertex and Fragment Processor pipelines are vectors of four floats, and the arithmetic operations are SIMD (single-instruction-multiple-data) instructions which can operate in parallel on all four floats that normally represent an XYZW-position or an RGBA-pixel. Although programmability rises from each generation of GPUs to the next one, there are still some restrictions even for GPUs implementing the current Shader Model 3.0. First of all, both the number of coded instructions and the number of instructions dynamically processed at runtime are still limited. As a result, there are still no true WHILE or GOTO instructions available in the shader assembly language. A second constraint is the prohibition of random texture reads inside conditionally processed code. Only positions addressed by a set of texture coordinates that were directly interpolated by the Rasterizer can be conditionally accessed. A limitation of today s series of GPUs implementing the Shader Model 3.0 concerns the efficiency of conditionally skipped fragment shader code. Because of synchronized execution between the pixel pipelines, time can only be saved if all the pixel pipelines active at a time meet the same condition. Since these pipelines always seem to process blocks of pixels that are spatially arranged in a fixed array, this limits the use of conditional execution for the purpose of time saving to scenarios where larger blocks of neighboring pixels meet exactly the same conditions throughout the execution path of the shader program. IV. FAST MOTION ESTIMATION USING GRAPHICS HARDWARE Due to the reasons explained above, none of the mentioned GPU based motion estimation approaches is directly applicable for the purpose of H.264 video encoding. Several requirements need to be met to successfully map a problem to a modern GPU architecture: First of all, it is obvious that the GPUs vast computing power originates from the existence of parallel pipelines. Therefore, one can only expect performance gains using the GPU if the problem to be solved is parallelizable. Second, problems can be mapped efficiently to GPUs only if the amount of data transfered between the texture RAM and the CPU is relatively low compared to the number of operations processed by the GPU. This should not be mixed up with GPU texture RAM accesses which are comparatively fast. Third, although conditional code execution is generally possible in fragment programs, its implementation in today s GPUs is not very efficient yet. Thus, it is recommended to first map the parts of the problem to the GPU which require a minimum of code in IF-THEN-ELSE blocks. Keeping these restrictions in mind, the H.264 motion estimation is mapped to the GPU as follows. Input to our motion estimation are the luminance components of the frame that is going to be encoded (from now on

4 4 called the current frame ) and all reference frames. Output of the proposed algorithm are quarter-pel accurate vectors (and SAD values) for each subblock of any possible size and all given reference frames. The problem is split into a CPU part and two GPU parts. First, if a reference frame has not been used as a reference frame before, two supersampled versions of it are computed using the GPU. Then, the CPU generates motion vector predictors for the current block size by applying FDVS [8] and split and merge operations [7] to already available vectors. Finally, starting from the motion vector predictors, a GPU shader program performs a small diamond search in order to refine them. This small diamond search is an iterative local block matching search within a reference frame where each iteration consists of the following steps: First, some block positions according to a specific search pattern around the current best match position are evaluated in terms of a block matching metric like SAD. Second, the position with the lowest metric value becomes the new best match position for the next iteration except for the case that the metric value at the old best match position was even lower. In the latter case, a small diamond search normally terminates. The search pattern for the small diamond search consists of the four block positions of the old best match position shifted by one small step (e.g., one pel or quarter-pel) to the top, left, bottom and right, which form the pattern of a diamond. As a result, we obtain motion vectors and the respective SAD value for each block, which finally can be used for macroblock coding. Since typical block based motion estimation algorithms spend most of the computing power and memory bandwidth on interpolating subpixels and calculating SAD values, these tasks are mapped to the GPU. Less demanding tasks are the computation of good motion vector predictors that will be refined subsequently and the application of heuristic methods to reduce the number of vectors that will be estimated at all. These motion estimation parts still reside on the CPU. For ease of implementation, only unidirectional motion estimation is considered, and SAD is directly used as the function to be locally minimized within the small diamond search iterations without any incorporation of motion vector bit costs or the Hadamard transformation. In H.264, the Hadamard transformation is used to encode the residual block difference after motion compensation. Therefore, it is advantageous to estimate the block difference bit costs using the sum of the absolute (Hadamard-)transformed differences (SATD) instead of the simpler SAD which we use to ease the implementation. The choice of the optimal encoding mode for a macroblock is beyond the scope of this paper. A. Workflow of the Entire Approach and the CPU Based Part The CPU part of our motion estimation is basically an incorporation of Split and Merge operations as proposed by Zhou et al. [7], the Forward Dominant Vector Selection (FDVS) algorithm of Chen et al. [8] and the Flexible Multi- Reference Frame Search Criterion of Li et al. [5]. The Split and Merge operations are used to compute motion vector predictors for other subblock sizes from given motion vectors of one subblock size. The FDVS algorithm computes motion vector predictors for another reference frame with distance n + 1 by adding the n motion vectors of previous frames with a distance of one frame. The Flexible Multi-Reference Frame Search Criterion is a simple heuristic method that stops the search for fast and continuously moving macroblocks after searching the nearest three reference frames. In our implementation, this heuristic method uses the vectors of the 16x16 blocks to decide whether any subblock of the appropriate macroblock has to be searched within the farther reference frames. These methods are combined as follows: For each frame do the following: 1 If a frame serves as a reference frame for the first time: create two supersampled and pre-shifted versions of it as needed by the GPU based small diamond search; 2 Calculate displacement vectors from the current frame to its directly preceding reference frame by: 2.1 Use the corresponding resulting vectors of the 8x8 subblocks of the preceding frame as predictors (or zero motion vectors for the very first and every 8th frame transition to prevent error accumulation); 2.2 Refine the vector predictors for all 8x8 subblocks in parallel using the GPU based small diamond search; 2.3 For all other subblock sizes do: Compute vector predictors starting from the 8x8 subblock vectors using Split and Merge [7] 2 ; Refine the vector predictors for the current subblock size in parallel using the GPU based small diamond search; 3 For all other reference frames (from nearest to farthest) do: 3.1 If three reference frames have been processed, apply the Flexible Multi-Reference Frame Search Criterion heuristic method [5] to stop the search for certain macroblocks; 3.2 Compute vector predictors for the 8x8 subblocks using the FDVS algorithm [8]; 3.3 Refine the vector predictors for all 8x8 subblocks in parallel using the GPU based small diamond search; 3.4 Analog to 2.3 create vector predictors starting from the finished 8x8 subblocks for the other subblock sizes using Merge and Split operations and refine them using the GPU based small diamond search; B. Interpolation of the Values at Half-Pel and Quarter-Pel Positions during Reference Frame Texture Supersampling The supersampling of the reference frame textures is mapped to the GPU and is calculated only once per new emerging reference frame in three steps prior to any small diamond search runs referring to it. Each of these three steps is implemented in an own fragment shader program: 1) Horizontal supersampling by a factor of 2: every second destination texel in a scanline is interpolated using the 2 If the difference of all source vectors of a Split or Merge operation to the resulting vector of the operation is below a threshold, the resulting vector is directly taken as the final vector without refining it through a small diamond search.

5 5 Fig. 2. Creation of a pre-shifted reference frame texture for quarter-pel [fullpel] search: A supersampled texture is pre-shifted in 4 different ways by one texel [four texels]. The 4 resulting pre-shifted versions are then incorporated into one 4-component RGBA texture. Fig. 3. The rendering of a full-screen quad Note: it is not necessary (as indicated by the crosses) that a total mapping exists between the subblocks of the current video frame and the texels of the source texture or destination texture. In this way, the number of vectors to be estimated is reduced. horizontal 6-tap filter as described in the H.264 standard. The other texels are filled with the unfiltered source texels. 2) Vertical supersampling by a factor of 2: the already horizontally supersampled resulting texture of step 1 is vertically supersampled analog to step 1. The result is a temporary reference frame texture that is already horizontally and vertically extended by the half-pel positions. 3) Supersampling to a factor of 4 with bilinear interpolation and simultaneous pre-shifting: the extended reference frame texture is completed by simultaneously inserting values at the quarter-pel positions horizontally and vertically. All new values are simply calculated by bilinear interpolation whereas the other positions values are just copied. The same fragment shader program that implements this bilinear interpolation is used to realize a preshifting according to the small diamond search pattern offsets on the fly, as illustrated in Fig. 2 and explained in detail below. The purpose for this pre-shifting is explained in Section IV-D. The third step is calculated twice with differently sized search patterns to finally create two supersampled textures for each reference frame one for the quarter-pel search and one for the full-pel search (this allows to reuse the implementation of the quarter-pel search for full-pel search later on). C. Application of the Small Diamond Search Fragment Shader Program The GPU based refinement of predicted motion vectors is implemented (as described in Section IV-D) as a fragment shader program. This shader program is applied by rendering a source texture (consisting of a field of predicted motion vectors) to a destination texture which represents the final motion vector field and the related SAD values. The reference frame and the current frame are also represented as textures and serve as additional input data to the shader program which executes an adopted small diamond search on the current frame s motion vector predictors of subblocks of a fixed size in parallel in one rendering step. The block size is coded via a global variable readable by the shader program ensuring that the appropriate number of current and reference frame pels is used to render a motion vector and its related SAD value. The source texture represents the predicted motion vector field for a given block size. The motion vector predictors are coded as floating point RGBA texels. Next to the vector predictors themselves, the subblock positions are also coded into the source texture s texels. The additional coding of the block positions allows an arbitrary mapping between the subblocks of the current frame and the source texture in order to compute a smaller subset of the motion vector field to save computation time. Further inputs to the shader program are the luminance components of the current frame and the current reference frame. They are coded as three additional textures which are dynamically addressed by the fragment shader program. Thereby, the luminance component of the current frame is stored in one 8-bit current frame texture. It is neither supersampled nor pre-shifted, and it contains only one color channel that holds the luminance data. The luminance component of the reference frame is redundantly stored in two 32-bit reference frame textures one for the full-pel search and one for the quarter-pel search. Both textures are supersampled and pre-shifted, as described in Section IV-B. The output result of the shader program is a destination texture where the pixel data represent the final motion vector results. Each motion vector is coded by two of the four RGBAcomponents of the floating point destination texture. A third component holds the corresponding SAD value. A so-called full screen quad is utilized to accomplish a 1-to- 1 texel mapping between the source texture and the destination texture. The full screen quad consists of two triangles that will be rendered to the destination texture and occlude it completely. By specifying one set of texture coordinates for the quad (referring to the source texture) and the quad s set of position coordinates (referring to the destination texture), the 1-to-1 texel mapping is accomplished (See Fig. 3). Note that the term full screen quad is commonly used in the GPGPU (general purpose computation on GPU) community ( and is not related to video frames. The complete motion estimation for a given block size with respect to a reference frame is started by simply rendering the full screen quad to the destination texture with the small diamond search fragment shader program activated. This small diamond search application already executes the complete fullpel search and quarter-pel search within one rendering pass. Finally, only the computed motion vectors and SAD values need to be transferred back (in form of the destination texture) to the system RAM across the AGP bus.

6 6 D. Implementation of the Small Diamond Search Fragment Shader Program The realization of the small diamond search in a shader program allows the parallel computation or refinement of motion vectors for several blocks at one time, according to the number of available pixel pipelines. Adopting the small diamond search to a shader program needs some considerations in order to exploit the GPU properties: First of all, SIMD functionality of each single pipeline is used to calculate in parallel the SAD values of all four positions of the small diamond pattern at once. Second, the number of texture accesses is minimized during the four SAD calculations. This is accomplished by pre-shifting the reference frame textures in a way that each texel directly contains four different samples of the reference frame s luminance component in its RGBA components. Thereby, the four samples originate from positions that are defined by the small diamond search pattern with a fixed width (see Fig. 2). Since two different pattern widths are required (namely quarter-pel and full-pel pattern widths), two differently pre-shifted versions of each reference frame texture were generated, as described in Section IV-B. Using such a pre-shifted reference frame texture, a parallel calculation of a pixel s absolute difference of all four small diamond pattern positions can be evaluated using only two texture accesses one 8-bit access to the current frame texture and one 32-bit access to the reference frame texture. This task is demonstrated by the SAD() subroutine of the pseudo code (see Fig. 4) which outlines the whole shader program that is run in parallel on many pixel pipelines. The block size is controlled by a global variable which is set outside of the shader program. The shader program is optimized on the assembly language level in terms of the number of used registers and manually unrolled loops. Both searches, the full-pel and the quarter-pel search are implemented in the same single shader program, which always executes them both the complete full-pel search prior to the quarter-pel search. As stated in Section III, it is not possible to conditionally read textures using dynamically calculated addresses, or to implement a real WHILE loop in a fragment shader program for the current Shader Model 3.0 hardware. Therefore, the termination criterion of the standard small diamond search has to be altered to be implemented in a fragment shader program. The standard small diamond search terminates when a local minimum is found or when a given maximum number of iterations have been spent. In contrast, the GPU small diamond search cannot terminate on the condition that a local minimum is found. Due to the GPU s limitations, it has to spend always the same predefined number of iterations. As shown in the pseudo code (see Fig. 4), the proposed GPU implementation will not change the vector in subsequent iterations after a local minimum has been found. Thus, the extra iterations cost some execution time but do not change the result. encoder was extended by the proposed approach using the DirectX 9 API ( to address the GPU. The output vectors of our algorithm were incorporated into the encoding process. Although the resulting SAD values could have been taken as block costs, the original code of JM9.0 was used to re-evaluate the block costs once, based on our final vectors. This guaranteed the compatibility of measures within the whole encoder and minimized the changes made to the existing code of the reference implementation. The small diamond search fragment shader program was written in Shader Model 3.0 assembly language. The High Level Shading Language (HLSL) was used to implement the 6-tap filter interpolation and the pre-shifting shader programs. This implementation was evaluated on a standard Windows XP PC consisting of the following key hardware components: AMD Athlon XP CPU, 1 GB RAM and a GeForce 6600GT AGP graphics card equipped with 128 MB RAM which is connected to the GPU by a 128 bit wide bus. The GPU core was normally clocked at 500 MHz, and the graphics RAM was clocked at 900 MHz, thus theoretically allowing a total memory bandwidth of about MB/s between the GPU and the graphics RAM. With 8 pixel pipelines, the GeForce 6600GT represents the current middleclass of GPUs. To test the encoding quality and speed of the implemented approach, the following standard CIF (Common Intermediate Format) video sequences ( were chosen as test material: coastguard, flower, foreman, mobile and tempete. The speed was also exemplarily tested with the following QCIF (Quarter Common Intermediate Format) video sequences: carphone, container and foreman. The experimental results are compared to the results of the UMHexagonS algorithm, which is the only fast motion estimation algorithm in the JM9.0 reference encoder, as well as to an entirely CPU based implementation of the GPU approach. PSNR of Y-component [db] Coastguard, 300 CIF 24 UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] V. EXPERIMENTAL RESULTS Fig. 5. PSNR/bitrate results of CIF The reference implementation JM9.0 (( of the H.264 For all tests, the encoding profile was set to Baseline, so neither B-frame encoding nor CABAC (context adaptive

7 7 Main shader program: 1. Read predicted motion vector from source texture; 2. Call SAD() to calculate SAD value of search center; 3. for number of full-pel iterations do: 3.1 Call SAD() to calculate 4 SAD values in parallel using full-pel pattern; 3.2 Get the new minimum SAD of the 4 new SAD values and the old minimum SAD; 3.3 if new minimum SAD!= old minimum SAD then move vector one full-pel step in direction of new minimum; 4. for number of quarter-pel iterations do: 4.1 Call SAD() to calculate 4 SAD values in parallel using quarter-pel pattern; 4.2 Get the new minimum SAD of the 4 new SAD values and the old minimum SAD; 4.3 if new minimum SAD!= old minimum SAD then move vector one quarter-pel step in direction of new minimum; 4.4 else set minimum_found_flag; 5. return minimum SAD, minimum_found_flag and vector; SAD() subroutine: 1. Set the four SAD values to 0; 2. for each pel of a block do: 2.1 Get the current frame luminance value from current texture position; 2.2 Get the 4 reference frame values at once (they are different since the textures are pre-shifted according to the small diamond search pattern); 2.3 Compute absolute differences of current frame luminance value and each of the 4 reference frame values (in one step again); 2.4 Add each of the 4 absolute difference values to one of the four SAD values (again in one step); 3. return the 4 SAD values; Fig. 4. Pseudo code of the main shader program and the SAD subroutine. time [sec] coastguard CIF flower CIF foreman CIF ME execution times mobile CIF tempete CIF carphone QCIF ME time (GPU) ME time (CPU) container QCIF foreman QCIF Fig. 6. The execution times (in seconds) needed for the CPU part and GPU part of the motion estimation for three algorithms (from left to right): 1. UMHexagonS, 2. UMHexagonS without SATD and without vectorcost, and 3. our approach on the GPU. No parallel processing between the CPU and the GPU was utilized yet. binary arithmetic coding) were used. The rate distortion optimization of the encoder was generally disabled, the number of reference frames was 5, and no bitrate control was used. The UMHexagonS algorithm is presented as a reference in two variants. First, the original and full-featured UMHexagonS implementation that minimizes a Lagrangian cost function during quarter-pel search was used, which incorporates the SATD (sum of absolutely transformed differences) and the estimated vector costs. Second, the same UMHexagonS implementation was used, but vector costs were ignored during minimization, and SAD was used as block matching metric even for the quarter-pel search. The latter UMHexagonS variant is therefore reduced to the same SAD cost function which our proposed approach uses. For both UMHexagonS reference tests, the SearchRange was set to 16 pels. For the GPU based motion estimation, the following parameters were used: The number of small diamond search iterations was normally set to 7 for the full-pel search and set to 4 for the quarter-pel search. In cases where zero motion vectors are refined instead of predicted vectors, the number of full-pel search iterations was set to 10. To allow a direct comparison between the speed of the GPU and the CPU, the proposed algorithm was also implemented in C++ to run entirely on the CPU. Apart from the missing parallelization within the SAD calculation, the only algorithmic difference between the GPU version and the CPU version is that the CPU implementation of the small diamond search terminates (as usual) when reaching a local minimum within the 7 (or 4) iterations. First, it was investigated whether using the proposed approach finally results in an acceptable video quality. Using the encoding parameters described above, the CIF video sequences were encoded with several quantization parameter settings (QP=23, 27, 32, 37, 44, 48). Fig. 5 (see also Fig. 9 to 12) shows graphs of the resulting PSNR and bitrate for both UMHexagonS variants as well as for the proposed GPU based approach. Table I shows numerical results for QP=32. The graphs of the UMHexagonS variant with SAD as the cost function and the graphs of our proposed approach are comparable in most of the measured cases and indicate that a sufficiently good video quality can be achieved using the proposed GPU based approach. The question whether a fast block matching based motion estimation may be implemented on modern GPUs can be answered by a clear yes. Nevertheless, the original idea was not only to explore whether motion estimation based on block matching can be implemented on GPUs, but to directly benefit from the GPUs potentially high computing powers. Therefore, we measured the execution times required for the GPU and the CPU motion estimation part. Fig. 6 shows the encoding times for the CIF and QCIF test sequences using the two UMHexagonS variants and our GPU based motion estimation algorithm. It can be observed that the GPU based approach outperforms the UMHexagonS CPU implementation in all measured cases.

8 8 PSNR [db] / UMHexagonS UMHexagonS GPU Small Bitrate [kbit/s] with SATD and with SAD and Diamond vectorcost no vectorcost Search ME execution times coastguard / / / flower / / / foreman / / / mobile / / / tempete / / / time [sec] ME time (GPU) ME time (CPU) TABLE I NUMERICAL RESULTS FOR PSNR[DB] / BITRATE[KBIT/S] WITH QP= Our approach on CPU Our approach on GPU For CIF sequences, the speed gain is obviously larger than for QCIF sequences. This indicates that using the GPU introduces some overhead which can only be compensated by parallel processing when the data to be processed is large enough. A direct comparison between the GPU based implementation and the same approach running entirely on the CPU is displayed in Fig. 7. It shows that shifting the small diamond search and the interpolation to the GPU also improves the speed while enabling the possibility to operate on different parts of the encoding process on different processors in parallel, which would further reduce the encoding time. In Fig. 7 the execution time of the GPU approach is split into the GPU part (the interpolation and the parallel small diamond search) and the CPU part (the motion vector predictor generation). Although these two parts mutually rely on their results and therefore cannot directly be executed in parallel, the CPU is relieved from computational load when the GPU runs the interpolation or the small diamond search. Thus, the CPU is enabled to process additional tasks. For example, the CPU could perform motion estimation in parallel for a smaller frame region completely on its own in order to achieve an optimal load distribution. In this way, the time denoted for the GPU part in Fig. 7 could be further reduced but in the current implementation this possibility has not been exploited yet. Nevertheless, the proposed approach enables this further possibility of parallelization of the GPU and the CPU. Although the GPU based motion estimation approach reduces the processing time noticeably, the results did not meet our expectations regarding the theoretically available GPU processor power and bandwidth, even with keeping in mind that the missing early termination criterion of the diamond search adoption potentially wastes a lot of computing power. Therefore, some experiments were conducted to find the bottleneck of the approach by independently changing the clock speed of the GPU and the graphics RAM to approximately 50% of their factory defaults. Fig. 8 shows the encoding times of the coastguard sequence when underclocking the GPU, the graphics RAM and both in relation to the execution time at a normal clock rate. It is obvious that underclocking the graphics RAM had a much worse effect on the encoding time than underclocking the GPU. By reducing the clock rate of the RAM to 50%, the execution time of the GPU specific motion estimation part was significantly increased, which indicates that memory bandwidth is the bottleneck. Since the SAD calculation is very memory intensive but its arithmetical complexity is rather low, Fig. 7. Direct comparison between the GPU based approach and the CPU implementation of the same approach using the coastguard CIF sequence. time [sec] ME execution times with underclocking GPU 500MHz, RAM 900MHz GPU 249MHz, RAM 900MHz GPU 500MHz, RAM 449MHz GPU 249MHz, RAM 449MHz ME time (GPU) ME time (CPU) Fig. 8. The execution times (in seconds) needed for the various parts of motion estimation, when underclocking the GPU, the graphics RAM or both. this coincides with the expected result. Although the memory bandwidth was clearly identified to be the bottleneck of the approach, the actually used bandwidth is only about 1.6% of the theoretical maximum bandwidth of about MB/s. The actually used bandwidth was simply approximated by dividing the known number and size of memory accesses by the total time the GPU spent for motion estimation. We conclude that this is caused by the random memory accesses to the current frame texture and the reference frame textures, which is quite untypical for a shader program and therefore interferes with texture pre-fetching and caching strategies implemented in the GPU. Other researchers [11] observed similar memory transfer reductions when heavily using random texture accesses. In addition, we have conducted experiments with two further CPU motion estimation algorithms which have been added to the JM reference software in version JM10.0: Simplified UMHexagonS and EPZS patterns. Unfortunately, many parameter names have changed since version JM9.0 (there are about two hundred parameters) compared to the version used in our experiments, and hence a direct comparison with our GPU based proposal is difficult. Nevertheless, the following relative time measurements were obtained using JM12.0 for the sequence coastguard : the Simplified UMHexagonS version is about 22% faster than UMHexagonS, the EPZS pattern is 3% faster than UMHexagonS, and all three achieve nearly identical PSNR values (PSNR: 37.0 at a bitrate of 1450 KB/s). Our proposed GPU motion estimation approach is more than 50% faster than the CPU based UMHexagonS approach (see Fig. 6). Although the performance of our approach is limited by the random memory accesses, its performance is comparable to state-of-the-art motion estimation algorithms for H.264 implemented on general purpose CPUs both in

9 9 40 Flower, 250 CIF 40 Mobile, 300 CIF PSNR of Y-component [db] PSNR of Y-component [db] UMHexagonS (with SATD and vectorcost) 20 UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] 22 UMHexagonS (with SATD and vectorcost) 20 UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] Fig. 9. PSNR/bitrate results of CIF Fig. 11. PSNR/bitrate results of CIF 40 Foreman, 300 CIF 40 Tempete, 260 CIF, 15Hz PSNR of Y-component [db] PSNR of Y-component [db] UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] 22 UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] Fig. 10. PSNR/bitrate results of CIF Fig. 12. PSNR/bitrate results of CIF terms of speed and quality. For example, consider the reported results of Li et al. [5] who presented an approach to H.264 motion estimation (only for the motion estimation part with integer accuracy). We exemplarily compare the results for the sequence coastguard : Li et al. report a number of search points of and an average matching time of ms measured on an Intel Pentium IV 3.0 GHz processor (average results for their whole test set: search points and a matching time of ms). Li et al. have implemented their approach using SSE2 assembly instructions for Intel processing units. For the coastguard sequence, our approach searches an average number of points per macroblock for the integer motion estimation part, and an average block matching time of 1.06 ms was measured. The performance of our implementation of the proposed GPU based approach is comparable, although the average block matching time is three times faster for Li et al. s approach: If one considers that only a middleclass GPU was used in our experiments and a recent graphics card like the NVidia Geforce 7800 GTX has 24 pixel pipelines (instead of 8 pipelines of the Geforce 6600 GT we used, which theoretically could yield a speed-up factor of 3), includes a RAM clocked with 1350MHz (instead of 900Mhz, theoretically a speed-up of factor 1.5) and has 256 bit wide RAM access (instead of 128 bit, theoretically a speed-up of factor 2), a further performance speed-up factor between 1.5 and 3 can be assumed for our approach. Thus, using an upper-class graphics card would probably result in a comparable block matching time for the proposed approach. The achieved quality of the encoded sequences is comparable to an UMHexagonS approach that minimizes the same cost function (see Fig. 5 and Fig. 9 to 12). Hence, the performance in terms of number of search points, average block matching time and quality of the proposed GPU approach is competitive to the state-of-the-art approach mentioned above, which is implemented on a high-performance general purpose CPU. VI. CONCLUSIONS In this paper, we proposed a GPU based approach to fast motion estimation on commodity graphics cards for the purpose of H.264 video encoding. The problem is split up into finding appropriate motion vector predictors for all possible subblocks using a CPU based implementation part, which

10 10 incorporates state-of-the-art techniques, and a GPU part that refines the calculated predictors using a GPU adoption of a parallel small diamond search. The approach has been implemented and tested on several video sequences. The achieved encoding quality turned out to be competitive to the JM9.0 reference implementation of UMHexagonS when using SAD as its cost function. An important advantage of the proposed approach is the possiblity to process encoding tasks on the GPU and CPU in parallel. The resulting performance of the implementation clearly outperformed the UMHexagonS CPU implementation of the H.264 reference encoder while achieving a competitive video quality. As an additional surplus, the proposed approach provides the option to compute other encoding tasks on the CPU in parallel while the GPU cares about the motion estimation. However, the theoretically expected performance gain could not fully be achieved. The main performance issues of the proposed approach were identified to be the random texture accesses that are needed by the GPU part, which seem to collide with texture pre-fetching and caching strategies implemented in the GPUs. Apart from the unsuitable caching strategies, the unavailability of real WHILE loops and the prohibition of arbitrary conditional texture accesses in the current GPUs inhibit an early termination criterion as used in the standard small diamond search. Instead of spending only as many iterations as sufficient, a constant number of iterations has to be chosen which is high enough for most cases. Of course, this leads to a number of needlessly evaluated loop iterations and memory accesses, and therefore a waste of computing time and memory bandwidth. The bottleneck of the proposed approach is the high number of random memory accesses, and not the arithmetical complexity of the SAD block matching metric. Therefore, extending the proposed approach with the SATD promises a further computational benefit compared to a CPU based approach since it could be computed in parallel in the shader program for each search position without any additional costs for memory accesses. Even an incorporation of a vector cost estimation into the cost function seems to be feasible to some extent. Thus, to replace the SAD by a more complex Lagrangian cost function within the shader program, as well as to make use of the parallel processing on the GPU and the CPU, are subjects of future work. [5] X. LI, E. Q. LI, Y.-K. CHEN, Fast Multi-Frame Motion Estimation Algorithm with Adaptive Search Strategies in H.264, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp , Vol. 3, 2004 [6] Y.-W. HUANG, B.-Y. HSIEH, T.-C. WANG, S.-Y. CHIEN, S.-Y. MA, C.- F. SHEN, L.-G. CHEN, Analysis and Reduction of Reference Frames for Motion Estimation in MPEG-4 AVC/JVT/H.264, in Proc. of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong-Kong, pp , Vol. 3, 2003 [7] Z. ZHOU, M.-T. SUN, Y.-F. HSU, Fast Variable Block-Size Motion Estimation Algorithms Based on Merge and Split Procedures for H.264/MPEG-4 AVC, in Proc. of 2004 IEEE International Symposium on Circuits and Systems (ISCAS), Vancouver, pp , Vol. 3, 2004 [8] M.-J. CHEN, Y.-Y. CHIANG, H.-J. LI, M.-C. CHI, Efficient Multi- Frame Motion Estimation Algorithms for MPEG-4 AVC/JVT/H.264, in Proc. of 2004 IEEE International Symposium on Circuits and Systems (ISCAS), Vancouver, pp , Vol. 3, 2004 [9] G. SHEN, G.-P. GAO, S. LI, H.-Y. SHUM, Y.-Q. ZHANG, Accelerate Video Decoding with Generic GPU, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, Issue 5, pp , 2005 [10] M. HARRIS, GPGPU: Beyond Graphics, Tutorial held at EU- ROGRAPHICS th Annual Conference of the European Association for Computer Graphics, 2004, slides availlable at [11] I. BUCK, GPU Computation Strategies & Tricks, Presentation of a course held at SIGGRAPH st International Conference on Computer Graphics and Interactive Techniques, 2004, slides availlable at [12] MPEG-4 PART 10/AVC, Coding of Audiovisual Objects - Part 10: Advanced Video Coding, 2003, ISO/IEC :2003 PLACE PHOTO HERE PLACE PHOTO HERE M artin Schwalb received his diploma in computer science from the University of Marburg, Germany in He is currently with the ipharro Media GmbH, Darmstadt, Germany, a recent spin-off of the Fraunhofer Institute for Computer Graphics, where he is working on the core of a leading-edge video fingerprinting technology. His current research interests include content based video retrieval, video copy detection, image features and GPGPU. R alph Ewerth is a research assistant in the Department of Mathematics and Computer Science at the University of Marburg, Germany. He received his diploma in computer science in 2002 and the Ph.D. degree in computer science in 2008, both from the University of Marburg, Germany. His research interests include video coding, machine learning, and multimedia content analysis and retrieval. REFERENCES [1] J. OSTERMANN, J. BORMANS, P. LIST, D. MARPE, M. NARROSCHKE, F. PEREIRA, T. STOCKHAMMER, AND T. WEDI, Video Coding with H.264/AVC: Tools, Performance, and Complexity, in IEEE Circuits and Systems Magazine, pp. 7-28, 1st Quarter, 2004 [2] F. KELLY, A. KOKARAM, Fast Image Interpolation for Motion Estimation using Graphics Hardware, in Proc. SPIE Vol. 5297, pp , Real-Time Imaging VIII; 2004 [3] F. KELLY, A. KOKARAM, Graphics Hardware for Gradient Based Motion Estimation, in Proc. SPIE Vol. 5309, pp , Embedded Processors for Multimedia and Communications; Subramania I. Sudharsanan, Michael Bove, Jr., Sethuraman Panchanathan; Eds., 2004 [4] R. STRZODKA, C. S. GARBE, Real-Time Motion Estimation and Visualization on Graphics Cards, in Proc. IEEE Visualization 2004, pp , 2004 PLACE PHOTO HERE B ernd Freisleben is full professor of computer science in the Department of Mathematics and Computer Science at the University of Marburg, Germany. He received his Master s degree in computer science from the Pennsylvania State University, USA, in 1981, and the Ph.D. degree in computer science from the Darmstadt University of Technology, Germany, in His research interests include computational intelligence, scientific computing, and multimedia computing.

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC

More information

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda

Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE Gaurav Hansda Fast Decision of Block size, Prediction Mode and Intra Block for H.264 Intra Prediction EE 5359 Gaurav Hansda 1000721849 gaurav.hansda@mavs.uta.edu Outline Introduction to H.264 Current algorithms for

More information

IN RECENT years, multimedia application has become more

IN RECENT years, multimedia application has become more 578 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 17, NO. 5, MAY 2007 A Fast Algorithm and Its VLSI Architecture for Fractional Motion Estimation for H.264/MPEG-4 AVC Video Coding

More information

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms

IBM Research Report. Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms RC24748 (W0902-063) February 12, 2009 Electrical Engineering IBM Research Report Inter Mode Selection for H.264/AVC Using Time-Efficient Learning-Theoretic Algorithms Yuri Vatis Institut für Informationsverarbeitung

More information

FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS

FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS FRAME-RATE UP-CONVERSION USING TRANSMITTED TRUE MOTION VECTORS Yen-Kuang Chen 1, Anthony Vetro 2, Huifang Sun 3, and S. Y. Kung 4 Intel Corp. 1, Mitsubishi Electric ITA 2 3, and Princeton University 1

More information

FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS. Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo

FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS. Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo FAST MOTION ESTIMATION DISCARDING LOW-IMPACT FRACTIONAL BLOCKS Saverio G. Blasi, Ivan Zupancic and Ebroul Izquierdo School of Electronic Engineering and Computer Science, Queen Mary University of London

More information

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video

CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video ICASSP 6 CAMED: Complexity Adaptive Motion Estimation & Mode Decision for H.264 Video Yong Wang Prof. Shih-Fu Chang Digital Video and Multimedia (DVMM) Lab, Columbia University Outline Complexity aware

More information

SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC

SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC SINGLE PASS DEPENDENT BIT ALLOCATION FOR SPATIAL SCALABILITY CODING OF H.264/SVC Randa Atta, Rehab F. Abdel-Kader, and Amera Abd-AlRahem Electrical Engineering Department, Faculty of Engineering, Port

More information

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials Tuukka Toivonen and Janne Heikkilä Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering

More information

Fast Mode Decision for H.264/AVC Using Mode Prediction

Fast Mode Decision for H.264/AVC Using Mode Prediction Fast Mode Decision for H.264/AVC Using Mode Prediction Song-Hak Ri and Joern Ostermann Institut fuer Informationsverarbeitung, Appelstr 9A, D-30167 Hannover, Germany ri@tnt.uni-hannover.de ostermann@tnt.uni-hannover.de

More information

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC

Reduced 4x4 Block Intra Prediction Modes using Directional Similarity in H.264/AVC Proceedings of the 7th WSEAS International Conference on Multimedia, Internet & Video Technologies, Beijing, China, September 15-17, 2007 198 Reduced 4x4 Block Intra Prediction Modes using Directional

More information

Digital Video Processing

Digital Video Processing Video signal is basically any sequence of time varying images. In a digital video, the picture information is digitized both spatially and temporally and the resultant pixel intensities are quantized.

More information

Complexity Reduced Mode Selection of H.264/AVC Intra Coding

Complexity Reduced Mode Selection of H.264/AVC Intra Coding Complexity Reduced Mode Selection of H.264/AVC Intra Coding Mohammed Golam Sarwer 1,2, Lai-Man Po 1, Jonathan Wu 2 1 Department of Electronic Engineering City University of Hong Kong Kowloon, Hong Kong

More information

Motion estimation for video compression

Motion estimation for video compression Motion estimation for video compression Blockmatching Search strategies for block matching Block comparison speedups Hierarchical blockmatching Sub-pixel accuracy Motion estimation no. 1 Block-matching

More information

Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation

Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation th International Conference on Advanced Computing and Communications Toward Optimal Pixel Decimation Patterns for Block Matching in Motion Estimation Avishek Saha Department of Computer Science and Engineering,

More information

A High Quality/Low Computational Cost Technique for Block Matching Motion Estimation

A High Quality/Low Computational Cost Technique for Block Matching Motion Estimation A High Quality/Low Computational Cost Technique for Block Matching Motion Estimation S. López, G.M. Callicó, J.F. López and R. Sarmiento Research Institute for Applied Microelectronics (IUMA) Department

More information

VIDEO COMPRESSION STANDARDS

VIDEO COMPRESSION STANDARDS VIDEO COMPRESSION STANDARDS Family of standards: the evolution of the coding model state of the art (and implementation technology support): H.261: videoconference x64 (1988) MPEG-1: CD storage (up to

More information

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION

A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION A COST-EFFICIENT RESIDUAL PREDICTION VLSI ARCHITECTURE FOR H.264/AVC SCALABLE EXTENSION Yi-Hau Chen, Tzu-Der Chuang, Chuan-Yung Tsai, Yu-Jen Chen, and Liang-Gee Chen DSP/IC Design Lab., Graduate Institute

More information

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames

A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames A Quantized Transform-Domain Motion Estimation Technique for H.264 Secondary SP-frames Ki-Kit Lai, Yui-Lam Chan, and Wan-Chi Siu Centre for Signal Processing Department of Electronic and Information Engineering

More information

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation

A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation 2009 Third International Conference on Multimedia and Ubiquitous Engineering A Novel Deblocking Filter Algorithm In H.264 for Real Time Implementation Yuan Li, Ning Han, Chen Chen Department of Automation,

More information

Reduced Frame Quantization in Video Coding

Reduced Frame Quantization in Video Coding Reduced Frame Quantization in Video Coding Tuukka Toivonen and Janne Heikkilä Machine Vision Group Infotech Oulu and Department of Electrical and Information Engineering P. O. Box 500, FIN-900 University

More information

Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block Transform

Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block Transform Circuits and Systems, 2010, 1, 12-17 doi:10.4236/cs.2010.11003 Published Online July 2010 (http://www.scirp.org/journal/cs) Fast Implementation of VC-1 with Modified Motion Estimation and Adaptive Block

More information

Overview: motion-compensated coding

Overview: motion-compensated coding Overview: motion-compensated coding Motion-compensated prediction Motion-compensated hybrid coding Motion estimation by block-matching Motion estimation with sub-pixel accuracy Power spectral density of

More information

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,

More information

Digital Image Stabilization and Its Integration with Video Encoder

Digital Image Stabilization and Its Integration with Video Encoder Digital Image Stabilization and Its Integration with Video Encoder Yu-Chun Peng, Hung-An Chang, Homer H. Chen Graduate Institute of Communication Engineering National Taiwan University Taipei, Taiwan {b889189,

More information

Mesh Based Interpolative Coding (MBIC)

Mesh Based Interpolative Coding (MBIC) Mesh Based Interpolative Coding (MBIC) Eckhart Baum, Joachim Speidel Institut für Nachrichtenübertragung, University of Stuttgart An alternative method to H.6 encoding of moving images at bit rates below

More information

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction

Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction Multi-View Image Coding in 3-D Space Based on 3-D Reconstruction Yongying Gao and Hayder Radha Department of Electrical and Computer Engineering, Michigan State University, East Lansing, MI 48823 email:

More information

High Efficient Intra Coding Algorithm for H.265/HVC

High Efficient Intra Coding Algorithm for H.265/HVC H.265/HVC における高性能符号化アルゴリズムに関する研究 宋天 1,2* 三木拓也 2 島本隆 1,2 High Efficient Intra Coding Algorithm for H.265/HVC by Tian Song 1,2*, Takuya Miki 2 and Takashi Shimamoto 1,2 Abstract This work proposes a novel

More information

Pattern based Residual Coding for H.264 Encoder *

Pattern based Residual Coding for H.264 Encoder * Pattern based Residual Coding for H.264 Encoder * Manoranjan Paul and Manzur Murshed Gippsland School of Information Technology, Monash University, Churchill, Vic-3842, Australia E-mail: {Manoranjan.paul,

More information

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter

Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter Coding of Coefficients of two-dimensional non-separable Adaptive Wiener Interpolation Filter Y. Vatis, B. Edler, I. Wassermann, D. T. Nguyen and J. Ostermann ABSTRACT Standard video compression techniques

More information

BANDWIDTH REDUCTION SCHEMES FOR MPEG-2 TO H.264 TRANSCODER DESIGN

BANDWIDTH REDUCTION SCHEMES FOR MPEG-2 TO H.264 TRANSCODER DESIGN BANDWIDTH REDUCTION SCHEMES FOR MPEG- TO H. TRANSCODER DESIGN Xianghui Wei, Wenqi You, Guifen Tian, Yan Zhuang, Takeshi Ikenaga, Satoshi Goto Graduate School of Information, Production and Systems, Waseda

More information

BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION FILTERS

BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION FILTERS 4th European Signal Processing Conference (EUSIPCO ), Florence, Italy, September 4-8,, copyright by EURASIP BLOCK MATCHING-BASED MOTION COMPENSATION WITH ARBITRARY ACCURACY USING ADAPTIVE INTERPOLATION

More information

An Efficient Mode Selection Algorithm for H.264

An Efficient Mode Selection Algorithm for H.264 An Efficient Mode Selection Algorithm for H.64 Lu Lu 1, Wenhan Wu, and Zhou Wei 3 1 South China University of Technology, Institute of Computer Science, Guangzhou 510640, China lul@scut.edu.cn South China

More information

Homogeneous Transcoding of HEVC for bit rate reduction

Homogeneous Transcoding of HEVC for bit rate reduction Homogeneous of HEVC for bit rate reduction Ninad Gorey Dept. of Electrical Engineering University of Texas at Arlington Arlington 7619, United States ninad.gorey@mavs.uta.edu Dr. K. R. Rao Fellow, IEEE

More information

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46

LIST OF TABLES. Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46. Table 5.2 Macroblock types 46 LIST OF TABLES TABLE Table 5.1 Specification of mapping of idx to cij for zig-zag scan 46 Table 5.2 Macroblock types 46 Table 5.3 Inverse Scaling Matrix values 48 Table 5.4 Specification of QPC as function

More information

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011)

Lecture 6: Texture. Kayvon Fatahalian CMU : Graphics and Imaging Architectures (Fall 2011) Lecture 6: Texture Kayvon Fatahalian CMU 15-869: Graphics and Imaging Architectures (Fall 2011) Today: texturing! Texture filtering - Texture access is not just a 2D array lookup ;-) Memory-system implications

More information

Implementation and analysis of Directional DCT in H.264

Implementation and analysis of Directional DCT in H.264 Implementation and analysis of Directional DCT in H.264 EE 5359 Multimedia Processing Guidance: Dr K R Rao Priyadarshini Anjanappa UTA ID: 1000730236 priyadarshini.anjanappa@mavs.uta.edu Introduction A

More information

High Performance Hardware Architectures for A Hexagon-Based Motion Estimation Algorithm

High Performance Hardware Architectures for A Hexagon-Based Motion Estimation Algorithm High Performance Hardware Architectures for A Hexagon-Based Motion Estimation Algorithm Ozgur Tasdizen 1,2,a, Abdulkadir Akin 1,2,b, Halil Kukner 1,2,c, Ilker Hamzaoglu 1,d, H. Fatih Ugurdag 3,e 1 Electronics

More information

Compression of Stereo Images using a Huffman-Zip Scheme

Compression of Stereo Images using a Huffman-Zip Scheme Compression of Stereo Images using a Huffman-Zip Scheme John Hamann, Vickey Yeh Department of Electrical Engineering, Stanford University Stanford, CA 94304 jhamann@stanford.edu, vickey@stanford.edu Abstract

More information

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding

Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding Deblocking Filter Algorithm with Low Complexity for H.264 Video Coding Jung-Ah Choi and Yo-Sung Ho Gwangju Institute of Science and Technology (GIST) 261 Cheomdan-gwagiro, Buk-gu, Gwangju, 500-712, Korea

More information

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller Entertainment Graphics: Virtual Realism for the Masses CSE 591: GPU Programming Introduction Computer games need to have: realistic appearance of characters and objects believable and creative shading,

More information

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei

More information

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV Jeffrey S. McVeigh 1 and Siu-Wai Wu 2 1 Carnegie Mellon University Department of Electrical and Computer Engineering

More information

Overview: motion estimation. Differential motion estimation

Overview: motion estimation. Differential motion estimation Overview: motion estimation Differential methods Fast algorithms for Sub-pel accuracy Rate-constrained motion estimation Bernd Girod: EE368b Image Video Compression Motion Estimation no. 1 Differential

More information

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.

EE 5359 MULTIMEDIA PROCESSING SPRING Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H. EE 5359 MULTIMEDIA PROCESSING SPRING 2011 Final Report IMPLEMENTATION AND ANALYSIS OF DIRECTIONAL DISCRETE COSINE TRANSFORM IN H.264 Under guidance of DR K R RAO DEPARTMENT OF ELECTRICAL ENGINEERING UNIVERSITY

More information

The Scope of Picture and Video Coding Standardization

The Scope of Picture and Video Coding Standardization H.120 H.261 Video Coding Standards MPEG-1 and MPEG-2/H.262 H.263 MPEG-4 H.264 / MPEG-4 AVC Thomas Wiegand: Digital Image Communication Video Coding Standards 1 The Scope of Picture and Video Coding Standardization

More information

Motion Estimation for H.264/AVC on Multiple GPUs Using NVIDIA CUDA

Motion Estimation for H.264/AVC on Multiple GPUs Using NVIDIA CUDA Motion Estimation for H.264/AVC on Multiple GPUs Using NVIDIA CUDA Bart Pieters a, Charles F. Hollemeersch, Peter Lambert, and Rik Van de Walle Department of Electronics and Information Systems Multimedia

More information

Fast Motion Estimation for Shape Coding in MPEG-4

Fast Motion Estimation for Shape Coding in MPEG-4 358 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 13, NO. 4, APRIL 2003 Fast Motion Estimation for Shape Coding in MPEG-4 Donghoon Yu, Sung Kyu Jang, and Jong Beom Ra Abstract Effective

More information

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology

Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Course Presentation Multimedia Systems Video II (Video Coding) Mahdi Amiri April 2012 Sharif University of Technology Video Coding Correlation in Video Sequence Spatial correlation Similar pixels seem

More information

An Improved H.26L Coder Using Lagrangian Coder Control. Summary

An Improved H.26L Coder Using Lagrangian Coder Control. Summary UIT - Secteur de la normalisation des télécommunications ITU - Telecommunication Standardization Sector UIT - Sector de Normalización de las Telecomunicaciones Study Period 2001-2004 Commission d' études

More information

View Synthesis for Multiview Video Compression

View Synthesis for Multiview Video Compression View Synthesis for Multiview Video Compression Emin Martinian, Alexander Behrens, Jun Xin, and Anthony Vetro email:{martinian,jxin,avetro}@merl.com, behrens@tnt.uni-hannover.de Mitsubishi Electric Research

More information

Introduction to Video Compression

Introduction to Video Compression Insight, Analysis, and Advice on Signal Processing Technology Introduction to Video Compression Jeff Bier Berkeley Design Technology, Inc. info@bdti.com http://www.bdti.com Outline Motivation and scope

More information

high performance medical reconstruction using stream programming paradigms

high performance medical reconstruction using stream programming paradigms high performance medical reconstruction using stream programming paradigms This Paper describes the implementation and results of CT reconstruction using Filtered Back Projection on various stream programming

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 20

ECE 571 Advanced Microprocessor-Based Design Lecture 20 ECE 571 Advanced Microprocessor-Based Design Lecture 20 Vince Weaver http://www.eece.maine.edu/~vweaver vincent.weaver@maine.edu 12 April 2016 Project/HW Reminder Homework #9 was posted 1 Raspberry Pi

More information

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri

Interframe coding A video scene captured as a sequence of frames can be efficiently coded by estimating and compensating for motion between frames pri MPEG MPEG video is broken up into a hierarchy of layer From the top level, the first layer is known as the video sequence layer, and is any self contained bitstream, for example a coded movie. The second

More information

10.2 Video Compression with Motion Compensation 10.4 H H.263

10.2 Video Compression with Motion Compensation 10.4 H H.263 Chapter 10 Basic Video Compression Techniques 10.11 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration

More information

Semi-Hierarchical Based Motion Estimation Algorithm for the Dirac Video Encoder

Semi-Hierarchical Based Motion Estimation Algorithm for the Dirac Video Encoder Semi-Hierarchical Based Motion Estimation Algorithm for the Dirac Video Encoder M. TUN, K. K. LOO, J. COSMAS School of Engineering and Design Brunel University Kingston Lane, Uxbridge, UB8 3PH UNITED KINGDOM

More information

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION

FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION FAST SPATIAL LAYER MODE DECISION BASED ON TEMPORAL LEVELS IN H.264/AVC SCALABLE EXTENSION Yen-Chieh Wang( 王彥傑 ), Zong-Yi Chen( 陳宗毅 ), Pao-Chi Chang( 張寶基 ) Dept. of Communication Engineering, National Central

More information

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation

Chapter 10. Basic Video Compression Techniques Introduction to Video Compression 10.2 Video Compression with Motion Compensation Chapter 10 Basic Video Compression Techniques 10.1 Introduction to Video Compression 10.2 Video Compression with Motion Compensation 10.3 Search for Motion Vectors 10.4 H.261 10.5 H.263 10.6 Further Exploration

More information

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain

Efficient MPEG-2 to H.264/AVC Intra Transcoding in Transform-domain MITSUBISHI ELECTRIC RESEARCH LABORATORIES http://www.merl.com Efficient MPEG- to H.64/AVC Transcoding in Transform-domain Yeping Su, Jun Xin, Anthony Vetro, Huifang Sun TR005-039 May 005 Abstract In this

More information

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications:

Chapter 11.3 MPEG-2. MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications: Chapter 11.3 MPEG-2 MPEG-2: For higher quality video at a bit-rate of more than 4 Mbps Defined seven profiles aimed at different applications: Simple, Main, SNR scalable, Spatially scalable, High, 4:2:2,

More information

Motion Estimation for Video Coding Standards

Motion Estimation for Video Coding Standards Motion Estimation for Video Coding Standards Prof. Ja-Ling Wu Department of Computer Science and Information Engineering National Taiwan University Introduction of Motion Estimation The goal of video compression

More information

GPGPU. Peter Laurens 1st-year PhD Student, NSC

GPGPU. Peter Laurens 1st-year PhD Student, NSC GPGPU Peter Laurens 1st-year PhD Student, NSC Presentation Overview 1. What is it? 2. What can it do for me? 3. How can I get it to do that? 4. What s the catch? 5. What s the future? What is it? Introducing

More information

International Journal of Emerging Technology and Advanced Engineering Website: (ISSN , Volume 2, Issue 4, April 2012)

International Journal of Emerging Technology and Advanced Engineering Website:   (ISSN , Volume 2, Issue 4, April 2012) A Technical Analysis Towards Digital Video Compression Rutika Joshi 1, Rajesh Rai 2, Rajesh Nema 3 1 Student, Electronics and Communication Department, NIIST College, Bhopal, 2,3 Prof., Electronics and

More information

CS427 Multicore Architecture and Parallel Computing

CS427 Multicore Architecture and Parallel Computing CS427 Multicore Architecture and Parallel Computing Lecture 6 GPU Architecture Li Jiang 2014/10/9 1 GPU Scaling A quiet revolution and potential build-up Calculation: 936 GFLOPS vs. 102 GFLOPS Memory Bandwidth:

More information

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264.

Title Adaptive Lagrange Multiplier for Low Bit Rates in H.264. Provided by the author(s) and University College Dublin Library in accordance with publisher policies. Please cite the published version when available. Title Adaptive Lagrange Multiplier for Low Bit Rates

More information

Graphics Hardware. Instructor Stephen J. Guy

Graphics Hardware. Instructor Stephen J. Guy Instructor Stephen J. Guy Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability! Programming Examples Overview What is a GPU Evolution of GPU GPU Design Modern Features Programmability!

More information

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC

IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC 17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 IMPROVED CONTEXT-ADAPTIVE ARITHMETIC CODING IN H.264/AVC Damian Karwowski, Marek Domański Poznań University

More information

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER

EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER EFFICIENT PU MODE DECISION AND MOTION ESTIMATION FOR H.264/AVC TO HEVC TRANSCODER Zong-Yi Chen, Jiunn-Tsair Fang 2, Tsai-Ling Liao, and Pao-Chi Chang Department of Communication Engineering, National Central

More information

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR.

Aiyar, Mani Laxman. Keywords: MPEG4, H.264, HEVC, HDTV, DVB, FIR. 2015; 2(2): 201-209 IJMRD 2015; 2(2): 201-209 www.allsubjectjournal.com Received: 07-01-2015 Accepted: 10-02-2015 E-ISSN: 2349-4182 P-ISSN: 2349-5979 Impact factor: 3.762 Aiyar, Mani Laxman Dept. Of ECE,

More information

ABSTRACT. KEYWORD: Low complexity H.264, Machine learning, Data mining, Inter prediction. 1 INTRODUCTION

ABSTRACT. KEYWORD: Low complexity H.264, Machine learning, Data mining, Inter prediction. 1 INTRODUCTION Low Complexity H.264 Video Encoding Paula Carrillo, Hari Kalva, and Tao Pin. Dept. of Computer Science and Technology,Tsinghua University, Beijing, China Dept. of Computer Science and Engineering, Florida

More information

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami

Outline Introduction MPEG-2 MPEG-4. Video Compression. Introduction to MPEG. Prof. Pratikgiri Goswami to MPEG Prof. Pratikgiri Goswami Electronics & Communication Department, Shree Swami Atmanand Saraswati Institute of Technology, Surat. Outline of Topics 1 2 Coding 3 Video Object Representation Outline

More information

IN the early 1980 s, video compression made the leap from

IN the early 1980 s, video compression made the leap from 70 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 9, NO. 1, FEBRUARY 1999 Long-Term Memory Motion-Compensated Prediction Thomas Wiegand, Xiaozheng Zhang, and Bernd Girod, Fellow,

More information

Could you make the XNA functions yourself?

Could you make the XNA functions yourself? 1 Could you make the XNA functions yourself? For the second and especially the third assignment, you need to globally understand what s going on inside the graphics hardware. You will write shaders, which

More information

Efficient and optimal block matching for motion estimation

Efficient and optimal block matching for motion estimation Efficient and optimal block matching for motion estimation Stefano Mattoccia Federico Tombari Luigi Di Stefano Marco Pignoloni Department of Electronics Computer Science and Systems (DEIS) Viale Risorgimento

More information

ARCHITECTURES OF INCORPORATING MPEG-4 AVC INTO THREE-DIMENSIONAL WAVELET VIDEO CODING

ARCHITECTURES OF INCORPORATING MPEG-4 AVC INTO THREE-DIMENSIONAL WAVELET VIDEO CODING ARCHITECTURES OF INCORPORATING MPEG-4 AVC INTO THREE-DIMENSIONAL WAVELET VIDEO CODING ABSTRACT Xiangyang Ji *1, Jizheng Xu 2, Debin Zhao 1, Feng Wu 2 1 Institute of Computing Technology, Chinese Academy

More information

Motion Vector Coding Algorithm Based on Adaptive Template Matching

Motion Vector Coding Algorithm Based on Adaptive Template Matching Motion Vector Coding Algorithm Based on Adaptive Template Matching Wen Yang #1, Oscar C. Au #2, Jingjing Dai #3, Feng Zou #4, Chao Pang #5,Yu Liu 6 # Electronic and Computer Engineering, The Hong Kong

More information

THE MPEG-2 video coding standard is widely used in

THE MPEG-2 video coding standard is widely used in 172 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 18, NO. 2, FEBRUARY 2008 A Fast MB Mode Decision Algorithm for MPEG-2 to H.264 P-Frame Transcoding Gerardo Fernández-Escribano,

More information

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1

X. GPU Programming. Jacobs University Visualization and Computer Graphics Lab : Advanced Graphics - Chapter X 1 X. GPU Programming 320491: Advanced Graphics - Chapter X 1 X.1 GPU Architecture 320491: Advanced Graphics - Chapter X 2 GPU Graphics Processing Unit Parallelized SIMD Architecture 112 processing cores

More information

H.264 to MPEG-4 Transcoding Using Block Type Information

H.264 to MPEG-4 Transcoding Using Block Type Information 1568963561 1 H.264 to MPEG-4 Transcoding Using Block Type Information Jae-Ho Hur and Yung-Lyul Lee Abstract In this paper, we propose a heterogeneous transcoding method of converting an H.264 video bitstream

More information

Understanding Sources of Inefficiency in General-Purpose Chips

Understanding Sources of Inefficiency in General-Purpose Chips Understanding Sources of Inefficiency in General-Purpose Chips Rehan Hameed Wajahat Qadeer Megan Wachs Omid Azizi Alex Solomatnikov Benjamin Lee Stephen Richardson Christos Kozyrakis Mark Horowitz GP Processors

More information

Rendering Subdivision Surfaces Efficiently on the GPU

Rendering Subdivision Surfaces Efficiently on the GPU Rendering Subdivision Surfaces Efficiently on the GPU Gy. Antal, L. Szirmay-Kalos and L. A. Jeni Department of Algorithms and their Applications, Faculty of Informatics, Eötvös Loránd Science University,

More information

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD

OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD OVERVIEW OF IEEE 1857 VIDEO CODING STANDARD Siwei Ma, Shiqi Wang, Wen Gao {swma,sqwang, wgao}@pku.edu.cn Institute of Digital Media, Peking University ABSTRACT IEEE 1857 is a multi-part standard for multimedia

More information

Advanced Video Coding: The new H.264 video compression standard

Advanced Video Coding: The new H.264 video compression standard Advanced Video Coding: The new H.264 video compression standard August 2003 1. Introduction Video compression ( video coding ), the process of compressing moving images to save storage space and transmission

More information

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang

Video Transcoding Architectures and Techniques: An Overview. IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang Video Transcoding Architectures and Techniques: An Overview IEEE Signal Processing Magazine March 2003 Present by Chen-hsiu Huang Outline Background & Introduction Bit-rate Reduction Spatial Resolution

More information

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI. CSCI 402: Computer Architectures Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI 6.6 - End Today s Contents GPU Cluster and its network topology The Roofline performance

More information

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014

Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms. Visual Computing Systems CMU , Fall 2014 Lecture 6: Texturing Part II: Texture Compression and GPU Latency Hiding Mechanisms Visual Computing Systems Review: mechanisms to reduce aliasing in the graphics pipeline When sampling visibility?! -

More information

Context based optimal shape coding

Context based optimal shape coding IEEE Signal Processing Society 1999 Workshop on Multimedia Signal Processing September 13-15, 1999, Copenhagen, Denmark Electronic Proceedings 1999 IEEE Context based optimal shape coding Gerry Melnikov,

More information

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing

VIDEO AND IMAGE PROCESSING USING DSP AND PFGA. Chapter 3: Video Processing ĐẠI HỌC QUỐC GIA TP.HỒ CHÍ MINH TRƯỜNG ĐẠI HỌC BÁCH KHOA KHOA ĐIỆN-ĐIỆN TỬ BỘ MÔN KỸ THUẬT ĐIỆN TỬ VIDEO AND IMAGE PROCESSING USING DSP AND PFGA Chapter 3: Video Processing 3.1 Video Formats 3.2 Video

More information

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering

An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering An Efficient Approach for Emphasizing Regions of Interest in Ray-Casting based Volume Rendering T. Ropinski, F. Steinicke, K. Hinrichs Institut für Informatik, Westfälische Wilhelms-Universität Münster

More information

Video compression with 1-D directional transforms in H.264/AVC

Video compression with 1-D directional transforms in H.264/AVC Video compression with 1-D directional transforms in H.264/AVC The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Kamisli, Fatih,

More information

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation Optimizing the Deblocking Algorithm for H.264 Decoder Implementation Ken Kin-Hung Lam Abstract In the emerging H.264 video coding standard, a deblocking/loop filter is required for improving the visual

More information

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke.

The GPU as a co-processor in FEM-based simulations. Preliminary results. Dipl.-Inform. Dominik Göddeke. The GPU as a co-processor in FEM-based simulations Preliminary results Dipl.-Inform. Dominik Göddeke dominik.goeddeke@mathematik.uni-dortmund.de Institute of Applied Mathematics University of Dortmund

More information

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC)

STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC) STUDY AND IMPLEMENTATION OF VIDEO COMPRESSION STANDARDS (H.264/AVC, DIRAC) EE 5359-Multimedia Processing Spring 2012 Dr. K.R Rao By: Sumedha Phatak(1000731131) OBJECTIVE A study, implementation and comparison

More information

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration , pp.517-521 http://dx.doi.org/10.14257/astl.2015.1 Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration Jooheung Lee 1 and Jungwon Cho 2, * 1 Dept. of

More information

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation

Image Processing Tricks in OpenGL. Simon Green NVIDIA Corporation Image Processing Tricks in OpenGL Simon Green NVIDIA Corporation Overview Image Processing in Games Histograms Recursive filters JPEG Discrete Cosine Transform Image Processing in Games Image processing

More information

Week 14. Video Compression. Ref: Fundamentals of Multimedia

Week 14. Video Compression. Ref: Fundamentals of Multimedia Week 14 Video Compression Ref: Fundamentals of Multimedia Last lecture review Prediction from the previous frame is called forward prediction Prediction from the next frame is called forward prediction

More information

Multimedia Decoder Using the Nios II Processor

Multimedia Decoder Using the Nios II Processor Multimedia Decoder Using the Nios II Processor Third Prize Multimedia Decoder Using the Nios II Processor Institution: Participants: Instructor: Indian Institute of Science Mythri Alle, Naresh K. V., Svatantra

More information

A Fast Intra/Inter Mode Decision Algorithm of H.264/AVC for Real-time Applications

A Fast Intra/Inter Mode Decision Algorithm of H.264/AVC for Real-time Applications Fast Intra/Inter Mode Decision lgorithm of H.64/VC for Real-time pplications Bin Zhan, Baochun Hou, and Reza Sotudeh School of Electronic, Communication and Electrical Engineering University of Hertfordshire

More information

Final report on coding algorithms for mobile 3DTV. Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin

Final report on coding algorithms for mobile 3DTV. Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin Final report on coding algorithms for mobile 3DTV Gerhard Tech Karsten Müller Philipp Merkle Heribert Brust Lina Jin MOBILE3DTV Project No. 216503 Final report on coding algorithms for mobile 3DTV Gerhard

More information