Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding

Size: px

Start display at page:

Download "Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding"

Julius Barrett
5 years ago
Views:

1 1 Fast Motion Estimation on Graphics Hardware for H.264 Video Encoding Martin Schwalb, Ralph Ewerth and Bernd Freisleben Department of Mathematics and Computer Science, University of Marburg Hans-Meerwein-Str. 3, D Marburg, Germany {schwalbm, ewerth, Abstract The video coding standard H.264 supports video compression with a higher coding efficiency than previous standards. However, this comes at the expense of an increased encoding complexity, in particular for motion estimation which becomes a very time consuming task even for today s central processing units (CPU). On the other hand, modern graphics hardware includes a powerful graphics processing unit (GPU) whose computing power remains idle most of the time. In this paper, we present a GPU based approach to motion estimation for the purpose of H.264 video encoding. A small diamond search is adapted to the programming model of modern GPUs to exploit their available parallel computing power and memory bandwidth. Experimental results demonstrate a significant reduction of computation time and a competitive encoding quality compared to a CPU UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel. Index Terms Parallel motion estimation, H.264, GPGPU (general purpose computation on GPU), programmable graphics hardware, MPEG-4 part 10/AVC. I. INTRODUCTION THE H.264 standard (also known as MPEG-4 part 10 Advanced Video Coding (AVC) [12]) allows a bitrate reduction of coded video by enabling the encoder to carry out more complex motion estimation than any of the predecessor standards [1]. The maximum number of possible reference frames for the block matching search has been increased. Furthermore, each macroblock can be split up irregularly into smaller rectangular blocks down to the size of four by four pels (picture elements), allowing the calculation of a motion vector for each of these subblocks of different sizes. Completely exploiting these new possibilities requires a computational effort that often cannot be met by CPU based software encoders running on standard PC hardware. Therefore, some heuristic algorithms have been proposed to reduce the number of motion vectors that are considered worthy to be estimated (e.g., Huang et al. [6] and Li et al. [5]). Other approaches estimate good motion vector predictors by considering the spatial or temporal neighborhood (e.g., Li et al. [5], Chen et al. [8] and Zhou et al. [7]) to save some search iterations. Although such techniques successfully reduce the computation time compared to a full search, the remaining motion estimation tasks still consume a large amount of the total encoding time. On the other hand, the CPU is no longer the only programmable source of computing power in mainstream PCs. The programmability rises from each series of Graphics Processing Units (GPUs) to its successors, allowing the implementation of more general tasks. Thereby, the computational capabilities of GPUs cannot only compete with those of CPUs they greatly outperform them in many areas. For instance, using a synthetical benchmark, the raw floating point computing power of an NVidia GeForce 6800 Ultra was observed [10] to be about 40 GFlops, and a peak memory bandwidth of 35.2 GB/s was measured. This vast computing power originates from the parallel architecture of the GPUs. Significant benefits are achievable if computationally expensive problems are successfully mapped to a GPU. Thus, it is reasonable to examine the possibilities of exploiting this normally idle processing power for video coding tasks. A few approaches to motion estimation utilizing GPU processing power have already been published. However, the exhaustive full search [2] is computationally too expensive for real world video coding applications, while gradient based methods ([3] and [4]) do not have a freely definable cost function which is recommended for video encoding. When trying to map typical fast (non-exhaustive) motion estimation algorithms based on local block matching to the GPU, one problem is that the main concepts which make these algorithms fast cannot be executed in parallel. For example, getting a good predictor for a motion vector typically requires several already estimated vectors of its temporal and spatial neighborhood as source data. This does not allow a parallel estimation of the new motion vector and the vectors that serve to build its predictor. Thus, when designing a GPU based fast motion estimation algorithm, one has to find a tradeoff between predictor quality and possible parallelism. In this paper, we propose a novel GPU based implementation of a fast (non-exhaustive) local block matching search. The contribution of the proposed approach is threefold. First, a parallel GPU implementation of a small diamond search is introduced. Starting from predicted motion vectors, it locally finds the best matching blocks for an arbitrary set of blocks of a particular size in a single reference frame in parallel. This parallel small diamond search relies on the availability of good vector predictors and the decision which subset of all possible vectors is worthy to be estimated. Thus, the second contribution is the presented composition of available CPU algorithms that creates vector predictors for blocks of the same size for the same reference frame. This composition directly supports parallel local block searches like the GPU

2 2 implementation of the small diamond search. Together, these two parts form an efficient motion estimation framework for H.264 with the compute-intensive parts running on the GPU and non-critical parts running on the CPU. Third, since the CPU is relieved from a significant amout of computational load, the suggested GPU based motion estimation approach potentially enables the CPU to process other encoding tasks in parallel. Experimental results demonstrate the good encoding quality of our approach. It is comparable to a UMHexagonS implementation while enabling the CPU to process other encoding tasks in parallel. Although a complete motion estimation framework for H.264 is presented, the scope of this paper lies on the GPU specific aspects and not on the invention of an optimal generic motion estimation algorithm for H.264. The remainder of this paper is organized as follows. Section II outlines approaches to motion vector predictor generation in H.264 motion estimation, as well as the GPU based motion estimation approaches published until now. Section III gives an introduction to the current generation of programmable commodity graphics hardware and discusses its limitations. In Section IV, mapping motion estimation to the GPU and implementation details are explained. The experimental results of our motion estimation (extending the JM9.0 reference implementation of the H.264 encoder) are presented in Section V. Finally, Section VI concludes the paper and outlines areas for future work. II. RELATED WORK One way to face the increased complexity of H.264 motion estimation is to reduce the total number of motion vectors to be estimated. For example, Huang et al. [6] develop various heuristic methods based on the available information after motion estimation with respect to the directly preceding reference frame. The main idea is to identify macroblocks which cover boundaries of moving objects. Due to possible occlusion/uncovering, searching more reference frames is suggested only for these macroblocks, but not for the other ones. Li et al. [5] suggest another heuristic method to identify macroblocks with fast and continuous movement using estimated motion vectors referring to the nearest three reference frames. Searching more reference frames seems to be worthy to the authors only for macroblocks with a vibrant or smooth movement. Therefore, the macroblocks with a fast and continuous movement are neglected during further search. Another popular method to reduce the computing effort needed for H.264 motion estimation is to estimate good motion vector predictors and to refine them only locally instead of starting the estimation from an assumed zero motion vector. This often saves iterations of the local search. Li et al. [5] use motion vectors of surrounding blocks with the same mode (top, top-right and left), vectors of the uplayer 1 block modes on the same position, vectors of the recent reference frame and scaled vectors of previous reference frames as vector predictors. 1 Li et al. call a block mode an uplayer of another block mode if it is of larger or equal size in both dimensions. For example, 16*16 is an uplayer block mode of 16*8 and 8*16. Chen et al. [8] predict motion vectors based on reference frames with a temporal distance greater than one by reusing motion vectors of several previous frames which themselves refer to their directly preceding reference frame. Therefore, motion vectors that span n frames are predicted by adding n already estimated motion vectors. The Split and Merge operations proposed by Zhou et al. [7] take advantage of the correlation between motion vectors of various block sizes to form predictors for vectors of differently sized blocks at the same position. Known vectors of larger subblocks are split up to form predictors of smaller subblocks, and vectors of smaller subblocks are merged together to form predictors of larger subblocks that cover the same area. There are some approaches exploiting the processing power of GPUs for video encoding/decoding tasks. Shen et al. [9] utilize the programmable GPU to accelerate video decoding by implementing the complete motion compensation feedback loop of a proprietary MPEG-like decoder and the color space conversion on the GPU as shader programs. Kelly and Kokaram [2] exploit the bilinear image interpolation capabilities of GPU Samplers (see Section III) to estimate subpel accurate motion vectors. This interpolation is used within a full search of motion vectors based on block matching, where the GPU interpolates the subpel positions and iteratively computes SAD (sum of absolute differences) values using a number of rendering steps. The choice of the optimal motion vector is made on the CPU. Since this approach always operates on a complete frame calculating the absolute difference of the current frame to the displaced reference frame, the same motion vector has to be evaluated for all blocks at once. Therefore, it is not trivial to directly incorporate any fast local search strategy into this approach. Apart from the block based full search approach, the same authors also present a hierarchical Wiener based pel-recursive motion estimation approach [3]. The solution for the update vector to a current vector estimate is split up into elementary addition, subtraction and multiplication operations that are performed in parallel on complete textures by the GPU. Automatic mipmap generation of the GPU is used to generate various low-pass filtered versions of the frames for different hierarchy levels. On-the-fly interpolation by the sampler is used like in the block based full search approach. Strzodka et al. [4] present a real-time motion estimator for visualization purposes that is implemented on graphics cards. They estimate motion through an eigenvector analysis of the spatio-temporal structure tensor at every pixel location. Apart from motion estimation, various techniques are used to visualize the estimated motion in real time. Their complete framework is implemented on the GPU and is able to perform real time motion estimation and visualization of a 25Hz 320*240 pixel image sequence. III. PROGRAMMABLE GRAPHICS HARDWARE Today s commodity graphics cards mainly consist of a Graphics Processing Unit (GPU) and a dedicated texture random access memory (RAM) that holds the frame buffer

3 3 and textures. While GPU accesses to the texture RAM are fast, the bandwidth of the bus connecting the CPU to the texture RAM is limited. On Accelerated Graphics Port (AGP) PCs, especially the back transfer from the texture RAM to the CPU is limited to the PCI bus (Peripheral Component Interconnect bus) data rate of only 266 MB/s. Complex 3D objects in real-time computer graphics are typically approximated by large sets of triangles. The main purpose of the GPU is to render a stream of textured triangles to the frame buffer or another texture. Each triangle is defined by its three vertices, which are data structures that hold 3D world position coordinates and optionally texture coordinates and lighting color. The rendering task can be split up into several subtasks. Therefore, the GPU is constructed as a pipelined architecture (Fig. 1), and currently two of the pipeline stages are more or less freely programmable: the Vertex Processor and the Fragment Processor. CPU Vertex Processor Rasterizer Fragment-Processor Destination Texture(s) Texture Ram GPU Sampler Source Texture(s) Fig. 1. Simplified overview of the graphics pipeline from the programmer s view. Technically, the Vertex- and Fragment Processor themselves are split up into a number of parallel subpipelines. The arrows denote the data flow. The Vertex Processor processes a small number of triangle vertices of a vertex stream in parallel and is used to compute arbitrary 3D transformations and projections on them. The resulting vertices are taken as input into the next stage of the pipeline: the Rasterizer. The purpose of this stage is to generate so called fragments for all output pixels that the currently rendered triangle covers. A fragment is a data structure analog to the structure of a vertex and is calculated by the Rasterizer by interpolating position coordinates, texture coordinates and lighting color values between its three input vertices of the triangle. The Fragment Processor then takes these interpolated parameters and the source textures as input and computes a small number of pixel color values at a time in parallel by running the same fragment shader program on these fragments. The term shader is often used to refer to the part of a rendering system which encapsulates the illumination model and the shading technique (or shading model). Such an illumination model or shading technique describes the rules to compute a surface s color at a given point. Source textures can be accessed through the Sampler which is able to linearly interpolate between four neighboring texels on the fly. Finally, the resulting color values are stored in the destination texture(s) exactly at the position(s) determined by the Rasterizer. In current GPUs, such as the NVidia GeForce 7800 GTX, the Vertex Processor itself consists of up to 8 parallel vertex pipelines and the Fragment Processor is built up of up to 24 pixel pipelines. Apart from this parallelism, the basic data types of the Vertex and Fragment Processor pipelines are vectors of four floats, and the arithmetic operations are SIMD (single-instruction-multiple-data) instructions which can operate in parallel on all four floats that normally represent an XYZW-position or an RGBA-pixel. Although programmability rises from each generation of GPUs to the next one, there are still some restrictions even for GPUs implementing the current Shader Model 3.0. First of all, both the number of coded instructions and the number of instructions dynamically processed at runtime are still limited. As a result, there are still no true WHILE or GOTO instructions available in the shader assembly language. A second constraint is the prohibition of random texture reads inside conditionally processed code. Only positions addressed by a set of texture coordinates that were directly interpolated by the Rasterizer can be conditionally accessed. A limitation of today s series of GPUs implementing the Shader Model 3.0 concerns the efficiency of conditionally skipped fragment shader code. Because of synchronized execution between the pixel pipelines, time can only be saved if all the pixel pipelines active at a time meet the same condition. Since these pipelines always seem to process blocks of pixels that are spatially arranged in a fixed array, this limits the use of conditional execution for the purpose of time saving to scenarios where larger blocks of neighboring pixels meet exactly the same conditions throughout the execution path of the shader program. IV. FAST MOTION ESTIMATION USING GRAPHICS HARDWARE Due to the reasons explained above, none of the mentioned GPU based motion estimation approaches is directly applicable for the purpose of H.264 video encoding. Several requirements need to be met to successfully map a problem to a modern GPU architecture: First of all, it is obvious that the GPUs vast computing power originates from the existence of parallel pipelines. Therefore, one can only expect performance gains using the GPU if the problem to be solved is parallelizable. Second, problems can be mapped efficiently to GPUs only if the amount of data transfered between the texture RAM and the CPU is relatively low compared to the number of operations processed by the GPU. This should not be mixed up with GPU texture RAM accesses which are comparatively fast. Third, although conditional code execution is generally possible in fragment programs, its implementation in today s GPUs is not very efficient yet. Thus, it is recommended to first map the parts of the problem to the GPU which require a minimum of code in IF-THEN-ELSE blocks. Keeping these restrictions in mind, the H.264 motion estimation is mapped to the GPU as follows. Input to our motion estimation are the luminance components of the frame that is going to be encoded (from now on

4 4 called the current frame ) and all reference frames. Output of the proposed algorithm are quarter-pel accurate vectors (and SAD values) for each subblock of any possible size and all given reference frames. The problem is split into a CPU part and two GPU parts. First, if a reference frame has not been used as a reference frame before, two supersampled versions of it are computed using the GPU. Then, the CPU generates motion vector predictors for the current block size by applying FDVS [8] and split and merge operations [7] to already available vectors. Finally, starting from the motion vector predictors, a GPU shader program performs a small diamond search in order to refine them. This small diamond search is an iterative local block matching search within a reference frame where each iteration consists of the following steps: First, some block positions according to a specific search pattern around the current best match position are evaluated in terms of a block matching metric like SAD. Second, the position with the lowest metric value becomes the new best match position for the next iteration except for the case that the metric value at the old best match position was even lower. In the latter case, a small diamond search normally terminates. The search pattern for the small diamond search consists of the four block positions of the old best match position shifted by one small step (e.g., one pel or quarter-pel) to the top, left, bottom and right, which form the pattern of a diamond. As a result, we obtain motion vectors and the respective SAD value for each block, which finally can be used for macroblock coding. Since typical block based motion estimation algorithms spend most of the computing power and memory bandwidth on interpolating subpixels and calculating SAD values, these tasks are mapped to the GPU. Less demanding tasks are the computation of good motion vector predictors that will be refined subsequently and the application of heuristic methods to reduce the number of vectors that will be estimated at all. These motion estimation parts still reside on the CPU. For ease of implementation, only unidirectional motion estimation is considered, and SAD is directly used as the function to be locally minimized within the small diamond search iterations without any incorporation of motion vector bit costs or the Hadamard transformation. In H.264, the Hadamard transformation is used to encode the residual block difference after motion compensation. Therefore, it is advantageous to estimate the block difference bit costs using the sum of the absolute (Hadamard-)transformed differences (SATD) instead of the simpler SAD which we use to ease the implementation. The choice of the optimal encoding mode for a macroblock is beyond the scope of this paper. A. Workflow of the Entire Approach and the CPU Based Part The CPU part of our motion estimation is basically an incorporation of Split and Merge operations as proposed by Zhou et al. [7], the Forward Dominant Vector Selection (FDVS) algorithm of Chen et al. [8] and the Flexible Multi- Reference Frame Search Criterion of Li et al. [5]. The Split and Merge operations are used to compute motion vector predictors for other subblock sizes from given motion vectors of one subblock size. The FDVS algorithm computes motion vector predictors for another reference frame with distance n + 1 by adding the n motion vectors of previous frames with a distance of one frame. The Flexible Multi-Reference Frame Search Criterion is a simple heuristic method that stops the search for fast and continuously moving macroblocks after searching the nearest three reference frames. In our implementation, this heuristic method uses the vectors of the 16x16 blocks to decide whether any subblock of the appropriate macroblock has to be searched within the farther reference frames. These methods are combined as follows: For each frame do the following: 1 If a frame serves as a reference frame for the first time: create two supersampled and pre-shifted versions of it as needed by the GPU based small diamond search; 2 Calculate displacement vectors from the current frame to its directly preceding reference frame by: 2.1 Use the corresponding resulting vectors of the 8x8 subblocks of the preceding frame as predictors (or zero motion vectors for the very first and every 8th frame transition to prevent error accumulation); 2.2 Refine the vector predictors for all 8x8 subblocks in parallel using the GPU based small diamond search; 2.3 For all other subblock sizes do: Compute vector predictors starting from the 8x8 subblock vectors using Split and Merge [7] 2 ; Refine the vector predictors for the current subblock size in parallel using the GPU based small diamond search; 3 For all other reference frames (from nearest to farthest) do: 3.1 If three reference frames have been processed, apply the Flexible Multi-Reference Frame Search Criterion heuristic method [5] to stop the search for certain macroblocks; 3.2 Compute vector predictors for the 8x8 subblocks using the FDVS algorithm [8]; 3.3 Refine the vector predictors for all 8x8 subblocks in parallel using the GPU based small diamond search; 3.4 Analog to 2.3 create vector predictors starting from the finished 8x8 subblocks for the other subblock sizes using Merge and Split operations and refine them using the GPU based small diamond search; B. Interpolation of the Values at Half-Pel and Quarter-Pel Positions during Reference Frame Texture Supersampling The supersampling of the reference frame textures is mapped to the GPU and is calculated only once per new emerging reference frame in three steps prior to any small diamond search runs referring to it. Each of these three steps is implemented in an own fragment shader program: 1) Horizontal supersampling by a factor of 2: every second destination texel in a scanline is interpolated using the 2 If the difference of all source vectors of a Split or Merge operation to the resulting vector of the operation is below a threshold, the resulting vector is directly taken as the final vector without refining it through a small diamond search.

5 5 Fig. 2. Creation of a pre-shifted reference frame texture for quarter-pel [fullpel] search: A supersampled texture is pre-shifted in 4 different ways by one texel [four texels]. The 4 resulting pre-shifted versions are then incorporated into one 4-component RGBA texture. Fig. 3. The rendering of a full-screen quad Note: it is not necessary (as indicated by the crosses) that a total mapping exists between the subblocks of the current video frame and the texels of the source texture or destination texture. In this way, the number of vectors to be estimated is reduced. horizontal 6-tap filter as described in the H.264 standard. The other texels are filled with the unfiltered source texels. 2) Vertical supersampling by a factor of 2: the already horizontally supersampled resulting texture of step 1 is vertically supersampled analog to step 1. The result is a temporary reference frame texture that is already horizontally and vertically extended by the half-pel positions. 3) Supersampling to a factor of 4 with bilinear interpolation and simultaneous pre-shifting: the extended reference frame texture is completed by simultaneously inserting values at the quarter-pel positions horizontally and vertically. All new values are simply calculated by bilinear interpolation whereas the other positions values are just copied. The same fragment shader program that implements this bilinear interpolation is used to realize a preshifting according to the small diamond search pattern offsets on the fly, as illustrated in Fig. 2 and explained in detail below. The purpose for this pre-shifting is explained in Section IV-D. The third step is calculated twice with differently sized search patterns to finally create two supersampled textures for each reference frame one for the quarter-pel search and one for the full-pel search (this allows to reuse the implementation of the quarter-pel search for full-pel search later on). C. Application of the Small Diamond Search Fragment Shader Program The GPU based refinement of predicted motion vectors is implemented (as described in Section IV-D) as a fragment shader program. This shader program is applied by rendering a source texture (consisting of a field of predicted motion vectors) to a destination texture which represents the final motion vector field and the related SAD values. The reference frame and the current frame are also represented as textures and serve as additional input data to the shader program which executes an adopted small diamond search on the current frame s motion vector predictors of subblocks of a fixed size in parallel in one rendering step. The block size is coded via a global variable readable by the shader program ensuring that the appropriate number of current and reference frame pels is used to render a motion vector and its related SAD value. The source texture represents the predicted motion vector field for a given block size. The motion vector predictors are coded as floating point RGBA texels. Next to the vector predictors themselves, the subblock positions are also coded into the source texture s texels. The additional coding of the block positions allows an arbitrary mapping between the subblocks of the current frame and the source texture in order to compute a smaller subset of the motion vector field to save computation time. Further inputs to the shader program are the luminance components of the current frame and the current reference frame. They are coded as three additional textures which are dynamically addressed by the fragment shader program. Thereby, the luminance component of the current frame is stored in one 8-bit current frame texture. It is neither supersampled nor pre-shifted, and it contains only one color channel that holds the luminance data. The luminance component of the reference frame is redundantly stored in two 32-bit reference frame textures one for the full-pel search and one for the quarter-pel search. Both textures are supersampled and pre-shifted, as described in Section IV-B. The output result of the shader program is a destination texture where the pixel data represent the final motion vector results. Each motion vector is coded by two of the four RGBAcomponents of the floating point destination texture. A third component holds the corresponding SAD value. A so-called full screen quad is utilized to accomplish a 1-to- 1 texel mapping between the source texture and the destination texture. The full screen quad consists of two triangles that will be rendered to the destination texture and occlude it completely. By specifying one set of texture coordinates for the quad (referring to the source texture) and the quad s set of position coordinates (referring to the destination texture), the 1-to-1 texel mapping is accomplished (See Fig. 3). Note that the term full screen quad is commonly used in the GPGPU (general purpose computation on GPU) community ( and is not related to video frames. The complete motion estimation for a given block size with respect to a reference frame is started by simply rendering the full screen quad to the destination texture with the small diamond search fragment shader program activated. This small diamond search application already executes the complete fullpel search and quarter-pel search within one rendering pass. Finally, only the computed motion vectors and SAD values need to be transferred back (in form of the destination texture) to the system RAM across the AGP bus.

6 6 D. Implementation of the Small Diamond Search Fragment Shader Program The realization of the small diamond search in a shader program allows the parallel computation or refinement of motion vectors for several blocks at one time, according to the number of available pixel pipelines. Adopting the small diamond search to a shader program needs some considerations in order to exploit the GPU properties: First of all, SIMD functionality of each single pipeline is used to calculate in parallel the SAD values of all four positions of the small diamond pattern at once. Second, the number of texture accesses is minimized during the four SAD calculations. This is accomplished by pre-shifting the reference frame textures in a way that each texel directly contains four different samples of the reference frame s luminance component in its RGBA components. Thereby, the four samples originate from positions that are defined by the small diamond search pattern with a fixed width (see Fig. 2). Since two different pattern widths are required (namely quarter-pel and full-pel pattern widths), two differently pre-shifted versions of each reference frame texture were generated, as described in Section IV-B. Using such a pre-shifted reference frame texture, a parallel calculation of a pixel s absolute difference of all four small diamond pattern positions can be evaluated using only two texture accesses one 8-bit access to the current frame texture and one 32-bit access to the reference frame texture. This task is demonstrated by the SAD() subroutine of the pseudo code (see Fig. 4) which outlines the whole shader program that is run in parallel on many pixel pipelines. The block size is controlled by a global variable which is set outside of the shader program. The shader program is optimized on the assembly language level in terms of the number of used registers and manually unrolled loops. Both searches, the full-pel and the quarter-pel search are implemented in the same single shader program, which always executes them both the complete full-pel search prior to the quarter-pel search. As stated in Section III, it is not possible to conditionally read textures using dynamically calculated addresses, or to implement a real WHILE loop in a fragment shader program for the current Shader Model 3.0 hardware. Therefore, the termination criterion of the standard small diamond search has to be altered to be implemented in a fragment shader program. The standard small diamond search terminates when a local minimum is found or when a given maximum number of iterations have been spent. In contrast, the GPU small diamond search cannot terminate on the condition that a local minimum is found. Due to the GPU s limitations, it has to spend always the same predefined number of iterations. As shown in the pseudo code (see Fig. 4), the proposed GPU implementation will not change the vector in subsequent iterations after a local minimum has been found. Thus, the extra iterations cost some execution time but do not change the result. encoder was extended by the proposed approach using the DirectX 9 API ( to address the GPU. The output vectors of our algorithm were incorporated into the encoding process. Although the resulting SAD values could have been taken as block costs, the original code of JM9.0 was used to re-evaluate the block costs once, based on our final vectors. This guaranteed the compatibility of measures within the whole encoder and minimized the changes made to the existing code of the reference implementation. The small diamond search fragment shader program was written in Shader Model 3.0 assembly language. The High Level Shading Language (HLSL) was used to implement the 6-tap filter interpolation and the pre-shifting shader programs. This implementation was evaluated on a standard Windows XP PC consisting of the following key hardware components: AMD Athlon XP CPU, 1 GB RAM and a GeForce 6600GT AGP graphics card equipped with 128 MB RAM which is connected to the GPU by a 128 bit wide bus. The GPU core was normally clocked at 500 MHz, and the graphics RAM was clocked at 900 MHz, thus theoretically allowing a total memory bandwidth of about MB/s between the GPU and the graphics RAM. With 8 pixel pipelines, the GeForce 6600GT represents the current middleclass of GPUs. To test the encoding quality and speed of the implemented approach, the following standard CIF (Common Intermediate Format) video sequences ( were chosen as test material: coastguard, flower, foreman, mobile and tempete. The speed was also exemplarily tested with the following QCIF (Quarter Common Intermediate Format) video sequences: carphone, container and foreman. The experimental results are compared to the results of the UMHexagonS algorithm, which is the only fast motion estimation algorithm in the JM9.0 reference encoder, as well as to an entirely CPU based implementation of the GPU approach. PSNR of Y-component [db] Coastguard, 300 CIF 24 UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] V. EXPERIMENTAL RESULTS Fig. 5. PSNR/bitrate results of CIF The reference implementation JM9.0 (( of the H.264 For all tests, the encoding profile was set to Baseline, so neither B-frame encoding nor CABAC (context adaptive

7 7 Main shader program: 1. Read predicted motion vector from source texture; 2. Call SAD() to calculate SAD value of search center; 3. for number of full-pel iterations do: 3.1 Call SAD() to calculate 4 SAD values in parallel using full-pel pattern; 3.2 Get the new minimum SAD of the 4 new SAD values and the old minimum SAD; 3.3 if new minimum SAD!= old minimum SAD then move vector one full-pel step in direction of new minimum; 4. for number of quarter-pel iterations do: 4.1 Call SAD() to calculate 4 SAD values in parallel using quarter-pel pattern; 4.2 Get the new minimum SAD of the 4 new SAD values and the old minimum SAD; 4.3 if new minimum SAD!= old minimum SAD then move vector one quarter-pel step in direction of new minimum; 4.4 else set minimum_found_flag; 5. return minimum SAD, minimum_found_flag and vector; SAD() subroutine: 1. Set the four SAD values to 0; 2. for each pel of a block do: 2.1 Get the current frame luminance value from current texture position; 2.2 Get the 4 reference frame values at once (they are different since the textures are pre-shifted according to the small diamond search pattern); 2.3 Compute absolute differences of current frame luminance value and each of the 4 reference frame values (in one step again); 2.4 Add each of the 4 absolute difference values to one of the four SAD values (again in one step); 3. return the 4 SAD values; Fig. 4. Pseudo code of the main shader program and the SAD subroutine. time [sec] coastguard CIF flower CIF foreman CIF ME execution times mobile CIF tempete CIF carphone QCIF ME time (GPU) ME time (CPU) container QCIF foreman QCIF Fig. 6. The execution times (in seconds) needed for the CPU part and GPU part of the motion estimation for three algorithms (from left to right): 1. UMHexagonS, 2. UMHexagonS without SATD and without vectorcost, and 3. our approach on the GPU. No parallel processing between the CPU and the GPU was utilized yet. binary arithmetic coding) were used. The rate distortion optimization of the encoder was generally disabled, the number of reference frames was 5, and no bitrate control was used. The UMHexagonS algorithm is presented as a reference in two variants. First, the original and full-featured UMHexagonS implementation that minimizes a Lagrangian cost function during quarter-pel search was used, which incorporates the SATD (sum of absolutely transformed differences) and the estimated vector costs. Second, the same UMHexagonS implementation was used, but vector costs were ignored during minimization, and SAD was used as block matching metric even for the quarter-pel search. The latter UMHexagonS variant is therefore reduced to the same SAD cost function which our proposed approach uses. For both UMHexagonS reference tests, the SearchRange was set to 16 pels. For the GPU based motion estimation, the following parameters were used: The number of small diamond search iterations was normally set to 7 for the full-pel search and set to 4 for the quarter-pel search. In cases where zero motion vectors are refined instead of predicted vectors, the number of full-pel search iterations was set to 10. To allow a direct comparison between the speed of the GPU and the CPU, the proposed algorithm was also implemented in C++ to run entirely on the CPU. Apart from the missing parallelization within the SAD calculation, the only algorithmic difference between the GPU version and the CPU version is that the CPU implementation of the small diamond search terminates (as usual) when reaching a local minimum within the 7 (or 4) iterations. First, it was investigated whether using the proposed approach finally results in an acceptable video quality. Using the encoding parameters described above, the CIF video sequences were encoded with several quantization parameter settings (QP=23, 27, 32, 37, 44, 48). Fig. 5 (see also Fig. 9 to 12) shows graphs of the resulting PSNR and bitrate for both UMHexagonS variants as well as for the proposed GPU based approach. Table I shows numerical results for QP=32. The graphs of the UMHexagonS variant with SAD as the cost function and the graphs of our proposed approach are comparable in most of the measured cases and indicate that a sufficiently good video quality can be achieved using the proposed GPU based approach. The question whether a fast block matching based motion estimation may be implemented on modern GPUs can be answered by a clear yes. Nevertheless, the original idea was not only to explore whether motion estimation based on block matching can be implemented on GPUs, but to directly benefit from the GPUs potentially high computing powers. Therefore, we measured the execution times required for the GPU and the CPU motion estimation part. Fig. 6 shows the encoding times for the CIF and QCIF test sequences using the two UMHexagonS variants and our GPU based motion estimation algorithm. It can be observed that the GPU based approach outperforms the UMHexagonS CPU implementation in all measured cases.

8 8 PSNR [db] / UMHexagonS UMHexagonS GPU Small Bitrate [kbit/s] with SATD and with SAD and Diamond vectorcost no vectorcost Search ME execution times coastguard / / / flower / / / foreman / / / mobile / / / tempete / / / time [sec] ME time (GPU) ME time (CPU) TABLE I NUMERICAL RESULTS FOR PSNR[DB] / BITRATE[KBIT/S] WITH QP= Our approach on CPU Our approach on GPU For CIF sequences, the speed gain is obviously larger than for QCIF sequences. This indicates that using the GPU introduces some overhead which can only be compensated by parallel processing when the data to be processed is large enough. A direct comparison between the GPU based implementation and the same approach running entirely on the CPU is displayed in Fig. 7. It shows that shifting the small diamond search and the interpolation to the GPU also improves the speed while enabling the possibility to operate on different parts of the encoding process on different processors in parallel, which would further reduce the encoding time. In Fig. 7 the execution time of the GPU approach is split into the GPU part (the interpolation and the parallel small diamond search) and the CPU part (the motion vector predictor generation). Although these two parts mutually rely on their results and therefore cannot directly be executed in parallel, the CPU is relieved from computational load when the GPU runs the interpolation or the small diamond search. Thus, the CPU is enabled to process additional tasks. For example, the CPU could perform motion estimation in parallel for a smaller frame region completely on its own in order to achieve an optimal load distribution. In this way, the time denoted for the GPU part in Fig. 7 could be further reduced but in the current implementation this possibility has not been exploited yet. Nevertheless, the proposed approach enables this further possibility of parallelization of the GPU and the CPU. Although the GPU based motion estimation approach reduces the processing time noticeably, the results did not meet our expectations regarding the theoretically available GPU processor power and bandwidth, even with keeping in mind that the missing early termination criterion of the diamond search adoption potentially wastes a lot of computing power. Therefore, some experiments were conducted to find the bottleneck of the approach by independently changing the clock speed of the GPU and the graphics RAM to approximately 50% of their factory defaults. Fig. 8 shows the encoding times of the coastguard sequence when underclocking the GPU, the graphics RAM and both in relation to the execution time at a normal clock rate. It is obvious that underclocking the graphics RAM had a much worse effect on the encoding time than underclocking the GPU. By reducing the clock rate of the RAM to 50%, the execution time of the GPU specific motion estimation part was significantly increased, which indicates that memory bandwidth is the bottleneck. Since the SAD calculation is very memory intensive but its arithmetical complexity is rather low, Fig. 7. Direct comparison between the GPU based approach and the CPU implementation of the same approach using the coastguard CIF sequence. time [sec] ME execution times with underclocking GPU 500MHz, RAM 900MHz GPU 249MHz, RAM 900MHz GPU 500MHz, RAM 449MHz GPU 249MHz, RAM 449MHz ME time (GPU) ME time (CPU) Fig. 8. The execution times (in seconds) needed for the various parts of motion estimation, when underclocking the GPU, the graphics RAM or both. this coincides with the expected result. Although the memory bandwidth was clearly identified to be the bottleneck of the approach, the actually used bandwidth is only about 1.6% of the theoretical maximum bandwidth of about MB/s. The actually used bandwidth was simply approximated by dividing the known number and size of memory accesses by the total time the GPU spent for motion estimation. We conclude that this is caused by the random memory accesses to the current frame texture and the reference frame textures, which is quite untypical for a shader program and therefore interferes with texture pre-fetching and caching strategies implemented in the GPU. Other researchers [11] observed similar memory transfer reductions when heavily using random texture accesses. In addition, we have conducted experiments with two further CPU motion estimation algorithms which have been added to the JM reference software in version JM10.0: Simplified UMHexagonS and EPZS patterns. Unfortunately, many parameter names have changed since version JM9.0 (there are about two hundred parameters) compared to the version used in our experiments, and hence a direct comparison with our GPU based proposal is difficult. Nevertheless, the following relative time measurements were obtained using JM12.0 for the sequence coastguard : the Simplified UMHexagonS version is about 22% faster than UMHexagonS, the EPZS pattern is 3% faster than UMHexagonS, and all three achieve nearly identical PSNR values (PSNR: 37.0 at a bitrate of 1450 KB/s). Our proposed GPU motion estimation approach is more than 50% faster than the CPU based UMHexagonS approach (see Fig. 6). Although the performance of our approach is limited by the random memory accesses, its performance is comparable to state-of-the-art motion estimation algorithms for H.264 implemented on general purpose CPUs both in

9 9 40 Flower, 250 CIF 40 Mobile, 300 CIF PSNR of Y-component [db] PSNR of Y-component [db] UMHexagonS (with SATD and vectorcost) 20 UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] 22 UMHexagonS (with SATD and vectorcost) 20 UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] Fig. 9. PSNR/bitrate results of CIF Fig. 11. PSNR/bitrate results of CIF 40 Foreman, 300 CIF 40 Tempete, 260 CIF, 15Hz PSNR of Y-component [db] PSNR of Y-component [db] UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] 22 UMHexagonS (with SATD and vectorcost) UMHexagonS (with SAD and no vectorcost) GPU Small Diamond Search (with SAD and no vectorcost) Bitrate [kbit/s] Fig. 10. PSNR/bitrate results of CIF Fig. 12. PSNR/bitrate results of CIF terms of speed and quality. For example, consider the reported results of Li et al. [5] who presented an approach to H.264 motion estimation (only for the motion estimation part with integer accuracy). We exemplarily compare the results for the sequence coastguard : Li et al. report a number of search points of and an average matching time of ms measured on an Intel Pentium IV 3.0 GHz processor (average results for their whole test set: search points and a matching time of ms). Li et al. have implemented their approach using SSE2 assembly instructions for Intel processing units. For the coastguard sequence, our approach searches an average number of points per macroblock for the integer motion estimation part, and an average block matching time of 1.06 ms was measured. The performance of our implementation of the proposed GPU based approach is comparable, although the average block matching time is three times faster for Li et al. s approach: If one considers that only a middleclass GPU was used in our experiments and a recent graphics card like the NVidia Geforce 7800 GTX has 24 pixel pipelines (instead of 8 pipelines of the Geforce 6600 GT we used, which theoretically could yield a speed-up factor of 3), includes a RAM clocked with 1350MHz (instead of 900Mhz, theoretically a speed-up of factor 1.5) and has 256 bit wide RAM access (instead of 128 bit, theoretically a speed-up of factor 2), a further performance speed-up factor between 1.5 and 3 can be assumed for our approach. Thus, using an upper-class graphics card would probably result in a comparable block matching time for the proposed approach. The achieved quality of the encoded sequences is comparable to an UMHexagonS approach that minimizes the same cost function (see Fig. 5 and Fig. 9 to 12). Hence, the performance in terms of number of search points, average block matching time and quality of the proposed GPU approach is competitive to the state-of-the-art approach mentioned above, which is implemented on a high-performance general purpose CPU. VI. CONCLUSIONS In this paper, we proposed a GPU based approach to fast motion estimation on commodity graphics cards for the purpose of H.264 video encoding. The problem is split up into finding appropriate motion vector predictors for all possible subblocks using a CPU based implementation part, which

10 10 incorporates state-of-the-art techniques, and a GPU part that refines the calculated predictors using a GPU adoption of a parallel small diamond search. The approach has been implemented and tested on several video sequences. The achieved encoding quality turned out to be competitive to the JM9.0 reference implementation of UMHexagonS when using SAD as its cost function. An important advantage of the proposed approach is the possiblity to process encoding tasks on the GPU and CPU in parallel. The resulting performance of the implementation clearly outperformed the UMHexagonS CPU implementation of the H.264 reference encoder while achieving a competitive video quality. As an additional surplus, the proposed approach provides the option to compute other encoding tasks on the CPU in parallel while the GPU cares about the motion estimation. However, the theoretically expected performance gain could not fully be achieved. The main performance issues of the proposed approach were identified to be the random texture accesses that are needed by the GPU part, which seem to collide with texture pre-fetching and caching strategies implemented in the GPUs. Apart from the unsuitable caching strategies, the unavailability of real WHILE loops and the prohibition of arbitrary conditional texture accesses in the current GPUs inhibit an early termination criterion as used in the standard small diamond search. Instead of spending only as many iterations as sufficient, a constant number of iterations has to be chosen which is high enough for most cases. Of course, this leads to a number of needlessly evaluated loop iterations and memory accesses, and therefore a waste of computing time and memory bandwidth. The bottleneck of the proposed approach is the high number of random memory accesses, and not the arithmetical complexity of the SAD block matching metric. Therefore, extending the proposed approach with the SATD promises a further computational benefit compared to a CPU based approach since it could be computed in parallel in the shader program for each search position without any additional costs for memory accesses. Even an incorporation of a vector cost estimation into the cost function seems to be feasible to some extent. Thus, to replace the SAD by a more complex Lagrangian cost function within the shader program, as well as to make use of the parallel processing on the GPU and the CPU, are subjects of future work. [5] X. LI, E. Q. LI, Y.-K. CHEN, Fast Multi-Frame Motion Estimation Algorithm with Adaptive Search Strategies in H.264, in Proc. IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), pp , Vol. 3, 2004 [6] Y.-W. HUANG, B.-Y. HSIEH, T.-C. WANG, S.-Y. CHIEN, S.-Y. MA, C.- F. SHEN, L.-G. CHEN, Analysis and Reduction of Reference Frames for Motion Estimation in MPEG-4 AVC/JVT/H.264, in Proc. of 2003 IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), Hong-Kong, pp , Vol. 3, 2003 [7] Z. ZHOU, M.-T. SUN, Y.-F. HSU, Fast Variable Block-Size Motion Estimation Algorithms Based on Merge and Split Procedures for H.264/MPEG-4 AVC, in Proc. of 2004 IEEE International Symposium on Circuits and Systems (ISCAS), Vancouver, pp , Vol. 3, 2004 [8] M.-J. CHEN, Y.-Y. CHIANG, H.-J. LI, M.-C. CHI, Efficient Multi- Frame Motion Estimation Algorithms for MPEG-4 AVC/JVT/H.264, in Proc. of 2004 IEEE International Symposium on Circuits and Systems (ISCAS), Vancouver, pp , Vol. 3, 2004 [9] G. SHEN, G.-P. GAO, S. LI, H.-Y. SHUM, Y.-Q. ZHANG, Accelerate Video Decoding with Generic GPU, in IEEE Transactions on Circuits and Systems for Video Technology, Vol. 15, Issue 5, pp , 2005 [10] M. HARRIS, GPGPU: Beyond Graphics, Tutorial held at EU- ROGRAPHICS th Annual Conference of the European Association for Computer Graphics, 2004, slides availlable at [11] I. BUCK, GPU Computation Strategies & Tricks, Presentation of a course held at SIGGRAPH st International Conference on Computer Graphics and Interactive Techniques, 2004, slides availlable at [12] MPEG-4 PART 10/AVC, Coding of Audiovisual Objects - Part 10: Advanced Video Coding, 2003, ISO/IEC :2003 PLACE PHOTO HERE PLACE PHOTO HERE M artin Schwalb received his diploma in computer science from the University of Marburg, Germany in He is currently with the ipharro Media GmbH, Darmstadt, Germany, a recent spin-off of the Fraunhofer Institute for Computer Graphics, where he is working on the core of a leading-edge video fingerprinting technology. His current research interests include content based video retrieval, video copy detection, image features and GPGPU. R alph Ewerth is a research assistant in the Department of Mathematics and Computer Science at the University of Marburg, Germany. He received his diploma in computer science in 2002 and the Ph.D. degree in computer science in 2008, both from the University of Marburg, Germany. His research interests include video coding, machine learning, and multimedia content analysis and retrieval. REFERENCES [1] J. OSTERMANN, J. BORMANS, P. LIST, D. MARPE, M. NARROSCHKE, F. PEREIRA, T. STOCKHAMMER, AND T. WEDI, Video Coding with H.264/AVC: Tools, Performance, and Complexity, in IEEE Circuits and Systems Magazine, pp. 7-28, 1st Quarter, 2004 [2] F. KELLY, A. KOKARAM, Fast Image Interpolation for Motion Estimation using Graphics Hardware, in Proc. SPIE Vol. 5297, pp , Real-Time Imaging VIII; 2004 [3] F. KELLY, A. KOKARAM, Graphics Hardware for Gradient Based Motion Estimation, in Proc. SPIE Vol. 5309, pp , Embedded Processors for Multimedia and Communications; Subramania I. Sudharsanan, Michael Bove, Jr., Sethuraman Panchanathan; Eds., 2004 [4] R. STRZODKA, C. S. GARBE, Real-Time Motion Estimation and Visualization on Graphics Cards, in Proc. IEEE Visualization 2004, pp , 2004 PLACE PHOTO HERE B ernd Freisleben is full professor of computer science in the Department of Mathematics and Computer Science at the University of Marburg, Germany. He received his Master s degree in computer science from the Pennsylvania State University, USA, in 1981, and the Ph.D. degree in computer science from the Darmstadt University of Technology, Germany, in His research interests include computational intelligence, scientific computing, and multimedia computing.

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot

implementation using GPU architecture is implemented only from the viewpoint of frame level parallel encoding [6]. However, it is obvious that the mot Parallel Implementation Algorithm of Motion Estimation for GPU Applications by Tian Song 1,2*, Masashi Koshino 2, Yuya Matsunohana 2 and Takashi Shimamoto 1,2 Abstract The video coding standard H.264/AVC