A 27 mw 1.1 mm 2 Motion Estimator for Picture-Rate Up-converter

Size: px

Start display at page:

Download "A 27 mw 1.1 mm 2 Motion Estimator for Picture-Rate Up-converter"

Shon Douglas
5 years ago
Views:

1 A 27 mw 1.1 mm 2 Motion Estimator for Picture-Rate Up-converter Aleksandar Berić, Ramanathan Sethuraman, Harm Peters, Jef van Meerbergen,, Gerard de Haan,, Carlos Alba Pinto Eindhoven University of Technology, Dept. of Electrical Engg., Eindhoven, The Netherlands Philips Research Laboratories, Eindhoven, The Netherlands a.b.beric@tue.nl Abstract The gap between application-specific integrated circuits (ASICs) and general purpose programmable processors in terms of performance, power, cost and flexibility is well known. Application specific instruction set processors (ASIPs) bridge this wide gap. This work presents a design of a very long instruction word (VLIW) based ASIP for motion estimation which is used in the picture-rate up-conversion application. The ASIP meets low-power and low-cost requirements apart from providing flexibility for the application domain. It consumes 27 mw and takes an area of 1.1 mm 2 in 0.13 µm technology for delivering motion estimation functionality for standard definition (SD) sequences at 140fps. Motion estimator performed single scan, where for each block of 8*8 pixels evaluation is done using the set of five motion vector candidates. The evaluation criterion was the sum-of-absolute-difference (SAD) criterion with the SAD window size of 32 pixels. In order to prove the concept in silicon, an FPGA prototyping system has been used. 1 Introduction In today s television market, picture-rate up-conversion, as part of the video format conversion chain [1], plays an important role. In the past, simpler and cheaply implementable algorithms like picture repetition were used for picture-rate up-conversion [2]. However, these algorithms produced visual artifacts like motion judder and blur. To enhance the quality of the interpolated pictures, recent algorithms use motion estimation, thus making the picture-rate (temporal) up-conversion a challenging application which references more than 10 Mb of image data and requires a bandwidth in the order of Gpel/sec. This work presents a low-cost, low-power implementation of a motion estimation algorithm used in temporal up-conversion that addresses the high memory capacity and bandwidth requirements. As the starting point of the design, a behavioural description in C-language for the temporal up-conversion application is used. In the next step, the hardware/software partitioning is performed. The extensive simulations of the partitioned application were performed in three different abstraction levels: partitioned C-code, RTL (generated through high-level-synthesis (HLS) tool [3]) and netlist (generated through gate-level-synthesis tool [4]). All three levels of the system simulation were carried out using a bitand cycle-true protocol [5] for the communication between hardware and software tasks. The control-intensive tasks of an application can be mapped onto a general purpose programmable processor (ARM, MIPS) [5]. On the other hand, two well known approaches exist for mapping compute-intensive tasks of an application, namely ASICs and general purpose programmable processors. ASICs optimally meet the performance and power requirements but lack flexibility. They have VHDL as design entry, which causes relatively long design times and makes late specification and version upgrades difficult to handle. General purpose programmable processors are highly flexible but give significant overhead in performance, power and cost. Unlike these well known approaches, application specific instruction set processors (ASIPs) offer a flexible, low-cost and low-power solution apart from meeting the performance requirements. ASIPs, tuned to an application domain, can be based on any processor architecture template such as a VLIW architecture [6, 7], or a vector processor architecture [8]. In this work, the VLIW architecture template is used. It is interesting to note that the choice of the ASIP template architecture greatly depends on the characteristics of the application domain. For instance, the motion estimation is efficiently implementable on the VLIW architecture template. Starting from the C description, VHDL of the VLIW processor and VLIW application specific functional units are derived through A RT Expert and A RT Builder tools [3], respectively. The designed motion estimator takes 1.1 mm 2 in 0.13 µm technology and consumes 27 mw to process 140 standard definition (720*576) frames per second. Concept is proved in silicon by demonstrating the end design through an FPGA-based prototyping environment [9].

2 n a c n+α e d b n+1 image number (a) c 1/2 1/2 a med e d b (b) Figure 1. Picture (a) shows the interpolation at time instance n+α. Picture (b) illustrates the motion compensation algorithm used for interpolation. The remainder of this paper is organised as follows. Section 2 briefly explains the temporal up-conversion algorithm used in this paper. The architecture and hardware/software partitioning of the application are presented in Section 3. The design of a VLIW based ASIP for motion estimation is explained in detail in Section 4. The end design is demonstrated by using an FPGA-based prototyping methodology and area, power and performance numbers of the design are presented in Section 5. Conclusions are drawn and directions for future work are presented in Section 6. 2 Temporal Up-conversion: Algorithm Common video cameras record images at 50 or 60 Hz while film registers a scene with: 24, 25 or 30 pictures per second. Modern televisions display the video stream at a variety of image rates that range from 50 to 100 Hz. Thus, a high quality temporal up-conversion of streaming video from one format to another is of great importance and is realised by using the motion estimation and compensation. Motion compensation is based on the motion vector field generated by the motion estimator. After motion estimation is performed, to every pixel identified with spatial position x and temporal position n, a best matching motion vector candidate is assigned. The best matching motion vector candidate, or the displacement vector, D( x, n) is such a vector from the input set, that offers the lowest match error. Based on the displacement vector field calculated at the temporal position n + α, 0 α 1 as well as the pixels available at the time instances n and n+1, new pixels can be interpolated at the time instance n + α. Fig. 1a illustrates the creation of the pixel e in the interpolated frame using motion compensated pixels a and b and pixels at the same spatial but different temporal position (pixels c and d ). As illustrated in fig. 1b, the luminance value of interpolated pixels, F dyn ( x, n+α) is determined using the median of input pixels [10], where the luminance value of the pixel located at ( x, n) is given with F ( x, n): F dyn ( x, n + α) = med {F ( x α D, n), F ( x + (1 α) D, n + 1), F a ( x, n + α)} 0 α 1 M1 M2 data bus 2 L1 L1 data bus 1 DEC/IDCT data bus 0 L0 L0 ME + MC Figure 2. The two-level caching strategy with data compression applied to frame memories and L1 cache. Data decompression block (DEC/IDCT) performs decoding (DEC) and finding the inverse discrete cosine transform (IDCT) of data stream. where F a is the non-motion compensated picture average defined with the following equation: F a ( x, n+α) = (1 α)f ( x, n)+αf ( x, n+1), 0 α 1 For motion estimation, the three-dimensional recursive search (3DRS) block matching algorithm [10] is used since it offers a smooth motion vector field at relatively low computational cost. The 3DRS motion estimator is based on the full-search block matcher (FSBM), which divides the image into blocks of pixels B( X) with centre X and assigns to all pixels of every (processing) block at image number n a displacement vector, D( X, n). The processing block size is usually set to 8*8 pixel. The displacement vector is selected from a candidate set, CS max, that limits the possible output vectors to a search area, SA. In this work, the candidate set consists of five motion vectors (two spatial, temporal, pseudo-random and null-vector candidate). However, the flexibility of the this design allows that the number and the selection of the set can be arbitrarily chosen [11]. The criterion for evaluation of the motion vector candidates used in this work is the sum-of-absolute- differences (SAD) criterion which offers a good compromise between computational complexity and quality. 3 Temporal Up-conversion: Architecture The architecture of the up-converter is depicted in fig. 2. The input frames are written into the frame memory 1 (M1) while the frame memory 2 (M2) contains the previous image. The previous frame (time instance n in fig. 1) and the current frame (time instance n+1 in fig. 1) are used by the motion estimation and compensation in order to generate the new interpolated frame (time instance n+α in fig. 1). 3.1 Data Compression and Locality of Reference The minimal memory capacity requirement which enable multiple motion estimation scans is two frame memories (12.65 Mbit for PAL standard stored in 4:2:2 format), which even in the state-of-the-art silicon technology is very difficult and costly to realise on a single-chip. However, the use of the DCT-based data compression [12] for compressing frame memory data (with the compression ratio up to a factor of four) hardly degrades the quality of temporal up-conversion.

3 Table 1. Total memory capacity, area and power and frame memory bandwidth (data bus 2 in fig. 2) as a function of levels of caching and data compression (compression factor set to 4). 0.13µm Without data compr. With data compr. technology 1 Level 2 Levels 1 Level 2 Levels Capacity [Mb] Area [mm 2 ] Power [mw] db2 BW [Mpel/s] The usage of the data compression also offers significant frame memory bandwidth reduction. Further reduction can be achieved by exploiting the locality of reference [13, 14]. In order to have a predictable system design, the complete search areas from the previous and current frames are cached (2*72*40 pixels) which leads to the size of the L0 cache of 45Kb. Since motion vector candidates are restricted to the search area, this approach does not result in cache misses. In order to further reduce the frame memory bandwidth requirements and hence the power dissipation, a level 1 (L1) cache is introduced (see fig. 2). The L1 cache holds h SA block lines (h SA is the height of SA) of the frame, thus reducing the total number of pixel retrievals from frame memory (i.e. only one memory access per pixel). In order to achieve this minimum, both luminance and chrominance components have to be stored in compressed form in L1 cache (225Kb). Table 1 summarises the advantage of using two levels of caching and applied data compression, in terms of required resources and power dissipation. The total memory capacity and frame memory bandwidth (data bus 2 in fig. 2) requirements are reduced by a factor of 3.7 and 13.3, respectively. Further, total memory power dissipation is reduced by a factor of Hardware/Software Partitioning The most computationally intensive and bandwidth demanding part of the up-conversion application are the motion estimation and compensation and hence should be mapped onto a hardware (see fig. 3). Two main functions of the motion estimation are identified: The block which calculates SADs for motion vector candidates and the block which performs bi-linear interpolation needed for sub-pixel accuracy of motion vectors. The motion vector candidates evaluated for the current block are generated based on the motion vector field. The motion vector field is software maintained. The selection of the best matching motion vector candidate is performed based on the calculated SADs and appropriate penalties applied to the respective evaluated candidate. The function of selecting the best motion vector candidate can be realised either in hardware or as software. A number of motion estimation parameters influence the image quality as well as the computational complexity and data bus bandwidth requirements of the motion estimation Software task MV Generation MV field Best MV Choice Hardware task SAD BI Previous ME Image Repository MC Current Figure 3. Hardware/software partition of the temporal upconversion application. VLIW Contr. SAD BI Communication Bus/Network Distributed Register Files L0 $ L0 $ MC ACU ALU RAM ROM Figure 4. The VLIW-based ASIP for temporal up-conversion. [15]. The following parameters of the motion estimator are identified: The number of motion estimation scans per input image pair; the direction of each individual scan; the order of scanning the image block-by-block; the number, selection and precision of the motion vector candidates [11]; the dimension of the processing block and the dimension of the SAD window; the size of the search area. In order to enable a flexible design, it is essential that most parameters are programmable. Section 4 addresses the issue of achieving this kind of flexibility without sacrificing other application requirements. 4 VLIW Based ASIP VLIW architectures are suitable for exploiting the instruction level parallelism in programs, i.e. for executing more than one basic (primitive) instruction at a time. These processors contain multiple application specific functional units (ASUs) as well as standard units like arithmetic logic unit, address calculation unit, etc. From the instruction memory, a very long instruction word is fetched and dispatched towards functional units for parallel execution. The dispatched instruction has enough control bits to directly and independently control the action of every functional unit in every clock cycle. Contrary to contemporary superscalar processors, the VLIW processors have relatively simple control logic because they do not perform any dynamic scheduling nor reordering of operations. Fig. 4 depicts an ASIP based on the VLIW processor architecture template that implements the motion estimation and compensation algorithm. Apart from several general purpose functional units like an arithmetic-logicunit (ALU) and an address computation unit (ACU), this ASIP also contains a number of application specific units,

i: a: b: 0 1 2 3 4 5 6 7 8 9 A B C D E F 3 Σ a i b i i=0 7 B Σ a i b i Σ a i b i i=4 i=8 F Σ a b i=c i i first step second step Figure 5.

4 i: a: b: A B C D E F 3 Σ a i b i i=0 7 B Σ a i b i Σ a i b i i=4 i=8 F Σ a b i=c i i first step second step Figure 5. Pseudo-code for motion estimation for temporal upconversion using the application specific instruction set tailored for accelerating the inner kernels of temporal upconversion algorithm. The VLIW-based ASIP for temporal up-conversion contains the following ASUs: the sum-ofabsolute-differences (SAD), the bi-linear interpolation (BI), two instances of the L0 cache (L0 cache) and the motion compensation (MC). The design of the ASIP starts from the C description of the temporal up-conversion algorithm. As the next step, an instruction-set suitable for fulfilling the parameterised requirements of the application is developed. An example of the pseudo-code is given in fig. 5 in which the application specific instruction-set is used. From this high level description, VHDL of the VLIW processor is automatically derived through the HLS tool A RT Expert. The ASUs also have a C specification, from which A RT Builder automatically generates the VHDL. 4.1 The Sum-of-Absolute-Differences ASU The sum-of-absolute-differences ASU is used to obtain the SAD of every motion vector candidate. It compares the block within the current frame and the corresponding block within the previous frame shifted over the appropriate motion vector candidates. The dimensions of the SAD window are programmable. If the width and the height of the SAD window is denoted with w SAD and h SAD, respectively, then w SAD, h SAD {4, 8, 12, 16}. The SAD ASU takes into account all pixels within the SAD window, i.e. does not perform pixel sub-sampling. If calculated for a block at time instance n + α, its function can be formally described with: SAD = h SAD w SAD F ( x αd, n) F ( x + (1 α) D, n + 1), i=1 j=1 0 α 1 where the luminance value of the pixel located at ( x, n) is given by F ( x, n). The vector x identifies the spatial position, (i, j) of the pixel within the SAD window calculated relatively from the top left corner of the SAD window. Position of the SAD window within the previous and current frame is identified by the motion vector candidate currently being evaluated, D. The SAD ASU is organised such that it calculates partial sums-of-absolute-differences, registers these partial sums and adds them together in two steps in order to generate c: i: p: SAD output Figure 6. The sum-of-absolute-differences ASU. Picture illustrates the specific case of all four chunks are being activated, i.e. the size of the SAD window set to (16, 16). c: Current pixel line i: Interpolated pixel line p: Previous (registered) pixel line D a 4 nearest neighbours of pixel "a" Figure 7. The bi-linear interpolation ASU. Picture shows the interpolation of pixel a based on its four neighbouring pixels. the output value. In the first step, the partial SADs are calculated line-by-line within the SAD window. In the second step, the accumulated partial SAD values are added to generate the final SAD of the SAD window. Fig. 6 illustrates the SAD operation in case the size of the SAD window is set to (16, 16). The partial SAD calculation is designed as four separate SAD sub-blocks each being capable of calculating the SAD of a chunk of four pixels in a single clock cycle. Thus, the maximal width of the SAD window of 16 pixels is supported and the number of clock cycles required to calculate the SAD of a given motion vector candidate is equal to h SAD +1. In case the requested width of the SAD window is less than the maximally supported width, the unused SAD sub-blocks are not triggered. This reduces the power dissipation. 4.2 The Bi-linear Interpolation ASU When sub-pixel accuracy of motion vectors is required, the bi-linear interpolation is used for generating corresponding pixels for the SAD calculation. Each interpolated pixel is generated through this ASU by taking the weighted average value of its four nearest neighbouring pixels. The weights are determined by the fractional value of the x and y component of the motion vector candidate currently being evaluated, D x and D y, respectively. The position of the SAD window s top-left pixel is determined by the truncated value of the D x and D y. In order to properly interpolate pixels located at the right-most column and lowest row of the SAD window, one additional column and one additional line are needed.

5 four pels 32 web A[4:0] D[31:0] RAM Q[31:0] bank 0 web A[4:0] D[31:0] RAM Q[31:0] bank 1 Bank selector addr web A[4:0] D[31:0] RAM Q[31:0] bank 11 5 FSM Table 2. Table shows synthesis and pre-layout netlist-level power dissipation simulation results. IC Technology 0.13µm CMOS Worst case conditions 85 C, 1.08V Area 1.1 mm 2 Frequency Power (typical) Performance 100 MHz 27 mw 140 SD fps (32 pels/sad, 5 cand.) Filter Figure 8. The L0 cache ASU. The width of the search area stored in cache is 48 pixels while the height is 32 pixels. The search area cache is composed of 12 physical units each containing 4 pixels per word. The bi-linear interpolation ASU is pixel-line organised and its functionality is illustrated in fig. 7. Based on two successive pixel-lines of width w SAD + 1 pixels, it generates w SAD interpolated pixels in one clock cycle. Note that, the ASU contains storage for one pixel-line (w SAD + 1 pixels wide). The BI ASU takes h SAD + 1 clock cycles to process the complete SAD window. 4.3 The L0 Cache ASU The L0 cache stores the entire search area required by the algorithm. Thus, a pixel-line at an arbitrary position within the search area can be retrieved efficiently. When the motion vectors have full-pixel accuracy, the L0 cache outputs a pixel-line containing w SAD pixels. If the subpixel accuracy is requested, w SAD + 1 pixels are output from the cache, where the additional pixel is required by the bi-linear interpolation ASU. The L0 cache size is limited to 6*4 blocks of 8*8 pixels and the motion vectors are clipped such that no pixel outside the search area is requested. At the start of each new block-line, the complete L0 cache has to be refilled while for every next block within the same block-line, one column of four blocks of 8*8 pixels has to be refilled. Since motion estimator and compensator require pixels within current and previous frame, the L0 cache ASU is instantiated two times. The architecture of the L0 cache ASU is depicted in fig. 8. The cache memory is organised as 12 individual banks, each containing 32 pixel-lines of four pixels. Via the bank selection module, each single memory location can be accessed individually during writing. During reading, according to the motion vector candidate being processed and the width of the SAD window, the appropriate group of memory banks is selected and pixels read from those banks are concatenated. Further, the pixels are filtered and one pixel-line (containing up to 17 pixels) per clock cycle is output. Since only selected banks are activated, the power dissipation of the ASU is further reduced. The SAD, BI and two L0 cache ASUs are pipelined thereby enabling parallel execution. 4.4 The Motion Compensation ASU The motion compensation ASU is based on the motion compensation algorithm described in detail in Section 2. It is capable to output 16 motion compensated pixels located in the same pixel-line within a single clock cycle. Each of the output pixels is derived from the three-input median filter. Two inputs to this filter are the motion compensated pixels from the current and previous frame, while the third input is determined as the averaged value of the non-motion compensated pixels from the current and previous frame (see fig. 1 for details). The motion compensation should be performed immediately after motion estimation is finished with the block currently being processed since the needed luminance pixels are available within the L0 cache. Motion compensation on the chrominance component is performed by fetching the appropriate pixels from the frame memory. 5 Results A stand-alone netlist simulation of the ASIP design was performed using an SD input sequence. Motion estimator performed single motion estimation scan and the scanning style was from top to bottom and left to right. Five full-pel accurate motion vector candidates were evaluated per each processed block using the 8*8 pixels SAD window. Within the SAD window, the alternate pixel-lines are used for the total SAD. The BI and MC ASUs are being designed and hence are not included in this ASIP simulation. For this design, motion compensation was performed as software task. Since the current version of the A RT toolset produces VHDL that always clocks all registers and memories regardless of activity of the functional units, clock gating was manually applied to distributed register files and RAM. Scripts are used to apply the clock gating to VHDL. This reduced the original power consumption of the ASIP by 19%. Table 2 summarises the synthesis results of the clockgated design. The ASIP design presented in this paper has a marginal overhead in terms of area and power compared to an ASIC realisation (approximately %). In order to prove the concept in silicon, an FPGA-based prototyping methodology [9] is used, where two tasks exist and execute in parallel: The complete VLIW-based ASIP design is implemented in PCI-based Nallatech xcv800- BG432-6 FPGA board [16] (see fig. 9) while the rest of the

Figure 9. Picture shows the hardware part of the prototyping environment. Prototyping FPGA board is the second PCB looking from the right to left.

6 Figure 9. Picture shows the hardware part of the prototyping environment. Prototyping FPGA board is the second PCB looking from the right to left. temporal up-conversion application is realised as software task. Since the embedded ARM was not available within the prototyping setup, the software task was mapped onto the PC platform. Logic of the ASIP is implemented using 96% of the total slices available while the caches took 85% of the RAM resources available on the FPGA device. Every slice of the FPGA contains two 4-input LUTs and two flip-flops. 6 Conclusions and future work In this work, a low-power and low-cost design of the VLIW-based ASIP which performs motion estimation used in picture-rate up-conversion application was presented. The design of the VLIW and the associated application specific functional units is done using the A RT HLS tool set starting from the C algorithm description. The designed motion estimator takes 1.1 mm 2 and consumes 27 mw in 0.13 µm technology for processing 140 SD fps. The end design is proved in silicon by demonstrating it through an FPGA based prototyping environment. As part of the future work, the functionality of the ASIP will be extended to address the bi-linear interpolation and motion compensation apart from increasing the search area to 9*5 blocks of 8*8 pixels. Further, the power consumption of the L0 cache can be reduced by exploiting the motion vector dynamics in the L0 cache design. Acknowledgements We would like to acknowledge the valuable contributions of our colleagues Ghiath Al-Kadi (prototyping) and Srinivasan Balakrishnan (discussions on ASUs). References [1] G. de Haan, Video format conversion, Journal of the SID, Vol. 8, no. 1, 2000, pages [2] T. Söhne et al., A video backend for multimedia TVsets, IEEE Transactions on Consumer Electronics, Vol. 44, No. 3, August 1998, pages [3] A RT tool set, Adelante Technologies, available online [4] Ambit tool set, Ambit Technologies, available online [5] A. Nieuwland, et al., C-HEAP: A heterogeneous multi-processor architecture template and scalable and flexible protocol for the design of embedded signal processing systems, Design Automation for Embedded Systems, [6] J.A., Very long instruction word architectures and the ELI-512, Proceedings 10th Symposium Computer Architecture, IEEE, June 1983, pages [7] J.R. Ellis, Bulldog: A compiler for VLIW architectures, Cambridge, MA, MIT Press, [8] V. Aue et al., A design methodology for high performance ICs: wireless broadband radio baseband case study, Proceedings of Euromicro Symposium on Digital Systems Design, September 2001, pages [9] N.G. Busá et al., RAPIDO: a modular, multi-board, heterogeneous multi-processor, PCI-bus based prototyping framework for the validation of SoC VLSI designs, IEEE Workshop on Rapid System Prototyping, July 2002, pages [10] G. de Haan, Video processing for multimedia systems, University press Eindhoven, ISBN , [11] A. Berić et. al., A technique for reducing complexity of recursive motion estimation algorithms, Proceedings of the IEEE Workshop on Signal Processing Systems, August 2003, pages [12] R.P. Kleihorst et al., DCT-Domain embedded memory compression for hybrid video coders, Journal of VLSI Signal Processing Systems 24, February 2000, pages [13] J. L. Hennesy et al., Computer architecture a quantitative approach. Morgan Kaufmann Publishers, Inc., ISBN , 1996, pages [14] A. Berić et al., Algorithm/Architecture co-design of a picture-rate up-conversion module, Proceedings of ProRISC/IEEE conference, November 2002, pages [15] A. Berić et al., Towards an efficient high quality picture-rate up-converter, Proceedings of the IEEE International Conference on Image Processing, September 2003, on CD. [16] Nallatech Ltd. Available online

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV

Comparative Study of Partial Closed-loop Versus Open-loop Motion Estimation for Coding of HDTV Jeffrey S. McVeigh 1 and Siu-Wai Wu 2 1 Carnegie Mellon University Department of Electrical and Computer Engineering