Hardware Optimized DCT/IDCT Implementation on Verilog HDL

Size: px

Start display at page:

Download "Hardware Optimized DCT/IDCT Implementation on Verilog HDL"

Sophia Robertson
5 years ago
Views:

1 Hardware Optimized DCT/IDCT Implementation on Verilog HDL ECE 734 In this report, I explore 4 implementations for hardware based pipelined DCT/IDCT in Verilog HDL. Conventional DCT/IDCT implementations suffer from the amount of hardware requirement needed for storage and computations. This project is an attempt to optimize these important requirements and compare 4 implementations to conclude the best design point for the hardware based DCT/IDCT implementation. It has been observed that the Serial In implementation consumes around ~6% lesser area than parallel In implementation at a performance degradation of only ~4%. Rahul Srikumar

2 Table of Contents Motivation... 2 Prior Work... 3 The Discrete Cosine Transform... 3 Introduction... 4 Four Implementations... 5 Serial In Parallel In Parallel In Parallel In Optimizations Synthesis and Results Conclusion References

3 Motivation Discrete Cosine Transform(DCT) is one of the important image compression algorithms used in image processing applications. It involves a lot of multiplications, additions and also has a huge memory requirement. Several algorithms have been proposed over the last couple of decades to reduce the number of computations and memory requirements involved in the DCT computation algorithm. Any algorithm that can reduce the total number of additions, multiplications or memory requirement would be of profound significance to the image processing domain. 2

4 Prior Work There has been a lot of research both in industry and academia on how to efficeintly implement a fast DCT/IDCT hardware algorithm. Dae Won Kiln, et. al [1], proposed and implemented a hardware Distributed Arithmetic(DA) method with radix-2 multibit coding with minimum resource requirement by using transpose memory. Atitallah et. al [2] compared Loeffler and DA algorithms to implement compression in H.264 nad MPEG. Martuza et. al [3] presented a hybrid architecture for IDCT computation based on the symmetric structure of matrices and similarity in matrix operations. The proposed architecture derives its inspiration from all the above well set examples. The Discrete Cosine Transform A discrete cosine transform (DCT) expresses a sequence of finitely many data points in terms of a sum of cosine functions oscillating at different frequencies i.e. it transforms a signal from a spatial representation into a frequency representation. In an image, most of the energy will be concentrated in the lower frequencies, so if I transform an image into its frequency components and discard the higher frequency coefficients, I can reduce the amount of data needed to describe the image without sacrificing too much image quality. This is why DCT is popularly used in several image compression algorithms. The DCT function used in image processing consists of sum of weighted cosine functions at different frequencies. The DCT of a function is expressed as follows 3

5 (1) (2) (3) Since images are 2-D objects, a 2-D DCT is required to get all pixels transformed into the frequency domain. This computation involves 2 major steps. (i) Computing the 1-D DCT of the rows of the pixel matrix. (ii) Computing the 1-D DCT of the columns of the pixel matrix by computing the DCT of the transpose of the matrix obtained in (i). 2-D DCT of an image is expressed as follows: (4) (5) (6) Introduction In my implementation, I explore four design points of my hardware implementation using Verilog HDL and evaluate the area-performance trade-off. The design comprises of four modules per design point. One module for DCT computation, One module for IDCT 4

6 computation, One top module that instantiates both the DCT and IDCT modules and a test bench to test the entire design. Core idea is to implement a fully-pipelined architecture that takes in 8 inputs and provides a single DCT output which in turn is used to compute the IDCT. A 1D-DCT is implemented on the input pixels first. The output of this so called the intermediate value is stored in a RAM. The 2nd 1D-DCT operation is done on this stored value to give the final 2D-DCT ouput dct_2d. The inputs are 8 bits wide and the 2d-dct outputs are 9 bits wide. A 1D-IDCT is implemented on the input DCT values. This intermediate value is stored in a RAM. The 2nd 1D-IDCT operation is done on this stored value to give the final 2D-IDCT output idct_2d. The inputs are 9 bits wide and the 2d-idct outputs are 8 bits wide. The nuances of the 4 design points have been provided in great details in the sections that follow. Four Implementations Serial In 1st 1D section The input signals are taken one pixel at a time in the order x00 to x07, x10 to x17 and so on until x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit shift registers are registered by the divide by 8 clock which is the CLOCK signal divided by 8. This will enable us to register in 8 pixels (one row) at a time. The pixels are paired 5

7 up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module alternately chooses addition and subtraction. This selection is done by the toggle flop. The output of the adder/subtractor is fed into a multiplier whose other input is connected to stored values in registers acting as memory. The outputs of the 4 multipliers are added at every clock in the final adder. The output of the adder z_out is the 1D-DCT values given out in the order in which the inputs were read in. It takes 8 clocks to read in the first set of inputs, 1 clock to register inputs,1 clock to do add/sub, 1clock to get absolute value, 1 clock for multiplication, 2 clock for the final adder. total = 14 clocks to get the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all the 64 values we need 14+63=77 clocks. Storage/RAM section The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data write can be continuous. The 1st valid input for the RAM1 is available at the 15th clock. So the RAM1 enable is active after 15 clocks. After this the write operation continues for 64 clocks. At the 65th clock, since z_out is continuous, we get the next valid z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is enabled at clocks. So at 65th clock, RAM1 goes into read mode for the next 64 clocks and RAM2 is in write mode. The 2 RAMS alternate between read and write every 64 clock cycles. 6

8 2nd 1D-DCT section After the 1st 77 clocks when RAM1 is full, the 2nd set of 1D calculations can start. The second 1D implementation is the same as the 1st 1D implementation with the inputs now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the 2nd section are the 2D-DCT coefficients. 1st 1D-IDCT section The input signals are taken one pixel at a time in the order x00 to x07, x10 to x07 and so on up to x77. These inputs are fed into a 8 bit shift register. The outputs of the 8 bit shift registers are registered at every 8th clock.this will enable us to register in 8 pixels (one row) at a time. The pixels are fed into a multiplier whose other input is connected to stored values in registers which act as memory. The outputs of the 8 multipliers are added at every CLOCK in the final adder. The output of the adder z_out is the 1D-IDCT values given out in the order in which the inputs were read in. It takes 8 clocks to read in the first set of inputs, 1 clock to get the absolute value of the input, 1 clock for multiplication, 2 clock for the final addition which adds up to a total of 12 clocks to get the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all the 64 values we need 12+64=76 clocks. Storage / RAM section The outputs z_out of the adder are stored in RAMs. Two RAMs are used so that data write can be continuous. The 1st valid input for the RAM1 is available at the 12th clock. 7

9 So the RAM1 enable is active after 11 clocks. After this the write operation continues for 64 clocks. At the 65th clock, since z_out is continuous, we get the next valid z_out_00. This 2nd set of valid 1D-DCT coefficients are written into RAM2 which is enabled at clocks. So at 65th clock, RAM1 goes into read mode for the next 64 clocks and RAM2 is in write mode. After this for every 64 clocks, the read and write switches between the 2 RAMS. 2nd 1D-IDCT section After the 1st 76th clock when RAM1 is full, the 2nd 1d calculations can start. The second 1D implementation is the same as the 1st 1D implementation with the inputs now coming from either RAM1 or RAM2. Also, the inputs are read in one column at a time in the order z00 to z70, z10 to z70 up to z77. The outputs from the adder in the 2nd section are the 2D-IDCT coefficients. 2 Parallel In 1st 1D section The input signals are taken 2 pixels at a time in the order x00:x01, x02:x03 and so on up to x06:x07. A divide by 4 clock is used to clock in 4 sets of 2 pixels to get 8 pixels. The pixels are paired up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module does 4 additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose other input is connected to stored values in registers which act as memory. The output of the 8 multipliers are added at every 8

10 CLOCK in the final adder. The output of the adder z_out is the 1D-DCT values given out in the order in which the inputs were read in. The difference is that it takes 4 clocks to register the inputs and sign extension, 1 clock to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2 clock for the final adder. total = 9 clocks to get the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all the 64 values we need 9+63=72 clocks. The remaining portions of the DCT/IDCT computation process is similar to the serial In implementation. 4 Parallel In The input signals are taken 4 pixels at a time in the order x00:x03 and x04:x07. A divide by 2 clock is used to clock in 2 sets of 4 pixels to get 8 pixels. The pixels are paired up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module does 4 additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose other input is connected to stored values in registers which act as memory. The output of the 8 multipliers are added at every CLOCK in the final adder. The output of the adder z_out is the 1D-DCT values given out in the order in which the inputs were read in. 9

11 In this implementation, it takes 2 clocks to register the inputs and sign extension, 1 clock to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2 clock for the final adder. total = 7 clocks to get the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all the 64 values we need 7+63=70 clocks. The remaining portions of the DCT/IDCT computation process is similar to the serial In implementation. 8 Parallel In The input signals are taken 8 pixels at a time in the order x00::x07. The pixels are paired up in an adder/subtractor in the order xk0,xk7:xk1,xk6:xk2,xk5:xk3,xk4. The adder subtractor is tied to CLOCK. For every clock, the adder/subtractor module does 4 additions and 4 subtractions. The output of the add/sub is fed into a multiplier whose other input is connected to stored values in registers which act as memory. The output of the 8 multipliers are added at every CLOCK in the final adder. The output of the adder z_out is the 1D-DCT values given out in the order in which the inputs were read in. In this implementation, it takes 1 clock to register the inputs and sign extension, 1 clock to do add/sub, 1clock to get separate sign + absolute value, 1 clock for multiplication, 2 clock for the final adder. total = 6 clocks to get the 1st z_out value. Every subsequent clock gives out the next z_out value. So to get all the 64 values we need 6+63=69 clocks. 10

12 The remaining portions of the DCT/IDCT computation process is similar to the serial In implementation. Optimizations Some of the optimizations I included are 2 RAMs for storage. Each RAM can store 64 pixels. When the first 1D-DCT value is available, the first RAM goes into write mode and remains in write mode for the next 63 clocks. Afterwards, it switches to read mode and the second RAM goes into write mode. The next set of 1D DCT coefficients are stored in the second RAM while the first RAM's DCT values are used for 2D DCT computation. As a result, the 2 RAMs alternate between read and write every 64 clocks. This helps us to achieve a fully pipelined design. For DCT computation its needed to store 64 Cosine coefficients for an 8 point DCT. In my design another main optimization was to use only 8 registers that get 8 coefficients every clock cycle. These values keep changing every clock cycle providing the multiplier with appropriate DCT Cosine coefficients. This enables in effectively reducing the hardware requirement by (1/8)th of conventional designs. Synthesis and Results Figure 1 shows the Modelsim Simulation results of the Serial In implementation of the DCT computation process. 11

FPGA. Some of the results that were obtained from Quartus are as shown

13 Figure 1: Modelsim simulation of serial in DCT computation All four implementations were synthesized on Quartus using Altera Cyclone IV FPGA. Some of the results that were obtained from Quartus are as shown in Figure 2. Figure 2: Synthesis Summary of Serial In DCT implementation 12

Combinational Blocks 6400 6300 6200 6100 6000 5900 8 Parallel 4 Parallel In 2 Parallel In Serial In 5800 5700 5600 combinational blocks Figure 3: Combinational blocks in 4

14 Combinational Blocks Parallel 4 Parallel In 2 Parallel In Serial In combinational blocks Figure 3: Combinational blocks in 4 implementations Registers Registers 8 Parallel 4 Parallel In 2 Parallel In Serial In Figure 4: Number of registers for 4 implementations 13

Total Computation Time 246 244 242 240 238 236 8 Parallel 4 Parallel In 2 Parallel In Serial In 234 232 230 Cycles to 2D IDCT of 8*8 block Figure 5: Total Computation time for 4 implementations S No.

15 Total Computation Time Parallel 4 Parallel In 2 Parallel In Serial In Cycles to 2D IDCT of 8*8 block Figure 5: Total Computation time for 4 implementations S No. Design Registers combinational Pins Cycles to Cycles to Cycles to Cycles to Type blocks 1D DCT 2D DCT 1D IDCT 2D IDCT 1 8 Parallel Parallel In 3 2 Parallel In 4 Serial In Table 1: Tabulates the number of cycles to compute various results at 4 design points. It can be noted from Figures 3,4 and 5 that the Total computation time of Serial In is 246 cycles and that of 8 parallel In is about 236 cycles, although the hardware requirement is pretty less for the serial in implementation. 14

16 Conclusion It can be concluded that the serial In consumes 6% lesser area than the 8 parallel implementation at a performance degradation of only about 4%. Hence for nonperformance critical, low power and low area applications serial In implementation should be preferred over other implementations. References [1]. Dae Won Kiln, Taeh- Won Kwon, Jiing Min Seo, Jae Kiln Ei, Silk Kyu Lee, Jmg Hee Silk, Jim Rim Choi A compatible dct/idct architecture using hardwired distributed arithmetic. [2]. A. Ben Atitallah, P. Kadionik, F. Ghozzi, P.Nouel, N. Masmoudi, Ph.Marchegay Optimization and implementation on fpga of the dct/idct algorithm. [3]. Muhammad Martuza, Carl McCrosky and Khan Wahid A fast hybrid dct architecture supporting h.264, vc-1, Mpeg-2, avs and jpeg codecs. [4]. Taizo Suzuki and Masaaki Ikehara Integer DCT Based on Direct-Lifting of DCT- IDCT for Lossless-to-Lossy Image Coding. [5]. Hui-Cheng Hsu, Kun-Bin Lee, Nelson Yen-Chung Chang, and Tian-Sheuan Chang, Architecture Design of Shape-Adaptive Discrete Cosine Transform and Its Inverse for MPEG-4 Video Coding. [6]. Kibum Suh, Kyung Yuk Min, Kyeounsoo Kim, Jong-Seog Koh Jong-Wha Chong A design of dpcm hybrid coding loop using single 1-d dct In mpeg-2 video encoder. 15

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization International Journal of Electronics and Communication Engineering. ISSN 0974-2166 Volume 4, Number 4 (2011), pp. 425-442 International Research Publication House http://www.irphouse.com A Novel VLSI Architecture