Design and Implementation of 3-D DWT for Video Processing Applications P. Mohaniah 1, P. Sathyanarayana 2, A. S. Ram Kumar Reddy 3 & A. Vijayalakshmi 4 1 E.C.E, N.B.K.R.IST, Vidyanagar, 2 E.C.E, S.V University Tirupati, India 3 Vyas Labz, Hyderabad, India, 4 K.L.University,Vaddeswaram E-mail: mohanaiah_pipuru@yahoo.co.in 1, satyamp1@yahoo.com 2, ramkumar.a@vyasinfo.com 3, allamsettivijaya@gmail.com 4 Abstract - For many natural signals, the Wavelet Transform is a more effective tool than the Fourier transform. The Wavelet Transform provides a multiresolution representation using a set of analyzing functions that are dilations and translations of a few functions (Wavelets). This paper proposes an efficient VLSI architecture for implementation of 3-D lifting-based discrete wavelet transform (DWT). The whole architecture was optimized in efficient pipeline and parallel design way to speed up and achieve higher hardware utilization. Time Division Multiplexing (TDM) design is utilized to realize the prediction step and update step using the same architecture, which reduces the size of the circuit. To further optimize the architecture for 1-D DWT mirror symmetric boundary extension technique is implemented. It is shown that the proposed architecture is successfully realized with the FPGA device of Cyclone family from Altera Corp. Keywords - Discrete wavelets transform (DWT), embedded mirror symmetric boundary extension, lifting scheme, VLSI architecture I. INTRODUCTION To use the wavelet transform for volume and video processing we must implement a 3D version of the analysis and synthesis filter banks. In the 3D case, the 1D analysis filter bank is applied in turn to each of the three dimensions. If the data is of size N1 by N2 by N3, then after applying the 1D analysis filter bank to the first dimension we have two sub-band data sets, each of size N1/2 by N2 by N3. After applying the 1D analysis filter bank to the second dimension we have four subband data sets, each of size N1/2 by N2/2 by N3. Applying the 1D analysis filter bank to the third dimension gives eight sub-band data sets, each of size N1/2 by N2/2 by N3/2. This is illustrated in the diagram below. Fig. 1 : The resolution of a 3-D signal is reduced in each dimension In this paper, an efficient line-based VLSI architecture for 3-D DWT using lifting scheme is proposed, which is mainly composed of one row DWT module and one column DWT module, working in parallel and pipeline fashion with 100% hardware utilization. This paper is structured as follows. In Section II, we briefly reviewed the background of the lifting scheme. The proposed architecture for the 3-level 3-D liftingbased DWT is presented in Section III. The simulation results of the proposed architecture and its real-time platform implementation performance analysis are described in Section IV. Finally, in Section V, the conclusions of the paper are given. II. LIFTING DWT The lifting scheme has been developed as a flexible tool suitable for constructing the second generation wavelet. It is composed of three basic operation stages: Splitting, Predicting, and Updating. 45
Fig.2 shows the lifting scheme of the wavelet filter computing one dimension signal: Split step: Where the signal is split into even and odd points, because the maximum correlation between adjacent pixels can be utilized for the next predict step. Predict step: The even samples are multiplied by the predict factor and then the results are added to the odd samples to generate the detailed coefficients. Update step: The detailed coefficients computed by the predict step are multiplied by the update factors and then the results are added to the even samples to get the coarse coefficients. output data are stored in the line buffer. The number of the buffer is decided by the number of tap of the lowpass filter. When [(M+1)/2 ] (M is the number of the tap of the low-pass filter) rows of data have finished the row DWT, the column DWT module starts to perform the column transform immediately and stores the intermediate results in the column buffer. The final transform data are stored in the external SRAM. Fig. 2 : Block diagram of the lifting scheme III. VLSI ARCHITECTURE OF 3-D DWT The proposed VLSI architecture shown in Fig.3 performs 3-D DWT with line-based method [2][6]which consists of five key modules: data choose module, the row DWT module, the column DWT module, DWT control unit and external RAM. An N 2 / 4 external RAM is used to store the LL band output coefficients to carry out the multi-level decomposition, where N represents the width and the height of the input image. The DWT control unit controls the time sequence of the whole system. Fig. 4: FSM of DWT control unit In this paper, we only discuss the wavelet transform modules. The DWT control unit and the external RAM are not the purpose of this article. For the importance of the DWT unit, FSM diagram is shown in Fig.4. A. Improved Embedded Mirror Symmetric Extension at the Boundaries The finite length of signal processed by using wavelet filter leads to the edge effect. JPEG2000 standard employs the symmetric extension at the boundaries to eliminate it. The traditional extension arithmetic needs additional memory units and operations will consume much power and area [2]. According to the characteristic of the lifting-based DWT, this paper brings forward the Embedded Mirror symmetric Extension Arithmetic, as show in Fig.5. It is embedded into the data operation process by changing the operation process at the beginning and end of the lifting operation. Fig. 3: Proposed 3-DDWTArchitecture First, one line of image data or LL sub-band data are routed in the Data Selector. Then the data enter into the row processor to perform 1-D row DWT and the Fig. 5 : Mirror symmetric extension 46
Equations given below are the new operation process of 5/3 wavelet transform. The embedded scheme which embedded into the row DWT module and column DWT module is implemented by finite state machine (FSM) and multiplexers which has four states: Forward extension, Normal even, Normal odd, Last extension. The dataextension is only embedded in Forward extension and Last extension. B. Proposed Row DWT Module The proposed architecture is optimized in terms of the processing speed, as illustrated in Fig.6. The multiplication is optimized by using shifting and adding operation. In this way, the row processor consists of six registers, five multiplexers, one adder and one shifting adder. All the hardware resources of the row processor can be time-multiplexed. One single line is calculated at a time. When a lifting step is performed, two consecutive even-numbered samples are added and multiplied with the corresponding lifting coefficient first before adding to the middle odd-numbered sample. That means one pixel data is encoded one clock. This reduces storage cells. Compared to the row DWT module where the input lines are partitioned into even and odd samples [4] (which needs two parallel row DWT units). Embedded mirror symmetric boundary data extension algorithm is implemented by using two multiplexers controlled by signals sel0, sel1, which results in significant reduction in the amount of internal storage and the access times of the external memory. The control signals (sel0, sel1) of multiplexers and the corresponding model of the lifting-scheme are shown in Table I. Time-multiplexing row processor is implemented by conducting the predict step in even clocks and the update step in odd clocks. The control signals sel1 and sel0 of the Multiplexers are generated by a counter. The row processor is optimized in the pipelined way, and the samples are encoded continuously as the samples are input. Hardware utilization reaches approximately 100%, and the control logic is simple. Table- II shows the data flow of the proposed module of 5/3 DWT for a row with 8 samples, where Hi (Li) represents the ith high-pass (low-pass) output. Table I : CONTROL SIGNALS AND THE CORRESPONDING MODEL OF THE LIFTING- SCHEME Table II : DATA FLOW FOR 5/3 DWT C. Proposed Column DWT Module The proposed Column DWT architecture is shown in Fig.7. Fig. 6 : Proposed Row DWT architecture 47
The normalized floating point numbers have the form of Z= (-1 S ) * 2 (E - Bias) * (1.M). The following algorithm is used to multiply two floating point numbers. 1. Significance and multiplication; i.e. (1.M 1 *1.M 2 ) 2. Placing the decimal point in the result 3. Exponents addition; i.e. (E 1 + E 2 -Bias) Fig. 7 : Proposed Column DWT Architecture In order to reduce the system latency, the column DWT has to execute in the row-wise order. The input data are stored in even line buffer, and odd line buffer which are naturally separated into even samples and odd samples along the column. Embedded mirror symmetric boundary data extension algorithm is also implemented in the Column DWT module. There are four multiplexers which control the steps (1 represents prediction step and while 0 represents update step). Multiplexers can ensure the re-use of the hardware resource and that samples join the associated computation according to the timing plan. The column DWT module begins to calculate samples after the first two lines finish computing in row DWT. Firstly, multiplexers are set to 1 and column processor conducts the prediction step. The result of the prediction step is outputted, stored in column buffer at the same time. Then multiplexers are set to 0, and column processor conducts the update step, then the results are exported directly. We also optimize the column processor in pipelined way to increase the speed of the wavelet transform. The data flow of the column DWT is similar to that of the row DWT. IV. FLOATING POINT MULTIPLICATION ALGORITHM Here we proposed a floating point multiplier, instead of normal multiplier to multiply the input filtered values with a constant coefficient values. The main advantage of this floating point multiplier is to increase the speed of operation and accuracy. Fig. 8 : Floating point multiplication with a constant value a 4. Getting the sign; i.e. s 1 XOR s 2 5. Normalizing the result; i.e. obtaining 1 at the MSB of the results significant 6. Rounding implementation. 7. Verifying for underflow/overflow occurrence. Consider the following IEEE 754 single precision floating point numbers to perform the multiplication, but the number of mantissa bits is reduced for simplification. Here considered only 5 bits, while still retaining the hidden 1 bit for normalized numbers. Here we are presenting a floating point multiplier in which rounding support is not implemented. By this we get more precision in MAC unit and this will be accessed by the multiplier or by a floating point adder unit. Fig. 9 shows the block diagram of the multiplier structure; Exponents addition, Significant multiplication, and Result s sign calculation, here all processes done independent and are in parallel. The significant multiplication is done on two 24 bit numbers and results in a 48 bit product, this we will call the intermediate product (IP). The IP is represented as (47 down to 0) and the decimal point is located between bits 46 and 45 in the IP. The following figure shows each block of the floating point multiplier. Fig. 9 : Floating point multiplier block diagram V. LUT IMPLEMENTATION FOR MEMORY BASED COMPUTATIONS Here the proposed LUT Implementation for memory based computations are bused to store the filtered values as well as constant multiplier product values. Instead of registers to store the values, we have 48
used LUT s which will reduce the memory size and optimizes the area and delay. We have proposed the Anti-symmetric Product Coding (APC) and Odd-Multiple-Storage(OMS) techniques for Look Up Table(LUT) design for memory-based multipliers to be used in digital signal processing applications. control bus. The processed image data are shown on PC. In order to verify if the proposed architecture can work correctly, we encode the architecture with Verilog HDL compatible with Quartusof version 6.0 and then implement it in the FPGA on the real-time platform. The FPGA we choose is Cyclone family from Altera Corp. The real-time image is first captured by the image sensor and then output to the FPGA by I2C bus. The transform circuit in the FPGA processes the captured image by doing 3-level DWT. Transformed image data of each level are stored in the SRAM of FPGA and then shown on the PC. The experimental result of the verification system is shown in Fig.12, which is the transformed image by doing 3-level lifting DWT from the original sample image. Fig. 10: The (5,3) Discrete Wavelet Transform Each of these techniques results in the reduction of the LUT size by a factor of two. In this brief, we present a different form of APC and a modified OMS scheme, in order to combine them for efficient memory-based multiplication. The proposed combined approach provides a reduction in LUT size to one-fourth of the conventional LUT. B. Performance analysis Fig. 12: Transformed image In table the results with respects to the hardwareconditions reported by the proposed paper are shown and compared with those reported by paper [4]- [7].The proposed architecture is successfully synthesized using Cyclone device family from Altera Corp. The performance, including the memory interface circuit, for this real-time platform is shown in Table. Fig. 11 : Proposed APC OMS Combined LUT design for the multiplication of W-bit fixed coefficient A with 6-bit input X VI. EXPERIMENTAL RESULTS AND PERFORMANCE ANALYSIS A. Experimental results The verification system is mainly composed with three parts, namely, a CMOS image sensor, a FPGA, a PC, which is a real-time platform. The functions of the image sensor are fully programmable via I2C serial Fig. 13: 3D DWT Simulation Results 49
Fig.14: 3D DWT SYNTHESIS REPORT VII. CONCLUSION In this paper, an efficient design and implementation of 3-D DWT for video processing applications has been proposed. The main advantages of proposed architecture is using time division multiplexing and pipelined method to reduce the hardware consumption and to increase the utilization along with the implementation of mirror symmetric boundary extension technique to reduce the quantity of computation and the power. With the implementation of floating point algorithm and LUT to this architecture we can get the precise coefficients. This work ensures that the image pixel values given to the DWT process which gives the high pass and low pass coefficients of the input image. The simulation results of DWT were verified with the appropriate test cases. Once the functional verification is done, DWT is synthesized by using Xilinx tool. REFERENCES [1] ISO/IEC JTC 1/SC 29/WG 1 N1646R, JPEG2000 part 1 final committee draft version 1.0, 2000. [2] C,-T.Huang, P-C,Tseng, and L,-G Chen, Generic RAM-based architectures for twodimensional discrete wavelet transform with linebased method, IEEE Trans. on Circuits and Systems, vol. 1, pp. 363-366, 2002. [3] Tan K C B and Arslan T., Low power embedded extension algorithm for lifting-based discrete wavelet transform in JPEG2000, IEEE Electronic Letters, Vol..37, pp.1328-1330, 2002. [4] XuguangLan, Nanning Zheng and Yuehu Liu, Low-power and highspeed VLSI architecture for lifting-based forward and inverse wavelet transform, IEEE Trans. on Consumer Electronics, Vol.51, pp.379-385, 2005 [5] Chin-Fa Hsieh, Tsung-Han Tsai and Chih-Huang Lai, Implementafion of an efficient DWT using a FPGA on a Real-time Platform, IEEE, ICICIC, Second International Conference on, pp. 235-235, 2007 [6] P.Y Chen, VLSI implementation for onedimensional multilevel liftingbased wavelet transform, IEEE Trans. on Computers, vol. 53, pp.386-398, 2004. [7] Peng Cao, XinGuo, and Chao Wang, Efficient architecture for twodimensional discrete wavelet transform based on lifting scheme, IEEE 7th International Conference on, pp. 225-228, Oct. 2007. [8] B.-F. Wu and C.F. Lin, A high-performance and memory-efficient pipeline architecture for the 5/3 and 9/7 discrete wavelet transform of JPEG2000 codec, IEEE Trans. on Circuit and Systems for Video Technology, vol. 15, pp. 1615-1628, Dec. 2010. [9] Benderli O, Tekmen Y C and Ismailoglu N. A real time, low latency, FPGA implementation of the 2D discrete wavelet transformation for streaming image applications, IEEE workshop on Signal processing systems, pp. 384-389, Sept. 2012. [10] H.Liao, M, K. Mandal and B.F.Cockburn, Efficient architectures for 1- D and 3-D liftingbased wavelet transforms, IEEE Trans. on Signal Processing, vol..52, pp. 1315-1326,2012 50