ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining

Journal of Scientific & Industrial Research Vol. 74, November 2015, pp. 609-613 ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining V Adhinarayanan 1 *, S Gopalakrishnan 2 and H A Shabeer 3 *1 Sathiyabama University, Chennai, Tamil Nadu, India 2 Department of ECE, Oxford Engineering College, Anna University, Chennai, Tamil Nadu, India 3 Department of ECE, AVS Engineering College, Salem, Tamil Nadu, India Received 22 April 2014; revised 18 January 2015; accepted 14 September 2015 Pipeline system requires clock routine complexity and clock skews between different parts of the system. Higher operating frequencies may be obtained in digital system using wave pipelining which permits clock frequencies. This requires proper selection of clock periods and clock skews for latched output of combinational logic circuit at the stable periods. Hybrid scheme is aimed at combination of advantage of pipelining and wave pipelining. Hence, we propose the design and implementation of hybrid wave 2D DWT pipelining using lifting scheme and system computational of one level 2D DWT implemented using the following techniques pipelining, non pipelining and wave pipelining. From the result, it is concluded that Hybrid pipelining is faster than non pipeline and requires less area and less clock, routing complexity and lower power compare to pipeline and also it is observed that wave pipeline circuit is faster than non pipeline circuit. Keywords: FPGA, SOC, ASIC, DWT, Lifting, Constant co-efficient multiplier. Introduction Field-programmable gate arrays (FPGAs) have grown enormously in their complexity and can encompass all the major functional elements of a complete end product into a single chip 1. An FPGA-based system on chip can contain one or more processors, memories, dedicated components for accelerating critical tasks and interfaces to various peripherals. Development tools for the FPGAs, the Altera, San Jose, CA, USA system-on-programmable-chip (SOPC) builder, enable the integration of intellectual proprietary (IP) cores for common DSP functions and user-designed custom blocks with the softcore processors Nios II. The availability of on-chip dedicated multipliers, softcore/hardcore processors and IP cores make the FPGAs to be an ideal platform for the implementation of area as well as speed intensive image processing applications such as discrete cosine transform (DCT) and discrete wavelet transform (DWT) 2.Joint Pictures experts Group 2000 (JPEG2000) is a recently standardized image compression algorithm that provides significant enhancements over the existing JPEG standard. JPEG2000 differs from widely used compression Author for correspondence E-mail: ma231@rediffmail.com standards in that it relies on DWT and uses embedded bit plane coding of the wavelet coefficients. DWT has been traditionally implemented using convolution or FIR filter bank structures. These structures require both a large number of arithmetic computations and a large memory for storage, which are not desirable for high-speed/low-power image processing applications.a new multiplier algorithm denoted as Baugh-Wooley pipelined constant coefficient multiplier (BW-PKCM) is proposed and used for the study and comparison of distributed arithmetic algorithm (DAA) and lifting schemes on FPGAs 3. For the computation of 2D DWT, 2 s complement multiplications are required. In the literature, BW method has been studied with carry save, carry ripple, and serial parallel algorithms 4. These schemes are inefficient in speed, area, or both when one of the operand is fixed. For an N-bit number, conventional 2 s complement multiplier (C2CM) requires [N-1/4] arrays of 4-inputs LUTs. But sign extension and BW methods require [N/4] arrays of 4-inputs LUTs. The size of the array is equal to the number of product bits. The 2 s complement block and control logic increases the number of LUT arrays area and multiplication time for the C2CM. However, for the sign extension and BW, the number of LUT array may be the same as that required for the

610 J SCI IND RES VOL 74 NOVEMBER 2015 first scheme. The lifting scheme with BWPKCM requires 4% less area but has the same speed compared to that using distributed arithmetic algorithm with sign extension scheme. In 2D DWT, The implementation details are available and filter coefficients are constant 3. Hence, BW-PKCM which combines the pipelined KCM with Baugh-Wooley multiplication algorithm is used in this paper.the operating frequency of the 2D DWT may be increased, if it is implemented using either pipelining or WP. Pipelining results in the highest operating frequency but has number of disadvantages such as increased area, power dissipation, and clock routing complexity. WP has been proposed as one of the techniques for overcoming these limitations. A number of systems have been implemented using wave-pipelining on ASICs and FPGAs 5, 6. The concept of wave-pipelining 7, 8, 9 has been described in a number of previous works WP results in increase in the speed and reduction in the clock routing complexity. The proposed hybrid scheme is aimed at combining the advantages of both pipelining and wave-pipelining.the organization of the rest of the paper is as follows: In section II, the review of previous work on 2D DWT is described. In section III, design of wave-pipelined lifting blocks is presented. In section IV, automation schemes for wave-pipelined circuits are presented. In section V, implementation and study of lifting blocks are discussed, and results are presented. In section VI, summarizes the conclusion. Review of previous work on 2D DWT In 2D wavelet transform may be computed using filter banks. One level 2D DWT, x[n] shows the input image, LL1 shows the subset of the transform coefficients represents the coarse form of the input image. The input samples x(n) are passed through the 2 stages of analysis filters. They are first processed by the low pass h[n] and high pass g[n] horizontal filters and are sub sampled by two. Subsequently, the outputs (L1, H1) are processed by low pass and high pass vertical filters.the horizontal and vertical filters contain 5 lifting blocks (α, β, γ, δ, ξ). The lifting scheme uses a poly-phase structure for the analysis filter. For the two levels 2D DWT the input is LL1 component and for further decomposition the same procedure is followed. For every level the image gets reduced by a factor of four. In the lifting scheme, the odd and even input samples are processed by the lifting blocks (α, β, γ, and δ, ξ (ξ 1 & ξ 2 )) in cascade. ξ 1, ξ 2 are scaling blocks. Details of α and β blocks are shown in Fig. 1a and Fig. 1b. γ and δ blocks are obtained by replacing the constants α, β with γ, δ. In Fig. 1, since the output from one block is fed as the input to the next block, the maximum rate at which the input can be fed to the system depends on the sum of the delays in all the four stages. The speed is increased by introducing pipelining at the points indicated by dotted lines in Fig. 1b. In this case, the input rate is determined by the largest delay among all the four blocks. The delay in the individual stages is reduced further by using Constant Coefficient Multiplier (KCM). KCM uses a ROM for finding the product of a constant and a variable. The variable is fed as address to the ROM, which contains the products corresponding to all possible combinations of the operands. When the ROM is implemented using 4 input Look Up Tables (LUTs), a number of stages of LUTs and adders are required to find the Fig. 1 (a) α block; ( b) β block

ADHINARAYANAN et al: ASIC IMPEMENTATION OF ONE LEVEL 2D DWT 611 product. For example a 12x12 bit KCM requires one ROM stage consisting of three 16x16 ROMs and two stages of 16 bit adders. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of ROMs and adders.the Pipelined Constant Coefficient Multiplier (PKCM) using the BW content is referred to as BW-PKCM and is shown to be superior compared to the other approaches 10. Hence, only this multiplier is considered for wave pipelining in this paper. The detailed diagram of α block implemented using BW-PKCM. The same scheme can be adopted for the β, γ, δ, ξ 1, ξ 2 blocks. Design of wave-piplined lifting blocks on FPGAS An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wave-pipelined circuit if a number of waves are made to simultaneously propagate through it 11. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on D max, the maximum propagation delay in the combinational logic block. Temporal/spatial diagram shows the combinational logic circuits 8. If D min denotes the minimum propagation delay of the signal through the combinational logic block, the maximum data rate of the wave-pipelined circuit depends on (D max D min ). Traditionally, in a wave-pipelined circuit, higher speeds are achieved by equalizing the D max and D min 9. The output of the wave-pipelined circuit alternates between unstable and stable states. The stable period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wave-pipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence adjusting the latching instant by itself may not be adequate for storing the correct result at the output register. For such cases, the clock period has to be increased to increase the stable period. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as EPIC editor from Xilinx, may be used for this purpose and these tasks are carried out manually 12, 13. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be significantly different due to fabrication variations. This difference becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a PC based test system 14. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave pipelined circuit in this fashion requires human intervention and is time consuming. Automation of the above three tasks is considered in this paper. Fig. 2 Overall block diagram of one level 2D DWT

612 J SCI IND RES VOL 74 NOVEMBER 2015 Implementation of 2D DWT using lifting scheme Overall block diagram of one level 2D DWT shown in Fig. 2. Input and Output is assumed to block RAMs for the horizontal filter and as well as vertical filter are assumed to be stored in block RAM. For the horizontal filter, even and odd inputs are applied from two block RAMs of size 512x11. For the testing of image, the result is written into four block RAMs of size 256x1.Self tuned wave pipelined circuit as shown in Fig. 3. It consists of different functional blocks namely PRSG block, PRBS sequence generator, signature analyzer, counter, Programmable Clock generator Circuit, Programmable skew generator circuit and FSM.It has two modes of operation namely test mode and normal mode. Always test mode signals used to select the mode of operation by making test mode signal to be 1. In normal mode, user input can be applied. Results and Discussion The 2D DWT scheme is implemented on ASIC using the lifting blocks with 9/7 biorthogonal filters and BWPKCM multipliers. The 2D DWT is implemented using 180 nm technology in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified modelsim simulation tool. Leo spectrum is used for synthesizing the circuit. From the implementation results shown in Table 1, it is observed that the wave pipelined circuit is 1.07 times faster than the non pipelined circuit. The pipelined circuit 1.08 times faster compared to wave pipelined circuit and this is achieved with the increase in area of 1.85 times. Implementation results on Spartan III XC3S200 Implementation results for one level 2D DWT on Xilinx Spartan-III XC3S200 using all the three approaches and the results are given in Fig. 4. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the wave pipelined circuit, the Micro blaze soft-core processor is used. Xilinx Embedded Design Kit (EDK) software is used to integrate the custom block to the Micro blaze processor 16. The rest of the steps are similar to what is used for the Altera SOC kit 17. For the all three schemes, the no. of logic elements, no. of registers, maximum operating frequency and power dissipated are computed and the results are given in Fig. 4. From this Fig. 4, it may be concluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non pipelined BW-KCM by a factor of 1.07. The scheme with BW-PKCM is in turn faster than the hybrid WPP BW-KCM by a factor of 1.56 and this is achieved with the increase in the number of registers by a factor of 3.157 and increase in the number of LEs by a factor of 1.54 compared to the hybrid WP-P unit. Fig. 3 Self tuned wave-pipelined circuit Table 1 Implementation Results of 2D DWT Scheme Area Frequency (MHz) Pipelining 6896 346.8 Non - Pipelining 3712 299.48 Wave Pipelining 3712 321.93 Fig. 4 Implementation results on Spartan-III XC3S200

ADHINARAYANAN et al: ASIC IMPEMENTATION OF ONE LEVEL 2D DWT 613 Implementation of 1 level 2D DWT using ASIC The 2D DWT scheme is implemented on ASIC using the lifting blocks with 9/7 biorthogonal filters and BW-PKCM multipliers. The 2D DWT is implemented using 180 nm technology in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified modelsim simulation tool. Leo spectrum is used for synthesizing the circuit. First time the 2D DWT is implemented using ASIC. In future, it can be extended to compare with hybrid WP schemes. Conclusion In this paper, the implementation of one level 2D DWT on ASIC shows that wave pipeline is faster than non pipeline circuits. The pipeline circuit faster compared to wave- pipeline circuit and this is achieved by increasing the area and also verified hybrid wave WP-P BW-KCM and is register efficient. References 1. Draper B A, Beveridge J R, Bohm A P W, Ross C & Chawathe M, Accelerated image processing on FPGAs, IEEE Trans. Image Process, 12 (12) (2003) 1543 1551. 2. Ritter J & Molitor P, A pipelined architecture for partitioned DWT based lossy image compression using FPGA s. In: Proc ACM Conf FPGA, (2001) 201 206. 3. Lakshminarayanan G, Venkataramani B, Kumar J S, Yousuf A K & Sriram G, Design and FPGA implementation of image block encoders with 2D-DWT, In: Proc IEEE Conf on Con Tech for Asia-Pacific Region (TENCON 03), 3 (2003) 1015 1019. 4. Daubechies & Sweldens W, Factoring Wavelet Transforms into Lifting Steps, J Fourier Anal & Appl, 4 (1998) 247-269. 5. Parhi K K, VLSI Signal Processing Systems, JohnWiley & Sons, New York, NY, USA, (1999). 6. Nyathi J & Delgado-Frias J D, A Hybrid wave-pipelined network router, IEEE Trans. Circuits Syst. I, Fundam. Theory & Appl, 49 (12) (2002) 1764 1772. 7. Hauck O, Katoch A & Huss S A, VLSI system design using asynchronous wave pipelines: a 0.35 μm CMOS 1.5 GHz elliptic curve public key cryptosystem chip, Proc. of Sixth Intl. Symp on Adva Res in Async Cir & Sys, (2000) 188 197. 8. Burleson W B, Ciesielski M, Klass F & Liu, Wave-pipelining: a tutorial and research survey, IEEE Trans. Very Large Scale Integr. (VLSI) Syst, 6 (3) (1998) 464-474. 9. Gray C, Liu W & Cavin R, Wave-pipelining: Theory nd Implementation, Kluwer Academic Publishers, (1993). 10. Tuttlebee W, Software defined radio, John Wiley & Sons ltd. USA, (2004) 11. Lakshminarayanan G & Venkataramani B, Optimization techniques for FPGA based wave-pipelined DSP blocks, IEEE Trans. Very Large Scale Integr. (VLSI) Syst, 13 (7) (2003) 783-793.