ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining

Similar documents
Research Article VLSI Implementation of Hybrid Wave-Pipelined 2D DWT Using Lifting Scheme

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

Implementation of Lifting-Based Two Dimensional Discrete Wavelet Transform on FPGA Using Pipeline Architecture

Design of 2-D DWT VLSI Architecture for Image Processing

Memory-Efficient and High-Speed Line-Based Architecture for 2-D Discrete Wavelet Transform with Lifting Scheme

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

2D-DWT LIFTING BASED IMPLEMENTATION USING VLSI ARCHITECTURE

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

Design and Implementation of Lifting Based Two Dimensional Discrete Wavelet Transform

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

A Novel Pseudo 4 Phase Dual Rail Asynchronous Protocol with Self Reset Logic & Multiple Reset

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Power and Area Efficient Implementation for Parallel FIR Filters Using FFAs and DA

FPGA Implementation of Discrete Fourier Transform Using CORDIC Algorithm

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Design and Implementation of 3-D DWT for Video Processing Applications

Three-D DWT of Efficient Architecture

Design and Implementation of Low Complexity Router for 2D Mesh Topology using FPGA

ANALYSIS OF AN AREA EFFICIENT VLSI ARCHITECTURE FOR FLOATING POINT MULTIPLIER AND GALOIS FIELD MULTIPLIER*

DUE to the high computational complexity and real-time

A Hybrid Wave Pipelined Network Router

Analysis of Radix- SDF Pipeline FFT Architecture in VLSI Using Chip Scope

Implementation of Two Level DWT VLSI Architecture

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm

Efficient Implementation of Low Power 2-D DCT Architecture

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

Design and Implementation of CVNS Based Low Power 64-Bit Adder

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Implementation of High Speed FIR Filter using Serial and Parallel Distributed Arithmetic Algorithm

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

HIGH LEVEL SYNTHESIS OF A 2D-DWT SYSTEM ARCHITECTURE FOR JPEG 2000 USING FPGAs

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Implementation of Pipelined Architecture Based on the DCT and Quantization For JPEG Image Compression

Design of an Efficient 128-Bit Carry Select Adder Using Bec and Variable csla Techniques

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

A Simple Method to Improve the throughput of A Multiplier

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

A Parallel Distributed Arithmetic Implementation of the Discrete Wavelet Transform

FPGA Implementation of an Efficient Two-dimensional Wavelet Decomposing Algorithm

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

II. MOTIVATION AND IMPLEMENTATION

High Speed VLSI Architecture for 3-D Discrete Wavelet Transform

DESIGN AND IMPLEMENTATION OF ADDER ARCHITECTURES AND ANALYSIS OF PERFORMANCE METRICS

A Novel VLSI Architecture for Digital Image Compression using Discrete Cosine Transform and Quantization

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

A Macro Generator for Arithmetic Cores

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

An HEVC Fractional Interpolation Hardware Using Memory Based Constant Multiplication

INTERNATIONAL JOURNAL OF ELECTRONICS AND COMMUNICATION ENGINEERING & TECHNOLOGY (IJECET)

MCM Based FIR Filter Architecture for High Performance

Study, Implementation and Survey of Different VLSI Architectures for Multipliers

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Enhanced Implementation of Image Compression using DWT, DPCM Architecture

Australian Journal of Basic and Applied Sciences

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

FPGA Implementation of a High Speed Multiplier Employing Carry Lookahead Adders in Reduction Phase

On Designs of Radix Converters Using Arithmetic Decompositions

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

FPGA Implementation Of DWT-SPIHT Algorithm For Image Compression

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

Comparative Study and Implementation of JPEG and JPEG2000 Standards for Satellite Meteorological Imaging Controller using HDL

Area-Time Efficient Square Architecture

Paper ID # IC In the last decade many research have been carried

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Parallel-Prefix Adders Implementation Using Reverse Converter Design. Department of ECE

Low Power and Memory Efficient FFT Architecture Using Modified CORDIC Algorithm

Performance Analysis of 64-Bit Carry Look Ahead Adder

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

Design and Analysis of Efficient Reconfigurable Wavelet Filters

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

An Efficient VLSI Architecture of 1D/2D and 3D for DWT Based Image Compression and Decompression Using a Lifting Scheme

Area Delay Power Efficient Carry-Select Adder

Stratix II vs. Virtex-4 Performance Comparison

Design of 8 bit Pipelined Adder using Xilinx ISE

Design of AHB Arbiter with Effective Arbitration Logic for DMA Controller in AMBA Bus

CHAPTER 4. DIGITAL DOWNCONVERTER FOR WiMAX SYSTEM

Reduction of Latency and Resource Usage in Bit-Level Pipelined Data Paths for FPGAs

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

System-on Solution from Altera and Xilinx

Transcription:

Journal of Scientific & Industrial Research Vol. 74, November 2015, pp. 609-613 ASIC Implementation of one level 2D DWT and 2D DWT in Hybrid Wave-Pipelining & Pipelining V Adhinarayanan 1 *, S Gopalakrishnan 2 and H A Shabeer 3 *1 Sathiyabama University, Chennai, Tamil Nadu, India 2 Department of ECE, Oxford Engineering College, Anna University, Chennai, Tamil Nadu, India 3 Department of ECE, AVS Engineering College, Salem, Tamil Nadu, India Received 22 April 2014; revised 18 January 2015; accepted 14 September 2015 Pipeline system requires clock routine complexity and clock skews between different parts of the system. Higher operating frequencies may be obtained in digital system using wave pipelining which permits clock frequencies. This requires proper selection of clock periods and clock skews for latched output of combinational logic circuit at the stable periods. Hybrid scheme is aimed at combination of advantage of pipelining and wave pipelining. Hence, we propose the design and implementation of hybrid wave 2D DWT pipelining using lifting scheme and system computational of one level 2D DWT implemented using the following techniques pipelining, non pipelining and wave pipelining. From the result, it is concluded that Hybrid pipelining is faster than non pipeline and requires less area and less clock, routing complexity and lower power compare to pipeline and also it is observed that wave pipeline circuit is faster than non pipeline circuit. Keywords: FPGA, SOC, ASIC, DWT, Lifting, Constant co-efficient multiplier. Introduction Field-programmable gate arrays (FPGAs) have grown enormously in their complexity and can encompass all the major functional elements of a complete end product into a single chip 1. An FPGA-based system on chip can contain one or more processors, memories, dedicated components for accelerating critical tasks and interfaces to various peripherals. Development tools for the FPGAs, the Altera, San Jose, CA, USA system-on-programmable-chip (SOPC) builder, enable the integration of intellectual proprietary (IP) cores for common DSP functions and user-designed custom blocks with the softcore processors Nios II. The availability of on-chip dedicated multipliers, softcore/hardcore processors and IP cores make the FPGAs to be an ideal platform for the implementation of area as well as speed intensive image processing applications such as discrete cosine transform (DCT) and discrete wavelet transform (DWT) 2.Joint Pictures experts Group 2000 (JPEG2000) is a recently standardized image compression algorithm that provides significant enhancements over the existing JPEG standard. JPEG2000 differs from widely used compression Author for correspondence E-mail: ma231@rediffmail.com standards in that it relies on DWT and uses embedded bit plane coding of the wavelet coefficients. DWT has been traditionally implemented using convolution or FIR filter bank structures. These structures require both a large number of arithmetic computations and a large memory for storage, which are not desirable for high-speed/low-power image processing applications.a new multiplier algorithm denoted as Baugh-Wooley pipelined constant coefficient multiplier (BW-PKCM) is proposed and used for the study and comparison of distributed arithmetic algorithm (DAA) and lifting schemes on FPGAs 3. For the computation of 2D DWT, 2 s complement multiplications are required. In the literature, BW method has been studied with carry save, carry ripple, and serial parallel algorithms 4. These schemes are inefficient in speed, area, or both when one of the operand is fixed. For an N-bit number, conventional 2 s complement multiplier (C2CM) requires [N-1/4] arrays of 4-inputs LUTs. But sign extension and BW methods require [N/4] arrays of 4-inputs LUTs. The size of the array is equal to the number of product bits. The 2 s complement block and control logic increases the number of LUT arrays area and multiplication time for the C2CM. However, for the sign extension and BW, the number of LUT array may be the same as that required for the

610 J SCI IND RES VOL 74 NOVEMBER 2015 first scheme. The lifting scheme with BWPKCM requires 4% less area but has the same speed compared to that using distributed arithmetic algorithm with sign extension scheme. In 2D DWT, The implementation details are available and filter coefficients are constant 3. Hence, BW-PKCM which combines the pipelined KCM with Baugh-Wooley multiplication algorithm is used in this paper.the operating frequency of the 2D DWT may be increased, if it is implemented using either pipelining or WP. Pipelining results in the highest operating frequency but has number of disadvantages such as increased area, power dissipation, and clock routing complexity. WP has been proposed as one of the techniques for overcoming these limitations. A number of systems have been implemented using wave-pipelining on ASICs and FPGAs 5, 6. The concept of wave-pipelining 7, 8, 9 has been described in a number of previous works WP results in increase in the speed and reduction in the clock routing complexity. The proposed hybrid scheme is aimed at combining the advantages of both pipelining and wave-pipelining.the organization of the rest of the paper is as follows: In section II, the review of previous work on 2D DWT is described. In section III, design of wave-pipelined lifting blocks is presented. In section IV, automation schemes for wave-pipelined circuits are presented. In section V, implementation and study of lifting blocks are discussed, and results are presented. In section VI, summarizes the conclusion. Review of previous work on 2D DWT In 2D wavelet transform may be computed using filter banks. One level 2D DWT, x[n] shows the input image, LL1 shows the subset of the transform coefficients represents the coarse form of the input image. The input samples x(n) are passed through the 2 stages of analysis filters. They are first processed by the low pass h[n] and high pass g[n] horizontal filters and are sub sampled by two. Subsequently, the outputs (L1, H1) are processed by low pass and high pass vertical filters.the horizontal and vertical filters contain 5 lifting blocks (α, β, γ, δ, ξ). The lifting scheme uses a poly-phase structure for the analysis filter. For the two levels 2D DWT the input is LL1 component and for further decomposition the same procedure is followed. For every level the image gets reduced by a factor of four. In the lifting scheme, the odd and even input samples are processed by the lifting blocks (α, β, γ, and δ, ξ (ξ 1 & ξ 2 )) in cascade. ξ 1, ξ 2 are scaling blocks. Details of α and β blocks are shown in Fig. 1a and Fig. 1b. γ and δ blocks are obtained by replacing the constants α, β with γ, δ. In Fig. 1, since the output from one block is fed as the input to the next block, the maximum rate at which the input can be fed to the system depends on the sum of the delays in all the four stages. The speed is increased by introducing pipelining at the points indicated by dotted lines in Fig. 1b. In this case, the input rate is determined by the largest delay among all the four blocks. The delay in the individual stages is reduced further by using Constant Coefficient Multiplier (KCM). KCM uses a ROM for finding the product of a constant and a variable. The variable is fed as address to the ROM, which contains the products corresponding to all possible combinations of the operands. When the ROM is implemented using 4 input Look Up Tables (LUTs), a number of stages of LUTs and adders are required to find the Fig. 1 (a) α block; ( b) β block

ADHINARAYANAN et al: ASIC IMPEMENTATION OF ONE LEVEL 2D DWT 611 product. For example a 12x12 bit KCM requires one ROM stage consisting of three 16x16 ROMs and two stages of 16 bit adders. The speed of the KCM can be increased by introducing the pipelining registers at the outputs of ROMs and adders.the Pipelined Constant Coefficient Multiplier (PKCM) using the BW content is referred to as BW-PKCM and is shown to be superior compared to the other approaches 10. Hence, only this multiplier is considered for wave pipelining in this paper. The detailed diagram of α block implemented using BW-PKCM. The same scheme can be adopted for the β, γ, δ, ξ 1, ξ 2 blocks. Design of wave-piplined lifting blocks on FPGAS An RTL model of a circuit consists of a combinational logic circuit separated by the input and output registers. The combinational logic circuit may be considered to be a wave-pipelined circuit if a number of waves are made to simultaneously propagate through it 11. In other words, at any point of time, a sequence of data is processed in the combinational logic block. In the case of pipelining, only one data is processed in the combinational logic block at a time. Further, the maximum data rate in the pipelined circuit depends only on D max, the maximum propagation delay in the combinational logic block. Temporal/spatial diagram shows the combinational logic circuits 8. If D min denotes the minimum propagation delay of the signal through the combinational logic block, the maximum data rate of the wave-pipelined circuit depends on (D max D min ). Traditionally, in a wave-pipelined circuit, higher speeds are achieved by equalizing the D max and D min 9. The output of the wave-pipelined circuit alternates between unstable and stable states. The stable period decreases with the increase in the logic depth. By adjusting the latching instant at the output register to lie in the stable period, the wave-pipelined circuit can be made to work properly. But, for large logic depths, there may not be any stable period. Hence adjusting the latching instant by itself may not be adequate for storing the correct result at the output register. For such cases, the clock period has to be increased to increase the stable period. Equalization of path delays, adjustment of the clock period and clock skew are the three tasks carried out for maximizing the operating speed of the wave-pipelined circuit. All the three tasks require the delays to be measured and altered if required. Layout editors, such as EPIC editor from Xilinx, may be used for this purpose and these tasks are carried out manually 12, 13. The wave-pipelined circuit designed using the layout editor may be tested using simulation. However, the simulation is inadequate for testing due to the difference between the actual delays and the delays calculated by the layout editor. This is because, the layout editor considers only the worst case delays and the actual delays may be significantly different due to fabrication variations. This difference becomes important as the logic depth of the circuit increases. Hence, the design is downloaded to the actual FPGA and its operation is checked using a PC based test system 14. If correct results are not obtained, delays are altered and the design is downloaded for testing again. A number of iterations of place and route, simulation, downloading and testing in the actual device may be required till the correct results are obtained. The design of wave pipelined circuit in this fashion requires human intervention and is time consuming. Automation of the above three tasks is considered in this paper. Fig. 2 Overall block diagram of one level 2D DWT

612 J SCI IND RES VOL 74 NOVEMBER 2015 Implementation of 2D DWT using lifting scheme Overall block diagram of one level 2D DWT shown in Fig. 2. Input and Output is assumed to block RAMs for the horizontal filter and as well as vertical filter are assumed to be stored in block RAM. For the horizontal filter, even and odd inputs are applied from two block RAMs of size 512x11. For the testing of image, the result is written into four block RAMs of size 256x1.Self tuned wave pipelined circuit as shown in Fig. 3. It consists of different functional blocks namely PRSG block, PRBS sequence generator, signature analyzer, counter, Programmable Clock generator Circuit, Programmable skew generator circuit and FSM.It has two modes of operation namely test mode and normal mode. Always test mode signals used to select the mode of operation by making test mode signal to be 1. In normal mode, user input can be applied. Results and Discussion The 2D DWT scheme is implemented on ASIC using the lifting blocks with 9/7 biorthogonal filters and BWPKCM multipliers. The 2D DWT is implemented using 180 nm technology in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified modelsim simulation tool. Leo spectrum is used for synthesizing the circuit. From the implementation results shown in Table 1, it is observed that the wave pipelined circuit is 1.07 times faster than the non pipelined circuit. The pipelined circuit 1.08 times faster compared to wave pipelined circuit and this is achieved with the increase in area of 1.85 times. Implementation results on Spartan III XC3S200 Implementation results for one level 2D DWT on Xilinx Spartan-III XC3S200 using all the three approaches and the results are given in Fig. 4. The programmable clock and clock skew blocks are implemented as Macro blocks using Xilinx ISE 8.1i project navigator. For tuning the wave pipelined circuit, the Micro blaze soft-core processor is used. Xilinx Embedded Design Kit (EDK) software is used to integrate the custom block to the Micro blaze processor 16. The rest of the steps are similar to what is used for the Altera SOC kit 17. For the all three schemes, the no. of logic elements, no. of registers, maximum operating frequency and power dissipated are computed and the results are given in Fig. 4. From this Fig. 4, it may be concluded that for the lifting scheme, the method using hybrid WP-P BW-KCM is faster than non pipelined BW-KCM by a factor of 1.07. The scheme with BW-PKCM is in turn faster than the hybrid WPP BW-KCM by a factor of 1.56 and this is achieved with the increase in the number of registers by a factor of 3.157 and increase in the number of LEs by a factor of 1.54 compared to the hybrid WP-P unit. Fig. 3 Self tuned wave-pipelined circuit Table 1 Implementation Results of 2D DWT Scheme Area Frequency (MHz) Pipelining 6896 346.8 Non - Pipelining 3712 299.48 Wave Pipelining 3712 321.93 Fig. 4 Implementation results on Spartan-III XC3S200

ADHINARAYANAN et al: ASIC IMPEMENTATION OF ONE LEVEL 2D DWT 613 Implementation of 1 level 2D DWT using ASIC The 2D DWT scheme is implemented on ASIC using the lifting blocks with 9/7 biorthogonal filters and BW-PKCM multipliers. The 2D DWT is implemented using 180 nm technology in ASIC. Verilog HDL language is used to describe the functionality of the circuit and after the circuit is described in HDL, functionality is verified modelsim simulation tool. Leo spectrum is used for synthesizing the circuit. First time the 2D DWT is implemented using ASIC. In future, it can be extended to compare with hybrid WP schemes. Conclusion In this paper, the implementation of one level 2D DWT on ASIC shows that wave pipeline is faster than non pipeline circuits. The pipeline circuit faster compared to wave- pipeline circuit and this is achieved by increasing the area and also verified hybrid wave WP-P BW-KCM and is register efficient. References 1. Draper B A, Beveridge J R, Bohm A P W, Ross C & Chawathe M, Accelerated image processing on FPGAs, IEEE Trans. Image Process, 12 (12) (2003) 1543 1551. 2. Ritter J & Molitor P, A pipelined architecture for partitioned DWT based lossy image compression using FPGA s. In: Proc ACM Conf FPGA, (2001) 201 206. 3. Lakshminarayanan G, Venkataramani B, Kumar J S, Yousuf A K & Sriram G, Design and FPGA implementation of image block encoders with 2D-DWT, In: Proc IEEE Conf on Con Tech for Asia-Pacific Region (TENCON 03), 3 (2003) 1015 1019. 4. Daubechies & Sweldens W, Factoring Wavelet Transforms into Lifting Steps, J Fourier Anal & Appl, 4 (1998) 247-269. 5. Parhi K K, VLSI Signal Processing Systems, JohnWiley & Sons, New York, NY, USA, (1999). 6. Nyathi J & Delgado-Frias J D, A Hybrid wave-pipelined network router, IEEE Trans. Circuits Syst. I, Fundam. Theory & Appl, 49 (12) (2002) 1764 1772. 7. Hauck O, Katoch A & Huss S A, VLSI system design using asynchronous wave pipelines: a 0.35 μm CMOS 1.5 GHz elliptic curve public key cryptosystem chip, Proc. of Sixth Intl. Symp on Adva Res in Async Cir & Sys, (2000) 188 197. 8. Burleson W B, Ciesielski M, Klass F & Liu, Wave-pipelining: a tutorial and research survey, IEEE Trans. Very Large Scale Integr. (VLSI) Syst, 6 (3) (1998) 464-474. 9. Gray C, Liu W & Cavin R, Wave-pipelining: Theory nd Implementation, Kluwer Academic Publishers, (1993). 10. Tuttlebee W, Software defined radio, John Wiley & Sons ltd. USA, (2004) 11. Lakshminarayanan G & Venkataramani B, Optimization techniques for FPGA based wave-pipelined DSP blocks, IEEE Trans. Very Large Scale Integr. (VLSI) Syst, 13 (7) (2003) 783-793.