VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming

Similar documents
On the Design of High Speed Parallel CRC Circuits using DSP Algorithams

DESIGN AND IMPLEMENTATION OF HIGH SPEED 64-BIT PARALLEL CRC GENERATION

A Novel Approach for Parallel CRC generation for high speed application

2 Asst Prof, Kottam College of Engineering, Chinnatekur, Kurnool, AP-INDIA,

AN EFFICIENT DESIGN OF VLSI ARCHITECTURE FOR FAULT DETECTION USING ORTHOGONAL LATIN SQUARES (OLS) CODES

64-bit parallel CRC Generation for High Speed Applications

Design of Convolutional Codes for varying Constraint Lengths

Performance Evaluation & Design Methodologies for Automated CRC Checking for 32 bit address Using HDLC Block

HDL IMPLEMENTATION OF SRAM BASED ERROR CORRECTION AND DETECTION USING ORTHOGONAL LATIN SQUARE CODES

Fault Tolerant Parallel Filters Based on ECC Codes

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

High Speed Radix 8 CORDIC Processor

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

Low Complexity Architecture for Max* Operator of Log-MAP Turbo Decoder

VLSI Implementation of Low Power Area Efficient FIR Digital Filter Structures Shaila Khan 1 Uma Sharma 2

Critical-Path Realization and Implementation of the LMS Adaptive Algorithm Using Verilog-HDL and Cadence-Tool

Error Correction and Detection using Cyclic Redundancy Check

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Enhanced Detection of Double Adjacent Errors in Hamming Codes through Selective Bit Placement

Effective Implementation of LDPC for Memory Applications

ISSN Vol.05,Issue.09, September-2017, Pages:

Efficient Majority Logic Fault Detector/Corrector Using Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

Error Detecting and Correcting Code Using Orthogonal Latin Square Using Verilog HDL

Implementing CRCCs. Introduction. in Altera Devices

Design and Implementation of Hamming Code on FPGA using Verilog

A High Performance CRC Checker for Ethernet Application

An Efficient Carry Select Adder with Less Delay and Reduced Area Application

K.V.GANESH*,D.SRI HARI**,M.HEMA*** *(Department of ECE,JNTUK,KAKINADA) **(Department of ECE,JNTUA,Anantapur) * **(Department of ECE,JNTUA,Anantapur)

Design of Flash Controller for Single Level Cell NAND Flash Memory

Power Optimized Programmable Truncated Multiplier and Accumulator Using Reversible Adder

Implementation of low power wireless sensor node with fault tolerance mechanism

ARITHMETIC operations based on residue number systems

AN EFFICIENT REVERSE CONVERTER DESIGN VIA PARALLEL PREFIX ADDER

A High-Speed FPGA Implementation of an RSD-Based ECC Processor

Reduced Latency Majority Logic Decoding for Error Detection and Correction

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes

A Modified CORDIC Processor for Specific Angle Rotation based Applications

Implementation of Fault Tolerant Parallel Filters Using ECC Technique

An Efficient VLSI Execution of Data Transmission Error Detection and Correction Based Bloom Filter

A New Architecture of High Performance WG Stream Cipher

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Based Low Area Motion Estimation with BISCD Architecture

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

Fixed Point LMS Adaptive Filter with Low Adaptation Delay

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

Design of a Multiplier Architecture Based on LUT and VHBCSE Algorithm For FIR Filter

CRC Generation for Protocol Processing

Implementation of Asynchronous Topology using SAPTL

Design of Convolution Encoder and Reconfigurable Viterbi Decoder

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

FIR Filter Architecture for Fixed and Reconfigurable Applications

Three-D DWT of Efficient Architecture

Majority Logic Decoding Of Euclidean Geometry Low Density Parity Check (EG-LDPC) Codes

the main limitations of the work is that wiring increases with 1. INTRODUCTION

DESIGN OF FAULT SECURE ENCODER FOR MEMORY APPLICATIONS IN SOC TECHNOLOGY

Volume 5, Issue 5 OCT 2016

DETECTION AND CORRECTION OF CELL UPSETS USING MODIFIED DECIMAL MATRIX

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Performance Analysis of CORDIC Architectures Targeted by FPGA Devices

A Normal I/O Order Radix-2 FFT Architecture to Process Twin Data Streams for MIMO

Low-Power Adaptive Viterbi Decoder for TCM Using T-Algorithm

FOLDED ARCHITECTURE FOR NON CANONICAL LEAST MEAN SQUARE ADAPTIVE DIGITAL FILTER USED IN ECHO CANCELLATION

Reliability of Memory Storage System Using Decimal Matrix Code and Meta-Cure

Design of a New Cryptography Algorithm using Reseeding-Mixing Pseudo Random Number Generator

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

FPGA Implementation of ALU Based Address Generation for Memory

Resource Efficient Multi Ported Sram Based Ternary Content Addressable Memory

International Journal for Research in Applied Science & Engineering Technology (IJRASET) IIR filter design using CSA for DSP applications

Memory, Area and Power Optimization of Digital Circuits

Implementation of A Optimized Systolic Array Architecture for FSBMA using FPGA for Real-time Applications

Adaptive FIR Filter Using Distributed Airthmetic for Area Efficient Design

16 BIT IMPLEMENTATION OF ASYNCHRONOUS TWOS COMPLEMENT ARRAY MULTIPLIER USING MODIFIED BAUGH-WOOLEY ALGORITHM AND ARCHITECTURE.

Area And Power Optimized One-Dimensional Median Filter

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

Area Efficient Z-TCAM for Network Applications

Modified Welch Power Spectral Density Computation with Fast Fourier Transform

FPGA Implementation of High Speed AES Algorithm for Improving The System Computing Speed

Hardware Implementation of Single Bit Error Correction and Double Bit Error Detection through Selective Bit Placement for Memory

IMPLEMENTATION OF DISTRIBUTED CANNY EDGE DETECTOR ON FPGA

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

DUE to the high computational complexity and real-time

Implementation of Convolution Encoder and Viterbi Decoder Using Verilog

ISSN Vol.08,Issue.12, September-2016, Pages:

Power and Area Efficient Implementation for Parallel FIR Filters Using FFAs and DA

Novel Implementation of Low Power Test Patterns for In Situ Test

Design of Majority Logic Decoder for Error Detection and Correction in Memories

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VI /Issue 3 / JUNE 2016

DESIGN OF HYBRID PARALLEL PREFIX ADDERS

Reconfigurable PLL for Digital System

FPGA Implementation of a High Speed Multistage Pipelined Adder Based CORDIC Structure for Large Operand Word Lengths

Fault Tolerant Parallel Filters Based On Bch Codes

of Soft Core Processor Clock Synchronization DDR Controller and SDRAM by Using RISC Architecture

A High Speed Design of 32 Bit Multiplier Using Modified CSLA

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

Verilog for High Performance

Area Delay Power Efficient Carry Select Adder

Transcription:

IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 2, Issue 5 (May. Jun. 203), PP 66-72 e-issn: 239 4200, p-issn No. : 239 497 VLSI Implementation of Parallel CRC Using Pipelining, Unfolding and Retiming Sangeeta Singh, S. Sujana 2, I. Babu 3, K. Latha 4 2 (Associate Professor, epartment of ECE, Vardhaman College of Engineering, Hyderabad, A.P, India) 3 4 (Assistant Professor, epartment of ECE, Vardhaman College of Engineering, Hyderabad, A.P, India) Abstract : Error detection is important whenever there is a non-zero chance of data getting corrupted. A Cyclic Redundancy Check (CRC) is the remainder, or residue, of binary division of a potentially long message, by a CRC polynomial. This technique is ubiquitously employed in communication and storage applications due to its effectiveness at detecting errors and malicious tampering. The hardware implementation of a bit-wise CRC is a simple linear feedback shift register. Such a circuit is very simple and can run at very high clock speeds, but it requires the stream to be bit-serial. This means that n clock cycles will be required to calculate the CRC values for an n-bit data stream. This latency is intolerable in many high speed data networking applications where data frames need to be processed at high speed and hence implementation of CRC generation and checking on a parallel stream of data becomes desirable. This paper presents implementation of parallel Cyclic Redundancy Check (CRC) based upon SP algorithms of pipelining, retiming and unfolding. The architectures are first pipelined to reduce the iteration bound by using novel look-ahead techniques and then unfolded and retimed to design high speed parallel circuits. This paper presents the comparison between the parallel implementation of CRC-9 and its serial implementation. It also shows that parallel implementation uses less number of clock cycles than the serial implementation of CRC-9 thereby increasing the speed of the architecture. This paper is implemented using Verilog hardware description language, simulated using Xilinx ISE tools and synthesized using Cadence tools. Keywords - Cyclic Redundancy Check (CRC), Pipelining, Retiming, Unfolding I. Introduction Error correction codes provides a mean to detect and correct errors introduced by the transmission channel. Two main categories of codes exist: block codes and convolution codes. They both introduce redundancy by adding parity symbols to the message data. Cyclic redundancy check (CRC) codes are the subset of the cyclic codes that are also a subset of linear block codes. Cyclic Redundancy Check (CRC) is widely used to detect errors in data communication and storage devices. CRC is a very powerful and easily implemented technique to obtain data reliability. The CRC technique is used to verify the integrity of blocks of data called Frames. In this technique, the transmitter appends an extra n bit sequence to every frame called Frame Check Sequence (FCS). FCS holds redundant information about the frame that helps the receiver detect errors in the frame. When the transmission is received or the stored data is retrieved, the CRC residue is regenerated and confirmed against the appended residue. For high-speed data transmission, the general serial implementation cannot meet the speed requirement. Parallel processing is a very efficient way to increase the throughput rate. Although parallel processing increases the number of message bits that can be processed in one clock cycle, it can also lead to a long critical path (CP). Thus, the increase of throughput rate that is achieved by parallel processing will be reduced by the decrease of circuit speed. Another issue is the increase of hardware cost caused by parallel processing, which needs to be controlled. The parallel CRC algorithm in [] processes an m -bit message in (mk)/l clock cycles, where k is the order of the generator polynomial and L is the level of parallelism. However, in [2], m message bits can be processed in m/l clock cycles. High speed architectures for parallel long encoders are based on the multiplication and division computations on generator polynomial are efficient in terms of speeding up the parallel linear feedback shift register (LFSR) structures. The proposed design achieves shorter critical path for parallel CRC circuits leading to high processing speed than commonly used generator polynomial. The proposed design starts from LFSR, which is generally used for serial CRC. An unfolding algorithm [3] is used to realize parallel processing. However, direct application of unfolding may lead to a parallel CRC circuit with long iteration bound, which is the lowest achievable CP[]. Two novel look-ahead pipelining methods are developed to reduce the iteration bound of the original serial LFSR CRC structures; then, the unfolding algorithm is applied to obtain a parallel CRC structure with low iteration bound. The retiming algorithm is then applied to obtain the achievable lowest CP. 66 Page

II. Pipelining, Unfolding And Retiming The implementation of CRC check generation circuit can be done with the use of linear feedback circuit. The CRC architecture for generator polynomial G(y)=yy 8 y 9 is shown in Fig.. x x2 x 3 x 4 x5 x6 x7 x 8 x 9 Figure : Serial CRC 2. Pipelining: It reduces the effective critical path by introducing pipelining latches along the critical data path either to increase the clock frequency or sample speed or to reduce power consumption at the same speed. It is done using a look-ahead pipelining algorithm to reduce the iteration bound. Iteration bound is defined as the maximum of all the loop bounds. Loop bound is defined as t/w, where t is the computation time of the loop and w is the no. of delay elements in the loop. The iteration bound for the circuit shown in Fig. is 2T XOR. The largest iteration bound of a general serial CRC architecture is also 2T XOR. The critical loop is described by a ( n ) = a( n) y( n) b( n ) ----------- () b(n2)... b(n) a(n) a(n)... Figure 2: Pipelining to reduce iteration bound The largest iteration bound of a general serial architecture is also 2T XOR. For example, the serial architectures of commonly used generator polynomials CRC-6 and CRC-2 have the iteration bound of 2T XOR because they have terms y 5 y 6 and y y 2 in their generator polynomials respectively. In proposed look-ahead pipelining, 2-level pipelining is given by a ( n 2) = a( n ) y( n ) b( n 2) a ( n 2) = a( n) y( n) b( n ) b( n 2) y( n ) -----(2) b(n2) a(n2) a(n) b(n)... Figure 3: First level pipelining y(n) x... x6 x7 x8 x0 data Figure 4: Two level pipelining Fig.4 shows that the loop bound for the circuit in Fig.2 has been reduced from 2Txor to T XOR at the cost of two XOR gates and two flip flops. Also the loop bounds of loop and loop2 are T XOR and (5/8) T XOR respectively. So, the iteration bound of the two level pipelined CRC architecture is T XOR. For improved look ahead pipelining consider the polynomial G(y)= yy 7 y 9 x 0 x x 6 x 7 x 8 Figure 5: LFSR for G(y) The loop equation can be given as: a ( n 2) = a( n) y( n) b( n 2) ------- (3) 67 Page

Four level pipelined structure can be obtained with application of improved look ahead pipelining technique. a( n 4) = a( n 2) y( n 2) b( n 4) a(n4) ------- (4) a ( n 4) = a( n) y( n) b( n 2) y( n 2) b( n 4) ---- (5)... b(n4)... a(n4) a(n) 4 b(n2) y(n2) Figure 6: Loop bound of T XOR /2 x0 x x4 x5 x6 y(n2) x7 x8 Figure 7: Four level pipelined 2.2 Retiming: Retiming is used to change the locations of delay elements in a circuit without affecting the input/output characteristics of the circuit. It reduces the critical path of the system by not altering the latency of the system. Retiming has many applications in synchronous circuit design. These applications include reducing the clock period of the circuit, reducing the number of registers in the circuit, decreasing the power consumption of the circuit and logic synthesis. It can be used to increase the clock rate of a circuit by reducing the computation time of the critical path. Critical path is the path with the longest computation time among all paths that contain zero delays, and its computation time is the lower bound on the clock period of the circuit. The two factors affecting the frequency of operation is critical path and iteration bound. Retiming is done by applying the algorithm presented in [3] and using Floyd Warshall algorithm. 2.3 Unfolding: It s a transformation technique that can be applied to SP program to create a new program describing more than one iteration of the original program. Unfolding a SP program by an unfolding factor J creates a new program that describes J consecutive iterations of the original program. It increases the sampling rate by replicating hardware so that several inputs can be processed in parallel and several outputs can be produced at the same time. The lower bound on the iteration period of a recursive SP program is termed as iteration bound. An implementation of the SP program can never achieve an iteration period less than the iteration bound, even when infinite processors are made available. In some cases, the SP program cannot be implemented with the iteration bound equal to the iteration bound without the use of unfolding. In general, for a given FG, when unfolding algorithm is applied with unfolding factor J, the iteration bound of the resultant FG is J times that of the original FG. III. IMPLEMENTATION CRC-9 architectures are first pipelined to reduce the iteration bound by using novel look-ahead pipelining methods and then retimed and unfolding to design high speed parallel circuits. The proposed design starts from LFSR, which is generally used for serial CRC-9. G(y)=yy 8 y 9 x x2 x 3 x 4 x5 x6 x7 x 8 x 9 Figure 8: Serial CRC-9 The serial CRC-9 is input is having the bit data and the correct output is obtained after 9 clock cycles. CRC-9 architectures are first pipelined to reduce the iteration bound. x x2 x3 x4 x5 x6 x7 x8 x9 Loop 2 3 Figure 9: LFSR of G(y)=yy 8 y 9 68 Page

Loop bound for this st = t/w = T XOR (t=t XOR ; w=) 2nd =(2/7)T XOR. 3rd =2 T XOR. Retiming algorithm is then applied to the pipelined architecture and based on the algorithm data flow graph for the pipelined circuit of Fig.0 is drawn. Here the critical path obtained is of magnitude 2. 6 data 2 3 4 7 5 data Figure 0: FG for pipelined block 5 3 2 3 4 5 4 Figure : Retimed Graph x x2 x3 x4 x5 Y(n) x9 x8 x7 x6 Figure 2: LFSR for retimed CRC The following are the steps to be applied for unfolding after application of retiming algorithm. For each node U in the original FG, draw J node U 0,U,.U J-. For each edge U V with W delays in the original FG, draw the J edges U i V (iw)%j with (floor(iw)/j) delays for i=0, J-. Where i is index of the node, w is weight of the edges, J is unfolding factor. Fig.3 and Fig.4 show the FG and LFSR representation for 2-level unfolding. The correct output is obtained after 5 clock cycles. Application of the unfolding algorithm increases the speed of the architecture. i=0 2 3 4 5 6 7 8 9 0 i= Y(k) Y(2k) 3 Figure 3: ata flow graph for 2-level unfolding x0 x2 Y(k) x7 x5 x x3 Y(2k) x8 x6 Figure 4: LFSR representation for 2-level unfolding x4 69 Page

IV. RESULTS This paper shows the comparison between the serial and parallel implementation of CRC circuits using pipelining, retiming and unfolding techniques. The generator polynomial used is given by G(Y)=yy 8 y 9. The input message sequence polynomial is: 000000. The design is implemented in verilog hardware description language and simulated using Xilinx on Spartan-3 and synthesized using cadence tools. Fig.5 shows the results obtained for the serial implementation of CRC. The correct output is obtained after 9 clock cycles. Table shows the timing report for serial CRC. After application of the pipelining, retiming and then unfolding algorithm to the same design, the output is generated after 5 clock cycles which is shown in figure 9. Hence it reduces the clock cycles and thus increases the speed of the circuit. Fig.6 to Fig.8 depict the respective RTL schematics. Figure 5: Serial CRC Figure 6: Serial CRC- RTL schematic Table : Timing report for serial CRC Timing Summary Speed Grade: -5 Minimum period:.962ns (Maximum Frequency: 509.697MHz) Minimum input arrival 3.567ns time before clock: Maximum output 4.82ns required time after clock: Figure 7: RTL schematic of pipelined CRC Figure 8: RTL schematic of retimed CRC Figure 9: CRC output based on pipelining, retiming and unfolding 70 Page

The following are the results obtained for CRC implementation using generator polynomial G(Y)=yy 8 y 9 and input message sequence 000000. The required number of clock cycles, iteration bound and critical path is noted down for different architectures. These results are depicted in Table 2 to 4. Table 2: No. of clock cycles No. of clock Architecture cycles Original Architecture 9 2-level pipelined 0 4-level pipelined 2 Retiming after 2level 0 pipelining Unfolding the 2-level 5 pipelined Retiming the unfolded architecture Table 3: Iteration Bound Architecture Iteration Bound 6 Original architecture 2-level pipelined 4-level pipelined & retiming 2 T XOR T XOR (7/8)T XOR Table 4: Critical Path Architecture Critical Path 2-level pipelined Retiming the 2-level pipelined 2 T XOR T XOR V. CONCLUSION This paper implements pipelining method for high-speed parallel CRC circuits. Pipelining has decreased the iteration bound of the architecture effectively. Applying unfolding technique to pipelined architecture increased the throughput of the circuit and thereby applying retiming to the architecture reduced the critical path delay. So applying pipelining, unfolding and retiming to the CRC has increased the throughput to achieve high speed design. This brief has proposed two pipelining methods for high speed parallel CRC hardware implementation. Proposed look ahead pipelining has a simpler structure, parallel CRC design can efficiently reduce the critical path. Although the proposed design is not efficiently applicable for the LFSR architecture of any generator polynomial, it is very efficient for the generator polynomials with many zero coefficients between the second and third highest order nonzero coefficients, as shown in the commonly used generator polynomials. The serial implementation of CRC-9 uses 9 clock cycles, iteration bound is 2T XOR and critical path is 2T XOR. Whereas the parallel implementation of CRC-9 using pipelining, retiming and unfolding uses only 5 clock cycles, its iteration bound is T XOR and critical path is T XOR. Thus increasing the speed of the architecture. This can be extended to achieve the high speed CRC circuit without increasing the hardware resources such that area occupied by the design is minimized. Parallel CRC architecture implemented here is having high speed compared to previous algorithms and also hardware cost is controlled. REFERENCES [] G. Campobello, G. Patane, and M. Russo, Parallel CRC realization, IEEE Trans. Comput., vol. 52, no. 0, pp. 32 39, Oct. 2003. [2] T.B. Pei and C. Zukowski, High-speed parallel CRC circuits in VLSI, IEEE Trans. Commun., vol. 40, no. 4, pp. 653 657, Apr. 992. [3] K. K. Parhi, VLSI igital Signal Processing Systems: esign and Implementation. Hoboken, NJ: Wiley, 999. [4] T. V. Ramabadran and S.S. Gaitonde, A tutorial on CRC computations, IEEE Micro, Vol.8 no.4, pp. 62-75, Aug.988. [5] X. Zhang and K. K. Parhi, High-speed architectures for parallel long BCH encoders, in Proc. ACM Great Lakes Symp. VLSI, Boston, MA, Apr. 2004, pp. 6. [6] K. K. Parhi, Eliminating the fanout bottleneck in parallel long BCH encoders, IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 5, no. 3, pp. 52-56, Mar. 2004. 7 Page

[7] W. Peterson and. Brown, Cyclic codes for error detection, Proc. IRE, vol. 49, no., pp. 228 235, 96. [8] G. Albertengo and R. Sisto, Parallel CRC generation, IEEE Micro, vol. 0, no. 5, pp. 63 7, Oct. 990. [9] M.. Shieh, M.-H. Sheu, C.-H. Chen, and H.- F. Lo, A systematic approach for parallel CRC computations, J. Inf. Sci. Eng., vol. 7, no. 3, pp. 445 46, May 200. [0] A. Tanenbaum, Computer Networks, 4 th ed. Englewood Cliffs, NJ-Prentice Hall, 2003. [] M. Braun, J. Friedrich, T. Grün, and J. Lembert, Parallel CRC computation in FPGAs, in Proc. 6th Int. Workshop Field- Program. Logic, Smart Appl., New Paradigms Compilers (FPL), London, U.K., 996, pp. 56 65, Springer-Verlag. [2] M. Sprachmann, Automatic generation of parallel CRC circuits, IEEE es. Test Comput., vol. 8, no. 3, pp. 08 4, May 200. [3] Martin Grymel and Steve B. Furber, A novel programmable parallel CRC circuit, IEEE Trans. VLSI Systems,. 9, vol. no. 0, October 20. [4] S. Lin and. J. Costello, Error Control Coding. Englewood Cliffs, NJ: Prentice-Hall, 983. [5] R.Lee, Cyclic Codes Redundancy, igital esign, July 977. [6]. Feldmeier, Fast Software Implementation of Error etection Codes, IEEE Trans. Networking, ec. 995. [7] M.C. Nielson, Method for High Speed CRC Computation, IBM Technical isclosure Bull., vol. 27,no. 6, pp. 3572-3576, Nov. 984. [8] T. C. enk & K. K. Parhi, Exhaustive Scheduling and Retiming of igital Signal Processing Systems, IEEE Trans. On Circuits and Systems-II: Analog and SP, vol. 45, no.7, pp. 82-838, July 998. 72 Page