Ripple-Carry Adder Binary Adders x n y n x y x y c n FA c n - c 2 FA c FA c s n MSB position Longest delay (Critical-path delay): d c(n) = n d carry = 2n gate delays d s(n-) = (n-) d carry +d sum = 2n gate delays s s LSB position
c + x i y i c i c i x i y i i s i c i x i y i s i = x i y i c i (a) Truth table c i + = x i y i + x i c i + y i c i (b) Karnaugh maps x i y i s i c i Full adder c i + (c) Circuit 2
Implement an adder with small look-up tables (LUT) 3
Fast-carry logic in FPGAs The fast-carry logic is about a magnitude faster than the delay through a regular logic LUT. x n y n x y x y c n FCL c n - c 2 FCL c FCL c x n y n x y x y c c n c 2 XOR XOR XOR s n- s 4 s
Instruction Pipelining Instruction pipeline for a RISC Instruction fetch Instruction decode and register fetch Execution and address calculation Memory access ResultW rite back IF ID EX MEM WB Total latency: Total delay time from instruction fetch to result write back to a register Throughput (maximum frequency, registered performance): Number of results (instructions) per second 5
Instruction Pipelining (continued) Instruction number Clock number 2 3 4 5 6 7 8 9 Instruction i IF ID EX MEM WB Instruction i+ IF ID EX MEM WB Instruction i+2 IF ID EX MEM WB Instruction i+3 IF ID EX MEM WB Instruction i+4 IF ID EX MEM WB Instruction i+5 IF ID EX MEM WB 6
Arithmetic Pipelining Pipelined adders Pipelining principle can be applied to FPGA designs at little or no additional cost since each logic element contains a flip-flop, which is otherwise unused. An arithmetic operation is broken into small primitive operations. The result of each primitive operations is saved in registers after each pipeline stage. If one part of the data is not processed at a pipeline stage, this part of data should still be saved after the pipeline stage. 7
Pipelined adder 8
Modulo Adders Building block of RNS-DSP design Modulo operation is performed by (a) an extra adder or (b) a ROM look-up table. Compute (x + y) mod M where b is the number of bits of the inputs and output. The inputs x and y are modulo M numbers. 9
Multiplication and Division by 2 k Multiplication by 2 k can be implemented by shifting the bits of the operand to the left by k. Division by 2 k can be implemented by shifting the bits of the operand to the right by k. For signed numbers, it is necessary to preserve the sign. This is done by shifting the bits to the right and filling from the left with the value of the sign bit.
Binary Multipliers Multiplication of unsigned numbers by hand Multiplicand M Multiplier Q (4) () Product P (54) Multiplication of unsigned numbers
Multiplication for implementation in hardware Multiplicand M Multiplier Q (4) () Partial product + Partial product + Partial product 2 + Product P (54) Multiplication of unsigned numbers 2
Multiplier designs Serial/parallel multiplier 2N-bit adder or N-bit adder + shift registers Serial/serial multiplier One carry-save adder + shift registers Serial/parallel multiplier using carry-save adders Parallel/parallel multiplier (array multiplier) Array multiplier with N 2 full adders Fast array multiplier for FPGAs Multiplier blocks 3
Serial/parallel multiplier with a 2N-bit adder Shift register for Multiplicant Shift register for Multiplier AND gates 2N-bit adder Register for Partial Product 4
Serial/parallel multiplier with an N-bit adder Register for Multiplicant Shift register for Multiplier AND gates N-bit adder Shift register for Partial Product Product 5
Carry-save adder a b Full adder Sum bit Carry-out Y D Q Q y s Y 2 D Q y 2 Clock Q Reset 6
Serial/parallel multiplier using carry-save adders Multiplier B n- B n-2...b Multiplicant A n- A n-2...a B n- B n-2...b A n- A n-2... A shift & & & FA D FA D... FA D D D D 7
Array multiplier with N 2 full adders 8
Fast array multiplier for FPGAs 9
Multiplier blocks A 2Nx2N multiplier can be defined in terms of an NxN multiplier block. P = Y X = (Y 2 2 N +Y ) (X 2 2 N +X ) = Y 2 X 2 2 2N + (Y 2 X +Y X 2 ) 2 N + Y X Indices 2 and indicate the most significant and least significant N-bit halves, respectively. 2
Binary Dividers Division Division is the most complex operation among the 4 basic arithmetic operations. Let N denote numerator and D denominator, two results are produced: the quotient Q and the remainder R: N / D = Q + R / D 2
In division, each quotient bit is determined in a sequential trial-and-error procedure. (In multiplication, all partial products can be produced parallel.) Result should be constrained: Q N and R D For signed numbers, R and N are assumed to have the same sign. 22
23
Linear Convergence Division Algorithms Restoring divider a trial-and-error method translated directly from the pencil-and-paper method The main disadvantage of the restoring divider is that we need two steps, subtract and add (i.e., restore), to determine one quotient bit. 24
Restoring divider 25
26
27
Nonperforming divider If the denominator is larger than the remainder, we do not perform the subtraction. A temporary remainder value is tested before the remainder register is updated. Note that the following VHDL code describes a combinational circuit. t := r d; IF t >= THEN r := t; --temporary remainder value -- Nonperforming test -- Update remainder q := q * 2 + ; -- Shift left and add ELSE q :=q * 2; END IF; -- Shift left 28
Comparison of nonperforming divider with restoring divider 29
Nonrestoring divider It does not increase the critical path. Always perform the subtraction. If the remainder is negative, perform an addition of d k /2 in the next step, instead of the restoring addition of d k in the present step and the subtraction of d k /2 in the next step. The quotient bit can be positive or negative, i.e., q k =, but not zero. This is a signed-digit representation. The negative ones can be saved in the quotient register as zeros. 3
The signed-digit representation should be converted to 2 s complement representation. For example: q SD = is saved as in the quotient register. To convert: (positive ones) (negative ones) alternatively: 2 * + + Correct remainder if r < : r := r + D and q := q. 3
Fast Divider Design Division through multiplication with reciprocal of denominator The reciprocal can be computed via a look-up table for small bit width. One can use Newton Algorithm to compute the reciprocal. f(x) = /x D x = /D x k+ = x k f(x k ) / f (x k ) With f(x) = /x D, we have f (x) = /x 2. x k+ = x k (2 D x k ) 32
33
34
Division by Convergence Both numerator N and denominator D are multiplied by approximation factors f k. The 2 multiplications can be computed in parallel. After a sufficient number of iterations k (quadratic convergence), DΠ f k and NΠ f k Q Algorithm: ) Normalize N and D such that D is close to. 2) Initialize x = N and t = D. 3) Repeat the following loop until x k shows the desired precision. f k = 2 t k x k+ = x k f k t k+ = t k f k 35
36