Outline Introduction to Structured VLSI Design Integer Arithmetic and Pipelining Multiplication in the digital domain HW mapping Pipelining optimization Joachim Rodrigues Signed and Unsigned Integers n-1 Unsigned integer: bit i 2 i i=0 Two's complement signed integer: n-2 bit n-1 (-2 n-1 ) bit i 2 i i=0 n-1 5 4 3 2 1 0 8 bit Signed/Unsigned Integers Signed overflow 128 1000 0000 127 1000 0001...... 1111 1100 1111 1101 MSB defines sign 2 1111 1110 1 1111 1111 Signed integers 0 0000 0000 0 1 0000 0001 1 2 0000 0010 2 3 0000 0011 3......... 126 0111 1110 126 Unsigned integers Signed overflow 127 0111 1111 127 1000 0000 128 1000 0001 129...... 1111 1110 254 1111 1111 255 Unsigned overflow
Add/Subtract Unsigned Overflow Examples A n 1 B A A 0 B n 1 1 B 1 0... C 0 = 0 1 C 2 S S S n 1 1 0 The HW for sum/difference (S) doesn't care about signed/unsigned Unsigned overflow = Carry out & add OR no carry-out & subtract Unsigned overflow Signed overflow = 1 True sign = S n 1 signed overflow = (A n 1 B n 1 1 ) ( 1 ) = A n 1 B n 1 C 1 106 = 16, outside [0..15] 1010 0110 C 4 =1 0000 = C 4 = 1 & add Unsigned overflow Carry-out & add Unsigned overflow 7-10 = -3, outside [0..15] 0111-1010 same as 0111 0101 1 C 4 =0 1101 = C 4 = 0 & subtract Unsigned overflow No carry-out & subtract Unsigned overflow Signed Overflow Example Multiplication 67 = 13, outside [-8..7] 0110 0111 C 4 =0 1101 C 3 = 1-1 = C 4 C 3 = 0 1 = 1 Carry-outs different Signed overflow S n-1 signed overflow = A n-1 B n-1 = A 3 B 3 C 4 = 0 0 0 = 0 True sign = Positive/zero Product = Multiplicand * Multiplier log (product) = log (multiplicand) log (multiplier) Width of product is (worst case) sum of widths of factors May overflow if single length product register is used Paper and pencil method Conditional add (controlled by bits of multiplier) and shift Partial product progressively develops into product 1 product bit/cycle Unsigned and signed multiplication Signs require extra attention Sequential, combinational or pipelined implementation Tradeoff between hardware resources, throughput, latency, power
Multiplying Using Paper and Pencil... more Paper and Pencil We will concentrate on unsigned integers for the next few slides! Example: 1011 * 1110 0000 (*0 = zero) 1011. (*1 = copy) 1011.. (*1 = copy) 1011... (*1 = copy) 10011010 In decimal: 11 * 14 = 154 Multiplicand * Multiplier Partl product Partl multiplier 1011*1110 0000 1110 0000 (0) 0000 > 00000 111 1011. (1) 1011. > 010110 11 1011.. (1) 1011.. > 1000010 1 1011... (1) 1011... 10011010 10011010 0 Multiplicand Partial prod uct, part.mul. LSB controls whether to add 0 or multiplicand to partial product Disadvantage: 2n bit ALU Advantage: n bit ALU 0: add zero, 1: add multiplicand Shifting in carry out prevents overflow Seq. Multiplication, Initialize Seq. Multiplication, Step n bit reg. Multiplicand Load Repeat step n times n bit reg. Multiplicand Add Control signal Add Conditional add 0 Multiplier 2n bit reg. bit 0 Load Partial product Partial x multiplier Shift right bit 0 2n bit reg.
Seq. Multiplication, Result n bit reg. Multiplicand Don't forget... Signed Multiplication Either transform to multiply of non negative integers: 1. Record signs and negate any negative factors. Add 2. Perform unsigned multiplication. 3. Negate product if signs above differ. Or directly perform signed multiplication: Product bit 0 2n bit reg. one partial product per clock cycle => very slow 1. Take into account the sign bit of multiplicand by shifting in true sign bits rather than carry outs, i.e. A n 1 B n 1 rather than. 2. Take into account the sign bit of multiplier by doing a conditional subtract rather than a conditional add during the last iteration. Seq. signed multiplication, step Multiplication by a Constant Repeat step n times True sign True sign n bit reg. Multiplicand Add/ sub Conditional add for iteration 1.. n 1, conditional subtract for iteration n Partial product Partial x multiplier Shift right bit 0 2n bit reg. As a designer you need to assure that division with a small constant is accomplished by a number of shifts and adds Some numerical examples: *2 (*10 2 ): multiplicand << 1 *3 (*11 2 ): multiplicand << 1 multiplicand *4 (*100 2 ): multiplicand << 2 *5 (*101 2 ): multiplicand << 2 multiplicand *255 (*11111111 2 ): multiplicand << 8 multiplicand True sign = A n 1 B n 1
String of n bit Adders Carry save Adders in Multipliers Unrolling loop lowers latency when compared to sequential add and shift at the expense of much more hardware n x n multiplication requires n 1 n bit adders Mp 2 *Mc Mp 1 *Mc Mp 0 *Mc 0 Significantly reduced delays for multi input adders Full adders with clever interconnect Sum and carries fed separately to adder at next level Carries drawn diagonally, sums drawn vertically Typically, a final (carry propagate) adder assimilates the carries t saved_latency = n*(t clk out t set up ) Mp n 1 *Mc A 0,2 B 0,2 C 0,2 A 0,1 B 0,1 C 0,1 A 0,0 B 0,0 C 0,0 CSA 0 C 1,3 S 1,2 C 1,2 S 1,1 C 1,1 S 1,0 A 1,2 A 1,1 A 1,0 C 1,0 CSA 1 P 2n 1 P 2n 2..n P n 1 P 2 P 1 P 0 C 2,3 S 2,2 C 2,2 C 2,1 S 2,1 S 2,0 6 x 6 Parallel Array Multiplier... Pipelined Version MP i, j = Multiplier i AND Multiplicand j MP 1,3 MP 0,3 MP 1,2 0 MP 0,2 MP 1,1 0 MP 1,0 MP 0,1 0 MP 0,0 MP 2,3 MP 2,2 MP 2,1 MP 2,0 MP 3,3 MP 3,2 MP 3,1 MP 3,0 Pipeline registers Pipeline registers Pipeline registers Carry propagate adder P 7 P 6 P 5 P 4 P 3 P 2 P 1 P 0
Sequential, Combinational, and Pipelined The sequential shift and add algorithm corresponds to a for loop that may be implemented by: a state machine or instructions (low end microcontrollers) The sequential algorithm may be unrolled and implemented as a deep combinational circuit: String of n bit adders and AND gates, or Carry save adders, AND gates, and final (n 1) bit adder Advantage: low latency Disadvantage: more hardware Pipelining The deep combinational circuit may be pipelined Advantage: very high throughput Disadvantages: pipeline latency, more hardware, and higher power Laundry process Comparison Non pipelined: Delay: 60 min Throughput 1/60 load per min Pipelined: Delay: 60 min Throughput k/(40k*20) load per min about 1/20 when k is large Throughput 3 times better than non pipelined Joachim Rodrigues, Informatik og Matematisk Modellering, jnr@imm.dtu.dk
Pipelined combinational circuit Adding pipeline to a comb circuit Candidate circuit for pipeline: enough input data to feed the pipelined circuit throughput is a main performance criterion comb circuit can be divided into stages with similar propagation delays propagation delay of a stage is much larger than the setup time and the clock to q delay of the register. Exercise (15 min) Recipe Pipeline two 4 bit adders which are connected in series. The FFs are ideal(t setup = t clk >Q =0) t pa = 400 ps. The carry out of the 2nd adder can be ignored. How many pipeline stages? Where do you put the FFs? What s the gain in throughput? How many FFs are required? a 0 b 0 a 1 b 1 a 2 b 2 a 3 b 3 s 0p s 1p s 2p s 3p c 0 c 1 c 2 c 3 s 0 s 1 s 2 s 3 Derive the block diagram of the original combinational circuit and arrange the circuit as a cascading chain Identify the major components and estimate the relative propagation delays of these components Divide the chain into stages of similar propagation delays Identify the signals that cross the boundary of the chain Insert registers for these signals in the boundary. c 3 Joachim Rodrigues, Informatik og Matematisk Modellering, jnr@imm.dtu.dk
Datapath Datapath Sequential part RTL description is characterized by registers in a design, and the combinational logic inbetween. This can be illustrated by a "register and cloud" diagram. Registers and the combinational logic are described separately in two different processes. architecture SPLIT of DATAPATH is signal X1, Y1, X2, Y2 :... begin seq : process (CLK) begin if (CLK'event and CLK = '1') then X1 <= Y0; X2 <= Y1; X3 <= Y2; end if; end process; Datapath Combinatorial part Pipelining LOGIC : process (X1, X2) begin - F(X1) and G(X2) can be replaced with the code - implementing the desired combinational logic - or appropriate functions must be defined. Y1 <= F(X1); Y2 <= G(X2); end process; end SPLIT; The instructions on the preceeding slides introduced pipelining of the DP. The critical path is reduced from F(X1) G(X2) to the either F(X1) or G(X2). Do not constraint the synhtesis tool by splitting operations, e.g., y1=x1x1 2.