Jan Rabaey Homework # 7 Solutions EECS141

Similar documents
ECE 341. Lecture # 7

Arithmetic Circuits. Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak.

Digital Circuit Design and Language. Datapath Design. Chang, Ik Joon Kyunghee University

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

Week 7: Assignment Solutions

Learning Outcomes. Spiral 2-2. Digital System Design DATAPATH COMPONENTS

ECE 341. Lecture # 6

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Lecture 5. Other Adder Issues

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Tailoring the 32-Bit ALU to MIPS

Binary Adders: Half Adders and Full Adders

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

ECE 30 Introduction to Computer Engineering

Written exam for IE1204/5 Digital Design Thursday 29/

University of Illinois at Chicago. Lecture Notes # 10

CPE300: Digital System Architecture and Design

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION

Arithmetic Logic Unit. Digital Computer Design

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

Chapter 3: part 3 Binary Subtraction

Array Multipliers. Figure 6.9 The partial products generated in a 5 x 5 multiplication. Sec. 6.5

Chapter 3 Arithmetic for Computers

Computer Arithmetic Multiplication & Shift Chapter 3.4 EEC170 FQ 2005

DIGITAL ARITHMETIC: OPERATIONS AND CIRCUITS

Date Performed: Marks Obtained: /10. Group Members (ID):. Experiment # 09 MULTIPLEXERS

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

LECTURE 4. Logic Design

ECE 341 Midterm Exam

By, Ajinkya Karande Adarsh Yoga

Digital Computer Arithmetic

Partial product generation. Multiplication. TSTE18 Digital Arithmetic. Seminar 4. Multiplication. yj2 j = xi2 i M

Lecture #21 March 31, 2004 Introduction to Gates and Circuits

Parallel logic circuits

Addition and multiplication

Number System. Introduction. Decimal Numbers

60-265: Winter ANSWERS Exercise 4 Combinational Circuit Design

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

CSC 220: Computer Organization Unit 10 Arithmetic-logic units

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

Computer Organization EE 3755 Midterm Examination

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y

Arithmetic-logic units

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

DC57 COMPUTER ORGANIZATION JUNE 2013

VARUN AGGARWAL

Boolean Unit (The obvious way)

CS/COE0447: Computer Organization

Computer Architecture, IFE CS and T&CS, 4 th sem. Fast Adders

CS/COE0447: Computer Organization

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Binary Multiplication

Architecture and Partitioning - Architecture

Computer Organization EE 3755 Midterm Examination

COMBINATIONAL LOGIC CIRCUITS

Advanced Computer Architecture-CS501

Integer Multiplication. Back to Arithmetic. Integer Multiplication. Example (Fig 4.25)

ECE468 Computer Organization & Architecture. The Design Process & ALU Design

Computer Architecture and Organization: L04: Micro-operations


DIGITAL TECHNICS. Dr. Bálint Pődör. Óbuda University, Microelectronics and Technology Institute

(Refer Slide Time 6:48)

Lecture 3: Basic Adders and Counters

R10. II B. Tech I Semester, Supplementary Examinations, May

Number Systems and Computer Arithmetic

VLSI for Multi-Technology Systems (Spring 2003)

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Chapter 5. Digital Design and Computer Architecture, 2 nd Edition. David Money Harris and Sarah L. Harris. Chapter 5 <1>

Two-Level CLA for 4-bit Adder. Two-Level CLA for 4-bit Adder. Two-Level CLA for 16-bit Adder. A Closer Look at CLA Delay

CS 2630 Computer Organization. Meeting 13: Faster arithmetic and more operations Brandon Myers University of Iowa

Starting Boolean Algebra

EECS150: Components and Design Techniques for Digital Systems

CAD4 The ALU Fall 2009 Assignment. Description

Effective Improvement of Carry save Adder

Lecture 6: Signed Numbers & Arithmetic Circuits. BCD (Binary Coded Decimal) Points Addressed in this Lecture

UC Berkeley College of Engineering, EECS Department CS61C: Combinational Logic Blocks

UC Berkeley College of Engineering, EECS Department CS61C: Combinational Logic Blocks

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring

Digital Design. Verilo. and. Fundamentals. fit HDL. Joseph Cavanagh. CRC Press Taylor & Francis Group Boca Raton London New York

Write only as much as necessary. Be brief!

PESIT Bangalore South Campus

Computer Architecture Set Four. Arithmetic

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG

CARLETON UNIVERSITY. Laboratory 2.0

(Refer Slide Time 3:31)

Let s put together a Manual Processor

Henry Lin, Department of Electrical and Computer Engineering, California State University, Bakersfield Lecture 7 (Digital Logic) July 24 th, 2012

Chapter 3 Arithmetic for Computers. ELEC 5200/ From P-H slides

*Instruction Matters: Purdue Academic Course Transformation. Introduction to Digital System Design. Module 4 Arithmetic and Computer Logic Circuits

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

Microcomputers. Outline. Number Systems and Digital Logic Review

Lecture 8: Addition, Multiplication & Division

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

CS Computer Architecture. 1. Explain Carry Look Ahead adders in detail

EECS150 - Digital Design Lecture 13 - Combinational Logic & Arithmetic Circuits Part 3

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

Transcription:

UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences Last modified on March 30, 2004 by Gang Zhou (zgang@eecs.berkeley.edu) Jan Rabaey Homework # 7 Solutions EECS141 Problem 1: Variable-Block Carry-Skip Adder The carry-skip adder is a pretty good circuit. However, upon closer inspection, you notice that if all the skip blocks are of the same size, the latter blocks will finish switching quickly and then sit idle for a while waiting for the carry signal to pass through all the bypass multiplexers. For example, in the diagram of a 32-bit carry-skip adder below, the carry-out for bits 4-7 will be ready at the same time as the carry-out for bits 0-3. This second block will sit around doing nothing while MUX1 does its job. 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 C in To speed up the circuit, we could vary the size of the skip block. Intuitively, we should then be able to reduce the size of the first skip block and make each subsequent block increasingly larger. Because the critical path includes the last skip block, we must also start to taper down the size of each block as we approach the end. To obtain the optimal size of all the skip blocks, you realize that some really smart guy has already done all the mathematical derivations which means that you don t have to do it yourself. After talking to this really smart guy, you know that the optimal configuration for a 32-bit adder is (under the assumption that t MUX = 2t prop ): 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 C in 31 Estimate the worst-case delay for the simple 32-bit carry-skip adder in the first diagram and then estimate the amount of delay improvement with this new variable-block scheme. Assume that the setup (creation of propagate and generate signals) takes t setup, each bit of carry propagation takes t prop (i.e. a skip block of m bits has a delay of m*t prop ), a MUX has a delay of t MUX, and the sum generation has a delay of t sum. Leave your answers in terms of t setup, t prop, t MUX, and t sum.

Solution: For the standard carry-skip adder, the delay is just as was derived in lecture. In the worst case, we must pass the carry-out signal from the first skip block (bits 0-3) all the way to the end. In that situation, we have to pass through all 7 MUX s. We then have to run through the last skip block (bits 28-31) in order to calculate the sum for bit 31. At this point, we only have to propagate the carry through 3 bits (28, 29, and 30) to get the carry-in to bit 31 which is needed to calculate the sum. Thus, the total delay of the simple carry-skip adder is: Delay = t setup + 4t prop + 7t MUX + 3t prop + t sum = t setup + 7t prop + 7t MUX + t sum For the variable-block carry-skip adder, the worst-case path is the same. However, the block sizes are different. We only have 1 bit to run through in the first block and 1 bit in the last block. The delay is thus: Delay = t setup + t prop + 7t MUX + t sum We get a delay improvement of 6t prop. Since we told you to assume that 2t prop =t MUX, it would be ok to mix these terms up. Problem 2: Short Adders In this question we are going to compare the speed of different types of adders. a) Calculate the worst case delay of an 8-bit Ripple carry adder consisting of the full adder blocks shown below. You can use t p for the AND and OR functions and 2t p for the XOR gates. Express your answer in terms of t p. Solution: In the first part, we need one XOR delay to obtain the first propagate, then two gate delays to reach the c out. Then for the next 6 stages we need 2 gate delays from c in to c out. In the final stage we have c in to sum XOR delay. The total delay is t total =4*t p + 6*2t p + 2t p = 18t p b) For the second part of this question you are to implement an 8-bit Carry-Look Ahead adder. For n-input AND/OR gates use the t p-n = 0.25*n 2 *t p, similarly for the XOR gate t XOR-n = 0.5*n 2 *t p. Find the worst-case delay of this adder again in terms of t p.

Solution: In part b) we assume that the CLA logic functions implemented are c o3 = g 3 + p 3 g 2 +p 3 p 2 g 1 +p 3 p 2 p 1 g 0 +p 3 p 2 p 1 p 0 c in c o2 = g 2 + p 2 g 1 + p 2 p 1 g 0 + p 2 p 1 p 0 c in c o1 = g 1 + p 1 g 0 + p 1 p 0 c in c o0 = g 0 + p 0 c in Ripple the carry between two blocks We spend an XOR delay to obtain g s and p s. Then from inputs to c o3 we go through one 5-input AND and one 5-input OR, then c o3 is passed to the next stage and again it generates c o2 and finally goes through an XOR gate. c o2 sees 4-input AND 4-input OR. 2t p + 0.25*5 2 t p + 0.25*5 2 t p +0.25*4 2 t p + 0.25*4 2 t p +2t p = (2+6.25+6.25+4+4+2)t p = 24.5t p Select the sum out between two alternatives As an alternative (faster) solution the second block performs a carry select operation. In the case both sums are generated in the second block. And we only need to choose using a MUX. A MUX implements the function F = as + bs. So it has a delay with a 2-input AND and a 2-input OR. t total = 2t p + 0.25*5 2 t p + 0.25*5 2 t p + t p + t p = (2 + 6.25+6.25 +1+1)t p = 16.5t p c) Repeat the same calculations for 32-bit adders. Hint: Implement the Carry-Look Ahead Adder as a block CLA of 4-bit block-length. Solution: In the CLA adder it takes an XOR delay (2t p ) to generate the individual p i,g i, 4-input AND and 4-input OR to generate the block P s and G s (4t p +4t p ). From the outputs of the top level (in the diagram) it takes additional 4- AND and 4-OR delays (4t p +4t p ) to generate mid level block P, G. Then, 2-input AND, 2-input OR make us go through bottom level (t p +t p ). After additional 2-input AND, 2-input OR we go through middle level (t p +t p ) and reach back at top level in the diagram. In this top level 2-input AND, 2-input OR (t p +t p ) is needed to generate the final carry and, a final XOR (2t p ) is needed to obtain the sum. (2+4+4+4+4+2+2+2+2)t p = 26t p In the RCA (ripple carry adder) case we again have 4t p +30*2t p +2t p = 66t p. As we can clearly see as the number of bits increases, the carry look ahead adder has a distinct advantage. But for adders with less than 10-bits it s usually wiser to do the implementation simply in ripple carry.

More details: The inputs of the top level are the individual p i, g i. As mentioned in part b) the equations implemented are p i+3:i = p i+3 p i+2 p i+1 p i g i+3:i = g i+3 + p i+3 g i+2 +p i+3 p i+2 g i+1 + p i+3 p i+2 p i+1 g i we can see that the worst case delay is 4 input AND + 4 input OR g i+3:i means a carry is generated within the block encompassing bit positions i+3 to i p i+3:i means the carry-in of the block is passed to the carry-out of the block. The mid level blocks implement p i+15:i = p i+15:i+12 p i+11:i+8 p i+7:i+4 p i+3:i g i+15:i = g i+15:i+12 + p i+15:i+12 g i+11:i+8 + p i+15:i+12 p i+11:i+8 g i+7:i+4 + p i+15:i+12 p i+11:i+8 p i+7:i+4 g i+3:i Once we have the p i:k and g i:k s and c o(k-1) (i.e. the carry out at stage k-1), we can obtain the carry-out of stage i using the relation c oi = g i:k + p i:k c o(k-1) (which has a delay of 2-input AND and 2-input OR, meaning that to get a carry out at ith bit position, the block encompassing i - k should either generate a carry or pass the carry coming as c o(k-1). ) Problem 3: Pipelined Multipliers An array multiplier consists of rows of adders, each producing partial sums that are subsequently fed to the next adder row. In this problem, we consider the effects of pipelining such a multiplier by inserting registers between the adder rows. a) Redraw Figure 11.30 (textbook, pg. 590) inserting word-level pipeline registers as required to achieve maximal throughput for the 4x4 multiplier. Hint #1: you must use additional registers to keep the input bits synchronized to the appropriate partial sums. Hint #2: just use little filled black rectangles to indicate registers and assume all registers are clocked using the same clock. b) Repeat part (a) for a carry-save, as opposed to ripple-carry, architecture. c) For each of the two multiplier architectures, compare the critical path, throughput, and latency of the pipelined and non-pipelined versions. d) Which architecture is better suited to pipelining, and how does the choice of a vector-merging adder affect this decision?

Solution:

Problem 4: Modified Booth Recoding Start with a NxN array multiplier and notice that the number of partial products required is N. This implies N-1 additions, and thus, N-1 rows in the array. Modified Booth Recoding (MBR) is a technique for halving the number of partial products produced during a multiplication. This is nice because fewer partial products means fewer additions, ultimately resulting in a faster multiplication. a) Two important number system principles are required to understand how MBR works. First, the base of the number system is called the radix. Decimal is radix-10, binary is radix-2, hexadecimal is radix-16, and so on. MBR uses a radix-4 number system. Since two binary bits can represent four numbers, we can take an ordinary binary number and split it into two bit groups to form a radix-4 number: Ordinary radix-2 (binary) number: [0 0 1 1 1 0 1 0] 2 = 0*2 7 + 0*2 6 + 1*2 5 + 1*2 4 + 1*2 3 + 0*2 2 + 1*2 1 + 0*2 0 = 58 Radix-4 number: [00 11 10 10] 2 = [0 3 2 2] 4 = 0*4 3 + 3*4 2 + 2*4 1 + 2*4 0 = 58 Note that in binary, the 8 bits mean that we will have 8 partial products. In radix-4, we have only four bits, hence half the partial products. When multiplying X*Y, two steps are taken before the multiplication is performed. First, we recode Y using radix-4. Second, we calculate the four possible unshifted partial products: 0*X, 1*X, 2*X, and 3*X. The radix-4 bit tells us which of these partial products to select and how far to shift it (ie. how many zeros to append to the end). Demonstrate how this works by multiplying 94*121 using this technique. b) Note that the biggest problem with this radix-4 multiplication is the partial product generation. 0*X, 1*X, and 2*X are easily generated using AND gates and a shifter. However, 3*X must be generated by adding 1*X + 2*X. This addition is in the critical path of the multiplier, so we would like to remove it. We do this by getting rid of all the 3*X partial products in the radix-4 calculation. Essentially, we need to remove radix-4 bits that have the value [3] 4 or, equivalently, [11] 2. Consider a number system in which each bit position can hold three values: {-1, 0, 1}. This is called a redundant number system because there is more than one way to represent the same number. Numbers in this format can be treated in the same way as ordinary binary numbers, eg: Ordinary binary number: [0 0 1 1 1 0 1 0] 2 = 0*2 7 + 0*2 6 + 1*2 5 + 1*2 4 + 1*2 3 + 0*2 2 + 1*2 1 + 0*2 0 = 32 + 16 + 8+ 2 = 58 Redundant number system: [0 1 0-1 1 0 1 0] 2 = 0*2 7 + 1*2 6 + 0*2 5 + -1*2 4 + 1*2 3 + 0*2 2 + 1*2 1 + 0*2 0 = 64 16 + 8 + 2 = 58 Note that all ordinary binary numbers are also included in this redundant number system, as well as a whole bunch more numbers that contain 1 bits. Convert the following redundant numbers into standard binary numbers and then into radix-4 numbers: [0 0 1 0 0 1 0 0] 2, [0 1 0 1 0 1 0 1] 2, [0 1 0 1 1-1 0 1] 2. Note that standard binary sequences of the form: {0, some ones} can be converted to redundant sequences of the form: {1, some zeros, -1}. By replacing a string of 1 s with 0 s, we can eliminate the possibility of two one s in a group, thus eliminating the 3*X partial product! c) MBR basically searches for strings of one s in the binary number, converts them into an equivalent redundant number representation, treats the result in radix-4, then does the multiplication. This can be easily accomplished by using the look-up table in Table 1. Since we are using radix-4, i = {0, 2, 4, 6, 8, }. Also, Y -1 = 0. Now for X*Y, the partial products 0*X, 1*X, 2*X, -2*X, -1*X, -0*X are generated, Y is recoded according to the table, and then the multiplication is performed. Recode Y = 121 into radix-4 bits of {-2, -1, 0, 1, 2} according to the table. Now perform the multiplication 94*121.

d) Now let s generate those partial products. A straightforward generation can be made using three signals: negate (1: negate X, 0: no change), shift (1: shift left by one, 0: no change), and zero (1: force to zero, 0: no change). Design a circuit that implements these three signals using standard gates (AND, OR, INVERTER, XOR, etc.). e) So what does all this gain us? We ve traded a 3*X partial product for 1*X and 2*X. Recall that negation in two s complement requires us to negate all the bits, then add 1. How can we add these one s in without making an entirely new adder in the critical path? Hint: Try to find holes in the multiplication (ie. low order bits that are known to be zero and can be replaced with our negate signal). f) Design a circuit that uses the three signals in Part d to generate 2*X, 1*X, 0*X, 1*X, and 2*X. Bear in mind that the negation does not need to add one because that will be taken care of using the method in Part e. g) Congratulations! You ve created all the primary building blocks of a Booth recoded multiplier. Lastly, what additional improvement can be made to make this one of the fastest multipliers available?

Solution: