UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Similar documents
EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Digital Computer Arithmetic

Array Multipliers. Figure 6.9 The partial products generated in a 5 x 5 multiplication. Sec. 6.5

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Partial product generation. Multiplication. TSTE18 Digital Arithmetic. Seminar 4. Multiplication. yj2 j = xi2 i M

Multi-Operand Addition Ivor Page 1

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

VARUN AGGARWAL

MULTIPLE OPERAND ADDITION. Multioperand Addition

*Instruction Matters: Purdue Academic Course Transformation. Introduction to Digital System Design. Module 4 Arithmetic and Computer Logic Circuits

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

ECE 341. Lecture # 6

Timing for Ripple Carry Adder

CPE300: Digital System Architecture and Design

Week 7: Assignment Solutions

Arithmetic Logic Unit

Analysis of Different Multiplication Algorithms & FPGA Implementation

VTU NOTES QUESTION PAPERS NEWS RESULTS FORUMS Arithmetic (a) The four possible cases Carry (b) Truth table x y

Chapter 3: part 3 Binary Subtraction

ECE331: Hardware Organization and Design

University of Illinois at Chicago. Lecture Notes # 10

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

Computer Arithmetic Multiplication & Shift Chapter 3.4 EEC170 FQ 2005

Integer Multipliers 1

EE 109 Unit 6 Binary Arithmetic

Hybrid Signed Digit Representation for Low Power Arithmetic Circuits

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

Jan Rabaey Homework # 7 Solutions EECS141

DLD VIDYA SAGAR P. potharajuvidyasagar.wordpress.com. Vignana Bharathi Institute of Technology UNIT 3 DLD P VIDYA SAGAR

Design of Arithmetic Units ECE152B AU 1

Arithmetic Circuits. Nurul Hazlina Adder 2. Multiplier 3. Arithmetic Logic Unit (ALU) 4. HDL for Arithmetic Circuit

COMPUTER ARCHITECTURE AND ORGANIZATION. Operation Add Magnitudes Subtract Magnitudes (+A) + ( B) + (A B) (B A) + (A B)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

Learning Outcomes. Spiral 2-2. Digital System Design DATAPATH COMPONENTS

JOURNAL OF INTERNATIONAL ACADEMIC RESEARCH FOR MULTIDISCIPLINARY Impact Factor 1.393, ISSN: , Volume 2, Issue 7, August 2014

Binary Multiplication

Chapter 4. Combinational Logic

Computer Architecture and Organization

Effective Improvement of Carry save Adder

EECS150 - Digital Design Lecture 13 - Combinational Logic & Arithmetic Circuits Part 3

IMPLEMENTATION OF TWIN PRECISION TECHNIQUE FOR MULTIPLICATION

RADIX-4 AND RADIX-8 MULTIPLIER USING VERILOG HDL

Adders, Subtracters and Accumulators in XC3000

Chapter 10 - Computer Arithmetic

ECE 30 Introduction to Computer Engineering

Binary Adders. Ripple-Carry Adder

Topics. 6.1 Number Systems and Radix Conversion 6.2 Fixed-Point Arithmetic 6.3 Seminumeric Aspects of ALU Design 6.4 Floating-Point Arithmetic

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

9/6/2011. Multiplication. Binary Multipliers The key trick of multiplication is memorizing a digit-to-digit table Everything else was just adding

Chapter 4 Arithmetic Functions

ECE 341. Lecture # 7

Integer Encoding and Manipulation

CS/COE0447: Computer Organization

By, Ajinkya Karande Adarsh Yoga

CS/COE0447: Computer Organization

ECE 341 Midterm Exam

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

Outline. Introduction to Structured VLSI Design. Signed and Unsigned Integers. 8 bit Signed/Unsigned Integers

A novel technique for fast multiplication

MULTIPLICATION TECHNIQUES

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

ECE 341 Midterm Exam

Overview. EECS Components and Design Techniques for Digital Systems. Lec 16 Arithmetic II (Multiplication) Computer Number Systems.

COMPUTER ORGANIZATION AND ARCHITECTURE

More complicated than addition. Let's look at 3 versions based on grade school algorithm (multiplicand) More time and more area


II. MOTIVATION AND IMPLEMENTATION

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION

Principles of Computer Architecture. Chapter 3: Arithmetic

Embedded Systems Ch 15 ARM Organization and Implementation

Learning Outcomes. Spiral 2 2. Digital System Design DATAPATH COMPONENTS

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm

COMPUTER ARITHMETIC (Part 1)

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Lecture Topics. Announcements. Today: Integer Arithmetic (P&H ) Next: continued. Consulting hours. Introduction to Sim. Milestone #1 (due 1/26)

ECE 645: Lecture 1. Basic Adders and Counters. Implementation of Adders in FPGAs

Boolean Unit (The obvious way)

Computer Architecture and Organization: L04: Micro-operations

CO Computer Architecture and Programming Languages CAPL. Lecture 9

EXPERIMENT NUMBER 11 REGISTERED ALU DESIGN

Experiment 7 Arithmetic Circuits Design and Implementation

UNIT-III REGISTER TRANSFER LANGUAGE AND DESIGN OF CONTROL UNIT

Arithmetic Circuits. Design of Digital Circuits 2014 Srdjan Capkun Frank K. Gürkaynak.

Chapter 3 Part 2 Combinational Logic Design

Outline. Combinational Circuit Design: Practice. Sharing. 2. Operator sharing. An example 0.55 um standard-cell CMOS implementation

Combinational Circuit Design: Practice

EXPERIMENT #8: BINARY ARITHMETIC OPERATIONS

Tailoring the 32-Bit ALU to MIPS

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

A High-Speed FPGA Implementation of an RSD- Based ECC Processor

Transcription:

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6c High-Speed Multiplication - III Israel Koren Fall 2010 ECE666/Koren Part.6c.1 Array Multipliers The two basic operations - generation and summation of partial products - can be merged, avoiding overhead and speeding up multiplication Iterative array multipliers (or array multipliers) consist of identical cells, each forming a new partial product and adding it to previously accumulated partial product Gain in speed obtained at expense of extra hardware Can be implemented so as to support a high rate of pipelining ECE666/Koren Part.6c.2 Page 1

Illustration - 5 x 5 Multiplication Straightforward implementation - Add first 2 partial products (a4x0, a3x0,,a0 x0 and a4x1, a3x1,,a0x1) in row 1 after proper alignment The results of row 1 are then added to a4x2, a3x2,,a0x2 in row 2, and so on ECE666/Koren Part.6c.3 5 x 5 Array Multiplier for Unsigned Numbers Basic cell - FA accepting one bit of new partial product aixj + one bit of previously accumulated partial product + carry-in bit No horizontal carry propagation in first 4 rows - carry-save type addition - accumulated partial product consists of intermediate sum and carry bits Last row is a ripple-carry adder - can be replaced by a fast 2-operand adder (e.g., carry-look-ahead adder) ECE666/Koren Part.6c.4 Page 2

Array Multiplier for Two s Complement Numbers Product bits like a4x0 and a0x4 have negative weight Should be subtracted instead of added ECE666/Koren Part.6c.5 Type I and II Cells Type I cells: 3 positive inputs - ordinary FAs Type II cells: 1 negative and 2 positive inputs Sum of 3 inputs of type II cell can vary from -1 to 2 c output has weight +2 s output has weight -1 Arithmetic operation of type II cell - II s and c outputs given by ECE666/Koren Part.6c.6 Page 3

Type I and II Cells Type II' cells: 2 negative inputs and 1 positive Sum of inputs varies from -2 to 1 c output has weight -2 s output has weight +1 II Type I' cell: all negative inputs - has negatively weighted c and s outputs Counts number of -1's at its inputs - represents this number through c and s outputs Same logic operation as type I cell - same gate implementation Similarly - types II and II' have the same gate implementation ECE666/Koren Part.6c.7 Booth s Algorithm Array Multiplier For two's complement operands n rows of basic cells - each row capable of adding or subtracting a properly aligned multiplicand to previously accumulated partial product Cells in row i perform an add, subtract or transfer-only operation, depending on xi and reference bit 4-bit operands Controlled add/ subtract/shift (CASS) ECE666/Koren Part.6c.8 Page 4

Controlled add/subtract/shift - CASS H and D: control signals indicating type of operation H=0: no arithmetic operation done H=1: arithmetic operation performed - new Pout Type of arithmetic operation indicated by D signal D=0: multiplicand bit, a, added to Pin with cin as incoming carry - generating Pout and cout as outgoing carry D=1: multiplicand bit, a, subtracted from Pin with incoming borrow and outgoing borrow Pout=Pin (a H) (cin H) cout=(pin D)(a+cin) + a cin Alternative: combination of multiplexer (0, +a and -a) and FA H and D generated by CTRL - based on xi and reference bit x{i-1} ECE666/Koren Part.6c.9 Booth s Algorithm Array Multiplier - details First row - most significant bit of multiplier Resulting partial product need be shifted left before adding/subtracting next multiple of multiplicand A new cell with input Pin=0 is added Number of bits in partial product increases by one each row - expand multiplicand before adding/subtracting it Accomplished by replicating sign bit of multiplicand ECE666/Koren Part.6c.10 Page 5

Properties and Delay Cannot take advantage of strings of 0's or 1's - cannot eliminate or skip rows Only advantage: ability to multiply negative numbers in two's complement with no need for correction Operation in row i need not be delayed until all upper (i-1) rows have completed their operation P0, generated after one CASS delay (plus delay of CTRL), P1 generated after two CASS delays, and P{2n-2}, generated after (2n-1) CASS delays Similarly can implement higher-radix multiplication requiring less rows Building block: multiplexer-adder circuit that selects correct multiple of multiplicand A and adds it to previously accumulated partial product ECE666/Koren Part.6c.11 Pipelining Important characteristic of array multipliers - allow pipelining Execution of separate multiplications overlaps The long delay of carry-propagating addition must be minimized Achieved by replacing CPA with several additional rows - allow carry propagation of only one position between consecutive rows To support pipelining, all cells must include latches - each row handles a separate multiplier-multiplicand pair Registers needed to propagate multiplier bits to their destination, and propagate completed product bits ECE666/Koren Part.6c.12 Page 6

Pipelined Array Multiplier ECE666/Koren Part.6c.13 Optimality of Multiplier Implementations Bounds on performance of algorithms for multiplication Theoretical bounds for multiplication similar to those for addition Adopting the idealized model using (f,r) gates: Execution time of a multiply circuit for two operands with n bits satisfies Tmult log f 2n If residue number system is employed: Tmult log f 2m m - number of digits needed to represent largest modulus in residue number system ECE666/Koren Part.6c.14 Page 7

Optimal Implementations Need to compare performance (execution time) and implementation costs (e.g., regularity of design, total area, etc.) Objective function like A T can be used A - area and T - execution time α A more general objective function: A T α can be either smaller or larger than 1 ECE666/Koren Part.6c.15 Basic Array Multiplier Very regular structure - can be implemented as a rectangular-shaped array - no waste of chip area n least significant bits of final product are produced on right side of rectangle; n most significant bits are outputs of bottom row of rectangle Highly regular and simple layout but has two drawbacks: Requires a very large area, proportional to n², since it contains about n² FAs and AND gates Long execution time T of about 2 n FA ( FA - delay of FA) More precisely, T consists of (n-1) FA for first (n-1) rows and (n-1) FA for CPA (ripple-carry adder) AT is proportional to n³ ECE666/Koren Part.6c.16 Page 8

Pipelined & Booth Array Multipliers Required area increases even further (CPA replaced) Latency of a single multiply operation increases However, pipeline period ( pipeline rate) shorter Booth based array multiplier offers no advantage A - order of n² and T - linear in n Radix-4 Booth can potentially be better - only n/2 rows - could reduce T and A by factor of two However, actual delay & area higher - recoding logic and, more importantly, partial product selectors, add complexity & interconnections - longer delay per row Also, since relative shift between adjacent rows is two bits, must allow carry to propagate horizontally Can be achieved locally or in last row - then carry propagation through 2n-1 bits (instead of n-1) Exact reduction depends on design and technology ECE666/Koren Part.6c.17 Radix-8 Booth & CSA Tree Similar problems with radix-8 Booth's array multiplier In addition, 3A should be precalculated Reduction in delay and area may be less than expected 1/3 Still, may be cost-effective in certain technologies and design styles Partial products can be accumulated using a cascade or a tree structure with shorter execution time But CSA tree structures have irregular interconnects - no area-efficient layout with a rectangular shape Moreover - overall width 2n usually required - multiplier area of order 2n log k AT may increase as 2n log² k ECE666/Koren Part.6c.18 Page 9

Delay of Balanced Delay Tree Balanced delay tree - more regular structure Increments in number of operands - 3,3,5,7,9... 2 Sum of series - order of k=j (j - number of elements in series, k - number of operands) Number of levels - determines overall delay - linear in j= k Compare to log k - number of levels in complete binary tree Proof: exercise Above expressions - theoretical, limited practical significance Detailed analysis of alternative designs is necessary for specific technology ECE666/Koren Part.6c.19 Page 10