AT Arithmetic. Integer addition

Similar documents
Integer Multiplication. Back to Arithmetic. Integer Multiplication. Example (Fig 4.25)

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

Timing for Ripple Carry Adder

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring

By, Ajinkya Karande Adarsh Yoga

COMP 303 Computer Architecture Lecture 6

Thomas Polzer Institut für Technische Informatik

Divide: Paper & Pencil

Chapter 3. Arithmetic Text: P&H rev

Week 7: Assignment Solutions

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Computer Architecture and Organization

Principles of Computer Architecture. Chapter 3: Arithmetic

More complicated than addition. Let's look at 3 versions based on grade school algorithm (multiplicand) More time and more area

CPE300: Digital System Architecture and Design

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Arithmetic for Computers. Hwansoo Han

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Binary Adders. Ripple-Carry Adder

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

EECS150 - Digital Design Lecture 13 - Combinational Logic & Arithmetic Circuits Part 3

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

COMPUTER ORGANIZATION AND DESIGN

ENERGY-EFFICIENT VLSI REALIZATION OF BINARY64 DIVISION WITH REDUNDANT NUMBER SYSTEMS 1 AVANIGADDA. NAGA SANDHYA RANI

361 div.1. Computer Architecture EECS 361 Lecture 7: ALU Design : Division

CS 5803 Introduction to High Performance Computer Architecture: Arithmetic Logic Unit. A.R. Hurson 323 CS Building, Missouri S&T

A Radix-10 SRT Divider Based on Alternative BCD Codings

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

M.J. Flynn 1. Lecture 6 EE 486. Bit logic. Ripple adders. Add algorithms. Addition. EE 486 lecture 6: Integer Addition

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Tailoring the 32-Bit ALU to MIPS

Computer Arithmetic Ch 8


Computer Arithmetic Ch 8

MIPS Integer ALU Requirements

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG

Part III The Arithmetic/Logic Unit. Oct Computer Architecture, The Arithmetic/Logic Unit Slide 1

Floating Point Arithmetic

Double Precision Floating-Point Arithmetic on FPGAs

Digital Computer Arithmetic

Homework 3. Assigned on 02/15 Due time: midnight on 02/21 (1 WEEK only!) B.2 B.11 B.14 (hint: use multiplexors) CSCI 402: Computer Architectures

FLOATING POINT ADDERS AND MULTIPLIERS

Vendor Agnostic, High Performance, Double Precision Floating Point Division for FPGAs

ECE260: Fundamentals of Computer Engineering

100 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 25, NO. 1, JANUARY 2017

COMPUTER ARITHMETIC (Part 1)

Chapter 10 - Computer Arithmetic

Chapter 3 Arithmetic for Computers (Part 2)

Number Systems and Computer Arithmetic

Chapter 5 : Computer Arithmetic

Organisasi Sistem Komputer

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

COMPUTER ORGANIZATION AND ARCHITECTURE

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

s complement 1-bit Booth s 2-bit Booth s

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

9 Multiplication and Division

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Computer Organisation CS303

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Integer Arithmetic. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

ADDERS AND MULTIPLIERS

Lecture Topics. Announcements. Today: Integer Arithmetic (P&H ) Next: The MIPS ISA (P&H ) Consulting hours. Milestone #1 (due 1/26)

Floating Point Square Root under HUB Format

CS/COE 0447 Example Problems for Exam 2 Spring 2011

At the ith stage: Input: ci is the carry-in Output: si is the sum ci+1 carry-out to (i+1)st state

Available online at ScienceDirect. The 4th International Conference on Electrical Engineering and Informatics (ICEEI 2013)

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Arithmetic Multiplication & Shift Chapter 3.4 EEC170 FQ 2005

Chapter 2 Data Representations

Fused Floating Point Three Term Adder Using Brent-Kung Adder

Topics. 6.1 Number Systems and Radix Conversion 6.2 Fixed-Point Arithmetic 6.3 Seminumeric Aspects of ALU Design 6.4 Floating-Point Arithmetic

A High-Performance Area-Efficient Multifunction Interpolator

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider

Outline. EEL-4713 Computer Architecture Multipliers and shifters. Deriving requirements of ALU. MIPS arithmetic instructions

Fast Arithmetic. Philipp Koehn. 19 October 2016

Data Representations & Arithmetic Operations

Module 2: Computer Arithmetic

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION

Floating-Point Butterfly Architecture Based on Binary Signed-Digit Representation

CSE 141 Computer Architecture Summer Session Lecture 3 ALU Part 2 Single Cycle CPU Part 1. Pramod V. Argade

Lecture 5. Other Adder Issues

Arithmetic Processing

TDT4255 Computer Design. Lecture 4. Magnus Jahre

Chapter 3. Arithmetic for Computers

An Asynchronous Floating-Point Multiplier

ECE232: Hardware Organization and Design

UNIT-III COMPUTER ARTHIMETIC

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

Review of Last lecture. Review ALU Design. Designing a Multiplier Shifter Design Review. Booth s algorithm. Today s Outline

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

A Floating Point Divider Performing IEEE Rounding and Quotient Conversion in Parallel

Binary Arithmetic. Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T.

ECE 154A Introduction to. Fall 2012

Multiplier-Based Double Precision Floating Point Divider According to the IEEE-754 Standard

Chapter 3 Arithmetic for Computers. ELEC 5200/ From P-H slides

Transcription:

AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under the AT (area-time) rule, area is (almost) as important. So it s important to know the latency, bandwidth and area that any particular algorithm requires. Michael Flynn EE 382 Processor Design Winter 98/99 1 Integer addition Adders are the fundamental building block of the processor, defining t. Adder types include carry chain, carry select (conditional sum), carry lookahead (Brent-Kung), canonic (prefix) carry skip, Ling Most high speed 32b adders take about the same area (f normalized) 1 A to 1.5A Michael Flynn EE 382 Processor Design Winter 98/99 2 1

Integer addition Both area and time scale as n, the adder precision. The delay, t, scales slowly (log n) Area scale about linearly with n; so a 64b adder takes 2-3 A, but still fits into t.maybe by definition of a cycle. Michael Flynn EE 382 Processor Design Winter 98/99 3 Carry skip adder Michael Flynn EE 382 Processor Design Winter 98/99 4 2

Manchester carry chain Michael Flynn EE 382 Processor Design Winter 98/99 5 Carry skip logic Michael Flynn EE 382 Processor Design Winter 98/99 6 3

Carry select addition Michael Flynn EE 382 Processor Design Winter 98/99 7 FP addition A basic FP adder has 5 steps exponent difference, pre align, significand add, post align, and round. Assuming that a full shifter has about the same complexity (delay and area) as an add, then 64b FP addition takes 7-10 A, and has about 5 t execution Michael Flynn EE 382 Processor Design Winter 98/99 8 4

FP addition Advanced FP adders are faster and use more area: 1) Two path FADD creates separate paths for operands; a path for operands whose exponents close in value (subtract) (this is the only case when we need a full shift to re normalize the result) a path for other cases where the exponent difference is > 2 (this is the only case that uses a full shift to pre align significands 2) A FADD with integrated rounding. Here the rounding step is eliminated by computing both the sum/difference and the result plus 1 this is done by using 2 adders (or a compound adder) and then MUXing out the final result. Michael Flynn EE 382 Processor Design Winter 98/99 9 FP adders The two path FP adder uses an additional significand adder and exponent adder about 3-4 A. It reduces FADD delay by one t Integrated rounding adds another rounding adder plus MUX another 3-4 A while reducing delay by another t Michael Flynn EE 382 Processor Design Winter 98/99 10 5

FP adders Net area time tradeoff Basic.. Area 10 A and delay 4-5 t Two path.. Area 13.5 A and delay 3-4 t Integrated round (with two paths) area 17 A and delay 2-3 t For pipelining add 1 A per pipe stage and use upper range on t Michael Flynn EE 382 Processor Design Winter 98/99 11 Multipliers After add the most important arithmetic op Approaches encode the multiplier bits (Booth 2, Booth 3...) assimilate the partial products one, two or n pass (iterated arrays or trees) arrays (simple, double, higher level) trees (Wallace, binary[4:2], ZD,.) CPA to produce product Michael Flynn EE 382 Processor Design Winter 98/99 12 6

Multipliers Integer and FP multipliers usually have about the same execution time (with same precision, n) Booth reduces number of pp s but adds MUXs to generate the pp s. Most of the area, and probably delay too, is in the pp reduction tree. Michael Flynn EE 382 Processor Design Winter 98/99 13 16 bit Booth 2 multiply Michael Flynn EE 382 Processor Design Winter 98/99 14 7

16 bit Booth 2 example Michael Flynn EE 382 Processor Design Winter 98/99 15 16 bit Booth 2 pp selector logic Michael Flynn EE 382 Processor Design Winter 98/99 16 8

16 bit Booth 3 multiply Michael Flynn EE 382 Processor Design Winter 98/99 17 5 x 5 unsigned multiplication Michael Flynn EE 382 Processor Design Winter 98/99 18 9

1-bit adder Michael Flynn EE 382 Processor Design Winter 98/99 19 Wallace tree Michael Flynn EE 382 Processor Design Winter 98/99 20 10

Wallace tree reduction Michael Flynn EE 382 Processor Design Winter 98/99 21 Multipliers A full tree implementation of a 54b (FP type) with Booth 2 has tree height 28 and uses about 2500 CSAs (or about 50 A in the tree). Maybe a total of 10 A in MUXs plus 50 A in tree and 3A in the CPA, 62A total.the fastest multiplier is, maybe, 2 t Using a 2 pass tree reduces the hardware considerably; height is 14 using about 700 CSAs or 14 A total area 5 + 14 + 3 = 22A; 3-4 t Michael Flynn EE 382 Processor Design Winter 98/99 22 11

Multipliers To pipeline the Multiplier we need a full tree implementation; probably 3-4 t. Perhaps Booth3, followed by a full tree (h = 17) and CPA stage. Probably area = 50-60A Michael Flynn EE 382 Processor Design Winter 98/99 23 Divide Infrequent op, but long latency can affect IPC achieved. Algorithms: SRT 2 or 3 bit (32-36 t) maybe 6-10 A NR or Binomial expansion (10-14 t); needs at least 6 A for table and control plus use of MPY Bipartite tables for small n (less than 24b) Michael Flynn EE 382 Processor Design Winter 98/99 24 12

Divide SRT creates quotient 2 or 3 bits/iteration uses divisor - partial remainder lookup table for trial quotient then subtracts result (partial rem.) is in redundant form so no restoration is needed; also result is left as a sum and carry pair (no cpa needed) fast iteration is possible, sometimes 2x per t Michael Flynn EE 382 Processor Design Winter 98/99 25 Divide Multiply based use either Newton Raphson or Binomial series if f(x) = b - 1/x; root is at x = 1/b then NR iteration is x i+1 = x i (2 - b x i ) converges is quadratic, doubles precision of result each iteration so start with table lookup of 1/b to 8b, then 3 iterations gives 64b result then a x (1/b) is quotient Michael Flynn EE 382 Processor Design Winter 98/99 26 13

Divide Divide is not usually pipelined, except for small n implementations. Frequently combined with square root in the same implementation. Michael Flynn EE 382 Processor Design Winter 98/99 27 Sub word concurrency Provides 8, 16, 32b concurrent ops within existing integer or FP hardware In 64b integer unit can do 8x8, or 4x16, or 2x32 ops concurrently Since FP units are designed to be faster, may be use it: 8x4, or 2x16, or 2x24. Michael Flynn EE 382 Processor Design Winter 98/99 28 14

Sub word concurrency Usually only for add and multiply Implementations straightforward for add; more complicated for multiply requires reorganizing partitions of the pp tree affects multiply area and delay marginally (maybe 10% delay and 20% area) isa must define saturating arithmetic. Michael Flynn EE 382 Processor Design Winter 98/99 29 15