Floating-Point Division Algorithms for an x86 Microprocessor with a Rectangular Multiplier

Similar documents
Lecture 3: Computer Arithmetic: Multiplication and Division

Newton-Raphson division module via truncated multipliers

Conditional Speculative Decimal Addition*

RADIX-10 PARALLEL DECIMAL MULTIPLIER

A Binarization Algorithm specialized on Document Images and Photos

Improvement of Spatial Resolution Using BlockMatching Based Motion Estimation and Frame. Integration

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Assignment # 2. Farrukh Jabeen Algorithms 510 Assignment #2 Due Date: June 15, 2009.

The Codesign Challenge

Using Delayed Addition Techniques to Accelerate Integer and Floating-Point Calculations in Configurable Hardware

Mallathahally, Bangalore, India 1 2

A New Memory Reduced Radix-4 CORDIC Processor For FFT Operation

Parallel matrix-vector multiplication

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

FPGA IMPLEMENTATION OF RADIX-10 PARALLEL DECIMAL MULTIPLIER

Mathematics 256 a course in differential equations for engineering students

Load Balancing for Hex-Cell Interconnection Network

Data Representation in Digital Design, a Single Conversion Equation and a Formal Languages Approach

Area Efficient Self Timed Adders For Low Power Applications in VLSI

High-Boost Mesh Filtering for 3-D Shape Enhancement

FPGA Based Fixed Width 4 4, 6 6, 8 8 and Bit Multipliers using Spartan-3AN

An Optimal Algorithm for Prufer Codes *

Lecture - Data Encryption Standard 4

Term Weighting Classification System Using the Chi-square Statistic for the Classification Subtask at NTCIR-6 Patent Retrieval Task

Learning the Kernel Parameters in Kernel Minimum Distance Classifier

Complex Numbers. Now we also saw that if a and b were both positive then ab = a b. For a second let s forget that restriction and do the following.

Analysis of Continuous Beams in General

CHAPTER 4 PARALLEL PREFIX ADDER

An Iterative Solution Approach to Process Plant Layout using Mixed Integer Optimisation

A mathematical programming approach to the analysis, design and scheduling of offshore oilfields

Array transposition in CUDA shared memory

Type-2 Fuzzy Non-uniform Rational B-spline Model with Type-2 Fuzzy Data

Random Kernel Perceptron on ATTiny2313 Microcontroller

Problem Definitions and Evaluation Criteria for Computational Expensive Optimization

A Fast Visual Tracking Algorithm Based on Circle Pixels Matching

For instance, ; the five basic number-sets are increasingly more n A B & B A A = B (1)

Wishing you all a Total Quality New Year!

High level vs Low Level. What is a Computer Program? What does gcc do for you? Program = Instructions + Data. Basic Computer Organization

Programming in Fortran 90 : 2017/2018

Specifications in 2001

Content Based Image Retrieval Using 2-D Discrete Wavelet with Texture Feature with Different Classifiers

Support Vector Machines

A New Approach For the Ranking of Fuzzy Sets With Different Heights

Resource Efficient Design and Implementation of Standard and Truncated Multipliers using FPGAs

Simulation Based Analysis of FAST TCP using OMNET++

FPGA Implementation of CORDIC Algorithms for Sine and Cosine Generator

Decomposition of Grey-Scale Morphological Structuring Elements in Hardware

CS1100 Introduction to Programming

Skew Angle Estimation and Correction of Hand Written, Textual and Large areas of Non-Textual Document Images: A Novel Approach

Exercises (Part 4) Introduction to R UCLA/CCPR. John Fox, February 2005

Cluster Analysis of Electrical Behavior

CMPS 10 Introduction to Computer Science Lecture Notes

CSCI 104 Sorting Algorithms. Mark Redekopp David Kempe

THE PULL-PUSH ALGORITHM REVISITED

NUMERICAL SOLVING OPTIMAL CONTROL PROBLEMS BY THE METHOD OF VARIATIONS

Module Management Tool in Software Development Organizations

R s s f. m y s. SPH3UW Unit 7.3 Spherical Concave Mirrors Page 1 of 12. Notes

THE low-density parity-check (LDPC) code is getting

Course Introduction. Algorithm 8/31/2017. COSC 320 Advanced Data Structures and Algorithms. COSC 320 Advanced Data Structures and Algorithms

An Entropy-Based Approach to Integrated Information Needs Assessment

Memory Modeling in ESL-RTL Equivalence Checking

Hermite Splines in Lie Groups as Products of Geodesics

CACHE MEMORY DESIGN FOR INTERNET PROCESSORS

UB at GeoCLEF Department of Geography Abstract

Vectorization of Image Outlines Using Rational Spline and Genetic Algorithm

Parallel Inverse Halftoning by Look-Up Table (LUT) Partitioning

Compiler Design. Spring Register Allocation. Sample Exercises and Solutions. Prof. Pedro C. Diniz

Solving two-person zero-sum game by Matlab

2x x l. Module 3: Element Properties Lecture 4: Lagrange and Serendipity Elements

An Application of the Dulmage-Mendelsohn Decomposition to Sparse Null Space Bases of Full Row Rank Matrices

The stream cipher MICKEY-128 (version 1) Algorithm specification issue 1.0

Repeater Insertion for Two-Terminal Nets in Three-Dimensional Integrated Circuits

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Steps for Computing the Dissimilarity, Entropy, Herfindahl-Hirschman and. Accessibility (Gravity with Competition) Indices

Related-Mode Attacks on CTR Encryption Mode

Load-Balanced Anycast Routing

APPLICATION OF PREDICTION-BASED PARTICLE FILTERS FOR TELEOPERATIONS OVER THE INTERNET

ELEC 377 Operating Systems. Week 6 Class 3

Brave New World Pseudocode Reference

Hybrid Non-Blind Color Image Watermarking

APPLICATION OF PREDICTION-BASED PARTICLE FILTERS FOR TELEOPERATIONS OVER THE INTERNET

FPGA-based implementation of circular interpolation

Evaluation of an Enhanced Scheme for High-level Nested Network Mobility

A RECONFIGURABLE ARCHITECTURE FOR MULTI-GIGABIT SPEED CONTENT-BASED ROUTING. James Moscola, Young H. Cho, John W. Lockwood

LS-TaSC Version 2.1. Willem Roux Livermore Software Technology Corporation, Livermore, CA, USA. Abstract

A Fast Content-Based Multimedia Retrieval Technique Using Compressed Data

CPE 628 Chapter 2 Design for Testability. Dr. Rhonda Kay Gaede UAH. UAH Chapter Introduction

Sequential search. Building Java Programs Chapter 13. Sequential search. Sequential search

Fast exponentiation via prime finite field isomorphism

Tsinghua University at TAC 2009: Summarizing Multi-documents by Information Distance

Analysis of Min Sum Iterative Decoder using Buffer Insertion

Configuration Management in Multi-Context Reconfigurable Systems for Simultaneous Performance and Power Optimizations*

Efficient Distributed File System (EDFS)

Rapid Development of High Performance Floating-Point Pipelines for Scientific Simulation 1

A MOVING MESH APPROACH FOR SIMULATION BUDGET ALLOCATION ON CONTINUOUS DOMAINS

A High-Quality, Energy Optimized, Real-Time Sampling Rate Conversion Library for the StrongARM Microprocessor

Improving The Test Quality for Scan-based BIST Using A General Test Application Scheme

SLAM Summer School 2006 Practical 2: SLAM using Monocular Vision

A Unified Framework for Semantics and Feature Based Relevance Feedback in Image Retrieval Systems

Reducing Frame Rate for Object Tracking

Transcription:

Floatng-Pont Dvson Algorthms for an x86 Mcroprocessor wth a Rectangular Multpler Mchael J. Schulte Dmtr Tan Carl E. Lemonds Unversty of Wsconsn Advanced Mcro Devces Advanced Mcro Devces Schulte@engr.wsc.edu Dmtr.Tan@amd.com Carl.Lemonds@amd.com Abstract Floatng-pont dvson s an mportant operaton n scentfc computng and multmeda applcatons. Ths paper presents and compares two dvson algorthms for an x86 mcroprocessor, whch utlzes a rectangular multpler that s optmzed for multmeda applcatons. The proposed dvson algorthms are based on Goldschmdt s dvson algorthm and provde correctly rounded results for IEEE 754 sngle, double, and extended precson floatng-pont numbers. Compared to a prevous Goldschmdt dvson algorthm, the fastest proposed algorthm requres 25% to 37% fewer cycles, whle utlzng a multpler that s roughly 2.5 tmes smaller.. Introducton In an x86 mcroprocessor, the floatng-pont unt (FPU) has undergone consderable change n recent years. Much of ths change s due to the advent of Streamng SIMD Extensons (SSE) []. These extensons, manly drven by multmeda applcatons (3D graphcs, vdeo, etc.), have added complexty to recent FPU desgns. Pror to the addton of SSE, the FPU n x86 mcroprocessors only had to support x87 scentfc floatng-pont nstructons. In x87 mode, the FPU performs arthmetc operatons on 8-bt extended-precson floatng-pont numbers, and then rounds the results to 32-bt sngle, 64-bt double, or 8- bt extended precson floatng-pont numbers [2]. Floatng-pont arthmetc n x86 mcroprocessors comples wth the specfcatons gven n the IEEE-754 Standard for Bnary Floatng-Pont Arthmetc [3]. Wth the growng mportance of multmeda applcatons, the FPU s now requred to support both x87 nstructons and SSE nstructons. In 999, Intel ntroduced SSE nstructons that perform multple floatng-pont arthmetc operatons on sngle-precson floatng-pont data types []. For example, a sngle SSE nstructon, DIVPS, performs four sngleprecson floatng-pont dvde operatons. A few years later, SSE2 ntroduced new nstructons for parallel double-precson operatons. Recently, SSE3 added horzontal arthmetc and asymmetrc arthmetc operatons, but no new data formats. Multmeda applcatons are placng a greater emphass on SSE performance over x87. Hence, the FPU workload s shftng from engneerng and scentfc computng to multmeda applcatons. We are desgnng an FPU that utlzes a 27-bt by 76-bt rectangular multpler, n whch the length of the multpler operand s less than the length of the multplcand operand. Ths reduces the area of the multpler, but requres multple passes through the multpler to produce a full-precson result. Our multpler s optmzed for sngle-precson SSE nstructons, whch are wdely used n multmeda applcatons [, 4]. The multpler can perform two parallel sngle-precson multples each cycle wth a latency of two cycles. It can perform one doubleprecson multply every other cycle wth a latency of three cycles or one extended-precson multply every three cycles wth a latency of four cycles. Compared to a fully-ppelned multpler, the rectangular multpler mproves the latency of sngle precson multples and reduces the area of the FPU. It also has the potental to reduce power dsspaton for multmeda applcatons. In addton to performng multplcaton, the rectangular multpler s used to perform dvson, square root, and elementary functon computatons. Due to ts mportance n scentfc computng and multmeda applcatons, several algorthms for floatng-pont dvson have been developed [5]. These algorthms can be dvded nto three man categores; dgt recurrence, very hgh-radx, and functonal teraton. Dgt recurrence algorthms, such as restorng dvson, non-restorng dvson, and SRT dvson, compute a fxed number of quotent bts each teraton [6]. Very hgh-radx dvson algorthms, ncludng accurate quotent approxmatons [7], the short recprocal algorthm [8, 9, ], and prescalng -4244-258-7/7/$25. 27 IEEE 34

and selecton by roundng algorthms [, 2], are dgt recurrence algorthms that compute a large number of quotent bts (e.g., 8 or more) each teraton. Functonal teraton algorthms, such as Goldschmdt s algorthm [3] and Newton-Raphson teraton [4], typcally obtan an estmate of the dvsor s recprocal, and then use multplcaton and subtracton to double the number of accurate quotent bts each teraton. In ths paper, we present and compare two dvson algorthms for an x86 mcroprocessor wth a rectangular multpler. These algorthms are based on Goldschmdt s dvson algorthm and provde support for sngle, double, and extended precson floatngpont numbers. The algorthms are also compared to the algorthm and mplementaton used on the AMD- K7 FPU [5], whch employ Goldschmdt s algorthm to perform dvson, but uses a fully ppelned multpler. Some of our goals n developng these algorthms nclude () the algorthms should have a small mpact on the archtecture and performance of the multpler, (2) they should be able to effcently utlze the rectangular multpler and hgh-speed recprocal approxmatons, (3) they should have low latences and not requre unnecessary passes through the rectangular multpler, (4) they should be optmzed for sngleprecson numbers, but also be able to effcently support double and extended-precson numbers, and (5) they should produce correctly rounded results, as specfed n the IEEE 754 Standard for Bnary Floatng-Pont Arthmetc. The man contrbuton of ths paper s the presentaton of two new dvson algorthms that are desgned to be mplemented wth a rectangular multpler and provde support for x87 and SSE datatypes. The algorthms presented n ths paper are based on Goldschmdt s dvson algorthm and are able to utlze the rectangular multpler and hgh-speed recprocal approxmatons. Our algorthms have low latences, especally for sngle-precson numbers. Compared to very hgh-radx algorthms, our algorthms requre fewer modfcatons to the multpler archtecture. They have lower latences than equvalent Newton-Raphson-based dvson algorthms, snce there are fewer dependences between multplcatons. The remander of ths paper s organzed as follows: Secton 2 gves an overvew of Goldschmdt s dvson algorthm. Secton 3 presents the desgn of a 27-bt by 76-bt rectangular multpler that provdes hghperformance sngle-precson multplcatons and s extended to mplement the proposed dvson algorthms. Secton 4 dscusses a prevous mplementaton of Goldschmdt s dvson algorthm on the AMD-K7 FPU, and descrbes our proposed dvson algorthms. Secton 5 compares the dvson algorthms, and Secton 6 gves our conclusons. In the followng sectons, upper case varables denote operands and lower-case varables denote bts wthn those operands. Indvdual bts are ndexed by ther bt poston wth the more sgnfcant bts havng lower ndces. For example, = x.x x n- has the value: V = n x = 2 When bts through j of are accessed, we use the notaton [:j], where [:j] = x x + x j- x j for < j. 2. Goldschmdt s dvson algorthm Goldschmdt s dvson algorthm s also known as dvson by multplcatve normalzaton, dvson by convergence, and dvson by seres expanson. It has been mplemented n the IBM 36/9 [6], the TMS39C62A [7], the IBM S/39 G4 [8], and the AMD-K7 mcroprocessor [5]. Varous publcatons descrbe Goldschmdt s dvson algorthm [9, 2, 2], ts error analyss [22], and ts mplementaton usng ppelned multplers [23, 24]. Goldschmdt s dvson algorthm, computes the quotent Q = A/B by startng wth an ntal approxmaton to the dvsor s recprocal; /B. It then multples by the dvdend, A, and dvsor, B, to obtan: N = A () D = B (2) R = 2 D (3) After ths, m teratons are performed, where: N+ = R N (4) D+ = R D (5) R+ = 2 D+ (6) Fnally, N m s multpled by R m to obtan Q. Each teraton requres two multplcatons and one subtracton (or complement operaton) and approxmately doubles the number of accurate bts. If has an absolute error of ε and computatons are performed wthout roundng error then: A N = A = + ε A = + A ε B B (7) D = B = + ε B Bε = + B (8) R = 2 D = 2 ( + ε ) = Bε (9) In the next teraton: 35

N = R N = D = R D = R = 2 D ( Bε ) ( Bε ) A + Aε B ( + Bε 2 2 2 2 = 2 ( B ε ) = + B ε A 2 = ABε B 2 2 ) = B ε () () (2) In general, when N s close to A/B, D + and R + converge towards. and N + converges towards A/B. Each teraton roughly doubles the number of accurate bts n the quotent approxmaton, N. Snce R s close to., not all of the bts of R are needed to compute N and D. If ε < 2, R has k R k ε R the form. r k+ r k+2 r n-. If 2 <, R has the form. r k+ r k+2 r n-. Consequently, the k most sgnfcant bts of R are not needed when computng N and D. Usng the substtuton R = R -, Equatons (4) to (6) can be rewrtten as: N+ = N + R N (3) D + = D + R D (4) R + = D + (5) Although ths approach requres extra addtons to mplement Equaton (3) and (4), t has the advantage ' that R s close to zero, whch lets R ' N and R ' D be computed wth less precson. Instead of ' computng R+ = D drectly, hardware computes + R as the one s complement of D and then computes: + k N + = N + N 2 = N + {' k, N } (6) k D+ = D + D 2 = D + {' k, D} (7) These computatons multply the approprate bts from R by N or D rght shfted by k bts and then adds ths product to N or D, respectvely. double precson numbers wth 53-bt sgnfcands, and extended precson numbers wth 64-bt sgnfcands. Smlar to the AMD-K7 multpler desgn [5], our multpler also provdes a varety of other multplcaton szes to facltate accurate dvson, square root, and elementary functon computatons. The multplcaton szes supported nclude 24x24, 25x24, 27x76, 53x53, 54x53, 54x76, 64x64, 68x68 and 76x76. The multpler also performs two sngle precson (dual 24x24) multplcatons n parallel, whch s frequently used n multmeda applcatons. 3. Rectangular multpler The rectangular floatng-pont multpler used to mplement our proposed dvson algorthms has two ppelne stages, as shown n Fgure. The frst stage, E, conssts of a 27-bt by 76-bt tree multpler that accepts the two numbers to be multpled, along wth a 76-bt feedback term n carry-save format, and produces a 3-bt product n carry-save format. The second stage, E2, conssts of combned addton and roundng, result multplexng, and forwardng to the regster fle and bypass networks. The multpler supports a range of precsons wth wder precson multples acheved by multple passes through the frst stage, E. It supports operatons on sngle precson numbers wth 24-bt sgnfcands, Fgure. 27-bt by 76-bt multpler For each pass through the multpler, the approprate 27-bts of the multpler operand are selected by the Unpack/Algn Multplexers. Two sets of radx-4 Booth encoders are requred to support the dual 24x24 multply. The Booth multplexers produce fourteen 8-bt partal products, whch are reduced, along wth the two 76-bt feedback terms, usng a partal product reducton tree mplemented usng three levels of 4-2 compressors. For the frst pass, the feedback terms are all zeros. For subsequent passes, the feedback terms 36

are obtaned from the upper 76-bts of the carry-save product from the prevous pass. The roundng scheme mplemented n the second stage, E2, nvolves addng roundng constants to the carry-save product usng 3-2 carry-save adders (CSAs) pror to the fnal addton [5]. The roundng s performed pror to normalzaton usng two addtons, wth one addton assumng roundng overflow occurs and one addton assumng roundng overflow does not occur. A thrd addton computes the un-rounded sgnfcand [5]. An approprate roundng constant s provded for each of the frst two addtons and s omtted for the un-rounded sgnfcand. Snce for wder precson multples, the product generaton s splt over multple cycles, the lower 27-bts are processed after each pass to compute the stcky bt and the carryn for the next pass. Table shows the multpler passes, latences, and throughputs for supported multplcaton szes. Table. Multpler passes, latences, and throughput for supported multplcaton szes Multplcaton Szes Multpler Passes Latences (cycles) Throughputs (mults/cycle) Dual 24x24 2 2 24x24, 25x24, 27x76 2 53x53, 54x53, 54x76 2 3 /2 64x64, 68x68, 76x76 3 4 /3 4. Floatng-pont dvson algorthms The dvson algorthms presented n ths paper are derved from the AMD-K7 Goldschmdt dvson algorthm [5], whch was desgned for a fullyppelned 76-bt by 76-bt multpler. Ths secton gves an overvew of the AMD-K7 dvson algorthm [5]. It then presents our varatons of Goldschmdt s dvson algorthm that are desgned for an x86 mcroprocessor wth the 76-bt by 27-bt rectangular multpler presented n Secton 3. The algorthms can be modfed for other multpler szes. Fgure 2 shows the verson of Goldschmdt s dvson algorthm mplemented on the AMD-K7 and presented n [5]. Ths dvson algorthm only supports extended precson nput operands wth results rounded to sngle, double, extended, or nternal precson. In Fgure 2, A and B are the nput operands. PC s the sgnfcand precson control, where PC s 24 for sngle precson, 53 for double precson, 64 for extend precson, and 68 for nternal precson. Dvson wth an nternal precson of 68 bts s used to compute certan elementary functons. RC s the roundng control, whch ndcates f the fnal result s rounded to nearest even, toward zero, toward mnus nfnty, or toward plus nfnty. Q s the ntal quotent approxmaton and Q f s the fnal correctly rounded quotent. REM s a 2-bt varable that ndcates the sgn of the remander and f the remander s zero. The cycles shown on the rght assume that the ntal recprocal estmate takes three cycles and each multplcaton takes four cycles [5]. The dvson algorthm takes 6 cycles for sngle precson (PC = 24), 2 cycles for double precson (PC = 53), and 24 cycles for extended and nternal precson (PC = 64 and 68, respectvely). Program: Goldschmdt s Dvson Algorthm n the AMD-K7 wth a 76 by 76 Multpler [5] Input = (A, B, PC, RC), Output = (Q f ) Operatons Cycles = recp_estmate(b) -3 D = termul_76x76(, B), R = comp(d ) 4-7 N = termul_76x76(, A) 5-8 f (PC == 24) {N f = N, R f = R, D = termul_76x76(d, R ), R = comp(d ) 8- N = termul_76x76(n, R ) 9-2 f (PC == 53) {N f = R, R f = R, D 2 = termmul_76x76(d, R ), R 2 = comp(d 2 ) 2-5 N 2 = termmul_76x76 (N, R ) 3-6 R f = N 2, R f = R 2 END DIVISION: Q = lastmul_76x76(n f, R f, PC+) See + REM = backmul_76x76(q, B, A), Q f = round(q, REM, PC, RC) See * + 9-2 (PC = 24), 3-6 (PC = 53), 7-2 (PC = 64/68) * 3-6 (PC = 24), 7-2 (PC = 53), 2-24 (PC = 64/68) Fgure 2: Goldschmdt s algorthm n the AMD-K7 The algorthm shown n Fgure 2 ncludes several operatons, whch are dscussed n detal by Oberman [5]. The recp_estmate operaton uses 2 -entry by 6-bt and 2 -entry by 7-bt bpartte tables to provde a recprocal estmate that s accurate to at least 4.94 bts [5, 25]. The termul_76x76 operaton corresponds to a 76-bt by 76-bt multply n whch the result s rounded to 76 bts usng round-to-nearest-even. The comp operaton produces the one s complement of D, whch s a 76-bt value. The lastmul_76x76 operaton s a 76-bt by 76-bt multply, whch rounds ts result to PC+ bts of precson usng round-to-nearest-even. PC+ bts of precson are requred n order to mplement the AMD-K7 roundng technque [5]. The backmul_76x76 operaton performs a 76-bt by 76-bt multplcaton of Q B and subtracts A to determne the sgn of the remander and f the remander s equal to zero. The round operaton produces the correctly 37

rounded quotent usng the AMD-K7 roundng technque [5]. To more effcently mplement Goldschmdt s dvson algorthm wth a rectangular multpler, our frst verson of Goldschmdt s algorthm (GS-) uses a truncated verson of R, n whch the requred precson of R s determned from a detaled error analyss. Ths analyss ndcates correctly rounded results are stll produced, when R s truncated to 3 bts and R s truncated to 6 bts. Snce R must be longer than 27 bts, t needs two passes through the 27-bt by 76-bt multpler, so R s nstead truncated to 54 bts. Smlarly, snce R s longer than 54 bts, t needs three passes through the multpler, so all 76 bts are used. Program: Goldschmdt s Dvson Algorthm wth Truncated R on a 27 x 76 Multpler (GS-) Input = (A,B,OT, PC, RC) Output = (Q f ) Operatons Cycles = recp_estmate(b) -3 D = termul_27x76(, B), R = comp(d ) 4-5 N = termul_27x76(, A) 5-6 f (OT = = SINGLE) { Q = lastmul_54x76(r [:53], N, 25) 7-9 REM = backmul_25x24(q, B, A), Q f = round(q, REM, 24, RC) - f (OT = = 87 and PC = = 24) goto 87 DIV D = termul_54x76(r [:53], D ), R = comp(d ) 6-8 N = termul_54x76(r [:53], N ) 8- f (OT = = DOUBLE ) { Q = lastmul_76x76(r, N, 54) -4 REM = backmul_54x53(q, B, A), Q f = round(q, REM, 53, RC) 5-7 87 DIV: f (PC == 24) { Q = lastmul_54x76(r [:53], N, 25) 7-9 else f (PC == 53) Q = lastmul_76x76(r, N, 54) -4 else { D 2 = termul_76x76(r, D ), R 2 = comp(d 2 ) -4 N 2 = termul_76x76(r, N ) 4-7 Q = lastmul_76x76(r 2, N 2, PC+) } 8-2 REM = backmul_76x76(q, B, A), Q f = round(q, REM, PC, RC) See * END DIVISION: * -3 (PC=24), 5-8 (PC=53), 22-25 (PC = 64/68) Fgure 3: Goldschmdt s algorthm wth truncated R on a 27 x 76 multpler (GS-) Utlzng a truncated verson of R allows some of the multplcatons to be performed wth fewer passes through the rectangular multpler. The GS- algorthm also examnes the operand type, OT, snce SSE requres support for sngle and double precson nput operands and operatons on these types of operands requre fewer passes through the rectangular multpler than extended precson operands. Fgure 3 shows the GS- Algorthm. In ths fgure, the sze of each multplcaton s specfed by the numbers after the _. All of the termul_ operatons, truncate ther results to 76 bts, the lastmul_ operatons round ther results to the precson specfed n the last argument usng round-to-nearest. The rest of the operatons have the same functonalty as the correspondng operatons n Fgure 2, except for the sze of the nput operands. For example, Q = lastmul_54x76(r [:53], N, 25) ndcates that the 54 most sgnfcant bts of R are multpled by all 76 bts of N. The result s rounded to 25 bts usng round-to-nearest. Snce R [:53] s 54 bts, ths multplcaton s performed wth two passes through the rectangular multpler. For sngle precson operands (OT = SINGLE), all of the multplcatons, except for lastmul_54x76, requre only a sngle pass through the multpler tree and the dvson has a latency of cycles. For double precson operands, the multplcatons requre one to three passes through the multpler tree and the dvson has a latency of 7 cycles. For x87 operands, the latency depends on the requred precson of the fnal result and s 3 cycles for sngle precson, 8 cycles for double precson, and 25 cycles for extended or nternal precson. Our second verson of Goldschmdt s algorthm (GS-2), shown n Fgure 4, uses a truncated verson of R and takes advantage of the fact that R s close to. to reduce the number of bts n R used for the teratve multplcatons and reduce the number of passes through the multpler. For example, snce 3 R < 2, the thrteen most sgnfcant bts of R are not needed. Based on Equaton (7), ths allows the computaton D = termul_54x76(r [:53], D ) (8) whch requres two passes through the multpler tree n GS- to be replaced by the computaton D = termuladd_27x76(r [3:39], D, 3) (9) whch corresponds to 3 D = D + D R[3 : 39] 2. = D + {' 3, D} R[3 : 39] (2) Ths operaton requres only a sngle pass through the multpler wth D rght shfted by 3 bts, the lower 3 bts of D truncated, and the un-shfted value of D added to the product. Ths operaton compensates for the fact that - R s used nstead of R, as descrbed n Secton 2. Smlar optmzatons are used throughout the algorthm to reduce the number of passes through 38

the multpler and the latency of the dvson algorthm. The operatons that use these types of optmzatons are termuladd_ and lastmuladd_. They mplement operand shftng, multplcaton, and addton by usng a modfed verson of the multpler descrbed n Secton 3. The lastmuladdd operaton s smlar to the termuladd algorthm, except that the result s rounded to the number of bts specfed by ts last argument usng round-to-nearest. Program: Goldschmdt s Dvson Algorthm wth Reduced R on a 27 x 76 Multpler (GS-2) Input = (A,B,OT, PC,RC), Output = (Q f ) Operatons Cycles = recp_estmate(b) -3 D = termul_27x76(, B), R = comp(d ) 4-5 N = termul_27x76(, A) 5-6 f (OT == SINGLE) { Q = lastmuladd_27x76(r [3:39], N, 3, 25) 7-8 REM = backmul_25x24(q, B, A), Q f = round(q, REM, 24, RC) 9- f (OT == 87 and PC = 24) goto 87 DIV D = termuladd_27x76(r [3:39], D, 3), R = comp(d ) 6-7 N = termuladd_27x76(r [3:39], N, 3) 7-8 f (OT == DOUBLE ) { Q = lastmuladd_54x76({r [26,75], N, 26, 54) 9- REM = backmul_54x53(q, B, A), Q f = round(q, REM, 53, RC) 2-4 87 DIV: f (PC == 24) Q = lastmuladd_27x76(r [3,39], N, 3, 25) 7-8 else f (PC == 53) Q = lastmuladd_54x76(r [26,75], N, 26, 54) 9- else { D 2 = termuladd_27x76(r [26:52], D, 26), R 2 = comp(d 2 ) 8-9 N 2 = termuladd_27x76(r [26:52], N, 26) 9- Q = lastmuladd_27x76(r 2 [52:75], N 2, 52, PC+)} -2 REM = backmul_65x64(q, B, A), Q f = round(q, REM, PC, RC) See * END DIVISION: * 9-2 (PC=24), 2-5 (PC=53), 3-6 (PC=64/68) Fgure 4: Goldschmdt s algorthm wth reduced R on a 27 x 76 multpler (GS-2) 5. Algorthm comparson Table 2 compares the latency n cycles for each dvson algorthm, based on the multplcaton latences gven n Table. In Table 2,,, and (E) ndcate results are rounded to sngle, double, or extended precson, respectvely. For completeness, the latency of the orgnal dvson algorthm [5] on the AMD-K7 mcroprocessor wth a 76x76 multpler s also gven, and denoted as K7 (76x76). The 76x76 multpler s roughly 2.5 tmes larger than our 27x76 multpler. Table 2 also shows the latency for the K7 dvson algorthm [5], when t has mnor modfcaton to work wth our rectangular multpler. Ths modfed algorthm s denoted as K7 (27x76). As shown n Table 2, the two proposed algorthms have better latency than the AMD-K7 (27x76) algorthm for all operand types and precsons. The GS-2 (27x76) algorthm has the lowest overall latency for all operand types and precsons. Compared to the GS- (27x76) algorthm, the GS-2 (27x76) algorthm reduces the latency by one cycle for sngle precson, three cycles for double precson, and nne cycles for extended precson, when the nput and output operands have the same precson. Table 3 shows the number of passes through the multpler for each dvson algorthm, based on the number of multpler passes for the varous multplcaton szes gven n Table. For example, a 27x76 multplcaton only requres a sngle pass through the multpler and a 76x76 multplcaton requres 3 passes through the multpler. As shown n Table 3, the GS-2 algorthm has the fewest passes through the multpler for all operand types and precsons. The number of passes through the multpler s mportant snce t mpacts the power dsspated by the dvson algorthm and also ndcates how avalable the multpler s for mplementng other operatons. Table 2: Latency of dvson algorthms (cycles) Algorthm Sngle Double (E) K7 (76x76) 6 2 6 2 24 K7 (27x76) 4 2 4 2 26 GS- (27x76) 7 3 8 25 GS-2 (27x76) 4 2 5 6 Table 3: Multpler passes of dvson algorthms Algorthm Sngle Double (E) K7 (76x76) 2 8 2 8 24 K7 (27x76) 8 4 8 4 2 GS- (27x76) 5 7 2 8 GS-2 (27x76) 4 8 6 9 39

Compared to the K7 (27x76) algorthm, the GS- (27x76) algorthm has roughly the same hardware complexty, but more complex control logc to handle the dfferent multplcaton szes. The GS-2 algorthm has the most complexty, snce t has addtonal multplexers to shft R, N, and D and t has modfcatons to the multpler tree to perform multply-add operatons. For our mplementaton, the relatvely small ncrease n hardware complexty of the GS-2 algorthm s less mportant than the reduced latency and passes through the rectangular multpler. 6. Conclusons Ths paper presents and compares varatons of Goldschmdt s dvson algorthm for an x86 mcroprocessor that utlzes a rectangular multpler. Of the algorthms presented n ths paper, the GS-2 algorthm has the lowest latency and requres the fewest passes through the rectangular multpler. All of the algorthms presented n ths paper have been verfed through extensve error analyss. The GS-2 algorthm has been modeled n Verlog and smulated usng over mllon test vectors for the supported operand types and result precsons. References [] S. K. Raman, V. Pentkovsk, and J. Keshava, Implementng Streamng SIMD Extensons on the Pentum III Processor, IEEE Mcro, vol. 2, no. 4, pp. 47-57, July 2. [2] Advanced Mcro Devces, AMD64 Archtecture Programmer s Manual Volume 5: 64-Bt Meda and x87 Floatng-Pont Instructons, Revson 3.7, September 26. [3] ANSI and IEEE, IEEE Standard for Bnary Floatngpont Arthmetc, 985. [4] W.-C. Ma and C.-L. Yang, Usng Intel Streamng SIMD Extensons for 3D Geometry Processng, Proceedngs of the 3rd IEEE Pacfc-Rm Conference on Multmeda, pp. 8-87, December 22 [5] S. F. Oberman and M. J. Flynn, "Dvson Algorthms and Implementatons," IEEE Transactons on Computers, vol. 46, no. 8, pp. 833-854, August 997. [6] M. D. Ercegovac and T. Lang, Dvson and Square Root: Dgt-Recurrence Algorthms and Implementatons, Kluwer Academc Publshers, 994. [7] D. Wong and M. Flynn, Fast Dvson Usng Accurate Quotent Approxmatons to Reduce the Number of Iteratons, IEEE Transactons on Computers, vol. 4, no. 8, pp. 98-995, August 992. [8] W. S. Brggs and D. W. Matula, A 7 69 Bt Multply and Add Unt wth Redundant Bnary Feedback and Sngle Cycle Latency, Proceedngs of the th IEEE Symposum on Computer Arthmetc, pp. 63-7, July 993. [9] W. S. Brggs and D. W. Matula, Method and Apparatus for Performng Dvson Usng a Rectangular Aspect Rato, Multpler, U.S. Patent No. 5,46,38, 989. [] W. S. Brggs and D. W. Matula, Method and Apparatus for Performng Prescaled Dvson, U.S. Patent No. 5,475,63, 995. [] M. D. Ercegovac, T. Lang, and P. Montusch, Very Hgh Radx Dvson wth Prescalng and Selecton by Roundng, IEEE Transactons on Computers, vol. 43, no. 8, pp. 99-98, August 994. [2] T. Lang and P. Montusch, Boostng Very Hgh Radx Dvson wth Prescalng and Selecton by Roundng, IEEE Transactons on Computers, vol. 5, no., pp. 3-27, January 2. [3] R. E. Goldschmdt, Applcatons of Dvson by Convergence, M.S. thess, Dept. of Electrcal Engneerng, MIT, Cambrdge, MA, June 964. [4] M. Flynn, On Dvson by Functonal Iteraton, IEEE Transactons on Computers, vol. 9, no. 8, pp. 72-76, August 97. [5] S. F. Oberman, Floatng-pont Dvson and Square Root Algorthms and Implementaton n the AMD-K7 Mcroprocessor, In Proceedngs of the 4 th IEEE Symposum on Computer Arthmetc, pg. 6-5, 999. [6] S. F. Anderson, J. G. Earle, R. E. Goldschmdt, and D. M. Powers, The IBM System/36 Model 9: Floatng- Pont Executon Unt, IBM Journal of Research and Development, vol., pp. 34-53, Jan. 967. [7] H. Darley, M. Gll, D. Earl, D. Ngo, P. Wang, M. Hpona, and J. Dodrll, Floatng Pont/Integer Processor wth Dvde and Square Root Functons, U.S. Patent No. 4,878,9, 989. [8] E. M. Schwarz, L. Sgal, and T. J. McPherson, CMOS Floatng-pont Unt for the S/39 Parallel Enterprse Server G4, IBM Journal of Research and Development, vol. 4, no. 4/5, pp. 475-488, July/September 997. [9] M. D. Ercegovac and T. Lang, Dgtal Arthmetc, Morgan Kaufmann Publshers, 24. [2] B. Parham, Computer Arthmetc: Algorthms and Hardware Desgns, Oxford Unversty Press, 2. [2] I. Koren, Computer Arthmetc Algorthms, A. K. Peters, 22. [22] G. Even, P.-M Sedel, and W. E. Ferguson, A Parametrc Error Analyss of Goldschmdt's Dvson Algorthm, 6th IEEE Symposum on Computer Arthmetc, pp. 65-7, June 23. [23] G. Even and P.-M. Sedel, "Ppelned Multplcatve Dvson wth IEEE Roundng," IEEE Internatonal Conference on Computer Desgn, pp. 24-245, 23. [24] G. Even and P.-M. Sedel, "Ppelned Multplcatve Dvson wth IEEE Roundng," U.S. Patent No. 24/28338, July, 24. [25] S. F. Oberman, Bpartte Look-up Table wth Output Values Havng Mnmzed Absolute Error, U.S. Patent No. 6,223,92, Aprl, 2. 3