IEEE Standard for Floating Point Numbers

Similar documents
CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

Finite arithmetic and error analysis

IEEE Standard 754 Floating Point Numbers

Numeric Encodings Prof. James L. Frankel Harvard University

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

Floating Point Arithmetic

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

COMP2611: Computer Organization. Data Representation

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Floating Point Representation in Computers

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Floating Point Numbers

Floating Point Numbers

Floating Point Numbers

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Data Representation Floating Point

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Floating-Point Numbers in Digital Computers

Floating Point January 24, 2008

Data Representation Floating Point

Chapter 2. Data Representation in Computer Systems

ECE232: Hardware Organization and Design

IEEE-754 floating-point

Bindel, Fall 2016 Matrix Computations (CS 6210) Notes for

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Chapter 03: Computer Arithmetic. Lesson 09: Arithmetic using floating point numbers

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Floating-Point Numbers in Digital Computers

CPE 323 REVIEW DATA TYPES AND NUMBER REPRESENTATIONS IN MODERN COMPUTERS

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

CPE 323 REVIEW DATA TYPES AND NUMBER REPRESENTATIONS IN MODERN COMPUTERS

Table : IEEE Single Format ± a a 2 a 3 :::a 8 b b 2 b 3 :::b 23 If exponent bitstring a :::a 8 is Then numerical value represented is ( ) 2 = (

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Floating Point Numbers. Lecture 9 CAP

Foundations of Computer Systems

Chapter Three. Arithmetic

Giving credit where credit is due

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

Chapter 2 Data Representations

Module 2: Computer Arithmetic

What we need to know about error: Class Outline. Computational Methods CMSC/AMSC/MAPL 460. Errors in data and computation

CO212 Lecture 10: Arithmetic & Logical Unit

Giving credit where credit is due

unused unused unused unused unused unused

INTEGER REPRESENTATIONS

System Programming CISC 360. Floating Point September 16, 2008

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Computational Methods CMSC/AMSC/MAPL 460. Representing numbers in floating point and associated issues. Ramani Duraiswami, Dept. of Computer Science

IEEE Standard for Floating-Point Arithmetic: 754

IEEE Floating Point Numbers Overview

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Floating-Point Arithmetic

Data Representations & Arithmetic Operations

C NUMERIC FORMATS. Overview. IEEE Single-Precision Floating-point Data Format. Figure C-0. Table C-0. Listing C-0.

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

MAT128A: Numerical Analysis Lecture Two: Finite Precision Arithmetic

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

Representing and Manipulating Floating Points

CS429: Computer Organization and Architecture

Representing and Manipulating Floating Points. Jo, Heeseung

Representing and Manipulating Floating Points

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

Floating-point representations

Floating-point representations

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

Floating-point representation

CS101 Lecture 04: Binary Arithmetic

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

CS 33. Data Representation (Part 3) CS33 Intro to Computer Systems VIII 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Scientific Computing. Error Analysis

FLOATING POINT NUMBERS

Floating Point Considerations

Most nonzero floating-point numbers are normalized. This means they can be expressed as. x = ±(1 + f) 2 e. 0 f < 1

Representing and Manipulating Floating Points

Computer Arithmetic Floating Point

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

Real Numbers finite subset real numbers floating point numbers Scientific Notation fixed point numbers

Floating Point Arithmetic

2 Computation with Floating-Point Numbers


Computer Organization: A Programmer's Perspective

The Perils of Floating Point

Floating Point Numbers

±M R ±E, S M CHARACTERISTIC MANTISSA 1 k j

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

Signed Multiplication Multiply the positives Negate result if signs of operand are different

15213 Recitation 2: Floating Point

2 Computation with Floating-Point Numbers

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Number Systems. Both numbers are positive

Number Systems. Decimal numbers. Binary numbers. Chapter 1 <1> 8's column. 1000's column. 2's column. 4's column

Module 12 Floating-Point Considerations

Transcription:

GENERAL ARTICLE IEEE Standard for Floating Point Numbers VRajaraman Floating point numbers are an important data type in computation which is used extensively. Yet, many users do not know the standard which is used in almost all computer hardware to store and process these. In this article, we explain the standards evolved by The Institute of Electrical and Electronic Engineers in 1985 and augmented in 2008 to represent floating point numbers and process them. This standard is now used by all computer manufacturers while designing floating point arithmetic units so that programs are portable among computers. Introduction There are two types of arithmetic which are performed in computers: integer arithmetic and real arithmetic. Integer arithmetic is simple. A decimal number is converted to its binary equivalent andarithmeticisperformedusingbinaryarithmeticoperations. Thelargestpositiveintegerthatmaybestoredinan8-bitbyteis +127,if1 bitisusedforsign. If16 bitsareused,thelargest positiveintegeris+32767andwith32bits,itis+2147483647, quite large! Integers are used mostly for counting. Most scientific computations are however performed using real numbers, that is, numbers with a fractional part. In order to represent real numbers incomputers,wehavetoasktwoquestions. Thefirstistodecide howmanybitsareneededtoencoderealnumbersandthesecond is to decide how to represent real numbers using these bits. Normally, in numerical computation in science and engineering, onewouldneedatleast7to8significantdigitsprecision.thus, thenumberofbitsneededtoencode8decimaldigitsisapproximately26,as log 2 10 = 3.32bits are needed onthe averageto encodeadigit.incomputers,numbersarestoredasasequenceof 8-bitbytes. Thus32bits(4bytes)whichisbiggerthan26bitsis alogicalsizetouseforrealnumbers. Given32bitstoencodereal VRajaramanisatthe IndianInstituteofScience, Bengaluru. Several generations of scientists and engineers in India havelearntcomputer science using his lucidly writtentextbookson programming and computer fundamentals. Hiscurrentresearch interests are parallel computingandhistoryof computing. Keywords Floating point numbers, rounding, decimal floating point numbers, IEEE 754-2008 Standard. RESONANCE January 2016 11

GENERAL ARTICLE Figure1. Fixedpointrepresentation of realnumbers in binaryusing 32 bits. numbers,thenextquestionishowtobreakitupintoaninteger partandafractionalpart. Onemethodistodividethe32bitsinto 2parts,oneparttorepresentanintegerpartofthenumberandthe otherthefractionalpartasshowninfigure1. Inthisfigure,wehavearbitrarilyfixedthe(virtual)binarypoint betweenbits7and8. Withthisrepresentation,calledfixedpoint representation,thelargestandthesmallestpositivebinarynumbersthatmayberepresentedusing32bitsare Binary Floating Point Numbers Astherangeofreal numbers representable with fixedpointisnot sufficient, normalized floating point is used to represent real numbers. This rangeofrealnumbers, whenfixedpoint representationis used, is not sufficient in many practical problems. Therefore, another representation called normalizedfloatingpointrepresentation is used for real numbers in computers. In this representation,32bitsaredividedintotwoparts:apartcalledthemantissa withitssignandtheothercalledtheexponentwithitssign.the mantissa represents fractions with a non-zero leading bit and the exponentthepowerof2bywhichthemantissaismultiplied. This method increases the range of numbers that may be represented using32bits.inthismethod,abinaryfloatingpointnumberis represented by (sign) mantissa 2 ±exponent wherethesignisonebit,themantissaisabinaryfractionwitha non-zeroleadingbit,andtheexponentisabinaryinteger. If32 12 RESONANCE January 2016

GENERAL ARTICLE bits are available to store floating point numbers, we have to decide the following: 1.Howmanybitsaretobeusedtorepresentthemantissa(1bit isreservedforthesignofthenumber). 2.Howmanybitsaretobeusedfortheexponent. 3.Howtorepresentthesignoftheexponent. Thenumberofbitstobeusedforthemantissaisdeterminedby the number of significant decimal digits required in computation, which is at least seven. Based on experience in numerical computation, it is found that at least seven significant decimal digits are needed if results are expected without too much error. The number of bits required to represent seven decimal digits is approximately 23. The remaining 8 bits are allocated for the exponent. The exponent may be represented in sign magnitude form. In this case, 1 bit is used for sign and 7 bits for the magnitude. The exponent will range from 127 to +127. The only disadvantage to this method is that there are two representationsfor0exponent:+0and 0. Toavoidthis,onemayusean excess representation or biased format to represent the exponent. Theexponenthasnosigninthisformat. Therange0to255ofan 8-bitnumberisdividedintotwoparts0to127and128to255as shown below: Thenumberofbits tobeusedforthe mantissa is determined by the number of significant decimal digits required in computation, whichisatleast seven. All bit strings from 0 to 126 are considered negative, exponent 127 represents 0, and values greater than 127 are positive. Thus therangeofexponentsis 127to+128. Givenanexponentstring exp,thevalueoftheexponentwillbetakenas(exp 127).The main advantage of this format is a unique representation for exponent zero. With the representation of binary floating point numbers explained above, the largest floating point number which can be represented is The main advantage of biased exponent format is unique representation for exponent 0. RESONANCE January 2016 13

GENERAL ARTICLE 0.1111...1111 2 11111111 23bits =(1 2 23 ) 2 255 127 3.4 10 38. The smallest floating point number is 0.10000 00 2 127 23bits 0.293 10 38. Example. Represent 52.21875 in 32-bit binary floating point format. 52.21875=110100.00111=.11010000111 2 6. Normalized23bitmantissa=0.11010000111000000000000. Asexcessrepresentationisbeingusedforexponent,itisequalto 127+6=133. Thustherepresentationis 52.21875=0.11010000111 2 133 =0.110100000111 2 10000101. The32-bitstringusedtostore52.21875inacomputerwillthusbe 01000010111010000111000000000000. Here, the most significant bit is the sign, the next 8 bits the exponent,andthelast23bitsthemantissa. Institute of Electrical and Electronics Engineers standardized how floating point binary numbers would be represented in computers, how these would be rounded, how0and wouldbe represented and how totreatexception condition such as attempttodivideby0. IEEEFloatingPointStandard754-1985 Floatingpointbinarynumberswerebeginningtobeusedinthe mid50s.therewasnouniformityintheformatsusedtorepresent floating point numbers and programs were not portable from one manufacturer s computer to another. By the mid 1980s, with the advent of personal computers, the number of bits used to store floatingpointnumberswasstandardizedas32bits. AStandards CommitteewasformedbytheInstituteof ElectricalandElectronics Engineers to standardize how floating point binary numbers would be represented in computers. In addition, the standard specified uniformity in rounding numbers, treating exception conditionssuchasattempttodivideby0,andrepresentationof0 and infinity( This standard, calledieee Standard 754 for 14 RESONANCE January 2016

GENERAL ARTICLE floating point numbers, was adopted in 1985 by all computer manufacturers. It allowed porting of programs from one computer to another without the answers being different. This standard defined floating point formats for 32-bit and 64-bit numbers. With improvement in computer technology it became feasible to usealargernumberof bitsforfloatingpointnumbers.afterthe standard was used for a number of years, many improvements were suggested. The standard was updated in 2008. The current standard is IEEE 754-2008 version. This version retained all the features of the 1985 standard and introduced new standards for 16 and128-bit numbers. It also introduced standards for representing decimal floating point numbers. We will describe these standards laterinthisarticle. Inthissection,wewilldescribeIEEE754-1985 Standard. The current standard is IEEE 754-2008 version. This version retained all the features of the 1985 standard and introduced new standards for 16-bit and128-bit numbers. It also introduced standards for representing decimal floating point numbers. IEEE floating point representation for binary real numbers consists of three parts. For a 32-bit(called single precision) number, they are: 1.Sign,forwhich1bitisallocated. 2. Mantissa(called significand in the standard) is allocated 23 bits. 3. Exponent is allocated 8 bits. As both positive and negative numbers are required for the exponent, instead of using a separate sign bit for the exponent, the standard uses a biased representation. The value of the bias is 127. Thus an exponent 0meansthat 127isstoredintheexponentfield. Astored value198meansthattheexponentvalueis(198 127)=71. Theexponents 127(all0s)and+128(all1s)arereservedfor representing special numbers which we discuss later. To increase the precision of the significand, the IEEE 754 Standard uses a normalized significand which implies that its most significantbitisalways1. Asthisisimplied,itisassumedtobe on the left of the(virtual) decimal point of the significand. Thus intheieeestandard,thesignificandis24bitslong 23bitsof thesignificandwhichisstoredinthememoryandanimplied1as the most significant 24th bit. The extra bit increases the number RESONANCE January 2016 15

GENERAL ARTICLE Figure 2. IEEE 754 representation of 32-bit floating point number. of significant digits in the significand. Thus, a floating point numberintheieeestandardis ( 1) s (1.f) 2 2 exponent 127 wheres isthesignbit; s=0isusedforpositivenumbersands =1forrepresentingnegativenumbers. frepresentsthebitsinthe significand. Observe that the implied 1 as the most significant bit of the significand is explicitly shown for clarity. For a32-bitwordmachine,theallocationofbitsfor afloating pointnumberaregiveninfigure2.observethattheexponentis placed before the significand(see Box 1). Example. Represent52.21875inIEEE754 32-bitfloatingpoint format. 52.21875 = 110100.00111 =1.1010000111 2 5. Normalized significand =.1010000111. Exponent: (e 127)=5or,e=132. ThebitrepresentationinIEEEformatis 0 10000100 10100001110000000000000 Box1. WhydoesIEEE754standardusebiasedexponentrepresentationandplaceitbeforethesignificand? Computers normally have separate arithmetic units to compute with integer operands and floating point operands. TheyarecalledFPU(FloatingPointUnit)andIU(IntegerUnit). AnIUworksfasterandischeaper tobuild.inmostrecentcomputers,thereareseveraliusandasmallernumberoffpus.itispreferabletouse IUswheneverpossible.ByplacingthesignbitandbiasedexponentbitsasshowninFigure2,operationssuch as x<r.o.>y,wherer.o.isarelationaloperator,namely,>,<,,andx,yarereals,canbecarriedoutbyinteger arithmetic units as the exponents are integers. Sorting real numbers may also be carried out by IUs as biased exponentarethemostsignificantbits.thisensureslexicographicalorderingofrealnumbers;thesignificand neednotbeconsideredasafraction. 16 RESONANCE January 2016

GENERAL ARTICLE SpecialValuesinIEEE754-1985Standard(32Bits) RepresentationofZero: Asthesignificandisassumedtohavea hidden1asthemostsignificantbit,all0sinthesignificandpart ofthenumberwillbetakenas1.00 0. Thuszeroisrepresented intheieeestandardbyall0sfortheexponentandall0sforthe significand. All0sfortheexponentisnotallowedtobeusedfor anyothernumber. Ifthesignbitis0andalltheotherbits0,the numberis+0. Ifthesignbitis1andalltheotherbits0,itis 0. Even though +0 and 0 have distinct representations they are assumed equal. Thus the representations of +0 and 0 are: +0 0 0 00000000 00000000000000000000000 1 00000000 00000000000000000000000 As the significand is assumedtohavea hidden1asthemost significantbit,all0s in the significand part ofthenumberwillbe takenas1.00 0. Thuszerois representedinthe IEEE Standard by all 0s for the exponent andall0sforthe significand. Representation of Infinity: All 1s in the exponent field is assumedtorepresentinfinity( Asignbit0represents+ and asignbit1represents. Thustherepresentationsof+ and are: + 0 11111111 00000000000000000000000 1 11111111 00000000000000000000000 Representation of Non Numbers: When an operation is performedbyacomputeronapairofoperands,theresultmaynotbe mathematically defined. For example, if zero is divided by zero, All1sinthe exponent field is assumed to representinfinity( ). RESONANCE January 2016 17

GENERAL ARTICLE When an arithmetic operation is performed on two numbers which results in an indeterminate answer,itiscalled NaN(NotaNumber) in IEEE Standard. theresultisindeterminate. SucharesultiscalledNotaNumber (NaN) in the IEEE Standard. In fact the IEEE Standard defines twotypesofnan.whentheresultofanoperationisnotdefined (i.e., indeterminate) it is called a Quiet NaN(QNaN). Examples are:0/0,( ),.QuietNaNsarenormallycarriedoverinthe computation. The other type of NaN is called a Signalling Nan (SNaN).Thisisusedtogiveanerrormessage.Whenanoperation leadstoafloatingpointunderflow,i.e.,theresultofacomputation issmallerthanthesmallestnumberthatcanbestoredasafloating pointnumber,ortheresultisanoverflow,i.e.,itislargerthanthe largestnumberthatcanbestored, SNaNisused. Whennovalid valueisstoredina variablename(i.e., itisundefined)andan attemptismadetouseitinanarithmeticoperation,snanwould result.qnanisrepresentedby0or1asthesignbit,all1sas exponent,anda0astheleft-mostbitofthesignificandandatleast one1intherestofthesignificand. SNaNisrepresentedby0or 1asthesignbit,all1sasexponent,anda1astheleft-mostbitof thesignificandandanystringofbitsfortheremaining22bits.we give below the representations of QNaN and SNaN. QNaN 0or1 11111111 00010000000000000000000 Themostsignificantbitofthesignificandis0.Thereisatleast one1intherestofthesignificand. SNaN 0 or 1 11111111 10000000000001000000000 The most significant bit of the significand is 1. Any combination ofbitsisallowedfortheotherbitsofthesignificand. 18 RESONANCE January 2016

GENERAL ARTICLE LargestandSmallestPositiveFloatingPointNumbers: Largest Positive Number 0 11111110 11111111111111111111111 Significand: 1111 1=1+(1 2 23 )=2 2 23. Exponent: (254 127)=127. LargestNumber = (2 2 23 ) 2 127 3.403 10 38. Iftheresultofacomputationexceedsthelargestnumberthatcan bestoredinthecomputer,thenitiscalledanoverflow. SmallestPositiveNumber 0 00000001 00000000000000000000000 Significand = 1.0. Exponent=1 127= 126. Thesmallestnormalizednumberis=2 126 1.1755 10 38. SubnormalNumbers:Whenalltheexponentbitsare0andthe leading hidden bit of the siginificand is 0, then the floating point number is called a subnormalnumber.thus, one logical representationofasubnormalnumberis ( 1) s 0.f 2 127 (all0sfortheexponent), where f hasatleastone1(otherwisethenumberwillbetakenas 0).However,thestandarduses 126,i.e.,bias+1fortheexponent ratherthan 127whichisthebiasforsomenotsoobviousreason, possiblybecausebyusing 126insteadof 127,thegapbetween the largest subnormal number and the smallest normalized number is smaller. Thelargestsubnormalnumberis 0.999999988 2 126.Itisclose tothesmallestnormalizednumber2 126. Whenallthe exponent bits are 0 and the leading hiddenbitofthe siginificand is 0, then thefloatingpoint numberiscalleda subnormal number. RESONANCE January 2016 19

GENERAL ARTICLE By using subnormal numbers, underflow whichmayoccurin some calculations are gradual. Thesmallestpositivesubnormalnumberis 0 00000000 00000000000000000000001 thevalueofwhichis 2 23 2 126 =2 149. Aresultthatissmallerthanthesmallestnumberthatcanbestored inacomputeriscalledanunderflow. You may wonder why subnormal numbers are allowed in the IEEE Standard. By using subnormal numbers, underflow which may occur in some calculations are gradual. Also, the smallest numberthatcouldberepresentedinamachineisclosertozero. (SeeBox2.) Box2. MachineEpsilonandDwarf In any of the formats for representing floating point numbers, machine epsilon is defined as the difference between1andthenextlargernumberthatcanbestoredinthatformat. Forexample,in32-bitIEEEformatwith a23-bitsignificand,themachineepsilonis2 23 =1.19 10 7. Thisessentiallytellsusthattheprecisionof decimalnumbersstoredinthisformatis7digits.thetermprecisionandaccuracyarenotthesame.accuracy impliescorrectnesswhereasprecisiondoesnot.for64-bitrepresentationofieeefloatingpointnumbersthe significandlengthis52bits.thus,machineepsilonis2 52 =2.22 10 16.Therefore,decimalcalculationswith 64 bits give 16 digit precision. This is a conservative definition used by industry. In MATLAB, machine epsilon is as defined above. However, academics define machine epsilon as the upper bound of the relative error when numbersarerounded. Thus,for32-bitfloatingpointnumbers,themachineepsilonis2 23 /2 5.96 10 8. The machine epsilon is useful in iterative computation. When two successive iterates differ by less than epsilon onemayassumethattheiterationhasconverged.theieeestandarddoesnotdefinemachineepsilon. Tinyordwarfistheminimumsubnormalnumberthatcanberepresentedusingthespecifiedfloatingpoint format. Thus, for IEEE 32-bit format, it is 0 00000000 00000000000000000000001 whosevalueis:2 126 23 =2 149 1.4 10 45. Anynumberlessthanthiswillsignalunderflow. Theadvantage ofusingsubnormalnumbersisthatunderflowisgradualaswasstatedearlier. 20 RESONANCE January 2016

GENERAL ARTICLE Whenamathematicaloperationsuchasadd,subtract,multiply, or divide is performed with two floating point numbers, the significand of the result may exceed 23 bits after the adjustment oftheexponent.insuchacase,therearetwoalternatives. Oneis totruncatetheresult,i.e.,ignoreallbitsbeyondthe32ndbit. The other is to round the result to the nearest significand. For example,ifasignificand is0.110..01andtheoverflowbitis1, aroundedvaluewouldbe0.111 10. Inotherwords,ifthefirst bitwhichistruncatedis1,add1totheleastsignificantbit,else ignorethebits. Thisiscalledroundingupwards.Iftheextrabits areignored,itiscalledroundingdownwards. TheIEEEStandard suggests to hardware designers to get the best possible result whileperformingarithmeticthatarereproducibleacrossdifferent manufacturer s computers using the standard. The Standard suggeststhatintheequation c=a <op>b, where a andb are operands and<op> an arithmetic operation, the resultcshouldbeasifitwascomputedexactlyandthenrounded. Thisiscalledcorrectrounding. InTables1and2,wesummarisethediscussionsinthissection. Table 1. IEEE 754-85 floatingpointstandard.weusef to represent the significand, eto representtheexponent, b the bias of the exponent, and±forthesign. Value Sign Exponent(8bits) Significand(23bits) +0 0 00000000 00.00 (all23bits0) 0 1 00000000 00.00 (all23bits0) +1.f 2 (e b) 0 00000001to aa.aa (a=0or1) eexponent,bbias 11111110 1.f 2 (e b) 1 00000001to aa.aa (a=0or1) 11111110 + 0 11111111 000 00 (all23bits0) 1 11111111 000 00 (all23bits0) SNaN 0or1 11111111 000 01to leadingbit0 011 11 (atleastone1in therest) QNaN 0or1 11111111 1000 10 leadingbit1 Positive subnormal 0 00000000 000 01to (atleastone1) 0.f 2 x+1-b (x is the number of leading 0s in significand) 111 11 RESONANCE January 2016 21

GENERAL ARTICLE Table2. Operations on special numbers. All NaNs are Quiet NaNs. Operation Result Operation Result n/± 0 ±0/±0 NaN ± /± ± NaN ±n/0 ± ± /± NaN + ± 0 NaN IEEE 754 Floating Point 64-Bit Standard 1985 With64bits available to store floating point numbers, there is more freedom to increase the significant digits in the significand and increase the range by allocating more bits to the exponent. The standard allocates 1 bit for sign,11bitsforthe exponent, and 52 bitsforthe significand. TheIEEE754floatingpoint standardfor64-bit(calleddouble precision) numbers is very similar to the 32-bit standard. The main differenceis theallocation bits for theexponent and the significand. With64bitsavailabletostorefloatingpointnumbers, there is more freedom to increase the significant digits in the significandandincreasetherangebyallocatingmorebitstothe exponent. Thestandardallocates1bitforsign,11bitsforthe exponent,and52bitsforthesignificand. Theexponentusesa biasedrepresentationwithabiasof1023.therepresentationis shown below: 1 bit 11 bits 52 bits Anumber inthisstandardisthus ( 1) s (1.f) 2 (exponent 1023), wheres=±1,and f isthevalueofthesignificand. The largest positive number which can be represented in this standard is 0 11111111110 1111..1111 1bit 11bits 52bits whichis=(2 2 52 ) 2 (2046 1023) =(2 2 52 ) 2 1023 10 3083. Thesmallest positivenumberthatcanberepresentedusing64 22 RESONANCE January 2016

GENERAL ARTICLE bits using this standard is 0 00000000001 0000..00000 1bit 11bits 52bits whosevalueis 2 1 1023 =2 1022. Thedefinitionsof±0,±,QNaN,SNaNremainthesameexcept thatthenumberofbitsintheexponentandsignificandarenow11 and52respectively.subnormalnumbershaveasimilardefinition asin32-bitstandard. IEEE 2008 Standard has introduced standards for 16-bit and 128-bit floating point numbers. It has also introduced standards for floating point decimal numbers. IEEE754FloatingPointStandard 2008 This standard was introduced to enhance the scope of IEEE 754 floatingpoint standardof1985. Theenhancements wereintroduced to meet the demands of three sets of professionals: 1. Those working in the graphics area. 2. Those who use high performance computers for numeric intensive computations. 3. Professionals carrying out financial transactions who require exact decimal computation. This standard has not changed the format of 32- and 64-bit numbers. It has introduced floating point numbers which are 16 and 128 bits long. These are respectively called half and quadruple precision. Besides these, it has also introduced standards for floating point decimal numbers. The16-BitStandard The 16-bit format for real numbers was introduced for pixel storage in graphics and not for computation. (Some graphics processingunitsusethe16-bitformatforcomputation.) The16- bitformatisshowninfigure3. The 16-bit format for real numbers was introduced for pixel storage in graphics and not forcomputation. Figure3. Representation of 16-bitfloatingpointnumbers. RESONANCE January 2016 23

GENERAL ARTICLE With16-bitrepresentation,5bitsareusedfortheexponentand10 bitsforthesignificand. Biasedexponentisusedwithabiasof15. Thustheexponentrangeis: 14to+15.Rememberthatall0sand all1sfortheexponentarereservedtorepresent0and respectively. Themaximumpositiveintegerthatcanbestoredis +1.111..1 2 15 =1+(1 2 11 ) 2 15 =(2 2 11 ) 2 15 65504. Theminimumpositivenumberis +1.000 0 2 14 =2 14 =0.61 10 4. The minimum subnormal 16-bit floating point number is 2 24 5.96 10 8. With improvements in computer technology, using128bitsto represent floating point numbers has become feasible. This is sometimesusedin numeric-intensive computation, where large rounding errors may occur. Thedefinitionsof±0,±,NaN,andSNaNfollowthesameideas asin32-bitformat. The 128-Bit Standard With improvements in computer technology, using 128 bits to represent floating point numbers has become feasible. This is sometimes used in numeric-intensive computation, where large roundingerrorsmayoccur.inthisstandard,besidesasignbit,14 bits are used for the exponent, and 113 bits are used for the significand. The representation is shown in Figure 4. Anumberinthisstandardis ( 1) s (1.f) 2 (exponent 16384). Figure 4.Representation of 128-bitfloatingpointnumber. 24 RESONANCE January 2016

GENERAL ARTICLE The largest positive number which can be represented in this format is: 0 11111111111110 111..111 1bit 14bits 113bits whichequals (1 2 113 ) 2 16383 = 10 4932. Thesmallest normalizedpositive number whichcanberepresentedis The main motivation for introducing a decimal standard is thefactthata terminating decimal fraction need not give aterminatingbinary fraction. 2 1 16383 =2 16382 10 4931. A subnormal number is represented by ( 1) s 0.f 2 16382. Thesmallestpositivesubnormalnumberis 2 16382 113 =2 16485. Thedefinitionsof±0,±,andQNaN,andSNaNarethesameas inthe1985standardexceptfortheincreaseofbitsintheexponent and significand. There are other minor changes which we will not discuss. The Decimal Standard The main motivation for introducing a decimal standard is the factthataterminatingdecimalfractionneednotgiveaterminatingbinaryfraction. Forexample,(0.1) 10 = (0.00011 (0011) recurring) 2.Ifonecomputes100000000 0.1usingbinaryarithmetic and rounding the non-terminating binary fraction up to the nextlargernumber,theansweris:10000001.490116insteadof 10000000. This is unacceptable in many situations, particularly in financial transactions. There was a demand from professionals carrying out financial transactions to introduce a standard for representing decimal floating point numbers. Decimal floating point numbers are not new. They were available in COBOL. There was, however, no standard and this resulted in non-portable programs. IEEE754-2008 hasthus introduceda standardfor Therewasademand from professionals carrying out financial transactionsto introduce a standard for representing decimal floating point numbers. RESONANCE January 2016 25

GENERAL ARTICLE decimal floating point numbers to facilitate portability of programs. Decimal Representation The idea of encoding decimal numbers using bits is simple. For example, to represent 0.1 encoded(not converted) to binary, it is writtenas0.1 10 0. Theexponentinsteadofbeinginterpretedas a power of 2 is now interpreted as a power of 10. Both the significand and the exponent are now integers and their binary representations have a finite number of bits. One method of representing0.1 10 0 usingthisideaina32-bitwordisshownin Figure 5. Wehaveassumedan8-bitbinaryexponentinbiasedform.Inthis representation, there is no hidden 1 in the significand. The significand is an integer.(fractions are represented by using an appropriate power of 10 in the exponent field.) For example,.000753658willbewrittenas0.753658 10 3 and753658,an integer, will be converted to binary and stored as the significand. Thepowerof10,namely 3,willbeanintegerintheexponent part. The IEEE Standard does not use this simple representation. This isduetothefactthatthelargestdecimalsignificandthatmaybe representedusingthismethodis0.8388607. It is preferable to obtain 7 significant digits, i.e., significand up to 0.9999999. TheIEEEStandardachievesthisbyborrowingsome bits from the exponent field and reducing the range of the exponent. In other words, instead of maximum exponent of 128, it isreducedto96andthebitssavedareusedtoincreasethesignificant digitsinthesignificand.thecodingofbitsisasshowninfigure6. Figure5. Representation of decimal floating point numbers. Figure 6. Representation of 32-bitfloatingpointnumbers inieee2008standard. 26 RESONANCE January 2016

GENERAL ARTICLE Observe the difference in terminology for exponent and significand fields. In this coding, one part of the five combination bits is interpreted as exponent and the other part as an extension of significand bits using an ingenious method. This will be explained later in this section. IEEE2008Standarddefinestwoformatsforrepresentingdecimal floating point numbers. One of them converts the binary significandfieldtoadecimalintegerbetween0and10 p 1 where pisthenumberofsignificantbitsinthesignificand.theother, called densely packed decimal form, encodes the bits in the significand field directly to decimal. Both methods represent decimal floating point numbers as shown below. Decimalfloatingpointnumber=( 1) s f 10 exp bias,wheres is the sign bit, f is a decimal fraction(significand), exp and biasare decimal numbers. The fraction need not be normalized but is usually normalized with a non-zero most significant digit. There isnohiddenbitasinthebinarystandard. Thestandarddefines decimal floating point formats for 32-bit, 64-bit and 128-bit numbers. IEEE 2008 Standard defines two formats for representing decimal floating point numbers. One of themconverts the binary significand fieldtoadecimal integer between 0 and10 p 1 wherepis the number of significantbitsinthe significand. The other, called densely packed decimal form, encodes thebitsin thesignificandfield directly to decimal. Densely Packed Decimal Significand Representation Normally,whendecimalnumbersarecodedtobinary,4bitsare required to represent each digit, as 3 bits give only 8 combinations that arenotsufficienttorepresent 10 digits. Four bits give16 combinations out of which we need only 10 combinations to representdecimalnumbersfrom0to9.thus,wewaste6combinationsoutof16. Fora32-bitnumber,if1bitisusedforsign,8 bitsfortheexponentand23bitsforthesignificand,themaximum positivevalueof thesignificandwillbe+0.799999. Theexponent rangewillbe00to99. Withabias of50,themaximum positivedecimalnumberwillbe0.799999 10 49. The number of significand digits in the above coding is not sufficient. A clever coding scheme for decimal encoding of binary numbers was discovered by Cowlinshaw. This method encodes10bitsas3decimaldigits(rememberthat2 10 =1024and RESONANCE January 2016 27

GENERAL ARTICLE onecanencode000to999usingtheavailable1024combinations of10bits). IEEE754Standardusesthiscodingscheme. Wewill now describe the standard using this coding method for 32-bit numbers. The standard defines the following: 1. Asignbit0for+and1for. 2. Acombinationfieldof5bits.Itiscalledacombinationfieldas onepartofitisusedfortheexponentandanotherpartforthe significand. 3. Anexponentcontinuationfieldthatisusedfortheexponent. 4. Acoefficientfieldthatencodesstringsof10bitsas3digits. Thus, 20 bits are encoded as 6 digits. This is part of the significand field. For32-bitIEEEword,thedistributionofbitswasgiveninFigure 6 which is reproduced below for ready reference. Sign Combinationfield Exponent continuation field Coefficient field 1bit 5bits 6bits 20bits The 5 bits of the combination field are used to increase the coefficient field by 1 digit. It uses the first ten 4-bit combinations of16possible4-bitcombinationstorepresentthis. Asthereare 32combinationsof5bits,outofwhichonly10areneededtoextend the coefficient digits, the remaining 22-bit combinations are availabletoincreasetheexponentrangeandalsoencode andnan.this isdonebyaningeniousmethodexplainedintable3. Combination bits Type Most significant bits of exponent Most significant digit of coefficient Table 3. Useofcombination bits in IEEE Decimal Standard. b 1 b 2 b 3 b 4 b 5 xyabc 11abc 11110 11111 Finite Finite NaN xy ab 0abc 100c 28 RESONANCE January 2016

GENERAL ARTICLE Sign bit 1 1 Combination field(bits) 5 5 Exponentcontinuationfield(bits) 8 12 Coefficient field(bits) 50 110 Numberofbitsinnumber 64 128 Significant digits 16 34 Exponentrange(decimal) 768 12288 Exponent bias(decimal) 384 6144 Table 4. Decimalrepresentation for 64-bit and 128-bit numbers. Thus, the most significant 2 bits of the exponent are constrained to take on values 00, 01, 10. Using the 6 bits of exponent combinationbits,wegetanexponentrangeof3 2 6 =192.The exponentbiasis96.(rememberthattheexponentisaninteger thatcanbeexactlyrepresentedinbinary.) Withtheadditionof1 significant digit to the 6 digits of the coefficient bits, the decimal representation of 32-bit numbers has 7 significant digits. The largestnumberthatcanbestoredis0.9999999 10 96.Asimilar methodisusedtorepresent64-and128-bitfloatingpointnumbers. ThisisshowninTable4. In the IEEE 2008 Decimal Standard, the combination bits along with exponent bits are used to represent NaN, signaling NaN and ± asshownintable5. Quantity Sign bit Combination bits b 0 b 1 b 2 b 3 b 4 b 5 b 6 SNaN 0or1 111111 QNaN 0or1 111110 + 0 11110x 1 11110x +0 0 00000x 0 1 00000x (xis0or1) Table 5. Representation of NaNs,±,and±0. RESONANCE January 2016 29

GENERAL ARTICLE Binary Integer Representation of Significand In this method, as we stated earlier, if there are p bits in the extendedcoefficient fieldconstituting thesignificand, they are converted as a decimal integer. This representation also uses the bitsofthecombinationfieldtoincreasethenumberofsignificant digits.theideaissimilarandwewillnotrepeatit.therepresentation of NaNs and ± is the same as in the densely packed format. The major advantage of this representation is the simplicity of performing arithmetic using hardware arithmetic units designed to perform arithmetic using binary numbers. The disadvantage is the time taken for conversion from integer to decimal. The densely packed decimal format allows simple encoding of binary to decimal but needs a specially designed decimal arithmeticunit. Acknowledgements IthankProf.PCPBhatt,Prof.CRMuthukrishnan,andProf.S KNandyforreadingthefirstdraftofthisarticleandsuggesting improvements. Suggested Reading Address for Correspondence V Rajaraman Supercomputer Education and Research Centre Indian Institute of Science Bengaluru 560 012, India. Email: rajaram@serc.iisc.in [1] David Goldberg,Whatevery computer scientist should know about floatingpoint arithmetic, ACM Computing Surveys,Vol.28,No.1, pp5 48,March1991. [2] V Rajaraman, ComputerOrientedNumericalMethods, 3rd Edition, PHI Learning,NewDelhi,pp.15 32,2013. [3] Marc Cowlinshaw, Denselypacked decimal encoding, IEEE Proceedings Computer and Digital Techniques, Vol.149,No.3,pp.102-104, May 2002. [4] Steve Hollasch, IEEE Standard 754 for floatingpoint numbers, steve.hollasch.net/cgindex/coding/ieeefloat.html [5] Decimal Floating Point, Wikipedia, en.wikipedia.org/wiki/ Decimal_floating_point 30 RESONANCE January 2016