CS101 Introduction to computing Floating Point Numbers

Similar documents
CS222: Processor Design

COMP2611: Computer Organization. Data Representation

& Control Flow of Program

Floating Point Arithmetic

Data Representation Floating Point

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

ECE232: Hardware Organization and Design

Data Representation Floating Point

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Chapter 2 Float Point Arithmetic. Real Numbers in Decimal Notation. Real Numbers in Decimal Notation

CSCI 402: Computer Architectures. Arithmetic for Computers (3) Fengguang Song Department of Computer & Information Science IUPUI.

Floating point. Today! IEEE Floating Point Standard! Rounding! Floating Point Operations! Mathematical properties. Next time. !

Floating Point Numbers

CS 33. Data Representation (Part 3) CS33 Intro to Computer Systems VIII 1 Copyright 2018 Thomas W. Doeppner. All rights reserved.

Finite arithmetic and error analysis

Computer Organization: A Programmer's Perspective

Floating Point Numbers

Floating Point Numbers

Systems I. Floating Point. Topics IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties

Data Representation Floating Point

Floating Point Puzzles The course that gives CMU its Zip! Floating Point Jan 22, IEEE Floating Point. Fractional Binary Numbers.

Floating Point : Introduction to Computer Systems 4 th Lecture, May 25, Instructor: Brian Railing. Carnegie Mellon

Floating Point January 24, 2008

Foundations of Computer Systems

Floating Point (with contributions from Dr. Bin Ren, William & Mary Computer Science)

Floating Point Puzzles. Lecture 3B Floating Point. IEEE Floating Point. Fractional Binary Numbers. Topics. IEEE Standard 754

Representing and Manipulating Floating Points

Bryant and O Hallaron, Computer Systems: A Programmer s Perspective, Third Edition. Carnegie Mellon

System Programming CISC 360. Floating Point September 16, 2008

Floating Point. CSE 238/2038/2138: Systems Programming. Instructor: Fatma CORUT ERGİN. Slides adapted from Bryant & O Hallaron s slides

Floating Point Numbers

FLOATING POINT NUMBERS

Representing and Manipulating Floating Points. Jo, Heeseung

3.5 Floating Point: Overview

CS429: Computer Organization and Architecture

Today: Floating Point. Floating Point. Fractional Binary Numbers. Fractional binary numbers. bi bi 1 b2 b1 b0 b 1 b 2 b 3 b j

15213 Recitation 2: Floating Point

Giving credit where credit is due

Computer System and programming in C

Computer (Literacy) Skills. Number representations and memory. Lubomír Bulej KDSS MFF UK

Giving credit where credit is due

Inf2C - Computer Systems Lecture 2 Data Representation

Representing and Manipulating Floating Points

Representing and Manipulating Floating Points. Computer Systems Laboratory Sungkyunkwan University

Floating point. Today. IEEE Floating Point Standard Rounding Floating Point Operations Mathematical properties Next time.

Floating Point Arithmetic

Representing and Manipulating Floating Points

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

IEEE 754 Floating-Point Format

Chapter Three. Arithmetic

Integers and Floating Point

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

These are reserved words of the C language. For example int, float, if, else, for, while etc.

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Floating-point Arithmetic. where you sum up the integer to the left of the decimal point and the fraction to the right.

CS 265. Computer Architecture. Wei Lu, Ph.D., P.Eng.

Floating Point Numbers. Lecture 9 CAP

Lecture 13: (Integer Multiplication and Division) FLOATING POINT NUMBERS

CS 261 Fall Floating-Point Numbers. Mike Lam, Professor.

Computer Architecture and IC Design Lab. Chapter 3 Part 2 Arithmetic for Computers Floating Point

Number Systems. Both numbers are positive

Computer Arithmetic Ch 8

Floating Point Arithmetic. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Computer Arithmetic Ch 8

Computer Architecture Chapter 3. Fall 2005 Department of Computer Science Kent State University

Module 2: Computer Arithmetic

Data Representation in Computer Memory

CS 61C: Great Ideas in Computer Architecture Performance and Floating Point Arithmetic


Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Written Homework 3. Floating-Point Example (1/2)

Outline. What is Performance? Restating Performance Equation Time = Seconds. CPU Performance Factors

CS 61C: Great Ideas in Computer Architecture Performance and Floating-Point Arithmetic

EE 109 Unit 19. IEEE 754 Floating Point Representation Floating Point Arithmetic

C NUMERIC FORMATS. Overview. IEEE Single-Precision Floating-point Data Format. Figure C-0. Table C-0. Listing C-0.

COMP2121: Microprocessors and Interfacing. Number Systems

CS61c Midterm Review (fa06) Number representation and Floating points From your friendly reader

CS 6210 Fall 2016 Bei Wang. Lecture 4 Floating Point Systems Continued

Signed Multiplication Multiply the positives Negate result if signs of operand are different

Today CISC124. Building Modular Code. Designing Methods. Designing Methods, Cont. Designing Methods, Cont. Assignment 1 due this Friday, 7pm.

The ALU consists of combinational logic. Processes all data in the CPU. ALL von Neuman machines have an ALU loop.

Computer Systems C S Cynthia Lee

Chapter 3: Arithmetic for Computers

Course Schedule. CS 221 Computer Architecture. Week 3: Plan. I. Hexadecimals and Character Representations. Hexadecimal Representation

1.2 Round-off Errors and Computer Arithmetic

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

The course that gives CMU its Zip! Floating Point Arithmetic Feb 17, 2000

Up next. Midterm. Today s lecture. To follow

CSCI 402: Computer Architectures. Arithmetic for Computers (4) Fengguang Song Department of Computer & Information Science IUPUI.

COSC 243. Data Representation 3. Lecture 3 - Data Representation 3 1. COSC 243 (Computer Architecture)

IEEE Standard for Floating-Point Arithmetic: 754

Floating-point representations

Floating-point representations

ECE 331 Hardware Organization and Design. Professor Jay Taneja UMass ECE - Discussion 5 2/22/2018

Floating Point. CSE 351 Autumn Instructor: Justin Hsia

Declaration. Fundamental Data Types. Modifying the Basic Types. Basic Data Types. All variables must be declared before being used.

IT 1204 Section 2.0. Data Representation and Arithmetic. 2009, University of Colombo School of Computing 1

Systems Programming and Computer Architecture ( )

Description Hex M E V smallest value > largest denormalized negative infinity number with hex representation 3BB0 ---

CO212 Lecture 10: Arithmetic & Logical Unit

Transcription:

CS101 Introduction to computing Floating Point Numbers A. Sahu and S. V.Rao Dept of Comp. Sc. & Engg. Indian Institute of Technology Guwahati 1

Outline Need to floating point number Number representation : IEEE 754 Floating point range Floating point density Accuracy Arithmetic and Logical Operation on FP Conversions and type casting in C 2

Need to go beyond integers integer 7 rational 5/8 real 3 complex 2 3 i complex real rational integer Extremely large and small values: distance pluto sun = 5.9 10 12 m mass of electron = 9.1 91x 10 28 gm

Representing fractions Integer pairs (for rational numbers) 5 8 = 5/8 Strings with explicit it decimal point 2 4 7. 0 9 Implicit point at a fixed position 0 1 0 0 1 1 0 1 0 1 1 0 0 0 1 0 1 1 Floating point fraction x base power implicit point

Numbers with binary point 101.1111 = 1x2 2 + 0x2 1 + 1x2 0 +. +1x2 1 + 1x2 2 = 4 + 1 +.+ 0.5 + 0.25 = 5.75 10 06 0.6 = 0.10011001100110011001... 00 00 00 00 00.6 x 2 = 1 +.2.2 x 2 = 0 +.4.4 x 2 = 0 +.8.8 x 2 = 1 +.6

Numeric Data Type char, short, int, long int char : 8 bit number (1 byte=1b) short: 16 bit number (2 byte) int : 32 bit number (4B) long int : 64 bit number (8B) float, double, long double float : 32 bit number (4B) double : 64 bit number (8B) long double : 128 bit number (16B) 6

Numeric Data Type unsigned char char unsigned short short Unsigned int int 7

Numeric Data Type char, short, int, long int We have : Signed and unsigned version char (8 bit) char : 128 to 127, we have +0 and 00 Fun unsigned char: 0 to 255 int : 2 31 to 2 31 1 unsigned int : 0 to 2 32 1 float, double, long double For fractional, real number data All these numberedare signed and get stored in different format 8

Sign bit Numeric Data Type Exponent Mantissa float Exponent Mantiss 1 Mantissa 2 double 9

FP numbers with base = 10 ( 1) S x F x 10 E S = Sign F = Fraction (fixed point number) usually called Mantissa or Significand E = Exponent (positive or negative integer) Example 5.9x10 12, 2.6x10 3 9.1 x 10 28 Only one non zero digit left to the point

FP numbers with base = 2 ( 1) S x F x 2 E S = Sign F = Fraction (fixed point number) usually called Mantissa or Significand E = Exponent (positive or negative integer) How to divide a word into S, F and E? How to represent S, F and E? Example 1.0101x2 12, 1.11012x10 3 1.101 x 2 18 Only one non zero digit left to the point: default it will be 1 incase of binary So no need to store this

IEEE 754 standard Single precision numbers 1 8 23 0 1011 0101 1101 0110 1011 0001 0110 110 S E F Double precision numbers 1 11 20+32 0 1011 0101 111 1101 0110 1011 0001 0110 S E F 1011 0001 0110 1100 1011 0101 1101 0110

Representing F in IEEE 754 Single precision numbers 23 1. 110101101011000101101101 F Double precision numbers 20+32 1. 101101011000101101101 F 101100010110110010110101110101101 Only one non zero digit left to the point: default it will be 1 incase of binary. So no need to store this bit

Value Range for F Single precision numbers 1 F 2 2 23 or 1 F <2 Double precision numbers 1 F 2 2 52 or 1 F <2 These are normalized.

Representing E in IEEE 754 Single precision numbers 8 10110101 E bias 127 Double precision numbers 10110101110 11 10110101110 E bias 1023

Floating point values E=E 127, V =( 1) s x 1.M x 2 E 127 V= 1.1101 1101 x 2 (40 127) =1.1101.. 1101 x 2 87 Single precision numbers 1 8 23 0 0010 1000 1101 0110 1011 0001 0110 110 S E F 16

Floating point values E=E 127, V =( 1) s x 1.M x 2 E 127 V= 1.1 x 2 (126 127) = 1.1 x 2 1 = 0.11x2 0 = 0.11 = 11/2 2 10= 3/4 10 = 0.75 10 Single precision numbers 1 8 23 1 0111 1110 1000 0000 0000 0000 0000 000 S E F 17

Value Range for E Single precision numbers 126 E 127 (all 0 s and all 1 s have special meanings) Double precision numbers 1022 E 1023 (ll0 (all 0 s and all 1 s have special meanings)

Floating point demo applet on the web https://www.h schmidt.net/floatconverter/ieee754.html Google Float applet to get the above link 19

Overflow and underflow largest positive/negative number (SP) = ±(2 2 23 ) x 2 127 ±2 x 10 38 smallest positive/negative number (SP) = ± 1 x 2 126 ±2 x 10 38 Largest positive/negative number (DP) = ±(2 ( 2 52 ) x 2 1023 ± 2 x 10 308 Smallest positive/negative number (DP) = ± 1 x 2 1022 ± 2 x 10 308

Density of int vs float Int : 32 bit Exponent Mantissa Float : 32 bit Number of number can be represented Both the cases (float, int) : 2 32 Range int ( 2 31 to 2 31 1) float Large ±(2 ( 2 23 ) x 2 127 Small± 1 x 2 126 50% of float numbers are Small (less then ±1 ) 21

Density of Floating Points 256 Persons in Room of Capacity 256 (Range) 8 bit integer : 256/256 = 1 256 person in Room of Capacity 200000 (Range) 1 st Row should be filled with 128 person 50% number with negative power are 1 < N > +1 Density of Floating point number is Dense towards 0 Sparse towards 2 2 1 1 0 +1 +2 + 22

Expressible Numbers(int and float) Expressible integers overflow + overflow 2 31 0 2 31 1 Expressible Float underflow + underflow overflow (1 2 24 )x2 128 + overflow 0 0.5x2 127 0.5x2 127 (1 2 24 )x2 128

Distribution of Values 6 bit IEEE like format e = 3 exponent bits f = 2 fraction bits Bias is 3 Notice how the distribution gets denser toward zero. -15-10 -5 0 5 10 15 Denormalized Normalized Infinity

Distribution of Values 6 bit IEEE like format e = 3 exponent bits f = 2 fraction bits Bias is 3 (l (close up view) -1-0.5 0 0.5 1 Denormalized Normalized Infinity

Density of 32 bit float SP Fraction/mantissa is 23 bit Number of different number can be stored for particular value of exponent Assume for exp=1, 2 23 =8x1024x1024 8x10 6 Between 1 2 we can store 8x10 6 numbers Similarly for exp=2, between 2 4, 8x10 6 number of number can be stored for exp=3, between 4 8, 8x10 6 number of number can be stored for exp=4, between 8 16, 8x10 6 number of number can be stored 26

Similarly Density of 32 bit float SP for exp=23, between 2 22 22 23, 810 8x10 6 number of number can be stored for exp=24, bt between 2 23 22 24, 810 8x10 6 number of number can be stored OK for exp=25, between 2 24 22 25, 8x10 6 number of number can be stored 2 24 2 25 >8 x10 6 BAD for exp=127, between 2 126 22 127, 8x10 6 number of number can be stored WROST 27

Density of 32 bit float SP 2 23 =8x1024x1024 8x10 6 0 1 2 4 8 16 28

Numbers in float format largest positive/negative number (SP) = ±(2 2 23 ) x 2 127 ± 2 x 10 38 Second largest number : ±(2 2 22 ) x 2 127 Difference Largest FP 2 nd largest FP = (2 23 2 22 )x2 127 =2x2 105 =2x10 32 Smallest positive/negative number (SP) = ± 1 x 2 126 ±2 x 10 38 29

Addition/Sub of Floating Point Step I: 3.2 x 10 8 ± 2.8 x 10 6 Align Exponents 320 x 10 6 ± 2.8 x 10 6 Step 2: Add Mantissas 322.8 x 10 6 Step 3: Normalize 3.228 x 10 8 30

Floating point operations: ADD Add/subtract A = A1 ± A2 [( 1) S1 x F1 x 2 E1 ] ± [( 1) S2 x F2 x 2 E2 ] suppose E1 > E2, then we can write it as pp, [( 1) S1 x F1 x 2 E1 ] ± [( 1) S2 x F2 x 2 E1 ] where F2 = F2 / 2 E1 E2, 32x10 8 ± 28x10 6 3.2 x 10 8 ± 2.8 x 10 6 320 x 10 6 ± 2.8 x 10 6 The result is 322.8 x 10 6 ( 1) S1 x (F1 ± F2 ) x 2 E1 3.228 x 10 8 It may need to be normalized

Testing Associatively with FP X= 1.5x10 38, Y=1.5x10 38, z=1000.00 X+(Y+Z) = 1.5x10 38 + (1.5x10 38 + 1000.0) = 1.5x100 38 + 1.5x10 38 =0 (X+Y)+Z = ( 1.5x10 38 + 1.5x10 38 ) + 1000.0 = 00+ 0.0 1000.0 0 =1000 32

Multiply Floating Point Step I: 32 3.2 x 10 8 X 58 5.8 x 10 6 Multiply Mantissas 3.2 X 5.8 X 10 8 x 10 6 Step 2: Add Exponents 18.56 x 10 14 Step 3: Normalize 1.856 x 10 15 For 32 bit SP Float : one 23 bit multiplication and 8 bit addition For 32 bit int: one 32 bit multiplication Above example: 3.2x5.8 is simpler, also 6+8 is also simpler as compared to 32 bit multiplication 33

Floating point operations Multiply [( 1) S1 x F1 x 2 E1 ] x [( 1) S2 x F2 x 2 E2 ] = ( 1) S1 S2 x (F1xF2) x 2 E1+E2 Since 1 (F1xF2) < 4, the result may need to be normalized 32x 3.2 10 8 X 58x 5.8 10 6 3.2 X 5.8 X 10 8 x 10 6 18.56 x 10 14 1.856 x 10 15

Floating point operations Divide [( 1) S1 x F1 x 2 E1 ] [( 1) S2 x F2 x 2 E2 ] = ( 1) S1 S2 x (F1 F2) x 2 E1 E2 Since.5 < (F1 F2) < 2, the result may need to be normalized (assume F2 0)

Float and double Float : single precision floating point Double : Double precision floating point Floating points operation are slower But not in newer PC Double operation are even slower Precision/Accuracy in Calculation Integer Float Double Speed of Calculation 36

Floating point Comparison Three phases Phase I: Compare sign (give result) Phase II: If (sign of both numbers are same) Compare exponents and give result 90% of case it fall in this categories Faster as compare to integer comparison : Require only 8 bit comparison for float and 11 bit for double (Example : sorting of float numbers) ( p g ) Phase III: If (both sign and exponents are same) compare fraction/mantissa

Storing and Printing Floating Point float x=145.0,y; y=sqrt(sqrt((x))); x=(y*y)*(y*y); printf("\nx=%f",x); float x=1.0/3.0; if ( x==1.0/3.0) 0) printf( YES ); else printf( NO ); Many Round off cause loss of accuracy x=145.000015 Value stored in x is not exactly same as 1.0/3.0 One is before round of and other (stored x) is after round of NO 38

Storing and Printing Floating Point float a=34359243.5366233; 359 3.5366 33; float b=3.5366233; float c=0.00000212363; 00000212363; printf("\na=%8.6f, b=%8.6f c=%8.12f\n", a, b, c ); a=34359243.000000 b=3.5366233 c=0.000002123630 Big number with small fraction can not combined 39

Storing and Printing Floating Point //15 S digits to store float a=34359243.5366233; //8 S digits to store float b=3.5366233; //6 S digits to store float c=0.00000212363; Thumb rule: 8 to 9 significant digits of a number can be stored in a 32 bit number 40

Thanks 41