CSCI 402: Computer Architectures

Size: px

Start display at page:

Download "CSCI 402: Computer Architectures"

Debra Hamilton
5 years ago
Views:

1 CSCI 402: Computer Architectures Arithmetic for Computers (5) Fengguang Song Department of Computer & Information Science IUPUI What happens when the exact result is not any floating point number, too small, or too large to represent accurately? You get an exception The following 7 slides are from Jim Demmel s CS267 course at UCB. 1

2 Exception Handling (1/7) 5 floating point exceptions: Underflow - exact result is not zero and < UN (or too small to represent) Overflow - exact result > OV (too large to represent) Divide-by-zero nonzero / 0 Invalid - 0/0, sqrt(-1), Inexact - you made a rounding error (very common) Possible responses: Stop with error message (unfriendly, not default) Keep computing (default!) 3 IEEE Floating Point Arithmetic Standard Underflow (2/7) Underflow Exception: occurs when exact nonzero result is less than underflow threshold UN Ex: UN/3 Return a denorm, or zero 4 2

3 IEEE Floating Point Arithmetic Standard Infinity (3/7) Overflow Exception: occurs when exact finite result too large to represent accurately Ex: 2*OV return +- infinity Divide by zero Exception: return +- infinity = 1/+-0 sign of zero important! Also return +- infinity for: 3+infinity, 2*infinity, infinity*infinity Note: Such as result is exact, not an exception! 5 IEEE Floating Point Arithmetic Standard NAN (Not A Number) (4/7) Invalid --> NAN occurs when exact result not a well-defined number 0/0 sqrt(-1) infinity-infinity, infinity/infinity, 0*infinity NAN + 3 NAN > 3? Return a NAN in all these cases There are two kinds of methods to handle NANs: Quiet: Propagates without raising an exception Signaling: Generate an exception when touched 6 3

4 Exception Handling User Interface (5/7) Each of the previous 5 exceptions has the following features: A sticky flag, which is set as soon as an exception occurs The sticky flag can be reset and read by the user Reset overflow_flag and invalid_flag Perform a computation Test overflow_flag and invalid_flag to see if any exception occurred An exception flag, which indicate whether a trap should occur or not No Trapping is the default: Continue computing returning a NAN, infinity or denorm On a trap, there should be a user-writable exception handler with access to the parameters of the exceptional operation Trapping or precise interrupts like this are rarely implemented for performance reasons 7 Exception Handling Summary (6/7) The IEEE standard defines five exceptions, each of which returns a default value, and has a corresponding status flag that is raised when the exception occurs. The possible 5 exceptions are: Invalid operation (e.g., square root of a negative number, 0/0) (returns NaNby default). Division by zero (an operation on finite operands gives an exact infinite result, e.g., 1/0 or log(0)) (returns ±infinity by default). Overflow (a result is too large to be represented correctly) (returns ±infinity by default (for round-to-nearest mode)). Underflow (a result is very small (outside the normal range) and is inexact) (returns a denormalized value by default). Inexact (you have a round error) (returns a rounded result by default). 8 4

Hazards of Parallel and Heterogeneous Computing (7/7) What new bugs may arise in parallel floating point programs? Ex 1: Nonrepeatability Makes debugging hard!

5 Hazards of Parallel and Heterogeneous Computing (7/7) What new bugs may arise in parallel floating point programs? Ex 1: Nonrepeatability Makes debugging hard! Ex 2: Different exception handling Can cause programs to hang Ex 3: Different rounding (even on IEEE FP machines) Can cause hanging, or wrong results with no warning See 9 x86 FP Architecture Originally based on 8087 FP coprocessor 8 80-bit extended-precision registers Used as a push-down register stack Registers indexed from TOS: ST(0), ST(1), FP values are 32-bit or 64 in memory Converted on load/store of memory operand Integer operands can also be converted on load/store But difficult to generate and optimize code Result: poor FP performance 14 5

6 Revisiting C1 on MIPS CPU (central processing unit) FPU (floating point unit) "coprocessor 1" mfc1 register $0,..,$31 integer arithmetic division multiplication logical ops mtc1 register $f0,.. $f31 floating point arithmetic divison multiplication int float convert lw sw lwc1 swc1 Memory (2^32 bytes) Floating-point addition double add.d fd, fs, ft 0x11 0x11 ft fs fd Floating-point addition single add.s fd, fs, ft 0x11 0x10 ft fs fd

7 An Example of Coprocessor 1 E.g., MIPS R3000 CPU + MIPS R3010 FPU R3010 FPU can perform conversion, comparison, load, store, move and arithmetic operations on single and double-precision numbers R3010 FPU resides on the same system bus as the main CPU, and communicates with the CPU using R3000 coprocessor interface During operation, R3010 co-processor monitors system bus, and when encountering floating-point instruction it loads it into 6-stage pipeline, where the instruction is decoded and executed. URL: Paper: MIPS R300 FP Coprocessor, 17 AMD Bulldozer Processors (in 2010) 19 7

8 Subword Parallellism (last topic in chapter 3) Graphics and Audio (multimedia) applications can perform concurrent operations on vectors E.g., given a 128-bit adder, you may do the following: A number of 16 8-bit adds, or A number of 8 16-bit adds, or A number of 4 32-bit adds This is also called data-level parallelism, vector parallelism, or Single Instruction Multiple Data (SIMD) 20 Streaming SIMD Extension 2 (SSE2) SSE2 was introduced by Intel (to Pentium 4) in 2001 SSE3 later introduced in 2004 AVX (i.e., 256-bit wide) in 2011 Can be used for multiple FP operands 4 (64-bit) double precision 8 (32-bit) single precision Instruction operates on them simultaneously 21 8

9 AVX was introduced to Intel Sandybridge (256-bit wide) in

10 24 Matrix Multiply nunoptimized code: 1. void dgemm (int n, double* A, double* B, double* C) 2. { 3. for (int i = 0; i < n; ++i) 4. for (int j = 0; j < n; ++j) 5. { 6. double cij = C[i+j*n]; /* cij = C[i][j] */ 7. for(int k = 0; k < n; k++ ) 8. cij += A[i+k*n] * B[k+j*n]; /* cij += A[i][k]*B[k][j] */ 9. C[i+j*n] = cij; /* C[i][j] = cij */ 10. } 11. } 25 10

11 k j B 4 i k A C Performance: 3.85 times as fast as the unoptimized code on Intel Core i7. 26 Matrix Multiply noptimized C code: 1. #include <x86intrin.h> 2. void dgemm (int n, double* A, double* B, double* C) 3. { 4. for ( int i = 0; i < n; i+=4 ) 5. for ( int j = 0; j < n; j++ ) { 6. m256d c0 = _mm256_load_pd(c+i+j*n);/*c0=c[i][j]*/ 7. for( int k = 0; k < n; k++ ) 8. c0 = _mm256_add_pd(c0, /* c0 += A[i][k]*B[k][j] */ 9. _mm256_mul_pd(_mm256_load_pd(a+i+k*n), 10. _mm256_broadcast_sd(b+k+j*n))); 11. _mm256_store_pd(c+i+j*n, c0); /* C[i][j] = c0 */ 12. } 13. } 27 11

1 st Fallacy: Right Shift for Division Left-shift an integer by i bits is the same as multiplying an integer by 2 i à Correct. Right shift by i is the same as integer division by 2 i? No!

12 1 st Fallacy: Right Shift for Division Left-shift an integer by i bits is the same as multiplying an integer by 2 i à Correct. Right shift by i is the same as integer division by 2 i? No! Only correct for unsigned integers e.g, given a signed integer: -5 Arithmetic right shift: replicate the sign bit 5 / 4 (quotient =?) >> 2 = = 2 vs (= ) >> 2 = = Pitfall: Associativity Floating-point addition is NOT Associative. Parallel programs can interleave operations in an unexpected order Assuming associativity may fail (x+y)+z x -1.50E+38 y 1.50E E+00 z E+00 x+(y+z) -1.50E E E+00 Why? n So, you must validate parallel programs under varying degrees of parallelism 29 12

13 Parallel Execution 2 nd Fallacy: Parallel execution strategies that work for integers also work for floating-point data types? Sequential version à Write a parallel version Not true for FP numbers Try to use standard libraries: LAPACK, ScaLAPACK Pitfall: Despite its name, MIPS instruction addiu is used to add constants to signed numbers! Purpose: No overflow detection 30 3 rd Fallacy: Only mathematicians care about floating point accuracy Indeed, very important for scientific code But how about general/average users? Example: Intel Pentium Division bug in th to 15 th decimal digits may have errors Intel claims: Average spreadsheet user only sees it every 27,000 years (how they got this number??) IBM Research counterclaims: an error every 24 days For average spreadsheet user Eventually Intel recalled the chips (The recall costed Intel 300million dollars) Conclusion: The market does expect accuracy 31 13

The Pentium floating-point divide bug even made the Top 10 List of the David Letterman Late Show on television.

14 A sampling of newspaper and magazine articles from November 1994, including the New York Times, San Jose Mercury News, San Francisco Chronicle, and Infoworld. The Pentium floating-point divide bug even made the Top 10 List of the David Letterman Late Show on television. Intel eventually took a $300 million writeoff to replace the buggy chips. 32 More Cases: Disasters Caused by Numerical Errors URL:

15 Remarks ISA support arithmetic Signed and unsigned integers Floating-point approximation to real numbers Both have limited range and precision! Operations can overflow and underflow MIPS ISA: Core instructions: have 54 most frequently used instructions (Table 3.26 in the textbook) They cover 100% of SPECINT and 97% of SPECFP! Other instructions: less frequent 34 BTW, What Does 64-bit Machine Mean? Typically means: Registers have 64 bits Pointers are 64 bits (for larger memory space) However, size of a data type is decided by the compiler E.g., int is still 32 bits (It is defined as 32 bits) Need to use 64-bit integers? à use long, size_t 35 15

Chapter 3. Arithmetic Text: P&H rev

Chapter 3. Arithmetic Text: P&H rev Chapter 3 Arithmetic Text: P&H rev3.29.16 Arithmetic for Computers Operations on integers Addition and subtraction Multiplication and division Dealing with overflow Floating-point real numbers Representation