Highly Optimized Mathematical Functions for the Itanium Processor

Size: px

Start display at page:

Download "Highly Optimized Mathematical Functions for the Itanium Processor"

Gabriella Erin Eaton
5 years ago
Views:

1 Highly Optimized Mathematical Functions for the Itanium Processor! Speaker: Shane Story! Software Engineer! CSL Numerics Group! Corporation Copyright Copyright Corporation.

2 Agenda! Itanium Processor Math Functionality Overview! Sample Algorithms! Performance! Accuracy and Testing! Open Source Website Copyright Copyright Corporation.

3 Overview Itanium Processor Math Functionality! Target Key Math Library Functions in C++ and Fortran Compiler exp: e x cos: cosine(x), log: ln(x) ) sin: sine(x) log10: log 10 (x) tan: tangent(x), pow(x,y): x y atan2(y,x): arctangent(y/x)! Open Source Website! Support Instructions for IA-32 Binaries f2xm1, fyl2x, fsin, fcos, fptan, fpatan (Paper: 1999 Computer Arithmetic Conference) Copyright Copyright Corporation.

4 Overview Software Math Functionality Exponential exp, exp10, exp2, log, log10, log2, cbrt, pow Trigonometric sin, cos, tan, asin, acos, atan, atan2 Hyperbolic sinh, cosh, tanh, asinh, acosh, atanh Others floor, ceil, modf, scalb, rem, j0, erf,,! Exponential! Trigonometric! Hyperbolic! Others Optimized Assembler / C Coded Copyright Copyright Corporation.

5 Overview Supported Precisions! Float / Real*4 (32-bit Floating-Point) expf, logf, sinf, powf,! Double / Real*8 (64-bit Floating-Point) exp, log, sin, pow,! Extended / Real*10 (80-bit Floating-Point) expl, logl, sinl, powl, Copyright Copyright Corporation.

6 Overview Itanium Processor Floating-Point Formats! IEEE 754 Floating-Point Formats sign exponent significand! (±1) 2 biased-exp sign exponent significand! Float (32-bits) (1-bit 8-bits 23-bits)! Double (64-bits) (1-bit 11-bits 52-bits)! Extended (80-bits) (1-bit 15-bits 64-bits)! Internal (82-bits) (1-bit 17-bits 64-bits) Copyright Copyright Corporation.

7 Overview Supporting Standards and Error Support! C99 (C Numerics Standard) Special Values: log(-1), exp(50,000) Unique classes: QNaN, ± Inf! Linux or Windows* Errno: : EDOM or ERANGE Matherr: : User takes Control! IEEE Floating-Point Exception Flags * Other brands and names are the property of their respective owners Copyright Copyright Corporation.

8 Overview Key Messages! Provide a Full Math Library Solution! Targeting All Important Functions Supporting Important Precisions Flexible Error Handling Mechanism What algorithms were used? Copyright Copyright Corporation.

9 Sample Algorithms Algorithmic Characteristics!Table-Based Argument Reduction Table-Lookup Polynomial Approximation Reconstruction Copyright Copyright Corporation.

10 Sample Algorithms Simple Table-Based Example (1)! exp(x) e x e! Reduction! exp(x) e x e N ln! Table-Lookup Copyright Copyright Corporation. Table-part + Reduced argument N = rint( ( 4x / ln(2) ) and r = x N ln(2)/4 x N ln(2)/4 + r ln(2)/4(2)/4 + r + r = 2 N/4 N/4 e r Based on (2 1/4 ) N and here 4092 N (10-byte Values) Load Approximation to 2 N/4

11 Sample Algorithms Simple Table-Based Example (2)! exp(x) e x = e N ln! Polynomial Approximation e r = 1 + r + r 2 /2 + r 3 /6 + + r 10 /10! Coefficients in tables or created on-the-fly! Reconstruction ln(2)/4(2)/4 + r + r = 2 N/4 N/4 e r e Table-part 2 N/4 e r 1 + r + r 2 /2 + r 3 /6 + + r 10 /10! exp(x) 2 N/4 e r Rounding Error Rnd.. Error & Trunc.. Error Rounding Error Copyright Copyright Corporation.

12 Sample Algorithms Itanium Processor Attributes(1)! Parallelism 2 FP Units, 2 Integer Units, 2 Memory Units! Internal Register File Format 64-Bit Significand Widest Range Exponent (17-Bits)! Four Floating-Point Status Registers User: s0 Internal: s1 and possibly s2 or s3 Copyright Copyright Corporation.

13 Sample Algorithms Itanium Processor Attributes(2)! FMA instruction: Result = A B + C Simple Method How Many Cycles? P(x) = A 0 + A 1 x +A 2 x 2 +A 3 x 3 Horner s Method A 0 + x ( A 1 + x ( A 2 + x A 3 ) ) ) 15 Cycles Cycle 0 Z 2 = A 2 + A 3 x Cycle 0 Z 1 =A 0 + A 1 x Cycle 1 Z 3 = x 2 Cycle 6 P(x) = Z 1 + Z 2 Z 3 Cycle 11 Result Copyright Copyright Corporation.

14 Sample Algorithms Itanium Processor Double-Precision Log Example (1)! Consider log(x) ln(x), and the original! x = 2 N 1.f 1 f 2 f52 (Double-Precision)! ln(2 N 1.f 1 f 2 f 52 ) = N ln(2) + ln(1. (1.f 1 f 2 f 52! Taylor Series: ln(1 + z) = z z 2 /2 + z 3 /3 - for z 1 and z 1 52 ) Copyright Copyright Corporation.

15 Sample Algorithms Double-Precision Log Example (2)! ln(x) = ln( ( frcpa(x) x / frcpa(x) ) = ln ( frcpa(x) x) ln ( frcpa(x) )! frcpa(x) = 2 N frcpa(1.f 1 f 2 f 8 ), where (1.f 1 f 2 f 52 ) is significand of the original x! ln(frcpa(x)) = NlnN ln(2) ln(frcpa(1.f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 )) Copyright Copyright Corporation.

16 Sample Algorithms Core log Approximation Piece 1: 1 ln(frcpa frcpa(x)x) Argument Reduction & Polynomial Approximation Piece 2: N ln(2) Miscellaneous Piece 3: 3 ln(frcpa frcpa(1.f 1 f 2 f 3 f 4 f 5 f 6 f 7 f 8 )) Table-Lookup ln(x) Piece 1 + Piece 2 + Piece 3 Copyright Copyright Corporation. Reconstruction

17 Sample Algorithms Log: Piece 1 (Reduction & Poly) Copyright Copyright Corporation. ln(x) = ln(frcpa(x)x) + NlnN ln(2) ln(frcpa(1.f 1 f 2 f 8 )) // Cycle 1: Compute reciprocal approximation to x: (1/x) frcpa.s0 inv_approx, p6 = f1, x // Cycle 8: frcpa(x) * x 1 yields z 2-8 fms.s1 z = inv_approx, x, 1 // Cycle 13: Compute z * z fma.s1 z_sqr = z, z, 0 // Cycle 14: 6 th degree polynomial evaluation in parallel fma.s1 P_lo1 = A3, z, A2 : fma.s1 P_lo2 = A5, z, A4 // Cycle 18: Calculate = z 3 and P hi (z) = z + A1 * z 2 fma.s1 z_cubed = z_sqr sqr, z, f0 : fma.s1 P_hi = z_sqr sqr, A1, z // Cycle 19: Construct P lo (z) fma.s1 P_lo = P_lo2, z_sqr sqr, P_lo1 // Cycle 24: ln(frcpa frcpa(x)x) P hi (z) + z 3 * P lo (z) Cycle 29 fma.s1 ln_approx = P_lo, z_cubed, P_hi

18 Sample Algorithms Log: Piece 2 (Miscellaneous) ln(x) = ln(frcpa(x)x) + Nln(2) ln(frcpa(1.f 1 f 2 f 8 )) // Cycle 3: Get exponent of x getf.exp N_Sbias = x // Cycle 5: Remove sign bit and N_Bias = N_Sbias Sbias, 0x1FFFF // Cycle 6: Remove exponent bias sub N_float = N_Bias, 0xFFFF // Cycle 8: Place exponent in general purpose register setf.sig sig N = N_float // Cycle 17: Make N a floating-point value fcvt.xf N = N Cycle 22 Copyright Copyright Corporation.

19 Sample Algorithms Log: Piece 3 (Table-Lookup) ln(x) = ln(frcpa(x)x) + NlnN ln(2) ln(frcpa(1.f 1 f 2 f 8 )) // Cycle 1: Normalize x fnorm x_norm = x // Cycle 9: Extract significand ox x_norm getf.sig signif_x = x_norm // Cycle 11: Remove the explicit bit shl signif_x = signif_x, 1 // Cycle 12: Isolate fraction bits f1, f2,,f8,f8 shr index = signif_x, 56 // Cycle 13: Calculate address of table entry shladd address = index, 4, Table_Base // Cycle 14: Load ln(frcpa frcpa(1.f 1 f... f 2 52 )) ldfe Table_part = [address] Cycle 23 Copyright Copyright Corporation.

20 Sample Algorithms Reconstruction for Log ln(x) = ln(frcpa(x)x) + NlnN ln(2) ln(frcpa(1.f 1 f 2 f 8 )) Miscellaneous Table-Lookup Cycle 23: Table = N ln(2) ln(frcpa(1.f 1 f 2 f )) 8 Argument Reduction & Polynomial Approximation Cycle 29: Result = ln( frcpa(x)x ) + Table log(x) complete in Cycle 34 Copyright Copyright Corporation.

21 Sample Algorithms Key Message! Novel Algorithms Yield High-Performance Functions! How good is the overall performance? Copyright Copyright Corporation.

22 Performance Single-Precision Functions tanf powf Cycles 60 sqrtf floorf Copyright 2001 Corporation. logf expf asinf acosf sinf cosf atanf floorf logf expf asinffunction acosf sinf cosf atanf tanf powf

24 Performance Extended-Precision Functions powl Cycles logl sinl cosl tanl atanl expl acosl asinl 50 sqrtl 0 floorl Copyright 2001 Corporation. Function floorl asinl acosl expl logl sinl cosl tanl atanl powl

25 Performance Key Message! Optimal Minimum Latency Performance Obtained! How good is the accuracy? Copyright Copyright Corporation.

26 Testing and Accuracy Units in the Last Place (Ulps)! Goal: Ulp Error(Computed Function) <.55 ulps! Distance Between f computed (x) & f exact (x) f(x) computed = (±1)( 2 k 1.f 1 f 2 f f 51 f ! Ulp Error = f exact f computed (2 where 1 2 -k f exact < 2, k is base-2 exponent of f exact Copyright Copyright Corporation. (2 k 52 ), k 52 ),

27 Testing and Accuracy Accuracy: Example! f(x) computed = f(x) computed! f(x) exact = ulps! f(x) exact = ulps Copyright Copyright Corporation.

28 Testing and Accuracy Math Library Test Suite GUI Copyright Copyright Corporation.

29 Testing and Accuracy Ulp Chart for log(x) Copyright Copyright Corporation.

30 Testing and Accuracy Key Message! High Accuracy is Essential! Accuracy Balance Performance Where can we find these functions? Copyright Copyright Corporation.

Open Source Website http://developer.intel.

31 Open Source Website opensource/numericsnumerics Copyright Copyright Corporation.

32 Summary! Provide a Full Math Library Solution!! Novel Algorithms Yield High-Performance Functions!! Optimal Minimum Latency Performance Obtained!! High Accuracy is Essential!! Sources are Available! Copyright Copyright Corporation.

33 Implementers & Designers! Marius Cornea! John Harrison! Cristina Iordache! Ted Kubaska! Bob Norin! Ping Tak Peter Tang! Eugeny Gvozdev! Vladimir Lunev Copyright Copyright Corporation.

34 Call to Action! Use the provided math functions from the C++ and Fortran Compiler! Download the source code from intel.com/software/.com/software/opensource/numericsnumerics Copyright Copyright Corporation.

Arithmetic and Logic Blocks

Arithmetic and Logic Blocks The Addition Block The block performs addition and subtractions on its inputs. This block can add or subtract scalar, vector, or matrix inputs. We can specify the operation