Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links

Size: px

Start display at page:

Download "Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links"

Gordon Simpson
6 years ago
Views:

2 Outline Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links 2

3 Introduction VML is one component of Intel MKL Support HPC applications: o o Scientific & engineering simulations Finance o Graphics Work with vectors rather than scalars Good performance, scalability and accuracy C & Fortran APIs 3

4 VML Functionality and features Vector Math Library Real functions o Collection of vector math functions Trigonometric Hyperbolic Power and Root Exponential & Logarithmic Special Arithmetic Rounding o Real/Complex o Double/single precision o 3 accuracy modes High Accuracy, HA (correct rounding in >99% cases, behave according to C99; slowest, default mode) Low Accuracy ( 2 lsb incorrect, behave according to C99;30-50% faster than HA), LA Enhanced Performance (~1/2 incorrect bits, is not guaranteed on entire domain;30-50% faster than LA), EP Sin Sinh Pow2o3 Exp Erf Add Floor Cos Cosh Pow3o2 Exm1p Erfc Sub Ceil SinCos Tanh Pow Ln ErfInv Sqr Trunc Tan Asinh Powx Log10 ErfcInv Mul Round Asin Acosh Sqrt Log1p CdfNorm Abs NearbyInt Acos Atanh Cbrt CdfNormInv Rint Atan InvSqrt TGamma Modf Atan2 Trigonometric InvCbrt Complex functions Hyperbolic Power and Root Arithmetic Exponential & Logarithmic Sin Sinh Pow Add Exp Cos Cosh Powx Sub Ln CIS Tanh Sqrt Div Log10 Tan Asinh Mul Asin Acosh MulByConj Acos Atanh Conj Atan Abs Arg 4

Bitwise Identical Results Precision & Accuracy Modes Positioning

not very demanding for DP EP accuracy SP LA Apps not very Apps not

accuracy requirements of most apps DP HA Rigorous accuracy specs DP

(at times artificial) accuracy specs for customer s

precision (Quad Precision) may address bitwise compatible results

apps where math function inaccuracies don t dominate method/parameter

5 Bitwise Identical Results Precision & Accuracy Modes Positioning Method/Parameter Inaccuracies Dominate SP EP Media apps SP HA Apps not very demanding for DP EP accuracy SP LA Apps not very Apps not very demanding for demanding for accuracy accuracy DP LA Meets accuracy requirements of most apps DP HA Rigorous accuracy specs DP LA is sufficient for majority of apps DP HA is used sometimes to meet (at times artificial) accuracy specs for customer s benchmarks/acceptance tests DP CR (Correct rounding) and/or higher precision (Quad Precision) may address bitwise compatible results issue in certain customer apps SP HA, SP LA, DP EP are targeted to apps where math function inaccuracies don t dominate method/parameter inaccuracies, e.g. Monte Carlo simulations SP EP is targeted to class of media/graphics apps 4/17/2016 QP DP CR 5

6 Usage Model Common call format vs,dexp( n, input_array, output_array ); Example ( VML example can be found in Intel MKL packages ) Note: VML functions allow in-place calls, namely input array & output array could have the same namе Change of accuracy mode vmlsetmode( VML_EP ); Note: since MKL will have function-level control of accuracy mode #include "mkl_vml.h #define VEC_LEN 1000 /* Vector size */ int main() double da[vec_len], dr[vec_len]; /* Generate vector of arguments da*/ /*Compute double precision vector exponential */ vdexp(vec_len,da,dr); 6

7 Performance Performance metric: Cycles-per-elements (CPE) Lower is better 7

8 CPE Performance : EP vs. LA vs. HA EP functions Always start with EP Go to higher accuracy only if lower accuracy doesn t meet requirements Performance of VML HA, LA, EP functions on Haswell, Intel(64) 64 mode vsexp vdexp vsln vdln vslog10 vdlog10 vspow vdpow vssincos vdsincos vscos vdcos VML_HA VML_LA VML_EP CPE (Cycles-Per-Element). Lower is faster. 8

9 VML Functions and Threading Default threading behavior depends on vector size, Performance of serial function, #CPU/Cores/HT But, single VML function may be too fast to be threaded effectively. Users should try to thread at high level MKL would normally not deviate from number of threads the user requested, unless MKL_DYNAMIC_FALSE is set. Otherwise it would aggressively manipulate the number of threads used. If vector size < 100, serial versions of the functions will run mkl_set_dynamic( MKL_DYNAMIC_FALSE ); mkl_domain_set_num_threads( 4, MKL_VML ); 9

10 VML in Applications: Black-Scholes European option pricing Embarrassingly parallel o o Millions of options can be priced simultaneously Thousands of workstations (Monte Carlo Farm) Vector math functions o Erf, Exp, Ln, Sqrt ln( S C S 0 o Double precision / K) T( r 0.5 ) ln( S K T r 0 / ) ( 0.5 ) K exp( rt ) T T S0[0], K[0], T[0] R[0], Sig[0] SIMD S0[1], K[1], T[1] R[1], Sig[1] S0[n], K[n], T[n] R[n], Sig[n] C[0] C[1] C[n] 10

VML in Applications: Black-Scholes Black-Scholes formula Baseline (icc) void BlackScholesFormula( int nopt, tfloat r, tfloat sig, const tfloat s0[], const tfloat x[], const tfloat t[], tfloat

sig*sqrt(t[i]) ); vcall[i] = s0[i]*cndf(d1) - EXP(-r*t[i])*x[i]*CNDF(d2); vput[i] = EXP(-r*t[i])*x[i]*CNDF(-d2) - s0[i]*cndf(-d1); MKL VML+icc void BlackScholesFormula( int nopt, tfloat r, tfloat

tss05[j] = tss[j] * HALF; mtr[j] = -tr[j]; EXP(mtr, Exp); INVSQRT(tss, InvSqrt); for ( j = 0; j < nopt; j++ ) w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2; w2[j] =(Log[j] + tr[j] -

11 VML in Applications: Black-Scholes Black-Scholes formula Baseline (icc) void BlackScholesFormula( int nopt, tfloat r, tfloat sig, const tfloat s0[], const tfloat x[], const tfloat t[], tfloat vcall[], tfloat vput[] ) tfloat d1, d2; int i; for ( i=0; i<nopt; i++ ) d1 = ( LOG(s0[i]/x[i]) + (r + HALF*sig*sig)*t[i] ) / ( sig*sqrt(t[i]) ); d2 = ( LOG(s0[i]/x[i]) + (r - HALF*sig*sig)*t[i] ) / ( sig*sqrt(t[i]) ); vcall[i] = s0[i]*cndf(d1) - EXP(-r*t[i])*x[i]*CNDF(d2); vput[i] = EXP(-r*t[i])*x[i]*CNDF(-d2) - s0[i]*cndf(-d1); MKL VML+icc void BlackScholesFormula( int nopt, tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] ) vmlsetmode( VML_EP ); DIV(s0, x, Div); LOG(Div, Log); for ( j = 0; j < nopt; j++ ) tr [j] = t[j] * r; tss[j] = t[j] * sig_2; tss05[j] = tss[j] * HALF; mtr[j] = -tr[j]; EXP(mtr, Exp); INVSQRT(tss, InvSqrt); for ( j = 0; j < nopt; j++ ) w1[j] =(Log[j] + tr[j] + tss05[j]) * InvSqrt[j] *INV_SQRT2; w2[j] =(Log[j] + tr[j] - tss05[j]) * InvSqrt[j] *INV_SQRT2; CNDF approximation Abramowitz and Stegun. Handbook of Mathematical Functions. Formula ERF(w1, w1); ERF(w2, w2); for ( j = 0; j < nopt; j++ ) w1[j] = HALF + HALF * w1[j]; w2[j] = HALF + HALF * w2[j]; vcall[j] = s0[j] * w1[j] - x[j] * Exp[j] * w2[j]; vput[j] = vcall[j] - s0[j] + x[j] * Exp[j]; 11

VML in Applications: Black-Scholes MKL VML+OMP+icc void BlackScholesFormula( int nopt,tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] ) int threads =

omp_get_thread_num(); vmlsetmode( VML_EP ); /* memory initialization for thread needs */ DIV(_s0, _x, Div); LOG(Div, Log); for ( j = 0; j < NBUF; j++ ) tr [j] = _t[j] * r; tss[j] = _t[j] * sig_2;

tss05[j]) * InvSqrt[j] * INV_SQRT2; ERF(w1, w1); ERF(w2, w2); for ( j = 0; j < NBUF; j++ ) w1[j] = HALF + HALF * w1[j]; w2[j] = HALF + HALF * w2[j]; _vcall[j] = _s0[j] * w1[j] - _x[j] * Exp[j] *

12 VML in Applications: Black-Scholes MKL VML+OMP+icc void BlackScholesFormula( int nopt,tfloat r, tfloat sig,tfloat s0[], tfloat x[], tfloat t[], tfloat vcall[], tfloat vput[] ) int threads = omp_get_max_threads(); Buffer = malloc(threads*nbuf*8*sizeof(tfloat)); sig_2 = sig*sig; nb = nopt/nbuf; #pragma omp parallel for shared( ) private( ) for ( i = 0; i < nb; i++ ) k = omp_get_thread_num(); vmlsetmode( VML_EP ); /* memory initialization for thread needs */ DIV(_s0, _x, Div); LOG(Div, Log); for ( j = 0; j < NBUF; j++ ) tr [j] = _t[j] * r; tss[j] = _t[j] * sig_2; tss05[j] = tss[j] * HALF; mtr[j] = -tr[j]; EXP(mtr, Exp); INVSQRT(tss, InvSqrt); for ( j = 0; j < NBUF; j++ ) w1[j] = (Log[j] + tr[j] + tss05[j]) * InvSqrt[j] * INV_SQRT2; w2[j] = (Log[j] + tr[j] - tss05[j]) * InvSqrt[j] * INV_SQRT2; ERF(w1, w1); ERF(w2, w2); for ( j = 0; j < NBUF; j++ ) w1[j] = HALF + HALF * w1[j]; w2[j] = HALF + HALF * w2[j]; _vcall[j] = _s0[j] * w1[j] - _x[j] * Exp[j] * w2[j]; _vput[j] = _vcall[j] - _s0[j] + _x[j] * Exp[j]; free(buffer); Performance 2x18 cores, 2.3 GHz HSW SP, Mopts/s DP, Mopts/s ICC + LIBM 1-ulp ICC + SVML EP, 1 thr ICC + MKL/VML EP 1 thr ICC + MKL/VML EP + OMP 36 thr Configuration Info - Versions: Intel Math Kernel Library (Intel MKL) 11.3.update 2, GLIBC 2.18; Hardware : Intel(R) Xeon(R) CPU E v3 Eighteen-core CPUs (45MB LLC, 2.3GHz), 64GB of RAM; Operating System: RHEL 6.5 GA x86_64; 12

13 Summary Vector Math Library (VML) o Performance enhancement through vectorization and threading o Flexible accuracy modes allow additional performance with acceptable accuracy o Compatible with various compilers o Native C and Fortran APIs 13

14 Useful links Intel Math Kernel Library (Intel MKL) Intel MKL Reference Manual, Chapter 9 VML performance & accuracy data 14

15 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

H.J. Lu, Sunil K Pandey. Intel. November, 2018

H.J. Lu, Sunil K Pandey. Intel. November, 2018 H.J. Lu, Sunil K Pandey Intel November, 2018 Issues with Run-time Library on IA Memory, string and math functions in today s glibc are optimized for today s Intel processors: AVX/AVX2/AVX512 FMA It takes