Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation

Size: px

Start display at page:

Download "Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation"

Caroline Francis
6 years ago
Views:

1 Intel Math Kernel Library (Intel MKL) Overview Hans Pabst Software and Services Group Intel Corporation

2 Agenda Motivation Functionality Compilation Performance Summary 2

3 Motivation How and where to optimize? 1. Appropriate algorithm* 2. Library 3. Multicore 4. SIMD HPC audience cares about: Performance Functionality Support for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { c[i*k+j] = 0; for (int k = 0; k < K; ++k) { c[i*k+j] += a[i*n+k] * b[k*k+j]; } } } Intel MKL! * Note The best parallel algorithm might be unrelated to the best serial algorithm. Therefore, exploiting parallelism is not necessarily an incremental optimization. 3

4 Agenda Motivation Functionality Compilation Performance Summary 4

5 Intel MKL Functionality Linear Algebra BLAS, Sparse BLAS LAPACK solvers Sparse Solvers (DSS, PARADISO) Iterative solver (RCI) ScaLAPACK, PBLAS Fast Fourier Transforms Multidimensional FFTW interfaces Cluster FFT Trig. Transforms Poisson solver Convolution via VSL Vector Math Trigonometric Hyperbolic Exponential, Logarithmic Power / Root Random Number Generators Congruential Wichmann-Hill Mersenne Twister Sobol Neiderreiter Non-deterministic Summary Statistics Kurtosis Variation coefficient Quantiles Ordering statistics Min/max Variance-covariance Data Fitting Spline-based Interpolation Cell search 5

6 Features Single threaded, and multi-threaded libraries Cluster support for important domains Support for large problem sizes (ILP) Conditional Numerical Reproducibility (CNR) Support for Intel Xeon Phi coprocessors Automatic offload, and compiler-assisted offload Manycore-hosted execution, cluster support, etc. 6

7 Use Cases Iterative Solver (RCI) Customize solver steps PBLAS Distribute easily VML Balance accuracy and performance RNG Safety and reliability VSL* Did you know that Intel MKL comes with some statistics? * For example, to detect outliers or to predict values. 7

8 New Features (Intel MKL 11.0) Intel Xeon Phi Coprocessor Support* Automatic offload supports multiple coprocessors LAPACK: LU, QR, and Cholesky (Intel MKL [1], )?GEMM,?TRMM,?TRSM (Intel MKL 11.0 [1], 11.01) Performance improvements in Intel MKL Conditional Numerical Reproducibility (CNR) Enabling for future hardware ( Haswell ) Support for AVX2 and FMA3 instruction set Other changes Deprecation of some service functions Support for PGI compiler 12.5 Support for LAPACK Cluster FFTs with SOI * See Intel MKL Link Line Advisor: 8

9 Conditional Numerical Reproducibility Motivation: engineered to address issues that previously seemed to be unrelated or diffuse. Ingredients and requirements: Memory alignment Number of threads Deterministic task scheduling Code path control Align memory try Intel MKL memory allocation functions 64-byte alignment for processors in the next few years Set the number of threads to a constant number Use sequential libraries Ensures that FP operations occur in order to ensure reproducible results Maintains consistent code paths across processors Will often mean lower performance on the latest processors * Conditional (if possible, relaxed in future versions): across OS / bits / versions, varying # of threads, 9

10 Conditional Numerical Reproducibility Offload report at run time Service functions and environment control* mkl_cbrw_set( ) and MKL_CBWR= CBWR? During Intel MKL 11.0 Beta CNR was called Conditional Bit-Wise Reproducibility. The term now conforms with the regular IEEE FP terminology. 10

11 Performance Impact of CNR for the Intel Optimized LINPACK Benchmark 11

12 Agenda Motivation Functionality Compilation Performance Summary 12

13 Intel Compiler and Intel MKL Intel MKL is available via: Development Composer, Parallel Studio, and Cluster Studio Stand-alone package Redistributable package (no runtime royalties) Environment variables (development) MKLROOT, IPPROOT, TBBROOT, /opt/intel/composerxe/bin/compilervars.sh intel64 13

14 Intel Xeon Phi : Execution Models Intel MKL Automatic Offload (AO) No code changes required Automatically uses both host and target Transparent data transfer and execution management Compiler Assisted Offload (CAO) Explicit controls of data transfer and remote execution using compiler offload pragmas/directives Can be used together with Automatic Offload Native Execution* Uses the coprocessors as independent nodes (a.k.a. manycore-hosted execution) Input data is copied to targets in advance * In fact, an offloaded code section (CAO) that calls Intel MKL is calling into the native library. 14

15 Compiling and Linking Intel MKL supports Linux*, Mac OS* X, and Windows* (platform s default compiler as well as non-intel compilers and their OpenMP* runtimes) Intel MKL Link Line Advisor -us/articles/intel-mkl-linkline-advisor/ 15

16 Example: DGEMM cblas_dgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, arows, bcols, acols, alpha, a, acols, b, bcols, beta, c, bcols); char atrans = 'T', btrans = 'T'; dgemm(&atrans, &btrans, &arows, &bcols, &acols, &alpha, a, &acols, b, &bcols, &beta, c, &bcols); char atrans = N', btrans = N'; dgemm(&atrans, &btrans, &brows, &arows, &acols, &alpha, b, &acols, a, &acols, &beta, c, &bcols); 16

17 Agenda Motivation Functionality Compilation Performance Summary 17

18 Performance Hints Automatic Offload Only kicks-in with sufficient problem sizes Compiler-assisted offload Memory alignment is inherited from host! Align with page-granularity (4 KB) for fast DMA transfers General memory alignment (SIMD vect.) Leading dimensions to a multiple of vector width Align buffers to a multiple of vector width e.g., 512 Bit / 64 Byte Use* mkl_malloc, _mm_malloc (_aligned_malloc), or tbb::scalable_aligned_malloc * Remember to call the corresponding free-function. 18

19 Performance Hints (cont.) Huge pages Use libhugetlbfs.so, or mmap() to allocate buffers CAO: MIC_USE_2MB_BUFFERS=60M (threshold) FFT transforms Memory alignment for 2d FFTs (and higher dimensionality) Single-precision (SP): strides divisible by 8 but not divisible by 16 Double-precision (DP): strides divisible by 4 but not divisible by 8 Consider single call in case of parallelizing a series of individual 1d FFTs 19

20 Performance Hints (cont.) Intel MKL threading runtime is OpenMP* Environment variables OMP_* (MKL_* takes precedence) Coprocessor (CAO): MIC_ENV_PREFIX=MIC MIC_OMP_NUM_THREADS= Intel OpenMP thread affinity KMP_AFFINITY= Host: compact,granularity=fine,1,0 Coprocessor (native): balanced MIC_ENV_PREFIX=MIC MIC_KMP_AFFINITY= Coprocessor (CAO): balanced kmp_* functions take precedence Intel MPI process affinity I_MPI_* variables 20

21 Agenda Motivation Functionality Compilation Performance Summary 21

22 Summary Tuned across a wide range of problem sizes Performance scales forward: early enabling Industry-standards e.g., BLAS and LAPACK Documentation: Performance: [BENCHMARKS] tab Webinars (December 5th) EMEA/ASMO: APAC:

23 Thank You

24 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

26 Moore s Law* 90nm nm nm 45nm 2007 Intel Core i Processors 32nm nm 25 nm Hi-K metal-gate 22nm D Tri-gate 14nm nm 2015 Shrink Intel Xeon Phi Coprocessor 22 nm Intel Haswell microarchitecture * Moore s Law: the number of transistors doubles every ~2 years 27

27 Intel Xeon E5 Frontend (x86) Decoder (uop) Pipeline Reordering / Scheduler Backend (Execution Units) EU #0 EU #1 EU #2 SIMD FP-MUL SIMD FP-ADD Some properties: - Out-of-order execution (up to 168 μops in flight) - Superscalar (up to 5 μops per cycle) - SIMD: 256-bit registers, AVX instruction set - 8 cores per die, 2-way hyper-threaded (SMT) Note: this diagram is rather incomplete / simplified e.g., no branch unit, no caches, etc. * Sandy Bridge microarchitecture: 1 SIMD FP-multiply-add per clock cycle (8 SP or 4 DP elements) 28

28 New Haswell Microarchitecture Frontend (x86) Decoder (uop) Pipeline Reordering / Scheduler Backend (Execution Units) EU #0 EU #1 EU #2 SIMD FMA SIMD FMA Some properties: - Intel AVX2 instruction set - Intel TSX Note: this diagram is rather incomplete / simplified e.g., no branch unit, no caches, etc. * Haswell microarchitecture: 2 SIMD FP-multiply-add per clock cycle (8 SP or 4 DP elements) 29

29 Hardware Potential (Peak) For example, Intel Xeon E x GHz (3.8 GHz), AVX (256 bit) Peak FP-performance (AVX) 2 sockets x 8 cores x n floats x 2 ops x clock SP (8 floats)*: ~ 742 GFLOP/s DP (4 doubles): ~ 371 GFLOP/s * Let s have another view: ~ 742 FP elements (SP) relative to a 1 GHz 30

30 Example: CBLAS SGEMM using namespace std; vector<float> a(arows * acols); vector<float> b(acols * bcols); vector<float> c(arows * bcols); const float alpha = 1, beta = 0; transform(a.begin(), a.end(), a.begin(), [](float /*dummy*/) { return static_cast<float>(rand()); }); transform(b.begin(), b.end(), b.begin(), [](float /*dummy*/) { return static_cast<float>(rand()); }); transform(c.begin(), c.end(), c.begin(), [](float /*dummy*/) { return static_cast<float>(rand()); }); cblas_sgemm(cblasrowmajor, CblasNoTrans, CblasNoTrans, arows, bcols, acols, alpha, &a[0], acols, &b[0], bcols, beta, &c[0], bcols); * No overloaded functions (C interface). Note, CBLAS vs. BLAS is to get row- vs. col-major storage. 31

31 Example: Typical C++ Wrapper Code template<typename T, typename U> void gemm(t* result, const T* a, const T* b, U arows, U acols, U bcols, T alpha = 1, T beta = 0) { struct local { const char atrans = 'T', btrans = 'T'; static void gemm(float* result, const float* a, const float* b, MKL_INT arows, MKL_INT acols, MKL_INT bcols, float alpha, float beta) { sgemm(&atrans, &btrans, &arows, &bcols, &acols, &alpha, a, &acols, b, } &bcols, &beta, result, &bcols); static void gemm(double* result, const double* a, const double* b, MKL_INT arows, MKL_INT acols, MKL_INT bcols, double alpha, double beta) { dgemm(&atrans, &btrans, &arows, &bcols, &acols, &alpha, a, &acols, b, } }; &bcols, &beta, result, &bcols); } local::gemm(result, a, b, static_cast<mkl_int>(arows), static_cast<mkl_int>(acols), static_cast<mkl_int>(bcols), alpha, beta); * Note, the Intel MKL C/BLAS interfaces are const-correct. Further, MKL_INT depends on LP64 vs. ILP64. 32

Intel Performance Libraries

Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation