Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast

Size: px

Start display at page:

Download "Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast"

Dennis Dennis
6 years ago
Views:

1 Faster Code. Faster Intel Parallel Studio XE 2017 Unleash the Beast

Create Faster Code Faster Intel Parallel Studio XE Design, build, verify and tune C++, C, Fortran, Python* and Java* Standards Driven Parallel Models: OpenMP, MPI & TBB Highlights from

Faster deep learning on IA using Intel Math Kernel Library and Intel Data Analytics Acceleration Library Quickly assess application performance using snapshot features of VTune Amplifier

2 Create Faster Code Faster Intel Parallel Studio XE Design, build, verify and tune C++, C, Fortran, Python* and Java* Standards Driven Parallel Models: OpenMP, MPI & TBB Highlights from 2017 edition Faster Python* application performance using Intel Distribution for Python and Intel VTune Amplifier XE. Faster deep learning on IA using Intel Math Kernel Library and Intel Data Analytics Acceleration Library Quickly assess application performance using snapshot features of VTune Amplifier XE and Intel Trace Analyzer and Collector Scale to next generation platforms including latest Intel Xeon Phi processor. Optimizations for AVX-512, high bandwidth memory and explicit vectorization for compiler and analysis tools. 2

3 Performance Libraries Profiling, Analysis & Architecture Cluster Tools Intel Parallel Studio XE Intel Inspector Memory & Threading Checking Intel VTune Amplifier Performance Profiler Intel Data Analytics Acceleration Library Optimized for Data Analytics & Machine Learning Intel Math Kernel Library Optimized Routines for Science, Engineering & Financial Intel Advisor Vectorization Optimization & Thread Prototyping Intel Cluster Checker Cluster Diagnostic Expert System Intel Trace Analyzer & Collector MPI Profiler Intel MPI Library Intel Integrated Performance Primitives Image, Signal & Data Processing Intel Threading Building Blocks Task Based Parallel C++ Template Library Intel C/C++ & Fortran Compilers Intel Distribution for Python Performance Scripting 3

4 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

5 SCALE Analyze Build What s Inside Intel Parallel Studio XE 2017 Composer Edition Professional Edition Cluster Edition from $699 from $1,699 from $2,949 Intel C++ Compiler Intel Fortran Compiler Intel Distribution for Python* Intel Math Kernel Library fast math library Intel Integrated Performance Primitives image, signal & data processing Intel Threading Building Blocks threading library Intel Data Analytics Acceleration Library machine learning & analytics Intel VTune Amplifier XE performance profiler Intel Advisor vectorization optimization and thread prototyping Intel Inspector memory and thread debugging Intel MPI Library message passing interface library Intel Trace Analyzer and Collector MPI Tuning and Analysis Intel Cluster Checker cluster diagnostic expert system Rogue Wave IMSL* Library Fortran numerical analysis Bundle or Add-on Add-on Add-on Additional configurations including, floating and academic, are available at: 5

6 Staying current with Support for the Latest Standards, Operating Systems & Processors Enhanced C11 and C++14 standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator Operating systems Windows* 7 thru 10, Windows Server Debian* 7.0, 8.0; Fedora* 23, 24; Red Hat Enterprise Linux* 6, 7; SuSE LINUX Enterprise Server* 11,12; Ubuntu* LTS LTS, macos* Enhanced Fortran 2008 and draft 2015 standards support Implied-shape PARAMETER arrays 2008 bind C internal procedures Extended EXIT for all named blocks Pointer initialization Latest processors Support and tuning added for the latest Intel Xeon Phi codenamed Knights Landing and AVX-512 6

8 Intel Compilers for Parallel Studio XE 2017 What s new in Intel C and Intel Fortran 17.0 Productive language-level vectorization & parallelism models for advanced developers driving application performance Common updates Enhanced support for the newest AVX2 and AVX512 instruction sets for the latest Intel processors (including Intel Xeon Phi) Enhanced optimization/vectorization reports register allocation Tight integration with Intel Advisor Initial support for OpenMP* 4.5, offering improved vectorization control, new SIMD instructions, and much more Intel C++ Compiler SIMD Data Layout Template to facilitate vectorization for your C++ code Virtual function vectorization capability Improved compiler loop and function alignment Full support for the latest C11 and C++14 standards Intel Fortran Compiler Substantial coarray performance improvement up to twice as fast as previous versions on non-trivial coarray Fortran programs Almost complete Fortran 2008 support Further interoperability with C (part of draft Fortran 2015) 8

9 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

10 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Two lines added that take full advantage of both SSE or AVX Pragmas ignored by other compilers so code is portable typedef float complex fcomplex; const uint32_t max_iter = 3000; #pragma omp declare simd uniform(max_iter), simdlen(16) uint32_t mandel(fcomplex c, uint32_t max_iter) { uint32_t count = 1; fcomplex z = c; while ((cabsf(z) < 2.0f) && (count < max_iter)) { z = z * z + c; count++; } return count; } uint32_t count[imagewidth][imageheight];.. for (int32_t y = 0; y < ImageHeight; ++y) { float c_im = max_imag - y * imag_factor; #pragma omp simd safelen(16) for (int32_t x = 0; x < ImageWidth; ++x) { fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF); count[y][x] = mandel(in_vals_tmp, max_iter); } } Mandelbrot calculation speedup Normalized performance data higher is better 1 2,48 4,27 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

11 Impressive performance improvement Intel C++ Explicit Vectorization using OpenMP* SIMD SIMD Speedup on Intel Xeon Processor Normalized performance data higher is better 6,61 6,06 4,27 4,14 4,15 2,48 2,27 2,26 2,43 4,83 3,51 3,91 2,74 4,92 1,00 1,00 1,00 1,00 1,00 1,00 1,00 AoBench Collision Detection Grassshader Mandelbrot Libor RTM-stencil Geomean Serial SSE4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

13 Boost NumPy/SciPy performance with Intel MKL Intel Distribution for Python* Easy access to High performance Python NumPy/SciPy/Scikit-Learn/pandas accelerated with Intel MKL Close to 100X performance speedups on select functions Includes Python optimized modules for Intel TBB, Intel DAAL Includes numba, Cython, pydaal Integrated Distribution, Out-of-the-Box access to performance Python 2.7 & 3.5. Windows, Linux, macos Latest Optimizations for Intel Xeon and Intel Xeon Phi Processors Available as free standalone, via conda* and Intel Parallel Studio XE

14 Close to 100X faster for select functions 14

15 Profile Python & Go using Intel VTune Amplifier And Mixed Python / C++ / Fortran New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 15

16 Intel Math Kernel Library Intel Data Analytics Acceleration Library Intel Integrated Performance Primitives Intel Threading Building Blocks

17 17

Includes functions for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics and more

18 Intel Math Kernel Library Speeds math processing for machine learning, scientific, engineering financial and design applications Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation Includes functions for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics and more De facto standard APIs for easy switching from other math libraries Highly optimized, threaded and vectorized to maximize processor performance 18

19 Components of Intel MKL 2017 New Linear Algebra Fast Fourier Transforms Vector Math Summary Statistics And More Deep Neural Networks BLAS LAPACK ScaLAPACK Sparse BLAS Sparse Solvers Iterative PARDISO* Cluster Sparse Solver Multidimensional FFTW interfaces Cluster FFT Trigonometric Hyperbolic Exponential Log Power Root Vector RNGs Kurtosis Variation coefficient Order statistics Min/max Variancecovariance Splines Interpolation Trust Region Fast Poisson Solver Convolution Pooling Normalization ReLU Softmax 19

20 Performance (GFlops) Performance Benefit to Applications Intel MKL Significant LAPACK Performance Boost using Intel Math Kernel Library versus ATLAS* DGETRF on Intel Xeon E Processor Matrix Size Intel MKL provides significant performance boost over ATLAS* Intel MKL - 16 threads Intel MKL - 8 threads ATLAS - 16 threads ATLAS - 8 threads Configuration: Hardware: CPU: Dual Intel Xeon E5-2697v2@2.70Ghz; 64 GB RAM. Interconnect: Mellanox Technologies* MT27500 Family [ConnectX*-3] FDR.. Software: RedHat* RHEL 6.2; OFED 3.5-2; Intel MPI Library 5.0 Intel MPI Benchmarks (default parameters; built with Intel C++ Compiler XE for Linux*); Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # The latest version of Intel MKL unleashes the performance benefits of Intel architectures 20

21 What s New: Intel MKL 2017 Optimized math functions to enable neural networks (CNN and DNN) for deep learning Improved ScaLAPACK performance for symmetric eigensolvers on HPC clusters New data fitting functions based on B-splines and monotonic splines Improved optimizations for newer Intel processors, especially Knight s Landing Xeon Phi Extended TBB threading layer support for all BLAS level-1 functions 21

22 22

23 Scientific/Engineering Web/Social Business Intel DAAL Overview Industry leading performance, C++/Java/Python library for machine learning and deep learning optimized for Intel Architectures. Pre-processing Transformation Analysis Modeling Validation Decision Making (De-)Compression PCA Statistical moments Variance matrix QR, SVD, Cholesky Apriori Linear regression Naïve Bayes SVM Classifier boosting Kmeans EM GMM Collaborative filtering Neural Networks

24 Speedup Example Performance: Intel DAAL vs. Spark* MLLib PCA (correlation method) on an 8-node Hadoop* cluster based on Intel Xeon Processors E v X 6X 6X 7X 7X M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 Table size Configuration Info - Versions: Intel Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel Xeon Processor E v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

25 What s New: Intel DAAL 2017 Neural Networks Python API (a.k.a. PyDAAL) Easy installation through Anaconda or pip New data source connector for KDB+ Open source project on GitHub Fork me on GitHub: 25

26 26

27 Rich Feature Set for Parallelism Intel Threading Building Blocks (Intel TBB) Parallel algorithms and data structures Threads and synchronization Memory allocation and task scheduling Generic Parallel Algorithms Efficient scalable way to exploit the power of multicore without having to start from scratch. Flow Graph A set of classes to express parallelism as a graph of compute dependencies and/or data flow Concurrent Containers Concurrent access, and a scalable alternative to containers that are externally locked for thread-safety Synchronization Primitives Atomic operations, a variety of mutexes with different properties, condition variables Task Scheduler Timers and Exceptions Threads Thread Local Storage Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph Thread-safe timers and exception classes OS API wrappers Efficient implementation for unlimited number of thread-local variables Memory Allocation Scalable memory manager and false-sharing free allocators 27

28 What s new: Intel Threading Building Blocks 2017 static_partitioner class Helps minimizing overhead of parallel loops streaming_node class Enables heterogeneous streaming computations within the flow graph. Added method to isolate execution of a group of tasks or an algorithm from other tasks submitted to the scheduler. A preview feature for Python* module is added to replace Python's thread pool class. Graph/stereo example is added. Improvements to graph/fgbzip example (added async_msg usage example) 28

29 29

30 Intel IPP Domain Applications Image Processing Medical Imaging Computer Vision Digital Surveillance Biometric Identification Automated Sorting ADAS Visual Search Signal Processing Games (sophisticated audio content or effects) Echo cancellation Telecommunications Energy Data Compression & Cryptography Data Centers Enterprise data managements ID verification Smart cards/wallets Electronic signature Information security / cybersecurity 30

31 What s new: Intel Integrated Performance Primitives 2017 Extended optimization for Intel AVX-512 on KNL and Intel Xeon processors Intel IPP Platform-Aware APIs in the image and signal processing domains are added to support external threading and 64-bit data length Significantly improved performance of zlib compression functions is Extension of IPP optimized functionality in OpenCV Limited pre-silicon optimizations for KNH and CNL EP/XE server 31

32 Intel VTune Amplifier XE Performance Profiler Intel Inspector XE Memory & Thread Debugger Intel Advisor XE Vectorization Optimization and Thread Prototyping

33 33

Intel VTune Amplifier Faster, Scaleable Code, Faster Get the Data You Need Hotspot (Statistical call tree),

analysis 1 GPU Offload and OpenCL Kernel Tracing Find Answers Fast View Results on the Source / Assembly

Viewpoints Visualize Thread & Task Activity on the Timeline Easy to Use No Special Compiles C, C++, C#,

Remote Data Collection Analyze Windows* & Linux* data on OS X* 2 1 Events vary by processor.

34 Intel VTune Amplifier Faster, Scaleable Code, Faster Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits Analysis Cache miss, Bandwidth analysis 1 GPU Offload and OpenCL Kernel Tracing Find Answers Fast View Results on the Source / Assembly OpenMP Scalability Analysis, Graphical Frame Analysis Filter Out Extraneous Data Organize Data with Viewpoints Visualize Thread & Task Activity on the Timeline Easy to Use No Special Compiles C, C++, C#, Fortran, Java, ASM Visual Studio* Integration or Stand Alone Graphical Interface & Command Line Local & Remote Data Collection Analyze Windows* & Linux* data on OS X* 2 1 Events vary by processor. 2 No data collection on OS X* Quickly Find Tuning Opportunities See Results On The Source Code Tune OpenMP Scalability Visualize & Filter Data 34

35 New for 2017! Python, FLOPS, Storage & More Intel VTune Amplifier Performance Profiler New! Profile Python and Mixed Python / C++ / Fortran Tune Intel Xeon Phi Knights Landing Processors Quickly See 3 Keys to HPC Performance Optimize Memory Access Storage Analysis I/O bound or CPU bound? Enhanced OpenCL & GPU Profiling Easier Remote and Command Line Usage Add Custom Counters to the Timeline Preview: Application & Storage Performance Snapshots Intel Advisor optimize vectorization for AVX-512 (with or without hardware) 35

Intel VTune Amplifier Tunes Knights Landing Processors 4 Critical Optimizations for Intel Xeon Phi Processors 1) High Bandwidth Memory Decide which

36 Intel VTune Amplifier Tunes Knights Landing Processors 4 Critical Optimizations for Intel Xeon Phi Processors 1) High Bandwidth Memory Decide which data structures to place in MCDRAM See performance problems by memory hierarchy Measure DRAM and MCDRAM bandwidth 2) Scalability of MPI and OpenMP Serial vs. Parallel time Imbalance, overhead cost, parallel loop parameters 3) Micro Architecture Efficiency See the efficiency of your code in the core pipeline Zero in on details with custom PMU events 4) Vectorization Efficiency Use Intel Advisor Optimize for AVX-512 with or without AVX-512 hardware New! 36

custom memory allocators Optimize NUMA latency & scalability True & false sharing optimization Auto detect max system bandwidth

37 Optimize Memory Access Memory Access Analysis - Intel VTune Amplifier 2017 Improved! Tune data structures for performance Attribute cache misses to data structures (not just the code causing the miss) Support for custom memory allocators Optimize NUMA latency & scalability True & false sharing optimization Auto detect max system bandwidth Easier tuning of inter-socket bandwidth Easier install, Latest processors No special drivers required on Linux* Intel Xeon Phi processor MCDRAM (high bandwidth memory) analysis 37

38 Storage Device Analysis (HDD, SATA or NVMe SSD) Intel VTune Amplifier Are You I/O Bound or CPU Bound? Explore imbalance between I/O operations (async & sync) and compute Storage accesses mapped to the source code See when CPU is waiting for I/O Measure bus bandwidth to storage New! Sliders set thresholds for I/O Queue Depth Slow task with I/O Wait Latency analysis Tune storage accesses with latency histogram Distribution of I/O over multiple devices 38

39 Intel Performance Snapshots Three Fast Ways to Discover Untapped Performance Is your application making good use of modern computer hardware? Run a test case during your coffee break. High level summary shows which apps can benefit most from code modernization and faster storage. Pick a Performance Snapshot: Application for non-mpi apps MPI for MPI apps Storage for systems. Servers and workstations with directly attached storage. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. New! New! 39

40 40

Find & Debug Memory & Threading Errors Intel Inspector Memory & Thread Debugger Correctness Tools Increase ROI By 12%-21% 1 Errors found earlier are less expensive to fix Several studies, ROI%

41 Find & Debug Memory & Threading Errors Intel Inspector Memory & Thread Debugger Correctness Tools Increase ROI By 12%-21% 1 Errors found earlier are less expensive to fix Several studies, ROI% varies, but earlier is cheaper Diagnosing Some Errors Can Take Months Races & deadlocks not easily reproduced Memory errors hard to find without a tool Debugger Integration Speeds Diagnosis Breakpoint set just before the problem Examine variables & threads with the debugger Diagnose in hours instead of months 1 Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Debugger Breakpoints Part of Intel Parallel Studio Professional For Windows* and Linux* From $1,599 Intel Inspector dramatically sped up our ability to track down difficult to isolate threading errors before our packages are released to the field. Peter von Kaenel, Director, Software Development, Harmonic Inc. 41

42 New for 2017! New Processors, New C++ Language Features Intel Inspector 2017 Memory and Thread Debugger New C++ Language Features Full C++ 11 support including std::mutex and std::atomic Easier Identification of Threading Bugs Variable name causing error is shown (global, static & stack) in addition to the code lines Run Native on Intel Xeon Phi Processors This simplifies workflow for Intel Xeon Phi processor development Tip: Reduce thread count to 30 for best KNL performance while running Intel Inspector New! 42

43 43

Get Faster Code Faster! Intel Advisor Thread Prototyping Have you: Threaded an app, but seen little benefit? Hit a scalability barrier? Delayed release due to sync. errors?

44 Get Faster Code Faster! Intel Advisor Thread Prototyping Have you: Threaded an app, but seen little benefit? Hit a scalability barrier? Delayed release due to sync. errors? Data Driven Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Add Parallelism with Less Effort, Less Risk and More Impact Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort Simon Hammond Senior Technical Staff Sandia National Laboratories 44

Faster Code Faster with Data Driven Design Intel Advisor Vectorization Optimization and Thread Prototyping Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is

45 Faster Code Faster with Data Driven Design Intel Advisor Vectorization Optimization and Thread Prototyping Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is blocking vectorization Tips for effective vectorization Safely force compiler vectorization Optimize memory stride Breakthrough for Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Less Effort, Less Risk and More Impact Part of Intel Parallel Studio for Windows* and Linux* 45

46 New! New for 2017! AVX-512, FLOPS, & More Intel Advisor Vectorization Optimization Next Gen Intel Xeon Phi Support Tune for AVX-512 with or without AVX-512 hardware Precise FLOPS calculation Enhanced Memory Access Analysis Easier Selection of High Impact Loops Batch Mode Workflow Saves Time Fast Answers with Loop Analytics 46

47 Intel MPI Library Intel Trace Analyzer and Collector

48 Intel MPI Library Overview Optimized MPI application performance Application-specific tuning Automatic tuning New! - Support for Intel Xeon Phi Processor (code named Knights Landing) New! Support for Intel Omni-Path Architecture Fabric Lower latency and multi-vendor interoperability Industry leading latency Performance optimized support for the fabric capabilities through OpenFabrics*(OFI) Faster MPI communication Optimized collectives Sustainable scalability up to 340K cores Native InfiniBand* interface support allows for lower latencies, higher bandwidth, and reduced memory requirements More robust MPI applications Seamless interoperability with Intel Trace Analyzer and Collector Applications CFD Crash Climate OCD BIO Other... Develop applications for one fabric Intel MPI Library Select interconnect fabric at runtime TCP/IP Omni-Path InfiniBand iwarp Achieve optimized MPI performance Shared Memory Intel MPI Library One MPI Library to develop, maintain & test for multiple fabrics Other Networks Fabrics Cluster 48

49 What s New: Intel MPI Library 2017 Ready for Intel Xeon Phi Processors (code named Knights Landing (KNL)) Ready for Intel Omni-Path Architecture fabric Usage of specially optimized memcpy for KNL Tuning of shared memory collectives on single KNL nodes General optimization of RMA General optimization and speed up startup time and MPI tune utility 49

Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load

50 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Idealizer Automatically detect performance issues and their impact on runtime 50

51 MPI Performance Snapshot Scalable profiling for MPI and Hybrid Lightweight Low overhead profiling for 100K+ Ranks Scalability- Performance variation at scale can be detected sooner Identifying Key Metrics Shows MPI/OpenMP imbalances 51

52 What s New: Intel Trace Analyzer and Collector Intel Trace Analyzer and Collector will be ready for KNL Improved scalability of imbalance profiler by up to 10x Improved MPI Snapshot feature HTML output 52

53 Additional Material Product page overview, features, FAQs, support Training materials movies, tech briefs, documentation Evaluation guides step by step walk through Reviews Additional Development Products: Intel Software Development Products For more detail on each component of Parallel Studio XE, visit Inside Blue. 53

55 Enhanced application performance with AVX-512 support Enhanced performance due to AVX-512 instructions taking advantage of FMA units, memcpy, new pre-fetch instructions, new transcendental instructions, MCDRAM, and increased number of cores. 55

56 Enhanced application performance with AVX-512 support Key functionality / library domain KNL features used to deliver enhanced performance (instructions, other) *GEMMs/BLAS MP Linpack LU/CHolesky/QR/LAPACK/SMP Linpack Two FMA units + 2 instruction decoders are key AVX512 FMA (vfmadd231ps or vfm231pd) Same as in BLAS (as main LAPACK kernel is?*gemm) + greater core count Prefetcht0 instruction MCDRAM Intel Math Kernel Library Intel Integrated Performance Primitives Intel Data Analytics Acceleration Library 2D and 3D FFTs DNN Sparse Vector Statistics Vector Math All from Signal Processing (1D) and up to Image (2D) and Volume (3D) processing Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh AVX512 FMA Two FMA units + 2 instruction decoders MCDRAM AVX512 FMA Similar to BLAS/LAPACK, greater number of cores AVX512 FMA Two FMA units + 2 instruction decoders Large number of cores for MT performance AVX512 FMA Prefetcht1 instruction Prefetcht0, prefetcht1 instruction Masking support Large core count Prefetcht1 instruction Depend on seq. Blas level 3 Knights Landing improvement New Transcendental Support Instructions: VGETEXP, VGETMANT, VRNDSCALE, VSCALEF, VFIXUPIMM, VRCP28, VRSQRT28, VEXP2 The main advantage inherited from LRB/KNC is support of mask registers and therefore support of predicates for all new instructions. Then, - full 512-bit register palign support (no lanes restrictions as for old AVX palign)- _mm512_alignr_epi32, _mm512_alignr_epi64. Then, on the fly integer conversions: vpmovq{w b d}, vpmovq{w b}. And the last one integer any-direction comparison: vpcmp{d q} and vpcmpu{d q}. Similar to BLAS/LAPACK, greater number of cores Intel MPI Library Used compiler s AVX-512 version of memcpy (but w/ fix, failed CQ on ICC) Build IMPI w/ -fvisibility=hidden (make all symbols as hidden by default and only needed as external). Addressed KNL micro-arch features, such as short BTB, by reducing access to PLT/GOT Reduced/simplified critical path where it s possible. Addressed KNL frond-end specifics. 56

57 Easy access to Parallel Studio XE Runtimes For Amazon Web Services users only Intel Parallel Studio XE Runtime Required to be able to run applications built with the Intel Performance Libraries or Intel Compilers. Includes latest optimizations for Intel Architecture for faster application performance Linux Only Easy access for Amazon Web Services users at no cost Latest runtimes through Linux native repos YUM repo available now! ( 57

58 Educating with Webinar series about 2017 tools Expert talks about the new features Series of live webinars Sept 13 Nov 8, 2016 Attend live, or watch after the fact. 58

59 Educating with High Performance Programming Book Knights Landing specific details, programming advice and real world examples. Intel Xeon Phi Processor High Performance Programming Techniques to generally increase program performance on any system and prepare you better for Intel Xeon Phi processors. Available as of June 2016 I believe you will find this book is an invaluable reference to help develop your own Unfair Advantage James A. Manager Sandia National Laboratories 59

Developer Conferences developers share proven techniques and best practices hpcdevcon.intel.

60 More education with software.intel.com/moderncode Online community growing collection of tools, trainings, support features Black Belts in parallelism from Intel & industry Intel HPC Developer Conferences developers share proven techniques and best practices hpcdevcon.intel.com Hands on training for developers and partners with remote access to Intel Xeon processor and Xeon Phi coprocessor-based clusters. software.intel.com/icmp Developer Access Program provides early access to Intel Xeon Phi processor codenamed Knights Landing + 1 year license for Intel Parallel Studio XE Cluster Edition. 60

Choices to Fit Needs Intel Tools All Products with support worldwide, for purchase. Intel Premier Support - private direct support from Intel support for past versions software.intel.

com/qualify-for-free-software Community support only all tools: Students, Educators, classroom use, Open Source Developers, Academic Researchers (qualification required) Intel Performance Libraries

61 Choices to Fit Needs Intel Tools All Products with support worldwide, for purchase. Intel Premier Support - private direct support from Intel support for past versions software.intel.com/products Most Products without Premier support via special programs for those who qualify students, educators, classroom use, open source developers, and academic researchers software.intel.com/qualify-for-free-software Community support only all tools: Students, Educators, classroom use, Open Source Developers, Academic Researchers (qualification required) Intel Performance Libraries without Premier support -Community licensing for Intel performance libraries no royalties, no restrictions based on company or project size software.intel.com/nest Community support only Intel Performance Libraries: Community Licensing (no qualification required) 61

62 What s New Intel C++ Compiler SIMD Data Layout Templates to facilitate vectorization for your C++ code Enhanced C11 and C++14 language standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator, Enhanced GNU* and Microsoft* compatibility SSE Cast Support Diagnostic improvements on template argument 62

63 What s New Intel Fortran Compiler Substantial Coarray Fortran performance improvement on non-trivial programs Almost complete Fortran 2008 support Enhanced Fortran 2008 and draft Fortran 2015 language standards support implied-shape PARAMETER arrays 2008 bind C internal procedures extended EXIT for all named blocks pointer initialization VS2013 Shell replaces VS2010 Shell on Windows 63

64 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Three lines added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma omp declare simd linear(z:40) uniform(l, N, Nmat) linear(k) float path_calc(float *z, float L[][VLEN], int k, int N, int Nmat) #pragma omp declare simd uniform(l, N, Nopt, Nmat) linear(k) float portfolio(float L[][VLEN], int k, int N, int Nopt, int Nmat) for (path=0; path<npath; path+=vlen) { /* Initialise forward rates */ z = z0 + path * Nmat; #pragma omp simd linear(z:nmat) for(int k=0; k < VLEN; k++) { for(i=0;i<n;i++) { L[i][k] = L0[i]; } /* LIBOR path calculation */ float temp = path_calc(z, L, k, N, Nmat); v[k+path] = portfolio(l, k, N, Nopt, Nmat); /* move pointer to start of next block */ z += Nmat; } } Libor calculation speedup Normalized performance data higher is better 1 3,51 6,61 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

65 Impressive Performance Improvement Intel C++ Explicit Vectorization: SIMD Performance One line added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma simd vectorlength(8) for (int x = x0; x < x1; ++x) { float div = coef[0] * A_cur[x] + coef[1] * ((A_cur[x + 1] + A_cur[x - 1]) + (A_cur[x + Nx] + A_cur[x - Nx]) + (A_cur[x + Nxy] + A_cur[x - Nxy])) + coef[2] * ((A_cur[x + 2] + A_cur[x - 2]) + (A_cur[x + sx2] + A_cur[x - sx2]) + (A_cur[x + sxy2] + A_cur[x - sxy2])) + coef[3] * ((A_cur[x + 3] + A_cur[x - 3]) + (A_cur[x + sx3] + A_cur[x - sx3]) + (A_cur[x + sxy3] + A_cur[x - sxy3])) + coef[4] * ((A_cur[x + 4] + A_cur[x - 4]) + (A_cur[x + sx4] + A_cur[x - sx4]) + (A_cur[x + sxy4] + A_cur[x - sxy4])); A_next[x] = 2 * A_cur[x] - A_next[x] + vsq[s+x] * div; } RTM-stencil calculation speedup Normalized performance data higher is better 1 3,91 6,06 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

66 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel Confidential: Must be viewed under CNDA All products, systems, dates and figures are preliminary based on current expectations, and are subject to change without notice. Substantial Coarray Fortran performance improvement on non-trivial programs 3,70 1,00 1,40 1,00 1,00 1,00 1,23 1,01 University of Edinburgh University of Houston University of Houston University of Houston EPCC microbenchmarks NAS Parallel benchmarks coarray kernels coarray microbenchmarks Runtime performance relative to Intel Fortran 16.0 higher is better Configuration: Windows hardware: HP DL320e Gen8 v2 (single-socket server) with Intel(R) Xeon(R) CPU E GHz, 32 GB RAM, HyperThreading is off; Linux hardware: HP BL460c Gen9 with Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel C++ compiler 16.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 8.1. SPEC* Benchmark ( Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel Confidential 66

67 SIMD Data Layout Template - Improve productivity and boost C++ performance Quickly convert Array of Structures to Structure of Arrays representation. Increase productivity: Use predefined templates with minimal effort, and let SDLT do the vecorization for you. Improve performance: SDLT vectorizes your code by making memory access contiguous, which can lead to more efficient code and better performance. Seamless integration: SDLT follows the familiar Intel vector programming model. We used SDLT to vectorize the deformer code in Premo, the inhouse animation tool for DreamWorks Animation. The performance improvements we were able to achieve were dramatic, and these improvements will translate directly into higher quality characters that will be seen on-screen in future movies. Also the library itself was easy to use and integrate into our existing codebase. Martin Watt Principal Engineer, DreamWorks Animation 67

68 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems

69 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems Intel s solution is to Accelerate Python performance Enable easy access Empower the community

Access multiple options for faster Python Included in Intel Distribution for Python* Accelerate with

is much faster with the Intel version too. Dr.

Intel DAAL Exploit vectorization and threading Cython + Intel C++ compiler Numba + Intel LLVM

Blaze/Dask, Numba Multi-node parallelism Mpi4Py, Distarray Intel native libraries: Intel MPI

70 Access multiple options for faster Python Included in Intel Distribution for Python* Accelerate with native libraries I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review NumPy, SciPy, Scikit-Learn, Theano, Pandas, pydaal Intel MKL, Intel DAAL Exploit vectorization and threading Cython + Intel C++ compiler Numba + Intel LLVM Better/Composable threading Cython, Numba, Pyston Threading composability for MKL, CPython, Blaze/Dask, Numba Multi-node parallelism Mpi4Py, Distarray Intel native libraries: Intel MPI Integration with Big Data, ML platforms and frameworks Spark, Hadoop, Trusted Analytics Platform Better performance profiling Extensions for profiling mixed Python & native/jit codes

Intel Distribution for Python* Reviews Intel's Python distribution provides a major math boost The still-in-beta Python distribution uses Math Kernel Library to speed up processing on Intel

Instead, the MKL speeds up certain math operations so that they run faster on one thread and multiple threads.

71 Intel Distribution for Python* Reviews Intel's Python distribution provides a major math boost The still-in-beta Python distribution uses Math Kernel Library to speed up processing on Intel hardware The distribution's main touted advantage is speed -- but not a PyPy-style general speedup via a JIT. Instead, the MKL speeds up certain math operations so that they run faster on one thread and multiple threads. I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review HPC Podcast Looks at Intel s Pending Distribution of Python Yes, Intel is doing their own Python build! It is still in beta but I think it s a great idea..yeah, it s important!

Automatic Performance Scaling Back Up from the Core, to Multicore, to Many Core and Beyond Intel MKL Extracting performance from the computing resources Core: vectorization, prefetching, cache

72 Automatic Performance Scaling Back Up from the Core, to Multicore, to Many Core and Beyond Intel MKL Extracting performance from the computing resources Core: vectorization, prefetching, cache utilization Multi-Many core (processor/socket) level parallelization Multi-socket (node) level parallelization Clusters scaling Sequential Intel MKL MKL + OpenMP Many Core Intel Xeon Phi TM Coprocessor MKL + Intel MPI 72

73 Big Data & Machine Learning Challenge Volum e Value Velocity Variety Problem: Big data needs high performance computing. Many big data applications leave performance at the table > Not optimized for underlying hardware. Solution: A performance library provides building blocks to be easily integrated into big data analytics workflow.s

Intel Data Analytics Acceleration Library (Intel DAAL) An Intel-optimized library that provides building blocks for all data analytics stages, from data

any of them Flexible interface to connect to different data sources (CSV, SQL, HDFS, ) Windows*, Linux*, and OS X* Developed by same team as the

74 Intel Data Analytics Acceleration Library (Intel DAAL) An Intel-optimized library that provides building blocks for all data analytics stages, from data preparation to data mining & machine learning Python, Java & C++ APIs Can be used with many platforms (Hadoop*, Spark*, R*, Matlab*, ) but not tied to any of them Flexible interface to connect to different data sources (CSV, SQL, HDFS, ) Windows*, Linux*, and OS X* Developed by same team as the industryleading Intel Math Kernel Library Open source, Free community-supported and commercial premium-supported options Also included in Parallel Studio XE suites 74

75 Intel Threading Building Blocks Good Tuning Data Gets Good Results Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships Details all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Intel's TBB was an invaluable help in multithreading our in-house renderer CGIStudio and is now also used in animation and simulation software. Beside the ease of use, it takes care of the two most important aspects of running an application on multiple cores -- load balancing and scalability. Maurice van Swaaji Blue Sky Studios "Intel TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table. Details Michaël Rouillé CTO Golaem More Case Studies 75

Intel Threading Building Blocks (Intel TBB) C++ template library to simplify the task of adding parallelism on a single device or across multiple devices Specify tasks instead of manipulating threads

76 Intel Threading Building Blocks (Intel TBB) C++ template library to simplify the task of adding parallelism on a single device or across multiple devices Specify tasks instead of manipulating threads Intel TBB maps your logical tasks onto threads with full support for nested parallelism Targets threading for scalable performance Uses proven, efficient parallel patterns Uses work stealing to support the load balance of unknown execution time for tasks. It has the advantage of low-overhead polymorphism. Flow graph feature allows developers to easily express dependency and data flow graphs Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Has high level parallel algorithms and concurrent containers and low level building blocks like scalable memory allocator, locks and atomic operations. Commercial support for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors More Case Studies 76

77 Resources and Availability Intel Threading Building Blocks (Intel TBB) Resources Commercial product page: software.intel.com/intel-tbb Flow Graph Designer: software.intel.com/articles/flow-graph-designer User Forum: software.intel.com/forums/intel-threading-building-blocks Available on Linux, Windows, macos and Android Commercially available with Intel Parallel Studio XE 2017: software.intel.com/enus/intel-parallel-studio-xe Community licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest The Open-Source Community Site: 77

78 Challenges faced by developers Performance optimization is a never ending task. Completing key processing tasks within designated time constraints is a critical issue. Hand optimization code for one platform makes code performance worse on another platform. With manual optimization code becomes more complex and difficult to maintain. Code should run fast as possible without spending extra effort. 78

79 Different Domains in Intel IPP Image Processing Signal Processing Data Compression Computer Vision Cryptography Color Conversion Vector Math String Processing Image Domain Signal Domain Data Domain 79

80 Intel Integrated Performance Primitives Building Blocks for Image, Signal & Data Processing Provides developers with ready-to-use functions to accelerate image, signal, data processing & cryptography computation tasks. Optimized for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors. License versions available on Linux, Windows, macos, Android Available as a part of: Intel Parallel Studio XE software.intel.com/en-us/intel-parallelstudio-xe Community Licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest 80

Correctness Tools Increase ROI By 12%-21% Cost Factors Square Project Analysis

Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National

of applications is growing Correctness tools find defects during development prior

81 Correctness Tools Increase ROI By 12%-21% Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Size and complexity of applications is growing Correctness tools find defects during development prior to shipment Reworking defects is 40%-50% of total project effort Reduce time, effort, and cost to repair Find errors earlier when they are less expensive to fix 81

82 Race Conditions Are Difficult to Diagnose They only occur occasionally and are difficult to reproduce Correct Thread 1 Thread 2 Shared Counter Read count 0 Increment 0 Write count 1 0 Read count 1 Increment 1 Write count 2 Incorrect Thread 1 Thread 2 Shared Counter Read count 0 0 Read count 0 Increment 0 Increment 0 Write count 1 Write count 1 82

Debug Memory & Threading Errors Intel Inspector Find and eliminate errors Memory leaks, invalid access Races & deadlocks C, C++ and Fortran (or a mix) Simple, Reliable, Accurate No special recompiles

83 Debug Memory & Threading Errors Intel Inspector Find and eliminate errors Memory leaks, invalid access Races & deadlocks C, C++ and Fortran (or a mix) Simple, Reliable, Accurate No special recompiles Use any build, any compiler 1 Analyzes dynamically generated or linked code Inspects 3 rd party libraries without source Productive user interface + debugger integration Command line for automated regression analysis Clicking an error instantly displays source code snippets and the call stack Fits your existing process 1 That follows common OS standards. 83

84 Profile Python & Go! And Mixed Python / C++ / Fortran Intel VTune Amplifier New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 84

85 Three Keys to HPC Performance: Threading, Memory Access, Vectorization Intel VTune Amplifier New! Threading: CPU Utilization Serial vs. Parallel time Top OpenMP regions by potential gain Tip: Use hotspot OpenMP region analysis for more detail Memory Access Efficiency Stalls by memory hierarchy Bandwidth utilization Tip: Use Memory Access analysis Vectorization: FPU Utilization FLOPS estimates from sampling Tip: Use Intel Advisor for precise metrics and vectorization optimization For 3rd, 5th, 6th Generation Intel Core processors and second generation Intel Xeon Phi processor code named Knights Landing. 85

86 Application Performance Snapshot Discover opportunities for better performance with vectorization & threading Objectives Simple enough to run during a coffee break Highlight where code modernization can help Users Performance teams fast prioritization of which apps will benefit most All Developers size the potential performance gain from code modernization Non-Objectives Actionable tuning data that is another tool. Snapshot is just a fast health check. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. Preview! 86

87 Free download: Also included with Intel Parallel Studio Cluster Edition. 87

88 Storage Performance Snapshot Discover if faster storage can improve server/workstation performance Learn It On One Coffee Break Easy setup Quickly see meaningful data System view of workload Any architecture Targeted Systems Servers & workstations with directly attached storage Not scale out storage clusters Linux kernel 2.6 or newer dstat 0.7 or newer Windows Server 2012, Windows 8 or newer Windows OS Preview! Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. 88

89 Get Faster Code Faster! Intel Advisor Vectorization Optimization Have you: Recompiled for AVX2 with little gain Wondered where to vectorize? Recoded intrinsics for new arch.? Struggled with compiler reports? New! Data Driven Vectorization: What vectorization will pay off most? What s blocking vectorization? Why? Are my loops vector friendly? Will reorganizing data increase performance? Is it safe to just use pragma simd? "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing 89

90 Next Gen Intel Xeon Phi Support Vectorization Advisor runs on and optimizes for Intel Xeon Phi AVX-512 ERI specific to Intel Xeon Phi New! Efficiency (72%), Speed-up (11.5x), Vector Length (16) Performance optimization problem and advice how to fix it 90

91 Precise Repeatable FLOPS Metrics Intel Advisor Vectorization Optimization New! FLOPS by loop and function All recent Intel processors (not co-processors) Instrumentation (count FLOP) plus sampling (time with low overhead) Adjusted for masking with AVX-512 processors 91

Measure Footprint Compare to cache size Does it fit in cache?

92 Enhanced Memory Access Analysis Intel Advisor Are you bandwidth or compute limited? Measure Footprint Compare to cache size Does it fit in cache? Variable References Map data to variable names for easier analysis Gather/Scatter Detect unneeded gather/scatters that reduce performance New! 92

93 Start Tuning for AVX-512 without AVX-512 hardware Intel Advisor - Vectorization Advisor New! Use axcommon-avx512 xavx compiler flags to generate both code-paths AVX(2) code path (executed on Haswell and earlier processors) AVX-512 code path for newer hardware Compare AVX and AVX-512 code with Intel Advisor Inserts (AVX2) vs. Gathers (AVX-512) Speed-up estimate: 13.5x (AVX2) vs. 30.6x (AVX-512)

94 Faster Code Faster Using Intel Advisor Vectorization "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing Intel Advisor s Vectorization Advisor fills a gap in code performance analysis. It can guide the informed user to better exploit the vector capabilities of modern processors and coprocessors. Dr. Luigi Iapichino Scientific Computing Expert Leibniz Supercomputing Centre Threading "Intel Advisor has been extremely helpful in identifying the best pieces of code for parallelization. We can save several days of manual work by targeting the right loops and we can use Advisor to find potential thread safety issues to help avoid problems later on." Carlos Boneti HPC software engineer, Schlumberger Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort, and has already been used to highlight subtle parallel correctness issues in complex multi-file, multi-function algorithms. Simon Hammond Senior Technical Staff Sandia National Laboratories More Case Studies 94

95 Speaker the speaker notes are important for this presentation. Be sure to read them.

96 Optimizing Performance On Parallel Hardware It s an iterative process Cluster Scalable? Y N Tune MPI Ignore if you are not targeting clusters. Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth 96

97 Performance Analysis Tools for Diagnosis Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel Trace Analyzer & Collector (ITAC) Intel MPI Snapshot Intel MPI Tuner Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth Intel VTune Amplifier Intel Advisor Intel VTune Amplifier 97

98 Tools for High Performance Implementation Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel MPI Library Intel MPI Benchmarks Intel Compiler Effective threading? N Y Thread Vectorize Memory Bandwidth Sensitive? Y Optimize Bandwidth N Intel Math Kernel Library Intel IPP Media & Data Library Intel Data Analytics Library Intel Cilk Plus Intel OpenMP* Intel TBB Threading Library 98

99 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

100

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster: