Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast

Size: px
Start display at page:

Download "Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast"

Transcription

1 Faster Code. Faster Intel Parallel Studio XE 2017 Unleash the Beast

2 Create Faster Code Faster Intel Parallel Studio XE Design, build, verify and tune C++, C, Fortran, Python* and Java* Standards Driven Parallel Models: OpenMP, MPI & TBB Highlights from 2017 edition Faster Python* application performance using Intel Distribution for Python and Intel VTune Amplifier XE. Faster deep learning on IA using Intel Math Kernel Library and Intel Data Analytics Acceleration Library Quickly assess application performance using snapshot features of VTune Amplifier XE and Intel Trace Analyzer and Collector Scale to next generation platforms including latest Intel Xeon Phi processor. Optimizations for AVX-512, high bandwidth memory and explicit vectorization for compiler and analysis tools. 2

3 Performance Libraries Profiling, Analysis & Architecture Cluster Tools Intel Parallel Studio XE Intel Inspector Memory & Threading Checking Intel VTune Amplifier Performance Profiler Intel Data Analytics Acceleration Library Optimized for Data Analytics & Machine Learning Intel Math Kernel Library Optimized Routines for Science, Engineering & Financial Intel Advisor Vectorization Optimization & Thread Prototyping Intel Cluster Checker Cluster Diagnostic Expert System Intel Trace Analyzer & Collector MPI Profiler Intel MPI Library Intel Integrated Performance Primitives Image, Signal & Data Processing Intel Threading Building Blocks Task Based Parallel C++ Template Library Intel C/C++ & Fortran Compilers Intel Distribution for Python Performance Scripting 3

4 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

5 SCALE Analyze Build What s Inside Intel Parallel Studio XE 2017 Composer Edition Professional Edition Cluster Edition from $699 from $1,699 from $2,949 Intel C++ Compiler Intel Fortran Compiler Intel Distribution for Python* Intel Math Kernel Library fast math library Intel Integrated Performance Primitives image, signal & data processing Intel Threading Building Blocks threading library Intel Data Analytics Acceleration Library machine learning & analytics Intel VTune Amplifier XE performance profiler Intel Advisor vectorization optimization and thread prototyping Intel Inspector memory and thread debugging Intel MPI Library message passing interface library Intel Trace Analyzer and Collector MPI Tuning and Analysis Intel Cluster Checker cluster diagnostic expert system Rogue Wave IMSL* Library Fortran numerical analysis Bundle or Add-on Add-on Add-on Additional configurations including, floating and academic, are available at: 5

6 Staying current with Support for the Latest Standards, Operating Systems & Processors Enhanced C11 and C++14 standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator Operating systems Windows* 7 thru 10, Windows Server Debian* 7.0, 8.0; Fedora* 23, 24; Red Hat Enterprise Linux* 6, 7; SuSE LINUX Enterprise Server* 11,12; Ubuntu* LTS LTS, macos* Enhanced Fortran 2008 and draft 2015 standards support Implied-shape PARAMETER arrays 2008 bind C internal procedures Extended EXIT for all named blocks Pointer initialization Latest processors Support and tuning added for the latest Intel Xeon Phi codenamed Knights Landing and AVX-512 6

7

8 Intel Compilers for Parallel Studio XE 2017 What s new in Intel C and Intel Fortran 17.0 Productive language-level vectorization & parallelism models for advanced developers driving application performance Common updates Enhanced support for the newest AVX2 and AVX512 instruction sets for the latest Intel processors (including Intel Xeon Phi) Enhanced optimization/vectorization reports register allocation Tight integration with Intel Advisor Initial support for OpenMP* 4.5, offering improved vectorization control, new SIMD instructions, and much more Intel C++ Compiler SIMD Data Layout Template to facilitate vectorization for your C++ code Virtual function vectorization capability Improved compiler loop and function alignment Full support for the latest C11 and C++14 standards Intel Fortran Compiler Substantial coarray performance improvement up to twice as fast as previous versions on non-trivial coarray Fortran programs Almost complete Fortran 2008 support Further interoperability with C (part of draft Fortran 2015) 8

9 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

10 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Two lines added that take full advantage of both SSE or AVX Pragmas ignored by other compilers so code is portable typedef float complex fcomplex; const uint32_t max_iter = 3000; #pragma omp declare simd uniform(max_iter), simdlen(16) uint32_t mandel(fcomplex c, uint32_t max_iter) { uint32_t count = 1; fcomplex z = c; while ((cabsf(z) < 2.0f) && (count < max_iter)) { z = z * z + c; count++; } return count; } uint32_t count[imagewidth][imageheight];.. for (int32_t y = 0; y < ImageHeight; ++y) { float c_im = max_imag - y * imag_factor; #pragma omp simd safelen(16) for (int32_t x = 0; x < ImageWidth; ++x) { fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF); count[y][x] = mandel(in_vals_tmp, max_iter); } } Mandelbrot calculation speedup Normalized performance data higher is better 1 2,48 4,27 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

11 Impressive performance improvement Intel C++ Explicit Vectorization using OpenMP* SIMD SIMD Speedup on Intel Xeon Processor Normalized performance data higher is better 6,61 6,06 4,27 4,14 4,15 2,48 2,27 2,26 2,43 4,83 3,51 3,91 2,74 4,92 1,00 1,00 1,00 1,00 1,00 1,00 1,00 AoBench Collision Detection Grassshader Mandelbrot Libor RTM-stencil Geomean Serial SSE4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

12

13 Boost NumPy/SciPy performance with Intel MKL Intel Distribution for Python* Easy access to High performance Python NumPy/SciPy/Scikit-Learn/pandas accelerated with Intel MKL Close to 100X performance speedups on select functions Includes Python optimized modules for Intel TBB, Intel DAAL Includes numba, Cython, pydaal Integrated Distribution, Out-of-the-Box access to performance Python 2.7 & 3.5. Windows, Linux, macos Latest Optimizations for Intel Xeon and Intel Xeon Phi Processors Available as free standalone, via conda* and Intel Parallel Studio XE

14 Close to 100X faster for select functions 14

15 Profile Python & Go using Intel VTune Amplifier And Mixed Python / C++ / Fortran New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 15

16 Intel Math Kernel Library Intel Data Analytics Acceleration Library Intel Integrated Performance Primitives Intel Threading Building Blocks

17 17

18 Intel Math Kernel Library Speeds math processing for machine learning, scientific, engineering financial and design applications Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation Includes functions for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics and more De facto standard APIs for easy switching from other math libraries Highly optimized, threaded and vectorized to maximize processor performance 18

19 Components of Intel MKL 2017 New Linear Algebra Fast Fourier Transforms Vector Math Summary Statistics And More Deep Neural Networks BLAS LAPACK ScaLAPACK Sparse BLAS Sparse Solvers Iterative PARDISO* Cluster Sparse Solver Multidimensional FFTW interfaces Cluster FFT Trigonometric Hyperbolic Exponential Log Power Root Vector RNGs Kurtosis Variation coefficient Order statistics Min/max Variancecovariance Splines Interpolation Trust Region Fast Poisson Solver Convolution Pooling Normalization ReLU Softmax 19

20 Performance (GFlops) Performance Benefit to Applications Intel MKL Significant LAPACK Performance Boost using Intel Math Kernel Library versus ATLAS* DGETRF on Intel Xeon E Processor Matrix Size Intel MKL provides significant performance boost over ATLAS* Intel MKL - 16 threads Intel MKL - 8 threads ATLAS - 16 threads ATLAS - 8 threads Configuration: Hardware: CPU: Dual Intel Xeon E5-2697v2@2.70Ghz; 64 GB RAM. Interconnect: Mellanox Technologies* MT27500 Family [ConnectX*-3] FDR.. Software: RedHat* RHEL 6.2; OFED 3.5-2; Intel MPI Library 5.0 Intel MPI Benchmarks (default parameters; built with Intel C++ Compiler XE for Linux*); Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # The latest version of Intel MKL unleashes the performance benefits of Intel architectures 20

21 What s New: Intel MKL 2017 Optimized math functions to enable neural networks (CNN and DNN) for deep learning Improved ScaLAPACK performance for symmetric eigensolvers on HPC clusters New data fitting functions based on B-splines and monotonic splines Improved optimizations for newer Intel processors, especially Knight s Landing Xeon Phi Extended TBB threading layer support for all BLAS level-1 functions 21

22 22

23 Scientific/Engineering Web/Social Business Intel DAAL Overview Industry leading performance, C++/Java/Python library for machine learning and deep learning optimized for Intel Architectures. Pre-processing Transformation Analysis Modeling Validation Decision Making (De-)Compression PCA Statistical moments Variance matrix QR, SVD, Cholesky Apriori Linear regression Naïve Bayes SVM Classifier boosting Kmeans EM GMM Collaborative filtering Neural Networks

24 Speedup Example Performance: Intel DAAL vs. Spark* MLLib PCA (correlation method) on an 8-node Hadoop* cluster based on Intel Xeon Processors E v X 6X 6X 7X 7X M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 Table size Configuration Info - Versions: Intel Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel Xeon Processor E v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

25 What s New: Intel DAAL 2017 Neural Networks Python API (a.k.a. PyDAAL) Easy installation through Anaconda or pip New data source connector for KDB+ Open source project on GitHub Fork me on GitHub: 25

26 26

27 Rich Feature Set for Parallelism Intel Threading Building Blocks (Intel TBB) Parallel algorithms and data structures Threads and synchronization Memory allocation and task scheduling Generic Parallel Algorithms Efficient scalable way to exploit the power of multicore without having to start from scratch. Flow Graph A set of classes to express parallelism as a graph of compute dependencies and/or data flow Concurrent Containers Concurrent access, and a scalable alternative to containers that are externally locked for thread-safety Synchronization Primitives Atomic operations, a variety of mutexes with different properties, condition variables Task Scheduler Timers and Exceptions Threads Thread Local Storage Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph Thread-safe timers and exception classes OS API wrappers Efficient implementation for unlimited number of thread-local variables Memory Allocation Scalable memory manager and false-sharing free allocators 27

28 What s new: Intel Threading Building Blocks 2017 static_partitioner class Helps minimizing overhead of parallel loops streaming_node class Enables heterogeneous streaming computations within the flow graph. Added method to isolate execution of a group of tasks or an algorithm from other tasks submitted to the scheduler. A preview feature for Python* module is added to replace Python's thread pool class. Graph/stereo example is added. Improvements to graph/fgbzip example (added async_msg usage example) 28

29 29

30 Intel IPP Domain Applications Image Processing Medical Imaging Computer Vision Digital Surveillance Biometric Identification Automated Sorting ADAS Visual Search Signal Processing Games (sophisticated audio content or effects) Echo cancellation Telecommunications Energy Data Compression & Cryptography Data Centers Enterprise data managements ID verification Smart cards/wallets Electronic signature Information security / cybersecurity 30

31 What s new: Intel Integrated Performance Primitives 2017 Extended optimization for Intel AVX-512 on KNL and Intel Xeon processors Intel IPP Platform-Aware APIs in the image and signal processing domains are added to support external threading and 64-bit data length Significantly improved performance of zlib compression functions is Extension of IPP optimized functionality in OpenCV Limited pre-silicon optimizations for KNH and CNL EP/XE server 31

32 Intel VTune Amplifier XE Performance Profiler Intel Inspector XE Memory & Thread Debugger Intel Advisor XE Vectorization Optimization and Thread Prototyping

33 33

34 Intel VTune Amplifier Faster, Scaleable Code, Faster Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits Analysis Cache miss, Bandwidth analysis 1 GPU Offload and OpenCL Kernel Tracing Find Answers Fast View Results on the Source / Assembly OpenMP Scalability Analysis, Graphical Frame Analysis Filter Out Extraneous Data Organize Data with Viewpoints Visualize Thread & Task Activity on the Timeline Easy to Use No Special Compiles C, C++, C#, Fortran, Java, ASM Visual Studio* Integration or Stand Alone Graphical Interface & Command Line Local & Remote Data Collection Analyze Windows* & Linux* data on OS X* 2 1 Events vary by processor. 2 No data collection on OS X* Quickly Find Tuning Opportunities See Results On The Source Code Tune OpenMP Scalability Visualize & Filter Data 34

35 New for 2017! Python, FLOPS, Storage & More Intel VTune Amplifier Performance Profiler New! Profile Python and Mixed Python / C++ / Fortran Tune Intel Xeon Phi Knights Landing Processors Quickly See 3 Keys to HPC Performance Optimize Memory Access Storage Analysis I/O bound or CPU bound? Enhanced OpenCL & GPU Profiling Easier Remote and Command Line Usage Add Custom Counters to the Timeline Preview: Application & Storage Performance Snapshots Intel Advisor optimize vectorization for AVX-512 (with or without hardware) 35

36 Intel VTune Amplifier Tunes Knights Landing Processors 4 Critical Optimizations for Intel Xeon Phi Processors 1) High Bandwidth Memory Decide which data structures to place in MCDRAM See performance problems by memory hierarchy Measure DRAM and MCDRAM bandwidth 2) Scalability of MPI and OpenMP Serial vs. Parallel time Imbalance, overhead cost, parallel loop parameters 3) Micro Architecture Efficiency See the efficiency of your code in the core pipeline Zero in on details with custom PMU events 4) Vectorization Efficiency Use Intel Advisor Optimize for AVX-512 with or without AVX-512 hardware New! 36

37 Optimize Memory Access Memory Access Analysis - Intel VTune Amplifier 2017 Improved! Tune data structures for performance Attribute cache misses to data structures (not just the code causing the miss) Support for custom memory allocators Optimize NUMA latency & scalability True & false sharing optimization Auto detect max system bandwidth Easier tuning of inter-socket bandwidth Easier install, Latest processors No special drivers required on Linux* Intel Xeon Phi processor MCDRAM (high bandwidth memory) analysis 37

38 Storage Device Analysis (HDD, SATA or NVMe SSD) Intel VTune Amplifier Are You I/O Bound or CPU Bound? Explore imbalance between I/O operations (async & sync) and compute Storage accesses mapped to the source code See when CPU is waiting for I/O Measure bus bandwidth to storage New! Sliders set thresholds for I/O Queue Depth Slow task with I/O Wait Latency analysis Tune storage accesses with latency histogram Distribution of I/O over multiple devices 38

39 Intel Performance Snapshots Three Fast Ways to Discover Untapped Performance Is your application making good use of modern computer hardware? Run a test case during your coffee break. High level summary shows which apps can benefit most from code modernization and faster storage. Pick a Performance Snapshot: Application for non-mpi apps MPI for MPI apps Storage for systems. Servers and workstations with directly attached storage. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. New! New! 39

40 40

41 Find & Debug Memory & Threading Errors Intel Inspector Memory & Thread Debugger Correctness Tools Increase ROI By 12%-21% 1 Errors found earlier are less expensive to fix Several studies, ROI% varies, but earlier is cheaper Diagnosing Some Errors Can Take Months Races & deadlocks not easily reproduced Memory errors hard to find without a tool Debugger Integration Speeds Diagnosis Breakpoint set just before the problem Examine variables & threads with the debugger Diagnose in hours instead of months 1 Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Debugger Breakpoints Part of Intel Parallel Studio Professional For Windows* and Linux* From $1,599 Intel Inspector dramatically sped up our ability to track down difficult to isolate threading errors before our packages are released to the field. Peter von Kaenel, Director, Software Development, Harmonic Inc. 41

42 New for 2017! New Processors, New C++ Language Features Intel Inspector 2017 Memory and Thread Debugger New C++ Language Features Full C++ 11 support including std::mutex and std::atomic Easier Identification of Threading Bugs Variable name causing error is shown (global, static & stack) in addition to the code lines Run Native on Intel Xeon Phi Processors This simplifies workflow for Intel Xeon Phi processor development Tip: Reduce thread count to 30 for best KNL performance while running Intel Inspector New! 42

43 43

44 Get Faster Code Faster! Intel Advisor Thread Prototyping Have you: Threaded an app, but seen little benefit? Hit a scalability barrier? Delayed release due to sync. errors? Data Driven Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Add Parallelism with Less Effort, Less Risk and More Impact Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort Simon Hammond Senior Technical Staff Sandia National Laboratories 44

45 Faster Code Faster with Data Driven Design Intel Advisor Vectorization Optimization and Thread Prototyping Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is blocking vectorization Tips for effective vectorization Safely force compiler vectorization Optimize memory stride Breakthrough for Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Less Effort, Less Risk and More Impact Part of Intel Parallel Studio for Windows* and Linux* 45

46 New! New for 2017! AVX-512, FLOPS, & More Intel Advisor Vectorization Optimization Next Gen Intel Xeon Phi Support Tune for AVX-512 with or without AVX-512 hardware Precise FLOPS calculation Enhanced Memory Access Analysis Easier Selection of High Impact Loops Batch Mode Workflow Saves Time Fast Answers with Loop Analytics 46

47 Intel MPI Library Intel Trace Analyzer and Collector

48 Intel MPI Library Overview Optimized MPI application performance Application-specific tuning Automatic tuning New! - Support for Intel Xeon Phi Processor (code named Knights Landing) New! Support for Intel Omni-Path Architecture Fabric Lower latency and multi-vendor interoperability Industry leading latency Performance optimized support for the fabric capabilities through OpenFabrics*(OFI) Faster MPI communication Optimized collectives Sustainable scalability up to 340K cores Native InfiniBand* interface support allows for lower latencies, higher bandwidth, and reduced memory requirements More robust MPI applications Seamless interoperability with Intel Trace Analyzer and Collector Applications CFD Crash Climate OCD BIO Other... Develop applications for one fabric Intel MPI Library Select interconnect fabric at runtime TCP/IP Omni-Path InfiniBand iwarp Achieve optimized MPI performance Shared Memory Intel MPI Library One MPI Library to develop, maintain & test for multiple fabrics Other Networks Fabrics Cluster 48

49 What s New: Intel MPI Library 2017 Ready for Intel Xeon Phi Processors (code named Knights Landing (KNL)) Ready for Intel Omni-Path Architecture fabric Usage of specially optimized memcpy for KNL Tuning of shared memory collectives on single KNL nodes General optimization of RMA General optimization and speed up startup time and MPI tune utility 49

50 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Idealizer Automatically detect performance issues and their impact on runtime 50

51 MPI Performance Snapshot Scalable profiling for MPI and Hybrid Lightweight Low overhead profiling for 100K+ Ranks Scalability- Performance variation at scale can be detected sooner Identifying Key Metrics Shows MPI/OpenMP imbalances 51

52 What s New: Intel Trace Analyzer and Collector Intel Trace Analyzer and Collector will be ready for KNL Improved scalability of imbalance profiler by up to 10x Improved MPI Snapshot feature HTML output 52

53 Additional Material Product page overview, features, FAQs, support Training materials movies, tech briefs, documentation Evaluation guides step by step walk through Reviews Additional Development Products: Intel Software Development Products For more detail on each component of Parallel Studio XE, visit Inside Blue. 53

54

55 Enhanced application performance with AVX-512 support Enhanced performance due to AVX-512 instructions taking advantage of FMA units, memcpy, new pre-fetch instructions, new transcendental instructions, MCDRAM, and increased number of cores. 55

56 Enhanced application performance with AVX-512 support Key functionality / library domain KNL features used to deliver enhanced performance (instructions, other) *GEMMs/BLAS MP Linpack LU/CHolesky/QR/LAPACK/SMP Linpack Two FMA units + 2 instruction decoders are key AVX512 FMA (vfmadd231ps or vfm231pd) Same as in BLAS (as main LAPACK kernel is?*gemm) + greater core count Prefetcht0 instruction MCDRAM Intel Math Kernel Library Intel Integrated Performance Primitives Intel Data Analytics Acceleration Library 2D and 3D FFTs DNN Sparse Vector Statistics Vector Math All from Signal Processing (1D) and up to Image (2D) and Volume (3D) processing Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh AVX512 FMA Two FMA units + 2 instruction decoders MCDRAM AVX512 FMA Similar to BLAS/LAPACK, greater number of cores AVX512 FMA Two FMA units + 2 instruction decoders Large number of cores for MT performance AVX512 FMA Prefetcht1 instruction Prefetcht0, prefetcht1 instruction Masking support Large core count Prefetcht1 instruction Depend on seq. Blas level 3 Knights Landing improvement New Transcendental Support Instructions: VGETEXP, VGETMANT, VRNDSCALE, VSCALEF, VFIXUPIMM, VRCP28, VRSQRT28, VEXP2 The main advantage inherited from LRB/KNC is support of mask registers and therefore support of predicates for all new instructions. Then, - full 512-bit register palign support (no lanes restrictions as for old AVX palign)- _mm512_alignr_epi32, _mm512_alignr_epi64. Then, on the fly integer conversions: vpmovq{w b d}, vpmovq{w b}. And the last one integer any-direction comparison: vpcmp{d q} and vpcmpu{d q}. Similar to BLAS/LAPACK, greater number of cores Intel MPI Library Used compiler s AVX-512 version of memcpy (but w/ fix, failed CQ on ICC) Build IMPI w/ -fvisibility=hidden (make all symbols as hidden by default and only needed as external). Addressed KNL micro-arch features, such as short BTB, by reducing access to PLT/GOT Reduced/simplified critical path where it s possible. Addressed KNL frond-end specifics. 56

57 Easy access to Parallel Studio XE Runtimes For Amazon Web Services users only Intel Parallel Studio XE Runtime Required to be able to run applications built with the Intel Performance Libraries or Intel Compilers. Includes latest optimizations for Intel Architecture for faster application performance Linux Only Easy access for Amazon Web Services users at no cost Latest runtimes through Linux native repos YUM repo available now! ( 57

58 Educating with Webinar series about 2017 tools Expert talks about the new features Series of live webinars Sept 13 Nov 8, 2016 Attend live, or watch after the fact. 58

59 Educating with High Performance Programming Book Knights Landing specific details, programming advice and real world examples. Intel Xeon Phi Processor High Performance Programming Techniques to generally increase program performance on any system and prepare you better for Intel Xeon Phi processors. Available as of June 2016 I believe you will find this book is an invaluable reference to help develop your own Unfair Advantage James A. Manager Sandia National Laboratories 59

60 More education with software.intel.com/moderncode Online community growing collection of tools, trainings, support features Black Belts in parallelism from Intel & industry Intel HPC Developer Conferences developers share proven techniques and best practices hpcdevcon.intel.com Hands on training for developers and partners with remote access to Intel Xeon processor and Xeon Phi coprocessor-based clusters. software.intel.com/icmp Developer Access Program provides early access to Intel Xeon Phi processor codenamed Knights Landing + 1 year license for Intel Parallel Studio XE Cluster Edition. 60

61 Choices to Fit Needs Intel Tools All Products with support worldwide, for purchase. Intel Premier Support - private direct support from Intel support for past versions software.intel.com/products Most Products without Premier support via special programs for those who qualify students, educators, classroom use, open source developers, and academic researchers software.intel.com/qualify-for-free-software Community support only all tools: Students, Educators, classroom use, Open Source Developers, Academic Researchers (qualification required) Intel Performance Libraries without Premier support -Community licensing for Intel performance libraries no royalties, no restrictions based on company or project size software.intel.com/nest Community support only Intel Performance Libraries: Community Licensing (no qualification required) 61

62 What s New Intel C++ Compiler SIMD Data Layout Templates to facilitate vectorization for your C++ code Enhanced C11 and C++14 language standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator, Enhanced GNU* and Microsoft* compatibility SSE Cast Support Diagnostic improvements on template argument 62

63 What s New Intel Fortran Compiler Substantial Coarray Fortran performance improvement on non-trivial programs Almost complete Fortran 2008 support Enhanced Fortran 2008 and draft Fortran 2015 language standards support implied-shape PARAMETER arrays 2008 bind C internal procedures extended EXIT for all named blocks pointer initialization VS2013 Shell replaces VS2010 Shell on Windows 63

64 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Three lines added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma omp declare simd linear(z:40) uniform(l, N, Nmat) linear(k) float path_calc(float *z, float L[][VLEN], int k, int N, int Nmat) #pragma omp declare simd uniform(l, N, Nopt, Nmat) linear(k) float portfolio(float L[][VLEN], int k, int N, int Nopt, int Nmat) for (path=0; path<npath; path+=vlen) { /* Initialise forward rates */ z = z0 + path * Nmat; #pragma omp simd linear(z:nmat) for(int k=0; k < VLEN; k++) { for(i=0;i<n;i++) { L[i][k] = L0[i]; } /* LIBOR path calculation */ float temp = path_calc(z, L, k, N, Nmat); v[k+path] = portfolio(l, k, N, Nopt, Nmat); /* move pointer to start of next block */ z += Nmat; } } Libor calculation speedup Normalized performance data higher is better 1 3,51 6,61 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

65 Impressive Performance Improvement Intel C++ Explicit Vectorization: SIMD Performance One line added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma simd vectorlength(8) for (int x = x0; x < x1; ++x) { float div = coef[0] * A_cur[x] + coef[1] * ((A_cur[x + 1] + A_cur[x - 1]) + (A_cur[x + Nx] + A_cur[x - Nx]) + (A_cur[x + Nxy] + A_cur[x - Nxy])) + coef[2] * ((A_cur[x + 2] + A_cur[x - 2]) + (A_cur[x + sx2] + A_cur[x - sx2]) + (A_cur[x + sxy2] + A_cur[x - sxy2])) + coef[3] * ((A_cur[x + 3] + A_cur[x - 3]) + (A_cur[x + sx3] + A_cur[x - sx3]) + (A_cur[x + sxy3] + A_cur[x - sxy3])) + coef[4] * ((A_cur[x + 4] + A_cur[x - 4]) + (A_cur[x + sx4] + A_cur[x - sx4]) + (A_cur[x + sxy4] + A_cur[x - sxy4])); A_next[x] = 2 * A_cur[x] - A_next[x] + vsq[s+x] * div; } RTM-stencil calculation speedup Normalized performance data higher is better 1 3,91 6,06 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

66 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel Confidential: Must be viewed under CNDA All products, systems, dates and figures are preliminary based on current expectations, and are subject to change without notice. Substantial Coarray Fortran performance improvement on non-trivial programs 3,70 1,00 1,40 1,00 1,00 1,00 1,23 1,01 University of Edinburgh University of Houston University of Houston University of Houston EPCC microbenchmarks NAS Parallel benchmarks coarray kernels coarray microbenchmarks Runtime performance relative to Intel Fortran 16.0 higher is better Configuration: Windows hardware: HP DL320e Gen8 v2 (single-socket server) with Intel(R) Xeon(R) CPU E GHz, 32 GB RAM, HyperThreading is off; Linux hardware: HP BL460c Gen9 with Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel C++ compiler 16.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 8.1. SPEC* Benchmark ( Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel Confidential 66

67 SIMD Data Layout Template - Improve productivity and boost C++ performance Quickly convert Array of Structures to Structure of Arrays representation. Increase productivity: Use predefined templates with minimal effort, and let SDLT do the vecorization for you. Improve performance: SDLT vectorizes your code by making memory access contiguous, which can lead to more efficient code and better performance. Seamless integration: SDLT follows the familiar Intel vector programming model. We used SDLT to vectorize the deformer code in Premo, the inhouse animation tool for DreamWorks Animation. The performance improvements we were able to achieve were dramatic, and these improvements will translate directly into higher quality characters that will be seen on-screen in future movies. Also the library itself was easy to use and integrate into our existing codebase. Martin Watt Principal Engineer, DreamWorks Animation 67

68 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems

69 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems Intel s solution is to Accelerate Python performance Enable easy access Empower the community

70 Access multiple options for faster Python Included in Intel Distribution for Python* Accelerate with native libraries I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review NumPy, SciPy, Scikit-Learn, Theano, Pandas, pydaal Intel MKL, Intel DAAL Exploit vectorization and threading Cython + Intel C++ compiler Numba + Intel LLVM Better/Composable threading Cython, Numba, Pyston Threading composability for MKL, CPython, Blaze/Dask, Numba Multi-node parallelism Mpi4Py, Distarray Intel native libraries: Intel MPI Integration with Big Data, ML platforms and frameworks Spark, Hadoop, Trusted Analytics Platform Better performance profiling Extensions for profiling mixed Python & native/jit codes

71 Intel Distribution for Python* Reviews Intel's Python distribution provides a major math boost The still-in-beta Python distribution uses Math Kernel Library to speed up processing on Intel hardware The distribution's main touted advantage is speed -- but not a PyPy-style general speedup via a JIT. Instead, the MKL speeds up certain math operations so that they run faster on one thread and multiple threads. I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review HPC Podcast Looks at Intel s Pending Distribution of Python Yes, Intel is doing their own Python build! It is still in beta but I think it s a great idea..yeah, it s important!

72 Automatic Performance Scaling Back Up from the Core, to Multicore, to Many Core and Beyond Intel MKL Extracting performance from the computing resources Core: vectorization, prefetching, cache utilization Multi-Many core (processor/socket) level parallelization Multi-socket (node) level parallelization Clusters scaling Sequential Intel MKL MKL + OpenMP Many Core Intel Xeon Phi TM Coprocessor MKL + Intel MPI 72

73 Big Data & Machine Learning Challenge Volum e Value Velocity Variety Problem: Big data needs high performance computing. Many big data applications leave performance at the table > Not optimized for underlying hardware. Solution: A performance library provides building blocks to be easily integrated into big data analytics workflow.s

74 Intel Data Analytics Acceleration Library (Intel DAAL) An Intel-optimized library that provides building blocks for all data analytics stages, from data preparation to data mining & machine learning Python, Java & C++ APIs Can be used with many platforms (Hadoop*, Spark*, R*, Matlab*, ) but not tied to any of them Flexible interface to connect to different data sources (CSV, SQL, HDFS, ) Windows*, Linux*, and OS X* Developed by same team as the industryleading Intel Math Kernel Library Open source, Free community-supported and commercial premium-supported options Also included in Parallel Studio XE suites 74

75 Intel Threading Building Blocks Good Tuning Data Gets Good Results Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships Details all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Intel's TBB was an invaluable help in multithreading our in-house renderer CGIStudio and is now also used in animation and simulation software. Beside the ease of use, it takes care of the two most important aspects of running an application on multiple cores -- load balancing and scalability. Maurice van Swaaji Blue Sky Studios "Intel TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table. Details Michaël Rouillé CTO Golaem More Case Studies 75

76 Intel Threading Building Blocks (Intel TBB) C++ template library to simplify the task of adding parallelism on a single device or across multiple devices Specify tasks instead of manipulating threads Intel TBB maps your logical tasks onto threads with full support for nested parallelism Targets threading for scalable performance Uses proven, efficient parallel patterns Uses work stealing to support the load balance of unknown execution time for tasks. It has the advantage of low-overhead polymorphism. Flow graph feature allows developers to easily express dependency and data flow graphs Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Has high level parallel algorithms and concurrent containers and low level building blocks like scalable memory allocator, locks and atomic operations. Commercial support for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors More Case Studies 76

77 Resources and Availability Intel Threading Building Blocks (Intel TBB) Resources Commercial product page: software.intel.com/intel-tbb Flow Graph Designer: software.intel.com/articles/flow-graph-designer User Forum: software.intel.com/forums/intel-threading-building-blocks Available on Linux, Windows, macos and Android Commercially available with Intel Parallel Studio XE 2017: software.intel.com/enus/intel-parallel-studio-xe Community licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest The Open-Source Community Site: 77

78 Challenges faced by developers Performance optimization is a never ending task. Completing key processing tasks within designated time constraints is a critical issue. Hand optimization code for one platform makes code performance worse on another platform. With manual optimization code becomes more complex and difficult to maintain. Code should run fast as possible without spending extra effort. 78

79 Different Domains in Intel IPP Image Processing Signal Processing Data Compression Computer Vision Cryptography Color Conversion Vector Math String Processing Image Domain Signal Domain Data Domain 79

80 Intel Integrated Performance Primitives Building Blocks for Image, Signal & Data Processing Provides developers with ready-to-use functions to accelerate image, signal, data processing & cryptography computation tasks. Optimized for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors. License versions available on Linux, Windows, macos, Android Available as a part of: Intel Parallel Studio XE software.intel.com/en-us/intel-parallelstudio-xe Community Licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest 80

81 Correctness Tools Increase ROI By 12%-21% Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Size and complexity of applications is growing Correctness tools find defects during development prior to shipment Reworking defects is 40%-50% of total project effort Reduce time, effort, and cost to repair Find errors earlier when they are less expensive to fix 81

82 Race Conditions Are Difficult to Diagnose They only occur occasionally and are difficult to reproduce Correct Thread 1 Thread 2 Shared Counter Read count 0 Increment 0 Write count 1 0 Read count 1 Increment 1 Write count 2 Incorrect Thread 1 Thread 2 Shared Counter Read count 0 0 Read count 0 Increment 0 Increment 0 Write count 1 Write count 1 82

83 Debug Memory & Threading Errors Intel Inspector Find and eliminate errors Memory leaks, invalid access Races & deadlocks C, C++ and Fortran (or a mix) Simple, Reliable, Accurate No special recompiles Use any build, any compiler 1 Analyzes dynamically generated or linked code Inspects 3 rd party libraries without source Productive user interface + debugger integration Command line for automated regression analysis Clicking an error instantly displays source code snippets and the call stack Fits your existing process 1 That follows common OS standards. 83

84 Profile Python & Go! And Mixed Python / C++ / Fortran Intel VTune Amplifier New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 84

85 Three Keys to HPC Performance: Threading, Memory Access, Vectorization Intel VTune Amplifier New! Threading: CPU Utilization Serial vs. Parallel time Top OpenMP regions by potential gain Tip: Use hotspot OpenMP region analysis for more detail Memory Access Efficiency Stalls by memory hierarchy Bandwidth utilization Tip: Use Memory Access analysis Vectorization: FPU Utilization FLOPS estimates from sampling Tip: Use Intel Advisor for precise metrics and vectorization optimization For 3rd, 5th, 6th Generation Intel Core processors and second generation Intel Xeon Phi processor code named Knights Landing. 85

86 Application Performance Snapshot Discover opportunities for better performance with vectorization & threading Objectives Simple enough to run during a coffee break Highlight where code modernization can help Users Performance teams fast prioritization of which apps will benefit most All Developers size the potential performance gain from code modernization Non-Objectives Actionable tuning data that is another tool. Snapshot is just a fast health check. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. Preview! 86

87 Free download: Also included with Intel Parallel Studio Cluster Edition. 87

88 Storage Performance Snapshot Discover if faster storage can improve server/workstation performance Learn It On One Coffee Break Easy setup Quickly see meaningful data System view of workload Any architecture Targeted Systems Servers & workstations with directly attached storage Not scale out storage clusters Linux kernel 2.6 or newer dstat 0.7 or newer Windows Server 2012, Windows 8 or newer Windows OS Preview! Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. 88

89 Get Faster Code Faster! Intel Advisor Vectorization Optimization Have you: Recompiled for AVX2 with little gain Wondered where to vectorize? Recoded intrinsics for new arch.? Struggled with compiler reports? New! Data Driven Vectorization: What vectorization will pay off most? What s blocking vectorization? Why? Are my loops vector friendly? Will reorganizing data increase performance? Is it safe to just use pragma simd? "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing 89

90 Next Gen Intel Xeon Phi Support Vectorization Advisor runs on and optimizes for Intel Xeon Phi AVX-512 ERI specific to Intel Xeon Phi New! Efficiency (72%), Speed-up (11.5x), Vector Length (16) Performance optimization problem and advice how to fix it 90

91 Precise Repeatable FLOPS Metrics Intel Advisor Vectorization Optimization New! FLOPS by loop and function All recent Intel processors (not co-processors) Instrumentation (count FLOP) plus sampling (time with low overhead) Adjusted for masking with AVX-512 processors 91

92 Enhanced Memory Access Analysis Intel Advisor Are you bandwidth or compute limited? Measure Footprint Compare to cache size Does it fit in cache? Variable References Map data to variable names for easier analysis Gather/Scatter Detect unneeded gather/scatters that reduce performance New! 92

93 Start Tuning for AVX-512 without AVX-512 hardware Intel Advisor - Vectorization Advisor New! Use axcommon-avx512 xavx compiler flags to generate both code-paths AVX(2) code path (executed on Haswell and earlier processors) AVX-512 code path for newer hardware Compare AVX and AVX-512 code with Intel Advisor Inserts (AVX2) vs. Gathers (AVX-512) Speed-up estimate: 13.5x (AVX2) vs. 30.6x (AVX-512)

94 Faster Code Faster Using Intel Advisor Vectorization "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing Intel Advisor s Vectorization Advisor fills a gap in code performance analysis. It can guide the informed user to better exploit the vector capabilities of modern processors and coprocessors. Dr. Luigi Iapichino Scientific Computing Expert Leibniz Supercomputing Centre Threading "Intel Advisor has been extremely helpful in identifying the best pieces of code for parallelization. We can save several days of manual work by targeting the right loops and we can use Advisor to find potential thread safety issues to help avoid problems later on." Carlos Boneti HPC software engineer, Schlumberger Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort, and has already been used to highlight subtle parallel correctness issues in complex multi-file, multi-function algorithms. Simon Hammond Senior Technical Staff Sandia National Laboratories More Case Studies 94

95 Speaker the speaker notes are important for this presentation. Be sure to read them.

96 Optimizing Performance On Parallel Hardware It s an iterative process Cluster Scalable? Y N Tune MPI Ignore if you are not targeting clusters. Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth 96

97 Performance Analysis Tools for Diagnosis Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel Trace Analyzer & Collector (ITAC) Intel MPI Snapshot Intel MPI Tuner Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth Intel VTune Amplifier Intel Advisor Intel VTune Amplifier 97

98 Tools for High Performance Implementation Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel MPI Library Intel MPI Benchmarks Intel Compiler Effective threading? N Y Thread Vectorize Memory Bandwidth Sensitive? Y Optimize Bandwidth N Intel Math Kernel Library Intel IPP Media & Data Library Intel Data Analytics Library Intel Cilk Plus Intel OpenMP* Intel TBB Threading Library 98

99 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #

100

Intel Parallel Studio XE 2015

Intel Parallel Studio XE 2015 2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:

More information

Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast

Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast Faster Code. Faster Intel Parallel Studio XE 2017 Unleash the Beast Create Faster Code Faster Intel Parallel Studio XE Design, build, verify, and tune C++, C, Fortran*, Python* and Java* Standards-driven

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Intel Distribution for Python* и Intel Performance Libraries

Intel Distribution for Python* и Intel Performance Libraries Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk

More information

Jackson Marusarz Software Technical Consulting Engineer

Jackson Marusarz Software Technical Consulting Engineer Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis

More information

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,

Performance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate

More information

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python*

Sergey Maidanov. Software Engineering Manager for Intel Distribution for Python* Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* Introduction Python is among the most popular programming languages Especially for prototyping But very limited use in production

More information

Memory & Thread Debugger

Memory & Thread Debugger Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis

More information

Scaling Out Python* To HPC and Big Data

Scaling Out Python* To HPC and Big Data Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping

More information

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor

More information

Maximizing performance and scalability using Intel performance libraries

Maximizing performance and scalability using Intel performance libraries Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona

More information

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python

Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1:

More information

Using Intel VTune Amplifier XE and Inspector XE in.net environment

Using Intel VTune Amplifier XE and Inspector XE in.net environment Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector

More information

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:

Intel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python

More information

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel

Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application

More information

Kevin O Leary, Intel Technical Consulting Engineer

Kevin O Leary, Intel Technical Consulting Engineer Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."

More information

Graphics Performance Analyzer for Android

Graphics Performance Analyzer for Android Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

intel System Studio 2018 Beta 새로운플랫폼을위한새로운맞춤형개발자경험

intel System Studio 2018 Beta 새로운플랫폼을위한새로운맞춤형개발자경험 intel System Studio 2018 Beta 새로운플랫폼을위한새로운맞춤형개발자경험 Introduction to Developer Products Division Technical Computing IoT, Wearables, Embedded & Mobile Systems Computer Vision Performance Client Media & Apps

More information

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.

Munara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend

More information

A Simple Path to Parallelism with Intel Cilk Plus

A Simple Path to Parallelism with Intel Cilk Plus Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description

More information

Efficiently Introduce Threading using Intel TBB

Efficiently Introduce Threading using Intel TBB Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++

More information

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization

More information

Jackson Marusarz Intel Corporation

Jackson Marusarz Intel Corporation Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits

More information

Bei Wang, Dmitry Prohorov and Carlos Rosales

Bei Wang, Dmitry Prohorov and Carlos Rosales Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512

More information

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017

Achieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017 Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware

More information

Revealing the performance aspects in your code

Revealing the performance aspects in your code Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular

More information

VLPL-S Optimization on Knights Landing

VLPL-S Optimization on Knights Landing VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.

More information

Intel VTune Amplifier XE

Intel VTune Amplifier XE Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development

More information

Eliminate Threading Errors to Improve Program Stability

Eliminate Threading Errors to Improve Program Stability Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed

More information

FAST FORWARD TO YOUR <NEXT> CREATION

FAST FORWARD TO YOUR <NEXT> CREATION FAST FORWARD TO YOUR CREATION THE ULTIMATE PROFESSIONAL WORKSTATIONS POWERED BY INTEL XEON PROCESSORS 7 SEPTEMBER 2017 WHAT S NEW INTRODUCING THE NEW INTEL XEON SCALABLE PROCESSOR BREAKTHROUGH PERFORMANCE

More information

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division

Intel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS

More information

What s New August 2015

What s New August 2015 What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability

More information

Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team

Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team Agenda Intel IPP and Intel MKL Benefits What s New in Intel MKL 11.3 What s New in Intel IPP 9.0 New Features and Changes Tips to Move Intel

More information

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Intel Advisor XE. Vectorization Optimization. Optimization Notice Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics

More information

Intel Distribution For Python*

Intel Distribution For Python* Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple

More information

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth

Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.

More information

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,

More information

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth

Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New

More information

Installation Guide and Release Notes

Installation Guide and Release Notes Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel

More information

Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth

Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Linux*.... 3 Intel C++ Compiler Professional Edition Components:......... 3 s...3

More information

Expressing and Analyzing Dependencies in your C++ Application

Expressing and Analyzing Dependencies in your C++ Application Expressing and Analyzing Dependencies in your C++ Application Pablo Reble, Software Engineer Developer Products Division Software and Services Group, Intel Agenda TBB and Flow Graph extensions Composable

More information

Fastest and most used math library for Intel -based systems 1

Fastest and most used math library for Intel -based systems 1 Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo

More information

Simplified and Effective Serial and Parallel Performance Optimization

Simplified and Effective Serial and Parallel Performance Optimization HPC Code Modernization Workshop at LRZ Simplified and Effective Serial and Parallel Performance Optimization Performance tuning Using Intel VTune Performance Profiler Performance Tuning Methodology Goal:

More information

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing

Case Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing Case Study Software Optimizing an Illegal Image Filter System Intel Integrated Performance Primitives High-Performance Computing Tencent Doubles the Speed of its Illegal Image Filter System using SIMD

More information

Performance Tools for Technical Computing

Performance Tools for Technical Computing Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology

More information

Intel Cluster Checker 3.0 webinar

Intel Cluster Checker 3.0 webinar Intel Cluster Checker 3.0 webinar June 3, 2015 Christopher Heller Technical Consulting Engineer Q2, 2015 1 Introduction Intel Cluster Checker 3.0 is a systems tool for Linux high performance compute clusters

More information

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization

More information

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.

Agenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others. Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP

More information

Eliminate Memory Errors to Improve Program Stability

Eliminate Memory Errors to Improve Program Stability Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides

More information

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends

Outline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory

More information

Using Intel VTune Amplifier XE for High Performance Computing

Using Intel VTune Amplifier XE for High Performance Computing Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message

More information

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria

Alexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria Alexei Katranov IWOCL '16, April 21, 2016, Vienna, Austria Hardware: customization, integration, heterogeneity Intel Processor Graphics CPU CPU CPU CPU Multicore CPU + integrated units for graphics, media

More information

Getting Started with Intel SDK for OpenCL Applications

Getting Started with Intel SDK for OpenCL Applications Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel

More information

Intel Many Integrated Core (MIC) Architecture

Intel Many Integrated Core (MIC) Architecture Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products

More information

Microarchitectural Analysis with Intel VTune Amplifier XE

Microarchitectural Analysis with Intel VTune Amplifier XE Microarchitectural Analysis with Intel VTune Amplifier XE Michael Klemm Software & Services Group Developer Relations Division 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION

More information

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms. Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.

More information

More performance options

More performance options More performance options OpenCL, streaming media, and native coding options with INDE April 8, 2014 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming

More information

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.

This guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems. Introduction A resource leak refers to a type of resource consumption in which the program cannot release resources it has acquired. Typically the result of a bug, common resource issues, such as memory

More information

Intel Architecture for HPC

Intel Architecture for HPC Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter

More information

Intel Math Kernel Library (Intel MKL) Latest Features

Intel Math Kernel Library (Intel MKL) Latest Features Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

IXPUG 16. Dmitry Durnov, Intel MPI team

IXPUG 16. Dmitry Durnov, Intel MPI team IXPUG 16 Dmitry Durnov, Intel MPI team Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2 Intel MPI 2017 Beta U1 is available! Key

More information

Intel Software Development Products for High Performance Computing and Parallel Programming

Intel Software Development Products for High Performance Computing and Parallel Programming Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN

More information

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013

More information

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth

Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel C++ Compiler Professional Edition for Windows*..... 3 Intel C++ Compiler Professional Edition At A Glance...3 Intel C++

More information

What s P. Thierry

What s P. Thierry What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Intel Software Development Products Licensing & Programs Channel EMEA

Intel Software Development Products Licensing & Programs Channel EMEA Intel Software Development Products Licensing & Programs Channel EMEA Intel Software Development Products Advanced Performance Distributed Performance Intel Software Development Products Foundation of

More information

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth

Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This

More information

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility

More information

HPC code modernization with Intel development tools

HPC code modernization with Intel development tools HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor

More information

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and

More information

Intel Parallel Studio 2011

Intel Parallel Studio 2011 THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive

More information

Intel Xeon Phi Coprocessor Performance Analysis

Intel Xeon Phi Coprocessor Performance Analysis Intel Xeon Phi Coprocessor Performance Analysis Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Martin Kruliš, v

Martin Kruliš, v Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal

More information

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017

Becca Paren Cluster Systems Engineer Software and Services Group. May 2017 Becca Paren Cluster Systems Engineer Software and Services Group May 2017 Clusters are complex systems! Challenge is to reduce this complexity barrier for: Cluster architects System administrators Application

More information

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized

More information

Ultimate Workstation Performance

Ultimate Workstation Performance Product brief & COMPARISON GUIDE Intel Scalable Processors Intel W Processors Ultimate Workstation Performance Intel Scalable Processors and Intel W Processors for Professional Workstations Optimized to

More information

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization

Achieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization white paper Internet Discovery Artificial Intelligence (AI) Achieving.X Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization As one of the world s preeminent

More information

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation

SPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation SPDK China Summit 2018 Ziye Yang Senior Software Engineer Network Platforms Group, Intel Corporation Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 2 Agenda SPDK programming

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU

More information

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino

Performance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,

More information

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:

More information

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes

Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is

More information

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel

More information

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager

Intel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides

More information

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ, Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon

More information

Accelerate. HP / Intel. CAE Innovation at any Scale with Optimized Simulation Solutions. Performance. Efficiency. Agility

Accelerate. HP / Intel. CAE Innovation at any Scale with Optimized Simulation Solutions. Performance. Efficiency. Agility Accelerate CAE Innovation at any Scale with Optimized Simulation Solutions HP / Intel Performance Efficiency Thierry Carron, HPC Senior Architect HPC EMEA Win Team France Agility Stephan Gillich Dir. HPC

More information

Tuning Python Applications Can Dramatically Increase Performance

Tuning Python Applications Can Dramatically Increase Performance Tuning Python Applications Can Dramatically Increase Performance Vasilij Litvinov Software Engineer, Intel Legal Disclaimer & 2 INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,

More information

Oracle Developer Studio 12.6

Oracle Developer Studio 12.6 Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises

More information

OPENSHMEM AND OFI: BETTER TOGETHER

OPENSHMEM AND OFI: BETTER TOGETHER 4th ANNUAL WORKSHOP 208 OPENSHMEM AND OFI: BETTER TOGETHER James Dinan, David Ozog, and Kayla Seager Intel Corporation [ April, 208 ] NOTICES AND DISCLAIMERS Intel technologies features and benefits depend

More information

Reusing this material

Reusing this material XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us

More information

Fast forward. To your <next>

Fast forward. To your <next> Fast forward To your Navin Shenoy EXECUTIVE VICE PRESIDENT GENERAL MANAGER, DATA CENTER GROUP CLOUD ECONOMICS INTELLIGENT DATA PRACTICES NETWORK TRANSFORMATION Intel Xeon Scalable Platform The

More information