Faster Code. Faster. Intel Parallel Studio XE Unleash the Beast
|
|
- Dennis Dennis
- 6 years ago
- Views:
Transcription
1 Faster Code. Faster Intel Parallel Studio XE 2017 Unleash the Beast
2 Create Faster Code Faster Intel Parallel Studio XE Design, build, verify and tune C++, C, Fortran, Python* and Java* Standards Driven Parallel Models: OpenMP, MPI & TBB Highlights from 2017 edition Faster Python* application performance using Intel Distribution for Python and Intel VTune Amplifier XE. Faster deep learning on IA using Intel Math Kernel Library and Intel Data Analytics Acceleration Library Quickly assess application performance using snapshot features of VTune Amplifier XE and Intel Trace Analyzer and Collector Scale to next generation platforms including latest Intel Xeon Phi processor. Optimizations for AVX-512, high bandwidth memory and explicit vectorization for compiler and analysis tools. 2
3 Performance Libraries Profiling, Analysis & Architecture Cluster Tools Intel Parallel Studio XE Intel Inspector Memory & Threading Checking Intel VTune Amplifier Performance Profiler Intel Data Analytics Acceleration Library Optimized for Data Analytics & Machine Learning Intel Math Kernel Library Optimized Routines for Science, Engineering & Financial Intel Advisor Vectorization Optimization & Thread Prototyping Intel Cluster Checker Cluster Diagnostic Expert System Intel Trace Analyzer & Collector MPI Profiler Intel MPI Library Intel Integrated Performance Primitives Image, Signal & Data Processing Intel Threading Building Blocks Task Based Parallel C++ Template Library Intel C/C++ & Fortran Compilers Intel Distribution for Python Performance Scripting 3
4 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
5 SCALE Analyze Build What s Inside Intel Parallel Studio XE 2017 Composer Edition Professional Edition Cluster Edition from $699 from $1,699 from $2,949 Intel C++ Compiler Intel Fortran Compiler Intel Distribution for Python* Intel Math Kernel Library fast math library Intel Integrated Performance Primitives image, signal & data processing Intel Threading Building Blocks threading library Intel Data Analytics Acceleration Library machine learning & analytics Intel VTune Amplifier XE performance profiler Intel Advisor vectorization optimization and thread prototyping Intel Inspector memory and thread debugging Intel MPI Library message passing interface library Intel Trace Analyzer and Collector MPI Tuning and Analysis Intel Cluster Checker cluster diagnostic expert system Rogue Wave IMSL* Library Fortran numerical analysis Bundle or Add-on Add-on Add-on Additional configurations including, floating and academic, are available at: 5
6 Staying current with Support for the Latest Standards, Operating Systems & Processors Enhanced C11 and C++14 standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator Operating systems Windows* 7 thru 10, Windows Server Debian* 7.0, 8.0; Fedora* 23, 24; Red Hat Enterprise Linux* 6, 7; SuSE LINUX Enterprise Server* 11,12; Ubuntu* LTS LTS, macos* Enhanced Fortran 2008 and draft 2015 standards support Implied-shape PARAMETER arrays 2008 bind C internal procedures Extended EXIT for all named blocks Pointer initialization Latest processors Support and tuning added for the latest Intel Xeon Phi codenamed Knights Landing and AVX-512 6
7
8 Intel Compilers for Parallel Studio XE 2017 What s new in Intel C and Intel Fortran 17.0 Productive language-level vectorization & parallelism models for advanced developers driving application performance Common updates Enhanced support for the newest AVX2 and AVX512 instruction sets for the latest Intel processors (including Intel Xeon Phi) Enhanced optimization/vectorization reports register allocation Tight integration with Intel Advisor Initial support for OpenMP* 4.5, offering improved vectorization control, new SIMD instructions, and much more Intel C++ Compiler SIMD Data Layout Template to facilitate vectorization for your C++ code Virtual function vectorization capability Improved compiler loop and function alignment Full support for the latest C11 and C++14 standards Intel Fortran Compiler Substantial coarray performance improvement up to twice as fast as previous versions on non-trivial coarray Fortran programs Almost complete Fortran 2008 support Further interoperability with C (part of draft Fortran 2015) 8
9 PGI* Visual C++* 2015 Intel C Clang* 3.8 GCC* Intel C PGI* Visual* C Intel 17.0 Clang* 3.8 GCC* Intel 17.0 PGI Fortran* Absoft* Intel Fortran 17.0 Open64* PGI* 16.4 gfortran* Absoft* Intel Fortran 17.0 Boost application performance on Windows* and Linux* Intel C++ and Fortran Compilers Boost C++ application performance on Windows* & Linux* using Intel C++ Compiler (higher is better) Floating Point Integer 1,71 1,13 1,55 1 1,05 1,39 1 1,03 1, ,02 Boost Fortran application performance on Windows* & Linux* using Intel Fortran Compiler (higher is better) 1,00 1,86 1,29 1,26 1,14 1,00 1,43 1,87 Windows Linux Windows Linux Estimated SPECfp _rate_base2006 Estimated SPECint _rate_base2006 Relative geomean performance, SPEC* benchmark - higher is better Configuration: Windows hardware: Intel(R) Xeon(R) CPU E GHz, HT enabled, TB enabled, 32 GB RAM; Linux hardware: Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel compilers 17.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC PGI 15.10, Clang/LLVM 3.8 Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). SPEC* Benchmark ( SmartHeap libs 11.3 for Visual C++ and Intel Compiler were used for SPECint benchmarks. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # ,00 Windows Linux Relative geomean performance, Polyhedron* benchmark higher is better Configuration: Hardware: Intel(R) Xeon(R) CPU E GHz, Hyperthreading enabled, TB enabled, 32 GB RAM. Software: Intel Fortran compiler 170, Absoft*15.0.1,. PGI Fortran* (Windows)/16.4 (Linux), Open64* 4.5.2, gfortran* Linux OS: Red Hat Enterprise Linux Server release 7.2, Kernel el7.x86_64. Windows OS: Windows 10 Pro ( N/A Build 10240). Polyhedron Fortran Benchmark ( Windows compiler switches: Absoft: -m64 -O5 -speed_math=10 -fast_math -march=core -xinteger -stack:0x Intel Fortran compiler: /fast /Qparallel /QxCORE-AVX2 /nostandard-realloc-lhs /link /stack: PGI Fortran: -fastsse -Munroll=n:4 -Mipa=fast,inline -Mconcur=numa. Linux compiler switches: Absoft: -m64 -mavx -O5 -speed_math=10 -march=core -xinteger. Gfortran: -Ofast -mfpmath=sse -flto -march=native -funroll-loops -ftreeparallelize-loops=4. Intel Fortran compiler: -fast -parallel -xcore-avx2 -nostandard-realloc-lhs. PGI Fortran: -fast -Mipa=fast,inline -Msmartalloc -Mfprelaxed - Mstack_arrays -Mconcur=bind. Open64: -march=auto -Ofast -mso apo. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
10 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Two lines added that take full advantage of both SSE or AVX Pragmas ignored by other compilers so code is portable typedef float complex fcomplex; const uint32_t max_iter = 3000; #pragma omp declare simd uniform(max_iter), simdlen(16) uint32_t mandel(fcomplex c, uint32_t max_iter) { uint32_t count = 1; fcomplex z = c; while ((cabsf(z) < 2.0f) && (count < max_iter)) { z = z * z + c; count++; } return count; } uint32_t count[imagewidth][imageheight];.. for (int32_t y = 0; y < ImageHeight; ++y) { float c_im = max_imag - y * imag_factor; #pragma omp simd safelen(16) for (int32_t x = 0; x < ImageWidth; ++x) { fcomplex in_vals_tmp = (min_real + x * real_factor) + (c_im * 1.0iF); count[y][x] = mandel(in_vals_tmp, max_iter); } } Mandelbrot calculation speedup Normalized performance data higher is better 1 2,48 4,27 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
11 Impressive performance improvement Intel C++ Explicit Vectorization using OpenMP* SIMD SIMD Speedup on Intel Xeon Processor Normalized performance data higher is better 6,61 6,06 4,27 4,14 4,15 2,48 2,27 2,26 2,43 4,83 3,51 3,91 2,74 4,92 1,00 1,00 1,00 1,00 1,00 1,00 1,00 AoBench Collision Detection Grassshader Mandelbrot Libor RTM-stencil Geomean Serial SSE4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
12
13 Boost NumPy/SciPy performance with Intel MKL Intel Distribution for Python* Easy access to High performance Python NumPy/SciPy/Scikit-Learn/pandas accelerated with Intel MKL Close to 100X performance speedups on select functions Includes Python optimized modules for Intel TBB, Intel DAAL Includes numba, Cython, pydaal Integrated Distribution, Out-of-the-Box access to performance Python 2.7 & 3.5. Windows, Linux, macos Latest Optimizations for Intel Xeon and Intel Xeon Phi Processors Available as free standalone, via conda* and Intel Parallel Studio XE
14 Close to 100X faster for select functions 14
15 Profile Python & Go using Intel VTune Amplifier And Mixed Python / C++ / Fortran New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 15
16 Intel Math Kernel Library Intel Data Analytics Acceleration Library Intel Integrated Performance Primitives Intel Threading Building Blocks
17 17
18 Intel Math Kernel Library Speeds math processing for machine learning, scientific, engineering financial and design applications Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation Includes functions for dense and sparse linear algebra (BLAS, LAPACK, PARDISO), FFTs, vector math, summary statistics and more De facto standard APIs for easy switching from other math libraries Highly optimized, threaded and vectorized to maximize processor performance 18
19 Components of Intel MKL 2017 New Linear Algebra Fast Fourier Transforms Vector Math Summary Statistics And More Deep Neural Networks BLAS LAPACK ScaLAPACK Sparse BLAS Sparse Solvers Iterative PARDISO* Cluster Sparse Solver Multidimensional FFTW interfaces Cluster FFT Trigonometric Hyperbolic Exponential Log Power Root Vector RNGs Kurtosis Variation coefficient Order statistics Min/max Variancecovariance Splines Interpolation Trust Region Fast Poisson Solver Convolution Pooling Normalization ReLU Softmax 19
20 Performance (GFlops) Performance Benefit to Applications Intel MKL Significant LAPACK Performance Boost using Intel Math Kernel Library versus ATLAS* DGETRF on Intel Xeon E Processor Matrix Size Intel MKL provides significant performance boost over ATLAS* Intel MKL - 16 threads Intel MKL - 8 threads ATLAS - 16 threads ATLAS - 8 threads Configuration: Hardware: CPU: Dual Intel Xeon E5-2697v2@2.70Ghz; 64 GB RAM. Interconnect: Mellanox Technologies* MT27500 Family [ConnectX*-3] FDR.. Software: RedHat* RHEL 6.2; OFED 3.5-2; Intel MPI Library 5.0 Intel MPI Benchmarks (default parameters; built with Intel C++ Compiler XE for Linux*); Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # The latest version of Intel MKL unleashes the performance benefits of Intel architectures 20
21 What s New: Intel MKL 2017 Optimized math functions to enable neural networks (CNN and DNN) for deep learning Improved ScaLAPACK performance for symmetric eigensolvers on HPC clusters New data fitting functions based on B-splines and monotonic splines Improved optimizations for newer Intel processors, especially Knight s Landing Xeon Phi Extended TBB threading layer support for all BLAS level-1 functions 21
22 22
23 Scientific/Engineering Web/Social Business Intel DAAL Overview Industry leading performance, C++/Java/Python library for machine learning and deep learning optimized for Intel Architectures. Pre-processing Transformation Analysis Modeling Validation Decision Making (De-)Compression PCA Statistical moments Variance matrix QR, SVD, Cholesky Apriori Linear regression Naïve Bayes SVM Classifier boosting Kmeans EM GMM Collaborative filtering Neural Networks
24 Speedup Example Performance: Intel DAAL vs. Spark* MLLib PCA (correlation method) on an 8-node Hadoop* cluster based on Intel Xeon Processors E v X 6X 6X 7X 7X M x 200 1M x 400 1M x 600 1M x 800 1M x 1000 Table size Configuration Info - Versions: Intel Data Analytics Acceleration Library 2016, CDH v5.3.1, Apache Spark* v1.2.0; Hardware: Intel Xeon Processor E v3, 2 Eighteen-core CPUs (45MB LLC, 2.3GHz), 128GB of RAM per node; Operating System: CentOS 6.6 x86_64. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
25 What s New: Intel DAAL 2017 Neural Networks Python API (a.k.a. PyDAAL) Easy installation through Anaconda or pip New data source connector for KDB+ Open source project on GitHub Fork me on GitHub: 25
26 26
27 Rich Feature Set for Parallelism Intel Threading Building Blocks (Intel TBB) Parallel algorithms and data structures Threads and synchronization Memory allocation and task scheduling Generic Parallel Algorithms Efficient scalable way to exploit the power of multicore without having to start from scratch. Flow Graph A set of classes to express parallelism as a graph of compute dependencies and/or data flow Concurrent Containers Concurrent access, and a scalable alternative to containers that are externally locked for thread-safety Synchronization Primitives Atomic operations, a variety of mutexes with different properties, condition variables Task Scheduler Timers and Exceptions Threads Thread Local Storage Sophisticated work scheduling engine that empowers parallel algorithms and the flow graph Thread-safe timers and exception classes OS API wrappers Efficient implementation for unlimited number of thread-local variables Memory Allocation Scalable memory manager and false-sharing free allocators 27
28 What s new: Intel Threading Building Blocks 2017 static_partitioner class Helps minimizing overhead of parallel loops streaming_node class Enables heterogeneous streaming computations within the flow graph. Added method to isolate execution of a group of tasks or an algorithm from other tasks submitted to the scheduler. A preview feature for Python* module is added to replace Python's thread pool class. Graph/stereo example is added. Improvements to graph/fgbzip example (added async_msg usage example) 28
29 29
30 Intel IPP Domain Applications Image Processing Medical Imaging Computer Vision Digital Surveillance Biometric Identification Automated Sorting ADAS Visual Search Signal Processing Games (sophisticated audio content or effects) Echo cancellation Telecommunications Energy Data Compression & Cryptography Data Centers Enterprise data managements ID verification Smart cards/wallets Electronic signature Information security / cybersecurity 30
31 What s new: Intel Integrated Performance Primitives 2017 Extended optimization for Intel AVX-512 on KNL and Intel Xeon processors Intel IPP Platform-Aware APIs in the image and signal processing domains are added to support external threading and 64-bit data length Significantly improved performance of zlib compression functions is Extension of IPP optimized functionality in OpenCV Limited pre-silicon optimizations for KNH and CNL EP/XE server 31
32 Intel VTune Amplifier XE Performance Profiler Intel Inspector XE Memory & Thread Debugger Intel Advisor XE Vectorization Optimization and Thread Prototyping
33 33
34 Intel VTune Amplifier Faster, Scaleable Code, Faster Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits Analysis Cache miss, Bandwidth analysis 1 GPU Offload and OpenCL Kernel Tracing Find Answers Fast View Results on the Source / Assembly OpenMP Scalability Analysis, Graphical Frame Analysis Filter Out Extraneous Data Organize Data with Viewpoints Visualize Thread & Task Activity on the Timeline Easy to Use No Special Compiles C, C++, C#, Fortran, Java, ASM Visual Studio* Integration or Stand Alone Graphical Interface & Command Line Local & Remote Data Collection Analyze Windows* & Linux* data on OS X* 2 1 Events vary by processor. 2 No data collection on OS X* Quickly Find Tuning Opportunities See Results On The Source Code Tune OpenMP Scalability Visualize & Filter Data 34
35 New for 2017! Python, FLOPS, Storage & More Intel VTune Amplifier Performance Profiler New! Profile Python and Mixed Python / C++ / Fortran Tune Intel Xeon Phi Knights Landing Processors Quickly See 3 Keys to HPC Performance Optimize Memory Access Storage Analysis I/O bound or CPU bound? Enhanced OpenCL & GPU Profiling Easier Remote and Command Line Usage Add Custom Counters to the Timeline Preview: Application & Storage Performance Snapshots Intel Advisor optimize vectorization for AVX-512 (with or without hardware) 35
36 Intel VTune Amplifier Tunes Knights Landing Processors 4 Critical Optimizations for Intel Xeon Phi Processors 1) High Bandwidth Memory Decide which data structures to place in MCDRAM See performance problems by memory hierarchy Measure DRAM and MCDRAM bandwidth 2) Scalability of MPI and OpenMP Serial vs. Parallel time Imbalance, overhead cost, parallel loop parameters 3) Micro Architecture Efficiency See the efficiency of your code in the core pipeline Zero in on details with custom PMU events 4) Vectorization Efficiency Use Intel Advisor Optimize for AVX-512 with or without AVX-512 hardware New! 36
37 Optimize Memory Access Memory Access Analysis - Intel VTune Amplifier 2017 Improved! Tune data structures for performance Attribute cache misses to data structures (not just the code causing the miss) Support for custom memory allocators Optimize NUMA latency & scalability True & false sharing optimization Auto detect max system bandwidth Easier tuning of inter-socket bandwidth Easier install, Latest processors No special drivers required on Linux* Intel Xeon Phi processor MCDRAM (high bandwidth memory) analysis 37
38 Storage Device Analysis (HDD, SATA or NVMe SSD) Intel VTune Amplifier Are You I/O Bound or CPU Bound? Explore imbalance between I/O operations (async & sync) and compute Storage accesses mapped to the source code See when CPU is waiting for I/O Measure bus bandwidth to storage New! Sliders set thresholds for I/O Queue Depth Slow task with I/O Wait Latency analysis Tune storage accesses with latency histogram Distribution of I/O over multiple devices 38
39 Intel Performance Snapshots Three Fast Ways to Discover Untapped Performance Is your application making good use of modern computer hardware? Run a test case during your coffee break. High level summary shows which apps can benefit most from code modernization and faster storage. Pick a Performance Snapshot: Application for non-mpi apps MPI for MPI apps Storage for systems. Servers and workstations with directly attached storage. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. New! New! 39
40 40
41 Find & Debug Memory & Threading Errors Intel Inspector Memory & Thread Debugger Correctness Tools Increase ROI By 12%-21% 1 Errors found earlier are less expensive to fix Several studies, ROI% varies, but earlier is cheaper Diagnosing Some Errors Can Take Months Races & deadlocks not easily reproduced Memory errors hard to find without a tool Debugger Integration Speeds Diagnosis Breakpoint set just before the problem Examine variables & threads with the debugger Diagnose in hours instead of months 1 Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Debugger Breakpoints Part of Intel Parallel Studio Professional For Windows* and Linux* From $1,599 Intel Inspector dramatically sped up our ability to track down difficult to isolate threading errors before our packages are released to the field. Peter von Kaenel, Director, Software Development, Harmonic Inc. 41
42 New for 2017! New Processors, New C++ Language Features Intel Inspector 2017 Memory and Thread Debugger New C++ Language Features Full C++ 11 support including std::mutex and std::atomic Easier Identification of Threading Bugs Variable name causing error is shown (global, static & stack) in addition to the code lines Run Native on Intel Xeon Phi Processors This simplifies workflow for Intel Xeon Phi processor development Tip: Reduce thread count to 30 for best KNL performance while running Intel Inspector New! 42
43 43
44 Get Faster Code Faster! Intel Advisor Thread Prototyping Have you: Threaded an app, but seen little benefit? Hit a scalability barrier? Delayed release due to sync. errors? Data Driven Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Add Parallelism with Less Effort, Less Risk and More Impact Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort Simon Hammond Senior Technical Staff Sandia National Laboratories 44
45 Faster Code Faster with Data Driven Design Intel Advisor Vectorization Optimization and Thread Prototyping Faster Vectorization Optimization: Vectorize where it will pay off most Quickly ID what is blocking vectorization Tips for effective vectorization Safely force compiler vectorization Optimize memory stride Breakthrough for Threading Design: Quickly prototype multiple options Project scaling on larger systems Find synchronization errors before implementing threading Design without disrupting development Less Effort, Less Risk and More Impact Part of Intel Parallel Studio for Windows* and Linux* 45
46 New! New for 2017! AVX-512, FLOPS, & More Intel Advisor Vectorization Optimization Next Gen Intel Xeon Phi Support Tune for AVX-512 with or without AVX-512 hardware Precise FLOPS calculation Enhanced Memory Access Analysis Easier Selection of High Impact Loops Batch Mode Workflow Saves Time Fast Answers with Loop Analytics 46
47 Intel MPI Library Intel Trace Analyzer and Collector
48 Intel MPI Library Overview Optimized MPI application performance Application-specific tuning Automatic tuning New! - Support for Intel Xeon Phi Processor (code named Knights Landing) New! Support for Intel Omni-Path Architecture Fabric Lower latency and multi-vendor interoperability Industry leading latency Performance optimized support for the fabric capabilities through OpenFabrics*(OFI) Faster MPI communication Optimized collectives Sustainable scalability up to 340K cores Native InfiniBand* interface support allows for lower latencies, higher bandwidth, and reduced memory requirements More robust MPI applications Seamless interoperability with Intel Trace Analyzer and Collector Applications CFD Crash Climate OCD BIO Other... Develop applications for one fabric Intel MPI Library Select interconnect fabric at runtime TCP/IP Omni-Path InfiniBand iwarp Achieve optimized MPI performance Shared Memory Intel MPI Library One MPI Library to develop, maintain & test for multiple fabrics Other Networks Fabrics Cluster 48
49 What s New: Intel MPI Library 2017 Ready for Intel Xeon Phi Processors (code named Knights Landing (KNL)) Ready for Intel Omni-Path Architecture fabric Usage of specially optimized memcpy for KNL Tuning of shared memory collectives on single KNL nodes General optimization of RMA General optimization and speed up startup time and MPI tune utility 49
50 Intel Trace Analyzer and Collector Overview Intel Trace Analyzer and Collector helps the developer: Visualize and understand parallel application behavior Evaluate profiling statistics and load balancing Identify communication hotspots Features Event-based approach Low overhead Excellent scalability Powerful aggregation and filtering functions Idealizer Automatically detect performance issues and their impact on runtime 50
51 MPI Performance Snapshot Scalable profiling for MPI and Hybrid Lightweight Low overhead profiling for 100K+ Ranks Scalability- Performance variation at scale can be detected sooner Identifying Key Metrics Shows MPI/OpenMP imbalances 51
52 What s New: Intel Trace Analyzer and Collector Intel Trace Analyzer and Collector will be ready for KNL Improved scalability of imbalance profiler by up to 10x Improved MPI Snapshot feature HTML output 52
53 Additional Material Product page overview, features, FAQs, support Training materials movies, tech briefs, documentation Evaluation guides step by step walk through Reviews Additional Development Products: Intel Software Development Products For more detail on each component of Parallel Studio XE, visit Inside Blue. 53
54
55 Enhanced application performance with AVX-512 support Enhanced performance due to AVX-512 instructions taking advantage of FMA units, memcpy, new pre-fetch instructions, new transcendental instructions, MCDRAM, and increased number of cores. 55
56 Enhanced application performance with AVX-512 support Key functionality / library domain KNL features used to deliver enhanced performance (instructions, other) *GEMMs/BLAS MP Linpack LU/CHolesky/QR/LAPACK/SMP Linpack Two FMA units + 2 instruction decoders are key AVX512 FMA (vfmadd231ps or vfm231pd) Same as in BLAS (as main LAPACK kernel is?*gemm) + greater core count Prefetcht0 instruction MCDRAM Intel Math Kernel Library Intel Integrated Performance Primitives Intel Data Analytics Acceleration Library 2D and 3D FFTs DNN Sparse Vector Statistics Vector Math All from Signal Processing (1D) and up to Image (2D) and Volume (3D) processing Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh Two FMA units + 2 instruction decoders MCDRAM, tile-to-tile mesh AVX512 FMA Two FMA units + 2 instruction decoders MCDRAM AVX512 FMA Similar to BLAS/LAPACK, greater number of cores AVX512 FMA Two FMA units + 2 instruction decoders Large number of cores for MT performance AVX512 FMA Prefetcht1 instruction Prefetcht0, prefetcht1 instruction Masking support Large core count Prefetcht1 instruction Depend on seq. Blas level 3 Knights Landing improvement New Transcendental Support Instructions: VGETEXP, VGETMANT, VRNDSCALE, VSCALEF, VFIXUPIMM, VRCP28, VRSQRT28, VEXP2 The main advantage inherited from LRB/KNC is support of mask registers and therefore support of predicates for all new instructions. Then, - full 512-bit register palign support (no lanes restrictions as for old AVX palign)- _mm512_alignr_epi32, _mm512_alignr_epi64. Then, on the fly integer conversions: vpmovq{w b d}, vpmovq{w b}. And the last one integer any-direction comparison: vpcmp{d q} and vpcmpu{d q}. Similar to BLAS/LAPACK, greater number of cores Intel MPI Library Used compiler s AVX-512 version of memcpy (but w/ fix, failed CQ on ICC) Build IMPI w/ -fvisibility=hidden (make all symbols as hidden by default and only needed as external). Addressed KNL micro-arch features, such as short BTB, by reducing access to PLT/GOT Reduced/simplified critical path where it s possible. Addressed KNL frond-end specifics. 56
57 Easy access to Parallel Studio XE Runtimes For Amazon Web Services users only Intel Parallel Studio XE Runtime Required to be able to run applications built with the Intel Performance Libraries or Intel Compilers. Includes latest optimizations for Intel Architecture for faster application performance Linux Only Easy access for Amazon Web Services users at no cost Latest runtimes through Linux native repos YUM repo available now! ( 57
58 Educating with Webinar series about 2017 tools Expert talks about the new features Series of live webinars Sept 13 Nov 8, 2016 Attend live, or watch after the fact. 58
59 Educating with High Performance Programming Book Knights Landing specific details, programming advice and real world examples. Intel Xeon Phi Processor High Performance Programming Techniques to generally increase program performance on any system and prepare you better for Intel Xeon Phi processors. Available as of June 2016 I believe you will find this book is an invaluable reference to help develop your own Unfair Advantage James A. Manager Sandia National Laboratories 59
60 More education with software.intel.com/moderncode Online community growing collection of tools, trainings, support features Black Belts in parallelism from Intel & industry Intel HPC Developer Conferences developers share proven techniques and best practices hpcdevcon.intel.com Hands on training for developers and partners with remote access to Intel Xeon processor and Xeon Phi coprocessor-based clusters. software.intel.com/icmp Developer Access Program provides early access to Intel Xeon Phi processor codenamed Knights Landing + 1 year license for Intel Parallel Studio XE Cluster Edition. 60
61 Choices to Fit Needs Intel Tools All Products with support worldwide, for purchase. Intel Premier Support - private direct support from Intel support for past versions software.intel.com/products Most Products without Premier support via special programs for those who qualify students, educators, classroom use, open source developers, and academic researchers software.intel.com/qualify-for-free-software Community support only all tools: Students, Educators, classroom use, Open Source Developers, Academic Researchers (qualification required) Intel Performance Libraries without Premier support -Community licensing for Intel performance libraries no royalties, no restrictions based on company or project size software.intel.com/nest Community support only Intel Performance Libraries: Community Licensing (no qualification required) 61
62 What s New Intel C++ Compiler SIMD Data Layout Templates to facilitate vectorization for your C++ code Enhanced C11 and C++14 language standards support Sized deallocation Relaxed constexpr restrictions Variable templates Single-Quotation-Mark as a digit separator, Enhanced GNU* and Microsoft* compatibility SSE Cast Support Diagnostic improvements on template argument 62
63 What s New Intel Fortran Compiler Substantial Coarray Fortran performance improvement on non-trivial programs Almost complete Fortran 2008 support Enhanced Fortran 2008 and draft Fortran 2015 language standards support implied-shape PARAMETER arrays 2008 bind C internal procedures extended EXIT for all named blocks pointer initialization VS2013 Shell replaces VS2010 Shell on Windows 63
64 Impressive Performance Improvement Intel Compiler OpenMP* Explicit Vectorization Three lines added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma omp declare simd linear(z:40) uniform(l, N, Nmat) linear(k) float path_calc(float *z, float L[][VLEN], int k, int N, int Nmat) #pragma omp declare simd uniform(l, N, Nopt, Nmat) linear(k) float portfolio(float L[][VLEN], int k, int N, int Nopt, int Nmat) for (path=0; path<npath; path+=vlen) { /* Initialise forward rates */ z = z0 + path * Nmat; #pragma omp simd linear(z:nmat) for(int k=0; k < VLEN; k++) { for(i=0;i<n;i++) { L[i][k] = L0[i]; } /* LIBOR path calculation */ float temp = path_calc(z, L, k, N, Nmat); v[k+path] = portfolio(l, k, N, Nopt, Nmat); /* move pointer to start of next block */ z += Nmat; } } Libor calculation speedup Normalized performance data higher is better 1 3,51 6,61 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
65 Impressive Performance Improvement Intel C++ Explicit Vectorization: SIMD Performance One line added that take full advantage of both SSE or AVX Pragma s ignored by other compilers so code is portable #pragma simd vectorlength(8) for (int x = x0; x < x1; ++x) { float div = coef[0] * A_cur[x] + coef[1] * ((A_cur[x + 1] + A_cur[x - 1]) + (A_cur[x + Nx] + A_cur[x - Nx]) + (A_cur[x + Nxy] + A_cur[x - Nxy])) + coef[2] * ((A_cur[x + 2] + A_cur[x - 2]) + (A_cur[x + sx2] + A_cur[x - sx2]) + (A_cur[x + sxy2] + A_cur[x - sxy2])) + coef[3] * ((A_cur[x + 3] + A_cur[x - 3]) + (A_cur[x + sx3] + A_cur[x - sx3]) + (A_cur[x + sxy3] + A_cur[x - sxy3])) + coef[4] * ((A_cur[x + 4] + A_cur[x - 4]) + (A_cur[x + sx4] + A_cur[x - sx4]) + (A_cur[x + sxy4] + A_cur[x - sxy4])); A_next[x] = 2 * A_cur[x] - A_next[x] + vsq[s+x] * div; } RTM-stencil calculation speedup Normalized performance data higher is better 1 3,91 6,06 Serial SSE 4.2 Core-AVX2 Configuration: Intel Xeon CPU 3.50 GHz Haswell system (4 cores with Hyper-Threading On), running at 3.50GHz, with 32.0GB RAM, L1 Cache 256KB, L2 Cache 1.0MB, L3 Cache 8.0MB, 64-bit Windows* Server 2012 R2 Datacenter. Compiler options:, SSE4.2: O3 Qopenmp -simd QxSSE4.2 or AVX2: -O3 Qopenmp simd -QxCORE-AVX2. For more information go to Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessordependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
66 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel 16.0 Intel 17.0 Intel Confidential: Must be viewed under CNDA All products, systems, dates and figures are preliminary based on current expectations, and are subject to change without notice. Substantial Coarray Fortran performance improvement on non-trivial programs 3,70 1,00 1,40 1,00 1,00 1,00 1,23 1,01 University of Edinburgh University of Houston University of Houston University of Houston EPCC microbenchmarks NAS Parallel benchmarks coarray kernels coarray microbenchmarks Runtime performance relative to Intel Fortran 16.0 higher is better Configuration: Windows hardware: HP DL320e Gen8 v2 (single-socket server) with Intel(R) Xeon(R) CPU E GHz, 32 GB RAM, HyperThreading is off; Linux hardware: HP BL460c Gen9 with Intel(R) Xeon(R) CPU E GHz, 256 GB RAM, HyperThreading is on. Software: Intel C++ compiler 16.0, Microsoft (R) C/C++ Optimizing Compiler Version for x86/x64, GCC Linux OS: Red Hat Enterprise Linux Server release 7.1 (Maipo), kernel el7.x86_64. Windows OS: Windows 8.1. SPEC* Benchmark ( Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. * Other brands and names are the property of their respective owners. Benchmark Source: Intel Corporation : Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Intel Confidential 66
67 SIMD Data Layout Template - Improve productivity and boost C++ performance Quickly convert Array of Structures to Structure of Arrays representation. Increase productivity: Use predefined templates with minimal effort, and let SDLT do the vecorization for you. Improve performance: SDLT vectorizes your code by making memory access contiguous, which can lead to more efficient code and better performance. Seamless integration: SDLT follows the familiar Intel vector programming model. We used SDLT to vectorize the deformer code in Premo, the inhouse animation tool for DreamWorks Animation. The performance improvements we were able to achieve were dramatic, and these improvements will translate directly into higher quality characters that will be seen on-screen in future movies. Also the library itself was easy to use and integrate into our existing codebase. Martin Watt Principal Engineer, DreamWorks Animation 67
68 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems
69 Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1: Domain specialists are not professional software programmers. Challenge#2: Python performance limits migration to production systems Intel s solution is to Accelerate Python performance Enable easy access Empower the community
70 Access multiple options for faster Python Included in Intel Distribution for Python* Accelerate with native libraries I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review NumPy, SciPy, Scikit-Learn, Theano, Pandas, pydaal Intel MKL, Intel DAAL Exploit vectorization and threading Cython + Intel C++ compiler Numba + Intel LLVM Better/Composable threading Cython, Numba, Pyston Threading composability for MKL, CPython, Blaze/Dask, Numba Multi-node parallelism Mpi4Py, Distarray Intel native libraries: Intel MPI Integration with Big Data, ML platforms and frameworks Spark, Hadoop, Trusted Analytics Platform Better performance profiling Extensions for profiling mixed Python & native/jit codes
71 Intel Distribution for Python* Reviews Intel's Python distribution provides a major math boost The still-in-beta Python distribution uses Math Kernel Library to speed up processing on Intel hardware The distribution's main touted advantage is speed -- but not a PyPy-style general speedup via a JIT. Instead, the MKL speeds up certain math operations so that they run faster on one thread and multiple threads. I expected Intel s numpy to be fast but it is significant that plain old python code is much faster with the Intel version too. Dr. Donald Kinghorn, Puget Systems Review HPC Podcast Looks at Intel s Pending Distribution of Python Yes, Intel is doing their own Python build! It is still in beta but I think it s a great idea..yeah, it s important!
72 Automatic Performance Scaling Back Up from the Core, to Multicore, to Many Core and Beyond Intel MKL Extracting performance from the computing resources Core: vectorization, prefetching, cache utilization Multi-Many core (processor/socket) level parallelization Multi-socket (node) level parallelization Clusters scaling Sequential Intel MKL MKL + OpenMP Many Core Intel Xeon Phi TM Coprocessor MKL + Intel MPI 72
73 Big Data & Machine Learning Challenge Volum e Value Velocity Variety Problem: Big data needs high performance computing. Many big data applications leave performance at the table > Not optimized for underlying hardware. Solution: A performance library provides building blocks to be easily integrated into big data analytics workflow.s
74 Intel Data Analytics Acceleration Library (Intel DAAL) An Intel-optimized library that provides building blocks for all data analytics stages, from data preparation to data mining & machine learning Python, Java & C++ APIs Can be used with many platforms (Hadoop*, Spark*, R*, Matlab*, ) but not tied to any of them Flexible interface to connect to different data sources (CSV, SQL, HDFS, ) Windows*, Linux*, and OS X* Developed by same team as the industryleading Intel Math Kernel Library Open source, Free community-supported and commercial premium-supported options Also included in Parallel Studio XE suites 74
75 Intel Threading Building Blocks Good Tuning Data Gets Good Results Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships Details all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Intel's TBB was an invaluable help in multithreading our in-house renderer CGIStudio and is now also used in animation and simulation software. Beside the ease of use, it takes care of the two most important aspects of running an application on multiple cores -- load balancing and scalability. Maurice van Swaaji Blue Sky Studios "Intel TBB provided us with optimized code that we did not have to develop or maintain for critical system services. I could assign my developers to code what we bring to the software table. Details Michaël Rouillé CTO Golaem More Case Studies 75
76 Intel Threading Building Blocks (Intel TBB) C++ template library to simplify the task of adding parallelism on a single device or across multiple devices Specify tasks instead of manipulating threads Intel TBB maps your logical tasks onto threads with full support for nested parallelism Targets threading for scalable performance Uses proven, efficient parallel patterns Uses work stealing to support the load balance of unknown execution time for tasks. It has the advantage of low-overhead polymorphism. Flow graph feature allows developers to easily express dependency and data flow graphs Using Intel TBB s new flow graph feature, we accomplished what was previously not possible, parallelize a very sizable task graph with thousands of interrelationships all in about a week. Robert Link GCAM Project Scientist Pacific Northwest National Lab Has high level parallel algorithms and concurrent containers and low level building blocks like scalable memory allocator, locks and atomic operations. Commercial support for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors More Case Studies 76
77 Resources and Availability Intel Threading Building Blocks (Intel TBB) Resources Commercial product page: software.intel.com/intel-tbb Flow Graph Designer: software.intel.com/articles/flow-graph-designer User Forum: software.intel.com/forums/intel-threading-building-blocks Available on Linux, Windows, macos and Android Commercially available with Intel Parallel Studio XE 2017: software.intel.com/enus/intel-parallel-studio-xe Community licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest The Open-Source Community Site: 77
78 Challenges faced by developers Performance optimization is a never ending task. Completing key processing tasks within designated time constraints is a critical issue. Hand optimization code for one platform makes code performance worse on another platform. With manual optimization code becomes more complex and difficult to maintain. Code should run fast as possible without spending extra effort. 78
79 Different Domains in Intel IPP Image Processing Signal Processing Data Compression Computer Vision Cryptography Color Conversion Vector Math String Processing Image Domain Signal Domain Data Domain 79
80 Intel Integrated Performance Primitives Building Blocks for Image, Signal & Data Processing Provides developers with ready-to-use functions to accelerate image, signal, data processing & cryptography computation tasks. Optimized for Intel Atom, Core, Xeon processors, and for Intel Xeon Phi processors and coprocessors. License versions available on Linux, Windows, macos, Android Available as a part of: Intel Parallel Studio XE software.intel.com/en-us/intel-parallelstudio-xe Community Licensing for Intel Performance Libraries- without Premier support: software.intel.com/nest 80
81 Correctness Tools Increase ROI By 12%-21% Cost Factors Square Project Analysis CERT: U.S. Computer Emergency Readiness Team, and Carnegie Mellon CyLab NIST: National Institute of Standards & Technology : Square Project Results Size and complexity of applications is growing Correctness tools find defects during development prior to shipment Reworking defects is 40%-50% of total project effort Reduce time, effort, and cost to repair Find errors earlier when they are less expensive to fix 81
82 Race Conditions Are Difficult to Diagnose They only occur occasionally and are difficult to reproduce Correct Thread 1 Thread 2 Shared Counter Read count 0 Increment 0 Write count 1 0 Read count 1 Increment 1 Write count 2 Incorrect Thread 1 Thread 2 Shared Counter Read count 0 0 Read count 0 Increment 0 Increment 0 Write count 1 Write count 1 82
83 Debug Memory & Threading Errors Intel Inspector Find and eliminate errors Memory leaks, invalid access Races & deadlocks C, C++ and Fortran (or a mix) Simple, Reliable, Accurate No special recompiles Use any build, any compiler 1 Analyzes dynamically generated or linked code Inspects 3 rd party libraries without source Productive user interface + debugger integration Command line for automated regression analysis Clicking an error instantly displays source code snippets and the call stack Fits your existing process 1 That follows common OS standards. 83
84 Profile Python & Go! And Mixed Python / C++ / Fortran Intel VTune Amplifier New! Low Overhead Sampling Accurate performance data without high overhead instrumentation Launch application or attach to a running process Precise Line Level Details No guessing, see source line level detail Mixed Python / native C, C++, Fortran Optimize native code driven by Python 84
85 Three Keys to HPC Performance: Threading, Memory Access, Vectorization Intel VTune Amplifier New! Threading: CPU Utilization Serial vs. Parallel time Top OpenMP regions by potential gain Tip: Use hotspot OpenMP region analysis for more detail Memory Access Efficiency Stalls by memory hierarchy Bandwidth utilization Tip: Use Memory Access analysis Vectorization: FPU Utilization FLOPS estimates from sampling Tip: Use Intel Advisor for precise metrics and vectorization optimization For 3rd, 5th, 6th Generation Intel Core processors and second generation Intel Xeon Phi processor code named Knights Landing. 85
86 Application Performance Snapshot Discover opportunities for better performance with vectorization & threading Objectives Simple enough to run during a coffee break Highlight where code modernization can help Users Performance teams fast prioritization of which apps will benefit most All Developers size the potential performance gain from code modernization Non-Objectives Actionable tuning data that is another tool. Snapshot is just a fast health check. Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. Preview! 86
87 Free download: Also included with Intel Parallel Studio Cluster Edition. 87
88 Storage Performance Snapshot Discover if faster storage can improve server/workstation performance Learn It On One Coffee Break Easy setup Quickly see meaningful data System view of workload Any architecture Targeted Systems Servers & workstations with directly attached storage Not scale out storage clusters Linux kernel 2.6 or newer dstat 0.7 or newer Windows Server 2012, Windows 8 or newer Windows OS Preview! Free download: Also included with Intel Parallel Studio and Intel VTune Amplifier products. 88
89 Get Faster Code Faster! Intel Advisor Vectorization Optimization Have you: Recompiled for AVX2 with little gain Wondered where to vectorize? Recoded intrinsics for new arch.? Struggled with compiler reports? New! Data Driven Vectorization: What vectorization will pay off most? What s blocking vectorization? Why? Are my loops vector friendly? Will reorganizing data increase performance? Is it safe to just use pragma simd? "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing 89
90 Next Gen Intel Xeon Phi Support Vectorization Advisor runs on and optimizes for Intel Xeon Phi AVX-512 ERI specific to Intel Xeon Phi New! Efficiency (72%), Speed-up (11.5x), Vector Length (16) Performance optimization problem and advice how to fix it 90
91 Precise Repeatable FLOPS Metrics Intel Advisor Vectorization Optimization New! FLOPS by loop and function All recent Intel processors (not co-processors) Instrumentation (count FLOP) plus sampling (time with low overhead) Adjusted for masking with AVX-512 processors 91
92 Enhanced Memory Access Analysis Intel Advisor Are you bandwidth or compute limited? Measure Footprint Compare to cache size Does it fit in cache? Variable References Map data to variable names for easier analysis Gather/Scatter Detect unneeded gather/scatters that reduce performance New! 92
93 Start Tuning for AVX-512 without AVX-512 hardware Intel Advisor - Vectorization Advisor New! Use axcommon-avx512 xavx compiler flags to generate both code-paths AVX(2) code path (executed on Haswell and earlier processors) AVX-512 code path for newer hardware Compare AVX and AVX-512 code with Intel Advisor Inserts (AVX2) vs. Gathers (AVX-512) Speed-up estimate: 13.5x (AVX2) vs. 30.6x (AVX-512)
94 Faster Code Faster Using Intel Advisor Vectorization "Intel Advisor s Vectorization Advisor permitted me to focus my work where it really mattered. When you have only a limited amount of time to spend on optimization, it is invaluable." Gilles Civario Senior Software Architect Irish Centre for High-End Computing Intel Advisor s Vectorization Advisor fills a gap in code performance analysis. It can guide the informed user to better exploit the vector capabilities of modern processors and coprocessors. Dr. Luigi Iapichino Scientific Computing Expert Leibniz Supercomputing Centre Threading "Intel Advisor has been extremely helpful in identifying the best pieces of code for parallelization. We can save several days of manual work by targeting the right loops and we can use Advisor to find potential thread safety issues to help avoid problems later on." Carlos Boneti HPC software engineer, Schlumberger Intel Advisor has allowed us to quickly prototype ideas for parallelism, saving developer time and effort, and has already been used to highlight subtle parallel correctness issues in complex multi-file, multi-function algorithms. Simon Hammond Senior Technical Staff Sandia National Laboratories More Case Studies 94
95 Speaker the speaker notes are important for this presentation. Be sure to read them.
96 Optimizing Performance On Parallel Hardware It s an iterative process Cluster Scalable? Y N Tune MPI Ignore if you are not targeting clusters. Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth 96
97 Performance Analysis Tools for Diagnosis Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel Trace Analyzer & Collector (ITAC) Intel MPI Snapshot Intel MPI Tuner Effective threading? N Y Vectorize Memory Bandwidth Sensitive? Y N Thread Optimize Bandwidth Intel VTune Amplifier Intel Advisor Intel VTune Amplifier 97
98 Tools for High Performance Implementation Intel Parallel Studio XE Cluster Scalable? Y N Tune MPI Intel MPI Library Intel MPI Benchmarks Intel Compiler Effective threading? N Y Thread Vectorize Memory Bandwidth Sensitive? Y Optimize Bandwidth N Intel Math Kernel Library Intel IPP Media & Data Library Intel Data Analytics Library Intel Cilk Plus Intel OpenMP* Intel TBB Threading Library 98
99 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright 2015, Intel Corporation. All rights reserved. Intel, Pentium, Xeon, Xeon Phi, Core, VTune, Cilk, and the Intel logo are trademarks of Intel Corporation in the U.S. and other countries. Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision #
100
Intel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationFaster Code. Faster. Intel Parallel Studio XE Unleash the Beast
Faster Code. Faster Intel Parallel Studio XE 2017 Unleash the Beast Create Faster Code Faster Intel Parallel Studio XE Design, build, verify, and tune C++, C, Fortran*, Python* and Java* Standards-driven
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationIntel Distribution for Python* и Intel Performance Libraries
Intel Distribution for Python* и Intel Performance Libraries 1 Motivation * L.Prechelt, An empirical comparison of seven programming languages, IEEE Computer, 2000, Vol. 33, Issue 10, pp. 23-29 ** RedMonk
More informationJackson Marusarz Software Technical Consulting Engineer
Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationSergey Maidanov. Software Engineering Manager for Intel Distribution for Python*
Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* Introduction Python is among the most popular programming languages Especially for prototyping But very limited use in production
More informationMemory & Thread Debugger
Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis
More informationScaling Out Python* To HPC and Big Data
Scaling Out Python* To HPC and Big Data Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* What Problems We Solve: Scalable Performance Make Python usable beyond prototyping
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationMaximizing performance and scalability using Intel performance libraries
Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona
More informationIntel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python
Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1:
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationIntel Architecture and Tools Jureca Tuning for the platform II. Dr. Heinrich Bockhorst Intel SSG/DPD/ Date:
Intel Architecture and Tools Jureca Tuning for the platform II Dr. Heinrich Bockhorst Intel SSG/DPD/ Date: 23.11.2017 Agenda Introduction Processor Architecture Overview Composer XE Compiler Intel Python
More informationIntel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel
Intel VTune Amplifier XE for Tuning of HPC Applications Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Which performance analysis tool should I use first? Intel Application
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationGraphics Performance Analyzer for Android
Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationintel System Studio 2018 Beta 새로운플랫폼을위한새로운맞춤형개발자경험
intel System Studio 2018 Beta 새로운플랫폼을위한새로운맞춤형개발자경험 Introduction to Developer Products Division Technical Computing IoT, Wearables, Embedded & Mobile Systems Computer Vision Performance Client Media & Apps
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationJackson Marusarz Intel Corporation
Jackson Marusarz Intel Corporation Intel VTune Amplifier Quick Introduction Get the Data You Need Hotspot (Statistical call tree), Call counts (Statistical) Thread Profiling Concurrency and Lock & Waits
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationAchieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017
Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware
More informationRevealing the performance aspects in your code
Revealing the performance aspects in your code 1 Three corner stones of HPC The parallelism can be exploited at three levels: message passing, fork/join, SIMD Hyperthreading is not quite threading A popular
More informationVLPL-S Optimization on Knights Landing
VLPL-S Optimization on Knights Landing 英特尔软件与服务事业部 周姗 2016.5 Agenda VLPL-S 性能分析 VLPL-S 性能优化 总结 2 VLPL-S Workload Descriptions VLPL-S is the in-house code from SJTU, paralleled with MPI and written in C++.
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationEliminate Threading Errors to Improve Program Stability
Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development
More informationEliminate Threading Errors to Improve Program Stability
Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed
More informationFAST FORWARD TO YOUR <NEXT> CREATION
FAST FORWARD TO YOUR CREATION THE ULTIMATE PROFESSIONAL WORKSTATIONS POWERED BY INTEL XEON PROCESSORS 7 SEPTEMBER 2017 WHAT S NEW INTRODUCING THE NEW INTEL XEON SCALABLE PROCESSOR BREAKTHROUGH PERFORMANCE
More informationIntel VTune Amplifier XE. Dr. Michael Klemm Software and Services Group Developer Relations Division
Intel VTune Amplifier XE Dr. Michael Klemm Software and Services Group Developer Relations Division Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationChao Yu, Technical Consulting Engineer, Intel IPP and MKL Team
Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team Agenda Intel IPP and Intel MKL Benefits What s New in Intel MKL 11.3 What s New in Intel IPP 9.0 New Features and Changes Tips to Move Intel
More informationIntel Advisor XE. Vectorization Optimization. Optimization Notice
Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics
More informationIntel Distribution For Python*
Intel Distribution For Python* Intel Distribution for Python* 2017 Advancing Python performance closer to native speeds Easy, out-of-the-box access to high performance Python High performance with multiple
More informationIntel C++ Compiler Professional Edition 11.1 for Mac OS* X. In-Depth
Intel C++ Compiler Professional Edition 11.1 for Mac OS* X In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Mac OS* X. 3 Intel C++ Compiler Professional Edition 11.1 Components:...3 Features...3
More informationEliminate Memory Errors to Improve Program Stability
Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationIntel C++ Compiler Professional Edition 11.0 for Linux* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition for Linux*...3 Intel C++ Compiler Professional Edition Components:...3 Features...3 New
More informationInstallation Guide and Release Notes
Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel
More informationIntel C++ Compiler Professional Edition 11.1 for Linux* In-Depth
Intel C++ Compiler Professional Edition 11.1 for Linux* In-Depth Contents Intel C++ Compiler Professional Edition 11.1 for Linux*.... 3 Intel C++ Compiler Professional Edition Components:......... 3 s...3
More informationExpressing and Analyzing Dependencies in your C++ Application
Expressing and Analyzing Dependencies in your C++ Application Pablo Reble, Software Engineer Developer Products Division Software and Services Group, Intel Agenda TBB and Flow Graph extensions Composable
More informationFastest and most used math library for Intel -based systems 1
Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo
More informationSimplified and Effective Serial and Parallel Performance Optimization
HPC Code Modernization Workshop at LRZ Simplified and Effective Serial and Parallel Performance Optimization Performance tuning Using Intel VTune Performance Profiler Performance Tuning Methodology Goal:
More informationCase Study. Optimizing an Illegal Image Filter System. Software. Intel Integrated Performance Primitives. High-Performance Computing
Case Study Software Optimizing an Illegal Image Filter System Intel Integrated Performance Primitives High-Performance Computing Tencent Doubles the Speed of its Illegal Image Filter System using SIMD
More informationPerformance Tools for Technical Computing
Christian Terboven terboven@rz.rwth-aachen.de Center for Computing and Communication RWTH Aachen University Intel Software Conference 2010 April 13th, Barcelona, Spain Agenda o Motivation and Methodology
More informationIntel Cluster Checker 3.0 webinar
Intel Cluster Checker 3.0 webinar June 3, 2015 Christopher Heller Technical Consulting Engineer Q2, 2015 1 Introduction Intel Cluster Checker 3.0 is a systems tool for Linux high performance compute clusters
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationEliminate Memory Errors to Improve Program Stability
Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides
More informationOutline. Motivation Parallel k-means Clustering Intel Computing Architectures Baseline Performance Performance Optimizations Future Trends
Collaborators: Richard T. Mills, Argonne National Laboratory Sarat Sreepathi, Oak Ridge National Laboratory Forrest M. Hoffman, Oak Ridge National Laboratory Jitendra Kumar, Oak Ridge National Laboratory
More informationUsing Intel VTune Amplifier XE for High Performance Computing
Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message
More informationAlexei Katranov. IWOCL '16, April 21, 2016, Vienna, Austria
Alexei Katranov IWOCL '16, April 21, 2016, Vienna, Austria Hardware: customization, integration, heterogeneity Intel Processor Graphics CPU CPU CPU CPU Multicore CPU + integrated units for graphics, media
More informationGetting Started with Intel SDK for OpenCL Applications
Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationMicroarchitectural Analysis with Intel VTune Amplifier XE
Microarchitectural Analysis with Intel VTune Amplifier XE Michael Klemm Software & Services Group Developer Relations Division 1 Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationMore performance options
More performance options OpenCL, streaming media, and native coding options with INDE April 8, 2014 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2016 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture How Programming
More informationThis guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.
Introduction A resource leak refers to a type of resource consumption in which the program cannot release resources it has acquired. Typically the result of a bug, common resource issues, such as memory
More informationIntel Architecture for HPC
Intel Architecture for HPC Georg Zitzlsberger georg.zitzlsberger@vsb.cz 1st of March 2018 Agenda Salomon Architectures Intel R Xeon R processors v3 (Haswell) Intel R Xeon Phi TM coprocessor (KNC) Ohter
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationIXPUG 16. Dmitry Durnov, Intel MPI team
IXPUG 16 Dmitry Durnov, Intel MPI team Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2 Intel MPI 2017 Beta U1 is available! Key
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationIntel C++ Compiler Professional Edition 11.0 for Windows* In-Depth
Intel C++ Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel C++ Compiler Professional Edition for Windows*..... 3 Intel C++ Compiler Professional Edition At A Glance...3 Intel C++
More informationWhat s P. Thierry
What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationIntel Software Development Products Licensing & Programs Channel EMEA
Intel Software Development Products Licensing & Programs Channel EMEA Intel Software Development Products Advanced Performance Distributed Performance Intel Software Development Products Foundation of
More informationIntel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth
Intel Visual Fortran Compiler Professional Edition 11.0 for Windows* In-Depth Contents Intel Visual Fortran Compiler Professional Edition for Windows*........................ 3 Features...3 New in This
More informationEARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA
EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA SUDHEER CHUNDURI, SCOTT PARKER, KEVIN HARMS, VITALI MOROZOV, CHRIS KNIGHT, KALYAN KUMARAN Performance Engineering Group Argonne Leadership Computing Facility
More informationHPC code modernization with Intel development tools
HPC code modernization with Intel development tools Bayncore, Ltd. Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona Microprocessor
More informationProgramming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title
Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and
More informationIntel Parallel Studio 2011
THE ULTIMATE ALL-IN-ONE PERFORMANCE TOOLKIT Studio 2011 Product Brief Studio 2011 Accelerate Development of Reliable, High-Performance Serial and Threaded Applications for Multicore Studio 2011 is a comprehensive
More informationIntel Xeon Phi Coprocessor Performance Analysis
Intel Xeon Phi Coprocessor Performance Analysis Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationMartin Kruliš, v
Martin Kruliš 1 Optimizations in General Code And Compilation Memory Considerations Parallelism Profiling And Optimization Examples 2 Premature optimization is the root of all evil. -- D. Knuth Our goal
More informationBecca Paren Cluster Systems Engineer Software and Services Group. May 2017
Becca Paren Cluster Systems Engineer Software and Services Group May 2017 Clusters are complex systems! Challenge is to reduce this complexity barrier for: Cluster architects System administrators Application
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationUltimate Workstation Performance
Product brief & COMPARISON GUIDE Intel Scalable Processors Intel W Processors Ultimate Workstation Performance Intel Scalable Processors and Intel W Processors for Professional Workstations Optimized to
More informationAchieving 2.5X 1 Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization
white paper Internet Discovery Artificial Intelligence (AI) Achieving.X Higher Performance for the Taboola TensorFlow* Serving Application through Targeted Software Optimization As one of the world s preeminent
More informationSPDK China Summit Ziye Yang. Senior Software Engineer. Network Platforms Group, Intel Corporation
SPDK China Summit 2018 Ziye Yang Senior Software Engineer Network Platforms Group, Intel Corporation Agenda SPDK programming framework Accelerated NVMe-oF via SPDK Conclusion 2 Agenda SPDK programming
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU
More informationPerformance analysis tools: Intel VTuneTM Amplifier and Advisor. Dr. Luigi Iapichino
Performance analysis tools: Intel VTuneTM Amplifier and Advisor Dr. Luigi Iapichino luigi.iapichino@lrz.de Which tool do I use in my project? A roadmap to optimisation After having considered the MPI layer,
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationIntel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is
More informationUsing Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System
Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationTools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,
Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - fabio.baruffa@lrz.de LRZ, 27.6.- 29.6.2016 Architecture Overview Intel Xeon Processor Intel Xeon Phi Coprocessor, 1st generation Intel Xeon
More informationAccelerate. HP / Intel. CAE Innovation at any Scale with Optimized Simulation Solutions. Performance. Efficiency. Agility
Accelerate CAE Innovation at any Scale with Optimized Simulation Solutions HP / Intel Performance Efficiency Thierry Carron, HPC Senior Architect HPC EMEA Win Team France Agility Stephan Gillich Dir. HPC
More informationTuning Python Applications Can Dramatically Increase Performance
Tuning Python Applications Can Dramatically Increase Performance Vasilij Litvinov Software Engineer, Intel Legal Disclaimer & 2 INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,
More informationOracle Developer Studio 12.6
Oracle Developer Studio 12.6 Oracle Developer Studio is the #1 development environment for building C, C++, Fortran and Java applications for Oracle Solaris and Linux operating systems running on premises
More informationOPENSHMEM AND OFI: BETTER TOGETHER
4th ANNUAL WORKSHOP 208 OPENSHMEM AND OFI: BETTER TOGETHER James Dinan, David Ozog, and Kayla Seager Intel Corporation [ April, 208 ] NOTICES AND DISCLAIMERS Intel technologies features and benefits depend
More informationReusing this material
XEON PHI BASICS Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationFast forward. To your <next>
Fast forward To your Navin Shenoy EXECUTIVE VICE PRESIDENT GENERAL MANAGER, DATA CENTER GROUP CLOUD ECONOMICS INTELLIGENT DATA PRACTICES NETWORK TRANSFORMATION Intel Xeon Scalable Platform The
More information