Hans Pabst, January 24 th 2013 Software and Services Group Intel Corporation
|
|
- Daniel Gibbs
- 6 years ago
- Views:
Transcription
1 VPE Swiss Workshop: Infrastruktur für rechenintensive Anwendungen Intel Xeon Phi Product Family An Overview Hans Pabst, January 24 th 2013 Software and Services Group Intel Corporation
2 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers 2
3 VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE INTERPROCESSOR NETWORK COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK VECTOR IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE Agenda FIXED FUNCTION LOGIC Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers GDDR5 MEMORY and I/O INTERFACES PCIe 3
4 Overview Intel Xeon General instruction streams High single-thread perf. High memory capacity Core/memory aggr. via sockets and nodes Instruction set extensions SIMD e.g., Intel AVX Virtualization, AES, etc. Intel Xeon Phi General instruction streams Highly parallel workloads High memory bandwidth Up to 61 cores/die, aggr. via PCIe and nodes SIMD (512-bit registers) Gather/scatter, FMA, masked instructions Intel Xeon Phi is a coprocessor for highly parallel workloads. 4
5 History of SIMD ISA extensions Intel Pentium processor (1993) MMX (1997) Intel Streaming SIMD Extensions (Intel SSE in 1999 to Intel SSE4.2 in 2008) Intel Advanced Vector Extensions (Intel AVX in 2011 and Intel AVX2 in 2013) Intel Many Integrated Core Architecture (Intel MIC Architecture in 2013) * Illustrated with the number of 32-bit data elements that are processed by one packed instruction. 5
6 Performance Motivation Remember Pollack s rule: Performance ~ Die Area 4x the die area gives 2x the performance in one core, but 4x the performance when dedicated to 4 cores Conclusions (with respect to Pollack s rule) A powerful handle to adjust Performance/Watt Weaker cores be beneficial (but many of them) Parallel hardware Parallel algorithms Appropriate tools GHz Era Multicore Manycore Time 6
7 Speedup? Peak perf. by example ( Intel Xeon E (not the top-bin) 2S x 8C x 2.7 GHz x 4F DP x 2 ops* ~345 GF/s Intel Xeon Phi 3120A (lowest bin) 57C x 1.1 GHz x 8F DP x 2 ops* ~1 TF/s Amdahl s Law determines the total speedup of a mixture of serial and parallel code sections e.g., P=80% and S=1000/345 ~2x Leverage people s knowledge with common tools and practices! * Two operations ( ops ) due to separate FMUL/FADD ports on Intel Xeon, and FMA on Intel Xeon Phi. 7
8 What about 100x or more? Compared with a single core and without SIMD For example: Intel Xeon E C x 3.5 GHz x 1F DP x 2 ops 7 GF/s (Turbo Boost: 3.5 GHz) Many programs today are not fully optimized for the cores and vectors in today s CPU Conclusions Because of Polack s rule we want accelerators Amdahl s Law ( low speedup ), high effort ( rewrite ), or workloads that are less applicable* we need: Hardware not breaking assumptions made in software Common software tools that are not vendor-locked * Examples: to miss FMA (50% off from peak perf.), incoherent branches, etc. 8
9 Performance GEMM, STREAM, and SMP Linpack GEMM, Cholesky / LU / QR Decomposition, SMP Linpack, etc. Example: 9
10 86% Efficient 82% Efficient 75% Efficient ECC On ECC Off Synthetic Benchmark Summary (Intel MKL) SGEMM (GF/s) DGEMM (GF/s) SMP Linpack (GF/s) STREAM Triad (GB/s) Up to 2.9X Up to 2.8X Up to 2.6X Up to 2.2X Higher is Better Higher is Better Higher is Better Higher is Better S Intel Xeon Processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 1 Intel Xeon Phi coprocessor Notes 1. Intel Xeon Processor E used for all SGEMM Matrix = x 13824, DGEMM Matrix 7936 x 7936, SMP Linpack Matrix x Intel Xeon Phi coprocessor SE10P (ECC on) with Gold Release Candidate SW stack SGEMM Matrix = x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix x
11 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers Shared Memory Application Development Shared, Distributed, and Hybrid Memory App. Development 11
12 Intel Parallel Studio XE 2013 Phase Productivity Tool Feature Benefit Advanced Parallel Design Intel Advisor XE Analyze existing code base and find opportunities for parallelization. Easier analysis and performance heuristics, find compute hotspots and check for parallelization strategies. Advanced Build and Debug Intel Composer XE C/C++ and Fortran compilers, performance libraries, and parallel models Application performance, scalability and quality for current multicore and future many-core systems. Advanced Verify Intel Inspector XE Memory & threading error checking tool for higher code reliability & quality Increases productivity and lowers cost, by catching memory and threading defects early Advanced Tune Intel VTune Amplifier XE Performance Profiler to optimize performance and scalability Removes guesswork, saves time, makes it easier to find performance and scalability bottlenecks Combines ease of use with deeper insights. 12
13 How and where to optimize? 1. Prefer a library that solves the problem, and/or chose an appropriate/parallel algorithm* 2. Optimize your own code a) Across multiple cores. b) In-core e.g., SIMD for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { c[i*k+j] = 0; for (int k = 0; k < K; ++k) { c[i*k+j] += a[i*n+k] * b[k*k+j]; } } } (ordered by anticipated impact; use tools to qualify and check) Intel Performance Library! * A parallel algorithm is not necessarily an incremental optimization of a serial algorithm. 13
14 Intel Math Kernel Library (Intel MKL) Linear Algebra BLAS, Sparse BLAS LAPACK solvers Sparse Solvers (DSS, PARADISO) Iterative solver (RCI) ScaLAPACK, PBLAS Fast Fourier Transforms Multidimensional FFTW interfaces Cluster FFT Trig. Transforms Poisson solver Convolution via VSL Vector Math Trigonometric Hyperbolic Exponential, Logarithmic Power / Root Random Number Generators Congruential Wichmann-Hill Mersenne Twister Sobol Neiderreiter Non-deterministic Summary Statistics Kurtosis Variation coefficient Quantiles Order statistics Min/max Variance-covariance Data Fitting Spline-based Interpolation Cell search 14
15 Intel MKL Features Single threaded, and multi-threaded libraries* Cluster support for important domains Support for large problem sizes (ILP) Conditional Numerical Reproducibility (CNR) Support for Intel Xeon Phi coprocessors Automatic offload, and compiler-assisted offload Manycore-hosted execution, cluster support, etc. As always: early enabled for future hardware Haswell support: AVX2 and FMA3 instruction set * Intel MKL Link Line Advisor: 15
16 Intel MKL Compilation and Linkage Intel MKL supports Linux*, Mac OS* X, and Windows* (platform s default compiler, Intel Compiler as well as non-intel compilers and their OpenMP* runtimes) Intel MKL Link Line Advisor -us/articles/intel-mkl-linkline-advisor/ 16
17 Use Cases Iterative Solver (RCI) Customize solver steps PBLAS Distribute easily VML Balance accuracy and performance RNG Safety and reliability VSL* Did you know that Intel MKL comes with some statistics? * For example, to detect outliers or to predict values. 17
18 Execution Models (Intel Xeon Phi) Intel MKL Automatic Offload (AO) Transparent data transfer and execution management Limited to key functions (sufficient FLOP/Byte ratio) Automatically uses host and (multiple) targets No code changes required Compiler Assisted Offload (CAO) Explicit control of data transfer / persistence Intel Compiler offload pragmas/directives Language Extension for Offload (LEO) OpenMP* 4.0 standard (draft) Can be used together with Automatic Offload Native Execution* Uses the coprocessors as independent nodes (a.k.a. manycore-hosted execution) Input data is copied to targets in advance * In fact, an offloaded code section (CAO) that calls e.g., Intel MKL is calling into a library that is native. 18
19 Intel MKL Automatic Offload (AO) Control automatic offload (hybrid execution!) Environment variable: MKL_MIC_ENABLE=1 Remember: sufficient problem size needed (Byte/FLOP ratio) Service functions take precedence (work division, etc.) Upcoming optimizations (Intel MKL ) Supported functions (more to come) BLAS level 3: GEMM, TRMM, TRSM LAPACK: Cholesky, LU, QR Multiple cards per node Only GEMM (Intel MKL ) Check for offload (also applies to CAO) OFFLOAD_REPORT=<0 1 2>, or call mkl_mic_set_offload_report( ) 19
20 Intel Compiler (C/C++ and Fortran) Supports Intel Xeon Phi (since V13.0) Compiler Assisted Offload (CAO) Cross-compilation ( -mmic ) No effort to get code working on Intel Xeon Phi Tuning takes effort, but leverages existing standards Optimizations usually lead to perf. gain on Intel Xeon Standards and existing code Cross-compiled code can be used in offload section For example: Intel Threading Building Blocks Intel OpenMP* 4.0 (incl. offload pragmas/directives) MYO (Mine Yours Ours) shared virtual memory 20
21 Compiler Features Task and data parallelism e.g., Intel Cilk Plus Compiler techniques e.g., Automatic vectorization Current standards e.g., Fortran 2003, and C++11 OpenMP 4.0 Compatibility e.g., linkcompatibility with the platform s default compiler Security e.g., Pointer Checker subroutine quad(len,a,b,c,x1,x2) real(4) a(len),b(len), c(len) real x1(len), x2(len), s do i=1,len s = b(i)**2-4.*a(i)*c(i) if (s.ge.0.) then x1(i) = sqrt(s) x2(i) = (-x1(i) - b(i)) *0.5 / a(i) x1(i) = ( x1(i) - b(i)) *0.5 / a(i) else x2(i)=0. x1(i)=0. endif enddo end > ifort -c -vec-report2 quad.f90 quad.f90(4): (col. 3) remark: LOOP WAS VECTORIZED. * Performance e.g., Polyhedron Benchmarks (F90): 21
22 Automatic Vectorization Guided Auto-Parallelization (GAP) User/advice-oriented terminology Vectorization report Compiler terminology More complete Implement Advice GAP Report Remove vectorization blockers User-mandated vectorization Break vector dependencies Resolve Issues Vectorization Report 22
23 Vectorization Report Get details on vectorization s success and failure L&M: W: -vec-report<n>, n=0,1,2,3,4,5* /Qvec-report<n>, n=0,1,2,3,4,5* 35: subroutine fd( y ) 36: integer :: i 37: real, dimension(10), intent(inout) :: y 38: do i=2,10 39: y(i) = y(i-1) : end do 41: end subroutine fd novec.f90(38): (col. 3) remark: loop was not vectorized: existence of vector dependence. novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39. novec.f90(38:3-38:3):vec:main_: loop was not vectorized: existence of vector dependence * Diagnostic level: (0) no diagnostic, (1) vectorized loops, (2) vectorized loops and non-vect. loops 23
24 Multiple Code Paths (Retargeting) double A[1000], B[1000], C[1000]; void add() { for (int i = 0; i < 1000; ++i) { if (A[i] > 0) { A[i] += B[i]; } else { A[i] += C[i]; } } }.B1.2:: vmovaps ymm3, A[rdx*8] vmovaps ymm1, C[rdx*8] vcmpgtpd ymm2, ymm3, ymm0 vblendvpd ymm4, ymm1,b[rdx*8], ymm2 vaddpd ymm5, ymm3, ymm4 vmovaps A[rdx*8], ymm5 add rdx, 4 cmp rdx, 1000 jl.b1.2 AVX.B1.2:: movaps xmm2, A[rdx*8] xorps xmm0, xmm0 cmpltpd xmm0, xmm2 movaps xmm1, B[rdx*8] andps xmm1, xmm0 andnps xmm0, C[rdx*8] orps xmm1, xmm0 addpd xmm2, xmm1 movaps A[rdx*8], xmm2 add rdx, 2 cmp rdx, 1000 SSE2 jl.b1.2.b1.2:: movaps xmm2, A[rdx*8] xorps xmm0, xmm0 cmpltpd xmm0, xmm2 movaps xmm1, C[rdx*8] blendvpd xmm1, B[rdx*8], xmm0 addpd xmm2, xmm1 movaps A[rdx*8], xmm2 add rdx, 2 cmp rdx, 1000 jl.b1.2 SSE4.1 Intel Xeon Phi can be just one of the multiple code paths. 24
25 Intel Xeon Phi Infrastructure and Open Source Contributions Operating System (OS) You can assume at least a BusyBox environment Embedded Linux* (very few customizations) May go into the Yocto Project Other infrastructure Intel Manycore Platform Software Stack (Intel MPSS) Intel Coprocessor Offload Infrastructure (Intel COI) Intel Symmetric Communications Infrastructure (Intel SCI) Other/upcoming contributions to: GNU* Compiler Collection (GCC) GNU* Debugger (GDB) 25
26 Intel VTune Amplifier XE 2013 All available analysis types Different ways to start the analysis Helps creating new analysis types Copy correct command line syntax to clipboard 26
27 Intel Inspector XE 2013 Dynamic Analysis: Finds Memory and Threading Errors Find and eliminate errors Memory leaks, invalid access Races & deadlocks Analyze hybrid MPI cluster apps Heap growth analysis Faster & Easier to use Debugger breakpoints Break on selected errors Run faster to known error Pause/resume collection Narrow analysis focus Better performance Improved error suppression Find errors early (when they are less expensive). 27
28 Intel Inspector XE
29 Intel Advisor XE Tool for what-if analysis Modeling: use code annotations to introduce parallelism Evaluation: estimate the effect e.g. the speedup GUI-driven assistant (5 steps) Productivity and Safety Parallel correctness is checked based on a correct program Non-intrusive API It s not auto-parallelization It s not modifying the code 29
30 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers 30
31 Philosophy of the Intel Xeon Family Modest core count increase e.g., E5-2xxx introduced 8 cores/die (prev. 6 cores/die) E5-2xxx v2 is expected to have 12 cores/die E5-4xxx (EP) allows 4 sockets (prev. only EX up to 8) Performance per core must increase e.g., E5 introduced Intel AVX (increased FP throughput) Haswell with 16F DP / cycle (SP perf. 2x over DP) Power consumption to go down (or to stay flat) e.g., Turbo Boost, configurable TDP, low-voltage Balanced platform (compute, memory, I/O) e.g., More main memory for Intel Xeon Phi 31
32 Software Roadmap Upcoming features OpenCL* SDK for Intel Xeon Phi Windows-hosted Intel Xeon Phi Intel Composer XE 2013 Update 2 (next week) OpenMP* 4.0 support * 32
33 Summary Expect large performance boost with Intel Xeon 3 rd generation ( Haswell ) FP throughput will double Intel Xeon Phi can be targeted with regular developer tools and standards, perf. tuning benefits Intel Xeon as well Intel Xeon Phi coprocessor behaves similar to a normal computer system (OS, login, etc.) 33
34 References Intel Xeon Phi Coprocessor Developer Forum Intel Xeon Phi Coprocessor Quick Start Guide Programming Intel's Xeon Phi: A Jumpstart Introduction Phi Programming for CUDA Developers 34
35 Available since April/May Teaches parallel programming in a cookbook-style with many examples Shared memory Programming and Debugging on X86 Architecture Not about Intel Xeon Phi coprocessors (c) 2012, publisher: Worx 35
36 Available since July Teaches parallel programming in a new more effective manner. Not about Intel Xeon Phi coprocessors. Not about any specific hardware. It s about effective parallel programming. (c) 2012, publisher: Morgan Kaufmann 36
37 Availability: ~March 2013 Completely focused on Intel Xeon Phi coprocessors. Volume 1: essentials ~350 pages of explanation of programming. Teaches us how to use and obtain high performance on the Intel MIC architecture The authors have provided a very readable, big-picture introduction to programming the Intel Xeon Phi Coprocessor. By chronicling step-by-step optimizations of several computational kernels, software interfaces are illustrated for getting the most out of key architectural features of the Intel Xeon Phi Coprocessor. James L. Schwarzmeier, Cray Inc, January 2013 (c) 2013, Morgan Kaufman Publ. Inc 37
38 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions? 38
39 Call to Action 1. Start to evaluate Intel Composer XE Evaluation includes Premier support as well Under Windows, a trial plays well with an evaluation of Microsoft* Visual Studio Feel free to ask for an on-site training 2. Start to optimize your application for multicore C/C++ and Fortran as well OpenCL where appropriate Avoid to stuck with frequency scaling 3. Interested to target Intel Xeon Phi for Windows workstations? Get in contact. Intel/Switzerland, local consultants, etc. 39
40 Thank You
41
42 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Copyright 2013, 2012, Intel Corporation. All rights reserved.
Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation
Intel Math Kernel Library (Intel MKL) Overview Hans Pabst Software and Services Group Intel Corporation Agenda Motivation Functionality Compilation Performance Summary 2 Motivation How and where to optimize?
More informationIntel Performance Libraries
Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation
More informationKlaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation
S c i c o m P 2 0 1 3 T u t o r i a l Intel Xeon Phi Product Family Programming Tools Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation Agenda Intel Parallel Studio XE 2013
More informationFastest and most used math library for Intel -based systems 1
Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo
More informationGraphics Performance Analyzer for Android
Graphics Performance Analyzer for Android 1 What you will learn from this slide deck Detailed optimization workflow of Graphics Performance Analyzer Android* System Analysis Only Please see subsequent
More informationIntel Math Kernel Library 10.3
Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)
More informationSergey Maidanov. Software Engineering Manager for Intel Distribution for Python*
Sergey Maidanov Software Engineering Manager for Intel Distribution for Python* Introduction Python is among the most popular programming languages Especially for prototyping But very limited use in production
More informationIntel Math Kernel Library (Intel MKL) Latest Features
Intel Math Kernel Library (Intel MKL) Latest Features Sridevi Allam Technical Consulting Engineer Sridevi.allam@intel.com 1 Agenda - Introduction to Support on Intel Xeon Phi Coprocessors - Performance
More informationLIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015
LIBXSMM Library for small matrix multiplications. Intel High Performance and Throughput Computing (EMEA) Hans Pabst, March 12 th 2015 Abstract Library for small matrix-matrix multiplications targeting
More informationJackson Marusarz Software Technical Consulting Engineer
Jackson Marusarz Software Technical Consulting Engineer What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action 2 Analysis Tools for Diagnosis
More informationIntel Parallel Studio XE 2015
2015 Create faster code faster with this comprehensive parallel software development suite. Faster code: Boost applications performance that scales on today s and next-gen processors Create code faster:
More informationMaximizing performance and scalability using Intel performance libraries
Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona
More informationMemory & Thread Debugger
Memory & Thread Debugger Here is What Will Be Covered Overview Memory/Thread analysis New Features Deep dive into debugger integrations Demo Call to action Intel Confidential 2 Analysis Tools for Diagnosis
More informationWhat s New August 2015
What s New August 2015 Significant New Features New Directory Structure OpenMP* 4.1 Extensions C11 Standard Support More C++14 Standard Support Fortran 2008 Submodules and IMPURE ELEMENTAL Further C Interoperability
More informationIntel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant
Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant Parallel is the Path Forward Intel Xeon and Intel Xeon Phi Product Families are both going parallel Intel Xeon processor
More informationChao Yu, Technical Consulting Engineer, Intel IPP and MKL Team
Chao Yu, Technical Consulting Engineer, Intel IPP and MKL Team Agenda Intel IPP and Intel MKL Benefits What s New in Intel MKL 11.3 What s New in Intel IPP 9.0 New Features and Changes Tips to Move Intel
More informationIntel Software Development Products for High Performance Computing and Parallel Programming
Intel Software Development Products for High Performance Computing and Parallel Programming Multicore development tools with extensions to many-core Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN
More informationGetting Started with Intel SDK for OpenCL Applications
Getting Started with Intel SDK for OpenCL Applications Webinar #1 in the Three-part OpenCL Webinar Series July 11, 2012 Register Now for All Webinars in the Series Welcome to Getting Started with Intel
More informationGet Ready for Intel MKL on Intel Xeon Phi Coprocessors. Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library
Get Ready for Intel MKL on Intel Xeon Phi Coprocessors Zhang Zhang Technical Consulting Engineer Intel Math Kernel Library Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL
More informationAchieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013
Achieving High Performance Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013 Does Instruction Set Matter? We find that ARM and x86 processors are simply engineering design points optimized
More informationIntel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2015 Composer Edition for Linux* Installation Guide and Release Notes 23 October 2014 Table of Contents 1 Introduction... 1 1.1 Product Contents... 2 1.2 Intel Debugger (IDB) is
More informationIntel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python
Intel tools for High Performance Python 데이터분석및기타기능을위한고성능 Python Python Landscape Adoption of Python continues to grow among domain specialists and developers for its productivity benefits Challenge#1:
More informationIntel Many Integrated Core (MIC) Architecture
Intel Many Integrated Core (MIC) Architecture Karl Solchenbach Director European Exascale Labs BMW2011, November 3, 2011 1 Notice and Disclaimers Notice: This document contains information on products
More informationInstallation Guide and Release Notes
Intel Parallel Studio XE 2013 for Linux* Installation Guide and Release Notes Document number: 323804-003US 10 March 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.1.1 Changes since Intel
More informationAchieving Peak Performance on Intel Hardware. Intel Software Developer Conference London, 2017
Achieving Peak Performance on Intel Hardware Intel Software Developer Conference London, 2017 Welcome Aims for the day You understand some of the critical features of Intel processors and other hardware
More informationUsing Intel VTune Amplifier XE and Inspector XE in.net environment
Using Intel VTune Amplifier XE and Inspector XE in.net environment Levent Akyil Technical Computing, Analyzers and Runtime Software and Services group 1 Refresher - Intel VTune Amplifier XE Intel Inspector
More informationUsing Intel VTune Amplifier XE for High Performance Computing
Using Intel VTune Amplifier XE for High Performance Computing Vladimir Tsymbal Performance, Analysis and Threading Lab 1 The Majority of all HPC-Systems are Clusters Interconnect I/O I/O... I/O I/O Message
More informationSarah Knepper. Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018
Sarah Knepper Intel Math Kernel Library (Intel MKL) 25 May 2018, iwapt 2018 Outline Motivation Problem statement and solutions Simple example Performance comparison 2 Motivation Partial differential equations
More informationIntel Advisor XE. Vectorization Optimization. Optimization Notice
Intel Advisor XE Vectorization Optimization 1 Performance is a Proven Game Changer It is driving disruptive change in multiple industries Protecting buildings from extreme events Sophisticated mechanics
More informationMaximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms
Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Family-Based Platforms Executive Summary Complex simulations of structural and systems performance, such as car crash simulations,
More informationIFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor
IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor D.Sc. Mikko Byckling 17th Workshop on High Performance Computing in Meteorology October 24 th 2016, Reading, UK Legal Disclaimer & Optimization
More informationUsing Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System
Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System Overview This guide is intended to help developers use the latest version of Intel Math Kernel Library (Intel
More informationH.J. Lu, Sunil K Pandey. Intel. November, 2018
H.J. Lu, Sunil K Pandey Intel November, 2018 Issues with Run-time Library on IA Memory, string and math functions in today s glibc are optimized for today s Intel processors: AVX/AVX2/AVX512 FMA It takes
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference London, 2017 Agenda Vectorization is becoming more and more important What is
More informationIntel Math Kernel Library Perspectives and Latest Advances. Noah Clemons Lead Technical Consulting Engineer Developer Products Division, Intel
Intel Math Kernel Library Perspectives and Latest Advances Noah Clemons Lead Technical Consulting Engineer Developer Products Division, Intel After Compiler and Threading Libraries, what s next? Intel
More informationKevin O Leary, Intel Technical Consulting Engineer
Kevin O Leary, Intel Technical Consulting Engineer Moore s Law Is Going Strong Hardware performance continues to grow exponentially We think we can continue Moore's Law for at least another 10 years."
More informationIntel Math Kernel Library (Intel MKL) BLAS. Victor Kostin Intel MKL Dense Solvers team manager
Intel Math Kernel Library (Intel MKL) BLAS Victor Kostin Intel MKL Dense Solvers team manager Intel MKL BLAS/Sparse BLAS Original ( dense ) BLAS available from www.netlib.org Additionally Intel MKL provides
More informationBei Wang, Dmitry Prohorov and Carlos Rosales
Bei Wang, Dmitry Prohorov and Carlos Rosales Aspects of Application Performance What are the Aspects of Performance Intel Hardware Features Omni-Path Architecture MCDRAM 3D XPoint Many-core Xeon Phi AVX-512
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor http://tinyurl.com/inteljames twitter @jamesreinders James Reinders it s all about parallel programming Source Multicore CPU Compilers Libraries, Parallel Models Multicore CPU
More informationUsing Intel Inspector XE 2011 with Fortran Applications
Using Intel Inspector XE 2011 with Fortran Applications Jackson Marusarz Intel Corporation Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS
More informationKirill Rogozhin. Intel
Kirill Rogozhin Intel From Old HPC principle to modern performance model Old HPC principles: 1. Balance principle (e.g. Kung 1986) hw and software parameters altogether 2. Compute Density, intensity, machine
More informationVectorization Advisor: getting started
Vectorization Advisor: getting started Before you analyze Run GUI or Command Line Set-up environment Linux: source /advixe-vars.sh Windows: \advixe-vars.bat Run GUI or Command
More informationEliminate Threading Errors to Improve Program Stability
Introduction This guide will illustrate how the thread checking capabilities in Intel Parallel Studio XE can be used to find crucial threading defects early in the development cycle. It provides detailed
More informationInstallation Guide and Release Notes
Intel C++ Studio XE 2013 for Windows* Installation Guide and Release Notes Document number: 323805-003US 26 June 2013 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.1.1 Changes since Intel
More informationAgenda. Optimization Notice Copyright 2017, Intel Corporation. All rights reserved. *Other names and brands may be claimed as the property of others.
Agenda VTune Amplifier XE OpenMP* Analysis: answering on customers questions about performance in the same language a program was written in Concepts, metrics and technology inside VTune Amplifier XE OpenMP
More informationEliminate Threading Errors to Improve Program Stability
Eliminate Threading Errors to Improve Program Stability This guide will illustrate how the thread checking capabilities in Parallel Studio can be used to find crucial threading defects early in the development
More informationComputer Architecture and Structured Parallel Programming James Reinders, Intel
Computer Architecture and Structured Parallel Programming James Reinders, Intel Parallel Computing CIS 410/510 Department of Computer and Information Science Lecture 17 Manycore Computing and GPUs Computer
More informationIntel Math Kernel Library. Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication
Intel Math Kernel Library Getting Started Tutorial: Using the Intel Math Kernel Library for Matrix Multiplication Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS.
More informationIntroduction to Xeon Phi. Bill Barth January 11, 2013
Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider
More informationThis guide will show you how to use Intel Inspector XE to identify and fix resource leak errors in your programs before they start causing problems.
Introduction A resource leak refers to a type of resource consumption in which the program cannot release resources it has acquired. Typically the result of a bug, common resource issues, such as memory
More informationIntel VTune Amplifier XE
Intel VTune Amplifier XE Vladimir Tsymbal Performance, Analysis and Threading Lab 1 Agenda Intel VTune Amplifier XE Overview Features Data collectors Analysis types Key Concepts Collecting performance
More informationIntel Software Development Products Licensing & Programs Channel EMEA
Intel Software Development Products Licensing & Programs Channel EMEA Intel Software Development Products Advanced Performance Distributed Performance Intel Software Development Products Foundation of
More informationIntel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes
Intel Parallel Studio XE 2011 for Windows* Installation Guide and Release Notes Document number: 323803-001US 4 May 2011 Table of Contents 1 Introduction... 1 1.1 What s New... 2 1.2 Product Contents...
More informationEfficiently Introduce Threading using Intel TBB
Introduction This guide will illustrate how to efficiently introduce threading using Intel Threading Building Blocks (Intel TBB), part of Intel Parallel Studio XE. It is a widely used, award-winning C++
More informationMore performance options
More performance options OpenCL, streaming media, and native coding options with INDE April 8, 2014 2014, Intel Corporation. All rights reserved. Intel, the Intel logo, Intel Inside, Intel Xeon, and Intel
More informationMunara Tolubaeva Technical Consulting Engineer. 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries.
Munara Tolubaeva Technical Consulting Engineer 3D XPoint is a trademark of Intel Corporation in the U.S. and/or other countries. notices and disclaimers Intel technologies features and benefits depend
More informationOpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel
OpenMP * 4 Support in Clang * / LLVM * Andrey Bokhanko, Intel Clang * : An Excellent C++ Compiler LLVM * : Collection of modular and reusable compiler and toolchain technologies Created by Chris Lattner
More informationBecca Paren Cluster Systems Engineer Software and Services Group. May 2017
Becca Paren Cluster Systems Engineer Software and Services Group May 2017 Clusters are complex systems! Challenge is to reduce this complexity barrier for: Cluster architects System administrators Application
More informationWhat s P. Thierry
What s new@intel P. Thierry Principal Engineer, Intel Corp philippe.thierry@intel.com CPU trend Memory update Software Characterization in 30 mn 10 000 feet view CPU : Range of few TF/s and
More informationHPCG on Intel Xeon Phi 2 nd Generation, Knights Landing. Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF
HPCG on Intel Xeon Phi 2 nd Generation, Knights Landing Alexander Kleymenov and Jongsoo Park Intel Corporation SC16, HPCG BoF 1 Outline KNL results Our other work related to HPCG 2 ~47 GF/s per KNL ~10
More informationIntel Xeon Phi Coprocessor. Technical Resources. Intel Xeon Phi Coprocessor Workshop Pawsey Centre & CSIRO, Aug Intel Xeon Phi Coprocessor
Technical Resources Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPETY RIGHTS
More informationINTEL MKL Vectorized Compact routines
INTEL MKL Vectorized Compact routines Mesut Meterelliyoz, Peter Caday, Timothy B. Costa, Kazushige Goto, Louise Huot, Sarah Knepper, Arthur Araujo Mitrano, Shane Story 2018 BLIS RETREAT 09/17/2018 OUTLINE
More informationHigh Performance Computing The Essential Tool for a Knowledge Economy
High Performance Computing The Essential Tool for a Knowledge Economy Rajeeb Hazra Vice President & General Manager Technical Computing Group Datacenter & Connected Systems Group July 22 nd 2013 1 What
More informationA Simple Path to Parallelism with Intel Cilk Plus
Introduction This introductory tutorial describes how to use Intel Cilk Plus to simplify making taking advantage of vectorization and threading parallelism in your code. It provides a brief description
More informationAchieving Peak Performance on Intel Hardware. Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017
Achieving Peak Performance on Intel Hardware Jim Cownie: Intel Software Developer Conference Frankfurt, December 2017 Welcome Aims for the day You understand some of the critical features of Intel processors
More informationPerformance Profiler. Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava,
Performance Profiler Klaus-Dieter Oertel Intel-SSG-DPD IT4I HPC Workshop, Ostrava, 08-09-2016 Faster, Scalable Code, Faster Intel VTune Amplifier Performance Profiler Get Faster Code Faster With Accurate
More informationextreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA
extreme XQCD Bern Aug 5th, 2013 Edmund Preiss Manager Business Development, EMEA Topics Covered Today 2 Intel s offerings to HPC Update on Intel Architecture Roadmap Overview on Intel Development Tools
More informationCrosstalk between VMs. Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA
Crosstalk between VMs Alexander Komarov, Application Engineer Software and Services Group Developer Relations Division EMEA 2 September 2015 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT
More informationVisualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature
Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature Intel Software Developer Conference Frankfurt, 2017 Klaus-Dieter Oertel, Intel Agenda Intel Advisor for vectorization
More informationIntel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2
Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2 This release of the Intel C++ Compiler 16.0 product is a Pre-Release, and as such is 64 architecture processor supporting
More informationEliminate Memory Errors to Improve Program Stability
Introduction INTEL PARALLEL STUDIO XE EVALUATION GUIDE This guide will illustrate how Intel Parallel Studio XE memory checking capabilities can find crucial memory defects early in the development cycle.
More informationCompiling for Scalable Computing Systems the Merit of SIMD. Ayal Zaks Intel Corporation Acknowledgements: too many to list
Compiling for Scalable Computing Systems the Merit of SIMD Ayal Zaks Intel Corporation Acknowledgements: too many to list Takeaways 1. SIMD is mainstream and ubiquitous in HW 2. Compiler support for SIMD
More informationIXPUG 16. Dmitry Durnov, Intel MPI team
IXPUG 16 Dmitry Durnov, Intel MPI team Agenda - Intel MPI 2017 Beta U1 product availability - New features overview - Competitive results - Useful links - Q/A 2 Intel MPI 2017 Beta U1 is available! Key
More informationThe Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation
The Intel Xeon Phi Coprocessor Dr-Ing. Michael Klemm Software and Services Group Intel Corporation (michael.klemm@intel.com) Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED
More informationIntel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation
Intel Direct Sparse Solver for Clusters, a research project for solving large sparse systems of linear algebraic equation Alexander Kalinkin Anton Anders Roman Anders 1 Legal Disclaimer INFORMATION IN
More informationПовышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин
Повышение энергоэффективности мобильных приложений путем их распараллеливания. Примеры. Владимир Полин Legal Notices This presentation is for informational purposes only. INTEL MAKES NO WARRANTIES, EXPRESS
More informationBitonic Sorting. Intel SDK for OpenCL* Applications Sample Documentation. Copyright Intel Corporation. All Rights Reserved
Intel SDK for OpenCL* Applications Sample Documentation Copyright 2010 2012 Intel Corporation All Rights Reserved Document Number: 325262-002US Revision: 1.3 World Wide Web: http://www.intel.com Document
More informationIntel Array Building Blocks
Intel Array Building Blocks Productivity, Performance, and Portability with Intel Parallel Building Blocks Intel SW Products Workshop 2010 CERN openlab 11/29/2010 1 Agenda Legal Information Vision Call
More informationBitonic Sorting Intel OpenCL SDK Sample Documentation
Intel OpenCL SDK Sample Documentation Document Number: 325262-002US Legal Information INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL
More informationIntroduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero
Introduction to Intel Xeon Phi programming techniques Fabio Affinito Vittorio Ruggiero Outline High level overview of the Intel Xeon Phi hardware and software stack Intel Xeon Phi programming paradigms:
More informationGuy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany
Guy Blank Intel Corporation, Israel March 27-28, 2017 European LLVM Developers Meeting Saarland Informatics Campus, Saarbrücken, Germany Motivation C AVX2 AVX512 New instructions utilized! Scalar performance
More informationDemonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor
Demonstrating Performance Portability of a Custom OpenCL Data Mining Application to the Intel Xeon Phi Coprocessor Alexander Heinecke (TUM), Dirk Pflüger (Universität Stuttgart), Dmitry Budnikov, Michael
More informationIntel Architecture for Software Developers
Intel Architecture for Software Developers 1 Agenda Introduction Processor Architecture Basics Intel Architecture Intel Core and Intel Xeon Intel Atom Intel Xeon Phi Coprocessor Use Cases for Software
More informationPRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,
PRACE PATC Course: Intel MIC Programming Workshop, MKL Ostrava, 7-8.2.2017 1 Agenda A quick overview of Intel MKL Usage of MKL on Xeon Phi Compiler Assisted Offload Automatic Offload Native Execution Hands-on
More informationIntel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes
Intel Parallel Studio XE 2011 SP1 for Linux* Installation Guide and Release Notes Document number: 323804-002US 21 June 2012 Table of Contents 1 Introduction... 1 1.1 What s New... 1 1.2 Product Contents...
More informationIntel Cluster Checker 3.0 webinar
Intel Cluster Checker 3.0 webinar June 3, 2015 Christopher Heller Technical Consulting Engineer Q2, 2015 1 Introduction Intel Cluster Checker 3.0 is a systems tool for Linux high performance compute clusters
More informationApril 2 nd, Bob Burroughs Director, HPC Solution Sales
April 2 nd, 2019 Bob Burroughs Director, HPC Solution Sales Today - Introducing 2 nd Generation Intel Xeon Scalable Processors how Intel Speeds HPC performance Work Time System Peak Efficiency Software
More informationAccelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing
Accelerating HPC (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing SAAHPC, Knoxville, July 13, 2010 Legal Disclaimer Intel may make changes to specifications and product
More informationOverview of Intel Parallel Studio XE
Overview of Intel Parallel Studio XE Stephen Blair-Chappell 1 30-second pitch Intel Parallel Studio XE 2011 Advanced Application Performance What Is It? Suite of tools to develop high performing, robust
More informationProgramming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title
Programming for the Intel Many Integrated Core Architecture By James Reinders The Architecture for Discovery PowerPoint Title Intel Xeon Phi coprocessor 1. Designed for Highly Parallel workloads 2. and
More informationIntel Xeon Phi Coprocessor Performance Analysis
Intel Xeon Phi Coprocessor Performance Analysis Legal Disclaimer INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO
More informationPresenter: Georg Zitzlsberger. Date:
Presenter: Georg Zitzlsberger Date: 07-09-2016 1 Agenda Introduction to SIMD for Intel Architecture Compiler & Vectorization Validating Vectorization Success Intel Cilk Plus OpenMP* 4.x Summary 2 Vectorization
More informationBring your application to a new era:
Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.
More informationKnights Corner: Your Path to Knights Landing
Knights Corner: Your Path to Knights Landing James Reinders, Intel Wednesday, September 17, 2014; 9-10am PDT Photo (c) 2014, James Reinders; used with permission; Yosemite Half Dome rising through forest
More informationObtaining the Last Values of Conditionally Assigned Privates
Obtaining the Last Values of Conditionally Assigned Privates Hideki Saito, Serge Preis*, Aleksei Cherkasov, Xinmin Tian Intel Corporation (* at submission time) 2016/10/04 OpenMPCon2016 Legal Disclaimer
More informationEliminate Memory Errors to Improve Program Stability
Eliminate Memory Errors to Improve Program Stability This guide will illustrate how Parallel Studio memory checking capabilities can find crucial memory defects early in the development cycle. It provides
More informationGetting Reproducible Results with Intel MKL
Getting Reproducible Results with Intel MKL Why do results vary? Root cause for variations in results Floating-point numbers order of computation matters! Single precision example where (a+b)+c a+(b+c)
More informationGAP Guided Auto Parallelism A Tool Providing Vectorization Guidance
GAP Guided Auto Parallelism A Tool Providing Vectorization Guidance 7/27/12 1 GAP Guided Automatic Parallelism Key design ideas: Use compiler to help detect what is blocking optimizations in particular
More informationMikhail Dvorskiy, Jim Cownie, Alexey Kukanov
Mikhail Dvorskiy, Jim Cownie, Alexey Kukanov What is the Parallel STL? C++17 C++ Next An extension of the C++ Standard Template Library algorithms with the execution policy argument Support for parallel
More informationOverview of Intel Xeon Phi Coprocessor
Overview of Intel Xeon Phi Coprocessor Sept 20, 2013 Ritu Arora Texas Advanced Computing Center Email: rauta@tacc.utexas.edu This talk is only a trailer A comprehensive training on running and optimizing
More information