Hans Pabst, January 24 th 2013 Software and Services Group Intel Corporation

Size: px

Start display at page:

Download "Hans Pabst, January 24 th 2013 Software and Services Group Intel Corporation"

Daniel Gibbs
6 years ago
Views:

1 VPE Swiss Workshop: Infrastruktur für rechenintensive Anwendungen Intel Xeon Phi Product Family An Overview Hans Pabst, January 24 th 2013 Software and Services Group Intel Corporation

2 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers 2

3 VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE INTERPROCESSOR NETWORK COHERENT CACHE COHERENT CACHE INTERPROCESSOR NETWORK VECTOR IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE VECTOR IA CORE COHERENT CACHE COHERENT CACHE VECTOR IA CORE Agenda FIXED FUNCTION LOGIC Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers GDDR5 MEMORY and I/O INTERFACES PCIe 3

4 Overview Intel Xeon General instruction streams High single-thread perf. High memory capacity Core/memory aggr. via sockets and nodes Instruction set extensions SIMD e.g., Intel AVX Virtualization, AES, etc. Intel Xeon Phi General instruction streams Highly parallel workloads High memory bandwidth Up to 61 cores/die, aggr. via PCIe and nodes SIMD (512-bit registers) Gather/scatter, FMA, masked instructions Intel Xeon Phi is a coprocessor for highly parallel workloads. 4

5 History of SIMD ISA extensions Intel Pentium processor (1993) MMX (1997) Intel Streaming SIMD Extensions (Intel SSE in 1999 to Intel SSE4.2 in 2008) Intel Advanced Vector Extensions (Intel AVX in 2011 and Intel AVX2 in 2013) Intel Many Integrated Core Architecture (Intel MIC Architecture in 2013) * Illustrated with the number of 32-bit data elements that are processed by one packed instruction. 5

6 Performance Motivation Remember Pollack s rule: Performance ~ Die Area 4x the die area gives 2x the performance in one core, but 4x the performance when dedicated to 4 cores Conclusions (with respect to Pollack s rule) A powerful handle to adjust Performance/Watt Weaker cores be beneficial (but many of them) Parallel hardware Parallel algorithms Appropriate tools GHz Era Multicore Manycore Time 6

7 Speedup? Peak perf. by example ( Intel Xeon E (not the top-bin) 2S x 8C x 2.7 GHz x 4F DP x 2 ops* ~345 GF/s Intel Xeon Phi 3120A (lowest bin) 57C x 1.1 GHz x 8F DP x 2 ops* ~1 TF/s Amdahl s Law determines the total speedup of a mixture of serial and parallel code sections e.g., P=80% and S=1000/345 ~2x Leverage people s knowledge with common tools and practices! * Two operations ( ops ) due to separate FMUL/FADD ports on Intel Xeon, and FMA on Intel Xeon Phi. 7

8 What about 100x or more? Compared with a single core and without SIMD For example: Intel Xeon E C x 3.5 GHz x 1F DP x 2 ops 7 GF/s (Turbo Boost: 3.5 GHz) Many programs today are not fully optimized for the cores and vectors in today s CPU Conclusions Because of Polack s rule we want accelerators Amdahl s Law ( low speedup ), high effort ( rewrite ), or workloads that are less applicable* we need: Hardware not breaking assumptions made in software Common software tools that are not vendor-locked * Examples: to miss FMA (50% off from peak perf.), incoherent branches, etc. 8

Performance GEMM, STREAM, and SMP Linpack http://www.intel.

9 Performance GEMM, STREAM, and SMP Linpack GEMM, Cholesky / LU / QR Decomposition, SMP Linpack, etc. Example: 9

86% Efficient 82% Efficient 75% Efficient ECC On ECC Off Synthetic Benchmark Summary (Intel MKL) SGEMM (GF/s) DGEMM (GF/s) SMP Linpack (GF/s) STREAM Triad (GB/s) 2000 1500 Up to 2.9X 1.

2X Higher is Better Higher is Better Higher is Better Higher is Better 883 800 1000 200 175 803 800 150 181 600 600 1000 100 640 400 309 400 303 79 500 200 200 50 0 2S Intel Xeon Processor 1 Intel

10 86% Efficient 82% Efficient 75% Efficient ECC On ECC Off Synthetic Benchmark Summary (Intel MKL) SGEMM (GF/s) DGEMM (GF/s) SMP Linpack (GF/s) STREAM Triad (GB/s) Up to 2.9X Up to 2.8X Up to 2.6X Up to 2.2X Higher is Better Higher is Better Higher is Better Higher is Better S Intel Xeon Processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 0 2S Intel Xeon processor 1 Intel Xeon Phi coprocessor 1 Intel Xeon Phi coprocessor Notes 1. Intel Xeon Processor E used for all SGEMM Matrix = x 13824, DGEMM Matrix 7936 x 7936, SMP Linpack Matrix x Intel Xeon Phi coprocessor SE10P (ECC on) with Gold Release Candidate SW stack SGEMM Matrix = x 15360, DGEMM Matrix 7680 x 7680, SMP Linpack Matrix x

11 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers Shared Memory Application Development Shared, Distributed, and Hybrid Memory App. Development 11

Intel Parallel Studio XE 2013 Phase Productivity Tool Feature Benefit Advanced Parallel Design Intel Advisor XE Analyze existing code base and find opportunities for parallelization.

Advanced Build and Debug Intel Composer XE C/C++ and Fortran compilers, performance libraries, and parallel models Application performance, scalability and quality for current multicore and future

12 Intel Parallel Studio XE 2013 Phase Productivity Tool Feature Benefit Advanced Parallel Design Intel Advisor XE Analyze existing code base and find opportunities for parallelization. Easier analysis and performance heuristics, find compute hotspots and check for parallelization strategies. Advanced Build and Debug Intel Composer XE C/C++ and Fortran compilers, performance libraries, and parallel models Application performance, scalability and quality for current multicore and future many-core systems. Advanced Verify Intel Inspector XE Memory & threading error checking tool for higher code reliability & quality Increases productivity and lowers cost, by catching memory and threading defects early Advanced Tune Intel VTune Amplifier XE Performance Profiler to optimize performance and scalability Removes guesswork, saves time, makes it easier to find performance and scalability bottlenecks Combines ease of use with deeper insights. 12

13 How and where to optimize? 1. Prefer a library that solves the problem, and/or chose an appropriate/parallel algorithm* 2. Optimize your own code a) Across multiple cores. b) In-core e.g., SIMD for (int i = 0; i < M; ++i) { for (int j = 0; j < N; ++j) { c[i*k+j] = 0; for (int k = 0; k < K; ++k) { c[i*k+j] += a[i*n+k] * b[k*k+j]; } } } (ordered by anticipated impact; use tools to qualify and check) Intel Performance Library! * A parallel algorithm is not necessarily an incremental optimization of a serial algorithm. 13

14 Intel Math Kernel Library (Intel MKL) Linear Algebra BLAS, Sparse BLAS LAPACK solvers Sparse Solvers (DSS, PARADISO) Iterative solver (RCI) ScaLAPACK, PBLAS Fast Fourier Transforms Multidimensional FFTW interfaces Cluster FFT Trig. Transforms Poisson solver Convolution via VSL Vector Math Trigonometric Hyperbolic Exponential, Logarithmic Power / Root Random Number Generators Congruential Wichmann-Hill Mersenne Twister Sobol Neiderreiter Non-deterministic Summary Statistics Kurtosis Variation coefficient Quantiles Order statistics Min/max Variance-covariance Data Fitting Spline-based Interpolation Cell search 14

15 Intel MKL Features Single threaded, and multi-threaded libraries* Cluster support for important domains Support for large problem sizes (ILP) Conditional Numerical Reproducibility (CNR) Support for Intel Xeon Phi coprocessors Automatic offload, and compiler-assisted offload Manycore-hosted execution, cluster support, etc. As always: early enabled for future hardware Haswell support: AVX2 and FMA3 instruction set * Intel MKL Link Line Advisor: 15

16 Intel MKL Compilation and Linkage Intel MKL supports Linux*, Mac OS* X, and Windows* (platform s default compiler, Intel Compiler as well as non-intel compilers and their OpenMP* runtimes) Intel MKL Link Line Advisor -us/articles/intel-mkl-linkline-advisor/ 16

17 Use Cases Iterative Solver (RCI) Customize solver steps PBLAS Distribute easily VML Balance accuracy and performance RNG Safety and reliability VSL* Did you know that Intel MKL comes with some statistics? * For example, to detect outliers or to predict values. 17

18 Execution Models (Intel Xeon Phi) Intel MKL Automatic Offload (AO) Transparent data transfer and execution management Limited to key functions (sufficient FLOP/Byte ratio) Automatically uses host and (multiple) targets No code changes required Compiler Assisted Offload (CAO) Explicit control of data transfer / persistence Intel Compiler offload pragmas/directives Language Extension for Offload (LEO) OpenMP* 4.0 standard (draft) Can be used together with Automatic Offload Native Execution* Uses the coprocessors as independent nodes (a.k.a. manycore-hosted execution) Input data is copied to targets in advance * In fact, an offloaded code section (CAO) that calls e.g., Intel MKL is calling into a library that is native. 18

19 Intel MKL Automatic Offload (AO) Control automatic offload (hybrid execution!) Environment variable: MKL_MIC_ENABLE=1 Remember: sufficient problem size needed (Byte/FLOP ratio) Service functions take precedence (work division, etc.) Upcoming optimizations (Intel MKL ) Supported functions (more to come) BLAS level 3: GEMM, TRMM, TRSM LAPACK: Cholesky, LU, QR Multiple cards per node Only GEMM (Intel MKL ) Check for offload (also applies to CAO) OFFLOAD_REPORT=<0 1 2>, or call mkl_mic_set_offload_report( ) 19

20 Intel Compiler (C/C++ and Fortran) Supports Intel Xeon Phi (since V13.0) Compiler Assisted Offload (CAO) Cross-compilation ( -mmic ) No effort to get code working on Intel Xeon Phi Tuning takes effort, but leverages existing standards Optimizations usually lead to perf. gain on Intel Xeon Standards and existing code Cross-compiled code can be used in offload section For example: Intel Threading Building Blocks Intel OpenMP* 4.0 (incl. offload pragmas/directives) MYO (Mine Yours Ours) shared virtual memory 20

21 Compiler Features Task and data parallelism e.g., Intel Cilk Plus Compiler techniques e.g., Automatic vectorization Current standards e.g., Fortran 2003, and C++11 OpenMP 4.0 Compatibility e.g., linkcompatibility with the platform s default compiler Security e.g., Pointer Checker subroutine quad(len,a,b,c,x1,x2) real(4) a(len),b(len), c(len) real x1(len), x2(len), s do i=1,len s = b(i)**2-4.*a(i)*c(i) if (s.ge.0.) then x1(i) = sqrt(s) x2(i) = (-x1(i) - b(i)) *0.5 / a(i) x1(i) = ( x1(i) - b(i)) *0.5 / a(i) else x2(i)=0. x1(i)=0. endif enddo end > ifort -c -vec-report2 quad.f90 quad.f90(4): (col. 3) remark: LOOP WAS VECTORIZED. * Performance e.g., Polyhedron Benchmarks (F90): 21

22 Automatic Vectorization Guided Auto-Parallelization (GAP) User/advice-oriented terminology Vectorization report Compiler terminology More complete Implement Advice GAP Report Remove vectorization blockers User-mandated vectorization Break vector dependencies Resolve Issues Vectorization Report 22

23 Vectorization Report Get details on vectorization s success and failure L&M: W: -vec-report<n>, n=0,1,2,3,4,5* /Qvec-report<n>, n=0,1,2,3,4,5* 35: subroutine fd( y ) 36: integer :: i 37: real, dimension(10), intent(inout) :: y 38: do i=2,10 39: y(i) = y(i-1) : end do 41: end subroutine fd novec.f90(38): (col. 3) remark: loop was not vectorized: existence of vector dependence. novec.f90(39): (col. 5) remark: vector dependence: proven FLOW dependence between y line 39, and y line 39. novec.f90(38:3-38:3):vec:main_: loop was not vectorized: existence of vector dependence * Diagnostic level: (0) no diagnostic, (1) vectorized loops, (2) vectorized loops and non-vect. loops 23

24 Multiple Code Paths (Retargeting) double A[1000], B[1000], C[1000]; void add() { for (int i = 0; i < 1000; ++i) { if (A[i] > 0) { A[i] += B[i]; } else { A[i] += C[i]; } } }.B1.2:: vmovaps ymm3, A[rdx*8] vmovaps ymm1, C[rdx*8] vcmpgtpd ymm2, ymm3, ymm0 vblendvpd ymm4, ymm1,b[rdx*8], ymm2 vaddpd ymm5, ymm3, ymm4 vmovaps A[rdx*8], ymm5 add rdx, 4 cmp rdx, 1000 jl.b1.2 AVX.B1.2:: movaps xmm2, A[rdx*8] xorps xmm0, xmm0 cmpltpd xmm0, xmm2 movaps xmm1, B[rdx*8] andps xmm1, xmm0 andnps xmm0, C[rdx*8] orps xmm1, xmm0 addpd xmm2, xmm1 movaps A[rdx*8], xmm2 add rdx, 2 cmp rdx, 1000 SSE2 jl.b1.2.b1.2:: movaps xmm2, A[rdx*8] xorps xmm0, xmm0 cmpltpd xmm0, xmm2 movaps xmm1, C[rdx*8] blendvpd xmm1, B[rdx*8], xmm0 addpd xmm2, xmm1 movaps A[rdx*8], xmm2 add rdx, 2 cmp rdx, 1000 jl.b1.2 SSE4.1 Intel Xeon Phi can be just one of the multiple code paths. 24

25 Intel Xeon Phi Infrastructure and Open Source Contributions Operating System (OS) You can assume at least a BusyBox environment Embedded Linux* (very few customizations) May go into the Yocto Project Other infrastructure Intel Manycore Platform Software Stack (Intel MPSS) Intel Coprocessor Offload Infrastructure (Intel COI) Intel Symmetric Communications Infrastructure (Intel SCI) Other/upcoming contributions to: GNU* Compiler Collection (GCC) GNU* Debugger (GDB) 25

26 Intel VTune Amplifier XE 2013 All available analysis types Different ways to start the analysis Helps creating new analysis types Copy correct command line syntax to clipboard 26

to use Debugger breakpoints Break on selected errors Run faster to known error Pause/resume collection Narrow

27 Intel Inspector XE 2013 Dynamic Analysis: Finds Memory and Threading Errors Find and eliminate errors Memory leaks, invalid access Races & deadlocks Analyze hybrid MPI cluster apps Heap growth analysis Faster & Easier to use Debugger breakpoints Break on selected errors Run faster to known error Pause/resume collection Narrow analysis focus Better performance Improved error suppression Find errors early (when they are less expensive). 27

28 Intel Inspector XE

29 Intel Advisor XE Tool for what-if analysis Modeling: use code annotations to introduce parallelism Evaluation: estimate the effect e.g. the speedup GUI-driven assistant (5 steps) Productivity and Safety Parallel correctness is checked based on a correct program Non-intrusive API It s not auto-parallelization It s not modifying the code 29

30 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions and Answers 30

31 Philosophy of the Intel Xeon Family Modest core count increase e.g., E5-2xxx introduced 8 cores/die (prev. 6 cores/die) E5-2xxx v2 is expected to have 12 cores/die E5-4xxx (EP) allows 4 sockets (prev. only EX up to 8) Performance per core must increase e.g., E5 introduced Intel AVX (increased FP throughput) Haswell with 16F DP / cycle (SP perf. 2x over DP) Power consumption to go down (or to stay flat) e.g., Turbo Boost, configurable TDP, low-voltage Balanced platform (compute, memory, I/O) e.g., More main memory for Intel Xeon Phi 31

32 Software Roadmap Upcoming features OpenCL* SDK for Intel Xeon Phi Windows-hosted Intel Xeon Phi Intel Composer XE 2013 Update 2 (next week) OpenMP* 4.0 support * 32

33 Summary Expect large performance boost with Intel Xeon 3 rd generation ( Haswell ) FP throughput will double Intel Xeon Phi can be targeted with regular developer tools and standards, perf. tuning benefits Intel Xeon as well Intel Xeon Phi coprocessor behaves similar to a normal computer system (OS, login, etc.) 33

34 References Intel Xeon Phi Coprocessor Developer Forum Intel Xeon Phi Coprocessor Quick Start Guide Programming Intel's Xeon Phi: A Jumpstart Introduction Phi Programming for CUDA Developers 34

35 Available since April/May Teaches parallel programming in a cookbook-style with many examples Shared memory Programming and Debugging on X86 Architecture Not about Intel Xeon Phi coprocessors (c) 2012, publisher: Worx 35

36 Available since July Teaches parallel programming in a new more effective manner. Not about Intel Xeon Phi coprocessors. Not about any specific hardware. It s about effective parallel programming. (c) 2012, publisher: Morgan Kaufmann 36

Availability: ~March 2013 Completely focused on Intel Xeon Phi coprocessors.

Teaches us how to use and obtain high performance on the Intel MIC architecture The authors have

By chronicling step-by-step optimizations of several computational kernels, software interfaces are

37 Availability: ~March 2013 Completely focused on Intel Xeon Phi coprocessors. Volume 1: essentials ~350 pages of explanation of programming. Teaches us how to use and obtain high performance on the Intel MIC architecture The authors have provided a very readable, big-picture introduction to programming the Intel Xeon Phi Coprocessor. By chronicling step-by-step optimizations of several computational kernels, software interfaces are illustrated for getting the most out of key architectural features of the Intel Xeon Phi Coprocessor. James L. Schwarzmeier, Cray Inc, January 2013 (c) 2013, Morgan Kaufman Publ. Inc 37

38 Agenda Overview and Motivation Software Developer Tools Roadmap and Summary Questions? 38

39 Call to Action 1. Start to evaluate Intel Composer XE Evaluation includes Premier support as well Under Windows, a trial plays well with an evaluation of Microsoft* Visual Studio Feel free to ask for an on-site training 2. Start to optimize your application for multicore C/C++ and Fortran as well OpenCL where appropriate Avoid to stuck with frequency scaling 3. Interested to target Intel Xeon Phi for Windows workstations? Get in contact. Intel/Switzerland, local consultants, etc. 39

40 Thank You

42 Legal Disclaimer & Optimization Notice INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR OTHER INTELLECTUAL PROPERTY RIGHT. Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product when combined with other products. Copyright, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and Cilk are trademarks of Intel Corporation in the U.S. and other countries. Optimization Notice Intel s compilers may or may not optimize to the same degree for non-intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice. Notice revision # Copyright 2013, 2012, Intel Corporation. All rights reserved.

Intel Math Kernel Library (Intel MKL) Overview. Hans Pabst Software and Services Group Intel Corporation

Intel Math Kernel Library (Intel MKL) Overview Hans Pabst Software and Services Group Intel Corporation Agenda Motivation Functionality Compilation Performance Summary 2 Motivation How and where to optimize?