Intel Xeon Phi Coprocessor

Size: px

Start display at page:

Download "Intel Xeon Phi Coprocessor"

Meagan Sutton
5 years ago
Views:

1 Intel Xeon Phi Coprocessor 1

2 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 2

Foundation of HPC Performance Suited for full

performance and performance/watt for serial &

core/thread performance with moderate number

optimized for highly parallelized compute

enabling efficient application readiness and

3 Intel Multicore Architecture Intel Many Integrated Core Architecture (Intel MIC) Foundation of HPC Performance Suited for full scope of workloads Industry leading performance and performance/watt for serial & parallel workloads Focus on fast single core/thread performance with moderate number of cores Performance and performance/watt optimized for highly parallelized compute workloads Common software tools with Xeon enabling efficient application readiness and performance tuning IA extension to Manycore Many cores/threads with wide SIMD

Intel Architecture Multicore and Manycore

5500 series Intel Xeon processor 5600 series

Xeon processor code name Ivy Bridge Intel

Phi Coprocessor Core(s) 1 2 4 6 8 12 Threads

4 Intel Architecture Multicore and Manycore More cores. Wider vectors. Co-Processors. Images do not reflect actual die sizes. Actual production die may differ from images. Intel Xeon processor 64-bit Intel Xeon processor 5100 series Intel Xeon processor 5500 series Intel Xeon processor 5600 series Intel Xeon processor E5 Product Family Intel Xeon processor code name Ivy Bridge Intel Xeon processor code name Haswell Intel Xeon Phi Coprocessor Core(s) Threads TBD Intel Xeon Phi Coprocessor extends established CPU architecture and programming concepts to highly parallel applications

Introducing Intel Xeon Phi Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.

programming languages and tools Leading to Groundbreaking results Over 1 TeraFlop/s double precision peak performance 1 Up to 2.

5 Introducing Intel Xeon Phi Coprocessors Highly-parallel Processing for Unparalleled Discovery Groundbreaking: differences Up to 61 IA cores/1.1 GHz/ 244 Threads Up to 16GB memory with up to 352 GB/s bandwidth 512-bit SIMD vector instructions Linux operating system, IP addressable Standard programming languages and tools Leading to Groundbreaking results Over 1 TeraFlop/s double precision peak performance 1 Up to 2.2x higher memory bandwidth than on an Intel Xeon processor E5 family-based server. 2 Up to 4x more performance per watt than with an Intel Xeon processor E5 family-based server. 3 5

6 MPSS - MIC Software Stack MIC operating system is Linux! Host OS: Linux;Wind ows* to be added No need to deal with the lowlevel APIs!

Intel Xeon Phi Coprocessors Introduced Full

Parallel Computing Solution Performance/$ leadership

Family Optimized for High Density Environments

TFlops DP 5110P 5120D 7xxx Family Highest Level of

7 Intel Xeon Phi Coprocessors Introduced Full Portfolio June ISC 3xxx Family Outstanding Parallel Computing Solution Performance/$ leadership 6GB GDDR5 240 GB/s >1 TFlops DP 3120P 3120A 5xxx Family Optimized for High Density Environments Performance/watt leadership 8GB GDDR5 >300 GB/s >1 TFlops DP 5110P 5120D 7xxx Family Highest Level of Features Performance leadership 16GB GDDR5 352 GB/s > 1.2 TFlops DPT 7120P 7120X

8 Announced at ISC 13 Tianhe-2 ( MilkyWay 2 ): Largest supercomputer of the world installed in Guangzhou, China First in Top500 list 55 peta flops peak performance Intel Xeon processors E V2 Based on 3 rd Generation Intel Core architecture code name Ivy Bridge Intel Xeon Phi processors! 8

9 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 9

10 Intel Xeon Phi Architecture Overview 8 memory controllers 16 Channel GDDR5 MC PCIe GEN2 Up to ~350GB/sec BW Up to 61 core s, at 1.1 GHz in-order, support 4 threads 512 bit Vector Processing Unit 32 native registers igh-speed bi-directional ring interconnect Fully Coherent L2 Cache Reliability Features Parity on L1 Cache, ECC on memory CRC on memory IO, CAP on memory 10

Core Architecture Overview Instruction Decode Scalar Unit

Registers Vector Unit Vector Registers Two pipelines Dual

throughput 32K L1 I-cache 32K L1 D-cache 512K L2 Cache Ring

clock latency, hidden by round-robin scheduling of threads

11 Core Architecture Overview Instruction Decode Scalar Unit based on Intel Pentium processor: Scalar Unit Scalar Registers Vector Unit Vector Registers Two pipelines Dual issue with vector instructions Pipelined one-per-clock scalar throughput 32K L1 I-cache 32K L1 D-cache 512K L2 Cache Ring SIMD Vector Processing Engine: 4 hardware threads per core 4 clock latency, hidden by round-robin scheduling of threads Cannot issue back to back inst in same thread!! Coherent 512KB L2 Cache per core 11

12 Vector Processing Unit Extends the Scalar IA Core PPF PF D0 D1 D2 E WB Thread 0 IP Thread 1 IP Thread 2 IP Thread 3 IP 4 threads in-order Pipe 0 (u-pipe) L1 TLB and L1 instruction cache 32KB Decoder Instruction Cache Miss TLB miss 16B/cycle ( 2 IPC) ucode Pipe 1 (v-pipe) TLB Miss Handler L2 TLB HWP for L2 CRI 512KB L2 Cache VPU RF X87 RF Scalar RF VPU 512b SIMD X87 ALU 0 ALU 1 L1 TLB and L1 Data Cache 32 KB TLB miss Data Cache Miss On-Die Interconnect 12

13 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 13

14 One Source Base, Tuned to many Targets Source Compilers, Libraries, Parallel Models Multicore Many-core Cluster Multicore CPU Multicore CPU Intel MIC Architecture Multicore Cluster Multicore and Many-core Cluster

Intel Developer Products - Intel Cluster Studio XE 2013 Phase Product Feature Benefit Intel Advisor XE Threading design assistant (Studio products only) Simplifies,

Performance Primitives Intel Math Kernel Library Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to

Application Tuning Capability Intel VTune Amplifier XE Performance Profiler for optimizing application performance and scalability Remove guesswork, saves time, makes it

quality Increased productivity, code quality, and lowers cost, finds memory, threading, and security defects before they happen Intel Trace Analyzer & Collector MPI

15 Intel Developer Products - Intel Cluster Studio XE 2013 Phase Product Feature Benefit Intel Advisor XE Threading design assistant (Studio products only) Simplifies, demystifies, and speeds parallel application design Build Intel Composer XE C/C++ and Fortran compilers Intel Threading Building Blocks Intel Cilk Plus Intel Integrated Performance Primitives Intel Math Kernel Library Enabling solution to achieve the application performance and scalability benefits of multicore and forward scale to many-core Intel MPI Library High Performance Message Passing (MPI) Library Enabling High Performance Scalability, Interconnect Independence, Runtime Fabric Selection, and Application Tuning Capability Intel VTune Amplifier XE Performance Profiler for optimizing application performance and scalability Remove guesswork, saves time, makes it easier to find performance and scalability bottlenecks Verify & Tune Intel Inspector XE Memory & threading dynamic analysis for code quality Static Analysis for code quality Increased productivity, code quality, and lowers cost, finds memory, threading, and security defects before they happen Intel Trace Analyzer & Collector MPI Performance Profiler for understanding application correctness & behavior Analyze performance of MPI programs and visualize parallel application behavior and communications patterns to identify hotspots

Data Parallelism of Intel Processors (2) 255 0 X8 Y8 X8 Y8 X7 Y7 X7 Y7 X6 Y6 X6

size: 256 bit Data types: 32 and 64 bit float VL: 4, 8, 16 511 X16 Y16.

Y3 X2 Y2 X2 Y2 0 X1 Y1 X1 Y1 Intel MIC Vector size: 512 bit Data types: 32 and

16 Data Parallelism of Intel Processors (2) X8 Y8 X8 Y8 X7 Y7 X7 Y7 X6 Y6 X6 Y6 X5 Y5 X5 Y5 X4 Y4 X4 Y4 X3 Y3 X3 Y3 X2 Y2 X2 Y2 X1 Y1 X1 Y1 Intel AVX Vector size: 256 bit Data types: 32 and 64 bit float VL: 4, 8, X16 Y16... X16 Y16 X8 Y8 X8 Y8 X7 Y7 X7 Y7 X6 Y6 X6 Y6 X5 Y5 X5 Y5 X4 Y4 X4 Y4 X3 Y3 X3 Y3 X2 Y2 X2 Y2 0 X1 Y1 X1 Y1 Intel MIC Vector size: 512 bit Data types: 32 and 64 bit integer 32 and 64 bit float VL: 8,16 Illustrations: Xi, Yi & results 32 bit float

17 Intel Cilk Plus Array Notation <array base> [<lower bound>:<length>[:<stride>]]+ A[:] // All of vector A B[2:6] // Elements 2 to 7 of vector B C[:][5] // Column 5 of matrix C D[0:3:2] // Elements 0,2,4 of vector D if (a[:] > b[:]) c[:] = d[:] * e[:]; else c[:] = d[:] * 2; A simple and elegant solution: a language construct for vector level parallelism

Map vector parallelism to vector ISA Array Notation Intel SSE Intel AVX Intel MIC Elemental

18 Input: C/C++/FORTRAN source code Data parallel part of Intel Cilk Plus extension Vectorization for MIC: A new Target Fully Automatic Analysis Vectorization Hints (ivdep/vector pragmas) Vectorizer: Map vector parallelism to vector ISA Array Notation Intel SSE Intel AVX Intel MIC Elemental Function SIMD pragma Express/expose vector parallelism Optimize and Code Gen Vectorizer makes retargeting easy!

19 Many-Core Hosted Native Model Enabled by mmic compiler switch Fully supported by compiler vectorization, Intel MKL, OpenMP*, Intel TBB, Intel Cilk Plus, Intel MPI, Might be an option for some applications: Needs to fit into memory! Should be highly parallel code Serial parts are slower on MIC than on host! Limited access to external environment like I/O Native MIC file system exists in memory only! NFS allows external I/O but

20 Many-Core Hosted with MPI MPI ranks on Intel Xeon Phi TM coprocessors(only) All messages into/out of Intel Xeon Phi TM coprocessors Xeon Data MIC MPI Programmed as homogenous network of many-core CPUs: Data Network Xeon MIC Data Xeon MIC Data Xeon MIC 20

21 Parallel Compute Parallel Compute PCIe Heterogeneous Programming Tools MKL CPU Executable MIC Native Executable Tools MKL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCl Parallel programming is the same on MIC and CPU

22 Parallel Compute Parallel Compute PCIe Heterogeneous Programming Tools MKL CPU Executable MIC Native Executable Tools MKL Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCL Offload Directives (Non-Shared Model) Fortran (CAF) TBB OpenMP C++ Cilk Plus OpenCl Offload Keywords (Virtual Shared-Memory) Parallel programming is the same on MIC and CPU

23 Choices for Offloading Application Code Two Intel-specific models for offloading are supported ( Intel Composer 2013 ): LEO: Language Extensions for Offload for C/C++ and Fortran Explicit data transfer by compiler directives MYO: Mine-Your-Ours for C/C++ Shared virtual memory implicit offload controlled by language extensions for variable declaration etc Offloading and parallelism is orthogonal Offloading only transfers control to the MIC devices Parallelism needs to be exploited by a second model (e.g. OpenMP*) 23

24 LEO: Explicit Offload Completely realized by directives and attributes Ignored by other compilers ( might result in a warning ) Requires bit-wise copyable data objects Programmer designates variables that need to be copied between host and card in the offload directive C/C++ Example: #pragma offload target(mic) in(data:length(size)) Fortran Example:!dir$ offload target(mic in(a1:length(size)) Very much influenced accelerator ( target ) extension of coming OpenMP* 4.0 standard #pragma omp target map(to(b:count)) map(from(a:count)) 24

25 C/C++ Extensions for explicit Offload Offload pragma Keyword for variable & function definitions C/C++ Syntax #pragma offload <clauses> <statement block> attribute ((target(mic))) Semantics Allow next statement block to execute on Intel MIC Architecture or host CPU Compile function for, or allocate variable on, both CPU and Intel MIC Architecture Entire blocks of code Data transfer #pragma offload_attribute(push, target(mic)) #pragma offload_attribute(pop) #pragma offload_transfer target(mic)<clauses> Mark entire files or large blocks of code for generation on both host CPU and Intel MIC Architecture Initiates asynchronous data transfer, or initiates and completes synchronous data transfer and/or other countries. *Other Intel names Many and brands Integrated may be Core claimed Architecture as the property of others. 25

26 Example float reduction(float *data, int numberof) { float ret = 0.f; #pragma offload target(mic) in(data:length(numberof)) { #pragma omp parallel for reduction(+:ret) for (int i=0; i < numberof; ++i) ret += data[i]; } return ret; } Note: copies numberof elements to the coprocessor, not numberof*sizeof(float) bytes the compiler knows data s type 26

27 Example: Call Intel MKL on Coprocessor int main{ // initialize variables } #pragma offload target(mic) \ in(transa,transb, N, alpha, beta) \ in(a:length(matrix_elements)) \ in(b:length(matrix_elements)) \ inout(c:length(matrix_elements)) \ sgemm(&transa, &transb, &N, &N, &N, &alpha, A, &N, B, &N, &beta, C, &N); sgemm performs C=beta*C+alpha*A*B, transa and transb regulate the transposition of A and B and the Ns define the sizes of the matrices (see documentation). C is input and output, all others are input only. MKL will automatically make optimal use of MIC 27

28 MYO: Implicit Offload Real language extensions Not accepted by other compilers Alternative model for non-compact data objects like a linked list Programmer marks variables that should be (virtually) shared between host and card Run-time system automatically maintains coherence at boundary of offloaded code regions Sample: _Cilk_shared double foo; _Offload func(y); 28

29 MYO Memory Model Section of memory maintained at the same (!!) virtual address on both the host and coprocessor Reserving same address range on both devices allows Seamless sharing of complex pointer-containing data structures Elimination of user marshaling and data management Use of simple language extensions to C/C++ Same address range C/C++ executable Offload code Host Memory Host Intel MIC MIC Memory

30 MYO: _Cilk_shared for Data & Routines What Syntax Semantics Function int _Cilk_shared f(int x) { return x+1; } Versions generated for both CPU and card; may be called from either side Global _Cilk_shared int x = 0; Visible on both sides File/Function static static _Cilk_shared int x; Visible on both sides, only to code within the file/function Class class _Cilk_shared x { }; Class methods, members, and and operators are available on both sides Pointer to shared data int _Cilk_shared *p; p is local (not shared), can point to shared data A shared pointer int *_Cilk_shared p; p is shared; should only point at shared data Entire blocks of code #pragma offload_attribute( push, _Cilk_shared) #pragma offload_attribute(pop) Mark entire files or large blocks of code _Cilk_shared using this pragma and/or other countries. *Other Intel names Many and brands Integrated may be Core claimed Architecture as the property of others. 30

31 Example int main() { int count = 10000; // Shared variable declaration for pi _Cilk_shared float pi; // Initialize shared global // variables pi = 0.0f; _Cilk_shared void compute_pi(int count) { int i; } // Compute pi on target _Offload compute_pi(count); pi /= count; } #pragma omp parallel for \ reduction(+:pi) for (i=0; i<count; i++) { float t = (float)((i+0.5f)/count); pi += 4.0f/(1.0f+t*t); } 31

32 Offload Models plus MPI MPI ranks on Intel Xeon processors (only) All messages into/out of processors Data Offload Offload models used to accelerate MPI ranks MPI Xeon MIC Homogenous network of hybrid nodes: Data Network Xeon MIC Data Xeon MIC Data Xeon MIC 32

33 Intel Xeon Phi Coprocessor OpenCL* Compiler Clang* OpenCL* LLVM Compiler Optimizer OpenCL* LLVM* IR LLVM* Standard Passes LLVM* IR LLVM* Vectorizer: Scalarizer Divergence Analysis Predicator Packetizer bypasses LLVM* IR Code Generator Xeon Phi Code LLVM* OpenCL Passes: Barriers Builtins Kernel Arguments LLVM* IR LLVM* Standard Passes

34 Intel VTune Amplifier XE 2013 for Intel Xeon Phi Select which problem areas you want to analyze Beginning with update 4, the methodology in this presentation is implemented in the General Exploration profile Events to collect will be configured automatically 7/23/2 34

35 MIC Debugging with GDB* Run Host-MIC GDB* on your localhost (you can t use default host gdb!) /usr/linux-k1om-4.7/bin/x86_64-k1om-linux-gdb Start gdbserver on the Intel Xeon Phi Coprocessor To remote debug using pipe to ssh (gdb) target extended-remote ssh T mic0 ~/gdbserver multi IP:port To remote debug using stdio (gdb) target extended-remote ssh -T mic0 ~/gdbserver multi - To attach to a running application via the processid (pid) (gdb) file /local/path/to/application (gdb) attach <remote-pid> To run an application directly from GDB* (gdb) file /local/path/to/application (gdb) set remote exec-file /target/path/to/application

36 Extension to Microsoft Windows* OS Microsoft Window being added as second host operating system Windows 2008 Server No change for operating system of coprocessor remains Linux Developer environment same as for Linux Same programming models In beta testing today To be released H2/

37 Future Product Line: Knights Landing Knights Landing is the code name for the 2 nd generation product for the Intel Many Integrated Core Architecture Knights Landing targets Intel s 14nm manufacturing process Kights Landing will be productized as a processor (running the host OS) and as a coprocessor ( a PCI end-point device ) Knights Landing will feature on-package high-bandwidth memory 37

38 Agenda Introduction Intel Xeon Phi Architecture Programming Models Outlook Summary 38

39 Summary Intel Xeon Phi coprocessor is a real product now! The flexibility of the programming models offer a solution for all needs: From a single MIC card in a workstation up to thousands of MIC cards attached to HPC cluster nodes Compilers and other tools make it easy to develop or port code to run applications natively on the coprocessors, heterogeneously on host CPU + Intel MIC systems and collections of these connected in a compute cluster No new parallel programming models needed: All Intelsupported models available for MIC too including innovative models like Coarray-Fortran Using MIC is a simple extension of CPU programming

40 Get Educated on Intel Xeon Phi Coprocessors Intel Xeon Phi Coprocessor (codename Knights Corner): Intel Xeon Phi Coprocessor Instruction Set Architecture Reference Manual: understanding An Overview of Programming for Intel Xeon processors and Intel Xeon Phi coprocessors: Intel Manycore Platform Software Stack (MPSS): Intel Xeon Phi Coprocessor Developer's Quick Start Guide: Intel Xeon Phi Coprocessor System Software Developers Guide: Intel and Third Party Tools and Libraries available with support for Intel Xeon Phi Coprocessor: Optimization and Performance Tuning for Intel Xeon Phi Coprocessors - Part 1: Optimization Essentials: optimization Optimization and Performance Tuning for Intel Xeon Phi Coprocessors, Part 2: Understanding and Using Hardware Events: understanding Tools That Enable You to be Ready!

41 41

43 Vector Instruction Performance VPU contains 16 SP ALUs, 8 DP ALUs, Most VPU instructions have a latency of 4 cycles and TPT 1 cycle Load/Store/Scatter have 7-cycle latency Convert/Shuffle have 6-cycle latency VPU instruction are issued in u-pipe Certain instructions can go to v-pipe also Vector Mask, Vector Store, Vector Packstore, Vector Prefetch, Scalar 43

Spectrum of Execution Models ---- Heterogenous Models ----- CPU-Centric Intel MIC-Centric Intel Xeon Processor Intel Many Integrated Core (MIC) Multi-core Hosted Offload Symmetric Reverse Offload

44 Spectrum of Execution Models ---- Heterogenous Models CPU-Centric Intel MIC-Centric Intel Xeon Processor Intel Many Integrated Core (MIC) Multi-core Hosted Offload Symmetric Reverse Offload Manycore Hosted General purpose serial and parallel computing Codes with balanced needs Highly-parallel codes Codes with highly- parallel phases Codes with serial phases Multi-core Many-core Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main( ) Foo( ) MPI_*() Foo( ) Main( ) Foo( ) MPI_*() Main() Foo( ) MPI_*() PCIe Supported with Intel Tools

Intel Many Integrated Core (MIC) Programming Intel Xeon Phi

Intel Many Integrated Core (MIC) Programming Intel Xeon Phi Dmitry Petunin Intel Technical Consultant 1 Legal Disclaimer & INFORMATION IN THIS DOCUMENT IS PROVIDED AS IS. NO LICENSE, EXPRESS OR IMPLIED,