An Introduction to the Intel Xeon Phi. Si Liu Feb 6, PDF Free Download

Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1

Last modified: Feb 1st, 2015 An Introduction to the Intel Xeon Phi Si Liu Feb 6, 2015 The University of Texas at Austin, 2015 Please see the final slide for copyright and licensing information.

Key References James Jeffers and James Reinders, Intel Xeon Phi Coprocessor High-Performance Programming, 2013 (but some material is no longer current) Jim Browne, Quick Start Guide - Incorporation of MICs in Application Execution Intel Developer Zone http://software.intel.com/en-us/mic-developer Stampede User Guide and related TACC resources https://portal.tacc.utexas.edu/user-guides/stampede Other specific recommendations throughout the day 3

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 4

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 5

MIC and Xeon Phi First product of Intel s Many Integrated Core (MIC) architecture Described as Symmetric Multiprocessor (SMP) on-a-chip Coprocessor PCI Express card Stripped down Linux operating system Dense, simplified processor Wider vector unit Wider hardware thread count Many power-hungry operations removed Lots of names Many Integrated Core architecture, aka MIC Knights Corner (codename) Intel Xeon Phi Coprocessor SE10P (product name) 6

TACC Stampede Supercomputer Flagship supercomputer of TACC and XSEDE Dell Linux Cluster with CentOS 6,400+ Dell PowerEdge server nodes Outfitted Intel Xeon E5 processors and Intel Xeon Phi Coprocessors 500k cores, 1.8m threads, 9+ PetaFLOPS Login nodes, large-memory nodes, graphics nodes Global parallel Lustre file system + local disk

Typical Stampede Node ( = blade ) Dell PowerEdge 8220 ("DCS Zeus") Compute Node Stampede is the first large-scale production system with MIC technology.

Typical Stampede Node: processor Main processor in each node is multicore: CPU (Host) Xeon E5 ( Sandy Bridge ) 2 Intel Xeon E5 8-core (Sandy Bridge) processor in the 2-socket motherboard 32GB DDR3 memory per node 6,400 nodes, 102,400 cores 2.2 PF base system 16 cores 32GB RAM Two 8-core sockets

Typical Stampede Node: coprocessor Host and coprocessor connected via 16 lane PCIe ~64Gb/sec each way* very roughly the same bandwidth as the 56 Gb/sec inter-node InfiniBand fabric x16 PCIe *There's quite a bit of fine-print here, especially regarding bandwidth associated with MPI communications.

Typical Stampede Node: coprocessor 7+ PetaFLOPS, 400,000 core "innovative component" CPU (Host) Xeon E5 ( Sandy Bridge ) Coprocessor (MIC) Xeon Phi ("Knights Corner") Low power, simplified ("lightweight") cores designed and intended for high degree of parallelism x16 PCIe x86 instruction set (source code compatible) with devicespecific binaries 61 cores, 244 threads per coprocessor 16 cores 32GB RAM Two 8-core sockets 61 lightweight cores 8GB RAM Each core has 4 hardware threads

Typical Stampede Node: coprocessor MIC runs stripped down Linux and bash shell Can ssh to MIC CPU (Host) Xeon E5 ( Sandy Bridge ) Coprocessor (MIC) Xeon Phi ("Knights Corner") - seldom need to do so Only essential tools and minimalist environment No hard drive or SSD: everything's in RAM x16 PCIe Linux bash Shared FS mounts Local file systems

Typical Stampede Node: coprocessor Signature characteristics of Xeon Phi: Lower clock speed Removed power-hungry functionality (e.g. branch prediction, out-of-order execution) 512-bit vector unit (twice the width of Xeon E5): 8DP or 16SP Peak Memory Bandwidth: 352 GB/s CPU (Host) Xeon E5 ( Sandy Bridge ) x16 PCIe Coprocessor (MIC) Xeon Phi ("Knights Corner") Focus on FLOPS as well as fast memory (GDDR5) 2.7-3.1GHz 225 GFLOPS ~300W ~1.1GHz 1 TFLOP ~300W

Vectorization on the Xeon Phi Each core has 512-bit (64 Byte) vector unit up to 8 double precision (or 16 single precision) results per cycle supports Fused Multiply-Add (FMA3), so 16 DP ops/cycle array a: array x: array b: 13 14 15 16 17 18 19 20 21 22 23 24 13 14 15 16 17 18 19 20 21 22 23 24 13 14 15 16 17 18 19 20 21 22 23 24 y[i] = a[i]*x[i] + b[i] 16 17 18 19 20 21 22 23

Multiple MICs on a Node 480 nodes have two MICs Each MIC connected via PCIe to host motherboard* MICs are not connected to each other all transfers go through host Use normal-2mic queue * The schematic is deceptive. Each MIC connects to the two-socket motherboard. Do not think that each MIC connects to its own 8-core Xeon E5 chip.

Specification of Xeon Phi Coprocessor ~1.1 GHz clock frequency 61 cores, 4 threads/core 512-bit wide vector unit >1.0 TeraFLOPS DP, >2.1 TeraFLOPS SP Data Cache L1 32KB/core L2 512KB/core, 30.5 MB/chip (Bidirectional ring network) Memory 8 GB GDDR5 ~352 GB/s (in theory) PCIe ~64Gb/s each way 16

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 17

Better performance software.intel.com 18

Advantages of MIC Intel s MIC is based on x86 technology x86 cores w/caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Our early MIC porting efforts for codes in the field are frequently doubling performance on Sandy Bridge 19

Market Penetration MIC is hitting its stride Top 10: Tianhe 2 (#1) and Stampede (#7) Beacon (NICS) and SuperMIC (LSU) DOE: Cora (NERSC) first major next-gen MIC (KNL) MIC-equipped systems are now generating more Top500 FLOPS than GPU systems (49PF vs 46PF) Next generation (Knights Landing) shows great promise Available as a stand-alone host Up to 16GB of high speed (5x) "on package" memory plus additional "traditional" DDR4 RAM (Near complete) binary compatibility with Haswell processors 20

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 21

How to use the Xeon Phi Several distinct execution models How you run your application/program depends on how it uses the MIC Hybrid execution models common 22

Host-Only: Ignore the MIC Traditional CPU mode - No source-code changes Compile your C/C++/Fortran code with Intel compilers Pure MPI or MPI + X - X: OpenMP, TBB, Cilk+, etc. Run on any compute node Application runs on one or more hosts (only) 23

Native: One MIC running autonomously To compile: add one new flag -mmic To run: can manually ssh to MIC - you don't need to Multiple programming models available - Plausible: OpenMP, MPI+X - Implausible: serial, pure MPI Can safely use all 61 cores - 60 cores, 60/120/180/240 threads are recommended Good performance requires high degree of parallelism App runs on a single MIC (only) 24

Offload: MIC as assistant processor A program running on the host offloads work by directing the MIC to execute a specified block of code Host directs the exchange of data between host and MIC Resembles GPU programming in ordinary C/C++/Fortran Compile and run on host OpenMP-style directives Ideally, host stays active while the MIC coprocessor does its assigned work app running on host...do work and deliver results as directed... 25

Automatic Offload (AO) Pre-packaged offload in Intel Math Kernel Library (MKL) Accelerates frequently-used math processing - Computationally intensive linear algebra: highly vectorized and threaded matrix computation, FFT, etc. MKL automatically manages details More than offload: - work division/balance across host and MIC! xgemm, xgesv, etc. C/C++, Fortran, R, Python, MATLAB, etc. 26

Symmetric: MPI Tasks Across Multiple Devices MPI tasks on MICs and hosts (or multiple MICs only) To compile: build two versions of executable, one for each side To run: use a special TACC MPI launcher that manages the details Challenges include - Heterogeneity and capacity (RAM, performance) - Communication (MICs connected only through their hosts) - File operations (MIC-based I/O is slow at best, problematic at worst) 16- core host RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM MIC MIC MIC MIC 27

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 28

MIC-Enabled Applications: Stampede Modules Module Domain Comments LAMMPS lammps chem offload; 1-3x improvement NAMD namd/2014_07_31-mic chem offload; 2x improvement WRF wrf/3.6 weather symmetric; 1.4x improvement DDT ddt debugging VTune vtune profiling PAPI papi profiling PerfExpert perfexpert profiling ITAC pending profiling NetCDF/3 netcdf/3.6.3 i/o But note issues with MIC-based file ops NetCDF/4 pending parallel i/o But note issues with MIC-based file ops HDF5 pending parallel i/o But note issues with MIC-based file ops MVAPICH2 mvapich2-mic MPI IMPI impi MPI 29

MIC-Enabled Applications: Other Research Codes Availability Domain Comments Chroma release soon chem symmetric; 1.5x improvement Quantum Espresso download and build* chem AO; 2.5x on 2-mic SeisSol contact developers seismology offload; scales to 6000 nodes GROMACS 5.0.1-dev* chem MIC optimization ongoing PvFMM github FM Solver offload; 2.5x improvement *Stampede module expected in near- term 30

MIC-Enabled Applications: Stampede Modules Supporting AO Module Domain Comments C/C++ intel (any) dev Fortran intel (any) dev R Rstats/3.0.3 dev (stat) MATLAB matlab dev bring your own license Python python/2.7.6 dev PETSc petsc/3.3, 3.4, 3.5 solvers excludes gcc builds of 3.4 gsl gsl/1.16 solvers AO is available to any app built with Intel compiler and linking to threaded MKL. When using an Intel compiler, MKL is available without loading a module. 31

Can I port my code to the MIC? Answer: Probably "Porting" is generally very easy Involves little more than compiling with a new flag The only major issue is availability of libraries But that's the wrong question Will MIC improve my code's performance? 32

How to achieve the high performance Degree of parallelism Vectorization (local data parallelism) Threading (multi-thread parallelism) Hard work invested in optimization But it is well worth the effort More demonstration in the following sessions 33

MIC Strengths Not just a coprocessor More than just kernels: MPI, native codes Familiar languages, programming models C, C++, Fortran OpenMP and MPI (and others) Optimizing MIC code will benefit host code 2x improvement (new host to old host) common Accelerator programmers have already done the hard part (encapsulating the offload) 34

MIC Issues and Questions Current issues Amortizing the transfers MIC and host working together File operations from MIC Software (especially library) availability Memory limitations Emerging Issues Evolution of accelerator specifications Evolution of the hardware 35

Summary We're at the front end of disruptive technology Good news: opportunities for breakthroughs Bad news: software stack and best practices still evolving Things to remember about transition Porting is generally easy Performance requires hard work That work will also benefit the host side 36

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 37

Lab 1 tar xvf ~train00/mic1502/stampede_tour.tar Launching codes on Stampede Batch (sbatch) vs interactive (idev) Host-only MPI, native MIC launched from host, native MIC launched from MIC Follow instructions in README.TOUR 38

Si Liu siliu@tacc.utexas.edu For more information: www.tacc.utexas.edu

License The University of Texas at Austin, 2015 This work is licensed under the Creative Commons Attribution Non-Commercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ When attributing this work, please use the following text: Step Up to the MIC: An Introduction to the Intel Xeon Phi, Texas Advanced Computing Center, 2013. Available under a Creative Commons Attribution Non-Commercial 3.0 Unported License

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015