An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

Similar documents
Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduc)on to Xeon Phi

Introduction to the Intel Xeon Phi on Stampede

Introduc)on to Xeon Phi

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Introduc)on to Xeon Phi

Intel Knights Landing Hardware

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

The Era of Heterogeneous Computing

Introduction to High Performance Computing. Shaohao Chen Research Computing Services (RCS) Boston University

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Parallel Programming on Ranger and Stampede

Overview of Intel Xeon Phi Coprocessor

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

the Intel Xeon Phi coprocessor

HPC Hardware Overview

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Native Computing and Optimization. Hang Liu December 4 th, 2013

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

HPC Architectures. Types of resource currently in use

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Accelerator Programming Lecture 1

Architecture, Programming and Performance of MIC Phi Coprocessor

n N c CIni.o ewsrg.au

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Intel Many Integrated Core (MIC) Architecture

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Intel Xeon Phi Coprocessors

Reusing this material

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Knights Corner: Your Path to Knights Landing

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Preparing for Highly Parallel, Heterogeneous Coprocessing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

HPC code modernization with Intel development tools

Scientific Computing with Intel Xeon Phi Coprocessors

GPUs and Emerging Architectures

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

Intel Architecture for HPC

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Stampede User Environment

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Trends in HPC (hardware complexity and software challenges)

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

CME 213 S PRING Eric Darve

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Programming for the Intel Many Integrated Core Architecture By James Reinders. The Architecture for Discovery. PowerPoint Title

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

arxiv: v1 [hep-lat] 1 Dec 2017

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

High Performance Computing with Accelerators

Accelerating Insights In the Technical Computing Transformation

Bring your application to a new era:

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Intel Advisor XE Future Release Threading Design & Prototyping Vectorization Assistant

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Debugging Intel Xeon Phi KNC Tutorial

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Overview of Tianhe-2

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

Introduc)on to Hyades

SuperMike-II Launch Workshop. System Overview and Allocations

Intel Advisor XE. Vectorization Optimization. Optimization Notice

Does the Intel Xeon Phi processor fit HEP workloads?

SCALABLE HYBRID PROTOTYPE

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Description of Power8 Nodes Available on Mio (ppc[ ])

Parallel Systems. Project topics

Native Computing and Optimization on Intel Xeon Phi

AACE: Applications. Director, Application Acceleration Center of Excellence National Institute for Computational Sciences glenn-

What s P. Thierry

Performance and Energy Usage of Workloads on KNL and Haswell Architectures

Investigation of Intel MIC for implementation of Fast Fourier Transform

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Intel Xeon Phi Coprocessor

VLPL-S Optimization on Knights Landing

Experiences with ENZO on the Intel Many Integrated Core Architecture

Performance of deal.ii on a node

Symmetric Computing. SC 14 Jerome VIENNE

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Introduction to CUDA Programming

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Fujitsu s new supercomputer, delivering the next step in Exascale capability

Intra-MIC MPI Communication using MVAPICH2: Early Experience

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Xeon Phi Coprocessors on Turing

Intel Xeon Phi Coprocessor

Parallel and Distributed Programming Introduction. Kenjiro Taura

Transcription:

Training Agenda Session 1: Introduction 8:00 9:45 Session 2: Native: MIC stand-alone 10:00-11:45 Lunch break Session 3: Offload: MIC as coprocessor 1:00 2:45 Session 4: Symmetric: MPI 3:00 4:45 1

Last modified: Feb 1st, 2015 An Introduction to the Intel Xeon Phi Si Liu Feb 6, 2015 The University of Texas at Austin, 2015 Please see the final slide for copyright and licensing information.

Key References James Jeffers and James Reinders, Intel Xeon Phi Coprocessor High-Performance Programming, 2013 (but some material is no longer current) Jim Browne, Quick Start Guide - Incorporation of MICs in Application Execution Intel Developer Zone http://software.intel.com/en-us/mic-developer Stampede User Guide and related TACC resources https://portal.tacc.utexas.edu/user-guides/stampede Other specific recommendations throughout the day 3

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 4

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 5

MIC and Xeon Phi First product of Intel s Many Integrated Core (MIC) architecture Described as Symmetric Multiprocessor (SMP) on-a-chip Coprocessor PCI Express card Stripped down Linux operating system Dense, simplified processor Wider vector unit Wider hardware thread count Many power-hungry operations removed Lots of names Many Integrated Core architecture, aka MIC Knights Corner (codename) Intel Xeon Phi Coprocessor SE10P (product name) 6

TACC Stampede Supercomputer Flagship supercomputer of TACC and XSEDE Dell Linux Cluster with CentOS 6,400+ Dell PowerEdge server nodes Outfitted Intel Xeon E5 processors and Intel Xeon Phi Coprocessors 500k cores, 1.8m threads, 9+ PetaFLOPS Login nodes, large-memory nodes, graphics nodes Global parallel Lustre file system + local disk

Typical Stampede Node ( = blade ) Dell PowerEdge 8220 ("DCS Zeus") Compute Node Stampede is the first large-scale production system with MIC technology.

Typical Stampede Node: processor Main processor in each node is multicore: CPU (Host) Xeon E5 ( Sandy Bridge ) 2 Intel Xeon E5 8-core (Sandy Bridge) processor in the 2-socket motherboard 32GB DDR3 memory per node 6,400 nodes, 102,400 cores 2.2 PF base system 16 cores 32GB RAM Two 8-core sockets

Typical Stampede Node: coprocessor Host and coprocessor connected via 16 lane PCIe ~64Gb/sec each way* very roughly the same bandwidth as the 56 Gb/sec inter-node InfiniBand fabric x16 PCIe *There's quite a bit of fine-print here, especially regarding bandwidth associated with MPI communications.

Typical Stampede Node: coprocessor 7+ PetaFLOPS, 400,000 core "innovative component" CPU (Host) Xeon E5 ( Sandy Bridge ) Coprocessor (MIC) Xeon Phi ("Knights Corner") Low power, simplified ("lightweight") cores designed and intended for high degree of parallelism x16 PCIe x86 instruction set (source code compatible) with devicespecific binaries 61 cores, 244 threads per coprocessor 16 cores 32GB RAM Two 8-core sockets 61 lightweight cores 8GB RAM Each core has 4 hardware threads

Typical Stampede Node: coprocessor MIC runs stripped down Linux and bash shell Can ssh to MIC CPU (Host) Xeon E5 ( Sandy Bridge ) Coprocessor (MIC) Xeon Phi ("Knights Corner") - seldom need to do so Only essential tools and minimalist environment No hard drive or SSD: everything's in RAM x16 PCIe Linux bash Shared FS mounts Local file systems

Typical Stampede Node: coprocessor Signature characteristics of Xeon Phi: Lower clock speed Removed power-hungry functionality (e.g. branch prediction, out-of-order execution) 512-bit vector unit (twice the width of Xeon E5): 8DP or 16SP Peak Memory Bandwidth: 352 GB/s CPU (Host) Xeon E5 ( Sandy Bridge ) x16 PCIe Coprocessor (MIC) Xeon Phi ("Knights Corner") Focus on FLOPS as well as fast memory (GDDR5) 2.7-3.1GHz 225 GFLOPS ~300W ~1.1GHz 1 TFLOP ~300W

Vectorization on the Xeon Phi Each core has 512-bit (64 Byte) vector unit up to 8 double precision (or 16 single precision) results per cycle supports Fused Multiply-Add (FMA3), so 16 DP ops/cycle array a: array x: array b: 13 14 15 16 17 18 19 20 21 22 23 24 13 14 15 16 17 18 19 20 21 22 23 24 13 14 15 16 17 18 19 20 21 22 23 24 y[i] = a[i]*x[i] + b[i] 16 17 18 19 20 21 22 23

Multiple MICs on a Node 480 nodes have two MICs Each MIC connected via PCIe to host motherboard* MICs are not connected to each other all transfers go through host Use normal-2mic queue * The schematic is deceptive. Each MIC connects to the two-socket motherboard. Do not think that each MIC connects to its own 8-core Xeon E5 chip.

Specification of Xeon Phi Coprocessor ~1.1 GHz clock frequency 61 cores, 4 threads/core 512-bit wide vector unit >1.0 TeraFLOPS DP, >2.1 TeraFLOPS SP Data Cache L1 32KB/core L2 512KB/core, 30.5 MB/chip (Bidirectional ring network) Memory 8 GB GDDR5 ~352 GB/s (in theory) PCIe ~64Gb/s each way 16

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 17

Better performance software.intel.com 18

Advantages of MIC Intel s MIC is based on x86 technology x86 cores w/caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Our early MIC porting efforts for codes in the field are frequently doubling performance on Sandy Bridge 19

Market Penetration MIC is hitting its stride Top 10: Tianhe 2 (#1) and Stampede (#7) Beacon (NICS) and SuperMIC (LSU) DOE: Cora (NERSC) first major next-gen MIC (KNL) MIC-equipped systems are now generating more Top500 FLOPS than GPU systems (49PF vs 46PF) Next generation (Knights Landing) shows great promise Available as a stand-alone host Up to 16GB of high speed (5x) "on package" memory plus additional "traditional" DDR4 RAM (Near complete) binary compatibility with Haswell processors 20

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 21

How to use the Xeon Phi Several distinct execution models How you run your application/program depends on how it uses the MIC Hybrid execution models common 22

Host-Only: Ignore the MIC Traditional CPU mode - No source-code changes Compile your C/C++/Fortran code with Intel compilers Pure MPI or MPI + X - X: OpenMP, TBB, Cilk+, etc. Run on any compute node Application runs on one or more hosts (only) 23

Native: One MIC running autonomously To compile: add one new flag -mmic To run: can manually ssh to MIC - you don't need to Multiple programming models available - Plausible: OpenMP, MPI+X - Implausible: serial, pure MPI Can safely use all 61 cores - 60 cores, 60/120/180/240 threads are recommended Good performance requires high degree of parallelism App runs on a single MIC (only) 24

Offload: MIC as assistant processor A program running on the host offloads work by directing the MIC to execute a specified block of code Host directs the exchange of data between host and MIC Resembles GPU programming in ordinary C/C++/Fortran Compile and run on host OpenMP-style directives Ideally, host stays active while the MIC coprocessor does its assigned work app running on host...do work and deliver results as directed... 25

Automatic Offload (AO) Pre-packaged offload in Intel Math Kernel Library (MKL) Accelerates frequently-used math processing - Computationally intensive linear algebra: highly vectorized and threaded matrix computation, FFT, etc. MKL automatically manages details More than offload: - work division/balance across host and MIC! xgemm, xgesv, etc. C/C++, Fortran, R, Python, MATLAB, etc. 26

Symmetric: MPI Tasks Across Multiple Devices MPI tasks on MICs and hosts (or multiple MICs only) To compile: build two versions of executable, one for each side To run: use a special TACC MPI launcher that manages the details Challenges include - Heterogeneity and capacity (RAM, performance) - Communication (MICs connected only through their hosts) - File operations (MIC-based I/O is slow at best, problematic at worst) 16- core host RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM RAM MIC MIC MIC MIC 27

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 28

MIC-Enabled Applications: Stampede Modules Module Domain Comments LAMMPS lammps chem offload; 1-3x improvement NAMD namd/2014_07_31-mic chem offload; 2x improvement WRF wrf/3.6 weather symmetric; 1.4x improvement DDT ddt debugging VTune vtune profiling PAPI papi profiling PerfExpert perfexpert profiling ITAC pending profiling NetCDF/3 netcdf/3.6.3 i/o But note issues with MIC-based file ops NetCDF/4 pending parallel i/o But note issues with MIC-based file ops HDF5 pending parallel i/o But note issues with MIC-based file ops MVAPICH2 mvapich2-mic MPI IMPI impi MPI 29

MIC-Enabled Applications: Other Research Codes Availability Domain Comments Chroma release soon chem symmetric; 1.5x improvement Quantum Espresso download and build* chem AO; 2.5x on 2-mic SeisSol contact developers seismology offload; scales to 6000 nodes GROMACS 5.0.1-dev* chem MIC optimization ongoing PvFMM github FM Solver offload; 2.5x improvement *Stampede module expected in near- term 30

MIC-Enabled Applications: Stampede Modules Supporting AO Module Domain Comments C/C++ intel (any) dev Fortran intel (any) dev R Rstats/3.0.3 dev (stat) MATLAB matlab dev bring your own license Python python/2.7.6 dev PETSc petsc/3.3, 3.4, 3.5 solvers excludes gcc builds of 3.4 gsl gsl/1.16 solvers AO is available to any app built with Intel compiler and linking to threaded MKL. When using an Intel compiler, MKL is available without loading a module. 31

Can I port my code to the MIC? Answer: Probably "Porting" is generally very easy Involves little more than compiling with a new flag The only major issue is availability of libraries But that's the wrong question Will MIC improve my code's performance? 32

How to achieve the high performance Degree of parallelism Vectorization (local data parallelism) Threading (multi-thread parallelism) Hard work invested in optimization But it is well worth the effort More demonstration in the following sessions 33

MIC Strengths Not just a coprocessor More than just kernels: MPI, native codes Familiar languages, programming models C, C++, Fortran OpenMP and MPI (and others) Optimizing MIC code will benefit host code 2x improvement (new host to old host) common Accelerator programmers have already done the hard part (encapsulating the offload) 34

MIC Issues and Questions Current issues Amortizing the transfers MIC and host working together File operations from MIC Software (especially library) availability Memory limitations Emerging Issues Evolution of accelerator specifications Evolution of the hardware 35

Summary We're at the front end of disruptive technology Good news: opportunities for breakthroughs Bad news: software stack and best practices still evolving Things to remember about transition Porting is generally easy Performance requires hard work That work will also benefit the host side 36

Outline What is MIC (Xeon Phi)? Why use MIC? How to use MIC? Assessment and Summary Lab 1 37

Lab 1 tar xvf ~train00/mic1502/stampede_tour.tar Launching codes on Stampede Batch (sbatch) vs interactive (idev) Host-only MPI, native MIC launched from host, native MIC launched from MIC Follow instructions in README.TOUR 38

Si Liu siliu@tacc.utexas.edu For more information: www.tacc.utexas.edu

License The University of Texas at Austin, 2015 This work is licensed under the Creative Commons Attribution Non-Commercial 3.0 Unported License. To view a copy of this license, visit http://creativecommons.org/licenses/by-nc/3.0/ When attributing this work, please use the following text: Step Up to the MIC: An Introduction to the Intel Xeon Phi, Texas Advanced Computing Center, 2013. Available under a Creative Commons Attribution Non-Commercial 3.0 Unported License