Introduction to the Intel Xeon Phi on Stampede

Similar documents
Introduction to Xeon Phi. Bill Barth January 11, 2013

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Introduc)on to Xeon Phi

Introduc)on to Xeon Phi

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

The Stampede is Coming Welcome to Stampede Introductory Training. Dan Stanzione Texas Advanced Computing Center

Parallel Programming on Ranger and Stampede

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Introduc)on to Xeon Phi

Native Computing and Optimization. Hang Liu December 4 th, 2013

Intel Knights Landing Hardware

Preparing for Highly Parallel, Heterogeneous Coprocessing

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

HPC Hardware Overview

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Overview of Intel Xeon Phi Coprocessor

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

The Era of Heterogeneous Computing

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPUs and Emerging Architectures

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Offloading. Kent Milfeld Stampede Training, January 11, 2013

Intel Many Integrated Core (MIC) Architecture

Intel Xeon Phi Coprocessors

HPC. Accelerating. HPC Advisory Council Lugano, CH March 15 th, Herbert Cornelius Intel

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Native Computing and Optimization on Intel Xeon Phi

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

the Intel Xeon Phi coprocessor

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Accelerator Programming Lecture 1

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Trends in HPC (hardware complexity and software challenges)

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Offloading. Kent Milfeld June,

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Heterogeneous Computing and OpenCL

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Illinois Proposal Considerations Greg Bauer

CPU-GPU Heterogeneous Computing

Symmetric Computing. ISC 2015 July John Cazes Texas Advanced Computing Center

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Symmetric Computing. Jerome Vienne Texas Advanced Computing Center

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Intel Xeon Phi Coprocessor

Symmetric Computing. John Cazes Texas Advanced Computing Center

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

n N c CIni.o ewsrg.au

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Knights Corner: Your Path to Knights Landing

Bring your application to a new era:

Does the Intel Xeon Phi processor fit HEP workloads?

Investigation of Intel MIC for implementation of Fast Fourier Transform

Intel C++ Compiler User's Guide With Support For The Streaming Simd Extensions 2

Technologies and application performance. Marc Mendez-Bermond HPC Solutions Expert - Dell Technologies September 2017

Intel Software Development Products for High Performance Computing and Parallel Programming

Architecture, Programming and Performance of MIC Phi Coprocessor

HPC Architectures. Types of resource currently in use

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Introduc)on to Hyades

Benchmark results on Knight Landing (KNL) architecture

TECHNICAL GUIDELINES FOR APPLICANTS TO PRACE 13 th CALL (T ier-0)

Overview of Tianhe-2

Offload Computing on Stampede

Comet Virtualization Code & Design Sprint

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

The Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest

rabbit.engr.oregonstate.edu What is rabbit?

Interconnect Your Future

OpenMP 4.0 (and now 5.0)

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Our new HPC-Cluster An overview

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

INF5063: Programming heterogeneous multi-core processors Introduction

Tiny GPU Cluster for Big Spatial Data: A Preliminary Performance Evaluation

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Trends in the Infrastructure of Computing

Intel Performance Libraries

Preparing GPU-Accelerated Applications for the Summit Supercomputer

Gateways to Discovery: Cyberinfrastructure for the Long Tail of Science

Introduction to Parallel Programming

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to CUDA Programming

General Purpose GPU Computing in Partial Wave Analysis

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Parallel Computing. November 20, W.Homberg

Path to Exascale? Intel in Research and HPC 2012

Transcription:

June 10, 2014 Introduction to the Intel Xeon Phi on Stampede John Cazes Texas Advanced Computing Center

Stampede - High Level Overview Base Cluster (Dell/Intel/Mellanox): Intel Sandy Bridge processors Dell dual-socket nodes w/32gb RAM (2GB/core) 6,400 nodes 56 Gb/s Mellanox FDR InfiniBand interconnect More than 100,000 cores, 2.2 PF peak performance Co-Processors: Intel Xeon Phi MIC Many Integrated Core processors Special release of Knight s Corner (61 cores) 7+ PF peak performance Max Total Concurrency: exceeds 500,000 cores 1.8M threads

Additional Integrated Subsystems Stampede includes 16 1TB Sandy Bridge shared memory nodes with dual GPUs 128 of the compute nodes are also equipped with NVIDIA Kepler K20 GPUs (and MICS for performance bake-offs) 16 login, data mover and management servers (batch, subnet manager, provisioning, etc) Software included for high throughput computing, remote visualization Storage subsystem driven by Dell storage nodes: Aggregate Bandwidth greater than 150GB/s More than 14PB of capacity Similar partitioning of disk space into multiple Lustre filesystems as previous TACC systems ($HOME, $WORK and $SCRATCH)

Power/Physical Stampede spans 182 48U cabinets. Power density (after upgrade in 2015) will exceed 40kW per rack. Estimated 2015 peak power is 6.2MW.

Stampede Footprint Ranger Stampede 8000 ft 2 ~10PF 6.5 MW 3000 ft2 0.6 PF 3 MW Machine Room Expansion Added 6.5MW of additional power

Actually, way more utility space than machine space Turns out the utilities for the datacenter costs more, takes more time and more space than the computing systems

Innovative Component One of the goals of the NSF solicitation was to introduce a major new innovative capability component to science and engineering research communities We proposed the Intel Xeon Phi coprocessor (many integrated core or MIC) one first generation Phi installed per host during initial deployment confirmed injection of 1600 future generation MICs in 2015 (5+ PF)

An Inflection point (or two) in High Performance Computing Relatively boring, but rapidly improving architectures for the last 16 years. Performance rising much faster than Moore s Law But power rising faster And concurrency rising faster than that, with serial performance decreasing. Something had to give

Accelerated Computing for Exascale Exascale systems, predicted for 2018, would have required 500MW on the old curves. Something new was clearly needed. The accelerated computing movement was reborn (this happens periodically, starting with 387 math coprocessors).

Key aspects of acceleration We have lots of transistors Moore s law is holding; this isn t necessarily the problem. We don t really need lower power per transistor, we need lower power per *operation*. How to do this? nvidia GPU -AMD Fusion FPGA -ARM Cores

Intel s MIC approach Since the days of RISC vs. CISC, Intel has mastered the art of figuring out what is important about a new processing technology, and saying why can t we do this in x86? The Intel Many Integrated Core (MIC) architecture is about large die, simpler circuit, much more parallelism, in the x86 line.

Xeon Phi MIC Xeon Phi = first product of Intel s Many Integrated Core (MIC) architecture Co-processor PCI Express card Stripped down Linux operating system But still a full host you can log in to the Phi directly and run stuff! Lots of names Many Integrated Core architecture, aka MIC Knights Corner (code name) Intel Xeon Phi Co-processor SE10P (product name of the version for Stampede) I should have many more trademark symbols in my slides if you ask Intel!

Intel Xeon Phi Chip 22 nm process Based on what Intel learned from Larrabee SCC TeraFlops Research Chip

MIC Architecture Many cores on the die L1 and L2 cache Bidirectional ring network for L2 Memory and PCIe connection

George Chrysos, Intel, Hot Chips 24 (2012): http://www.slideshare.net/intelxeon/under-the-armor-of-knights-corner-intel-mic-architecture-at-hotchips-2012

George Chrysos, Intel, Hot Chips 24 (2012): http://www.slideshare.net/intelxeon/under-the-armor-of-knights-corner-intel-mic-architecture-at-hotchips-2012

Speeds and Feeds Stampede SE10P version (Your mileage may vary) Processor ~1.1 GHz 61 cores 512-bit wide vector unit 1.074 TF peak DP Data Cache L1 32KB/core L2 512KB/core, 30.5 MB/chip Memory 8GB GDDR5 DRAM 5.5 GT/s, 512-bit* PCIe 5.0 GT/s, 16-bit

What we at TACC like about MIC Intel s MIC is based on x86 technology x86 cores w/ caches and cache coherency SIMD instruction set Programming for MIC is similar to programming for CPUs Familiar languages: C/C++ and Fortran Familiar parallel programming models: OpenMP & MPI MPI on host and on the coprocessor Any code can run on MIC, not just kernels Optimizing for MIC is similar to optimizing for CPUs Optimize once, run anywhere Optimizing can be hard; but everything you do to your code should *also* improve performance on current and future regular Intel chips, AMD CPUs, etc.

Programming the Xeon Phi Some universal truths of Xeon Phi coding: You need lots of threads. Really. Lots. At least 2 per core to have a chance 3 or 4 per core are often optimal. This is not like old school hyperthreading, if you don t have 2 threads per core, half your cycles will be wasted. You need to vectorize. This helps on regular processors too, but the effect is much more pronounced on the MIC chips. You can no longer ignore the vectorization report from the compiler If you can do these 2 things, good things will happen, in most any language.

MIC Programming Experiences at TACC Codes port easily Minutes to days depending mostly on library dependencies Performance requires real work While the silicon continues to evolve Getting codes to run *at all* is almost too easy; really need to put in the effort to get what you expect Scalability is pretty good Multiple threads per core *really important* Getting your code to vectorize *really important*

Typical Stampede Node ( = blade ) CPU (Host) Sandy Bridge Coprocessor (MIC) Knights Corner x16 PCIe 16 cores 32G RAM Two Xeon E5 8-core processors 61 lightweight cores 8G RAM Xeon Phi Coprocessor Each core has 4 hardware threads MIC runs lightweight Linux-like OS (BusyBox)

Stampede Programming Models Traditional Cluster, or Host Only Pure MPI and MPI+X (OpenMP, TBB, Cilk+, OpenCL ) Ignore the MIC Native Phi Use one Phi and run OpenMP or MPI programs directly Offload MPI on Host, Offload to Xeon Phi Targeted offload through OpenMP extensions Automatically offload some library routines with MKL Symmetric - MPI tasks on Host and Phi Treat the Phi (mostly) like another host

What is a native application? It is an application built to run exclusively on the MIC coprocessor. MIC is not binary compatible with the host processor Instruction set is similar to Pentium, but not all 64 bit scalar extensions are included. MIC has 512 bit vector extensions, but does NOT have MMX, SSE, or AVX extensions. Native applications can t be used on the host CPU, and viceversa.

Why run a native application? It is possible to login and run applications on the MIC without any host intervention Easy way to get acquainted with the properties of the MIC Performance studies Single card scaling tests (OMP/MPI) No issues with data exchange with host The native code probably performs quite well on the host CPU once you build a host version Good path for symmetric runs

Will My Code Run on Xeon Phi? Yes but that s the wrong question Will your code run *best* on Phi?, or Will you get great Phi performance without additional work?

Building a native application Cross-compile on the host (login or compute nodes) No compilers installed on coprocessors MIC is fully supported by the Intel C/C++ and Fortran compilers (v13+): icc -openmp -mmic mysource.c -o myapp.mic ifort -openmp -mmic mysource.f90 -o myapp.mic The -mmic flag causes the compiler to generate a native mic executable It is convenient to use a.mic extension to differentiate MIC executables

Native Execution Quirks The mic runs a lightweight version of Linux, based on BusyBox Some tools are missing: w,numactl Some tools have reduced functionality: ps Relatively few libraries have been ported to the coprocessor environment These issues make the implicit or explicit launcher approach even more convenient

What is Offloading Send block of code to be executed on coprocessor (MIC). Must have a binary of the code (code block or function). Compiler makes the binary and stores it in the executable (a.out). During execution on the CPU, the runtime is contacted to begin executing the MIC binary at an offload point. When the coprocessor is finished, the CPU resumes executing the CPU part of the code. Linux OS HCA PCIe Linux micro OS MIC CPU execution is directed to run a MIC binary section of code on the MIC.

MPI with Offload to Phi Existing codes using accelerators have already identified regions where offload works well Porting these to OpenMP offload should be straightforward Automatic offload where MKL kernel routines can be used xgemm, etc.

Directives Directives can be inserted before code blocks and functions to run the code on the Xeon Phi Coprocessor (the MIC ). No recoding required. (Optimization may require some changes.) Directives are simple, but more details (specifiers) may be needed for optimal performance. Data must be moved to the MIC For large amounts of data: Amortized when doing large amounts of work. Keep data resident ( persistent ). Move data asynchronously.

Simple OMP Example #define N 10000 int main(){ float a[n]; int i; #pragma offload target(mic) #pragma omp parallel for for(i=0,i<n;i++) a[i]=(float) i; OMP Parallel Regions Can Be Offloaded Directly program main integer, parameter :: N=10000 real :: a(n)!dir$ offload target(mic)!$omp parallel do do i=1,n a(i)=i; end do } printf( %f \n,a[10]); print*, a(10) end program OpenMP regions can be offloaded directly. OpenMP parallel regions can exist in offloaded code blocks or functions.

Automatic Offload Offloads some MKL routines automatically No coding change No recompiling Makes sense with BLAS-3 type routines Minimal Data O(n 2 ), Maximal Compute O(n 3 ) Supported Routines (more to come) Type Level-3 BLAS LAPACK 3 amigos Routine xgemm, xtrsm, STRMM LU, QR, Cholesky

Symmetric Computing Run MPI tasks on both MIC and host and across nodes Also called heterogeneous computing Two executables are required: CPU MIC Currently only works with Intel MPI MVAPICH2 support coming

Typical Stampede Node ( = blade ) CPU (Host) Sandy Bridge Coprocessor (MIC) Knights Corner x16 PCIe 16 cores 32G RAM Two Xeon E5 8-core processors 61 lightweight cores 8G RAM Xeon Phi Coprocessor Each core has 4 hardware threads MIC runs lightweight Linux-like OS (BusyBox)

Balance How to balance the code? Sandy Bridge Xeon Phi Memory 32 GB 8 GB Cores 16 61 Clock Speed 2.7 GHz 1.1 GHz Memory 51.2 GB/s(x2) 352 GB/s Bandwidth Vector Length 4 DP words 8 DP words

Balance Example: Memory balance Balance memory use and performance by using a different # of tasks/threads on host and MIC Host 16 tasks/1 thread/task 2GB/task Xeon PHI 4 tasks/60 threads/task 2GB/task

Balance Example: Performance balance Balance performance by tuning the # of tasks and threads on host and MIC Host 16 tasks/1 thread/task 2GB/task Xeon PHI? tasks/? threads/task 2GB/task

LBM Example Lattice Boltzmann Method CFD code Carlos Rosales, TACC OpenMP code Finding all the right routines to parallelize is critical

Other options At TACC, we ve focused mainly on C/C++ and Fortran, with threading through OpenMP Other options are out there, notably: Intel s cilk extensions for C/C++ TBB (Threading Building Blocks) for C++ OpenCL support is available, but we have little experience so far. It s Linux, you can run whatever you want (ie., I have run a pthreads code on Phi).

Summary MIC experiences Early scaling looks good; application porting is fairly straight forward since it can run native C/C++, and Fortran code Optimization work is still required to get at all the available raw performance for a wide variety of applications vectorization on these large many-core devices is key affinitization can have a strong impact (positive/negative) on performance algorithmic threading performance is also key; if the kernel of interest does not have high scaling efficiency on a standard x86_64 processor (8-16 cores), it will not scale on many-core MIC optimization efforts also yield fruit on normal Xeon

MIC Information Stampede User Guide: http://www.tacc.utexas.edu/user-services/user-guides/stampede-user-guide TACC Advanced Offloading: Search and click on Advanced Offloading document in the Stampede User Guide. Intel Programming & Compiler for MIC Intel Compiler Manuals: C/C++ Fortran Example code: /opt/apps/intel/13/composer_xe_2013.1.117/samples/en_us/

Acknowledgements Thanks/kudos to: Sponsor: National Science Foundation NSF Grant #OCI-1134872 Stampede Award, Enabling, Enhancing, and Extending Petascale Computing for Science and Engineering NSF Grant #OCI-0926574 - Topology-Aware MPI Collectives and Scheduling Many, many Dell, Intel, and Mellanox engineers All my colleagues at TACC