Intel Xeon Phi Coprocessors

Similar documents
Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Intel Xeon Phi Coprocessor

Introduction to Xeon Phi. Bill Barth January 11, 2013

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Intel Xeon Phi Coprocessor Offloading Computation

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Parallel Systems. Project topics

Overview of Intel Xeon Phi Coprocessor

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

Reusing this material

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Accelerator Programming Lecture 1

Programming Intel R Xeon Phi TM

A case study of performance portability with OpenMP 4.5

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

the Intel Xeon Phi coprocessor

Intel Many Integrated Core (MIC) Programming Intel Xeon Phi

Parallel Accelerators

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

Parallel Programming on Ranger and Stampede

Parallel Accelerators

Scientific Computing with Intel Xeon Phi Coprocessors

n N c CIni.o ewsrg.au

OpenACC 2.6 Proposed Features

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Bring your application to a new era:

THE FUTURE OF GPU DATA MANAGEMENT. Michael Wolfe, May 9, 2017

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

Lecture 13. Shared memory: Architecture and programming

Debugging Intel Xeon Phi KNC Tutorial

Native Computing and Optimization. Hang Liu December 4 th, 2013

CME 213 S PRING Eric Darve

Heterogeneous Computing and OpenCL

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Preparing for Highly Parallel, Heterogeneous Coprocessing

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

Introduc)on to Xeon Phi

arxiv: v1 [physics.comp-ph] 4 Nov 2013

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

EXASCALE COMPUTING ROADMAP IMPACT ON LEGACY CODES MARCH 17 TH, MIC Workshop PAGE 1. MIC workshop Guillaume Colin de Verdière

IBM CORAL HPC System Solution

Architecture, Programming and Performance of MIC Phi Coprocessor

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

An Evaluation of Unified Memory Technology on NVIDIA GPUs

Vincent C. Betro, Ph.D. NICS March 6, 2014

OpenACC Course. Office Hour #2 Q&A

Can Accelerators Really Accelerate Harmonie?

Introduction to the Intel Xeon Phi on Stampede

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

Accelerator programming with OpenACC

Offloading. Kent Milfeld Stampede Training, January 11, 2013

rabbit.engr.oregonstate.edu What is rabbit?

The Era of Heterogeneous Computing

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Memory Footprint of Locality Information On Many-Core Platforms Brice Goglin Inria Bordeaux Sud-Ouest France 2018/05/25

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Designing Optimized MPI Broadcast and Allreduce for Many Integrated Core (MIC) InfiniBand Clusters

Parallel Computing. November 20, W.Homberg

Intel Parallel Studio XE 2015

Intel Xeon Phi Coprocessor

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Modernizing OpenMP for an Accelerated World

Accelerating Implicit LS-DYNA with GPU

Martin Dubois, ing. Contents

Parallel Programming. Libraries and Implementations

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

A GPU based brute force de-dispersion algorithm for LOFAR

Running HARMONIE on Xeon Phi Coprocessors

Introduction to CELL B.E. and GPU Programming. Agenda

Technology for a better society. hetcomp.com

Parallel Programming Libraries and implementations

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Introduction to CUDA Programming

The Stampede is Coming: A New Petascale Resource for the Open Science Community

COMP Parallel Computing. Programming Accelerators using Directives

Vincent C. Betro, R. Glenn Brook, & Ryan C. Hulguin XSEDE Xtreme Scaling Workshop Chicago, IL July 15-16, 2012

Approaches to acceleration: GPUs vs Intel MIC. Fabio AFFINITO SCAI department

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Debugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. Lars Koesterke John D. McCalpin

Ryan Hulguin

LUNAR TEMPERATURE CALCULATIONS ON A GPU

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

Transcription:

Intel Xeon Phi Coprocessors Reference: Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, by A. Vladimirov and V. Karpusenko, 2013

Ring Bus on Intel Xeon Phi Example with 8 cores

Xeon Phi Card Coprocessor connected via PCIe Gen 2 (6 GB/s) 6-8 GB memory on card

Intel Xeon Phi Coprocessors Many Integrated Core (MIC) architecture, with 57-61 cores on a chip Processors connected via a ring bus Peak approx 1 Tflop/s double precision Simple x86 cores at approx 1 GHz clock speed Enhances code portability In-order processor Each core supports 4-way hyperthreading; each hyperthread issues instructions every other cycle Cores have 512-bit SIMD units

Memory characteristics L1D: 32 KB, L1I: 32 KB (per core) L2: 512 KB (per core) unified Memory BW: 350 GB/s peak; 200 GB/s attainable

Programming Modes Offload Program executes on host and offloads work to coprocessors Native (coprocessor runs Linux uos) Could run several MPI processes Symmetric (MPI between hosts and coprocessors)

Where s the disk? File system is stored on a virtual file system (stored in DRAM on the card) When the coprocessors are rebooted, home directories are reset and necessary libraries must be copied to the file system Or, an actual disk file system can be mounted on the coprocessor, but access would be slow (across PCIe)

Offloading vs. Native/Symmetric Mode Offloading is good when Application is not highly parallel throughout and not all cores of the coprocessor can always be used (few fast cores vs. many slow cores) Application has high memory requirements where offloaded portions can use less memory (coprocessor limited to 8 GB in native mode) Offloading is bad when Overhead of data transfer is high compared to offload computation

Offloading Common offload procedure: #pragma offload target(mic) inout(data: length(size)) This is performed automatically: allocate memory on coprocessor transfer data to coprocessor perform offload calculation transfer data to host deallocate memory on coprocessor Fall back to host if no coprocessor available Code still works if directives are disabled

Persistent data in offloads Data allocation and transfer are expensive Can reuse memory space allocated on the coprocessor (alloc_if, free_if) Can control when data transfer occurs between host and coprocessor in, out, inout nocopy to avoid transfering statically allocated variables length(0) to avoid transfering data referenced by pointers

What does this code do?

Asynchronous Offloading Host program blocks until the offload completes (default) For non-blocking offload, use signal clause Works for asynchronous data transfer as well

Offloading on Intel Xeon Phi Two methods pragma offload (fast, managed data allocation and transfer) shared virtual memory model (convenient but slower due to software managing coherency) Compare to GPU offloading CUDA approach: launch kernels from the host; explicit function calls to allocate data on GPU and to transfer data between host and GPU OpenACC: pragmas Other options OpenMP (pragmas for offloading) OpenCL (explicit function calls), more suitable for GPUs

Virtual Shared Memory for Offloading Logically shared memory between host and coprocessor Programmer marks variables that are shared Runtime maintains coherence at the beginning and end of offload statements (only modified data is copied) _Cilk_shared keyword to mark shared variables/data shared variables have the same addresses on host and coprocessor, to simply offloading of complex data structures shared variables are allocated dynamically (not on the stack) _Cilk_offload keyword to mark offloaded functions Dynamically allocated memory can also be shared: _Offload_shared_malloc, _Offload_shared_free

Multiple coprocessors Use OpenMP threads; one thread offloads to one coprocessor. On coprocessor, use OpenMP to parallelize across cores.

Native/Symmetric MPI vs. MPI+Offload