Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

Similar documents
Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Intel MIC Programming Workshop, Hardware Overview & Native Execution. Ostrava,

Scientific Computing with Intel Xeon Phi Coprocessors

Accelerator Programming Lecture 1

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

Reusing this material

Introduction to Xeon Phi. Bill Barth January 11, 2013

Intel Many Integrated Core (MIC) Matt Kelly & Ryan Rawlins

the Intel Xeon Phi coprocessor

Intel Knights Landing Hardware

Computer Architecture and Structured Parallel Programming James Reinders, Intel

An Introduction to the Intel Xeon Phi. Si Liu Feb 6, 2015

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Overview of Intel Xeon Phi Coprocessor

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Intel Xeon Phi Coprocessors

Parallel Accelerators

Debugging Intel Xeon Phi KNC Tutorial

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Parallel Accelerators

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Introduction to the Intel Xeon Phi on Stampede

Intel Architecture for HPC

Knights Corner: Your Path to Knights Landing

Introduc)on to Xeon Phi

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

The Era of Heterogeneous Computing

Intel Xeon Phi Coprocessor

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

PRACE PATC Course: Intel MIC Programming Workshop, MKL LRZ,

PRACE PATC Course: Intel MIC Programming Workshop MPI LRZ,

Tools for Intel Xeon Phi: VTune & Advisor Dr. Fabio Baruffa - LRZ,

Native Computing and Optimization. Hang Liu December 4 th, 2013

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

E, F. Best-known methods (BKMs), 153 Boot strap processor (BSP),

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

rabbit.engr.oregonstate.edu What is rabbit?

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Heterogeneous Computing and OpenCL

Parallel Programming on Ranger and Stampede

COSC 6385 Computer Architecture - Data Level Parallelism (III) The Intel Larrabee, Intel Xeon Phi and IBM Cell processors

Introduc)on to Hyades

Intel MIC Architecture. Dr. Momme Allalen, LRZ, PRACE PATC: Intel MIC&GPU Programming Workshop

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Xeon Phi Coprocessors on Turing

Preparing for Highly Parallel, Heterogeneous Coprocessing

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduc)on to Xeon Phi

Intel Many Integrated Core (MIC) Architecture

Intel Xeon Phi coprocessor (codename Knights Corner) George Chrysos Senior Principal Engineer Hot Chips, August 28, 2012

Trends in HPC (hardware complexity and software challenges)

Parallel Computing. November 20, W.Homberg

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

Introduction to tuning on KNL platforms

Evaluation of Intel Xeon Phi "Knights Corner": Opportunities and Shortcomings

Architecture, Programming and Performance of MIC Phi Coprocessor

Investigation of Intel MIC for implementation of Fast Fourier Transform

Introduction to tuning on KNL platforms

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

Klaus-Dieter Oertel, May 28 th 2013 Software and Services Group Intel Corporation

GPUs and Emerging Architectures

Performance Optimization of Smoothed Particle Hydrodynamics for Multi/Many-Core Architectures

Bring your application to a new era:

IFS RAPS14 benchmark on 2 nd generation Intel Xeon Phi processor

Programming Intel R Xeon Phi TM

Lab MIC Offload Experiments 7/22/13 MIC Advanced Experiments TACC

PRACE PATC Course: Intel MIC Programming Workshop, MKL. Ostrava,

HPC Architectures. Types of resource currently in use

TACC s Stampede Project: Intel MIC for Simulation and Data-Intensive Computing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

n N c CIni.o ewsrg.au

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

Intel Math Kernel Library (Intel MKL) Latest Features

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

GRID Testing and Profiling. November 2017

Intel Xeon Phi Coprocessor

Leibniz Supercomputer Centre. Movie on YouTube

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Advanced Parallel Programming I

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Our new HPC-Cluster An overview

SCALABLE HYBRID PROTOTYPE

Intel Xeon Phi Coprocessor

Introduc)on to Xeon Phi

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Technical Report. Document Id.: CESGA Date: July 28 th, Responsible: Andrés Gómez. Status: FINAL

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Laurent Duhem Intel Alain Dominguez - Intel

arxiv: v1 [physics.comp-ph] 4 Nov 2013

OpenCL Vectorising Features. Andreas Beckmann

1. Many Core vs Multi Core. 2. Performance Optimization Concepts for Many Core. 3. Performance Optimization Strategy for Many Core

Lab MIC Experiments 4/25/13 TACC

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Simulation using MIC co-processor on Helios

arxiv: v2 [hep-lat] 3 Nov 2016

HPC code modernization with Intel development tools

Transcription:

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ, 27.6.- 29.6.2016 1

Agenda Intro @ accelerators on HPC Architecture overview of the Intel Xeon Phi Products Programming models Native mode programming & hands on

Why do we Need Coprocessors/or Accelerators on HPC? In the past, computers got faster by increasing the clock frequency of the core, but this has now reached its limit mainly due to power requirements and heat dissipation restrictions (unmanageable problem). Today, processor cores are not getting any faster, but instead the number of cores per chip increases. On HPC, we need a chip that can provide higher computing performance at lower energy.

Why do we Need Coprocessors/or Accelerators on HPC? The actual solution is a heterogeneous system containing both CPUs and accelerators, plus other forms of parallelism such as vector instruction support. Widely accepted that heterogeneous systems with accelerators deliver the highest performance and energy efficient computing in HPC. Today!! the accelerated computing is revolutionising HPC.

Why do we Need Coprocessors/or Accelerators on HPC?

TOP 4 Sites for June 2016 27-29 Jun 2016 Intel MIC Programming Workshop

Architectures Comparison (CPU vs GPU) Control ALU ALU ALU ALU Cache DRAM DRAM - Large cache and sophisticated flow control minimise latency for arbitrary memory access for serial process. Simple flow control and limited cache. More transistors for computing in parallel (up to 17 billion on Pascal GPU) Same problem executed on many data elements in parallel (SIMD)

Intel Xeon Phi Products The first product was released in 2012 named Knights Corner (KNC) which is the first architecture supporting 512 bit vectors. The 2 nd generation released last week named Knights Landing (KNL) also support 512bit vectors with a new instruction set called Intel Advanced Vector Instructions 512 (Intel AVX-512) KNL has a peak performance of 6 TFLOP/s in single precision ~ 3 times what KNC had, due to 2 vector processing units (VPUs) per core, doubled compared to the KNC. Each VPU operates independently on 512-bit vector registers, which can hold 16 single precision or 8 double precession floating-point numbers.

Intel Xeon Phi: KNC Coprocessors Add-on to CPU based systems (PCIe interface) High power efficiency ~ 1 TFLOP/s in DP Heterogeneous clustering Currently it s a major part of the world s 2fastest supercomputer: Tianhe-2 in Guangzhou, China Today most of the major HPC venders are supporting Xeon Phi

Intel Xeon Phi Coprocessors & MIC Architecture Multi-core Intel Xeon processor ~ 16 physical cores ~ 3 GHz e.g Intel Sandy-Bridge 32 nm (AVX) 256-bit vector registers Many-core Intel Xeon Phi coprocessor ~ 61 cores (244) ~ 1 GHz 22 nm 512-bit vector registers Parallel Programming and Optimization with Intel Xeon Phi Coprocessors, Colfax 2013 http://www.colfaxintl.com/nd/xeonphi/book.aspx

Intel MIC Architecture in common with Intel multi-core Xeon CPUs! X86 architecture C, C++ and Fortran Standard parallelisation libraries Similar optimisation methods Up to 18 cores/socket Up to 3 GHz Up to 768 GB of DDR3 RAM 256 bit AVX vectors 2-way hyper-threading PCIe bus connection IP-addressable Own Linux version OS Full Intel software tool suite 8-16 GB GDDR5 DRAM (cached) Up to 61 x86 (64 bit) cores Up to 1 GHz 512 bit wide vector registers 4 way hyper-threading 64-bit addressing SSE, AVX or AVX2: are not supported Intel Initial Many Core Instructions (IMCI).

Architectures Comparison Control ALU ALU ALU ALU Cache DRAM DRAM DRAM General-purpose architecture Power-efficient Multiprocessor X86 design architecture Massively data parallel

The Intel Xeon Phi Microarchitecture High bandwidth network interconnect by bidirectional ring topology. The ring connects all the 61 cores, L2 caches through a distributed global Tag Directory (TD), PCIe client logic, GDDR5 memory controllers etc.

Network Access Network access possible using TCP/IP tools like ssh. NFS mounts on Xeon Phi Supported. First experiences with the Intel MIC architecture at LRZ, Volker Weinberg and Momme Allalen, inside Vol. 11 No.2 Autumn2013

Cache Structure Cache line size 64B L1 size 32KB data cache, 32KB instruction code L1 latency 1 cycle L2 size 512 KB L2 ways 8 L2 latency 15-30 cycles Memory à L2 prefetching hardware and software L2 à L1 prefetching software only Translation Lookside Buffer (TLB) 64 pages of size 4KB (256KB coverage) Coverage options (L1, data) 8 pages of size 2MB (16MB coverage) Intel's Jim Jeffers and James Reinders: Intel Xeon Phi Coprocessor High Performance Programming, the first book on how to program this HPC vector chip, is available on Amazon 15

Cache Structure L2 size depends on how data/code is shared between the cores If no data is shared between cores: L2 size is 30.5 MB (61 cores). If every core shares the same data: L2 size is 512 KB. Cache coherency across the entire coprocessor. Data remains consistent without software intervention.

Features of the Initial Many Core Instructions (IMCI) Set 512-bit wide registers can pack up to eight 64-bit elements (long int, double) up to sixteen 32-bit elements (int, float) Arithmetic Instructions Addition, subtraction and multiplication Fused Multiply-Add instruction (FMA) Division and reciprocal calculation Exponential functions and the power function Logarithms (natural, base 2 and base 10) Square root, inverse square root, hypothenuse value and cubic root Trigonometric functions (sin, cos, tan, sinh, cosh, tanh, asin, acos, atan)

Lab1: Access SuperMIC

Interacting with Intel Xeon Phi Coprocessors user@host~$micinfo listdevices MicInfo Utility Log Created Thu Jan 7 09:33:20 2016 List of Available Devices deviceid domain bus# pcidev# hardwareid ------------ ------------- --------- ------------ ---------------- 0 0 20 0 22508086 1 0 8b 0 22508086 ------------------------------------------------------------------ user@host~$ micinfo grep -i cores Cores Total No of Active Cores : 61 Cores Total No of Active Cores : 61

Interacting with Intel Xeon Phi Coprocessors user@host~$ /sbin/lspci grep -i "co-processor 20:00.0 Co-processor: Intel Corporation Xeon Phi. (rev 20) 8b:00.0 Co-processor: Intel Corporation Xeon Phi. (rev 20) user@host~$ cat /etc/hosts grep mic1 user@host~$ cat /etc/hosts grep mic1-ib0 wc l user@host~$ ssh mic0 or ssh mic1 user@host-mic0~$ ls / bin boot dev etc home init lib lib64 lrz media mnt proc root sbin sys tmp usr var micsmc a utility for monitoring the physical parameters of Intel Xeon Phi coprocessors: model, memory, core rail temperatures, core frequency, power usage, etc.

Parallelism on the Heterogeneous Systems 1 L1D L2 L1D L2 2 3 L3 L1D L2 L1D L2 L1D L2 L1D L2 L3 L1D L2 L1D L2 4 Memory Interface Memory Interface 5 6 Other I/O Memory Memory Vectorisation - SIMD 1 Parallelism MPI/OpenMP ccnuma domains 3 Multiple accelerators 4 2 PCIe buses 5 Other I/O resources 6

MIC Programming Models

MIC Programming Models Native Mode Programs started on Xeon Phi Cross-compilation using mmic User access to Xeon Phi is necessary Offload to MIC Offload using OpenMP extensions Automatically offload some routines using MKL MKL Compiled assisted offload (CAO) MKL automatic Offload (AO) MPI tasks on Host and MIC Treat the coprocessor like another host Host main() { #pragma offload target(mic) MPI only and MPI + X (X may be OpenMP, TBB, Cilk, OpenCL etc.) } Host main() { #pragma offload target(mic) } MIC MIC myfunction();

MIC Programming Models

Native Mode First ensure that the application is suitable for native execution. The application runs entirely on the MIC coprocessor without offload from a host system. Compile the application for native execution using the flag: -mmic Build also the required libraries for native execution. Copy the executable and any dependencies, such as run time libraries to the coprocessor. Mount file shares to the coprocessor for accessing input data sets and saving output data sets. Login to the coprocessor via console, setup the environment and run the executable. You can debug the native application via a debug server running on the coprocessor.

Native Mode Compile on the Host: ~$ $INTEL_BASE/linux/bin/compilervars.sh intel64 ~$ icpc -mmic hello.c -o hello ~$ifort -mmic hello.f90 -o hello Launch execution from the MIC: ~$ scp hello $HOST-mic0: ~$ ssh $HOST-mic0 ~$./hello hello, world

Native Mode: micnativeloadex The tool automatically transfer the code and dependent libraries and execute from the host: ~$./hello -bash:./hello: cannot execute Binary file ~$ export SINK_LIBRARY_PATH= /intel/compiler/lib/mic ~$ micnativeloadex./hello hello, world ~$ micnativeloadex./hello -v hello, world Remote process returned: 0 Exit reason: SHUTDOWN OK

Native Mode Use the Intel compiler flag mmic Remove assembly and unnecessary dependencies Use host= host=x86_64 to avoid program does not run errors Example, the GNU Multiple Precision Arithmetic Library (GMP): ~$host: wget http://mirror.bibleonline.ru/gnu/gmp/gmp-6.1.0.tar.bz2 ~$host: tar xf gmp-6.1.0.tar.bz2 ~$host: cd gmp.6.1.0 ~$host:./configure --prefix=${gmp_build} CC=icc CFLAGS= -mmic host=x86_64 disable-assembly --enable-cxx ~$host: make

Lab 2: Native Mode

micinfo and _SC_NPROCESSORS_ONLN ~$ micinfo listdevices ~$ micinfo grep -i cores ~$ cat hello.c #include <stdio.h> #include <unistd.h> int main(){ printf("hello world! I have %ld logical cores.\n", } sysconf(_sc_nprocessors_onln)); ~$ icc hello.c o hello-host &&./hello-host ~$ icc mmic hello.c o hello-mic ~$ micnativeloadex./hello-mic

Thank you.