Heterogeneous Computing and OpenCL

Similar documents
Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Accelerator Programming Lecture 1

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

the Intel Xeon Phi coprocessor

The Era of Heterogeneous Computing

Introduction to the Xeon Phi programming model. Fabio AFFINITO, CINECA

Introduction to Xeon Phi. Bill Barth January 11, 2013

Parallel Computing. November 20, W.Homberg

PORTING CP2K TO THE INTEL XEON PHI. ARCHER Technical Forum, Wed 30 th July Iain Bethune

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CME 213 S PRING Eric Darve

Advanced OpenMP Features

GPUs and Emerging Architectures

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Preparing for Highly Parallel, Heterogeneous Coprocessing

Intel Xeon Phi Coprocessors

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

CPU-GPU Heterogeneous Computing

A Unified Approach to Heterogeneous Architectures Using the Uintah Framework

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

Parallel Hybrid Computing Stéphane Bihan, CAPS

Introduction to the Intel Xeon Phi on Stampede

Intel MIC Programming Workshop, Hardware Overview & Native Execution. IT4Innovations, Ostrava,

Introduction to CUDA Programming

Parallel Programming on Ranger and Stampede

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

The Stampede is Coming: A New Petascale Resource for the Open Science Community

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Intel MIC Programming Workshop, Hardware Overview & Native Execution LRZ,

The Intel Xeon Phi Coprocessor. Dr-Ing. Michael Klemm Software and Services Group Intel Corporation

n N c CIni.o ewsrg.au

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Dealing with Heterogeneous Multicores

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Native Computing and Optimization. Hang Liu December 4 th, 2013

Introduction to Intel Xeon Phi programming techniques. Fabio Affinito Vittorio Ruggiero

6/14/2017. The Intel Xeon Phi. Setup. Xeon Phi Internals. Fused Multiply-Add. Getting to rabbit and setting up your account. Xeon Phi Peak Performance

Parallel Accelerators

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Double Rewards of Porting Scientific Applications to the Intel MIC Architecture

Parallel Accelerators

Addressing Heterogeneity in Manycore Applications

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

High Performance Computing. Leopold Grinberg T. J. Watson IBM Research Center, USA

Intel Architecture for HPC

OP2 FOR MANY-CORE ARCHITECTURES

Introduc)on to Xeon Phi

Energy Efficient K-Means Clustering for an Intel Hybrid Multi-Chip Package

IMPROVING ENERGY EFFICIENCY THROUGH PARALLELIZATION AND VECTORIZATION ON INTEL R CORE TM

Computer Architecture and Structured Parallel Programming James Reinders, Intel

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Parallel Programming Overview

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 61C: Great Ideas in Computer Architecture (Machine Structures) Lecture 30: GP-GPU Programming. Lecturer: Alan Christopher

GPUfs: Integrating a file system with GPUs

High Performance Computing with Accelerators

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Advanced Parallel Programming I

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Parallel and Distributed Programming Introduction. Kenjiro Taura

Bei Wang, Dmitry Prohorov and Carlos Rosales

GPU Fundamentals Jeff Larkin November 14, 2016

Trends in HPC (hardware complexity and software challenges)

Trends in the Infrastructure of Computing

Introduction to CELL B.E. and GPU Programming. Agenda

General Purpose GPU Computing in Partial Wave Analysis

Modern CPU Architectures

Real-Time Rendering Architectures

OpenMP 4.0 (and now 5.0)

To hear the audio, please be sure to dial in: ID#

Introduction to tuning on many core platforms. Gilles Gouaillardet RIST

Introduction to GPU hardware and to CUDA

45-year CPU Evolution: 1 Law -2 Equations

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

Native Computing and Optimization on the Intel Xeon Phi Coprocessor. John D. McCalpin

INTRODUCTION TO THE ARCHER KNIGHTS LANDING CLUSTER. Adrian

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Accelerating HPC. (Nash) Dr. Avinash Palaniswamy High Performance Computing Data Center Group Marketing

CUDA Programming Model

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

OpenACC (Open Accelerators - Introduced in 2012)

Parallel Programming Libraries and implementations

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Code optimization in a 3D diffusion model

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

PARALLEL PROGRAMMING MANY-CORE COMPUTING: INTRO (1/5) Rob van Nieuwpoort

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Programmer's View of Execution Teminology Summary

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Massively Parallel Architectures

EARLY EVALUATION OF THE CRAY XC40 SYSTEM THETA

Transcription:

Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information)

Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi Coprocessor OpenCL Programming Model Summary 2

Method development What do we do? Heterogeneous computing programming model Xeon Phi Coprocessors, AMD GPU, NVIDIA GPUs OpenCL, CUDA, Offload, OpenHMPP, MPI, OpenMP, Hybrid Main interests: GSL-CL : OpenCL for GNU Scientific Library GSL-CL (http://sourceforge.net/projects/gsl-cl/develop Highly parallel quantum Monte Carlo KISTI Monte Carlo (KMC) Parallel Molecular Dynamics Code KISTI Molecular Dynamics (KMD) 3

Amdahl s Law 4

Extreme scale computing issues KMD code : KISTI Quantum Diffusion Monte Carlo 5

Heterogeneous Programming Model OpenACC OpenHMPP CUDA OpenCL OpenMP 4.0 SIMD Offloading 6

CUDA: Multi-GPU Scalability MPI+CUDA Multi-GPU Tesla 2050 MPI+CUDA Weak scaling Optimization Asynchronizio n between Kernel and MPI comm. 7

Benefits of Heterogeneous Computing Speed-up=1 Speed-up=5 Speed-up=19 8

What is Heterogeneous Multi-cores Computing? Many-cores, coprocessors, Accelerators GPU FPGA MIC DSP 9

Heterogeneous Computing Era 컴퓨팅성능 10 18 10 15 Exaflops (Japan?) China (Tinahe2 Japan ) (K) 10 12 10 9 10 6 10 3 1970 1980 1990 2000 2010 2020 2030 ( 년도 ) 10

PARTII XEON PHI COPROCESSOR 11

Xeon Phi Node 320 Glops 12

Coprocessor Cross-compile for the coprocessor OpenMP, posix threads, OpenCL, MPI usable 240 Hardware threads on single chip However, in-order cores Very small memory (8 GB), small caches (only 2 levels) Limited hardware prefetching Poor single thread performance ~ 1GHz Host CPU will be bored Only suitable for highly parallelized / scalable codes 13

Vector register Scalar: 64(80)-bit wide register, 1 double/ instruction add [a] [b] Intel SSE: 128-bit wide register(xmm), 2 doubles/ instruction SSE2, SSE4.2 Intel AVX: 256-bit wide register(ymm), 4 doubles/ instruction Shipped with Sandy bridge Later, AMD will implement. Intel Phi: 512-bit wide register(), 8 doubles/ instruction Shipped with Knights Corner at 2012 Intrinsic: _mm{bit}_{operation}_{packed, scalar}{precision}() ex. _mm128_add_pd(), _mm128_exp_pd() on icc Compiler auto-vectorization, C Extension for Array Notation(Cilk+), ArBB(C++) OpenCL: easy vectorization combined with multi-threading 14

SIMD Fused Multiply Add 15

There is no single driving force Source : George Hager http://ircc.fiu.edu/sc13/practitionerscookbook_slides.pdf Perf. = ncores * SIMD * Freq * FMA Parallelization - OpenMP, Cilk Heterogeneous - MPI, OpenCL Vectorization - SIMD Optimization - Prefetch, Loop - Cache, Latency HC here is here to stay : SIMD + OpenMP + MPI + OpenCL/CUDA 16

CPU/GPU/MIC TDP : Thermal Design Power 17

Computing Node Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.13 GHz DDR5 8 GB Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.1 GHz DDR5 8 GB PCI Express 6 GB/s 18

MPI Programming Models Host-only Model All MPI ranks reside on the host The coprocessors can be used by using offload pragmas Coprocessor-only Model All MPI ranks reside on the coprocessor Native mode Symmetric Model The MPI ranks reside on both the host and the coprocessor 19

Native MPI : Symmetric Mode mic1 1 2 3 PCIe 1 1 1 MEM MEM L3 Cache 1 1 1 4 5 6 mic0 1 1 1 1 1 1 MEM mpirun host localhost n 6 a.cpu : -host mic0 n 120 a.mic : -host mic1 n 120 a.mic 20

SAXPY in native mode void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); for(int i=0; i<n; ++i){ x[i]=i; y[i]=2.0*i + 1.0; } saxpycpu(n, a, x, y); free(x); free(y); } $icc mmic o a.mic saxpy.c 21

Bandwidth on single core Example Array Copy x(:) = a*x(:)+y(:) Sandybridge Xeon Phi How does data travel from Mem. to CPU and back? Memory Size (kb) 22

Copy bandwidth on Intel Phi Setup Double-precision 1D array Memory alignment in 64 bytes TRIAD Xeon Phi 5110P Theoretical aggregate bandwidth = 352 GB/s 2x # of real cores OpenMP with Intel C/C++ compiler 2013 XE KMP_AFFINITY=scatter 23

Roofline Model for Xeon Phi (DP) Sources: http://crd.lbl.gov/assets/pubs_presos/parlab08-roofline-talk.pdf Performace is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio 24

Thread Affinity Choices Intel OpenMP Supports the following Affinity Type: Compact assign threads to consecutive h/w contexts on same physical core Scatter assign consecutive threads to different physical cores maximize access to M 25

Go Parallel with OpenMP 26

Offloading Strategy Offload directive #pragma offload target (mic:0 or mic:1) Great time to venture into manycores Try offloading compute intensive section Optimize data transfers Split calculation Use asynchronous transfer Good for Code spends a lots of time doing computation without I/O The data is relatively easy to encapsulate Computation time is substantially higher than the data transfer time 27

Count3s example with Offloading #include <stdlib.h> attribute ((target(mic))) int count3s(const int N, const int* data){ int icount=0; int i; for ( i = 0 ; i < N ; i++ ){ if ( data[i] == 3 ) icount++; } return icount; } int num_c3s; #pragma offload target(mic) in(array:length(nsize)) { printf("counting 3s from MIC! \n"); fflush(0); num_c3s = count3s(nsize, array); } }./a.out Hello World from CPU! Hello World from MIC! Counting 3s from MIC! num_c3s=3332344 in the array[10000000] ratio= 33.32(%) 28

LJ code Benchmark Native mode, N2=500, File I/O 29

PARTIII OPENCL 30

Needs for OpenCL l Diverse vendors and hardwares l Industry-standard programming platform is required: It is OpenCL. Table: OpenCL-applicable devices in the market Altera AMD ARM IBM Intel nvidia CPU GPU APU Acc. 31

Recent trend in computing devices l Multi-core CPU: ~ 16 cores / socket, a few tens per node l NUMA, Large on-chip caches, high frequency (~ 3.5 GHz) operation l CPU Vector instruction: SSE, AVX(256 bit), AVX2, l Intel Xeon Phi: many-integrated cores l 60 cores, 512 bit-wide vectors l GPU: nvidia Tesla K20, AMD HD7970 (~ 2,000 cores) l Good double-precision performance, large memory (~ 6GB) l FPGA: programmable gate array, Altera OpenCL version 32

Prerequisite for OpenCL development Identify installed computing devices on your computer. Visit appropriate vendor homepages and download the followings AMD : x86_64 CPUs, Radeon GPU, Fusion APU SDKs: http://developer.amd.com/sdks/amdappsdk/pages/default.aspx GPU driver: http://www.amd.com Intel : Intel x64 CPUs, GPU, Phi SDKs: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/ nvidia : GPU SDKs: http://developer.nvidia.com/cuda-downloads GPU driver: http://www.nvidia.com/download/index.aspx?lang=en-us\ IBM: POWER, common runtime for x86 https://www.ibm.com/developerworks/community/groups/service/html/commu nityview?communityuuid=80367538-d04a-47cb-9463-428643140bf1 ARM: Mobile CPU SNU: http://opencl.snu.ac.kr Portland Group: http://www.pgroup.com/products/pgcl.htm 33

2) Memory Model Device Private Private Private Private Work item Compute unit Local Memory Work item Work item Compute unit Local Memory Work item 70~150 GB/s data Global Memory data Host Host Memory PCIe (slow) ~5GB/s 34

OpenCL Diagram Kernels Kernels Context Kernels Kernels Platforms Programs Kernels Queues Devices Memory Enqueues Hardware Setup Compile Code Data & Arguments Send to Execution 35

Platform Model Compute Device HOST INTERCONNECT (PCIe) Compute Unit Processing Elements You can just identify the OpenCL platform as a vendor, that is AMD, Intel, etc. Platform == Vendor 36

OpenCL Context You can just identify the OpenCL context as a computational workspace. Context == Our table for works You can just identify the OpenCL device as CPU, GPU,. Device == CPU or GPU 37

OpenCL Command Queue You can just identify the OpenCL command queue as a job manager. Command Queue == Manager, Professor API 38

OpenCL : Game of Cards Command Queue Hand Context Table Host Dealer Program Deck of Cards Kernel Card A K Q J A K Q J Player 0 Player 1 Player 2 Player 3 Device Type of Card Game Platform 39

Work flow PlatformID DeviceID Context Command Queue Execute kernel Buffer Read Program Create Program Build Pro. Kernel Create Kernel Set Kernel Arg. Buffer Create Buffer Write Buffer 40

OpenCL Program You should load and keep a source code of OpenCL kernels before executing kernels. (kernel == function) Create Program With Source Loaded source code of OpenCL kernels is compiled. Build Program == Compile + Link You should specify function(kernel) arguments of each kernel. (No call stack in GPU) Set Kernel Arg == function arg. 41

Read/Write Buffer Synchronization of contents between an array on Host and the corresponding array on GPU Buffer R/W == Synchronization API 42

Execute OpenCL Kernel You can simply run a OpenCL kernel on GPU/CPU. Enqueue Task => Simple run API 43

Comparison with offload model OpenCL Offload model PlatformID DeviceID Context Command Queue Execute kernel Buffer Read declspec(target(mic)) function1(); declspec(target(mic)) function2(); #pragma offload target(mic) inout(x:length(3*n)) in(v:length(3*n)) nocopy(f:length(3*n)) { functions } Program High-level API Better choice in CPU/Phi computing only Kernel Buffer Low-level API One source in heterogeneous devices. 44

Performance comparison Lennard-Jones MD # of molecules = 3200 100 time steps Work group size = 16 or 32 OpenCL shows better performance. Codes are in the same optimization degree. 100 Elapsed time (second) 90 80 70 60 50 40 30 20 10 0 offload CL 27.252 20.13 0 50 100 150 200 250 # of threads 45

Summary You will benefit using Xeon Phi if you can Take advantage of high memory bandwidth Take advantage of wide vectors Take advantage of the high thread count If you can t, you re probably better off with Xeons If you want a good performance you will need to optimize your code by extracting parallelism such as OpenMP, SIMD, OpenCL, MPI This is at least the same amount of work you would put into porting to CUDA. 46