Heterogeneous Computing and OpenCL

Heterogeneous Computing and OpenCL Hongsuk Yi (hsyi@kisti.re.kr) (Korea Institute of Science and Technology Information)

Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi Coprocessor OpenCL Programming Model Summary 2

Method development What do we do? Heterogeneous computing programming model Xeon Phi Coprocessors, AMD GPU, NVIDIA GPUs OpenCL, CUDA, Offload, OpenHMPP, MPI, OpenMP, Hybrid Main interests: GSL-CL : OpenCL for GNU Scientific Library GSL-CL (http://sourceforge.net/projects/gsl-cl/develop Highly parallel quantum Monte Carlo KISTI Monte Carlo (KMC) Parallel Molecular Dynamics Code KISTI Molecular Dynamics (KMD) 3

Amdahl s Law 4

Extreme scale computing issues KMD code : KISTI Quantum Diffusion Monte Carlo 5

Heterogeneous Programming Model OpenACC OpenHMPP CUDA OpenCL OpenMP 4.0 SIMD Offloading 6

CUDA: Multi-GPU Scalability MPI+CUDA Multi-GPU Tesla 2050 MPI+CUDA Weak scaling Optimization Asynchronizio n between Kernel and MPI comm. 7

Benefits of Heterogeneous Computing Speed-up=1 Speed-up=5 Speed-up=19 8

What is Heterogeneous Multi-cores Computing? Many-cores, coprocessors, Accelerators GPU FPGA MIC DSP 9

Heterogeneous Computing Era 컴퓨팅성능 10 18 10 15 Exaflops (Japan?) China (Tinahe2 Japan ) (K) 10 12 10 9 10 6 10 3 1970 1980 1990 2000 2010 2020 2030 ( 년도 ) 10

PARTII XEON PHI COPROCESSOR 11

Xeon Phi Node 320 Glops 12

Coprocessor Cross-compile for the coprocessor OpenMP, posix threads, OpenCL, MPI usable 240 Hardware threads on single chip However, in-order cores Very small memory (8 GB), small caches (only 2 levels) Limited hardware prefetching Poor single thread performance ~ 1GHz Host CPU will be bored Only suitable for highly parallelized / scalable codes 13

Vector register Scalar: 64(80)-bit wide register, 1 double/ instruction add [a] [b] Intel SSE: 128-bit wide register(xmm), 2 doubles/ instruction SSE2, SSE4.2 Intel AVX: 256-bit wide register(ymm), 4 doubles/ instruction Shipped with Sandy bridge Later, AMD will implement. Intel Phi: 512-bit wide register(), 8 doubles/ instruction Shipped with Knights Corner at 2012 Intrinsic: _mm{bit}_{operation}_{packed, scalar}{precision}() ex. _mm128_add_pd(), _mm128_exp_pd() on icc Compiler auto-vectorization, C Extension for Array Notation(Cilk+), ArBB(C++) OpenCL: easy vectorization combined with multi-threading 14

SIMD Fused Multiply Add 15

There is no single driving force Source : George Hager http://ircc.fiu.edu/sc13/practitionerscookbook_slides.pdf Perf. = ncores * SIMD * Freq * FMA Parallelization - OpenMP, Cilk Heterogeneous - MPI, OpenCL Vectorization - SIMD Optimization - Prefetch, Loop - Cache, Latency HC here is here to stay : SIMD + OpenMP + MPI + OpenCL/CUDA 16

CPU/GPU/MIC TDP : Thermal Design Power 17

Computing Node Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.13 GHz DDR5 8 GB Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.1 GHz DDR5 8 GB PCI Express 6 GB/s 18

MPI Programming Models Host-only Model All MPI ranks reside on the host The coprocessors can be used by using offload pragmas Coprocessor-only Model All MPI ranks reside on the coprocessor Native mode Symmetric Model The MPI ranks reside on both the host and the coprocessor 19

Native MPI : Symmetric Mode mic1 1 2 3 PCIe 1 1 1 MEM MEM L3 Cache 1 1 1 4 5 6 mic0 1 1 1 1 1 1 MEM mpirun host localhost n 6 a.cpu : -host mic0 n 120 a.mic : -host mic1 n 120 a.mic 20

SAXPY in native mode void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); for(int i=0; i<n; ++i){ x[i]=i; y[i]=2.0*i + 1.0; } saxpycpu(n, a, x, y); free(x); free(y); } $icc mmic o a.mic saxpy.c 21

Bandwidth on single core Example Array Copy x(:) = a*x(:)+y(:) Sandybridge Xeon Phi How does data travel from Mem. to CPU and back? Memory Size (kb) 22

Copy bandwidth on Intel Phi Setup Double-precision 1D array Memory alignment in 64 bytes TRIAD Xeon Phi 5110P Theoretical aggregate bandwidth = 352 GB/s 2x # of real cores OpenMP with Intel C/C++ compiler 2013 XE KMP_AFFINITY=scatter 23

Roofline Model for Xeon Phi (DP) Sources: http://crd.lbl.gov/assets/pubs_presos/parlab08-roofline-talk.pdf Performace is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio 24

Thread Affinity Choices Intel OpenMP Supports the following Affinity Type: Compact assign threads to consecutive h/w contexts on same physical core Scatter assign consecutive threads to different physical cores maximize access to M 25

Go Parallel with OpenMP 26

Offloading Strategy Offload directive #pragma offload target (mic:0 or mic:1) Great time to venture into manycores Try offloading compute intensive section Optimize data transfers Split calculation Use asynchronous transfer Good for Code spends a lots of time doing computation without I/O The data is relatively easy to encapsulate Computation time is substantially higher than the data transfer time 27

Count3s example with Offloading #include <stdlib.h> attribute ((target(mic))) int count3s(const int N, const int* data){ int icount=0; int i; for ( i = 0 ; i < N ; i++ ){ if ( data[i] == 3 ) icount++; } return icount; } int num_c3s; #pragma offload target(mic) in(array:length(nsize)) { printf("counting 3s from MIC! \n"); fflush(0); num_c3s = count3s(nsize, array); } }./a.out Hello World from CPU! Hello World from MIC! Counting 3s from MIC! num_c3s=3332344 in the array[10000000] ratio= 33.32(%) 28

LJ code Benchmark Native mode, N2=500, File I/O 29

PARTIII OPENCL 30

Needs for OpenCL l Diverse vendors and hardwares l Industry-standard programming platform is required: It is OpenCL. Table: OpenCL-applicable devices in the market Altera AMD ARM IBM Intel nvidia CPU GPU APU Acc. 31

Recent trend in computing devices l Multi-core CPU: ~ 16 cores / socket, a few tens per node l NUMA, Large on-chip caches, high frequency (~ 3.5 GHz) operation l CPU Vector instruction: SSE, AVX(256 bit), AVX2, l Intel Xeon Phi: many-integrated cores l 60 cores, 512 bit-wide vectors l GPU: nvidia Tesla K20, AMD HD7970 (~ 2,000 cores) l Good double-precision performance, large memory (~ 6GB) l FPGA: programmable gate array, Altera OpenCL version 32

Prerequisite for OpenCL development Identify installed computing devices on your computer. Visit appropriate vendor homepages and download the followings AMD : x86_64 CPUs, Radeon GPU, Fusion APU SDKs: http://developer.amd.com/sdks/amdappsdk/pages/default.aspx GPU driver: http://www.amd.com Intel : Intel x64 CPUs, GPU, Phi SDKs: http://software.intel.com/en-us/articles/vcsource-tools-opencl-sdk/ nvidia : GPU SDKs: http://developer.nvidia.com/cuda-downloads GPU driver: http://www.nvidia.com/download/index.aspx?lang=en-us\ IBM: POWER, common runtime for x86 https://www.ibm.com/developerworks/community/groups/service/html/commu nityview?communityuuid=80367538-d04a-47cb-9463-428643140bf1 ARM: Mobile CPU SNU: http://opencl.snu.ac.kr Portland Group: http://www.pgroup.com/products/pgcl.htm 33

2) Memory Model Device Private Private Private Private Work item Compute unit Local Memory Work item Work item Compute unit Local Memory Work item 70~150 GB/s data Global Memory data Host Host Memory PCIe (slow) ~5GB/s 34

OpenCL Diagram Kernels Kernels Context Kernels Kernels Platforms Programs Kernels Queues Devices Memory Enqueues Hardware Setup Compile Code Data & Arguments Send to Execution 35

Platform Model Compute Device HOST INTERCONNECT (PCIe) Compute Unit Processing Elements You can just identify the OpenCL platform as a vendor, that is AMD, Intel, etc. Platform == Vendor 36

OpenCL Context You can just identify the OpenCL context as a computational workspace. Context == Our table for works You can just identify the OpenCL device as CPU, GPU,. Device == CPU or GPU 37

OpenCL Command Queue You can just identify the OpenCL command queue as a job manager. Command Queue == Manager, Professor API 38

OpenCL : Game of Cards Command Queue Hand Context Table Host Dealer Program Deck of Cards Kernel Card A K Q J A K Q J Player 0 Player 1 Player 2 Player 3 Device Type of Card Game Platform 39

Work flow PlatformID DeviceID Context Command Queue Execute kernel Buffer Read Program Create Program Build Pro. Kernel Create Kernel Set Kernel Arg. Buffer Create Buffer Write Buffer 40

OpenCL Program You should load and keep a source code of OpenCL kernels before executing kernels. (kernel == function) Create Program With Source Loaded source code of OpenCL kernels is compiled. Build Program == Compile + Link You should specify function(kernel) arguments of each kernel. (No call stack in GPU) Set Kernel Arg == function arg. 41

Read/Write Buffer Synchronization of contents between an array on Host and the corresponding array on GPU Buffer R/W == Synchronization API 42

Execute OpenCL Kernel You can simply run a OpenCL kernel on GPU/CPU. Enqueue Task => Simple run API 43

Comparison with offload model OpenCL Offload model PlatformID DeviceID Context Command Queue Execute kernel Buffer Read declspec(target(mic)) function1(); declspec(target(mic)) function2(); #pragma offload target(mic) inout(x:length(3*n)) in(v:length(3*n)) nocopy(f:length(3*n)) { functions } Program High-level API Better choice in CPU/Phi computing only Kernel Buffer Low-level API One source in heterogeneous devices. 44

Performance comparison Lennard-Jones MD # of molecules = 3200 100 time steps Work group size = 16 or 32 OpenCL shows better performance. Codes are in the same optimization degree. 100 Elapsed time (second) 90 80 70 60 50 40 30 20 10 0 offload CL 27.252 20.13 0 50 100 150 200 250 # of threads 45

Summary You will benefit using Xeon Phi if you can Take advantage of high memory bandwidth Take advantage of wide vectors Take advantage of the high thread count If you can t, you re probably better off with Xeons If you want a good performance you will need to optimize your code by extracting parallelism such as OpenMP, SIMD, OpenCL, MPI This is at least the same amount of work you would put into porting to CUDA. 46