Heterogeneous Computing and OpenCL

Size: px

Start display at page:

Download "Heterogeneous Computing and OpenCL"

Rafe Bridges
5 years ago
Views:

1 Heterogeneous Computing and OpenCL Hongsuk Yi (Korea Institute of Science and Technology Information)

2 Contents Overview of the Heterogeneous Computing Introduction to Intel Xeon Phi Coprocessor OpenCL Programming Model Summary 2

3 Method development What do we do? Heterogeneous computing programming model Xeon Phi Coprocessors, AMD GPU, NVIDIA GPUs OpenCL, CUDA, Offload, OpenHMPP, MPI, OpenMP, Hybrid Main interests: GSL-CL : OpenCL for GNU Scientific Library GSL-CL ( Highly parallel quantum Monte Carlo KISTI Monte Carlo (KMC) Parallel Molecular Dynamics Code KISTI Molecular Dynamics (KMD) 3

4 Amdahl s Law 4

5 Extreme scale computing issues KMD code : KISTI Quantum Diffusion Monte Carlo 5

6 Heterogeneous Programming Model OpenACC OpenHMPP CUDA OpenCL OpenMP 4.0 SIMD Offloading 6

7 CUDA: Multi-GPU Scalability MPI+CUDA Multi-GPU Tesla 2050 MPI+CUDA Weak scaling Optimization Asynchronizio n between Kernel and MPI comm. 7

8 Benefits of Heterogeneous Computing Speed-up=1 Speed-up=5 Speed-up=19 8

9 What is Heterogeneous Multi-cores Computing? Many-cores, coprocessors, Accelerators GPU FPGA MIC DSP 9

10 Heterogeneous Computing Era 컴퓨팅성능 Exaflops (Japan?) China (Tinahe2 Japan ) (K) ( 년도 ) 10

11 PARTII XEON PHI COPROCESSOR 11

12 Xeon Phi Node 320 Glops 12

13 Coprocessor Cross-compile for the coprocessor OpenMP, posix threads, OpenCL, MPI usable 240 Hardware threads on single chip However, in-order cores Very small memory (8 GB), small caches (only 2 levels) Limited hardware prefetching Poor single thread performance ~ 1GHz Host CPU will be bored Only suitable for highly parallelized / scalable codes 13

14 Vector register Scalar: 64(80)-bit wide register, 1 double/ instruction add [a] [b] Intel SSE: 128-bit wide register(xmm), 2 doubles/ instruction SSE2, SSE4.2 Intel AVX: 256-bit wide register(ymm), 4 doubles/ instruction Shipped with Sandy bridge Later, AMD will implement. Intel Phi: 512-bit wide register(), 8 doubles/ instruction Shipped with Knights Corner at 2012 Intrinsic: _mm{bit}_{operation}_{packed, scalar}{precision}() ex. _mm128_add_pd(), _mm128_exp_pd() on icc Compiler auto-vectorization, C Extension for Array Notation(Cilk+), ArBB(C++) OpenCL: easy vectorization combined with multi-threading 14

15 SIMD Fused Multiply Add 15

16 There is no single driving force Source : George Hager Perf. = ncores * SIMD * Freq * FMA Parallelization - OpenMP, Cilk Heterogeneous - MPI, OpenCL Vectorization - SIMD Optimization - Prefetch, Loop - Cache, Latency HC here is here to stay : SIMD + OpenMP + MPI + OpenCL/CUDA 16

17 CPU/GPU/MIC TDP : Thermal Design Power 17

18 Computing Node Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.13 GHz DDR5 8 GB Memory 12 GB Xeon 8 Cores 2.53 GHz Xeon Phi 61 Cores 1.1 GHz DDR5 8 GB PCI Express 6 GB/s 18

19 MPI Programming Models Host-only Model All MPI ranks reside on the host The coprocessors can be used by using offload pragmas Coprocessor-only Model All MPI ranks reside on the coprocessor Native mode Symmetric Model The MPI ranks reside on both the host and the coprocessor 19

20 Native MPI : Symmetric Mode mic PCIe MEM MEM L3 Cache mic MEM mpirun host localhost n 6 a.cpu : -host mic0 n 120 a.mic : -host mic1 n 120 a.mic 20

21 SAXPY in native mode void saxpycpu(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } int main(int argc, const char* argv[]) { int n = 10240; float a = 2.0f; float* x; float* y; x = (float*) malloc(n * sizeof(float)); y = (float*) malloc(n * sizeof(float)); for(int i=0; i<n; ++i){ x[i]=i; y[i]=2.0*i + 1.0; } saxpycpu(n, a, x, y); free(x); free(y); } $icc mmic o a.mic saxpy.c 21

22 Bandwidth on single core Example Array Copy x(:) = a*x(:)+y(:) Sandybridge Xeon Phi How does data travel from Mem. to CPU and back? Memory Size (kb) 22

23 Copy bandwidth on Intel Phi Setup Double-precision 1D array Memory alignment in 64 bytes TRIAD Xeon Phi 5110P Theoretical aggregate bandwidth = 352 GB/s 2x # of real cores OpenMP with Intel C/C++ compiler 2013 XE KMP_AFFINITY=scatter 23

24 Roofline Model for Xeon Phi (DP) Sources: Performace is upper bounded by both the peak flop rate, and the product of streaming bandwidth and the flop:byte ratio 24

25 Thread Affinity Choices Intel OpenMP Supports the following Affinity Type: Compact assign threads to consecutive h/w contexts on same physical core Scatter assign consecutive threads to different physical cores maximize access to M 25

26 Go Parallel with OpenMP 26

27 Offloading Strategy Offload directive #pragma offload target (mic:0 or mic:1) Great time to venture into manycores Try offloading compute intensive section Optimize data transfers Split calculation Use asynchronous transfer Good for Code spends a lots of time doing computation without I/O The data is relatively easy to encapsulate Computation time is substantially higher than the data transfer time 27

28 Count3s example with Offloading #include <stdlib.h> attribute ((target(mic))) int count3s(const int N, const int* data){ int icount=0; int i; for ( i = 0 ; i < N ; i++ ){ if ( data[i] == 3 ) icount++; } return icount; } int num_c3s; #pragma offload target(mic) in(array:length(nsize)) { printf("counting 3s from MIC! \n"); fflush(0); num_c3s = count3s(nsize, array); } }./a.out Hello World from CPU! Hello World from MIC! Counting 3s from MIC! num_c3s= in the array[ ] ratio= 33.32(%) 28

29 LJ code Benchmark Native mode, N2=500, File I/O 29

30 PARTIII OPENCL 30

31 Needs for OpenCL l Diverse vendors and hardwares l Industry-standard programming platform is required: It is OpenCL. Table: OpenCL-applicable devices in the market Altera AMD ARM IBM Intel nvidia CPU GPU APU Acc. 31

32 Recent trend in computing devices l Multi-core CPU: ~ 16 cores / socket, a few tens per node l NUMA, Large on-chip caches, high frequency (~ 3.5 GHz) operation l CPU Vector instruction: SSE, AVX(256 bit), AVX2, l Intel Xeon Phi: many-integrated cores l 60 cores, 512 bit-wide vectors l GPU: nvidia Tesla K20, AMD HD7970 (~ 2,000 cores) l Good double-precision performance, large memory (~ 6GB) l FPGA: programmable gate array, Altera OpenCL version 32

33 Prerequisite for OpenCL development Identify installed computing devices on your computer. Visit appropriate vendor homepages and download the followings AMD : x86_64 CPUs, Radeon GPU, Fusion APU SDKs: GPU driver: Intel : Intel x64 CPUs, GPU, Phi SDKs: nvidia : GPU SDKs: GPU driver: IBM: POWER, common runtime for x86 nityview?communityuuid= d04a-47cb bf1 ARM: Mobile CPU SNU: Portland Group: 33

34 2) Memory Model Device Private Private Private Private Work item Compute unit Local Memory Work item Work item Compute unit Local Memory Work item 70~150 GB/s data Global Memory data Host Host Memory PCIe (slow) ~5GB/s 34

35 OpenCL Diagram Kernels Kernels Context Kernels Kernels Platforms Programs Kernels Queues Devices Memory Enqueues Hardware Setup Compile Code Data & Arguments Send to Execution 35

36 Platform Model Compute Device HOST INTERCONNECT (PCIe) Compute Unit Processing Elements You can just identify the OpenCL platform as a vendor, that is AMD, Intel, etc. Platform == Vendor 36

37 OpenCL Context You can just identify the OpenCL context as a computational workspace. Context == Our table for works You can just identify the OpenCL device as CPU, GPU,. Device == CPU or GPU 37

38 OpenCL Command Queue You can just identify the OpenCL command queue as a job manager. Command Queue == Manager, Professor API 38

39 OpenCL : Game of Cards Command Queue Hand Context Table Host Dealer Program Deck of Cards Kernel Card A K Q J A K Q J Player 0 Player 1 Player 2 Player 3 Device Type of Card Game Platform 39

40 Work flow PlatformID DeviceID Context Command Queue Execute kernel Buffer Read Program Create Program Build Pro. Kernel Create Kernel Set Kernel Arg. Buffer Create Buffer Write Buffer 40

41 OpenCL Program You should load and keep a source code of OpenCL kernels before executing kernels. (kernel == function) Create Program With Source Loaded source code of OpenCL kernels is compiled. Build Program == Compile + Link You should specify function(kernel) arguments of each kernel. (No call stack in GPU) Set Kernel Arg == function arg. 41

42 Read/Write Buffer Synchronization of contents between an array on Host and the corresponding array on GPU Buffer R/W == Synchronization API 42

43 Execute OpenCL Kernel You can simply run a OpenCL kernel on GPU/CPU. Enqueue Task => Simple run API 43

44 Comparison with offload model OpenCL Offload model PlatformID DeviceID Context Command Queue Execute kernel Buffer Read declspec(target(mic)) function1(); declspec(target(mic)) function2(); #pragma offload target(mic) inout(x:length(3*n)) in(v:length(3*n)) nocopy(f:length(3*n)) { functions } Program High-level API Better choice in CPU/Phi computing only Kernel Buffer Low-level API One source in heterogeneous devices. 44

45 Performance comparison Lennard-Jones MD # of molecules = time steps Work group size = 16 or 32 OpenCL shows better performance. Codes are in the same optimization degree. 100 Elapsed time (second) offload CL # of threads 45

46 Summary You will benefit using Xeon Phi if you can Take advantage of high memory bandwidth Take advantage of wide vectors Take advantage of the high thread count If you can t, you re probably better off with Xeons If you want a good performance you will need to optimize your code by extracting parallelism such as OpenMP, SIMD, OpenCL, MPI This is at least the same amount of work you would put into porting to CUDA. 46

Many-core Processor Programming for beginners. Hongsuk Yi ( 李泓錫 ) KISTI (Korea Institute of Science and Technology Information)

Many-core Processor Programming for beginners Hongsuk Yi ( 李泓錫 ) (hsyi@kisti.re.kr) KISTI (Korea Institute of Science and Technology Information) Contents Overview of the Heterogeneous Computing Introduction