Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? IR CBP

Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people! I'm french : if I twist your eardrums, I apologize... 2/28

Plan of talk... What s Blaise Pascal Center? Where are we (now)? Raw processing power is NOT longer inside the CPU Where do we want to go (tomorrow)? What are the facilities provided for developing, learning, exploring in IT? What can we expect at most to reduce the Elapsed time... How can we go? Compiler? High level librairies? Hybrid approach? 3/28

What s Blaise Pascal Center? Director : Pr Ralf Everaers Centre Blaise Pascal: «maison de la modélisation» Hotel for projects, conferences, trainings on all IT services Hotel for projects: Technical benchs in an experimental platform for everybody Digital laboratory benchs for laboratories for specific purposes Hotel for trainings: ATOSIM (Erasmus Mundus) Continuing educations for searchers, teachers & engineers Advanced education : M1, M2 in physics, chemistry, geology, biology, Test center : to reuse, to divert, to explore in HPC & HPDA 4/28

Multinodes TechBenchs : 9 clusters 116 nodes, 4 IB speeds 2 nodes Sun V20Z with AMD 250 4 physical cores @2400MHz Interconnection Infiniband SDR 10 Gb/s 2 nodes Sun X2200 with AMD 2218HE 8 physical cores @2600MHz Interconnection Infiniband DDR 20 Gb/s 8 nodes Sun X4150 with Xeon E5440 64 physical cores @2833MHz Interconnection Infiniband DDR 20 Gb/s 64 nodes Dell R410 with Xeon X5550 512 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s 4 nodes Dell C6100 with Xeon X5650 48 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s + C410X with 4 GPGPU 16 nodes Dell C6100 with Xeon X5650 192 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s 8 nodes HP SL230 with Xeon E5-2667 64 physical cores HT @2666MHz Interconnection Infiniband FDR 56 Gb/s 8 nodes Dell R410 with Xeon X5550 64 physical cores HT @2666MHz Interconnection Infiniband DDR 20 Gb/s 4 nodes Sun X4500 with AMD 285 16 physical cores @2400MHz Interconnection Infiniband SDR 10 Gb/s 5/28

Multicores TechBenchs From 2 to 28 cores: examples... 6/28

Multishaders TechBenchs (GP)GPU 72 different models... GPU Gamer : 18 Nvidia GTX 560 Ti Nvidia GTX 680 Nvidia GTX 690 Nvidia GTX Titan Nvidia GTX 780 Nvidia GTX 780 Ti Nvidia GTX 750 Nvidia GTX 750 Ti Nvidia GTX 960 Nvidia GTX 970 Nvidia GTX 980 Nvidia GTX 980 Ti Nvidia GTX 1050 Ti Nvidia GTX 1060 Nvidia GTX 1070 Nvidia GTX 1080 Nvidia GTX 1080 Ti Nvidia RTX 2080 Ti GPU desktop & pro : 27 GPGPU : 9 Nvidia Tesla C1060 Nvidia Tesla M2050 Nvidia Tesla M2070 Nvidia Tesla M2090 Nvidia Tesla K20m Nvidia Tesla K40c Nvidia Tesla K40m Nvidia Tesla K80 Nvidia Tesla P100 NVS 290 Nvidia FX 4800 NVS 310 NVS 315 Nvidia Quadro 600 Nvidia Quadro 2000 Nvidia Quadro 4000 Nvidia Quadro K2000 Nvidia Quadro K4000 Nvidia Quadro K420 Nvidia Quadro P600 Nvidia 8400 GS Nvidia 8500 GT Nvidia 8800 GT Nvidia 9500 GT Nvidia GT 220 Nvidia GT 320 Nvidia GT 430 Nvidia GT 620 Nvidia GT 640 Nvidia GT 710 Nvidia GT 730 Nvidia GT 1030 Nvidia Quadro 2000M Nvidia Quadro K4000M Nvidia Quadro M2200 Nvidia MX150 7/28 GPU AMD Gamer & al : 18 HD 4350 HD 4890 HD 5850 HD 5870 HD 6450 HD 6670 Fusion E2-1800 GPU HD 7970 FirePro V5900 FirePro W5000 Kaveri A10-7850K GPU R7 240 R9 290 R9 295X2 Nano Fury R9 Fury R9 380 RX Vega64

How to represent these testbenchs? Question of frequencies, cores,... (GP)GPUs CPUs 8/28

On TechBenchs in CBP: SIDUS Never installed, just started! What? Why? To perform unicity of configurations To Limit the footage of systems on drives For who? Deploying a system simply on a set of machines Researchers, teachers, students, engineers,... When & Where? Centre Blaise Pascal : since 2010, near from 200 machines PSMN : since 2011, more than 500 nodes (their proper instance) Laboratories : UMPA, LBMC, IGFL, CRAL,... How? To use a network share folder To divert a hack used in livecd since decades 9/28

Where are we? TOP 500 & the accelerators Where : on Top 500 When : Jun, 2018 How much : ~1/4 with accelerators : MIC (Intel & copied ones) GPGPU (Nvidia, AMD) 7/10 in Top 10 10/28

Where are we? What to expect? On a laptop, CPU & GPU benchs CPU CPU CPU A «Pi Dart Dash» In C/OpenCL In Python OpenCL A naive «N-Body» 11/28

12/28 Cluster 64n8c 2 GPUs per board! Cluster 8n16c 2x Intel Gold 6142 CPUs 1x Intel i7-6700k 2x Intel 2680v4 14c 2x Intel 2687v3 10c 2x Intel 2609v2 4c AMD 1x Intel 2620 6c GTX for Gamers 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c 1.00E+11 AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9 295X2 HD 7970 Tesla for Pros HD 5850 RTX 2080Ti GTX 1080Ti GTX 980Ti 4.00E+11 GTX 780Ti 5.00E+11 GTX 560Ti Tesla P100 2.00E+11 Tesla K80 3.00E+11 Tesla M2090 Tesla C1060 Raw performances on generations... A «Pi Dart Dash» in OpenCL 6.00E+11 FP32 FP64 AMD/ATI Intel 0.00E+00

13/28 Cluster 64n8c Cluster 8n16c 2x Intel Gold 6142 16c CPUs 1x Intel i7-6700k 4c 2x Intel E5-2680v4 14c 2x Intel E5-2687v3 10c 2x Intel E5-2609v2 4c 1x Intel E5-2620 6c AMD 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c GTX for Gamers AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9 295X2 HD 7970 1.00E+11 HD 5850 RTX 2080Ti GTX 1080Ti GTX 980Ti GTX 780Ti 5.00E+10 GTX 560Ti Tesla P100 Tesla K80 Tesla M2090 1.50E+11 Tesla C1060 Raw performances in FP64 The same «Pi Dart Dash» 2.00E+11 FP64 Tesla for Pros AMD/ATI Intel 0.00E+00

For a more «physical» code Nbody.py in OpenCL/OpenGL Statistics Execution Deliberately simple Euler implicit implementation 32768 particules How many distances computed each step? 14/28

15/28 2x Intel Gold 6142 16c 1x Intel i7-6700k 4c AMD 2x Intel E5-2680v4 14c GTX for Gamers 2x Intel E5-2640v3 10c 2x Intel E5-2637v2 4c 2x Intel E5-2665 8c 6E+10 2x Intel X5550 4c 2x AMD Epyc 7401 24c 4E+10 AMD Ryzen7 1800 8c AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9-295X #1 Tesla For Pros HD7970 RTX 2080Ti GTX 1080Ti 2E+11 GTX 980Ti 2E+11 GTX 780Ti 2E+11 GTX 560Ti Tesla P100 1E+11 Tesla K80 1E+11 Tesla M2090 Tesla C1060 For a more «physical» code «N-Body» newtonian code D²/T_FP32 D²/T_FP64 1E+11 AMD/ATI 8E+10 CPUs Intel 2E+10 0E+00

16/28 2x Intel Gold 6142 16c 1x Intel i7-6700k 4c AMD 2x Intel E5-2680v4 14c 2x Intel E5-2640v3 10c 2x Intel E5-2637v2 4c 2x Intel E5-2665 8c 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c 1.00E+10 AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9-295X #1 GTX for Gamers HD7970 RTX 2080Ti GTX 1080Ti GTX 980Ti 5.00E+09 GTX 780Ti GTX 560Ti Tesla P100 Tesla K20m 1.50E+10 Tesla M2090 Tesla C1060 In double precision, another lecture GPGPU, the Best. CPU not so bad! 2.00E+10 D²/T_FP64 Tesla For Pros CPUs Intel AMD/ATI 0.00E+00

Xeon Phi is mentioned A stillborn, but instructive 3 ways of programming Intel compiler, cross-compiling, execution on embedded micro-system Intel compiler, OpenMP in «offload» mode, transparent execution Intel Implementation of OpenCL for MIC What is offload in OpenMP? Offload call on Xeon Phi MIC Classical OpenMP on independant tasks #pragma omp parallel for for (int i=0 ; i<process; i++) { inside[i]=mainloopglobal(iterations /process,seed_w+i,seed_z+i); } #pragma omp target device(0) #pragma omp teams num_teams(60) thread_limit(4) #pragma omp distribute for (int i=0 ; i<process; i++) { inside[i]=mainloopglobal(iterations/process,seed_w+i,seed_z+i); } 17/28

And in C/OpenCL implementation No Really simple But wait! Select the platform & device err = clgetplatformids(0, NULL, &platformcount); platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformcount); err = clgetplatformids(platformcount, platforms, NULL); err = clgetdeviceids(platforms[myplatform], CL_DEVICE_TYPE_ALL, 0, NULL, &devicecount); Set attributes to OpenCL kernel devices = (cl_device_id*) malloc(sizeof(cl_device_id) * devicecount); err = clgetdeviceids(platforms[myplatform], CL_DEVICE_TYPE_ALL, devicecount, devices, NULL); cl_context GPUContext = clcreatecontext(props, 1, &devices[mydevice], NULL, NULL, &err); cl_command_queue cqcommandqueue = clcreatecommandqueue(gpucontext,devices[mydevice], 0, &err); cl_mem GPUInside = clcreatebuffer(gpucontext, CL_MEM_WRITE_ONLY, sizeof(uint64_t) * ParallelRate, NULL, NULL); cl_program OpenCLProgram = clcreateprogramwithsource(gpucontext, 130,OpenCLSource,NULL,NULL); clbuildprogram(openclprogram, 0, NULL, NULL, NULL, NULL); cl_kernel OpenCLMainLoopGlobal = clcreatekernel(openclprogram, "MainLoopGlobal", NULL); clsetkernelarg(openclmainloopglobal, 0, sizeof(cl_mem),&gpuinside); clsetkernelarg(openclmainloopglobal, 1, sizeof(uint64_t),&iterationseach); clsetkernelarg(openclmainloopglobal, 2, sizeof(uint32_t),&seed_w); clsetkernelarg(openclmainloopglobal, 3, sizeof(uint32_t),&seed_z); clsetkernelarg(openclmainloopglobal, 4, sizeof(uint32_t),&mytype); size_t WorkSize[1] = {ParallelRate}; // one dimensional Range clenqueuendrangekernel(cqcommandqueue, OpenCLMainLoopGlobal, 1, NULL, WorkSize, NULL, 0, NULL, NULL); clenqueuereadbuffer(cqcommandqueue, GPUInside, CL_TRUE, 0, ParallelRate * sizeof(uint64_t), HostInside, 0, NULL, NULL); 18/28

And what about the performances on the Xeon Phi with 60 cores Performance in itops from 60 to 7680, step 60 OpenMP Offload : Parallel Rate better for 32x the number of Cores, optimum at PR=7680 OpenCL : better than OpenMP, only when PR=4x the number of cores, optimum at PR=3840 3.00E+10 OMP FP32 OMP FP64 2.50E+10 OCL FP32 OCL FP64 2.00E+10 1.50E+10 1.00E+10 5.00E+09 0.00E+00 0 1000 2000 3000 4000 5000 19/28 6000 7000

Performances for Xeon Phi Ratio OpenCL/OpenMP Speed-up OpenCL > 2.5 for multiples of 240 as PR Speed-down OpenCL for all other numbers of PR 4.50 Ratio 32 4.00 Ratio 64 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 0 1000 2000 3000 4000 5000 6000 20/28 7000

Performances around local max A period of 16 on Parallel Rate between max performances in OpenCL A performance divided by 40 between max and min as Parallel Rate for OpenCL A stability of performance for OpenMP 3.00E+10 OMP FP32 OMP FP64 2.50E+10 OCL FP32 OCL FP64 2.00E+10 1.50E+10 1.00E+10 5.00E+09 0.00E+00 3780 3800 3820 3840 3860 21/28 3880 3900

What OpenACC promises? The same thing as Intel promised in OpenMP offload? From presentation : https://www.olcf.ornl.gov/wp-content/uploads/2012/08/openacc.pdf Parallel Construct Data Constructs #pragma acc data [clause [[,] clause]...] new-line Loop Constructs #pragma acc parallel [clause [[,] clause]...] new-line #pragma acc loop [clause [[,] clause]...]new-line Why not... But can we test it? Yes! From Portland, a complete implementation of OpenACC on Nvidia 22/28

OpenACC on «Pi Dart Dash» Very simple implementation... Previous core routine : #pragma acc routine LENGTH MainLoopGlobal(LENGTH iterations,unsigned int seed_w,unsigned int seed_z) Previous distribution loop #pragma omp parallel for shared(parallelrate,inside) #pragma acc kernels loop for (int i=0 ; i<parallelrate; i++) { inside[i]=mainloopglobal(iterationseach,seed_w+i,seed_z+i); } 23/28

Open ACC/MP in «Pi Dart Dash» Mitigated results 6.00E+11 1E+12 OpenCL 32 OpenCL 64 OpenACC 32 OpenACC 64 OpenMP 32 OpenMP 64 Good! 5.00E+11 4.00E+11 Ugly? OpenCL 32 OpenCL 64 OpenACC 32 OpenACC 64 OpenMP 32 OpenMP 64 1E+11 3.00E+11 Bad :-( 2.00E+11 1E+10 1.00E+11 1E+09 0.00E+00 Tesla P100 RTX 2080 Ti E5-2637v4 2x4 E5-2640v4 2x10 Tesla P100 RTX 2080 Ti 24/28 E5-2637v4 2x4 E5-2640v4 2x10

Performance of a GPUfied code PKDGRAV3 Small hydrid code : MPI OpenMP CUDA (with kernels) Compilation : Classical cmake operations BUT old compiler required Execution : 256³ particules default : to use GPU (be careful in case of multigpu machine) -cqs 0 : to perform a CPU only simulations 25/28

PKDGRAV3 vs PiDartDash in FP64 Comparison between CPU & GPU 2.5E-03 2.0E+11 CPU Only x2 x15 FP64 Recent CPU sets 1/Elapsed 2.0E-03 1.5E+11 x3 1.5E-03 1.0E+11 1.0E-03 x4 5.0E+10 5.0E-04 16 c 61 42 24 c G ol d 74 01 Ep yc G TX 26/28 2x 2x Te sl G ol d 2x 2x Ep yc Lots of work to do... 2x 10 80 Ti RT X 20 2x 80 E5 Ti -2 63 7v 4 4c 61 42 74 01 20 80 Ti G TX 10 2x 80 E5 Ti -2 63 7v 4 4c RT X a P1 E5 00-2 64 0v 4 10 c Te sl 2x a P1 E5 00-2 64 0v 4 10 c 0.0E+00 0.0E+00

How do we go? You have seen : Hybrid approach with primitives : OpenMP, OpenACC The more : easy to use, keeping code legacy The less : licence of compiler & continuity, jailed with Nvidia Kokkos, SyCL approach : The more & the less : C++ & continuity, mostly jailed with Nvidia But there is an alternative, the «thinking device» : Using Python as skeleton for calling small & efficient kernels Using OpenCL (or CUDA) for CPU, GPU, GPGPU for kernels Providing «matrix» codes as «inpiring» codes for grain, scalability, etc «Thinking device» demands to fit the parallel rate to device 27/28

Conclusion... In all cases : 3 questions to have in mind Do you need FP64 only sometimes but not all the time? Do your use case compatible with myriad of parallel rate? Use case : code + data Myriad : number of 10 thousands (optimum around 16 times number of ALU) Do you need less than 16GB of RAM for the core? If yes : (GP)GPU is a solution The Gamer GTX could be much more cheaper! 28/28