Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? Emmanuel Quemener IR CBP

Similar documents
From modelisation house To test center. Emmanuel Quémener

Videocard Benchmarks Over 800,000 Video Cards Benchmarked

Houches 2016 : 6th School. Computational Physics. GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU

Videocard Benchmarks Over 800,000 Video Cards Benchmarked

GPUs and Emerging Architectures

High Performance Computing with Accelerators

Parallel Programming. Libraries and Implementations

Parallel Programming Libraries and implementations

SIDUS Extreme Deduplication & Reproducibility

CME 213 S PRING Eric Darve

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Towards Transparent and Efficient GPU Communication on InfiniBand Clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

Astrosim Les «Houches» the 7th ;-) GPU : the disruptive technology of 21th century. Little comparison between CPU, MIC, GPU.

GPU GPU CPU. Raymond Namyst 3 Samuel Thibault 3 Olivier Aumage 3

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

Exploiting Task-Parallelism on GPU Clusters via OmpSs and rcuda Virtualization

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

HPC Current Development in Indonesia. Dr. Bens Pardamean Bina Nusantara University Indonesia

n N c CIni.o ewsrg.au

GPU acceleration on IB clusters. Sadaf Alam Jeffrey Poznanovic Kristopher Howard Hussein Nasser El-Harake

An Extension of XcalableMP PGAS Lanaguage for Multi-node GPU Clusters

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

HPC Architectures. Types of resource currently in use

Architectures for Scalable Media Object Search

Videocard Benchmarks Over 800,000 Video Cards Benchmarked

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Parallel Programming on Ranger and Stampede

Pedraforca: a First ARM + GPU Cluster for HPC

Laptop Requirement: Technical Specifications and Guidelines. Frequently Asked Questions

Accelerating Implicit LS-DYNA with GPU

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Heterogeneous Computing

Home Software Hardware Benchmarks Services Store Support Forums About Us

Directive-based Programming for Highly-scalable Nodes

Accelerating Financial Applications on the GPU

ECE 574 Cluster Computing Lecture 17

The Stampede is Coming: A New Petascale Resource for the Open Science Community

ANSYS Improvements to Engineering Productivity with HPC and GPU-Accelerated Simulation

Preparing for Highly Parallel, Heterogeneous Coprocessing

Flexible modelling from point data

Parallel Programming. Libraries and implementations

Starting with a tiny joke!

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

AMath 483/583, Lecture 24, May 20, Notes: Notes: What s a GPU? Notes: Some GPU application areas

OpenACC (Open Accelerators - Introduced in 2012)

Lecture: Manycore GPU Architectures and Programming, Part 4 -- Introducing OpenMP and HOMP for Accelerators

CUDA. Matthew Joyner, Jeremy Williams

A Comprehensive Study on the Performance of Implicit LS-DYNA

OP2 FOR MANY-CORE ARCHITECTURES

CURRENT STATUS OF THE PROJECT TO ENABLE GAUSSIAN 09 ON GPGPUS

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

Minnesota Supercomputing Institute Regents of the University of Minnesota. All rights reserved.

University at Buffalo Center for Computational Research

Addressing Heterogeneity in Manycore Applications

CSC573: TSHA Introduction to Accelerators

OpenACC Course. Office Hour #2 Q&A

Introduction to GPU hardware and to CUDA

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPU ARCHITECTURE Chris Schultz, June 2017

Deploying remote GPU virtualization with rcuda. Federico Silla Technical University of Valencia Spain

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Big Data Analytics Performance for Large Out-Of- Core Matrix Solvers on Advanced Hybrid Architectures

JCudaMP: OpenMP/Java on CUDA

Real Parallel Computers

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

GPGPU IGAD 2014/2015. Lecture 1. Jacco Bikker

Overview of research activities Toward portability of performance

Resources Current and Future Systems. Timothy H. Kaiser, Ph.D.

Serial and Parallel Sobel Filtering for multimedia applications

Maya , MotionBuilder & Mudbox 2018.x August 2018

CPMD Performance Benchmark and Profiling. February 2014

Carlos Reaño, Javier Prades and Federico Silla Technical University of Valencia (Spain)

Improved Event Generation at NLO and NNLO. or Extending MCFM to include NNLO processes

LAMMPS-KOKKOS Performance Benchmark and Profiling. September 2015

COMP528: Multi-core and Multi-Processor Computing

Tutorial. Preparing for Stampede: Programming Heterogeneous Many-Core Supercomputers

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

Parallel Systems. Project topics

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

The Rise of Open Programming Frameworks. JC BARATAULT IWOCL May 2015

Rock em Graphic Cards

Martin Kruliš, v

MILC Performance Benchmark and Profiling. April 2013

Maximize Performance and Scalability of RADIOSS* Structural Analysis Software on Intel Xeon Processor E7 v2 Family-Based Platforms

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Qualified Apple Mac Systems for Media Composer 7.0

Evaluating OpenMP s Effectiveness in the Many-Core Era

HPC trends (Myths about) accelerator cards & more. June 24, Martin Schreiber,

Running the FIM and NIM Weather Models on GPUs

Scalability of Processing on GPUs

The rcuda technology: an inexpensive way to improve the performance of GPU-based clusters Federico Silla

Intel Xeon Phi Coprocessors

Applications of Berkeley s Dwarfs on Nvidia GPUs

ANSYS Fluent 14 Performance Benchmark and Profiling. October 2012

HPC with GPU and its applications from Inspur. Haibo Xie, Ph.D

Transcription:

Graphical Processing Units : Can we really entrust the compiler to reach the raw performance? IR CBP

Starting with a tiny joke! How do you call people speak 3 languages? How do you call people speak 2 languages? Trilingual people! Bilingual people! How do you call people speak 1 language? French people! I'm french : if I twist your eardrums, I apologize... 2/28

Plan of talk... What s Blaise Pascal Center? Where are we (now)? Raw processing power is NOT longer inside the CPU Where do we want to go (tomorrow)? What are the facilities provided for developing, learning, exploring in IT? What can we expect at most to reduce the Elapsed time... How can we go? Compiler? High level librairies? Hybrid approach? 3/28

What s Blaise Pascal Center? Director : Pr Ralf Everaers Centre Blaise Pascal: «maison de la modélisation» Hotel for projects, conferences, trainings on all IT services Hotel for projects: Technical benchs in an experimental platform for everybody Digital laboratory benchs for laboratories for specific purposes Hotel for trainings: ATOSIM (Erasmus Mundus) Continuing educations for searchers, teachers & engineers Advanced education : M1, M2 in physics, chemistry, geology, biology, Test center : to reuse, to divert, to explore in HPC & HPDA 4/28

Multinodes TechBenchs : 9 clusters 116 nodes, 4 IB speeds 2 nodes Sun V20Z with AMD 250 4 physical cores @2400MHz Interconnection Infiniband SDR 10 Gb/s 2 nodes Sun X2200 with AMD 2218HE 8 physical cores @2600MHz Interconnection Infiniband DDR 20 Gb/s 8 nodes Sun X4150 with Xeon E5440 64 physical cores @2833MHz Interconnection Infiniband DDR 20 Gb/s 64 nodes Dell R410 with Xeon X5550 512 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s 4 nodes Dell C6100 with Xeon X5650 48 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s + C410X with 4 GPGPU 16 nodes Dell C6100 with Xeon X5650 192 physical cores HT @2666MHz Interconnection Infiniband QDR 40 Gb/s 8 nodes HP SL230 with Xeon E5-2667 64 physical cores HT @2666MHz Interconnection Infiniband FDR 56 Gb/s 8 nodes Dell R410 with Xeon X5550 64 physical cores HT @2666MHz Interconnection Infiniband DDR 20 Gb/s 4 nodes Sun X4500 with AMD 285 16 physical cores @2400MHz Interconnection Infiniband SDR 10 Gb/s 5/28

Multicores TechBenchs From 2 to 28 cores: examples... 6/28

Multishaders TechBenchs (GP)GPU 72 different models... GPU Gamer : 18 Nvidia GTX 560 Ti Nvidia GTX 680 Nvidia GTX 690 Nvidia GTX Titan Nvidia GTX 780 Nvidia GTX 780 Ti Nvidia GTX 750 Nvidia GTX 750 Ti Nvidia GTX 960 Nvidia GTX 970 Nvidia GTX 980 Nvidia GTX 980 Ti Nvidia GTX 1050 Ti Nvidia GTX 1060 Nvidia GTX 1070 Nvidia GTX 1080 Nvidia GTX 1080 Ti Nvidia RTX 2080 Ti GPU desktop & pro : 27 GPGPU : 9 Nvidia Tesla C1060 Nvidia Tesla M2050 Nvidia Tesla M2070 Nvidia Tesla M2090 Nvidia Tesla K20m Nvidia Tesla K40c Nvidia Tesla K40m Nvidia Tesla K80 Nvidia Tesla P100 NVS 290 Nvidia FX 4800 NVS 310 NVS 315 Nvidia Quadro 600 Nvidia Quadro 2000 Nvidia Quadro 4000 Nvidia Quadro K2000 Nvidia Quadro K4000 Nvidia Quadro K420 Nvidia Quadro P600 Nvidia 8400 GS Nvidia 8500 GT Nvidia 8800 GT Nvidia 9500 GT Nvidia GT 220 Nvidia GT 320 Nvidia GT 430 Nvidia GT 620 Nvidia GT 640 Nvidia GT 710 Nvidia GT 730 Nvidia GT 1030 Nvidia Quadro 2000M Nvidia Quadro K4000M Nvidia Quadro M2200 Nvidia MX150 7/28 GPU AMD Gamer & al : 18 HD 4350 HD 4890 HD 5850 HD 5870 HD 6450 HD 6670 Fusion E2-1800 GPU HD 7970 FirePro V5900 FirePro W5000 Kaveri A10-7850K GPU R7 240 R9 290 R9 295X2 Nano Fury R9 Fury R9 380 RX Vega64

How to represent these testbenchs? Question of frequencies, cores,... (GP)GPUs CPUs 8/28

On TechBenchs in CBP: SIDUS Never installed, just started! What? Why? To perform unicity of configurations To Limit the footage of systems on drives For who? Deploying a system simply on a set of machines Researchers, teachers, students, engineers,... When & Where? Centre Blaise Pascal : since 2010, near from 200 machines PSMN : since 2011, more than 500 nodes (their proper instance) Laboratories : UMPA, LBMC, IGFL, CRAL,... How? To use a network share folder To divert a hack used in livecd since decades 9/28

Where are we? TOP 500 & the accelerators Where : on Top 500 When : Jun, 2018 How much : ~1/4 with accelerators : MIC (Intel & copied ones) GPGPU (Nvidia, AMD) 7/10 in Top 10 10/28

Where are we? What to expect? On a laptop, CPU & GPU benchs CPU CPU CPU A «Pi Dart Dash» In C/OpenCL In Python OpenCL A naive «N-Body» 11/28

12/28 Cluster 64n8c 2 GPUs per board! Cluster 8n16c 2x Intel Gold 6142 CPUs 1x Intel i7-6700k 2x Intel 2680v4 14c 2x Intel 2687v3 10c 2x Intel 2609v2 4c AMD 1x Intel 2620 6c GTX for Gamers 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c 1.00E+11 AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9 295X2 HD 7970 Tesla for Pros HD 5850 RTX 2080Ti GTX 1080Ti GTX 980Ti 4.00E+11 GTX 780Ti 5.00E+11 GTX 560Ti Tesla P100 2.00E+11 Tesla K80 3.00E+11 Tesla M2090 Tesla C1060 Raw performances on generations... A «Pi Dart Dash» in OpenCL 6.00E+11 FP32 FP64 AMD/ATI Intel 0.00E+00

13/28 Cluster 64n8c Cluster 8n16c 2x Intel Gold 6142 16c CPUs 1x Intel i7-6700k 4c 2x Intel E5-2680v4 14c 2x Intel E5-2687v3 10c 2x Intel E5-2609v2 4c 1x Intel E5-2620 6c AMD 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c GTX for Gamers AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9 295X2 HD 7970 1.00E+11 HD 5850 RTX 2080Ti GTX 1080Ti GTX 980Ti GTX 780Ti 5.00E+10 GTX 560Ti Tesla P100 Tesla K80 Tesla M2090 1.50E+11 Tesla C1060 Raw performances in FP64 The same «Pi Dart Dash» 2.00E+11 FP64 Tesla for Pros AMD/ATI Intel 0.00E+00

For a more «physical» code Nbody.py in OpenCL/OpenGL Statistics Execution Deliberately simple Euler implicit implementation 32768 particules How many distances computed each step? 14/28

15/28 2x Intel Gold 6142 16c 1x Intel i7-6700k 4c AMD 2x Intel E5-2680v4 14c GTX for Gamers 2x Intel E5-2640v3 10c 2x Intel E5-2637v2 4c 2x Intel E5-2665 8c 6E+10 2x Intel X5550 4c 2x AMD Epyc 7401 24c 4E+10 AMD Ryzen7 1800 8c AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9-295X #1 Tesla For Pros HD7970 RTX 2080Ti GTX 1080Ti 2E+11 GTX 980Ti 2E+11 GTX 780Ti 2E+11 GTX 560Ti Tesla P100 1E+11 Tesla K80 1E+11 Tesla M2090 Tesla C1060 For a more «physical» code «N-Body» newtonian code D²/T_FP32 D²/T_FP64 1E+11 AMD/ATI 8E+10 CPUs Intel 2E+10 0E+00

16/28 2x Intel Gold 6142 16c 1x Intel i7-6700k 4c AMD 2x Intel E5-2680v4 14c 2x Intel E5-2640v3 10c 2x Intel E5-2637v2 4c 2x Intel E5-2665 8c 2x Intel X5550 4c 2x AMD Epyc 7401 24c AMD Ryzen7 1800 8c 1.00E+10 AMD FX-9590 8c MIC Phi 7120P RX Vega64 Nano Fury R9-295X #1 GTX for Gamers HD7970 RTX 2080Ti GTX 1080Ti GTX 980Ti 5.00E+09 GTX 780Ti GTX 560Ti Tesla P100 Tesla K20m 1.50E+10 Tesla M2090 Tesla C1060 In double precision, another lecture GPGPU, the Best. CPU not so bad! 2.00E+10 D²/T_FP64 Tesla For Pros CPUs Intel AMD/ATI 0.00E+00

Xeon Phi is mentioned A stillborn, but instructive 3 ways of programming Intel compiler, cross-compiling, execution on embedded micro-system Intel compiler, OpenMP in «offload» mode, transparent execution Intel Implementation of OpenCL for MIC What is offload in OpenMP? Offload call on Xeon Phi MIC Classical OpenMP on independant tasks #pragma omp parallel for for (int i=0 ; i<process; i++) { inside[i]=mainloopglobal(iterations /process,seed_w+i,seed_z+i); } #pragma omp target device(0) #pragma omp teams num_teams(60) thread_limit(4) #pragma omp distribute for (int i=0 ; i<process; i++) { inside[i]=mainloopglobal(iterations/process,seed_w+i,seed_z+i); } 17/28

And in C/OpenCL implementation No Really simple But wait! Select the platform & device err = clgetplatformids(0, NULL, &platformcount); platforms = (cl_platform_id*) malloc(sizeof(cl_platform_id) * platformcount); err = clgetplatformids(platformcount, platforms, NULL); err = clgetdeviceids(platforms[myplatform], CL_DEVICE_TYPE_ALL, 0, NULL, &devicecount); Set attributes to OpenCL kernel devices = (cl_device_id*) malloc(sizeof(cl_device_id) * devicecount); err = clgetdeviceids(platforms[myplatform], CL_DEVICE_TYPE_ALL, devicecount, devices, NULL); cl_context GPUContext = clcreatecontext(props, 1, &devices[mydevice], NULL, NULL, &err); cl_command_queue cqcommandqueue = clcreatecommandqueue(gpucontext,devices[mydevice], 0, &err); cl_mem GPUInside = clcreatebuffer(gpucontext, CL_MEM_WRITE_ONLY, sizeof(uint64_t) * ParallelRate, NULL, NULL); cl_program OpenCLProgram = clcreateprogramwithsource(gpucontext, 130,OpenCLSource,NULL,NULL); clbuildprogram(openclprogram, 0, NULL, NULL, NULL, NULL); cl_kernel OpenCLMainLoopGlobal = clcreatekernel(openclprogram, "MainLoopGlobal", NULL); clsetkernelarg(openclmainloopglobal, 0, sizeof(cl_mem),&gpuinside); clsetkernelarg(openclmainloopglobal, 1, sizeof(uint64_t),&iterationseach); clsetkernelarg(openclmainloopglobal, 2, sizeof(uint32_t),&seed_w); clsetkernelarg(openclmainloopglobal, 3, sizeof(uint32_t),&seed_z); clsetkernelarg(openclmainloopglobal, 4, sizeof(uint32_t),&mytype); size_t WorkSize[1] = {ParallelRate}; // one dimensional Range clenqueuendrangekernel(cqcommandqueue, OpenCLMainLoopGlobal, 1, NULL, WorkSize, NULL, 0, NULL, NULL); clenqueuereadbuffer(cqcommandqueue, GPUInside, CL_TRUE, 0, ParallelRate * sizeof(uint64_t), HostInside, 0, NULL, NULL); 18/28

And what about the performances on the Xeon Phi with 60 cores Performance in itops from 60 to 7680, step 60 OpenMP Offload : Parallel Rate better for 32x the number of Cores, optimum at PR=7680 OpenCL : better than OpenMP, only when PR=4x the number of cores, optimum at PR=3840 3.00E+10 OMP FP32 OMP FP64 2.50E+10 OCL FP32 OCL FP64 2.00E+10 1.50E+10 1.00E+10 5.00E+09 0.00E+00 0 1000 2000 3000 4000 5000 19/28 6000 7000

Performances for Xeon Phi Ratio OpenCL/OpenMP Speed-up OpenCL > 2.5 for multiples of 240 as PR Speed-down OpenCL for all other numbers of PR 4.50 Ratio 32 4.00 Ratio 64 3.50 3.00 2.50 2.00 1.50 1.00 0.50 0.00 0 1000 2000 3000 4000 5000 6000 20/28 7000

Performances around local max A period of 16 on Parallel Rate between max performances in OpenCL A performance divided by 40 between max and min as Parallel Rate for OpenCL A stability of performance for OpenMP 3.00E+10 OMP FP32 OMP FP64 2.50E+10 OCL FP32 OCL FP64 2.00E+10 1.50E+10 1.00E+10 5.00E+09 0.00E+00 3780 3800 3820 3840 3860 21/28 3880 3900

What OpenACC promises? The same thing as Intel promised in OpenMP offload? From presentation : https://www.olcf.ornl.gov/wp-content/uploads/2012/08/openacc.pdf Parallel Construct Data Constructs #pragma acc data [clause [[,] clause]...] new-line Loop Constructs #pragma acc parallel [clause [[,] clause]...] new-line #pragma acc loop [clause [[,] clause]...]new-line Why not... But can we test it? Yes! From Portland, a complete implementation of OpenACC on Nvidia 22/28

OpenACC on «Pi Dart Dash» Very simple implementation... Previous core routine : #pragma acc routine LENGTH MainLoopGlobal(LENGTH iterations,unsigned int seed_w,unsigned int seed_z) Previous distribution loop #pragma omp parallel for shared(parallelrate,inside) #pragma acc kernels loop for (int i=0 ; i<parallelrate; i++) { inside[i]=mainloopglobal(iterationseach,seed_w+i,seed_z+i); } 23/28

Open ACC/MP in «Pi Dart Dash» Mitigated results 6.00E+11 1E+12 OpenCL 32 OpenCL 64 OpenACC 32 OpenACC 64 OpenMP 32 OpenMP 64 Good! 5.00E+11 4.00E+11 Ugly? OpenCL 32 OpenCL 64 OpenACC 32 OpenACC 64 OpenMP 32 OpenMP 64 1E+11 3.00E+11 Bad :-( 2.00E+11 1E+10 1.00E+11 1E+09 0.00E+00 Tesla P100 RTX 2080 Ti E5-2637v4 2x4 E5-2640v4 2x10 Tesla P100 RTX 2080 Ti 24/28 E5-2637v4 2x4 E5-2640v4 2x10

Performance of a GPUfied code PKDGRAV3 Small hydrid code : MPI OpenMP CUDA (with kernels) Compilation : Classical cmake operations BUT old compiler required Execution : 256³ particules default : to use GPU (be careful in case of multigpu machine) -cqs 0 : to perform a CPU only simulations 25/28

PKDGRAV3 vs PiDartDash in FP64 Comparison between CPU & GPU 2.5E-03 2.0E+11 CPU Only x2 x15 FP64 Recent CPU sets 1/Elapsed 2.0E-03 1.5E+11 x3 1.5E-03 1.0E+11 1.0E-03 x4 5.0E+10 5.0E-04 16 c 61 42 24 c G ol d 74 01 Ep yc G TX 26/28 2x 2x Te sl G ol d 2x 2x Ep yc Lots of work to do... 2x 10 80 Ti RT X 20 2x 80 E5 Ti -2 63 7v 4 4c 61 42 74 01 20 80 Ti G TX 10 2x 80 E5 Ti -2 63 7v 4 4c RT X a P1 E5 00-2 64 0v 4 10 c Te sl 2x a P1 E5 00-2 64 0v 4 10 c 0.0E+00 0.0E+00

How do we go? You have seen : Hybrid approach with primitives : OpenMP, OpenACC The more : easy to use, keeping code legacy The less : licence of compiler & continuity, jailed with Nvidia Kokkos, SyCL approach : The more & the less : C++ & continuity, mostly jailed with Nvidia But there is an alternative, the «thinking device» : Using Python as skeleton for calling small & efficient kernels Using OpenCL (or CUDA) for CPU, GPU, GPGPU for kernels Providing «matrix» codes as «inpiring» codes for grain, scalability, etc «Thinking device» demands to fit the parallel rate to device 27/28

Conclusion... In all cases : 3 questions to have in mind Do you need FP64 only sometimes but not all the time? Do your use case compatible with myriad of parallel rate? Use case : code + data Myriad : number of 10 thousands (optimum around 16 times number of ALU) Do you need less than 16GB of RAM for the core? If yes : (GP)GPU is a solution The Gamer GTX could be much more cheaper! 28/28