GPU ACCELERATORS AT JSC OF THREADS AND KERNELS

Similar documents
CUDA Introduction Part I PATC GPU Programming Course 2017

Using JURECA's GPU Nodes

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CME 213 S PRING Eric Darve

CUDA Tools for Debugging and Profiling

GPU Architecture. Alan Gray EPCC The University of Edinburgh

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA Programming

Tesla Architecture, CUDA and Optimization Strategies

GPU CUDA Programming

GPU programming. Dr. Bernhard Kainz

NEW DEVELOPER TOOLS FEATURES IN CUDA 8.0. Sanjiv Satoor

Lecture 1: an introduction to CUDA

OpenACC Course. Office Hour #2 Q&A

Practical Introduction to CUDA and GPU

Vectorisation and Portable Programming using OpenCL

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Trends in HPC (hardware complexity and software challenges)

Introduction to GPU hardware and to CUDA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CSC573: TSHA Introduction to Accelerators

ARCHER Champions 2 workshop

Master Informatics Eng.

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPU programming CUDA C. GPU programming,ii. COMP528 Multi-Core Programming. Different ways:

CUDA Architecture & Programming Model

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

CUDA Programming Model

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

CUDA. Matthew Joyner, Jeremy Williams

Finite Element Integration and Assembly on Modern Multi and Many-core Processors

Technology for a better society. hetcomp.com

Parallel Numerical Algorithms

ECE 574 Cluster Computing Lecture 15

GPU Fundamentals Jeff Larkin November 14, 2016

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Parallel Computing. November 20, W.Homberg

OpenACC. Introduction and Evolutions Sebastien Deldon, GPU Compiler engineer

Tesla GPU Computing A Revolution in High Performance Computing

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

A case study of performance portability with OpenMP 4.5

High Performance Computing and GPU Programming


CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

LECTURE ON PASCAL GPU ARCHITECTURE. Jiri Kraus, November 14 th 2016

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

OpenACC 2.6 Proposed Features

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Fundamental CUDA Optimization. NVIDIA Corporation

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Scientific discovery, analysis and prediction made possible through high performance computing.

Introduction to Parallel Computing with CUDA. Oswald Haan

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

An Introduc+on to OpenACC Part II

Parallel Accelerators

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

Multi-Processors and GPU

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

CS 179 Lecture 4. GPU Compute Architecture

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

PERFORMANCE ANALYSIS AND DEBUGGING FOR VOLTA. Felix Schmitt 11 th Parallel Tools Workshop September 11-12, 2017

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Lecture 6: odds and ends

Fundamental CUDA Optimization. NVIDIA Corporation

GPU Programming Introduction

ECE 574 Cluster Computing Lecture 17

VSC Users Day 2018 Start to GPU Ehsan Moravveji

CSE 599 I Accelerated Computing Programming GPUS. Intro to CUDA C

PROGRAMOVÁNÍ V C++ CVIČENÍ. Michal Brabec

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

Massively Parallel Architectures

GPU Programming Using CUDA

Lecture 11: GPU programming

Advanced CUDA Optimization 1. Introduction

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

COMP Parallel Computing. Programming Accelerators using Directives

Paralization on GPU using CUDA An Introduction

Tesla GPU Computing A Revolution in High Performance Computing

Parallel Accelerators

Warps and Reduction Algorithms

CPU-GPU Heterogeneous Computing

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

Transcription:

GPU ACCELERATORS AT JSC OF THREADS AND KERNELS 28 May 2018 Andreas Herten Forschungszentrum Jülich Member of the Helmholtz Association

Outline GPUs at JSC GPU Architecture Empirical Motivation Comparisons 3 Core Features Memory Asynchronicity SIMT High Throughput Summary Programming GPUs Libraries OpenACC/ OpenMP CUDA C/C++ Performance Analysis Advanced Topics Using GPUs on JURECA Compiling Resource Allocation MemberoftheHelmholtzAssociation 28May2018 Slide1 41

JURECA Jülich s Multi-Purpose Supercomputer 1872 nodes with Intel Xeon E5 2680 v3 Haswell CPUs (2 12 cores) 75 nodes with 2 NVIDIA Tesla K80 cards (look like 4 GPUs); each 2 12 GB RAM JURECA Booster: 1640 nodes with Intel Xeon Phi Knights Landing 1.8 (CPU) + 0.44 (GPU) + 5 (KNL) PFLOP/s peak performance (#29) MemberoftheHelmholtzAssociation 28May2018 Slide2 41

JULIA JURON JURON A Human Brain Project Prototype 18 nodes with IBM POWER8NVL CPUs (2 10 cores) Per Node: 4 NVIDIA Tesla P100 cards (16 GB HBM2 memory), connected via NVLink GPU: 0.38 PFLOP/s peak performance MemberoftheHelmholtzAssociation 28May2018 Slide3 41

JUWELS Jülich s New Large System currently under construction 2500 nodes with Intel Xeon CPUs (2 24 cores) 48 nodes with 4 NVIDIA Tesla V100 cards (16 GB HBM2 memory) 10.4 (CPU) + 1.6 (GPU) + PFLOP/s peak performance MemberoftheHelmholtzAssociation 28May2018 Slide4 41

GPU Architecture MemberoftheHelmholtzAssociation 28May2018 Slide5 41

Why?

Status Quo Across Architectures Memory Bandwidth GFLOP/sec 10 4 10 3 10 2 Theoretical Peak Performance, Double Precision Xeon Phi 7120 (KNC) INTEL Xeon CPUs NVIDIA Tesla GPUs Xeon Phi 7290 (KNL) HD 3870 HD 4870 HD 5870 HD 6970 Tesla C1060 HD 6970 Tesla K20 X5482 Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20X FirePro W9100 X5492 HD 7970 GHz Ed. Tesla K40 FirePro S9150 W5590 HD 8970 Tesla P100 X5680 Tesla K40 X5690 E5-2690 E5-2699 v3 E5-2697 v2 E5-2699 v3 E5-2699 v4 AMD Radeon GPUs INTEL Xeon Phis 2008 2010 2012 2014 2016 End of Year Graphic: Rupp [2] MemberoftheHelmholtzAssociation 28May2018 Slide7 41

Status Quo Across Architectures Memory Bandwidth GB/sec 10 3 10 2 HD 3870 HD 4870 Tesla C1060 Theoretical Peak Memory Bandwidth Comparison HD 5870 HD 6970 HD 6970 HD 7970 GHz Ed. Tesla C1060 Tesla C2050 Tesla M2090 Tesla K20 HD 8970 FirePro W9100 Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X E5-2690 E5-2697 v2 E5-2699 v3 FirePro S9150 INTEL Xeon CPUs E5-2699 v3 Tesla P100 Xeon Phi 7290 (KNL) E5-2699 v4 Graphic: Rupp [2] X5482 X5492 W5590 X5680 X5690 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis 10 1 2008 2010 2012 2014 2016 End of Year MemberoftheHelmholtzAssociation 28May2018 Slide7 41

Status Quo Across Architectures Memory Bandwidth 1000 INTEL Xeon CPUs Theoretical Peak Memory Bandwidth Comparison 900 800 NVIDIA Tesla GPUs AMD Radeon GPUs INTEL Xeon Phis Tesla P100 GB/sec 700 600 500 400 300 200 100 HD 3870 X5482 Tesla K40 Xeon Phi 7120 (KNC) Tesla K20X Xeon Phi 7290 (KNL) HD 4870 HD 5870 Tesla C1060 HD 6970 HD 7970 GHz Ed. X5492 Tesla C1060 HD 6970 W5590 Tesla C2050 HD 8970 FirePro W9100 Tesla M2090 X5680 Tesla K20 FirePro S9150 X5690 E5-2690 E5-2697 v2 E5-2699 v3 E5-2699 v3 E5-2699 v4 2008 2010 2012 2014 2016 End of Year Graphic: Rupp [2] MemberoftheHelmholtzAssociation 28May2018 Slide7 41

CPU vs. GPU A matter of specialties Graphics: Lee [3] and Shearings Holidays [4] Transporting one Transporting many MemberoftheHelmholtzAssociation 28May2018 Slide8 41

CPU vs. GPU Chip Control ALU ALU ALU ALU Cache DRAM DRAM MemberoftheHelmholtzAssociation 28May2018 Slide9 41

GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory MemberoftheHelmholtzAssociation 28May2018 Slide10 41

GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory MemberoftheHelmholtzAssociation 28May2018 Slide10 41

Memory GPU memory ain t no CPU memory Host Unified Virtual Addressing GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) Control Cache DRAM ALU ALU ALU ALU PCIe <16 GB/s DRAM Device MemberoftheHelmholtzAssociation 28May2018 Slide11 41

Memory GPU memory ain t no CPU memory Host Unified Memory GPU: accelerator / extension card Separate device from CPU Separate memory, but UVA and UM Memory transfers need special consideration! Do as little as possible! Formerly: Explicitly copy data to/from GPU Now: Done automatically (performance?) P100: 16 GB RAM, 720 GB/s; V100: 16 (32) GB RAM, 900 GB/s Control Cache DRAM DRAM ALU ALU HBM2 <720 GB/s ALU ALU NVLink 80 GB/s Device MemberoftheHelmholtzAssociation 28May2018 Slide11 41

Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back Interconnect L2 DRAM MemberoftheHelmholtzAssociation 28May2018 Slide12 41

Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back Interconnect L2 DRAM MemberoftheHelmholtzAssociation 28May2018 Slide12 41

Processing Flow CPU GPU CPU Scheduler CPU... CPU Memory 1 Transfer data from CPU memory to GPU memory, transfer program 2 Load GPU program, execute on SMs, get (cached) data from memory; write back 3 Transfer results back to host memory Interconnect L2 DRAM MemberoftheHelmholtzAssociation 28May2018 Slide12 41

GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory MemberoftheHelmholtzAssociation 28May2018 Slide13 41

Async Following different streams Problem: Memory transfer is comparably slow Solution: Do something else in meantime (computation)! Overlap tasks Copy and compute engines run separately (streams) Copy Compute Copy Compute Copy Compute Copy Compute GPU needs to be fed: Schedule many computations CPU can do other work while GPU computes; synchronization MemberoftheHelmholtzAssociation 28May2018 Slide14 41

GPU Architecture Overview Aim: Hide Latency Everything else follows SIMT Asynchronicity Memory MemberoftheHelmholtzAssociation 28May2018 Slide15 41

SIMT SIMT=SIMD SMT A0 Vector B0 C0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A1 B1 C1 + = A2 A3 B2 B3 C2 C3 Thread Core Thread Core SMT SIMT Core Core MemberoftheHelmholtzAssociation 28May2018 Slide16 41

SIMT SIMT=SIMD SMT A0 Vector B0 C0 CPU: Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if A1 B1 C1 + = A2 A3 B2 B3 C2 C3 Thread Core Thread Core SMT SIMT Core Core Graphics: Nvidia Corporation [5] Tesla V100 MemberoftheHelmholtzAssociation 28May2018 Slide16 41

Multiprocessor Vector SIMT = SIMD SMT A0 B0 A1 B1 A2 CPU: + A3 Single Instruction, Multiple Data (SIMD) Simultaneous Multithreading (SMT) Tesla V100 28 May 2018 = Slide 16 41 C1 B2 C2 B3 C3 SMT GPU: Single Instruction, Multiple Threads (SIMT) CPU core GPU multiprocessor (SM) Working unit: set of threads (32, a warp) Fast switching of threads (large register file) Branching if Member of the Helmholtz Association C0 Thread Core Core Thread Core Core SIMT Graphics: Nvidia Corporation [5] SIMT

Low Latency vs. High Throughput Maybe GPU s ultimate feature CPU Minimizes latency within each thread GPU Hides latency with computations from other thread warps CPU Core: Low Latency T 1 T 2 T 3 T 4 GPU Streaming Multiprocessor: High Throughput W 1 W2 W 3 W4 Thread/Warp Processing Context Switch Ready Waiting MemberoftheHelmholtzAssociation 28May2018 Slide17 41

CPU vs. GPU Let s summarize this! Optimized for low latency + Large main memory + Fast clock rate + Large caches + Branch prediction + Powerful ALU Relatively low memory bandwidth Cache misses costly Low performance per watt Optimized for high throughput + High bandwidth main memory + Latency tolerant (parallelism) + More compute resources + High performance per watt Limited memory capacity Low per-thread performance Extension card MemberoftheHelmholtzAssociation 28May2018 Slide18 41

Programming GPUs MemberoftheHelmholtzAssociation 28May2018 Slide19 41

Preface: CPU A simple CPU program! SAXPY: y = a x+ y, with single precision Part of LAPACK BLAS Level 1 void saxpy(int n, float a, float * x, float * y) { for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy(n, a, x, y); MemberoftheHelmholtzAssociation 28May2018 Slide20 41

Programming GPUs Libraries MemberoftheHelmholtzAssociation 28May2018 Slide21 41

Libraries Programming GPUs is easy: Just don t! Use applications & libraries! Wizard: Breazell [6] MemberoftheHelmholtzAssociation 28May2018 Slide22 41

Libraries Programming GPUs is easy: Just don t! Use applications & libraries! cublas cusparse Wizard: Breazell [6] cufft curand CUDA Math th ano MemberoftheHelmholtzAssociation 28May2018 Slide22 41

cublas Parallel algebra GPU-parallel BLAS (all 152 routines) Single, double, complex data types Constant competition with Intel s MKL Multi-GPU support https://developer.nvidia.com/cublas http://docs.nvidia.com/cuda/cublas MemberoftheHelmholtzAssociation 28May2018 Slide23 41

cublas Code example float a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); cudafree(d_x); cudafree(d_y); cublasdestroy(handle); MemberoftheHelmholtzAssociation 28May2018 Slide24 41

cublas Code example float a = 42; int n = 10; float x[n], y[n]; // fill x, y cublashandle_t handle; cublascreate(&handle); float * d_x, * d_y; cudamallocmanaged(&d_x, n * sizeof(x[0]); cudamallocmanaged(&d_y, n * sizeof(y[0]); cublassetvector(n, sizeof(x[0]), x, 1, d_x, 1); cublassetvector(n, sizeof(y[0]), y, 1, d_y, 1); cublassaxpy(n, a, d_x, 1, d_y, 1); cublasgetvector(n, sizeof(y[0]), d_y, 1, y, 1); cudafree(d_x); cudafree(d_y); cublasdestroy(handle); Initialize Allocate GPU memory Copy data to GPU Call BLAS routine Copy result to host Finalize MemberoftheHelmholtzAssociation 28May2018 Slide24 41

Programming GPUs OpenACC/ OpenMP MemberoftheHelmholtzAssociation 28May2018 Slide25 41

GPU Programming with Directives Keepin you portable Annotate serial source code by directives #pragma acc loop for (int i = 0; i < 1; i++) {}; OpenACC: Especially for GPUs; OpenMP: Has GPU support (in theory ) Compiler interprets directives, creates according instructions Pro Portability Other compiler? No problem! To it, it s a serial program Different target architectures from same code Easy to program Con Only few compilers Not all the raw power available Harder to debug Easy to program wrong MemberoftheHelmholtzAssociation 28May2018 Slide26 41

OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc kernels for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); MemberoftheHelmholtzAssociation 28May2018 Slide27 41

OpenACC Code example void saxpy_acc(int n, float a, float * x, float * y) { #pragma acc parallel loop copy(y) copyin(x) for (int i = 0; i < n; i++) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y saxpy_acc(n, a, x, y); MemberoftheHelmholtzAssociation 28May2018 Slide27 41

Programming GPUs CUDA C/C++ MemberoftheHelmholtzAssociation 28May2018 Slide28 41

Programming GPU Directly Finally Two solutions: OpenCL Open Computing Language by Khronos Group (Apple, IBM, NVIDIA, ) 2009 Platform: Programming language (OpenCL C/C++), API, and compiler Targets CPUs, GPUs, FPGAs, and other many-core machines Fully open source Different compilers available CUDA NVIDIA s GPU platform 2007 Platform: Drivers, programming language (CUDA C/C++), API, compiler, debuggers, profilers, Only NVIDIA GPUs Compilation with nvcc (free, but not open) clang has CUDA support, but CUDA needed for last step Also: CUDA Fortran Choose what flavor you like, what colleagues/collaboration is using Hardest: Come up with parallelized algorithm MemberoftheHelmholtzAssociation 28May2018 Slide29 41

CUDA s Parallel Model In software: Threads, Blocks Methods to exploit parallelism: Thread Block 0 1 2 3 4 5 0 1 2 3 4 5 0 1 2 3 4 5 Block Grid Threads & blocks in 3D 3D 0 1 2 Parallel function: kernel global kernel(int a, float * b) { } Access own ID by global variables threadidx.x, blockidx.y, Execution entity: threads Lightweight fast switchting! 1000s threads execute simultaneously order non-deterministic! MemberoftheHelmholtzAssociation 28May2018 Slide30 41

CUDA SAXPY With runtime-managed data transfers global void saxpy_cuda(int n, float a, float * x, float * y) { int i = blockidx.x * blockdim.x + threadidx.x; if (i < n) y[i] = a * x[i] + y[i]; } float a = 42; int n = 10; float x[n], y[n]; // fill x, y cudamallocmanaged(&x, n * sizeof(float)); cudamallocmanaged(&y, n * sizeof(float)); saxpy_cuda<<<2, 5>>>(n, a, x, y); cudadevicesynchronize(); MemberoftheHelmholtzAssociation 28May2018 Slide31 41

Programming GPUs Performance Analysis MemberoftheHelmholtzAssociation 28May2018 Slide32 41

GPU Tools The helpful helpers helping helpless (and others) NVIDIA cuda-gdb GDB-like command line utility for debugging cuda-memcheck Like Valgrind s memcheck, for checking errors in memory accesses Nsight IDE for GPU developing, based on Eclipse (Linux, OS X) or Visual Studio (Windows) nvprof Command line profiler, including detailed performance counters Visual Profiler Timeline profiling and annotated performance experiments OpenCL: CodeXL (Open Source, GPUOpen/AMD) debugging, profiling. MemberoftheHelmholtzAssociation 28May2018 Slide33 41

nvprof Command that line $ nvprof./matrixmul -wa=1024 -ha=1024 -wb=1024 -hb=1024 ==37064== Profiling application:./matrixmul -wa=1024 -ha=1024 -wb=1024 -hb=1024 ==37064== Profiling result: Time(%) Time Calls Avg Min Max Name 99.19% 262.43ms 301 871.86us 863.88us 882.44us void matrixmulcuda<int=32>(float*, float*, float*, int, int) 0.58% 1.5428ms 2 771.39us 764.65us 778.12us [CUDA memcpy HtoD] 0.23% 599.40us 1 599.40us 599.40us 599.40us [CUDA memcpy DtoH] ==37064== API calls: Time(%) Time Calls Avg Min Max Name 61.26% 258.38ms 1 258.38ms 258.38ms 258.38ms cudaeventsynchronize 35.68% 150.49ms 3 50.164ms 914.97us 148.65ms cudamalloc 0.73% 3.0774ms 3 1.0258ms 1.0097ms 1.0565ms cudamemcpy 0.62% 2.6287ms 4 657.17us 655.12us 660.56us cudevicetotalmem 0.56% 2.3408ms 301 7.7760us 7.3810us 53.103us cudalaunch 0.48% 2.0111ms 364 5.5250us 235ns 201.63us cudevicegetattribute 0.21% 872.52us 1 872.52us 872.52us 872.52us cudadevicesynchronize 0.15% 612.20us 1505 406ns 361ns 1.1970us cudasetupargument 0.12% 499.01us 3 166.34us 140.45us 216.16us cudafree MemberoftheHelmholtzAssociation 28May2018 Slide34 41

Visual Profiler MemberoftheHelmholtzAssociation 28May2018 Slide35 41

Advanced Topics So much more interesting things to show! Optimize memory transfers to reduce overhead Optimize applications for GPU architecture Drop-in BLAS acceleration with NVBLAS ($LD_PRELOAD) Tensor Cores for Deep Learning Use multiple GPUs On one node Across many nodes MPI Most of that: Addressed at dedicated training courses MemberoftheHelmholtzAssociation 28May2018 Slide36 41

Using GPUs on JURECA MemberoftheHelmholtzAssociation 28May2018 Slide37 41

Compiling on JURECA CUDA OpenACC MPI Module: module load CUDA/9.1.85 Compile: nvcc file.cu Default host compiler: g++; use nvcc_pgc++ for PGI compiler cublas: g++ file.cpp -I$CUDA_HOME/include -L$CUDA_HOME/lib64 -lcublas -lcudart Module: module load PGI/17.10-GCC-5.5.0 Compile: pgc++ -acc -ta=tesla file.cpp Module: module load MVAPICH2/2.3a-GDR Enabled for CUDA (CUDA-aware); no need to copy data to host before transfer MemberoftheHelmholtzAssociation 28May2018 Slide38 41

Running on JURECA Dedicated GPU partitions: gpus and develgpus (+ vis) --partition=develgpus Total 4 nodes (Job: <2 h, 2 nodes) --partition=gpus Total 70 nodes (Job: <1 d, 32 nodes) Needed: Resource configuration with --gres --gres=gpu:2 --gres=gpu:4 --gres=mem1024,gpu:2 --partition=vis See online documentation MemberoftheHelmholtzAssociation 28May2018 Slide39 41

Example 96 tasks in total, running on 4 nodes Per node: 4 GPUs #!/bin/bash -x #SBATCH --nodes=4 #SBATCH --ntasks=96 #SBATCH --ntasks-per-node=24 #SBATCH --output=gpu-out.%j #SBATCH --error=gpu-err.%j #SBATCH --time=00:15:00 #SBATCH --partition=gpus #SBATCH --gres=gpu:4 srun./gpu-prog MemberoftheHelmholtzAssociation 28May2018 Slide40 41

Conclusion, Resources GPUs provide highly-parallel computing power We have many devices installed at JSC, ready to be used! Training courses by JSC CUDA Course April 2019 OpenACC Course 29-30 October 2018 Generally: see online documentation and sc@fz-juelich.de Further consultation via our lab: NVIDIA Application Lab in Jülich; contact me! Interested in JURON? Get access! Thank you for your attention! a.herten@fz-juelich.de MemberoftheHelmholtzAssociation 28May2018 Slide41 41

APPENDIX MemberoftheHelmholtzAssociation 28May2018 Slide1 8

Appendix Glossary References MemberoftheHelmholtzAssociation 28May2018 Slide2 8

Glossary I API A programmatic interface to software by well-defined functions. Short for application programming interface. 41 CUDA Computing platform for GPUs from NVIDIA. Provides, among others, CUDA C/C++. 2, 40, 41, 42, 43, 50, 53, 56 JSC Jülich Supercomputing Centre, the supercomputing institute of Forschungszentrum Jülich, Germany. 2, 53, 56 JURECA A multi-purpose supercomputer with 1800 nodes at JSC. 2, 3, 49, 50, 51 JURON One of the two HBP pilot system in Jülich; name derived from Juelich and Neuron. 4 JUWELS Jülich s new supercomputer, the successor of JUQUEEN. 5 MemberoftheHelmholtzAssociation 28May2018 Slide3 8

Glossary II MPI The Message Passing Interface, a API definition for multi-node computing. 48, 50 NVIDIA US technology company creating GPUs. 3, 4, 5, 41, 45, 53, 56 NVLink NVIDIA s communication protocol connecting CPU GPU and GPU GPU with high bandwidth. 4, 56 OpenACC Directive-based programming, primarily for many-core machines. 2, 36, 37, 38, 39, 50, 53 OpenCL The Open Computing Language. Framework for writing code for heterogeneous architectures (CPU, GPU, DSP, FPGA). The alternative to CUDA. 41, 45 OpenMP Directive-based programming, primarily for multi-threaded machines. 2, 36, 37 P100 A large GPU with the Pascal architecture from NVIDIA. It employs NVLink as its interconnect and has fast HBM2 memory. 4 MemberoftheHelmholtzAssociation 28May2018 Slide4 8

Glossary III Pascal GPU architecture from NVIDIA (announced 2016). 56 POWER CPU architecture from IBM, earlier: PowerPC. See also POWER8. 56 POWER8 Version 8 of IBM s POWERprocessor, available also under the OpenPOWER Foundation. 4, 56 SAXPY Single-precision A X+Y. A simple code example of scaling a vector and adding an offset. 29, 43 Tesla The GPU product line for general purpose computing computing of NVIDIA. 3, 4, 5 CPU Central Processing Unit. 3, 4, 5, 11, 12, 15, 16, 17, 18, 19, 23, 24, 25, 29, 41, 56 MemberoftheHelmholtzAssociation 28May2018 Slide5 8

Glossary IV GPU Graphics Processing Unit. 2, 3, 4, 5, 6, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 22, 23, 24, 25, 26, 28, 30, 31, 32, 33, 36, 37, 40, 41, 44, 45, 48, 49, 51, 52, 53, 56 HBP Human Brain Project. 56 SIMD Single Instruction, Multiple Data. 23, 24, 25 SIMT Single Instruction, Multiple Threads. 13, 14, 20, 22, 23, 24, 25 SM Streaming Multiprocessor. 23, 24, 25 SMT Simultaneous Multithreading. 23, 24, 25 MemberoftheHelmholtzAssociation 28May2018 Slide6 8

References I [2] Karl Rupp. Pictures: CPU/GPU Performance Comparison. URL: https://www.karlrupp.net/2013/06/cpu-gpu-and-mic-hardwarecharacteristics-over-time/ (pages 8 10). [6] Wes Breazell. Picture: Wizard. URL: https://thenounproject.com/wes13/collection/its-a-wizards-world/ (pages 31, 32). MemberoftheHelmholtzAssociation 28May2018 Slide7 8

References: Images, Graphics I [1] Alexandre Debiève. Bowels of computer. Freely available at Unsplash. URL: https://unsplash.com/photos/fo7jilwjotu. [3] Mark Lee. Picture: kawasaki ninja. URL: https://www.flickr.com/photos/pochacco20/39030210/ (page 11). [4] Shearings Holidays. Picture: Shearings coach 636. URL: https://www.flickr.com/photos/shearings/13583388025/ (page 11). [5] Nvidia Corporation. Pictures: Volta GPU. Volta Architecture Whitepaper. URL: https://images.nvidia.com/content/volta-architecture/pdf/volta- Architecture-Whitepaper-v1.0.pdf (pages 24, 25). MemberoftheHelmholtzAssociation 28May2018 Slide8 8