High Performance Computing - Introduction to GPGPU Programming. Prof Matt Probert

Similar documents
SC13 GPU Technology Theater. Accessing New CUDA Features from CUDA Fortran Brent Leback, Compiler Manager, PGI

Functions. 6.1 Modular Programming. 6.2 Defining and Calling Functions. Gaddis: 6.1-5,7-10,13,15-16 and 7.7

Pre-exascale Architectures: OpenPOWER Performance and Usability Assessment for French Scientific Community

GPU Programming. Ringberg Theorie Seminar 2010

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Lecture 4: Threads

Tesla Architecture, CUDA and Optimization Strategies

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Why Learn to Program?

A Comparison of a Second-Order versus a Fourth- Order Laplacian Operator in the Multigrid Algorithm

As Michi Henning and Steve Vinoski showed 1, calling a remote

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

CUDA Fortran Brent Leback The Portland Group

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Intro to Programming & C Why Program? 1.2 Computer Systems: Hardware and Software. Hardware Components Illustrated

HPC with Multicore and GPUs

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Technology for a better society. hetcomp.com

Parallel Numerical Algorithms

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Mobile App Recommendation: Maximize the Total App Downloads

Nearest Neighbor Learning

Practical Introduction to CUDA and GPU

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

file://j:\macmillancomputerpublishing\chapters\in073.html 3/22/01

Introduction to OpenMP

Multi-Processors and GPU

A Memory Grouping Method for Sharing Memory BIST Logic

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA Programming Model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Sample of a training manual for a software tool

High Performance Computing and GPU Programming

Introduction to CUDA Programming

High-Performance Computing Using GPUs

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Special Edition Using Microsoft Excel Selecting and Naming Cells and Ranges

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Scheduling

An Optimizing Compiler

Multi-MANO interworking for the management of multi-domains networks and network slicing Functionality & Demos

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

Introducing a Target-Based Approach to Rapid Prototyping of ECUs

Tesla GPU Computing A Revolution in High Performance Computing

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

CSC573: TSHA Introduction to Accelerators

MCSE Training Guide: Windows Architecture and Memory

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Lecture 8: GPU Programming. CSE599G1: Spring 2017

ECE 574 Cluster Computing Lecture 15

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Lecture outline Graphics and Interaction Scan Converting Polygons and Lines. Inside or outside a polygon? Scan conversion.

Scientific discovery, analysis and prediction made possible through high performance computing.

Overview: Graphics Processing Units

Straight-line code (or IPO: Input-Process-Output) If/else & switch. Relational Expressions. Decisions. Sections 4.1-6, , 4.

Computing using GPUs

Graph Partitioning. Standard problem in parallelization, partitioning sparse matrix in nearly independent blocks or discretization grids in FEM.

OpenACC programming for GPGPUs: Rotor wake simulation

GPU Implementation of Parallel SVM as Applied to Intrusion Detection System

A Petrel Plugin for Surface Modeling

Mathematical computations with GPUs

OpenACC Course. Office Hour #2 Q&A

Microsoft Visual Studio 2005 Professional Tools. Advanced development tools designed for professional developers

Introduction to CUDA

Genetic Algorithms for Parallel Code Optimization

3.1 The cin Object. Expressions & I/O. Console Input. Example program using cin. Unit 2. Sections 2.14, , 5.1, CS 1428 Spring 2018

On-Chip CNN Accelerator for Image Super-Resolution

VSC Users Day 2018 Start to GPU Ehsan Moravveji

JOINT IMAGE REGISTRATION AND EXAMPLE-BASED SUPER-RESOLUTION ALGORITHM

Lecture 11: GPU programming

An Introduc+on to OpenACC Part II

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction to USB Development

ACCELERATING HPC APPLICATIONS ON NVIDIA GPUS WITH OPENACC

Performance Diagnosis for Hybrid CPU/GPU Environments

A case study of performance portability with OpenMP 4.5

Fundamental CUDA Optimization. NVIDIA Corporation

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Intel Architecture: Features & Futures

Chapter 3: Introduction to the Flash Workspace

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Advanced Memory Management

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

An Introduction to Design Patterns

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

Advanced OpenMP Features

CUDA Architecture & Programming Model

Outline. Parallel Numerical Algorithms. Forward Substitution. Triangular Matrices. Solving Triangular Systems. Back Substitution. Parallel Algorithm

CSE120 Principles of Operating Systems. Architecture Support for OS

Register Allocation. Consider the following assignment statement: x = (a*b)+((c*d)+(e*f)); In posfix notation: ab*cd*ef*++x

COS 318: Operating Systems. Virtual Memory Design Issues: Paging and Caching. Jaswinder Pal Singh Computer Science Department Princeton University

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

S Comparing OpenACC 2.5 and OpenMP 4.5

GPU ARCHITECTURE Chris Schultz, June 2017

HPCSE II. GPU programming and CUDA

CSE120 Principles of Operating Systems. Prof Yuanyuan (YY) Zhou Synchronization: Semaphore

Transcription:

High Performance Computing - Introduction to GPGPU Programming Prof Matt Probert http://www-users.york.ac.uk/~mijp1

Overview What are GPUs? GPU Architecture Memory Programming in CUDA The Future

What is a GPU? A GPU is a massive vector processor Hundreds of processing units Large memory bandwidth Processing units can coaborate Some probems can be soved more efficienty on vector processors (remember Crays). GPUs have some big advantages. Cheap and very fast for vector probems Computation is asynchronous with the CPU Hardware tricks such as texture fitering (zero overhead!) Other acceerators (e.g. Xeon Phi) avaiabe

Vaue for Money (2016 prices) P100 ~ 9k Ebor 40k K80 5k Xeon E7 5k However, Ebor is a distributed data machine with 16 nodes with 16 Xeon cores and tota 256 GB of RAM etc. These are different toos for different probems.

When is GPU Programming usefu? The GPU devotes more transistors to data processing (nvidia C Programming Guide) Good at doing ots of numeric cacuations simutaneousy. - This is actuay required for the GPU to be efficient - There must be many more numeric operations than memory operations to break even. Usefu as a co-processor. - Can offoad GPU efficient cacuations whie the CPU continues with the rest. Brute forcing! - Sometimes a brute force method is more efficient with many processors than an eegant soution on just one

Foating Point Standard nvidia architectures 2.x onwards (i.e. Fermi onwards) are IEEE 754 compiant. Od nvidia architectures were mosty compiant. Generay exceptions were handed in a noncompiant way. Aso some mathematics e.g. FMAD, division and sqrt were not standard. Standard compiant intrinsic functions are avaiabe but at a arge computationa penaty (software impemented).

nvidia Architectures V1 = Tesa (2008) introduced CUDA with performance of 0.5 GFLOP/watt, V2 = Fermi (2010) 64-bit foating point and performance = 2 GFLOP/W V3 = Keper (2012) - dynamic paraeism V5 = Maxwe (2014) faster Keper 3.8 TFLOP DP, 24 GB GDRAM, 2 GPU, 500 GB/s, 300 W V6 = Pasca (2016) unified memory, stacked DRAM, direct interconnect between GPU and main RAM to eiminate common botteneck V7 = Vota (2018) new = tensor cores with haf-precision math for machine earning now 7.8 TFLOP DP, 15.7 TFLOP SP, 125 TFLOP hp

Lots of arithmetic units sharing sma caches. Execution bocks are schedued across processing cores. GPU Architecture Large memory bandwidth but sow memory access. Fermi onwards have fu cache hierarchy.

Each thread runs one instance of a kerne. Threads Threads are organised into bocks (which can be up to 3D), bocks are schedued on and off the processors. Execution of a bock on the processors is caed a warp. Grids (which can be 2D) contain many bocks which are executed on the device.

Texture Write via CPU Aows hardware interpoation 2D Locaity of arrays Constant Memory Mode Write via CPU Sma, used for random access instructions Goba Write via CPU and GPU Shared Loca to Bock Low atency Fasted comms between threads Loca Per thread ony memory Registers Thread ony, Fastest memory avaiabe Limited space

Registers are oca to thread and has thread ifetime. Shared memory shared between threads in a bock and has bock ifetime Goba memory is accessibe everywhere and is persistent. Memory Scope

Tensor Cores One of the key operations when training a neura network or doing ML is matrix-matrix mutipication (DGEMM) Each tensor core has a 4x4x4 matrix processing array Can do 64 mixed-precision OPs/cock with haf-precision inputs and either HP or SP output

Programming Technoogies (I) CUDA (Compute Unified Device Architecture) Nvidia's proprietary patform. Offers both API (extensions to C/C++) and a driver eve programming. Proprietary PGI Fortran compier aows Fortran deveopment both directive and kerne modes. OpenCL (Open Computing Language) Genera patform for computing patforms (GPUs and CPUs). Impemented on a driver eve e.g. buit into MacOS Specification is manufacturer (and device) independent write once, run anywhere.

OpenACC Programming Technoogies (II) Open standard version of the directives approach of PGI Spec v2.0 Juy 2013 has better support for contro of data movement, caing externa functions, and separate compiation for host & device so can buid ibraries Led by Cray, nvidia, PGI and CAPS with support for Fortran, C/C++ Now in gcc and gfortran (since v6.1) OpenMP v4.0 Pushed by Inte to support Xeon Phi etc Subset of OpenACC functionaity at present In gcc/gfortran since v4.9.1 Supposed to incude a of OpenACC in future but difficuties with Inte vs nvidia

CUDA Programme Structure Seria (host) code Kerne (device) code - Grid Bocks - Threads Must remember to aocate data on both host and device Kerne executes asynchronousy

Starting CUDA Headers must be incuded. In this introduction ony the API wi be demonstrated. use cudafor #incude <cuda.h>; #incude <cuda_runtime.h>; C/C++ fies with CUDA kernes must have the extension.cu Fortran CUDA fies must have the extension.cuf See references at end for more compete exampes...

Device Kernes A Kerne is a goba subroutine (static function) which runs on the device. One instance of the kerne wi be executed by every thread which is invoked. attributes(goba) subroutine my_kerne( a, b, c) static goba void my_kerne( foat * a, foat * b, foat * c); Kernes can ony address device memory spaces. Executing kernes behaviour differs by using the thread index which uniquey identifies the thread. tx = threadidx%x + (bockidx%x * bockdim%x) int tx = threadidx.x + (bockidx.x * bockdim.x);

Host Code Device memory must be aocated in advance by host: rea,device,aocatabe :: Adev(:)... aocate(adev(m,n),stat=istat) static device doube * devptr = NULL;... cudamaoc((void**)&devptr, N*sizeof(doube)); Data must be copied (over the bus) to the device and if necessary copied back to the host ater. This is very sow! (Host memory can be aocated in non-paged (pinned) memory to avoid one copy operation). Adev = A! Host to device cudamemcpy(devptr, hostptr, N*sizeof(doube), cudamemcpyhosttodevice);

Kerne Execution ca my_kerne<<<grid,bock>>>(arg1,arg2,...) my_kerne<<<grid,bock>>>(arg1,arg2...); grid specifies the dimensions of the grid (i.e. the number of bocks aunched wi be the product of the dimensions). bock specifies the dimensions of each bock (i.e. the number of threads per bock is the product of the dimensions). dim3 derived type which has three members. type(dim3) :: grid Integer :: x=5,y=5,z=1 grid = (x,y,z) int x=5,y=5,z=1; dim3 grid(x,y,z);

Kerne Compiation and Run CUDA kernes must be compied with a CUDA compier (pgfortran or nvcc). At runtime the kerne wi be copied to the device on first execution note that for accurate profiing, a kerne shoud be executed at east once before timing. OpenCL kernes are usuay compied at runtime. This aows the kerne to be optimised for the running context but has a arger initia overhead.

Advanced Features Streaming Data can be moved across the system bus whie computation is happening. This is good for arge data structures which don't fit in device memory. Texture/Surface memory Memory can be accessed via non integer or surface coordinate! Linear interpoation can aso be performed. Fast intrinsic function Some specia functions exist which aow certain operations to be performed very quicky athough are not IEEE compiant. For exampe rsqrt reciproca square root. These can be usefu for areas of the code where precision is ess important.

V6 (2014+) Drop-in repacement for BLAS etc with auto offoad from CPU to GPU so free speed-up! CUBLAS BLAS ibrary - Incudes a S,D,C,Z eve 1-3 BLAS routines CUFFT FFT ibrary (free) CUDA Libraries - 1D, 2D, 3D compex and rea - Stream enabed for parae data movement & computation CUSPARSE Sparse matrix ibrary - BLAS stye routines between sparse & dense matrices CURAND Random number ibrary Now aso XT versions for muti-gpu support (non-free)

CUBLAS Fortran Interfacing Interfacing the CUBLAS ibrary (written in C) with the cudafortran from PGI requires an interface to be written using the C-interoperabiity part of F2003, e.g. Modue cubas Interface cuda_gemm Subroutine cuda_sgemm(cta, ctb, m, n, k apha, A, & & da, B, db,beta, c, dc) bind(c,name='cubassgemm') Use iso_c_binding character(1,c_char),vaue :: m,n,k,da,db,dc rea(c_foat),vaue :: apha,beta rea(c_foat),device,dimension(da,*) :: A rea(c_foat),device,dimension(db,*) :: B rea(c_foat),device,dimension(dc,*) :: C End subroutine cuda_sgemm End interface cuda_gemm End modue cubas

Fortran Exampe subroutine mmu( A, B, C ) use cudafor rea, dimension(:,:) :: A, B, C integer :: N, M, L rea, device, aocatabe, dimension(:,:) :: Adev,Bdev,Cdev type(dim3) :: dimgrid, dimbock N = size(a,1) ; M = size(a,2) ; L = size(b,2) aocate( Adev(N,M), Bdev(M,L), Cdev(N,L) ) Adev = A(1:N,1:M) ; Bdev = B(1:M,1:L) dimgrid = dim3( N/16, L/16, 1 ) dimbock = dim3( 16, 16, 1 ) ca mmu_kerne<<<dimgrid,dimbock>>>( Adev,Bdev,Cdev,N,M,L ) C(1:N,1:M) = Cdev deaocate( Adev, Bdev, Cdev ) end subroutine

Fortran Kerne attributes(goba) subroutine MMUL_KERNEL( A,B,C,N,M,L) rea,device :: A(N,M),B(M,L),C(N,L) integer,vaue :: N,M,L integer :: i,j,kb,k,tx,ty rea,shared :: Ab(16,16), Bb(16,16) rea :: Cij tx = threadidx%x ; ty = threadidx%y i = (bockidx%x-1) * 16 + tx ; j = (bockidx%y-1) * 16 + ty Cij = 0.0 do kb = 1, M, 16! Fetch one eement each into Ab and Bb NB 16x16 = 256! threads in this thread-bock are fetching separate eements of Ab and Bb Ab(tx,ty) = A(i,kb+ty-1) Bb(tx,ty) = B(kb+tx-1,j)! Wait unti a eements of Ab and Bb are fied ca syncthreads() do k = 1, 16 Cij = Cij + Ab(tx,k) * Bb(k,ty) enddo! Wait unti a threads in the thread-bock finish with this Ab and Bb ca syncthreads() enddo C(i,j) = Cij end subroutine

OpenMP v4.5 Spec finaized in Nov 2015 (C/C++/Fortran) Extensions to SIMD and TASK Array reduction now aowed Extended TARGET attributes for better acceerator performance Lots of new stuff for DEVICE Avaiabe in GNU v6.1 and Inte v18.0 V5.0 due any day now...

OpenMP v4.5 On CPU: #pragma omp parae for On GPU: #pragma omp target teams distribute parae for IBM XL compier with nvidia offoading (Dec 16): LULESH benchmark from OpenMP.org

OpenACC (I) Compier can auto-generate kernes for a section of code (e.g. mutipe oops and/or Fortran array operations) Maximum fexibiity Auto-detect dependencies:!$acc kernes fortran oop(s) to be executed on device!$acc end kernes #pragma acc kernes { c oop(s) to be executed on device }

OpenACC (II) Or user can assert a singe oop is dependencyfree and hence safe to be paraeized:!$acc parae oop For i=1,n!singe fortran oop!nb no matching end-parae #pragma acc parae oop For (i=0;i<n;i++) { //singe c oop to be executed on device } // NB no bock braces NB data may be copied back from host between successive parae sections but stays on host for duration of a kernes section

Further Reading CUDA Deveoper ZONE https://deveoper.nvidia.com/cuda-education-training OpenCL Standard: http://www.khronos.org/openc OpenACC standard: http://www.openacc.org OpenMP v4.5 http://www.openmp.org/wp-content/upoads/ openmp-4.5.pdf