GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Similar documents
GPU programming: Code optimization part 1. Sylvain Collange Inria Rennes Bretagne Atlantique

Optimization Techniques for Parallel Code: 4: GPU code optimization

GPU programming: Code optimization

PPAR: GPU code optimization

GPU programming: Code optimization. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Odds and Ends. Joseph Kider University of Pennsylvania CIS Fall 2011

GPU programming: CUDA. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Odds and Ends. Administrivia. Administrivia. Agenda. Patrick Cozzi University of Pennsylvania CIS Spring Assignment 5.

PPAR: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique PPAR

Optimization Techniques for Parallel Code 2. Introduction to GPU architecture

Parallel programming: Introduction to GPU architecture. Sylvain Collange Inria Rennes Bretagne Atlantique

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Parallel programming: Introduction to GPU architecture

Optimization Techniques for Parallel Code 3. Introduction to GPU architecture

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Chapter 6. Parallel Processors from Client to Cloud. Copyright 2014 Elsevier Inc. All rights reserved.

CUDA Memories. Introduction 5/4/11

Pinned-Memory. Table of Contents. Streams Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Stream. Pinned-memory.

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature. Intel Software Developer Conference London, 2017

Master Informatics Eng.

High Performance Computing on GPUs using NVIDIA CUDA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

KNL tools. Dr. Fabio Baruffa

Porting the ICON Non-hydrostatic Dynamics and Physics to GPUs

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Performance models for CUDA streams on NVIDIA GeForce series

Chapter 04. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

L17: Asynchronous Concurrent Execution, Open GL Rendering

Intel profiling tools and roofline model. Dr. Luigi Iapichino

Lecture 1: Introduction and Computational Thinking

University of Bielefeld

GPU programming: Code optimization part 2. Sylvain Collange Inria Rennes Bretagne Atlantique

Instructor: Leopold Grinberg

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

ECE/ME/EMA/CS 759 High Performance Computing for Engineering Applications

Lecture 10. Efficient Host-Device Data Transfer

Fundamental CUDA Optimization. NVIDIA Corporation

SCIENTIFIC COMPUTING FOR ENGINEERS PERFORMANCE MODELING

Memory-level and Thread-level Parallelism Aware GPU Architecture Performance Analytical Model

AMS 148 Chapter 8: Optimization in CUDA, and Advanced Topics

PARALLEL PROGRAMMING MANY-CORE COMPUTING: THE LOFAR SOFTWARE TELESCOPE (5/5)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

Efficient Data Transfers

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Parallel Computing. Lecture 19: CUDA - I

EE/CSCI 451: Parallel and Distributed Computation

Mathematical computations with GPUs

Dense Linear Algebra. HPC - Algorithms and Applications

CSE 160 Lecture 24. Graphical Processing Units

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CUDA Performance Optimization Mark Harris NVIDIA Corporation

GPU programming: Unified memory models and more. Sylvain Collange Inria Rennes Bretagne Atlantique

CME 213 S PRING Eric Darve

Visualizing and Finding Optimization Opportunities with Intel Advisor Roofline feature

CUDA Performance Optimization. Patrick Legresley

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to Scientific Programming using GPGPU and CUDA

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Lecture 11: OpenCL and Altera OpenCL. James C. Hoe Department of ECE Carnegie Mellon University

Double-Precision Matrix Multiply on CUDA

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Lecture 3: Introduction to CUDA

BOPS, Not FLOPS! A New Metric, Measuring Tool, and Roofline Performance Model For Datacenter Computing. Chen Zheng ICT,CAS

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Multi-GPU Programming

Study of Performance Issues on a SMP-NUMA System using the Roofline Model

Kirill Rogozhin. Intel

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

GPU Programming Using CUDA

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

CS560 Lecture Parallel Architecture 1

Evaluation of Asynchronous Offloading Capabilities of Accelerator Programming Models for Multiple Devices

GPU programming: 1. Parallel programming models. Sylvain Collange Inria Rennes Bretagne Atlantique

Advanced CUDA Programming. Dr. Timo Stich

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

John W. Romein. Netherlands Institute for Radio Astronomy (ASTRON) Dwingeloo, the Netherlands

Kepler Overview Mark Ebersole

A GPU performance estimation model based on micro-benchmarks and black-box kernel profiling

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Submission instructions (read carefully): SS17 / Assignment 4 Instructor: Markus Püschel. ETH Zurich

Algorithms of Scientific Computing

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

PARALLEL PROGRAMMING MANY-CORE COMPUTING FOR THE LOFAR TELESCOPE ROB VAN NIEUWPOORT. Rob van Nieuwpoort

Unrolling parallel loops

Analysis Report. Number of Multiprocessors 3 Multiprocessor Clock Rate Concurrent Kernel Max IPC 6 Threads per Warp 32 Global Memory Bandwidth

Lecture 5. Performance programming for stencil methods Vectorization Computing with GPUs

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

INTRODUCTION TO OPENCL TM A Beginner s Tutorial. Udeepta Bordoloi AMD

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 6. Parallel Processors from Client to Cloud

Introduction to CUDA

MD-CUDA. Presented by Wes Toland Syed Nabeel

Device Memories and Matrix Multiplication

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

Lecture 6. Programming with Message Passing Message Passing Interface (MPI)

Administrative. L8: Writing Correct Programs, cont. and Control Flow. Questions/comments from previous lectures. Outline 2/10/11

EE 352 Lab 4 Cache Me If You Can

Transcription:

GPU programming: Code optimization part 1 Sylvain Collange Inria Rennes Bretagne Atlantique sylvain.collange@inria.fr

Outline Analytical performance modeling Optimizing host-device data transfers Optimizing for memory locality Matrix multiplication example Optimizing memory access patterns Memory access patterns Global memory optimization

First-order performance model Questions you should ask yourself, before starting to code or optimize Will my code run faster on the GPU? Is my existing code running as fast as it should? Is performance limited by computations or memory bandwidth? Pen-and-pencil calculations can (often) answer such questions 3

Performance: metrics and definitions Optimistic evaluation: upper bound on performance Assume perfect overlap of computations and memory accesses Memory accesses: bytes Only external memory, not caches or registers Computations: flops Only useful computations (usually floating-point) not address calculations, loop iterators.. Arithmetic intensity: flops / bytes = computations / memory accesses Property of the code Arithmetic throughput: flops / s Property of code + architecture 4

The roofline model How much performance can I get for a given arithmetic intensity? Upper bound on arithmetic throughput, as a function of arithmetic intensity Property of the architecture Arithmetic throughput (Gflops/s) Arithmetic intensity (flops/byte) Memory-bound Compute-bound S. Williams, A. Waterman, D. Patterson. Roofline: an insightful visual performance model for multicore architectures. Communications of the ACM, 2009 5

Building the machine model Compute or measure: Peak memory throughput GTX 980: 224 GB/s Ideal arithmetic intensity = peak compute throughput / mem throughput GTX 980: 6412 (Gflops/s) / 224 (GB/s) = 28.6 flops/b 4 (B/flop) = 114 (dimensionless) Arithmetic throughput (Gflops/s) 4612 Gflops Beware of units: float=4b, double=8b! 28.6 B/flop Arithmetic intensity (flops/byte) Achievable peaks may be lower than theoretical peaks Lower curves when adding realistic constraints 6

Using the model Compute arithmetic intensity, measure performance of program Identify bottleneck: memory or computation Take optimization decision Arithmetic throughput (Gflops/s) Optimize memory accesses Optimize computation Measured performance Reuse data Arithmetic intensity (flops/byte) 7

Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? How many memory accesses? Arithmetic intensity? Compute-bound or memory-bound? How many Gflop/s on a GTX 980 GPU? With data in GPU memory? With data in CPU memory? How many Gflop/s on an i7 4790 CPU? GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 8

Example: dot product for i = 1 to n r += a[i] * b[i] How many computations? How many memory accesses? Arithmetic intensity? Compute-bound or memory-bound? How many Gflop/s on a GTX 980 GPU? 2 n flops 2 n words 1 flop/word = 0.25 flop/b Highly memory-bound With data in GPU memory? 224 GB/s 0.25 flop/b 56 Gflops/s With data in CPU memory? 16 GB/s 0.25 flop/b 4 Gflops/s How many Gflop/s on an i7 4790 CPU? 25.6 GB/s 0.25 flop/b 6.4 Gflops/s Conclusion: don't bother porting to GPU! GTX 980: 4612 Gflops/s, 224 GB/s i7 4790: 460 Gflops/s, 25.6 GB/s PCIe link: 16 GB/s 9

Outline Analytical performance modeling Optimizing host-device data transfers Optimizing for memory locality Matrix multiplication example Optimizing memory access patterns Memory access patterns Global memory optimization

Asynchronous transfers Overlap CPU work with GPU work CPU cudamemcpy HtoD kernel1<<<>>> kernel2<<<>>> cudamemcpy DtoH Asynchronous wait Synchronous: blocking kernel3<<<>>> DMA copy HtoD copy DtoH GPU kernel1 kernel2 kernel3 Time Can we do better? 11

Streams: pipelining commands Streams, aka Command queues in OpenCL CPU Commands from the same stream run in-order Commands from different streams run out-of-order cudamemcpyasync DtoH, a cudamemcpyasync HtoD, b kernel<<<,,,b>>> cudamemcpyasync HtoD, a kernel<<<,,,a>>> cudamemcpyasync DtoH, b cudamemcpyasync HtoD, c kernel<<<,,,c>>> cudamemcpyasync DtoH, c cudastreamsynchronize a cudastreamsynchronize c cudastreamsynchronize b DMA copy a HtoD copy b HtoD copy c HtoD copy a DtoH copy b DtoH copy c DtoH GPU kernel a kernel b kernel c 12

Streams: benefits Overlap CPU-GPU communication and computation: Direct Memory Access (DMA) copy engine runs CPU-GPU memory transfers in background Requires page-locked memory Some Tesla GPUs have 2 DMA engines: simultaneous send and receive Concurrent kernel execution Start next kernel before previous kernel finishes Reduces loss due to load imbalance Example Kernel<<<5,,,a>>> Kernel<<<4,,,b>>> Serial kernel execution a block 0 a 1 a 4 a 3 b b 0 1 a 2 b 2 b 3 Concurrent kernel execution a block 0 a 1 a 4 a 3 a 2 b 0 b 1 b 2 b 3 13

Page-locked memory By default, allocated memory is pageable Can be swapped out to disk, moved by the OS... DMA transfers are only safe on page-locked memory Fixed virtual physical mapping cudamemcpy needs an intermediate copy: slower, synchronous only cudamallochost allocates page-locked memory Mandatory when using streams Warning: page-locked memory is a limited resource! 14

Streams: example Send data, execute, receive data cudastream_t stream[2]; for (int i = 0; i < 2; ++i) cudastreamcreate(&stream[i]); float* hostptr; cudamallochost(&hostptr, 2 * size); for (int i = 0; i < 2; ++i) { cudamemcpyasync(inputdevptr + i * size, hostptr + i * size, size, cudamemcpyhosttodevice, stream[i]); MyKernel <<<100, 512, 0, stream[i]>>> (outputdevptr + i * size, inputdevptr + i * size, size); cudamemcpyasync(hostptr + i * size, outputdevptr + i * size, size, cudamemcpydevicetohost, stream[i]); } for (int i = 0; i < 2; ++i) cudastreamdestroy(stream[i]); 15

Events Schedule synchronization of one stream with another Specifies dependencies between tasks cudaevent_t e; cudaeventcreate(&e); kernel1<<<,,,a>>>(); cudaeventrecord(e, a); cudastreamwaitevent(b, e); kernel2<<<,,,b>>>(); cudaeventdestroy(e); Measure timing kernel1 kernel2 Event e cudaeventrecord(start, 0); kernel<<<>>>(); cudaeventrecord(stop, 0); cudaeventsynchronize(stop); float elapsedtime; cudaeventelapsedtime(&elapsedtime, start, stop); kernel start stop 16

Scheduling data dependency graphs With streams and events, we can express task dependency graphs Equivalent to threads and events (e.g. semaphores) on CPU Example: 2 GPU streams: a and 1 CPU thread: Where should we place events? b kernel 5 kernel1 kernel3 kernel 2 CPU code kernel 4 17

Scheduling data dependency graphs kernel1<<<,,,a>>>(); cudaeventrecord(e1, a); kernel2<<<,,,b>>>(); kernel1 kernel 2 cudastreamwaitevent(b, e1); kernel3<<<,,,b>>>(); cudaeventrecord(e2, b); kernel5<<<,,,a>>>(); cudaeventrecord(e3, a); cudaeventsynchronize(e2); CPU code e3 e1 kernel 5 kernel3 CPU code wait(e1) e2 wait(e2) cudastreamwaitevent(b, e3); kernel4<<<,,,b>>>(); wait(e3) kernel 4 18

Outline Analytical performance modeling Optimizing host-device data transfers Optimizing for memory locality Matrix multiplication example Optimizing memory access patterns Memory access patterns Global memory optimization

Classic example: matrix multiplication Naive algorithm for i = 0 to n-1 for j = 0 to n-1 for k = 0 to n-1 C[i,j]+=A[i,k]*B[k,j] A B C k k Arithmetic intensity: 1:1 :( 20

Reusing inputs Move loop on k up for k = 0 to n-1 for i = 0 to n-1 for j = 0 to n-1 C[i,j]+=A[i,k]*B[k,j] A B C k k Enable data reuse on inputs But need to keep all matrix C in registers! 21

With tiling Block loops on i and j for i = 0 to n-1 step 16 for j = 0 to n-1 step 16 for k = 0 to n-1 for i2 = i to i+15 for j2 = j to j+15 C[i2,j2]+=A[i2,k]*B[k,j2] For one block: product between horizontal panel of A and vertical panel of B A B C i,j Constant size 22

With more tiling Block loop on k for i = 0 to n-1 step 16 for j = 0 to n-1 step 16 for k = 0 to n-1 step 16 for k2 = k to k+15 for i2 = i to i+15 for j2 = j to j+15 C[i2,j2]+=A[i2,k]*B[k,j2] A B C k k Constant size 23

Pre-loading data for i = 0 to n-1 step 16 for j = 0 to n-1 step 16 c = {0} for k = 0 to n-1 step 16 a = A[i..i+15,k..k+15] b = B[k..k+15,j..j+15] for k2 = 0 to 15 for i2 = 0 to 15 for j2 = 0 to 15 c[i2,j2]+=a[i2,k2]*b[k2,j2] Load submatrices a and b Multiply submatrices c = a b C[i..i+15,j..j+15] = c Store submatrix c A B C b a c 24

Breaking into two levels Run loops on i, j, i2, j2 in parallel for // i = 0 to n-1 step 16 for // j = 0 to n-1 step 16 c = {0} for k = 0 to n-1 step 16 a = A[i..i+15,k..k+15] b = B[k..k+15,j..j+15] Level 2: Blocks Level 1: Threads for k2 = 0 to 15 for // i2 = 0 to 15 for // j2 = 0 to 15 c[i2,j2]+=a[i2,k2]*b[k2,j2] C[i..i+15,j..j+15] = c Let's focus on threads 25

Level 1: SIMD (PRAM-style) version Each processor has ID (x,y) Loops on i2, j2 are implicit c[x,y] = 0 for k = 0 to n-1 step 16 a[x,y] = A[i+x,k+y] b[x,y] = B[k+x,j+y] for k2 = 0 to 15 c[x,y]+=a[x,k2]*b[k2,y] C[i+x,j+y] = c[x,y] Load submatrices a and b Multiply submatrices c = a b Store submatrix c Private writes: no conflict Read from other processors How to translate to SPMD (BSP-style)? 26

SPMD version Place synchronization barriers c[x,y] = 0 for k = 0 to n-1 step 16 a[x,y] = A[i+x,k+y] b[x,y] = B[k+x,j+y] Barrier for k2 = 0 to 15 c[x,y]+=a[x,k2]*b[k2,y] Barrier C[i+x,j+y] = c[x,y] Why do we need the second barrier? 27

Data allocation 3 memory spaces: Global, Shared, Local Where should we put: A, B, C, a, b, c? c[x,y] = 0 for k = 0 to n-1 step 16 a[x,y] = A[i+x,k+y] b[x,y] = B[k+x,j+y] Barrier for k2 = 0 to 15 c[x,y] += a[x,k2]*b[k2,y] Barrier C[i+x,j+y] = c[x,y] 28