GPGPUGPGPU: Multi-GPU Programming

Similar documents
Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Hardware/Software Co-Design

Blocks, Grids, and Shared Memory

Information Coding / Computer Graphics, ISY, LiTH. CUDA memory! ! Coalescing!! Constant memory!! Texture memory!! Pinned memory 26(86)

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Fundamental CUDA Optimization. NVIDIA Corporation

Lab 1 Part 1: Introduction to CUDA

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Fundamental CUDA Optimization. NVIDIA Corporation

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Introduction to GPU programming. Introduction to GPU programming p. 1/17

Advanced CUDA Optimizations. Umar Arshad ArrayFire

Advanced CUDA Programming. Dr. Timo Stich

Introduction to Parallel Computing with CUDA. Oswald Haan

Module Memory and Data Locality

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

CS 314 Principles of Programming Languages

Lecture 7. Using Shared Memory Performance programming and the memory hierarchy

COSC 6374 Parallel Computations Introduction to CUDA

GPU Programming. Performance Considerations. Miaoqing Huang University of Arkansas Fall / 60

Example 1: Color-to-Grayscale Image Processing

CS 179: GPU Computing. Recitation 2: Synchronization, Shared memory, Matrix Transpose

ECE 408 / CS 483 Final Exam, Fall 2014

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Unrolling parallel loops

1/25/12. Administrative

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose 10:30 AM Friday October 2, 2009 Joe Stam

CUDA Memory Types All material not from online sources/textbook copyright Travis Desell, 2012

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Tiled Matrix Multiplication

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Fundamental Optimizations

CUDA Parallelism Model

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

CUDA Performance Considerations (2 of 2) Varun Sampath Original Slides by Patrick Cozzi University of Pennsylvania CIS Spring 2012

Zero-copy. Table of Contents. Multi-GPU Learning CUDA to Solve Scientific Problems. Objectives. Technical Issues Zero-copy. Multigpu.

Dense Linear Algebra. HPC - Algorithms and Applications

CME 213 S PRING Eric Darve

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Programming in CUDA. Malik M Khan

Reductions and Low-Level Performance Considerations CME343 / ME May David Tarjan NVIDIA Research

GPU Computing Workshop CSU Getting Started. Garland Durham Quantos Analytics

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

Convolution Soup: A case study in CUDA optimization. The Fairmont San Jose Joe Stam

Massively Parallel Architectures

ECE 574 Cluster Computing Lecture 15

Profiling & Tuning Applications. CUDA Course István Reguly

Lecture 7. Overlap Using Shared Memory Performance programming the memory hierarchy

Tesla Architecture, CUDA and Optimization Strategies

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

2/2/11. Administrative. L6: Memory Hierarchy Optimization IV, Bandwidth Optimization. Project Proposal (due 3/9) Faculty Project Suggestions

Memory concept. Grid concept, Synchronization. GPU Programming. Szénási Sándor.

Memory. Lecture 2: different memory and variable types. Memory Hierarchy. CPU Memory Hierarchy. Main memory

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

CUDA Programming. Week 1. Basic Programming Concepts Materials are copied from the reference list

Advanced CUDA Optimizations

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Optimizing Parallel Reduction in CUDA

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

Optimizing Parallel Reduction in CUDA. Mark Harris NVIDIA Developer Technology

Sparse Linear Algebra in CUDA

Introduction to GPU Computing. Design and Analysis of Parallel Algorithms

Image Processing Optimization C# on GPU with Hybridizer

CS671 Parallel Programming in the Many-Core Era

GPU programming basics. Prof. Marco Bertini

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

University of Bielefeld

Lecture 5. Performance Programming with CUDA

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

ECE 574 Cluster Computing Lecture 17

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

CUDA C Programming Mark Harris NVIDIA Corporation

CS/EE 217 Midterm. Question Possible Points Points Scored Total 100

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Lecture 2: CUDA Programming

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Lecture 10!! Introduction to CUDA!

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

Introduction to GPGPU and GPU-architectures

High Performance Computing and GPU Programming

GPU programming. Dr. Bernhard Kainz

GPU Programming Using CUDA. Samuli Laine NVIDIA Research

Matrix Multiplication in CUDA. A case study

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

Code Optimizations for High Performance GPU Computing

Lecture 2: different memory and variable types

Scientific discovery, analysis and prediction made possible through high performance computing.

Atomic Operations. Atomic operations, fast reduction. GPU Programming. Szénási Sándor.

GPU Performance Nuggets

CS 193G. Lecture 4: CUDA Memories

CUDA. Sathish Vadhiyar High Performance Computing

Warps and Reduction Algorithms

Programmable Graphics Hardware (GPU) A Primer

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming Using CUDA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Zero Copy Memory and Multiple GPUs

Transcription:

GPGPUGPGPU: Multi-GPU Programming Fall 2012

HW4 global void cuda_transpose(const float *ad, const int n, float *atd) { } int i = threadidx.y + blockidx.y*blockdim.y; int j = threadidx.x + blockidx.x*blockdim.x; if (i<n && j<n) atd[i*n + j] = ad[j*n + i]; return;

HW4 global void cuda_transpose_shared(const float *ad, const int n, float *atd) { extern shared float atile[]; int loci = threadidx.y, locj = threadidx.x; int bi = blockidx.y, bj = blockidx.x; int blocksize = blockdim.x; int i = loci + bi*blocksize; int j = locj + bj*blocksize; /* read in shared data */ if (i<n && j<n) { atile[loci*blocksize + locj] = ad[i*n + j]; } syncthreads(); } if (i<n && j<n) { atd[(loci+bj*blocksize)*n + (locj+bi*blocksize)] = atile[locj*blocksize+loci]; } return;

HW4 global void cuda_transpose_shared(const float *ad Advantages: coallesced reads, writes (what is optimum blocksize for coallescing? Can get 100% memory efficiency) Downsides: syncthreads() } extern shared float atile[]; int loci = threadidx.y, locj = threadidx.x; int bi = blockidx.y, bj = blockidx.x; int blocksize = blockdim.x; int i = loci + bi*blocksize; int j = locj + bj*blocksize; /* read in shared data */ if (i<n && j<n) { atile[loci*blocksize + locj] = ad[i*n + j]; } syncthreads(); if (i<n && j<n) { atd[(loci+bj*blocksize)*n + (locj+bi*blocksiz } return;

HW4 32x32 blocks - perfect coallescing have to wait for 32 memory transactions to complete before proceed At least 132 cycles for scheduling Can get around with more blocks... But 4kB/block shared; past 12 blocks per SM (=168, 400^2 matrix) occupancy suffers.

HW4 What can we do? Want to either reduce blocks shared memory footprint, or more fewer waits before syncthreads() without impacting coallescing.

HW4 Narrower blocks?

HW4 Narrower blocks? Means outputting the transpose is inefficient

HW4 Multiple transactions per thread...

HW4 if (j<n) { for (int iter=0; iter<multi; iter++) { int inrow =multi*i+iter; int tilerow=multi*loci+iter; if ( inrow < n) { atile[tilerow*(blockdim.x+1) + locj] = ad[inrow*n + j]; } } } syncthreads(); for (int iter=0; iter<multi; iter++) { int outrow=blockdim.x*bj + loci/multi; int outcol=blockdim.y*bi*multi + (loci%multi)*blockdim.y + locj; if (outrow < n && outcol < n) { atd[outrow*n + outcol] = atile[(multi*locj+iter)*(blockdim.x+1)+loci]; } } return;

HW4 arc01-$./transpose --matsize=384 --nblocks=12 --multi=4 Matrix size = 384, Number of blocks = 12, Blocksize = 32. CPU time = 1.161 millisec. GPU time = 0.529 millisec (global mem). CUDA global mem and CPU results differ by 0.000000 GPU time = 0.571 millisec (shared mem). CUDA shared mem and CPU results differ by 0.000000 GPU time = 0.583 millisec (shared mem). CUDA shared mem - bank and CPU results differ by 0.000000 GPU time = 0.499 millisec (multiple transactions). CUDA multiple transactions and CPU results differ by 0.000000 arc01-$./transpose --matsize=1024 --nblocks=32 --multi=4 Matrix size = 1024, Number of blocks = 32, Blocksize = 32. CPU time = 11.133 millisec. GPU time = 2.606 millisec (global mem). CUDA global mem and CPU results differ by 0.000000 GPU time = 2.841 millisec (shared mem). CUDA shared mem and CPU results differ by 0.000000 GPU time = 2.815 millisec (shared mem). CUDA shared mem - bank and CPU results differ by 0.000000 GPU time = 2.333 millisec (multiple transactions).

Optimizizing around Tradeoffs Want memory accesses that are 32-floats wide Want many blocks (latency) Want few blocks (shared mem/node: occupancy) Not clear a priori what best choices are; have to experiment (and different hardware can shift balance.) For sticky problems, people have often already found good solutions: http://www.cs.colostate.edu/~cs675/ MatrixTranspose.pdf

Multiple GPUS Might want to use multiple GPUs in a node for: More cores (you have enough work; but occupancy!) More memory (~6GB/card isn t a tonne) More memory bandwidth between global + cores

Multiple GPUS If you have an existing parallel program, relatively easy Each thread/process drives its own GPU Communicate amongst themselves as before MPI programs lend themselves particularly well to this

Multiple GPUS Multi-GPU machines becoming more common Haven t talked about CPU parallelism in this class Will talk about driving multiple GPUs from a single CPU program

Very simple since CUDA 4.0 All you really need to know is: int ngpus; cudagetdevicecount( &ngpus ); and cudasetdevice(gpunum);

Very simple since CUDA 4.0 Consider our simple GPU C = A x B matrix multiply Note that we can recast this as a block-matrix multiply, with C and A both broken down by rows Each chunk of A, C can be assigned to a different GPU = * = *

matmult.cu

matmult.cu

Matmult-multi.cu

Matmult-multi.cu

Matmult-multi.cu

Matmult-multi.cu

matmult-multi.cu

matmult-stream.cu

Blocking vs Async arc01-$./matmult-multi --matsize=512 --nblocks=32 Matrix size = 512, Number of blocks = 32. CPU time = 1087.42 millisec. GPU time = 15.894 millisec (1 gpu). Single GPU and CPU results differ by 0.000000 GPU time = 9.163 millisec (2 gpu). 2-GPU and CPU results differ by 0.000000 arc01-$./matmult-stream --matsize=512 --nblocks=32 Matrix size = 512, Number of blocks = 32. CPU a = 1087.06 millisec. GPU time = 14.755 millisec (1 gpu). Single GPU and CPU results differ by 0.000000 GPU time = 7.925 millisec (2 gpu). 2-GPU and CPU results differ by 0.000000

Memory exchange In general, don t just fire off one kernel and then done Iterative methods, timesteps, etc. Dealing with multiple GPUs introduces more complex CPU/ GPU memory issues More bookeeping Efficiency (waiting for CPU to copy data out of one GPU, into another) CUDA 4,5 introduce nicer ways to deal with this

Memory exchange Options: Explicit CPU copying Zero-copy from host (bookkeeping, maybe lower efficiency) Peer-to-peer transfers Peer-to-peer memory access

Zero (explicit) copy GPU can see region of host memory As with async memory copy, and for same reasons, must be pinned memory, allocated with cudahostalloc (or registered) Access through slow PCIe bus Host/global like global/shared; fine if only going to access once, slow if you need to access it repeatedly.

Zero Copy Let s use this for (say) B... cudahostalloc(&a, n*n*sizeof(float), cudahostallocdefault); cudahostalloc(&b, n*n*sizeof(float), cudahostallocmapped cudahostallocportable ); cudahostalloc(&c, n*n*sizeof(float), cudahostallocdefault); cudahostalloc(&ccuda, n*n*sizeof(float), cudahostallocdefault); for (int dev=0; dev<ngpus; dev++) { cudasetdevice(dev);... CHK_CUDA( cudamalloc(&(multi_ad[dev]), locnrows*n*sizeof(float)) ); CHK_CUDA( cudahostgetdevicepointer(&(multi_bd[dev]), b, 0) ); CHK_CUDA( cudamalloc(&(multi_cd[dev]), locnrows*n*sizeof(float)) ); }... cuda_sgemm_simple<<<gridsize, blocksize>>>(multi_ad[dev], multi_bd[dev], locnrows, n, n, multi_cd[dev]);

Zero Copy Slower, but less explicit copying, memory usage on GPU arc01-$./matmult-zero --matsize=512 --nblocks=128 Matrix size = 512, Number of blocks = 128. CPU time = 1086.17 millisec. GPU time = 78.694 millisec (1 gpu). Single GPU and CPU results differ by 0.000000 GPU time = 82.103 millisec (2 gpu). 2-GPU and CPU results differ by 0.000000

Unified Virtual Adressing Tesla Fermis and later CUDA 4.0 and later From programmer s point of view, one global memory space Can just use cudamemcpy for everything Mainly convenience

Direct Peer Copes With UVA, can enable copies directly from GPU to GPUfirst enable peer access for each peer pair: cudadeviceenablepeeraccess(int peerdevice, unsigned int flags ) flags always zero (for now) Give appropriate device ptr

Homework Goal: repeatedly smooth the large image from previous homework (models a PDE calculation) on multiple GPUs. Starting from smoothimage HW, just using global memory for image, loop over the memcopy/kernel/ memcopy with 1 gpu Then split over multiple gpus, as with simple matmult Then enable peer copying, and have kernels start with copy from neighbour. How to synchronize?