Performance Diagnosis for Hybrid CPU/GPU Environments

Similar documents
Performance Analysis of Hybrid CPU/GPU Environments

Introduction to GPU programming. Introduction to GPU programming p. 1/17

GPU Programming with CUDA. Pedro Velho

High-Performance Computing Using GPUs

GPU Computing: A Quick Start

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Tesla Architecture, CUDA and Optimization Strategies

Introduction to CUDA CIRC Summer School 2014

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Parallel Hybrid Computing F. Bodin, CAPS Entreprise

GPU Programming Using CUDA

Accelerated Test Execution Using GPUs

Module 2: Introduction to CUDA C. Objective

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

HPC with Multicore and GPUs

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Using a GPU in InSAR processing to improve performance

Parallel Numerical Algorithms

Accelerating Molecular Modeling Applications with Graphics Processors

Outline 2011/10/8. Memory Management. Kernels. Matrix multiplication. CIS 565 Fall 2011 Qing Sun

Lecture 11: GPU programming

Paralization on GPU using CUDA An Introduction

Information Coding / Computer Graphics, ISY, LiTH. Introduction to CUDA. Ingemar Ragnemalm Information Coding, ISY

Kepler Overview Mark Ebersole

Lessons learned from a simple application

Computation to Core Mapping Lessons learned from a simple application

ARCHER Champions 2 workshop

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to GPGPUs and to CUDA programming model

Lecture 3: Introduction to CUDA

Lecture 1: an introduction to CUDA

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Parallel Programming Libraries and implementations

CUDA Programming. Aiichiro Nakano

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Lecture 10!! Introduction to CUDA!

Module 2: Introduction to CUDA C

Introduction to CUDA Programming

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

OpenACC Course. Office Hour #2 Q&A

JCudaMP: OpenMP/Java on CUDA

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Review. Lecture 10. Today s Outline. Review. 03b.cu. 03?.cu CUDA (II) Matrix addition CUDA-C API

Performance Considerations and GMAC

Dynamic Task Parallelism with a GPU Work-Stealing Runtime System

Warps and Reduction Algorithms

CUDA by Example. The University of Mississippi Computer Science Seminar Series. April 28, 2010

ECE 574 Cluster Computing Lecture 15

Parallel Accelerators

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Pragma-based GPU Programming and HMPP Workbench. Scott Grauer-Gray

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

An Introduction to OpenAcc

Lecture 9. Outline. CUDA : a General-Purpose Parallel Computing Architecture. CUDA Device and Threads CUDA. CUDA Architecture CUDA (I)

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

CUDA. Matthew Joyner, Jeremy Williams

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

CS516 Programming Languages and Compilers II

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

Josef Pelikán, Jan Horáček CGG MFF UK Praha

Parallel Accelerators

Accelerating image registration on GPUs

n N c CIni.o ewsrg.au

Massively Parallel Architectures

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Introduction to CUDA

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

CUDA GPGPU Workshop CUDA/GPGPU Arch&Prog

Cartoon parallel architectures; CPUs and GPUs

Overview: Graphics Processing Units

GPU Programming for Mathematical and Scientific Computing

Lecture 2: Introduction to CUDA C

Vector Addition on the Device: main()

Empirical Modeling: an Auto-tuning Method for Linear Algebra Routines on CPU plus Multi-GPU Platforms

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

Module 3: CUDA Execution Model -I. Objective

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

University of Bielefeld

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

Matrix Multiplication in CUDA. A case study

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Parallel Computing. Lecture 19: CUDA - I

About 30 student clubs/associations 1500 students 5 years program

CUDA Programming Model

Introduction to CELL B.E. and GPU Programming. Agenda

COMP 322: Fundamentals of Parallel Programming. Flynn s Taxonomy for Parallel Computers

Basic Elements of CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

GPU CUDA Programming

Performance-Portable Many Core Plasma Simulations: Porting PIConGPU to OpenPower and Beyond

GPU COMPUTING. Ana Lucia Varbanescu (UvA)

CUDA Advanced Techniques 2 Mohamed Zahran (aka Z)

ECE 574 Cluster Computing Lecture 17

Transcription:

Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University

Performance Diagnosis for Hybrid CPU/GPU Environments Michael M. Smith and Karen L. Karavanic Computer Science Department Portland State University Work in Progress

Introduction Shift to GPUs: Single core processors have hit performance wall due to heat dissipation and power requirements => multicore, manycore Top500 (June 2011): 17/500 are GPU-accelerated Includes #2 (Tianhe-1a), #4 (Nebulae) and #5 (Tsubame 2.0) Hybrid GPU programming model different, and programmers need tools to provide unified view of application s performance

Performance Diagnosis of Hybrid Applications: Project Overview A benchmark suite for evaluating performance tools under development High level diagnostic metrics to convey most important performance trends (note: want them to work for CUDA and OpenCL) Visualizations to convey most important performance trends Funding for Curriculum Innovation in Multicore Computing New PSU Course: General Purpose GPU Computing

Graphic Processor Unit (GPU) Scheduling unit: warp = 32 threads Max per MP: 8 blocks, 32 warps Nvidia Cuda Programming Guide, http://developer.download.nvidia.com/ compute/cuda/2_0/docs/nvidia_cuda_programming_guide_2.0.pdf

GPU Programming http://www.geek.com/wp-content/uploads/ 2009/03/xeon-processor.jpg http://cc.doc.ic.ac.uk/projects/prj_axel/ images/tesla_c1060.png Compute Cap.>= 2.0: streams Compute Cap.>= 2.0: multiple kernels

GPU Programming With CUDA 1 int main(int argc, char** argv) { 2 int a_host; 3 int b_host; 4 int answer; 5 6 a_host = thread_id; 7 b_host = 3; 8 9 answer = a_host + b_host; 10 11 return 0; 12 } 1 global void add(int *a, int *b, int *answer_dev) { 2 *answer_dev = *a + *b; 3 } 4 5 int main(int argc, char** argv) { 6 int a_host; 7 int *a_dev; 8 int b_host; 9 int *b_dev; 10 int answer; 11 int *answer_dev; 12 13 a_host = thread_id; 14 b_host = 3; 15 16 cudamalloc((void **) &answer_dev, sizeof(int)); 17 cudamalloc((void **) &a_dev, sizeof(int)); 18 cudamalloc((void **) &b_dev, sizeof(int)); 19 20 cudamemcpy(a_dev, &a_host, sizeof(int), cudamemcpyhosttodevice); 21 cudamemcpy(b_dev, &b_host, sizeof(int), cudamemcpyhosttodevice); 22 23 dim3 dimgrid(1); 24 dim3 dimblock(1,32); 25 add<<<dimgrid, dimblock, 0>>>(a_dev, b_dev, answer_dev); 26 27 cudathreadsynchronize(); 28 29 cudamemcpy(&answer, answer_dev, sizeof(int), cudamemcpydevicetohost); 30 31 return 0; 32 } define kernel serial code allocate memory move data launch kernel move data serial code

A Simple Performance Tool Benchmark Suite Inspired by APART Test Suite Goals: Easily understandable behavior Used to check correctness of diagnosis Started with an existing single device implementation of matrix multiplication and extended it to support: Multiple GPU devices Overlap CPU and GPU computation Use pinned memory Use asynchronous memory transfers Based on Programming Massively Parallel Processors: A Hands-on Approach by Kirk and Hwu

Naive Kernel P = M x N [4x4] d: data in global memory ds: data in shared memory Matrix multiplication performed by calculating dot product of rows in M and columns in N Kernel configured to use one block of threads, so size of matrices limited by the maximum number of threads in a block In our case this was 16x16 Maximum number of threads in a block is 512 16x16 is largest size that is a power of two

Tiled Kernel Breaks the P d matrix up into tiles Tiles the same size as a block P = M x N [4x4] d: data in global memory ds: data in shared memory

Tiled Kernel P = M x N [4x4] d: data in global memory ds: data in shared memory

Tiled Plus Shared Memory Kernel Tiled kernel accesses data in global memory multiple times Tiled plus shared memory kernel uses shared memory to reduce global memory traffic Breaks computation up into phases Threads cooperatively load elements into shared memory P = M x N [4x4] d: data in global memory ds: data in shared memory

Phase One Global Memory: 4GB Shared Memory per Block: 16KB P = M x N [4x4] d: data in global memory ds: data in shared memory

Phase Two Global Memory: 4GB Shared Memory per Block: 16KB P = M x N [4x4] d: data in global memory ds: data in shared memory

CUDA programming model requires at least one host thread per device We divide the work among multiple host threads. Multiple Device Support P = M x N [4x4] d: data in global memory ds: data in shared memory

Multiple Device Support P = M x N [4x4] d: data in global memory ds: data in shared memory

Pinned Memory Optimizations Asynchronous memory transfers Requires pinned memory Non-blocking memory transfers for host

Hybrid Matrix Multiplication Overlaps CPU and GPU computation Divides work same as multiple device version One thread performs computation on device Other thread performs computation on host

Device Efficiency Metric Goal - indicate if the overhead of moving data to the device was justified Theoretical definition: kernel compute time / kernel wall clock Kernel device time: amount of time kernel executes on device Device time: amount of time kernel and data movement functions execute on the device Minimum value: 0 Maximum value: 1 DE = kernel device time device time

Device Utilization Goal - show how much of device s available computation is used Device time: amount of time kernel and data movement functions execute on the device Wall clock time: amount of time the application ran Minimum value: 0 Maximum value: 1 DU = device time wall clock time

Experiments Goals: Use matrix multiplication benchmark suite to study the behavior of the derived metrics What do the derived metrics tell us about a real scientific application?

Experimental Design - Instrumentation CudaProf configured to gather: kernel device time: gputime for kernel functions device time: gputime for kernel and data movement functions wall clock time: external time utility Tesla C1060 (1 GPU), S1070 (4 GPUs): compute capability 1.3 DE = kernel device time device time DU = device time wall clock time

Experimental Design - Matrix Multiplication Kernels: Naive Tiled Tiled plus shared memory Memory optimizations: Paged memory Pinned memory Pinned memory with asynchronous memory transfers Multiple devices Overlapped CPU and GPU computation Matrix sizes: from 16x16 to 16,384x16,384

Matrix Multiplication Case Study Device Efficiency 1.00 0.90 0.80 DE = kernel device time device time 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 16x16 512x512 1024x1024 2048x2048 4096x4096 8192x8192 16384x16384 Matrix multiplication configured with: tiled+shared memory kernel, pinned memory, asynchronous memory transfers, single device

Matrix Multiplication Case Study Device Efficiency 1.00 0.90 0.80 DE = kernel device time device time 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 dev0 dev1 dev0 dev1 dev0 dev1 dev0 dev1 dev0 dev1 dev0 dev1 dev0 dev1 32x32 512x512 1024x1024 2048x2048 4096x4096 8192x8192 16384x16384 Matrix multiplication configured with: tiled+shared memory kernel, pinned memory, asynchronous memory transfers, two devices

Matrix Multiplication Case Study Device Efficiency 1.00 0.90 0.80 DE = kernel device time device time 0.70 0.60 0.50 0.40 0.30 0.20 0.10 0.00 32x32 512x512 1024x1024 2048x2048 4096x4096 Matrix multiplication configured with: tiled+shared memory kernel, pinned memory, asynchronous memory transfers, overlapping CPU and GPU computation

Matrix Multiplication Case Study 1.00 1.00 0.90 0.90 0.80 0.80 0.70 0.70 0.60 0.60 0.50 DE DU 0.50 DE DU 0.40 0.40 0.30 0.30 0.20 0.20 0.10 0.10 0.00 0.00 512x512 1024x1024 2048x2048 4096x4096 8192x8192 16384x16384 512x512 1024x1024 2048x2048 4096x4096 8192x8192 16384x16384 Matrix multiplication using tiled kernel, paged memory, single device. Matrix multiplication using tiled plus shared memory kernel, paged memory, single device.

Matrix Multiplication Case Study Matrix Size Device Utilization 1 device 2 devices dev0 dev0 dev1 16x16 0.00 0.00 0.00 DU = device time wall clock time 512x512 0.00 0.00 0.00 1024x1024 0.01 0.01 0.01 2048x2048 0.09 0.04 0.04 4096x4096 0.37 0.23 0.23 8192x8192 0.78 0.61 0.61 16384x16384 0.93 0.85 0.85 Matrix multiplication configured with: tiled+shared memory kernel, pinned memory, asynchronous memory transfers

Experimental Design - NAMD Are these metrics useful for full-scale application? Molecular dynamics simulator Extended by Phillips et al. to support GPUs Configured to simulate the Satellite Tobacco Mosaic Virus (STMV) Modified an existing simulation configuration

NAMD Case Study Device Efficiency 1.00 0.90 0.80 0.70 0.60 0.50 0.40 0.30 single device two devices DE = kernel device time device time 0.20 0.10 0.00 dev0 dev1

NAMD Case Study Device Utilization 1.00 0.90 0.80 0.70 0.60 0.50 single device two devices 0.40 0.30 0.20 DU = device time wall clock time 0.10 0.00 dev0 dev1

Observations & Questions Device efficiency can be used to indicate performance in a hybrid environment, but doesn t reflect optimizations between kernels Device utilization can be used to indicate performance in a hybrid environment, but doesn t show how much an application can be accelerated Does device utilization matter? What about CPU utilization? Is CPU wait time free? FLOPs/watt? How can we visualize the asynchronous transfers and streams? Most common student questions: How can I tell what the GPU is doing?? (perf.,scheduling) How should I break down the problem? (blocks, streams)

Acknowledgments and contact info This work funded in part by National Science Foundation Award #1044973. Thanks to all students in Performance Analysis of Heterogeneous Multicore Systems. Benchmark suite completed for CUDA and OpenCL. karavan@cs.pdx.edu www.cs.pdx.edu/~karavan