CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

Size: px

Start display at page:

Download "CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator"

Barnard Warren
6 years ago
Views:

1 CUDA Update: Present & Future Mark Ebersole, NVIDIA CUDA Educator

2 Recent CUDA News

3 Kepler K20 & K20X

4 Kepler GPU Architecture: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 64 DP CUDA Cores per SMX 4 warp schedulers Up to 2048 threads concurrently 32 special-function units 64KB shared mem + L1 cache 48KB Read-Only Data cache 64K 32-bit registers

5 Kepler vs. Fermi Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110 Compute Capability Threads / Warp Max Warps / Multiprocessor Max Threads / Multiprocessor Max Thread Blocks / Multiprocessor bit Registers / Multiprocessor Max Registers / Thread Max Threads / Thread Block Shared Memory Size Configurations (bytes) 16K 16K 16K 16K 48K 48K 32K 32K 48K 48K Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1 Hyper- Q No No No Yes Dynamic Parallelism No No No Yes

Fastest Performance on Scientific Applications Tesla K20X

Chroma Earth Science SPECFEM3D Molecular Dynamics AMBER 0.0x 5.

0x System Config- CPU results: Dual socket E5-2687w, 3.

6 Fastest Performance on Scientific Applications Tesla K20X Speed-Up over Sandy Bridge CPUs Higher Ed MATLAB (FFT)* Physics Chroma Earth Science SPECFEM3D Molecular Dynamics AMBER 0.0x 5.0x 10.0x 15.0x 20.0x System Config- CPU results: Dual socket E5-2687w, 3.10 GHz GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600k CPU vs with Tesla K20 GPU

7 Titan: World s #1 Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs Petaflops Sustained Performance on Linpack

8 Dynamic Parallelism CPU Fermi GPU CPU Kepler GPU

9 Dynamic Work Generation Coarse grid Fine grid Dynamic grid Higher Performance Lower Accuracy Lower Performance Higher Accuracy Target performance where accuracy is required Supported on GK110 GPUs

10 CUDA 5.0

11 CUDA 5 nvidia.com/getcuda Application Acceleration Made Easier New Nsight Eclipse Edition Develop, Debug, and Optimize All in one tool! GPUDirect RDMA between GPUs and PCIe devices GPU Library Object Linking Libraries and plug-ins for GPU code Dynamic Parallelism Spawn new parallel work from within GPU code on GK110

12 NVIDIA GPUDirect Now Supports RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU1 GPU2 CPU PCI-e PCI-e Network Card Network Network Card Server 1 Server 2

13 nvprof CUDA 5.0 Toolkit Textual reports Summary of GPU and CPU activity Trace of GPU and CPU activity Event collection Headless profile collection Use nvprof on headless node to collect data Visualize timeline with Visual Profiler

14 GPU Programability

15 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

16 GPU Accelerated Libraries Drop-in Acceleration for your Applications NVIDIA cublas NVIDIA curand NVIDIA cusparse NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cufft IMSL Library Building-block ArrayFire Matrix Algorithms Computations for CUDA Sparse Linear Algebra C++ STL Features for CUDA

17 Explore the CUDA (Libraries) Ecosystem CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem Watch past GTC library talks

18 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

19 OpenACC Directives CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience Your original Fortran or C code OpenACC Compiler Hint Compiler Parallelizes code Works on many-core GPUs & multicore CPUs

20 Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n);; } #pragma acc kernels for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n);; }

21 Start Now with OpenACC Directives Free trial license to PGI Accelerator Tools for quick ramp Sign up for a free trial of the directives compiler now!

22 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

23 Opening the CUDA Compiler Developers want to build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs, x86 CUDA C, C++, Fortran NVIDIA GPUs LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support CUDA Compiler Contributed to Open Source LLVM

64-bit Linux or 64-bit Windows with TCC driver Fermi or later architecture GPUs (compute capability 2.0 or higher) CUDA 4.

24 Unified Virtual Addressing (UVA) CPU and GPU allocations use unified virtual address space Think of each one (CPU, GPU) getting its own range of a single VA space Driver/device can determine from an address where data resides Requires: 64-bit Linux or 64-bit Windows with TCC driver Fermi or later architecture GPUs (compute capability 2.0 or higher) CUDA 4.0 or later A GPU can dereference a pointer that is: an address on another GPU an address on the host (CPU) NVIDIA Corporation 2012

is not available if One of the GPUs is pre-fermi GPUs are connected to different Intel

25 UVA and Multi-GPU Programming Two interesting aspects Peer-to-peer (P2P) memcopies Accessing another GPU s addresses Both require peer-access to be enabled Peer-access is not available if One of the GPUs is pre-fermi GPUs are connected to different Intel IOH chips on the motherboard (QPI and PCIe protocols disagree on P2P) NVIDIA Corporation 2012

26 P2P Memory Access cudasetdevice(0); // Set device 0 as current float* p0; size_t size = 1024 * sizeof(float); cudamalloc(&p0, size); // Allocate memory on device 0 MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0 cudasetdevice(1); // Set device 1 as current cudadeviceenablepeeraccess(0, 0); // Enable peer-to-peer access // with device 0 // Launch kernel on device 1 // This kernel launch can access memory on device 0 at address p0 MyKernel<<<1000, 128>>>(p0); NVIDIA Corporation 2012

27 P2P Memory Copy cudasetdevice(0); // Set device 0 as current float* p0; size_t size = 1024 * sizeof(float); cudamalloc(&p0, size); // Allocate memory on device 0 cudasetdevice(1); // Set device 1 as current float* p1; cudamalloc(&p1, size); // Allocate memory on device 1 cudasetdevice(0); // Set device 0 as current MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0 cudasetdevice(1); // Set device 1 as current cudamemcpypeer(p1, 1, p0, 0, size); // Copy p0 to p1 MyKernel<<<1000, 128>>>(p1); // Launch kernel on device 1 NVIDIA Corporation 2012

This is an additional restriction over the PCI Express standard requirements to prevent incompatibility

28 Topology Matters for P2P Communication P2P Communication is Not Supported Between Bridges (*) (*) The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect. ( ) NVIDIA Corporation 2012

29 Topology Matters GPU0 GPU1 GPU2 GPU3 Best P2P Performance Between GPUs on the Same PCIe Switch x16 x16 x16 x16 CPU0 QPI IOH0 PCI-e Switch 0 x16 Switch 1 x16 P2P Communication Supported Between GPUs on the Same IOH PCI-e Switches: Fully Supported NVIDIA Corporation 2012

30 How Does P2P Memcopy Help Multi-GPU? Ease of programming No need to manually maintain memory buffers on the host for inter- GPU exchanges Increased throughput Especially when communication path does not include IOH (GPUs connected to a PCIe switch): Single-directional transfers achieve up to ~6.3 GB/s Duplex transfers achieve ~12.2 GB/s GPU-pairs can communicate concurrently if paths don t overlap NVIDIA Corporation 2012

Peer-to-Peer Throughputs Via PCIe switch: GPUs attached to the same PCIe switch Simplex:6.3 GB/s Duplex:12.2 GB/s Via IOH chip: GPUs attached to the same IOH chip Simplex:5.

31 Peer-to-Peer Throughputs Via PCIe switch: GPUs attached to the same PCIe switch Simplex:6.3 GB/s Duplex:12.2 GB/s Via IOH chip: GPUs attached to the same IOH chip Simplex:5.3 GB/s Duplex: 9.0 GB/s Via host: GPUs attached to different IOH chips Simplex:2.2 GB/s Duplex: 3.9 GB/s DRAM CPU-1 IOH 36D PCIe switch DRAM CPU-0 IOH 36D GPU-2 GPU-3 GPU-0 GPU-1 NVIDIA Corporation 2012

32 P2P transfer between multiple CPU processes CUDA 4.1 introduced a new family of functions called cudaipc* Create a handle to a GPU device memory segment which can be exported to other processes within a node API functions also provide an Ipc mechanism for passing events between processes MVAPICH2-1.8 and later already supports this efficient GPU- GPU transfer within a node IPC functionality is restricted to devices with support for unified addressing on Linux operating systems NVIDIA Corporation 2012

33 Communication for Multiple Hosts, Multiple GPUs NVIDIA Corporation 2012

34 Communication Between GPUs in Different Nodes GPUs in different network nodes Requires network communication With CUDA 4.0, transparent interoperability between CUDA pinned memory and Infiniband network card export CUDA_NIC_INTEROP=1 With CUDA 4.1 and later, default behavior If each node also has multiple GPUs: Can continue using P2P within the node Can overlap some PCIe transfers with network communication (in addition to kernel execution) NVIDIA Corporation 2012

35 GPU aware MPI Implementations NVIDIA Corporation 2012

36 GPU-aware MPI Support GPU to GPU communication through standard MPI interfaces without exposing low level details to the programmer e.g. enable MPI_Send, MPI_Recv from/to GPU memory Made possible by Unified Virtual Addressing (UVA) in CUDA 4.0 MVAPICH2, OpenMPI, Platform MPI (Beta) Code without MPI integration At Sender: cudamemcpy(s_buf, s_device, size, cudamemcpydevicetohost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudamemcpy(r_device, r_buf, size, cudamemcpyhosttodevice); Code with MPI integration At Sender: MPI_Send(s_device, size, );; At Receiver: MPI_Recv(r_device, size, );; NVIDIA Corporation 2012

37 RDMA Requirements All Telsa with Kepler GPUs IB NIC and GPU on same IOH

38 Performance Results two Nodes BW in GB/s MVAPICH D to D (regular Host Memory) MVAPICH D to D with GPUDirect* 0.00 MVAPICH MPI H to H 4 Mb 2 Mb 256 kb 32 kb 4 kb 512 b 64 b 8 b 1 b message Size Latency (1 byte) µs µs 2.30 µs * Accelerated Communication with Network & Storage Devices

39 OpenMP Set device for each OpenMP thread #pragma omp parallel { cudasetdevice(omp_get_thread_num() % omp_get_num_threads()); cudamalloc( );; cudamemcpy( );; #pragma omp for shared(m, n, Anew, A) for ( int j = 1; j < n; j++) { kernel<<<blocks,threads>>>(j, n, Anew, A, );; cudamemcpy( );; } cudamemcpy( );; } Useful for quickly dividing work among multiple GPUs in a node Not as explicit as MPI-based solution NVIDIA Corporation 2012

40 NUMA Considerations NVIDIA Corporation 2012

41 NUMA and GPUs Host (CPU) NUMA affects PCIe transfer throughput in dual- IOH systems Transfers to remote GPUs achieve lower throughput Additional QPI hops (This affects any PCIe device, not just GPUs (eg. network cards)) When possible, lock CPU threads to a socket that s closest to the GPU s IOH chip For example, by using taskset, numactl, GOMP_CPU_AFFINITY, KMP_AFFINITY, etc. taskset c 2 gpuapp device=0 numactl physcpubind=2 yourapp device=0 NVIDIA Corporation 2012

42 NUMA and GPUs C-function which uses sched_setaffinity() system function #include <sched.h> void set_cpu_affinity(int id) { cpu_set_t mask; int gpu_table[3] = {2, 3, 5}; /* Set the affinity for the GPU specified by id. */ CPU_ZERO(&mask); CPU_SET(gpu_table[id], &mask); sched_setaffinity(0, sizeof(mask), &mask); } A sample usage: /* Select the CUDA GPU and set the CPU processor affinity. */ cudasetdevice(dev); set_cpu_affinity(dev); For some applications, it is not practical to restrict GPU to memory transfers to the memory of a single CPU socket Use alternatives like numactl interleave=all yourapp to minimize the difference between using near memory and far memory NVIDIA Corporation 2012

43 NUMA and GPUs Developers should use hwloc when concerned about NUMA toplogies Used by Open MPI, MVAPICH2, others Handles other PCIe devices too Built-in support to get topology information for CUDA devices Interaction between hwloc and CUDA driver to get for instance a list of processors near NVIDIA GPUs NVIDIA Corporation 2012

support CUDA-MEMCHECK support for memory

Visualization and Analysis of MPI + CUDA

44 Cluster Solutions Allinea and TotalView Cluster Debuggers Multi-GPU debugging support CUDA-MEMCHECK support for memory errors MPI and CUDA support for GPU clusters Breakpoints, thread control, and data evaluation VAMPIR Cluster Profiler Visualization and Analysis of MPI + CUDA code. Other profiler partners TAU, PAPI, HPC-Toolkit NVIDIA Corporation 2012

45 Summary CUDA provides a number of features to facilitate multi-gpu programming Single-process / multiple GPUs: Unified virtual address space Ability to directly access peer GPU s data Ability to issue P2P memcopies No staging via CPU memory High aggregate throughput for many-gpu nodes Multiple-processes: GPU Direct to maximize performance when both PCIe and IB transfers are needed Streams and asynchronous kernel/copies Allow overlapping of communication and execution Keep NUMA in mind on multi-ioh systems NVIDIA Corporation 2012

46 Finally NVIDIA Corporation 2012

GPU Test Drive At A Glance 5x Average speed up with GPUs compared to CPUs

hosted clusters by 21 service providers across the globe 6 Key Apps in

4 out of 5 satisfaction rating from test drive users.

GPU-acceleration speeds up MD simulations by 15-20 times, enabling us to

Based on the GPU Test Drive experience, we decided to invest heavily in the

47 GPU Test Drive At A Glance 5x Average speed up with GPUs compared to CPUs experienced by 100s of test drive users 21+Amazon Try on AWS EC2 or remotely hosted clusters by 21 service providers across the globe 6 Key Apps in Computational Chemistry for focused penetration in Higher Ed. 4 out of 5 satisfaction rating from test drive users. What are customers saying? GPU-acceleration speeds up MD simulations by times, enabling us to identify significantly more possible treatments. Based on the GPU Test Drive experience, we decided to invest heavily in the technology for a new computing cluster. Dr. David Gohara, Professor, St. Louis University What is next for GPU Test Drive? K10 & K20 Available now: Version NVIDIA Corporation 2012

48 Links Get CUDA: Nsight: Programming Guide/Best Practices docs.nvidia.com Questions: NVIDIA Developer forums devtalk.nvidia.com Search or ask on General:

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs