CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

Size: px
Start display at page:

Download "CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator"

Transcription

1 CUDA Update: Present & Future Mark Ebersole, NVIDIA CUDA Educator

2 Recent CUDA News

3 Kepler K20 & K20X

4 Kepler GPU Architecture: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 64 DP CUDA Cores per SMX 4 warp schedulers Up to 2048 threads concurrently 32 special-function units 64KB shared mem + L1 cache 48KB Read-Only Data cache 64K 32-bit registers

5 Kepler vs. Fermi Fermi GF100 Fermi GF104 Kepler GK104 Kepler GK110 Compute Capability Threads / Warp Max Warps / Multiprocessor Max Threads / Multiprocessor Max Thread Blocks / Multiprocessor bit Registers / Multiprocessor Max Registers / Thread Max Threads / Thread Block Shared Memory Size Configurations (bytes) 16K 16K 16K 16K 48K 48K 32K 32K 48K 48K Max X Grid Dimension 2^16-1 2^16-1 2^32-1 2^32-1 Hyper- Q No No No Yes Dynamic Parallelism No No No Yes

6 Fastest Performance on Scientific Applications Tesla K20X Speed-Up over Sandy Bridge CPUs Higher Ed MATLAB (FFT)* Physics Chroma Earth Science SPECFEM3D Molecular Dynamics AMBER 0.0x 5.0x 10.0x 15.0x 20.0x System Config- CPU results: Dual socket E5-2687w, 3.10 GHz GPU results: Dual socket E5-2687w + 2 Tesla K20X GPUs *MATLAB results comparing one i7-2600k CPU vs with Tesla K20 GPU

7 Titan: World s #1 Open Science Supercomputer 18,688 Tesla K20X GPUs 27 Petaflops Peak: 90% of Performance from GPUs Petaflops Sustained Performance on Linpack

8 Dynamic Parallelism CPU Fermi GPU CPU Kepler GPU

9 Dynamic Work Generation Coarse grid Fine grid Dynamic grid Higher Performance Lower Accuracy Lower Performance Higher Accuracy Target performance where accuracy is required Supported on GK110 GPUs

10 CUDA 5.0

11 CUDA 5 nvidia.com/getcuda Application Acceleration Made Easier New Nsight Eclipse Edition Develop, Debug, and Optimize All in one tool! GPUDirect RDMA between GPUs and PCIe devices GPU Library Object Linking Libraries and plug-ins for GPU code Dynamic Parallelism Spawn new parallel work from within GPU code on GK110

12 NVIDIA GPUDirect Now Supports RDMA System Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory GDDR5 Memory System Memory CPU GPU1 GPU2 GPU1 GPU2 CPU PCI-e PCI-e Network Card Network Network Card Server 1 Server 2

13 nvprof CUDA 5.0 Toolkit Textual reports Summary of GPU and CPU activity Trace of GPU and CPU activity Event collection Headless profile collection Use nvprof on headless node to collect data Visualize timeline with Visual Profiler

14 GPU Programability

15 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

16 GPU Accelerated Libraries Drop-in Acceleration for your Applications NVIDIA cublas NVIDIA curand NVIDIA cusparse NVIDIA NPP Vector Signal Image Processing GPU Accelerated Linear Algebra Matrix Algebra on GPU and Multicore NVIDIA cufft IMSL Library Building-block ArrayFire Matrix Algorithms Computations for CUDA Sparse Linear Algebra C++ STL Features for CUDA

17 Explore the CUDA (Libraries) Ecosystem CUDA Tools and Ecosystem described in detail on NVIDIA Developer Zone: developer.nvidia.com/cuda-tools-ecosystem Watch past GTC library talks

18 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

19 OpenACC Directives CPU GPU Simple Compiler hints Program myscience... serial code...!$acc kernels do k = 1,n1 do i = 1,n2... parallel code... enddo enddo!$acc end kernels... End Program myscience Your original Fortran or C code OpenACC Compiler Hint Compiler Parallelizes code Works on many-core GPUs & multicore CPUs

20 Familiar to OpenMP Programmers OpenMP OpenACC CPU CPU GPU main() { double pi = 0.0; long i; main() { double pi = 0.0; long i; #pragma omp parallel for reduction(+:pi) for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n);; } #pragma acc kernels for (i=0; i<n; i++) { double t = (double)((i+0.05)/n); pi += 4.0/(1.0+t*t); } printf( pi = %f\n, pi/n);; }

21 Start Now with OpenACC Directives Free trial license to PGI Accelerator Tools for quick ramp Sign up for a free trial of the directives compiler now!

22 3 Ways to Accelerate Applications Applications Libraries OpenACC Directives Programming Languages Drop-in Acceleration Easily Accelerate Applications Maximum Flexibility

23 Opening the CUDA Compiler Developers want to build front-ends for Java, Python, R, DSLs Target other processors like ARM, FPGA, GPUs, x86 CUDA C, C++, Fortran NVIDIA GPUs LLVM Compiler For CUDA x86 CPUs New Language Support New Processor Support CUDA Compiler Contributed to Open Source LLVM

24 Unified Virtual Addressing (UVA) CPU and GPU allocations use unified virtual address space Think of each one (CPU, GPU) getting its own range of a single VA space Driver/device can determine from an address where data resides Requires: 64-bit Linux or 64-bit Windows with TCC driver Fermi or later architecture GPUs (compute capability 2.0 or higher) CUDA 4.0 or later A GPU can dereference a pointer that is: an address on another GPU an address on the host (CPU) NVIDIA Corporation 2012

25 UVA and Multi-GPU Programming Two interesting aspects Peer-to-peer (P2P) memcopies Accessing another GPU s addresses Both require peer-access to be enabled Peer-access is not available if One of the GPUs is pre-fermi GPUs are connected to different Intel IOH chips on the motherboard (QPI and PCIe protocols disagree on P2P) NVIDIA Corporation 2012

26 P2P Memory Access cudasetdevice(0); // Set device 0 as current float* p0; size_t size = 1024 * sizeof(float); cudamalloc(&p0, size); // Allocate memory on device 0 MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0 cudasetdevice(1); // Set device 1 as current cudadeviceenablepeeraccess(0, 0); // Enable peer-to-peer access // with device 0 // Launch kernel on device 1 // This kernel launch can access memory on device 0 at address p0 MyKernel<<<1000, 128>>>(p0); NVIDIA Corporation 2012

27 P2P Memory Copy cudasetdevice(0); // Set device 0 as current float* p0; size_t size = 1024 * sizeof(float); cudamalloc(&p0, size); // Allocate memory on device 0 cudasetdevice(1); // Set device 1 as current float* p1; cudamalloc(&p1, size); // Allocate memory on device 1 cudasetdevice(0); // Set device 0 as current MyKernel<<<1000, 128>>>(p0); // Launch kernel on device 0 cudasetdevice(1); // Set device 1 as current cudamemcpypeer(p1, 1, p0, 0, size); // Copy p0 to p1 MyKernel<<<1000, 128>>>(p1); // Launch kernel on device 1 NVIDIA Corporation 2012

28 Topology Matters for P2P Communication P2P Communication is Not Supported Between Bridges (*) (*) The IOH does not support non-contiguous byte enables from PCI Express for remote peer-to-peer MMIO transactions. This is an additional restriction over the PCI Express standard requirements to prevent incompatibility with Intel QuickPath Interconnect. ( ) NVIDIA Corporation 2012

29 Topology Matters GPU0 GPU1 GPU2 GPU3 Best P2P Performance Between GPUs on the Same PCIe Switch x16 x16 x16 x16 CPU0 QPI IOH0 PCI-e Switch 0 x16 Switch 1 x16 P2P Communication Supported Between GPUs on the Same IOH PCI-e Switches: Fully Supported NVIDIA Corporation 2012

30 How Does P2P Memcopy Help Multi-GPU? Ease of programming No need to manually maintain memory buffers on the host for inter- GPU exchanges Increased throughput Especially when communication path does not include IOH (GPUs connected to a PCIe switch): Single-directional transfers achieve up to ~6.3 GB/s Duplex transfers achieve ~12.2 GB/s GPU-pairs can communicate concurrently if paths don t overlap NVIDIA Corporation 2012

31 Peer-to-Peer Throughputs Via PCIe switch: GPUs attached to the same PCIe switch Simplex:6.3 GB/s Duplex:12.2 GB/s Via IOH chip: GPUs attached to the same IOH chip Simplex:5.3 GB/s Duplex: 9.0 GB/s Via host: GPUs attached to different IOH chips Simplex:2.2 GB/s Duplex: 3.9 GB/s DRAM CPU-1 IOH 36D PCIe switch DRAM CPU-0 IOH 36D GPU-2 GPU-3 GPU-0 GPU-1 NVIDIA Corporation 2012

32 P2P transfer between multiple CPU processes CUDA 4.1 introduced a new family of functions called cudaipc* Create a handle to a GPU device memory segment which can be exported to other processes within a node API functions also provide an Ipc mechanism for passing events between processes MVAPICH2-1.8 and later already supports this efficient GPU- GPU transfer within a node IPC functionality is restricted to devices with support for unified addressing on Linux operating systems NVIDIA Corporation 2012

33 Communication for Multiple Hosts, Multiple GPUs NVIDIA Corporation 2012

34 Communication Between GPUs in Different Nodes GPUs in different network nodes Requires network communication With CUDA 4.0, transparent interoperability between CUDA pinned memory and Infiniband network card export CUDA_NIC_INTEROP=1 With CUDA 4.1 and later, default behavior If each node also has multiple GPUs: Can continue using P2P within the node Can overlap some PCIe transfers with network communication (in addition to kernel execution) NVIDIA Corporation 2012

35 GPU aware MPI Implementations NVIDIA Corporation 2012

36 GPU-aware MPI Support GPU to GPU communication through standard MPI interfaces without exposing low level details to the programmer e.g. enable MPI_Send, MPI_Recv from/to GPU memory Made possible by Unified Virtual Addressing (UVA) in CUDA 4.0 MVAPICH2, OpenMPI, Platform MPI (Beta) Code without MPI integration At Sender: cudamemcpy(s_buf, s_device, size, cudamemcpydevicetohost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudamemcpy(r_device, r_buf, size, cudamemcpyhosttodevice); Code with MPI integration At Sender: MPI_Send(s_device, size, );; At Receiver: MPI_Recv(r_device, size, );; NVIDIA Corporation 2012

37 RDMA Requirements All Telsa with Kepler GPUs IB NIC and GPU on same IOH

38 Performance Results two Nodes BW in GB/s MVAPICH D to D (regular Host Memory) MVAPICH D to D with GPUDirect* 0.00 MVAPICH MPI H to H 4 Mb 2 Mb 256 kb 32 kb 4 kb 512 b 64 b 8 b 1 b message Size Latency (1 byte) µs µs 2.30 µs * Accelerated Communication with Network & Storage Devices

39 OpenMP Set device for each OpenMP thread #pragma omp parallel { cudasetdevice(omp_get_thread_num() % omp_get_num_threads()); cudamalloc( );; cudamemcpy( );; #pragma omp for shared(m, n, Anew, A) for ( int j = 1; j < n; j++) { kernel<<<blocks,threads>>>(j, n, Anew, A, );; cudamemcpy( );; } cudamemcpy( );; } Useful for quickly dividing work among multiple GPUs in a node Not as explicit as MPI-based solution NVIDIA Corporation 2012

40 NUMA Considerations NVIDIA Corporation 2012

41 NUMA and GPUs Host (CPU) NUMA affects PCIe transfer throughput in dual- IOH systems Transfers to remote GPUs achieve lower throughput Additional QPI hops (This affects any PCIe device, not just GPUs (eg. network cards)) When possible, lock CPU threads to a socket that s closest to the GPU s IOH chip For example, by using taskset, numactl, GOMP_CPU_AFFINITY, KMP_AFFINITY, etc. taskset c 2 gpuapp device=0 numactl physcpubind=2 yourapp device=0 NVIDIA Corporation 2012

42 NUMA and GPUs C-function which uses sched_setaffinity() system function #include <sched.h> void set_cpu_affinity(int id) { cpu_set_t mask; int gpu_table[3] = {2, 3, 5}; /* Set the affinity for the GPU specified by id. */ CPU_ZERO(&mask); CPU_SET(gpu_table[id], &mask); sched_setaffinity(0, sizeof(mask), &mask); } A sample usage: /* Select the CUDA GPU and set the CPU processor affinity. */ cudasetdevice(dev); set_cpu_affinity(dev); For some applications, it is not practical to restrict GPU to memory transfers to the memory of a single CPU socket Use alternatives like numactl interleave=all yourapp to minimize the difference between using near memory and far memory NVIDIA Corporation 2012

43 NUMA and GPUs Developers should use hwloc when concerned about NUMA toplogies Used by Open MPI, MVAPICH2, others Handles other PCIe devices too Built-in support to get topology information for CUDA devices Interaction between hwloc and CUDA driver to get for instance a list of processors near NVIDIA GPUs NVIDIA Corporation 2012

44 Cluster Solutions Allinea and TotalView Cluster Debuggers Multi-GPU debugging support CUDA-MEMCHECK support for memory errors MPI and CUDA support for GPU clusters Breakpoints, thread control, and data evaluation VAMPIR Cluster Profiler Visualization and Analysis of MPI + CUDA code. Other profiler partners TAU, PAPI, HPC-Toolkit NVIDIA Corporation 2012

45 Summary CUDA provides a number of features to facilitate multi-gpu programming Single-process / multiple GPUs: Unified virtual address space Ability to directly access peer GPU s data Ability to issue P2P memcopies No staging via CPU memory High aggregate throughput for many-gpu nodes Multiple-processes: GPU Direct to maximize performance when both PCIe and IB transfers are needed Streams and asynchronous kernel/copies Allow overlapping of communication and execution Keep NUMA in mind on multi-ioh systems NVIDIA Corporation 2012

46 Finally NVIDIA Corporation 2012

47 GPU Test Drive At A Glance 5x Average speed up with GPUs compared to CPUs experienced by 100s of test drive users 21+Amazon Try on AWS EC2 or remotely hosted clusters by 21 service providers across the globe 6 Key Apps in Computational Chemistry for focused penetration in Higher Ed. 4 out of 5 satisfaction rating from test drive users. What are customers saying? GPU-acceleration speeds up MD simulations by times, enabling us to identify significantly more possible treatments. Based on the GPU Test Drive experience, we decided to invest heavily in the technology for a new computing cluster. Dr. David Gohara, Professor, St. Louis University What is next for GPU Test Drive? K10 & K20 Available now: Version NVIDIA Corporation 2012

48 Links Get CUDA: Nsight: Programming Guide/Best Practices docs.nvidia.com Questions: NVIDIA Developer forums devtalk.nvidia.com Search or ask on General:

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing. Axel Koehler Sr. Solution Architect HPC GPU Computing Axel Koehler Sr. Solution Architect HPC 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2 Continued Demand for Ever Faster Supercomputers First-principles

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

Multi-GPU Programming

Multi-GPU Programming Multi-GPU Programming Paulius Micikevicius Developer Technology, NVIDIA 1 Outline Usecases and a taxonomy of scenarios Inter-GPU communication: Single host, multiple GPUs Multiple hosts Case study Multiple

More information

GPU Computing Ecosystem

GPU Computing Ecosystem GPU Computing Ecosystem CUDA 5 Enterprise level GPU Development GPU Development Paths Libraries, Directives, Languages GPU Tools Tools, libraries and plug-ins for GPU codes Tesla K10 Kepler! Tesla K20

More information

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31

More information

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work

More information

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA High-Productivity CUDA Programming Levi Barnes, Developer Technology Engineer, NVIDIA MORE RESOURCES How to learn more GTC -- March 2014 San Jose, CA gputechconf.com Video archives, too Qwiklabs nvlabs.qwiklabs.com

More information

MPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462

MPI + X programming. UTK resources: Rho Cluster with GPGPU   George Bosilca CS462 MPI + X programming UTK resources: Rho Cluster with GPGPU https://newton.utk.edu/doc/documentation/systems/rhocluster George Bosilca CS462 MPI Each programming paradigm only covers a particular spectrum

More information

GPU Computing with NVIDIA s new Kepler Architecture

GPU Computing with NVIDIA s new Kepler Architecture GPU Computing with NVIDIA s new Kepler Architecture Axel Koehler Sr. Solution Architect HPC HPC Advisory Council Meeting, March 13-15 2013, Lugano 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro,

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Kepler Overview Mark Ebersole

Kepler Overview Mark Ebersole Kepler Overview Mark Ebersole TFLOPS TFLOPS 3x Performance in a Single Generation 3.5 3 2.5 2 1.5 1 0.5 0 1.25 1 Single Precision FLOPS (SGEMM) 2.90 TFLOPS.89 TFLOPS.36 TFLOPS Xeon E5-2690 Tesla M2090

More information

Operational Robustness of Accelerator Aware MPI

Operational Robustness of Accelerator Aware MPI Operational Robustness of Accelerator Aware MPI Sadaf Alam Swiss National Supercomputing Centre (CSSC) Switzerland 2nd Annual MVAPICH User Group (MUG) Meeting, 2014 Computing Systems @ CSCS http://www.cscs.ch/computers

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

n N c CIni.o ewsrg.au

n N c CIni.o ewsrg.au @NCInews NCI and Raijin National Computational Infrastructure 2 Our Partners General purpose, highly parallel processors High FLOPs/watt and FLOPs/$ Unit of execution Kernel Separate memory subsystem GPGPU

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Future Directions for CUDA Presented by Robert Strzodka

Future Directions for CUDA Presented by Robert Strzodka Future Directions for CUDA Presented by Robert Strzodka Authored by Mark Harris NVIDIA Corporation Platform for Parallel Computing Platform The CUDA Platform is a foundation that supports a diverse parallel

More information

arxiv: v1 [physics.comp-ph] 4 Nov 2013

arxiv: v1 [physics.comp-ph] 4 Nov 2013 arxiv:1311.0590v1 [physics.comp-ph] 4 Nov 2013 Performance of Kepler GTX Titan GPUs and Xeon Phi System, Weonjong Lee, and Jeonghwan Pak Lattice Gauge Theory Research Center, CTP, and FPRD, Department

More information

GPUs as better MPI Citizens

GPUs as better MPI Citizens s as better MPI Citizens Author: Dale Southard, NVIDIA Date: 4/6/2011 www.openfabrics.org 1 Technology Conference 2011 October 11-14 San Jose, CA The one event you can t afford to miss Learn about leading-edge

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

NVIDIA GPUs in Earth System Modelling Thomas Bradley

NVIDIA GPUs in Earth System Modelling Thomas Bradley NVIDIA GPUs in Earth System Modelling Thomas Bradley Agenda: GPU Developments for CWO Motivation for GPUs in CWO Parallelisation Considerations GPU Technology Roadmap MOTIVATION FOR GPUS IN CWO NVIDIA

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages

Approaches to GPU Computing. Libraries, OpenACC Directives, and Languages Approaches to GPU Computing Libraries, OpenACC Directives, and Languages Add GPUs: Accelerate Applications CPU GPU 146X 36X 18X 50X 100X Medical Imaging U of Utah Molecular Dynamics U of Illinois, Urbana

More information

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth

Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth Support for GPUs with GPUDirect RDMA in MVAPICH2 SC 13 NVIDIA Booth by D.K. Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda Outline Overview of MVAPICH2-GPU

More information

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries

More information

Ian Buck, GM GPU Computing Software

Ian Buck, GM GPU Computing Software Ian Buck, GM GPU Computing Software History... GPGPU in 2004 GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July 01 Jan 02 July 02 Jan 03 July

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR

Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Exploiting Full Potential of GPU Clusters with InfiniBand using MVAPICH2-GDR Presentation at Mellanox Theater () Dhabaleswar K. (DK) Panda - The Ohio State University panda@cse.ohio-state.edu Outline Communication

More information

S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017

S WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE. Mathias Wagner, Jakob Progsch GTC 2017 S7445 - WHAT THE PROFILER IS TELLING YOU OPTIMIZING WHOLE APPLICATION PERFORMANCE Mathias Wagner, Jakob Progsch GTC 2017 BEFORE YOU START The five steps to enlightenment 1. Know your application What does

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

Using JURECA's GPU Nodes

Using JURECA's GPU Nodes Mitglied der Helmholtz-Gemeinschaft Using JURECA's GPU Nodes Willi Homberg Supercomputing Centre Jülich (JSC) Introduction to the usage and programming of supercomputer resources in Jülich 23-24 May 2016

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

Introduction to OpenACC. 16 May 2013

Introduction to OpenACC. 16 May 2013 Introduction to OpenACC 16 May 2013 GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance Rendering Data Analytics

More information

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA)

CUDA Tools for Debugging and Profiling. Jiri Kraus (NVIDIA) Mitglied der Helmholtz-Gemeinschaft CUDA Tools for Debugging and Profiling Jiri Kraus (NVIDIA) GPU Programming with CUDA@Jülich Supercomputing Centre Jülich 25-27 April 2016 What you will learn How to

More information

Timothy Lanfear, NVIDIA HPC

Timothy Lanfear, NVIDIA HPC GPU COMPUTING AND THE Timothy Lanfear, NVIDIA FUTURE OF HPC Exascale Computing will Enable Transformational Science Results First-principles simulation of combustion for new high-efficiency, lowemision

More information

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory

Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Institute of Computational Science Efficient CPU GPU data transfers CUDA 6.0 Unified Virtual Memory Juraj Kardoš (University of Lugano) July 9, 2014 Juraj Kardoš Efficient GPU data transfers July 9, 2014

More information

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand

Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Latest Advances in MVAPICH2 MPI Library for NVIDIA GPU Clusters with InfiniBand Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Steve Scott, Tesla CTO SC 11 November 15, 2011

Steve Scott, Tesla CTO SC 11 November 15, 2011 Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA

OPEN MPI WITH RDMA SUPPORT AND CUDA. Rolf vandevaart, NVIDIA OPEN MPI WITH RDMA SUPPORT AND CUDA Rolf vandevaart, NVIDIA OVERVIEW What is CUDA-aware History of CUDA-aware support in Open MPI GPU Direct RDMA support Tuning parameters Application example Future work

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer

NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY. Peter Messmer NOVEL GPU FEATURES: PERFORMANCE AND PRODUCTIVITY Peter Messmer pmessmer@nvidia.com COMPUTATIONAL CHALLENGES IN HEP Low-Level Trigger High-Level Trigger Monte Carlo Analysis Lattice QCD 2 COMPUTATIONAL

More information

dcuda: Distributed GPU Computing with Hardware Overlap

dcuda: Distributed GPU Computing with Hardware Overlap dcuda: Distributed GPU Computing with Hardware Overlap Tobias Gysi, Jeremia Bär, Lukas Kuster, and Torsten Hoefler GPU computing gained a lot of popularity in various application domains weather & climate

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface

CSE 599 I Accelerated Computing - Programming GPUS. Advanced Host / Device Interface CSE 599 I Accelerated Computing - Programming GPUS Advanced Host / Device Interface Objective Take a slightly lower-level view of the CPU / GPU interface Learn about different CPU / GPU communication techniques

More information

S CUDA on Xavier

S CUDA on Xavier S8868 - CUDA on Xavier Anshuman Bhat CUDA Product Manager Saikat Dasadhikari CUDA Engineering 29 th March 2018 1 CUDA ECOSYSTEM 2018 CUDA DOWNLOADS IN 2017 3,500,000 CUDA REGISTERED DEVELOPERS 800,000

More information

Lecture 1: an introduction to CUDA

Lecture 1: an introduction to CUDA Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Overview hardware view software view CUDA programming

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU

NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU NVIDIA Think about Computing as Heterogeneous One Leo Liao, 1/29/2106, NTU GPGPU opens the door for co-design HPC, moreover middleware-support embedded system designs to harness the power of GPUaccelerated

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

CSC573: TSHA Introduction to Accelerators

CSC573: TSHA Introduction to Accelerators CSC573: TSHA Introduction to Accelerators Sreepathi Pai September 5, 2017 URCS Outline Introduction to Accelerators GPU Architectures GPU Programming Models Outline Introduction to Accelerators GPU Architectures

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge.

MPI and CUDA. Filippo Spiga, HPCS, University of Cambridge. MPI and CUDA Filippo Spiga, HPCS, University of Cambridge Outline Basic principle of MPI Mixing MPI and CUDA 1 st example : parallel GPU detect 2 nd example: heat2d CUDA- aware MPI, how

More information

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

Basics of CADA Programming - CUDA 4.0 and newer

Basics of CADA Programming - CUDA 4.0 and newer Basics of CADA Programming - CUDA 4.0 and newer Feb 19, 2013 Outline CUDA basics Extension of C Single GPU programming Single node multi-gpus programing A brief introduction on the tools Jacket CUDA FORTRAN

More information

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA STATE OF THE ART 2012 18,688 Tesla K20X GPUs 27 PetaFLOPS FLAGSHIP SCIENTIFIC APPLICATIONS

More information

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences

More information

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU Peng Wang HPC Developer Technology NVIDIA SuperPhones to SuperComputers Computers no longer get faster, just wider Architectural Features Common to All Processors

More information

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

Hands-on CUDA Optimization. CUDA Workshop

Hands-on CUDA Optimization. CUDA Workshop Hands-on CUDA Optimization CUDA Workshop Exercise Today we have a progressive exercise The exercise is broken into 5 steps If you get lost you can always catch up by grabbing the corresponding directory

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

NVIDIA TECHNOLOGY Mark Ebersole

NVIDIA TECHNOLOGY Mark Ebersole NVIDIA TECHNOLOGY Mark Ebersole ACM Learning Center http://learning.acm.org 1,350+ trusted technical books and videos by leading publishers including O Reilly, Morgan Kaufmann, others Online courses with

More information

Languages, Libraries and Development Tools for GPU Computing

Languages, Libraries and Development Tools for GPU Computing Languages, Libraries and Development Tools for GPU Computing CPU GPU GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on

More information

CUDA 6.0. Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow

CUDA 6.0. Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow CUDA 6.0 Manuel Ujaldón Associate Professor, Univ. of Malaga (Spain) Conjoint Senior Lecturer, Univ. of Newcastle (Australia) Nvidia CUDA Fellow 1 Acknowledgements To the great Nvidia people, for sharing

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc.

Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. CSC 391/691: GPU Programming Fall 2011 Advanced Topics: Streams, Multi-GPU, Tools, Libraries, etc. Copyright 2011 Samuel S. Cho Streams Until now, we have largely focused on massively data-parallel execution

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Lecture 8: GPU Programming. CSE599G1: Spring 2017

Lecture 8: GPU Programming. CSE599G1: Spring 2017 Lecture 8: GPU Programming CSE599G1: Spring 2017 Announcements Project proposal due on Thursday (4/28) 5pm. Assignment 2 will be out today, due in two weeks. Implement GPU kernels and use cublas library

More information

HPCSE II. GPU programming and CUDA

HPCSE II. GPU programming and CUDA HPCSE II GPU programming and CUDA What is a GPU? Specialized for compute-intensive, highly-parallel computation, i.e. graphic output Evolution pushed by gaming industry CPU: large die area for control

More information

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016

WHAT S NEW IN CUDA 8. Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Siddharth Sharma, Oct 2016 WHAT S NEW IN CUDA 8 Why Should You Care >2X Run Computations Faster* Solve Larger Problems** Critical Path Analysis * HOOMD Blue v1.3.3 Lennard-Jones liquid

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends

Overview. Lecture 6: odds and ends. Synchronicity. Warnings. synchronicity. multiple streams and devices. multiple GPUs. other odds and ends Overview Lecture 6: odds and ends Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre synchronicity multiple streams and devices multiple GPUs other

More information

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016

LECTURE ON MULTI-GPU PROGRAMMING. Jiri Kraus, November 14 th 2016 LECTURE ON MULTI-GPU PROGRAMMING Jiri Kraus, November 14 th 2016 MULTI GPU PROGRAMMING With MPI and OpenACC Node 0 Node 1 Node N-1 MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM MEM GPU CPU GPU CPU GPU CPU

More information

CME 213 S PRING Eric Darve

CME 213 S PRING Eric Darve CME 213 S PRING 2017 Eric Darve Summary of previous lectures Pthreads: low-level multi-threaded programming OpenMP: simplified interface based on #pragma, adapted to scientific computing OpenMP for and

More information

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Introduction to CUDA programming 1 Agenda GPU Architecture Overview Tools of the Trade Introduction to CUDA C Patterns of Parallel

More information

GPUs and Emerging Architectures

GPUs and Emerging Architectures GPUs and Emerging Architectures Mike Giles mike.giles@maths.ox.ac.uk Mathematical Institute, Oxford University e-infrastructure South Consortium Oxford e-research Centre Emerging Architectures p. 1 CPUs

More information

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes

Introduction: Modern computer architecture. The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Introduction: Modern computer architecture The stored program computer and its inherent bottlenecks Multi- and manycore chips and nodes Motivation: Multi-Cores where and why Introduction: Moore s law Intel

More information

Cartoon parallel architectures; CPUs and GPUs

Cartoon parallel architectures; CPUs and GPUs Cartoon parallel architectures; CPUs and GPUs CSE 6230, Fall 2014 Th Sep 11! Thanks to Jee Choi (a senior PhD student) for a big assist 1 2 3 4 5 6 7 8 9 10 11 12 13 14 ~ socket 14 ~ core 14 ~ HWMT+SIMD

More information