CUDA. Matthew Joyner, Jeremy Williams

Similar documents
CSC573: TSHA Introduction to Accelerators

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

GPU A rchitectures Architectures Patrick Neill May

Titan - Early Experience with the Titan System at Oak Ridge National Laboratory

Portable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Timothy Lanfear, NVIDIA HPC

Fundamental CUDA Optimization. NVIDIA Corporation

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

GPU ARCHITECTURE Chris Schultz, June 2017

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

GPU-optimized computational speed-up for the atmospheric chemistry box model from CAM4-Chem

CRAY XK6 REDEFINING SUPERCOMPUTING. - Sanjana Rakhecha - Nishad Nerurkar

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

TUNING CUDA APPLICATIONS FOR MAXWELL

FCUDA: Enabling Efficient Compilation of CUDA Kernels onto

Accelerating High Performance Computing.

TUNING CUDA APPLICATIONS FOR MAXWELL

Peter Messmer Developer Technology Group Stan Posey HPC Industry and Applications

OpenACC Course. Office Hour #2 Q&A

GPU Computing with NVIDIA s new Kepler Architecture

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GPU ARCHITECTURE Chris Schultz, June 2017

Steve Scott, Tesla CTO SC 11 November 15, 2011

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

NVIDIA Fermi Architecture

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

Practical Introduction to CUDA and GPU

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

NVIDIA Tesla K20X GPU Accelerator. Breton Minnehan, Beau Sattora

Mathematical computations with GPUs

CUDA and GPU Performance Tuning Fundamentals: A hands-on introduction. Francesco Rossi University of Bologna and INFN

ADVANCES IN EXTREME-SCALE APPLICATIONS ON GPU. Peng Wang HPC Developer Technology

HiPANQ Overview of NVIDIA GPU Architecture and Introduction to CUDA/OpenCL Programming, and Parallelization of LDPC codes.

GPU COMPUTING AND THE FUTURE OF HPC. Timothy Lanfear, NVIDIA

Mathematical computations with GPUs

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

PERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015

Accelerating Biomedical Imaging using parallel Fuzzy C-Means Algorithm

It s a Multicore World. John Urbanic Pittsburgh Supercomputing Center

Preparing GPU-Accelerated Applications for the Summit Supercomputer

CS 179 Lecture 4. GPU Compute Architecture

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Programming. Lecture 1: Introduction. Miaoqing Huang University of Arkansas 1 / 27

n N c CIni.o ewsrg.au

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Fundamental CUDA Optimization. NVIDIA Corporation

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

CUDA Architecture & Programming Model

CUDA programming. CUDA requirements. CUDA Querying. CUDA Querying. A CUDA-capable GPU (NVIDIA) NVIDIA driver A CUDA SDK

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

arxiv: v1 [physics.comp-ph] 4 Nov 2013

Lecture 1: Gentle Introduction to GPUs

An Introduction to OpenACC

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Technology for a better society. hetcomp.com

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Introduction to Parallel Computing with CUDA. Oswald Haan

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Mapping MPI+X Applications to Multi-GPU Architectures

Productive Performance on the Cray XK System Using OpenACC Compilers and Tools

NVidia s GPU Microarchitectures. By Stephen Lucas and Gerald Kotas

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

GPU Basics. Introduction to GPU. S. Sundar and M. Panchatcharam. GPU Basics. S. Sundar & M. Panchatcharam. Super Computing GPU.

CUDA Experiences: Over-Optimization and Future HPC

A study on linear algebra operations using extended precision floating-point arithmetic on GPUs

Inside Kepler. Manuel Ujaldon Nvidia CUDA Fellow. Computer Architecture Department University of Malaga (Spain)

CUDA OPTIMIZATIONS ISC 2011 Tutorial

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

ECE 574 Cluster Computing Lecture 15

HIGH-PERFORMANCE COMPUTING

Introduction to GPU hardware and to CUDA

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Present and Future Leadership Computers at OLCF

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Efficiency and Programmability: Enablers for ExaScale. Bill Dally Chief Scientist and SVP, Research NVIDIA Professor (Research), EE&CS, Stanford

GPGPU Offloading with OpenMP 4.5 In the IBM XL Compiler

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

HETEROGENEOUS HPC, ARCHITECTURAL OPTIMIZATION, AND NVLINK STEVE OBERLIN CTO, TESLA ACCELERATED COMPUTING NVIDIA

NVIDIA s Compute Unified Device Architecture (CUDA)

NVIDIA s Compute Unified Device Architecture (CUDA)

John Levesque Nov 16, 2001

J. Blair Perot. Ali Khajeh-Saeed. Software Engineer CD-adapco. Mechanical Engineering UMASS, Amherst

GPU Accelerated Application Performance Tuning. Delivered by John Ashley, Slides by Cliff Woolley, NVIDIA Developer Technology Group

Tesla GPU Computing A Revolution in High Performance Computing

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

The Stampede is Coming: A New Petascale Resource for the Open Science Community

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

E6895 Advanced Big Data Analytics Lecture 8: GPU Examples and GPU on ios devices

Transcription:

CUDA Matthew Joyner, Jeremy Williams

Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL

What is CUDA?

What is CUDA? CUDA is a parallel computing platform and programming model developed by NVIDIA Simplifies the process of performing general purpose computing on the GPU Capable of scheduling 30,720 threads per GPU

CUDA GPU Architecture

Kepler GK110 15 SMX 6x 64-bit Memory Controllers 1.5 MB L2 Cache

Streaming Multiprocessor CUDA GPU is made from a scalable array of Streaming Multiprocessors Each multiprocessor is independent SMX is the multiprocessor used in the Kepler architecture

SMX 192 x Single-Precision CUDA cores 64 x Double-Precision CUDA cores 32 x Special Function Units 32 x Load/Store Units 16 x Texture Units 4 x Warp Schedulers 2 x Dispatch Units 65,536 x 32-bit Registers 64KB Shared Memory/L1 Cache 48KB Read-Only Data Cache Max 2048 concurrent threads With 15 SMX units this results in 30,720 concurrent threads per GK110

Memory SMX Level 65,536 x 32-bit Registers 255 Max Registers per thread 64KB of Shared Memory/L1 Cache Supports 16/48-32/32-48/16 split Can be configured per application or per kernel 48KB of Read-Only Data Cache Used to be texture cache and had to go through texture units Kepler allowed the processing cores access GPU Level 1.5MB of L2 Cache Primary point of data unification across SMX units DRAM Amount varies based on Graphics Card make/model Uses Single-Error Correct Double-Error Detect (SECDED) Error Correction Code (ECC) Read-Only Data Cache also supports single-error correction using a parity check

CPU/GPU Communication

Communication Code sent from CPU to GPU is in Parallel Thread Execution (PTX) PTX is a pseudo-assembly language Graphics Drivers convert PTX into executable binary

Communication Kernel - Executed by CPU and sent to GPU Assigned to GPU A Kernel contains 1 Grid Grid - 2D array of Blocks Assigned to GPU Block - Group of concurrent threads Assigned to SMX Broken down into Warps by SMX scheduler Largest unit of synchronizable threads Warp - Group of 32 threads Assigned to Warp Scheduler in SMX Thread - Sequential Instructions Assigned to individual core

Hyper-Q A GPU work queue can execute 1 Kernel at a time Previous generations had a single work queue per GPU Resulted in poor GPU utilization due to inability to saturate the GPU Multiple Processes could submit to queue Caused scheduler to flag artificial dependencies in Kernels from different processes Kepler implemented Hyper-Q GPU now has 32 work queues Allows for better saturation of GPU Multiple Processes can submit to different queues Removes artificial dependencies

Dynamic Parallelism Previous Generations Kernels could only be launched by CPU Kepler Kernels can be launched by both the CPU and the GPU Allows for less communication between the CPU and GPU

GPUDirect Allows the GPU to communicate with other devices directly Eliminates CPU bandwidth and latency bottlenecks Currently supports a variety of IB adapters, NICs, and SSDs Will complement NVIDIA s planned Unified Virtual Address Space (UVA)

Programming for CUDA

Code modification Makes use of OpenACC standard compiler directives (similar to OpenMP) Program can be optimized for GPU acceleration without modifying the code itself, using only compiler directives Compiler directives can instruct the compiler to automatically parallelize a given code block, or can be more specific.

Example of Compiler Directives #include <stdio.h> #define N 1000000 int main(void) { double pi = 0.0f; long i; #pragma acc parallel loop reduction(+:pi) for (i=0; i<n; i++) { double t= (double)((i+0.5)/n); pi +=4.0/(1.0+t*t); } printf("pi=%16.15f\n",pi/n); return 0; }

CUDA-ready Libraries Many common math and science libraries have versions optimized for CUDA GPUs Developers can simply replace their traditional libraries with CUDA libraries to take advantage of speed increases up to 10x depending on operation

CUDA-ready Libraries Popular Math/Science operations with CUDA libraries: BLAS FFT LAPACK OpenCV Sparse Matrix functions Random Number Generation Ocean Simulation Many others

Use Cases of CUDA

Use Cases of CUDA - Titan Supercomputer Finished in October 2012 at the Oak Ridge National Laboratory #1 on the TOP500 list until June 2013 Utilizes 18,688 compute nodes with both an AMD multi-core CPU and a NVIDIA K20x CUDA-based GPU First hybrid system (GPU and CPU) to perform at over 10 petaflops (17.59 petaflops LINPACK) Gemini Interconnect is capable of transmitting tens of millions of MPI messages per second at sub-microsecond latency, using a 3D torus network topology

Scientific Applications accelerated by Titan s CUDA upgrade Enhanced modeling of: Nuclear reactors Supernovae Molecular physics of combustion Earth s atmosphere Some of these applications saw improvements of over 2x once GPUs were utilized

Scientific Applications accelerated by CUDA Application Speedup Seismic Database 66 to 100X Mobile Phone Antenna Simulation 45X Molecular Dynamics 240X Neuron Simulation 100X MRI Processing 245 415X Atmospheric Cloud Simulation 50X From http://www.nvidia.com/object/io_43499.html

CUDA vs. OpenCL

OpenCL Not tied to a specific GPU architecture, built to enable acceleration by any co-processor (ASIC, DSP, FPGA, etc.) Implemented as a set of libraries and functions, places limitations on what programming constructs can be used Requires a major re-write to port existing code

Comparison CUDA is tied to a single hardware platform (NVIDIA s GPUs) while OpenCL can be implemented in any kind of co-processor Since OpenCL requires re-write and CUDA can be used with only compiler directives, there are many more libraries already ported to CUDA Tradeoff between code complexity and portability

Questions?