GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

Size: px
Start display at page:

Download "GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group"

Transcription

1 GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

2 Why CUDA? Mainstream Massively Parallel Programming Over 300 Million CUDA capable GPUs sold Runs on GPU and CPU (PGI CUDA-x86) Additional Compute Resource Significant speedups of Parallel Computing Tasks Offload tasks from CPU Efficient Great Perf/Watt and Perf/$ ratio

3 Tesla GPUs Power 3 of Top 5 Supercomputers #1 : Tianhe-1A 7168 Tesla GPU s 2.5 PFLOPS #3 : Nebulae 4650 Tesla GPU s 1.2 PFLOPS #4 : Tsubame Tesla GPU s PFLOPS We not only created the world's fastest computer, but also implemented a heterogeneous computing architecture incorporating CPU and GPU, this is a new innovation. Premier Wen Jiabao Public comments acknowledging Tianhe-1A

4 World s Fastest MD Simulation Sustained Performance of 1.87 Petaflops/s Institute of Process Engineering (IPE) Chinese Academy of Sciences (CAS) MD Simulation for Crystalline Silicon Used all 7168 Tesla GPUs on Tianhe-1A GPU Supercomputer

5 GPU Computing Research & Education World Class Research Leadership and Teaching University of Cambridge Harvard University University of Utah University of Tennessee University of Maryland University of Illinois at Urbana-Champaign Tsinghua University Tokyo Institute of Technology Chinese Academy of Sciences National Taiwan University Georgia Institute of Technology Proven Research Vision Johns Hopkins University Mass. Gen. Hospital/NE Univ Nanyan University North Carolina State University Technical University-Czech Swinburne University of Tech. CSIRO TU Munich SINTEF UCLA HP Labs University of New Mexico ICHEC University Of Warsaw-ICM Barcelona SuperComputer Center VSB-Tech Clemson University University of Ostrava Uni Bonn/Fraunhofer SCAI TU Darmstadt Karlsruhe Institute Of Technology And more coming shortly. Academic Partnerships / Fellowships GPGPU Education 350+ Universities

6 Commercial Apps Accelerated by GPUs Molecular Dynamics AMBER CHARMM DL_POLY GROMACS LAMMPS NAMD Fluid Dynamics Altair Acusolve Autodesk Moldflow OpenFOAM Prometech Particlework Turbostream Earth Sciences ASUCA HOMME NASA GEOS-5 NOAA NIM WRF Engineering Simulation Others Agilent EMPro ANSYS Mechanical ANSYS Nexxim CST Microwave Studio Impetus AFEA Remcom XFdtd SIMULIA Abaqus GADGET2 MATLAB Mathematica NBODY Paradigm VoxelGeo PARATEC Schlumberger Petrel

7 Rich Toolchain & Ecosystem for Fast Ramp-up on GPUs Numerical Packages MATLAB Mathematica NI LabView pycuda Debuggers & Profilers cuda-gdb NV Visual Profiler Parallel Nsight Visual Studio Allinea TotalView GPU Compilers C C++ Fortran OpenCL DirectCompute Java Python Parallelizing Compilers PGI Accelerator CAPS HMPP mcuda OpenMP Libraries BLAS FFT LAPACK NPP Sparse Imaging RNG GPGPU Consultants & Training OEM Solution Providers ANEO GPU Tech

8 GPU COMPUTING FLOW

9 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory

10 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance

11 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Data is persistent in GPU memory, while computing data can also be copied

12 Processing Flow PCIe Bus 1. Copy input data from CPU memory to GPU memory 2. Load GPU program and execute, caching data on chip for performance 3. Data is persistent in GPU memory, while computing data can also be copied 4. Copy results from GPU memory to CPU memory

13 NVIDIA GPU Direct PCIe Bus PCIe Bus InfiniBand NIC Efficient access to GPU memory for 3 rd party devices (Mellanox & Qlogic) GPU Peer-to-Peer memory access, transfers & synchronization MPI implementations natively support GPU data transfers (MVAPICH2- GPU, OpenMPI) New with CUDA 4.0

14 GPU ARCHITECTURE

15 DRAM I/F Giga Thread HOST I/F DRAM I/F GPU Architecture: Two Main Components Global memory Analogous to RAM in a CPU server Accessible by both GPU and CPU Currently up to 6 GB Bandwidth currently up to 150 GB/s for Quadro and Tesla products ECC on/off option for Quadro and Tesla products Streaming Multiprocessors (SMs) Perform the actual computations Each SM has its own: Control units, registers, execution pipelines, caches L2 DRAM I/F DRAM I/F DRAM I/F DRAM I/F

16 GPU Architecture Fermi: Streaming Multiprocessor (SM) 32 CUDA Cores per SM 32 fp32 ops/clock 16 fp64 ops/clock 32 int32 ops/clock 2 warp schedulers Up to 1536 threads concurrently 4 special-function units 64KB shared mem + L1 cache 32K 32-bit registers Instruction Cache Scheduler Dispatch Core Core Core Core Core Core Core Core Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache Core Core Core Core Core Core Core Core

17 GPU Architecture Fermi: CUDA Core Floating point & Integer unit IEEE floating-point standard Fused multiply-add (FMA) instruction for both single and double precision Logic unit Move, compare unit Branch unit CUDA Core Dispatch Port Operand Collector FP Unit INT Unit Result Queue Instruction Cache Scheduler Dispatch Scheduler Dispatch Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

18 GPU Architecture Fermi: Memory System L1 16 or 48KB / SM, can be chosen by the program Hardware-managed Aggregate bandwidth per GPU: 1.03 TB/s Shared memory User-managed scratch-pad Hardware will not evict until threads overwrite 16 or 48KB / SM (64KB total is split between Shared and L1) Aggregate bandwidth per GPU: 1.03 TB/s

19 GPU Architecture Fermi: Memory System ECC protection: DRAM ECC supported for GDDR5 memory All major internal memories are ECC protected Register file, L1 cache, L2 cache

20 CUDA PROGRAMMING MODEL

21 Anatomy of a CUDA C/C++ Application Serial code executes in a Host (CPU) thread Parallel code executes in many Device (GPU) threads across multiple processing elements CUDA C/C++ Application Serial code Host = CPU Parallel code Device = GPU Serial code Host = CPU Parallel code Device = GPU...

22 Compiling CUDA C Applications void serial_function( ) {... } void other_function(int... ) {... } void saxpy_serial(float... ) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } void main( ) { float x; saxpy_serial(..);... } Modify into Parallel CUDA C code CUDA C Functions NVCC (Open64) CUDA object files Linker Rest of C Application CPU Compiler CPU object files CPU-GPU Executable

23 CUDA C : C with a few keywords void saxpy_serial(int n, float a, float *x, float *y) { for (int i = 0; i < n; ++i) y[i] = a*x[i] + y[i]; } // Invoke serial SAXPY kernel saxpy_serial(n, 2.0, x, y); Standard C Code global void saxpy_parallel(int n, float a, float *x, float *y) { int i = blockidx.x*blockdim.x + threadidx.x; if (i < n) y[i] = a*x[i] + y[i]; } // Invoke parallel SAXPY kernel with 256 threads/block int nblocks = (n + 255) / 256; saxpy_parallel<<<nblocks, 256>>>(n, 2.0, x, y); Parallel C Code

24 CUDA C : C with a few keywords Kernel: function called by the host that executes on the GPU Can only access GPU memory No variable number of arguments No static variables (Before No recursion - since Fermi recursion is supported) Functions must be declared with a qualifier: global : GPU kernel function launched by CPU, must return void device : can be called from GPU functions host : can be called from CPU functions (default) host and device qualifiers can be combined

25 CUDA Kernels Parallel portion of application: execute as a kernel Entire GPU executes kernel, many threads CUDA threads: Lightweight Fast switching 1000s execute simultaneously CPU Host Executes functions GPU Device Executes kernels

26 CUDA Kernels: Parallel Threads A kernel is a function executed on the GPU as an array of threads in parallel All threads execute the same code, can take different paths float x = input[threadidx.x]; float y = func(x); output[threadidx.x] = y; Each thread has an ID Select input/output data Control decisions

27 CUDA Kernels: Subdivide into Blocks

28 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks

29 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid

30 CUDA Kernels: Subdivide into Blocks Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

31 CUDA Kernels: Subdivide into Blocks GPU Threads are grouped into blocks Blocks are grouped into a grid A kernel is executed as a grid of blocks of threads

32 Kernel Execution CUDA thread CUDA core Each thread is executed by a core CUDA thread block CUDA kernel grid... CUDA Streaming Multiprocessor CUDA-enabled GPU Each block is executed by one SM and does not migrate Several concurrent blocks can reside on one SM depending on the blocks resource requirements and the SM s available resources Each kernel is executed on one device Multiple kernels can execute on a device at one time

33 Thread blocks allow cooperation Instruction Cache Scheduler Scheduler Dispatch Dispatch Threads may need to cooperate: Cooperatively load/store blocks of memory all will use Share results with each other or cooperate to produce a single result Synchronize with each other Register File Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Core Load/Store Units x 16 Special Func Units x 4 Interconnect Network 64K Configurable Cache/Shared Mem Uniform Cache

34 Thread blocks allow scalability Blocks can execute in any order, concurrently or sequentially This independence between blocks gives scalability: A kernel scales across any number of SMs Kernel Grid Launch Device with 2 SMs SM 0 SM 1 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7 Device with 4 SMs SM 0 SM 1 SM 2 SM 3 Block 0 Block 1 Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

35 CUDA MEMORY SYSTEM

36 Memory hierarchy Thread: Registers

37 Memory hierarchy Thread: Registers Local memory Local Local Local Local Local Local Local

38 Memory hierarchy Thread: Registers Local memory Block of threads: Shared memory

39 Memory hierarchy : Shared memory shared int a[size]; Allocated per thread block, same lifetime as the block Accessible by any thread in the block Latency: a few cycles High aggregate bandwidth: 14 * 32 * 4 B * 1.15 GHz / 2 = 1.03 TB/s Several uses: Sharing data among threads in a block User-managed cache (reducing gmem accesses)

40 Memory hierarchy Thread: Registers Local memory Block of threads: Shared memory All blocks: Global memory

41 Memory hierarchy : Global memory Accessible by all threads of any kernel Data lifetime: from allocation to deallocation by host code cudamalloc (void ** pointer, size_t nbytes) cudamemset (void * pointer, int value, size_t count) cudafree (void* pointer) Latency: cycles Bandwidth: 156 GB/s Note: requirement on access pattern to reach peak performance

42 HARDWARE DETAILS

43 Warps Vectors of Threads Group of 32 Threads Share one Program Counter (PC) Threads of warp execute in lock-step (similar to Vectorprocessor) Divergence within warps is supported on the HW level Improves maintainability and flexibility over explicit vector programming Automatic serialization/instruction replays Warp ThreadIdx

44 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //...

45 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //...

46 Divergence Examples No Serialization Warp 0 Warp 1 //... if( threadidx.x < 32 ) { // Do something } else { // Do something else } //

47 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //...

48 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //...

49 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //

50 Divergence Examples Serialization Warp 0 Warp 1 //... if( threadidx.x < 16 ) { // Do something } else { // Do something else } //

51 Memory Operations Memory operations are issued per warp (32 threads) Threads in a warp provide memory addresses Determine which lines/segments are needed Request the needed lines/segments Hardware supports two types of loads: Full Caching Default mode Attempts to hit in L1, then L2, then GMEM Load granularity is 128-byte line L2-only Compile with Xptxas dlcm=cg option to nvcc Attempts to hit in L2, then GMEM Load granularity is 32-bytes Stores: Invalidate L1, write-back for L2

52 Caching Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

53 L2-only Load Warp requests 32 aligned, consecutive 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

54 Caching Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 1 cache-line Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

55 L2-only Load Warp requests 32 aligned, permuted 4-byte words Addresses fall within 4 segments Warp needs 128 bytes 128 bytes move across the bus on a miss Bus utilization: 100% addresses from a warp Memory addresses

56 Caching Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within 2 cache-lines Warp needs 128 bytes 256 bytes move across the bus on misses Bus utilization: 50% addresses from a warp Memory addresses

57 L2-only Load Warp requests 32 misaligned, consecutive 4-byte words Addresses fall within at most 5 segments Warp needs 128 bytes 160 bytes move across the bus on misses Bus utilization: at least 80% Some misaligned patterns will fall within 4 segments, so 100% utilization addresses from a warp Memory addresses

58 Optimization Strategies SHARED MEMORY

59 Shared Memory Organization: 32 banks, 4-byte wide banks Successive 4-byte words belong to different banks Performance: 4 bytes per bank per 2 clocks per multiprocessor smem accesses are issued per 32 threads (warp) serialization: if n threads of 32 access different 4-byte words in the same bank, n accesses are executed serially multicast: n threads access the same word in one fetch Could be different bytes within the same word Prior to Fermi, only broadcast was available, sub-word accesses within the same bank caused serialization

60 Bank Addressing Examples No Bank Conflicts No Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Thread 31 Bank 31 Thread 31 Bank 31

61 Bank Addressing Examples 2-way Bank Conflicts 8-way Bank Conflicts Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 28 Thread 29 Thread 30 Thread 31 Bank 0 Bank 1 Bank 2 Bank 3 Bank 4 Bank 5 Bank 6 Bank 7 Bank 31 Thread 0 Thread 1 Thread 2 Thread 3 Thread 4 Thread 5 Thread 6 Thread 7 Thread 31 x8 x8 Bank 0 Bank 1 Bank 2 Bank 7 Bank 8 Bank 9 Bank 31

62 Shared Memory: Avoiding Bank Conflicts 32x32 float array in SMEM Warp accesses a column: 32-way bank conflicts (threads in a warp access the same bank) warps: Bank 0 Bank 1 Bank

63 Shared Memory: Avoiding Bank Conflicts Add a column for padding: 32x33 SMEM array Warp accesses a column: 32 different banks, no bank conflicts warps: padding Bank 0 Bank 1 Bank

64 Measure before you Optimize Premature optimization is the root of all evil -- Donald Knuth Use the Visual Profiler Identify limiting factor Analyze instruction throughput Analyze memory throughput Analyze kernel occupancy Compare to theoretical device limits

65 CUDA 4.0: Highlights Easier Parallel Application Porting Faster Multi-GPU Programming New & Improved Developer Tools Share GPUs across multiple threads Single thread access to all GPUs No-copy pinning of system memory New CUDA C/C++ features Thrust templated primitives library NPP image/video processing library Layered Textures NVIDIA GPUDirect v2.0 Peer-to-Peer Access Peer-to-Peer Transfers Unified Virtual Addressing Auto Performance Analysis C++ Debugging GPU Binary Disassembler cuda-gdb for MacOS

66 CUDA Programming Resources CUDA Toolkit (Current Version 4.0) Compiler, libraries, and documentation Free download for Windows, Linux, and MacOS GPU Computing SDK Code samples Whitepapers Instructional materials on NVIDIA Developer site CUDA introduction & optimization webinar: slides and audio Parallel programming course at University of Illinois UC Tutorials Forums

67 GPU Tools More in Hands On Session Profiler Available for all supported OSs Command-line or GUI Sampling signals on GPU for: Memory access parameters Execution (serialization, divergence) Automatic Performance Analysis (new in CUDA 4.0) Debugger Linux: cuda-gdb Windows: Parallel Nsight True step debugging on the GPU

68 GPU DIRECT

69 MPI Integration of GPUDirect GPUDirect + UVA allow a native support of GPU data transfers in MPI implementations enable MPI_Send, MPI_Recv from/to GPU memory MPI library can differentiate between device memory and host memory without any hints from the user Programmer doesn t have to deal with low-level details Code without MPI integration At Sender: cudamemcpy(s_buf, s_device, size, cudamemcpydevicetohost); MPI_Send(s_buf, size, MPI_CHAR, 1, 1, MPI_COMM_WORLD); At Receiver: MPI_Recv(r_buf, size, MPI_CHAR, 0, 1, MPI_COMM_WORLD, &req); cudamemcpy(r_device, r_buf, size, cudamemcpyhosttodevice); Code with MPI integration At Sender: MPI_Send(s_device, size, ); At Receiver: MPI_Recv(r_device, size, );

70 Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg MVAPICH2-GPU Upcoming MVAPICH2 support for GPU-GPU communication With GPUDirect 45% improvement compared to Memcpy+Send (4MB) 24% improvement compared to MemcpyAsync+Isend (4MB) Without GPUDirect 38% improvement compared to Memcpy+send (4MB) 33% improvement compared to MemcpyAsync+Isend (4MB) With GPUDirect 45% improvement compared to Memcpy+Put Without GPUDirect 39% improvement compared with Memcpy+Put Similar improvement for Get operation Major improvement in programming

71 Measurements from: H. Wang, S. Potluri, M. Luo, A. Singh, S. Sur and D. K. Panda, "MVAPICH2-GPU: Optimized GPU to GPU Communication for InfiniBand Clusters", Int'l Supercomputing Conference 2011 (ISC), Hamburg MVAPICH2-GPU With GPU Direct 37% improvement for MPI_Gather (1MB) and 32% improvement for MPI_Scatter (4MB) Without GPU Direct 33% improvement for MPI_Gather (1MB) and 23% improvement MPI_Scatter (4MB) With GPUDirect 30% improvement for 4 MB messages Without GPUDirect 26% improvement for 4 MB messages

72 OpenMPI OpenMPI support to send data directly from CUDA device memory via MPI calls developed by Rolf vandevaart (NVIDIA) Code is currently available in the trunk which can be downloaded at: By default, the code is disabled and has to be configured into the library. --with-cuda(=dir) Build cuda support, optionally adding DIR/include, DIR/lib, and DIR/lib64 --with-cuda-libdir=dir Search for cuda libraries in DIR

73 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h - C99 floating-point Library Included in the CUDA Toolkit (free download) For more information on CUDA libraries:

74 cufft: Multi-dimensional FFTs New in CUDA 4.0 Significant performance improvements in: double precision radix 2, 3, 5 and 7 2D/3D sizes that contain prime factors larger than 7 Flexible input and output data layouts* Similar to the FFTW Advanced Interface Eliminates extra data transposes and copies * Only supported for complex-to-complex transforms in this release

75 gflops gflops FFTs up to 10x Faster than MKL 1D used in audio processing and as a foundation for 2D and 3D FFTs cufft - Single Precision cufft - Double Precision Radix2 Radix7 MKL Radix3 Radix Radix2 Radix3 Radix7 Radix5 MKL log2(size) log2(size) MKL 10.1r1 on Intel Quad Core i , 2.93Ghz cufft 4.0 on Tesla C2070, ECC on Performance measured for ~16M total elements, split into batches of transforms of the size on the x-axis

76 gflops 2D/3D primes now use Bluestein Algorithm Significant performance improvement for 2D and 3D transform sizes cufft Single Precision 2D cufft 4.0 cufft 3.2 MKL Transform size (NxN, for prime N) * MKL 10.1r1 on Intel Quad Core i , 2.93Ghz * cufft4.0 on C2070, ECC on

77 cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex New in CUDA 4.0 New API Facilitates multi-gpu programming Thread-safe More routines provide parallelism using streams Previous legacy API still supported out-of-the-box Rewrote documentation from scratch Performance improvements Ex: ZGEMM performance improved 10% on Fermi (325 GFLOPS peak on C2050)

78 SGEMM SSYMM SSYRK STRMM STRSM CGEMM CHEMM CSYMM CSYRK CTRMM CTRSM DGEMM DSYMM DSYRK DTRMM DTRSM ZGEMM ZHEMM ZSYMM ZSYRK ZTRMM ZTRSM SGEMM SSYMM SSYRK STRMM STRSM CGEMM CHEMM CSYMM CSYRK CTRMM CTRSM DGEMM DSYMM DSYRK DTRMM DTRSM ZGEMM ZHEMM ZSYMM ZSYRK ZTRMM ZTRSM cublas Level 3 Performance Up to ~800GFLOPS and ~17x speedup over MKL GFLOPS Speedup over MKL Single Complex Double Double Complex Single Complex Double Double Complex * 4Kx4K matrix size * cublas 4.0, Tesla C2050 (Fermi), ECC on * MKL , 4-core 2.66Ghz

79 GFLOPS ZGEMM Performance vs. Matrix Size 350 Up to 8x speedup over MKL % of peak perf. (512x512) 97% of peak perf. (1024x1024) cublas 4.0 MKL % of peak perf. (512x512) 80% of peak perf. (1024x1024) Performance may vary based on OS version and motherboard configuration * cublas 4.0, Tesla C2050 (Fermi), ECC on * MKL , 4-core 2.66Ghz

80 cusparse: Sparse linear algebra routines Conversion routines for dense, COO, CSR and CSC formats Optimized sparse matrix-vector multiplication for CSR New Sparse Triangular Solve CUDA 4.0 API optimized for common iterative solve algorithms y 1 y 2 y 3 y 4 \alpha \beta y 1 y 2 y 3 y 4

81 Speedup (csrmv) cusparse is up to 6x Faster than MKL Sparse Matrix x Dense Vector single double complex double-complex Performance may vary based on OS version and motherboard configuration * cusparse 4.0, NVIDIA C2050 (Fermi), ECC on * MKL , 4-core 3.07GHz

82 Speedup (csrmm) Up to 35x faster with 6 Dense Vectors Useful for block iterative solve schemes single double complex double-complex Performance may vary based on OS version and motherboard configuration * cusparse 4.0, NVIDIA C2050 (Fermi), ECC on * MKL , 4-core 3.07GHz

83 curand: Random Number Generation New in CUDA 4.0 Scrambled and 64-bit Sobol Log-normal distribution New parallel ordering supports faster XORWOW initialization Results of CURAND generators against standard statistical test batteries are reported in documentation

84 GSamples/sec curand Performance curand 64-bit Scrambled Sobol 8x faster than MKL 32-bit plain Sobol CURAND XORWOW CURAND 32-bit Sobol' CURAND 32-bit Scrambled Sobol' CURAND 64-bit Sobol' CURAND 64-bit Scrambled Sobol' MKL 32-bit Sobol' (single thread) uniform normal log-normal uniform normal log-normal single precision double precision Performance may vary based on OS version and motherboard configuration * CURAND 4.0, NVIDIA C2050 (Fermi), ECC on

85 NVIDIA Performance Primitives Up to 40x speedups Arithmetic, Logic, Conversions, Filters, Statistics, etc. ~420 image functions (+70 in 4.0) ~500 signal functions (+400 in 4.0) Majority of primitives 5x to 10x faster than analogous routines in Intel IPP * NPP 4.0, NVIDIA C2050 (Fermi) * IPP 6.1, Dual Socket Core i7 2.67GHz

86 Thrust: CUDA C++ Template Library Added to CUDA Toolkit as of CUDA 4.0 Also available on Google Code Template library for CUDA Host and Device Containers that mimic the C++ STL Optimized algorithms for sort, reduce, scan, etc. OpenMP backend for portability Allows applications and prototypes to be built quickly

87 Speedup (X) Speedup (X) Thrust Algorithm Performance Various Algorithms (32M int.) Speedup compared to C++ STL Sort (32M samples) Speedup compared to C++ STL Thrust 4.0 Intel TBB C++ STL Thrust 4.0 Intel TBB C++ STL reduce transform scan sort 0 char short int long float double Algorithm Datatype * Thrust 4.0, NVIDIA Tesla C2050 (Fermi) * Core i7 3.07GHz

88 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, high accuracy Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log10,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp10, sinpi, sincos, cospi, erfinv, erfcinv,... Performance improvements in CUDA 4.0 Double-precision /, rsqrt(), erfc(), & sinh() are all >~30% faster on Fermi Added cospi() to CUDA 4.0

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

CUDA 6.0 Performance Report. April 2014

CUDA 6.0 Performance Report. April 2014 CUDA 6. Performance Report April 214 1 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random

More information

CUDA Toolkit 5.0 Performance Report. January 2013

CUDA Toolkit 5.0 Performance Report. January 2013 CUDA Toolkit 5.0 Performance Report January 2013 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research

Introduction to CUDA CME343 / ME May James Balfour [ NVIDIA Research Introduction to CUDA CME343 / ME339 18 May 2011 James Balfour [ jbalfour@nvidia.com] NVIDIA Research CUDA Programing system for machines with GPUs Programming Language Compilers Runtime Environments Drivers

More information

CUDA 6.5 Performance Report

CUDA 6.5 Performance Report CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number

More information

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011

NVIDIA GPUDirect Technology. NVIDIA Corporation 2011 NVIDIA GPUDirect Technology NVIDIA GPUDirect : Eliminating CPU Overhead Accelerated Communication with Network and Storage Devices Peer-to-Peer Communication Between GPUs Direct access to CUDA memory for

More information

CUDA 7.0 Performance Report. May 2015

CUDA 7.0 Performance Report. May 2015 CUDA 7.0 Performance Report May 2015 1 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library

More information

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2018 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

NVIDIA GPUs in Earth System Modelling Thomas Bradley

NVIDIA GPUs in Earth System Modelling Thomas Bradley NVIDIA GPUs in Earth System Modelling Thomas Bradley Agenda: GPU Developments for CWO Motivation for GPUs in CWO Parallelisation Considerations GPU Technology Roadmap MOTIVATION FOR GPUS IN CWO NVIDIA

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline! Fermi Architecture! Kernel optimizations! Launch configuration! Global memory throughput! Shared memory access! Instruction throughput / control

More information

CUDA OPTIMIZATIONS ISC 2011 Tutorial

CUDA OPTIMIZATIONS ISC 2011 Tutorial CUDA OPTIMIZATIONS ISC 2011 Tutorial Tim C. Schroeder, NVIDIA Corporation Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow

Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Fundamental Optimizations (GTC 2010) Paulius Micikevicius NVIDIA Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control flow Optimization

More information

Fundamental CUDA Optimization. NVIDIA Corporation

Fundamental CUDA Optimization. NVIDIA Corporation Fundamental CUDA Optimization NVIDIA Corporation Outline Fermi/Kepler Architecture Kernel optimizations Launch configuration Global memory throughput Shared memory access Instruction throughput / control

More information

GPU Computing Ecosystem

GPU Computing Ecosystem GPU Computing Ecosystem CUDA 5 Enterprise level GPU Development GPU Development Paths Libraries, Directives, Languages GPU Tools Tools, libraries and plug-ins for GPU codes Tesla K10 Kepler! Tesla K20

More information

NVIDIA CUDA Libraries

NVIDIA CUDA Libraries NVIDIA CUDA Libraries Ujval Kapasi*, Elif Albuz*, Philippe Vandermersch*, Nathan Whitehead*, Frank Jargstorff* San Jose Convention Center Sept 22, 2010 *NVIDIA NVIDIA CUDA Libraries Applications 3 rd Party

More information

Accelerating High Performance Computing.

Accelerating High Performance Computing. Accelerating High Performance Computing http://www.nvidia.com/tesla Computing The 3 rd Pillar of Science Drug Design Molecular Dynamics Seismic Imaging Reverse Time Migration Automotive Design Computational

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Fundamental Optimizations

Fundamental Optimizations Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA PROGRAMMING MODEL Carlo Nardone Sr. Solution Architect, NVIDIA EMEA CUDA: COMMON UNIFIED DEVICE ARCHITECTURE Parallel computing architecture and programming model GPU Computing Application Includes

More information

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Fundamental Optimizations Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Outline Kernel optimizations Launch configuration Global memory throughput Shared memory access

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA

Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Fundamental Optimizations in CUDA Peng Wang, Developer Technology, NVIDIA Optimization Overview GPU architecture Kernel optimization Memory optimization Latency optimization Instruction optimization CPU-GPU

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012

Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA. NVIDIA Corporation 2012 Scalable Cluster Computing with NVIDIA GPUs Axel Koehler NVIDIA Outline Introduction to Multi-GPU Programming Communication for Single Host, Multiple GPUs Communication for Multiple Hosts, Multiple GPUs

More information

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer

CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer CUDA Optimization: Memory Bandwidth Limited Kernels CUDA Webinar Tim C. Schroeder, HPC Developer Technology Engineer Outline We ll be focussing on optimizing global memory throughput on Fermi-class GPUs

More information

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator

CUDA Update: Present & Future. Mark Ebersole, NVIDIA CUDA Educator CUDA Update: Present & Future Mark Ebersole, NVIDIA CUDA Educator Recent CUDA News Kepler K20 & K20X Kepler GPU Architecture: Streaming Multiprocessor (SMX) 192 SP CUDA Cores per SMX 64 DP CUDA Cores per

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation

GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation GPU Computing with OpenACC Directives Dr. Timo Stich Developer Technology Group NVIDIA Corporation WHAT IS GPU COMPUTING? Add GPUs: Accelerate Science Applications CPU GPU Small Changes, Big Speed-up Application

More information

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model Parallel Programming Principle and Practice Lecture 9 Introduction to GPGPUs and CUDA Programming Model Outline Introduction to GPGPUs and Cuda Programming Model The Cuda Thread Hierarchy / Memory Hierarchy

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

Advanced CUDA Programming. Dr. Timo Stich

Advanced CUDA Programming. Dr. Timo Stich Advanced CUDA Programming Dr. Timo Stich (tstich@nvidia.com) Outline SIMT Architecture, Warps Kernel optimizations Global memory throughput Launch configuration Shared memory access Instruction throughput

More information

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin

EE382N (20): Computer Architecture - Parallelism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez. The University of Texas at Austin EE382 (20): Computer Architecture - ism and Locality Spring 2015 Lecture 09 GPUs (II) Mattan Erez The University of Texas at Austin 1 Recap 2 Streaming model 1. Use many slimmed down cores to run in parallel

More information

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA

HPC COMPUTING WITH CUDA AND TESLA HARDWARE. Timothy Lanfear, NVIDIA HPC COMPUTING WITH CUDA AND TESLA HARDWARE Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or High Throughput?

More information

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni

CUDA Optimizations WS Intelligent Robotics Seminar. Universität Hamburg WS Intelligent Robotics Seminar Praveen Kulkarni CUDA Optimizations WS 2014-15 Intelligent Robotics Seminar 1 Table of content 1 Background information 2 Optimizations 3 Summary 2 Table of content 1 Background information 2 Optimizations 3 Summary 3

More information

CUDA Performance Optimization. Patrick Legresley

CUDA Performance Optimization. Patrick Legresley CUDA Performance Optimization Patrick Legresley Optimizations Kernel optimizations Maximizing global memory throughput Efficient use of shared memory Minimizing divergent warps Intrinsic instructions Optimizations

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions.

CUDA. Schedule API. Language extensions. nvcc. Function type qualifiers (1) CUDA compiler to handle the standard C extensions. Schedule CUDA Digging further into the programming manual Application Programming Interface (API) text only part, sorry Image utilities (simple CUDA examples) Performace considerations Matrix multiplication

More information

GPU Fundamentals Jeff Larkin November 14, 2016

GPU Fundamentals Jeff Larkin November 14, 2016 GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate

More information

HIGH-PERFORMANCE COMPUTING

HIGH-PERFORMANCE COMPUTING HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Timothy Lanfear, NVIDIA WHY GPU COMPUTING? Science is Desperate for Throughput Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 Petaflop Bacteria 100s of

More information

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS

HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS HIGH-PERFORMANCE COMPUTING WITH CUDA AND TESLA GPUS Timothy Lanfear, NVIDIA WHAT IS GPU COMPUTING? What is GPU Computing? x86 PCIe bus GPU Computing with CPU + GPU Heterogeneous Computing Low Latency or

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

Practical Introduction to CUDA and GPU

Practical Introduction to CUDA and GPU Practical Introduction to CUDA and GPU Charlie Tang Centre for Theoretical Neuroscience October 9, 2009 Overview CUDA - stands for Compute Unified Device Architecture Introduced Nov. 2006, a parallel computing

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Introduction to Parallel Computing with CUDA. Oswald Haan

Introduction to Parallel Computing with CUDA. Oswald Haan Introduction to Parallel Computing with CUDA Oswald Haan ohaan@gwdg.de Schedule Introduction to Parallel Computing with CUDA Using CUDA CUDA Application Examples Using Multiple GPUs CUDA Application Libraries

More information

Introduction to CUDA Programming

Introduction to CUDA Programming Introduction to CUDA Programming Steve Lantz Cornell University Center for Advanced Computing October 30, 2013 Based on materials developed by CAC and TACC Outline Motivation for GPUs and CUDA Overview

More information

GPU CUDA Programming

GPU CUDA Programming GPU CUDA Programming 이정근 (Jeong-Gun Lee) 한림대학교컴퓨터공학과, 임베디드 SoC 연구실 www.onchip.net Email: Jeonggun.Lee@hallym.ac.kr ALTERA JOINT LAB Introduction 차례 Multicore/Manycore and GPU GPU on Medical Applications

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca (Slides http://support.scinet.utoronto.ca/ northrup/westgrid CUDA.pdf) March 12, 2014

More information

CUDA Programming Model

CUDA Programming Model CUDA Xing Zeng, Dongyue Mou Introduction Example Pro & Contra Trend Introduction Example Pro & Contra Trend Introduction What is CUDA? - Compute Unified Device Architecture. - A powerful parallel programming

More information

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan

CUDA Workshop. High Performance GPU computing EXEBIT Karthikeyan CUDA Workshop High Performance GPU computing EXEBIT- 2014 Karthikeyan CPU vs GPU CPU Very fast, serial, Low Latency GPU Slow, massively parallel, High Throughput Play Demonstration Compute Unified Device

More information

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2

Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 Optimized Non-contiguous MPI Datatype Communication for GPU Clusters: Design, Implementation and Evaluation with MVAPICH2 H. Wang, S. Potluri, M. Luo, A. K. Singh, X. Ouyang, S. Sur, D. K. Panda Network-Based

More information

General Purpose GPU Computing in Partial Wave Analysis

General Purpose GPU Computing in Partial Wave Analysis JLAB at 12 GeV - INT General Purpose GPU Computing in Partial Wave Analysis Hrayr Matevosyan - NTC, Indiana University November 18/2009 COmputationAL Challenges IN PWA Rapid Increase in Available Data

More information

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series Introduction to GPU Computing Using CUDA Spring 2014 Westgid Seminar Series Scott Northrup SciNet www.scinethpc.ca March 13, 2014 Outline 1 Heterogeneous Computing 2 GPGPU - Overview Hardware Software

More information

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13 Introduction to GPU Computing chandlerz@nvidia.com 周国峰 Wuhan University 2017/10/13 GPU and Its Application 3 Ways to Develop Your GPU APP An Example to Show the Developments Add GPUs: Accelerate Science

More information

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5)

CUDA programming model. N. Cardoso & P. Bicudo. Física Computacional (FC5) CUDA programming model N. Cardoso & P. Bicudo Física Computacional (FC5) N. Cardoso & P. Bicudo CUDA programming model 1/23 Outline 1 CUDA qualifiers 2 CUDA Kernel Thread hierarchy Kernel, configuration

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References q This set of slides is mainly based on: " CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory " Slide of Applied

More information

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms

What is GPU? CS 590: High Performance Computing. GPU Architectures and CUDA Concepts/Terms CS 590: High Performance Computing GPU Architectures and CUDA Concepts/Terms Fengguang Song Department of Computer & Information Science IUPUI What is GPU? Conventional GPUs are used to generate 2D, 3D

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono

Introduction to CUDA Algoritmi e Calcolo Parallelo. Daniele Loiacono Introduction to CUDA Algoritmi e Calcolo Parallelo References This set of slides is mainly based on: CUDA Technical Training, Dr. Antonino Tumeo, Pacific Northwest National Laboratory Slide of Applied

More information

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Getting Started with CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Heterogeneous Computing CPU GPU Once upon a time Past Massively Parallel Supercomputers Goodyear MPP Thinking Machine MasPar Cray 2 1.31

More information

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include 3.1 Overview Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include GPUs (Graphics Processing Units) AMD/ATI

More information

GPU Programming Introduction

GPU Programming Introduction GPU Programming Introduction DR. CHRISTOPH ANGERER, NVIDIA AGENDA Introduction to Heterogeneous Computing Using Accelerated Libraries GPU Programming Languages Introduction to CUDA Lunch What is Heterogeneous

More information

CUDA. Matthew Joyner, Jeremy Williams

CUDA. Matthew Joyner, Jeremy Williams CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel

More information

GPU Computing with CUDA. Part 2: CUDA Introduction

GPU Computing with CUDA. Part 2: CUDA Introduction GPU Computing with CUDA Part 2: CUDA Introduction Dortmund, June 4, 2009 SFB 708, AK "Modellierung und Simulation" Dominik Göddeke Angewandte Mathematik und Numerik TU Dortmund dominik.goeddeke@math.tu-dortmund.de

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34 1 / 34 GPU Programming Lecture 2: CUDA C Basics Miaoqing Huang University of Arkansas 2 / 34 Outline Evolvements of NVIDIA GPU CUDA Basic Detailed Steps Device Memories and Data Transfer Kernel Functions

More information

ECE 574 Cluster Computing Lecture 15

ECE 574 Cluster Computing Lecture 15 ECE 574 Cluster Computing Lecture 15 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 30 March 2017 HW#7 (MPI) posted. Project topics due. Update on the PAPI paper Announcements

More information

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra) CS 470 Spring 2016 Mike Lam, Professor Other Architectures (with an aside on linear algebra) Parallel Systems Shared memory (uniform global address space) Primary story: make faster computers Programming

More information

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks.

Register file. A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks. Sharing the resources of an SM Warp 0 Warp 1 Warp 47 Register file A single large register file (ex. 16K registers) is partitioned among the threads of the dispatched blocks Shared A single SRAM (ex. 16KB)

More information

ECE 574 Cluster Computing Lecture 17

ECE 574 Cluster Computing Lecture 17 ECE 574 Cluster Computing Lecture 17 Vince Weaver http://web.eece.maine.edu/~vweaver vincent.weaver@maine.edu 28 March 2019 HW#8 (CUDA) posted. Project topics due. Announcements 1 CUDA installing On Linux

More information

Introduction to GPU hardware and to CUDA

Introduction to GPU hardware and to CUDA Introduction to GPU hardware and to CUDA Philip Blakely Laboratory for Scientific Computing, University of Cambridge Philip Blakely (LSC) GPU introduction 1 / 35 Course outline Introduction to GPU hardware

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

Using GPUs to compute the multilevel summation of electrostatic forces

Using GPUs to compute the multilevel summation of electrostatic forces Using GPUs to compute the multilevel summation of electrostatic forces David J. Hardy Theoretical and Computational Biophysics Group Beckman Institute for Advanced Science and Technology University of

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer Parallel Programming and Debugging with CUDA C Geoff Gerfin Sr. System Software Engineer CUDA - NVIDIA s Architecture for GPU Computing Broad Adoption Over 250M installed CUDA-enabled GPUs GPU Computing

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing. Axel Koehler Sr. Solution Architect HPC GPU Computing Axel Koehler Sr. Solution Architect HPC 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2 Continued Demand for Ever Faster Supercomputers First-principles

More information

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming Overview Lecture 1: an introduction to CUDA Mike Giles mike.giles@maths.ox.ac.uk hardware view software view Oxford University Mathematical Institute Oxford e-research Centre Lecture 1 p. 1 Lecture 1 p.

More information

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA Part 1: Hardware design and programming model Dirk Ribbrock Faculty of Mathematics, TU dortmund 2016 Table of Contents Why parallel

More information

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017

VOLTA: PROGRAMMABILITY AND PERFORMANCE. Jack Choquette NVIDIA Hot Chips 2017 VOLTA: PROGRAMMABILITY AND PERFORMANCE Jack Choquette NVIDIA Hot Chips 2017 1 TESLA V100 21B transistors 815 mm 2 80 SM 5120 CUDA Cores 640 Tensor Cores 16 GB HBM2 900 GB/s HBM2 300 GB/s NVLink *full GV100

More information

NVIDIA GPU Computing Séminaire Calcul Hybride Aristote 25 Mars 2010

NVIDIA GPU Computing Séminaire Calcul Hybride Aristote 25 Mars 2010 NVIDIA GPU Computing 2010 Séminaire Calcul Hybride Aristote 25 Mars 2010 NVIDIA GPU Computing 2010 Tesla 3 rd generation Full OEM coverage Ecosystem focus Value Propositions per segments Card System Module

More information

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1

Lecture 15: Introduction to GPU programming. Lecture 15: Introduction to GPU programming p. 1 Lecture 15: Introduction to GPU programming Lecture 15: Introduction to GPU programming p. 1 Overview Hardware features of GPGPU Principles of GPU programming A good reference: David B. Kirk and Wen-mei

More information

Supercomputing with NVIDIA GPUs

Supercomputing with NVIDIA GPUs Supercomputing with NVIDIA GUs HCN Workshop, May, 2011 Axel Koehler- NVIDIA NVIDIA Introduction and HC Evolution of GUs ublic, based in Santa Clara, CA ~$4B revenue ~6000 employees Founded in 1999 with

More information

The Future of GPU Computing

The Future of GPU Computing The Future of GPU Computing Bill Dally Chief Scientist & Sr. VP of Research, NVIDIA Bell Professor of Engineering, Stanford University November 18, 2009 The Future of Computing Bill Dally Chief Scientist

More information

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA

HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS. Chris Butler NVIDIA HIGH-PERFORMANCE COMPUTING WITH NVIDIA TESLA GPUS Chris Butler NVIDIA Science is Desperate for Throughput Gigaflops 1,000,000,000 1 Exaflop 1,000,000 1 Petaflop Bacteria 100s of Chromatophores Chromatophore

More information

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA

High-Productivity CUDA Programming. Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA High-Productivity CUDA Programming Cliff Woolley, Sr. Developer Technology Engineer, NVIDIA HIGH-PRODUCTIVITY PROGRAMMING High-Productivity Programming What does this mean? What s the goal? Do Less Work

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17

CUDA Lecture 2. Manfred Liebmann. Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 CUDA Lecture 2 Manfred Liebmann Technische Universität München Chair of Optimal Control Center for Mathematical Sciences, M17 manfred.liebmann@tum.de December 15, 2015 CUDA Programming Fundamentals CUDA

More information

Spring Prof. Hyesoon Kim

Spring Prof. Hyesoon Kim Spring 2011 Prof. Hyesoon Kim 2 Warp is the basic unit of execution A group of threads (e.g. 32 threads for the Tesla GPU architecture) Warp Execution Inst 1 Inst 2 Inst 3 Sources ready T T T T One warp

More information

TUNING CUDA APPLICATIONS FOR MAXWELL

TUNING CUDA APPLICATIONS FOR MAXWELL TUNING CUDA APPLICATIONS FOR MAXWELL DA-07173-001_v7.0 March 2015 Application Note TABLE OF CONTENTS Chapter 1. Maxwell Tuning Guide... 1 1.1. NVIDIA Maxwell Compute Architecture... 1 1.2. CUDA Best Practices...2

More information

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011

Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Identifying Performance Limiters Paulius Micikevicius NVIDIA August 23, 2011 Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for

More information

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS 1 Last time Each block is assigned to and executed on a single streaming multiprocessor (SM). Threads execute in groups of 32 called warps. Threads in

More information

Performance Optimization Process

Performance Optimization Process Analysis-Driven Optimization (GTC 2010) Paulius Micikevicius NVIDIA Performance Optimization Process Use appropriate performance metric for each kernel For example, Gflops/s don t make sense for a bandwidth-bound

More information

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010

NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Analysis-Driven Optimization Paulius Micikevicius NVIDIA Supercomputing, Tutorial S03 New Orleans, Nov 14, 2010 Performance Optimization Process Use appropriate performance metric for each kernel For example,

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information