CUDA 6.0 Performance Report. April 2014

Similar documents
CUDA 6.5 Performance Report

CUDA 7.0 Performance Report. May 2015

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 5.0 Performance Report. January 2013

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Using OpenACC With CUDA Libraries

CUDA Accelerated Compute Libraries. M. Naumov

Using OpenACC With CUDA Libraries

NVIDIA CUDA Libraries

CUDA 8 PERFORMANCE OVERVIEW. November 2016

GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

A Sampling of CUDA Libraries Michael Garland

CUDA math libraries APC

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

Technology for a better society. hetcomp.com

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

Intel Math Kernel Library

Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library

Level-3 BLAS on the TI C6678 multi-core DSP

NVIDIA CUDA TOOLKIT V6.0

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA

Solving Dense Linear Systems on Graphics Processors

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

A Standard for Batching BLAS Operations

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Tesla GPU Computing A Revolution in High Performance Computing

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Intel Math Kernel Library ( Intel MKL )

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

Built-in Types of Data

Intel Performance Libraries

Applications of Berkeley s Dwarfs on Nvidia GPUs

GPU Computing Past, Present, Future. Ian Buck, GM GPU Computing Sw

Tesla GPU Computing A Revolution in High Performance Computing

Introduction to CUDA

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

OpenFOAM + GPGPU. İbrahim Özküçük

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

Cray Scientific Libraries. Overview

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

HPC with Multicore and GPUs

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

MAGMA. Matrix Algebra on GPU and Multicore Architectures

CUB. collective software primitives. Duane Merrill. NVIDIA Research

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Data Parallel Execution Model

Advanced CUDA Optimization 1. Introduction

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

Tesla Architecture, CUDA and Optimization Strategies

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

NEW ADVANCES IN GPU LINEAR ALGEBRA

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

CS 179: Lecture 10. Introduction to cublas

CAPS Technology. ProHMPT, 2009 March12 th

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Fastest and most used math library for Intel -based systems 1

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

VSC Users Day 2018 Start to GPU Ehsan Moravveji

Languages, Libraries and Development Tools for GPU Computing

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

PyCUDA. Continued...

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Maximizing performance and scalability using Intel performance libraries

CUDA Architecture & Programming Model

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

QR Decomposition on GPUs

PSEUDORANDOM numbers are very important in practice

Intel Math Kernel Library 10.3

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

NVIDIA GPU TECHNOLOGY UPDATE

Taipei Embedded Outreach OpenCL DSP Profile Proposals

Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

GPU Computing. Axel Koehler Sr. Solution Architect HPC

Transcription:

CUDA 6. Performance Report April 214 1

CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2

usec CUDA 6: 2x Faster GPU Kernel Launches 4 Dynamic Parallel Kernel Launches 35 3 25 2 15 1 5 1.5x Faster 2.x Faster CUDA 5 CUDA 6 CUDA 5 CUDA 6 Back to Back Launches a<<<...>>>; b<<<...>>>; Launch and Synchronize a<<<...>>>; cudadevicesynchronize(); Performance may vary based on OS version and motherboard configuration CUDA 5. and CUDA 6. on Tesla K2 3

cufft: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts New in CUDA 6 XT interface supports dual-gpu cards (Tesla K1, GeForce GTX69, ) 4

GFLOPS GFLOPS cufft: up to 7 GFLOPS 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision 7 6 5 25 2 4 15 3 2 1 1 5 1 3 5 7 9 11 13 15 17 19 21 23 25 log2(transform_size) 1 3 5 7 9 11 13 15 17 19 21 23 25 log2(transform_size) Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 32M elements, input and output data on device 5

GFLOPS GFLOPS cufft: Consistently High Performance 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision 7 6 5 25 2 4 15 3 2 1 1 5 1 1 1, 1,, 1,, Transform Size 1 1 1, 1,, 1,, Transform Size Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 28M-33M elements, input and output data on device 6

Execution Time (ms) Execution Time (ms) New in CUDA 6 cufft-xt: Boosts Performance on K1 1 14 9 8 12 7 6 5 4 25% Faster 1 8 6 65% Faster 3 4 2 1 2 cufft cufft-xt cufft cufft-xt 256x256x256 512x512x512 Performance may vary based on OS version and motherboard configuration cufft 6. on K1, ECC ON, input and output data on device 7

cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface New in CUDA 6 XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size Drop-in BLAS intercepts CPU BLAS calls, streams to GPU 8

SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK GFLOPS cublas: >3 TFLOPS single-precision >1 TFLOPS double-precision 35 3 25 2 15 1 5 Single Single Complex Double Double Complex Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device m=n=k=496, transpose=no, side=right, fill=lower 9

GFLOPS cublas: ZGEMM 5x Faster than MKL 12 1 8 6 cublas MKL 4 2 5 1 15 2 25 3 35 4 Matrix Dimension (m=n=k) Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device MKL 11..4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 1

New in CUDA 6 8 7 6 5 4 3 2 cublas-xt: Multi-GPU Performance Scaling 7.9 TFLOPS 6. TFLOPS 4.2 TFLOPS 2.2 TFLOPS 1 1 x K1 2 x K1 3 x K1 4 x K1 16K x 16K SGEMM on Tesla K1 Performance may vary based on OS version and motherboard configuration cublas-xt 6. on K1, ECC ON, input and output data on host 11

cusparse: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines matrixvector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) New in CUDA 6 Many improvements to triangular solvers, Incomplete-LU, and Cholesky preconditioners y 1 y 2 y 3 1. 2. 3. 1. 2. 3. 4. \alpha + \beta 4. y 4 5. 6. 7. y 1 y 2 y 3 y 4 12

Speedup over MKL cusparse: 5x Faster than MKL 6x Sparse Matrix x Dense Vector (SpMV) 5x 4x 3x 2x 1x x Performance may vary based on OS version and motherboard configuration Average of s/c/d/z routines cusparse 6. on K4m, ECC ON, input and output data on device MKL 11..4 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz Matrices obtained from: http://www.cise.ufl.edu/research/sparse/matrices/ 13

curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 6 Mersenne Twister 19937 14

Gsamples / sec curand: Up to 75x Faster vs. Intel MKL 16 14 12 1 8 6 curand MKL 4 2 Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4c, ECC ON, double-precision input and output data on device MKL 11..1 on Intel SandyBridge 6-core E5-262 @ 2. GHz 15

Gsamples / sec 18 curand: High Performance RNGs 16 14 12 1 8 6 4 2 XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Pseudo-random Sobol64 Scrambled Quasi-random Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4m, ECC ON, double precision input and output data on device 16

NPP: NVIDIA Performance Primitives Over 5 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation New in CUDA 6 Over 5 new routines, including: median filter, BGR/YUV conversion, 3D LUT color conversion, improvements to JPEG primitives, plus many more 17

NPP Speedup vs. Intel IPP 3x 25x 2x 15x 28.5x 1x 5x 5.7x 14.4x 17.8x 6.3x 12.9x x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT Performance may vary based on OS version and motherboard configuration NPP 6. on K4m, input and output data on device IPP 7. on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 18

CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly 19

Speedup Speedup Thrust Performance vs. Intel TBB Thrust vs. TBB on 32M integers Thrust Sort vs. TBB on 32M samples 14x 5x 12x 4x 43.7x 1x 8x 3x 24.8x 6x 4x 2x 13.3x 15.x 2x 1x 5.2x 5.8x x reduce transform scan sort x char short int long float double Performance may vary based on OS version and motherboard configuration Thrust v1.7.1 on K4m, ECC ON, input and output data on device TBB 4.2 on Intel IvyBridge 12-core E5-2697 v2 @ 2.7GHz 2

math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log1,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp1, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],... New in CUDA 6 Over 8 new SIMD instructions Useful for video processing: _v*2, _v*4 Cylindrical bessel: cyl_i{,1} 1/hypotenuse: rhypot 21