CUDA Toolkit 5.0 Performance Report. January 2013

Similar documents
CUDA 6.0 Performance Report. April 2014

CUDA Toolkit 4.0 Performance Report. June, 2011

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

CUDA 7.0 Performance Report. May 2015

CUDA Accelerated Compute Libraries. M. Naumov

CUDA 6.5 Performance Report

NVIDIA CUDA Libraries

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries

CUDA math libraries APC

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

Built-in Types of Data

CUDA 8 PERFORMANCE OVERVIEW. November 2016

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries

Intel Math Kernel Library ( Intel MKL )

Technology for a better society. hetcomp.com

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

Quantum ESPRESSO on GPU accelerated systems

GPU Computing Past, Present, Future. Ian Buck, GM GPU Computing Sw

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

NEW ADVANCES IN GPU LINEAR ALGEBRA

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA

A Sampling of CUDA Libraries Michael Garland

PyCUDA. Continued...

Python. Olmo Zavala R. Python Exercises. Center of Atmospheric Sciences, UNAM. August 24, 2016

Taipei Embedded Outreach OpenCL DSP Profile Proposals

Intel Performance Libraries

Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links

Intel Math Kernel Library

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

OpenFOAM + GPGPU. İbrahim Özküçük

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

Object-Based Programming. Programming with Objects

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

Languages, Libraries and Development Tools for GPU Computing

A. Matrix-wise and element-wise operations

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

Solving the heat equation with CUDA

Function Example. Function Definition. C Programming. Syntax. A small program(subroutine) that performs a particular task. Modular programming design

DAG-Scheduled Linear Algebra Using Template-Based Building Blocks

Programming for Engineers in Python. Recitation 3

Programming for Engineers in Python. Recitation 2

MAGMA. Matrix Algebra on GPU and Multicore Architectures

Lab 7: Reading Files, Importing, Bigram Function. Ling 1330/2330: Computational Linguistics Na-Rae Han

Optimizing the operations with sparse matrices on Intel architecture

Introduction to GPU Computing. 周国峰 Wuhan University 2017/10/13

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

NVIDIA GPU TECHNOLOGY UPDATE

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

S III. Case Study: TI Calculator Numerics

GPU Computing Ecosystem

A Standard for Batching BLAS Operations

Efficient Multi-GPU CUDA Linear Solvers for OpenFOAM

OpenCL Overview. Tim Mattson Intel Labs. Copyright Khronos Group, Page 1

Matlab Workshop I. Niloufer Mackey and Lixin Shen

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Cray Scientific Libraries. Overview

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

Future Directions for CUDA Presented by Robert Strzodka

CUDA Toolkit 4.1 CUSPARSE Library. PG _v01 January 2012

Advanced CUDA Optimization 1. Introduction

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Chap 6 Function Define a function, which can reuse a piece of code, just with a few different values.

Computing Fundamentals

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Programming for Engineers in Python. Recitation 3 Functions

Functions. Systems Programming Concepts

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

Mixed MPI-OpenMP EUROBEN kernels

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

tag 220 tan[f l] struct { int i; double d; } sa, sb; struct { int i; double d; } s1, s2;

SVM multiclass classification in 10 steps 17/32

Porting Scientific Research Codes to GPUs with CUDA Fortran: Incompressible Fluid Dynamics using the Immersed Boundary Method

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

Intel Math Kernel Library 10.3

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

NVIDIA CUDA TOOLKIT V6.5

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

Data Parallel Execution Model

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

HPC with Multicore and GPUs

Maximizing performance and scalability using Intel performance libraries

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

ArcGIS Enterprise Building Raster Analytics Workflows. Mike Muller, Jie Zhang

Automatic Performance Tuning. Jeremy Johnson Dept. of Computer Science Drexel University

(IUCAA, Pune) kaustubh[at]iucaa[dot]ernet[dot]in.

ACCELERATING CFD AND RESERVOIR SIMULATIONS WITH ALGEBRAIC MULTI GRID Chris Gottbrath, Nov 2016

An Introduc+on to OpenACC Part II

High-Productivity CUDA Programming. Levi Barnes, Developer Technology Engineer, NVIDIA

MAGMA: a New Generation

Profiling GPU Code. Jeremy Appleyard, February 2016

Solving Dense Linear Systems on Graphics Processors

Transcription:

CUDA Toolkit 5.0 Performance Report January 2013

CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit: www.nvidia.com/getcuda (free download) For more information on CUDA libraries: http://developer.download.nvidia.com/gtc/pdf/gtc2012/presentationpdf/s0629-monday-cuda-accelerated.pdf 2

cufft: Multi-dimensional FFTs Real and Complex data types Single- and double-precision 1D, 2D and 3D batched transforms Flexible input and output data layouts Similar to the FFTW Advanced Interface 3

cufft: up to 600 GFLOPS 1D used in audio processing and as a foundation for 2D and 3D FFTs 700 Single Precision 1D Complex 300 Double Precision 1D Complex 600 250 500 200 GFLOPS 400 300 GFLOPS 150 200 100 100 50 0 2 4 6 8 10 14 16 18 20 22 24 26 28 log2(size) 0 2 4 6 8 10 14 16 18 20 22 24 26 28 log2(size) Performance may vary based on OS version and motherboard configuration cufft 5.0 on K20X, input and output data on device 4

cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex New in CUDA 5.0: cublas interface callable from device kernels on K20, K20X Full list of new features and optimizations: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cublas-performance-improvements 5

cublas: >1 TFLOPS double-precision 3000 GFLOPS 10x Speedup over MKL 2500 9x 8x 2000 7x 6x 1500 5x 1000 4x 3x 500 2x 1x 0 x Performance may vary based on OS version and motherboard configuration cublas 5.0 on K20X, input and output data on device MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz 6

ZGEMM Performance vs. Matrix Size 1200 CUBLAS MKL 1000 800 GFLOPS 600 400 200 0 0 512 1024 1536 2048 2560 3072 3584 4096 Performance may vary based on OS version and motherboard configuration cublas 5.0 on K20X, input and output data on device MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz 7

cusparse: Sparse linear algebra routines Format conversion: dense, COO, CSR, CSC, HYB, BlockCSR Optimized sparse linear algebra for CSR and HYB formats New in CUDA 5.0: Incomplete-LU & -Cholesky preconditioners (ilu0 and ic0) BlockCSR storage format (bsr) Complete list of new feature and optimizations: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cusparse http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#cusparse-performance-improvements y 1 y 2 y 3 y 4 \alpha 1.0 2.0 5.0 3.0 4.0 6.0 7.0 1.0 2.0 3.0 4.0 + \beta y 1 y 2 y 3 y 4 8

3.5x cusparse performance Sparse Matrix x Dense Vector 3.0x Speedup Over MKL 2.5x 2.0x 1.5x 1.0x.5x.0x Performance may vary based on OS version and motherboard configuration Average of s/d/c/z routines cusparse 5.0 on K20X, input and output data on device MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz 9

cusparse: up to 12x Faster than MKL Sparse Matrix x 6 Dense Vectors (useful for block iterative solvers) 14x 12x Speedup over MKL 10x 8x 6x 4x 2x x Performance may vary based on OS version and motherboard configuration Average of s/d/c/z routines cusparse 5.0 on K20X, input and output data on device MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz 10

Tri-diagonal Solver Performance vs. MKL Speedup for Tri-Diagonal solver (gtsv) GTSV Speedup Over MKL 18x 16x 14x 12x 10x 8x 6x 4x 2x single double complex double-complex 0x 16384 131072 1048576 2097152 4194304 Size Performance may vary based on OS version and motherboard configuration cusparse 5.0 on K20X, input and output data on device MKL 10.3.6 on Intel SandyBridge E5-2687W @ 3.10GHz 11

curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 5.0: Poisson distribution 12

16 14 curand Performance Double Precision RNGs GPU CPU 12 Gsamples/sec 10 8 6 4 2 0 MRG32k3a 32-bit Sobol MRG32k3a 32-bit Sobol Uniform Distribution Normal Distribution Performance may vary based on OS version and motherboard configuration curand 5.0 on K20X, input and output data on device MKL 10.2.3 on Intel SandyBridge E5-2687W @ 3.10GHz 13

CUDA C++ Template Library Template library for CUDA Host and Device Containers that mimic the C++ STL Optimized algorithms for sort, reduce, scan, etc. OpenMP backend for portability Also available on github: http://thrust.github.com/ Allows applications and prototypes to be built quickly 14

Thrust Performance Various Algorithms (32M int.) Speedup over TBB Sort (32M samples) Speedup over TBB 14x 35x 12x 30x 10x 25x Speedup 8x 6x Speedup 20x 15x 4x 10x 2x 5x x reduce transform scan sort x char short int long float double Algorithm Datatype Performance may vary based on OS version and motherboard configuration Thrust 5.0 on K20X, input and output data on device TBB 4.1 on Intel SandyBridge E5-2687W @3.10GHz 15

math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log10,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp10, sinpi, sincos, cospi, erfinv, erfcinv,... New in CUDA 5.0 sincospi[f]() and normcdf[inv][f]() sin(), cos() and erfcinvf() more accurate and faster Full list of new features and optimizations: http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#math http://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#math-performance-improvements 16

GPU Technology Conference 2013 March 18-21 San Jose, CA Why attend GTC? GTC advances global awareness of the dramatic changes we re seeing in science and research, graphics, cloud computing, game development, and mobile computing, and how the GPU is central to innovation in all areas. Ways to participate Submit a Research Poster share your work and gain exposure as a thought leader Register learn from the experts and network with your peers Exhibit/Sponsor promote your organization as a key player in the GPU ecosystem Visit www.gputechconf.com