CUDA 6.0 Performance Report. April 2014

Size: px
Start display at page:

Download "CUDA 6.0 Performance Report. April 2014"

Transcription

1 CUDA 6. Performance Report April 214 1

2 CUDA 6 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number Generation (RNG) Library NPP Performance Primitives for Image & Video Processing Thrust Templated Parallel Algorithms & Data Structures math.h C99 floating-point Library Included in the CUDA Toolkit (free download): developer.nvidia.com/cuda-toolkit For more information on CUDA libraries: developer.nvidia.com/gpu-accelerated-libraries 2

3 usec CUDA 6: 2x Faster GPU Kernel Launches 4 Dynamic Parallel Kernel Launches x Faster 2.x Faster CUDA 5 CUDA 6 CUDA 5 CUDA 6 Back to Back Launches a<<<...>>>; b<<<...>>>; Launch and Synchronize a<<<...>>>; cudadevicesynchronize(); Performance may vary based on OS version and motherboard configuration CUDA 5. and CUDA 6. on Tesla K2 3

4 cufft: Multi-dimensional FFTs Real and complex Single- and double-precision data types 1D, 2D and 3D batched transforms Flexible input and output data layouts New in CUDA 6 XT interface supports dual-gpu cards (Tesla K1, GeForce GTX69, ) 4

5 GFLOPS GFLOPS cufft: up to 7 GFLOPS 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision log2(transform_size) log2(transform_size) Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 32M elements, input and output data on device 5

6 GFLOPS GFLOPS cufft: Consistently High Performance 1D Complex, Batched FFTs Used in Audio Processing and as a Foundation for 2D and 3D FFTs 8 Single Precision 3 Double Precision , 1,, 1,, Transform Size 1 1 1, 1,, 1,, Transform Size Performance may vary based on OS version and motherboard configuration cufft 6. on K4c, ECC ON, 28M-33M elements, input and output data on device 6

7 Execution Time (ms) Execution Time (ms) New in CUDA 6 cufft-xt: Boosts Performance on K % Faster % Faster cufft cufft-xt cufft cufft-xt 256x256x x512x512 Performance may vary based on OS version and motherboard configuration cufft 6. on K1, ECC ON, input and output data on device 7

8 cublas: Dense Linear Algebra on GPUs Complete BLAS implementation plus useful extensions Supports all 152 standard routines for single, double, complex, and double complex Host and device-callable interface New in CUDA 6 XT Interface for Level 3 BLAS Distributed computations across multiple GPUs Out-of-core streaming to GPU, no upper limit on matrix size Drop-in BLAS intercepts CPU BLAS calls, streams to GPU 8

9 SGEMM SSYMM STRSM SSYRK CGEMM CSYMM CTRSM CSYRK DGEMM DSYMM DTRSM DSYRK ZGEMM ZSYMM ZTRSM ZSYRK GFLOPS cublas: >3 TFLOPS single-precision >1 TFLOPS double-precision Single Single Complex Double Double Complex Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device m=n=k=496, transpose=no, side=right, fill=lower 9

10 GFLOPS cublas: ZGEMM 5x Faster than MKL cublas MKL Matrix Dimension (m=n=k) Performance may vary based on OS version and motherboard configuration cublas 6. on K4m, ECC ON, input and output data on device MKL on Intel IvyBridge 12-core E GHz 1

11 New in CUDA cublas-xt: Multi-GPU Performance Scaling 7.9 TFLOPS 6. TFLOPS 4.2 TFLOPS 2.2 TFLOPS 1 1 x K1 2 x K1 3 x K1 4 x K1 16K x 16K SGEMM on Tesla K1 Performance may vary based on OS version and motherboard configuration cublas-xt 6. on K1, ECC ON, input and output data on host 11

12 cusparse: Sparse linear algebra routines Optimized sparse linear algebra BLAS routines matrixvector, matrix-matrix, triangular solve Support for variety of formats (CSR, COO, block variants) New in CUDA 6 Many improvements to triangular solvers, Incomplete-LU, and Cholesky preconditioners y 1 y 2 y \alpha + \beta 4. y y 1 y 2 y 3 y 4 12

13 Speedup over MKL cusparse: 5x Faster than MKL 6x Sparse Matrix x Dense Vector (SpMV) 5x 4x 3x 2x 1x x Performance may vary based on OS version and motherboard configuration Average of s/c/d/z routines cusparse 6. on K4m, ECC ON, input and output data on device MKL on Intel IvyBridge 12-core E GHz Matrices obtained from: 13

14 curand: Random Number Generation Generating high quality random numbers in parallel is hard Don t do it yourself, use a library! Pseudo- and Quasi-RNGs Supports several output distributions Statistical test results in documentation New in CUDA 6 Mersenne Twister

15 Gsamples / sec curand: Up to 75x Faster vs. Intel MKL curand MKL 4 2 Sobol32 MRG32k3a Sobol32 MRG32k3a Sobol32 MRG32k3a Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4c, ECC ON, double-precision input and output data on device MKL on Intel SandyBridge 6-core 2. GHz 15

16 Gsamples / sec 18 curand: High Performance RNGs XORWOW Philox MRG32k3a MTGP32 Sobol32 Scrambled Pseudo-random Sobol64 Scrambled Quasi-random Uniform Distribution Normal Distribution Log-Normal Distribution Performance may vary based on OS version and motherboard configuration curand 6. on K4m, ECC ON, double precision input and output data on device 16

17 NPP: NVIDIA Performance Primitives Over 5 image and signal processing routines: color transforms, geometric transforms, move operations, linear filters, image & signal statistics, image & signal arithmetic, JPEG building blocks, image segmentation New in CUDA 6 Over 5 new routines, including: median filter, BGR/YUV conversion, 3D LUT color conversion, improvements to JPEG primitives, plus many more 17

18 NPP Speedup vs. Intel IPP 3x 25x 2x 15x 28.5x 1x 5x 5.7x 14.4x 17.8x 6.3x 12.9x x Image Set (8-bit RGB) Image Set Channel (8-bit RGB) Image Resize (8-bit RGB) Image Gaussian Filter (32-bit float) Color Conversion 8-bit YUV422 to 8-bit RGB JPEG 8x8 Forward DCT Performance may vary based on OS version and motherboard configuration NPP 6. on K4m, input and output data on device IPP 7. on Intel IvyBridge 12-core E GHz 18

19 CUDA C++ Template Library Template library for CUDA C++ Host and Device Containers that mimic the C++ STL Optimized Algorithms for sort, reduce, scan, etc. OpenMP Backend for portability Also available on github: thrust.github.com Allows applications and prototypes to be built quickly 19

20 Speedup Speedup Thrust Performance vs. Intel TBB Thrust vs. TBB on 32M integers Thrust Sort vs. TBB on 32M samples 14x 5x 12x 4x 43.7x 1x 8x 3x 24.8x 6x 4x 2x 13.3x 15.x 2x 1x 5.2x 5.8x x reduce transform scan sort x char short int long float double Performance may vary based on OS version and motherboard configuration Thrust v1.7.1 on K4m, ECC ON, input and output data on device TBB 4.2 on Intel IvyBridge 12-core E GHz 2

21 math.h: C99 floating-point library + extras CUDA math.h is industry proven, high performance, accurate Basic: +, *, /, 1/, sqrt, FMA (all IEEE-754 accurate for float, double, all rounding modes) Exponentials: exp, exp2, log, log2, log1,... Trigonometry: sin, cos, tan, asin, acos, atan2, sinh, cosh, asinh, acosh,... Special functions: lgamma, tgamma, erf, erfc Utility: fmod, remquo, modf, trunc, round, ceil, floor, fabs,... Extras: rsqrt, rcbrt, exp1, sinpi, sincos[pi], cospi, erfinv, erfcinv, normcdf[inv],... New in CUDA 6 Over 8 new SIMD instructions Useful for video processing: _v*2, _v*4 Cylindrical bessel: cyl_i{,1} 1/hypotenuse: rhypot 21

CUDA 6.5 Performance Report

CUDA 6.5 Performance Report CUDA 6.5 Performance Report 1 CUDA 6.5 Performance Report CUDART CUDA Runtime Library cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library curand Random Number

More information

CUDA 7.0 Performance Report. May 2015

CUDA 7.0 Performance Report. May 2015 CUDA 7.0 Performance Report May 2015 1 CUDA 7.0 Performance Report cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse Matrix Library New in cusolver Linear Solver Library

More information

CUDA Toolkit 4.0 Performance Report. June, 2011

CUDA Toolkit 4.0 Performance Report. June, 2011 CUDA Toolkit 4. Performance Report June, 211 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

CUDA Toolkit 5.0 Performance Report. January 2013

CUDA Toolkit 5.0 Performance Report. January 2013 CUDA Toolkit 5.0 Performance Report January 2013 CUDA Math Libraries High performance math routines for your applications: cufft Fast Fourier Transforms Library cublas Complete BLAS Library cusparse Sparse

More information

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries

Introduction to GPGPUs and to CUDA programming model: CUDA Libraries Introduction to GPGPUs and to CUDA programming model: CUDA Libraries www.cineca.it Marzia Rivi m.rivi@cineca.it NVIDIA CUDA Libraries http://developer.nvidia.com/technologies/libraries CUDA Toolkit includes

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2015 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

CUDA Accelerated Compute Libraries. M. Naumov

CUDA Accelerated Compute Libraries. M. Naumov CUDA Accelerated Compute Libraries M. Naumov Outline Motivation Why should you use libraries? CUDA Toolkit Libraries Overview of performance CUDA Proprietary Libraries Address specific markets Third Party

More information

Using OpenACC With CUDA Libraries

Using OpenACC With CUDA Libraries Using OpenACC With CUDA Libraries John Urbanic with NVIDIA Pittsburgh Supercomputing Center Copyright 2018 3 Ways to Accelerate Applications Applications Libraries Drop-in Acceleration CUDA Libraries are

More information

NVIDIA CUDA Libraries

NVIDIA CUDA Libraries NVIDIA CUDA Libraries Ujval Kapasi*, Elif Albuz*, Philippe Vandermersch*, Nathan Whitehead*, Frank Jargstorff* San Jose Convention Center Sept 22, 2010 *NVIDIA NVIDIA CUDA Libraries Applications 3 rd Party

More information

CUDA 8 PERFORMANCE OVERVIEW. November 2016

CUDA 8 PERFORMANCE OVERVIEW. November 2016 CUDA 8 PERFORMANCE OVERVIEW November 2016 CUDA 8 PERFORMANCE HIGHLIGHTS 2X 1.5-2X higher performance out-of-the-box Solve larger problems than possible before with Unified Memory SOCIAL NETWORK ANALYSIS

More information

GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group

GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group GPU Computing using CUDA C/C++ Dr. Timo Stich Developer Technology Group Why CUDA? Mainstream Massively Parallel Programming Over 300 Million CUDA capable GPUs sold Runs on GPU and CPU (PGI CUDA-x86) Additional

More information

A Sampling of CUDA Libraries Michael Garland

A Sampling of CUDA Libraries Michael Garland A Sampling of CUDA Libraries Michael Garland NVIDIA Research CUBLAS Implementation of BLAS (Basic Linear Algebra Subprograms) on top of CUDA driver Self-contained at the API level, no direct interaction

More information

CUDA math libraries APC

CUDA math libraries APC CUDA math libraries APC CUDA Libraries http://developer.nvidia.com/cuda-tools-ecosystem CUDA Toolkit CUBLAS linear algebra CUSPARSE linear algebra with sparse matrices CUFFT fast discrete Fourier transform

More information

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation

Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA. NVIDIA Corporation Porting the NAS-NPB Conjugate Gradient Benchmark to CUDA NVIDIA Corporation Outline! Overview of CG benchmark! Overview of CUDA Libraries! CUSPARSE! CUBLAS! Porting Sequence! Algorithm Analysis! Data/Code

More information

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University

CSE 591: GPU Programming. Programmer Interface. Klaus Mueller. Computer Science Department Stony Brook University CSE 591: GPU Programming Programmer Interface Klaus Mueller Computer Science Department Stony Brook University Compute Levels Encodes the hardware capability of a GPU card newer cards have higher compute

More information

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries

CUDA libraries. Lecture 5: libraries and tools. CUDA libraries. CUDA libraries Lecture 5: libraries and tools cublas Prof. Mike Giles mike.giles@maths.ox.ac.uk Oxford University Mathematical Institute Oxford e-research Centre Lecture 5 p. 1 basic linear algebra subroutines for dense

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Introduction to OpenACC Directives. Duncan Poole, NVIDIA

Introduction to OpenACC Directives. Duncan Poole, NVIDIA Introduction to OpenACC Directives Duncan Poole, NVIDIA GPUs Reaching Broader Set of Developers 1,000,000 s 100,000 s Early Adopters Research Universities Supercomputing Centers Oil & Gas CAE CFD Finance

More information

Technology for a better society. hetcomp.com

Technology for a better society. hetcomp.com Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction

More information

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures

Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Premiers retours d expérience sur l utilisation de GPU pour des applications de mécanique des structures Antoine Petitet et Stefanos Vlachoutsis Juin 2011 Copyright ESI Group, 2009. 2010. All rights reserved.

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Intel Math Kernel Library

Intel Math Kernel Library Intel Math Kernel Library Release 7.0 March 2005 Intel MKL Purpose Performance, performance, performance! Intel s scientific and engineering floating point math library Initially only basic linear algebra

More information

Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library

Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Leveraging the NVIDIA CUDA BLAS in the IMSL FORTRAN Library Benchmarking the NVIDIA GPU A White Paper by Rogue Wave Software. October, 2010 Rogue Wave Softw are 5500 Flatiron Parkw ay, Suite 200 Boulder,

More information

Level-3 BLAS on the TI C6678 multi-core DSP

Level-3 BLAS on the TI C6678 multi-core DSP Level-3 BLAS on the TI C6678 multi-core DSP Murtaza Ali, Eric Stotzer Texas Instruments {mali,estotzer}@ti.com Francisco D. Igual Dept. Arquitectura de Computadores y Automática Univ. Complutense de Madrid

More information

NVIDIA CUDA TOOLKIT V6.0

NVIDIA CUDA TOOLKIT V6.0 NVIDIA CUDA TOOLKIT V6.0 RN-06722-001 _v6.0 February 2014 Release Notes for Windows, Linux, and Mac OS TABLE OF CONTENTS Errata... iii CUDA 6.0 Release Candidate... iii Chapter 1. CUDA Toolkit Major Components...

More information

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA

CUDA 7 AND BEYOND MARK HARRIS, NVIDIA CUDA 7 AND BEYOND MARK HARRIS, NVIDIA C++11 CUDA 7 cusolver Runtime Compilation [&](char)c)){) ))for)(auto)x):)letters))) ))))if)(c)==)x))return)true;) ))return)false;) }) C++11 FEELS LIKE A NEW LANGUAGE

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 OVERVIEW WEBINAR 7/23/15 CUDA 7.5 https://developer.nvidia.com/cuda-toolkit 16-bit Floating-Point Storage 2x larger datasets in GPU memory Great for Deep Learning cusparse Dense Matrix * Sparse

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield NVIDIA GTX200: TeraFLOPS Visual Computing August 26, 2008 John Tynefield 2 Outline Execution Model Architecture Demo 3 Execution Model 4 Software Architecture Applications DX10 OpenGL OpenCL CUDA C Host

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS

NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS NEW FEATURES IN CUDA 6 MAKE GPU ACCELERATION EASIER MARK HARRIS 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries GPUDirect RDMA in MPI 4 Developer Tools 1 Unified Memory CUDA 6 2 3 XT and Drop-in Libraries

More information

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN Massively Parallel Computing with CUDA Carlos Alberto Martínez Angeles Cinvestav-IPN What is a GPU? A graphics processing unit (GPU) The term GPU was popularized by Nvidia in 1999 marketed the GeForce

More information

Intel Math Kernel Library ( Intel MKL )

Intel Math Kernel Library ( Intel MKL ) Intel Math Kernel Library ( Intel MKL ) Part of Intel Parallel Studio XE Composer Edition December 2014 Copyright 2014, Intel Corporation. All rights reserved. *Other brands and names are the property

More information

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA

CUDA 5 and Beyond. Mark Ebersole. Original Slides: Mark Harris 2012 NVIDIA CUDA 5 and Beyond Mark Ebersole Original Slides: Mark Harris The Soul of CUDA The Platform for High Performance Parallel Computing Accessible High Performance Enable Computing Ecosystem Introducing CUDA

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

Built-in Types of Data

Built-in Types of Data Built-in Types of Data Types A data type is set of values and a set of operations defined on those values Python supports several built-in data types: int (for integers), float (for floating-point numbers),

More information

Intel Performance Libraries

Intel Performance Libraries Intel Performance Libraries Powerful Mathematical Library Intel Math Kernel Library (Intel MKL) Energy Science & Research Engineering Design Financial Analytics Signal Processing Digital Content Creation

More information

Applications of Berkeley s Dwarfs on Nvidia GPUs

Applications of Berkeley s Dwarfs on Nvidia GPUs Applications of Berkeley s Dwarfs on Nvidia GPUs Seminar: Topics in High-Performance and Scientific Computing Team N2: Yang Zhang, Haiqing Wang 05.02.2015 Overview CUDA The Dwarfs Dynamic Programming Sparse

More information

GPU Computing Past, Present, Future. Ian Buck, GM GPU Computing Sw

GPU Computing Past, Present, Future. Ian Buck, GM GPU Computing Sw GPU Computing Past, Present, Future Ian Buck, GM GPU Computing Sw History... GPGPU in 2004 GFLOPS recent trends multiplies per second (observed peak) NVIDIA NV30, 35, 40 ATI R300, 360, 420 Pentium 4 July

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Introduction to CUDA

Introduction to CUDA Introduction to CUDA Overview HW computational power Graphics API vs. CUDA CUDA glossary Memory model, HW implementation, execution Performance guidelines CUDA compiler C/C++ Language extensions Limitations

More information

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators Karl Rupp, Barry Smith rupp@mcs.anl.gov Mathematics and Computer Science Division Argonne National Laboratory FEMTEC

More information

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009 Introduction CUDA is a tool to turn your graphics card into a small computing cluster. It s not always

More information

OpenFOAM + GPGPU. İbrahim Özküçük

OpenFOAM + GPGPU. İbrahim Özküçük OpenFOAM + GPGPU İbrahim Özküçük Outline GPGPU vs CPU GPGPU plugins for OpenFOAM Overview of Discretization CUDA for FOAM Link (cufflink) Cusp & Thrust Libraries How Cufflink Works Performance data of

More information

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta

Technical Report Performance Analysis of CULA on different NVIDIA GPU Architectures. Prateek Gupta Technical Report 2014-02 Performance Analysis of CULA on different NVIDIA GPU Architectures Prateek Gupta May 20, 2014 1 Spring 2014: Performance Analysis of CULA on different NVIDIA GPU Architectures

More information

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA

Turing Architecture and CUDA 10 New Features. Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture and CUDA 10 New Features Minseok Lee, Developer Technology Engineer, NVIDIA Turing Architecture New SM Architecture Multi-Precision Tensor Core RT Core Turing MPS Inference Accelerated,

More information

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU

PARALUTION - a Library for Iterative Sparse Methods on CPU and GPU - a Library for Iterative Sparse Methods on CPU and GPU Dimitar Lukarski Division of Scientific Computing Department of Information Technology Uppsala Programming for Multicore Architectures Research Center

More information

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units

In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis

More information

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection

A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection A Linear Algebra Library for Multicore/Accelerators: the PLASMA/MAGMA Collection Jack Dongarra University of Tennessee Oak Ridge National Laboratory 11/24/2009 1 Gflop/s LAPACK LU - Intel64-16 cores DGETRF

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli High performance 2D Discrete Fourier Transform on Heterogeneous Platforms Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli Motivation Fourier Transform widely used in Physics, Astronomy, Engineering

More information

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić How to perform HPL on CPU&GPU clusters Dr.sc. Draško Tomić email: drasko.tomic@hp.com Forecasting is not so easy, HPL benchmarking could be even more difficult Agenda TOP500 GPU trends Some basics about

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators

Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Solving Dense Linear Systems on Platforms with Multiple Hardware Accelerators Francisco D. Igual Enrique S. Quintana-Ortí Gregorio Quintana-Ortí Universidad Jaime I de Castellón (Spain) Robert A. van de

More information

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC GPGPUs in HPC VILLE TIMONEN Åbo Akademi University 2.11.2010 @ CSC Content Background How do GPUs pull off higher throughput Typical architecture Current situation & the future GPGPU languages A tale of

More information

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi

Implementing a Speech Recognition System on a GPU using CUDA. Presented by Omid Talakoub Astrid Yi Implementing a Speech Recognition System on a GPU using CUDA Presented by Omid Talakoub Astrid Yi Outline Background Motivation Speech recognition algorithm Implementation steps GPU implementation strategies

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

CUB. collective software primitives. Duane Merrill. NVIDIA Research

CUB. collective software primitives. Duane Merrill. NVIDIA Research CUB collective software primitives Duane Merrill NVIDIA Research What is CUB?. A design model for collective primitives How to make reusable SIMT software constructs. A library of collective primitives

More information

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve

Administrative Issues. L11: Sparse Linear Algebra on GPUs. Triangular Solve (STRSM) A Few Details 2/25/11. Next assignment, triangular solve Administrative Issues L11: Sparse Linear Algebra on GPUs Next assignment, triangular solve Due 5PM, Tuesday, March 15 handin cs6963 lab 3 Project proposals Due 5PM, Wednesday, March 7 (hard

More information

Data Parallel Execution Model

Data Parallel Execution Model CS/EE 217 GPU Architecture and Parallel Programming Lecture 3: Kernel-Based Data Parallel Execution Model David Kirk/NVIDIA and Wen-mei Hwu, 2007-2013 Objective To understand the organization and scheduling

More information

Advanced CUDA Optimization 1. Introduction

Advanced CUDA Optimization 1. Introduction Advanced CUDA Optimization 1. Introduction Thomas Bradley Agenda CUDA Review Review of CUDA Architecture Programming & Memory Models Programming Environment Execution Performance Optimization Guidelines

More information

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley. CUDA 9 AND BEYOND Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

Tesla Architecture, CUDA and Optimization Strategies

Tesla Architecture, CUDA and Optimization Strategies Tesla Architecture, CUDA and Optimization Strategies Lan Shi, Li Yi & Liyuan Zhang Hauptseminar: Multicore Architectures and Programming Page 1 Outline Tesla Architecture & CUDA CUDA Programming Optimization

More information

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Accelerating GPU computation through mixed-precision methods Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University Outline Motivation Truncated Precision using CUDA Solving Linear

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA

On the Parallel Solution of Sparse Triangular Linear Systems. M. Naumov* San Jose, CA May 16, 2012 *NVIDIA On the Parallel Solution of Sparse Triangular Linear Systems M. Naumov* San Jose, CA May 16, 2012 *NVIDIA Why Is This Interesting? There exist different classes of parallel problems Embarrassingly parallel

More information

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013

GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS. Kyle Spagnoli. Research EM Photonics 3/20/2013 GTC 2013: DEVELOPMENTS IN GPU-ACCELERATED SPARSE LINEAR ALGEBRA ALGORITHMS Kyle Spagnoli Research Engineer @ EM Photonics 3/20/2013 INTRODUCTION» Sparse systems» Iterative solvers» High level benchmarks»

More information

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017

May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND. Mark Harris, May 10, 2017 May 8-11, 2017 Silicon Valley CUDA 9 AND BEYOND Mark Harris, May 10, 2017 INTRODUCING CUDA 9 BUILT FOR VOLTA FASTER LIBRARIES Tesla V100 New GPU Architecture Tensor Cores NVLink Independent Thread Scheduling

More information

CS 179: Lecture 10. Introduction to cublas

CS 179: Lecture 10. Introduction to cublas CS 179: Lecture 10 Introduction to cublas Table of contents, you are here. Welcome to week 4, this is new material from here on out so please ask questions and help the TAs to improve the lectures and

More information

CAPS Technology. ProHMPT, 2009 March12 th

CAPS Technology. ProHMPT, 2009 March12 th CAPS Technology ProHMPT, 2009 March12 th Overview of the Talk 1. HMPP in a nutshell Directives for Hardware Accelerators (HWA) 2. HMPP Code Generation Capabilities Efficient code generation for CUDA 3.

More information

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS

GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS GPGPU, 1st Meeting Mordechai Butrashvily, CEO GASS Agenda Forming a GPGPU WG 1 st meeting Future meetings Activities Forming a GPGPU WG To raise needs and enhance information sharing A platform for knowledge

More information

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs

vs. GPU Performance Without the Answer University of Virginia Computer Engineering g Labs Where is the Data? Why you Cannot Debate CPU vs. GPU Performance Without the Answer Chris Gregg and Kim Hazelwood University of Virginia Computer Engineering g Labs 1 GPUs and Data Transfer GPU computing

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

Fastest and most used math library for Intel -based systems 1

Fastest and most used math library for Intel -based systems 1 Fastest and most used math library for Intel -based systems 1 Speaker: Alexander Kalinkin Contributing authors: Peter Caday, Kazushige Goto, Louise Huot, Sarah Knepper, Mesut Meterelliyoz, Arthur Araujo

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

VSC Users Day 2018 Start to GPU Ehsan Moravveji

VSC Users Day 2018 Start to GPU Ehsan Moravveji Outline A brief intro Available GPUs at VSC GPU architecture Benchmarking tests General Purpose GPU Programming Models VSC Users Day 2018 Start to GPU Ehsan Moravveji Image courtesy of Nvidia.com Generally

More information

Languages, Libraries and Development Tools for GPU Computing

Languages, Libraries and Development Tools for GPU Computing Languages, Libraries and Development Tools for GPU Computing CPU GPU GPUs have evolved to the point where many real-world applications are easily implemented on them and run significantly faster than on

More information

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Joined Advanced Student School (JASS) 2009 March 29 - April 7, 2009 St. Petersburg, Russia G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G Dmitry Puzyrev St. Petersburg State University Faculty

More information

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator

Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator Introduction to CUDA C/C++ Mark Ebersole, NVIDIA CUDA Educator What is CUDA? Programming language? Compiler? Classic car? Beer? Coffee? CUDA Parallel Computing Platform www.nvidia.com/getcuda Programming

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team Laboratory of Information Technologies Joint Institute for Nuclear Research The Helmholtz International Summer School Lattice

More information

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices

CSE 599 I Accelerated Computing - Programming GPUS. Parallel Pattern: Sparse Matrices CSE 599 I Accelerated Computing - Programming GPUS Parallel Pattern: Sparse Matrices Objective Learn about various sparse matrix representations Consider how input data affects run-time performance of

More information

PyCUDA. Continued...

PyCUDA. Continued... PyCUDA Continued... gpuarray Vector Types pycuda.gpuarray.vec All CUDA vector types are supported: float3, int3, long4, etc, Available as numpy data types Field names x, y, z, and w as in CUDA Construct

More information

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology Exploiting GPU Caches in Sparse Matrix Vector Multiplication Yusuke Nagasaka Tokyo Institute of Technology Sparse Matrix Generated by FEM, being as the graph data Often require solving sparse linear equation

More information

Maximizing performance and scalability using Intel performance libraries

Maximizing performance and scalability using Intel performance libraries Maximizing performance and scalability using Intel performance libraries Roger Philp Intel HPC Software Workshop Series 2016 HPC Code Modernization for Intel Xeon and Xeon Phi February 17 th 2016, Barcelona

More information

CUDA Architecture & Programming Model

CUDA Architecture & Programming Model CUDA Architecture & Programming Model Course on Multi-core Architectures & Programming Oliver Taubmann May 9, 2012 Outline Introduction Architecture Generation Fermi A Brief Look Back At Tesla What s New

More information

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015

HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 HPC with the NVIDIA Accelerated Computing Toolkit Mark Harris, November 16, 2015 Accelerators Surge in World s Top Supercomputers 125 100 75 Top500: # of Accelerated Supercomputers 100+ accelerated systems

More information

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster

Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster th IEEE International Conference on Computer and Information Technology (CIT ) Accelerating Linpack Performance with Mixed Precision Algorithm on CPU+GPGPU Heterogeneous Cluster WANG Lei ZHANG Yunquan

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

PSEUDORANDOM numbers are very important in practice

PSEUDORANDOM numbers are very important in practice Proceedings of the 2013 Federated Conference on Computer Science and Information Systems pp. 515 519 Template Library for Multi- Pseudorandom Number Recursion-based Generars Dominik Szałkowski Institute

More information

Intel Math Kernel Library 10.3

Intel Math Kernel Library 10.3 Intel Math Kernel Library 10.3 Product Brief Intel Math Kernel Library 10.3 The Flagship High Performance Computing Math Library for Windows*, Linux*, and Mac OS* X Intel Math Kernel Library (Intel MKL)

More information

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects

Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Realization of a low energy HPC platform powered by renewables - A case study: Technical, numerical and implementation aspects Markus Geveler, Stefan Turek, Dirk Ribbrock PACO Magdeburg 2015 / 7 / 7 markus.geveler@math.tu-dortmund.de

More information

NVIDIA GPU TECHNOLOGY UPDATE

NVIDIA GPU TECHNOLOGY UPDATE NVIDIA GPU TECHNOLOGY UPDATE May 2015 Axel Koehler Senior Solutions Architect, NVIDIA NVIDIA: The VISUAL Computing Company GAMING DESIGN ENTERPRISE VIRTUALIZATION HPC & CLOUD SERVICE PROVIDERS AUTONOMOUS

More information

Taipei Embedded Outreach OpenCL DSP Profile Proposals

Taipei Embedded Outreach OpenCL DSP Profile Proposals Copyright 2018 The Khronos Group Inc. Page 1 Taipei Embedded Outreach OpenCL DSP Profile Proposals Prof. Jenq-Kuen Lee, NTHU Taipei, January 2018 Copyright 2018 The Khronos Group Inc. Page 2 Outline Speaker

More information

Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links

Outline. Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links Outline Introduction Intel Vector Math Library (VML) o Features and performance VML in Finance Useful links 2 Introduction VML is one component of Intel MKL Support HPC applications: o o Scientific & engineering

More information

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics

Technische Universität München. GPU Programming. Rüdiger Westermann Chair for Computer Graphics & Visualization. Faculty of Informatics GPU Programming Rüdiger Westermann Chair for Computer Graphics & Visualization Faculty of Informatics Overview Programming interfaces and support libraries The CUDA programming abstraction An in-depth

More information

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou Study and implementation of computational methods for Differential Equations in heterogeneous systems Asimina Vouronikoy - Eleni Zisiou Outline Introduction Review of related work Cyclic Reduction Algorithm

More information

GPU Computing. Axel Koehler Sr. Solution Architect HPC

GPU Computing. Axel Koehler Sr. Solution Architect HPC GPU Computing Axel Koehler Sr. Solution Architect HPC 1 NVIDIA: Parallel Computing Company GPUs: GeForce, Quadro, Tesla ARM SoCs: Tegra VGX 2 Continued Demand for Ever Faster Supercomputers First-principles

More information