Accelerating MATLAB with CUDA

Similar documents
S05: High Performance Computing with CUDA. CUDA Libraries. Massimiliano Fatica, NVIDIA. CUBLAS: BLAS implementation CUFFT: FFT implementation

Master Thesis Accelerating Image Registration on GPUs

Using a GPU in InSAR processing to improve performance

Accelerating image registration on GPUs

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

RAPID MULTI-GPU PROGRAMMING WITH CUDA LIBRARIES. Nikolay Markovskiy

Paralization on GPU using CUDA An Introduction

Introduction to CUDA

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

GPU & High Performance Computing (by NVIDIA) CUDA. Compute Unified Device Architecture Florian Schornbaum

Scientific discovery, analysis and prediction made possible through high performance computing.

GPU ARCHITECTURE Chris Schultz, June 2017

GPU Programming. Alan Gray, James Perry EPCC The University of Edinburgh

JCudaMP: OpenMP/Java on CUDA

High-Performance Computing Using GPUs

GPU ARCHITECTURE Chris Schultz, June 2017

CUDA PROGRAMMING MODEL. Carlo Nardone Sr. Solution Architect, NVIDIA EMEA

Practical Introduction to CUDA and GPU

Using OpenACC With CUDA Libraries

CUFFT Library PG _V1.0 June, 2007

Introduction to GPU hardware and to CUDA

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

GPU Programming. Lecture 2: CUDA C Basics. Miaoqing Huang University of Arkansas 1 / 34

Speed Up Your Codes Using GPU

A Sampling of CUDA Libraries Michael Garland

GPU programming: CUDA basics. Sylvain Collange Inria Rennes Bretagne Atlantique

Introduction to Parallel Computing with CUDA. Oswald Haan

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

Is OpenMP 4.5 Target Off-load Ready for Real Life? A Case Study of Three Benchmark Kernels

GPU Computing: A Quick Start

ATS-GPU Real Time Signal Processing Software

ECE 574 Cluster Computing Lecture 17

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

High Performance and GPU Computing in MATLAB

GPGPU/CUDA/C Workshop 2012

Using OpenACC With CUDA Libraries

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

General Purpose GPU programming (GP-GPU) with Nvidia CUDA. Libby Shoop

Supporting Data Parallelism in Matcloud: Final Report

Learn CUDA in an Afternoon. Alan Gray EPCC The University of Edinburgh

Tesla GPU Computing A Revolution in High Performance Computing

Deep learning in MATLAB From Concept to CUDA Code

Introduction to GPU (Graphics Processing Unit) Architecture & Programming

Overview. Lecture 1: an introduction to CUDA. Hardware view. Hardware view. hardware view software view CUDA programming

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

This is a draft chapter from an upcoming CUDA textbook by David Kirk from NVIDIA and Prof. Wen-mei Hwu from UIUC.

Getting Started with MATLAB Francesca Perino

OP2 FOR MANY-CORE ARCHITECTURES

GPGPU. Alan Gray/James Perry EPCC The University of Edinburgh.

Accelerator cards are typically PCIx cards that supplement a host processor, which they require to operate Today, the most common accelerators include

CUDA-GDB: The NVIDIA CUDA Debugger

GPU Acceleration of the Longwave Rapid Radiative Transfer Model in WRF using CUDA Fortran. G. Ruetsch, M. Fatica, E. Phillips, N.

Parallel Computing: What has changed lately? David B. Kirk

CUDA C Programming Mark Harris NVIDIA Corporation

Introduction to Numerical General Purpose GPU Computing with NVIDIA CUDA. Part 1: Hardware design and programming model

GPGPU, 4th Meeting Mordechai Butrashvily, CEO GASS Company for Advanced Supercomputing Solutions

University of Bielefeld

On Level Scheduling for Incomplete LU Factorization Preconditioners on Accelerators

Don t reinvent the wheel. BLAS LAPACK Intel Math Kernel Library

Vector Addition on the Device: main()

CUDA Architecture & Programming Model

Tesla Architecture, CUDA and Optimization Strategies

Unified Memory. Notes on GPU Data Transfers. Andreas Herten, Forschungszentrum Jülich, 24 April Member of the Helmholtz Association

Technology for a better society. hetcomp.com

Applications of Berkeley s Dwarfs on Nvidia GPUs

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

ECE 574 Cluster Computing Lecture 15

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Mathematical computations with GPUs

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

GpuWrapper: A Portable API for Heterogeneous Programming at CGG

Predictive Runtime Code Scheduling for Heterogeneous Architectures

[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG.

INTRODUCTION TO GPU COMPUTING IN AALTO. Topi Siro

Accelerating Leukocyte Tracking Using CUDA: A Case Study in Leveraging Manycore Coprocessors

NLSEmagic: Nonlinear Schrödinger Equation Multidimensional. Matlab-based GPU-accelerated Integrators using Compact High-order Schemes

AES Cryptosystem Acceleration Using Graphics Processing Units. Ethan Willoner Supervisors: Dr. Ramon Lawrence, Scott Fazackerley

Parallel Programming and Debugging with CUDA C. Geoff Gerfin Sr. System Software Engineer

compiler, CUBLAS and CUFFT (required for development) collection of examples and documentation

PART IV. CUDA programming! code, libraries and tools. Dr. Christian Napoli, M.Sc.! Dpt. Mathematics and Informatics, University of Catania!

Cuda C Programming Guide Appendix C Table C-

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

CSC266 Introduction to Parallel Computing using GPUs Introduction to CUDA

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

General Purpose GPU Computing in Partial Wave Analysis

OpenACC. Part I. Ned Nedialkov. McMaster University Canada. October 2016

Introduction to GPU Computing Using CUDA. Spring 2014 Westgid Seminar Series

A Proposal to Extend the OpenMP Tasking Model for Heterogeneous Architectures

GPU Computing: Development and Analysis. Part 1. Anton Wijs Muhammad Osama. Marieke Huisman Sebastiaan Joosten

Introduction to CUDA Programming

OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4

High Performance Computing and GPU Programming

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

CUFFT LIBRARY USER'S GUIDE. DU _v7.5 September 2015

CUDA Kenjiro Taura 1 / 36

Tutorial: Parallel programming technologies on hybrid architectures HybriLIT Team

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

NVIDIA CUDA DEBUGGER CUDA-GDB. User Manual

GPU 1. CSCI 4850/5850 High-Performance Computing Spring 2018

Transcription:

Accelerating MATLAB with CUDA Massimiliano Fatica NVIDIA mfatica@nvidia.com Won-Ki Jeong University of Utah wkjeong@cs.utah.edu

Overview MATLAB can be easily extended via MEX files to take advantage of the computational power offered by the latest NVIDIA GPUs (GeForce 8800, Quadro FX5600, Tesla). Programming the GPU for computational purposes was a very cumbersome task before CUDA. Using CUDA, it is now very easy to achieve impressive speed-up with minimal effort. This work is a proof of concept that shows the feasibility and benefits of using this approach.

MEX file Even though MATLAB is built on many well-optimized libraries, some functions can perform better when written in a compiled language (e.g. C and Fortran). MATLAB provides a convenient API for interfacing code written in C and FORTRAN to MATLAB functions with MEX files. MEX files could be used to exploit multi-core processors with OpenMP or threaded codes or like in this case to offload functions to the GPU.

NVMEX Native MATLAB script cannot parse CUDA code New MATLAB script nvmex.m compiles CUDA code (.cu) to create MATLAB function files Syntax similar to original mex script: >> nvmex f nvmexopts.bat filename.cu IC:\cuda\include LC:\cuda\lib -lcudart Available for Windows and Linux from: http://developer.nvidia.com/object/matlab_cuda.html

Mex files for CUDA A typical mex file will perform the following steps: 1. Convert from double to single precision 2. Rearrange the data layout for complex data 3. Allocate memory on the GPU 4. Transfer the data from the host to the GPU 5. Perform computation on GPU (library, custom code) 6. Transfer results from the GPU to the host 7. Rearrange the data layout for complex data 8. Convert from single to double 9. Clean up memory and return results to MATLAB Some of these steps will go away with new versions of the library (2,7) and new hardware (1,8)

CUDA MEX example Additional code in MEX file to handle CUDA /*Parse input, convert to single precision and to interleaved complex format */.. /* Allocate array on the GPU */ cufftcomplex *rhs_complex_d; cudamalloc( (void **) &rhs_complex_d,sizeof(cufftcomplex)*n*m); /* Copy input array in interleaved format to the GPU */ cudamemcpy( rhs_complex_d, input_single, sizeof(cufftcomplex)*n*m, cudamemcpyhosttodevice); /* Create plan for CUDA FFT NB: transposing dimensions*/ cufftplan2d(&plan, N, M, CUFFT_C2C) ; /* Execute FFT on GPU */ cufftexecc2c(plan, rhs_complex_d, rhs_complex_d, CUFFT_INVERSE) ; /* Copy result back to host */ cudamemcpy( input_single, rhs_complex_d, sizeof(cufftcomplex)*n*m, cudamemcpydevicetohost); /* Clean up memory and plan on the GPU */ cufftdestroy(plan); cudafree(rhs_complex_d); /*Convert back to double precision and to split complex format */.

Initial study Focus on 2D FFTs. FFT-based methods are often used in single precision ( for example in image processing ) Mex files to overload MATLAB functions, no modification between the original MATLAB code and the accelerated one. Application selected for this study: solution of the Euler equations in vorticity form using a pseudo-spectral method.

Implementation details: Case A) FFT2.mex and IFFT2.mex Mex file in C with CUDA FFT functions. Standard mex script could be used. Overall effort: few hours Case B) Szeta.mex: Vorticity source term written in CUDA Mex file in CUDA with calls to CUDA FFT functions. Small modifications necessary to handle files with a.cu suffix Overall effort: ½ hour (starting from working mex file for 2D FFT)

Configuration Hardware: Software: AMD Opteron 250 with 4 GB of memory NVIDIA GeForce 8800 GTX Windows XP and Microsoft VC8 compiler RedHat Enterprise Linux 4 32 bit, gcc compiler MATLAB R2006b CUDA 1.0

FFT2 performance

Vorticity source term http://www.amath.washington.edu/courses/571-winter-2006/matlab/szeta.m function S = Szeta(zeta,k,nu4) % Pseudospectral calculation of vorticity source term % S = -(- psi_y*zeta_x + psi_x*zeta_y) + nu4*del^4 zeta % on a square periodic domain, where zeta = psi_xx + psi_yy is an NxN matrix % of vorticity and k is vector of Fourier wavenumbers in each direction. % Output is an NxN matrix of S at all pseudospectral gridpoints zetahat = fft2(zeta); [KX KY] = meshgrid(k,k); % Matrix of (x,y) wavenumbers corresponding % to Fourier mode (m,n) del2 = -(KX.^2 + KY.^2); del2(1,1) = 1; % Set to nonzero to avoid division by zero when inverting % Laplacian to get psi psihat = zetahat./del2; dpsidx = real(ifft2(1i*kx.*psihat)); dpsidy = real(ifft2(1i*ky.*psihat)); dzetadx = real(ifft2(1i*kx.*zetahat)); dzetady = real(ifft2(1i*ky.*zetahat)); diff4 = real(ifft2(del2.^2.*zetahat)); S = -(-dpsidy.*dzetadx + dpsidx.*dzetady) - nu4*diff4;

Caveats The current CUDA FFT library only supports interleaved format for complex data while MATLAB stores all the real data followed by the imaginary data. Complex to complex (C2C) transforms used The accelerated computations are not taking advantage of the symmetry of the transforms. The current GPU hardware only supports single precision (double precision will be available in the next generation GPU towards the end of the year). Conversion to/from single from/to double is consuming a significant portion of wall clock time.

Advection of an elliptic vortex 256x256 mesh, 512 RK4 steps, Linux, MATLAB file http://www.amath.washington.edu/courses/571-winter-2006/matlab/fs_vortex.m MATLAB 168 seconds MATLAB with CUDA (single precision FFTs) 14.9 seconds (11x)

Pseudo-spectral simulation of 2D Isotropic turbulence. 512x512 mesh, 400 RK4 steps, Windows XP, MATLAB file http://www.amath.washington.edu/courses/571-winter-2006/matlab/fs_2dturb.m MATLAB 992 seconds MATLAB with CUDA (single precision FFTs) 93 seconds

Power spectrum of vorticity is very sensitive to fine scales. Result from original MATLAB run and CUDA accelerated one are in excellent agreement MATLAB run CUDA accelerated MATLAB run

Timing details 1024x1024 mesh, 400 RK4 steps on Windows, 2D isotropic turbulence Runtime Opteron 250 Speed up Runtime Opteron 2210 Speed up PCI-e Bandwidth: 1135 MB/s 1483 MB/s Host to/from device 1003 MB/s 1223 MB/s Standard MATLAB 8098 s 9525s Overload FFT2 and IFFT2 4425 s 1.8x 4937s 1.9x Overload Szeta 735 s 11.x 789s 12.X Overload Szeta, FFT2 and IFFT2 577 s 14.x 605s 15.7x

Conclusion Integration of CUDA is straightforward as a MEX plug-in No need for users to leave MATLAB to run big simulations: high productivity Relevant speed-ups even for small size grids Plenty of opportunities for further optimizations