High Performance and GPU Computing in MATLAB

Similar documents
Speeding up MATLAB Applications Sean de Wolski Application Engineer

Parallel Programming in MATLAB on BioHPC

Large Data in MATLAB: A Seismic Data Processing Case Study U. M. Sundar Senior Application Engineer

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Getting Started with MATLAB Francesca Perino

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Modeling a 4G LTE System in MATLAB

Introduction to Matlab GPU Acceleration for. Computational Finance. Chuan- Hsiang Han 1. Section 1: Introduction

Accelerating System Simulations

Georgia Institute of Technology Center for Signal and Image Processing Steve Conover February 2009

MATLAB AND PARALLEL COMPUTING

Optimizing and Accelerating Your MATLAB Code

G P G P U : H I G H - P E R F O R M A N C E C O M P U T I N G

Parallel Computing with MATLAB

Parallel Computing with MATLAB

Tesla Architecture, CUDA and Optimization Strategies

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

High performance 2D Discrete Fourier Transform on Heterogeneous Platforms. Shrenik Lad, IIIT Hyderabad Advisor : Dr. Kishore Kothapalli

An Introduction to GPGPU Pro g ra m m ing - CUDA Arc hitec ture

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

General Purpose GPU Computing in Partial Wave Analysis

Parallel Programming Principle and Practice. Lecture 9 Introduction to GPGPUs and CUDA Programming Model

HPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Introduction to CUDA programming

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

HPC with Multicore and GPUs

MATLAB Based Optimization Techniques and Parallel Computing

THE AUSTRALIAN NATIONAL UNIVERSITY First Semester Examination June COMP3320/6464/HONS High Performance Scientific Computing

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

GPU Accelerated Solvers for ODEs Describing Cardiac Membrane Equations

CUDA and OpenCL Implementations of 3D CT Reconstruction for Biomedical Imaging

DEEP NEURAL NETWORKS AND GPUS. Julie Bernauer

Lecture 1: Introduction and Computational Thinking

MATLAB: The challenges involved in providing a high-level language on a GPU

GPU-Accelerated Deep Learning

HYPERDRIVE IMPLEMENTATION AND ANALYSIS OF A PARALLEL, CONJUGATE GRADIENT LINEAR SOLVER PROF. BRYANT PROF. KAYVON 15618: PARALLEL COMPUTER ARCHITECTURE

Study and implementation of computational methods for Differential Equations in heterogeneous systems. Asimina Vouronikoy - Eleni Zisiou

NVIDIA GTX200: TeraFLOPS Visual Computing. August 26, 2008 John Tynefield

Applications of Berkeley s Dwarfs on Nvidia GPUs

CUDA 6.0 Performance Report. April 2014

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

Using a GPU in InSAR processing to improve performance

Massively Parallel Computing with CUDA. Carlos Alberto Martínez Angeles Cinvestav-IPN

On Massively Parallel Algorithms to Track One Path of a Polynomial Homotopy

Tesla GPU Computing A Revolution in High Performance Computing

Accelerating MATLAB with CUDA

Introduction to GPU hardware and to CUDA

Deep learning in MATLAB From Concept to CUDA Code

CS 179: GPU Computing LECTURE 4: GPU MEMORY SYSTEMS

Multi-Processors and GPU

A GPU Enhanced Linux Cluster for Accelerated FMS

Accelerating image registration on GPUs

Technical Report TR

MATLAB on BioHPC. portal.biohpc.swmed.edu Updated for

An Introduction to GPU Architecture and CUDA C/C++ Programming. Bin Chen April 4, 2018 Research Computing Center

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Parallel Computing with MATLAB on Discovery Cluster

CMSC 858M/AMSC 698R. Fast Multipole Methods. Nail A. Gumerov & Ramani Duraiswami. Lecture 20. Outline

N-Body Simulation using CUDA. CSE 633 Fall 2010 Project by Suraj Alungal Balchand Advisor: Dr. Russ Miller State University of New York at Buffalo

CUDA 7.5 OVERVIEW WEBINAR 7/23/15

AIRWC : Accelerated Image Registration With CUDA. Richard Ansorge 1 st August BSS Group, Cavendish Laboratory, University of Cambridge UK.

Jedox Suite. Platform Support Guide

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

GPU-Accelerated Parallel Sparse LU Factorization Method for Fast Circuit Analysis

GeoImaging Accelerator Pansharpen Test Results. Executive Summary

Lecture 1: What is MATLAB?

Accelerating GPU computation through mixed-precision methods. Michael Clark Harvard-Smithsonian Center for Astrophysics Harvard University

GPU Performance Optimisation. Alan Gray EPCC The University of Edinburgh

Exploiting GPU Caches in Sparse Matrix Vector Multiplication. Yusuke Nagasaka Tokyo Institute of Technology

[ NOTICE ] YOU HAVE TO INSTALL ALL FILES PREVIOUSLY, BECAUSE A INSTALLATION TIME IS TOO LONG.

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

SPOC : GPGPU programming through Stream Processing with OCaml

CSE 591: GPU Programming. Introduction. Entertainment Graphics: Virtual Realism for the Masses. Computer games need to have: Klaus Mueller

CS GPU and GPGPU Programming Lecture 8+9: GPU Architecture 7+8. Markus Hadwiger, KAUST

CUDA Experiences: Over-Optimization and Future HPC

PyCUDA. An Introduction

What s New in MATLAB and Simulink The MathWorks, Inc. 1

International Journal of Computer Science and Network (IJCSN) Volume 1, Issue 4, August ISSN

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

How do you know your GPU or manycore program is correct?

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

Predictive Runtime Code Scheduling for Heterogeneous Architectures

60x Computational Fluid Dynamics and Visualisation

Platform Support Guide

ACCELERATING THE PRODUCTION OF SYNTHETIC SEISMOGRAMS BY A MULTICORE PROCESSOR CLUSTER WITH MULTIPLE GPUS

CUDA Programming Model

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Very fast simulation of nonlinear water waves in very large numerical wave tanks on affordable graphics cards

Parallel Computing with MATLAB

Accelerated Test Execution Using GPUs

Using Intel Math Kernel Library with MathWorks* MATLAB* on Intel Xeon Phi Coprocessor System

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

Parallel Programming in MATLAB on BioHPC

Technology for a better society. hetcomp.com

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Flux Vector Splitting Methods for the Euler Equations on 3D Unstructured Meshes for CPU/GPU Clusters

CUDA. GPU Computing. K. Cooper 1. 1 Department of Mathematics. Washington State University

CS 470 Spring Other Architectures. Mike Lam, Professor. (with an aside on linear algebra)

Transcription:

High Performance and GPU Computing in MATLAB Jan Houška houska@humusoft.cz http://www.humusoft.cz 1

About HUMUSOFT Company: Humusoft s.r.o. Founded: 1990 Number of employees: 18 Location: Praha 8, Pobřežní 20 MATLAB, Simulink Comsol Multiphysics dspace development systems HeavyHorse multiprocessor workstations Training, consulting, development 2

Computations in MATLAB The basic data type is a matrix algorithms optimized to work with vectors and matrices vector and matrix operations systems of linear equations polynomials and data fitting statistic functions trigonometric functions discrete Fourier transform ordinary differential equations More than 1000 functions from different areas available elementary math functions 3

How do I accelerate my computations? Q: I don t see any significant speedup on my high-end multi-core workstation Why are most of my CPU cores idle? How do I utilize my GPU? A: The algorithm must be parallelized to utilize multiple cores computations must be at least partially independent there must be no or little data dependency between concurrently running tasks Some algorithms are inherently parallel little or no additional work required many MATLAB functions that work on individual vector and matrix elements good candidates for GPU acceleration Some algorithms are parallelizable, but not parallelized various amount of additional work needed candidates for explicitly parallel jobs Some algorithms are not parallelizable at all the previous step of algorithm needs to be fully finished before the next step can start many ODE solvers there is no way to benefit from multiple cores 4

Multiprocessing in MATLAB Built-in multithreading built into core MATLAB for specific vector and matrix operations automatically enabled, no extra work necessary Parallel programming division to tasks controlled by MATLAB programmer various levels of control, from semi-automatic to fully controlled code is run on multiple CPU cores suitable for generic parallel computing moderate number of tasks of arbitrary complexity GPU computing code is run on a separate GPU device with dedicated memory data needs to be transferred to GPU and back suitable for massively parallel computing very high number of simple tasks 5

Parallel Computing Toolbox Design and implementation of parallel algorithms Structure client MATLAB commands for job and task creation local scheduler worker distributes tasks to workers, gathers job results computational unit for a task Up to 8 workers on a local machine easily scalable to cluster of arbitrary size using MATLAB Distributed Computing Server GPU hardware interface NVidia CUDA compute capability 1.3 and higher required 6

Graphics Processing Unit (GPU) Originally for graphics acceleration, now also used for scientific calculations Massively parallel array of integer and floating point processors Typically hundreds of processors per card GPU cores complement CPU cores Dedicated high-speed memory 7

Parallel Computing Toolbox GPU functions Interface for GPU computing math functions implemented on GPU running MATLAB code on GPU interface for running CUDA code from MATLAB argument passing to/from MATLAB New since MATLAB Release 2010b initial version of the interface extensions expected in future versions new release of MATLAB is available twice per year Typical applications acceleration by parallel processing on the GPU CUDA code development and debugging 8

Parallel Computing Toolbox GPU functions Object gpudevice identifies and selects the device one GPU device per one MATLAB instance can be used up to 8 devices per machine using parallel workers gpudevicecount >> gpudevice Properties: Name: 'GeForce GTX 460' Index: 1 ComputeCapability: '2.1' SupportsDouble: 1 DriverVersion: 3.1000 MaxThreadsPerBlock: 1024 MaxShmemPerBlock: 49152 MaxThreadBlockSize: [1024 1024 64] MaxGridSize: [65535 65535] SIMDWidth: 32 TotalMemory: 1.0417e+009 FreeMemory: 580116480 MultiprocessorCount: 7 GPUOverlapsTransfers: 1 KernelExecutionTimeout: 1 DeviceSupported: 1 DeviceSelected: 1 9

Parallel Computing Toolbox GPU functions Object gpuarray handle to GPU data in MATLAB math operations defined directly on the object indexing, multiplication, abs, sin, floor, max, fft, lu, gamma, erfc,... gather transfers data back to MATLAB supports real and complex numbers, different data types data types can differ in speed magic(4); xg = gpuarray(x); rg = xg*xg; r = gather(rg); Data can be created directly on the GPU no need for transfer from MATLAB parallel.gpu.gpuarray.zeros(1000) 10

Parallel Computing Toolbox GPU functions Running MATLAB code on GPU function arrayfun applies the function to each array element function can have multiple input and output arguments automatically translates MATLAB code to GPU supports a subset of MATLAB language function [z, w] = sqrtsincos(x, y) z = sqrt(sin(x)*cos(y)); w = sqrt(sin(y)*cos(x)); a = rand(4096); b = rand(4096); ag = gpuarray(a); bg = gpuarray(b); rg = arrayfun(@sqrtsincos, ag, bg); r = gather(rg); 11

Parallel Computing Toolbox GPU functions Running CUDA code from MATLAB object parallel.gpu.cudakernel directly runs CUDA PTX kernel automatic conversion of input arguments output arguments are gpuarray objects global void sqrtsincos(double* v1, const double* v2) { int idx = blockidx.x*blockdim.x + threadidx.x; v1[idx] = sqrt(sin(v1[idx])*cos(v2[idx])); } sqk = parallel.gpu.cudakernel('sqrtsincos.ptx', 'sqrtsincos.cu'); a = rand(1024, 1); b = rand(1024, 1); sqk.threadblocksize = 1024; rk = feval(sqk, a, b); r = gather(rk); 12

Humusoft HeavyHorse 13 AMD Opteron processors two or four processors 8 to 48 cores CPU frequency 2.2 to 3.1 GHz 8 to 128 GB RAM Graphics accelerator NVIDIA Tesla C2050 supports GPU computing available as an option Choice of operating systems Microsoft Windows 64-bit: XP, Vista, 7, Server Linux 64-bit: OpenSUSE, Ubuntu,... Application software optionally pre-installed MATLAB Parallel Computing Toolbox MATLAB Distributed Computing Server COMSOL Multiphysics

Additional Information Web sites www.humusoft.cz homepage of Humusoft s.r.o. www.mathworks.com homepage of MathWorks MATLAB Central discussions, blogs, file exchange, questions & answers,... www.mathworks.com/matlabcentral/ International Conference Technical Computing Prague 2011 www.humusoft.cz/akce/matlab11 Discussion groups Czech and Slovak MATLAB Users Group (CSMUG) www.humusoft.cz/produkty/matlab/csmug Usenet News comp.soft-sys.matlab 14