Parallel Programming in MATLAB on BioHPC

Similar documents
MATLAB on BioHPC. portal.biohpc.swmed.edu Updated for

Parallel Programming in MATLAB on BioHPC

Parallel Processing Tool-box

Using the MATLAB Parallel Computing Toolbox on the UB CCR cluster

Parallel Computing with MATLAB

Daniel D. Warner. May 31, Introduction to Parallel Matlab. Daniel D. Warner. Introduction. Matlab s 5-fold way. Basic Matlab Example

Parallel Computing with MATLAB

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX. Kadin Tseng Boston University Scientific Computing and Visualization

Introductory Parallel and Distributed Computing Tools of nirvana.tigem.it

Parallel MATLAB at VT

Matlab: Parallel Computing Toolbox

Lecture x: MATLAB - advanced use cases

Using the SLURM Job Scheduler

Calcul intensif et Stockage de Masse. CÉCI/CISM HPC training sessions Use of Matlab on the clusters

Calcul intensif et Stockage de Masse. CÉCI/CISM HPC training sessions

Adaptive Algorithms and Parallel Computing

Introduction to BioHPC

Summer 2009 REU: Introduction to Some Advanced Topics in Computational Mathematics

Monitoring and Trouble Shooting on BioHPC

Performance Analysis of Matlab Code and PCT

Parallel and Distributed Computing with MATLAB Gerardo Hernández Manager, Application Engineer

Choosing Resources Wisely Plamen Krastev Office: 38 Oxford, Room 117 FAS Research Computing

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Parallel Computing with MATLAB

Parallel Computing with MATLAB

Parallel and Distributed Computing with MATLAB The MathWorks, Inc. 1

Visualization on BioHPC

Parallel MATLAB at FSU: PARFOR and SPMD

Optimizing and Accelerating Your MATLAB Code

Parallel Merge Sort Using MPI

Parallel Programming with MATLAB

High Performance and GPU Computing in MATLAB

Duke Compute Cluster Workshop. 3/28/2018 Tom Milledge rc.duke.edu

Using Parallel Computing Toolbox to accelerate the Video and Image Processing Speed. Develop parallel code interactively

MATLAB Parallel Computing

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Parallel MATLAB: Parallel For Loops

Parallel Computing with Matlab and R

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for

High Performance Computing for BU Economists

Duke Compute Cluster Workshop. 10/04/2018 Tom Milledge rc.duke.edu

Data Analysis with MATLAB. Steve Lantz Senior Research Associate Cornell CAC

INTRODUCTION TO MATLAB PARALLEL COMPUTING TOOLBOX

Speeding up MATLAB Applications The MathWorks, Inc.

MATLAB Distributed Computing Server Release Notes

SUBMITTING JOBS TO ARTEMIS FROM MATLAB

Introduction to BioHPC New User Training

MATLAB Distributed Computing Server (MDCS) Training

Statistical and Mathematical Software on HPC systems. Jefferson Davis Research Analytics

CHEOPS Cologne High Efficient Operating Platform for Science Application Software

Introduction to BioHPC

High Performance Computing for BU Economists

Introduction to BioHPC

Introduction to SLURM & SLURM batch scripts

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Statistical and Mathematical Software on HPC systems. Jefferson Davis Research Analytics

MATLAB Parallel Computing Toolbox Benchmark for an Embarrassingly Parallel Application

Sherlock for IBIIS. William Law Stanford Research Computing

Matlab Tutorial. Yi Gong

MATLAB AND PARALLEL COMPUTING

Duke Compute Cluster Workshop. 11/10/2016 Tom Milledge h:ps://rc.duke.edu/

Lesson 1 1 Introduction

MIC Lab Parallel Computing on Stampede

Part One: The Files. C MPI Slurm Tutorial - Hello World. Introduction. Hello World! hello.tar. The files, summary. Output Files, summary

High Performance Computing Cluster Advanced course

Shared Memory Programming With OpenMP Computer Lab Exercises

Introduction to UBELIX

Working with Shell Scripting. Daniel Balagué

Adaptive Algorithms and Parallel Computing

Workstations & Thin Clients

Appendix A. Introduction to MATLAB. A.1 What Is MATLAB?

INTRODUCTION TO GPU COMPUTING WITH CUDA. Topi Siro

Introduction to MATLAB. Arturo Donate

Parallel Computing with MATLAB on Discovery Cluster

Combining OpenMP and MPI. Timothy H. Kaiser,Ph.D..

Technical Computing with MATLAB

How to run a job on a Cluster?

Graham vs legacy systems

2 Amazingly Simple Example Suppose we wanted to represent the following matrix 2 itchy = To enter itchy in Matla

Batch Systems. Running your jobs on an HPC machine

Mit MATLAB auf der Überholspur Methoden zur Beschleunigung von MATLAB Anwendungen

Supporting Data Parallelism in Matcloud: Final Report

Introduction to RCC. September 14, 2016 Research Computing Center

Batch Systems & Parallel Application Launchers Running your jobs on an HPC machine

Using CSC Environment Efficiently,

Introduction to RCC. January 18, 2017 Research Computing Center

CS 179: GPU Programming. Lecture 7

Bash for SLURM. Author: Wesley Schaal Pharmaceutical Bioinformatics, Uppsala University

Description of Power8 Nodes Available on Mio (ppc[ ])

function [s p] = sumprod (f, g)

Computational Finance

Using a Linux System 6

Slurm basics. Summer Kickstart June slide 1 of 49

Introduction to SLURM & SLURM batch scripts

Shared Memory Programming With OpenMP Exercise Instructions

Objectives. 1 Running, and Interface Layout. 2 Toolboxes, Documentation and Tutorials. 3 Basic Calculations. PS 12a Laboratory 1 Spring 2014

Introduction to High-Performance Computing (HPC)

1 Introduction to Matlab

Introduction to SLURM & SLURM batch scripts

GPU Computing with CUDA

Transcription:

Parallel Programming in MATLAB on BioHPC [web] [email] portal.biohpc.swmed.edu biohpc-help@utsouthwestern.edu 1 Updated for 2017-05-17

What is MATLAB High level language and development environment for: - Algorithm and application development - Data analysis - Mathematical modeling Extensive math, engineering, and plotting functionality Add-on products for image and video processing, communications, signal processing and more. powerful & easy to learn! 2

Ways of MATLAB Parallel Computing Build-in Multithreading (Implicit Parallelism) *automatically handled by Matlab - use of vector operations in place of for-loops +, -, *, /, SORT, CEIL, LOG, FFT, IFFT, SVD, Example: dot product C = A.* B; v.s. for i = 1 : size(a, 1) for j = 1: size(a, 2) C(i, j) = A(i, j) * B(i, j); If both A and B are 1000-by-1000 Matrix, the Matlab build-in dot product is 50~100 times faster than for-loop. * Testing environment : Matlab/R2015a @ BioHPC computer node 3

Ways of MATLAB Parallel Computing Parallel Computing Toolbox *Acceleration MATLAB code with very little code changes job will running on your workstation, thin-client or a reserved compute node - Parallel for Loop (parfor) - Spmd (Single Program Multiple Data) - Pmode (Interactive environment for spmd development) - Parallel Computing with GPU Matlab Distributed Computing Server (MDCS) *Directly submit Matlab job to BioHPC cluster - Matlab job scheduler integrated with SLURM - Single job use multiple nodes 4

Parallel Computing Toolbox Parallel for-loops (parfor) Allow several MATLAB workers to execute individual loop iterations simultaneously. The only difference in parfor loop is the keyword parfor instead of for. When the loop begins, it opens a parallel pool of MATLAB sessions called workers for executing the iterations in parallel. <open matlab pool> for <statements> parfor <statements> <close matlab pool> 5

Parallel Computing Toolbox Parallel for-loops (parfor) Example 1: Estimating an Integration F x = a b 0.2sin2x dx The integration is estimated by calculating the total area under the F(x) curve. n F x height width 1 We can calculate subareas at any order! a b 6

Parallel Computing Toolbox Parallel for-loops (parfor) Example 1: Estimating an Integration quad_fun.m (serial version) function q = quad_fun (n, a, b) q = 0.0; w=(b-a)/n; for i = 1 : n x = (n-i) * a + (i-1) *b) /(n-1); fx = 0.2*sin(2*x); q = q + w*fx; return quad_fun_parfor.m (parfor version) function q = quad_fun_parfor (n, a, b) q = 0.0; w=(b-a)/n; parfor i = 1 : n x = (n-i) * a + (i-1) *b) /(n-1); fx = 0.2*sin(2*x); q = q + w*fx; return 12 workers Speedup = t1/t2 = 6.14x 7

Parallel Computing Toolbox Parallel for-loops (parfor) Q: How many workers can I use for MATLAB parallel pool? 2013a/2013b max number of workers = 12 2014a/2014b/2015a/2015b max number of workers = number of physical cores Default Maximum 128GB, 384GB, GPU 12 16* 256GB 12 24* Workstation 6 6 Thin client 2 2 ** Use parpool( local, numofworkers) to specify how many workers you want. (or use matlabpool( local, numofworkers) if your are using matlab_r2013a) 8

Parallel Computing Toolbox Parallel for-loops (parfor) Q: How many workers can I use for MATLAB parallel pool? 2016a/2016b/2017a max number of workers = 512 default number of workers = number of physical cores choose smartly Number of workers 128GB, 384GB, GPU <= 32 256GB <= 48 Workstation <= 12 Thin client <= 4 % load local profile configuration mycluster = parcluster('local'); % increase number of workers mycluster.numworkers=32; % open matlab pool parpool(mycluster); % parfor loop parfor <statements> % close matlab pool delete(gcp('nocreate')); 9

Limitations of parfor No Nested parfor loops parfor i = 1 : 10 parfor j = 1 : 10 <statement> * Because a worker cannot open a parallel pool parfor i = 1 : 10 for j = 1 : 10 <statement> Loop variable must be increasing integers parfor x = 0 : 0.1 : 1 <statement> Loop iterations must be indepent xall = 0 : 0.1 :1; parfor ii = 1 : length(xall) x = xall(ii); <statement> parfor i = 2 : 10 A(I) = 0.5* ( A(I-1) + A(I+1) ); 10

Parallel Computing Toolbox spmd and Distributed Array spmd : Single Program Multiple Data single program -- the identical code runs on multiple workers. multiple data -- each worker can have different, unique data for that code. numlabs number of workers. labindex -- each worker has a unique identifier between 1 and numlabs. point-to-point communications between workers : labs, labreceive, labsrecieve Ideal for: 1) programs that takes a long time to execute. 2) programs operating on large data set. <open matlab pool> spmd <statements> spmd labdata = load([ datafile_ num2str(labindex).ascii ]) % multiple data result = MyFunction(labdata); % single program <close matlab pool> 11

Parallel Computing Toolbox spmd and Distributed Array Example 1: Estimating an Integration quad_fun.m function q = quad_fun (n, a, b) q = 0.0; w=(b-a)/n; for i = 1 : n fx = 0.2*sin(2*i); q = q + w*fx; return quad_fun_spmd.m (spmd version) a = 0.13; b = 1.53; n = 120000000; spmd aa = (labindex - 1) / numlabs * (b-a) + a; bb = labindex / numlabs * (b-a) + a; nn = round(n / numlabs) q_part = quad_fun(nn, aa, bb); q = sum([q_part{:}]); 12 workers Speedup = t1/t3 = 8.26x 12 communications between client and workers to update q, whereas in parfor, it needs 120,000,000 communications between client and workers 12

Parallel Computing Toolbox spmd and Distributed Array distributed array - Data distributed from client and access readily on client. Data always distributed along the last dimension, and as evenly as possible along that dimension among the workers. codistributed array Data distributed within spmd, user define at which dimension to distribute data. Array created directly (locally) on worker. Matrix Multiply A*B distributed array parpool( local ); A = rand(3000); B = rand(3000); a = distributed(a); b = distributed(b); c = a*b; % run on workers, c is distributed delete(gcp); codistributed array parpool( local ); A = rand(3000); B = rand(3000); spmd u = codistributed(a, codistributor1d(1)); % by row, 1 st dim v = codistributed(b, codistributor1d(2)); % by column, 2 nd dim w = u * v; delete(gcp); 13

Parallel Computing Toolbox pmode: learn parallel programing interactively Enter command here 14

Parallel Computing Toolbox spmd and Distributed Array Example 2 : Ax = b linearsolver.m (serial version) n = 10000; M = rand(n); X = ones(n, 1); A = M + M ; b = A * X; u = A \ b; linearsolverspmd.m (spmd version) n = 10000; M = rand(n); X = ones(n, 1); spmd m = codistributed(m, codistributor( 1d, 2)); x = codistributed(x, codistributor( 1d, 1)); A = m +m ; b = A * x; utmp = A \ b; u1 = gather(utmp); 15

Parallel Computing Toolbox spmd and Distributed Array Factors reduce the speedup of parfor and spmd Computation inside the loop is simple. Memory limitations. (create more data compare to serial code) Transfer and convert data is time consuming Unbalanced computational load Synchronization 16

Parallel Computing Toolbox spmd and Distributed Array Example 3 : Contrast Enhancement Image size : 480 * 640 uint8 contrast_enhance.m function y = contrast_enhance (x) x = double (x); n = size (x, 1); x_average = sum ( sum (x(:,:))) /n /n ; s = 3.0; % the contrast s should be greater than 1 y = (1.0 s ) * x_average + s * x((n+1)/2, (n+1)/2); return contrast_serial.m x = imread( surfsup.tif ); yl = nlfilter (x, [3,3], @contrast_enhance); y = uint8(y); Running time is 14.46 s (example provided by John Burkardt @ FSU) 17

Parallel Computing Toolbox spmd and Distributed Array Example 3 : Contrast Enhancement contrast_spmd.m (spmd version) x = imread( surfsup.tif ); xd = distributed(x); spmd xl = getlocalpart(xd); xl = nlfilter(xl, [3,3], @contrast_enhance); y = [xl{:}]; General sliding-neighborhood operation 12 workers, running time is 2.01 s speedup = 7.19x zero padding at edges When the image is divided by columns among the workers, artificial internal boundaries are created! Solution: build up communication between workers. 18

Parallel Computing Toolbox spmd and Distributed Array Example 3 : Contrast Enhancement column = labsreceive ( next, previous, xl(:,) ); previous next ghost boundary For 3*3 filtering window, you need 1 column of ghost boundary on each side. column = labsreceive ( previous, next, xl(:,1) );

Parallel Computing Toolbox spmd and Distributed Array Example 3 : Contrast Enhancement contrast_mpi.m (mpi version) x = imread('surfsup.tif'); xd = distributed(x); spmd xl = getlocalpart ( xd ); if ( labindex ~= 1 ) previous = labindex - 1; else previous = numlabs; if ( labindex ~= numlabs ) next = labindex + 1; else next = 1; find out previous & next by labindex column = labsreceive ( previous, next, xl(:,1) ); if ( labindex < numlabs ) xl = [ xl, column ]; column = labsreceive ( next, previous, xl(:,) ); if ( 1 < labindex ) xl = [ column, xl ]; attache ghost boundaries xl = nlfilter ( xl, [3,3], @contrast_enhance ); if ( 1 < labindex ) xl = xl(:,2:); if ( labindex < numlabs ) xl = xl(:,1:-1); xl = uint8 ( xl ); y = [ xl{:} ]; 12 workers running time 2.00 s speedup = 7.23x remove ghost boundaries 20

Parallel Computing Toolbox GPU Parallel Computing Toolbox enables you to program MATLAB to use your computer s graphics processing unit (GPU) for matrix operations. For some problems, execution in the GPU is faster than in the CPU. How to enable GPU computing of Matlab on BioHPC cluster Step 1 : reserve a GPU node Step 2 : inside terminal, type in export CUDA_VISIBLE_DEVICES= 0 to enable the hardware rering. You can launch Matlab directly if you have GPU card on your workstation. Current supporting versions on BioHPC: 2013a/2013b/2014a/2014b/2015a/2015b 21

Parallel Computing Toolbox GPU Example 2 : Ax = b linearsolvergpu.m (GPU version) n = 10000; M = rand(n); X = ones(n, 1); Mgpu = gpuarray(m); Xgpu = gpuarray(x); Agpu = Mgpu + Mgpu ; Bgpu = Agpu * Xgpu; ugpu = Agpu \ bgpu u = gather(ugpu); Copy data from CPU to GPU ( ~ 0.3 s) Copy data from GPU to CPU ( ~0.0002 s) Speedup = t1/t3 = 2.54x 22

Parallel Computing Toolbox GPU Check GPU information (inside Matlab) Q: If the GPU available. if (gpudevicecount > 0 ) Q: How to check memory usage? currentgpu = gpudevice; currentgpu.availablememory; * totalmemory refers to the GPU accelerator memory, e.g: K40m ~12 GB; K20m ~5 GB 23

Parallel Computing Toolbox benchmark Out of memory * the actually memory consumption is much higher than the input data size 24

25 Matlab Distributed Computing Server

Matlab Distributed Computing Server: Setup and Test your MATLAB environment Home -> ENVIRONMENT -> Parallel -> Manage Cluster Profiles Add -> Import Cluster Profiles from file And then select profile file from: /project/apps/apps/matlab/profile/nucleus_r2017a.settings /project/apps/apps/matlab/profile/nucleus_r2016b.settings /project/apps/apps/matlab/profile/nucleus_r2016a.settings /project/apps/apps/matlab/profile/nucleus_r2015b.settings /project/apps/apps/matlab/profile/nucleus_r2015a.settings /project/apps/apps/matlab/profile/nucleus_r2014b.settings /project/apps/apps/matlab/profile/nucleus_r2014a.settings /project/apps/apps/matlab/profile/nucleus_r2013b.settings /project/apps/apps/matlab/profile/nucleus_r2013a.settings *based on your matlab version 26

Matlab Distributed Computing Server: Setup and Test your MATLAB environment Test your parallel profile You should have two Cluster Profiles ready for Matlab parallel computing: local for running job on your workstation, thin client or on any single compute node of BioHPC cluster (use Parallel Computing toolbox) nucleus_r<version No.> for running job on BioHPC cluster with multiple nodes (use MDCS) * Ignore the MATLAB Parallel Cloud profile if you are using 2016a/2016b/2017a 27

Matlab Distributed Computing Server: Setup SLURM environment From Matlab command window: ClusterInfo.setQueueName( 128GB ); % use 128 GB partition ClusterInfo.setNNode(2); % request 2 nodes ClusterInfo.setWallTime( 1:00:00 ); % setup time limit to 1 hour ClusterInfo.setEmailAddress( yi.du@utsouthwestern.edu ); % email notification You could check your settings by: ClusterInfo.getQueueName(); ClusterInfo.getNNode(); ClusterInfo.getWallTime(); ClusterInfo.getEmailAddress(); 28

Matlab Distributed Computing Server: batch processing Job submission For scripts: job = batch( ascript, Pool, 15, profile, nucleus_r2016a ); script name number of workers - 1 configuration file For functions: job = batch(@sfun, 1, {50, 4000}, Pool, 15, profile, nucleus_r2016a ); function name cell array containing input arguments number of variables to be returned by function 29

Matlab Distributed Computing Server: batch processing Wait for batch job to be finished Block session until job has finished : wait(job, finished ); Occasionally checking on state of job : get(job, state ); Load result after job finished : results = fetchoutputs(job) -or- load(job); Delete job : delete(job); You can also open job monitor from Home->Parallel->Monitor Jobs 30

Matlab Distributed Computing Server Example 1: Estimating an Integration quad_submission.m ClusterInfo.setQueueName( super ); ClusterInfo.setNNode(2); ClusterInfo.setWallTime( 00:30:00 ); job = batch(@quad_fun, 1, {120000000, 0.13, 1.53}, Pool, 63, profile, nucleus_r2016a ); wait(job, finished ); Results = fetchoutputs(job); 31

Matlab Distributed Computing Server * Recall the serial job only needs 5.7881 s, MDCS is not designed for small jobs Running time is 24 s If we submit spmd version code with MDCS job = batch( quad_fun_spmd, Pool, 63, profile, nucleus_r2016a ); Running time is 15 s 32

Matlab Distributed Computing Server Example 4: Vessel Extraction microscopy image (size:2126*5517) find the central line of vessel (fast matching method) local normalization remove pepper & salt threshold ~ 6.5 hours ~ 21.3 seconds separate vessel groups based on connectivity binary image 33 ~ 6.6 seconds 243 subsets of binary image

Matlab Distributed Computing Server Example 4: Vessel Extraction (step 3) vesselextraction_submission.m % get file list subimages = dir(['bwimages', '/*.jpg']); parfor i = 1 : length(subimages) filename = subimages(i).name; fastmatching(filename); time needed for processing each sub image various t = 784 s t = 784 s Job running on 4 nodes requires minimum computational time Speedup = 28.3x 34

License limitation of Matlab Distributed Computing Server Maximum concurrent licenses = 128 worker Shared with all BioHPC user Job ping on MDCS licenses Solution : Single-node jobs use local profile, increase number of workers (2016a) Multiple-node jobs SLURM + GUN parallel - ex_04 : example_parallel_sbatch.sh 4 nodes with a pool size of 128 threads, t = 699 s 35

SLURM + GUN parallel A shell tool for executing jobs in parallel using one or more computers. http://www.gnu.org/software/parallel/ parallelize is to run 8 jobs on each CPU GNU Parallel instead spawns a new process when one finishes - keeping the CPUs active and thus saving time Usage: sbatch GUN parallel srun base scripts * Image retrieved from https://www.biostars.org/p/63816/ 36

Submit multiple jobs to multi-node with srun & GUN parallel #!/bin/bash #SBATCH --job-name=parallelexample #SBATCH --partition=super #SBATCH --nodes=4 #SBATCH ntasks=128 #SBATCH --time=1-00:00:00 CORES_PER_TASK=1 INPUTS_COMMAND= ls BWimages TASK_SCRIPT= single.sh module add parallel INPUTS_COMMAND: generate a input file list TASK_SCRIPT: base script SRUN= srun --exclusive n N1 c $CORES_PER_TASK PARALLEL= parallel --delay.2 j $SLURM_NTASKS joblog task.log eval $INPUTS_COMMAND $PARALLEL $SRUN sh $TASK_SCRIPT {} single.sh #!/bin/bash module add matlab/2016a matlab nodisplay nodesktop -singlecompthread r fastmatching( $1 ), exit 37

Questions and Comments For assistance with MATLAB, please contact the BioHPC Group: Email: biohpc-help@utsouthwestern.edu Submit help ticket at http://portal.biohpc.swmed.edu When did it happen? What time? Cluster or client? What job id? How you run it? Which Matlab Version? If the code comparable with different versions? Can we look at your scripts and data? Tell us if you are happy for us to access your scripts/data to help troubleshoot. 38