Getting the most out of your CPUs Parallel computing strategies in R

Size: px

Start display at page:

Download "Getting the most out of your CPUs Parallel computing strategies in R"

Leslie Lang
5 years ago
Views:

1 Getting the most out of your CPUs Parallel computing strategies in R Stefan Theussl Department of Statistics and Mathematics Wirtschaftsuniversität Wien July 2, 2008

2 Outline Introduction Parallel Computing Strategies in R HPC WU Benchmarks Parallel Monte Carlo Simulation Conclusions

3 Introduction

4 Why Parallel Computing? Multi core CPUs already available for commodity PCs Demand for computing power steadily increases In statistical analysis data volumes are becoming larger Statistical techniques are becoming more computer intensive

5 Why Parallel Computing? Software for parallel computing is already available R, a language for statistical computing and graphics already offers extensions to use this software High performance computers available for a reasonable price Parallel programming models are getting easier to use

6 Parallel Execution of Tasks Table: Interleaved Concurrency Cycle Process Process 1 X X X Process 2 X X X Process 3 X X X Table: Parallelism Cycle Process Process 1 X X X X X X Process 2 X X X X X X X Process 3 X X X X X

7 Computer Architecture Shared memory platforms (SMPs) Many-core desktop computers or laptops High performance computing servers with lots of RAM Distributed memory platforms (DMPs) Beowulf clusters The grid

8 Shared Memory Platforms Multiple processors share one global memory Connected to global memory mostly via bus technology Communication via shared variables SMPs are now commonplace because of multi core CPUs Limited number of processors (up to 64 in one machine)

9 Shared Memory Platforms

10 Shared Memory Platforms

11 Shared Memory Platforms

12 Distributed Memory Platforms Provide access to cheap computational power Can easily scale up to several hundreds or thousands of processors Communication between the nodes is achieved through common network technology Typically we use message passing libraries like MPI or PVM

13 Distributed Memory Platforms

14 General Strategies R is capable to call routines implemented in C or FORTRAN, so we can achieve Implicit parallelism via parallelizing compilers Depends on a corresponding compiler Bad performance as they are in its infancy Explicit parallelism with implicit decomposition e.g., OpenMP Parallelism easy to achieve using compiler directives Incrementally parallelizing sequential code possible Depends on a corresponding compiler Explicit parallelism e.g., with message passing libraries Interfaces available in R Development of parallel programs is difficult Deliver good performance

15 Parallel Computing Strategies in R

16 Example: Matrix Multiplication We want to parallelize r C = AB c ij = a ik b kj k=1

17 Matrix Multiplication Algorithm Require: A R m r and B R r n. Ensure: C R m n 1: m nrow(a) 2: r ncol(a) 3: n nrow(b) 4: for i = 1 : m do 5: for j = 1 : n do 6: for k = 1 : r do 7: C(i, j) C(i, j) + A(i, k)b(k, j) 8: end for 9: end for 10: end for C code

18 Parallel Computing Strategies in R Threaded R with OpenMP on SMPs Parallel R using MPI on a cluster of workstations Parallel R using package snow

19 Parallel Computing Strategies in R A selection of R infrastructure packages for parallel computing: Rmpi 1 (version on CRAN) rpvm 2 (version on CRAN) snow 3 (version on CRAN) RScaLAPACK 4 (version on CRAN) parc 5 (under development on R-Forge) snow is used by 9, Rmpi by 7, and rpvm by 2 other CRAN packages 1 Hao Yu 2 Na Li and A.J. Rossini 3 Luke Tierney 4 Samatova et al 5 Stefan Theussl

20 Threaded R with OpenMP on SMPs

21 Threaded R with OpenMP on SMPs OpenMP Parallel Algorithm Require: A R m r and B R r n. Ensure: C R m n 1: m nrow(a) 2: r ncol(a) 3: n nrow(b) 4:!$omp parallel for shared(a, B, C, j, k) 5: for i = 1 : m do 6: for j = 1 : n do 7: for k = 1 : r do 8: C(i, j) C(i, j) + A(i, k)b(k, j) 9: end for 10: end for 11: end for C code

22 Parallel R Using MPI on a Cluster

23 Parallel R Using MPI on a Cluster A = A 1. A p A i is the ith block or sub matrix with dimensions m i by n of A where p i=1 m i = m. We say that A = A i is an m i by n block matrix. The workers calculate a block of matrix C C i = A i B The master combines the results to a single matrix.

24 Message Passing Algorithm Master Require: A R m r, B R r n and p. Ensure: C R m n 1: m nrow(a) 2: n worker m/p 3: n last m (p 1)n worker 4: decompose A to A i such that A 1...A p 1 R n worker r and A p R n last r 5: spawn p worker processes 6: for i = 1 : p do 7: send A i, B to process i; Start multiplication on process i 8: end for 9: for i = 1 : p do 10: receive local result C i from workers 11: end for 12: combine C i to C R code

25 Message Passing Algorithm Workers Require: A rank R n rank r, B R r n, p Ensure: C rank R n rank n 1: C rank A rank B 2: send local result C rank to master R code

26 Parallel R using package snow snow stands for simple network of workstations It is easy to create R worker processes library("snow") cl <- makecluster(10, type = "MPI") To carry out matrix multiplication in parallel simply use C <- parmm(cl, A, B) snow offers parallel versions of apply(), lapply(),...

27 HPC WU

Cluster@WU bignode.q 4 nodes 2 Dual Core Intel XEON 5140 @ 2.33 GHz 16 GB RAM node.

28 bignode.q 4 nodes 2 Dual Core Intel XEON 2.33 GHz 16 GB RAM node.q 68 nodes 1 Intel Core 2 Duo 2.4 GHz 4 GB RAM This is a total of bit computation nodes and a total of 336 gigabytes of RAM.

Coming soon IBM System p 550 4 2-core IBM POWER6 @ 3.

29 Coming soon IBM System p core IBM 3.5 GHz 128 GB RAM This is a total of 8 64-bit computation nodes which have access to 128 gigabytes of shared memory.

30 Benchmarks

31 Results on an SMP Task: Matrix Multiplication execution time [s] normal MPI wb OpenMP PVM wb # of CPUs

32 Results on a DMP Task: Matrix Multiplication execution time [s] native BLAS MPI PVM snow MPI snow PVM # of CPUs

33 Parallel Monte Carlo Simulation

34 Pricing of Derivatives 1. Sample a random path for S in a risk neutral world 2. Calculate the expected payoff of the derivative 3. Repeat steps 1 and 2 to get many sample values of the payoff from the derivative in a risk neutral world. 4. Calculate the mean of the sample payoffs to get an estimate of the expected payoff in a risk neutral world 5. Discount the expected payoff at a risk free rate to get an estimate of the value of the derivative. Step 3 can be parallelized.

35 Conclusions OpenMP Parallelization of sequential (C or FORTRAN) code Performance is good on SMPs Message passing libraries MPI based code delivers better results in comparison to PVM (in combination with R) MPI routines have to be implemented by hand Routines of package snow are easy to handle Package Rmpi and snow allow interactive handling of R processes

36 Outlook and future work Implement high level R functions for existing parallel algorithms Finish package parc which will include OpenMP aware code Efficient parallel pseudo random number generation Interface to Google Map/Reduce

37 Thank you for your attention Further reading: Erricos Kontoghiorghes, editor. Handbook of Parallel Computing and Statistics. Chapman & Hall, Anthony Rossini, Luke Tierney, and Na Li. Simple Parallel Statistical Computing in R. UW Biostatistics Working Paper Series, (Working Paper 193), Stefan Theussl. Applied High Performance Computing Using R. Master s thesis, Wirtschaftsuniversität Wien, 2007.

38 Matrix Multiplication in C void Serial_matrix_mult( double *x, int *nrx, int *ncx, double *y, int *nry, int *ncy, double *z) { int i, j, k; double sum; } for(i = 0; i < *nrx; i++) for(j = 0; j < *ncy; j++){ sum = 0.0; for(k = 0; k < *ncx; k++) sum += x[i + k**nrx]*y[k + j**nry]; z[i + j**nrx] = sum; } Back

39 OpenMP: Algorithm void OMP_matrix_mult( double *x, int *nrx, int *ncx, double *y, int *nry, int *ncy, double *z) { int i, j, k; double tmp, sum; #pragma omp parallel for private(sum) \ shared(x, y, z, j, k, nrx, nry, ncy, ncx) for(i = 0; i < *nrx; i++) for(j = 0; j < *ncy; j++){ sum = 0.0; for(k = 0; k < *ncx; k++) sum += x[i + k**nrx]*y[k + j**nry]; z[i + j**nrx] = sum; } } Back

40 MPI: Algorithm, Master (1) mm_rmpi <- function(a, B, n_cpu = 1, spawnrworkers = FALSE) { da <- dim(a) ## dimensions of matrix A db <- dim(b) ## dimensions of matrix B ## Input validation matrix_mult_validate( A, B, da, db ) if( n_cpu == 1 ) return(a %*% B) ## spawn R workers? if( spawnrworkers ) mpi.spawn.rslaves( nslaves = n_cpu ) ## broadcast data and functions mpi.bcast.robj2slave( A ) mpi.bcast.robj2slave( B ) mpi.bcast.robj2slave( n_cpu ) Back

41 MPI: Algorithm, Master (2) ## how many rows on workers? nrows_on_workers <- ceiling( da[ 1 ] / n_cpu ) nrows_on_last <- da[ 1 ] - ( n_cpu - 1 ) * nrows_on_workers ## broadcast number of rows and foo to apply mpi.bcast.robj2slave( nrows_on_workers ) mpi.bcast.robj2slave( nrows_on_last ) mpi.bcast.robj2slave( mm_rmpi_worker ) ## start partial matrix multiplication on workers mpi.bcast.cmd( mm_rmpi_worker() ) ## gather partial results from workers (the master does not ## contribute to calculation) local_results <- NULL results <- mpi.gather.robj(local_results, root = 0, comm = 1) C <- NULL ## Rmpi returns a list if the vectors are of different length for(i in 1:n_cpu) C <- rbind(c, results[[ i + 1 ]]) if( spawnrworkers ) mpi.close.rslaves() C } Back

42 MPI: Algorithm, Workers mm_rmpi_worker <- function(){ commrank <- mpi.comm.rank() - 1 if(commrank == ( n_cpu - 1 )) local_results <- A[ (nrows_on_workers * commrank + 1): (nrows_on_workers * commrank + nrows_on_last), ] %*% B else local_results <- A[ (nrows_on_workers * commrank + 1): (nrows_on_workers * commrank + nrows_on_workers), ] %*% B mpi.gather.robj(local_results, root = 0, comm = 1) } Back

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico.

OpenMP and MPI. Parallel and Distributed Computing. Department of Computer Science and Engineering (DEI) Instituto Superior Técnico. OpenMP and MPI Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico November 15, 2010 José Monteiro (DEI / IST) Parallel and Distributed Computing