Parallel Computing with R. Le Yan LSU

Similar documents
Parallel Computing with R. Le Yan LSU

R on BioHPC. Rstudio, Parallel R and BioconductoR. Updated for

2 Calculation of the within-class covariance matrix

Getting Started with doparallel and foreach

CPU-GPU Heterogeneous Computing

First steps in Parallel Computing

A Comprehensive Study on the Performance of Implicit LS-DYNA

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

Advanced R Programming - Lecture 6

Parallelism. CS6787 Lecture 8 Fall 2017

with High Performance Computing: Parallel processing and large memory Many thanks allocations

Introduction to parallel Computing

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

CUDA Optimization with NVIDIA Nsight Visual Studio Edition 3.0. Julien Demouth, NVIDIA

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Accelerating Implicit LS-DYNA with GPU

Parallel programming in R

parallel Parallel R ANF R Vincent Miele CNRS 07/10/2015

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Performance potential for simulating spin models on GPU

R for deep learning (III): CUDA and MultiGPUs Acceleration

Parallel Computing with R and How to Use it on High Performance Computing Cluster

Multi-GPU Scaling of Direct Sparse Linear System Solver for Finite-Difference Frequency-Domain Photonic Simulation

Computer Caches. Lab 1. Caching

Cougar Open CL v1.0. Users Guide for Open CL support for Delphi/ C++Builder and.net

Optimizing Data Locality for Iterative Matrix Solvers on CUDA

Introduction to Multicore Programming

Introduction to Computing Systems - Scientific Computing's Perspective. Le Yan LSU

Speeding up MATLAB Applications Sean de Wolski Application Engineer

Parallel Processing SIMD, Vector and GPU s cont.

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

ECE 571 Advanced Microprocessor-Based Design Lecture 2

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

Modern GPUs (Graphics Processing Units)

CUDA PROGRAMMING MODEL Chaithanya Gadiyam Swapnil S Jadhav

CSCI 402: Computer Architectures. Parallel Processors (2) Fengguang Song Department of Computer & Information Science IUPUI.

Processes and Threads. Processes and Threads. Processes (2) Processes (1)

Prof. Thomas Sterling

Performance analysis basics

The Use of Cloud Computing Resources in an HPC Environment

CSE 591/392: GPU Programming. Introduction. Klaus Mueller. Computer Science Department Stony Brook University

Optimize Data Structures and Memory Access Patterns to Improve Data Locality

An Introduction to OpenAcc

High Performance Computing on GPUs using NVIDIA CUDA

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

rcuda: an approach to provide remote access to GPU computational power

CSC630/CSC730 Parallel & Distributed Computing

Memory Bandwidth and Low Precision Computation. CS6787 Lecture 10 Fall 2018

CUDA OPTIMIZATION TIPS, TRICKS AND TECHNIQUES. Stephen Jones, GTC 2017

Statistical and Mathematical Software on HPC systems. Jefferson Davis Research Analytics

X10 specific Optimization of CPU GPU Data transfer with Pinned Memory Management

Introduction II. Overview

DIFFERENTIAL. Tomáš Oberhuber, Atsushi Suzuki, Jan Vacata, Vítězslav Žabka

HPC with Multicore and GPUs

Introduction to Multicore Programming

High-Performance and Parallel Computing

FPGA-based Supercomputing: New Opportunities and Challenges

Chapter 3 Parallel Software

Analytical Modeling of Parallel Programs

Effective R Programming

Introduction to GPGPU and GPU-architectures

High Performance Computing. Introduction to Parallel Computing

Assuming ideal conditions (perfect pipelining and no hazards), how much time would it take to execute the same program in: b) A 5-stage pipeline?

GPGPUs in HPC. VILLE TIMONEN Åbo Akademi University CSC

Achieving High Performance. Jim Cownie Principal Engineer SSG/DPD/TCAR Multicore Challenge 2013

Chapter 2 Parallel Hardware

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Implementation of the finite-difference method for solving Maxwell`s equations in MATLAB language on a GPU

Parallel Programming Multicore systems

Lecture 1: Gentle Introduction to GPUs

THE COMPARISON OF PARALLEL SORTING ALGORITHMS IMPLEMENTED ON DIFFERENT HARDWARE PLATFORMS

Comparing R and Python for PCA PyData Boston 2013

High performance Computing and O&G Challenges

Building NVLink for Developers

Multicore Computer, GPU 및 Cluster 환경에서의 MATLAB Parallel Computing 기능

Stat 3701 Lecture Notes: Parallel Computing in R Charles J. Geyer September 18, 2018

Intel VTune Amplifier XE

CSE 599 I Accelerated Computing - Programming GPUS. Memory performance

Overview of research activities Toward portability of performance

Lecture 13: Memory Consistency. + a Course-So-Far Review. Parallel Computer Architecture and Programming CMU , Spring 2013

CUDA Accelerated Linpack on Clusters. E. Phillips, NVIDIA Corporation

OpenACC Fundamentals. Steve Abbott November 13, 2016

Fractal: A Software Toolchain for Mapping Applications to Diverse, Heterogeneous Architecures

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Lecture 27 Programming parallel hardware" Suggested reading:" (see next slide)"

Evaluating Computers: Bigger, better, faster, more?

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Optimizing and Accelerating Your MATLAB Code

designing a GPU Computing Solution

INTRODUCTION TO COMPILER DIRECTIVES WITH OPENACC

7 DAYS AND 8 NIGHTS WITH THE CARMA DEV KIT

How to perform HPL on CPU&GPU clusters. Dr.sc. Draško Tomić

General Purpose GPU Computing in Partial Wave Analysis

Scalable Distributed Training with Parameter Hub: a whirlwind tour

Manycore and GPU Channelisers. Seth Hall High Performance Computing Lab, AUT

S WHAT THE PROFILER IS TELLING YOU: OPTIMIZING GPU KERNELS. Jakob Progsch, Mathias Wagner GTC 2018

The Google File System

Fast Interactive Sand Simulation for Gesture Tracking systems Shrenik Lad

Profiling and Parallelizing with the OpenACC Toolkit OpenACC Course: Lecture 2 October 15, 2015

PROCESS VIRTUAL MEMORY. CS124 Operating Systems Winter , Lecture 18

Transcription:

Parallel Computing with Le Yan HPC @ LSU 11/1/2017 HPC training series Fall 2017

Parallel Computing: Why? Getting results faster unning in parallel may speed up the time to reach solution Dealing with bigger data sets Example: Moving 200 boxes by 1 person vs. 10 people unning in parallel may allow you to use more memory than that available on a single computer Example: Moving a grand piano by 1 person vs. 10 people 11/1/2017 HPC training series Fall 2017 1

Parallel Computing: How? Identify a group of workers For sake of simplicity we will use worker/process/thread interchangeably Divide the workload into chunks Assign one chunk or a number of chunks to each worker Each worker processes its own assignment in parallel with other workers 11/1/2017 HPC training series Fall 2017 2

Parallel Computing: equirements Hardware: modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel Your laptop/desktop/workstation has many cores HPC clusters is composed of many nodes (servers), each of which has many cores Software: many software packages are aware of parallel hardware and are capable of coordinating workload processing among workers 11/1/2017 HPC training series Fall 2017 3

Parallel Computing: equirements Hardware: modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel Your laptop/desktop/workstation has many cores HPC clusters is composed of many nodes (servers), each of which has many cores Base is single threaded, i.e. not parallel egardless how many cores are available, can only use one of them 11/1/2017 HPC training series Fall 2017 4

Parallel Computing: equirements Hardware: modern computers are equipped with more than one CPU core and are capable of processing workloads in parallel Your laptop/desktop/workstation has many cores HPC clusters is composed of many nodes (servers), each of which has many cores Base The is goal single threaded, of this training i.e. is to not show parallel how to egardless use some how packages many cores to achieve are available, parallel can only use one of them processing 11/1/2017 HPC training series Fall 2017 5

Where We Are with Base QB2 Cluster Login node CPU core Node Node Node Node Cluster = multiple nodes (servers) x multiple cores per node 11/1/2017 HPC training series Fall 2017 6

What We Want to Achieve QB2 Cluster Login node CPU core Node Node Node Node Cluster = multiple nodes (servers) x multiple cores per node 11/1/2017 HPC training series Fall 2017 7

Parallel Computing: Caveats Using more workers does not always make your program run faster Efficiency of parallel programs Example: Moving 200 boxes by 200 people vs. 1,000 people Low efficiency means idle workers and vice versa Defined as speedup divided by number of workers 4 workers, 3x speedup, efficiency = 3/4 = 75% 8 workers, 4x speedup, efficiency = 4/8 = 50% Usually decrease with increasing number of workers 11/1/2017 HPC training series Fall 2017 8

Is Parallel Computing for You? Does your code run slow? Yes Is your code parallelizable? Yes Is your code parallel already? No Try parallelization No No Yes Don t bother, e.g. it is perhaps not wise to spend weeks to parallelize a program that finishes in 30 seconds; Don t bother, e.g. not much we can do in if the target function is written in C or Fortran; Some functions utilize parallel numerical libraries they are implicitly parallel already 11/1/2017 HPC training series Fall 2017 9

Implicit Parallelization Some functions in can call parallel numerical libraries On LONI and LSU HPC clusters this is the multithreaded Intel MKL library Mostly linear algebraic and related functions Example: linear regression, matrix decomposition, computing inverse and determinant of a matrix 11/1/2017 HPC training series Fall 2017 10

Cpu0 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 : 97.7%us, 2.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st # Matrix creation and random number generation Cpu3 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 0.0%us, 0.0%sy, # are NOT 0.0%ni,100.0%id, implicitly 0.0%wa, parallel 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.0%us, 0.0%sy, # Matrix 0.0%ni,100.0%id, inversion is 0.0%wa, implicitly 0.0%hi, parallel 0.0%si, 0.0%st Cpu6 : 0.0%us, 0.0%sy, # Each 0.0%ni,100.0%id, node has 20 cores 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 0.0%us, 0.0%sy, # Only 0.0%ni,100.0%id, 1 out 20 cores 0.0%wa, is busy 0.0%hi, when 0.0%si, running0.0%st # this line Cpu17 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st A <- matrix(rnorm(10000*10000),10000,10000) Cpu18 : 0.0%us, 0.0%sy, 0.0%ni,100.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 : 0.0%us, 0.0%sy, Ainv <-0.0%ni,100.0%id, solve(a) 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65876884k total, 9204212k used, 56672672k free, 77028k buffers Swap: 134217724k total, 14324k used, 134203400k free, 5302204k cached PID USE P NI VIT ES SH S %CPU %MEM TIME+ COMMAND 114903 lyan1 20 0 1022m 760m 6664 99.9 1.2 0:06.51 running on one node of the QB2 cluster: 20 cores total, 1 busy, 19 idle 11/1/2017 HPC training series Fall 2017 11

Cpu0 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu1 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st # Matrix creation and random number generation Cpu3 : 99.3%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.3%si, 0.0%st Cpu4 : 99.7%us, 0.3%sy, # are NOT 0.0%ni, implicitly 0.0%id, 0.0%wa, parallel 0.0%hi, 0.0%si, 0.0%st Cpu5 : 99.7%us, 0.3%sy, # Matrix 0.0%ni, inversion 0.0%id, is 0.0%wa, implicitly 0.0%hi, parallel 0.0%si, 0.0%st Cpu6 : 99.7%us, 0.3%sy, # Each 0.0%ni, node has 0.0%id, 20 cores 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 99.7%us, 0.3%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu8 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu9 : 99.7%us, 0.3%sy, A <- matrix(rnorm(10000*10000),10000,10000) 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu10 : 99.3%us, 0.3%sy, # 20 out 0.0%ni, 20 cores 0.3%id, are 0.0%wa, busy when 0.0%hi, running 0.0%si, 0.0%st # this line Cpu16 : 99.7%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu17 : 99.7%us, 0.3%sy, Ainv <-0.0%ni, solve(a) 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu18 : 99.7%us, 0.0%sy, 0.0%ni, 0.3%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu19 :100.0%us, 0.0%sy, 0.0%ni, 0.0%id, 0.0%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 65876884k total, 11968768k used, 53908116k free, 77208k buffers Swap: 134217724k total, 14324k used, 134203400k free, 5307564k cached PID USE P NI VIT ES SH S %CPU %MEM TIME+ COMMAND 115515 lyan1 20 0 5025m 3.4g 8392 1996.5 5.4 1:31.54 running on one node of the QB2 cluster: 20 cores total, 20 busy, 0 idle 11/1/2017 HPC training series Fall 2017 12

Know Your Program Before starting writing programs, you need to be able to answer these questions How do I know the program runs faster after parallelization? Which part of my code slows the execution down (the most)? First step in parallelization: performance analysis Purpose: know which part takes how long, and locate the hotspot first Two most frequent used methods in system.time() rprof() and summaryprof() 11/1/2017 HPC training series Fall 2017 13

system.time() ## Output from system.time() function ## User: time spent in user-mode ## System: time spent in kernel (I/O etc.) ## Elapsed: wall clock time ## Usage: system.time(<code segment>) system.time( { A <- matrix(rnorm(10000*10000),10000,10000) Ainv <- solve(a) }) user system elapsed 156.582 0.948 16.325 How much wall clock time it takes perhaps the most important metric 11/1/2017 HPC training series Fall 2017 14

system.time() Code [lyan1@qb032 ]$ cat inv_st. print("matrix creation:") system.time({ A <- matrix(rnorm(10000*10000),10000,10000) }) print("matrix inversion:") system.time({ Ainv <- solve(a) }) Measure the execution times of different functions Output [lyan1@qb032 ]$ script inv_st. [1] "Matrix creation:" user system elapsed 7.437 0.278 7.711 [1] "Matrix inversion:" user system elapsed 149.092 0.768 9.417 Note the huge discrepancy between user and elapsed it is an indication of implicit parallelization 11/1/2017 HPC training series Fall 2017 15

rprof() and summaryprof() Start profiling End profiling [lyan1@qb032 ]$ cat inv_prof. prof() A <- matrix(rnorm(10000*10000),10000,10000) Ainv <- solve(a) prof(null) summaryprof() Print profiling result 11/1/2017 HPC training series Fall 2017 16

rprof() and summaryprof() [lyan1@qb032 ]$ script inv_prof. How much time is spent in this function itself $by.self self.time self.pct total.time total.pct "solve.default" 153.36 95.09 153.58 95.23 ".External" 6.68 4.14 6.68 4.14 "matrix" 1.02 0.63 7.70 4.77 "diag" 0.22 0.14 0.22 0.14 How much time is spent in this function and the functions it calls $by.total total.time total.pct self.time self.pct "solve.default" 153.58 95.23 153.36 95.09 "solve" 153.58 95.23 0.00 0.00 "matrix" 7.70 4.77 1.02 0.63 ".External" 6.68 4.14 6.68 4.14 "rnorm" 6.68 4.14 0.00 0.00 "diag" 0.22 0.14 0.22 0.14 11/1/2017 HPC training series Fall 2017 17

Writing Parallel Code parallel Package Introduced in 2.14.1 Integrated previous multicore and snow packages Coarse grained parallelization Suit for the chunks of computation are unrelated and do not need to communicate Two ways of using parallel packages mc*apply function for loop with %dopar% Need foreach and doparallel packages 11/1/2017 HPC training series Fall 2017 18

Function mclapply Parallelized version of the lapply function Similar syntax with mc.cores indicates how many cores/workers to use mclapply(x, FUN, mc.cores = <number of cores>, ) eturn a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X Can use all cores on one node But not on multiple nodes mclapply 11/1/2017 HPC training series Fall 2017 19

# Quadratic Equation: a*x^2 + b*x + c = 0 solve.quad.eq <- function(a, b, c) { # eturn solutions x.delta <- sqrt(b*b - 4*a*c) x1 <- (-b + x.delta)/(2*a) x2 <- (-b - x.delta)/(2*a) return(c(x1, x2)) } len <- 1e7 a <- runif(len, -10, 10); b <- runif(len, -10, 10); c <- runif(len, -10, 10) #Serial: lapply res1.s <- lapply(1:len, FUN = function(x) { solve.quad.eq(a[x], b[x], c[x])}) #Parallel: mclapply with 4 cores library(parallel) res1.p <- mclapply(1:len, Function solve.quad.eq Input: three coefficients of a quadratic equation Output: solutions of the quadratic equation Create 10 million sets of randomly generated coefficients lapply function: call the solve.quad.eq function for each set of coefficients mclapply function: same arguments with one extra: mc.cores FUN = function(x) { solve.quad.eq(a[x], b[x], c[x]) }, mc.cores = 4) 11/1/2017 HPC training series Fall 2017 20

# Quadratic Equation: a*x^2 + b*x + c = 0 solve.quad.eq <- function(a, b, c) { # eturn solutions x.delta <- sqrt(b*b - 4*a*c) x1 <- (-b + x.delta)/(2*a) > x2 system.time( <- (-b - x.delta)/(2*a) +res1.s <- lapply(1:len, FUN = function(x) { solve.quad.eq(a[x], b[x], c[x])}) ) return(c(x1, x2)) } user system elapsed 358.878 0.375 359.046 len > system.time( <- 1e7 Create 10 million sets of randomly generated coefficients a + <- res1.p runif(len, <- mclapply(1:len, -10, 10); b <- runif(len, -10, 10); c <- runif(len, -10, 10) + FUN = function(x) { solve.quad.eq(a[x], b[x], c[x]) }, + mc.cores = 4) #Serial: + ) lapply res1.s user <- system lapply(1:len, elapsed FUN = function(x) { solve.quad.eq(a[x], b[x], c[x])}) 11.098 0.342 81.581 #Parallel: mclapply with 4 cores library(parallel) res1.p <- mclapply(1:len, Function solve.quad.eq Input: three coefficients of a quadratic equation Output: solutions of the quadratic equation lapply function: call the solve.quad.eq function for each set of coefficients mclapply function: same arguments with one extra: mc.cores FUN = function(x) { solve.quad.eq(a[x], b[x], c[x]) }, mc.cores = 4) 11/1/2017 HPC training series Fall 2017 21

# Quadratic Equation: a*x^2 + b*x + c = 0 solve.quad.eq <- function(a, b, c) { # eturn solutions x.delta <- sqrt(b*b - 4*a*c) x1 <- (-b + x.delta)/(2*a) > x2 system.time( <- (-b - x.delta)/(2*a) +res1.s <- lapply(1:len, FUN = function(x) { solve.quad.eq(a[x], b[x], c[x])}) ) return(c(x1, x2)) } user system elapsed It s always a good idea to check the efficiency of a 358.878 0.375 359.046 len <- 1e7 Create 10 million sets parallel of randomly program: > system.time( generated coefficients a + <- res1.p runif(len, <- mclapply(1:len, -10, 10); b <- runif(len, -10, 10); c <- runif(len, -10, 10) + FUN = function(x) Speedup { = solve.quad.eq(a[x], 359.046/81.581 = 4.40b[x], c[x]) }, + mc.cores = 4) Efficiency = 4.40/4 = 110% (!) #Serial: + ) lapply res1.s user <- system lapply(1:len, elapsed FUN = function(x) { solve.quad.eq(a[x], b[x], c[x])}) 11.098 0.342 81.581 #Parallel: mclapply with 4 cores library(parallel) res1.p <- mclapply(1:len, Function solve.quad.eq Input: three coefficients of a quadratic equation Output: solutions of the quadratic equation lapply function: call the solve.quad.eq function for each set of coefficients mclapply function: same arguments with one extra: mc.cores FUN = function(x) { solve.quad.eq(a[x], b[x], c[x]) }, mc.cores = 4) 11/1/2017 HPC training series Fall 2017 22

%dopar% From doparallel package On top of packages parallel, foreach, iterator Purpose: parallelize a for loop Can run on multiple nodes %dopar% 11/1/2017 HPC training series Fall 2017 23

%dopar% Steps Create a cluster of workers (makecluster) egister the cluster (registerdoparallel) Process the for loop in parallel (foreach %dopar%) Stop the cluster (stopcluster) 11/1/2017 HPC training series Fall 2017 24

%dopar%: On A Single Node # Workload: # Create 1,000 random samples, each with # 1,000,000 observations from a standard # normal distribution, then take a # summary for each sample. iters <- 1000 # Sequential version for (i in 1:iters) { to.ls <- rnorm(1e6) to.ls <- summary(to.ls) } # Parallel version with %dopar% # Step 1: Create a cluster of 4 workers cl <- makecluster(4) # Step 2: egister the cluster registerdoparallel(cl) # Step 3: Process the loop ls <- foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) to.ls<-summary(to.ls) } # Step 4: Stop the cluster stopcluster(cl) 11/1/2017 HPC training series Fall 2017 25

%dopar%: On A Single Node # Sequential system.time( # Parallel version with %dopar% # Workload: for (i in 1:iters) { # Create 1,000 random samples, to.ls <-each rnorm(1e6) with # Step 1: Create a cluster of 4 workers # 1,000,000 observations to.ls from <- a standard summary(to.ls) cl <- makecluster(4) # normal distribution, } then take a # summary for each sample. ) # Step 2: egister the cluster user system elapsed registerdoparallel(cl) iters <- 1000 60.249 3.499 63.739 # Step 3: Process the loop # Sequential version # Parallel with 4 cores ls<-foreach(icount(iters)) %dopar% { for (i in 1:iters) { system.time({ to.ls<-rnorm(1e6) to.ls <- rnorm(1e6) cl <- makecluster(4) to.ls<-summary(to.ls) to.ls <- summary(to.ls) registerdoparallel(cl) } } ls<-foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) # Step 4: Stop the cluster to.ls<-summary(to.ls) stopcluster(cl) } stopcluster(cl) }) user system elapsed 0.232 0.032 17.738 11/1/2017 HPC training series Fall 2017 26

%dopar%: One Single Node # Sequential system.time( # Parallel version with %dopar% # Workload: for (i in 1:iters) { # Create 1,000 random samples, to.ls <-each rnorm(1e6) with # Step 1: Create a cluster of 4 workers # 1,000,000 observations to.ls from <- a standard summary(to.ls) cl <- makecluster(4) # normal distribution, } then take a # summary for each sample. ) # Step 2: egister the cluster user system elapsed registerdoparallel(cl) iters <- 1000 60.249 3.499 63.739 Speedup = 63.739/17.738 = 3.59 # Step 3: Process the loop # Sequential version # Parallel with 4 cores ls<-foreach(icount(iters)) Efficiency = 3.59/4 = 90% %dopar% { for (i in 1:iters) { system.time({ to.ls<-rnorm(1e6) to.ls <- rnorm(1e6) cl <- makecluster(4) to.ls<-summary(to.ls) to.ls <- summary(to.ls) registerdoparallel(cl) } } ls<-foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) # Step 4: Stop the cluster to.ls<-summary(to.ls) stopcluster(cl) } stopcluster(cl) }) user system elapsed 0.232 0.032 17.738 11/1/2017 HPC training series Fall 2017 27

makecluster() We specify how many workers to use On the same node: On multiple nodes: cl <- makecluster(<number of workers>) cl <- makecluster(<list of hostnames>) Example: create 4 workers, 2 on qb101 and 2 on qb102 cl <- makecluster(c( qb101, qb101, qb102, qb102 )) 11/1/2017 HPC training series Fall 2017 28

%dopar%: Multiple Nodes on QB2 Cluster # ead all host names hosts <- as.vector(unique(read.table(sys.getenv("pbs_nodefile"),stringsasfactors=f))[,1]) # Count number of hosts nh <- length(hosts) Get the host names of the nodes # Use 4 workers nc <- 4 # Make a cluster on multiple nodes cl <- makecluster(rep(hosts, each = nc/nh)) registerdoparallel(cl) ls<-foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) to.ls<-summary(to.ls) Same steps for the rest of the code: Make a cluster egister the cluster Process loop with %dopar% Stop the cluster } stopcluster(cl) 11/1/2017 HPC training series Fall 2017 29

unning Parallel Codes Now we have a code that can run in parallel So the next question is: How do we know how many workers we should we run it with? The more the better (faster)? 11/1/2017 HPC training series Fall 2017 30

unning Parallel Codes Now we have a code that can run in parallel So the next question is: How do we know how many workers we should we run it with? The more the better (faster)? The answer is: scaling test (trial and error) 11/1/2017 HPC training series Fall 2017 31

cl <- makepsockcluster(rep(hosts, each = clustersize[i]/nh)) registerdoparallel(cl) t <- system.time( ls <- foreach(icount(iters)) %dopar% { ) to.ls <- rnorm(1e6) to.ls <- summary(to.ls) } stopcluster(cl) Nothing is returned, so chunks of the workload are independent of each other 11/1/2017 HPC training series Fall 2017 32

ls <- foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) to.ls<-summary(to.ls) } On one node (20 cores total) We call this relationship between the number of workers and run time the scaling behavior. On two nodes (40 cores total) Ideal behavior 11/1/2017 HPC training series Fall 2017 33

ls <- foreach(icount(iters)) %dopar% { to.ls<-rnorm(1e6) to.ls<-summary(to.ls) } Observations: Deviation from ideal behavior should be expected; Number of workers can differ from number of cores, but it doesn t make sense to have more workers than the number of cores; More is not necessarily faster, and could be slower; On one node (20 cores total) On two nodes (40 cores total) Ideal behavior 11/1/2017 HPC training series Fall 2017 34

res2.p <- foreach(i=1:core,.combine='rbind') %dopar% { # local data for results res <- matrix(0, nrow=chunk.size, ncol=2) for(x in ((i-1)*chunk.size+1):(i*chunk.size)) { res[x - (i-1)*chunk.size,] <- solve.quad.eq(a[x], b[x], c[x]) } } # return local results res The results from each chunk are aggregated into res2.p, so chunks of the workload are independent 11/1/2017 HPC training series Fall 2017 35

On two nodes (40 cores total) On one node (20 cores total) Ideal behavior 11/1/2017 HPC training series Fall 2017 36

On two nodes (40 cores total) On one node (20 cores total) Observations: If there is data dependency, performance deteriorates faster (compared to cases where there is none); Performance deteriorates faster when some workers on one node and some on the other; Ideal behavior 11/1/2017 HPC training series Fall 2017 37

What we have learned about parallel () codes With increasing number of workers, efficiency decreases, and eventually adding more workers slows it down Best scaling behaviors are typically found with codes with no data dependency (we call it embarrassingly parallel ) With this in mind, when developing our codes, we should reduce data dependency as much as possible 11/1/2017 HPC training series Fall 2017 38

How Many Workers to Use Sometimes our goal should be to maximize efficiency If there is no constraint, minimize the wall clock time 11/1/2017 HPC training series Fall 2017 39

Summary: Steps of Developing Parallel () Codes Step 1: Analyze performance Find hot spots parallelizable code segments that slow down the execution the most Step 2: Parallelize code segments Step 3: un scaling tests How do efficiency and run time change with increasing number of workers? What are the optimal number of workers? Step 4: Is the code fast enough? If yes, stop developing (for now) and move on to production runs; If no, go back to step 1 and start another iteration. 11/1/2017 HPC training series Fall 2017 40

Memory Management eplica of data objects could be created for every worker Memory usage would increase with the number of workers does not necessarily clean them up even if you close the cluster Need to monitor memory footprint closely The prof function is capable of memory profiling as well 11/1/2017 HPC training series Fall 2017 41

res2.p <- foreach(i=1:core,.combine='rbind') %dopar% { # local data for results res <- matrix(0, nrow=chunk.size, ncol=2) for(x in ((i-1)*chunk.size+1):(i*chunk.size)) { res[x - (i-1)*chunk.size,] <- solve.quad.eq(a[x], b[x], c[x]) } # return local results res } PID USE P NI VIT ES SH S %CPU %MEM TIME+ COMMAND 87483 lyan1 20 0 539m 314m 6692 100.0 0.5 0:02.05 87492 lyan1 20 0 539m 314m 6692 100.0 0.5 0:02.05 87465 lyan1 20 0 539m 314m 6692 99.4 0.5 0:02.04 87474 lyan1 20 0 539m 314m 6692 99.4 0.5 0:02.05 With 4 workers: Memory = 314*4 = 1256 MB 11/1/2017 HPC training series Fall 2017 42

res2.p <- foreach(i=1:core,.combine='rbind') %dopar% PID USE { P NI VIT ES SH S %CPU %MEM TIME+ COMMAND 87514 lyan1 # local data 20 for 0 501m results 276m 6692 99.8 0.4 0:03.63 87523 lyan1 res <- matrix(0, 20 0 501m nrow=chunk.size, 276m 6692 99.8 ncol=2) 0.4 0:03.63 87676 lyan1 for(x in 20 ((i-1)*chunk.size+1):(i*chunk.size)) 0 501m 276m 6692 99.8 0.4 0:03.61 { 87505 lyan1 res[x 20 - (i-1)*chunk.size,] 0 501m 276m 6692 <- solve.quad.eq(a[x], 99.5 0.4 0:03.64 b[x], c[x]) 87532 lyan1 } 20 0 501m 276m 6692 99.5 0.4 0:03.63 87577 lyan1 # return 20 local 0 results 501m 276m 6692 99.5 0.4 0:03.63 87613 lyan1 res 20 0 501m 276m 6692 99.2 0.4 0:03.61 87640 } lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87649 lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87667 lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87586 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.59 87631 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.60 87658 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.60 87550 lyan1 20 0 501m 276m 6692 98.5 0.4 0:03.60 87622 lyan1 20 0 501m 276m 6692 98.5 0.4 0:03.60 87568 lyan1 20 0 501m 276m 6692 97.5 0.4 0:03.56 87604 lyan1 20 0 501m 276m 6692 96.2 0.4 0:03.52 87559 lyan1 20 0 501m 276m 6692 91.5 0.4 0:03.36 87595 lyan1 20 0 501m 276m 6692 87.9 0.4 0:03.27 87541 lyan1 20 0 501m 276m 6692 86.9 0.4 0:03.22 With 20 workers: Memory = 276*20 = 5520 MB 11/1/2017 HPC training series Fall 2017 43

res2.p <- foreach(i=1:core,.combine='rbind') %dopar% PID USE { P NI VIT ES SH S %CPU %MEM TIME+ COMMAND 87514 lyan1 # local data 20 for 0 501m results 276m 6692 99.8 0.4 0:03.63 87523 lyan1 res <- matrix(0, 20 0 501m nrow=chunk.size, 276m 6692 99.8 ncol=2) 0.4 0:03.63 87676 lyan1 for(x in 20 ((i-1)*chunk.size+1):(i*chunk.size)) 0 501m 276m 6692 99.8 0.4 0:03.61 { 87505 lyan1 res[x 20 - (i-1)*chunk.size,] 0 501m 276m 6692 <- solve.quad.eq(a[x], 99.5 0.4 0:03.64 b[x], c[x]) 87532 lyan1 } 20 0 501m 276m 6692 99.5 0.4 0:03.63 87577 lyan1 # return 20 local 0 results 501m 276m 6692 99.5 0.4 0:03.63 87613 lyan1 res 20 0 501m 276m 6692 99.2 0.4 0:03.61 87640 } lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87649 lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87667 lyan1 20 0 501m 276m 6692 99.2 0.4 0:03.61 87586 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.59 87631 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.60 87658 lyan1 20 0 501m 276m 6692 98.8 0.4 0:03.60 87550 lyan1 20 0 501m 276m 6692 98.5 0.4 0:03.60 87622 lyan1 20 0 501m 276m 6692 98.5 0.4 0:03.60 87568 lyan1 20 0 501m 276m 6692 97.5 0.4 0:03.56 87604 lyan1 20 0 501m 276m 6692 96.2 0.4 0:03.52 87559 lyan1 20 0 501m 276m 6692 91.5 0.4 0:03.36 87595 lyan1 20 0 501m 276m 6692 87.9 0.4 0:03.27 87541 lyan1 20 0 501m 276m 6692 86.9 0.4 0:03.22 The memory footprint doesn t increase linearly with the number of workers, but quite close, so we need to monitor it closely when changing the number of workers. With 20 workers: Memory = 276*20 = 5520 MB 11/1/2017 HPC training series Fall 2017 44

with GPU GPU stands for Graphic Processing Unit Originally designed to process graphic data Can tremendously accelerate certain types of computation as well, e.g. matrix multiplications All nodes on LONI QB2 cluster are equipped with two GPU s Package gpu brings the processing power of GPU to 11/1/2017 HPC training series Fall 2017 45

Example: Matrix Multiplication on GPU [lyan1@qb032 ]$ cat matmul_gpu. # Load necessary library library(gpu) ODE <- 8192 On CPU: Create matrix A and B, then multiply them # On CPU A <- matrix(rnorm(ode^2), nrow=ode) B <- matrix(rnorm(ode^2), nrow=ode) ctime <- system.time(c <- A %*% B) print(paste("on CPU:",ctime["elapsed"],"seconds")) On GPU: Create matrix A and B (with a different function than on CPU), then multiply them # On GPU vcla <- vclmatrix(rnorm(ode^2), nrow=ode, ncol=ode) vclb <- vclmatrix(rnorm(ode^2), nrow=ode, ncol=ode) gtime <- system.time(vclc <- vcla %*% vclb) print(paste("on GPU:",gtime["elapsed"],"seconds")) print(paste("the speedup is",ctime["elapsed"]/gtime["elapsed"])) 11/1/2017 HPC training series Fall 2017 46

Example: Matrix Multiplication on GPU [lyan1@qb072 ]$ script matmul_gpu. Loading required package: methods Number of platforms: 1 - platform: NVIDIA Corporation: OpenCL 1.2 CUDA 8.0.0 - gpu index: 0 - Tesla K20Xm - gpu index: 1 - Tesla K20Xm checked all devices completed initialization gpu 1.2.1 Attaching package: gpu The following objects are masked from package:base : colnames, svd [1] "On CPU: 4.295 seconds" [1] "On GPU: 0.0309999999999988 seconds" [1] "The speedup is 138.54838709678" Wow! Huge speedup! Especially so given the CPU results are obtained with 20 cores (implicitly parallel) 11/1/2017 HPC training series Fall 2017 47

The Other Side of The Coin [lyan1@qb072 ]$ cat matmul_gpu_overall. # Load necessary library library(gpu) ODE <- 8192 Same code with only one difference: Now we are measuring the run time including matrix creation. # On CPU ctime <- system.time({ A <- matrix(rnorm(ode^2), nrow=ode) B <- matrix(rnorm(ode^2), nrow=ode) C <- A %*% B }) print(paste("on CPU:",ctime["elapsed"],"seconds")) # On GPU gtime <- system.time({ vcla <- vclmatrix(rnorm(ode^2), nrow=ode, ncol=ode) vclb <- vclmatrix(rnorm(ode^2), nrow=ode, ncol=ode) vclc <- vcla %*% vclb }) print(paste("on GPU:",gtime["elapsed"],"seconds")) print(paste("the speedup is",ctime["elapsed"]/gtime["elapsed"])) 11/1/2017 HPC training series Fall 2017 48

The Other Side of The Coin [lyan1@qb072 ]$ script matmul_gpu_overall. Loading required package: methods Number of platforms: 1 - platform: NVIDIA Corporation: OpenCL 1.2 CUDA 8.0.0 - gpu index: 0 - Tesla K20Xm - gpu index: 1 - Tesla K20Xm checked all devices completed initialization gpu 1.2.1 Attaching package: gpu The following objects are masked from package:base : colnames, svd [1] "On CPU: 14.298 seconds" [1] "On GPU: 13.897 seconds" [1] "The speedup is 1.02885514859322" This time, not impressive at all: Matrix creation is not much work, but moving data to/from GPU takes a lot of time. And again, wall clock time is what matters at the end of day. 11/1/2017 HPC training series Fall 2017 49

Deep Learning in Since 2012, Deep Neural Network (DNN) has gained great popularity in applications such as Image and pattern recognition Natural language processing There are a few packages that support DNN MXNet (multiple nodes with GPU support) H2o (multiple nodes) Darch Deepnet pud 11/1/2017 HPC training series Fall 2017 50

eferences Parallel (www.parallelr.com) Code: https://github.com/patriczhao/parallel Documentation for packages mentioned in this tutorial 11/1/2017 HPC training series Fall 2017 51

Thank you! 11/1/2017 HPC training series Fall 2017 52