Using Rmpi within the HPC4Stats framework

Size: px

Start display at page:

Download "Using Rmpi within the HPC4Stats framework"

Holly Clarke
5 years ago
Views:

Using Rmpi within the HPC4Stats framework Dorit Hammerling Analytics and Integrative Machine Learning Group National Center for Atmospheric Research (NCAR) Based on work by Doug Nychka (Applied

1 Using Rmpi within the HPC4Stats framework Dorit Hammerling Analytics and Integrative Machine Learning Group National Center for Atmospheric Research (NCAR) Based on work by Doug Nychka (Applied Mathematics and Statistics Department, Colorado School of Mines) with contributions from Daniel Milroy (Department of Computer Science, CU Boulder), Brian Vanderwende (IT Consulting Services Group, NCAR), Sophia Chen (Department of Computer Science, Brown University), and Nathan Lenssen (Columbia University) October 11, 2018 Hammerling et al. (NCAR) HPC4Stats October 11, / 19

2 Overview Context and introduction Strategies for parallel analysis Find the right platform for your problem NCAR s HPC system as an example Choices for splitting up the data Rmpi as a general strategy HPC4Stats framework In a nutshell: Leverage computational infrastructure to conduct embarrassingly parallel data analysis in R using existing tools. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

3 Typical data amenable to parallel inference daily data for 35 years: 12,775 values per grid cell 288 longitudes 192 latitudes: 55,296 grid cells 12,775 55,296 = 706,406,400 data points (2.83 GB) Hammerling et al. (NCAR) HPC4Stats October 11, / 19

4 Fitting a Generalized Pareto distribution This is a complementary approach to block maxima for Extreme Value Analysis For data above a given threshold (µ) fit a probability density with the form: 1 f (x) = σ[1 + ξ(x µ) ] σ (1/ξ+1) for x µ. σ scale parameter, ξ shape parameter We are ignoring all the data below the threshold to just fit the tail. Having selected the threshold, estimate σ and ξ by maximum likelihood. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

5 Fitting a Generalized Pareto distribution: R code tailprob<-.01 # tail probability used in extremes fitting returnlevelyear <- 100 # Years used for return level Y<- dataset[lonindex, latindex,] threshold<- quantile(y, 1- tailprob) frac<- sum(y > threshold) / length(y) GPFit<- fevd(y, threshold=threshold, type="gp",method="mle") ReturnLevel<- return.level(gpfit,returnlevelyear, do.ci=true) Depending on your machine takes somewhere from 0.3 to 1 second. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

6 Fitting a Generalized Pareto distribution: R code tailprob<-.01 # tail probability used in extremes fitting returnlevelyear <- 100 # Years used for return level Y<- dataset[lonindex, latindex,] threshold<- quantile(y, 1- tailprob) frac<- sum(y > threshold) / length(y) GPFit<- fevd(y, threshold=threshold, type="gp",method="mle") ReturnLevel<- return.level(gpfit,returnlevelyear, do.ci=true) Depending on your machine takes somewhere from 0.3 to 1 second. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

7 Why use HPC systems for statistical computing? Doing repetitive tasks can take a lot of time. Even short tasks add up quickly: 0.33 seconds for one location corresponds to approx. 5 hours for 55,000 locations. 1 second for one location corresponds to approx. 15 hours for 55,000 locations. And that is for a single data set. Often we want to analyze hundreds of data sets and test different parameters. And we might not have to worry about obtaining the data. Moving the analysis to the data is becoming more and more common. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

8 Why use HPC systems for statistical computing? Doing repetitive tasks can take a lot of time. Even short tasks add up quickly: 0.33 seconds for one location corresponds to approx. 5 hours for 55,000 locations. 1 second for one location corresponds to approx. 15 hours for 55,000 locations. And that is for a single data set. Often we want to analyze hundreds of data sets and test different parameters. And we might not have to worry about obtaining the data. Moving the analysis to the data is becoming more and more common. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

9 Why use HPC systems for statistical computing? Doing repetitive tasks can take a lot of time. Even short tasks add up quickly: 0.33 seconds for one location corresponds to approx. 5 hours for 55,000 locations. 1 second for one location corresponds to approx. 15 hours for 55,000 locations. And that is for a single data set. Often we want to analyze hundreds of data sets and test different parameters. And we might not have to worry about obtaining the data. Moving the analysis to the data is becoming more and more common. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

10 Why use HPC systems for statistical computing? Doing repetitive tasks can take a lot of time. Even short tasks add up quickly: 0.33 seconds for one location corresponds to approx. 5 hours for 55,000 locations. 1 second for one location corresponds to approx. 15 hours for 55,000 locations. And that is for a single data set. Often we want to analyze hundreds of data sets and test different parameters. And we might not have to worry about obtaining the data. Moving the analysis to the data is becoming more and more common. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

11 NCAR s high performance computing (HPC) system Cheyenne (online since 2017:) 5.34 petaflops peak 145,152 cores with 64 or 128 GB memory 313 TB total memory 100 Gb/s interconnects Core-hours available to NSF research community Simple application process for graduate students Hammerling et al. (NCAR) HPC4Stats October 11, / 19

12 Cores and nodes on HPC systems Usually cores on one node share memory (cache). Memory between nodes is typically not shared, but can be accessed. Understanding the basics of the architecture and interconnects can be really helpful! Hammerling et al. (NCAR) HPC4Stats October 11, / 19

13 Relevant details: Memory and parallelization tools Memory available on compute nodes Two classes of CPU nodes on Cheyenne: Standard nodes have 64 GB of memory (46 GB usable). Large memory nodes with 128 GB of memory (110 GB usable). Data Analysis cluster: Many GPUs, but limited memory. Configured for deep learning applications. You need to know what is installed and how it is configured! Rmpi: Limits on workers? What physical interconnect is it using? Matlab Distributed Computing server Spark for Python or Scala Hammerling et al. (NCAR) HPC4Stats October 11, / 19

14 Relevant details: Memory and parallelization tools Memory available on compute nodes Two classes of CPU nodes on Cheyenne: Standard nodes have 64 GB of memory (46 GB usable). Large memory nodes with 128 GB of memory (110 GB usable). Data Analysis cluster: Many GPUs, but limited memory. Configured for deep learning applications. You need to know what is installed and how it is configured! Rmpi: Limits on workers? What physical interconnect is it using? Matlab Distributed Computing server Spark for Python or Scala Hammerling et al. (NCAR) HPC4Stats October 11, / 19

15 Application benchmarking Even if one knows the architecture very well and has data on low-level benchmarks, application benchmarking is critical. Application benchmarking: benchmarking that uses code as close as possible to the real production code (including I/O operations!). For large data sets how you read in and distribute the data matters! Typical options: All the data at once In blocks: e.g. one latitude or longitude band at a time Smallest possible unit: e.g. one grid cell at a time Hammerling et al. (NCAR) HPC4Stats October 11, / 19

16 Application benchmarking Even if one knows the architecture very well and has data on low-level benchmarks, application benchmarking is critical. Application benchmarking: benchmarking that uses code as close as possible to the real production code (including I/O operations!). For large data sets how you read in and distribute the data matters! Typical options: All the data at once In blocks: e.g. one latitude or longitude band at a time Smallest possible unit: e.g. one grid cell at a time Hammerling et al. (NCAR) HPC4Stats October 11, / 19

17 Technical Report, Data and Code available Hammerling et al. (NCAR) HPC4Stats October 11, / 19

18 Rmpi in a picture Hammerling et al. (NCAR) HPC4Stats October 11, / 19

19 Rmpi overview An R interface (wrapper) to MPI A convenient way to run R with many R tasks on many cores. Little knowledge of MPI and parallel computing needed. Uses a supervisor/worker model: All are full fledged R sessions but the worker sessions only receive instructions from the supervisor. The task assigned to each worker is an R function that is passed a unique index. The index is used to determine exactly what task to do. (In our case the index determines a range of grid boxes.) Hammerling et al. (NCAR) HPC4Stats October 11, / 19

20 Schematic of using Rmpi 1. LIBRARIES: Install/load all the libraries you need. 2. SETUP: Read in any common data sets (e.g. climate data, lat/lon grids). Define any input functions. Define objects to control computation. 3. WORKERS: Spawn them and broadcast objects to them. 4. APPLY: mpi.iapplylb: a single R function (e.g. dotask) applied to a sequence of IDs. 5. SAVE RESULTS Hammerling et al. (NCAR) HPC4Stats October 11, / 19

21 HPC4Stats: a reusable framework to implement Rmpi Batch script that calls Rmpi does not need to be changed. Organizes the information broadcast to the workers. Particular parallel tasks determined by a short R Namelist. Working directories default to a specific organization. (Relatively) simple to switch between a laptop/cluster and HPC system. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

22 HPC4Stats layout Assumes three subdirectories: batch Holds R namelists (.rnl), example batch scripts (.pbs), README files, examples, and output from R scripts. src The batchsupervisor.r script and any other source code. output Where output is saved to (usually in R binary format). data (Optional) Where any common data sets are located. plots (Optional) Location for plots. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

23 HPC4Stats batch execution We assume that batch jobs are executed from the batch directory. To submit an R batch job: 1. Create a specific R namelist and export the name of this file. (export HPC4StatsNAMELIST=template.rnl) 2. Execute R batch command. (R CMD BATCH --no-save../src/supervisorbatch.r template.rout) On a local machine: use a terminal window to export the namelist and to submit batch command. On an HPC system: create a batch job wrapper using a queue script (e.g. PBS) and submit to queuing system. Hammerling et al. (NCAR) HPC4Stats October 11, / 19

24 Some details The dotask function is expected to only take a task ID. The function needs to figure out from this what to do. Keep in mind that any objects broadcast to the workers will be found by this function through the usual way R searches for objects. Make sure the that number of workers (nworkers in the R namelist) is at least one less than the number of processes requested. The R namelist is included as part of the output object. Keep in mind that while there are many defaults most choices can be changed through the R namelist. Avoid changing the batchsupervisor.r script. Be creative with your namelist structure! Hammerling et al. (NCAR) HPC4Stats October 11, / 19

Exercise for this afternoon Running an Rmpi example using HPC4Stats: Make sure you have Rmpi (and required mpi and compiler libraries) installed on your laptop!

25 Exercise for this afternoon Running an Rmpi example using HPC4Stats: Make sure you have Rmpi (and required mpi and compiler libraries) installed on your laptop! Download the directory tree HCP4Stats Asheville. We will start with the README file and take it from there! Hammerling et al. (NCAR) HPC4Stats October 11, / 19

Accelerating CMIP Data Analysis with Parallel Computing in R

Accelerating CMIP Data Analysis with Parallel Computing in R Daniel Milroy Sophia Chen Brian Vanderwende Dorit Hammerling NCAR Technical Notes NCAR/TN-534+CODE National Center for Atmospheric Research