Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov

Size: px

Start display at page:

Download "Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov"

Caroline Maxwell
5 years ago
Views:

1 Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October 6-7, 2016

2 Outline Parallel Programming Paradigms Outline Brief introduction to parallel hardware and software Parallel programming paradigms Shared memory vs. distributed memory Manager-workers and fork-join MapReduce SPMD - single program, multiple data Data-flow

3 Parallel Programming Paradigms Three Basic Flavors of Hardware Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory

4 Parallel Programming Paradigms Your Laptop or Desktop Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory

5 Parallel Programming Paradigms Server to Cluster to Supercomputer Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory

6 Parallel Programming Paradigms Brief introduction to parallel hardware and software Native Programming Models and Tools Distributed Memory Interconnection Network Default is parallel: what is my data and what do I need from others? Sockets SPMD (MPI) MapReduce (shuffle) Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA CUDA OpenCL OpenCL OpenACC OpenACC OpenMP OpenMP Pthreads Pthreads fork fork

7 Parallel Programming Paradigms Brief introduction to parallel hardware and software 30+ Years of Parallel Computing Research Distributed Memory Interconnection Network Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core CUDA OpenCL OpenACC OpenMP Pthreads Network Memory Default is serial: which tasks can the compiler make parallel? fork

Parallel Programming Paradigms Last 10 years of Advances Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Default is parallel (SPMD): what is my data

8 Parallel Programming Paradigms Last 10 years of Advances Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core CUDA OpenCL OpenACC OpenMP Pthreads Network Memory Default is serial: which tasks can the compiler make parallel? fork

Parallel Programming Paradigms Brief introduction to parallel hardware and software Distributed Programming Works in Shared Memory Distributed Memory Interconnection Network Default is parallel: what

9 Parallel Programming Paradigms Brief introduction to parallel hardware and software Distributed Programming Works in Shared Memory Distributed Memory Interconnection Network Default is parallel: what is my data and what do I need from others? Sockets SPMD (MPI) MapReduce (shuffle) Mem Mem Mem Mem Shared Memory Network Memory Co-Processor Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? Offload data and tasks. We are slow but many! CUDA CUDA OpenCL OpenCL OpenACC OpenACC OpenMP OpenMP Pthreads Pthreads fork fork

Parallel Programming Paradigms R Interfaces to Low-Level Native Tools Brief introduction to parallel hardware

Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor

Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is

10 Parallel Programming Paradigms R Interfaces to Low-Level Native Tools Brief introduction to parallel hardware and software Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore

11 Parallel Programming Paradigms Brief introduction to parallel hardware and software Some packages in R for parallel computing parallel: multicore + snow multicore: an interface to unix fork (no Windows) snow: simple network of workstations pbdmpi, pbddmat and other pbd: use HPC concepts, simplify, and use scalable libraries foreach, doparallel: iterface to hide hardware reality, can be difficult to debug Rmpi: simplified with pbdmpi for SPMD RHadoop, RHipe: needs HDFS, slow because file-backed datadr: divide-recombine, currently MapReduce/HADOOP back end SparkR: in-memory, needs HDFS, limited to Shuffle, MPI generally faster and more flexible

12 Manager-Workers Manager-Workers 1 A serial program (Manager) divides up work and/or data 2 Manager sends work (and data) to workers 3 Workers run in parallel without interaction 4 Manager collects/combines results from workers Divide-Recombine fits this model Concept appears similar to interactive and to client-server

13 MapReduce MapReduce A concept born of a search engine Decouples certain coupled problems with an intermediate communication: shuffle User needs to decompose computation into Map and Reduce steps User writes two serial codes: Map and Reduce

14 MapReduce MapReduce: a Parallel Search Engine Concept Search MANY documents Serve MANY users Web Pages (records) p0 p1 p2 p3 Index Words (keys) A 1 A 2 A 3 A 4 B 1 B 2 B 3 B 4 C 1 C 2 C 3 C 4 D 1 D 2 D 3 D 4 Shuffle MPI Alltoallv Index Words (keys) Web Pages (records) p0 p1 p2 p3 A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 A 3 B 3 C 3 D 3 A 4 B 4 C 4 D 4 Matrix transpose in another language?

15 MapReduce Can use different sets of processors Index Words (keys) Web p0 Pages p1 B 1 B 2 B 3 B 4 (records) p2 p3 Streaming Shuffle MPI Scatter Index Words (keys) Web Pages (records) p4 p5 p6 p7 B 1 B 2 B 3 B 4

16 SPMD SPMD: Single Program Multiple Data Write one general program so many copies of it can run asynchronously and cooperate (usually via MPI) to solve the problem. The prevalent way of distributed programming in HPC for 30+ years Can handle tightly coupled parallel computations It is designed for batch computing There is usually no manager - rather, all cooperate Prime driver behind MPI specification Way to program server side in client-server

17 SPMD A = X T X, where X = SPMD X 1. X 8 (Row-Block partition) A = reduce( crossprod( X i ) ) A = allreduce( crossprod( X i ) ) 1 1 Ostrouchov (1987). Parallel Computing on a Hypercube: An overview of the architecture and some applications. Proceedings of the 19th Symposium on the Interface of Computer Science and Statistics, p

18 SPMD and MapReduce SPMD (MPI and Shuffle) Both Concepts are about Communication SPMD makes communication explicit, gives choices (MPI) MapReduce hides communication, uses one choice (shuffle)

Data-Flow Data-flow: Parallel Runtime Scheduling and Execution Controller (PaRSEC) Graphic from icl.cs.utk.edu Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J.

19 Data-Flow Data-flow: Parallel Runtime Scheduling and Execution Controller (PaRSEC) Graphic from icl.cs.utk.edu Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. PaRSEC: Exploiting Heterogeneity to Enhance Scalability, IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, Master data-flow controller runs distributed on all cores. Dynamic generation of current level in flow graph Effectively removes collective synchronizations

20 Libraries Recall: Hardware flavors and Low-Level Native Tools Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore

21 Libraries Scalable Libraries Mapped to Hardware Distributed Memory Profiling Tau ZeroMQ Interconnection Network MPI ScaLAPACK PBLAS cache PETSc + BLACS Trilinos Mem Mem Mem Mem CombBLAS Co-Processor mpip fpmpi PAPI LibSci (Cray) MKL (Intel) Shared Memory ACML (AMD) DPLASMA PLASMA MAGMA Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core I/O NetCDF4 ADIOS Network Memory cublas (NVIDIA) cusparse (NVIDIA)

Libraries R and pbdr Interfaces to HPC Libraries Distributed

ScaLAPACK getpass PBLAS cache PETSc + BLACS Trilinos pbddmat

MKL (Intel) Shared Memory ACML (AMD) Memory Network OpenBLAS

Processing Unit MIC: Many Integrated Core MAGMA cublas

ADIOS pbdio pbdprof pbdpapi Machine Learning pbdml Learning

22 Libraries R and pbdr Interfaces to HPC Libraries Distributed Memory ZeroMQ Interconnection pbdcs Network pbdzmq MPI remoter ScaLAPACK getpass PBLAS cache PETSc + BLACS Trilinos pbddmat Mem Mem Mem Mem pbddmat pbdbase pbdslap CombBLAS LibSci (Cray) MKL (Intel) Shared Memory ACML (AMD) Memory Network OpenBLAS DPLASMA PLASMA pbdmpi Co-Processor Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core MAGMA cublas (NVIDIA) cusparse (NVIDIA) Profiling I/O PAPI HDF5 NetCDF4 ADIOS pbdio pbdprof pbdpapi Machine Learning pbdml Learning pbdr pbddemo rhdf5 Tau mpip fpmpi pbdncdf4 pbdadios Released Under Development

23 Libraries Recall: Hardware flavors and Low-Level Native Tools Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore

Prepared by pbdr Core Team Acknowledgments Engaging parallel libraries at scale R language unchanged New distributed concepts New profiling capabilities New interactive SPMD pbdr Core Team Developers

Michael Matheson, David Pierce, Andrew Raim, Brian Ripley, ZhaoKang Wang, Hao Yu In situ distributed capability In situ staging capability via ADIOS Plans for DPLASMA GPU capability Support This

24 Prepared by pbdr Core Team Acknowledgments Engaging parallel libraries at scale R language unchanged New distributed concepts New profiling capabilities New interactive SPMD pbdr Core Team Developers Wei-Chen Chen, FDA George Ostrouchov, ORNL & UTK Drew Schmidt, UTK Developers Christian Heckendorf, Pragneshkumar Patel, Gaurav Sehrawat Contributors Whit Armstrong, Ewan Higgs, Michael Lawrence, Michael Matheson, David Pierce, Andrew Raim, Brian Ripley, ZhaoKang Wang, Hao Yu In situ distributed capability In situ staging capability via ADIOS Plans for DPLASMA GPU capability Support This material is based upon work supported by the National Science Foundation Division of Mathematical Sciences under Grant No This work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR This work also used resources of the National Institute for Computational Sciences at the University of Tennessee, Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation.

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU