Using R for HPC Data Science. Session: Parallel Programming Paradigms. George Ostrouchov
|
|
- Caroline Maxwell
- 5 years ago
- Views:
Transcription
1 Using R for HPC Data Science Session: Parallel Programming Paradigms George Ostrouchov Oak Ridge National Laboratory and University of Tennessee and pbdr Core Team Course at IT4Innovations, Ostrava, October 6-7, 2016
2 Outline Parallel Programming Paradigms Outline Brief introduction to parallel hardware and software Parallel programming paradigms Shared memory vs. distributed memory Manager-workers and fork-join MapReduce SPMD - single program, multiple data Data-flow
3 Parallel Programming Paradigms Three Basic Flavors of Hardware Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory
4 Parallel Programming Paradigms Your Laptop or Desktop Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory
5 Parallel Programming Paradigms Server to Cluster to Supercomputer Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Mem Mem Mem Mem Co-Processor Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Network Memory
6 Parallel Programming Paradigms Brief introduction to parallel hardware and software Native Programming Models and Tools Distributed Memory Interconnection Network Default is parallel: what is my data and what do I need from others? Sockets SPMD (MPI) MapReduce (shuffle) Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA CUDA OpenCL OpenCL OpenACC OpenACC OpenMP OpenMP Pthreads Pthreads fork fork
7 Parallel Programming Paradigms Brief introduction to parallel hardware and software 30+ Years of Parallel Computing Research Distributed Memory Interconnection Network Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core CUDA OpenCL OpenACC OpenMP Pthreads Network Memory Default is serial: which tasks can the compiler make parallel? fork
8 Parallel Programming Paradigms Last 10 years of Advances Brief introduction to parallel hardware and software Distributed Memory Interconnection Network Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core CUDA OpenCL OpenACC OpenMP Pthreads Network Memory Default is serial: which tasks can the compiler make parallel? fork
9 Parallel Programming Paradigms Brief introduction to parallel hardware and software Distributed Programming Works in Shared Memory Distributed Memory Interconnection Network Default is parallel: what is my data and what do I need from others? Sockets SPMD (MPI) MapReduce (shuffle) Mem Mem Mem Mem Shared Memory Network Memory Co-Processor Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? Offload data and tasks. We are slow but many! CUDA CUDA OpenCL OpenCL OpenACC OpenACC OpenMP OpenMP Pthreads Pthreads fork fork
10 Parallel Programming Paradigms R Interfaces to Low-Level Native Tools Brief introduction to parallel hardware and software Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore
11 Parallel Programming Paradigms Brief introduction to parallel hardware and software Some packages in R for parallel computing parallel: multicore + snow multicore: an interface to unix fork (no Windows) snow: simple network of workstations pbdmpi, pbddmat and other pbd: use HPC concepts, simplify, and use scalable libraries foreach, doparallel: iterface to hide hardware reality, can be difficult to debug Rmpi: simplified with pbdmpi for SPMD RHadoop, RHipe: needs HDFS, slow because file-backed datadr: divide-recombine, currently MapReduce/HADOOP back end SparkR: in-memory, needs HDFS, limited to Shuffle, MPI generally faster and more flexible
12 Manager-Workers Manager-Workers 1 A serial program (Manager) divides up work and/or data 2 Manager sends work (and data) to workers 3 Workers run in parallel without interaction 4 Manager collects/combines results from workers Divide-Recombine fits this model Concept appears similar to interactive and to client-server
13 MapReduce MapReduce A concept born of a search engine Decouples certain coupled problems with an intermediate communication: shuffle User needs to decompose computation into Map and Reduce steps User writes two serial codes: Map and Reduce
14 MapReduce MapReduce: a Parallel Search Engine Concept Search MANY documents Serve MANY users Web Pages (records) p0 p1 p2 p3 Index Words (keys) A 1 A 2 A 3 A 4 B 1 B 2 B 3 B 4 C 1 C 2 C 3 C 4 D 1 D 2 D 3 D 4 Shuffle MPI Alltoallv Index Words (keys) Web Pages (records) p0 p1 p2 p3 A 1 B 1 C 1 D 1 A 2 B 2 C 2 D 2 A 3 B 3 C 3 D 3 A 4 B 4 C 4 D 4 Matrix transpose in another language?
15 MapReduce Can use different sets of processors Index Words (keys) Web p0 Pages p1 B 1 B 2 B 3 B 4 (records) p2 p3 Streaming Shuffle MPI Scatter Index Words (keys) Web Pages (records) p4 p5 p6 p7 B 1 B 2 B 3 B 4
16 SPMD SPMD: Single Program Multiple Data Write one general program so many copies of it can run asynchronously and cooperate (usually via MPI) to solve the problem. The prevalent way of distributed programming in HPC for 30+ years Can handle tightly coupled parallel computations It is designed for batch computing There is usually no manager - rather, all cooperate Prime driver behind MPI specification Way to program server side in client-server
17 SPMD A = X T X, where X = SPMD X 1. X 8 (Row-Block partition) A = reduce( crossprod( X i ) ) A = allreduce( crossprod( X i ) ) 1 1 Ostrouchov (1987). Parallel Computing on a Hypercube: An overview of the architecture and some applications. Proceedings of the 19th Symposium on the Interface of Computer Science and Statistics, p
18 SPMD and MapReduce SPMD (MPI and Shuffle) Both Concepts are about Communication SPMD makes communication explicit, gives choices (MPI) MapReduce hides communication, uses one choice (shuffle)
19 Data-Flow Data-flow: Parallel Runtime Scheduling and Execution Controller (PaRSEC) Graphic from icl.cs.utk.edu Bosilca, G., Bouteiller, A., Danalis, A., Faverge, M., Herault, T., Dongarra, J. PaRSEC: Exploiting Heterogeneity to Enhance Scalability, IEEE Computing in Science and Engineering, Vol. 15, No. 6, 36-45, November, Master data-flow controller runs distributed on all cores. Dynamic generation of current level in flow graph Effectively removes collective synchronizations
20 Libraries Recall: Hardware flavors and Low-Level Native Tools Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore
21 Libraries Scalable Libraries Mapped to Hardware Distributed Memory Profiling Tau ZeroMQ Interconnection Network MPI ScaLAPACK PBLAS cache PETSc + BLACS Trilinos Mem Mem Mem Mem CombBLAS Co-Processor mpip fpmpi PAPI LibSci (Cray) MKL (Intel) Shared Memory ACML (AMD) DPLASMA PLASMA MAGMA Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core I/O NetCDF4 ADIOS Network Memory cublas (NVIDIA) cusparse (NVIDIA)
22 Libraries R and pbdr Interfaces to HPC Libraries Distributed Memory ZeroMQ Interconnection pbdcs Network pbdzmq MPI remoter ScaLAPACK getpass PBLAS cache PETSc + BLACS Trilinos pbddmat Mem Mem Mem Mem pbddmat pbdbase pbdslap CombBLAS LibSci (Cray) MKL (Intel) Shared Memory ACML (AMD) Memory Network OpenBLAS DPLASMA PLASMA pbdmpi Co-Processor Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core MAGMA cublas (NVIDIA) cusparse (NVIDIA) Profiling I/O PAPI HDF5 NetCDF4 ADIOS pbdio pbdprof pbdpapi Machine Learning pbdml Learning pbdr pbddemo rhdf5 Tau mpip fpmpi pbdncdf4 pbdadios Released Under Development
23 Libraries Recall: Hardware flavors and Low-Level Native Tools Distributed Memory Default is parallel (SPMD): what is my data and what do I need from others? Sockets MPI MapReduce snow Rmpi Rhpc pbdmpi Interconnection Network RHadoop SparkR Mem Mem Mem Mem Co-Processor Offload data and tasks. We are slow but many! Shared Memory Network Memory Local Memory GPU: Graphical Processing Unit MIC: Many Integrated Core Default is serial: which tasks can the compiler make parallel? CUDA OpenCL OpenACC OpenMP Pthreads fork Foreign Language Interfaces:.C.Call Rcpp OpenCL inline... snow + multicore = parallel multicore
24 Prepared by pbdr Core Team Acknowledgments Engaging parallel libraries at scale R language unchanged New distributed concepts New profiling capabilities New interactive SPMD pbdr Core Team Developers Wei-Chen Chen, FDA George Ostrouchov, ORNL & UTK Drew Schmidt, UTK Developers Christian Heckendorf, Pragneshkumar Patel, Gaurav Sehrawat Contributors Whit Armstrong, Ewan Higgs, Michael Lawrence, Michael Matheson, David Pierce, Andrew Raim, Brian Ripley, ZhaoKang Wang, Hao Yu In situ distributed capability In situ staging capability via ADIOS Plans for DPLASMA GPU capability Support This material is based upon work supported by the National Science Foundation Division of Mathematical Sciences under Grant No This work used resources of the Oak Ridge Leadership Computing Facility at the Oak Ridge National Laboratory, which is supported by the Office of Science of the U.S. Department of Energy under Contract No. DE-AC05-00OR This work also used resources of the National Institute for Computational Sciences at the University of Tennessee, Knoxville, which is supported by the Office of Cyberinfrastructure of the U.S. National Science Foundation.
Thinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationBest Practice Guide to Hybrid PaRSEC + OpenMP Programming
Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2
More informationTitan - Early Experience with the Titan System at Oak Ridge National Laboratory
Office of Science Titan - Early Experience with the Titan System at Oak Ridge National Laboratory Buddy Bland Project Director Oak Ridge Leadership Computing Facility November 13, 2012 ORNL s Titan Hybrid
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and
More informationOverlapping Computation and Communication for Advection on Hybrid Parallel Computers
Overlapping Computation and Communication for Advection on Hybrid Parallel Computers James B White III (Trey) trey@ucar.edu National Center for Atmospheric Research Jack Dongarra dongarra@eecs.utk.edu
More informationA Quick Guide for the pmclust Package (Ver )
Wei-Chen Chen, George Ostrouchov i A Quick Guide for the pmclust Package (Ver. 0.1-6) Wei-Chen Chen 1 and George Ostrouchov 1,2 1 Department of Ecology and Evolutionary Biology, University of Tennessee,
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationThe Titan Tools Experience
The Titan Tools Experience Michael J. Brim, Ph.D. Computer Science Research, CSMD/NCCS Petascale Tools Workshop 213 Madison, WI July 15, 213 Overview of Titan Cray XK7 18,688+ compute nodes 16-core AMD
More informationParallel R Bob Settlage Feb 14, 2018
Parallel R Bob Settlage Feb 14, 2018 Parallel R Todays Agenda Introduction Brief aside: - R and parallel R on ARC's systems Snow Rmpi pbdr (more brief) Conclusions 2/48 R Programming language and environment
More informationEarly Experiences Writing Performance Portable OpenMP 4 Codes
Early Experiences Writing Performance Portable OpenMP 4 Codes Verónica G. Vergara Larrea Wayne Joubert M. Graham Lopez Oscar Hernandez Oak Ridge National Laboratory Problem statement APU FPGA neuromorphic
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationCPU-GPU Heterogeneous Computing
CPU-GPU Heterogeneous Computing Advanced Seminar "Computer Engineering Winter-Term 2015/16 Steffen Lammel 1 Content Introduction Motivation Characteristics of CPUs and GPUs Heterogeneous Computing Systems
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationGPU Architecture. Alan Gray EPCC The University of Edinburgh
GPU Architecture Alan Gray EPCC The University of Edinburgh Outline Why do we want/need accelerators such as GPUs? Architectural reasons for accelerator performance advantages Latest GPU Products From
More informationChapter 3 Parallel Software
Chapter 3 Parallel Software Part I. Preliminaries Chapter 1. What Is Parallel Computing? Chapter 2. Parallel Hardware Chapter 3. Parallel Software Chapter 4. Parallel Applications Chapter 5. Supercomputers
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationOpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4
OpenACC Course Class #1 Q&A Contents OpenACC/CUDA/OpenMP... 1 Languages and Libraries... 3 Multi-GPU support... 4 How OpenACC Works... 4 OpenACC/CUDA/OpenMP Q: Is OpenACC an NVIDIA standard or is it accepted
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationTrends and Challenges in Multicore Programming
Trends and Challenges in Multicore Programming Eva Burrows Bergen Language Design Laboratory (BLDL) Department of Informatics, University of Bergen Bergen, March 17, 2010 Outline The Roadmap of Multicores
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationVersion Guide to the remoter Package. Just the Basics. Drew Schmidt
Version 0.4-0 Guide to the remoter Package Just the Basics Drew Schmidt Guide to the remoter Package January 4, 2018 Drew Schmidt wrathematics@gmail.com Version 0.4-0 Acknowledgements and Disclaimer Work
More informationPortable and Productive Performance with OpenACC Compilers and Tools. Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance with OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 Cray: Leadership in Computational Research Earth Sciences
More information6.1 Multiprocessor Computing Environment
6 Parallel Computing 6.1 Multiprocessor Computing Environment The high-performance computing environment used in this book for optimization of very large building structures is the Origin 2000 multiprocessor,
More informationAutoTune Workshop. Michael Gerndt Technische Universität München
AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy
More informationTechnology for a better society. hetcomp.com
Technology for a better society hetcomp.com 1 J. Seland, C. Dyken, T. R. Hagen, A. R. Brodtkorb, J. Hjelmervik,E Bjønnes GPU Computing USIT Course Week 16th November 2011 hetcomp.com 2 9:30 10:15 Introduction
More informationParallel Programming Environments. Presented By: Anand Saoji Yogesh Patel
Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements
More informationHPC Middle East. KFUPM HPC Workshop April Mohamed Mekias HPC Solutions Consultant. Agenda
KFUPM HPC Workshop April 29-30 2015 Mohamed Mekias HPC Solutions Consultant Agenda 1 Agenda-Day 1 HPC Overview What is a cluster? Shared v.s. Distributed Parallel v.s. Massively Parallel Interconnects
More informationIn-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units
Page 1 of 17 In-Situ Statistical Analysis of Autotune Simulation Data using Graphical Processing Units Niloo Ranjan Jibonananda Sanyal Joshua New Page 2 of 17 Table of Contents In-Situ Statistical Analysis
More informationVersion Programming with Big Data in R. Speaking Serial R with a Parallel Accent. Package Examples and Demonstrations.
Version 0.3-0 Programming with Big Data in R Speaking Serial R with a Parallel Accent Package Examples and Demonstrations pbdr Core Team Speaking Serial R with a Parallel Accent (Ver. 0.3-1) pbdr Package
More informationMicrosoft Windows HPC Server 2008 R2 for the Cluster Developer
50291B - Version: 1 02 May 2018 Microsoft Windows HPC Server 2008 R2 for the Cluster Developer Microsoft Windows HPC Server 2008 R2 for the Cluster Developer 50291B - Version: 1 5 days Course Description:
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Automatic Parallelization and OpenMP 3 GPGPU 2 Multithreaded Programming
More informationThe Effect of Emerging Architectures on Data Science (and other thoughts)
The Effect of Emerging Architectures on Data Science (and other thoughts) Philip C. Roth With contributions from Jeffrey S. Vetter and Jeremy S. Meredith (ORNL) and Allen Malony (U. Oregon) Future Technologies
More informationA General Discussion on! Parallelism!
Lecture 2! A General Discussion on! Parallelism! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Lecture 2: Overview Flynn s Taxonomy of
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationFaster Code for Free: Linear Algebra Libraries. Advanced Research Compu;ng 22 Feb 2017
Faster Code for Free: Linear Algebra Libraries Advanced Research Compu;ng 22 Feb 2017 Outline Introduc;on Implementa;ons Using them Use on ARC systems Hands on session Conclusions Introduc;on 3 BLAS Level
More informationIT4Innovations national supercomputing center. Branislav Jansík
IT4Innovations national supercomputing center Branislav Jansík branislav.jansik@vsb.cz Anselm Salomon Data center infrastructure Anselm and Salomon Anselm Intel Sandy Bridge E5-2665 2x8 cores 64GB RAM
More informationAccelerator programming with OpenACC
..... Accelerator programming with OpenACC Colaboratorio Nacional de Computación Avanzada Jorge Castro jcastro@cenat.ac.cr 2018. Agenda 1 Introduction 2 OpenACC life cycle 3 Hands on session Profiling
More informationProductive Performance on the Cray XK System Using OpenACC Compilers and Tools
Productive Performance on the Cray XK System Using OpenACC Compilers and Tools Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 The New Generation of Supercomputers Hybrid
More informationPERFORMANCE PORTABILITY WITH OPENACC. Jeff Larkin, NVIDIA, November 2015
PERFORMANCE PORTABILITY WITH OPENACC Jeff Larkin, NVIDIA, November 2015 TWO TYPES OF PORTABILITY FUNCTIONAL PORTABILITY PERFORMANCE PORTABILITY The ability for a single code to run anywhere. The ability
More informationSuperMatrix on Heterogeneous Platforms. Jianyu Huang SHPC, UT Austin
SuperMatrix on Heterogeneous Platforms Jianyu Huang SHPC, U Austin 1 How Heterogeneous? 2 How Many Languages? 3 How Many Languages? 3 Question! 4 FLAME Answer: SuperMatrix libflame SuperMatrix clblas OpenCL
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationPresent and Future Leadership Computers at OLCF
Present and Future Leadership Computers at OLCF Al Geist ORNL Corporate Fellow DOE Data/Viz PI Meeting January 13-15, 2015 Walnut Creek, CA ORNL is managed by UT-Battelle for the US Department of Energy
More informationLecture 11: GPU programming
Lecture 11: GPU programming David Bindel 4 Oct 2011 Logistics Matrix multiply results are ready Summary on assignments page My version (and writeup) on CMS HW 2 due Thursday Still working on project 2!
More informationHeterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak
More informationGPU Debugging Made Easy. David Lecomber CTO, Allinea Software
GPU Debugging Made Easy David Lecomber CTO, Allinea Software david@allinea.com Allinea Software HPC development tools company Leading in HPC software tools market Wide customer base Blue-chip engineering,
More informationParallel Computing with MATLAB
Parallel Computing with MATLAB CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University
More informationThe Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System
The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview
More informationGuide to the pbddmat Package
Drew Schmidt, Wei-Chen Chen, George Ostrouchov, Pragneshkumar Patel i Guide to the pbddmat Package Version 2.0 Drew Schmidt 1, Wei-Chen Chen 2, George Ostrouchov 1,2, Pragneshkumar Patel 1 1 Remote Data
More informationParallelism paradigms
Parallelism paradigms Intro part of course in Parallel Image Analysis Elias Rudberg elias.rudberg@it.uu.se March 23, 2011 Outline 1 Parallelization strategies 2 Shared memory 3 Distributed memory 4 Parallelization
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationComputing on GPUs. Prof. Dr. Uli Göhner. DYNAmore GmbH. Stuttgart, Germany
Computing on GPUs Prof. Dr. Uli Göhner DYNAmore GmbH Stuttgart, Germany Summary: The increasing power of GPUs has led to the intent to transfer computing load from CPUs to GPUs. A first example has been
More informationDynamic Selection of Auto-tuned Kernels to the Numerical Libraries in the DOE ACTS Collection
Numerical Libraries in the DOE ACTS Collection The DOE ACTS Collection SIAM Parallel Processing for Scientific Computing, Savannah, Georgia Feb 15, 2012 Tony Drummond Computational Research Division Lawrence
More informationA Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs. Li-Wen Chang, Wen-mei Hwu University of Illinois
A Scalable, Numerically Stable, High- Performance Tridiagonal Solver for GPUs Li-Wen Chang, Wen-mei Hwu University of Illinois A Scalable, Numerically Stable, High- How to Build a gtsv for Performance
More informationparallel Parallel R ANF R Vincent Miele CNRS 07/10/2015
Parallel R ANF R Vincent Miele CNRS 07/10/2015 Thinking Plan Thinking Context Principles Traditional paradigms and languages Parallel R - the foundations embarrassingly computations in R the snow heritage
More informationMPI + X programming. UTK resources: Rho Cluster with GPGPU George Bosilca CS462
MPI + X programming UTK resources: Rho Cluster with GPGPU https://newton.utk.edu/doc/documentation/systems/rhocluster George Bosilca CS462 MPI Each programming paradigm only covers a particular spectrum
More informationUltra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.
Abstract Ultra Large-Scale FFT Processing on Graphics Processor Arrays Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc. Graphics Processor Unit (GPU) technology has been shown well-suited to efficient
More informationThe StarPU Runtime System
The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU
More informationQuantum ESPRESSO on GPU accelerated systems
Quantum ESPRESSO on GPU accelerated systems Massimiliano Fatica, Everett Phillips, Josh Romero - NVIDIA Filippo Spiga - University of Cambridge/ARM (UK) MaX International Conference, Trieste, Italy, January
More informationParticle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA
Particle-in-Cell Simulations on Modern Computing Platforms Viktor K. Decyk and Tajendra V. Singh UCLA Outline of Presentation Abstraction of future computer hardware PIC on GPUs OpenCL and Cuda Fortran
More informationIntroduction to Multicore Programming
Introduction to Multicore Programming Minsoo Ryu Department of Computer Science and Engineering 2 1 Multithreaded Programming 2 Synchronization 3 Automatic Parallelization and OpenMP 4 GPGPU 5 Q& A 2 Multithreaded
More informationOvercoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model
Overcoming Distributed Debugging Challenges in the MPI+OpenMP Programming Model Lai Wei, Ignacio Laguna, Dong H. Ahn Matthew P. LeGendre, Gregory L. Lee This work was performed under the auspices of the
More informationThe Heterogeneous Programming Jungle. Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest
The Heterogeneous Programming Jungle Service d Expérimentation et de développement Centre Inria Bordeaux Sud-Ouest June 19, 2012 Outline 1. Introduction 2. Heterogeneous System Zoo 3. Similarities 4. Programming
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationEclipse-PTP: An Integrated Environment for the Development of Parallel Applications
Eclipse-PTP: An Integrated Environment for the Development of Parallel Applications Greg Watson (grw@us.ibm.com) Craig Rasmussen (rasmusen@lanl.gov) Beth Tibbitts (tibbitts@us.ibm.com) Parallel Tools Workshop,
More informationParallel Architectures
Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s
More informationCafeGPI. Single-Sided Communication for Scalable Deep Learning
CafeGPI Single-Sided Communication for Scalable Deep Learning Janis Keuper itwm.fraunhofer.de/ml Competence Center High Performance Computing Fraunhofer ITWM, Kaiserslautern, Germany Deep Neural Networks
More informationShared Memory programming paradigm: openmp
IPM School of Physics Workshop on High Performance Computing - HPC08 Shared Memory programming paradigm: openmp Luca Heltai Stefano Cozzini SISSA - Democritos/INFM
More informationGPU Fundamentals Jeff Larkin November 14, 2016
GPU Fundamentals Jeff Larkin , November 4, 206 Who Am I? 2002 B.S. Computer Science Furman University 2005 M.S. Computer Science UT Knoxville 2002 Graduate Teaching Assistant 2005 Graduate
More informationHeterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments
Heterogenous Acceleration for Linear Algebra in Mulit-Coprocessor Environments Azzam Haidar 1, Piotr Luszczek 1, Stanimire Tomov 1, and Jack Dongarra 1,2,3 1 University of Tennessee Knoxville, USA 2 Oak
More informationPortability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures
Photos placed in horizontal position with even amount of white space between photos and header Portability and Scalability of Sparse Tensor Decompositions on CPU/MIC/GPU Architectures Christopher Forster,
More informationDebugging CUDA Applications with Allinea DDT. Ian Lumb Sr. Systems Engineer, Allinea Software Inc.
Debugging CUDA Applications with Allinea DDT Ian Lumb Sr. Systems Engineer, Allinea Software Inc. ilumb@allinea.com GTC 2013, San Jose, March 20, 2013 Embracing GPUs GPUs a rival to traditional processors
More informationDavid R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
Whitepaper Introduction A Library Based Approach to Threading for Performance David R. Mackay, Ph.D. Libraries play an important role in threading software to run faster on Intel multi-core platforms.
More informationCUDA. Matthew Joyner, Jeremy Williams
CUDA Matthew Joyner, Jeremy Williams Agenda What is CUDA? CUDA GPU Architecture CPU/GPU Communication Coding in CUDA Use cases of CUDA Comparison to OpenCL What is CUDA? What is CUDA? CUDA is a parallel
More informationHPC Numerical Libraries. Nicola Spallanzani SuperComputing Applications and Innovation Department
HPC Numerical Libraries Nicola Spallanzani n.spallanzani@cineca.it SuperComputing Applications and Innovation Department Algorithms and Libraries Many numerical algorithms are well known and largely available.
More informationHierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationParallel Processors. The dream of computer architects since 1950s: replicate processors to add performance vs. design a faster processor
Multiprocessing Parallel Computers Definition: A parallel computer is a collection of processing elements that cooperate and communicate to solve large problems fast. Almasi and Gottlieb, Highly Parallel
More informationA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear
More informationParallel Programming. Libraries and implementations
Parallel Programming Libraries and implementations Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License. http://creativecommons.org/licenses/by-nc-sa/4.0/deed.en_us
More informationThe Constellation Project. Andrew W. Nash 14 November 2016
The Constellation Project Andrew W. Nash 14 November 2016 The Constellation Project: Representing a High Performance File System as a Graph for Analysis The Titan supercomputer utilizes high performance
More informationHPC with GPU and its applications from Inspur. Haibo Xie, Ph.D
HPC with GPU and its applications from Inspur Haibo Xie, Ph.D xiehb@inspur.com 2 Agenda I. HPC with GPU II. YITIAN solution and application 3 New Moore s Law 4 HPC? HPC stands for High Heterogeneous Performance
More informationParallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)
Parallel Programming Models Parallel Programming Models Shared Memory (without threads) Threads Distributed Memory / Message Passing Data Parallel Hybrid Single Program Multiple Data (SPMD) Multiple Program
More informationShared memory programming model OpenMP TMA4280 Introduction to Supercomputing
Shared memory programming model OpenMP TMA4280 Introduction to Supercomputing NTNU, IMF February 16. 2018 1 Recap: Distributed memory programming model Parallelism with MPI. An MPI execution is started
More informationIntroduction to HPC Parallel I/O
Introduction to HPC Parallel I/O Feiyi Wang (Ph.D.) and Sarp Oral (Ph.D.) Technology Integration Group Oak Ridge Leadership Computing ORNL is managed by UT-Battelle for the US Department of Energy Outline
More informationIntroduction to Runtime Systems
Introduction to Runtime Systems Towards Portability of Performance ST RM Static Optimizations Runtime Methods Team Storm Olivier Aumage Inria LaBRI, in cooperation with La Maison de la Simulation Contents
More informationAddressing Heterogeneity in Manycore Applications
Addressing Heterogeneity in Manycore Applications RTM Simulation Use Case stephane.bihan@caps-entreprise.com Oil&Gas HPC Workshop Rice University, Houston, March 2008 www.caps-entreprise.com Introduction
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationStarPU: a runtime system for multigpu multicore machines
StarPU: a runtime system for multigpu multicore machines Raymond Namyst RUNTIME group, INRIA Bordeaux Journées du Groupe Calcul Lyon, November 2010 The RUNTIME Team High Performance Runtime Systems for
More informationAmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015
AmgX 2.0: Scaling toward CORAL Joe Eaton, November 19, 2015 Agenda Introduction to AmgX Current Capabilities Scaling V2.0 Roadmap for the future 2 AmgX Fast, scalable linear solvers, emphasis on iterative
More informationPerformance of deal.ii on a node
Performance of deal.ii on a node Bruno Turcksin Texas A&M University, Dept. of Mathematics Bruno Turcksin Deal.II on a node 1/37 Outline 1 Introduction 2 Architecture 3 Paralution 4 Other Libraries 5 Conclusions
More informationParallel and Distributed Computing
Parallel and Distributed Computing NUMA; OpenCL; MapReduce José Monteiro MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer Science and Engineering
More informationParallel and High Performance Computing CSE 745
Parallel and High Performance Computing CSE 745 1 Outline Introduction to HPC computing Overview Parallel Computer Memory Architectures Parallel Programming Models Designing Parallel Programs Parallel
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationSome notes on efficient computing and high performance computing environments
Some notes on efficient computing and high performance computing environments Abhi Datta 1, Sudipto Banerjee 2 and Andrew O. Finley 3 July 31, 2017 1 Department of Biostatistics, Bloomberg School of Public
More informationCode Auto-Tuning with the Periscope Tuning Framework
Code Auto-Tuning with the Periscope Tuning Framework Renato Miceli, SENAI CIMATEC renato.miceli@fieb.org.br Isaías A. Comprés, TUM compresu@in.tum.de Project Participants Michael Gerndt, TUM Coordinator
More informationSteve Scott, Tesla CTO SC 11 November 15, 2011
Steve Scott, Tesla CTO SC 11 November 15, 2011 What goal do these products have in common? Performance / W Exaflop Expectations First Exaflop Computer K Computer ~10 MW CM5 ~200 KW Not constant size, cost
More information