PaRSEC: Distributed task-based runtime for scalable hybrid applications
|
|
- Charles Manning
- 5 years ago
- Views:
Transcription
1 PaRSEC: Distributed task-based runtime for scalable hybrid applications George Bosilca and many others Intertwine Workshop, April. 20, 2018
2 PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures Concepts Clear separation of concerns: compiler optimize each task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution Interface with the application developers through specialized domain specific languages (PTG/JDF/TTG, Python, insert_task, fork/join, ) Separate algorithms from data distribution Make control flow executions a relic Runtime Portability layer for heterogeneous architectures Scheduling policies adapt every execution to the hardware & ongoing system status Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold Coherency protocols minimize data movements Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions
3 Tasks A B C Data collections PaRSEC User data: dense matrix, sparse, structured or unstructured A(k) F(A(k)) F(C(k)) C(k) B(k) Graph of tasks = a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment An execution unit taking a set of input data and generating, upon completion, a different set of output data. Tasks and data have a coherent distributed scope (managed by the runtime) Low-level API allowing the design of Domain Specific Languages (JDF, DTD, TTG) Supports distributed heterogeneous (and trigger tasks execution) environments. Communications are implicit (the runtime moves data) Resources are dynamic (threads, accelerators) Built-in resilience, performance instrumentation and analysis (R, python)
4 PaRSEC Architecture Domain Specific Extensions Parallel Runtime Hardware Compact Representation - PTG Scheduling Scheduling Distributed Scheduling Memory Allocator Cores Dense LA Sparse LA Architecture information Memory Hierarchies Dynamic Discovered Representation - DTG Data Collections Data Data Data Data Movement Collective Patterns Coherence Soft. MOESI * Tasks Tasks Task classes TTG / Chemistry Hard core Specialized Specialized Kernels Specialized Kernels Kernels Data Accelerators Movement MPI Xeon Phi OpenACC UCX NVidia GasNet * Software design based on Modular Component Architecture (MCA) of Open MPI. Clear components API Runtime selection of components Implementing a new component has little impact on the rest of the software stack. Not yet componentized
5 The PaRSEC data Data View User defined Runtime defined A(k) v2 v2 v1 Data Collection A data is a manipulation token, the basic logical element (view) used in the description of the dataflow Locations: have multiple coherent copies (remote node, device, checkpoint) Shape: can have different memory layout Visibility: only accessible via the most current version of the data State: can be migrated / logged Data collections are ensemble of data distributed among the nodes Can be regular (multi-dimensional matrices) Or irregular (sparse data, graphs) Can be regularly distributed (cyclic-k) or user-defined Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column, ) A data-copy is the practical unit of data Has a memory layout (think MPI datatype) Has a property of locality (device, NUMA domain, node) Has a version associated with Multiple instances can coexist
6 DSL: The PaRSEC application parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, Create Data", PASSED_BY_REF, DATA_AT(&dDATA, n), OUT REGION_FULL, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); } } parsec_handle_wait( parsec_dtd_handle ); Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion Data initialization and PaRSEC context setup. Common to all DSL
7 Uncountable ways How to describe a graph of tasks? Generic: Dagguer (Charm++), Legion, ParalleX, Parameterized Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC, Uintah, DARMA, Kokkos, RAJA Application specific: MADNESS, PaRSEC runtime The runtime is agnostic to the domain specific language (DSL) Different DSL interoperate through the data collections The DSL share Distributed schedulers Communication engine Hardware resources Data management (coherence, versioning, ) They don t share The task structure The internal dataflow depiction 7
8 DSL: The insert_task interface Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, } PASSED_BY_REF, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, } Create Data", DATA_AT(&dDATA, n), OUT REGION_FULL, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); parsec_handle_wait( parsec_dtd_handle ); Data initialization and PaRSEC context setup. Common to all DSL
9 The Parameterized Task Graph (JDF) GEQRT(k) k = 0..( MT < NT )? MT-1 : NT-1 ) : A(k, k) RW A <- (k == 0)? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1)? A UNMQR(k, k+1.. NT-1) [type = LOWER] -> (k < MT-1)? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1)? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1)? T UNMQR(k, k+1.. NT-1) BODY [type = CPU] /* default */ zgeqrt( A, T ); END BODY [type = CUDA] cuda_zgeqrt( A, T ); END Control flow is eliminated, therefore maximum parallelism is possible A dataflow parameterized and concise language Accept non-dense iterators Allow inlined C/C++ code to augment the language [any expression] Termination mechanism part of the runtime (i.e. needs to know the number of tasks per node) The dependencies had to be globally (and statically) defined prior to the execution Dynamic DAGs non-natural No data dependent DAGs 9
10 Dense Linear Algebra: QR heterogeneous Experiments on Arc machines, E GHz 20 cores gcc 6.3 MKL 2016 PaRSEC-2.0-rc1 StarPU CUDA GEQRT 0 TSQRT UNMQR TSMQR MAGMA runs out of GPU memory B big tile size b: small tile size ib: inner block size 10
11 PERFORMANCE (TFLOP/S) DGEQRF performance strong scaling Cray XT5 (Kraken) - N = M = 41,472 Systolic QR over PULSAR Systolic QR over PaRSEC (2D) DPLASMA HQR (best single tree) LibSCI Scalapack Dense Linear Algebra DPLASMA = ScaLAPACK + runtime (PaRSEC) NUMBER OF CORES Keeneland h stands for dynamic Hierarchical algorithms (a task can divide itself) GEQRT TSQRT UNMQR TSMQR
12 Overhead of insert_task 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/
13 Overhead of insert_task DTD overhead 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/
14 Overhead of insert_task DTD overhead Fixed task duration Weak scaling: Fixed number of tasks per process T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship 6144 cores Benefits: critical path is defined by the sequential ordering Drawbacks: impossible to build collective patterns, selecting the window size is difficult, all data movement There are three types of scenario must be known globally (and their order is critically important) Insert All: Each rank inserts all tasks, and executes only locals Select Insert: Each rank inserts only local tasks, but iterates over all tasks. Insert Local: Each rank only inserts local tasks. 14
15 Explicitly expose collective patterns SLATE - PaRSEC Task dependencies done on the coordination doorbells instead of data Need for lookahead to compensate for the high level lack of parallelism A(k, k) -> A(k+1.. N, k),
16 Natural data-dependent DAG Composition 16 POTRF TRTRI LAUUM Example POTRI = POTRF + TRTRI + LAUUM 3 approaches: Fork/join: complete POTRF before starting TRTRI Compiler-based: give the three sequential algorithms to the Q2J compiler, and get a single PTG for POINV Runtime-based: tell the runtime that after POTRF is done on a tile, TRTRI can start, and let the runtime compose PaRSEC Traditional
17 More dynamic applications Execution Time (sec) Original 64 PaRSEC 64 NWChem (PaRSEC) Beta-carotene efficiency DIP: Elastodynamic Wave Propagation Perfect PaRSEC MPI Not yet HT ready Interoperability between GlobalArray + PaRSEC + MPI Cores/Node C 40 H number of threads Dynamically redistribute the data - use PAPI counters to estimate the imbalance - reshuffle the frontiers to balance the workload
18 Dynamic distributed Workload f(x) Nodes in the tree are tasks, and they are distributed across P processes g(x) We have a forest of trees, each tree representing a different function or intermediary expression to compute Active process: process has pending actions or has pending emissions to perform Idle process: no further pending local actions Idle process returns to active state upon reception of a message Workload termination is a global state when in a snapshot all processes are idle and there are no in-transit messages Wave Algorithms (4C) Once idle a process start a termination wave counting communications (sent and recv). Two identical waves are necessary to detect termination. Optimal Delay Algorithms (EDOD) Reduce the idleness to the root of a tree (a nonleaf become idle when all children are idle + local). Acknowledge message reception to allow processes to become idle (follow the same tree and inverse the parents idleness) Credit Distribution Algorithms (CDA) Spread credits via messages. Return credits to an entity once idle. Detect termination once all the credits are accounted for.
19 Random Walk application Number of control messages Implementation in PaRSEC (exp. on Mira) MADNESS style trees Threshold define the level of refinement (accuracy of the operations) 10 8 nodes in the tree (aka tasks) CDA ½ the control messages compared with 4C All of them are FLUSH (no BORROW) Projection Operation with threshold of 10 13
20 Conclusions Programming can be made easy(ier) Portability: inherently take advantage of all hardware capabilities Efficiency: deliver the best performance on several families of algorithms Domain Specific Languages to facilitate development Interoperability: data is the centric piece Build a scientific enabler allowing different communities to focus on different problems Application developers on their algorithms Language specialists on Domain Specific Languages System developers on system issues Compilers on optimizing the task code Interact with hardware designers to improve support for runtime needs HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking
21 Priority based Wish List Portable and efficient API/library for accelerator support Portable, efficient and interoperable communication library (UCX, libfabric, ) Moving away from MPI will require an efficient datatype engine Also supported by rest of the software stack (for interoperability) Resource management/allocation system PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?) Generic profiling system, tools integration Any type of task-based debugger and performance analysis
Hierarchical DAG Scheduling for Hybrid Distributed Systems
June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical
More informationDistributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca
Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent
More informationMAGMA: a New Generation
1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release
More informationMAGMA. Matrix Algebra on GPU and Multicore Architectures
MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/
More informationDo we need dataflow programming?
Do we need dataflow programming? Anthony Danalis Innovative Computing Laboratory University of Tennessee CCDSC 16, Chateau des Contes Programming vs Execution ü Dataflow based execution ü Think ILP, Out
More informationBest Practice Guide to Hybrid PaRSEC + OpenMP Programming
Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2
More informationGPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)
GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization
More informationTowards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.
Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing
More informationMAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures
MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010
More informationOverview of research activities Toward portability of performance
Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into
More informationPower Profiling of Cholesky and QR Factorizations on Distributed Memory Systems
International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and
More informationqr_mumps a multithreaded, multifrontal QR solver The MUMPS team,
qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, MUMPS Users Group Meeting, May 29-30, 2013 The qr_mumps software the qr_mumps software qr_mumps version 1.0 a multithreaded, multifrontal
More informationAddressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer
Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2
More informationThinking Outside of the Tera-Scale Box. Piotr Luszczek
Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU
More informationPortable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.
Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the
More informationCMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)
CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can
More informationPreliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede
Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison
More informationHARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA
HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh
More informationSHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008
SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem
More informationHarp-DAAL for High Performance Big Data Computing
Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big
More informationThe Legion Mapping Interface
The Legion Mapping Interface Mike Bauer 1 Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight
More informationParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser
ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050
More informationA Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography
1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography
More informationHigh Performance Linear Algebra
High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control
More informationDistributed & Heterogeneous Programming in C++ for HPC at SC17
Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn
More informationA scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday
More informationIntroduction to parallel Computing
Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts
More informationMaking Dataflow Programming Ubiquitous for Scientific Computing
Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale
More informationNEW ADVANCES IN GPU LINEAR ALGEBRA
GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear
More informationAccelerated Library Framework for Hybrid-x86
Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit
More informationCharm++ for Productivity and Performance
Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil
More informationHPC with Multicore and GPUs
HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware
More informationA Standard for Batching BLAS Operations
A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community
More informationIntroduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620
Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved
More informationCUDA GPGPU Workshop 2012
CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline
More informationAn Extension of the StarSs Programming Model for Platforms with Multiple GPUs
An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento
More informationData Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions
Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing
More informationA Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems
CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 00:2 35 Published online in Wiley InterScience (www.interscience.wiley.com). A Scalable Approach to Solving
More informationHigh Performance Computing. University questions with solution
High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The
More informationTOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT
TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware
More informationQR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators
QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault and Stanimire Tomov INRIA, LaBRI,
More informationPortable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.
Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific
More informationAdrian Tate XK6 / openacc workshop Manno, Mar
Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous
More informationOpenACC Course. Office Hour #2 Q&A
OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle
More informationProgramming Models for Multi- Threading. Brian Marshall, Advanced Research Computing
Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows
More informationA Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC
A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic
More informationWhat is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.
DARMA Janine C. Bennett, Jonathan Lifflander, David S. Hollman, Jeremiah Wilke, Hemanth Kolla, Aram Markosyan, Nicole Slattengren, Robert L. Clay (PM) PSAAP-WEST February 22, 2017 Sandia National Laboratories
More informationPiz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design
Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500
More informationIntel Xeon Phi Coprocessor
Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate
More informationHiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking. CJ Newburn
HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking CJ Newburn TRENDS Relevant to small or large scale HPC, AI Scale Hierarchical Differentiation for efficiency Heterogeneity Unpredictability
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationAutomatic Development of Linear Algebra Libraries for the Tesla Series
Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source
More informationAnalyzing the Performance of IWAVE on a Cluster using HPCToolkit
Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,
More informationScalability of Uintah Past Present and Future
DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John Schmidt,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction
More informationCenter for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop
Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion
More informationBatch Linear Algebra for GPU-Accelerated High Performance Computing Environments
Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering
More informationHybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS
+ Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics
More informationToward a supernodal sparse direct solver over DAG runtimes
Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX
More informationDuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES
DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES AFSHIN ZAFARI, ELISABETH LARSSON, AND MARTIN TILLENIUS Abstract. Current high-performance computer systems used
More informationPerformance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster
Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &
More informationCritically Missing Pieces on Accelerators: A Performance Tools Perspective
Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?
More informationPrinciples of Parallel Algorithm Design: Concurrency and Mapping
Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction
More informationOpenACC 2.6 Proposed Features
OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively
More informationIntel Xeon Phi архитектура, модели программирования, оптимизация.
Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture
More informationThe StarPU Runtime System
The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU
More informationGPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU
April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely
More informationAcceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP
Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University
More informationHPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser
HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu) HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform,
More informationTesla GPU Computing A Revolution in High Performance Computing
Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic
More informationOS Design Approaches. Roadmap. OS Design Approaches. Tevfik Koşar. Operating System Design and Implementation
CSE 421/521 - Operating Systems Fall 2012 Lecture - II OS Structures Roadmap OS Design and Implementation Different Design Approaches Major OS Components!! Memory management! CPU Scheduling! I/O Management
More informationGivy, a software DSM runtime using raw pointers
Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples
More informationOVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI
CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing
More informationExploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API
EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,
More informationEvolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017
Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &
More informationHybrid Model Parallel Programs
Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters
More informationWelcome to the 2017 Charm++ Workshop!
Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017
More informationThe Art of Parallel Processing
The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a
More informationrcuda: an approach to provide remote access to GPU computational power
rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda
More informationSolving Dense Linear Systems on Graphics Processors
Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad
More informationQR Decomposition on GPUs
QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of
More informationINTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017
INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and
More informationA Pluggable Framework for Composable HPC Scheduling Libraries
A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1, Vivek Kumar 2, Nick Vrvilo 1, Zoran Budimlic 1, Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University
More informationDesigning Parallel Programs. This review was developed from Introduction to Parallel Computing
Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis
More informationCray Scientific Libraries. Overview
Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized
More informationModeling and Simulation of Hybrid Systems
Modeling and Simulation of Hybrid Systems Luka Stanisic Samuel Thibault Arnaud Legrand Brice Videau Jean-François Méhaut CNRS/Inria/University of Grenoble, France University of Bordeaux/Inria, France JointLab
More informationHANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY
HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED
More informationNew Mapping Interface
New Mapping Interface Mike Bauer NVIDIA Research 1 Why Have A Mapping Interface? Legion Program Legion Legion Mapper Scheduling is hard Lots of runtimes have heuristics What do you do when they are wrong?
More informationWhy? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators
Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost
More informationDeep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017
Deep Learning Frameworks COSC 7336: Advanced Natural Language Processing Fall 2017 Today s lecture Deep learning software overview TensorFlow Keras Practical Graphical Processing Unit (GPU) From graphical
More informationHPC future trends from a science perspective
HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively
More informationSearch Algorithms for Discrete Optimization Problems
Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic
More informationMapping MPI+X Applications to Multi-GPU Architectures
Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under
More informationIllinois Proposal Considerations Greg Bauer
- 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and
More informationRecent Advances in Heterogeneous Computing using Charm++
Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing
More informationPGI Accelerator Programming Model for Fortran & C
PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...
More informationScalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany
Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP
More informationMPI in 2020: Opportunities and Challenges. William Gropp
MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is
More information