PaRSEC: Distributed task-based runtime for scalable hybrid applications

Size: px
Start display at page:

Download "PaRSEC: Distributed task-based runtime for scalable hybrid applications"

Transcription

1 PaRSEC: Distributed task-based runtime for scalable hybrid applications George Bosilca and many others Intertwine Workshop, April. 20, 2018

2 PaRSEC: a generic runtime system for asynchronous, architecture aware scheduling of fine-grained tasks on distributed many-core heterogeneous architectures Concepts Clear separation of concerns: compiler optimize each task class, developer describe dependencies between tasks, the runtime orchestrate the dynamic execution Interface with the application developers through specialized domain specific languages (PTG/JDF/TTG, Python, insert_task, fork/join, ) Separate algorithms from data distribution Make control flow executions a relic Runtime Portability layer for heterogeneous architectures Scheduling policies adapt every execution to the hardware & ongoing system status Data movements between producers and consumers are inferred from dependencies. Communications/computations overlap naturally unfold Coherency protocols minimize data movements Memory hierarchies (including NVRAM and disk) integral part of the scheduling decisions

3 Tasks A B C Data collections PaRSEC User data: dense matrix, sparse, structured or unstructured A(k) F(A(k)) F(C(k)) C(k) B(k) Graph of tasks = a data centric programming environment based on asynchronous tasks executing on a heterogeneous distributed environment An execution unit taking a set of input data and generating, upon completion, a different set of output data. Tasks and data have a coherent distributed scope (managed by the runtime) Low-level API allowing the design of Domain Specific Languages (JDF, DTD, TTG) Supports distributed heterogeneous (and trigger tasks execution) environments. Communications are implicit (the runtime moves data) Resources are dynamic (threads, accelerators) Built-in resilience, performance instrumentation and analysis (R, python)

4 PaRSEC Architecture Domain Specific Extensions Parallel Runtime Hardware Compact Representation - PTG Scheduling Scheduling Distributed Scheduling Memory Allocator Cores Dense LA Sparse LA Architecture information Memory Hierarchies Dynamic Discovered Representation - DTG Data Collections Data Data Data Data Movement Collective Patterns Coherence Soft. MOESI * Tasks Tasks Task classes TTG / Chemistry Hard core Specialized Specialized Kernels Specialized Kernels Kernels Data Accelerators Movement MPI Xeon Phi OpenACC UCX NVidia GasNet * Software design based on Modular Component Architecture (MCA) of Open MPI. Clear components API Runtime selection of components Implementing a new component has little impact on the rest of the software stack. Not yet componentized

5 The PaRSEC data Data View User defined Runtime defined A(k) v2 v2 v1 Data Collection A data is a manipulation token, the basic logical element (view) used in the description of the dataflow Locations: have multiple coherent copies (remote node, device, checkpoint) Shape: can have different memory layout Visibility: only accessible via the most current version of the data State: can be migrated / logged Data collections are ensemble of data distributed among the nodes Can be regular (multi-dimensional matrices) Or irregular (sparse data, graphs) Can be regularly distributed (cyclic-k) or user-defined Data View a subset of the data collection used in a particular algorithm (aka. submatrix, row, column, ) A data-copy is the practical unit of data Has a memory layout (think MPI datatype) Has a property of locality (device, NUMA domain, node) Has a version associated with Multiple instances can coexist

6 DSL: The PaRSEC application parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, Create Data", PASSED_BY_REF, DATA_AT(&dDATA, n), OUT REGION_FULL, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); } } parsec_handle_wait( parsec_dtd_handle ); Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion Data initialization and PaRSEC context setup. Common to all DSL

7 Uncountable ways How to describe a graph of tasks? Generic: Dagguer (Charm++), Legion, ParalleX, Parameterized Task Graph (PaRSEC), Dynamic Task Discovery (StarPU, StarSS), Yvette (XML), Fork/Join (spawn). CnC, Uintah, DARMA, Kokkos, RAJA Application specific: MADNESS, PaRSEC runtime The runtime is agnostic to the domain specific language (DSL) Different DSL interoperate through the data collections The DSL share Distributed schedulers Communication engine Hardware resources Data management (coherence, versioning, ) They don t share The task structure The internal dataflow depiction 7

8 DSL: The insert_task interface Define a distributed collection of data (here 1 dimension array of integers) Start PaRSEC (resource allocation) Create a tasks placeholder and associate it with the PaRSEC context Add tasks. A configurable window will limit the number of pending tasks Wait till completion parsec_vector_t ddata; parsec_vector_init( &ddata, matrix_integer, matrix_tile, nodes, rank, 1, /* tile_size*/ N, /* Global vector size*/ 0, /* starting point */ 1 ); /* block size */ parsec_context_t* parsec; parsec = parsec_context_init(null, NULL); /* start the PaRSEC engine */ parsec_handle_t* parsec_dtd_handle = parsec_handle_new (parsec); parsec_enqueue(parsec, (parsec_handle_t*) parsec_dtd_handle); parsec_context_start(parsec); for( n = 0; n < N; n++ ) { parsec_insert_task( parsec_dtd_handle, call_to_kernel_type_write, } PASSED_BY_REF, 0 /* DONE */); for( k = 0; k < K; k++ ) { parsec_insert_task( parsec_dtd_handle, } Create Data", DATA_AT(&dDATA, n), OUT REGION_FULL, call_to_kernel_type_read, "Read_Data", PASSED_BY_REF, DATA_AT(&dDATA, n), INPUT REGION_FULL, 0 /* DONE */ ); parsec_handle_wait( parsec_dtd_handle ); Data initialization and PaRSEC context setup. Common to all DSL

9 The Parameterized Task Graph (JDF) GEQRT(k) k = 0..( MT < NT )? MT-1 : NT-1 ) : A(k, k) RW A <- (k == 0)? A(k, k) : A1 TSMQR(k-1, k, k) -> (k < NT-1)? A UNMQR(k, k+1.. NT-1) [type = LOWER] -> (k < MT-1)? A1 TSQRT(k, k+1) [type = UPPER] -> (k == MT-1)? A(k, k) [type = UPPER] WRITE T <- T(k, k) -> T(k, k) -> (k < NT-1)? T UNMQR(k, k+1.. NT-1) BODY [type = CPU] /* default */ zgeqrt( A, T ); END BODY [type = CUDA] cuda_zgeqrt( A, T ); END Control flow is eliminated, therefore maximum parallelism is possible A dataflow parameterized and concise language Accept non-dense iterators Allow inlined C/C++ code to augment the language [any expression] Termination mechanism part of the runtime (i.e. needs to know the number of tasks per node) The dependencies had to be globally (and statically) defined prior to the execution Dynamic DAGs non-natural No data dependent DAGs 9

10 Dense Linear Algebra: QR heterogeneous Experiments on Arc machines, E GHz 20 cores gcc 6.3 MKL 2016 PaRSEC-2.0-rc1 StarPU CUDA GEQRT 0 TSQRT UNMQR TSMQR MAGMA runs out of GPU memory B big tile size b: small tile size ib: inner block size 10

11 PERFORMANCE (TFLOP/S) DGEQRF performance strong scaling Cray XT5 (Kraken) - N = M = 41,472 Systolic QR over PULSAR Systolic QR over PaRSEC (2D) DPLASMA HQR (best single tree) LibSCI Scalapack Dense Linear Algebra DPLASMA = ScaLAPACK + runtime (PaRSEC) NUMBER OF CORES Keeneland h stands for dynamic Hierarchical algorithms (a task can divide itself) GEQRT TSQRT UNMQR TSMQR

12 Overhead of insert_task 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/

13 Overhead of insert_task DTD overhead 4500 Cholesky(double precision) on Haswell, 8 nodes, 20 cores/node, 160 cores Size(N)/No. of Tasks(millions), Tile size: k/ k/0.2 30k/0.8 40k/2.0 50k/3.8 60k/6.5 70k/ k/ T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship Gflops PaRSEC-PTG, Tile size: 128 PaRSEC-DTD, Tile size: 128 PaRSEC-DTD, 128, Execution Only 10k/0.5 20k/3.8 30k/ k/ k/ k/102.5 Size(N)/No. of Tasks(millions), Tile size: 128 PaRSEC-PTG, Tile size: 320 PaRSEC-DTD, Tile size: k/ k/

14 Overhead of insert_task DTD overhead Fixed task duration Weak scaling: Fixed number of tasks per process T DTD/PTG : Overall time N: Total number of tasks C T : Cost/duration of each task P: Total number of nodes/process n: Total number of cores C D : Cost of discovering a task C R : Cost of building DAG/relationship 6144 cores Benefits: critical path is defined by the sequential ordering Drawbacks: impossible to build collective patterns, selecting the window size is difficult, all data movement There are three types of scenario must be known globally (and their order is critically important) Insert All: Each rank inserts all tasks, and executes only locals Select Insert: Each rank inserts only local tasks, but iterates over all tasks. Insert Local: Each rank only inserts local tasks. 14

15 Explicitly expose collective patterns SLATE - PaRSEC Task dependencies done on the coordination doorbells instead of data Need for lookahead to compensate for the high level lack of parallelism A(k, k) -> A(k+1.. N, k),

16 Natural data-dependent DAG Composition 16 POTRF TRTRI LAUUM Example POTRI = POTRF + TRTRI + LAUUM 3 approaches: Fork/join: complete POTRF before starting TRTRI Compiler-based: give the three sequential algorithms to the Q2J compiler, and get a single PTG for POINV Runtime-based: tell the runtime that after POTRF is done on a tile, TRTRI can start, and let the runtime compose PaRSEC Traditional

17 More dynamic applications Execution Time (sec) Original 64 PaRSEC 64 NWChem (PaRSEC) Beta-carotene efficiency DIP: Elastodynamic Wave Propagation Perfect PaRSEC MPI Not yet HT ready Interoperability between GlobalArray + PaRSEC + MPI Cores/Node C 40 H number of threads Dynamically redistribute the data - use PAPI counters to estimate the imbalance - reshuffle the frontiers to balance the workload

18 Dynamic distributed Workload f(x) Nodes in the tree are tasks, and they are distributed across P processes g(x) We have a forest of trees, each tree representing a different function or intermediary expression to compute Active process: process has pending actions or has pending emissions to perform Idle process: no further pending local actions Idle process returns to active state upon reception of a message Workload termination is a global state when in a snapshot all processes are idle and there are no in-transit messages Wave Algorithms (4C) Once idle a process start a termination wave counting communications (sent and recv). Two identical waves are necessary to detect termination. Optimal Delay Algorithms (EDOD) Reduce the idleness to the root of a tree (a nonleaf become idle when all children are idle + local). Acknowledge message reception to allow processes to become idle (follow the same tree and inverse the parents idleness) Credit Distribution Algorithms (CDA) Spread credits via messages. Return credits to an entity once idle. Detect termination once all the credits are accounted for.

19 Random Walk application Number of control messages Implementation in PaRSEC (exp. on Mira) MADNESS style trees Threshold define the level of refinement (accuracy of the operations) 10 8 nodes in the tree (aka tasks) CDA ½ the control messages compared with 4C All of them are FLUSH (no BORROW) Projection Operation with threshold of 10 13

20 Conclusions Programming can be made easy(ier) Portability: inherently take advantage of all hardware capabilities Efficiency: deliver the best performance on several families of algorithms Domain Specific Languages to facilitate development Interoperability: data is the centric piece Build a scientific enabler allowing different communities to focus on different problems Application developers on their algorithms Language specialists on Domain Specific Languages System developers on system issues Compilers on optimizing the task code Interact with hardware designers to improve support for runtime needs HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking

21 Priority based Wish List Portable and efficient API/library for accelerator support Portable, efficient and interoperable communication library (UCX, libfabric, ) Moving away from MPI will require an efficient datatype engine Also supported by rest of the software stack (for interoperability) Resource management/allocation system PaRSEC supports dynamic resource provisioning, but we need a portable system to bridge the gap between different programming runtimes Memory allocator: thread safe, runtime defined properties, arenas (with and without sbrk). (memkind?) Generic profiling system, tools integration Any type of task-based debugger and performance analysis

Hierarchical DAG Scheduling for Hybrid Distributed Systems

Hierarchical DAG Scheduling for Hybrid Distributed Systems June 16, 2015 Hierarchical DAG Scheduling for Hybrid Distributed Systems Wei Wu, Aurelien Bouteiller, George Bosilca, Mathieu Faverge, Jack Dongarra IPDPS 2015 Outline! Introduction & Motivation! Hierarchical

More information

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca

Distributed Dense Linear Algebra on Heterogeneous Architectures. George Bosilca Distributed Dense Linear Algebra on Heterogeneous Architectures George Bosilca bosilca@eecs.utk.edu Centraro, Italy June 2010 Factors that Necessitate to Redesign of Our Software» Steepness of the ascent

More information

MAGMA: a New Generation

MAGMA: a New Generation 1.3 MAGMA: a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Jack Dongarra T. Dong, M. Gates, A. Haidar, S. Tomov, and I. Yamazaki University of Tennessee, Knoxville Release

More information

MAGMA. Matrix Algebra on GPU and Multicore Architectures

MAGMA. Matrix Algebra on GPU and Multicore Architectures MAGMA Matrix Algebra on GPU and Multicore Architectures Innovative Computing Laboratory Electrical Engineering and Computer Science University of Tennessee Piotr Luszczek (presenter) web.eecs.utk.edu/~luszczek/conf/

More information

Do we need dataflow programming?

Do we need dataflow programming? Do we need dataflow programming? Anthony Danalis Innovative Computing Laboratory University of Tennessee CCDSC 16, Chateau des Contes Programming vs Execution ü Dataflow based execution ü Think ILP, Out

More information

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming

Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Best Practice Guide to Hybrid PaRSEC + OpenMP Programming Version 1.0, 30 th September 2018 Jakub Šístek, Reazul Hoque and George Bosilca Table of Contents TABLE OF CONTENTS... 2 1 INTRODUCTION... 3 2

More information

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE) NATALIA GIMELSHEIN ANSHUL GUPTA STEVE RENNICH SEID KORIC NVIDIA IBM NVIDIA NCSA WATSON SPARSE MATRIX PACKAGE (WSMP) Cholesky, LDL T, LU factorization

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures

MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures MAGMA a New Generation of Linear Algebra Libraries for GPU and Multicore Architectures Stan Tomov Innovative Computing Laboratory University of Tennessee, Knoxville OLCF Seminar Series, ORNL June 16, 2010

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems

Power Profiling of Cholesky and QR Factorizations on Distributed Memory Systems International Conference on Energy-Aware High Performance Computing Hamburg, Germany Bosilca, Ltaief, Dongarra (KAUST, UTK) Power Sept Profiling, DLA Algorithms ENAHPC / 6 Power Profiling of Cholesky and

More information

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team,

qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, qr_mumps a multithreaded, multifrontal QR solver The MUMPS team, MUMPS Users Group Meeting, May 29-30, 2013 The qr_mumps software the qr_mumps software qr_mumps version 1.0 a multithreaded, multifrontal

More information

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer

Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems. Ed Hinkel Senior Sales Engineer Addressing the Increasing Challenges of Debugging on Accelerated HPC Systems Ed Hinkel Senior Sales Engineer Agenda Overview - Rogue Wave & TotalView GPU Debugging with TotalView Nvdia CUDA Intel Phi 2

More information

Thinking Outside of the Tera-Scale Box. Piotr Luszczek

Thinking Outside of the Tera-Scale Box. Piotr Luszczek Thinking Outside of the Tera-Scale Box Piotr Luszczek Brief History of Tera-flop: 1997 1997 ASCI Red Brief History of Tera-flop: 2007 Intel Polaris 2007 1997 ASCI Red Brief History of Tera-flop: GPGPU

More information

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh. Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization Dmitry I. Lyakh liakhdi@ornl.gov This research used resources of the Oak Ridge Leadership Computing Facility at the

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede Qingyu Meng, Alan Humphrey, John Schmidt, Martin Berzins Thanks to: TACC Team for early access to Stampede J. Davison

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008 SHARCNET Workshop on Parallel Computing Hugh Merz Laurentian University May 2008 What is Parallel Computing? A computational method that utilizes multiple processing elements to solve a problem in tandem

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

The Legion Mapping Interface

The Legion Mapping Interface The Legion Mapping Interface Mike Bauer 1 Philosophy Decouple specification from mapping Performance portability Expose all mapping (perf) decisions to Legion user Guessing is bad! Don t want to fight

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography 1 A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography He Huang, Liqiang Wang, Po Chen(University of Wyoming) John Dennis (NCAR) 2 LSQR in Seismic Tomography

More information

High Performance Linear Algebra

High Performance Linear Algebra High Performance Linear Algebra Hatem Ltaief Senior Research Scientist Extreme Computing Research Center King Abdullah University of Science and Technology 4th International Workshop on Real-Time Control

More information

Distributed & Heterogeneous Programming in C++ for HPC at SC17

Distributed & Heterogeneous Programming in C++ for HPC at SC17 Distributed & Heterogeneous Programming in C++ for HPC at SC17 Michael Wong (Codeplay), Hal Finkel DHPCC++ 2018 1 The Panel 2 Ben Sanders (AMD, HCC, HiP, HSA) Carter Edwards (SNL, Kokkos, ISO C++) CJ Newburn

More information

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems

A scalable approach to solving dense linear algebra problems on hybrid CPU-GPU systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. () Published online in Wiley Online Library (wileyonlinelibrary.com)..33 A scalable approach to solving dense linear

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

Introduction to parallel Computing

Introduction to parallel Computing Introduction to parallel Computing VI-SEEM Training Paschalis Paschalis Korosoglou Korosoglou (pkoro@.gr) (pkoro@.gr) Outline Serial vs Parallel programming Hardware trends Why HPC matters HPC Concepts

More information

Making Dataflow Programming Ubiquitous for Scientific Computing

Making Dataflow Programming Ubiquitous for Scientific Computing Making Dataflow Programming Ubiquitous for Scientific Computing Hatem Ltaief KAUST Supercomputing Lab Synchronization-reducing and Communication-reducing Algorithms and Programming Models for Large-scale

More information

NEW ADVANCES IN GPU LINEAR ALGEBRA

NEW ADVANCES IN GPU LINEAR ALGEBRA GTC 2012: NEW ADVANCES IN GPU LINEAR ALGEBRA Kyle Spagnoli EM Photonics 5/16/2012 QUICK ABOUT US» HPC/GPU Consulting Firm» Specializations in:» Electromagnetics» Image Processing» Fluid Dynamics» Linear

More information

Accelerated Library Framework for Hybrid-x86

Accelerated Library Framework for Hybrid-x86 Software Development Kit for Multicore Acceleration Version 3.0 Accelerated Library Framework for Hybrid-x86 Programmer s Guide and API Reference Version 1.0 DRAFT SC33-8406-00 Software Development Kit

More information

Charm++ for Productivity and Performance

Charm++ for Productivity and Performance Charm++ for Productivity and Performance A Submission to the 2011 HPC Class II Challenge Laxmikant V. Kale Anshu Arya Abhinav Bhatele Abhishek Gupta Nikhil Jain Pritish Jetley Jonathan Lifflander Phil

More information

HPC with Multicore and GPUs

HPC with Multicore and GPUs HPC with Multicore and GPUs Stan Tomov Electrical Engineering and Computer Science Department University of Tennessee, Knoxville COSC 594 Lecture Notes March 22, 2017 1/20 Outline Introduction - Hardware

More information

A Standard for Batching BLAS Operations

A Standard for Batching BLAS Operations A Standard for Batching BLAS Operations Jack Dongarra University of Tennessee Oak Ridge National Laboratory University of Manchester 5/8/16 1 API for Batching BLAS Operations We are proposing, as a community

More information

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620

Introduction to Parallel and Distributed Computing. Linh B. Ngo CPSC 3620 Introduction to Parallel and Distributed Computing Linh B. Ngo CPSC 3620 Overview: What is Parallel Computing To be run using multiple processors A problem is broken into discrete parts that can be solved

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs

An Extension of the StarSs Programming Model for Platforms with Multiple GPUs An Extension of the StarSs Programming Model for Platforms with Multiple GPUs Eduard Ayguadé 2 Rosa M. Badia 2 Francisco Igual 1 Jesús Labarta 2 Rafael Mayo 1 Enrique S. Quintana-Ortí 1 1 Departamento

More information

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions

Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Data Partitioning on Heterogeneous Multicore and Multi-GPU systems Using Functional Performance Models of Data-Parallel Applictions Ziming Zhong Vladimir Rychkov Alexey Lastovetsky Heterogeneous Computing

More information

A Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems

A Scalable Approach to Solving Dense Linear Algebra Problems. on Hybrid CPU-GPU Systems CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Computat.: Pract. Exper. 2014; 00:2 35 Published online in Wiley InterScience (www.interscience.wiley.com). A Scalable Approach to Solving

More information

High Performance Computing. University questions with solution

High Performance Computing. University questions with solution High Performance Computing University questions with solution Q1) Explain the basic working principle of VLIW processor. (6 marks) The following points are basic working principle of VLIW processor. The

More information

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT

TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT TOOLS FOR IMPROVING CROSS-PLATFORM SOFTWARE DEVELOPMENT Eric Kelmelis 28 March 2018 OVERVIEW BACKGROUND Evolution of processing hardware CROSS-PLATFORM KERNEL DEVELOPMENT Write once, target multiple hardware

More information

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators

QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators QR Factorization on a Multicore Node Enhanced with Multiple GPU Accelerators Emmanuel Agullo, Cédric Augonnet, Jack Dongarra, Mathieu Faverge, Hatem Ltaief, Samuel Thibault and Stanimire Tomov INRIA, LaBRI,

More information

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc.

Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. Portable and Productive Performance on Hybrid Systems with libsci_acc Luiz DeRose Sr. Principal Engineer Programming Environments Director Cray Inc. 1 What is Cray Libsci_acc? Provide basic scientific

More information

Adrian Tate XK6 / openacc workshop Manno, Mar

Adrian Tate XK6 / openacc workshop Manno, Mar Adrian Tate XK6 / openacc workshop Manno, Mar6-7 2012 1 Overview & Philosophy Two modes of usage Contents Present contents Upcoming releases Optimization of libsci_acc Autotuning Adaptation Asynchronous

More information

OpenACC Course. Office Hour #2 Q&A

OpenACC Course. Office Hour #2 Q&A OpenACC Course Office Hour #2 Q&A Q1: How many threads does each GPU core have? A: GPU cores execute arithmetic instructions. Each core can execute one single precision floating point instruction per cycle

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC

A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC A Simulation of Global Atmosphere Model NICAM on TSUBAME 2.5 Using OpenACC Hisashi YASHIRO RIKEN Advanced Institute of Computational Science Kobe, Japan My topic The study for Cloud computing My topic

More information

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes. DARMA Janine C. Bennett, Jonathan Lifflander, David S. Hollman, Jeremiah Wilke, Hemanth Kolla, Aram Markosyan, Nicole Slattengren, Robert L. Clay (PM) PSAAP-WEST February 22, 2017 Sandia National Laboratories

More information

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design

Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Piz Daint: Application driven co-design of a supercomputer based on Cray s adaptive system design Sadaf Alam & Thomas Schulthess CSCS & ETHzürich CUG 2014 * Timelines & releases are not precise Top 500

More information

Intel Xeon Phi Coprocessor

Intel Xeon Phi Coprocessor Intel Xeon Phi Coprocessor A guide to using it on the Cray XC40 Terminology Warning: may also be referred to as MIC or KNC in what follows! What are Intel Xeon Phi Coprocessors? Hardware designed to accelerate

More information

HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking. CJ Newburn

HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking. CJ Newburn HiHAT: A New Way Forward for Hierarchical Heterogeneous Asynchronous Tasking CJ Newburn TRENDS Relevant to small or large scale HPC, AI Scale Hierarchical Differentiation for efficiency Heterogeneity Unpredictability

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Automatic Development of Linear Algebra Libraries for the Tesla Series

Automatic Development of Linear Algebra Libraries for the Tesla Series Automatic Development of Linear Algebra Libraries for the Tesla Series Enrique S. Quintana-Ortí quintana@icc.uji.es Universidad Jaime I de Castellón (Spain) Dense Linear Algebra Major problems: Source

More information

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit Analyzing the Performance of IWAVE on a Cluster using HPCToolkit John Mellor-Crummey and Laksono Adhianto Department of Computer Science Rice University {johnmc,laksono}@rice.edu TRIP Meeting March 30,

More information

Scalability of Uintah Past Present and Future

Scalability of Uintah Past Present and Future DOE for funding the CSAFE project (97-10), DOE NETL, DOE NNSA NSF for funding via SDCI and PetaApps, INCITE, XSEDE Scalability of Uintah Past Present and Future Martin Berzins Qingyu Meng John Schmidt,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Gernot Ziegler, Developer Technology (Compute) (Material by Thomas Bradley) Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction

More information

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop

Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop Center for Scalable Application Development Software (CScADS): Automatic Performance Tuning Workshop http://cscads.rice.edu/ Discussion and Feedback CScADS Autotuning 07 Top Priority Questions for Discussion

More information

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments

Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Batch Linear Algebra for GPU-Accelerated High Performance Computing Environments Ahmad Abdelfattah, Azzam Haidar, Stanimire Tomov, and Jack Dongarra SIAM Conference on Computational Science and Engineering

More information

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Hybrid Computing @ KAUST Many Cores and OpenACC Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS + Agenda Hybrid Computing n Hybrid Computing n From Multi-Physics

More information

Toward a supernodal sparse direct solver over DAG runtimes

Toward a supernodal sparse direct solver over DAG runtimes Toward a supernodal sparse direct solver over DAG runtimes HOSCAR 2013, Bordeaux X. Lacoste Xavier LACOSTE HiePACS team Inria Bordeaux Sud-Ouest November 27, 2012 Guideline Context and goals About PaStiX

More information

DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES

DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES DuctTeip: A TASK-BASED PARALLEL PROGRAMMING FRAMEWORK FOR DISTRIBUTED MEMORY ARCHITECTURES AFSHIN ZAFARI, ELISABETH LARSSON, AND MARTIN TILLENIUS Abstract. Current high-performance computer systems used

More information

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster

Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Performance Analysis of Memory Transfers and GEMM Subroutines on NVIDIA TESLA GPU Cluster Veerendra Allada, Troy Benjegerdes Electrical and Computer Engineering, Ames Laboratory Iowa State University &

More information

Critically Missing Pieces on Accelerators: A Performance Tools Perspective

Critically Missing Pieces on Accelerators: A Performance Tools Perspective Critically Missing Pieces on Accelerators: A Performance Tools Perspective, Karthik Murthy, Mike Fagan, and John Mellor-Crummey Rice University SC 2013 Denver, CO November 20, 2013 What Is Missing in GPUs?

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

OpenACC 2.6 Proposed Features

OpenACC 2.6 Proposed Features OpenACC 2.6 Proposed Features OpenACC.org June, 2017 1 Introduction This document summarizes features and changes being proposed for the next version of the OpenACC Application Programming Interface, tentatively

More information

Intel Xeon Phi архитектура, модели программирования, оптимизация.

Intel Xeon Phi архитектура, модели программирования, оптимизация. Нижний Новгород, 2017 Intel Xeon Phi архитектура, модели программирования, оптимизация. Дмитрий Прохоров, Дмитрий Рябцев, Intel Agenda What and Why Intel Xeon Phi Top 500 insights, roadmap, architecture

More information

The StarPU Runtime System

The StarPU Runtime System The StarPU Runtime System A Unified Runtime System for Heterogeneous Architectures Olivier Aumage STORM Team Inria LaBRI http://starpu.gforge.inria.fr/ 1Introduction Olivier Aumage STORM Team The StarPU

More information

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU

GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU April 4-7, 2016 Silicon Valley GPU ACCELERATION OF CHOLMOD: BATCHING, HYBRID AND MULTI-GPU Steve Rennich, Darko Stosic, Tim Davis, April 6, 2016 OBJECTIVE Direct sparse methods are among the most widely

More information

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP

Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Acceleration of a Python-based Tsunami Modelling Application via CUDA and OpenHMPP Zhe Weng and Peter Strazdins*, Computer Systems Group, Research School of Computer Science, The Australian National University

More information

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser HPX The C++ Standards Library for Concurrency and Hartmut Kaiser (hkaiser@cct.lsu.edu) HPX A General Purpose Runtime System The C++ Standards Library for Concurrency and Exposes a coherent and uniform,

More information

Tesla GPU Computing A Revolution in High Performance Computing

Tesla GPU Computing A Revolution in High Performance Computing Tesla GPU Computing A Revolution in High Performance Computing Mark Harris, NVIDIA Agenda Tesla GPU Computing CUDA Fermi What is GPU Computing? Introduction to Tesla CUDA Architecture Programming & Memory

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. Topic

More information

OS Design Approaches. Roadmap. OS Design Approaches. Tevfik Koşar. Operating System Design and Implementation

OS Design Approaches. Roadmap. OS Design Approaches. Tevfik Koşar. Operating System Design and Implementation CSE 421/521 - Operating Systems Fall 2012 Lecture - II OS Structures Roadmap OS Design and Implementation Different Design Approaches Major OS Components!! Memory management! CPU Scheduling! I/O Management

More information

Givy, a software DSM runtime using raw pointers

Givy, a software DSM runtime using raw pointers Givy, a software DSM runtime using raw pointers François Gindraud UJF/Inria Compilation days (21/09/2015) F. Gindraud (UJF/Inria) Givy 21/09/2015 1 / 22 Outline 1 Distributed shared memory systems Examples

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API EuroPAR 2016 ROME Workshop Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API Suyang Zhu 1, Sunita Chandrasekaran 2, Peng Sun 1, Barbara Chapman 1, Marcus Winter 3,

More information

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University Scalable Tools Workshop 7 August 2017

Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University   Scalable Tools Workshop 7 August 2017 Evolving HPCToolkit John Mellor-Crummey Department of Computer Science Rice University http://hpctoolkit.org Scalable Tools Workshop 7 August 2017 HPCToolkit 1 HPCToolkit Workflow source code compile &

More information

Hybrid Model Parallel Programs

Hybrid Model Parallel Programs Hybrid Model Parallel Programs Charlie Peck Intermediate Parallel Programming and Cluster Computing Workshop University of Oklahoma/OSCER, August, 2010 1 Well, How Did We Get Here? Almost all of the clusters

More information

Welcome to the 2017 Charm++ Workshop!

Welcome to the 2017 Charm++ Workshop! Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

rcuda: an approach to provide remote access to GPU computational power

rcuda: an approach to provide remote access to GPU computational power rcuda: an approach to provide remote access to computational power Rafael Mayo Gual Universitat Jaume I Spain (1 of 60) HPC Advisory Council Workshop Outline computing Cost of a node rcuda goals rcuda

More information

Solving Dense Linear Systems on Graphics Processors

Solving Dense Linear Systems on Graphics Processors Solving Dense Linear Systems on Graphics Processors Sergio Barrachina Maribel Castillo Francisco Igual Rafael Mayo Enrique S. Quintana-Ortí High Performance Computing & Architectures Group Universidad

More information

QR Decomposition on GPUs

QR Decomposition on GPUs QR Decomposition QR Algorithms Block Householder QR Andrew Kerr* 1 Dan Campbell 1 Mark Richards 2 1 Georgia Tech Research Institute 2 School of Electrical and Computer Engineering Georgia Institute of

More information

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017

INTRODUCTION TO OPENACC. Analyzing and Parallelizing with OpenACC, Feb 22, 2017 INTRODUCTION TO OPENACC Analyzing and Parallelizing with OpenACC, Feb 22, 2017 Objective: Enable you to to accelerate your applications with OpenACC. 2 Today s Objectives Understand what OpenACC is and

More information

A Pluggable Framework for Composable HPC Scheduling Libraries

A Pluggable Framework for Composable HPC Scheduling Libraries A Pluggable Framework for Composable HPC Scheduling Libraries Max Grossman 1, Vivek Kumar 2, Nick Vrvilo 1, Zoran Budimlic 1, Vivek Sarkar 1 1 Habanero Extreme Scale So=ware Research Group, Rice University

More information

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing Designing Parallel Programs This review was developed from Introduction to Parallel Computing Author: Blaise Barney, Lawrence Livermore National Laboratory references: https://computing.llnl.gov/tutorials/parallel_comp/#whatis

More information

Cray Scientific Libraries. Overview

Cray Scientific Libraries. Overview Cray Scientific Libraries Overview What are libraries for? Building blocks for writing scientific applications Historically allowed the first forms of code re-use Later became ways of running optimized

More information

Modeling and Simulation of Hybrid Systems

Modeling and Simulation of Hybrid Systems Modeling and Simulation of Hybrid Systems Luka Stanisic Samuel Thibault Arnaud Legrand Brice Videau Jean-François Méhaut CNRS/Inria/University of Grenoble, France University of Bordeaux/Inria, France JointLab

More information

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY Presenters: Harshitha Menon, Seonmyeong Bak PPL Group Phil Miller, Sam White, Nitin Bhat, Tom Quinn, Jim Phillips, Laxmikant Kale MOTIVATION INTEGRATED

More information

New Mapping Interface

New Mapping Interface New Mapping Interface Mike Bauer NVIDIA Research 1 Why Have A Mapping Interface? Legion Program Legion Legion Mapper Scheduling is hard Lots of runtimes have heuristics What do you do when they are wrong?

More information

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators

Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Remote CUDA (rcuda) Why? High performance clusters: Fast interconnects Hundreds of nodes, with multiple cores per node Large storage systems Hardware accelerators Better performance-watt, performance-cost

More information

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017

Deep Learning Frameworks. COSC 7336: Advanced Natural Language Processing Fall 2017 Deep Learning Frameworks COSC 7336: Advanced Natural Language Processing Fall 2017 Today s lecture Deep learning software overview TensorFlow Keras Practical Graphical Processing Unit (GPU) From graphical

More information

HPC future trends from a science perspective

HPC future trends from a science perspective HPC future trends from a science perspective Simon McIntosh-Smith University of Bristol HPC Research Group simonm@cs.bris.ac.uk 1 Business as usual? We've all got used to new machines being relatively

More information

Search Algorithms for Discrete Optimization Problems

Search Algorithms for Discrete Optimization Problems Search Algorithms for Discrete Optimization Problems Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar To accompany the text ``Introduction to Parallel Computing'', Addison Wesley, 2003. 1 Topic

More information

Mapping MPI+X Applications to Multi-GPU Architectures

Mapping MPI+X Applications to Multi-GPU Architectures Mapping MPI+X Applications to Multi-GPU Architectures A Performance-Portable Approach Edgar A. León Computer Scientist San Jose, CA March 28, 2018 GPU Technology Conference This work was performed under

More information

Illinois Proposal Considerations Greg Bauer

Illinois Proposal Considerations Greg Bauer - 2016 Greg Bauer Support model Blue Waters provides traditional Partner Consulting as part of its User Services. Standard service requests for assistance with porting, debugging, allocation issues, and

More information

Recent Advances in Heterogeneous Computing using Charm++

Recent Advances in Heterogeneous Computing using Charm++ Recent Advances in Heterogeneous Computing using Charm++ Jaemin Choi, Michael Robson Parallel Programming Laboratory University of Illinois Urbana-Champaign April 12, 2018 1 / 24 Heterogeneous Computing

More information

PGI Accelerator Programming Model for Fortran & C

PGI Accelerator Programming Model for Fortran & C PGI Accelerator Programming Model for Fortran & C The Portland Group Published: v1.3 November 2010 Contents 1. Introduction... 5 1.1 Scope... 5 1.2 Glossary... 5 1.3 Execution Model... 7 1.4 Memory Model...

More information

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany

Scalasca support for Intel Xeon Phi. Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Scalasca support for Intel Xeon Phi Brian Wylie & Wolfgang Frings Jülich Supercomputing Centre Forschungszentrum Jülich, Germany Overview Scalasca performance analysis toolset support for MPI & OpenMP

More information

MPI in 2020: Opportunities and Challenges. William Gropp

MPI in 2020: Opportunities and Challenges. William Gropp MPI in 2020: Opportunities and Challenges William Gropp www.cs.illinois.edu/~wgropp MPI and Supercomputing The Message Passing Interface (MPI) has been amazingly successful First released in 1992, it is

More information