Solvers, Programming Models and Proto Apps TOM VANDER AA APPLICATION WORKSHOP OCTOBER 2016, MANCHESTER

Size: px
Start display at page:

Download "Solvers, Programming Models and Proto Apps TOM VANDER AA APPLICATION WORKSHOP OCTOBER 2016, MANCHESTER"

Transcription

1 Solvers, Programming Models and Proto Apps TOM VANDER AA APPLICATION WORKSHOP OCTOBER 2016, MANCHESTER

2 Strong Exa-Scaling is Hard CFD Application Today: 50M mesh points In ten years: 500M ExaScale Computers 10M cores Hence 50 mesh points per core CFD Proxy Application Proto application of EXA2CT Slide 2

3 CFD-Proxy on >1 Xeon-Phi speedup cores Slide 3

4 CFD-Proxy on >1 Xeon-Phi speedup comm_free gaspi_bulk_sync gaspi_async mpi_bulk_sync mpi_early_recv mpi_async mpi_fence_bulk_sync mpi_fence_async mpi_pscw_bulk_sync mpi_pscw_async cores Slide 4

5 Strong Exa-Scaling is Possible Bulk Synchronous Asynchronous GASPI write+notify MPI ISend/IRecv DON T Single Threaded Communication Thread-to-thread communication DO MPI Data Types Multi-threaded packing

6 EXA2CT Solvers that scale to ExaScale TBB Programming models that scale to ExaScale Using relevant reallife proto applications CILK PATUS GASPI SHARK

7 Solvers that scale to ExaScale Programming models that scale to ExaScale Using relevant reallife proto applications

8 EXA2CT Solvers that scale to ExaScale TBB Programming models that scale to ExaScale Using relevant reallife proto applications CILK PATUS GASPI SHARK

9 Overlap communication and computation in pipelined solvers Pipelined GMRES overlaps dot-product global communication latency with SpMV Available in PETSc

10 Counter Rounding Errors due to more Local Computations Slide 10

11 EXA2CT Solvers that scale to ExaScale TBB Programming models that scale to ExaScale Using relevant reallife proto applications CILK PATUS GASPI SHARK

12 GASPI in a nutshell PGAS API - designed to be Simple Multithreaded Global asynchronous dataflow Interoperability with MPI gaspi_notify gaspi_write

13 GASPI Key in EXA2CT in Proto-Application Aviation, Machine Learning, Nano-Electronics in Libraries for task-based programming for distributed work-stealing for resilience Slide 13

14 Example: TITUS Median Iteration Time group orig 014M orig 110M TITUS 014M TITUS 110M W s w Distributed Work Stealing using GASPI using small-world principle Hide latency with work Very high efficiency even for illbalanced problem Process Count Slide 14

15 EXA2CT Solvers that scale to ExaScale TBB Programming models that scale to ExaScale Using relevant reallife proto applications CILK PATUS GASPI SHARK

16 Proto Applications MUPHY Proto Applications ~1% can be filled up with experimental dose response data ABCD 10µM 1000s of targets 1nM Quarterly updated Millions of compounds Why? Experimental cost >5$ xm cpds x000 targets

17 Proto Applications MUPHY Proto Applications ~1% can be filled up with experimental dose response data ABCD 10µM 1000s of targets 1nM Quarterly updated Millions of compounds Why? Experimental cost >5$ xm cpds x000 targets

18 EXA2CT open source for you! Solvers in PETSC Programming libraries GASPI Dynamic programming Proto-Applications FEM/CFD, but also Machine Learning, Multi-Physics

19

20

21 Partners

GASPI AND THE EXA2CT PROJECT JUNE 2015, CRIHAN- CORIA ERIC PETIT UVSQ

GASPI AND THE EXA2CT PROJECT JUNE 2015, CRIHAN- CORIA ERIC PETIT UVSQ GASPI AND THE EXA2CT PROJECT JUNE 2015, CRIHAN- CORIA ERIC PETIT UVSQ Outline A brief introduction to Exa2ct. About proto- applications Distributed/shared, harware/software, address space Outline An introduction

More information

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE

European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE European exascale applications workshop, Manchester, 11th and 12th October 2016 DLR TAU-Code - Application in INTERWinE Thomas Gerhold, Barbara Brandfass, Jens Jägersküpper, DLR Christian Simmendinger,

More information

Evaluating New Communication Models in the Nek5000 Code for Exascale

Evaluating New Communication Models in the Nek5000 Code for Exascale Evaluating New Communication Models in the Nek5000 Code for Exascale Ilya Ivanov (KTH), Rui Machado (Fraunhofer), Mirko Rahn (Fraunhofer), Dana Akhmetova (KTH), Erwin Laure (KTH), Jing Gong (KTH), Philipp

More information

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc. Towards a codelet-based runtime for exascale computing Chris Lauderdale ET International, Inc. What will be covered Slide 2 of 24 Problems & motivation Codelet runtime overview Codelets & complexes Dealing

More information

GPI-2: a PGAS API for asynchronous and scalable parallel applications

GPI-2: a PGAS API for asynchronous and scalable parallel applications GPI-2: a PGAS API for asynchronous and scalable parallel applications Rui Machado CC-HPC, Fraunhofer ITWM Barcelona, 13 Jan. 2014 1 Fraunhofer ITWM CC-HPC Fraunhofer Institute for Industrial Mathematics

More information

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser HPX High Performance CCT Tech Talk Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 What s HPX? Exemplar runtime system implementation Targeting conventional architectures (Linux based SMPs and clusters) Currently,

More information

AUTOMATIC SMT THREADING

AUTOMATIC SMT THREADING AUTOMATIC SMT THREADING FOR OPENMP APPLICATIONS ON THE INTEL XEON PHI CO-PROCESSOR WIM HEIRMAN 1,2 TREVOR E. CARLSON 1 KENZO VAN CRAEYNEST 1 IBRAHIM HUR 2 AAMER JALEEL 2 LIEVEN EECKHOUT 1 1 GHENT UNIVERSITY

More information

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES Cliff Woolley, NVIDIA PREFACE This talk presents a case study of extracting parallelism in the UMT2013 benchmark for 3D unstructured-mesh

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 17 January 2017 Last Thursday

More information

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it

CS 590: High Performance Computing. Parallel Computer Architectures. Lab 1 Starts Today. Already posted on Canvas (under Assignment) Let s look at it Lab 1 Starts Today Already posted on Canvas (under Assignment) Let s look at it CS 590: High Performance Computing Parallel Computer Architectures Fengguang Song Department of Computer Science IUPUI 1

More information

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC

EXASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC EASCALE IN 2018 REALLY? FRANCK CAPPELLO INRIA&UIUC What are we talking about? 100M cores 12 cores/node Power Challenges Exascale Technology Roadmap Meeting San Diego California, December 2009. $1M per

More information

Principles of Parallel Algorithm Design: Concurrency and Mapping

Principles of Parallel Algorithm Design: Concurrency and Mapping Principles of Parallel Algorithm Design: Concurrency and Mapping John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 3 28 August 2018 Last Thursday Introduction

More information

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH 1 Exascale Programming Models With the evolution of HPC architecture towards exascale, new approaches for programming these machines

More information

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation

Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Achieving Efficient Strong Scaling with PETSc Using Hybrid MPI/OpenMP Optimisation Michael Lange 1 Gerard Gorman 1 Michele Weiland 2 Lawrence Mitchell 2 Xiaohu Guo 3 James Southern 4 1 AMCG, Imperial College

More information

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System

The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System The Uintah Framework: A Unified Heterogeneous Task Scheduling and Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins Scientific Computing and Imaging Institute & University of Utah I. Uintah Overview

More information

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report

ESPRESO ExaScale PaRallel FETI Solver. Hybrid FETI Solver Report ESPRESO ExaScale PaRallel FETI Solver Hybrid FETI Solver Report Lubomir Riha, Tomas Brzobohaty IT4Innovations Outline HFETI theory from FETI to HFETI communication hiding and avoiding techniques our new

More information

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill This project has received funding from the European Union s Horizon 2020 research and innovation programme under grant agreement No. 671603 An Exascale Programming, Multi-objective Optimisation and Resilience

More information

AutoTune Workshop. Michael Gerndt Technische Universität München

AutoTune Workshop. Michael Gerndt Technische Universität München AutoTune Workshop Michael Gerndt Technische Universität München AutoTune Project Automatic Online Tuning of HPC Applications High PERFORMANCE Computing HPC application developers Compute centers: Energy

More information

Continuum Computer Architecture

Continuum Computer Architecture Plenary Presentation to the Workshop on Frontiers of Extreme Computing: Continuum Computer Architecture Thomas Sterling California Institute of Technology and Louisiana State University October 25, 2005

More information

Parallel Programming Libraries and implementations

Parallel Programming Libraries and implementations Parallel Programming Libraries and implementations Partners Funding Reusing this material This work is licensed under a Creative Commons Attribution- NonCommercial-ShareAlike 4.0 International License.

More information

The next-generation CFD solver Flucs HPC aspects

The next-generation CFD solver Flucs HPC aspects The next-generation CFD solver Flucs HPC aspects Jens Jägersküpper German Aerospace Center Institute of Aerodynamics and Flow Technology Center for Computer Applications in AeroSpace Science and Engineering

More information

European exascale applications workshop, Edinburgh, 19th/20th April 2018 Asynchronous Execution in DLR's CFD Solvers

European exascale applications workshop, Edinburgh, 19th/20th April 2018 Asynchronous Execution in DLR's CFD Solvers European exascale applications workshop, Edinburgh, 19th/20th April 2018 Asynchronous Execution in DLR's CFD Solvers Thomas Gerhold Institute of Software Methods for Product Virtualization, Dresden DLR

More information

Welcome to the 2017 Charm++ Workshop!

Welcome to the 2017 Charm++ Workshop! Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017

More information

Improving Uintah s Scalability Through the Use of Portable

Improving Uintah s Scalability Through the Use of Portable Improving Uintah s Scalability Through the Use of Portable Kokkos-Based Data Parallel Tasks John Holmen1, Alan Humphrey1, Daniel Sunderland2, Martin Berzins1 University of Utah1 Sandia National Laboratories2

More information

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH

Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH Exploiting the Potential of European HPC Stakeholders in Extreme-Scale Demonstrators Top-Down System Design Approach Hans-Christian Hoppe, Intel Deutschland GmbH Motivation & Introduction Computer system

More information

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI

Efficient AMG on Hybrid GPU Clusters. ScicomP Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann. Fraunhofer SCAI Efficient AMG on Hybrid GPU Clusters ScicomP 2012 Jiri Kraus, Malte Förster, Thomas Brandes, Thomas Soddemann Fraunhofer SCAI Illustration: Darin McInnis Motivation Sparse iterative solvers benefit from

More information

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES

COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES COMPUTING ELEMENT EVOLUTION AND ITS IMPACT ON SIMULATION CODES P(ND) 2-2 2014 Guillaume Colin de Verdière OCTOBER 14TH, 2014 P(ND)^2-2 PAGE 1 CEA, DAM, DIF, F-91297 Arpajon, France October 14th, 2014 Abstract:

More information

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores

A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores A Work Stealing Scheduler for Parallel Loops on Shared Cache Multicores Marc Tchiboukdjian Vincent Danjean Thierry Gautier Fabien Le Mentec Bruno Raffin Marc Tchiboukdjian A Work Stealing Scheduler for

More information

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu

swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu swsptrsv: a Fast Sparse Triangular Solve with Sparse Level Tile Layout on Sunway Architecture 1,3 2 1,3 1,3 Xinliang Wang, Weifeng Liu, Wei Xue, Li Wu 1 2 3 Outline 1. Background 2. Sunway architecture

More information

Accelerating the Iterative Linear Solver for Reservoir Simulation

Accelerating the Iterative Linear Solver for Reservoir Simulation Accelerating the Iterative Linear Solver for Reservoir Simulation Wei Wu 1, Xiang Li 2, Lei He 1, Dongxiao Zhang 2 1 Electrical Engineering Department, UCLA 2 Department of Energy and Resources Engineering,

More information

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc. HPC Runtime Software Rishi Khan SC 11 Current Programming Models Shared Memory Multiprocessing OpenMP fork/join model Pthreads Arbitrary SMP parallelism (but hard to program/ debug) Cilk Work Stealing

More information

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction ICON for HD(CP) 2 High Definition Clouds and Precipitation for Advancing Climate Prediction High Definition Clouds and Precipitation for Advancing Climate Prediction ICON 2 years ago Parameterize shallow

More information

Unified Runtime for PGAS and MPI over OFED

Unified Runtime for PGAS and MPI over OFED Unified Runtime for PGAS and MPI over OFED D. K. Panda and Sayantan Sur Network-Based Computing Laboratory Department of Computer Science and Engineering The Ohio State University, USA Outline Introduction

More information

The ECMWF forecast model, quo vadis?

The ECMWF forecast model, quo vadis? The forecast model, quo vadis? by Nils Wedi European Centre for Medium-Range Weather Forecasts wedi@ecmwf.int contributors: Piotr Smolarkiewicz, Mats Hamrud, George Mozdzynski, Sylvie Malardel, Christian

More information

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing

Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Uni-Address Threads: Scalable Thread Management for RDMA-based Work Stealing Shigeki Akiyama, Kenjiro Taura The University of Tokyo June 17, 2015 HPDC 15 Lightweight Threads Lightweight threads enable

More information

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries How to Boost the Performance of Your MPI and PGAS s with MVAPICH2 Libraries A Tutorial at the MVAPICH User Group (MUG) Meeting 18 by The MVAPICH Team The Ohio State University E-mail: panda@cse.ohio-state.edu

More information

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11

Contents. Preface xvii Acknowledgments. CHAPTER 1 Introduction to Parallel Computing 1. CHAPTER 2 Parallel Programming Platforms 11 Preface xvii Acknowledgments xix CHAPTER 1 Introduction to Parallel Computing 1 1.1 Motivating Parallelism 2 1.1.1 The Computational Power Argument from Transistors to FLOPS 2 1.1.2 The Memory/Disk Speed

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

Adaptive Power Profiling for Many-Core HPC Architectures

Adaptive Power Profiling for Many-Core HPC Architectures Adaptive Power Profiling for Many-Core HPC Architectures Jaimie Kelley, Christopher Stewart The Ohio State University Devesh Tiwari, Saurabh Gupta Oak Ridge National Laboratory State-of-the-Art Schedulers

More information

Hybrid Programming with MPI and SMPSs

Hybrid Programming with MPI and SMPSs Hybrid Programming with MPI and SMPSs Apostolou Evangelos August 24, 2012 MSc in High Performance Computing The University of Edinburgh Year of Presentation: 2012 Abstract Multicore processors prevail

More information

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand

Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Multi-Threaded UPC Runtime for GPU to GPU communication over InfiniBand Miao Luo, Hao Wang, & D. K. Panda Network- Based Compu2ng Laboratory Department of Computer Science and Engineering The Ohio State

More information

The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures

The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures The GASPI API: A Failure Tolerant PGAS API for Asynchronous Dataflow on Heterogeneous Architectures Christian Simmendinger, Mirko Rahn, and Daniel Gruenewald Abstract The Global Address Space Programming

More information

Multi-core Programming Evolution

Multi-core Programming Evolution Multi-core Programming Evolution Based on slides from Intel Software ollege and Multi-ore Programming increasing performance through software multi-threading by Shameem Akhter and Jason Roberts, Evolution

More information

Parallel Programming on Larrabee. Tim Foley Intel Corp

Parallel Programming on Larrabee. Tim Foley Intel Corp Parallel Programming on Larrabee Tim Foley Intel Corp Motivation This morning we talked about abstractions A mental model for GPU architectures Parallel programming models Particular tools and APIs This

More information

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC Guest Lecturer: Sukhyun Song (original slides by Alan Sussman) Parallel Programming with Message Passing and Directives 2 MPI + OpenMP Some applications can

More information

MPI+X on The Way to Exascale. William Gropp

MPI+X on The Way to Exascale. William Gropp MPI+X on The Way to Exascale William Gropp http://wgropp.cs.illinois.edu Likely Exascale Architectures (Low Capacity, High Bandwidth) 3D Stacked Memory (High Capacity, Low Bandwidth) Thin Cores / Accelerators

More information

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh

SSS: An Implementation of Key-value Store based MapReduce Framework. Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh SSS: An Implementation of Key-value Store based MapReduce Framework Hirotaka Ogawa (AIST, Japan) Hidemoto Nakada Ryousei Takano Tomohiro Kudoh MapReduce A promising programming tool for implementing largescale

More information

How HPC Hardware and Software are Evolving Towards Exascale

How HPC Hardware and Software are Evolving Towards Exascale How HPC Hardware and Software are Evolving Towards Exascale Kathy Yelick Associate Laboratory Director and NERSC Director Lawrence Berkeley National Laboratory EECS Professor, UC Berkeley NERSC Overview

More information

Place Value. Verbal Form: 30,542 = Thirty thousand, five hundred forty-two. (Notice we don t use the word and.)

Place Value. Verbal Form: 30,542 = Thirty thousand, five hundred forty-two. (Notice we don t use the word and.) 1, etc.. π, 2, 3, etc.. SECTION 1.1 A set is a collection of objects. The set of natural numbers is {1,2,3,4,5,.} The set of whole numbers is {0,1,2,3,4,5, } Whole numbers are used for counting objects

More information

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory

Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Generating Efficient Data Movement Code for Heterogeneous Architectures with Distributed-Memory Roshan Dathathri Thejas Ramashekar Chandan Reddy Uday Bondhugula Department of Computer Science and Automation

More information

Parallel dense linear algebra computations (1)

Parallel dense linear algebra computations (1) Parallel dense linear algebra computations (1) Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA, Spring 2008 [L.07] Tuesday, January 29, 2008 1 Sources for today s material Mike Heath

More information

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE

HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE HPX A GENERAL PURPOSE C++ RUNTIME SYSTEM FOR PARALLEL AND DISTRIBUTED APPLICATIONS OF ANY SCALE The Venture Point TECHNOLOGY DEMANDS NEW RESPONSE 2 Technology Demands new Response 3 Technology Demands

More information

Introduction to Xeon Phi. Bill Barth January 11, 2013

Introduction to Xeon Phi. Bill Barth January 11, 2013 Introduction to Xeon Phi Bill Barth January 11, 2013 What is it? Co-processor PCI Express card Stripped down Linux operating system Dense, simplified processor Many power-hungry operations removed Wider

More information

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores

MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores MetaFork: A Compilation Framework for Concurrency Platforms Targeting Multicores Presented by Xiaohui Chen Joint work with Marc Moreno Maza, Sushek Shekar & Priya Unnikrishnan University of Western Ontario,

More information

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters Presentation at GTC 2014 by Dhabaleswar K. (DK) Panda The Ohio State University E-mail: panda@cse.ohio-state.edu http://www.cse.ohio-state.edu/~panda

More information

Next Generation CEA Computing Centres

Next Generation CEA Computing Centres Next Generation IO @ CEA Computing Centres J-Ch Lafoucriere ORAP Forum #39 2017-03-28 A long History of Storage Architectures Last Century Compute Systems Few Cray Supercomputers (vectors and MPP) Few

More information

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel Parallel Programming Environments Presented By: Anand Saoji Yogesh Patel Outline Introduction How? Parallel Architectures Parallel Programming Models Conclusion References Introduction Recent advancements

More information

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout

CGAR: Strong Consistency without Synchronous Replication. Seo Jin Park Advised by: John Ousterhout CGAR: Strong Consistency without Synchronous Replication Seo Jin Park Advised by: John Ousterhout Improved update performance of storage systems with master-back replication Fast: updates complete before

More information

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth

Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth Presenter: Surabhi Jain Contributors: Surabhi Jain, Gengbin Zheng, Maria Garzaran, Jim Cownie, Taru Doodi, and Terry L. Wilmarth May 25, 2018 ROME workshop (in conjunction with IPDPS 2018), Vancouver,

More information

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC

An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC An Example of Porting PETSc Applications to Heterogeneous Platforms with OpenACC Pi-Yueh Chuang The George Washington University Fernanda S. Foertter Oak Ridge National Laboratory Goal Develop an OpenACC

More information

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser ParalleX A Cure for Scaling Impaired Parallel Applications Hartmut Kaiser (hkaiser@cct.lsu.edu) 2 Tianhe-1A 2.566 Petaflops Rmax Heterogeneous Architecture: 14,336 Intel Xeon CPUs 7,168 Nvidia Tesla M2050

More information

Integrating Analysis and Computation with Trios Services

Integrating Analysis and Computation with Trios Services October 31, 2012 Integrating Analysis and Computation with Trios Services Approved for Public Release: SAND2012-9323P Ron A. Oldfield Scalable System Software Sandia National Laboratories Albuquerque,

More information

Online Course Evaluation. What we will do in the last week?

Online Course Evaluation. What we will do in the last week? Online Course Evaluation Please fill in the online form The link will expire on April 30 (next Monday) So far 10 students have filled in the online form Thank you if you completed it. 1 What we will do

More information

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018

Sayantan Sur, Intel. SEA Symposium on Overlapping Computation and Communication. April 4 th, 2018 Sayantan Sur, Intel SEA Symposium on Overlapping Computation and Communication April 4 th, 2018 Legal Disclaimer & Benchmark results were obtained prior to implementation of recent software patches and

More information

Building NVLink for Developers

Building NVLink for Developers Building NVLink for Developers Unleashing programmatic, architectural and performance capabilities for accelerated computing Why NVLink TM? Simpler, Better and Faster Simplified Programming No specialized

More information

Towards Exascale Computing with Fortran 2015

Towards Exascale Computing with Fortran 2015 Towards Exascale Computing with Fortran 2015 Alessandro Fanfarillo National Center for Atmospheric Research Damian Rouson Sourcery Institute Outline Parallelism in Fortran 2008 SPMD PGAS Exascale challenges

More information

Polyhedral Optimizations of Explicitly Parallel Programs

Polyhedral Optimizations of Explicitly Parallel Programs Habanero Extreme Scale Software Research Group Department of Computer Science Rice University The 24th International Conference on Parallel Architectures and Compilation Techniques (PACT) October 19, 2015

More information

Zigbee 3.0 and Dotdot Connecting the IoT. Jean-Pierre Desbenoit Schneider Electric Bruno Vulcano Legrand

Zigbee 3.0 and Dotdot Connecting the IoT. Jean-Pierre Desbenoit Schneider Electric Bruno Vulcano Legrand Zigbee 3.0 and Dotdot Connecting the IoT Jean-Pierre Desbenoit Schneider Electric Bruno Vulcano Legrand 1 We are the Standard Bearer of the Open IoT Established in 2002, our wideranging global membership

More information

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing

Programming Models for Multi- Threading. Brian Marshall, Advanced Research Computing Programming Models for Multi- Threading Brian Marshall, Advanced Research Computing Why Do Parallel Computing? Limits of single CPU computing performance available memory I/O rates Parallel computing allows

More information

Latency-Tolerant Software Distributed Shared Memory

Latency-Tolerant Software Distributed Shared Memory Latency-Tolerant Software Distributed Shared Memory Jacob Nelson, Brandon Holt, Brandon Myers, Preston Briggs, Luis Ceze, Simon Kahan, Mark Oskin University of Washington USENIX ATC 2015 July 9, 2015 25

More information

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013

Lustre* - Fast Forward to Exascale High Performance Data Division. Eric Barton 18th April, 2013 Lustre* - Fast Forward to Exascale High Performance Data Division Eric Barton 18th April, 2013 DOE Fast Forward IO and Storage Exascale R&D sponsored by 7 leading US national labs Solutions to currently

More information

Bring your application to a new era:

Bring your application to a new era: Bring your application to a new era: learning by example how to parallelize and optimize for Intel Xeon processor and Intel Xeon Phi TM coprocessor Manel Fernández, Roger Philp, Richard Paul Bayncore Ltd.

More information

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS

DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS th August DIAMOND RINGS ACKNOWLEDGED EVENT PROPAGATION IN MANY-CORE PROCESSORS Stefan Nürnberger, Randolf Rotta, Gabor Drescher, Daniel Danner, Jörg Nolte ACKNOWLEDGED EVENT PROPAGATION What does it do?

More information

Overview of research activities Toward portability of performance

Overview of research activities Toward portability of performance Overview of research activities Toward portability of performance Do dynamically what can t be done statically Understand evolution of architectures Enable new programming models Put intelligence into

More information

Improving the interoperability between MPI and OmpSs-2

Improving the interoperability between MPI and OmpSs-2 Improving the interoperability between MPI and OmpSs-2 Vicenç Beltran Querol vbeltran@bsc.es 19/04/2018 INTERTWinE Exascale Application Workshop, Edinburgh Why hybrid MPI+OmpSs-2 programming? Gauss-Seidel

More information

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends

Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Dense Linear Algebra on Heterogeneous Platforms: State of the Art and Trends Paolo Bientinesi AICES, RWTH Aachen pauldj@aices.rwth-aachen.de ComplexHPC Spring School 2013 Heterogeneous computing - Impact

More information

Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications

Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications Using HPX and OP2 for Improving Parallel Scaling Performance of Unstructured Grid Applications Zahra Khatami 1,2,3, Hartmut Kaiser 1,2,4, and J Ramanujam 1,5 1 Center for Computation and Technology, Louisiana

More information

Parallel Languages: Past, Present and Future

Parallel Languages: Past, Present and Future Parallel Languages: Past, Present and Future Katherine Yelick U.C. Berkeley and Lawrence Berkeley National Lab 1 Kathy Yelick Internal Outline Two components: control and data (communication/sharing) One

More information

ASYNCHRONOUS COMPUTING IN C++

ASYNCHRONOUS COMPUTING IN C++ http://stellar-goup.org ASYNCHRONOUS COMPUTING IN C++ Hartmut Kaiser (Hartmut.Kaiser@gmail.com) CppCon 2014 WHAT IS ASYNCHRONOUS COMPUTING? Spawning off some work without immediately waiting for the work

More information

Shared-memory Parallel Programming with Cilk Plus

Shared-memory Parallel Programming with Cilk Plus Shared-memory Parallel Programming with Cilk Plus John Mellor-Crummey Department of Computer Science Rice University johnmc@rice.edu COMP 422/534 Lecture 4 30 August 2018 Outline for Today Threaded programming

More information

MDHIM: A Parallel Key/Value Store Framework for HPC

MDHIM: A Parallel Key/Value Store Framework for HPC MDHIM: A Parallel Key/Value Store Framework for HPC Hugh Greenberg 7/6/2015 LA-UR-15-25039 HPC Clusters Managed by a job scheduler (e.g., Slurm, Moab) Designed for running user jobs Difficult to run system

More information

The Art of Parallel Processing

The Art of Parallel Processing The Art of Parallel Processing Ahmad Siavashi April 2017 The Software Crisis As long as there were no machines, programming was no problem at all; when we had a few weak computers, programming became a

More information

Basic Communication Operations (Chapter 4)

Basic Communication Operations (Chapter 4) Basic Communication Operations (Chapter 4) Vivek Sarkar Department of Computer Science Rice University vsarkar@cs.rice.edu COMP 422 Lecture 17 13 March 2008 Review of Midterm Exam Outline MPI Example Program:

More information

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY

USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY 14th ANNUAL WORKSHOP 2018 USING OPEN FABRIC INTERFACE IN INTEL MPI LIBRARY Michael Chuvelev, Software Architect Intel April 11, 2018 INTEL MPI LIBRARY Optimized MPI application performance Application-specific

More information

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs

Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs Multicore DSP Software Synthesis using Partial Expansion of Dataflow Graphs George F. Zaki, William Plishker, Shuvra S. Bhattacharyya University of Maryland, College Park, MD, USA & Frank Fruth Texas Instruments

More information

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives

Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Extending the Task-Aware MPI (TAMPI) Library to Support Asynchronous MPI primitives Kevin Sala, X. Teruel, J. M. Perez, V. Beltran, J. Labarta 24/09/2018 OpenMPCon 2018, Barcelona Overview TAMPI Library

More information

Optimizing non-blocking Collective Operations for InfiniBand

Optimizing non-blocking Collective Operations for InfiniBand Optimizing non-blocking Collective Operations for InfiniBand Open Systems Lab Indiana University Bloomington, USA IPDPS 08 - CAC 08 Workshop Miami, FL, USA April, 14th 2008 Introduction Non-blocking collective

More information

Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations

Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Center for Information Services and High Performance Computing TU Dresden Design Approach for a Generic and Scalable Framework for Parallel FMU Simulations Martin Flehmig, Marc Hartung, Marcus Walther

More information

Iterative Sparse Triangular Solves for Preconditioning

Iterative Sparse Triangular Solves for Preconditioning Euro-Par 2015, Vienna Aug 24-28, 2015 Iterative Sparse Triangular Solves for Preconditioning Hartwig Anzt, Edmond Chow and Jack Dongarra Incomplete Factorization Preconditioning Incomplete LU factorizations

More information

Thread and Data parallelism in CPUs - will GPUs become obsolete?

Thread and Data parallelism in CPUs - will GPUs become obsolete? Thread and Data parallelism in CPUs - will GPUs become obsolete? USP, Sao Paulo 25/03/11 Carsten Trinitis Carsten.Trinitis@tum.de Lehrstuhl für Rechnertechnik und Rechnerorganisation (LRR) Institut für

More information

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute

ParCube. W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute ParCube W. Randolph Franklin and Salles V. G. de Magalhães, Rensselaer Polytechnic Institute 2017-11-07 Which pairs intersect? Abstract Parallelization of a 3d application (intersection detection). Good

More information

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018 Cloudline Autonomous Driving Solutions Accelerating insights through a new generation of Data and Analytics October, 2018 HPE big data analytics solutions power the data-driven enterprise Secure, workload-optimized

More information

PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA

PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA 2 nd Workshop MIC IFERC PARALLEL PROGRAMMING ON INTEL XEON PHI FOR EFFICIENT LINEAR ALGEBRA Ph.D. candidate Fan YE Advisor CEA Christophe Calvin Supervisor Serge Petiton 18 MARCH 2015 2015 年 3 月 18 日 CEA

More information

Scalable Software Transactional Memory for Chapel High-Productivity Language

Scalable Software Transactional Memory for Chapel High-Productivity Language Scalable Software Transactional Memory for Chapel High-Productivity Language Srinivas Sridharan and Peter Kogge, U. Notre Dame Brad Chamberlain, Cray Inc Jeffrey Vetter, Future Technologies Group, ORNL

More information

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System

Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Radiation Modeling Using the Uintah Heterogeneous CPU/GPU Runtime System Alan Humphrey, Qingyu Meng, Martin Berzins, Todd Harman Scientific Computing and Imaging Institute & University of Utah I. Uintah

More information

IWES st Italian Workshop on Embedded Systems Pisa September 2016

IWES st Italian Workshop on Embedded Systems Pisa September 2016 IWES 2016 1st Italian Workshop on Embedded Systems Pisa -- 19 September 2016 Research Group Overview Roberto Giorgi University of Siena, Italy http://www.dii.unisi.it/~giorgi Siena on Earth 2 Engineering

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

CUDA GPGPU Workshop 2012

CUDA GPGPU Workshop 2012 CUDA GPGPU Workshop 2012 Parallel Programming: C thread, Open MP, and Open MPI Presenter: Nasrin Sultana Wichita State University 07/10/2012 Parallel Programming: Open MP, MPI, Open MPI & CUDA Outline

More information

Ezequiel Martin, Pablo Carrica, IIHR Hydroscience & Engineering, University of Iowa. Ralph Noack, Celeritas Simulation Technology, LLC

Ezequiel Martin, Pablo Carrica, IIHR Hydroscience & Engineering, University of Iowa. Ralph Noack, Celeritas Simulation Technology, LLC 12 th Overset Grid Symposium, Atlanta, GA October 6-9, 2014 Multiple Body Overset Connectivity Method with Application to Wind Farm Simulations Ezequiel Martin, Pablo Carrica, IIHR Hydroscience & Engineering,

More information