Welcome to the 2017 Charm++ Workshop!

Similar documents
CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Adaptive Runtime Support

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Scalable In-memory Checkpoint with Automatic Restart on Failures

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Advanced Parallel Programming. Is there life beyond MPI?

Recent Advances in Heterogeneous Computing using Charm++

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Scalable Interaction with Parallel Applications

Perspectives form US Department of Energy work on parallel programming models for performance portability

Memory Tagging in Charm++

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Charm++ for Productivity and Performance

MPI+X on The Way to Exascale. William Gropp

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Scalable Dynamic Adaptive Simulations with ParFUM

Programming Models for Supercomputing in the Era of Multicore

HPX. High Performance ParalleX CCT Tech Talk Series. Hartmut Kaiser

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

MICE: A Prototype MPI Implementation in Converse Environment


The Charm++ Parallel Programming System Manual

Hybrid KAUST Many Cores and OpenACC. Alain Clo - KAUST Research Computing Saber Feki KAUST Supercomputing Lab Florent Lebeau - CAPS

Techniques to improve the scalability of Checkpoint-Restart

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

ET International HPC Runtime Software. ET International Rishi Khan SC 11. Copyright 2011 ET International, Inc.

1/4 Energy and Fault tolerance 2 Director s Desk 3 PPL at SC13 4 Grants 5 EpiSimdemics 6 Metabalancer 7 Transitions 7 Domain-Specific Languages

CS 475: Parallel Programming Introduction

Charm++ and Adaptive MPI

Using BigSim to Estimate Application Performance

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

ParalleX. A Cure for Scaling Impaired Parallel Applications. Hartmut Kaiser

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale

All routines were built with VS2010 compiler, OpenMP 2.0 and TBB 3.0 libraries were used to implement parallel versions of programs.

What is DARMA? DARMA is a C++ abstraction layer for asynchronous many-task (AMT) runtimes.

Update of Post-K Development Yutaka Ishikawa RIKEN AICS

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Towards Exascale Programming Models HPC Summit, Prague Erwin Laure, KTH

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Message Passing Model

Using Charm++ to Support Multiscale Multiphysics

HPX The C++ Standards Library for Concurrency and Parallelism. Hartmut Kaiser

Scalable Shared Memory Programing

MPI+X on The Way to Exascale. William Gropp

Strong Scaling for Molecular Dynamics Applications

IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Overview of research activities Toward portability of performance

Enabling Exascale Computing through the ParalleX Execution Model

MTAPI: Parallel Programming for Embedded Multicore Systems

Extreme I/O Scaling with HDF5

Compilers and Compiler-based Tools for HPC

The Cascade High Productivity Programming Language

Introduction to MPI. EAS 520 High Performance Scientific Computing. University of Massachusetts Dartmouth. Spring 2014

Dynamic Load Balancing for Weather Models via AMPI

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Exploring Task Parallelism for Heterogeneous Systems Using Multicore Task Management API

A Taxonomy of Task-Based Technologies for High-Performance Computing

China Big Data and HPC Initiatives Overview. Xuanhua Shi

c 2012 Harshitha Menon Gopalakrishnan Menon

Lecture 14. Synthesizing Parallel Programs IAP 2007 MIT

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Squall: Fine-Grained Live Reconfiguration for Partitioned Main Memory Databases

6.1 Multiprocessor Computing Environment

Application level asynchronous check-pointing / restart: first experiences with GPI

Solvers, Programming Models and Proto Apps TOM VANDER AA APPLICATION WORKSHOP OCTOBER 2016, MANCHESTER

A Lightweight OpenMP Runtime

Designing for Scalability. Patrick Linskey EJB Team Lead BEA Systems

Using GPUs to compute the multilevel summation of electrostatic forces

FUSION PROCESSORS AND HPC

Projections A Performance Tool for Charm++ Applications

Assignment 5. Georgia Koloniari

Integrating Analysis and Computation with Trios Services

The DEEP (and DEEP-ER) projects

Formation of the First Galaxies Enzo-P / Cello Adaptive Mesh Refinement

DEBUGGING LARGE SCALE APPLICATIONS WITH VIRTUALIZATION FILIPPO GIOACHIN DISSERTATION

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Parallelization Using a PGAS Language such as X10 in HYDRO and TRITON

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism.

NAMD Performance Benchmark and Profiling. November 2010

Fault tolerant issues in large scale applications

CLICK TO EDIT MASTER TITLE STYLE. Click to edit Master text styles. Second level Third level Fourth level Fifth level

Scheduling Strategies for HPC as a Service (HPCaaS) for Bio-Science Applications

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

Copyright Khronos Group Page 1. Vulkan Overview. June 2015

Box s 1 minute Bio l B. Eng (AE 1983): Khon Kean University

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Building Multi-Petaflop Systems with MVAPICH2 and Global Arrays

MPI in 2020: Opportunities and Challenges. William Gropp

MIGRATION OF LEGACY APPLICATIONS TO HETEROGENEOUS ARCHITECTURES Francois Bodin, CTO, CAPS Entreprise. June 2011

How to Boost the Performance of Your MPI and PGAS Applications with MVAPICH2 Libraries

Transcription:

Welcome to the 2017 Charm++ Workshop! Laxmikant (Sanjay) Kale http://charm.cs.illinois.edu Parallel Programming Laboratory Department of Computer Science University of Illinois at Urbana Champaign 2017 CHARM++ WORKSHOP 1

A bit of history This is the 15 th workshop in a series that began in 2001 2017 CHARM++ WORKSHOP 2

2017 CHARM++ WORKSHOP 3

A Reflection on the History Charm++, the name, is from 1993 Most of the foundational concepts : by 2002 So, what does this long period of 15 years signify? Maybe I was too slow But I prefer the interpretation: We have been enhancing and adding features based on large-scale application development. A long co-design cycle The research agenda opened up by the foundational concepts is vast Although the foundations were done in 2002, the fleshing out of adaptive runtime capabilities is where many intellectual challenges, and engineering work, lay. 2017 CHARM++ WORKSHOP 4

What is Charm++? Charm++ is a generalized approach to writing parallel programs An alternative to the likes of MPI, UPC, GA etc. But not to sequential languages such as C, C++, Fortran Represents: The style of writing parallel programs The runtime system And the entire ecosystem that surrounds it Three design principles: Overdecomposition, Migratability, Asynchrony 2017 CHARM++ WORKSHOP 5

Overdecomposition Decompose the work units & data units into many more pieces than execution units Cores/Nodes/.. Not so hard: we do decomposition anyway 2017 CHARM++ WORKSHOP 6

Migratability Allow these work and data units to be migratableat runtime i.e. the programmer or runtime, can move them Consequences for the app-developer Communication must now be addressed to logical units with global names, not to physical processors But this is a good thing Consequences for RTS Must keep track of where each unit is Naming and location management 2017 CHARM++ WORKSHOP 7

Asynchrony: Message-Driven Execution With over decomposition and Migratibility: You have multiple units on each processor They address each other via logical names Need for scheduling: What sequence should the work units execute in? One answer: let the programmer sequence them Seen in current codes, e.g. some AMR frameworks Message-driven execution: Let the work-unit that happens to have data ( message ) available for it execute next Let the RTS select among ready work units Programmer should not specify what executes next, but can influence it via priorities 2017 CHARM++ WORKSHOP 8

Realization of this model in Charm++ Overdecomposed entities: chares Chares are C++ objects With methods designated as entry methods Which can be invoked asynchronously by remote chares Chares are organized into indexed collections Each collection may have its own indexing scheme 1D,..7D Sparse Bitvector or string as an index Chares communicate via asynchronous method invocations A[i].foo(.); A is the name of a collection, i is the index of the particular chare. 2017 CHARM++ WORKSHOP 9

Parallel Address Space Processor 0 Processor 1 Processor 2 Processor 3 2017 CHARM++ WORKSHOP 10

Message-driven Execution A[23].foo( ) Processor 0 Processor 1 2017 CHARM++ WORKSHOP 11

Processor 0 Processor 1 Processor 2 Processor 3 2017 CHARM++ WORKSHOP 12

Processor 0 Processor 1 Processor 2 Processor 3 2017 CHARM++ WORKSHOP 13

Processor 0 Processor 1 Processor 2 Processor 3 2017 CHARM++ WORKSHOP 14

Empowering the RTS Adaptive Runtime System Introspection Adaptivity Asynchrony Overdecomposition Migratability The Adaptive RTS can: Dynamically balance loads Optimize communication: Spread over time, async collectives Automatic latency tolerance Prefetch data with almost perfect predictability 2017 CHARM++ WORKSHOP 15

Some Production Applications Application Domain Previous parallelization Scale NAMD Classical MD PVM 500k ChaNGa N-body gravity & SPH MPI 500k EpiSimdemics Agent-based epidemiology MPI 500k OpenAtom Electronic Structure MPI 128k Spectre Relativistic MHD 100k FreeON/SpAMM Quantum Chemistry OpenMP 50k Enzo-P/Cello Astrophysics/Cosmology MPI 32k ROSS PDES MPI 16k SDG Elastodynamic fracture 10k ADHydro Systems Hydrology 1000 Disney ClothSim Textile & rigid body dynamics TBB 768 Particle Tracking Velocimetry reconstruction 512 JetAlloc Stochastic MIP optimization 480 2017 CHARM++ WORKSHOP 16

Relevance to Exascale Intelligent, introspective, Adaptive Runtime Systems, developed for handling application s dynamic variability, already have features that can deal with challenges posed by exascale hardware 2017 CHARM++ WORKSHOP 17

Relevant capabilities for Exascale Load balancing Data-driven execution in support of task-based models Resilience multiple approaches: in-memory checkpoint, leveraging NVM, message-logging for low MTBF all leveraging object-based overdecomposition Power/Thermal optimizations Shrink/Expand sets of processors allocated during execution Adaptivity-aware resource management for whole-machine optimizations 2017 CHARM++ WORKSHOP 18

IEEE Computer highlights Charm++ energy efficient runtime 2017 CHARM++ WORKSHOP 19

Interaction Between the Runtime System and the Resource Manager ü Allows dynamic interaction between the system resource manager or scheduler and the job runtime system ü Meets system-level constraints such as power caps and hardware configurations ü Achieves the objectives of both datacenter users and system administrators 2017 CHARM++ WORKSHOP 20

Charm++ interoperates with MPI So, you can write one module in Charm++, while keeping the rest in MPI Charm++ Control 2017 CHARM++ WORKSHOP 21

Integration of Loop Parallelism Used for transient load balancing within a node Mechanisms: Charm++ s old CkLoop construct New integration with OpenMP (gomp, and now llvm) BSC s OMPSS integration is orthogonal Other new OpenMP schedulers RTS splits a loop into Charm++ messages Pushed into each local work stealing queue where idle threads within the same node can steal tasks 2017 CHARM++ WORKSHOP 22

Core0 Core1 Task Queue Task Queue Integrated RTS (Using Charm++ construct or OpenMP pragmas) for ( i = 0; i < n ; i++) { } 2 3

Recent Developments: Charmworks, Inc. Charm++ is now a commercially supported system Charmworks, Inc. Supported by DoE SBIR and small set of initial customers Non profit use (academia, US Govt. Labs..) remains free We are bringing improvements made by Charmworks into the University version (no forking of code so far) Specific improvements have included: Better handling of errors Robustness and ease of use improvements Production versions of research capabilities A new project at Charmworks for support and improvements to Adaptive MPI (AMPI) 2017 CHARM++ WORKSHOP 24

Upcoming Challenges and Opportunities Fatter nodes Improved global load balancing support in presence of GPGPUs Complex memory hierarchies (e.g. HBM) I think we are well-equipped for that, with prefetch Fine-grained messaging and lots of tiny chares: Graph algorithms, some solvers, DES,.. Subscale-simulations, multiple simulations In-situ analytics Funding! 2017 CHARM++ WORKSHOP 25

A glance at the Workshop Keynotes: Michael Norman, Rajeev Thakur PPL taks: Capabilities: load balancing*, heterogenity, DES Algorithms: sorting, connected components Languages: DARMA, Green-Marl, HPX (non-charm) Applications: NAMD, ChaNGA, OpenAtom, multi-level summation TaBaSCo (LANL, proxy app), Quinoa (LANL, Adaptive CFD) SpECTRE ( Relativistic Astrophysics) Panel: relevance of exascale to mid-range HPC 2017 CHARM++ WORKSHOP 26