Advanced Parallel Programming. Is there life beyond MPI?

Similar documents
A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Scalable Interaction with Parallel Applications

Programming Models for Supercomputing in the Era of Multicore

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

Histogram sort with Sampling (HSS) Vipul Harsh, Laxmikant Kale

Dynamic Load Balancing for Weather Models via AMPI

Welcome to the 2017 Charm++ Workshop!

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

Adaptive Runtime Support

Microsoft Windows HPC Server 2008 R2 for the Cluster Developer

Charm++ for Productivity and Performance

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

NUMA Support for Charm++

OpenACC Course. Office Hour #2 Q&A

Introducing Overdecomposition to Existing Applications: PlasComCM and AMPI

Preliminary Experiences with the Uintah Framework on on Intel Xeon Phi and Stampede

CUDA GPGPU Workshop 2012

Parallel Languages: Past, Present and Future

Fast Multipole, GPUs and Memory Crushing: The 2 Trillion Particle Euclid Flagship Simulation. Joachim Stadel Doug Potter See [Potter+ 17]

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

High performance computing and numerical modeling

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Recent Advances in Heterogeneous Computing using Charm++

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Scalable In-memory Checkpoint with Automatic Restart on Failures

Charm++ Workshop 2010

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Scalable Dynamic Adaptive Simulations with ParFUM

Towards a codelet-based runtime for exascale computing. Chris Lauderdale ET International, Inc.

Parallel Programming Models. Parallel Programming Models. Threads Model. Implementations 3/24/2014. Shared Memory Model (without threads)

Particle-in-Cell Simulations on Modern Computing Platforms. Viktor K. Decyk and Tajendra V. Singh UCLA

Scaling Hierarchical N-body Simulations on GPU Clusters

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Communication and Optimization Aspects of Parallel Programming Models on Hybrid Architectures

Large-Scale GPU programming

Memory Tagging in Charm++

Trends and Challenges in Multicore Programming

ClearSpeed Visual Profiler

Performance and Accuracy of Lattice-Boltzmann Kernels on Multi- and Manycore Architectures

SHARCNET Workshop on Parallel Computing. Hugh Merz Laurentian University May 2008

Enabling Efficient Use of UPC and OpenSHMEM PGAS models on GPU Clusters

Introduction to parallel Computing

Acknowledgments. Amdahl s Law. Contents. Programming with MPI Parallel programming. 1 speedup = (1 P )+ P N. Type to enter text

Parallel Programming Libraries and implementations

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Scientific Computing at Million-way Parallelism - Blue Gene/Q Early Science Program

"Charting the Course to Your Success!" MOC A Developing High-performance Applications using Microsoft Windows HPC Server 2008

Scalability of Uintah Past Present and Future

Introduction to Parallel Programming for Multicore/Manycore Clusters Part II-3: Parallel FVM using MPI

Data-Intensive Applications on Numerically-Intensive Supercomputers

HPC Architectures. Types of resource currently in use

Introduction to Parallel Computing. CPS 5401 Fall 2014 Shirley Moore, Instructor October 13, 2014

Fault tolerant issues in large scale applications


Addressing Heterogeneity in Manycore Applications

RAMSES on the GPU: An OpenACC-Based Approach

CUDA Experiences: Over-Optimization and Future HPC

CPU-GPU Heterogeneous Computing

Parallel Applications on Distributed Memory Systems. Le Yan HPC User LSU

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

Portable Heterogeneous High-Performance Computing via Domain-Specific Virtualization. Dmitry I. Lyakh.

Perspectives form US Department of Energy work on parallel programming models for performance portability

Objective. We will study software systems that permit applications programs to exploit the power of modern high-performance computers.

Scalability issues : HPC Applications & Performance Tools

OpenACC 2.6 Proposed Features

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

PoS(eIeS2013)008. From Large Scale to Cloud Computing. Speaker. Pooyan Dadvand 1. Sònia Sagristà. Eugenio Oñate

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Barbara Chapman, Gabriele Jost, Ruud van der Pas

Introduction to parallel computers and parallel programming. Introduction to parallel computersand parallel programming p. 1

Parallel Computing Using OpenMP/MPI. Presented by - Jyotsna 29/01/2008

From the latency to the throughput age. Prof. Jesús Labarta Director Computer Science Dept (BSC) UPC

Experiences with Achieving Portability across Heterogeneous Architectures

Introduction to OpenMP. OpenMP basics OpenMP directives, clauses, and library routines

Parallel Programming Environments. Presented By: Anand Saoji Yogesh Patel

WHY PARALLEL PROCESSING? (CE-401)

End-to-end optimization potentials in HPC applications for NWP and Climate Research

An Efficient CUDA Implementation of a Tree-Based N-Body Algorithm. Martin Burtscher Department of Computer Science Texas State University-San Marcos

6.1 Multiprocessor Computing Environment

Porting a parallel rotor wake simulation to GPGPU accelerators using OpenACC

Lecture 11 Hadoop & Spark

Automatic MPI to AMPI Program Transformation using Photran

Adaptive-Mesh-Refinement Hydrodynamic GPU Computation in Astrophysics

Scalable Software Components for Ultrascale Visualization Applications

Intermediate Parallel Programming & Cluster Computing

IBM High Performance Computing Toolkit

High performance Computing and O&G Challenges

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

Intro to Parallel Computing

Ultra Large-Scale FFT Processing on Graphics Processor Arrays. Author: J.B. Glenn-Anderson, PhD, CTO enparallel, Inc.

Speedup Altair RADIOSS Solvers Using NVIDIA GPU

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Improving Uintah s Scalability Through the Use of Portable

Parallel Programming. Libraries and Implementations

Peta-Scale Simulations with the HPC Software Framework walberla:

Transcription:

Advanced Parallel Programming Is there life beyond MPI?

Outline MPI vs. High Level Languages Declarative Languages Map Reduce and Hadoop Shared Global Address Space Languages Charm++ ChaNGa ChaNGa on GPUs

Parallel Programming in MPI Good performance Highly portable: de facto standard Poor match to some architectures Active Messages, Shared Memory New machines are hybrid architectures Multicore, Vector, RDMA, Cell Parallel Assembly?

Parallel Programming in High Level Languages Abstraction allows easy expression of new algorithms Low level architecture is hidden (or abstracted) Integrated debugging/performance tools Sometimes a poor mapping of algorithm onto the language Steep learning curve

Parallel Programming Hierarchy Decomposition of computation into parallel components Parallelizing compiler, Chapel Mapping of components to processors Charm++ Scheduling of components OpenMP, HPF Expressing the above in data movement and thread execution MPI

Language Requirements General Purpose Expressive for application domain Including matching representations: *(a + i) vs a[i] High Level Efficiency/obvious cost model Modularity and Reusability Context independent libraries Similar to/interoperable with existing languages

Declarative Languages SQL example: SELECT SUM(L_Bol) FROM stars WHERE tform > 12.0 Performance through abstraction Limited expressivity, otherwise Complicated Slow (UDF)

Map Reduce & Hadoop Map: function produces (key, value) pairs Reduce: collects Map output Pig: SQL-like query language Effective data reduction framework Not suitable for HPC

Array Languages, e.g., CAF Arrays distributed across images Each processor can access data on other processors via co-array syntax call sync_all(/up, down/) new_a(1:ncol) = new_a(1:ncol) +A(1:ncol)[up] + A(1:ncol)[down] call sync_all(/up, down/) Easy expression of array model Cost transparent

Programmer: [Over] decomposition into virtual processors Runtime: Assigns VPs to processors Enables adaptive runtime strategies Charm++: Migratable Objects System implementation Benefits Software engineering Number of virtual processors can be independently controlled Separate VPs for different modules Message driven execution Adaptive overlap of communication Dynamic mapping Heterogeneous clusters Vacate, adjust to speed, share Automatic checkpointing Change set of processors used Automatic dynamic load balancing Communication optimization User View

User view

System View

Gravity Implementations Standard Tree-code Send : distribute particles to tree nodes as the walk proceeds. Naturally expressed in Charm++ Extremely communication intensive Cache : request treenodes from off processor as they are needed. More complicated programming Cache is now part of the language

ChaNGa Features Tree-based gravity solver High order multipole expansion Periodic boundaries (if needed) SPH: (Gasoline compatible) Individual multiple timesteps Dynamic load balancing with choice of strategies Checkpointing (via migration to disk) Visualization

Cosmological Comparisons: Mass Function Heitmann, et al. 2005

Overall structure 08/06/10 Parallel Programming Laboratory @ UIUC 16

Remote/local latency hiding Clustered data on 1,024 BlueGene/L processors Remote data work Local data work 08/06/10 Parallel Programming Laboratory @ UIUC 17 5.0s

Load balancing with GreedyLB Zoom In 5M on 1,024 BlueGene/L processors 5.6s 6.1s 4x messages 08/06/10 Parallel Programming Laboratory @ UIUC 18

Load balancing with OrbRefineLB Zoom in 5M on 1,024 BlueGene/L processors 5.6s 5.0s 08/06/10 Parallel Programming Laboratory @ UIUC 19

Scaling with load balancing Number of Processors x Execution Time per Iteration (s) 08/06/10 Parallel Programming Laboratory @ UIUC 20

Cosmo Loadbalancer Use Charm++ measurement based load balancer Modification: provide LB database with information about timestepping. Large timestep : balance based on previous Large step Small step balance based on previous small step

Results on 3 rung example 613s 429s 228s

Multistep Scaling

SPH Scaling

ChaNGa on GPU clusters Immense computational power Feeding the monster is a problem Charm++ GPU Manager User submits work requests with callback System transfers memory, executes, returns via callback GPU operates asynchronously Pipelined execution

Execution of Work Requests

GPU Scaling

GPU optimization

Summary Successfully created highly scalable code in HLL Computation/communication overlap Object migration for LB and Checkpoints Method prioritization GPU Manager framework HLL not a silver bullet Communication needs to be considered Productivity unclear Real Programmers write Fortran in any language

Thomas Quinn Graeme Lufkin Joachim Stadel James Wadsley Laxmikant Kale Filippo Gioachin Pritish Jetley Celso Mendes Amit Sharma Lukasz Wesolowski Edgar Solomonik

Availability Charm++: http://charm.cs.uiuc.edu ChaNGa download: http://software.astro.washington.edu/nchilada/ Release information: http://hpcc.astro.washington.edu/tools/changa.htm Mailing list: changa-users@u.washington.edu