A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS

Similar documents
IS TOPOLOGY IMPORTANT AGAIN? Effects of Contention on Message Latencies in Large Supercomputers

A Case Study of Communication Optimizations on 3D Mesh Interconnects

Optimizing communication for Charm++ applications by reducing network contention

Scalable Topology Aware Object Mapping for Large Supercomputers

Abhinav Bhatele, Laxmikant V. Kale University of Illinois at Urbana Champaign Sameer Kumar IBM T. J. Watson Research Center

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Enabling Contiguous Node Scheduling on the Cray XT3

QUANTIFYING NETWORK CONTENTION ON LARGE PARALLEL MACHINES

Enabling Contiguous Node Scheduling on the Cray XT3

Charm++ for Productivity and Performance

Charm++: An Asynchronous Parallel Programming Model with an Intelligent Adaptive Runtime System

Dr. Gengbin Zheng and Ehsan Totoni. Parallel Programming Laboratory University of Illinois at Urbana-Champaign

Toward Runtime Power Management of Exascale Networks by On/Off Control of Links

OpenAtom: Scalable Ab-Initio Molecular Dynamics with Diverse Capabilities

Overcoming Scaling Challenges in Biomolecular Simulations across Multiple Platforms

HANDLING LOAD IMBALANCE IN DISTRIBUTED & SHARED MEMORY

CkDirect: Unsynchronized One-Sided Communication in a Message-Driven Paradigm

PICS - a Performance-analysis-based Introspective Control System to Steer Parallel Applications

Load Balancing Techniques for Asynchronous Spacetime Discontinuous Galerkin Methods

Heuristic-based techniques for mapping irregular communication graphs to mesh topologies

A Parallel-Object Programming Model for PetaFLOPS Machines and BlueGene/Cyclops Gengbin Zheng, Arun Singla, Joshua Unger, Laxmikant Kalé

Scalable Interaction with Parallel Applications

Understanding application performance via micro-benchmarks on three large supercomputers: Intrepid, Ranger and Jaguar


Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

PREDICTING COMMUNICATION PERFORMANCE

Memory Tagging in Charm++

OPTIMIZATION OF FFT COMMUNICATION ON 3-D TORUS AND MESH SUPERCOMPUTER NETWORKS ANSHU ARYA THESIS

Memory Tagging in Charm++

Dynamic Load Balancing for Weather Models via AMPI

Scalable Fault Tolerance Schemes using Adaptive Runtime Support

Tutorial: Application MPI Task Placement

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Highly Scalable Parallel Sorting. Edgar Solomonik and Laxmikant Kale University of Illinois at Urbana-Champaign April 20, 2010

NAMD: A Portable and Highly Scalable Program for Biomolecular Simulations

Automated Mapping of Structured Communication Graphs onto Mesh Interconnects

NAMD at Extreme Scale. Presented by: Eric Bohm Team: Eric Bohm, Chao Mei, Osman Sarood, David Kunzman, Yanhua, Sun, Jim Phillips, John Stone, LV Kale

Reducing Communication Costs Associated with Parallel Algebraic Multigrid

High Performance Components with Charm++ and OpenAtom (Work in Progress)

Welcome to the 2017 Charm++ Workshop!

Using GPUs to compute the multilevel summation of electrostatic forces

Enzo-P / Cello. Formation of the First Galaxies. San Diego Supercomputer Center. Department of Physics and Astronomy

Adaptive Runtime Support

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Optimization of the Hop-Byte Metric for Effective Topology Aware Mapping

c Copyright by Sameer Kumar, 2005

Mapping to Irregular Torus Topologies and Other Techniques for Petascale Biomolecular Simulation

Charm++ Workshop 2010

Scalable Replay with Partial-Order Dependencies for Message-Logging Fault Tolerance

Integrated Performance Views in Charm++: Projections Meets TAU

Hierarchical task mapping for parallel applications on supercomputers

Matrix multiplication on multidimensional torus networks

Resource allocation and utilization in the Blue Gene/L supercomputer

Preliminary Evaluation of a Parallel Trace Replay Tool for HPC Network Simulations

Scalable Dynamic Adaptive Simulations with ParFUM

Dynamic High-Level Scripting in Parallel Applications

Using BigSim to Estimate Application Performance

A Framework for Collective Personalized Communication

Topology-aware task mapping for reducing communication contention on large parallel machines

TraceR: A Parallel Trace Replay Tool for Studying Interconnection Networks

c 2011 by Ehsan Totoni. All rights reserved.

Technical Computing Suite supporting the hybrid system

NAMD Serial and Parallel Performance

Introduction to FREE National Resources for Scientific Computing. Dana Brunson. Jeff Pummill

NUMA Support for Charm++

Maximizing Throughput on a Dragonfly Network

c Copyright by Gengbin Zheng, 2005

BigSim: A Parallel Simulator for Performance Prediction of Extremely Large Parallel Machines

Towards Large Scale Predictive Flow Simulations

Stockholm Brain Institute Blue Gene/L

The Red Storm System: Architecture, System Update and Performance Analysis

Optimizing a Parallel Runtime System for Multicore Clusters: A Case Study

Topology Awareness in the Tofu Interconnect Series

Enzo-P / Cello. Scalable Adaptive Mesh Refinement for Astrophysics and Cosmology. San Diego Supercomputer Center. Department of Physics and Astronomy

Development Environments for HPC: The View from NCSA

Detecting and Using Critical Paths at Runtime in Message Driven Parallel Programs

Flexible Hardware Mapping for Finite Element Simulations on Hybrid CPU/GPU Clusters

Petascale Multiscale Simulations of Biomolecular Systems. John Grime Voth Group Argonne National Laboratory / University of Chicago

Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

Strong Scaling for Molecular Dynamics Applications

GPU ACCELERATION OF WSMP (WATSON SPARSE MATRIX PACKAGE)

Installation and Test of Molecular Dynamics Simulation Packages on SGI Altix and Hydra-Cluster at JKU Linz

Adaptive MPI. Chao Huang, Orion Lawlor, L. V. Kalé

CS 475: Parallel Programming Introduction

Scaling All-to-All Multicast on Fat-tree Networks

c 2012 Harshitha Menon Gopalakrishnan Menon

Advanced Software for the Supercomputer PRIMEHPC FX10. Copyright 2011 FUJITSU LIMITED

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

TECHNIQUES IN SCALABLE AND EFFECTIVE PARALLEL PERFORMANCE ANALYSIS CHEE WAI LEE DISSERTATION

Memory Tagging in Charm++

c Copyright by Tarun Agarwal, 2005

Analyzing the Performance of IWAVE on a Cluster using HPCToolkit

An Orchestration Language for Parallel Objects

Basic Low Level Concepts

Computing architectures Part 2 TMA4280 Introduction to Supercomputing

Optimizing the performance of parallel applications on a 5D torus via task mapping

Scalable In-memory Checkpoint with Automatic Restart on Failures

Performance of HPC Applications over InfiniBand, 10 Gb and 1 Gb Ethernet. Swamy N. Kandadai and Xinghong He and

NUMA-Aware Data-Transfer Measurements for Power/NVLink Multi-GPU Systems

Transcription:

A CASE STUDY OF COMMUNICATION OPTIMIZATIONS ON 3D MESH INTERCONNECTS Abhinav Bhatele, Eric Bohm, Laxmikant V. Kale Parallel Programming Laboratory Euro-Par 2009 University of Illinois at Urbana-Champaign

Outline 2 Motivation Solution: Mapping of OpenAtom Performance Benefits Bigger Picture: Resources Needed Heuristic Solutions Automatic Mapping

OpenAtom 3 Ab-Initio Molecular Dynamics code Consider electrostatic interactions between the nuclei and electrons Calculate different energy terms Divided into different phases with lot of communication

OpenAtom on Blue Gene/L 4 Time per step (secs) 0.3 0.25 0.2 0.15 0.1 0.05 0 512 1024 2048 4096 8192 No. of cores w32 = 32 water molecules with 70 Ry cutoff w32 Default Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode

The problem lies in 5 Performance Analysis and Visualization Tool: Projections (part of Charm++) Timeline View

Solution 6 Topology Aware Mapping

Processor Virtualization 7 Programmer: Decomposes the computation into objects Runtime: Maps the computation on to the processors User View System View

Benefits of Charm++ 8 Computation is divided into objects/chares/virtual processors (VPs) Separates decomposition from mapping VPs can be flexibly mapped to actual physical processors (PEs)

Topology Manager API 9 The application needs information such as Dimensions of the partition Rank to physical co-ordinates and vice-versa TopoManager: a uniform API On BG/L and BG/P: provides a wrapper for system calls On XT3/4/5, there are no such system calls Provides a clean and uniform interface to the application http://charm.cs.uiuc.edu/~bhatele/phd/topomgr.htm

Parallelization using Charm++ 10 Eric Bohm, Glenn J. Martyna, Abhinav Bhatele, Sameer Kumar, Laxmikant V. Kale, John A. Gunnels, and Mark E. Tuckerman. Fine Grained Parallelization of the Car-Parrinello ab initio MD Method on Blue Gene/L. IBM J. of R. and D.: Applications of Massively Parallel Systems, 52(1/2):159-174, 2008.

Mapping Challenge 11 Load Balancing: Multiple VPs per PE Multiple groups of communicating objects Intra-group communication Inter-group communication Conflicting communication requirements

Topology Mapping of Chare Arrays 12 RealSpace and GSpace have state-wise communication Paircalculator and GSpace have plane-wise communication

Performance Improvements on BG/L 13 Time per step (secs) 0.3 0.25 0.2 0.15 0.1 0.05 w32 = 32 water molecules with 70 Ry cutoff w32 Default w32 Topology 0 512 1024 2048 4096 8192 No. of cores Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode, Year: 2006

Improved Timeline Views 14

Results on Blue Gene/L 15 Time per step (secs) 25 20 15 10 5 GST_BIG = 64 Ge, 128 Sb and 256 Te molecules with 20 Ry cutoff w256 Default w256 Topology GST_BIG Default GST_BIG Topology 0 1024 2048 4096 8192 16384 No. of cores Runs on Blue Gene/L at IBM T J Watson Research Center, CO mode

Results on Blue Gene/P 16 12 Time per step (secs) 10 8 6 4 2 w256 = 256 water molecules with 70 Ry cutoff w256 Default w256 Topology 0 1024 2048 4096 8192 No. of cores Runs on Blue Gene/P at Argonne National Laboratory, VN mode

Results on Cray XT3 17 Time per step (secs) 8 7 6 5 4 3 2 512 1024 2048 No. of cores w256 Default w256 Topology GST_BIG Default GST_BIG Topology Runs on Cray XT3 (Bigben) at Pittsburgh Supercomputing Center, VN mode (with system reservation to obtain complete 3d mesh shapes)

Performance Analysis 18 Idle Time (secs) 3000000 2500000 2000000 1500000 1000000 500000 w256m_70ry on Blue Gene/L Default Topology 0 1024 2048 4096 8192 No. of cores Performance Analysis and Visualization Tool: Projections Idle time added across all processors

Reduction in Communication Volume 19 Bandwidth (GB) 1200 1000 800 600 400 200 w256m_70ry on Blue Gene/P Default Topology 0 1024 2048 4096 8192 No. of cores Data obtained from Blue Gene/P s Uniform Performance Counters

Relative Performance Improvement 20 2.5 2 w256m_70ry % Improvement 1.5 1 0.5 Cray XT3 Blue Gene/L Blue Gene/P 0 512 1024 2048 4096 8192 No. of cores

Bigger picture 21 Different kinds of applications: Computation bound Communication bound Latency tolerant Latency sensitive Technique: Obtain processor topology and application communication graph Heuristic Techniques for mapping

Why does distance affect message latencies? 22 Consider a 3D mesh/torus interconnect Message latencies can be modeled by (L f /B) x D + L/B L f = length of flit, B = bandwidth, D = hops, L = message size When (L f * D) << L, first term is negligible But in presence of contention

Automatic Topology Aware Mapping 23 Many MPI applications exhibit a simple twodimensional near-neighbor communication pattern Examples: MILC, WRF, POP, Stencil,

Acknowledgements: Shawn Brown and Chad Vizino (PSC) Glenn Martyna, Sameer Kumar, Fred Mintzer (IBM) Teragrid for running time on Bigben (XT3) ANL for running time on Blue Gene/P E-mail: bhatele@illinois.edu Webpage: http://charm.cs.illinois.edu Funding DOE Grant B341494 (CSAR), DOE Grant DE-FG05-08OR23332 (ORNL LCF) and NSF Grant ITR 0121357