The Potential of Diffusive Load Balancing at Large Scale

Similar documents
Shape Optimizing Load Balancing for Parallel Adaptive Numerical Simulations Using MPI

New Challenges In Dynamic Load Balancing

Partitioning and Partitioning Tools. Tim Barth NASA Ames Research Center Moffett Field, California USA

Requirements of Load Balancing Algorithm

Scalable Dynamic Load Balancing of Detailed Cloud Physics with FD4

Improvements in Dynamic Partitioning. Aman Arora Snehal Chitnavis

simulation framework for piecewise regular grids

New challenges in dynamic load balancing

Advanced Topics in Numerical Analysis: High Performance Computing

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Scalable High-Quality 1D Partitioning

A Parallel Shape Optimizing Load Balancer

Peta-Scale Simulations with the HPC Software Framework walberla:

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Shape Optimizing Load Balancing for MPI-Parallel Adaptive Numerical Simulations

Penalized Graph Partitioning for Static and Dynamic Load Balancing

Planar: Parallel Lightweight Architecture-Aware Adaptive Graph Repartitioning

Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation

Parallel static and dynamic multi-constraint graph partitioning

Load Balancing Myths, Fictions & Legends

Parallel Rendering. Jongwook Jin Kwangyun Wohn VR Lab. CS Dept. KAIST MAR

This article appeared in a journal published by Elsevier. The attached copy is furnished to the author for internal non-commercial research and

HPC Methods for Coupling Spectral Cloud Microphysics with the COSMO Model

Parallel repartitioning and remapping in

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

On Partitioning FEM Graphs using Diffusion

Enabling High Performance Computational Science through Combinatorial Algorithms

Load Balancing for Parallel Multi-core Machines with Non-Uniform Communication Costs

On Delay Adjustment for Dynamic Load Balancing in Distributed Virtual Environments

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

A dynamic load-balancing strategy for large scale CFD-applications

AN APPROACH FOR LOAD BALANCING FOR SIMULATION IN HETEROGENEOUS DISTRIBUTED SYSTEMS USING SIMULATION DATA MINING

Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Combinatorial problems in a Parallel Hybrid Linear Solver

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Dr Tay Seng Chuan Tel: Office: S16-02, Dean s s Office at Level 2 URL:

Cluster Analysis. Ying Shen, SSE, Tongji University

Community Detection. Community

Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 4.0

Consultation for CZ4102

Partitioning Effects on MPI LS-DYNA Performance

Contributions au partitionnement de graphes parallèle multi-niveaux

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

Fast Dynamic Load Balancing for Extreme Scale Systems

Extreme Scalability Challenges in Micro-Finite Element Simulations of Human Bone

Partitioning of Public Transit Networks

Dynamic Load Balancing in Distributed Virtual Environments using Heat Diffusion

Joint Advanced Student School 2007 Martin Dummer

Proceedings of the 2014 Winter Simulation Conference A. Tolk, S. Y. Diallo, I. O. Ryzhov, L. Yilmaz, S. Buckley, and J. A. Miller, eds.

Basic Communication Operations Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar

The JOSTLE executable user guide : Version 3.1

BBS654 Data Mining. Pinar Duygulu. Slides are adapted from Nazli Ikizler

Accelerating HPL on Heterogeneous GPU Clusters

High Performance Computing: Tools and Applications

Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

Massively Parallel OpenMP-MPI Implementation of the SPH Code DualSPHysics

Mesh reordering in Fluidity using Hilbert space-filling curves

Exercises: April 11. Hermann Härtig, TU Dresden, Distributed OS, Load Balancing

CHAPTER 6 DEVELOPMENT OF PARTICLE SWARM OPTIMIZATION BASED ALGORITHM FOR GRAPH PARTITIONING

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh

Graph Operators for Coupling-aware Graph Partitioning Algorithms

Dynamic Data Repartitioning for Load-Balanced Parallel Particle Tracing

Parallel Graph Partitioning for Complex Networks

JÜLICH SUPERCOMPUTING CENTRE Site Introduction Michael Stephan Forschungszentrum Jülich

28.1 Introduction to Parallel Processing

Adaptive-Mesh-Refinement Pattern

Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing

F k G A S S1 3 S 2 S S V 2 V 3 V 1 P 01 P 11 P 10 P 00

ICON for HD(CP) 2. High Definition Clouds and Precipitation for Advancing Climate Prediction

Two-Level Dynamic Load Balancing Algorithm Using Load Thresholds and Pairwise Immigration

Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

APPLICATION SPECIFIC MESH PARTITION IMPROVEMENT

Dynamic load balancing in OSIRIS

AllScale Pilots Applications AmDaDos Adaptive Meshing and Data Assimilation for the Deepwater Horizon Oil Spill

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

Scalable Clustering of Signed Networks Using Balance Normalized Cut

Automated Mapping of Regular Communication Graphs on Mesh Interconnects

Using Clustering to Address Heterogeneity and Dynamism in Parallel Scientific Applications

PuLP. Complex Objective Partitioning of Small-World Networks Using Label Propagation. George M. Slota 1,2 Kamesh Madduri 2 Sivasankaran Rajamanickam 1

Charm++ Workshop 2010

CS 267 Applications of Parallel Computers. Lecture 23: Load Balancing and Scheduling. James Demmel

Algorithm XXX: Mongoose, A Graph Coarsening and Partitioning Library

Accelerated Load Balancing of Unstructured Meshes

TRAFFIC SIMULATION USING MULTI-CORE COMPUTERS. CMPE-655 Adelia Wong, Computer Engineering Dept Balaji Salunkhe, Electrical Engineering Dept

LOAD BALANCING DISTRIBUTED OPERATING SYSTEMS, SCALABILITY, SS Hermann Härtig

Hierarchy. No or very little supervision Some heuristic quality guidances on the quality of the hierarchy. Jian Pei: CMPT 459/741 Clustering (2) 1

Implementing the MPI Process Topology Mechanism

Interdisciplinary practical course on parallel finite element method using HiFlow 3

CSE 5243 INTRO. TO DATA MINING

Explicit and Implicit Coupling Strategies for Overset Grids. Jörg Brunswig, Manuel Manzke, Thomas Rung

Clustering Algorithms for general similarity measures

Interconnection networks

Transcription:

Center for Information Services and High Performance Computing The Potential of Diffusive Load Balancing at Large Scale EuroMPI 2016, Edinburgh, 27 September 2016 Matthias Lieber, Kerstin Gößner, Wolfgang E. Nagel matthias.lieber@tu-dresden.de

Motivation: Load Balancing Load Balance A challenge for HPC at large scale Especially for applications with workload variations Goals of load balancing Repartition application to balance workload Reduce comm. costs between partitions (edge cut) Reduce task migration costs Fast & scalable decision making Particle density, laser wakefield acceleration simulation with particle-in-cell code PIConGPU Slide 2

Motivation: Diffusive Load Balancing Fully distributed method Local operations lead to global convergence Cybenko, Dynamic Load Balancing for Distributed Memory Multiprocessors, J. Parallel Distr. Com. 7(2), 1989. Load per node over iterations Practical application is rare Well described since the 1990's Only few papers show real use in HPC Watts, Taylor, IEEE T. Parall. Distr. 9, 1998. Diekmann, Preis, Schlimbach, Walshaw, Parallel Computing 26(12), 2000. Schloegel, Karypis, Kumar, SC 2000. Motivation of this work Performance comparison to other state-of-the-art methods at large scale Slide 3

Contents Motivation Load Balancing Diffusive Load Balancing Short Diffusion Intro Concept Algorithms Performance Comparison Benchmark Setup Other Methods Results Slide 4

Short Diffusion Intro Concept Arrange processes/nodes in a graph G, e.g. mesh Balance virtual load with neighbors for several iterations until global convergence Result: minimal* load flow between neighbors in G that leads to global balance How to realize the flows? 2nd step required: task selection Satisfy flows best possible, keep edge cut and migration low (to reduce communication) * Most methods minimize sum of squares of individual flows between nodes (two-norm) Slide 5

Short Diffusion Intro: Algorithms Original Diffusion Algorithm (Orig Diff) In each iteration i each node v updates its i i i load: li+1 v =l v + α vw (l w l v ) w N v l v...load value of node v N v... neighbor nodes of node v α vw... diffusion parameter Cybenko, J. Parallel Distr. Com. 7(2), 1989. Estimation: One iteration should take few 10 µs only Second Order Diffusion (SO Diff) Prev. iteration s transfer influences current Improved Diffusion (Impr Diff) Update rule is adapted during iterations based on Laplacian matrix of graph G Dimension Exchange (Dim Exch) Local load is updated immediately before exchanging with next neighbor Muthukrishnan, Ghosh, Schultz, Theory Comput. Sys. 31, 1998. Hu, Blake, Parallel Computing 25(4), 1999. Cybenko, 1989. Xu, Monien, Lüling, Lau, Conc. Pract. E. 7, 1995. Slide 6

Contents Motivation Load Balancing Diffusive Load Balancing Short Diffusion Intro Concept Algorithms Performance Comparison Benchmark Setup Other Methods Results Slide 7

Performance Comparison: Diffusion Benchmark Benchmark setup 3D task grid, 3D process mesh, 512 tasks per proc Artificial imbalanced workload data* Iterations terminate at target imbalance of 0.1% 2D grid example of BOX scenario Red part is overloaded such that imbalance is 11% (i.e. max load / avg load - 1) Simplifications Time measurement w/o checking termination criterion Simple task selection algorithm (single pass) * in the paper we also use the particle-in-cell application szenario Slide 8

Performance Comparison: Other Methods Zoltan load balancing library MPI-based library implementations RCB: recursive coordinate bisection HSFC: Hilbert space-filling curve ParMetis graph partitioning via Zoltan Hierarchical space-filling curve Own fast and scalable method Leads to high migration http://www.cs.sandia.gov/zoltan Boman, Catalyurek, Chevalier, Devine, The Zoltan and Isorropia Parallel Toolkits for Combinatorial Scientific Computing: Partitioning, Ordering, and Coloring, Scientific Programming, 20(2), 2012. Schloegel, Karypis, Kumar, A Unified Algorithm for Load-balancing Adaptive Scientific Simulations, SC 2000. Lieber, Nagel, Scalable High-Quality 1D Partitioning, HPCS 2014. Slide 9

Performance Comparison: 1Ki-8Ki weak scaling = max(l v ) 1 avg (l v ) Max tasks sent+received among all procs Max number of task mesh edges cut by partition borders among all procs Median run time of 61 runs on Taurus, Intel Haswell + Infiniband FDR cluster with Intel MPI, error bars show 25/75 percentiles Iterations until flows lead to 0.1% imbalance (before task selection) Diffusion leads to smallest migration Diffusion achieves very good edge cut Diffusion run time ca. 2 ms for 8192 processes, Zoltan much slower Slide 10

Performance: 8Ki-128Ki, without task selection Max / total load transfer computed by diffusion relative to to avg / total load of procs Median run time of 19 runs on Juqueen, IBM Blue Gene/Q, error bars show 25/75 percentiles Iterations until flows lead to 0.1% imbalance Dimension exchange scales better than second order diffusion Diffusion takes few ms even on 128k processes* * task selection time does not depend on process count and takes few ms on Juqueen Slide 11

Summary Conclusion Diffusive load balancing is attractive on large scale when overhead (time for decision making, task migration) has to be low, e.g. in case of frequent rebalancing. Future work Improve task selection Scalable termination criterion: estimate required iterations or check convergence? Optimal process graph topology: match the hardware or the application? Add to Zoltan / Charm++ / application XYZ Slide 12

Thank you very much for your attention Acknowledgments / Funding: Slide 13