New Challenges In Dynamic Load Balancing

Similar documents
Requirements of Load Balancing Algorithm

Dynamic Load Balancing and Partitioning using the Zoltan Toolkit

Improvements in Dynamic Partitioning. Aman Arora Snehal Chitnavis

New challenges in dynamic load balancing

Load Balancing Myths, Fictions & Legends

Advances in Parallel Partitioning, Load Balancing and Matrix Ordering for Scientific Computing

Advanced Topics in Numerical Analysis: High Performance Computing

Hierarchical Partitioning and Dynamic Load Balancing for Scientific Computation

Graph and Hypergraph Partitioning for Parallel Computing

Graph Partitioning for High-Performance Scientific Simulations. Advanced Topics Spring 2008 Prof. Robert van Engelen

Optimizing Parallel Sparse Matrix-Vector Multiplication by Corner Partitioning

Lecture 19: Graph Partitioning

Zoltan: Parallel Partitioning, Load Balancing and Data-Management Services User's Guide

The Potential of Diffusive Load Balancing at Large Scale

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Enabling High Performance Computational Science through Combinatorial Algorithms

CS 140: Sparse Matrix-Vector Multiplication and Graph Partitioning

Partitioning and Partitioning Tools. Tim Barth NASA Ames Research Center Moffett Field, California USA

The JOSTLE executable user guide : Version 3.1

Adaptive-Mesh-Refinement Pattern

PuLP: Scalable Multi-Objective Multi-Constraint Partitioning for Small-World Networks

Parallel Multilevel Algorithms for Multi-constraint Graph Partitioning

PULP: Fast and Simple Complex Network Partitioning

PARALLEL DECOMPOSITION OF 100-MILLION DOF MESHES INTO HIERARCHICAL SUBDOMAINS

Parallel Greedy Matching Algorithms

APPLICATION SPECIFIC MESH PARTITION IMPROVEMENT

Preconditioning Linear Systems Arising from Graph Laplacians of Complex Networks

Multilevel Graph Partitioning

Kernighan/Lin - Preliminary Definitions. Comments on Kernighan/Lin Algorithm. Partitioning Without Nodal Coordinates Kernighan/Lin

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Penalized Graph Partitioning for Static and Dynamic Load Balancing

Multigrid Pattern. I. Problem. II. Driving Forces. III. Solution

Parallel Graph Partitioning and Sparse Matrix Ordering Library Version 4.0

Scalable Dynamic Adaptive Simulations with ParFUM

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Accelerated Load Balancing of Unstructured Meshes

Hypergraph-Partitioning Based Decomposition for Parallel Sparse-Matrix Vector Multiplication

Par$$oning Sparse Matrices

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

Research Article Large-Scale CFD Parallel Computing Dealing with Massive Mesh

Shape Optimizing Load Balancing for Parallel Adaptive Numerical Simulations Using MPI

Engineering Multilevel Graph Partitioning Algorithms

k-way Hypergraph Partitioning via n-level Recursive Bisection

Native mesh ordering with Scotch 4.0

METHODS AND TOOLS FOR PARALLEL ANISOTROPIC MESH ADAPTATION AND ANALYSIS

Handling Parallelisation in OpenFOAM

Parallel Algorithm for Multilevel Graph Partitioning and Sparse Matrix Ordering

Geometric Modeling. Mesh Decimation. Mesh Decimation. Applications. Copyright 2010 Gotsman, Pauly Page 1. Oversampled 3D scan data

Parallel repartitioning and remapping in

28.1 Introduction to Parallel Processing

Parallel static and dynamic multi-constraint graph partitioning

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Applying Graph Partitioning Methods in Measurement-based Dynamic Load Balancing

Graph Operators for Coupling-aware Graph Partitioning Algorithms

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Efficient Programming of Nanowire-based Sublithographic PLAs: A Multilevel Algorithm for Partitioning Graphs

Parallel FEM Computation and Multilevel Graph Partitioning Xing Cai

Visual Analysis of Lagrangian Particle Data from Combustion Simulations

Web Structure Mining Community Detection and Evaluation

Shape Optimizing Load Balancing for MPI-Parallel Adaptive Numerical Simulations

Principles of Parallel Algorithm Design: Concurrency and Mapping

DYNAMIC DOMAIN DECOMPOSITION AND LOAD BALANCING IN PARALLEL SIMULATION OF FINITE/DISCRETE ELEMENTS

Lesson 2 7 Graph Partitioning

Dynamic Data Repartitioning for Load-Balanced Parallel Particle Tracing

R-Trees. Accessing Spatial Data

Architecture-Aware Graph Repartitioning for Data-Intensive Scientific Computing

ADR and DataCutter. Sergey Koren CMSC818S. Thursday March 4 th, 2004

Using Clustering to Address Heterogeneity and Dynamism in Parallel Scientific Applications

Problem Definition. Clustering nonlinearly separable data:

Partitioning with Space-Filling Curves on the Cubed-Sphere

Graph partitioning. Prof. Richard Vuduc Georgia Institute of Technology CSE/CS 8803 PNA: Parallel Numerical Algorithms [L.27] Tuesday, April 22, 2008

Community Detection. Community

Generic Topology Mapping Strategies for Large-scale Parallel Architectures

Approaches to Architecture-Aware Parallel Scientific Computation

1.5D PARALLEL SPARSE MATRIX-VECTOR MULTIPLY

A Parallel Unstructured Mesh Infrastructure

Combinatorial problems in a Parallel Hybrid Linear Solver

Fast Dynamic Load Balancing for Extreme Scale Systems

Graph partitioning models for parallel computing q

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

Parallelizing Adaptive Triangular Grids with Refinement Trees and Space Filling Curves

Parallel Unstructured Mesh Generation by an Advancing Front Method

Parkway A Parallel Hypergraph Partitioning Tool

Multithreaded Clustering for Multi-level Hypergraph Partitioning

Multi-Threaded Graph Partitioning

Hypergraph-Partitioning-Based Decomposition for Parallel Sparse-Matrix Vector Multiplication

IMPROVING UNSTRUCTURED MESH PARTITIONS FOR MULTIPLE CRITERIA USING MESH ADJACENCIES

Mesh Decimation. Mark Pauly

ADAPTIVE LOAD BALANCING TECHNIQUES IN GLOBAL SCALE GRID ENVIRONMENT

Automated system partitioning based on hypergraphs for 3D stacked integrated circuits. FOSDEM 2018 Quentin Delhaye

Some aspects of parallel program design. R. Bader (LRZ) G. Hager (RRZE)

HPC Algorithms and Applications

SMD149 - Operating Systems - Multiprocessing

Overview. SMD149 - Operating Systems - Multiprocessing. Multiprocessing architecture. Introduction SISD. Flynn s taxonomy

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

Module 3 Mesh Generation

Foundation of Parallel Computing- Term project report

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Transcription:

New Challenges In Dynamic Load Balancing Karen D. Devine, et al. Presentation by Nam Ma & J. Anthony Toghia

What is load balancing? Assignment of work to processors Goal: maximize parallel performance through minimal CPU idle time and interprocessor communication overhead

Static vs. dynamic load balancing Static Applications with constant workloads (i.e. predictable) Pre-processor to the computation Dynamic Unpredictable workloads (e.g. adaptive finite element methods) On-the-fly adjustment of application decomposition

Current challenges Most load balancers are custom written for an application Lack of code reuse Inability to compare different load balancing algorithms Limited range of applications (symmetric, sparsely connected relationships) Architecture dependence

Zoltan Parallel Data Services Toolkit Library of expert implementations of related algorithms General purpose application to a wide variety of algorithms Code reuse / portability More thorough testing Comparison of various load balancers by changing a run-time parameter Data structure neutral design (data structures represented as generic objects with weights representing their computational costs)

Zoltan: additional tools Data migration tools to move data between old and new decompositions Callbacks to user defined constructors for data structures to support data-structure neutral migration Unstructured communication package (custom, complex migrations) All tools may be used independently (e.g. user can use Zoltan to compute decomposition but perform data migration manually or vice versa)

Tutorial: The Zoltan Toolkit Karen Devine and Cedric Chevalier Sandia National Laboratories, NM Umit Çatalyürek Ohio State University CSCAPES Workshop, June 2008 Sandia is a multiprogram laboratory operated by Sandia Corporation, a Lockheed Martin Company, for the United States Department of Energyʼs National Nuclear Security Administration under contract DE-AC04-94AL85000.

Slide 8 The Zoltan Toolkit Library of data management services for unstructured, dynamic Dynamic Load Balancing and/or adaptive computations. Graph Coloring 1 2 3 4 5 6 X X X X X X X X X X X X X X X X X X a 1 a 2 a 6 a 4 a 5 a 3 Data Migration Matrix Ordering Unstructured Communication Distributed Data Directories A B C D E F G H I 0 1 0 2 1 0 1 2 1

Slide 9 Zoltan Application Interface APPLICATION Initialize Zoltan (Zoltan_Initialize, Zoltan_Create) Select Method and Parameters (Zoltan_Set_Params) COMPUTE Re-partition (Zoltan_LB_Partition) ZOLTAN Zoltan_LB_Partition: Call query functions. Build data structures. Compute new decomposition. Return import/export lists. Register query functions (Zoltan_Set_Fn) Move data (Zoltan_Migrate) Clean up (Zoltan_Destroy) Zoltan_Migrate: Call packing query functions for exports. Send exports. Receive imports. Call unpacking query functions for imports.

Slide 10 Static Partitioning Initialize Application Partition Data Distribute Data Compute Solutions Output & End Static partitioning in an application: Data partition is computed. Data are distributed according to partition map. Application computes. Ideal partition: Processor idle time is minimized. Inter-processor communication costs are kept low. Zoltan_Set_Param(zz, LB_APPROACH, PARTITION );

Dynamic Repartitioning (a.k.a. Dynamic Load Balancing) Slide 11 Initialize Application Partition Data Redistribute Data Compute Solutions & Adapt Output & End Dynamic repartitioning (load balancing) in an application: Data partition is computed. Data are distributed according to partition map. Application computes and, perhaps, adapts. Process repeats until the application is done. Ideal partition: Processor idle time is minimized. Inter-processor communication costs are kept low. Cost to redistribute data is also kept low. Zoltan_Set_Param(zz, LB_APPROACH, REPARTITION );

Zoltan Toolkit: Suite of Partitioners Slide 12 No single partitioner works best for all applications. Trade-offs: Quality vs. speed. Geometric locality vs. data dependencies. High-data movement costs vs. tolerance for remapping. Application developers may not know which partitioner is best for application. Zoltan contains suite of partitioning methods. Application changes only one parameter to switch methods. Zoltan_Set_Param(zz, LB_METHOD, new_method_name ); Allows experimentation/comparisons to find most effective partitioner for application.

Partitioning methods Geometric (coordinate-based) methods Recursive Coordinate Bisection (Berger, Bokhari) Recursive Inertial Bisection (Taylor, Nour-Omid) Space Filling Curve Partitioning (Warren&Salmon, et al.) Refinement-tree Partitioning (Mitchell) Combinatorial (topology-based) methods Hypergraph Partitioning Hypergraph Repartitioning PaToH (Catalyurek & Aykanat) Zoltan Graph Partitioning ParMETIS (U. Minnesota) Jostle (U. Greenwich) 13

Geometric Partitioning Suitable for a class of applications, such as crash and particle simulations. Geometric-coordinate based decomposition is a more natural and straightforward approach. Graph partitioning is difficult or impossible. Frequent changes in proximity require fast and dynamic repartitioning strategies. Based on two main algorithms Recursive coordinate bisection Space filling curve 14

Recursive Coordinate Bisection (RCB) Divide the domain into two equal subdomains using a cutting plane orthogonal to a coordinate axis. Recursively cut the resulting subdomains. A variation of RCB: Recursive Inertial Bisection, which computes cuts orthogonal to principle inertial axes of the geometry 15

Space Filling Curve (SFC) SFC: Mapping between R n to R 1 that completely fills a domain SFC Partitioning: Run SFC through domain Order objects according to position on curve Perform 1-D partition of curve 16

Geometric Partitioning - Advantages and Disadvantages Advantages Simple, fast, inexpensive Effective when geometric locality is important No connectivity needed Implicitly incremental for repartitioning Disadvantages No explicit control of communication costs Need coordinate information 17

Repartitioning Implicitly incremental: small changes in workloads produce only small change in the decomposition Reduce the cost of moving application data 18

An Application: Contact Detection in Crash Simulation Identify which partition s subdomains intersect a given point (point assignment) or region (box assignment) Point assignment: given a point, it returns the partition owning the region of space containing that point Box assignment: given an axis-aligned region of space, it returns a list of partitions whose assigned regions overlap the specified box 19

Performance Experimental Setup: Initialized 96 partitions on a 16-processor cluster Performed 10,000 box-assignments Results comparing RCB and SFC 20

Slide 21 Graph Partitioning Best for mesh-based partial differential equation (PDE) simulations Represent problem as a weighted graph. Vertices = objects to be partitioned. Edges = dependencies between two objects. Weights = work load or amount of dependency. Partition graph so that Parts have equal vertex weight. Weight of edges cut by part boundaries is small.

Graph Partitioning: Slide 22 Advantages and Disadvantages Advantages: Highly successful model for mesh-based PDE problems. Explicit control of communication volume gives higher partition quality than geometric methods. Excellent software available. Serial: Parallel: Disadvantages: Chaco (SNL) Jostle (U. Greenwich) METIS (U. Minn.) Party (U. Paderborn) Scotch (U. Bordeaux) Zoltan (SNL) ParMETIS (U. Minn.) PJostle (U. Greenwich) More expensive than geometric methods. Edge-cut model only approximates communication volume.

Slide 23 Hypergraph Partitioning Zoltan_Set_Param(zz, LB_METHOD, HYPERGRAPH ); Zoltan_Set_Param(zz, HYPERGRAPH_PACKAGE, ZOLTAN ); or Zoltan_Set_Param(zz, HYPERGRAPH_PACKAGE, PATOH ); Alpert, Kahng, Hauck, Borriello, Çatalyürek, Aykanat, Karypis, et al. Hypergraph model: Vertices = objects to be partitioned. Hyperedges = dependencies between two or more objects. Partitioning goal: Assign equal vertex weight while minimizing hyperedge cut weight. A A Graph Partitioning Model Hypergraph Partitioning Model

Hypergraph partitioning Coarsening phase (Clustering) Hierarchical (simultaneous clustering) 1. Pick an unreached vertex (random) 2. Connect it to another vertex based on selection criteria (e.g. shortest edge) Agglomerative (build 1 cluster at a time) N u,cuv Connectivity value of vertex N = # of edges connected to N W u,cv, Weight = number of vertices in any cluster 1. Choose cluster or singleton with highest N u,cuv / W u,cv, 2. Choose edge based on selection criteria Partitioning phases (Bisecting hypergraph) Uncoarsening phase Project coarsened, bisected graph back to previous level Refine bisection by running boundary force method (BFM) 24

Hypergraph Example 25

Kernighan/Lin Algorithm E(a) = external cost of a in A = Σ {W(a,b) for b in B} I(a) = internal cost of a in A = Σ {W(a,a ) for other a in A} D(a) = total cost of a in A = E(a) I(a) Consider swapping a in A and b in B New Partition Cost = Old Partition Cost D(a) D(b) + 2*W(a,b) Compute D(n) for all nodes Repeat Unmark all nodes While there are unmarked pairs Find an unmarked pair (a,b) Mark a and b (but do not swap) Update D(n) for all unmarked nodes, as if a and b had been swapped Pick maximizing gain If Gain > 0 then swap Until gain <= 0 Worst case O(N 2 log N) 26

Slide 27 Hypergraph Repartitioning Augment hypergraph with data redistribution costs. Account for data s current processor assignments. Weight dependencies by their size and frequency of use. Partitioning then tries to minimize total communication volume: Data redistribution volume + Application communication volume Total communication volume Data redistribution volume: callback returns data sizes. Zoltan_Set_Fn(zz, ZOLTAN_OBJ_SIZE_MULTI_FN_TYPE, myobjsizefn, 0); Application communication volume = Hyperedge cuts * Number of times the communication is done between repartitionings. Zoltan_Set_Param(zz, PHG_REPART_MULTIPLIER, 100 ); Best Algorithms Paper Award at IPDPS07 Hypergraph-based Dynamic Load Balancing for Adaptive Scientific Computations Çatalyürek, Boman, Devine, Bozdag, Heaphy, & Riesen

Hypergraph Partitioning: Advantages and Disadvantages Slide 28 Advantages: Communication volume reduced 30-38% on average over graph partitioning (Catalyurek & Aykanat). 5-15% reduction for mesh-based applications. More accurate communication model than graph partitioning. Better representation of highly connected and/or non-homogeneous systems. Greater applicability than graph model. Can represent rectangular systems and non-symmetric dependencies. Disadvantages: Usually more expensive than graph partitioning.

Hypergraph Hexagonal graph 29

Hypergraph results asymmetric, sparse graph 30

Resource-aware Load Balancing Heterogeneous Architectures Clusters may have different types of processors with various capacity Assign capacity weight to processors Balance with respect to processors capacity Hierarchical partitioning: Allows different partitioners at different architecture levels 31

Dynamic Resource Utilization Model (DRUM) DRUM provides applications aggregated information about the computation and communication capabilities of an execution environment The tree constructed by DRUM represents a heterogeneous network. Leaves represent individual computation nodes (i.e. single processors (SP) or SMPs) Non-leaf nodes represent routers or switches, having an aggregate power 32

Tree represents a heterogeneous network by DRUM 33

Power Representation Power of a node is computed as weighted sum of a processing power p n and a communication power c n For each node n in L i (the set of nodes at level i) in the network, the final power is computed by where pp n is the power of the parent of node n. 34

Slide 35 For More Information... Zoltan Home Page http://www.cs.sandia.gov/zoltan User s and Developer s Guides Download Zoltan software under GNU LGPL. Email: {kddevin,ccheval,egboman}@sandia.gov umit@bmi.osu.edu

Ques%ons and Comments