Dynamic load balancing in OSIRIS

Similar documents
Three dimensional modeling using ponderomotive guiding center solver

HARNESSING IRREGULAR PARALLELISM: A CASE STUDY ON UNSTRUCTURED MESHES. Cliff Woolley, NVIDIA

Implementing a ponderomotive guiding center solver: a case study

Intro to Parallel Computing

Foster s Methodology: Application Examples

A4. Intro to Parallel Computing

Designing Parallel Programs. This review was developed from Introduction to Parallel Computing

Data Partitioning. Figure 1-31: Communication Topologies. Regular Partitions

Black-box and Gray-box Strategies for Virtual Machine Migration

Workloads Programmierung Paralleler und Verteilter Systeme (PPV)

Image-Space-Parallel Direct Volume Rendering on a Cluster of PCs

Load Balancing and Data Migration in a Hybrid Computational Fluid Dynamics Application

Principle Of Parallel Algorithm Design (cont.) Alexandre David B2-206

Chapter 24. Creating Surfaces for Displaying and Reporting Data

Dynamic Load Partitioning Strategies for Managing Data of Space and Time Heterogeneity in Parallel SAMR Applications

A Scalable Parallel LSQR Algorithm for Solving Large-Scale Linear System for Seismic Tomography

LATTICE-BOLTZMANN METHOD FOR THE SIMULATION OF LAMINAR MIXERS

About Phoenix FD PLUGIN FOR 3DS MAX AND MAYA. SIMULATING AND RENDERING BOTH LIQUIDS AND FIRE/SMOKE. USED IN MOVIES, GAMES AND COMMERCIALS.

A New Approach to Modeling Physical Systems: Discrete Event Simulations of Grid-based Models

Asynchronous OpenCL/MPI numerical simulations of conservation laws

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 9: Performance Issues in Shared Memory. The Lecture Contains:

Parallel Implementation of 3D FMA using MPI

Massively Parallel Phase Field Simulations using HPC Framework walberla

simulation framework for piecewise regular grids

Dynamic Fine Grain Scheduling of Pipeline Parallelism. Presented by: Ram Manohar Oruganti and Michael TeWinkle

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Scalable Software Components for Ultrascale Visualization Applications

Parallel Programming Patterns

Development of a Computational Framework for Block-Based AMR Simulations

Parallelization Principles. Sathish Vadhiyar

Direct Rendering. Direct Rendering Goals

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

CPSC 340: Machine Learning and Data Mining. Density-Based Clustering Fall 2016

Peta-Scale Simulations with the HPC Software Framework walberla:

An Introduction to Parallel Programming

Algorithms and Applications

Lecture 4: Locality and parallelism in simulation I

Lagrangian methods and Smoothed Particle Hydrodynamics (SPH) Computation in Astrophysics Seminar (Spring 2006) L. J. Dursi

DETECTION AND ROBUST ESTIMATION OF CYLINDER FEATURES IN POINT CLOUDS INTRODUCTION

3D ADI Method for Fluid Simulation on Multiple GPUs. Nikolai Sakharnykh, NVIDIA Nikolay Markovskiy, NVIDIA

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases. Introduction

Large and Sparse Mass Spectrometry Data Processing in the GPU Jose de Corral 2012 GPU Technology Conference

CMSC 714 Lecture 6 MPI vs. OpenMP and OpenACC. Guest Lecturer: Sukhyun Song (original slides by Alan Sussman)

MPI Optimizations via MXM and FCA for Maximum Performance on LS-DYNA

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Principles of Parallel Algorithm Design: Concurrency and Mapping

Multidimensional Data and Modelling - DBMS

CS 475: Parallel Programming Introduction

VCell Tutorial. FRAP: Fluorescence Redistribution After Photo bleaching

Parallel repartitioning and remapping in

GRID BASED CLUSTERING

Parallel Processing. Parallel Processing. 4 Optimization Techniques WS 2018/19

A Scalable Adaptive Mesh Refinement Framework For Parallel Astrophysics Applications

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Splotch: High Performance Visualization using MPI, OpenMP and CUDA

The Case of the Missing Supercomputer Performance

Fast Dynamic Load Balancing for Extreme Scale Systems

Load Balancing in Distributed System through Task Migration

The Future of High Performance Computing

Review: Creating a Parallel Program. Programming for Performance

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Lecture 4: Principles of Parallel Algorithm Design (part 4)

Introduction to Spatial Database Systems

Handling Parallelisation in OpenFOAM

Parallel Computing. Parallel Algorithm Design

General Plasma Physics

Optimization of MPI Applications Rolf Rabenseifner

Application of SDN: Load Balancing & Traffic Engineering

Homework # 2 Due: October 6. Programming Multiprocessors: Parallelism, Communication, and Synchronization

Adaptive Refinement Tree (ART) code. N-Body: Parallelization using OpenMP and MPI

University of Florida CISE department Gator Engineering. Clustering Part 4

Job Re-Packing for Enhancing the Performance of Gang Scheduling

Section 8.3: Examining and Repairing the Input Geometry. Section 8.5: Examining the Cartesian Grid for Leakages

A TIMING AND SCALABILITY ANALYSIS OF THE PARALLEL PERFORMANCE OF CMAQ v4.5 ON A BEOWULF LINUX CLUSTER

EE 701 ROBOT VISION. Segmentation

Domain Decomposition for Colloid Clusters. Pedro Fernando Gómez Fernández

Week 3: MPI. Day 04 :: Domain decomposition, load balancing, hybrid particlemesh

Cost Models for Query Processing Strategies in the Active Data Repository

Chapter 17: Parallel Databases

1 Serial Implementation

GPU ACCELERATED SELF-JOIN FOR THE DISTANCE SIMILARITY METRIC

A Review on Plant Disease Detection using Image Processing

Performance Metrics of a Parallel Three Dimensional Two-Phase DSMC Method for Particle-Laden Flows

Seminar on. A Coarse-Grain Parallel Formulation of Multilevel k-way Graph Partitioning Algorithm

Clustering Part 4 DBSCAN

Adaptive Mesh Astrophysical Fluid Simulations on GPU. San Jose 10/2/2009 Peng Wang, NVIDIA

Advanced Surface Based MoM Techniques for Packaging and Interconnect Analysis

Prime Time (Factors and Multiples)

Preprocessing Short Lecture Notes cse352. Professor Anita Wasilewska

CHRONO::HPC DISTRIBUTED MEMORY FLUID-SOLID INTERACTION SIMULATIONS. Felipe Gutierrez, Arman Pazouki, and Dan Negrut University of Wisconsin Madison

Fault tolerant issues in large scale applications

Introduction to parallel Computing

COMP/CS 605: Introduction to Parallel Computing Topic: Parallel Computing Overview/Introduction

Performance potential for simulating spin models on GPU

Chapter 3. Sukhwinder Singh

REPORT DOCUMENTATION PAGE

Lecture 12 Level Sets & Parametric Transforms. sec & ch. 11 of Machine Vision by Wesley E. Snyder & Hairong Qi

Transcription:

Dynamic load balancing in OSIRIS R. A. Fonseca 1,2 1 GoLP/IPFN, Instituto Superior Técnico, Lisboa, Portugal 2 DCTI, ISCTE-Instituto Universitário de Lisboa, Portugal

Maintaining parallel load balance is crucial Full scale 3D modeling of relevant scenarios requires scalability to large number of cores Code performance must be sustained throughout the entire simulation The overall performance will be limited by the slowest node Simulation time is dominated by particle calculations Some nodes may have more particles than other If the distribution of particles remains approximately constant throughout the simulation we could tackle this at initialization LWFA Simulation Parallel Partition 94 24 24 = 55k cores Load Imbalance (max/avg load) 9.04 Average Performance ~12% peak Particles/node, iz = 12 Static load balancing However this will usually depend on the dynamics of the simulation Shared memory parallelism can help mitigate the problem Use a particle domain decomposition inside shared memory region Smear out localized computational load peaks Dynamic load balancing required for optimal efficiency 2

Adjusting processor load dynamically The code can change node boundaries dynamically to attempt to maintain a even load across nodes: Determine best possible partition from current particle distribution Reshape the simulation process 0 process 1 Unbalanced No Begin Gather Integral of Load Unbalance > Threshold Yes Global Communication log 2 (# nodes) Get Best Partition Use minimal number of messages Only the first communication depends on #nodes Balanced No Significant Improvement Yes Communication with overlapping nodes process 2 process 3 Reshape Simulation End Even particle load across processes 4 cores Node Boundaries 3

Calculate cumulative load along load balance direction node 0 node 1 node 2 node 3 Main overhead (~90%) Scales well for large number of nodes log 2 (# nodes) all nodes The algorithm for calculating the best partition requires the cumulative computational load Essentially a scan operation Implemented as a reduce all operation followed by local calculations Get Load along load balancing direction on each node Add load from all nodes + Distribute result to all nodes MPI_REDUCEALL(few kb) Integrate Load along load balancing direction All nodes know the result Simplifies best partition calculations node 0 node 1 node 2 node 3 all nodes Global Integral Load is known by all nodes 4

Finding the best partition Algorithm for 1D partitions 60467 Only 1D integral load required (less communication) Find point where integral load is half between endpoints 20105 Repeat with two resulting intervals until all node partitions are calculated Same calculation is done by all nodes No need for extra communication 40299 50445 70491 Fast algorithm Well balanced distribution Handles impossible scenarios well 10035 30149 i.e. when it is impossible to find a load balanced situation it finds the best solution Can handle extra constraints: Binary Search Algorithm minimum number of cells per node Cells per node must be an integer multiple of some quantity 5

Partition reshaping All nodes know previous and new partition For each node Detect overlap between local node on old partition and other nodes on new partition Detect overlap between local node on new partition and other nodes on old partition Send/receive data to proper nodes/from old nodes All nodes know old and new partition node 0 node 1 node 2 Old partition node 0 node 1 node 2 For particles Remove/add to particle buffer For grids Send/get data to/from other nodes Allocate new grid with new shape Copy in values from old grid and data from other nodes Free old grid Send data to node 0 Communication pattern for node 1 node 1 New partition Get data from node 2 6

Including cells in the computational load Particles only Particles + cells α = 1.5 Node Boundaries Load estimate considering only particles per vertical slice Node Boundaries Cell calculations (field solve + grid filtering) can be considered by including an effective cell weight In this example each cell is considered to have the same computational load as 1.5 particles 7

Hydrodynamic nano plasma explosion Radius: 12.8 c/ω p Simulation Setup 2 species u th = 0.1 c Number of Particles 1 10 6 8 10 5 6 10 5 4 10 5 2 10 5 0 0 500 1000 1500 2000 2500 3000 Min. Load Fixed Boundaries Dynamic Load Balance Iteration Max. Load Total Simulation Time [s] 800 700 600 500 400 300 200 100 0 No dynamic load balance Particles only Particles and Grids 128 64 32 16 8 Iterations Between Load Balance 8 2D Cartesian geometry 3000 iterations Grid 1024 2 cells 102.4 2 (c/ω p ) 2 Particles electrons + positrons 8 2 particle/species/cell Generalized thermal velocity 0.1 c Load Algorithm maintains the best possible load balance Repartitioning the simulation is a time consuming task Don t do it too often! Performance Particles & grids ~50% better than just particles n loadbalance ~64 iterations yields good results Performance boost 2 Other scenarios can be as high as 4

Multi-dimensional dynamic load balancing t = 0 t > 0 node boundary Difficult task: Best Partition Single solution exists only for 1D parallel partitions Improving load balance in one direction might result in worse results in another direction Multidimensional Partition Assume separable load function (i.e. load is a product of fx(x) fy(y) fz(z) ) Each partition direction becomes independent Accounting for transverse directions Use max / sum value across perpendicular partition 9

Large scale LWFA run: Close but no cigar The ASCR problems are very difficult to load balance Very small problem size per node Laser 10 When choosing partitions with less nodes along propagation direction imbalance degrades significantly Not enough room to dynamic load balance along propagation direction Dynamic load balancing in the transverse directions does not yield big improvements Very different distributions from slice to slice Dynamic load balance in just 1 direction does not improve imbalance Using max node load helps to highlight the hot spots for the algorithm x 1 -x 2 slice at box center similar partition along x 3 > 30% improvement in inbalance No overall speedup Best result: Dynamic load balance along x2 and x3 Use max load 30% improvement in imbalance but... Lead to a 5% overall slowdown!

Alternative: Patch based load balancing Implemented in PSC / SMILEI LWFA simulation Partition the space into (10-100x) more domains (patches) than processing elements (PE) Dynamically assign patches to PE Assign similar load to PEs Attempt to maintain neighboring patches in the same PE Standard MPI partition 316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305 326 316 K. Germaschewski et al. / Journal of Computational Physics 318 (2016) 305 326 Patch based partitioning A. Beck et al. / Nuclear Instruments and Methods in Physics Research A 829 (2016) 418 421 421 4. Summary and outlook The importance of 3D simulations has been established. Their cost is still prohibitive for most applications since large scale simulations are so massively parallel that load imbalance becomes a barrier. It cannot be alleviated by brute force parallelism. That is merely a waste of computational resources. Germaschewski et al. suggested a new data structure for PIC codes, adapted to both modern massively parallel hardware and to the associated algorithmic challenges [7]. This structure has been adapted with success in SMILEI [18] and shows tremendous improvement in the case of LWFA, even in 2D. Recent advances in the physics included in PIC codes in both astrophysical plasmas and laser-plasma communities often leads to prohibitive numbers of SP. Now, the problem is not just only compute bound, but also memory bound. This issue may be resolved only through a load-limiting algorithm such as particle merging through k-means clustering [6]. The two methods above could be combined into a coherent dynamic load management strategy without interfering with any of the physics included in the code. It is even compatible with different geometry models like CC itself which is also subject to load imbalance. Patch assignment 7 PE example Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined as number of particles. Fig. 6. Load balancing increased density across the diagonal. We compare traditional domain decomposition (top row) to patch-based load balancing (bottom row). Thick black lines demarcate the subdomains assigned to each MPI process, thin grey lines show the underlying patches. (a), (d) plasma density. (b), (e) computational load for each decomposition unit. (c), (f) aggregate computational load for each MPI process. Computational load in this simple example is defined number many of particles. Whileashaving small patches on a process means increased synchronization work at patch boundaries (in particular Germaschewski et al., JCP 318 305 (2016); Beck et al., NIMPR-A 829 418 (2016) 11 Fig. 3. Example of a 32! 32 patches decomposition between 7 MPI processes. MPI domains are delimited by different uniform colors. The line shows the Hilbert curve

Overview Maintaining load balancing is crucial for large scale simulations Lowest performing node dominates Particle load per PE evolves dynamically along the simulation Share memory parallelism helps smear out load spikes but does not solve the problem Dynamic load balancing is implemented by moving node boundaries Works well for many problems Difficult to expand to multi-dimensional partitions Unable to handle difficult LWFA scenarios Patch based dynamic load balancing appears to a viable solution Requires moving to a tiling scheme to organize the data Lots of work, but has other advantages, e.g. improved data locality, similarity with GPU code Current implementations rely on MPI_THREAD_MULTIPLE s 4.0 O iris 12