Communication and Topology-aware Load Balancing in Charm++ with TreeMatch

Size: px

Start display at page:

Download "Communication and Topology-aware Load Balancing in Charm++ with TreeMatch"

Emery Francis
5 years ago
Views:

Jeannot Esteban Meneses-Rojas Guillaume Mercier François Tessier Gengbin

1 Communication and Topology-aware Load Balancing in Charm++ with TreeMatch Joint lab 10th workshop (IEEE Cluster 2013, Indianapolis, IN) Emmanuel Jeannot Esteban Meneses-Rojas Guillaume Mercier François Tessier Gengbin Zheng November 27, 2013 Emmanuel Jeannot Communication-aware load balancing 1 / 25

2 Introduction Scalable execution of parallel applications Number of cores is increasing But memory per core is decreasing Application will need to communicate even more than now Issues Process placement should take into account process affinity Here: load balancing in Charm++ taking into account: load affinity topology migration cost (transfer time) Emmanuel Jeannot Communication-aware load balancing 2 / 25

3 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 3 / 25

4 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 4 / 25

Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares.

5 Charm++ Features Parallel object-oriented programming language based on C++ Programs are decomposed into a number of cooperating message-driven objects called chares. In general we have more chares than processing units Chares are mapped to physical processors by an adaptive runtime system Load balancers can be called to migrate chares Chares placement and load balancing is transparent for the programmer Emmanuel Jeannot Communication-aware load balancing 5 / 25

6 Chares/Process Placement Why we should consider it Many current and future parallel platforms have several levels of hierarchy Application Chares/processes do not exchange the same amount of data (affinity) The process placement policy may have impact on performance Cache hierarchy, memory bus, high-performance network... In this work we deal with tree topologies only Switch Cabinet Cabinet... Node Node... Processor Processor Core Core Core Core Emmanuel Jeannot Communication-aware load balancing 6 / 25

7 Problems Given The parallel machine topology The application communication pattern Map application processes/chares to physical resources (cores) to reduce the communication costs zeus16.map Receiver rank Sender rank Emmanuel Jeannot Communication-aware load balancing 7 / 25

8 TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i Emmanuel Jeannot Communication-aware load balancing 8 / 25

9 TreeMatch The TreeMatch Algorithm Algorithm and environment to compute processes placement based on processes affinities and NUMA topology Input : The communication pattern of the application Preliminary execution with a monitored MPI implementation for static placement Dynamic recording on iterative applications with Charm++ A model (tree) of the underlying architecture : Hwloc can provide us this. Output : A processes permutation σ such that σ i is the core number on which we have to bind the process i Emmanuel Jeannot Communication-aware load balancing 8 / 25

10 Example example16.mat example16_treematch.mat Receiver rank σ =(0,2,8,10,4, 6,12,14,1,3,9, 11,5,7,13,15) = Receiver rank Sender rank Sender rank Emmanuel Jeannot Communication-aware load balancing 9 / 25

11 TreeMatch Vs. existing solution Graph partitionners Parallel Scotch (Par)Metis Other static algorithms [Träff 02]: placement through graph embedding and graph partitioning MPIPP [Chen et al. 2006]: placement through local exchange of processes LibTopoMap [Hoefler & Snir 11]: placement through network model + graph partitioning (ParMetis) Other topology-aware load-balacing algorithms [L. L. Pilla, et al. 2012] NUCOLB, shared memory machines [L. L. Pilla, et al. 2012] HwTopoLB All these solution requires quantitative information about the network and the communication duration. TreeMatch: only qualitative information about the topology (the structure) is required. Emmanuel Jeannot Communication-aware load balancing 10 / 25

12 Load balancing Principle Iterative applications load balancer called at regular interval Migrate chares in order to optimize several criteria Charm++ runtime system provides: chares load chares affinity etc... Constraints Dealing with complex modern architectures Taking into account communications between elements Cost of migrations Emmanuel Jeannot Communication-aware load balancing 11 / 25

13 What about Charm++? Not so easy... Several issues raised! Scalability of TreeMatch Need to find a relevant compromise between processes affinities and load balancing Compute-bound applications Communication-bound applications Impact of chares migrations? What about load balancing time? The next slides will present two load balancers relying on TreeMatch Compute-bound applications: TMLB_Min_Weight which applies a communication-aware load balancing by favoring the CPU load levelling and minimizing migrations Communication-bound applications: TMLB_TreeBased which performs a parallel communication-aware load balancing by giving advantage to the minimization of communication cost. Emmanuel Jeannot Communication-aware load balancing 12 / 25

14 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 13 / 25

15 Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25

16 Strategy for Charm++ CPU Load TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Sort each part by CPU load Chares placement + Load balancing -> groups of chares Chares Emmanuel Jeannot Communication-aware load balancing 14 / 25

17 Strategy for Charm++ TMLB_Min_Weight Applies TreeMatch on all chares (fake topology : #leaves = #chares) Binds chares according to their load leveling on less loaded chares (see example below) Hungarian algorithm to minimize group of chares migrations (min. weight matching) Emmanuel Jeannot Communication-aware load balancing 14 / 25

18 Results LeanMD Molecular Dynamics application Massive unbalance, few communications Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) Baseline GreedyLB RefineLB TMLB_min_weight LeanMD on 64 cores chares Particles per cell Emmanuel Jeannot Communication-aware load balancing 15 / 25

19 Results LeanMD - Migrations Comparing to TMLB_Min_Weight without minimizing migrations : Execution time up to 5% better Around 200 migrations less Number of migrated chares in LeanMD 960 chares - 64 cores Number of migrated chares Particles per cell GreedyLB RefineLB TMLB_min_weight Emmanuel Jeannot Communication-aware load balancing 16 / 25

20 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 17 / 25

21 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25

22 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25

23 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25

24 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

25 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

26 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel Chares Emmanuel Jeannot Communication-aware load balancing 18 / 25

27 Strategy for Charm++ TMLB_TreeBased 1 st step : Applies TreeMatch while considering groups of chares on cores 2 nd step : Reorders chares inside each node Defines the subtree Creates a fake topology with as much leaves as the number of chares + something... (constraints) Applies TreeMatch on this topology and the chares communication pattern Binds chares according to their load (leveling on less loaded chares) Each node in parallel CPU Load Groups of chares assigned to cores Emmanuel Jeannot Communication-aware load balancing 18 / 25

28 Results kneighbor Benchmarks application designed to simulate intensive communication between processes Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Particularly compared to RefineCommLB Takes into account load and communication Minimizes migrations Execution time (in seconds) kneighbor on 64 cores 64 elements 1MB message size 0 DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) kneighbor on 64 cores 128 elements 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Execution time (in seconds) kneighbor on 64 cores 256 elements 1MB message size DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Emmanuel Jeannot Communication-aware load balancing 19 / 25

29 Results Impact on communication Communications evolution between ten iterations Communication between 10 iterations without any load balancing strategy (in thousands of messages sent) Communication between 10 iterations after the first call of TreeMatchLB (in thousands of messages sent) Emmanuel Jeannot Communication-aware load balancing 20 / 25

30 Results Stencil3D 3 dimensional stencil with regular communication with fixed neighbors One chare per core : balance only considering communications Only one load balancing step after 10 iterations Experiments on 8 nodes with 8 cores on each (Intel Xeon 5550) Execution time (in seconds) Stencil3D on 64 cores 64 elements DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Emmanuel Jeannot Communication-aware load balancing 21 / 25

31 Results What about the load balancing time? Linear trajectory while the number of chares is doubled TMLB_TreeBased is clearly slower than the other strategies But the parallel version is almost implemented... Execution time (in ms) Execution time of load balancing strategies (running on 64 cores) DummyLB GreedyCommLB GreedyLB RefineCommLB TMLB_TreeBased Number of chares Figure : Load balancing time of the different strategies vs. number of chares for the KNeighbor application. Emmanuel Jeannot Communication-aware load balancing 22 / 25

32 Outline 1 Introduction 2 Problem and models 3 Load balancing for compute-bound applications 4 Load balancing for communication-bound applications 5 Conclusion Emmanuel Jeannot Communication-aware load balancing 23 / 25

33 Conclusion and Future Conclusion Topology is not flat! Processes affinities are not homogeneous Take into account these information to map chares give us improvement Need to distinguish between compute-bound and communication-bound application Several criteria taken into account: affinity, topology, load, migration cost, etc... Future work Find a better way to gather the topology (Hwloc?) Distribute and parallelize TMLB_TreeBased on the different nodes (work in progess with the PPL) Make TMLB_TreeBased more scalable for large scale clusters: allow to chose the level in the hierarchy where the algorithm will be distributed Hybrid architecture? Intel MIC? Continue collaborations between Inria and PPL Emmanuel Jeannot Communication-aware load balancing 24 / 25

34 The End Thanks for your attention! Any questions? Emmanuel Jeannot Communication-aware load balancing 25 / 25

Topology and affinity aware hierarchical and distributed load-balancing in Charm++

Topology and affinity aware hierarchical and distributed load-balancing in Charm++ Emmanuel Jeannot, Guillaume Mercier, François Tessier Inria - IPB - LaBRI - University of Bordeaux - Argonne National