On Partitioning FEM Graphs using Diffusion

Size: px

Start display at page:

Download "On Partitioning FEM Graphs using Diffusion"

Frank Perkins
6 years ago
Views:

1 On Partitioning FEM Graphs using Diffusion Stefan Schamberger Universität Paderborn, Fakultät für Elektrotechnik, Informatik und Mathematik Fürstenallee 11, D Paderborn Abstract To solve the graph partitioning problem, efficient heuristics have been developed also capable of distributing the computational load in parallel FEM computations. However, although a few parallel implementations do exist, the involved algorithms are hard to parallelize due to their sequential nature. This paper presents a new approach to deal with the FEM graph partitioning problem. Applying diffusion as growing mechanism, we are able to eliminate restrictions of former implementations based on the bubble framework and construct a relatively simple algorithm with a high degree of natural parallelism. We demonstrate that it computes solutions comparable to those of established heuristics. Its drawback is the long execution time if the parallelism is not exploited. Keywords: FEM graph partitioning, load balancing, diffusion, first-order-scheme, bubble 1 Introduction Graph partitioning is an important subproblem in many applications. One of them consists in balancing the computational load in distributed (adaptive) Finite Element Method (FEM) computations. Since these computations usually follow the single-program multiple-data paradigm, the same code is executed on all processors but on different parts of the data. This implies that the mesh discretizing the continuous simulation space has to be partitioned into P subdomains each assigned to one of the P processors involved in the computation. The applied iterative solvers mainly perform local operations defined by adjacencies in the mesh, hence the parallel algorithms mostly require communication at the partition boundaries. Thus, the parallel efficiency depends on two factors: An equal distribution of load on the processors and a small communication overhead archived by minimizing the number of messages to be exchanged between the different parts of the mesh. This work was partly supported by the German Science Foundation (DFG) project SFB-376 and by the IST Program of the EU under contract numbers IST (ALCOM-FT) and IST (FLAGS). The communication pattern of FEM computations can be modeled by a graph where the vertices represent the data and the edges the dependencies. It is known that the graph partitioning problem is NP-complete, but fortunately a number of efficient heuristics have been developed finding good solutions. State-of-the-art sequential libraries like Metis [1], Jostle [2] and Party [3] base on the multilevel paradigm introduced in [4], combined with a matching strategy to coarsen the graph and a local refinement, often a Kerninghan-Lin like algorithm [5], that is applied in every level during the uncoarsening. By exchanging vertices or vertex sets between partitions, they try to further reduce the edge cut (or any alternative metric) and therefore are essential to find a good solution. However, due to the fine granularity of these steps, their interdependencies and the involvement of more than one partition, these procedures are hard to parallelize. Different approaches have been proposed to maintain the integrity of the distributed data structure and ensure the efficiency of the local heuristics. For example, the parallel version of Metis [6] disallows adjacent vertices to be active at the same time by computing a graph coloring. In the same round, only vertices of one color may be transfered to neighboring partitions. More advanced methods are implemented in Jostle [2]. This also a good reference to problems usually encountered parallelizing the sequential improvement steps and to different ways to deal with them. Another overview on distributed graph partitioning algorithms can be found in [7]. The global edge cut is the classical metric most graph partitioners optimize. In case of FEM computations, this is not necessarily the best metric to follow because it does not model the real communication and runtime costs as described in [8]. Hence, different metrics have been implemented inside the local refinement process modeling the real objectives more closely. In [9], the costs emerging from vertex transfers is taken into consideration while Metis [1] is capable of minimize the subdomain connectivity as well as the communication costs. A completely different approach is undertaken in [10]. Since the convergence rate of the domain decompo- 1

2 Figure 1. The three phases of a bubble algorithm: Determination of initial seeds for each partition (left). Growing in a breadth-first-manner around the seeds (middle). Movement of the seeds to the partition centers (right). This is repeated until a stable state is reached. sition solver in the PadFEM environment depends on the geometric shape of a partition, the integrated load balancer focuses on iteratively optimizing their aspect ratio by applying a bubble like algorithm. Although different to the multilevelschemes, this approach still contains a strictly serial part and suffers from some other difficulties we describe in more detail in section 2. The new approach proposed in this paper is based on the bubble framework, too, but we replace the most important operations by diffusive ones. Although we have only implemented a sequential version of the algorithm yet, we are convinced that its parallelization is not too complicated since it does not contain any strictly sequential parts. We demonstrate that the proposed, diffusion based method is applicable and that the delivered solutions are comparable with those of state-of-the-art partitioning libraries. The remaining part of this paper is organized as follows: In the next section we describe the bubble framework in more detail and also discuss its existing implementations. In section 3 we briefly introduce the first order diffusion scheme which we integrate as growing mechanism. The resulting algorithm is described in section 4. Section 5 shows how we perform our experiments and presents comparisons between results of the new algorithm and those of other partitioning libraries. The last section contains a conclusion, a discussion about further work and a number of open questions. 2 The Bubble Framework The bubble framework (also described in [10]) has evolved from simple greedy algorithms computing bisections of graphs. Starting with an initial, often randomly chosen vertex (seed) per partition, all subdomains are grown simultaneously in a breadth-first manner. Colliding parts form a common border and keep on growing along this border just like soap bubbles. After the whole mesh has been covered and all vertices of the graph have been assigned this way, each component computes its new center that acts as the seed in the next iteration. This is usually repeated until a stable state, where the movement of all seeds is small enough, is reached. This procedure is based on the observation that within perfect bubbles, the center and the seed vertex coincide. Distances in this framework may either be chosen as Euclidian distances or as path lengths in a graph. In the latter case, no geometrical information is required. Summarized, a bubble algorithm mainly consists of the following three phases that are also illustrated in figure 1: Init A vertex for each partition is determined. These vertices act as the seeds in the first iteration. Grow Starting from their seeds, all partitions grow in a breadth-first manner until all vertices are assigned. Move All partitions determine their center vertices that are the seeds in the next iteration. To our knowledge and apart from simple greedy heuristics there are two implementations that apply the bubble framework to solve the FEM graph partitioning problem. The first one is part of a former version of the Party graph partitioning library. There, the implementation of the three phases can roughly be described as follows: Init The initial seeds are determined randomly. Grow Starting from every seed a breadth-first algorithm on the graph is applied. During this process the partitions alternately acquire one of their free neighbor vertices until all vertices are assigned. Move The vertex with the minimal maximal distance to all other vertices of the same partition becomes the new seed. 2

3 This approach shows several problems. The initial placement of the partitions may be very bad requiring many iterations until it is fixed, but even then the partition sizes usually vary extremely. The time spend on finding new seeds is quite long since a breadth-first-search has to be solved for every vertex. Moreover, the partition quality is not considered at all. Another important disadvantage is that the growing phase cannot be parallelized because vertices are assigned serially and earlier assignments have a great impact on later decisions. A second approach is described in [10] and, as already mentioned, it has been implemented in a former version of the FEM simulation tool PadFEM: Init The first initial seed is randomly chosen among the vertices with smallest degree. Then, to determine the seed for the next partition, a breadth-search is performed with all seeds as starting points. The last vertex found becomes the seed for the next partition. This is repeated until all seeds have been determined. Grow The smallest partition with at least one adjacent unassigned vertex grabs the vertex with the smallest Euclidian distance to its seed. Move The new seed of a partition becomes the vertex for that the sum of Euclidian distances to all other vertices of the same partition is minimal. To find this vertex quickly, some successive approximation is used. This algorithm solves some of the problems we have seen in the first approach. The initial seeds are distributed more evenly over the graph. Since the smallest possible partition gathers the next vertex, more attention is paid on the balance and also the determination of the center has been improved to work faster. By including coordinates in the choice of the next vertex, the partitions are usually also geometrically well shaped (and connected), what is the main goal of this approach. Other quality metrics are not considered. By relying on vertex coordinates this approach is only applicable if these are provided, and sometimes the Euclidian distance does not coincide at all with the path length between vertices. This can often be observed if an FEM mesh contains holes, in which case a partition may be placed around it. It is a general problem when working with coordinates and occurs more heavily for example in space-filling-curve based partitionings [11]. The experiments made in [10] also reveal that the selection mechanism, though improved by preferring under-weighted partitions, does still not lead to sufficient well balanced domains. Hence, to fix this, some additional computations are added after the last bubble iteration. Concerning a possible parallelization, the situation stays the same as described before because the selection process of the vertices is still strictly serial. 3 Diffusion In this section we briefly describe diffusive schemes. We assume that these are well known in the graph partitioning community since they are often applied as load balancers. The most simple one, first introduced in [12], is the first order scheme (FOS). Let G = (V,E) be a connected, undirected graph and l v R be the load of node v V. With l := ( l v / V ) V we denote the vector of the balanced load. Now, the task of a load balancing algorithm is to compute a flow f R E, such that A f = l l with A { 1,0,+1} R V E defined as the node edge incidence matrix of G. FOS performs on each node v i V the iteration [13]: e = (v i,v j ) E : and x k 1 e = α(li k 1 l k 1 j ) fe k = fe k 1 + xe k 1 li k = l k 1 i e=(v i,v j ) E x k 1 e where xe k is the amount of load exchanged via edge e in iteration k and α is a properly chosen parameter, e.g. α = (1 + maxdeg(v )) 1. (Note, that better α can be computed as shown for example in [14]). In matrix notation, this can be written as l k = Ml k 1 with the diffusion matrix M = I αl R V V. Here, L = AA T is the Laplacian matrix of G. For the error ε k = l k l holds ε k 2 γ k ε 0 2, where γ = max m i=2 µ i is the absolute second largest of the m distinct eigenvalues 1 = µ 1 > > µ m > 1 of M. Moreover, it has been shown that the calculated f k (like in all diffusion schemes) converge toward the l 2 -minimal balancing flow [15]. FOS has some interesting properties. First, in every iteration communication only occurs between adjacent vertices. Thus, no global view of the graph G is required and all calculations in one iteration of the diffusion scheme can be performed in parallel on all edges and vertices, respectively. The second observation we make is that load tends to spread faster in regions of the graph where more distinct paths between two nodes exist. We call these regions densely connected. Figure 2 gives an example. The biplane9 graph shown is a square-based mesh and some parts of the mesh, mainly around the two air-wings, have been refined more often than others. Hence, about half of the vertices located at the transition between these regions contain more edges toward the finer part. The load distribution in this example is originated at a single vertex close to such a transition. One can see that on the one hand the amount of load received by a vertex depends on its distance (path length) to this source. The reason for this is obvious when looking at the iteration scheme. On the other hand, vertices in the same distance to the seed but placed before a transition receive more load than those behind it. A similar behavior can be observed in refined graphs that do not 3

4 nected region. Summarized, the three phases of the bubble framework can be described as follows: Init The initial seeds are determined randomly although other methods could be implemented easily. For each partition, V /P load of its color is placed on its seed vertex. All other vertices stay empty. Grow We use FOS as growing mechanism. The load distribution is computed independently for all partitions (colors) and is stopped after k iterations, far before all vertices contain an equal amount of load. Move Contraction The vertex containing the most load of color p becomes the new seed of a partition p. At the beginning of the next iteration, only the seed vertices contain V /P load, all others stay empty. Move Consolidation All vertices are assigned to the partition they have received the highest amount of load from. To prepare for the next iteration, each partition distributes V /P load evenly among all its vertices. Figure 2. Load distribution originated at a single seed after 50 FOS iterations. Vertices with high load are colored red while empty vertices are black. The Init phase is identical to the former version of the Party library. As mentioned, a random distribution may place vertices suboptimal, but on the other hand it is more likely that vertices in dense regions are chosen. To grow partitions, we use FOS. Note, that in contrast to the described implementations, all partitions operate independently. The decision about the vertex assignments is delayed and integrated into the movement process. Hence, this is the most most interesting point, also because two different methods are applied. The first operation, called contraction, comes close to the former movement implementations described in section 2. A single vertex in the center of each partition containing the maximal load of the according color is determined and it becomes the new seed for the next iteration. However, since only a few FOS iterations are performed, this will be very likely the same vertex that initiated the diffusion process in the current iteration. Hence, no movement would occur. To fix this, we introduced a second operation we call consolidation. In contrast to the contraction, not a single vertex is assigned as new seed but the whole partition is used. Since more vertices are within the dense regions of the graph, this operation will direct the partition toward its desired position and also insures that in the following loops a different vertex contains the maximum load, as long as the final state has not been reached. Furthermore, the consolidation step helps to fix numerical problems that otherwise could occur if the load values became to small after executing too many FOS iterations. The resulting diffusive bubble algorithm (DB) looks like as sketched in figure 3. The input consists of the graph G that is also capable of storing load and flow vectors, and several contain the described T-intersections. This leads to the idea of this paper which we present in the next section. 4 DB - A Diffusive Bubble Algorithm In this section we describe how we integrate diffusion into the bubble framework. The main idea is based on the observation described in the last section: Load primarily diffuses into densely connected regions of the graph rather than into sparsely connected ones. Following this observation we expect to identify sets of vertices that possess a high number of internal and a small number of external edges. This procedure shows many similarities to graph clustering algorithms, with the difference, that we fix the number of clusters to P by executing the diffusion algorithm exactly P times, each time with a different kind of load. To distinguish between these loads we color them with colors from 1 to P, respectively. After all load is distributed, we assign the vertices to that partition they obtained load from. If a vertex contains more than one kind of load, meaning that it could be part of more than one cluster, it is assigned to the partition of which it received the highest amount from. Hence, a partition can crowd others out of parts of the graph if it itself already contains a higher load nearby. These dynamical movements are addressed by the bubble framework. During its iterations the partition centers, from which the diffusion process is initiated, are rearranged such that they are finally well distributed over the graph and are preferably placed within a densely con4

5 00 Algorithm DB(G, i, j, k) 01 in each iteration i 02 if i = 1 03 determine-seeds(g) 04 else 05 parallel for each partition p 06 distribute-load(g, p) 07 FOS(G, k) 08 contraction(g) 09 in each loop j 10 parallel for each partition p 11 distribute-load(g, p) 12 FOS(G, k) 13 π = consolidation(g) 14 return π Figure 3. Sketch of the DB algorithm. parameters i, j and k all specifying the number of the different iterations to be performed. The outer iteration (line 01), executed at least once, always starts determining single seed vertices for each of the partitions. Only in the first iteration random vertices are chosen (line 03). In the following iterations the seeds are determined by a contraction step (lines 05-08), for each partition p independently consisting of a equal load distribution (that follows a consolidation step) (line 06), and k FOS iterations (line 07). The determination of the vertex that contains the highest amount of load is performed in the contraction (line 08). It is followed by j loops (line 09) containing a consolidation step. This involves again an equal distribution of load on all vertices of a particular partition (line 11) and the FOS iterations (line 12). Note, that during the first iteration following a contraction all partitions consist only of a single vertex, the seed. The last operation, the consolidation itself (line 13), determines the maximal load of each color on every vertex and therefore also the resulting partitioning π. The DB algorithm mainly contains a collection of loops. Of those, the loops over the partitions (lines 05 and 10) and the FOS iterations are independent and therefore can be fully parallelized. Also the consolidation is a per vertex operation, only the maximum computations during the contraction requires a more global view on the whole partition. Another interesting point is that DB does not contain any explicit objectives. These are hidden inside the growth and movement operations. As with all other bubble implementations, some difficulties arise establishing the balance of the partitionings. However, adjusting the total load of a domain (either placed on a single vertex or equally distributed on all of them) provides a handy method to address this problem. Instead of supplying all partitions with the same load amount W = V /P, we increase the load on under-weighted and decrease it on overweighted partitions. To disable flipping, a damping-factor is included. The total load amount L j p placed on partition p in loop j is calculated as ( δp j 1 = W W j 1 p ) 2, θ j p = (1 ) θp j 1 L j p = θ j p W Wp j 1, δp j 1 + (δ j 1 p ) 2 where Wp j 1 denotes p s weight (size) after the last consolidation and is set to Wp 0 = W during a contraction. This procedure grants us some control over the partitions sizes. However, our experiments show that in some cases this does not suffice. Furthermore, the load balancing process sometimes is very slow and takes many iterations to show effect, especially if the under-weighted and overweighted partitions are situated far from each other. Thus, more advanced methods are needed as discussed in section 6. Figure 4 shows an example of the load distribution after 4 iterations (25 loops, 25 FOS iterations) using the biplane9 graph. Note, that only the dominant amount of load on a vertex is displayed. The partitions in the lower left part of the graph are smaller than their counterparts on the right, what can be easily recognized because they have been assigned a higher total load. With 1409 vertices, the weight of the heaviest partition is about 4% to high what might be noticed in the resulting partitioning displayed in figure 5. 5 First Experimental Results In this section we present some results obtained with a sequential implementation of the proposed DB algorithm and compare them to solutions of the (sequential) partitioning libraries kmetis, Jostle and Party. All of the heuristics are invoked with their default parameters, in case of the DB algorithm this means 4 iterations with 25 loops/fos iterations and a damping factor of = 1/4. Furthermore, we replace the global α introduced for the FOS iterations in section 3 by a separate one for every edge e = (v i,v j ) and set it to α e = (1 + max(deg(v i ),deg(v j )) 1. Our experiments are based on the set of graphs shown in table 1. Some of these graphs have been frequently used to compare graph partitioning heuristics while we included some others from the padfem 2 simulation tool. Note, that all graphs are FEM graphs (or their duals) of a relatively small size. The reasons for this limitation are discussed in section 6. To judge the quality of a partitioning, several metrics are possible. The classical one is the edge cut, that is the number of edges between vertices of different partitions. Since it 5

6 Figure 4. The maximal load per vertex after 4 iterations. Figure 5. The resulting partitioning after 4 iterations. is known that this metric does not model the real communication costs of FEM applications, we also measure the number of external edges, the number of boundary vertices and the communication volume (send and receive), assuming that vertices represent information and edges the communication pattern. For each partition p these metrics can be described as follows: The l1 -norm (summation norm) is a global norm. The global edge cut belongs into this category (it equals half the external edges in this norm). In contrast to the l1 -norm, the l -norm (maximum norm) is a local norm only considering the worst value. This norm is favorable if synchronized processes are involved. The l2 -norm (Euclidian norm) lays in between the l1 and the l -norm and reflects the global situation as well as local peaks. It has been shown that comparisons based on a single test per graph usually do not lead to meaningful conclusions. Therefore, we apply the permutation based evaluation scheme from [3]. We perform 100 runs for the multilevel heuristics but reduce this number to 10 for the DB algorithm due to its longer run-time. The results are summarized in a collection of charts generated by a script. Due to space limitations, we restrict our presentation to 16-partitionings and include only one chart collection. We also omit the timings because our current sequential implementation of the DB algorithms is not optimized. Thus, its run-time is much higher than those of the multilevel-heuristics. While on modern computers it takes the latter only a fraction of a second to compute a solution, the DB algorithm needs about a minute to compute a 16 partitioning of the biplane9 graph. Possible enhancements to DB are discussed in section 6. Figure 6 gives the detailed results we obtained dividing the biplane9 graph into 16 domains. Each of the chart contains values for the different partitioning libraries and displays the average of 100 (10) runs as well as the standard deviation and the extremes, respectively. external edges Number of edges that are incident to exactly one vertex of partition p. boundary vertices Number of vertices of partition p that are adjacent to at least one vertex from a different partition. send volume The amount of outgoing information is the sum of the adjacent partitions different to p that each vertex residing inside partition p has. receive volume The amount of incoming information is the number of vertices of partitions different to p adjacent to at least one vertex of partition p. Note that Metis is also capable of minimizing the communication volume [1]. However, we did not make use of this option. Furthermore, for each metric we consider three different norms. Given the values x1,..., xp, the norms are defined as follows: l1 : X = x1 + + xp q l2 : X = x xp2 l : X = max ( x1,..., xp ) 6

7 Table 1. Graphs used in this paper and some of their properties. graph V E min. deg. av. deg. max. deg. diameter origin grid FEM 2D airfoil FEM 2D airfoil1 dual FEM 2D dual biplane FEM 2D crack FEM 2D crack dual FEM 2D dual cs FEM 3D padfem dual FEM 3D dual sphere FEM 3D The chart cut compares the classical global edge cut. Note, that this is equivalent to half the number of external edges measured in l 1 -norm. One can see that the DB algorithm finds solutions comparable to those of kmetis and Jostle without imbalance allowance. However, much better results exist as for example demonstrated by Party. Looking at the balance chart, this becomes even more visible because the solutions computed by the DB algorithm are less balanced than those of the other libraries. As already mentioned, the results concerning the number external edges match with those for the global edge-cut for the l 1 -norm. Regarding the l 2 and l -norm, the situation is similar with a decreasing deviant between the libraries. A different picture is drawn by the metrics that model the real communication costs more closely. Concerning both, the boundary vertices and the communication volume, the solutions computed by the DB algorithm are much better and result in shorter boundaries and less messages than those of any other library. This advantage is especially noticeable in the l 1 and l 2 -norm, but also exists in the l -norm. This means smaller communication costs and a better distribution among the partitions. Remembering the edge-cuts, it becomes clear that the two objectives, minimizing the sum of external edges and minimizing the maximal number of messages, do not coincide well in case of the biplane9 graph. As seen in figure 6, though a relative difference exists, all three norms show the same tendency what is also true for all other graphs of out test-set. Therefore we restrict the results of the latter to the l 2 -norm without omitting too much information. The summarized results can be found in table 2. These can be roughly categorized into two groups: The first one is the set of graphs that leads to similar results as obtained for the biplane9 graph. This holds for grid100, crack dual, cs4, and padfem. On the other hand, our test set contains some instances where DB finds comparable solutions concerning the edge-cut but does not deliver better partitionings when looking at the communication volume. This is the case for crack, sphere and airfoil1. The airfoil1 dual lays between these groups. If this can be explained by different graph properties or a varying performance of the heuristics is unclear. As already mentioned, the DB algorithm shows weaknesses in computing balanced partitions on some graphs. In case of the airfoil1, the largest partition contains about 5% too many vertices. On the other hand, the absolute number of excess vertices is quite small and we think that these small partition sizes partly cause the problem. An idea to address this problem is discussed in section 6. Another observation is that the standard deviation of the communication volume metrics is often smaller for DB than for the other heuristics. This can be explained by the different metric the multilevel-heuristics optimize. What cannot be seen in table 2 is the partitions shape. In all cases, only connected partitions have been computed by the DB algorithm and, if the observations from 2D graphs also hold for 3D, their shape is very compact. A notably nice example is shown in figure 7. 6 Conclusion and Further Work In this paper we have proposed a new approach to partition FEM graphs by merging the bubble framework and the first-order-diffusion-scheme. The resulting algorithm requires many computations, hence it performs slowly when executed serially, but on the other hand is very simple and contains a high degree of possible parallelism. Our results show that its solutions are comparable to those of state-of-the-art partitioning heuristics and even outperform them on most of out test graphs concerning the boundary and communication volume metrics. Thus, we think that the DB algorithm has some interesting potential. But before it can be applied in real computations, numerous questions have to be answered: From the practical point of view, the most urgent need is to decrease the algorithms run-time. There are several possible ways to archive this. During our experiments we 7

8 cut ex. edges (l 1 ) ex. edges (l 2 ) ex. edges (l ) boundary (l 1 ) boundary (l 2 ) boundary (l ) balance send (l 1 ) send (l 2 ) send (l ) receive (l 1 ) receive (l 2 ) receive (l ) Figure 6. Detailed results obtained for the biplane9 graph. Results are shown (from left to right) for: kmetis (dark blue triangle), pmetis (light blue triangles), Jostle (squares) with 0% (yellow), 1% (orange) and 3% (red) imbalance allowance, Party (green diamonds), Party (black circles) and DB (magenta circles). Each bar displays the average value of 100 (10 in case of DB) independent runs with a large mark, the standard deviation of the values with a wide bar, the minimum and maximum values with thin bars and the result for the first run (on the original, not randomized instance of the graph [3]) with a small mark. 8

9 Table 2. Overview of some average results and the standard deviation (±) regarding the l 2 -norm. The default imbalance allowance of kmetis and Jostle is 3% while party uses 1%. Note that the DB algorithm does not provide any explicit balancing option. graph partitioner global cut balance l 2 norm ex. edges boundary send receive kmetis ± ± ± ± ± ± 6.4 grid100 Jostle ± ± ± ± ± ± 5.9 Party ± ± ± ± ± ± 1.9 DB ± ± ± ± ± ± 3.9 kmetis ± ± ± ± ± ± 4.3 airfoil1 Jostle ± ± ± ± ± ± 6.4 Party ± ± ± ± ± ± 3.3 DB ± ± ± ± ± ± 4.4 kmetis ± ± ± ± ± ± 6.2 airfoil1 Jostle ± ± ± ± ± ± 7.1 dual Party ± ± ± ± ± ± 4.4 DB ± ± ± ± ± ± 4.0 kmetis ± ± ± ± ± ± 9.1 biplane9 Jostle ± ± ± ± ± ± 14.2 Party ± ± ± ± ± ± 7.0 DB ± ± ± ± ± ± 11.3 kmetis ± ± ± ± ± ± 8.0 crack Jostle ± ± ± ± ± ± 7.0 Party ± ± ± ± ± ± 5.3 DB ± ± ± ± ± ± 4.2 kmetis ± ± ± ± ± ± 8.5 crack Jostle ± ± ± ± ± ± 8.5 dual Party ± ± ± ± ± ± 7.0 DB ± ± ± ± ± ± 6.1 kmetis ± ± ± ± ± ± 17.1 cs4 Jostle ± ± ± ± ± ± 17.8 Party ± ± ± ± ± ± 20.1 DB ± ± ± ± ± ± 12.6 kmetis ± ± ± ± ± ± 14.8 padfem Jostle ± ± ± ± ± ± 14.7 Party ± ± ± ± ± ± 12.2 DB ± ± ± ± ± ± 8.7 kmetis ± ± ± ± ± ± 8.2 sphere Jostle ± ± ± ± ± ± 13.3 Party ± ± ± ± ± ± 6.4 DB ± ± ± ± ± ± 7.7 have observed that using the CPU s cache properly can lead to an up to 5 times faster execution. Furthermore, the parallelism of the algorithm should be exploited on both, shared and distributed memory machines, as well as on combinations thereof. Though this will definitely shorten the execution time, we doubt that this already allows to partition FEM graphs containing several million vertices. The computations performed inside the FOS iterations are simple. Hence, another possibility to speed these up is the use of dedicated hardware different to the main CPUs. Another point to be addressed is the slow propagation of over/underweight between the partitions. Many iterations are required until a high load is recognized at the other end of the graph, especially if many well balanced partitions are in the way. The integration of an additional explicit load balancing component could help to improve this process and also take care of those cases where the implicit mechanism does not work properly. 9

10 Furthermore, it has been shown that FOS is also applicable to weighted graphs. Thus, the bubble framework can be merged with the multilevel-approach. First experiments have been successful, but in some cases the partitions in lower levels converge to different constellations than they do in higher ones, making the additional effort useless. We think that better parameters and an adopted graph coarsening strategy might fix this problem. The question of good parameters is also of theoretical interest. In the experiments presented, we set the number of iterations to 4 while performing 25 loops and FOS iterations, respectively. This works well for the graphs included in our test-set, all containing not much more than vertices and FEM originated. But even for these graphs the chosen constants are definitely not optimal and we would expect parameters to depend on the graphs properties. A better understanding here would also answer the question on what kinds of graphs the DB algorithm is applicable. Another important question to be solved is the metric that the DB algorithm optimizes. From our experiments we can see that it might be closely related to the local communication volume. Also, the partition shapes are very compact (e.g. figure 7). However, much more work addressing this question has to be done. Finally, we can think of integrating DB into the stateof-the-art partitioning libraries, either as pre- or postprocessor or as a global partitioner in the lower levels of the multilevel-schemes improving the overall otherwise difficult partition placement. References [1] G. Karypis and V. Kumar, Metis, user Manual, Version 4.0. [Online]. Available: www-users.cs.umn.edu/ karypis/metis/ metis/files/manual.ps [2] M. Cross and C. Walshaw, Parallel optimisation algorithms for multilevel mesh partitioning, Parallel Computing, vol. 26, no. 12, pp , [3] S. Schamberger, Improvements to the helpful-set heuristic and a new evaluation scheme for graphs-partitioners, in International Conference on Computational Science and its Applications, ICCSA 03, ser. LNCS, no. 2667, 2003, pp [4] B. Hendrickson and R. Leland, A multi-level algorithm for partitioning graphs, in Supercomputing 95. ACM/IEEE Press, [5] B. W. Kernighan and S. Lin, An efficient heuristic for partitioning graphs, Bell Systems Technical Journal, vol. 49, pp , [6] G. Karypis and V. Kumar, Parallel multilevel k-way partitioning scheme for irregular graphs, in Supercomputing 96. ACM/IEEE Press, 1996, p. 35. Figure 7. The 16 partitions of a 100x100 grid after 25 iterations. [7] K. Devine and B. Hendrickson, Dynamic load balancing in computational mechanics, Computational Methods in Applied Mechanical Engineering, vol. 184, pp , [8] B. Hendrickson, Graph partitioning and parallel solvers: Has the emperor no clothes? in Irregular 98, ser. LNCS, no. 1457, 1998, pp [9] R. Biswas and L. Oliker, PLUM: Parallel load balancing for adaptive unstructured meshes, Parallel and Distributed Computing, vol. 51, no. 2, pp , [10] R. Diekmann, R. Preis, F. Schlimbach, and C. Walshaw, Shape-optimized mesh partitioning and load balancing for parallel adaptive FEM, Parallel Computing, vol. 26, pp , [11] S. Schamberger and J. M. Wierum, Graph partitioning in scientific simulations: Multilevel schemes vs. space-filling curves, in Parallel Computing Technologies, PACT 03, ser. LNCS, no. 2763, 2003, pp [12] G. Cybenko, Load balancing for distributed memory multiprocessors, Parallel and Distributed Computing, vol. 7, pp , [13] R. Elsässer, B. Monien, and R. Preis, Diffusion schemes for load balancing on heterogeneous networks, Theory of Computing Systems, vol. 35, pp , [14] R. Elsässer, B. Monien, and S. Schamberger, Toward optimal diffusion matrices, in International Parallel and Distributed Processing Symposium, IPDPS 02, 2002, p. 67 (CD). [15] R. Diekmann, A. Frommer, and B. Monien, Efficient schemes for nearest neighbor load balancing, Parallel Computing, vol. 25, no. 7, pp ,

Heuristic Graph Bisection with Less Restrictive Balance Constraints

Heuristic Graph Bisection with Less Restrictive Balance Constraints Stefan Schamberger Fakultät für Elektrotechnik, Informatik und Mathematik Universität Paderborn Fürstenallee 11, D-33102 Paderborn schaum@uni-paderborn.de