Algorithmica 2002 Springer-Verlag New York Inc.

Size: px

Start display at page:

Download "Algorithmica 2002 Springer-Verlag New York Inc."

Bryce Riley
5 years ago
Views:

1 Algorithmia (2002) 33: DOI: /s Algorithmia 2002 Springer-Verlag New York In. Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 1 F. Dehne, 2 A. Ferreira, 3 E. Cáeres, 4 S. W. Song, 5 and A. Ronato 6 Abstrat. In this paper we present deterministi parallel algorithms for the oarse-grained multiomputer (CGM) and bulk synhronous parallel (BSP) models for solving the following well-known graph problems: (1) list ranking, (2) Euler tour onstrution in a tree, (3) omputing the onneted omponents and spanning forest, (4) lowest ommon anestor preproessing, (5) tree ontration and expression tree evaluation, (6) omputing an ear deomposition or open ear deomposition, and (7) 2-edge onnetivity and bionnetivity (testing and omponent omputation). The algorithms require O(log p) ommuniation rounds with linear sequential work per round ( p = no. proessors, N = total input size). Eah proessor reates, during the entire algorithm, messages of total size O(log(p)(N/p)). The algorithms assume that the loal memory per proessor (i.e., N/p) is larger than p ε, for some fixed ε>0. Our results imply BSP algorithms with O(log p) supersteps, O(g log( p)(n/ p)) ommuniation time, and O(log(p)(N/p)) loal omputation time. It is important to observe that the number of ommuniation rounds/supersteps obtained in this paper is independent of the problem size, and grows only logarithmially with respet to p. With growing problem size, only the sizes of the messages grow but the total number of messages remains unhanged. Due to the onsiderable protool overhead assoiated with eah message transmission, this is an important property. The result for Problem (1) is a onsiderable improvement over those previously reported. The algorithms for Problems (2) (7) are the first pratially relevant parallel algorithms for these standard graph problems. Key Words. Coarse grained parallel omputing, Graph algorithms. 1. Introdution. Speedup results for theoretial PRAM algorithms do not neessarily math the speedups observed on real mahines [1], [36]. Given suffiient slakness in the number of proessors, Valiant s bulk synhronous parallel (BSP) approah [39] simulates PRAM algorithms optimally on distributed memory parallel systems. Valiant points out, however, that one may want to design algorithms that utilize loal omputations and minimize global operations [38], [39]. The BSP approah requires that g (= loal omputation speed router bandwidth) is low, or fixed, even for inreasing 1 This researh was partially supported by the Natural Sienes and Engineering Researh Counil of Canada, FAPESP (Brasil), CNPq (Brasil), PROTEM-2-TCPAC (Brasil), the Commission of the European Communities (ESPRIT Long Term Researh Projet 20244, ALCOM-IT), DFG-SFB 376 Massive Parallelität (Germany), and the Région Rhône-Alpes (Frane). A preliminary version of this paper was published in the proeedings of the 1997 International Colloquium on Automata, Languages and Programming [5]. 2 Shool of Computer Siene, Carleton University, Ottawa, Ontario, Canada K1S 5B6. frank@dehne.net, 3 CNRS - I3S, INRIA, Sophia-Antipolis, Frane. ferreira@sophia.inria.fr. 4 Univ. Federal de Mato Grosso do Sul, Campo Grande, Brasil. edson@dt.ufms.br. 5 University of São Paulo, São Paulo, Brazil. song@ime.usp.br. 6 Faolta di Sienze Mat. Fis. e Nat., Mestre, Italia. ronato@dsi.unive.it. Reeived July 5, 2000; revised April 16, Communiated by F. P. Preparata. Online publiation Marh 11, 2002.

2 184 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato number of proessors. Gerbessiotis and Valiant [18] desribe irumstanes where PRAM simulations annot be performed effiiently, among others, if the fator g is high. Unfortunately, this is true for most urrently available multiproessors. The parallel algorithms presented in this paper onsider this ase for graph problems. As pointed out in [39], the ost of a message also ontains a onstant overhead ost s. The value of s an be fairly large and the total message overhead ost an have a onsiderable impat on the speedup observed (see, e.g., [9]). In this paper we use a more pratial version of the BSP model, referred to as the oarse-grained multiomputer (CGM) model [8] [11]. It is omprised of a set of p proessors P 1,...,P p with O(N/ p) loal memory per proessor and an arbitrary ommuniation network (or shared memory). All algorithms onsist of alternating loal omputation and global ommuniation rounds. Eah ommuniation round onsists of routing a single h-relation with h = O(N/p), i.e., eah proessor sends O(N/p) data and reeives O(N/p) data. We require that all information sent from a given proessor to another proessor in one ommuniation round is paked into one long message, thereby minimizing the message overhead. A CGM omputation/ommuniation round orresponds to a BSP superstep with ommuniation ost g(n/p) (plus the above paking requirement ). Finding an optimal algorithm in the CGM model is equivalent to minimizing the number of ommuniation rounds as well as the total loal omputation time and total message size. This onsiders all parameters disussed above that are affeting the final observed speedup and it requires no assumption on g. Furthermore, it has been shown that minimizing the number of supersteps also leads to improved portability aross different parallel arhitetures [38], [39], [14]. The above model has been used (expliitly or impliitly) in parallel algorithm design for various problems [4], [8] [11], [13], [15], [25] and shown very good pratial timing results. In this paper we study deterministi parallel graph algorithms for the CGM and BSP models. We onsider the following well-known graph problems: 1. List ranking. 2. Euler tour onstrution. 3. Computing the onneted omponents and spanning forest. 4. Lowest ommon anestor preproessing. 5. Tree ontration and expression tree evaluation. 6. Computing an ear deomposition or open ear deomposition Edge onnetivity and bionnetivity (testing and omponent omputation). These problems have been extensively studied for fine-grained parallel networks and for the PRAM (see, e.g., [32]). However, for the pratially muh more relevant CGM/BSP model there exist, to the best of our knowledge, only a few results on parallel graph algorithms. For the remainder, let n denote the number of verties, let m be the number of edges of a given input graph G, and let N = n +m. Reid-Miller [31] presented an empirial study of parallel list ranking for the Cray C-90. The paper followed essentially the CGM/BSP model and laimed that this was the fastest list ranking implementation at that time. More detailed empirial studies and tradeoffs with respets to the ommuniation volume are presented in [35]. The algorithm in [31] required O(log n) ommuniation rounds. In [12], an improved algorithm was presented whih required, with high probability, only

3 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 185 O(k log p) rounds, where k log n. In [14] the list ranking problem is onsidered as well. Here O(log p) ommuniation rounds are ahieved by a randomized algorithm. Bäumker and Dittrih presented in [3] a onneted omponents algorithm for planar graphs using O(log p) ommuniation rounds. They suggest an extension of this algorithm for general graphs with the same number of ommuniation rounds. Again, both algorithms are randomized. We improve these results by giving the first deterministi algorithms for list ranking and omputing onneted omponents in O(log p) rounds. Algorithms with O(1) ommuniation rounds have been presented for various Computational Geometry problems [9] [12], [17], but the graph problems studied in this paper have onsiderably less internal struture whih ould be exploited to obtain suh solutions. It is not known whether solutions with O(1) ommuniation rounds exist for these graph problems. Note that, in pratie, the number of proessors is usually fixed. In ontrast to the previous deterministi results, the improved number of ommuniation rounds obtained in this paper, O(log p), isindependent of n. This is of onsiderable pratial relevane. With growing problem size, the number of messages remains unhanged. Only the sizes of these messages grow, linear with respet to the growth in problem size. Due to the onsiderable protool overhead assoiated with eah message transmission, this is an important property. In fat, our experiene in implementing parallel algorithms on standard ommerial mahines indiates that this property is, in most ases, a ruial ingredient for pratially relevant parallel algorithms. As in [31] we, in general, assume that N p (oarse-grained), beause this is usually the ase in pratie. More preisely, we assume that N/p p ε for some fixed ε>0, whih is true for most ommerially available multiproessors. In Setion 3 of this paper we use a tehnique alled deterministi list ompression to obtain a deterministi list ranking algorithm with O(log p) rounds. The onneted omponents algorithm is presented in Setion 4.1. It uses a tehnique alled aelerated asading. That is, it simulates an existing PRAM algorithm for the same problem but stops the exeution of this algorithm after O(log p) rounds and then finishes the omputation with a different (new) CGM algorithm. In Setions we present deterministi parallel CGM/BSP algorithms with O(log p) ommuniation rounds for solving Problems 4 7, respetively. All algorithms require linear sequential work per round and eah proessor reates, during the entire algorithm, messages of total size O(log(p)(N/p)). To our knowledge, these are the only urrently known CGM and BSP algorithms for these problems. Before we proeed with presenting the above-mentioned results we give, in Setion 2, an overview of the BSP and CGM models and their relationship to eah other. 2. The BSP, CGM, and Related Parallel Computing Models. The BSP model was introdued in [38], and the CGM model was presented in [9] [11]. A BSP omputer is a olletion of proessor/memory modules onneted by a router that an deliver messages in a point to point fashion between the proessors. A BSP-style omputation is divided into a sequene of supersteps separated by barrier synhronizations. In omputation supersteps the proessors perform omputations on data that was present loally at the beginning of the superstep. In ommuniation supersteps data is

4 186 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato exhanged among the proessors via the router. A BSP omputer has the following parameters: N refers to the problem size, p is the number of proessors, L is the minimum time between synhronization steps (measured in basi omputation units), and g is the ratio of overall system omputational apaity (number of omputation operations) per unit time divided by the overall system ommuniation apaity (number of messages of unit size that an be delivered by the router) per unit time. A BSP algorithm with a total of λ supersteps has the following omputation and ommuniation osts: The omputation ost T omp of the algorithm is T omp = λ i=1 wi omp, where the ith omputation superstep is assigned ost womp i = max{l, t 1,...,t p }. Here t j is the number of basi omputation operations performed by proessor j in the ith superstep. The ommuniation ost T omm of the algorithm is T omm = λ i=1 wi omm, where the ith ommuniation superstep is assigned ost womm i = maxp j=1 {wi omm, j }. Here wi omm, j is the ommuniation ost inurred by proessor j in the ith superstep. Assuming that proessor p j reeives messages of lengths r 1,...,r j and sends messages of lengths {s 1,...,s j } during the ith superstep, womm, i j = max{l, g( j u=1 r i + j u=1 s i)}. A CGM(N, p) uses only two parameters, N and p, and assumes a olletion of p proessors with N/ p loal memory eah onneted by a router that an deliver messages in a point to point fashion. A CGM algorithm onsists of an alternating sequene of omputation rounds and ommuniation rounds separated by barrier synhronizations. A omputation round is equivalent to a omputation superstep in the BSP model, and the total omputation ost T omp is defined analogously. A ommuniation round onsists of a single h-relation with h N/p. The ost womm i of eah ommuniation round has the same value, referred to as H N,p. Therefore, the total ommuniation ost T omm of a CGM algorithm with λ ommuniation rounds is simply T omm = λh N,p. In a reent overview of different BSP and related models, Goodrih [20] referred to the CGM as the weak-crew BSP. The main differene between the BSP and CGM models is that the latter allows only one single type of ommuniation operation, the h-relation, and simply ounts the number of h-relations as its main measure of ommuniation ost. Note that every CGM algorithm is also a BSP algorithm but not vie versa. The CGM model aims at designing simple and pratial yet theoretially optimal or effiient parallel algorithms for oarse-grained parallel systems (N/p 1). Algorithms do usually require a lower bound on N/p, e.g., N/p p or N/p p ε. The CGM model targets in partiular the ase where the overall omputation speed is onsiderably larger than the overall ommuniation speed, whih is usually the ase. Sine the message size is maximal, the CGM model minimizes the message overhead assoiated with sending a single message (regardless of its length), whih is very important in pratie. In summary, the main advantage of the CGM model is that it allows us to model the ommuniation ost of a parallel algorithm by one single parameter, the number of ommuniation rounds, λ. Note that, while λ is the main parameter determining the performane of a CGM algorithm, we will also indiate other parameters like the loal omputation and total ommuniation when analyzing CGM algorithms (more disussion below). Previous definitions of the CGM model (e.g., [9]) distinguished between the osts of an h-relation and the ost of sorting. However, due to the reent results in [20] these are equivalent for N/p p ε. Also, it is not neessary to distinguish between

5 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 187 balaned and unbalaned h-relations. In both ases eah proessor sends/reeives O(h) data but in the balaned ase the data exhanged between any two proessors is always O(h/p) whereas it may vary in the unbalaned ase. It has been shown in [2] that an unbalaned h-relation an be simulated by O(1) balaned h-relations. Another possible ase of imbalane ours when some proessors send/reeive less than O(h) data. This problem has been studied in [22] (E-BSP model) but is not a topi of this paper. Another related model is the QSM [19] where ommuniation is performed via a shared memory with emphasis on memory ontention, i.e., simultaneous aesses to the same shared memory ell. The main differene between QSM and CGM is that the latter allows the use of only one single ommuniation sheme, the h-relation, for resolving memory ontention. Consult [20] for an overview of the different BSP related models. The CGM model has reently attrated onsiderable interest in the parallel algorithms ommunity. Several researhers have used it to design parallel algorithms for various problem areas (see, e.g., [8] and various CGM artiles in the reent ACM SPAA and IPPS/SPDP proeedings). For most parallel proessing implementations, the number of h-relations required is the overriding fator determining the performane beause the protool overhead assoiated with eah message is usually substantial. In the extreme ase this may of ourse not be true. A few extremely large messages may require more time than a larger number of very small messages. Therefore, it is also useful to study the total message size per proessor over all rounds, i.e., the sum of the sizes of all message sent by a proessor during the entire omputation. For all algorithms presented in this paper, the total message size per proessor over all rounds is O(log(p)(N/p)). In the remainder of this paper we design and analyze our parallel graph algorithms in the CGM model. The relationship to the BSP model is given by the following: OBSERVATION 1. A CGM algorithm with λ rounds and omputation ost T omp orresponds to a BSP algorithm with λ supersteps, ommuniation ost O(gλ(N/ p)), and the same omputation ost T omp. 3. List Ranking. Let L be a linked list of length n represented by a vetor s[1 n]. For eah i {1 n}, s[i] is a pointer to the list element following i in the list L. We refer to i and s[i]ass-neighbors. The last element λ of the list L is the one with s[λ] = λ. The distane between i and j, d L (i, j), is the number of nodes in L between i and j plus one (i.e., the distane is zero if and only if i = j, and it is one if and only if one node follows the other). The list ranking problem onsists of omputing for eah i L the distane between i and λ, referred to as rank L (i) = d L (i,λ). On a CGM(n, p), eah proessor initially stores n/p list elements of L with their respetive pointers. Figure 1 shows an example of a linked list with n = 24 elements stored on a CGM with p = 4 proessors, n/p = 6 elements per proessor. In the remainder of this setion we present a deterministi CGM list ranking algorithm whih requires O(log p) rounds. We need the following definitions. An r-ruling set L of L is defined as a subset of seleted list elements of L that has the following properties: (1) No two neighboring elements are seleted. (2) The distane of any unseleted element to the next seleted element is at most r. For eah i L let s L [i] be the next seleted element

6 188 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato rst ; - - ;; ; 1? A 1 ; 6? ; + ;? 3 ; 6 ) ; - ;? ) ; 6 last ; Pro. 1 Pro. 2 Pro. 3 Pro. 4 Fig. 1. A linked list L stored on a CGM(n, p). j L with respet to the order implied by L. We represent an r-ruling set L again as a linked list where eah element i L is assigned a pointer to s L [i] and a value w(i) = d L (i, s L [i]) representing the distane between i and s L (i). See Figure 2 for an illustration. The weighted list ranking problem on L with weights w( ) refers to omputing for eah i L the sum of the weights w( j) of all nodes j L between i and λ. Algorithm 1 outlines the top-level of our CGM list ranking method. An illustrating example is given in Figure 3. Algorithm 1. CGM List Ranking Input: A linked list L of length n stored on a CGM(n, p), n/p p ε. Eah proessor stores n/p list elements i L and their respetive pointers s[i]. Output: For eah list element i its rank rank L (i) in L. 1. The CGM omputes an O(p 2 )-ruling set R of size R = O(n/p) as desribed in Algorithm 2 below. 2. R is broadast to all proessors. This broadast is implemented as O(log p) ommuniation rounds where the number of proessors storing R is initially one and then doubled in eah ommuniation round. L L' Fig. 2. A list L and a 3-ruling set L.

7 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 189 Input: After Steps 1&2: After Step 3: After Step 4: After Step 5: Fig. 3. Illustration of Algorithm Eah proessor, sequentially, performs weighted list ranking on R with weights w( ), thereby omputing for eah j R its rank rank L ( j) in L. 4. Eah list element i L R has at most distane O(p 2 ), with respet to L, to its next element s R [i] inr. Using O(log p) CGM sorting steps, the CGM simulates O(log p) pointer jumping steps of the standard PRAM list ranking algorithm, thereby omputing for eah i L R its distane d L (i, s R [i]) to its next element s R [i]inr. 5. Eah proessor loally omputes the ranks of its list elements i L R as follows: rank L (i) = d L (i, s R [i]) + rank L (s R [i]). The hard part of the algorithm is the omputation of a O(p 2 )-ruling set R of size O(n/p) whih we disuss in detail below. Given suh a O(p 2 )-ruling set, the orretness of Algorithm 1 is straightforward. We also observe that Steps 2 5 an be easily implemented on a CGM in O(log p) ommuniation rounds with O(n/ p) loal omputation per round. Note that in Step 4 we are simulating PRAM pointer jumping steps on the CGM. Eah suh step an be implemented in O(1) rounds by applying Goodrih s sorting algorithm [20] for n/p p ε. In the remainder of this setion we introdue a new tehnique, alled deterministi list ompression, whih allows us to ompute a O(p 2 )-ruling set in O(log p) ommuniation rounds. The basi idea behind deterministi list ompression is to have an alternating sequene of ompress and onatenate phases. In a ompress phase we selet a subset of list elements, and in a onatenate phase we use pointer jumping to work our way towards building a linked list of seleted elements.

8 190 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato We define an s-interval of length k as a sequene I = (i 1,...,i k ) of list elements with s[i j ] = i j+1,1 j k 1. The two neighbors of the s-interval I are the two list elements n 1, n 2 not in I suh that s[n 1 ] = i 1 and s[i k ] = n 2. We refer to a maximal s-interval I = (i 1,...,i k ) of list elements whih are all stored at the same proessor as a loal s-interval. A maximal s-interval I = (i 1,...,i k ) of list elements where no two subsequent i j, i j+1 are stored at the same proessor is alled a nonloal s-interval. For the ompress phase we apply the deterministi oin tossing tehnique of [7] but with a different set of labels. Instead of the memory address used in [7], we use the number of the proessor storing list element i as its label l(i). A list element s[i] is alled a loal maximum if l(s 1 [i]) <l(i) >l(s[i]) where s 1 [i] is the list element j suh that s[ j] = i. Deterministi oin tossing selets the loal maxima with respet to our new label l(i). Note that there are at most p different labels. Therefore, for every nonloal s-interval, the distane between two seleted elements is O(p). Deterministi oin tossing will not selet any element of a loal s-interval. However, sine all elements of a loal s-interval are stored at the same proessor, we an proess them sequentially. Details will be disussed later. An example is shown in Figure 4. The linked list is the same as in Figure 1. The x-axis represents the ranks of the list elements and the y-axis represents the number of the proessor on whih eah element is stored. The solid irles are the loal maxima, i.e., those list elements seleted by our modified deterministi oin tossing. The above proedure selets, within eah nonloal s-interval, list elements with a distane between 2 and O( p) between onseutive seleted elements. If we would onnet the seleted elements by diret links between them, and repeat the proedure on this new linked list, and iterate this O(log p) times, then we would obtain a O(p 2 )-ruling set of size O(n/ p). However, this would require more than O(log p) ommuniation rounds. In order to apply deterministi oin tossing for a seond, third, et., time, the previously seleted elements need to be linked by pointers. Sine two subsequent elements seleted by deterministi oin tossing an have distane O( p), this may require O(log p) ommuniation rounds eah. Hene, this approah would require a total of O(log 2 p) ommuniation rounds. Notie, however, that if two seleted elements are at distane (p), then it is unneessary to apply further deterministi oin tossing in order to redue the number of seleted elements. The basi approah of our algorithm is therefore to interleave pointer jumping steps (onatenate) and deterministi oin tossing operations with respet to our new labeling sheme (ompress). More preisely, we will have only one pointer jumping step between subsequent deterministi oin tossing steps, and suh pointer jumping label=pro. # 4 ; A 3 ; AAU 2 1 s s ; rst last Fig. 4. The linked list elements from Figure 1 labeled by their proessor numbers.

9 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 191 operations will not be applied to those list elements that are pointing to seleted elements. This onludes the high level overview of our deterministi list ompression tehniques. The following desribes the algorithm in detail. Algorithm 2. CGM Algorithm for Computing a O(p 2 )-Ruling Set Input: A linked list L of length n stored on a CGM(n, p), n/p p ε. Eah proessor stores n/p list elements i L and their respetive pointers s[i]. Output: A set of seleted nodes of L representing a O(p 2 )-ruling set of size O(n/p). 1. Eah proessor loally marks all its list elements as not seleted. 2. Using global sort [20], all proessors determine for eah list element i its two neighbors s 1 [i] and s[i]. Then eah proessor loally performs for eah loal list element i: IF l(s 1 [i]) <l(i) >l(s[i]) THEN mark i as seleted. 3. Eah proessor loally determines its loal s-intervals. Using global sort, all proessors determine the two neighbors of eah loal s-intervals. Eah proessor examines loally all of its loal s-intervals. For eah loal s- interval of size larger than two, every seond element is marked as seleted. If a loal s-interval has size two and not both neighbors have a smaller label, then both elements of the loal s-interval are marked as not seleted. 4. FOR k = 1 log p DO 4.1. Using global sort, all proessors determine for eah list element i the urrent s[i] and s[s[i]]. Then eah proessor loally performs for eah loal list element i: IF s[i] is not seleted THEN set s[i] := s[s[i]] Using global sort [20], all proessors determine for eah list element i its two urrent neighbors s 1 [i] and s[i]. Then eah proessor loally performs for eah loal list element i: IF (s 1 [i], i and s[i] are seleted) AND NOT (l(s 1 [i]) <l(i) > l(s[i])) AND (l(s 1 [i]) l(i)) AND (l(i) l(s[i])) THEN mark i as not seleted Eah proessor loally determines its loal s-intervals. Using global sort, all proessors determine the two neighbors of eah loal s- interval. Eah proessor examines loally all of its loal s-intervals. For eah loal s-interval of size larger than two, every seond element is marked as not seleted. If a loal s-interval has size two and not both neighbors have a smaller label, then both elements of the loal s-interval are marked as not seleted. 5. The proessor storing the last element λ of L marks λ as seleted. We first prove that the set of elements seleted at the end of Algorithm 2 is of size at most O(n/ p).

10 192 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato LEMMA 1. After the kth iteration in Step 4, there are no more than two seleted elements in any s-interval of length 2 k for the original list L. PROOF. By indution on k. The lemma is trivial for k = 1. Let S be an s-interval of length 2 k for the original list L and let S 1 and S 2 be the first and seond half of S, respetively. By assumption, after iteration k 1 in Step 4, S 1 and S 2 ontain at most two seleted elements eah. Denote the at most four seleted elements, ordered with respet to L,bye 1,...,e 4. Note that the distane between these elements, with respet to L,is at most 2 k. Consider now iteration k in Step 4. After Step 4.1, any two seleted elements that have distane at most 2 k and no seleted element between them (with respet to L) are diretly onneted by a link whih is represented by the urrent s vetor. Hene, s[e 1 ] = e 2, s[e 2 ] = e 3, s[e 3 ] = e 4. Note that there are at most four (nonsymmetri) possible ases for applying Steps 4.2 and 4.3 to e 1,...,e 4.Ife 1,...,e 4 are pairwise distint, then only Step 4.2 applies. If e 1,...,e 4 are all equal, then Step 4.3 applies. If e 1,...,e 3 are pairwise distint and e 3 = e 4, then Step 4.2 applies to e 1,...,e 3 and Step 4.3 to e 3, e 4. The other possible ase is that e 1 e 2 = e 3 e 4. Then Step 4.2 applies to e 1 and e 4, and Step 4.3 applies to e 2, e 3. In any ase, after Steps 4.2 and 4.3, at most two elements among e 1,...,e 4 are still seleted. We now prove that onseutive elements seleted at the end of Algorithm 2 have distane at most O(p 2 ) in the original list L. We first need the following. LEMMA 2. After every exeution of Step 4.3, the distane (with respet to the urrent vetor s) of two onseutive seleted elements is at most O(p). PROOF. Consider two onseutive seleted elements e 1 and e 2. There are three possible ases: (1) e 1 and e 2 and all elements between them, with respet to s, have the same label. (2) For e 1 and e 2 and all elements between them, with respet to s, any pair of onseutive elements has different labels. (3) The mixed ase where some pairs of onseutive elements have the same label. Note that in the mixed ase it is impossible for three or more onseutive elements to have the same label, beause one of them would be a seleted element (Step 4.3). In Case (1) the distane of e 1 and e 2 is at most two beause of Step 4.3. In Case (2), due to our modified labeling sheme, there are at most p different labels. Hene, by the standard argument for deterministi oin tossing [7] the distane is at most O(p). Case (3) is equivalent to Case (2) exept for a fator of two. LEMMA 3. After the kth exeution of Step 4.3, two s-neighbors with respet to the urrent vetor s have distane O(2 k ) with respet to the original list L. PROOF. Obvious onsequene of the fat that only k pointer jumping operations were so far exeuted in Step 4.1. LEMMA 4. No two onseutive seleted elements have a distane of more than O(p 2 ) with respet to the original list L.

11 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 193 PROOF. Follows from Lemmas 2 and 3. LEMMA 5. On a CGM with p proessors and O(n/p) loal memory per proessor, n/p p ε (ε > 0), Algorithm 2 determines a O(p 2 )-ruling set of size O(n/p) in O(log p) ommuniation rounds with O(n/p) loal omputation per round. PROOF. The orretness of Algorithm 2 follows from Lemmas 1 4. For n/p p ε, the sorting algorithm in [20] requires O(1) rounds with O(n/p) loal omputation per round. The ommuniation performed by Algorithm 2 onsists of two global sorts for Steps 2 and 3, and 3 log p global sorts for Step 4. All loal omputation in eah round an be performed in linear time. Thus, the laimed time omplexity follows. In summary, we obtain THEOREM 1. The list ranking problem for a linked list with n verties an be solved on a CGM with p proessors and O(n/p) loal memory per proessor, n/p p ε (ε>0), using O(log p) ommuniation rounds and O(n/p) loal omputation per round. Euler Tour in a Tree. We omplete this setion with an important appliation of our list ranking algorithm. Let T = (V, E) be an undireted tree. We assume that the tree T is represented by an adjaeny list for eah vertex. Let T = (V, E ) be a direted graph with E ={(v, w), (w, v) {v, w} E}. T is Eulerian beause indegree(v) = outdegree(v) for eah vertex v. The Euler tour problem for T onsists of (1) omputing a path that traverses eah edge exatly one and returns to its starting point, and (2) omputing for eah vertex its rank in this path. THEOREM 2. The Euler tour problem for a tree T with n verties an be solved on a CGM with p proessors and O(n/p) loal memory per proessor, n/p p ε (ε>0), in O(log p) ommuniation rounds with O(n/p) loal omputation per round. PROOF. We ompute T and its adjaeny lists by doubling all edges of T and applying sorting [20]. Furthermore, we make the adjaeny list for eah vertex irular by applying list ranking. For eah edge (i, j) in T let next(i, j) be the suessor of its entry in the respetive adjaeny list. We now apply the well known method by Tarjan and Vishkin [37] to define an Euler tour ordering on the edges of T whih assigns to eah edge (i, j) as suessor the edge next( j, i). Computing next( j, i) for every edge (i, j) redues to sorting. Finally, we apply our list ranking method desribed above to determine the rank of eah vertex in the Euler tour. 4. Porting PRAM Algorithms to the CGM/BSP. In this setion we present CGM graph algorithms for onneted omponent labeling, lowest ommon anestor omputation, tree ontration, open ear deomposition, and bionneted omponent labeling. All algorithms require O(log p) rounds. They are obtained by porting the respetive PRAM algorithms to the CGM, using our CGM list ranking algorithm presented in Setion 3.

12 194 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato Our approah for onneted omponent labeling is to simulate the respetive PRAM method using sorting [20], but only for O(log p) rounds, and then finish the omputation with a new O(log p) rounds CGM algorithm based on binary merge. Our solutions for the other problems redue these tasks to a sequene of list ranking and onneted omponent omputations Conneted Components and Spanning Forest. Consider an undireted graph G = (V, E) with n verties and m edges. Eah vertex v V has a unique label between 1 and n. Two verties u and v are onneted if there is an undireted path of edges from u to v. A onneted subset of verties is a subset of verties where eah pair of verties is onneted. A onneted omponent of G is defined as a maximal onneted subset. Algorithm 3 shown below omputes the onneted omponents of G on a CGM with p proessors and O((n+m)/p) loal memory per proessor. Step 1 simulates the PRAM algorithm by Shiloh and Vishkin [34] but with only log p iterations of the main loop instead of the O(log n) iterations in Shiloh and Vishkin s original algorithm. Step 2 onverts all resulting trees into stars. It follows from [34] that the obtained graph G = (V, E ) has at most O(n/p) verties. Hene, V an be broadast to all proessors. E an still be of size O(m) and is distributed over the p proessors. E i refers to the edges of E stored at proessor i. We note that the spanning forest of (V, E i E j ) for two sets E i, E j is of size O( V ) = O(n/p). Hene, eah of the spanning forests omputed in Step 3 an be stored in the loal memory of a single proessor. We merge pairs of spanning forests until, after log p rounds, the spanning forest of G is stored at proessor P 0.In Step 4 all proessors then update the partial onneted omponent information obtained in Step 1. Algorithm 3. CGM Algorithm for Conneted Component Computation Input: An undireted graph G = (V, E) with n verties and m edges stored on a p proessor CGM with total O(n + m) memory, (n + m)/p p ε. Output: The onneted omponents of G represented by the values parent(v) for all verties v V. 1. Using sorting [20], simulate log p iterations of the main loop of the PRAM algorithm by Shiloh and Vishkin [34]. 2. Use the Euler tour algorithm in Setion 3 to onvert all resulting trees into stars. For eah v V, set parent(v) to be the root of the star ontaining v. Let G = (V, E ) be the graph onsisting of the superverties and live edges obtained. Distribute G suh that eah proessor stores the entire set V and a subset of m/p edges of E. Let E i be the edges stored at proessor i,0 i p Set all proessors to ative mode. FOR k := 1tologp DO Partition the ative proessors into groups of size two. FOR eah group P i, P j of ative proessors, i < j, IN PARALLEL DO Proessor P j sends its edge set E j to proessor P i. Proessor P j is set to passive mode.

13 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 195 Proessor P i omputes the spanning forest (V, E s ) of the graph SF = (V, E i E j ) and sets E i := E s. Set all proessors to ative mode and broadast E Eah proessor P i omputes sequentially the onneted omponents of the graph G = (V, E 0 ). For eah vertex v of V let parent (v) be the smallest label parent(w) of a vertex w V whih is in the same onneted omponent with respet to G = (V, E 0 ). For eah vertex u V stored at proessor P i set parent(u) := parent (parent(u)). (Note that parent(u) V.) We obtain THEOREM 3. Algorithm 3 omputes the onneted omponents and spanning forest of a graph G = (V, E) with n verties and m edges on a CGM with p proessors and O((n + m)/p) loal memory per proessor, (n + m)/p p ε (ε > 0), using O(log p) ommuniation rounds and O((n + m)/p) loal omputation per round Lowest Common Anestor. The lowest ommon anestor, LCA(u,v), of two verties u and v of a rooted tree T = (V, E) is the vertex w that is an anestor to both u and v, and is farthest from the root. We apply the approah in [21] whih uses Euler tour and range-minimum alulation. It onsists of the following operations: 1. ompute an Euler tour for T ; 2. find the levels, in T, for all verties of the Euler tour; 3. for eah vertex v find l(v) and r(v) whih denote the leftmost and rightmost appearanes, respetively, of v in the Euler tour; 4. solve the range-minima problem defined as follows: given a list of numbers {b 1, b 2,...,b n } and an interval [i, j], with 1 i j n, find the minimum of {b i,...,b j }. Operation 1 an be performed in O(log p) ommuniation rounds as shown in Setion 3. The same holds for Operation 2 beause it an also be redued to Euler tour omputation. We now onsider Operation 3. Given an Euler tour of verties a 1, a 2,...,a n, a 1. The element a i = v is the leftmost (rightmost) appearane of v if and only if level(a i 1 )=level(v) -1(level(a i+1 )=level(v) - 1, respetively) [21]. This requires the use of indies of the verties in the Euler tour. Our Euler tour is not given as an array of verties but rather by pointers to suessor verties in the tour. This is easily solved by using as index the rank obtained by list ranking from Setion 3. The rank an be viewed as an index going bakwards from the list. After list ranking (in O(log p) ommuniation rounds), Operation 3 an be ompleted in O(1) ommuniation rounds. Operation 4 also uses indies. Likewise, instead of indies, we utilize the ranks of the verties of the Euler tour. Given two verties u and v of T. In order to find the minimum level over the interval [r(u), l(v)], let rank(r(u)) =i and rank(l(v) = j. To find the required minimum, eah of the p proessors onsiders verties in its loal memory with ranks between j and i and finds the minimum level (O(n/p time). The

14 196 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato minimum of the resulting p numbers an be found in O(1) ommuniation rounds. We obtain LEMMA 6. Consider a rooted tree T = (V, E) with n verties. The LCA problem an be solved on a CGM with p proessors and O(n/p) loal memory per proessor, n/p p ε (ε > 0), using O(log p) ommuniation rounds and O(n/ p) loal omputation per round Tree Contration and Expression Tree Evaluation. We observe that the lassial tree ontration and expression tree evaluation algorithm of [27] an be easily implemented on a CGM to run in O(log p) ommuniation rounds. Reall that the tree ontration algorithm of [27] applies an alternating sequene of log n rake and ompress operations to ontrat a tree T into a single node. On a CGM, one an simply apply log p rake and ompress operations, whih require O(log p) rounds, and ompresses the tree into a smaller tree T of size O(n/p). The tree T an then be proessed sequentially at a single proessor. In order to perform expression tree evaluation suh that not only the value of the root but the value of every node is alulated, the PRAM method of [27] an be employed for the CGM as well. Let T = T 1,...,T t = T be the sequene of log p trees reated by the alternating sequene of log p rake and ompress operations. After T has been evaluated sequentially, log p expansion steps rereate the above sequene of trees in reverse order. A node v that is added in an expansion was deleted either by a rake or a ompress operation. In both ases its value an easily be omputed in O(1) rounds by a loal neighborhood operation. LEMMA 7. Tree ontration and expression tree evaluation on a tree T with n nodes an be performed on a CGM with p proessors and O(n/p) loal memory per proessor, n/p p ε (ε > 0), using O(log p) ommuniation rounds and O(n/p) loal omputation per round Open Ear Deomposition. We first reall the definition of an ear deomposition and open ear deomposition (see, e.g., [30]). Consider an undireted graph G = (V, E) with n verties and m edges. For the remainder, we assume that G is onneted. An ear deomposition of G is an ordered partition of E into r simple paths P 1,...,P r suh that P 1 is a yle, and, for eah 2 i r, P i is a simple path with endpoints belonging to P 1 P i 1 but with none of its internal verties belonging to P j, j < i. The paths P i are alled ears. If none of the P i, i > 1, is a yle, then the deomposition is alled an open ear deomposition. For an edge e in P i, let i be the ear number of e. An edge e E is a ut-edge if e does not lie on a yle in G. A onneted undireted graph G is 2-edge onneted if it ontains no ut-edge. G has an ear deomposition if and only if G is 2-edge onneted. A ut-vertex is a vertex whose removal leaves G disonneted. G is bionneted if it ontains at least three verties and has no utvertex. It has been shown that G has an open ear deomposition if and only if it is bionneted [40]. Let T be a spanning tree of G rooted at some node r. Consider the preorder numbering of T with respet to r and let preorder(v) be the preorder number of a node v, 0

15 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 197 preorder(v) n 1. For an edge e let la(e) denote the lowest ommon anestor of e = (u,v), as defined in Setion 4.2. An edge in G T is a nontree edge with respet to T. Any nontree edge e of G reates a yle in T {e}, alled the fundamental yle of e with respet to T. For eah vertex v onsider all fundamental yles reated by nontree edges inident to a desendant of v in T. Let low(v) be the minimum preorder number of a node w whih lies on any suh fundamental yle. (If no suh w exists, then let low(v) = n.) The lassial O(log n) time PRAM algorithms for ear deomposition and open ear deomposition [26], [28], [30] onsist of a onstant number of the following operations: 1. find a spanning tree T for G; 2. find the lowest ommon anestor la(e) of every nontree edge e = (u,v); 3. number the verties of T in preorder from 0 to n 1; 4. ompute low(v) for eah vertex v of V ; 5. find the onneted omponents of a graph with at most n verties and m edges; 6. sort at most m numbers. In the previous setions we have shown that Operations 1, 2, and 5 an be performed in O(log p) ommuniation rounds, and we an use [20] for Operation 6. We now disuss Operations 3 and 4. Preorder numbering of a tree an be solved by applying the Euler tour tehnique of Setion 3. The preorder number of a vertex in a tree is one plus the number of forward edges found in the Euler tour before enountering the vertex. The omputation of low(v) for eah vertex v of V an be redued to tree ontration and lowest ommon anestor omputation disussed in Setions 4.3 and 4.2, respetively. For eah node w of T define as label(w) the minimum preorder label of la(e) for all nontree edges inident to w. For eah vertex v of V, low(v) is the minimum label(w) of all nodes w in the subtree of T rooted at v. Hene, tree ontration on T using the min operation omputes all low(v) values. LEMMA 8. For a graph G = (V, E) with n verties and m edges, the ear deomposition, open ear deomposition, set of ut-edges, and set of ut-verties, if they exist, an be omputed on a CGM with p proessors and O((n + m)/p) loal memory per proessor, (n + m)/p p ε (ε>0), using O(log p) ommuniation rounds and O(n/p) loal omputation per round Bionneted Components. Testing 2-edge onnetivity and bionnetivity in O (log p) rounds is a simple onsequene of Lemma 8 and Theorem 3. The algorithms shown in the previous setion ompute the ear deomposition or open ear deomposition if G is 2-edge onneted or bionneted, respetively. If G is not 2-edge onneted or bionneted, then we obtain the ut-edges or ut-verties, respetively, as follows. We observe that ut-edges are tree edges (parent(v), v) with the property that low(v) > preorder(v). IfG is 2-edge onneted, a ut-vertex v an be deteted by examining the ear numbers of all edges inident to v. The smallest of those ear numbers will our twie, while any other of those ear numbers ours twie if and only if v is a ut-vertex.

16 198 F. Dehne, A. Ferreira, E. Cáeres, S. W. Song, and A. Ronato LEMMA 9. The 2-edge onneted and bionneted omponents of a onneted undireted graph G with n verties and m edges an be determined on a CGM with p proessors and O((n + m)/p) loal memory per proessor, (n + m)/p p ε (ε> 0), using O(log p) ommuniation rounds and O((n + m)/ p) loal omputation per round. 5. Conlusion. In this paper we presented deterministi parallel CGM and BSP algorithms for the following well-known graph problems: (1) list ranking, (2) Euler tour onstrution in a tree, (3) omputing the onneted omponents and spanning forest, (4) lowest ommon anestor preproessing, (5) tree ontration and expression tree evaluation, (6) omputing an ear deomposition or open ear deomposition, and (7) 2-edge onnetivity and bionnetivity (testing and omponent omputation). The CGM algorithms require O(log p) ommuniation rounds and linear sequential work per round, assuming loal memory (n + m)/p p ε (ε>0) whih is true for most ommerially available multiproessors. Our results imply BSP algorithms with O(log p) supersteps, O(g log( p)((n + m)/ p)) ommuniation time, and O(log( p)((n + m)/ p)) loal omputation time. The number of ommuniation rounds obtained is independent of the problem size and grows only logarithmially with respet to p. It is still an open question whether better algorithms exist (even randomized). Our algorithm for Problem (1) improves signifiantly on previous results, and our algorithms for Problems (2) (7) are the first pratially relevant parallel algorithms for these standard graph problems. Aknowledgments. The authors thank P. Flohini, N. Santoro, and I. Rieping for their helpful disussions on the first version of this paper [5]. Referenes [1] R. J. Anderson, and L. Snyder, A Comparison of Shared and Nonshared Memory Models of Computation, Pro. IEEE, 79(4) (1993), [2] D. Bader, D. Helman, and J. Jájá, Pratial Parallel Algorithms for Personalized Communiation and Integer Sorting, J. Exper. Algorithms, 1 (1996), [3] A. Bäumker and W. Dittrih, Parallel Algorithms for Image Proessing: Pratial Algorithms with Experiments, Pro. International Parallel Proessing Symposium, IEEE Computer Soiety Press, Los Alamitos, CA, 1996, pp [4] G. E. Blelloh, C. E. Leiserson, B. M. Maggs, and C. G. Plaxton, A Comparison of Sorting Algorithms for the Connetion Mahine CM-2, Pro. ACM Symposium on Parallel Algorithms and Arhitetures, 1991, pp [5] E. Caeres, F. Dehne, A. Ferreira, P. Flohini, I. Rieping, A. Ronato, N. Santoro, and S. W. Song, Effiient Parallel Graph Algorithms for Coarse Grained Multiomputers and BSP, Pro. 24th International Colloquium on Automata, Languages and Programming (ICALP 97), Leture Notes in Computer Siene, Vol. 1256, Springer-Verlag, Berlin, 1997, pp [6] R. Cole, Parallel Merge Sort, SIAM J. Comput., 17(4) (1988), [7] R. Cole and U. Vishkin, Approximate Parallel Sheduling. Part I: The Basi Tehnique with Appliations to Optimal Parallel List Ranking in Logarithmi Time, SIAM J. Comput., 17(1) (1988).

17 Effiient Parallel Graph Algorithms for Coarse-Grained Multiomputers and BSP 199 [8] F. Dehne (editor), Speial Issue on Coarse Grained Parallel Algorithms, Algorithmia, 24(3/4) (1999). [9] F. Dehne, A. Fabri, and A. Rau-Chaplin, Salable Parallel Geometri Algorithms for Coarse Grained Multiomputers, Pro. ACM Symposium on Computational Geometry, 1993, pp [10] F. Dehne, A. Fabri, and C. Kenyon, Salable and Arhiteture Independent Parallel Geometri Algorithms with High Probability Optimal Time, Pro. IEEE Symposium on Parallel and Distributed Proessing, 1994, pp [11] F. Dehne, X. Deng, P. Dymond, A. Fabri, and A. A. Kokhar, A Randomized Parallel 3D Convex Hull Algorithm for Coarse Grained Multiomputers, Pro. ACM Symposium on Parallel Algorithms and Arhitetures, 1995, pp [12] F. Dehne and S. W. Song, Randomized Parallel List Ranking for Distributed Memory Multiproessors, Pro. Asian Computing Siene Conferene, Singapore, De. 1996, Leture Notes in Computer Siene, Vol. 1179, Springer-Verlag, Berlin, 1996, pp [13] X. Deng, A Convex Hull Algorithm for Coarse Grained Multiproessors, Pro. International Symposium on Algorithms and Computation, [14] X. Deng and P. Dymond, Effiient Routing and Message Bounds for Optimal Parallel Algorithms, Pro. International Parallel Proessing Symposium, [15] X. Deng and N. Gu, Good Programming Style on Multiproessors, Pro. IEEE Symposium on Parallel and Distributed Proessing, 1994, pp [16] G. A. Dira, On Rigid Ciruit Graphs, Abh. Math. Sem. Univ. Hamburg, 25 (1961), [17] A. Ferreira, A. Rau-Chaplin, and S. Ubeda, Salable 2D Convex Hull and Triangulation Algorithms for Coarse-Grained Multiomputers, Pro. IEEE Symposium on Parallel and Distributed Proessing, San Antonio, IEEE Press, New York, 1995, pp [18] A. V. Gerbessiotis and L. G. Valiant, Diret Bulk-Synhronous Parallel Algorithms, Pro. Sandinavian Workshop on Algorithm Theory, Leture Notes in Computer Siene, Vol. 621, Springer-Verlag, Berlin, 1992, pp [19] P. B. Gibson, Y. Matias, and V. Ramahandran, The Queue-Read Queue-Write Asynhronous PRAM Model, Pro. 9th ACM SPAA, 1997, pp [20] M. T. Goodrih, Communiation Effiient Parallel Sorting, ACM Symposium on Theory of Computing, [21] J. Jájá, An Introdution to Parallel Algorithms, Addison-Wesley, Reading, MA, [22] B. H. H. Juurlink and H. A. G. Wijshoff, The E-BSP Model: Inorporating General Loality and Unbalaned Communiation into the BSP Model, Pro. Euro-Par 96, vol. II, Leture Notes in Computer Siene, Vol. 1124, Springer-Verlag, Berlin, [23] P. Klein, Effiient Parallel Algorithms for Chordal Graphs, Pro. IEEE Symposium on Foundations of Computer Siene, 1989, pp [24] P. Klein, Parallel Algorithms for Chordal Graphs, in [32], pp [25] Hui Li and K. C. Sevik, Parallel Sorting by Overpartitioning, Pro. ACM Symposium on Parallel Algorithms and Arhitetures, 1994, pp [26] Y. Maon, B. Shieber, and U. Vishkin, Parallel Ear Deomposition Searh (EDS) and ST-Numbering in Graphs, Theoret. Comput. Si., 47 (1986), [27] G. L. Miller and J. H. Reif, Parallel Tree Contration and Its Appliation, Pro. IEEE Symposium on Foundations of Computer Siene, 1985, pp [28] G. L. Miller and V. Ramahandran, Effiient Parallel Ear Deomposition with Appliations, Manusript, MSRI, Berkeley, January [29] K. Mulmuley, Computational Geometry: An Introdution Through Randomized Algorithms, Prentie- Hall, Englewood Cliffs, NJ, [30] V. Ramahandran, Parallel Open Ear Deomposition with Appliations to Graph Bionnetivity and Trionnetivity, in [32], pp [31] M. Reid-Miller, List Ranking and List San on the Cray C-90, Pro. ACM Symposium on Parallel Algorithms and Arhitetures, 1994, pp [32] J. H. Reif (editor), Synthesis of Parallel Algorithms, Morgan Kaufmann, San Mateo, CA, [33] D. J. Rose, R. E. Tarjan, and G. S. Lueker. Algorithmi Aspets of Vertex Elimination on Graphs, SIAM J. Comput., 5 (1976), [34] Y. Shiloh and U. Vishkin, An O(log n) Parallel Connetivity Algorithm, J. Algorithms, 3(1) (1983),

Algorithms for External Memory Lecture 6 Graph Algorithms - Weighted List Ranking

Algorithms for External Memory Leture 6 Graph Algorithms - Weighted List Ranking Leturer: Nodari Sithinava Sribe: Andi Hellmund, Simon Ohsenreither 1 Introdution & Motivation After talking about I/O-effiient