PRO: a Model for Parallel Resource-Optimal Computation

Size: px

Start display at page:

Download "PRO: a Model for Parallel Resource-Optimal Computation"

Poppy Rose
5 years ago
Views:

1 PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design of resource-otimal scalable arallel algorithms and simlifies their analysis. The model rests on the novel idea of incororating relative otimality as an integral art and measuring the quality of a arallel algorithm in terms of granularity. Key words: Parallel comuters, Parallel models, Parallel algorithms, Comlexity analysis Research suorted by IS-AUR of The Aurora Programme, a France-Norway Collaboration Research Project of The Research Council of Norway, The French Ministry of Foreign Affairs and The Ministry of Education, Research and Technology. Deartment of Informatics, University of Bergen, N-5020, Norway. {assefaw, telle}@ii.uib.no LIP & INRIA Rhone-Ales, France. Isabelle.Guerin-Lassous@inria.fr LORIA & INRIA Lorraine, France. gustedt@loria.fr 1

2 1 Introduction One of the challenges in arallel rocessing is the develoment of a general urose and effective model of arallel comutation. Unlike the realm of sequential comutation, where the Random Access Machine (RAM) has successfully served as a standard comutational model, no such single unifying model exists in the field of arallel comutation. From an algorithmic oint of view, the erformance of a sequential algorithm is adequately evaluated using its execution time making the RAM owerful enough for analysis and design. On the other hand, the erformance evaluation of a arallel algorithm involves several metrics, the most imortant of which are seedu, otimality (or efficiency), and scalability. Seedu and otimality are relative in nature as they are exressed with resect to some sequential algorithm. The notion of relativity is also relevant from a ractical oint of view. A arallel algorithm is often not designed from scratch, but rather starting from a sequential algorithm. We believe that a arallel comutation model should incororate the most imortant erformance evaluation metrics of arallel algorithms as the RAM does for sequential algorithms. In light of this, the objective of the current work is to develo a model that simlifies the design and analysis of resource-otimal scalable arallel algorithms. In an interesting survey aer [21], Maggs et al. suggest that an ideal arallel comutation model be designed within the hilosohy of simlicity and descritivity balanced with rescritivity. The Parallel Resource- Otimal (PRO) comutation model roosed here is develoed within this sirit. The key features of the PRO model that distinguish it from existing arallel comutation models are relativity, resource-otimality, and a new quality measure referred to as granularity. Relativity ertains to the fact that the design and analysis of a arallel algorithm in PRO is done relative to the time and sace comlexity of a secific sequential algorithm. Consequently, the arameters involved in the analysis of a PRO-algorithm are the number of rocessors, the inut size n, and the time and sace comlexity of the reference sequential algorithm A seq. A PRO-algorithm is required to be both time- and sace-otimal (hence resource-otimal). A arallel algorithm is said to be time- (or work-) otimal if the overall comutation and communication cost involved in the algorithm is roortional to the time comlexity of the sequential algorithm used as a reference. Similarly, it is said to be sace-otimal if the overall memory sace used by the algorithm is of the same order as the memory usage of 2

3 the underlying sequential version. As a consequence of its time-otimality, a PRO-algorithm always yields linear seedu relative to the reference sequential algorithm; i.e., the ratio between the sequential and arallel runtime is a linear function of. The quality of a PRO-algorithm is measured by the range of values can assume while linear seedu is maintained. This range is catured by an attribute of the model called the granularity function Grain(n). In other words, a PRO-algorithm with granularity Grain(n) is required to be fully scalable for all values of such that = O(Grain(n)). The granularity function Grain(n) determines the quality of one PRO-algorithm over another relative to the same sequential time and sace comlexity. The higher the function value Grain(n) the better the algorithm. Note that since otimality (consequently linear seedu) is hard-wired into the model, the runtime cannot be a quality measure for a PRO algorithm. However, in a sense, the time and sace comlexity of the reference sequential algorithm A seq can also be seen as a quality measure of the PRO-algorithm. This means that the selection of the reference sequential algorithm is of significant imortance. The rest of the aer is organized as follows. In Section 2 we give an overview of existing arallel comutation models and highlight their limitations. In Section 3 the PRO model is resented in detail and in Section 4 it is comared with a selection of existing arallel models. In Section 5 we illustrate how the model is used in design and analysis using the matrix multilication roblem as an examle. In Section 6 we give a PRO-algorithm for one-to-all broadcast, as an examle of a rimitive communication routine found in a otential PRO library. Finally, we conclude the aer in Section 7 with some remarks. 2 Existing models and their limitations There exists a lethora of arallel comutation models in the literature. On the theoretical end, we find the Parallel Random Access Machine (PRAM) model [8, 17] which in its simlest form osits a set of rocessors, with global shared memory, executing the same rogram in lockste. In this model, every rocessor can access any memory location at unit cost of time regardless of the memory location. This assumtion is in obvious disagreement with the reality of ractical arallel comuters. However, desite its serious limitation of being an idealized model of arallel comutation, the standard PRAM model still serves as a theoretical framework for investigating the maximum ossible comutational arallelism in a given task. Secifically, on this model, the N C versus P -comlete 3

4 dichotomy [14] is used to reflect the ease/hardness of finding a arallel algorithm for a roblem. Recall that NC denotes the class of roblems which have PRAM-algorithms with olylogarithmic runtime and olynomial number of rocessors in the inut size. A roblem is said to be P -comlete if an N C-algorithm for it would imly that all olynomial time sequential roblems have NC-algorithms. The roblem of whether or not P = NC has long been an oen roblem. The N C versus P -comlete dichotomy has its own ractical limitations. First, P -comleteness does not deict a full icture of non-arallelizability since the runtime requirement for an N C arallel algorithm is so stringent that the classification is confined to the case where u to olynomial number of rocessors in the inut size is available (fine-grained setting). For examle, there are P -comlete roblems for which less ambitious, but still satisfactory, runtime can be obtained by arallelization in PRAM [23]. In a fine-grained setting, since the number of rocessors is a function of the inut size n, it is customary to exress seedu as a function of n. Thus the seedu obtained using an N C-algorithm is sometimes referred to as exonential. In a coarse-grained setting, i.e., the case where n and are orders of magnitude aart, seedu is exressed as a function of only and some recent results [4, 7, 9, 15] show that this aroach is ractically relevant. Second, an N C-algorithm is not necessarily work-otimal, and thus not resource-otimal considering runtime and memory sace as resources that one wants to use efficiently. Third, even if we restrict ourselves to work-otimal N C-algorithms and aly Brent s scheduling rincile, which says an algorithm in theory can be simulated on a machine with fewer rocessors by only a constant factor more work, imlementations of PRAM algorithms often do not reflect this otimality in ractice [6]. This is mainly because the PRAM model does not account for non-local memory access (communication), and a Brent-tye simulation relies heavily on chea communication. To overcome the defects of the PRAM related to its failure of caturing real machine characteristics, the advocates of shared memory models roose several modifications to the standard PRAM model. In articular, they enhance the standard PRAM model by taking ractical machine features such as memory access, synchronization, latency and bandwidth issues into account. Pointers to the PRAM family of models can be found in [21]. Critics of shared memory models argue that the PRAM family of models fail to cature the nature of existing arallel comuters with distributed memory architectures. Examles of distributed memory comutational models suggested as alternatives include the Postal Model [2] and the Block 4

5 Distributed Memory (BDM) model [18]. Other categories of arallel models such as low-level, hierarchical memory, and network models are briefly reviewed in [21]. A more recent category of arallel models is that of bridging models, a notion oularized by Valiant with his introduction of the Bulk Synchronous Parallel (BSP) model [22]. The BSP model is a distributed memory coarsegrained model in which arallel comutation roceeds as a sequence of barrier synchronized suerstes where local comutation and communication are distinct rather than intermingled hases. Culler et al. [5] extended the BSP model by allowing asynchronous execution and better accounting for communication overhead. Their model is coined LogP, an acronym for the four arameters involved. A common feature of the BSP, LogP, and other related models is their lack of simlicity: each model involves relatively many arameters making analysis and design of algorithms cumbersome. The Coarse Grained Multicomuter (CGM) model [4, 7] was later roosed in an effort to retain the advantages of BSP while keeing the model simle (making the number of arameters fewer). The BSP and its secial case CGM have been the rimary insirations for our model. Thus, we believe that many otimal CGM and BSP algorithms can easily be adated to PRO. The PRO model attemts to artially address the limitations of existing arallel models highlighted in the foregoing discussion and comromises between theoretical and ractical considerations. One of its advantages from a theoretical oint of view is that it is a ste forward towards the identification of the class of roblems for which good arallel algorithms exist in a more realistic (ractical) way than the existing NC versus P -comlete classification. Our main goal in suggesting the PRO model is to enable the develoment of scalable and resource-otimal arallel algorithms and to simlify their analysis. The model identifies the salient features of a arallel algorithm that make its ractical scalability and otimality highly likely. In this regard, it can be considered as a set of guidelines for the algorithm designer in the quest for develoing scalable and efficient arallel algorithms. Hence, PRO can be seen as a mix of a arallel comutation model and a arallel algorithm design scheme which makes it biased towards the software side in its role as a bridging model. 5

6 3 The PRO model The PRO model is an algorithm design and analysis tool used to deliver a ractical, otimal, and scalable arallel algorithm relative to a secific sequential algorithm whenever this is ossible. Let Time(n) and Sace(n) denote the time and sace comlexity of a secific sequential algorithm for a given roblem with inut size n. The PRO model is defined to have the following attributes. Machine The underlying machine is assumed to consist of rocessors with M = O( Sace(n) ) rivate memory each, interconnected by some communication network (or shared memory) that can deliver messages in a oint-to-oint fashion. A message can consist of several machine words. Coarseness We assume that M, i.e., the size of the local memory of each rocessor is big enough to store words. Execution For any value = O(Grain(n)), a PRO algorithm, consists of O( Time(n) ) suerstes. A suerste consists of a local 2 comutation hase and an interrocessor communication hase. In articular, in each suerste, each rocessor sends at most one message to every other rocessor, sends and receives at most M words in total, and ays a unit of time er word sent and received, erforms local comutation, and ays a unit of time er oeration, has arallel runtime Time(n, ) = O( Time(n) ). ) ensures that the sace utilized by the underlying sequential algorithm is uniformly distributed among the rocessors. Since we may, without loss of generality, assume that Sace(n) = Ω(n), the imlication is that the rivate memory of each rocessor is large enough Note that the granularity function Grain(n) is a quality measure of a PRO-algorithm. As discussed in the LogP aer [5], technological factors are forcing arallel systems to converge towards systems formed by a collection of essentially comlete comuters connected by a robust communication network. The machine model assumtion of PRO is consistent with this convergence and mas well on several existing arallel comuter architectures. The memory requirement M = O( Sace(n) 6

7 to store its share of the inut and any additional sace the sequential algorithm might require. When Sace(n) = Θ(n), note that the inut data must be uniformly distributed on the rocessors. In this case the machine model assumtion of PRO is similar to the assumtion in the CGM model [7]. The coarseness assumtion M is consistent with the structure of existing arallel machines and machines to be built in the foreseeable future. The assumtion is required to simlify the imlementation of collecting messages (from ossibly all other rocessors) on a single rocessor. The execution of a PRO-algorithm consists of a sequence of suerstes (or rounds). The length of (time sent in) a suerste on each rocessor is determined by the sum of the time used for communication and the time used for local comutation. The length of a suerste s in the arallel algorithm seen as a whole, denoted by Time s (n, ), is the maximum over the lengths of the suerste on all rocessors. We can concetually think as if the suerstes are synchronized by a barrier set at the end of the longest suerste across the rocessors. However, note that in PRO the rocessors are not in reality required to synchronize at the end of each suerste. The arallel runtime Time(n, ) of the algorithm is the sum of the lengths of all the suerstes. Notice that the hyothetical barriers result in only a constant factor more time comared with an analysis that does not assume the barriers. In PRO, since a rocessor sends at most one message to every other rocessor in each suerste, each rocessor is involved in at most 2( 1) messages er suerste. Therefore, the requirement Stes = O( Time(n) ) on 2 the number of suerstes imlies that the overall time aid er rocessor for communication overhead and latency is O(Time(n)/) and hence can be neglected from the analysis since our goal is to achieve an O(Time(n)/) arallel runtime. Notice that the bandwidth restriction of the underlying architecture which in turn contributes to the communication cost is accounted for since each rocessor ays a unit of time er word sent and received. This is not an unrealistic assumtion noting that the network throughut (accounted in machine words) on modern architectures such as high erformance clusters is relatively close to the CPU frequency and to the CPU/memory bandwidth. The condition Time(n, ) = O( Time(n) ) requires that a PRO-algorithm be otimal and yield linear seedu relative to the sequential algorithm used as a reference. This requirement ensures the otential ractical use of the arallel algorithm. 7

8 Observation 1 A PRO algorithm relative to a sequential algorithm with runtime O(Time(n)) and sace requirement O(Sace(n)) has maximum granularity Grain(n) = O(min{ Sace(n), (Time(n)}) = O( Sace(n)). A PRO algorithm that achieves this is said to have otimal grain. Observation 1 is due to the limit on the memory size of each rocessor, the coarseness assumtion, and the bound on the number of suerstes. The limit on the size of the rivate memory of each rocessor (M = O( Sace(n) )) together with the coarseness assumtion M imly = O( Sace(n)). The fact that the number of suerstes of a PRO-algorithm should be Stes = O(Time(n)/ 2 ), gives = O( (Time(n)/Stes)) uon resolving and we clearly have Stes 1. Finally, note that Time(n) Sace(n), since an algorithm has to at least read the inut. Since a PRO-algorithm yields linear seedu for any = O(Grain(n)), a result like Brent s scheduling rincile is imlicit for these values of. But Observation 1 shows that we cannot start with an arbitrary number of rocessors and efficiently simulate on a fewer number. So Brent s scheduling rincile does not hold with full generality in the PRO model, which is in accordance with ractical observations. The design of a PRO-algorithm may sometimes involve subroutines for which there do not exist sequential counterarts. Examles of such tasks include communication rimitives such as broadcasting, data (re)-distribution routines, and load balancing routines. Such routines are often required in various arallel algorithms. With a slight abuse of notation, we call such arallel routines PRO-algorithms if the overall comutation and communication cost is linear in the inut size to the routines. 4 Comarison with other models In this section we comare the PRO model with PRAM, QSM, BSP, LogP, and CGM. Our tabular format for comarison is insired by a similar resentation in [13], where the Queuing Shared Memory (QSM) model is roosed. The columns of Table 1 are labeled with the names of the selected models in our comarison and some relevant features of a model are listed along the rows. The synchrony assumtion of the model is indicated in the row labeled synch. Lock-ste indicates that the rocessors are fully synchronized at each ste (of a universal clock), without accounting for synchronization. Bulksynchrony indicates that there can be asynchronous oerations between synchronization barriers. The row labeled memory shows how the model views 8

9 PRAM [8] QSM [13] BSP [22] LogP [5] CGM [4] PRO synch. lock-ste bulk-synch. bulk-synch. asynch. asynch. asynch. memory sh. sh. dist. dist. riv. riv. commun. SM SM MP MP MP/SM MP/SM arameters n, g, n, g, L, n, g, l, o, n, n, n, A seq granularity fine fine coarse fine coarse Grain(n) seedu NA NA NA NA NA Θ() otimal NA NA NA NA NA rel. A seq quality time time time time rounds Grain(n) Table 1: Comarison of arallel comutational models the memory of the arallel comuter: sh. indicates globally accessible shared memory, dist. stands for distributed memory and riv. is an abstraction for the case where the only assumtion is that each rocessor has access to rivate (local) memory. In the last variant the whole memory could either be distributed or shared. The row labeled commun. shows the tye of interrocessor communication assumed by the model. Shared memory (SM) indicates that communication is effected by reading from and writing to a globally accessible shared memory. Message-assing (MP) denotes the situation where rocessors communicate by exlicitly exchanging messages in a oint-to-oint fashion. The MP abstraction hides the details of how the message is routed through the interrocessor communication network. The arameters involved in the model are indicated in the row labeled arameters. The number of rocessors is denoted by, n is the inut size, A seq is the reference sequential algorithm, l is the communication cost (latency), L is a single arameter that accounts for the sum of latency (l) and the cost for a barrier synchronization, g is the bandwidth ga, and o is the overhead associated with sending or receiving a message. Note that the machine characteristics l and o are are taken into account in PRO, even though they are not exlicitly used as arameters. Latency is taken into consideration since the length of a suerste is determined by the sum of the comutational and communication cost. Communication overhead is hidden by the PRO-requirement that states Stes = O( Time(n) ). 2 The row labeled granularity indicates whether the model is fine-grained, coarse-grained or a more recise measure is used. We say that a model is coarse-grained if it alies to the case where n and call it fine-grained if it relies on using u to a olynomial number of rocessors in the inut size. In PRO granularity is exactly the quality measure Grain(n), and aears as one of the attributes of the model. The rows labeled seedu and otimal indicate the seedu and resource otimality requirements imosed by the model. Whenever these issues are not directly addressed by the model or are not alicable, the word NA is 9

10 used. Note that these requirements are hard-wired in the model in the case of PRO. The label rel. A seq means that the algorithm is otimal relative to the time and sace comlexity of A seq. We oint out that the goal in the design of algorithms using the CGM model [7, 4] is usually stated as that of achieving otimal algorithms, but the model er se does not imose an otimality requirement. The last row indicates the quality measure of an algorithm designed using the different models. For all other models excet CGM and PRO, the quality measure is running time. In CGM, the number of suerstes (rounds) is usually resented as a quality measure. In PRO the quality measure is granularity, one of the features that make PRO fundamentally different from all existing arallel comutation models. 5 Algorithm examle: matrix multilication In this section we illustrate how the PRO model is used, by starting from a given sequential algorithm and then designing and analyzing a arallel algorithm relative to it. We use the standard matrix multilication algorithm with three nested for-loos as an examle. This examle is chosen for its simlicity and since our objective at this stage is to illustrate the use of a new model rather than solving a difficult roblem. Consider the roblem of comuting the roduct C of two m m matrices A and B (inut size n = m 2 ). We want to design a PRO-algorithm relative to the standard sequential matrix multilication algorithm which has Time(n) = O(n 3 2 ) and Sace(n) = O(n). We assume that the inut matrices A and B are distributed among the rocessors P 0,..., P 1 so that rocessor P i stores rows (resectively columns) m i + 1 to m (i + 1) of A (resectively B). The outut matrix C will be row-artitioned among the rocessors in a similar fashion. Notice that with this data distribution each rocessor can, without communication, comute a block of m2 2 of the m2 entries of C exected to reside on it. In order to comute the next block of m2 entries, rocessor P 2 i needs the columns of matrix B that reside on rocessor P i+1. In each suerste the rocessors in the PRO algorithm will therefore exchange columns in a round-robin fashion and then each will comute a new block of results. Note that each column exchanged in a suerste constitutes one single message. Note also that the initial distribution of the rows of matrix A remains unchanged. In Algorithm 1, we have organized this sequence of comutation and communication stes in a manner that meets the requirements of the 10

11 Algorithm 1: Matrix multilication Inut: Two m m matrices A and B. The rows (columns) of A (B) are divided into m/ contiguous blocks, and stored on rocessors P 0, P 1,... P 1 resectively Outut: The roduct matrix C where the rows are stored in contiguous blocks across the rocessors for suerste s = 1 to do foreach rocessor P i do P i comutes the local sub-matrix roduct of its rows and current columns; P (i+1)mod sends its current block of columns to P i ; P i receives a new current block of columns from P (i+1)mod ; PRO model. Algorithm 1 has suerstes (Stes = ). In each suerste, the time sent in locally comuting each of the m 2 / 2 entries is Θ(m) resulting in local comuting time Θ(m 3 / 2 ) = Θ(n 3 2 / 2 ) er suerste. Likewise, the total size of data (words) exchanged by each rocessor in a suerste is Θ(m 2 /) = Θ(n/). Thus, the length of a suerste s is Time s (n, ) = Θ(n 3 2 / 2 +n/). Note that for = O( n), Time s (n, ) = Θ(n 3 2 / 2 ). Hence, for = O( n), the overall arallel runtime of the algorithm is Time(n, ) = Stes Θ(n 3 2 / 2 ) = Θ(n 3 2 /) = Θ(Time(n)/). (1) Noting that Sace(n) = Θ(n), we see that the memory restriction of the PRO model is resected, i.e., each rocessor has enough memory size to handle the transactions. In order to be able to neglect communication overhead, the condition on the number of suerstes, which in this case is just, should be met. In other words, we need = O(Time(n)/ 2 ) = O(n 3 2 / 2 ), which is true for = O( n). Thus the granularity function of the PRO-algorithm is Grain(n) = n. In summary, Lemma 1 Multilication of two m by m matrices has a PRO-algorithm with Grain(n) = m relative to a sequential algorithm with Time(n) = m 3 and Sace(n) = m 2 (inut size n = m 2 ). From Observation 1, we note that Algorithm 1 achieves otimal granularity. Note that on a relaxed model, where the assumtion that M is not resent, the strong regularity of matrix multilication and the exact 11

12 knowledge of the communication attern allows for algorithms that have an even finer granularity than m. For examle, a systolic matrix multilication algorithm has a granularity of m 2. However, PRO is intended to be alicable for general roblems and ractically relevant arallel systems. 6 Communication rimitive examle: one-to-all broadcast A good arallel comutation model should have a selection of algorithms for rimitive communication tasks available in its algorithm design toolbox. The PRO model is intended to meet this demand, but for lack of sace we give only one examle. In this section we illustrate how the PRO model allows otimal oneto-all broadcasting among its rocessors. Since there is no sequential basis algorithm in this case, we want an algorithm whose overall communication and comutation cost is linear in the inut and outut sizes. More recisely, we consider the situation where the inut consists of a vector of size m on a single rocessor and the outut should be a coy of this vector on each of the rocessors, and we want an algorithm that achieves this in O(m) time using O(m) memory on each rocessor. See Algorithm 2. Algorithm 2: One-to-All Broadcast Inut: A vector V of size m on rocessor P 0 Outut: A coy of V on each rocessor S1 P 0 divides V into equal sized arts; P 0 sends the i th art of V to rocessor P i, for each 0 < i ; foreach rocessor P i, i > 0 do rocessor P i receives the i th art from P 0 ; S2 foreach rocessor P i do P i sends out the i th art to P j, for each j i and 0 < j. foreach rocessor P j,j 0 do P j receives the i th art from P i, for each i j and 0 < i Lemma 2 PRO Algorithm 2 imlements a one-to-all broadcast of m memory words in two suerstes using O(m) time and O(m) sace er rocessor, for any number of rocessors m. Proof: First, we note that the algorithm correctly broadcasts the desired vector V, while observing the sace restriction, in two suerstes. We turn to the timing. In ste S1 rocessor P 0 in total sends out ( 1)m/ words 12

13 and each of the other rocessors receives a message of size m/. In ste S2 rocessor P i in total sends out 2 m words. Processor P j, j 0, in total receives 1 m words. The total time is dominated by the communication which is ( 1)m/ + m/ + 2 m + 1 m = (2) m/( ) < 3m (3) for total time O(m) as claimed. 7 Conclusion We have introduced a new arallel comutation model (called PRO) that enables the develoment of efficient scalable arallel algorithms and simlifies the comlexity analysis of such algorithms. The distinguishing feature of the PRO model is the novel focus on relativity, resource-otimality, and a new quality measure (granularity). In articular, the model requires a arallel algorithm to be both time- and sace-otimal relative to an underlying sequential algorithm. Having otimality as a built-in requirement, the quality of a PRO-algorithm is measured by the maximum number of rocessors that could be used while the otimality of the algorithm is maintained. The focus on relativity has theoretical as well as ractical justifications. From a theoretical oint of view, the erformance evaluation metrics of a arallel algorithm includes seedu and otimality, both of which are always exressed relative to some sequential algorithm. Moreover, there is an inherent asymmetry between sequential and arallel comutation. A arallel algorithm would always imly a sequential algorithm, whereas the converse is usually not true. Thus, in a sense, it is natural to think of an underlying sequential algorithm whenever one seaks of a arallel algorithm. From a ractical oint of view, one notes that the develoment of a arallel algorithm is often built on some known sequential algorithm. The fact that otimality is incororated as a requirement in the PRO model enables one to concentrate only on arallel algorithms that are ractically useful. However, the PRO model is not just a collection of some ideal features of arallel algorithms, it is also a means to achieve these features. In articular, the attributes of the model cature the salient characteristics of a arallel algorithm that make its ractical otimality and scalability highly likely. 13

14 In this sense, it can also be seen as a arallel algorithm design scheme. Moreover, the simlicity of the model eases analysis. We believe that the PRO model is a ste forward towards the identification of roblems for which ractically good arallel algorithms exist. Much work remains to be done, and we hoe that other members of the research community will join in. As a first item on the agenda, the PRO model needs to be tested for comatibility with already existing ractical arallel algorithms. Acknowledgments helful comments. We are grateful to the anonymous referees for their References [1] A. G. Alexandrakis, A. V. Gerbessiotis, D. S. Lecomber, and C. J. Siniolakis. Bandwidth, sace and comutation efficient PRAM rogramming: The BSP aroach. In Proceedings of the SUP EUR 96 Conference, Krakow, Poland, Setember [2] A. Bar-Noy and S. Kinis. Designing broadcasting algorithms in the Postal Model for message assing systems. In The 4th annual ACM symosium on arallel algorithms and architectures, ages 13 22, July [3] R. P. Brent. The arallel evaluation of generic arithmetic exressions. Journal of the ACM, 21(2): , [4] E. Caceres, F. Dehne, A. Ferreira, P. Locchini, I. Rieing, A. Roncato, N. Santoro, and S. W. Song. Efficient arallel grah algorithms for coarse grained multicomuters and BSP. In The 24th International Colloquium on Automata Languages and Programming, volume 1256 of LNCS, ages Sringer Verlag, [5] D. E. Culler, R. M. Kar, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of arallel comutation. In 4th ACM SIGPLAN Symosium on rinciles and ractice of arallel rogramming, San Diego, CA, May [6] F. Dehne. Coarse grained arallel algorithms. Algorithmica Secial Issue on Coarse grained arallel algorithms, 24(3/4): ,

15 [7] F. Dehne, A. Fabri, and A. Rau-Chalin. Scalable arallel comutational geometry for coarse grained multicomuters. International Journal on Comutational Geometry, 6(3): , [8] S. Fortune and J. Wyllie. Parallelism in random access machines. In 10th ACM Symosium on Theory of Comuting, ages , May [9] A. H. Gebremedhin, I. Guérin Lassous, J. Gustedt, and J. A. Telle. Grah coloring on a coarse grained multirocessor. In Ulrik Brandes and Dorothea Wagner, editors, WG 2000, volume 1928 of LNCS, ages Sringer-Verlag, [10] A. V. Gerbessiotis, D. S. Lecomber, C. J. Siniolakis, and K. R. Sujithan. PRAM rogramming: Theory vs. ractice. In Proceedings of 6th Euromicro Worksho on Parallel and Distributed Processing, Madrid, Sain. IEEE Comuter Society Press, January [11] A. V. Gerbessiotis and C. J. Siniolakis. A new randomized sorting algorithm on the BSP model. Technical reort, New Jersey Institute of Technology, [12] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous arallel algorithms. Journal of Parallel and Distributed Comuting, 22: , [13] P. B. Gibbons, Y. Matias, and V. Ramachandran. Can a Shared- Memory Model Serve as a Bridging Model for Parallel Comutation? Theory of Comuting Systems, 32(3): , [14] R. Greenlaw, H.J. Hoover, and W. L. Ruzzo. Limits to Parallel Comutation: P-Comleteness Theory. Oxford University Press, New York, [15] I. Guérin Lassous, J. Gustedt, and M. Morvan. Handling grahs according to a coarse grained aroach: Exeriments with MPI and PVM. In Jack Dongarra, Péter Kacsuk, and N. Podhorszki, editors, 7th Euroean PVM/MPI Users Grou Meeting, volume 1908 of LNCS, ages Sringer Verlag, [16] K. Hawick et al. High erformance comuting and communications glossary. see htt://nhse.nac.syr.edu/hccgloss/. [17] J. Jájá. An Introduction to Parallel Algorithms. Addison-Wesley,

16 [18] J. JáJá and K. W. Ryu. The Block Distributed Memory model. IEEE Transactions on Parallel and Distributed Systems, 8(7): , [19] R. M. Kar and V. Ramachandran. Parallel Algorithms for Shared- Memory Machines. In Jan van Leeuwen, editor, Handbook of Theoretical Comuter Science, volume A, Algorithms and Comlexity, ages Elsevier Science Publishers B.V., Amsterdam, [20] C. P. Kruskal, L. Rudolh, and M. Snir. A comlexity theory of efficient arallel algorithms. Theoretical Comuter Science, 71(1):95 132, march [21] B. M. Maggs, L. R. Matheson, and R. E. Tarjan. Models of arallel comutation: A survey and synthesis. In 28th HICSS, volume 2, ages 61 70, January [22] L. G. Valiant. A bridging model for arallel comutation. Communications of the ACM, 33(8): , [23] J. S. Vitter and R. A. Simons. New classes for arallel comlexity: A study of unification and other comlete roblems for P. IEEE Transactions on Comuters, C-35(5): ,

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but