PRO: a Model for Parallel Resource-Optimal Computation

Size: px
Start display at page:

Download "PRO: a Model for Parallel Resource-Optimal Computation"

Transcription

1 PRO: a Model for Parallel Resource-Otimal Comutation Assefaw Hadish Gebremedhin Isabelle Guérin Lassous Jens Gustedt Jan Arne Telle Abstract We resent a new arallel comutation model that enables the design of resource-otimal scalable arallel algorithms and simlifies their analysis. The model rests on the novel idea of incororating relative otimality as an integral art and measuring the quality of a arallel algorithm in terms of granularity. Key words: Parallel comuters, Parallel models, Parallel algorithms, Comlexity analysis Research suorted by IS-AUR of The Aurora Programme, a France-Norway Collaboration Research Project of The Research Council of Norway, The French Ministry of Foreign Affairs and The Ministry of Education, Research and Technology. Deartment of Informatics, University of Bergen, N-5020, Norway. {assefaw, telle}@ii.uib.no LIP & INRIA Rhone-Ales, France. Isabelle.Guerin-Lassous@inria.fr LORIA & INRIA Lorraine, France. gustedt@loria.fr 1

2 1 Introduction One of the challenges in arallel rocessing is the develoment of a general urose and effective model of arallel comutation. Unlike the realm of sequential comutation, where the Random Access Machine (RAM) has successfully served as a standard comutational model, no such single unifying model exists in the field of arallel comutation. From an algorithmic oint of view, the erformance of a sequential algorithm is adequately evaluated using its execution time making the RAM owerful enough for analysis and design. On the other hand, the erformance evaluation of a arallel algorithm involves several metrics, the most imortant of which are seedu, otimality (or efficiency), and scalability. Seedu and otimality are relative in nature as they are exressed with resect to some sequential algorithm. The notion of relativity is also relevant from a ractical oint of view. A arallel algorithm is often not designed from scratch, but rather starting from a sequential algorithm. We believe that a arallel comutation model should incororate the most imortant erformance evaluation metrics of arallel algorithms as the RAM does for sequential algorithms. In light of this, the objective of the current work is to develo a model that simlifies the design and analysis of resource-otimal scalable arallel algorithms. In an interesting survey aer [21], Maggs et al. suggest that an ideal arallel comutation model be designed within the hilosohy of simlicity and descritivity balanced with rescritivity. The Parallel Resource- Otimal (PRO) comutation model roosed here is develoed within this sirit. The key features of the PRO model that distinguish it from existing arallel comutation models are relativity, resource-otimality, and a new quality measure referred to as granularity. Relativity ertains to the fact that the design and analysis of a arallel algorithm in PRO is done relative to the time and sace comlexity of a secific sequential algorithm. Consequently, the arameters involved in the analysis of a PRO-algorithm are the number of rocessors, the inut size n, and the time and sace comlexity of the reference sequential algorithm A seq. A PRO-algorithm is required to be both time- and sace-otimal (hence resource-otimal). A arallel algorithm is said to be time- (or work-) otimal if the overall comutation and communication cost involved in the algorithm is roortional to the time comlexity of the sequential algorithm used as a reference. Similarly, it is said to be sace-otimal if the overall memory sace used by the algorithm is of the same order as the memory usage of 2

3 the underlying sequential version. As a consequence of its time-otimality, a PRO-algorithm always yields linear seedu relative to the reference sequential algorithm; i.e., the ratio between the sequential and arallel runtime is a linear function of. The quality of a PRO-algorithm is measured by the range of values can assume while linear seedu is maintained. This range is catured by an attribute of the model called the granularity function Grain(n). In other words, a PRO-algorithm with granularity Grain(n) is required to be fully scalable for all values of such that = O(Grain(n)). The granularity function Grain(n) determines the quality of one PRO-algorithm over another relative to the same sequential time and sace comlexity. The higher the function value Grain(n) the better the algorithm. Note that since otimality (consequently linear seedu) is hard-wired into the model, the runtime cannot be a quality measure for a PRO algorithm. However, in a sense, the time and sace comlexity of the reference sequential algorithm A seq can also be seen as a quality measure of the PRO-algorithm. This means that the selection of the reference sequential algorithm is of significant imortance. The rest of the aer is organized as follows. In Section 2 we give an overview of existing arallel comutation models and highlight their limitations. In Section 3 the PRO model is resented in detail and in Section 4 it is comared with a selection of existing arallel models. In Section 5 we illustrate how the model is used in design and analysis using the matrix multilication roblem as an examle. In Section 6 we give a PRO-algorithm for one-to-all broadcast, as an examle of a rimitive communication routine found in a otential PRO library. Finally, we conclude the aer in Section 7 with some remarks. 2 Existing models and their limitations There exists a lethora of arallel comutation models in the literature. On the theoretical end, we find the Parallel Random Access Machine (PRAM) model [8, 17] which in its simlest form osits a set of rocessors, with global shared memory, executing the same rogram in lockste. In this model, every rocessor can access any memory location at unit cost of time regardless of the memory location. This assumtion is in obvious disagreement with the reality of ractical arallel comuters. However, desite its serious limitation of being an idealized model of arallel comutation, the standard PRAM model still serves as a theoretical framework for investigating the maximum ossible comutational arallelism in a given task. Secifically, on this model, the N C versus P -comlete 3

4 dichotomy [14] is used to reflect the ease/hardness of finding a arallel algorithm for a roblem. Recall that NC denotes the class of roblems which have PRAM-algorithms with olylogarithmic runtime and olynomial number of rocessors in the inut size. A roblem is said to be P -comlete if an N C-algorithm for it would imly that all olynomial time sequential roblems have NC-algorithms. The roblem of whether or not P = NC has long been an oen roblem. The N C versus P -comlete dichotomy has its own ractical limitations. First, P -comleteness does not deict a full icture of non-arallelizability since the runtime requirement for an N C arallel algorithm is so stringent that the classification is confined to the case where u to olynomial number of rocessors in the inut size is available (fine-grained setting). For examle, there are P -comlete roblems for which less ambitious, but still satisfactory, runtime can be obtained by arallelization in PRAM [23]. In a fine-grained setting, since the number of rocessors is a function of the inut size n, it is customary to exress seedu as a function of n. Thus the seedu obtained using an N C-algorithm is sometimes referred to as exonential. In a coarse-grained setting, i.e., the case where n and are orders of magnitude aart, seedu is exressed as a function of only and some recent results [4, 7, 9, 15] show that this aroach is ractically relevant. Second, an N C-algorithm is not necessarily work-otimal, and thus not resource-otimal considering runtime and memory sace as resources that one wants to use efficiently. Third, even if we restrict ourselves to work-otimal N C-algorithms and aly Brent s scheduling rincile, which says an algorithm in theory can be simulated on a machine with fewer rocessors by only a constant factor more work, imlementations of PRAM algorithms often do not reflect this otimality in ractice [6]. This is mainly because the PRAM model does not account for non-local memory access (communication), and a Brent-tye simulation relies heavily on chea communication. To overcome the defects of the PRAM related to its failure of caturing real machine characteristics, the advocates of shared memory models roose several modifications to the standard PRAM model. In articular, they enhance the standard PRAM model by taking ractical machine features such as memory access, synchronization, latency and bandwidth issues into account. Pointers to the PRAM family of models can be found in [21]. Critics of shared memory models argue that the PRAM family of models fail to cature the nature of existing arallel comuters with distributed memory architectures. Examles of distributed memory comutational models suggested as alternatives include the Postal Model [2] and the Block 4

5 Distributed Memory (BDM) model [18]. Other categories of arallel models such as low-level, hierarchical memory, and network models are briefly reviewed in [21]. A more recent category of arallel models is that of bridging models, a notion oularized by Valiant with his introduction of the Bulk Synchronous Parallel (BSP) model [22]. The BSP model is a distributed memory coarsegrained model in which arallel comutation roceeds as a sequence of barrier synchronized suerstes where local comutation and communication are distinct rather than intermingled hases. Culler et al. [5] extended the BSP model by allowing asynchronous execution and better accounting for communication overhead. Their model is coined LogP, an acronym for the four arameters involved. A common feature of the BSP, LogP, and other related models is their lack of simlicity: each model involves relatively many arameters making analysis and design of algorithms cumbersome. The Coarse Grained Multicomuter (CGM) model [4, 7] was later roosed in an effort to retain the advantages of BSP while keeing the model simle (making the number of arameters fewer). The BSP and its secial case CGM have been the rimary insirations for our model. Thus, we believe that many otimal CGM and BSP algorithms can easily be adated to PRO. The PRO model attemts to artially address the limitations of existing arallel models highlighted in the foregoing discussion and comromises between theoretical and ractical considerations. One of its advantages from a theoretical oint of view is that it is a ste forward towards the identification of the class of roblems for which good arallel algorithms exist in a more realistic (ractical) way than the existing NC versus P -comlete classification. Our main goal in suggesting the PRO model is to enable the develoment of scalable and resource-otimal arallel algorithms and to simlify their analysis. The model identifies the salient features of a arallel algorithm that make its ractical scalability and otimality highly likely. In this regard, it can be considered as a set of guidelines for the algorithm designer in the quest for develoing scalable and efficient arallel algorithms. Hence, PRO can be seen as a mix of a arallel comutation model and a arallel algorithm design scheme which makes it biased towards the software side in its role as a bridging model. 5

6 3 The PRO model The PRO model is an algorithm design and analysis tool used to deliver a ractical, otimal, and scalable arallel algorithm relative to a secific sequential algorithm whenever this is ossible. Let Time(n) and Sace(n) denote the time and sace comlexity of a secific sequential algorithm for a given roblem with inut size n. The PRO model is defined to have the following attributes. Machine The underlying machine is assumed to consist of rocessors with M = O( Sace(n) ) rivate memory each, interconnected by some communication network (or shared memory) that can deliver messages in a oint-to-oint fashion. A message can consist of several machine words. Coarseness We assume that M, i.e., the size of the local memory of each rocessor is big enough to store words. Execution For any value = O(Grain(n)), a PRO algorithm, consists of O( Time(n) ) suerstes. A suerste consists of a local 2 comutation hase and an interrocessor communication hase. In articular, in each suerste, each rocessor sends at most one message to every other rocessor, sends and receives at most M words in total, and ays a unit of time er word sent and received, erforms local comutation, and ays a unit of time er oeration, has arallel runtime Time(n, ) = O( Time(n) ). ) ensures that the sace utilized by the underlying sequential algorithm is uniformly distributed among the rocessors. Since we may, without loss of generality, assume that Sace(n) = Ω(n), the imlication is that the rivate memory of each rocessor is large enough Note that the granularity function Grain(n) is a quality measure of a PRO-algorithm. As discussed in the LogP aer [5], technological factors are forcing arallel systems to converge towards systems formed by a collection of essentially comlete comuters connected by a robust communication network. The machine model assumtion of PRO is consistent with this convergence and mas well on several existing arallel comuter architectures. The memory requirement M = O( Sace(n) 6

7 to store its share of the inut and any additional sace the sequential algorithm might require. When Sace(n) = Θ(n), note that the inut data must be uniformly distributed on the rocessors. In this case the machine model assumtion of PRO is similar to the assumtion in the CGM model [7]. The coarseness assumtion M is consistent with the structure of existing arallel machines and machines to be built in the foreseeable future. The assumtion is required to simlify the imlementation of collecting messages (from ossibly all other rocessors) on a single rocessor. The execution of a PRO-algorithm consists of a sequence of suerstes (or rounds). The length of (time sent in) a suerste on each rocessor is determined by the sum of the time used for communication and the time used for local comutation. The length of a suerste s in the arallel algorithm seen as a whole, denoted by Time s (n, ), is the maximum over the lengths of the suerste on all rocessors. We can concetually think as if the suerstes are synchronized by a barrier set at the end of the longest suerste across the rocessors. However, note that in PRO the rocessors are not in reality required to synchronize at the end of each suerste. The arallel runtime Time(n, ) of the algorithm is the sum of the lengths of all the suerstes. Notice that the hyothetical barriers result in only a constant factor more time comared with an analysis that does not assume the barriers. In PRO, since a rocessor sends at most one message to every other rocessor in each suerste, each rocessor is involved in at most 2( 1) messages er suerste. Therefore, the requirement Stes = O( Time(n) ) on 2 the number of suerstes imlies that the overall time aid er rocessor for communication overhead and latency is O(Time(n)/) and hence can be neglected from the analysis since our goal is to achieve an O(Time(n)/) arallel runtime. Notice that the bandwidth restriction of the underlying architecture which in turn contributes to the communication cost is accounted for since each rocessor ays a unit of time er word sent and received. This is not an unrealistic assumtion noting that the network throughut (accounted in machine words) on modern architectures such as high erformance clusters is relatively close to the CPU frequency and to the CPU/memory bandwidth. The condition Time(n, ) = O( Time(n) ) requires that a PRO-algorithm be otimal and yield linear seedu relative to the sequential algorithm used as a reference. This requirement ensures the otential ractical use of the arallel algorithm. 7

8 Observation 1 A PRO algorithm relative to a sequential algorithm with runtime O(Time(n)) and sace requirement O(Sace(n)) has maximum granularity Grain(n) = O(min{ Sace(n), (Time(n)}) = O( Sace(n)). A PRO algorithm that achieves this is said to have otimal grain. Observation 1 is due to the limit on the memory size of each rocessor, the coarseness assumtion, and the bound on the number of suerstes. The limit on the size of the rivate memory of each rocessor (M = O( Sace(n) )) together with the coarseness assumtion M imly = O( Sace(n)). The fact that the number of suerstes of a PRO-algorithm should be Stes = O(Time(n)/ 2 ), gives = O( (Time(n)/Stes)) uon resolving and we clearly have Stes 1. Finally, note that Time(n) Sace(n), since an algorithm has to at least read the inut. Since a PRO-algorithm yields linear seedu for any = O(Grain(n)), a result like Brent s scheduling rincile is imlicit for these values of. But Observation 1 shows that we cannot start with an arbitrary number of rocessors and efficiently simulate on a fewer number. So Brent s scheduling rincile does not hold with full generality in the PRO model, which is in accordance with ractical observations. The design of a PRO-algorithm may sometimes involve subroutines for which there do not exist sequential counterarts. Examles of such tasks include communication rimitives such as broadcasting, data (re)-distribution routines, and load balancing routines. Such routines are often required in various arallel algorithms. With a slight abuse of notation, we call such arallel routines PRO-algorithms if the overall comutation and communication cost is linear in the inut size to the routines. 4 Comarison with other models In this section we comare the PRO model with PRAM, QSM, BSP, LogP, and CGM. Our tabular format for comarison is insired by a similar resentation in [13], where the Queuing Shared Memory (QSM) model is roosed. The columns of Table 1 are labeled with the names of the selected models in our comarison and some relevant features of a model are listed along the rows. The synchrony assumtion of the model is indicated in the row labeled synch. Lock-ste indicates that the rocessors are fully synchronized at each ste (of a universal clock), without accounting for synchronization. Bulksynchrony indicates that there can be asynchronous oerations between synchronization barriers. The row labeled memory shows how the model views 8

9 PRAM [8] QSM [13] BSP [22] LogP [5] CGM [4] PRO synch. lock-ste bulk-synch. bulk-synch. asynch. asynch. asynch. memory sh. sh. dist. dist. riv. riv. commun. SM SM MP MP MP/SM MP/SM arameters n, g, n, g, L, n, g, l, o, n, n, n, A seq granularity fine fine coarse fine coarse Grain(n) seedu NA NA NA NA NA Θ() otimal NA NA NA NA NA rel. A seq quality time time time time rounds Grain(n) Table 1: Comarison of arallel comutational models the memory of the arallel comuter: sh. indicates globally accessible shared memory, dist. stands for distributed memory and riv. is an abstraction for the case where the only assumtion is that each rocessor has access to rivate (local) memory. In the last variant the whole memory could either be distributed or shared. The row labeled commun. shows the tye of interrocessor communication assumed by the model. Shared memory (SM) indicates that communication is effected by reading from and writing to a globally accessible shared memory. Message-assing (MP) denotes the situation where rocessors communicate by exlicitly exchanging messages in a oint-to-oint fashion. The MP abstraction hides the details of how the message is routed through the interrocessor communication network. The arameters involved in the model are indicated in the row labeled arameters. The number of rocessors is denoted by, n is the inut size, A seq is the reference sequential algorithm, l is the communication cost (latency), L is a single arameter that accounts for the sum of latency (l) and the cost for a barrier synchronization, g is the bandwidth ga, and o is the overhead associated with sending or receiving a message. Note that the machine characteristics l and o are are taken into account in PRO, even though they are not exlicitly used as arameters. Latency is taken into consideration since the length of a suerste is determined by the sum of the comutational and communication cost. Communication overhead is hidden by the PRO-requirement that states Stes = O( Time(n) ). 2 The row labeled granularity indicates whether the model is fine-grained, coarse-grained or a more recise measure is used. We say that a model is coarse-grained if it alies to the case where n and call it fine-grained if it relies on using u to a olynomial number of rocessors in the inut size. In PRO granularity is exactly the quality measure Grain(n), and aears as one of the attributes of the model. The rows labeled seedu and otimal indicate the seedu and resource otimality requirements imosed by the model. Whenever these issues are not directly addressed by the model or are not alicable, the word NA is 9

10 used. Note that these requirements are hard-wired in the model in the case of PRO. The label rel. A seq means that the algorithm is otimal relative to the time and sace comlexity of A seq. We oint out that the goal in the design of algorithms using the CGM model [7, 4] is usually stated as that of achieving otimal algorithms, but the model er se does not imose an otimality requirement. The last row indicates the quality measure of an algorithm designed using the different models. For all other models excet CGM and PRO, the quality measure is running time. In CGM, the number of suerstes (rounds) is usually resented as a quality measure. In PRO the quality measure is granularity, one of the features that make PRO fundamentally different from all existing arallel comutation models. 5 Algorithm examle: matrix multilication In this section we illustrate how the PRO model is used, by starting from a given sequential algorithm and then designing and analyzing a arallel algorithm relative to it. We use the standard matrix multilication algorithm with three nested for-loos as an examle. This examle is chosen for its simlicity and since our objective at this stage is to illustrate the use of a new model rather than solving a difficult roblem. Consider the roblem of comuting the roduct C of two m m matrices A and B (inut size n = m 2 ). We want to design a PRO-algorithm relative to the standard sequential matrix multilication algorithm which has Time(n) = O(n 3 2 ) and Sace(n) = O(n). We assume that the inut matrices A and B are distributed among the rocessors P 0,..., P 1 so that rocessor P i stores rows (resectively columns) m i + 1 to m (i + 1) of A (resectively B). The outut matrix C will be row-artitioned among the rocessors in a similar fashion. Notice that with this data distribution each rocessor can, without communication, comute a block of m2 2 of the m2 entries of C exected to reside on it. In order to comute the next block of m2 entries, rocessor P 2 i needs the columns of matrix B that reside on rocessor P i+1. In each suerste the rocessors in the PRO algorithm will therefore exchange columns in a round-robin fashion and then each will comute a new block of results. Note that each column exchanged in a suerste constitutes one single message. Note also that the initial distribution of the rows of matrix A remains unchanged. In Algorithm 1, we have organized this sequence of comutation and communication stes in a manner that meets the requirements of the 10

11 Algorithm 1: Matrix multilication Inut: Two m m matrices A and B. The rows (columns) of A (B) are divided into m/ contiguous blocks, and stored on rocessors P 0, P 1,... P 1 resectively Outut: The roduct matrix C where the rows are stored in contiguous blocks across the rocessors for suerste s = 1 to do foreach rocessor P i do P i comutes the local sub-matrix roduct of its rows and current columns; P (i+1)mod sends its current block of columns to P i ; P i receives a new current block of columns from P (i+1)mod ; PRO model. Algorithm 1 has suerstes (Stes = ). In each suerste, the time sent in locally comuting each of the m 2 / 2 entries is Θ(m) resulting in local comuting time Θ(m 3 / 2 ) = Θ(n 3 2 / 2 ) er suerste. Likewise, the total size of data (words) exchanged by each rocessor in a suerste is Θ(m 2 /) = Θ(n/). Thus, the length of a suerste s is Time s (n, ) = Θ(n 3 2 / 2 +n/). Note that for = O( n), Time s (n, ) = Θ(n 3 2 / 2 ). Hence, for = O( n), the overall arallel runtime of the algorithm is Time(n, ) = Stes Θ(n 3 2 / 2 ) = Θ(n 3 2 /) = Θ(Time(n)/). (1) Noting that Sace(n) = Θ(n), we see that the memory restriction of the PRO model is resected, i.e., each rocessor has enough memory size to handle the transactions. In order to be able to neglect communication overhead, the condition on the number of suerstes, which in this case is just, should be met. In other words, we need = O(Time(n)/ 2 ) = O(n 3 2 / 2 ), which is true for = O( n). Thus the granularity function of the PRO-algorithm is Grain(n) = n. In summary, Lemma 1 Multilication of two m by m matrices has a PRO-algorithm with Grain(n) = m relative to a sequential algorithm with Time(n) = m 3 and Sace(n) = m 2 (inut size n = m 2 ). From Observation 1, we note that Algorithm 1 achieves otimal granularity. Note that on a relaxed model, where the assumtion that M is not resent, the strong regularity of matrix multilication and the exact 11

12 knowledge of the communication attern allows for algorithms that have an even finer granularity than m. For examle, a systolic matrix multilication algorithm has a granularity of m 2. However, PRO is intended to be alicable for general roblems and ractically relevant arallel systems. 6 Communication rimitive examle: one-to-all broadcast A good arallel comutation model should have a selection of algorithms for rimitive communication tasks available in its algorithm design toolbox. The PRO model is intended to meet this demand, but for lack of sace we give only one examle. In this section we illustrate how the PRO model allows otimal oneto-all broadcasting among its rocessors. Since there is no sequential basis algorithm in this case, we want an algorithm whose overall communication and comutation cost is linear in the inut and outut sizes. More recisely, we consider the situation where the inut consists of a vector of size m on a single rocessor and the outut should be a coy of this vector on each of the rocessors, and we want an algorithm that achieves this in O(m) time using O(m) memory on each rocessor. See Algorithm 2. Algorithm 2: One-to-All Broadcast Inut: A vector V of size m on rocessor P 0 Outut: A coy of V on each rocessor S1 P 0 divides V into equal sized arts; P 0 sends the i th art of V to rocessor P i, for each 0 < i ; foreach rocessor P i, i > 0 do rocessor P i receives the i th art from P 0 ; S2 foreach rocessor P i do P i sends out the i th art to P j, for each j i and 0 < j. foreach rocessor P j,j 0 do P j receives the i th art from P i, for each i j and 0 < i Lemma 2 PRO Algorithm 2 imlements a one-to-all broadcast of m memory words in two suerstes using O(m) time and O(m) sace er rocessor, for any number of rocessors m. Proof: First, we note that the algorithm correctly broadcasts the desired vector V, while observing the sace restriction, in two suerstes. We turn to the timing. In ste S1 rocessor P 0 in total sends out ( 1)m/ words 12

13 and each of the other rocessors receives a message of size m/. In ste S2 rocessor P i in total sends out 2 m words. Processor P j, j 0, in total receives 1 m words. The total time is dominated by the communication which is ( 1)m/ + m/ + 2 m + 1 m = (2) m/( ) < 3m (3) for total time O(m) as claimed. 7 Conclusion We have introduced a new arallel comutation model (called PRO) that enables the develoment of efficient scalable arallel algorithms and simlifies the comlexity analysis of such algorithms. The distinguishing feature of the PRO model is the novel focus on relativity, resource-otimality, and a new quality measure (granularity). In articular, the model requires a arallel algorithm to be both time- and sace-otimal relative to an underlying sequential algorithm. Having otimality as a built-in requirement, the quality of a PRO-algorithm is measured by the maximum number of rocessors that could be used while the otimality of the algorithm is maintained. The focus on relativity has theoretical as well as ractical justifications. From a theoretical oint of view, the erformance evaluation metrics of a arallel algorithm includes seedu and otimality, both of which are always exressed relative to some sequential algorithm. Moreover, there is an inherent asymmetry between sequential and arallel comutation. A arallel algorithm would always imly a sequential algorithm, whereas the converse is usually not true. Thus, in a sense, it is natural to think of an underlying sequential algorithm whenever one seaks of a arallel algorithm. From a ractical oint of view, one notes that the develoment of a arallel algorithm is often built on some known sequential algorithm. The fact that otimality is incororated as a requirement in the PRO model enables one to concentrate only on arallel algorithms that are ractically useful. However, the PRO model is not just a collection of some ideal features of arallel algorithms, it is also a means to achieve these features. In articular, the attributes of the model cature the salient characteristics of a arallel algorithm that make its ractical otimality and scalability highly likely. 13

14 In this sense, it can also be seen as a arallel algorithm design scheme. Moreover, the simlicity of the model eases analysis. We believe that the PRO model is a ste forward towards the identification of roblems for which ractically good arallel algorithms exist. Much work remains to be done, and we hoe that other members of the research community will join in. As a first item on the agenda, the PRO model needs to be tested for comatibility with already existing ractical arallel algorithms. Acknowledgments helful comments. We are grateful to the anonymous referees for their References [1] A. G. Alexandrakis, A. V. Gerbessiotis, D. S. Lecomber, and C. J. Siniolakis. Bandwidth, sace and comutation efficient PRAM rogramming: The BSP aroach. In Proceedings of the SUP EUR 96 Conference, Krakow, Poland, Setember [2] A. Bar-Noy and S. Kinis. Designing broadcasting algorithms in the Postal Model for message assing systems. In The 4th annual ACM symosium on arallel algorithms and architectures, ages 13 22, July [3] R. P. Brent. The arallel evaluation of generic arithmetic exressions. Journal of the ACM, 21(2): , [4] E. Caceres, F. Dehne, A. Ferreira, P. Locchini, I. Rieing, A. Roncato, N. Santoro, and S. W. Song. Efficient arallel grah algorithms for coarse grained multicomuters and BSP. In The 24th International Colloquium on Automata Languages and Programming, volume 1256 of LNCS, ages Sringer Verlag, [5] D. E. Culler, R. M. Kar, D. A. Patterson, A. Sahay, K. E. Schauser, E. Santos, R. Subramonian, and T. von Eicken. LogP: Towards a realistic model of arallel comutation. In 4th ACM SIGPLAN Symosium on rinciles and ractice of arallel rogramming, San Diego, CA, May [6] F. Dehne. Coarse grained arallel algorithms. Algorithmica Secial Issue on Coarse grained arallel algorithms, 24(3/4): ,

15 [7] F. Dehne, A. Fabri, and A. Rau-Chalin. Scalable arallel comutational geometry for coarse grained multicomuters. International Journal on Comutational Geometry, 6(3): , [8] S. Fortune and J. Wyllie. Parallelism in random access machines. In 10th ACM Symosium on Theory of Comuting, ages , May [9] A. H. Gebremedhin, I. Guérin Lassous, J. Gustedt, and J. A. Telle. Grah coloring on a coarse grained multirocessor. In Ulrik Brandes and Dorothea Wagner, editors, WG 2000, volume 1928 of LNCS, ages Sringer-Verlag, [10] A. V. Gerbessiotis, D. S. Lecomber, C. J. Siniolakis, and K. R. Sujithan. PRAM rogramming: Theory vs. ractice. In Proceedings of 6th Euromicro Worksho on Parallel and Distributed Processing, Madrid, Sain. IEEE Comuter Society Press, January [11] A. V. Gerbessiotis and C. J. Siniolakis. A new randomized sorting algorithm on the BSP model. Technical reort, New Jersey Institute of Technology, [12] A. V. Gerbessiotis and L. G. Valiant. Direct bulk-synchronous arallel algorithms. Journal of Parallel and Distributed Comuting, 22: , [13] P. B. Gibbons, Y. Matias, and V. Ramachandran. Can a Shared- Memory Model Serve as a Bridging Model for Parallel Comutation? Theory of Comuting Systems, 32(3): , [14] R. Greenlaw, H.J. Hoover, and W. L. Ruzzo. Limits to Parallel Comutation: P-Comleteness Theory. Oxford University Press, New York, [15] I. Guérin Lassous, J. Gustedt, and M. Morvan. Handling grahs according to a coarse grained aroach: Exeriments with MPI and PVM. In Jack Dongarra, Péter Kacsuk, and N. Podhorszki, editors, 7th Euroean PVM/MPI Users Grou Meeting, volume 1908 of LNCS, ages Sringer Verlag, [16] K. Hawick et al. High erformance comuting and communications glossary. see htt://nhse.nac.syr.edu/hccgloss/. [17] J. Jájá. An Introduction to Parallel Algorithms. Addison-Wesley,

16 [18] J. JáJá and K. W. Ryu. The Block Distributed Memory model. IEEE Transactions on Parallel and Distributed Systems, 8(7): , [19] R. M. Kar and V. Ramachandran. Parallel Algorithms for Shared- Memory Machines. In Jan van Leeuwen, editor, Handbook of Theoretical Comuter Science, volume A, Algorithms and Comlexity, ages Elsevier Science Publishers B.V., Amsterdam, [20] C. P. Kruskal, L. Rudolh, and M. Snir. A comlexity theory of efficient arallel algorithms. Theoretical Comuter Science, 71(1):95 132, march [21] B. M. Maggs, L. R. Matheson, and R. E. Tarjan. Models of arallel comutation: A survey and synthesis. In 28th HICSS, volume 2, ages 61 70, January [22] L. G. Valiant. A bridging model for arallel comutation. Communications of the ACM, 33(8): , [23] J. S. Vitter and R. A. Simons. New classes for arallel comlexity: A study of unification and other comlete roblems for P. IEEE Transactions on Comuters, C-35(5): ,

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

level 0 level 1 level 2 level 3

level 0 level 1 level 2 level 3 Communication-Ecient Deterministic Parallel Algorithms for Planar Point Location and 2d Voronoi Diagram? Mohamadou Diallo 1, Afonso Ferreira 2 and Andrew Rau-Chalin 3 1 LIMOS, IFMA, Camus des C zeaux,

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

The Handling of Graphs on PC Clusters: A Coarse Grained Approach

The Handling of Graphs on PC Clusters: A Coarse Grained Approach INSTITUT NATIONAL DE RECHERCHE EN INFORMATIQUE ET EN AUTOMATIQUE The Handling of Grahs on PC Clusters: A Coarse Grained Aroach Isabelle Guérin Lassous Jens Gustedt Michel Morvan N 3897 Mars 2000 THÈME

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

Graph Coloring on a Coarse Grained Multiprocessor

Graph Coloring on a Coarse Grained Multiprocessor Graph Coloring on a Coarse Grained Multiprocessor (extended abstract) Assefaw Hadish Gebremedhin 1, Isabelle Guérin Lassous 2, Jens Gustedt 3, and Jan Arne Telle 1 1 Univ. of Bergen, Norway. assefaw telle

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

Randomized Selection on the Hypercube 1

Randomized Selection on the Hypercube 1 Randomized Selection on the Hyercube 1 Sanguthevar Rajasekaran Det. of Com. and Info. Science and Engg. University of Florida Gainesville, FL 32611 ABSTRACT In this aer we resent randomized algorithms

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

A Symmetric FHE Scheme Based on Linear Algebra

A Symmetric FHE Scheme Based on Linear Algebra A Symmetric FHE Scheme Based on Linear Algebra Iti Sharma University College of Engineering, Comuter Science Deartment. itisharma.uce@gmail.com Abstract FHE is considered to be Holy Grail of cloud comuting.

More information

Coarse grained gather and scatter operations with applications

Coarse grained gather and scatter operations with applications Available online at www.sciencedirect.com J. Parallel Distrib. Comut. 64 (2004) 1297 1310 www.elsevier.com/locate/jdc Coarse grained gather and scatter oerations with alications Laurence Boxer a,b,, Russ

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

These, and closely related, roblems has been extensively studied for the sequential [1, 16] and the shared memory (PRAM) arallel [5, 6, 12, 13, 14, 15

These, and closely related, roblems has been extensively studied for the sequential [1, 16] and the shared memory (PRAM) arallel [5, 6, 12, 13, 14, 15 Coarse Grained Parallel Algorithms for Detecting Convex Biartite Grahs Extended Abstract E. Caceres y A. Chan z F. Dehne x G. Prencie { Abstract In this aer, we resent arallel algorithms for the coarse

More information

S16-02, URL:

S16-02, URL: Self Introduction A/Prof ay Seng Chuan el: Email: scitaysc@nus.edu.sg Office: S-0, Dean s s Office at Level URL: htt://www.hysics.nus.edu.sg/~hytaysc I was a rogrammer from to. I have been working in NUS

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION

A BICRITERION STEINER TREE PROBLEM ON GRAPH. Mirko VUJO[EVI], Milan STANOJEVI] 1. INTRODUCTION Yugoslav Journal of Oerations Research (00), umber, 5- A BICRITERIO STEIER TREE PROBLEM O GRAPH Mirko VUJO[EVI], Milan STAOJEVI] Laboratory for Oerational Research, Faculty of Organizational Sciences University

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

Simulating Ocean Currents. Simulating Galaxy Evolution

Simulating Ocean Currents. Simulating Galaxy Evolution Simulating Ocean Currents (a) Cross sections (b) Satial discretization of a cross section Model as two-dimensional grids Discretize in sace and time finer satial and temoral resolution => greater accuracy

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance

More information

Complexity analysis of matrix product on multicore architectures

Complexity analysis of matrix product on multicore architectures Comlexity analysis of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal Yves.Robert}@ens-lyon.fr

More information

Performance Results of Running Parallel Applications on the InteGrade

Performance Results of Running Parallel Applications on the InteGrade Performance Results of Running Parallel Alications on the InteGrade Edson Norberto Cáceres, Henrique Mongelli, Leonardo Loureiro, Christiane Nishibe Det. de Comutação e Estatística Universidade Federal

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a

[CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LPRAM [ACS89], Postal Model [BNK92], BSP [Val90], and LogP [CKP + 93]), a [CZ89]), communication costs such asnetwork latency and bandwidth (e.g., the LRAM [ACS89], ostal Model [BK9], BS [Val90], and Log [CK + 93]), and memory hierarchy, reecting the eects of multileveled memory

More information

Complexity analysis of matrix product on multicore architectures

Complexity analysis of matrix product on multicore architectures Comlexity analysis of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal Yves.Robert}@ens-lyon.fr

More information

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing

A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Graph Processing A Yoke of Oxen and a Thousand Chickens for Heavy Lifting Grah Processing Abdullah Gharaibeh, Lauro Beltrão Costa, Elizeu Santos-Neto, Matei Rieanu Deartment of Electrical and Comuter Engineering, The University

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Complexity analysis and performance evaluation of matrix product on multicore architectures

Complexity analysis and performance evaluation of matrix product on multicore architectures Comlexity analysis and erformance evaluation of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

Submission. Verifying Properties Using Sequential ATPG

Submission. Verifying Properties Using Sequential ATPG Verifying Proerties Using Sequential ATPG Jacob A. Abraham and Vivekananda M. Vedula Comuter Engineering Research Center The University of Texas at Austin Austin, TX 78712 jaa, vivek @cerc.utexas.edu Daniel

More information

Experimental Comparison of Shortest Path Approaches for Timetable Information

Experimental Comparison of Shortest Path Approaches for Timetable Information Exerimental Comarison of Shortest Path roaches for Timetable Information Evangelia Pyrga Frank Schulz Dorothea Wagner Christos Zaroliagis bstract We consider two aroaches that model timetable information

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005* The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans

A DEA-bases Approach for Multi-objective Design of Attribute Acceptance Sampling Plans Available online at htt://ijdea.srbiau.ac.ir Int. J. Data Enveloment Analysis (ISSN 2345-458X) Vol.5, No.2, Year 2017 Article ID IJDEA-00422, 12 ages Research Article International Journal of Data Enveloment

More information

Near-Optimal Routing Lookups with Bounded Worst Case Performance

Near-Optimal Routing Lookups with Bounded Worst Case Performance Near-Otimal Routing Lookus with Bounded Worst Case Performance Pankaj Guta Balaji Prabhakar Stehen Boyd Deartments of Electrical Engineering and Comuter Science Stanford University CA 9430 ankaj@stanfordedu

More information

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical

More information

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE

CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE CENTRAL AND PARALLEL PROJECTIONS OF REGULAR SURFACES: GEOMETRIC CONSTRUCTIONS USING 3D MODELING SOFTWARE Petra Surynková Charles University in Prague, Faculty of Mathematics and Physics, Sokolovská 83,

More information

Synthesis of FSMs on the Basis of Reusable Hardware Templates

Synthesis of FSMs on the Basis of Reusable Hardware Templates Proceedings of the 6th WSEAS Int. Conf. on Systems Theory & Scientific Comutation, Elounda, Greece, August -3, 006 (09-4) Synthesis of FSMs on the Basis of Reusable Hardware Temlates VALERY SKLYAROV, IOULIIA

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

Efficient Sequence Generator Mining and its Application in Classification

Efficient Sequence Generator Mining and its Application in Classification Efficient Sequence Generator Mining and its Alication in Classification Chuancong Gao, Jianyong Wang 2, Yukai He 3 and Lizhu Zhou 4 Tsinghua University, Beijing 0084, China {gaocc07, heyk05 3 }@mails.tsinghua.edu.cn,

More information

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems

A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems A Parallel Algorithm for Constructing Obstacle-Avoiding Rectilinear Steiner Minimal Trees on Multi-Core Systems Cheng-Yuan Chang and I-Lun Tseng Deartment of Comuter Science and Engineering Yuan Ze University,

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY

AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY AN ANALYTICAL MODEL DESCRIBING THE RELATIONSHIPS BETWEEN LOGIC ARCHITECTURE AND FPGA DENSITY Andrew Lam 1, Steven J.E. Wilton 1, Phili Leong 2, Wayne Luk 3 1 Elec. and Com. Engineering 2 Comuter Science

More information

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification

Using Rational Numbers and Parallel Computing to Efficiently Avoid Round-off Errors on Map Simplification Using Rational Numbers and Parallel Comuting to Efficiently Avoid Round-off Errors on Ma Simlification Maurício G. Grui 1, Salles V. G. de Magalhães 1,2, Marcus V. A. Andrade 1, W. Randolh Franklin 2,

More information

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.) Advanced Algorithms Fall 2015 Lecture 3: Geometric Algorithms(Convex sets, Divide & Conuer Algo.) Faculty: K.R. Chowdhary : Professor of CS Disclaimer: These notes have not been subjected to the usual

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

Semi-Markov Process based Model for Performance Analysis of Wireless LANs

Semi-Markov Process based Model for Performance Analysis of Wireless LANs Semi-Markov Process based Model for Performance Analysis of Wireless LANs Murali Krishna Kadiyala, Diti Shikha, Ravi Pendse, and Neeraj Jaggi Deartment of Electrical Engineering and Comuter Science Wichita

More information

Earthenware Reconstruction Based on the Shape Similarity among Potsherds

Earthenware Reconstruction Based on the Shape Similarity among Potsherds Original Paer Forma, 16, 77 90, 2001 Earthenware Reconstruction Based on the Shae Similarity among Potsherds Masayoshi KANOH 1, Shohei KATO 2 and Hidenori ITOH 1 1 Nagoya Institute of Technology, Gokiso-cho,

More information

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops

The R-LRPD Test: Speculative Parallelization of Partially Parallel Loops The R-LRPD Test: Seculative Parallelization of Partially Parallel Loos Francis Dang, Hao Yu, Lawrence Rauchwerger Det. of Comuter Science, Texas A&M University College Station, TX 778- {fhd,hy89,rwerger}@cs.tamu.edu

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

A Scalable Parallel Approach for Peptide Identification from Large-scale Mass Spectrometry Data

A Scalable Parallel Approach for Peptide Identification from Large-scale Mass Spectrometry Data 2009 International Conference on Parallel Processing Workshos A Scalable Parallel Aroach for Petide Identification from Large-scale Mass Sectrometry Data Gaurav Kulkarni, Ananth Kalyanaraman School of

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

A Novel Iris Segmentation Method for Hand-Held Capture Device

A Novel Iris Segmentation Method for Hand-Held Capture Device A Novel Iris Segmentation Method for Hand-Held Cature Device XiaoFu He and PengFei Shi Institute of Image Processing and Pattern Recognition, Shanghai Jiao Tong University, Shanghai 200030, China {xfhe,

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Swift Template Matching Based on Equivalent Histogram

Swift Template Matching Based on Equivalent Histogram Swift emlate Matching ased on Equivalent istogram Wangsheng Yu, Xiaohua ian, Zhiqiang ou * elecommunications Engineering Institute Air Force Engineering University Xi an, PR China *corresonding author:

More information

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming

Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Source-to-Source Code Generation Based on Pattern Matching and Dynamic Programming Weimin Chen, Volker Turau TR-93-047 August, 1993 Abstract This aer introduces a new technique for source-to-source code

More information

Design Trade-offs in Customized On-chip Crossbar Schedulers

Design Trade-offs in Customized On-chip Crossbar Schedulers J Sign Process Syst () 8:9 8 DOI.7/s-8--x Design Trade-offs in Customized On-chi Crossbar Schedulers Jae Young Hur Stehan Wong Todor Stefanov Received: October 7 / Revised: June 8 / cceted: ugust 8 / Published

More information

index i 1 independently iterates through its entire range of values with the value of i 2 xed, the range of the access

index i 1 independently iterates through its entire range of values with the value of i 2 xed, the range of the access Yunheung Paekz Simlication of Array Access Patterns for Comiler Otimizations Jay Hoeingery David Paduay z New Jersey Institute of Technology aek@cis.njit.edu y University of Illinois at Urbana-Chamaign

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER GEOMETRIC CONSTRAINT SOLVING IN < AND < 3 CHRISTOPH M. HOFFMANN Deartment of Comuter Sciences, Purdue University West Lafayette, Indiana 47907-1398, USA and PAMELA J. VERMEER Deartment of Comuter Sciences,

More information

DIMACS, Rutgers University. P.O. Box 1179, Piscataway, NJ USA. November 22, Abstract

DIMACS, Rutgers University. P.O. Box 1179, Piscataway, NJ USA. November 22, Abstract Scalable Parallel Comutational Geometry for Coarse Grained Multicomuters y Frank Dehne School of Comuter Science Carleton University Ottawa, Canada K1S 5B6 Andrew Rau-Chalin DIMACS, Rutgers University

More information

Learning Robust Locality Preserving Projection via p-order Minimization

Learning Robust Locality Preserving Projection via p-order Minimization Proceedings of the Twenty-Ninth AAAI Conference on Artificial Intelligence Learning Robust Locality Preserving Projection via -Order Minimization Hua Wang, Feiing Nie, Heng Huang Deartment of Electrical

More information

A 2D Random Walk Mobility Model for Location Management Studies in Wireless Networks Abstract: I. Introduction

A 2D Random Walk Mobility Model for Location Management Studies in Wireless Networks Abstract: I. Introduction A D Random Walk Mobility Model for Location Management Studies in Wireless Networks Kuo Hsing Chiang, RMIT University, Melbourne, Australia Nirmala Shenoy, Information Technology Deartment, RIT, Rochester,

More information

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model Ad Hoc Networks (4) 5 68 Contents lists available at SciVerse ScienceDirect Ad Hoc Networks journal homeage: www.elsevier.com/locate/adhoc Latency-minimizing data aggregation in wireless sensor networks

More information

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform

Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chip Platform Models for Advancing PRAM and Other Algorithms into Parallel Programs for a PRAM-On-Chi Platform Uzi Vishkin George C. Caragea Bryant Lee Aril 2006 University of Maryland, College Park, MD 20740 UMIACS-TR

More information

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming

Sequential Memory Access on the Unified Memory Machine with Application to the Dynamic Programming Sequential Memory Access on the Unified Memory Machine ith Alication to the Dynamic Programming Koji Nakano Deartment of Information Engineering Hiroshima University Kagamiyama --, Higashi Hiroshima, 79-87

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

TOPP Probing of Network Links with Large Independent Latencies

TOPP Probing of Network Links with Large Independent Latencies TOPP Probing of Network Links with Large Indeendent Latencies M. Hosseinour, M. J. Tunnicliffe Faculty of Comuting, Information ystems and Mathematics, Kingston University, Kingston-on-Thames, urrey, KT1

More information

CS 470 Spring Mike Lam, Professor. Performance Analysis

CS 470 Spring Mike Lam, Professor. Performance Analysis CS 470 Sring 2018 Mike Lam, Professor Performance Analysis Performance analysis Why do we arallelize our rograms? Performance analysis Why do we arallelize our rograms? So that they run faster! Performance

More information

Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost

Improving the Performance of MPI Derived Datatypes by Optimizing Memory-Access Cost Imroving the Performance of MPI Derived Datatyes by Otimizing Memory-Access Cost Surendra Byna William Gro Xian-He Sun Rajeev Thakur Deartment of Comuter Science Illinois Institute of Technology Chicago,

More information

10. Multiprocessor Scheduling (Advanced)

10. Multiprocessor Scheduling (Advanced) 10. Multirocessor Scheduling (Advanced) Oerating System: Three Easy Pieces AOS@UC 1 Multirocessor Scheduling The rise of the multicore rocessor is the source of multirocessorscheduling roliferation. w

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

Patterned Wafer Segmentation

Patterned Wafer Segmentation atterned Wafer Segmentation ierrick Bourgeat ab, Fabrice Meriaudeau b, Kenneth W. Tobin a, atrick Gorria b a Oak Ridge National Laboratory,.O.Box 2008, Oak Ridge, TN 37831-6011, USA b Le2i Laboratory Univ.of

More information

Design and Analysis of a Dynamically Reconfigurable Network Processor

Design and Analysis of a Dynamically Reconfigurable Network Processor Design and Analysis of a Dynamically Reconfigurable Network Processor I.A. Troxel, A.D. George, and S. Oral HCS Research Lab, ECE Deartment, University of Florida, Gainesville, FL {troxel,george,oral}@hcs.ufl.edu

More information

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s

Sensitivity of multi-product two-stage economic lotsizing models and their dependency on change-over and product cost ratio s Sensitivity two stage EOQ model 1 Sensitivity of multi-roduct two-stage economic lotsizing models and their deendency on change-over and roduct cost ratio s Frank Van den broecke, El-Houssaine Aghezzaf,

More information