Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Size: px
Start display at page:

Download "Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22"

Transcription

1 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance imlementation of collective communications oerations on distributed-memory comuter architectures. Using a combination of known techniques (many of which were first roosed in the 98s and early 99s) along with careful exloitation of communication modes suorted by MPI, we have develoed imlementations that have imroved erformance in most situations comared to those currently suorted by ublic domain imlementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) rocessor cluster are included. Introduction This aer makes a number of contributions to the toic of collective communication:. A review of best ractices. Collective communication was an active research in the 98s and early 99s as distributed-memory architectures with large numbers of rocessors were first introduced [,, 4, 9,,, 5]. Since then an occasional aer has been ublished [9,, 5, 7], but no dramatic new develoments have been reorted.. Focus on simlicity. Historically, hyercube toologies were the first toologies used for distributedmemory architectures. Examles include Caltech s Cosmic Cube [], the Intel ipsc, NCUBE [], and Thinking Machines Connection Machine architectures. As a result, highly efficient algorithms for hyercubes were develoed first, and these algorithms were then modified to target architectures with alternative toologies. Similarly, textbooks often start their discussion of collective communication by considering hyercubes. In our exosition we take a different aroach by considering one-dimensional toologies first. Algorithms that erform well are then generalized to multidimensional meshes. Hyercubes are finally discussed briefly by observing that they are log() dimensional meshes with two (comutational) nodes in each dimension. This fact allows us to focus on simle, effective solutions that naturally generalize to higher dimensions and ultimately hyercubes.. Algorithms. One consequence of Item is that we can state the algorithms more simly and concisely. Minimum-sanning tree algorithms on hyercubes tyically required loo-based algorithms that comuted indices of destination nodes by toggling bits of the indices of source nodes. We instead resent the algorithms recursively and avoid such obscuring by restricting bit maniulation. echan@cs.utexas.edu Deartment of Comuter Sciences, The University of Texas at Austin, Austin, TX 787. Texas Advanced Comuting Center, The University of Texas at Austin, Austin, TX 787.

2 4. Analysis. The cost of algorithms is analysed via a simle but effective model of arallel comutation. 5. Tunable libraries. More recently, the toic of tuning, referrably automatically, of collective communication libraries has again become fashionable [8, 8]. Unfortunately, many of these aers focus on the mechanism for choosing algorithms from a loose collection of algorithms. Often this collection does not even include the fastest and/or most ractical algorithm. Perhas the most imortant contribution of this aer is that it shows how algorithms for a given oeration can be organized as a arameterized family, which then clearly defines what arameters can be tuned to imrove erformance. This aroach was already incororated into the highly tuned InterCom library for the Intel Touchstone Delta and Paragon architectures of the early and mid 99s [, 4, 5]. However, many details of the theory and ractical techniques used to build that library were never ublished. 6. Imlementation. The merits of the aroach are verified via a MPI-comatible imlementation of all the resented algorithms [7]. Exeriments show that the resulting imlementation is comarable and sometimes better than the MPICH imlementation of the Message-Passing Interface (MPI) [,, 4]. There is an entire class of algorithms that we do not treat: ielined algorithms [, 4, 6]. The reason is that we do not consider these ractical on current generation architectures. The remainder of the aer is organized as follows. In Section we exlain some basic assumtions that are made for the urose of resenting this aer. In Section we discuss the communication oerations. Section 4 delineates the lower bounds of the collective communication oerations followed by discussion of network toologies in Section 5. In Section 6 we discuss different algorithms for varying data lengths. In Section 7 we discuss strategies for the secial cases where short and long vectors of data are communicated. More sohisticated hybrid algorithms that combine techniques for all vector lengths are discussed in Section 8. Performance results are given in Section 9. Concluding remarks can be found in Section. Model of Parallel Comutation To analyze the cost of the resented algorithms, it is useful to assume a simle model of arallel comutation. These assumtions are: Target architectures: Currently, the target architectures are distributed-memory arallel architectures. However, we exect that the methods discussed in this aer also have alicability when many cores on a single rocessor become available. Indexing: This aer assumes a arallel architecture with comutational nodes (nodes hereafter). The nodes are indexed from to. Each node could consist of a Symmetric Multi-Processor (SMP) but for communication uroses will be treated as one unit. Logically fully connected: We will assume that any node can send directly to any other node through a communication network where some toology rovides automatic routing. Communicating between nodes: At any given time, a single node can send only one message to one other node. Similarly, it can only receive one message from one other node. We will assume a node can send and receive simultaneously. Cost of communication: The cost of sending a message between two nodes will be modeled by α + nβ, in the absence of network conflicts. Here α and β resectively reresent the message startu time and er data item transmission time. In our model, the cost of the message is not a function of the distance between two nodes. The startu cost is largely due to software overhead on the sending and the receiving nodes. The routing of messages between nodes is subsequently done in hardware using wormhole routing, which ielines messages and incurs a very small extra overhead due to the distance between two nodes []. Tyically, α is four to five orders of magnitude greater than β where β is on the order of the cost of an instruction.

3 Network conflicts: Assuming that the ath between two communicating nodes, determined by the toology and the routing algorithm, is comletely occuied, then if some link (connection between neighboring nodes) in the communication ath is occuied by two or more messages, a network conflict occurs. This extra cost is modeled with α + knβ where k is the maximum over all links (along the ath of the message) of the number of conflicts on the links. Alternatively, links on which there is a conflict may be multilexed, which yields the same modification to the cost. Bidirectional channels: We will assume that messages traveling in oosite directions on a link do not conflict. Cost of comutation: The cost required to erform an arithmetic oeration (e.g., a reduction oeration) is denoted by γ. Although simle, the above assumtions are useful when conducting an analysis of communication costs on actual architectures. Some additional discussion is necessary regarding arameters α and β: Communication rotocols: The most generic communication uses the so-called three-ass rotocol. A message is sent to alert the receiving node that a message of a given size will be sent. After the buffer sace for the message has been allocated, the receiving node resonds. Finally, the message itself is sent. Notice that this requires three control messages to be sent between the sending and receiving nodes. We will denote the latency associated with this rotocol by α. If the sender can rely on the fact that a receive buffer already exists, a one-ass rotocol can be used, in which the message is simly sent without the above-described handshake. We will denote the latency associated with this rotocol by α. In articular, we will assume that there is always static buffer sace for very short messages. The three-ass rotocol can easily cost u to three times more than the one-ass rotocol. Thus, in our discussion we will assume that α = α. Relative order of send and receive calls: The cost er item sent is affected by the relative order in which a send and corresonding receive are osted. If a send is initiated before the corresonding receive is osted on the receiving node, the incoming message is buffered in temorary sace and coied when the receive, which indicates where the message is to be finally stored, is osted. If, on the other hand, the receive is osted before the send, or the send blocks until the receive is osted, no such extra coy is required. We will denote the cost er item transfered by β if no extra coy is required and by β if it is. Collective Communication When the nodes of a distributed-memory architecture collaborate to solve a given roblem, inherently comutation reviously erformed on a single node is now distributed among the nodes. Communication is erformed when data is shared and/or contributions from different nodes must be consolidated. Communication oerations that simultaneously involve a grou of nodes are called collective communication oerations. In our discussions, we will assume that the grou includes all nodes. The most tyically encountered collective communications, discussed in this section, fall into two categories: Data redistribution oerations: broadcast, scatter, gather, and allgather. These oerations move data between rocessors. Data consolidation oerations: reduce(-to-one), reduce-scatter, and allreduce. These oerations consolidate contributions from different rocessors by alying a reduction oeration. We will only consider reduction oerations that are both commutative and associative.

4 Oeration Before After Broadcast x x x x x to-one) x () x () x () j x(j) Scatter Gather Allgather Reduce(- Reducescatter x () x () x () x () x () x () x () x () x () x () x () x () j x(j) j x(j) j x(j) j x(j) Allreduce Node Node Node Node x () x () x () j x(j) j x(j) j x(j) j x(j) Figure : Collective communications considered in this aer. 4

5 Communication Latency Bandwidth Comutation Broadcast log () α nβ Reduce(-to-one) log () α nβ nγ Scatter log () α nβ Gather log () α nβ Allgather log () α nβ Reduce-scatter log () α nβ nγ Allreduce log () α nγ nβ Table : Lower bounds for the different comonents of communication cost. Pay articular attention to the conditions for the lower bounds given in the text. The oerations discussed in this aer are illustrated in Fig.. In that figure, x indicates a vector of data of length n. For some oerations, x is subdivided into subvectors x i, i =,,, where equals the number of nodes. A suerscrit is used to indicate a vector that must be reduced with other vectors from other nodes. j x(j) indicates the result of that reduction. The summation sign is used because summation is the most commonly encountered reduction oeration. We resent these collective communications as airs of dual oeration. We will show later that an imlementation of an oeration can be transformed into that of its dual by reversing the communication (and adding to or deleting reduction oerations from the imlementation). These dual airs are indicated by the grouings in Fig. (searated by the thick lines): broadcast and reduce(-to-one), scatter and gather, and allgather and reduce-scatter. Allreduce is the only oeration that does not have a dual (or it can be viewed as its own dual). 4 Lower Bounds It is useful to resent lower bounds on the cost of these oerations under our model regardless of the imlementation. In this section, we give informal arguments to derive these lower bounds. We will treat three terms of communication cost searately: latency, bandwidth, and comutation. Lower bounds are summarized in Table. It is assumed that > and that subvectors x i and x (j) i have equal length. Latency: The lower bound on latency is derived by the simle observation that for all collective communications at least one node has data that must somehow arrive at all other nodes. Under our model, at each ste, we can at most double the number of nodes that get the data. Comutation: Only the reduction communications require comutation. The comutation involved would require ( )n oerations if executed on a single node or time ( )nγ. Distributing this comutation erfectly among the nodes reduces the time to nγ under ideal circumstances. Hence the lower bound. Bandwidth: For broadcast and reduce(-to-one), the root node must either send or receive n items. The cost of this is bounded below by nβ. For the gather and scatter, the root node must either send or receive n items, with a cost of at least nβ. The same is the case for all nodes during the allgather and reduce-scatter. The allreduce is somewhat more comlicated. If the lower bound on comutation is to be achieved, one can argue that n items must leave each node, and n items must be received by each node after the comutation is comleted for a total cost of at least nβ. For further details see [4]. 5

6 d comment shae A single node. Dulicate the dimensional cube, and connect corresonding nodes with a link. Dulicate the dimensional cube, and connect corresonding nodes with a link. Dulicate the dimensional cube, and connect corresonding nodes with a link..... d d Dulicate the d dimensional cube, and connect corresonding nodes with a link Figure : Construction of hyercubes. 5 Toologies In this section, we discuss a few toologies. The toology with least connectivity is the linear array. A fully connected network is on the other end of the sectrum. In between, we consider higher dimensional mesh toologies. Hyercubes, which have historical and theoretical value, are shown to be mesh architectures of dimension log() with two nodes in each dimension. Many current architectures have multile levels of switches that route messages between subnetworks that are tyically fully connected. Linear arrays: We will assume that the nodes of a linear array are connected so that node i has neighbors i (left) and i+ (right), i <. Nodes and do not have left and right neighbors, resectively. A ring architecture would connect node with. However, the communication that a ring facilitates is imortant to us (e.g., simultaneous sending by all nodes to their right neighbor) since this communication can be achieved on a linear array because the message from node to node does not conflict with any of the other messages under our model. Mesh architectures: The nodes of a mesh architecture of dimension D can be indexed using a D-tule, (i,, i D ), with i j < d j and = d d D. The nodes indexed by (i,, i j, k, i j+,, i D ), k < d j, form a linear array. A torus architecture is the natural extension of a ring to multile dimensions. We will not need tori for our discussion for the same reason that we do not need rings. Hyercubes: In Fig. we show how a hyercube can be inductively constructed: A hyercube of dimension consists of a single node. No bits are required to index this one node. 6

7 A hyercube of dimension d is constructed from two hyercubes of dimension d by connecting nodes with corresonding index, and adding a leading binary bit to the index of each node. For all nodes in one of the two hyercubes this leading bit is set to while it is set to for the nodes in the other hyercube. Some observations are: Two nodes are neighbors if and only if the binary reresentation of their indices differ in exactly one bit. View the nodes of a hyercube as a linear array with nodes indexed,,. Then, the subsets of nodes {,, / } and {/,, } each form a hyercube, and node i of the left subset is a neighbor of node i + / in the right subset. This observation alies recursively to the two subsets. A hyercube is a mesh of dimension log() with two nodes in each dimension. Fully connected architectures: In a fully connected architecture, all nodes are neighbors of all other nodes. We will see in the remainder of this aer that the rimary advantage of a fully connected architecture is that one can view such architectures as higher dimensional meshes by factoring the number of nodes,, into integer factors. 6 Commonly Used Algorithms Deending on the amount of data involved in a collective communication, the strategy for reducing the cost of the oeration differs. When the amount of data is small it is the cost of initiating messages, α, that tends to dominate, and algorithms should strive to reduce this cost. In other words, it is the lower bound on the latency in Table that becomes the dominating factor. When the amount of data is large it is the costs er item sent and/or comuted, β and/or γ, that become the dominating factors. In this case the lower bounds due to bandwidth and comutation in Table are the dominating factors. 6. Broadcast and reduce Minimum-sanning tree algorithms The best known broadcast algorithm is the minimum-sanning tree algorithm (MST Bcast). On an arbitrary number of nodes, this algorithm can be described as follows. The nodes {,, } are artitioned into two (almost equal) disjoint subsets, {,, m} and {m +,, }, where m = / is the middle node. A destination node is chosen in the subset that does not contain the root. The message is sent from the root to the destination after which the root and the destination become the roots for broadcasts within their resective subsets of nodes. This algorithm is given in Fig. (a). In this algorithm, x is the vector data to be communicated, me and root indicate the index of the node that articiates and the current root of the broadcast, and left and right indicate the indices of the left- and right-most nodes in the current subset of nodes. The broadcast among all nodes is then accomlished by calling MSTBcast( x, root,, ). The algorithm is illustrated in Fig. 4. It is not hard to see that, in the absense of network conflicts, the cost of this algorithm is T MSTBcast (, n) = log() (α + nβ ) in the generic case when the Send and Recv routines use a three-ass rotocol. This cost achieves the lower bound for the latency comonent of the cost of communication. Under our model, the algorithm does not incur network conflicts on fully connected networks and on linear arrays, regardless of how the destination is chosen at each ste. (The choice of dest in Fig. (a) is 7

8 MSTBcast( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then dest = right else dest = left if me == root Send( x, dest ) if me == dest Recv( x, root ) if me mid and root mid MSTBcast( x, root, left, mid ) else if me mid and root > mid MSTBcast( x, dest, left, mid ) else if me > mid and root mid MSTBcast( x, dest, mid+, right ) else if me > mid and root > mid MSTBcast( x, root, mid+, right ) (a) MSTReduce( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then srce = right else srce = left if me mid and root mid MSTReduce( x, root, left, mid ) else if me mid and root > mid MSTReduce( x, srce, left, mid ) else if me > mid and root mid MSTReduce( x, srce, mid+, right ) else if me > mid and root > mid MSTReduce( x, root, mid+, right ) if me == srce Send( x, root ) if me == root Recv( tm, srce ) and x = x + tm (b) MSTScatter( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then dest = right else dest = left if root mid if me == root Send( x mid+:right, dest ) if me == dest Recv( x mid+:right, root ) else if me == root Send( x left:mid, dest ) if me == dest Recv( x left:mid, root ) if me mid and root mid MSTScatter( x, root, left, mid ) else if me mid and root > mid MSTScatter( x, dest, left, mid ) else if me > mid and root mid MSTScatter( x, dest, mid+, right ) else if me > mid and root > mid MSTScatter( x, root, mid+, right ) (c) MSTGather( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then srce = right else srce = left if me mid and root mid MSTGather( x, root, left, mid ) else if me mid and root > mid MSTGather( x, srce, left, mid ) else if me > mid and root mid MSTGather( x, srce, mid+, right ) else if me > mid and root > mid MSTGather( x, root, mid+, right ) if root mid if me == srce Send( x mid+:right, root ) if me == root Recv( x mid+:right, srce ) else if me == srce Send( x left:mid, root ) if me == root Recv( x left:mid, srce ) (d) Figure : Minimum-sanning tree algorithms. 8

9 simly convenient.) On a hyercube, the destination needs to be chosen to be a neighbor of the current root. This change requires the algorithm in Fig. (a) to be modified by choosing dest as if root mid dest = root left + (mid + ) else dest = root + left (mid + ) in other words, choose the node in the subset that does not contain the current root that is in the same relative osition as the root. The MST Reduce can be imlemented by reversing the communication and alying a reduction oeration with the data that is received. Again, the nodes {,, } are artitioned into two (almost equal) disjoint subsets, {,, m} and {m +,, }. This time, a source node is chosen in the subset that does not contain the root. Recursively, all contributions within each subset are reduced to the root and to the source node. Finally, the reduced result is sent from the source node to the root, where it is reduced with the data that is already at the root node. This algorithm is given in Fig. (b) and illustrated in Fig. 5. Comaring the algorithms in Fig. (a) and (b), we note that the artitioning into subsets of nodes is identical. For the broadcast, the message is sent by the root after which the broadcast roceeds recursively in each subset of nodes. For the reduce, the recursion comes first after which a message is sent to the root where the data is reduced with the data at the root. In effect, the communication is reversed in order and direction. The cost of this algorithm, identical to that of the broadcast excet that now a γ term must be added for the reduction at each ste, is given by T MSTReduce (, n) = log() (α + nβ + nγ). Both algorithms achieve the lower bound of log() α for the latency comonent of the cost. 6. Scatter and gather A scatter can be imlemented much like MST Bcast, excet that at each ste of the recursion only the data that ultimately must reside in the subnetwork, at which the destination is a member, need to be sent from the root to the destination. The resulting algorithm is given in Fig. (c) and is illustrated in Fig. 6. The MST Gather is similarly obtained by reversing the communications in the MST Scatter, as given in Fig. (d). Under the assumtion that all subvectors are of equal length, the cost of these algorithms is given by T MSTScatter (, n) = T MSTGather (, n) = log() k= ( α + k nβ ) = log() α + nβ. This cost achieves the lower bound for the latency and bandwidth comonents. Under the stated assumtions these algorithms are otimal. 6. Allgather and reduce-scatter Bidirectional exchange algorithms The best known algorithm for allgather assumes that = d for some integer d and use so-called bidirectional exchanges (BDE), which can be described as follows. Partition the network in two halves. Recursively erform an allgather of all the data in the resective halves. Next, exchange the so-gathered data between disjoint airs of nodes where each air consists of one node from each of the two halves. Generally, node i (i < /) is aired with node i + /, which are neighbors if the network is a hyercube. This algorithm, called recursive-doubling, is given in Fig. 7(a) and illustrated in Fig. 8. In the absense of network conflicts (on a hyercube or fully connected architecture) and assuming all subvectors are of equal length, the cost is T BDEAllgather (, n) = log() k= ( α + k nβ ) = log()α + nβ. 9

10 Ste Ste Figure 4: Minimum-sanning tree algorithm for broadcast. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) Ste Figure 5: Minimum-sanning tree algorithm for reduce. This cost attains the lower bound for both the latency and bandwidth comonents and is thus otimal under these assumtions. Problems arise with BDE algorithms when the number of nodes is not a ower of two. If the subnetwork of nodes contains an odd number of nodes, one odd node does not contain a corresonding node in the other subnetwork. In one remedy for this situation, one node from the oosing subnetwork must send its data to the odd node. Unfortunately, this solution requires that one node must send data twice at each ste, so the cost of BDE algorithms doubles when not using a ower of two number of nodes. In ractice, BDE algorithms still erform quite well because the node needing to send data twice is different at each ste of recursion, so the communication can be overlaed between stes. Nonetheless, the result is rather hahazard. Reduce-scatter can again be imlemented by reversing the communications and their order and by adding reduction oerations when the data arrives. This algorithm, called recursive-halving, is given in Fig. 7(b) and illustrated in Fig. 9. Again assuming all subvectors are of equal length, the cost is T BDEReduce scatter (, n) = log() k= ( α + k n(β + γ) ) = log()α + n(β + γ).

11 Ste Ste Figure 6: Minimum-sanning tree algorithm for scatter. BDEAllgather( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid BDEAllgather( x, left, mid ) else BDEAllgather( x, mid+, right ) if me mid Send( x left:mid, artner ) Recv( x mid+:right, artner ) else Send( x mid+:right, artner ) Recv( x left:mid, artner ) (a) BDEReduce-scatter( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid Send( x mid+:right, artner ) Recv( tm, artner ) x left:mid = x left:mid + tm else Send( x left:mid, artner ) Recv( tm, artner ) x mid+:right = x mid+:right + tm if me mid BDEReduce-scatter( x, left, mid ) else BDEReduce-scatter( x, mid+, right ) (b) Figure 7: Bidirectional exchange algorithms for allgather and reduce-scatter.

12 Ste Ste Figure 8: Recursive-doubling algorithm for allgather. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (:) x (:) x (:) x (:) Ste Figure 9: Recursive-halving algorithm for reduce-scatter. Notation: x (j :j :s) i = j xj i where j {j, j + s,, j } and x (j :j ) i = j xj i where j {j, j +,, j }. We will revisit BDE algorithms as a secial case of the bucket algorithms, discussed next, on hyercubes in Section 7.. Bucket algorithms An alternative aroach to the imlemention of the allgather oeration views the nodes as a ring, embedded in a linear array by taking advantage of the fact that messages traversing a link in oosite direction do not conflict. At each ste, all nodes send data to the node to their right. In this fashion, the subvectors that start on the individual nodes are eventually distributed to all nodes. The bucket (BKT) algorithm, also called the cyclic algorithm, is given in Fig. (a) and is illustrated in Fig.. Notice that if each node starts with an equal subvector of data the cost of this aroach is given by T BKTAllgather (, n) = ( )(α + n β ) = ( )α + nβ, achieving the lower bound for the bandwidth comonent of the cost.

13 BKTAllgather( x ) rev = me if rev < then rev = next = me + if next = then next = curi = me for i =,, Send( x curi, next ) curi = curi if curi < then curi = Recv( x curi, rev ) endfor (a) BKTReduce-scatter( x ) rev = me if rev < then rev = next = me + if next = then next = curi = next for i =,, Send( x curi, rev ) curi = curi + if curi = then curi = Recv( tm, next ) x curi = x curi + tm endfor (b) Figure : Bucket algorithms for allgather and reduce-scatter. A simle otimization comes from reosting all receives after which a single synchronization with the node to the left (indicating that all receives have been osted) is required before sending commences so that a one-ass rotocol can be used. This synchronization itself can be imlemented by sending a zero-length message, at a cost of α in our model. The remaining sends each also incur only α as a latency cost, for a total cost of T BKTAllgather (, n) = α + ( )α + nβ = α + nβ. Similarly, the reduce-scatter oeration can be imlemented by a simultaneous assing of messages around the ring to the left. The algorithm is given in Fig. (b) and is illustrated in Fig.. This time, a artial result towards the total reduction of the subvectors are accumulated as the messages ass around the ring. With a similar strategy for reosting messages, the cost of this algorithm is given by T BKTReduce scatter (, n) = α + n(β + γ). 6.4 Allreduce Like the reduce-scatter the allreduce can be also be imlemented using a BDE algorithm. This time at each ste the entire vector is exchanged and added to the local result. This algorithm is given in Fig. and is illustrated in Fig. 4. In the absense of network conflicts the cost of this algorithm is T BDEAllreduce (, n) = log()(α + nβ + nγ). This cost attains the lower bound only for the latency comonent. 7 Moving On We now discuss how to ick and/or combine algorithms as a function of architecture, number of nodes, and vector length. We do so by resenting strategies for different tyes of architectures, building uon the algorithms that are already resented. Recall that short messages incur a latency of α.

14 Ste Ste Ste Figure : Bucket algorithm for allgather. x () x () x () x () x () x () x () x () x () x () x () x () x () x (,) x () x () x (,) x (,) x () x () x (,) x () Ste Ste x (,,) x () x (,,) x () x (,,) x (,,) x () x (:) x (:) x (:) x (:) Ste Figure : Bucket algorithm for reduce-scatter. 7. Linear arrays On linear arrays the MST Bcast and MST Reduce algorithms achieve the lower bound for the α term while the BKT Allgather and BKT Reduce-scatter algorithms achieve the lower bound for the β term. The MST Scatter and MST Gather algorithms achieve the lower bounds for all vector lengths. BDE algorithms are undesirable since they require d nodes and because they inherently incur network conflicts. The following strategy rovides simle algorithms that have merit in the extreme cases of short and long vector lengths. Fig. 5 and Fig. 6 summarize this strategy where short and long vector algorithms can be used as building blocks to comose different collective communication oerations. 7.. Broadcast Short vectors: Long vectors: MST algorithm. MST Scatter followed by BKT Allgather. The aroximate cost is T Scatter Allgather (, n) = log() α + nβ + α + nβ 4

15 BDEAllreduce( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid Send( x, artner ) Recv( tm, artner ) x = x + tm else Send( x, artner ) Recv( tm, artner ) x = x + tm if me mid BDEAllreduce( x, left, mid ) else BDEAllreduce( x, mid+, right ) Figure : Bidirectional exchange algorithm for allreduce. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) Ste Figure 4: Bidirectional exchange algorithm for allreduce. 5

16 Reduce log() α + log() n(β + γ) Scatter log() α + nβ Gather log() α + nβ Broadcast log() α + log() nβ Reduce-scatter (Reduce/Scatter) log() (α + α )+ log() n(β + γ) + nβ Allreduce (Reduce/Broadcast) log() (α + α )+ log() n(β + γ) Allgather (Gather/Broadcast) log() (α + α )+ log() nβ + nβ Figure 5: A building block aroach to short vector algorithms on linear arrays. Reduce-scatter α + n(β + γ) Gather log() α + nβ Scatter log() α + nβ Allgather α + nβ Reduce (Reduce-scatter/Gather) ( + log() )α + nβ + Allreduce (Reduce-scatter/Allgather) α + nβ + nγ Broadcast (Scatter/Allgather) α + log() α + nβ nγ Figure 6: A building block aroach to long vector algorithms on linear arrays. 6

17 = α + log() α + nβ. As n gets large, and the β term dominates, this cost is aroximately log() / times faster than the MST Bcast algorithm and within a factor of two of the lower bound. 7.. Reduce Short vectors: MST algorithm. Long vectors: BKT Reduce-scatter (the dual of the allgather) followed by a MST Gather (the dual of the scatter). This time all the receives for the gather can be reosted before the reduce-scatter commenses by observing that the comletion of the reduce-scatter signals that all buffers for the gather are already available. Since the ready-receive sends can be used during the gather, the cost becomes T Reduce scatter Gather (, n) = α + n(β + γ) + log() α + nβ = ( + log() )α + nβ + nγ. Again, the β term is within a factor of two of the lower bound while the γ term is otimal. 7.. Scatter Short vectors: MST algorithm. Long vectors: Sending individual messages from the root to each of the other nodes. While the cost, ( )α + nβ, is clearly worse than the MST algorithm, in ractice the β term has sometimes been observed to be smaller ossibly because the cost of each message can be overlaed with those of other messages. We will call it the simle (SMPL) algorithm Gather Same as scatter, in reverse Allgather Short vectors: MST Gather followed by MST Bcast. To reduce the α term, receives for the MST Bcast can be osted before the MST Gather commenses, yielding a cost of T Gather Bcast (, n) = log() α + nβ + log() α + log() nβ This cost is close to the lower bound of log() α. log() (α + α ) + ( log() + )nβ. Long vectors: BKT algorithm Reduce-scatter Short vectors: MST Reduce followed by MST Scatter. The receives for the MST Scatter can be osted before the MST Reduce commenses, for a cost of T Reduce Scatter (, n) = log() α + log() n(β + γ) + log() α + nβ This cost is close to the lower bound of log() α. log() (α + α ) + ( log() + )nβ + log() nγ. 7

18 Long vectors: BKT algorithm Allreduce Short vectors: MST Reduce followed by MST Bcast. The receives for the MST Bcast can be osted before the MST Reduce commenses, for a cost of T Reduce Bcast (, n) = log() α + log() n(β + γ) + log() α + log() nβ = log() (α + α ) + log() nβ + log() nγ. This cost is close to the lower bound of log() α. Long vectors: BKT Reduce-scatter followed by BKT Allgather. The aroximate cost is T Reduce scatter Allgather (, n) = α + n(β + γ) + α + nβ This cost achieves the lower bound for the β and γ terms. 7. Multidimensional meshes = α + nβ + nγ. Next, we show that on multidimensional meshes the α term can be substantially imroved for long vector algorithms, relative to linear arrays. The key here is to observe that each row and column in a two-dimensional mesh forms a linear array and that all our collective communication oerations can be formulated as erforming the oeration first within rows and then within columns, or vise versa. For the two-dimensional mesh, we will assume that our nodes hysically form a r c mesh and that the nodes are indexed in row-major order. For a mesh of dimension d, we will assume that the nodes are hysically organized as a d d d d mesh. 7.. Broadcast Short vectors: MST algorithm within columns followed by MST algorithm within rows. The cost is T Bcast Bcast (r, c, n) = ( log(r) + log(c) ) (α + nβ ) log() (α + nβ ). Generalizing to a d dimensional mesh yields an algorithm with a cost of d log(d k ) (α + nβ ) log() (α + nβ ). k= Observe that the cost of the MST algorithm done successively in multile dimensions yields aroximately the same cost as erforming it in just one dimension (e.g., a linear array). Long vectors: MST Scatter within columns, MST Scatter within rows, BKT Allgather within rows, BKT Allgather within columns. The aroximate cost is T Scatter Scatter Allgather Allgather (r, c, n) = log(r) α + r nβ + log(c) α + c n r c r β +cα + c n c r β + rα + r nβ r 8

19 = (c + r)α + ( log(c) + log(r) )α + nβ (c + r)α + log() α + β. As n gets large, and the β term dominates, this cost is still within a factor of two of the lower bound for β while the α term has been greatly reduced. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d d k α + log(d k ) α + nβ d d k α + log() α + nβ. k= 7.. Reduce k= k= Short vectors: MST algorithm within rows followed by MST algorithm within columns. The cost is T Reduce Reduce (r, c, n) = ( log(c) + log(r) ) (α + n(β + γ)) log() (α + n(β + γ)). Generalizing to a d dimensional mesh yields an algorithm with a cost of d d log(d k ) α + log(d k ) n(β + γ) log() (α + n(β + γ)). k= k= Long vectors: BKT Reduce-scatter within rows, BKT Reduce-scatter within columns, MST Gather within columns, MST Gather within rows. The aroximate cost is T Reduce scatter Reduce scatter Gather Gather (r, c, n) = cα + c nβ + c nγ + rα + r n c c r c β + r n r c γ + log(r) α + r n r c β + log(c) α + c nβ c = (c + r)α + ( log(r) + log(c) )α + nβ + nγ (c + r)α + log() α + nβ + nγ. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d d k α + log(d k ) α + nβ + nγ k= k= d d k α + log() α + nβ + nγ. k= As n gets large, and the β term dominates, this cost is within a factor of two of the lower bound for β. The γ term is otimal. 7.. Scatter Short vectors: Long vectors: MST algorithms successively in each of the dimensions. SMPL algorithms successively in each of the dimensions. 9

20 7..4 Gather Same as scatter, in reverse Allgather Short vectors: MST Gather within rows, MST Bcast within rows, MST Gather within columns, MST Bcast within columns. Again, reosing can be used to reduce the α term for the broadcasts, for a total cost of T Gather Bcast Gather Bcast (r, c, n) = log(c) α + c n c r β + log(c) α + log(c) n r β + log(r) α + r nβ + log(r) α + log(r)nβ r ( = ( log(r) + log(c) )(α + α ) + + log(c) r ( log() (α + α ) + + log(c) ) + log(r) nβ r ) + log(r) nβ Generalizing to a d dimensional mesh yields an algorithm with a cost of d ( log(d k ) (α + α ) + + log(d d ) + + log(d ) ) + log(d ) nβ d d d d d k= ( log() (α + α ) + + log(d d ) + + log(d ) ) + log(d ) nβ. d d d d d Notice that the α term remains close to the lower bound of log() α while the β term has been reduced! Long vectors: BKT Allgather within rows, BKT Allgather within columns. The cost is T Allgather Allgather (r, c, n) = cα + c n c r β + rα + r nβ r = (c + r)α + nβ. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d k α + nβ Reduce-scatter k= Short vectors: MST Reduce within columns, MST Scatter within columns, MST Reduce within rows, MST Scatter within rows. Preosing the receives for the scatter oerations yields a total cost of T Reduce Scatter Reduce Scatter (r, c, n) = log(r) α + log(r) n(β + γ) + log(r) α + r n r c β + log(c) α + c n(β + γ) + log(c) α + log(c)nβ c ( = ( log(r) + log(c) )(α + α ) + + log(c) ) + log(r) nβ r ( ) log(c) + + log(r) nγ r ( log() (α + α ) + + log(c) ) ( ) log(c) + log(r) nβ + + log(r) nγ. r r

21 Generalizing to a d dimensional mesh yields an algorithm with a cost of d log(d k ) (α + α ) + k= ( log(d ) + log() (α + α ) + + ( + log(d ) + + log(d ) d ) + log(d d ) nβ d d d d d ) + + log(d d ) + log(d d ) nγ d d d d d ( + log(d ) + + log(d d ) d d d ( log(d ) d d d + + log(d d ) d d + log(d d ) ) + log(d d ) nβ d d ) nγ. Notice that the α term remains close to the lower bound of log() α while both the β and γ terms have been reduced! Long vectors: BKT Reduce-scatter within rows, BKT Reduce-scatter within columns. The cost is T Reduce scatter Reduce scatter (r, c, n) = rα + r n r c (β + γ) + cα + c n(β + γ). c = (r + c)α + n(β + γ). Generalizing to a d dimensional mesh yields an algorithm with a cost of 7..7 Allreduce d d k α + n(β + γ). k= Short vectors: MST Reduce followed by MST Bcast (both discussed above), reosting the receives for the broadcast. The aroximate cost is d T Reduce Bcast (d, n) = log(d k ) (α + α ) ( log(d ) + + k= ) + log(d d ) d d ) + + log(d d ) d d d ( log(d ) + + log(d d ) + log(d d ) d d d d d nβ nγ. Long vectors: cost is BKT Reduce-scatter followed by BKT Allgather (both discussed above). The aroximate d T Reduce scatter Allgather (d, n) = d k α + nβ + nγ. k= This cost achieves the lower bound for the β and γ terms. 7. Hyercubes We now argue that the discussion on algorithms for multidimensional meshes includes all algorithms already discussed for hyercubes. Recall that a d dimensional hyercube is simly a d dimensional mesh with d = = d d =. Now, the short vector algorithm for broadcasting successively in each of the dimensions

22 becomes the MST Bcast on hyercubes. The long vector algorithm for allgather that successively executes a BKT algorithm in each dimension is equivalent to a BDE algorithm on hyercubes. Similar connections can be made for other algorithms discussed for multidimensional meshes. The conclusion is that we can concentrate on otimizing algorithms for multidimensional meshes. A by-roduct of the analyses will be otimized algorithms for hyercubes. 7.4 Fully connected architectures Fully connected architectures can be viewed as multidimensional meshes so that, as noted for hyercube architectures, it suffices to analyze the otimization of algorithms for multidimensional meshes. 8 Stategies for All Vector Lengths We have develoed numerous algorithms for the short and long vector cases. They have been shown to be art of a consistent family rather than a bag full of algorithms. The natural next question becomes how to deal with intermediate length vectors. A naive solution would be to determine the crossover oint between the short and long vector costs and switch algorithms at that crossover oint. In this section, we show that one can do much better with hybrid algorithms. Key to this aroach is the recognition that all collective communications have in common the roerty that the oeration erformed among all nodes yields the same result as when the nodes are logically viewed as a two-dimensional mesh, and the oeration is erformed first in one dimension and next in the second dimension. This observation leaves the ossibility of using a different algorithm in each of the two dimensions. 8. A rototyical examle: broadcast Consider a broadcast on a r c mesh of nodes. A broadcast among all nodes can then be imlemented as Ste : A broadcast within the row of nodes that includes the original root. Ste : Simultaneous broadcasts within columns of nodes where the roots of the nodes are in the same row as the original root. For each of these two stes, a different broadcast algorithm can be chosen. Now, consider the case where in Ste a long vector algorithm is chosen: MST Scatter followed by BKT Allgather. It is beneficial to orchestrate the broadcast as Ste a: MST Scatter within the rows of nodes that includes the original root. Ste : Simultaneous broadcasts within columns of nodes where the roots of the nodes are in the same row as the original root. Ste b: Simultaneous BKT Allgather within rows of nodes. The benefit is that now in Ste the broadcasts involve vectors of length aroximately n/c. This algorithm is illustrated in Fig. 7. For fixed r and c, combinations of short and long vector algorithms for Stes and are examined in Fig. 8. Notice that: Otion is never better than Otion since log(r) + log(c) log(). Otion 4 is never better than Otion. The observation here is that if a scatter allgather broadcast is to be used, it should be as art of Ste a b so that the length of the vectors to be broadcast during Ste is reduced. Otion 5 is generally better than Otion 6 since the α terms are identical while the β term is smaller for Otion 5 than for Otion 6 (since n/ < n/r).

23 Ste a x b Figure 7: Broadcast on a (logical) D mesh imlemented by (a) a scatter within the row that includes the root, followed by () a broadcast within the columns, followed by (b) an allgather within rows.

24 Otion Algorithm Cost MST Bcast (all nodes). log() α + log() nβ Ste : MST Bcast, Ste : MST Bcast. ( log(r) + log(c) )α + ( log(r) + log(c) )nβ Ste a: MST Scatter, Ste : MST Bcast, Ste b: BKT Allgather. cα + ( log(r) + log(c) )α + ( log(r) c + c c )nβ 4 Ste : MST Bcast, cα + ( log(r) + log(c) )α Ste : MST Scatter BKT Allgather. + ( log(r) + c c )nβ 5 Ste a: MST Scatter, (r + c)α + ( log(r) + log(c) )α Ste : MST Scatter BKT Allgather, + nβ Ste b: BKT Allgather. (assuming r and c ) 6 Ste : MST Scatter BKT Allgather, (r + c)α + ( log(r) + log(c) )α 7 Ste : MST Scatter BKT Allgather. + ( r r + c c )nβ MST Scatter BTK Allgather (all nodes). α + log() α + nβ Figure 8: Various aroaches to imlementing a broadcast on an r c mesh. The cost analysis assumes no network conflicts occur. Otion 5 is generally better than Otion 7 because the β terms are identical while the α term has been reduced since r + c (when is not rime). Thus, we find that there are three algorithms of interest: Otion : MST Bcast among all nodes. This algorithm is best when the vector lengths are short. Otion : MST Scatter MST Bcast BKT Allgather. This algorithm is best when the vector length is such that the scatter leaves subvectors that are considered to be small when broadcast among r nodes. Otion 5: MST Scatter MST Scatter BKT Allgather BKT Allgather. This algorithm is best when the vector length is such that the scatter leaves subvectors that are considered to be long when broadcast among r nodes if multile integer factorizations of exist. In Fig. 9(left) we show the redicted cost, in time, of each of these otions on a fully connected machine with = 56, r = c = 6, α 6, β 9, and γ. The grah clearly shows how the different otions trade lowering the α term for increasing the β term, the extremes being the MST Scatter BKT Allgather, with the greatest α term and lowest β term, and the MST Bcast, with the lowest α term and greatest β term. Naturally, many combinations of r and c can be chosen, and the technique can be extended to more than two dimensions, which we discuss next. 8. Otimal hybridization on hyercubes We now discuss how to otimally combine short and long vector algorithms for the broadcast oeration. Our strategy is to develo theoretical results for hyercubes, first reorted in [5] for the allgather oeration. This theoretical result will then motivate heuristics for multidimensional meshes discussed in Section 8.. Assume that = d and that the nodes form a hyercube architecture (or, alternatively, are fully connected). We assume that a vector of length n is to be broadcast and, for simlicity, that n is an integer multile of. A strategy will be indicated by an integer D, D < d, and two vectors, (a, a,, a D ) and (d, d,, d D ). The idea is that the rocessors are viewed as a D dimensional mesh, that the ith dimension of that mesh has d i nodes that themselves form a hyercube, and that a i {long, short} indicates whether a long or short vector algorithm is to be used in the ith dimension. 4

25 BCAST Predicted D =56 MST Scatter Allgather Scatter MST Allgather Scatter Scatter Allgather Allgather MST Scatter Allgather "Otimal" Hybrid Hybrids BCAST Predicted =56 time sec time sec message size bytes message size bytes Figure 9: Predicted erformance of various hybrid broadcast algorithms. Left: Performance for Otions,, 5, and 7. Right: Performance of many hybrids on a hyercube or fully connected architectures. MST Scatter Allgather "Otimal" 5 7 Hybrid Hybrids BCAST Predicted = time sec message size bytes Figure : Predicted erformance of various hybrid broadcast algorithms on a 5 7 mesh. 5

26 If a i = long, then a long vector algorithm is used in the ith dimension of the mesh. The vector is scattered among d i rocessors, and each iece (now reduced in length by a factor d i ) is broadcast among the remaining dimensions of the mesh using the strategy (a i+,, a D ), (d i+,, d D ), after which the result is collected via BKT Allgather. If a i = short, then MST Bcast is used in the ith dimension of the mesh, and a broadcast among the remaining dimensions of the mesh using the strategy (a i+,, a D ), (d i+,, d D ) is emloyed. The cost given a strategy is given by the inductively defined cost function C(n, (a,, a D ), (d,, d D )) = if D = d α + log(d )α + d d nβ + C( n d, (a,, a D ), (d,, d D )) if D > and a = long log(d )α + log(d )nβ + C(n, (a,, a D ), (d,, d D )) if D > and a = short Some simle observations are: Assume d i > (but a ower of two). Then, C(n, (a,, a i, a i, }{{} a i+,, a D ), (d,, d i, d i, }{{} d i+,, d D )) {}}{{}}{ C(n, (a,, a i, a i, a i, a i+,, a D ), (d,, d i,, d i, d i+,, d D )). This observation tells us that an otimal (minimal cost) strategy can satisfy the restriction that D = d and d = = d d =. Assume a i = short and a i+ = long. Then, C(n, (a,, a i, a i, a i+, }{{} a i+,, a D ), (d,, d i, d i, d i+, }{{} d i+,, d D )) {}}{ {}}{ C(n, (a,, a i, a i+, a i, a i+,, a D ), (d,, d i, d i+, d i, d i+,, d D )). This observation tells us that an otimal strategy can satisfy the restriction that a short vector algorithm is never used before a long vector algorithm. Together these two observations indicate that an otimal strategy exists among the strategies that view the nodes as a d dimensional mesh, with two nodes in each dimension, and have the form (a,, a k, a k,, a d ) where a i = long if i < k and a i = short if i k. An otimal strategy can thus be determined by choosing k ot, the number of times a long vector algorithm is chosen. The cost of such a strategy is now given by C(n, k) = k α + log( k )α + k k nβ + log( d k )α + log( d k ) n k β = k α + kα + k k nβ + (d k)α + (d k) n k β = k α + dα + ( k+ + d k) n k β Let us now examine C(n, k) vs. C(n, k + ): ( C(n, k + ) C(n, k) = k+ α + dα + ( k+ n ) + d k ) k+ β ( k α + dα + ( k+ + d k) n ) k β = k n α + ( d + k) k+ β. () 6

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective

More information

Optimization of Collective Communication Operations in MPICH

Optimization of Collective Communication Operations in MPICH To be ublished in the International Journal of High Performance Comuting Alications, 5. c Sage Publications. Otimization of Collective Communication Oerations in MPICH Rajeev Thakur Rolf Rabenseifner William

More information

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks

Complexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Journal of Comuting and Information Technology - CIT 8, 2000, 1, 1 12 1 Comlexity Issues on Designing Tridiagonal Solvers on 2-Dimensional Mesh Interconnection Networks Eunice E. Santos Deartment of Electrical

More information

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network

Sensitivity Analysis for an Optimal Routing Policy in an Ad Hoc Wireless Network 1 Sensitivity Analysis for an Otimal Routing Policy in an Ad Hoc Wireless Network Tara Javidi and Demosthenis Teneketzis Deartment of Electrical Engineering and Comuter Science University of Michigan Ann

More information

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1

Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Spanning Trees 1 Multicast in Wormhole-Switched Torus Networks using Edge-Disjoint Sanning Trees 1 Honge Wang y and Douglas M. Blough z y Myricom Inc., 325 N. Santa Anita Ave., Arcadia, CA 916, z School of Electrical and

More information

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University

Shuigeng Zhou. May 18, 2016 School of Computer Science Fudan University Query Processing Shuigeng Zhou May 18, 2016 School of Comuter Science Fudan University Overview Outline Measures of Query Cost Selection Oeration Sorting Join Oeration Other Oerations Evaluation of Exressions

More information

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model

IMS Network Deployment Cost Optimization Based on Flow-Based Traffic Model IMS Network Deloyment Cost Otimization Based on Flow-Based Traffic Model Jie Xiao, Changcheng Huang and James Yan Deartment of Systems and Comuter Engineering, Carleton University, Ottawa, Canada {jiexiao,

More information

10. Parallel Methods for Data Sorting

10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting 10. Parallel Methods for Data Sorting... 1 10.1. Parallelizing Princiles... 10.. Scaling Parallel Comutations... 10.3. Bubble Sort...3 10.3.1. Sequential Algorithm...3

More information

Introduction to Parallel Algorithms

Introduction to Parallel Algorithms CS 1762 Fall, 2011 1 Introduction to Parallel Algorithms Introduction to Parallel Algorithms ECE 1762 Algorithms and Data Structures Fall Semester, 2011 1 Preliminaries Since the early 1990s, there has

More information

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model

COMP Parallel Computing. BSP (1) Bulk-Synchronous Processing Model COMP 6 - Parallel Comuting Lecture 6 November, 8 Bulk-Synchronous essing Model Models of arallel comutation Shared-memory model Imlicit communication algorithm design and analysis relatively simle but

More information

Efficient Parallel Hierarchical Clustering

Efficient Parallel Hierarchical Clustering Efficient Parallel Hierarchical Clustering Manoranjan Dash 1,SimonaPetrutiu, and Peter Scheuermann 1 Deartment of Information Systems, School of Comuter Engineering, Nanyang Technological University, Singaore

More information

Lecture 8: Orthogonal Range Searching

Lecture 8: Orthogonal Range Searching CPS234 Comutational Geometry Setember 22nd, 2005 Lecture 8: Orthogonal Range Searching Lecturer: Pankaj K. Agarwal Scribe: Mason F. Matthews 8.1 Range Searching The general roblem of range searching is

More information

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K.

AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS. Ren Chen and Viktor K. inuts er clock cycle Streaming ermutation oututs er clock cycle AUTOMATIC GENERATION OF HIGH THROUGHPUT ENERGY EFFICIENT STREAMING ARCHITECTURES FOR ARBITRARY FIXED PERMUTATIONS Ren Chen and Viktor K.

More information

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka

Parallel Construction of Multidimensional Binary Search Trees. Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka Parallel Construction of Multidimensional Binary Search Trees Ibraheem Al-furaih, Srinivas Aluru, Sanjay Goil Sanjay Ranka School of CIS and School of CISE Northeast Parallel Architectures Center Syracuse

More information

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications

OMNI: An Efficient Overlay Multicast. Infrastructure for Real-time Applications OMNI: An Efficient Overlay Multicast Infrastructure for Real-time Alications Suman Banerjee, Christoher Kommareddy, Koushik Kar, Bobby Bhattacharjee, Samir Khuller Abstract We consider an overlay architecture

More information

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data

Efficient Processing of Top-k Dominating Queries on Multi-Dimensional Data Efficient Processing of To-k Dominating Queries on Multi-Dimensional Data Man Lung Yiu Deartment of Comuter Science Aalborg University DK-922 Aalborg, Denmark mly@cs.aau.dk Nikos Mamoulis Deartment of

More information

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics

Lecture 2: Fixed-Radius Near Neighbors and Geometric Basics structure arises in many alications of geometry. The dual structure, called a Delaunay triangulation also has many interesting roerties. Figure 3: Voronoi diagram and Delaunay triangulation. Search: Geometric

More information

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level,

[9] J. J. Dongarra, R. Hempel, A. J. G. Hey, and D. W. Walker, \A Proposal for a User-Level, [9] J. J. Dongarra, R. Hemel, A. J. G. Hey, and D. W. Walker, \A Proosal for a User-Level, Message Passing Interface in a Distributed-Memory Environment," Tech. Re. TM-3, Oak Ridge National Laboratory,

More information

Equality-Based Translation Validator for LLVM

Equality-Based Translation Validator for LLVM Equality-Based Translation Validator for LLVM Michael Ste, Ross Tate, and Sorin Lerner University of California, San Diego {mste,rtate,lerner@cs.ucsd.edu Abstract. We udated our Peggy tool, reviously resented

More information

An Efficient Video Program Delivery algorithm in Tree Networks*

An Efficient Video Program Delivery algorithm in Tree Networks* 3rd International Symosium on Parallel Architectures, Algorithms and Programming An Efficient Video Program Delivery algorithm in Tree Networks* Fenghang Yin 1 Hong Shen 1,2,** 1 Deartment of Comuter Science,

More information

Directed File Transfer Scheduling

Directed File Transfer Scheduling Directed File Transfer Scheduling Weizhen Mao Deartment of Comuter Science The College of William and Mary Williamsburg, Virginia 387-8795 wm@cs.wm.edu Abstract The file transfer scheduling roblem was

More information

Randomized algorithms: Two examples and Yao s Minimax Principle

Randomized algorithms: Two examples and Yao s Minimax Principle Randomized algorithms: Two examles and Yao s Minimax Princile Maximum Satisfiability Consider the roblem Maximum Satisfiability (MAX-SAT). Bring your knowledge u-to-date on the Satisfiability roblem. Maximum

More information

Message-Passing Interface Basics

Message-Passing Interface Basics WeekC Message-Passing Interface Basics C. Opener C.. Launch to edx 7 7 Week C. Message-Passing Interface Basics C.. Outline Week C to edx C.. Opener 7 C.. What you will learn to edx 74 Week C. Message-Passing

More information

An Indexing Framework for Structured P2P Systems

An Indexing Framework for Structured P2P Systems An Indexing Framework for Structured P2P Systems Adina Crainiceanu Prakash Linga Ashwin Machanavajjhala Johannes Gehrke Carl Lagoze Jayavel Shanmugasundaram Deartment of Comuter Science, Cornell University

More information

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering

Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Autonomic Physical Database Design - From Indexing to Multidimensional Clustering Stehan Baumann, Kai-Uwe Sattler Databases and Information Systems Grou Technische Universität Ilmenau, Ilmenau, Germany

More information

Image Segmentation Using Topological Persistence

Image Segmentation Using Topological Persistence Image Segmentation Using Toological Persistence David Letscher and Jason Fritts Saint Louis University Deartment of Mathematics and Comuter Science {letscher, jfritts}@slu.edu Abstract. This aer resents

More information

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.)

Lecture 3: Geometric Algorithms(Convex sets, Divide & Conquer Algo.) Advanced Algorithms Fall 2015 Lecture 3: Geometric Algorithms(Convex sets, Divide & Conuer Algo.) Faculty: K.R. Chowdhary : Professor of CS Disclaimer: These notes have not been subjected to the usual

More information

Space-efficient Region Filling in Raster Graphics

Space-efficient Region Filling in Raster Graphics "The Visual Comuter: An International Journal of Comuter Grahics" (submitted July 13, 1992; revised December 7, 1992; acceted in Aril 16, 1993) Sace-efficient Region Filling in Raster Grahics Dominik Henrich

More information

A Scalable Parallel Sorting Algorithm Using Exact Splitting

A Scalable Parallel Sorting Algorithm Using Exact Splitting A Scalable Parallel Sorting Algorithm Using Exact Slitting Christian Siebert 1,2 and Felix Wolf 1,2,3 1 German Research School for Simulation Sciences, 52062 Aachen, Germany 2 RWTH Aachen University, Comuter

More information

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing

The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing The VEGA Moderately Parallel MIMD, Moderately Parallel SIMD, Architecture for High Performance Array Signal Processing Mikael Taveniku 2,3, Anders Åhlander 1,3, Magnus Jonsson 1 and Bertil Svensson 1,2

More information

Privacy Preserving Moving KNN Queries

Privacy Preserving Moving KNN Queries Privacy Preserving Moving KNN Queries arxiv:4.76v [cs.db] 4 Ar Tanzima Hashem Lars Kulik Rui Zhang National ICT Australia, Deartment of Comuter Science and Software Engineering University of Melbourne,

More information

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism

A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism A New and Efficient Algorithm-Based Fault Tolerance Scheme for A Million Way Parallelism Erlin Yao, Mingyu Chen, Rui Wang, Wenli Zhang, Guangming Tan Key Laboratory of Comuter System and Architecture Institute

More information

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties

Improved heuristics for the single machine scheduling problem with linear early and quadratic tardy penalties Imroved heuristics for the single machine scheduling roblem with linear early and quadratic tardy enalties Jorge M. S. Valente* LIAAD INESC Porto LA, Faculdade de Economia, Universidade do Porto Postal

More information

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip

A Metaheuristic Scheduler for Time Division Multiplexed Network-on-Chip Downloaded from orbit.dtu.dk on: Jan 25, 2019 A Metaheuristic Scheduler for Time Division Multilexed Network-on-Chi Sørensen, Rasmus Bo; Sarsø, Jens; Pedersen, Mark Ruvald; Højgaard, Jasur Publication

More information

Truth Trees. Truth Tree Fundamentals

Truth Trees. Truth Tree Fundamentals Truth Trees 1 True Tree Fundamentals 2 Testing Grous of Statements for Consistency 3 Testing Arguments in Proositional Logic 4 Proving Invalidity in Predicate Logic Answers to Selected Exercises Truth

More information

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs

RESEARCH ARTICLE. Simple Memory Machine Models for GPUs The International Journal of Parallel, Emergent and Distributed Systems Vol. 00, No. 00, Month 2011, 1 22 RESEARCH ARTICLE Simle Memory Machine Models for GPUs Koji Nakano a a Deartment of Information

More information

CMSC 425: Lecture 16 Motion Planning: Basic Concepts

CMSC 425: Lecture 16 Motion Planning: Basic Concepts : Lecture 16 Motion lanning: Basic Concets eading: Today s material comes from various sources, including AI Game rogramming Wisdom 2 by S. abin and lanning Algorithms by S. M. LaValle (Chats. 4 and 5).

More information

A joint multi-layer planning algorithm for IP over flexible optical networks

A joint multi-layer planning algorithm for IP over flexible optical networks A joint multi-layer lanning algorithm for IP over flexible otical networks Vasileios Gkamas, Konstantinos Christodoulooulos, Emmanouel Varvarigos Comuter Engineering and Informatics Deartment, University

More information

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2

An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 An Efficient Coding Method for Coding Region-of-Interest Locations in AVS2 Mingliang Chen 1, Weiyao Lin 1*, Xiaozhen Zheng 2 1 Deartment of Electronic Engineering, Shanghai Jiao Tong University, China

More information

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems

Matlab Virtual Reality Simulations for optimizations and rapid prototyping of flexible lines systems Matlab Virtual Reality Simulations for otimizations and raid rototying of flexible lines systems VAMVU PETRE, BARBU CAMELIA, POP MARIA Deartment of Automation, Comuters, Electrical Engineering and Energetics

More information

J. Parallel Distrib. Comput.

J. Parallel Distrib. Comput. J. Parallel Distrib. Comut. 71 (2011) 288 301 Contents lists available at ScienceDirect J. Parallel Distrib. Comut. journal homeage: www.elsevier.com/locate/jdc Quality of security adatation in arallel

More information

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS

PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS PREDICTING LINKS IN LARGE COAUTHORSHIP NETWORKS Kevin Miller, Vivian Lin, and Rui Zhang Grou ID: 5 1. INTRODUCTION The roblem we are trying to solve is redicting future links or recovering missing links

More information

Source Coding and express these numbers in a binary system using M log

Source Coding and express these numbers in a binary system using M log Source Coding 30.1 Source Coding Introduction We have studied how to transmit digital bits over a radio channel. We also saw ways that we could code those bits to achieve error correction. Bandwidth is

More information

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs

An empirical analysis of loopy belief propagation in three topologies: grids, small-world networks and random graphs An emirical analysis of looy belief roagation in three toologies: grids, small-world networks and random grahs R. Santana, A. Mendiburu and J. A. Lozano Intelligent Systems Grou Deartment of Comuter Science

More information

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS

AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS AN INTEGER LINEAR MODEL FOR GENERAL ARC ROUTING PROBLEMS Philie LACOMME, Christian PRINS, Wahiba RAMDANE-CHERIF Université de Technologie de Troyes, Laboratoire d Otimisation des Systèmes Industriels (LOSI)

More information

CMSC 754 Computational Geometry 1

CMSC 754 Computational Geometry 1 CMSC 754 Comutational Geometry 1 David M. Mount Deartment of Comuter Science University of Maryland Fall 2002 1 Coyright, David M. Mount, 2002, Det. of Comuter Science, University of Maryland, College

More information

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1

10 File System Mass Storage Structure Mass Storage Systems Mass Storage Structure Mass Storage Structure FILE SYSTEM 1 10 File System 1 We will examine this chater in three subtitles: Mass Storage Systems OERATING SYSTEMS FILE SYSTEM 1 File System Interface File System Imlementation 10.1.1 Mass Storage Structure 3 2 10.1

More information

A Symmetric FHE Scheme Based on Linear Algebra

A Symmetric FHE Scheme Based on Linear Algebra A Symmetric FHE Scheme Based on Linear Algebra Iti Sharma University College of Engineering, Comuter Science Deartment. itisharma.uce@gmail.com Abstract FHE is considered to be Holy Grail of cloud comuting.

More information

Distributed Systems (5DV147)

Distributed Systems (5DV147) Distributed Systems (5DV147) Mutual Exclusion and Elections Fall 2013 1 Processes often need to coordinate their actions Which rocess gets to access a shared resource? Has the master crashed? Elect a new

More information

Patterned Wafer Segmentation

Patterned Wafer Segmentation atterned Wafer Segmentation ierrick Bourgeat ab, Fabrice Meriaudeau b, Kenneth W. Tobin a, atrick Gorria b a Oak Ridge National Laboratory,.O.Box 2008, Oak Ridge, TN 37831-6011, USA b Le2i Laboratory Univ.of

More information

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations

Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, with GPU implementations Parallel Algorithms for the Summed Area Table on the Asynchronous Hierarchical Memory Machine, ith GPU imlementations Akihiko Kasagi, Koji Nakano, and Yasuaki Ito Deartment of Information Engineering Hiroshima

More information

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model.

Lecture 18. Today, we will discuss developing algorithms for a basic model for parallel computing the Parallel Random Access Machine (PRAM) model. U.C. Berkeley CS273: Parallel and Distributed Theory Lecture 18 Professor Satish Rao Lecturer: Satish Rao Last revised Scribe so far: Satish Rao (following revious lecture notes quite closely. Lecture

More information

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching

Mitigating the Impact of Decompression Latency in L1 Compressed Data Caches via Prefetching Mitigating the Imact of Decomression Latency in L1 Comressed Data Caches via Prefetching by Sean Rea A thesis resented to Lakehead University in artial fulfillment of the requirement for the degree of

More information

Extracting Optimal Paths from Roadmaps for Motion Planning

Extracting Optimal Paths from Roadmaps for Motion Planning Extracting Otimal Paths from Roadmas for Motion Planning Jinsuck Kim Roger A. Pearce Nancy M. Amato Deartment of Comuter Science Texas A&M University College Station, TX 843 jinsuckk,ra231,amato @cs.tamu.edu

More information

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH

A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH A CLASS OF STRUCTURED LDPC CODES WITH LARGE GIRTH Jin Lu, José M. F. Moura, and Urs Niesen Deartment of Electrical and Comuter Engineering Carnegie Mellon University, Pittsburgh, PA 15213 jinlu, moura@ece.cmu.edu

More information

A Study of Protocols for Low-Latency Video Transport over the Internet

A Study of Protocols for Low-Latency Video Transport over the Internet A Study of Protocols for Low-Latency Video Transort over the Internet Ciro A. Noronha, Ph.D. Cobalt Digital Santa Clara, CA ciro.noronha@cobaltdigital.com Juliana W. Noronha University of California, Davis

More information

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model

Ad Hoc Networks. Latency-minimizing data aggregation in wireless sensor networks under physical interference model Ad Hoc Networks (4) 5 68 Contents lists available at SciVerse ScienceDirect Ad Hoc Networks journal homeage: www.elsevier.com/locate/adhoc Latency-minimizing data aggregation in wireless sensor networks

More information

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree

To appear in IEEE TKDE Title: Efficient Skyline and Top-k Retrieval in Subspaces Keywords: Skyline, Top-k, Subspace, B-tree To aear in IEEE TKDE Title: Efficient Skyline and To-k Retrieval in Subsaces Keywords: Skyline, To-k, Subsace, B-tree Contact Author: Yufei Tao (taoyf@cse.cuhk.edu.hk) Deartment of Comuter Science and

More information

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4

Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 Auto-Tuning Distributed-Memory 3-Dimensional Fast Fourier Transforms on the Cray XT4 M. Gajbe a A. Canning, b L-W. Wang, b J. Shalf, b H. Wasserman, b and R. Vuduc, a a Georgia Institute of Technology,

More information

Distributed Estimation from Relative Measurements in Sensor Networks

Distributed Estimation from Relative Measurements in Sensor Networks Distributed Estimation from Relative Measurements in Sensor Networks #Prabir Barooah and João P. Hesanha Abstract We consider the roblem of estimating vectorvalued variables from noisy relative measurements.

More information

Layered Switching for Networks on Chip

Layered Switching for Networks on Chip Layered Switching for Networks on Chi Zhonghai Lu zhonghai@kth.se Ming Liu mingliu@kth.se Royal Institute of Technology, Sweden Axel Jantsch axel@kth.se ABSTRACT We resent and evaluate a novel switching

More information

Hardware-Accelerated Formal Verification

Hardware-Accelerated Formal Verification Hardare-Accelerated Formal Verification Hiroaki Yoshida, Satoshi Morishita 3 Masahiro Fujita,. VLSI Design and Education Center (VDEC), University of Tokyo. CREST, Jaan Science and Technology Agency 3.

More information

Fast Distributed Process Creation with the XMOS XS1 Architecture

Fast Distributed Process Creation with the XMOS XS1 Architecture Communicating Process Architectures 20 P.H. Welch et al. (Eds.) IOS Press, 20 c 20 The authors and IOS Press. All rights reserved. Fast Distributed Process Creation with the XMOS XS Architecture James

More information

Introduction to Visualization and Computer Graphics

Introduction to Visualization and Computer Graphics Introduction to Visualization and Comuter Grahics DH2320, Fall 2015 Prof. Dr. Tino Weinkauf Introduction to Visualization and Comuter Grahics Grids and Interolation Next Tuesday No lecture next Tuesday!

More information

Computing the Convex Hull of. W. Hochstattler, S. Kromberg, C. Moll

Computing the Convex Hull of. W. Hochstattler, S. Kromberg, C. Moll Reort No. 94.60 A new Linear Time Algorithm for Comuting the Convex Hull of a Simle Polygon in the Plane by W. Hochstattler, S. Kromberg, C. Moll 994 W. Hochstattler, S. Kromberg, C. Moll Universitat zu

More information

arxiv: v1 [cs.dc] 13 Nov 2018

arxiv: v1 [cs.dc] 13 Nov 2018 Task Grah Transformations for Latency Tolerance arxiv:1811.05077v1 [cs.dc] 13 Nov 2018 Victor Eijkhout November 14, 2018 Abstract The Integrative Model for Parallelism (IMP) derives a task grah from a

More information

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island

A GPU Heterogeneous Cluster Scheduling Model for Preventing Temperature Heat Island A GPU Heterogeneous Cluster Scheduling Model for Preventing Temerature Heat Island Yun-Peng CAO 1,2,a and Hai-Feng WANG 1,2 1 School of Information Science and Engineering, Linyi University, Linyi Shandong,

More information

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation

SPITFIRE: Scalable Parallel Algorithms for Test Set Partitioned Fault Simulation To aear in IEEE VLSI Test Symosium, 1997 SITFIRE: Scalable arallel Algorithms for Test Set artitioned Fault Simulation Dili Krishnaswamy y Elizabeth M. Rudnick y Janak H. atel y rithviraj Banerjee z y

More information

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field

Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Learning Motion Patterns in Crowded Scenes Using Motion Flow Field Min Hu, Saad Ali and Mubarak Shah Comuter Vision Lab, University of Central Florida {mhu,sali,shah}@eecs.ucf.edu Abstract Learning tyical

More information

Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations

Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Research Collection Bachelor Thesis Communication-Avoiding Parallel Algorithms for Solving Triangular Matrix Equations Author(s): Wicky, Tobias Publication Date: 2015 Permanent Link: htts://doi.org/10.3929/ethz-a-010686133

More information

A Simple and Robust Approach to Computation of Meshes Intersection

A Simple and Robust Approach to Computation of Meshes Intersection A Simle and Robust Aroach to Comutation of Meshes Intersection Věra Skorkovská 1, Ivana Kolingerová 1 and Bedrich Benes 2 1 Deartment of Comuter Science and Engineering, University of West Bohemia, Univerzitní

More information

Coarse grained gather and scatter operations with applications

Coarse grained gather and scatter operations with applications Available online at www.sciencedirect.com J. Parallel Distrib. Comut. 64 (2004) 1297 1310 www.elsevier.com/locate/jdc Coarse grained gather and scatter oerations with alications Laurence Boxer a,b,, Russ

More information

EE678 Application Presentation Content Based Image Retrieval Using Wavelets

EE678 Application Presentation Content Based Image Retrieval Using Wavelets EE678 Alication Presentation Content Based Image Retrieval Using Wavelets Grou Members: Megha Pandey megha@ee. iitb.ac.in 02d07006 Gaurav Boob gb@ee.iitb.ac.in 02d07008 Abstract: We focus here on an effective

More information

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects

Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScript Objects Identity-sensitive Points-to Analysis for the Dynamic Behavior of JavaScrit Objects Shiyi Wei and Barbara G. Ryder Deartment of Comuter Science, Virginia Tech, Blacksburg, VA, USA. {wei,ryder}@cs.vt.edu

More information

Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee

Implementations of Partial Document Ranking Using. Inverted Files. Wai Yee Peter Wong. Dik Lun Lee Imlementations of Partial Document Ranking Using Inverted Files Wai Yee Peter Wong Dik Lun Lee Deartment of Comuter and Information Science, Ohio State University, 36 Neil Ave, Columbus, Ohio 4321, U.S.A.

More information

521493S Computer Graphics Exercise 3 (Chapters 6-8)

521493S Computer Graphics Exercise 3 (Chapters 6-8) 521493S Comuter Grahics Exercise 3 (Chaters 6-8) 1 Most grahics systems and APIs use the simle lighting and reflection models that we introduced for olygon rendering Describe the ways in which each of

More information

Randomized Selection on the Hypercube 1

Randomized Selection on the Hypercube 1 Randomized Selection on the Hyercube 1 Sanguthevar Rajasekaran Det. of Com. and Info. Science and Engg. University of Florida Gainesville, FL 32611 ABSTRACT In this aer we resent randomized algorithms

More information

Chapter 8: Adaptive Networks

Chapter 8: Adaptive Networks Chater : Adative Networks Introduction (.1) Architecture (.2) Backroagation for Feedforward Networks (.3) Jyh-Shing Roger Jang et al., Neuro-Fuzzy and Soft Comuting: A Comutational Aroach to Learning and

More information

Theoretical Analysis of Graphcut Textures

Theoretical Analysis of Graphcut Textures Theoretical Analysis o Grahcut Textures Xuejie Qin Yee-Hong Yang {xu yang}@cs.ualberta.ca Deartment o omuting Science University o Alberta Abstract Since the aer was ublished in SIGGRAPH 2003 the grahcut

More information

Pivot Selection for Dimension Reduction Using Annealing by Increasing Resampling *

Pivot Selection for Dimension Reduction Using Annealing by Increasing Resampling * ivot Selection for Dimension Reduction Using Annealing by Increasing Resamling * Yasunobu Imamura 1, Naoya Higuchi 1, Tetsuji Kuboyama 2, Kouichi Hirata 1 and Takeshi Shinohara 1 1 Kyushu Institute of

More information

The Scalability and Performance of Common Vector Solution to Generalized Label Continuity Constraint in Hybrid Optical/Packet Networks

The Scalability and Performance of Common Vector Solution to Generalized Label Continuity Constraint in Hybrid Optical/Packet Networks The Scalability and Performance of Common Vector Solution to Generalized abel Continuity Constraint in Hybrid Otical/Pacet etwors Shujia Gong and Ban Jabbari {sgong, bjabbari}@gmuedu George Mason University

More information

1.5 Case Study. dynamic connectivity quick find quick union improvements applications

1.5 Case Study. dynamic connectivity quick find quick union improvements applications . Case Study dynamic connectivity quick find quick union imrovements alications Subtext of today s lecture (and this course) Stes to develoing a usable algorithm. Model the roblem. Find an algorithm to

More information

Object and Native Code Thread Mobility Among Heterogeneous Computers

Object and Native Code Thread Mobility Among Heterogeneous Computers Object and Native Code Thread Mobility Among Heterogeneous Comuters Bjarne Steensgaard Eric Jul Microsoft Research DIKU (Det. of Comuter Science) One Microsoft Way University of Coenhagen Redmond, WA 98052

More information

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005*

The Anubis Service. Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL June 8, 2005* The Anubis Service Paul Murray Internet Systems and Storage Laboratory HP Laboratories Bristol HPL-2005-72 June 8, 2005* timed model, state monitoring, failure detection, network artition Anubis is a fully

More information

Skip List Based Authenticated Data Structure in DAS Paradigm

Skip List Based Authenticated Data Structure in DAS Paradigm 009 Eighth International Conference on Grid and Cooerative Comuting Ski List Based Authenticated Data Structure in DAS Paradigm Jieing Wang,, Xiaoyong Du,. Key Laboratory of Data Engineering and Knowledge

More information

Stereo Disparity Estimation in Moment Space

Stereo Disparity Estimation in Moment Space Stereo Disarity Estimation in oment Sace Angeline Pang Faculty of Information Technology, ultimedia University, 63 Cyberjaya, alaysia. angeline.ang@mmu.edu.my R. ukundan Deartment of Comuter Science, University

More information

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap

Brigham Young University Oregon State University. Abstract. In this paper we present a new parallel sorting algorithm which maximizes the overlap Aeared in \Journal of Parallel and Distributed Comuting, July 1995 " Overlaing Comutations, Communications and I/O in Parallel Sorting y Mark J. Clement Michael J. Quinn Comuter Science Deartment Deartment

More information

Distributed Algorithms

Distributed Algorithms Course Outline With grateful acknowledgement to Christos Karamanolis for much of the material Jeff Magee & Jeff Kramer Models of distributed comuting Synchronous message-assing distributed systems Algorithms

More information

Visualization, Estimation and User-Modeling for Interactive Browsing of Image Libraries

Visualization, Estimation and User-Modeling for Interactive Browsing of Image Libraries Visualization, Estimation and User-Modeling for Interactive Browsing of Image Libraries Qi Tian, Baback Moghaddam 2 and Thomas S. Huang Beckman Institute, University of Illinois, Urbana-Chamaign, IL 680,

More information

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch

RST(0) RST(1) RST(2) RST(3) RST(4) RST(5) P4 RSR(0) RSR(1) RSR(2) RSR(3) RSR(4) RSR(5) Processor 1X2 Switch 2X1 Switch Sub-logarithmic Deterministic Selection on Arrays with a Recongurable Otical Bus 1 Yijie Han Electronic Data Systems, Inc. 750 Tower Dr. CPS, Mail Sto 7121 Troy, MI 48098 Yi Pan Deartment of Comuter Science

More information

CS2 Algorithms and Data Structures Note 8

CS2 Algorithms and Data Structures Note 8 CS2 Algorithms and Data Structures Note 8 Heasort and Quicksort We will see two more sorting algorithms in this lecture. The first, heasort, is very nice theoretically. It sorts an array with n items in

More information

A Petri net-based Approach to QoS-aware Configuration for Web Services

A Petri net-based Approach to QoS-aware Configuration for Web Services A Petri net-based Aroach to QoS-aware Configuration for Web s PengCheng Xiong, YuShun Fan and MengChu Zhou, Fellow, IEEE Abstract With the develoment of enterrise-wide and cross-enterrise alication integration

More information

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER

GEOMETRIC CONSTRAINT SOLVING IN < 2 AND < 3. Department of Computer Sciences, Purdue University. and PAMELA J. VERMEER GEOMETRIC CONSTRAINT SOLVING IN < AND < 3 CHRISTOPH M. HOFFMANN Deartment of Comuter Sciences, Purdue University West Lafayette, Indiana 47907-1398, USA and PAMELA J. VERMEER Deartment of Comuter Sciences,

More information

Record Route IP Traceback: Combating DoS Attacks and the Variants

Record Route IP Traceback: Combating DoS Attacks and the Variants Record Route IP Traceback: Combating DoS Attacks and the Variants Abdullah Yasin Nur, Mehmet Engin Tozal University of Louisiana at Lafayette, Lafayette, LA, US ayasinnur@louisiana.edu, metozal@louisiana.edu

More information

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets

An improved algorithm for Hausdorff Voronoi diagram for non-crossing sets An imroved algorithm for Hausdorff Voronoi diagram for non-crossing sets Frank Dehne, Anil Maheshwari and Ryan Taylor May 26, 2006 Abstract We resent an imroved algorithm for building a Hausdorff Voronoi

More information

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission.

Each copy of any part of a JSTOR transmission must contain the same copyright notice that appears on the screen or printed page of such transmission. An Interactive Programming Method for Solving the Multile Criteria Problem Author(s): Stanley Zionts and Jyrki Wallenius Source: Management Science, Vol. 22, No. 6 (Feb., 1976),. 652-663 Published by:

More information

Near-Optimal Routing Lookups with Bounded Worst Case Performance

Near-Optimal Routing Lookups with Bounded Worst Case Performance Near-Otimal Routing Lookus with Bounded Worst Case Performance Pankaj Guta Balaji Prabhakar Stehen Boyd Deartments of Electrical Engineering and Comuter Science Stanford University CA 9430 ankaj@stanfordedu

More information

Complexity analysis and performance evaluation of matrix product on multicore architectures

Complexity analysis and performance evaluation of matrix product on multicore architectures Comlexity analysis and erformance evaluation of matrix roduct on multicore architectures Mathias Jacquelin, Loris Marchal and Yves Robert École Normale Suérieure de Lyon, France {Mathias.Jacquelin Loris.Marchal

More information

Power Savings in Embedded Processors through Decode Filter Cache

Power Savings in Embedded Processors through Decode Filter Cache Power Savings in Embedded Processors through Decode Filter Cache Weiyu Tang Rajesh Guta Alexandru Nicolau Deartment of Information and Comuter Science University of California, Irvine Irvine, CA 92697-3425

More information

Grouping of Patches in Progressive Radiosity

Grouping of Patches in Progressive Radiosity Grouing of Patches in Progressive Radiosity Arjan J.F. Kok * Abstract The radiosity method can be imroved by (adatively) grouing small neighboring atches into grous. Comutations normally done for searate

More information