Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22

Size: px

Start display at page:

Download "Collective Communication: Theory, Practice, and Experience. FLAME Working Note #22"

Baldwin Johns
5 years ago
Views:

1 Collective Communication: Theory, Practice, and Exerience FLAME Working Note # Ernie Chan Marcel Heimlich Avi Purkayastha Robert van de Geijn Setember, 6 Abstract We discuss the design and high-erformance imlementation of collective communications oerations on distributed-memory comuter architectures. Using a combination of known techniques (many of which were first roosed in the 98s and early 99s) along with careful exloitation of communication modes suorted by MPI, we have develoed imlementations that have imroved erformance in most situations comared to those currently suorted by ublic domain imlementations of MPI such as MPICH. Performance results from a large Intel Xeon/Pentium 4 (R) rocessor cluster are included. Introduction This aer makes a number of contributions to the toic of collective communication:. A review of best ractices. Collective communication was an active research in the 98s and early 99s as distributed-memory architectures with large numbers of rocessors were first introduced [,, 4, 9,,, 5]. Since then an occasional aer has been ublished [9,, 5, 7], but no dramatic new develoments have been reorted.. Focus on simlicity. Historically, hyercube toologies were the first toologies used for distributedmemory architectures. Examles include Caltech s Cosmic Cube [], the Intel ipsc, NCUBE [], and Thinking Machines Connection Machine architectures. As a result, highly efficient algorithms for hyercubes were develoed first, and these algorithms were then modified to target architectures with alternative toologies. Similarly, textbooks often start their discussion of collective communication by considering hyercubes. In our exosition we take a different aroach by considering one-dimensional toologies first. Algorithms that erform well are then generalized to multidimensional meshes. Hyercubes are finally discussed briefly by observing that they are log() dimensional meshes with two (comutational) nodes in each dimension. This fact allows us to focus on simle, effective solutions that naturally generalize to higher dimensions and ultimately hyercubes.. Algorithms. One consequence of Item is that we can state the algorithms more simly and concisely. Minimum-sanning tree algorithms on hyercubes tyically required loo-based algorithms that comuted indices of destination nodes by toggling bits of the indices of source nodes. We instead resent the algorithms recursively and avoid such obscuring by restricting bit maniulation. echan@cs.utexas.edu Deartment of Comuter Sciences, The University of Texas at Austin, Austin, TX 787. Texas Advanced Comuting Center, The University of Texas at Austin, Austin, TX 787.

2 4. Analysis. The cost of algorithms is analysed via a simle but effective model of arallel comutation. 5. Tunable libraries. More recently, the toic of tuning, referrably automatically, of collective communication libraries has again become fashionable [8, 8]. Unfortunately, many of these aers focus on the mechanism for choosing algorithms from a loose collection of algorithms. Often this collection does not even include the fastest and/or most ractical algorithm. Perhas the most imortant contribution of this aer is that it shows how algorithms for a given oeration can be organized as a arameterized family, which then clearly defines what arameters can be tuned to imrove erformance. This aroach was already incororated into the highly tuned InterCom library for the Intel Touchstone Delta and Paragon architectures of the early and mid 99s [, 4, 5]. However, many details of the theory and ractical techniques used to build that library were never ublished. 6. Imlementation. The merits of the aroach are verified via a MPI-comatible imlementation of all the resented algorithms [7]. Exeriments show that the resulting imlementation is comarable and sometimes better than the MPICH imlementation of the Message-Passing Interface (MPI) [,, 4]. There is an entire class of algorithms that we do not treat: ielined algorithms [, 4, 6]. The reason is that we do not consider these ractical on current generation architectures. The remainder of the aer is organized as follows. In Section we exlain some basic assumtions that are made for the urose of resenting this aer. In Section we discuss the communication oerations. Section 4 delineates the lower bounds of the collective communication oerations followed by discussion of network toologies in Section 5. In Section 6 we discuss different algorithms for varying data lengths. In Section 7 we discuss strategies for the secial cases where short and long vectors of data are communicated. More sohisticated hybrid algorithms that combine techniques for all vector lengths are discussed in Section 8. Performance results are given in Section 9. Concluding remarks can be found in Section. Model of Parallel Comutation To analyze the cost of the resented algorithms, it is useful to assume a simle model of arallel comutation. These assumtions are: Target architectures: Currently, the target architectures are distributed-memory arallel architectures. However, we exect that the methods discussed in this aer also have alicability when many cores on a single rocessor become available. Indexing: This aer assumes a arallel architecture with comutational nodes (nodes hereafter). The nodes are indexed from to. Each node could consist of a Symmetric Multi-Processor (SMP) but for communication uroses will be treated as one unit. Logically fully connected: We will assume that any node can send directly to any other node through a communication network where some toology rovides automatic routing. Communicating between nodes: At any given time, a single node can send only one message to one other node. Similarly, it can only receive one message from one other node. We will assume a node can send and receive simultaneously. Cost of communication: The cost of sending a message between two nodes will be modeled by α + nβ, in the absence of network conflicts. Here α and β resectively reresent the message startu time and er data item transmission time. In our model, the cost of the message is not a function of the distance between two nodes. The startu cost is largely due to software overhead on the sending and the receiving nodes. The routing of messages between nodes is subsequently done in hardware using wormhole routing, which ielines messages and incurs a very small extra overhead due to the distance between two nodes []. Tyically, α is four to five orders of magnitude greater than β where β is on the order of the cost of an instruction.

3 Network conflicts: Assuming that the ath between two communicating nodes, determined by the toology and the routing algorithm, is comletely occuied, then if some link (connection between neighboring nodes) in the communication ath is occuied by two or more messages, a network conflict occurs. This extra cost is modeled with α + knβ where k is the maximum over all links (along the ath of the message) of the number of conflicts on the links. Alternatively, links on which there is a conflict may be multilexed, which yields the same modification to the cost. Bidirectional channels: We will assume that messages traveling in oosite directions on a link do not conflict. Cost of comutation: The cost required to erform an arithmetic oeration (e.g., a reduction oeration) is denoted by γ. Although simle, the above assumtions are useful when conducting an analysis of communication costs on actual architectures. Some additional discussion is necessary regarding arameters α and β: Communication rotocols: The most generic communication uses the so-called three-ass rotocol. A message is sent to alert the receiving node that a message of a given size will be sent. After the buffer sace for the message has been allocated, the receiving node resonds. Finally, the message itself is sent. Notice that this requires three control messages to be sent between the sending and receiving nodes. We will denote the latency associated with this rotocol by α. If the sender can rely on the fact that a receive buffer already exists, a one-ass rotocol can be used, in which the message is simly sent without the above-described handshake. We will denote the latency associated with this rotocol by α. In articular, we will assume that there is always static buffer sace for very short messages. The three-ass rotocol can easily cost u to three times more than the one-ass rotocol. Thus, in our discussion we will assume that α = α. Relative order of send and receive calls: The cost er item sent is affected by the relative order in which a send and corresonding receive are osted. If a send is initiated before the corresonding receive is osted on the receiving node, the incoming message is buffered in temorary sace and coied when the receive, which indicates where the message is to be finally stored, is osted. If, on the other hand, the receive is osted before the send, or the send blocks until the receive is osted, no such extra coy is required. We will denote the cost er item transfered by β if no extra coy is required and by β if it is. Collective Communication When the nodes of a distributed-memory architecture collaborate to solve a given roblem, inherently comutation reviously erformed on a single node is now distributed among the nodes. Communication is erformed when data is shared and/or contributions from different nodes must be consolidated. Communication oerations that simultaneously involve a grou of nodes are called collective communication oerations. In our discussions, we will assume that the grou includes all nodes. The most tyically encountered collective communications, discussed in this section, fall into two categories: Data redistribution oerations: broadcast, scatter, gather, and allgather. These oerations move data between rocessors. Data consolidation oerations: reduce(-to-one), reduce-scatter, and allreduce. These oerations consolidate contributions from different rocessors by alying a reduction oeration. We will only consider reduction oerations that are both commutative and associative.

4 Oeration Before After Broadcast x x x x x to-one) x () x () x () j x(j) Scatter Gather Allgather Reduce(- Reducescatter x () x () x () x () x () x () x () x () x () x () x () x () j x(j) j x(j) j x(j) j x(j) Allreduce Node Node Node Node x () x () x () j x(j) j x(j) j x(j) j x(j) Figure : Collective communications considered in this aer. 4

5 Communication Latency Bandwidth Comutation Broadcast log () α nβ Reduce(-to-one) log () α nβ nγ Scatter log () α nβ Gather log () α nβ Allgather log () α nβ Reduce-scatter log () α nβ nγ Allreduce log () α nγ nβ Table : Lower bounds for the different comonents of communication cost. Pay articular attention to the conditions for the lower bounds given in the text. The oerations discussed in this aer are illustrated in Fig.. In that figure, x indicates a vector of data of length n. For some oerations, x is subdivided into subvectors x i, i =,,, where equals the number of nodes. A suerscrit is used to indicate a vector that must be reduced with other vectors from other nodes. j x(j) indicates the result of that reduction. The summation sign is used because summation is the most commonly encountered reduction oeration. We resent these collective communications as airs of dual oeration. We will show later that an imlementation of an oeration can be transformed into that of its dual by reversing the communication (and adding to or deleting reduction oerations from the imlementation). These dual airs are indicated by the grouings in Fig. (searated by the thick lines): broadcast and reduce(-to-one), scatter and gather, and allgather and reduce-scatter. Allreduce is the only oeration that does not have a dual (or it can be viewed as its own dual). 4 Lower Bounds It is useful to resent lower bounds on the cost of these oerations under our model regardless of the imlementation. In this section, we give informal arguments to derive these lower bounds. We will treat three terms of communication cost searately: latency, bandwidth, and comutation. Lower bounds are summarized in Table. It is assumed that > and that subvectors x i and x (j) i have equal length. Latency: The lower bound on latency is derived by the simle observation that for all collective communications at least one node has data that must somehow arrive at all other nodes. Under our model, at each ste, we can at most double the number of nodes that get the data. Comutation: Only the reduction communications require comutation. The comutation involved would require ( )n oerations if executed on a single node or time ( )nγ. Distributing this comutation erfectly among the nodes reduces the time to nγ under ideal circumstances. Hence the lower bound. Bandwidth: For broadcast and reduce(-to-one), the root node must either send or receive n items. The cost of this is bounded below by nβ. For the gather and scatter, the root node must either send or receive n items, with a cost of at least nβ. The same is the case for all nodes during the allgather and reduce-scatter. The allreduce is somewhat more comlicated. If the lower bound on comutation is to be achieved, one can argue that n items must leave each node, and n items must be received by each node after the comutation is comleted for a total cost of at least nβ. For further details see [4]. 5

6 d comment shae A single node. Dulicate the dimensional cube, and connect corresonding nodes with a link. Dulicate the dimensional cube, and connect corresonding nodes with a link. Dulicate the dimensional cube, and connect corresonding nodes with a link..... d d Dulicate the d dimensional cube, and connect corresonding nodes with a link Figure : Construction of hyercubes. 5 Toologies In this section, we discuss a few toologies. The toology with least connectivity is the linear array. A fully connected network is on the other end of the sectrum. In between, we consider higher dimensional mesh toologies. Hyercubes, which have historical and theoretical value, are shown to be mesh architectures of dimension log() with two nodes in each dimension. Many current architectures have multile levels of switches that route messages between subnetworks that are tyically fully connected. Linear arrays: We will assume that the nodes of a linear array are connected so that node i has neighbors i (left) and i+ (right), i <. Nodes and do not have left and right neighbors, resectively. A ring architecture would connect node with. However, the communication that a ring facilitates is imortant to us (e.g., simultaneous sending by all nodes to their right neighbor) since this communication can be achieved on a linear array because the message from node to node does not conflict with any of the other messages under our model. Mesh architectures: The nodes of a mesh architecture of dimension D can be indexed using a D-tule, (i,, i D ), with i j < d j and = d d D. The nodes indexed by (i,, i j, k, i j+,, i D ), k < d j, form a linear array. A torus architecture is the natural extension of a ring to multile dimensions. We will not need tori for our discussion for the same reason that we do not need rings. Hyercubes: In Fig. we show how a hyercube can be inductively constructed: A hyercube of dimension consists of a single node. No bits are required to index this one node. 6

7 A hyercube of dimension d is constructed from two hyercubes of dimension d by connecting nodes with corresonding index, and adding a leading binary bit to the index of each node. For all nodes in one of the two hyercubes this leading bit is set to while it is set to for the nodes in the other hyercube. Some observations are: Two nodes are neighbors if and only if the binary reresentation of their indices differ in exactly one bit. View the nodes of a hyercube as a linear array with nodes indexed,,. Then, the subsets of nodes {,, / } and {/,, } each form a hyercube, and node i of the left subset is a neighbor of node i + / in the right subset. This observation alies recursively to the two subsets. A hyercube is a mesh of dimension log() with two nodes in each dimension. Fully connected architectures: In a fully connected architecture, all nodes are neighbors of all other nodes. We will see in the remainder of this aer that the rimary advantage of a fully connected architecture is that one can view such architectures as higher dimensional meshes by factoring the number of nodes,, into integer factors. 6 Commonly Used Algorithms Deending on the amount of data involved in a collective communication, the strategy for reducing the cost of the oeration differs. When the amount of data is small it is the cost of initiating messages, α, that tends to dominate, and algorithms should strive to reduce this cost. In other words, it is the lower bound on the latency in Table that becomes the dominating factor. When the amount of data is large it is the costs er item sent and/or comuted, β and/or γ, that become the dominating factors. In this case the lower bounds due to bandwidth and comutation in Table are the dominating factors. 6. Broadcast and reduce Minimum-sanning tree algorithms The best known broadcast algorithm is the minimum-sanning tree algorithm (MST Bcast). On an arbitrary number of nodes, this algorithm can be described as follows. The nodes {,, } are artitioned into two (almost equal) disjoint subsets, {,, m} and {m +,, }, where m = / is the middle node. A destination node is chosen in the subset that does not contain the root. The message is sent from the root to the destination after which the root and the destination become the roots for broadcasts within their resective subsets of nodes. This algorithm is given in Fig. (a). In this algorithm, x is the vector data to be communicated, me and root indicate the index of the node that articiates and the current root of the broadcast, and left and right indicate the indices of the left- and right-most nodes in the current subset of nodes. The broadcast among all nodes is then accomlished by calling MSTBcast( x, root,, ). The algorithm is illustrated in Fig. 4. It is not hard to see that, in the absense of network conflicts, the cost of this algorithm is T MSTBcast (, n) = log() (α + nβ ) in the generic case when the Send and Recv routines use a three-ass rotocol. This cost achieves the lower bound for the latency comonent of the cost of communication. Under our model, the algorithm does not incur network conflicts on fully connected networks and on linear arrays, regardless of how the destination is chosen at each ste. (The choice of dest in Fig. (a) is 7

8 MSTBcast( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then dest = right else dest = left if me == root Send( x, dest ) if me == dest Recv( x, root ) if me mid and root mid MSTBcast( x, root, left, mid ) else if me mid and root > mid MSTBcast( x, dest, left, mid ) else if me > mid and root mid MSTBcast( x, dest, mid+, right ) else if me > mid and root > mid MSTBcast( x, root, mid+, right ) (a) MSTReduce( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then srce = right else srce = left if me mid and root mid MSTReduce( x, root, left, mid ) else if me mid and root > mid MSTReduce( x, srce, left, mid ) else if me > mid and root mid MSTReduce( x, srce, mid+, right ) else if me > mid and root > mid MSTReduce( x, root, mid+, right ) if me == srce Send( x, root ) if me == root Recv( tm, srce ) and x = x + tm (b) MSTScatter( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then dest = right else dest = left if root mid if me == root Send( x mid+:right, dest ) if me == dest Recv( x mid+:right, root ) else if me == root Send( x left:mid, dest ) if me == dest Recv( x left:mid, root ) if me mid and root mid MSTScatter( x, root, left, mid ) else if me mid and root > mid MSTScatter( x, dest, left, mid ) else if me > mid and root mid MSTScatter( x, dest, mid+, right ) else if me > mid and root > mid MSTScatter( x, root, mid+, right ) (c) MSTGather( x, root, left, right ) if left = right return mid = (left + right)/ if root mid then srce = right else srce = left if me mid and root mid MSTGather( x, root, left, mid ) else if me mid and root > mid MSTGather( x, srce, left, mid ) else if me > mid and root mid MSTGather( x, srce, mid+, right ) else if me > mid and root > mid MSTGather( x, root, mid+, right ) if root mid if me == srce Send( x mid+:right, root ) if me == root Recv( x mid+:right, srce ) else if me == srce Send( x left:mid, root ) if me == root Recv( x left:mid, srce ) (d) Figure : Minimum-sanning tree algorithms. 8

9 simly convenient.) On a hyercube, the destination needs to be chosen to be a neighbor of the current root. This change requires the algorithm in Fig. (a) to be modified by choosing dest as if root mid dest = root left + (mid + ) else dest = root + left (mid + ) in other words, choose the node in the subset that does not contain the current root that is in the same relative osition as the root. The MST Reduce can be imlemented by reversing the communication and alying a reduction oeration with the data that is received. Again, the nodes {,, } are artitioned into two (almost equal) disjoint subsets, {,, m} and {m +,, }. This time, a source node is chosen in the subset that does not contain the root. Recursively, all contributions within each subset are reduced to the root and to the source node. Finally, the reduced result is sent from the source node to the root, where it is reduced with the data that is already at the root node. This algorithm is given in Fig. (b) and illustrated in Fig. 5. Comaring the algorithms in Fig. (a) and (b), we note that the artitioning into subsets of nodes is identical. For the broadcast, the message is sent by the root after which the broadcast roceeds recursively in each subset of nodes. For the reduce, the recursion comes first after which a message is sent to the root where the data is reduced with the data at the root. In effect, the communication is reversed in order and direction. The cost of this algorithm, identical to that of the broadcast excet that now a γ term must be added for the reduction at each ste, is given by T MSTReduce (, n) = log() (α + nβ + nγ). Both algorithms achieve the lower bound of log() α for the latency comonent of the cost. 6. Scatter and gather A scatter can be imlemented much like MST Bcast, excet that at each ste of the recursion only the data that ultimately must reside in the subnetwork, at which the destination is a member, need to be sent from the root to the destination. The resulting algorithm is given in Fig. (c) and is illustrated in Fig. 6. The MST Gather is similarly obtained by reversing the communications in the MST Scatter, as given in Fig. (d). Under the assumtion that all subvectors are of equal length, the cost of these algorithms is given by T MSTScatter (, n) = T MSTGather (, n) = log() k= ( α + k nβ ) = log() α + nβ. This cost achieves the lower bound for the latency and bandwidth comonents. Under the stated assumtions these algorithms are otimal. 6. Allgather and reduce-scatter Bidirectional exchange algorithms The best known algorithm for allgather assumes that = d for some integer d and use so-called bidirectional exchanges (BDE), which can be described as follows. Partition the network in two halves. Recursively erform an allgather of all the data in the resective halves. Next, exchange the so-gathered data between disjoint airs of nodes where each air consists of one node from each of the two halves. Generally, node i (i < /) is aired with node i + /, which are neighbors if the network is a hyercube. This algorithm, called recursive-doubling, is given in Fig. 7(a) and illustrated in Fig. 8. In the absense of network conflicts (on a hyercube or fully connected architecture) and assuming all subvectors are of equal length, the cost is T BDEAllgather (, n) = log() k= ( α + k nβ ) = log()α + nβ. 9

10 Ste Ste Figure 4: Minimum-sanning tree algorithm for broadcast. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) Ste Figure 5: Minimum-sanning tree algorithm for reduce. This cost attains the lower bound for both the latency and bandwidth comonents and is thus otimal under these assumtions. Problems arise with BDE algorithms when the number of nodes is not a ower of two. If the subnetwork of nodes contains an odd number of nodes, one odd node does not contain a corresonding node in the other subnetwork. In one remedy for this situation, one node from the oosing subnetwork must send its data to the odd node. Unfortunately, this solution requires that one node must send data twice at each ste, so the cost of BDE algorithms doubles when not using a ower of two number of nodes. In ractice, BDE algorithms still erform quite well because the node needing to send data twice is different at each ste of recursion, so the communication can be overlaed between stes. Nonetheless, the result is rather hahazard. Reduce-scatter can again be imlemented by reversing the communications and their order and by adding reduction oerations when the data arrives. This algorithm, called recursive-halving, is given in Fig. 7(b) and illustrated in Fig. 9. Again assuming all subvectors are of equal length, the cost is T BDEReduce scatter (, n) = log() k= ( α + k n(β + γ) ) = log()α + n(β + γ).

11 Ste Ste Figure 6: Minimum-sanning tree algorithm for scatter. BDEAllgather( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid BDEAllgather( x, left, mid ) else BDEAllgather( x, mid+, right ) if me mid Send( x left:mid, artner ) Recv( x mid+:right, artner ) else Send( x mid+:right, artner ) Recv( x left:mid, artner ) (a) BDEReduce-scatter( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid Send( x mid+:right, artner ) Recv( tm, artner ) x left:mid = x left:mid + tm else Send( x left:mid, artner ) Recv( tm, artner ) x mid+:right = x mid+:right + tm if me mid BDEReduce-scatter( x, left, mid ) else BDEReduce-scatter( x, mid+, right ) (b) Figure 7: Bidirectional exchange algorithms for allgather and reduce-scatter.

12 Ste Ste Figure 8: Recursive-doubling algorithm for allgather. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (:) x (:) x (:) x (:) Ste Figure 9: Recursive-halving algorithm for reduce-scatter. Notation: x (j :j :s) i = j xj i where j {j, j + s,, j } and x (j :j ) i = j xj i where j {j, j +,, j }. We will revisit BDE algorithms as a secial case of the bucket algorithms, discussed next, on hyercubes in Section 7.. Bucket algorithms An alternative aroach to the imlemention of the allgather oeration views the nodes as a ring, embedded in a linear array by taking advantage of the fact that messages traversing a link in oosite direction do not conflict. At each ste, all nodes send data to the node to their right. In this fashion, the subvectors that start on the individual nodes are eventually distributed to all nodes. The bucket (BKT) algorithm, also called the cyclic algorithm, is given in Fig. (a) and is illustrated in Fig.. Notice that if each node starts with an equal subvector of data the cost of this aroach is given by T BKTAllgather (, n) = ( )(α + n β ) = ( )α + nβ, achieving the lower bound for the bandwidth comonent of the cost.

13 BKTAllgather( x ) rev = me if rev < then rev = next = me + if next = then next = curi = me for i =,, Send( x curi, next ) curi = curi if curi < then curi = Recv( x curi, rev ) endfor (a) BKTReduce-scatter( x ) rev = me if rev < then rev = next = me + if next = then next = curi = next for i =,, Send( x curi, rev ) curi = curi + if curi = then curi = Recv( tm, next ) x curi = x curi + tm endfor (b) Figure : Bucket algorithms for allgather and reduce-scatter. A simle otimization comes from reosting all receives after which a single synchronization with the node to the left (indicating that all receives have been osted) is required before sending commences so that a one-ass rotocol can be used. This synchronization itself can be imlemented by sending a zero-length message, at a cost of α in our model. The remaining sends each also incur only α as a latency cost, for a total cost of T BKTAllgather (, n) = α + ( )α + nβ = α + nβ. Similarly, the reduce-scatter oeration can be imlemented by a simultaneous assing of messages around the ring to the left. The algorithm is given in Fig. (b) and is illustrated in Fig.. This time, a artial result towards the total reduction of the subvectors are accumulated as the messages ass around the ring. With a similar strategy for reosting messages, the cost of this algorithm is given by T BKTReduce scatter (, n) = α + n(β + γ). 6.4 Allreduce Like the reduce-scatter the allreduce can be also be imlemented using a BDE algorithm. This time at each ste the entire vector is exchanged and added to the local result. This algorithm is given in Fig. and is illustrated in Fig. 4. In the absense of network conflicts the cost of this algorithm is T BDEAllreduce (, n) = log()(α + nβ + nγ). This cost attains the lower bound only for the latency comonent. 7 Moving On We now discuss how to ick and/or combine algorithms as a function of architecture, number of nodes, and vector length. We do so by resenting strategies for different tyes of architectures, building uon the algorithms that are already resented. Recall that short messages incur a latency of α.

14 Ste Ste Ste Figure : Bucket algorithm for allgather. x () x () x () x () x () x () x () x () x () x () x () x () x () x (,) x () x () x (,) x (,) x () x () x (,) x () Ste Ste x (,,) x () x (,,) x () x (,,) x (,,) x () x (:) x (:) x (:) x (:) Ste Figure : Bucket algorithm for reduce-scatter. 7. Linear arrays On linear arrays the MST Bcast and MST Reduce algorithms achieve the lower bound for the α term while the BKT Allgather and BKT Reduce-scatter algorithms achieve the lower bound for the β term. The MST Scatter and MST Gather algorithms achieve the lower bounds for all vector lengths. BDE algorithms are undesirable since they require d nodes and because they inherently incur network conflicts. The following strategy rovides simle algorithms that have merit in the extreme cases of short and long vector lengths. Fig. 5 and Fig. 6 summarize this strategy where short and long vector algorithms can be used as building blocks to comose different collective communication oerations. 7.. Broadcast Short vectors: Long vectors: MST algorithm. MST Scatter followed by BKT Allgather. The aroximate cost is T Scatter Allgather (, n) = log() α + nβ + α + nβ 4

15 BDEAllreduce( x, left, right ) if left = right return size = right left + mid = (left + right)/ if me mid artner = me + size/ else artner = me size/ if me mid Send( x, artner ) Recv( tm, artner ) x = x + tm else Send( x, artner ) Recv( tm, artner ) x = x + tm if me mid BDEAllreduce( x, left, mid ) else BDEAllreduce( x, mid+, right ) Figure : Bidirectional exchange algorithm for allreduce. x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () x () Ste x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (::) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) x (:) Ste Figure 4: Bidirectional exchange algorithm for allreduce. 5

16 Reduce log() α + log() n(β + γ) Scatter log() α + nβ Gather log() α + nβ Broadcast log() α + log() nβ Reduce-scatter (Reduce/Scatter) log() (α + α )+ log() n(β + γ) + nβ Allreduce (Reduce/Broadcast) log() (α + α )+ log() n(β + γ) Allgather (Gather/Broadcast) log() (α + α )+ log() nβ + nβ Figure 5: A building block aroach to short vector algorithms on linear arrays. Reduce-scatter α + n(β + γ) Gather log() α + nβ Scatter log() α + nβ Allgather α + nβ Reduce (Reduce-scatter/Gather) ( + log() )α + nβ + Allreduce (Reduce-scatter/Allgather) α + nβ + nγ Broadcast (Scatter/Allgather) α + log() α + nβ nγ Figure 6: A building block aroach to long vector algorithms on linear arrays. 6

17 = α + log() α + nβ. As n gets large, and the β term dominates, this cost is aroximately log() / times faster than the MST Bcast algorithm and within a factor of two of the lower bound. 7.. Reduce Short vectors: MST algorithm. Long vectors: BKT Reduce-scatter (the dual of the allgather) followed by a MST Gather (the dual of the scatter). This time all the receives for the gather can be reosted before the reduce-scatter commenses by observing that the comletion of the reduce-scatter signals that all buffers for the gather are already available. Since the ready-receive sends can be used during the gather, the cost becomes T Reduce scatter Gather (, n) = α + n(β + γ) + log() α + nβ = ( + log() )α + nβ + nγ. Again, the β term is within a factor of two of the lower bound while the γ term is otimal. 7.. Scatter Short vectors: MST algorithm. Long vectors: Sending individual messages from the root to each of the other nodes. While the cost, ( )α + nβ, is clearly worse than the MST algorithm, in ractice the β term has sometimes been observed to be smaller ossibly because the cost of each message can be overlaed with those of other messages. We will call it the simle (SMPL) algorithm Gather Same as scatter, in reverse Allgather Short vectors: MST Gather followed by MST Bcast. To reduce the α term, receives for the MST Bcast can be osted before the MST Gather commenses, yielding a cost of T Gather Bcast (, n) = log() α + nβ + log() α + log() nβ This cost is close to the lower bound of log() α. log() (α + α ) + ( log() + )nβ. Long vectors: BKT algorithm Reduce-scatter Short vectors: MST Reduce followed by MST Scatter. The receives for the MST Scatter can be osted before the MST Reduce commenses, for a cost of T Reduce Scatter (, n) = log() α + log() n(β + γ) + log() α + nβ This cost is close to the lower bound of log() α. log() (α + α ) + ( log() + )nβ + log() nγ. 7

18 Long vectors: BKT algorithm Allreduce Short vectors: MST Reduce followed by MST Bcast. The receives for the MST Bcast can be osted before the MST Reduce commenses, for a cost of T Reduce Bcast (, n) = log() α + log() n(β + γ) + log() α + log() nβ = log() (α + α ) + log() nβ + log() nγ. This cost is close to the lower bound of log() α. Long vectors: BKT Reduce-scatter followed by BKT Allgather. The aroximate cost is T Reduce scatter Allgather (, n) = α + n(β + γ) + α + nβ This cost achieves the lower bound for the β and γ terms. 7. Multidimensional meshes = α + nβ + nγ. Next, we show that on multidimensional meshes the α term can be substantially imroved for long vector algorithms, relative to linear arrays. The key here is to observe that each row and column in a two-dimensional mesh forms a linear array and that all our collective communication oerations can be formulated as erforming the oeration first within rows and then within columns, or vise versa. For the two-dimensional mesh, we will assume that our nodes hysically form a r c mesh and that the nodes are indexed in row-major order. For a mesh of dimension d, we will assume that the nodes are hysically organized as a d d d d mesh. 7.. Broadcast Short vectors: MST algorithm within columns followed by MST algorithm within rows. The cost is T Bcast Bcast (r, c, n) = ( log(r) + log(c) ) (α + nβ ) log() (α + nβ ). Generalizing to a d dimensional mesh yields an algorithm with a cost of d log(d k ) (α + nβ ) log() (α + nβ ). k= Observe that the cost of the MST algorithm done successively in multile dimensions yields aroximately the same cost as erforming it in just one dimension (e.g., a linear array). Long vectors: MST Scatter within columns, MST Scatter within rows, BKT Allgather within rows, BKT Allgather within columns. The aroximate cost is T Scatter Scatter Allgather Allgather (r, c, n) = log(r) α + r nβ + log(c) α + c n r c r β +cα + c n c r β + rα + r nβ r 8

19 = (c + r)α + ( log(c) + log(r) )α + nβ (c + r)α + log() α + β. As n gets large, and the β term dominates, this cost is still within a factor of two of the lower bound for β while the α term has been greatly reduced. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d d k α + log(d k ) α + nβ d d k α + log() α + nβ. k= 7.. Reduce k= k= Short vectors: MST algorithm within rows followed by MST algorithm within columns. The cost is T Reduce Reduce (r, c, n) = ( log(c) + log(r) ) (α + n(β + γ)) log() (α + n(β + γ)). Generalizing to a d dimensional mesh yields an algorithm with a cost of d d log(d k ) α + log(d k ) n(β + γ) log() (α + n(β + γ)). k= k= Long vectors: BKT Reduce-scatter within rows, BKT Reduce-scatter within columns, MST Gather within columns, MST Gather within rows. The aroximate cost is T Reduce scatter Reduce scatter Gather Gather (r, c, n) = cα + c nβ + c nγ + rα + r n c c r c β + r n r c γ + log(r) α + r n r c β + log(c) α + c nβ c = (c + r)α + ( log(r) + log(c) )α + nβ + nγ (c + r)α + log() α + nβ + nγ. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d d k α + log(d k ) α + nβ + nγ k= k= d d k α + log() α + nβ + nγ. k= As n gets large, and the β term dominates, this cost is within a factor of two of the lower bound for β. The γ term is otimal. 7.. Scatter Short vectors: Long vectors: MST algorithms successively in each of the dimensions. SMPL algorithms successively in each of the dimensions. 9

20 7..4 Gather Same as scatter, in reverse Allgather Short vectors: MST Gather within rows, MST Bcast within rows, MST Gather within columns, MST Bcast within columns. Again, reosing can be used to reduce the α term for the broadcasts, for a total cost of T Gather Bcast Gather Bcast (r, c, n) = log(c) α + c n c r β + log(c) α + log(c) n r β + log(r) α + r nβ + log(r) α + log(r)nβ r ( = ( log(r) + log(c) )(α + α ) + + log(c) r ( log() (α + α ) + + log(c) ) + log(r) nβ r ) + log(r) nβ Generalizing to a d dimensional mesh yields an algorithm with a cost of d ( log(d k ) (α + α ) + + log(d d ) + + log(d ) ) + log(d ) nβ d d d d d k= ( log() (α + α ) + + log(d d ) + + log(d ) ) + log(d ) nβ. d d d d d Notice that the α term remains close to the lower bound of log() α while the β term has been reduced! Long vectors: BKT Allgather within rows, BKT Allgather within columns. The cost is T Allgather Allgather (r, c, n) = cα + c n c r β + rα + r nβ r = (c + r)α + nβ. Generalizing to a d dimensional mesh yields an algorithm with a cost of d d k α + nβ Reduce-scatter k= Short vectors: MST Reduce within columns, MST Scatter within columns, MST Reduce within rows, MST Scatter within rows. Preosing the receives for the scatter oerations yields a total cost of T Reduce Scatter Reduce Scatter (r, c, n) = log(r) α + log(r) n(β + γ) + log(r) α + r n r c β + log(c) α + c n(β + γ) + log(c) α + log(c)nβ c ( = ( log(r) + log(c) )(α + α ) + + log(c) ) + log(r) nβ r ( ) log(c) + + log(r) nγ r ( log() (α + α ) + + log(c) ) ( ) log(c) + log(r) nβ + + log(r) nγ. r r

21 Generalizing to a d dimensional mesh yields an algorithm with a cost of d log(d k ) (α + α ) + k= ( log(d ) + log() (α + α ) + + ( + log(d ) + + log(d ) d ) + log(d d ) nβ d d d d d ) + + log(d d ) + log(d d ) nγ d d d d d ( + log(d ) + + log(d d ) d d d ( log(d ) d d d + + log(d d ) d d + log(d d ) ) + log(d d ) nβ d d ) nγ. Notice that the α term remains close to the lower bound of log() α while both the β and γ terms have been reduced! Long vectors: BKT Reduce-scatter within rows, BKT Reduce-scatter within columns. The cost is T Reduce scatter Reduce scatter (r, c, n) = rα + r n r c (β + γ) + cα + c n(β + γ). c = (r + c)α + n(β + γ). Generalizing to a d dimensional mesh yields an algorithm with a cost of 7..7 Allreduce d d k α + n(β + γ). k= Short vectors: MST Reduce followed by MST Bcast (both discussed above), reosting the receives for the broadcast. The aroximate cost is d T Reduce Bcast (d, n) = log(d k ) (α + α ) ( log(d ) + + k= ) + log(d d ) d d ) + + log(d d ) d d d ( log(d ) + + log(d d ) + log(d d ) d d d d d nβ nγ. Long vectors: cost is BKT Reduce-scatter followed by BKT Allgather (both discussed above). The aroximate d T Reduce scatter Allgather (d, n) = d k α + nβ + nγ. k= This cost achieves the lower bound for the β and γ terms. 7. Hyercubes We now argue that the discussion on algorithms for multidimensional meshes includes all algorithms already discussed for hyercubes. Recall that a d dimensional hyercube is simly a d dimensional mesh with d = = d d =. Now, the short vector algorithm for broadcasting successively in each of the dimensions

22 becomes the MST Bcast on hyercubes. The long vector algorithm for allgather that successively executes a BKT algorithm in each dimension is equivalent to a BDE algorithm on hyercubes. Similar connections can be made for other algorithms discussed for multidimensional meshes. The conclusion is that we can concentrate on otimizing algorithms for multidimensional meshes. A by-roduct of the analyses will be otimized algorithms for hyercubes. 7.4 Fully connected architectures Fully connected architectures can be viewed as multidimensional meshes so that, as noted for hyercube architectures, it suffices to analyze the otimization of algorithms for multidimensional meshes. 8 Stategies for All Vector Lengths We have develoed numerous algorithms for the short and long vector cases. They have been shown to be art of a consistent family rather than a bag full of algorithms. The natural next question becomes how to deal with intermediate length vectors. A naive solution would be to determine the crossover oint between the short and long vector costs and switch algorithms at that crossover oint. In this section, we show that one can do much better with hybrid algorithms. Key to this aroach is the recognition that all collective communications have in common the roerty that the oeration erformed among all nodes yields the same result as when the nodes are logically viewed as a two-dimensional mesh, and the oeration is erformed first in one dimension and next in the second dimension. This observation leaves the ossibility of using a different algorithm in each of the two dimensions. 8. A rototyical examle: broadcast Consider a broadcast on a r c mesh of nodes. A broadcast among all nodes can then be imlemented as Ste : A broadcast within the row of nodes that includes the original root. Ste : Simultaneous broadcasts within columns of nodes where the roots of the nodes are in the same row as the original root. For each of these two stes, a different broadcast algorithm can be chosen. Now, consider the case where in Ste a long vector algorithm is chosen: MST Scatter followed by BKT Allgather. It is beneficial to orchestrate the broadcast as Ste a: MST Scatter within the rows of nodes that includes the original root. Ste : Simultaneous broadcasts within columns of nodes where the roots of the nodes are in the same row as the original root. Ste b: Simultaneous BKT Allgather within rows of nodes. The benefit is that now in Ste the broadcasts involve vectors of length aroximately n/c. This algorithm is illustrated in Fig. 7. For fixed r and c, combinations of short and long vector algorithms for Stes and are examined in Fig. 8. Notice that: Otion is never better than Otion since log(r) + log(c) log(). Otion 4 is never better than Otion. The observation here is that if a scatter allgather broadcast is to be used, it should be as art of Ste a b so that the length of the vectors to be broadcast during Ste is reduced. Otion 5 is generally better than Otion 6 since the α terms are identical while the β term is smaller for Otion 5 than for Otion 6 (since n/ < n/r).

23 Ste a x b Figure 7: Broadcast on a (logical) D mesh imlemented by (a) a scatter within the row that includes the root, followed by () a broadcast within the columns, followed by (b) an allgather within rows.

24 Otion Algorithm Cost MST Bcast (all nodes). log() α + log() nβ Ste : MST Bcast, Ste : MST Bcast. ( log(r) + log(c) )α + ( log(r) + log(c) )nβ Ste a: MST Scatter, Ste : MST Bcast, Ste b: BKT Allgather. cα + ( log(r) + log(c) )α + ( log(r) c + c c )nβ 4 Ste : MST Bcast, cα + ( log(r) + log(c) )α Ste : MST Scatter BKT Allgather. + ( log(r) + c c )nβ 5 Ste a: MST Scatter, (r + c)α + ( log(r) + log(c) )α Ste : MST Scatter BKT Allgather, + nβ Ste b: BKT Allgather. (assuming r and c ) 6 Ste : MST Scatter BKT Allgather, (r + c)α + ( log(r) + log(c) )α 7 Ste : MST Scatter BKT Allgather. + ( r r + c c )nβ MST Scatter BTK Allgather (all nodes). α + log() α + nβ Figure 8: Various aroaches to imlementing a broadcast on an r c mesh. The cost analysis assumes no network conflicts occur. Otion 5 is generally better than Otion 7 because the β terms are identical while the α term has been reduced since r + c (when is not rime). Thus, we find that there are three algorithms of interest: Otion : MST Bcast among all nodes. This algorithm is best when the vector lengths are short. Otion : MST Scatter MST Bcast BKT Allgather. This algorithm is best when the vector length is such that the scatter leaves subvectors that are considered to be small when broadcast among r nodes. Otion 5: MST Scatter MST Scatter BKT Allgather BKT Allgather. This algorithm is best when the vector length is such that the scatter leaves subvectors that are considered to be long when broadcast among r nodes if multile integer factorizations of exist. In Fig. 9(left) we show the redicted cost, in time, of each of these otions on a fully connected machine with = 56, r = c = 6, α 6, β 9, and γ. The grah clearly shows how the different otions trade lowering the α term for increasing the β term, the extremes being the MST Scatter BKT Allgather, with the greatest α term and lowest β term, and the MST Bcast, with the lowest α term and greatest β term. Naturally, many combinations of r and c can be chosen, and the technique can be extended to more than two dimensions, which we discuss next. 8. Otimal hybridization on hyercubes We now discuss how to otimally combine short and long vector algorithms for the broadcast oeration. Our strategy is to develo theoretical results for hyercubes, first reorted in [5] for the allgather oeration. This theoretical result will then motivate heuristics for multidimensional meshes discussed in Section 8.. Assume that = d and that the nodes form a hyercube architecture (or, alternatively, are fully connected). We assume that a vector of length n is to be broadcast and, for simlicity, that n is an integer multile of. A strategy will be indicated by an integer D, D < d, and two vectors, (a, a,, a D ) and (d, d,, d D ). The idea is that the rocessors are viewed as a D dimensional mesh, that the ith dimension of that mesh has d i nodes that themselves form a hyercube, and that a i {long, short} indicates whether a long or short vector algorithm is to be used in the ith dimension. 4

25 BCAST Predicted D =56 MST Scatter Allgather Scatter MST Allgather Scatter Scatter Allgather Allgather MST Scatter Allgather "Otimal" Hybrid Hybrids BCAST Predicted =56 time sec time sec message size bytes message size bytes Figure 9: Predicted erformance of various hybrid broadcast algorithms. Left: Performance for Otions,, 5, and 7. Right: Performance of many hybrids on a hyercube or fully connected architectures. MST Scatter Allgather "Otimal" 5 7 Hybrid Hybrids BCAST Predicted = time sec message size bytes Figure : Predicted erformance of various hybrid broadcast algorithms on a 5 7 mesh. 5

26 If a i = long, then a long vector algorithm is used in the ith dimension of the mesh. The vector is scattered among d i rocessors, and each iece (now reduced in length by a factor d i ) is broadcast among the remaining dimensions of the mesh using the strategy (a i+,, a D ), (d i+,, d D ), after which the result is collected via BKT Allgather. If a i = short, then MST Bcast is used in the ith dimension of the mesh, and a broadcast among the remaining dimensions of the mesh using the strategy (a i+,, a D ), (d i+,, d D ) is emloyed. The cost given a strategy is given by the inductively defined cost function C(n, (a,, a D ), (d,, d D )) = if D = d α + log(d )α + d d nβ + C( n d, (a,, a D ), (d,, d D )) if D > and a = long log(d )α + log(d )nβ + C(n, (a,, a D ), (d,, d D )) if D > and a = short Some simle observations are: Assume d i > (but a ower of two). Then, C(n, (a,, a i, a i, }{{} a i+,, a D ), (d,, d i, d i, }{{} d i+,, d D )) {}}{{}}{ C(n, (a,, a i, a i, a i, a i+,, a D ), (d,, d i,, d i, d i+,, d D )). This observation tells us that an otimal (minimal cost) strategy can satisfy the restriction that D = d and d = = d d =. Assume a i = short and a i+ = long. Then, C(n, (a,, a i, a i, a i+, }{{} a i+,, a D ), (d,, d i, d i, d i+, }{{} d i+,, d D )) {}}{ {}}{ C(n, (a,, a i, a i+, a i, a i+,, a D ), (d,, d i, d i+, d i, d i+,, d D )). This observation tells us that an otimal strategy can satisfy the restriction that a short vector algorithm is never used before a long vector algorithm. Together these two observations indicate that an otimal strategy exists among the strategies that view the nodes as a d dimensional mesh, with two nodes in each dimension, and have the form (a,, a k, a k,, a d ) where a i = long if i < k and a i = short if i k. An otimal strategy can thus be determined by choosing k ot, the number of times a long vector algorithm is chosen. The cost of such a strategy is now given by C(n, k) = k α + log( k )α + k k nβ + log( d k )α + log( d k ) n k β = k α + kα + k k nβ + (d k)α + (d k) n k β = k α + dα + ( k+ + d k) n k β Let us now examine C(n, k) vs. C(n, k + ): ( C(n, k + ) C(n, k) = k+ α + dα + ( k+ n ) + d k ) k+ β ( k α + dα + ( k+ + d k) n ) k β = k n α + ( d + k) k+ β. () 6

Collective communication: theory, practice, and experience

Collective communication: theory, practice, and experience CONCURRENCY AND COMPUTATION: PRACTICE AND EXPERIENCE Concurrency Comutat.: Pract. Exer. 2007; 19:1749 1783 Published online 5 July 2007 in Wiley InterScience (www.interscience.wiley.com)..1206 Collective