A Bandwidth Latency Tradeoff for Broadcast and Reduction

A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi@mpi-sb.mpg.de. http://www.mpi-sb.mpg.de/sanders, jopsi Abstract. A novel algorithm is presented for broadcasting in the single-port and duplex model on a completely connected interconnection network with start-up latencies. None of the existing broadcasting algorithms performs close to optimal for packets that are neither very large nor very small. The broadcasting schedule of the new algorithm is based on trees of fractional degree. It performs close to optimal for all packet sizes. For realistic values of P it is up to 5% faster than the best alternative. Introduction Broadcasting. Consider P processing units, PUs, of a parallel machine. Broadcasting, the operation in which one processor has to send a message M to all other PUs, is a crucial building block for many parallel algorithms. Since it can be implemented once and for all in communication libraries such as MPI [], it makes sense to try to strive for optimal algorithms for all P and all message lengths k. Since broadcasting is sometimes a bottleneck operation, even constant factors should not be neglected. In addition, by reversing the direction of communication, broadcasting algorithms can usually be turned into reduction algorithms. Reduction is the task to compute a generalized sum L i<p M i, where initially message M i is stored on PU i and where can be any associative operator. Broadcasting and reduction are among the most important communication primitives. For example, some of the best algorithms for matrix multiplication or dense matrix-vector multiplication have these two functions as their sole communication routines []. Assumptions and Communication Models. We study broadcasting long messages using the following assumptions which reflect the properties of many current machines at least for the transmission of long messages: The PUs are connected by a network. Communication is synchronous, i.e., both sender and receiver have to cooperate in transmitting a message. Transferring a message of length k between any pair of PUs takes time t + k. Here t is the start-up overhead and the time unit is defined as the the time needed to transfer one data item.

In the last point we assume that the topology of the network does not play a role. However, our algorithms have enough locality to work well on networks with limited bisection width and we will also informally discuss how the communication structures can be embedded into meshes. We are looking at two communication models. The first is the single-port model, in which a PU can only participate in one communication at a time. It can either send or receive a packet but not both. The second is the somewhat more powerful duplex model, in which a PU can concurrently participate in sending one message and receiving another message (not necessarily from the same PU). Lower Bounds. In the duplex model, all non-source PUs must receive the k data, and the whole broadcasting takes at least log P steps. Thus in the duplex model there is a lower bound of T lower, duplex = k + t log P: In the single port model all non-source PUs must receive the k data. These (P )k data have been sent by at most P PUs. So, even with perfect balance the PUs are sending packets for ( =P ) k time. This gives T lower, one-port = ( =P ) k + t log P: For odd P there is always one idle PU, in that case the first term is k. These lower bounds hold in full generality. For a large and natural class of algorithms, we can prove a stronger bound though. Consider algorithms that divide the total data set of size k in s packets of size k=s each. All PUs operate synchronously, and in every step they send or receive at most one packet. So, until step s, there is still at least one packet known only to the source. Thus, for given s, at least s+log P steps are required in the duplex model. Each step takes k=s + t time. For given k, t and P, the minimum is assumed for s = p k t= log P : T lower, duplex = k + t log P + p k t log P : () We were not able to come with a corresponding bound for the one-port model. The problem is that one cannot conclude that the routing requires s + log P steps. Previous Results. Peter schreibt evt. noch was schoenes. In dem fall muss die naechste alinea umgestaltet werden. Abgrenzung gegenueber Hyperwuerfelalg. evt. beispiel (P = 6) fuer H-tree-einbettung. New Contributions. The earlier algorithms perform optimal in two extreme cases. If k is very large, the start-up effects become negligible, and one should concentrate on minimizing the amount of data to be transferred. In the other extreme case, if t is very large, the transmission time is negligible, and the sole objective is to minimize the number of communication steps. For intermediate cases the best of these approaches takes up to 5% more time than the lower bound. In the case of duplex connections even twice as much. Our refined broadcasting schedule comes much closer to the lower bounds for all values of k and t. For the duplex model we come to within lower-order terms from (). The schedule is rather simple and might be implemented easily.

In general a duplex algorithms cannot be turned into a one-port algorithm by simply replacing every duplex step by two one-port steps: this would amount to constructing a coloring with two colors of a graph with degree two. For graphs containing odd cycles, such as our duplex algorithm, this is impossible. However, a modification of the schedule leads to a good one-port algorithm. It requires twice as long as the duplex algorithm. Contents. After describing the known algorithms, we present the fractional-tree broadcasting: first for the simpler duplex model, then for the one-port model. At the end of the paper we consider embedding the fractional trees on sparse networks. Simple Algorithms In order to be able to compare the performance of our new algorithm, we first present two well-known algorithms.. Linear Pipeline Broadcasting For messages with k t P, a simple pipelined algorithm performs very well: The PUs are organized into a directed Hamiltonian path starting at the source PU and the message is subdivided into s packets of size k=s. The packets are handed through the path in a pipelined fashion. An internal PU simply alternates between receiving a packet from its predecessor and forwarding it to its successor. An example is given in Figure. Step 5 6 7 8 9 PU PU PU PU Fig.. The simple pipelined algorithm for P = and s = 5. It is easy to see that the broadcasting time is T pipe, one-port = (d + (s )) (t + k=s): ()

Here d = P is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T pipe, one-port = ( p d t + p k) = ( + o()) k + O(t P ): This means that for k t P broadcasting takes little more than the (near) optimal k time. For the duplex model, the constant two can be dropped in the analysis. For long messages, the algorithm is almost twice as fast.. Binary Tree Broadcasting For shorter messages, the high latency of O(t P ) until the last PU begins to receive data is prohibitive for large P. This latency can be reduced to O(t log P ) by using a binary tree instead of an Hamiltionian path. Each internal PU performs the following three steps in a loop: receive a packet from the father; send the packet to the left child; send the packet to the right child. This approach is illustrated in Figure. d = d = d = d = d = @ @R H HHHj P PPPP Pq @ @R H HHHj @ @R @ @R Fig.. The tree structures used for the broadcasting for trees of various depths d. The tree for depth d is constructed out off the trees for d and d. The numbers indicate the routing steps in which a packet coming from the root is routed over the connections. Every three steps the root injects one packet. For single ported communication, each pass through the loop takes (t + k=s). Similar to the above calculation, we get T bin, one-port = (d + (s )) (t + k=s): () Here d = O(log P ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d=( t). Substitution gives T bin, one-port = ( p d t + p k) = ( + o()) k + O(t log P ):

This is good algorithm even for rather small k but for large k it takes up to 5% more time than the linear pipeline. In the duplex model, the PUs are performing a two-step loop: receive a packet from the father and send the previous packet to the right child; send the latest packet to the left child. Thus each pass takes (t + k=s) time. For large k the linear pipeline is up to twice as fast. Fractional Tree Broadcasting The question arises whether better tradeoffs between bandwidth and latency can be achieved for medium sized messages than either a pipeline with linear latency or a binary tree with logarithmic latency but slower message transmission. Indeed we now design a family of broadcast algorithms which achieve execution times ( + r in the single ported model and + o()) k + O(r t log P ) ( + r + o()) k + O(r t log P ) in the duplex model for any integer r. The linear pipeline and the binary tree can be viewed as special cases of this scheme for r = and r =, respectively.. Duplex Model The idea is to replace the node of a binary tree by a chain of r PUs. The input is fed into the head of this chain. The data is passed down the chain and on to a successor chain as in the linear pipeline algorithm. In addition, the PUs of the chain cooperate in feeding the data to the head of a second successor chain. Figure shows the communication structure and an example. The resulting communication structure could be viewed as a tree with degree + =r. We call it a fractional tree. Schedule Details. Now we describe the routing schedule in more detail. The total data set of size k is divided in k=r runs consisting of r packets each. The packets in each run are indexed from through r. The same indices are used for the PUs in each chain. The routing is composed of phases. Each phase takes r + steps. The steps are indexed from through r. Initially all packets are stored in PU of chain. Consider a chain at depth l. In phase t l this chain is involved in forwarding the packets of run t l that enter at PU through the chain. Also it is involved in further forwarding the packets of run t l through the chain and into both successor chains. Now we consider step j of phase t more precisely. Any PU i with i j sends packet r + j i of run t l to the PU below it (for i = r, this means to the head of the successor chain). For j >, PU i = j sends packet i of run t l to the head of the second successor chain. Any PU i with i < j sends packet j i of run t l to the PU below it. The routing is illustrated in Figure. Performance Analysis. Having established a smooth timing of the algorithm the analysis can proceed analogously to that of the simple algorithms from Section. A tree of

.. r. Fig.. Left: the communication structure in a chain of r PUs which cooperate to supply two successor nodes with all the data they have to receive. Right: a complete binary tree of depth d = build out of chains with r =. depth d consists of d + chains, each with r PUs. Thus, for given r and P, the tree has a depth of d = blog(p=r)c. If the data are subdivided in s packets, then every phase takes (r + ) (t + k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r + d + phases, which can be performed in time T fract, duplex = (s (r + )=r + d ) (t + k=s): () Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. () is minimized for s = p k d =t r=(r + ). Substitution gives T fract, duplex = ( + =r) k + d t + p k d =r t = ( + o()) (k + k=r + d r t): There are two additional terms, one is due to the slower progress of the packets, the other to the start-up latencies. If t is very large, the best strategy is to choose r =. If k is very large, one should take r = P. For moderate r the dependency of d on r is not so important. In that case, the optimum is assumed for r = p k=(d t). Substitution gives T fract = ( + o()) (k + p k log P t): (5) For many problem instances, the new approach is considerably faster than the best previous one. Actually, our result matches the lower bound of () for this type of algorithms.

PU PU PU before step step & & & step & step 5 & step 5 PU & & PU & & PU Fig.. The fractional tree algorithm for the duplex model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The five pictures show the development as a result of the = r + routing steps performed during some phase t l +. The packets with indices ; ; actually denote the packets of run tl. The packets with indices ; ; 5 denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. All other PUs send to the PUs below them. Performance Example. For large P almost a factor is gained for t P = k. In that case both previous algorithms require more than k. According to (5), the partial-tree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. For concrete moderate k the situation is slightly less good than the above result suggests. As an example we consider P = and t = k=96. We can take r = 8, d = 6 and s = 5. Then there are 6 runs of length 8 and the routing consists of 7 phases of 9 steps each. Thus, there are 69 steps that take 69 (k=5 + k=96) = : k all together. How about the other algorithms The pipelined algorithm in the duplex pmodel takes T pipe, duplex = (P + s) (t + k=s) time, which is minimized for s = P k=t. So in our case, we should take s = 8 and achieve a time consumption of :5 k. The binary tree algorithm in the duplex model takes T bin, duplex = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k.

. One-Port Model As was pointed out in the introduction, for one-port machines scheduling the send and receive operations so that all communicating pairs of PUs agree cannot be achieved by simply doubling every step of a duplex algorithm. Therefore we do not simply copy the above algorithm, but construct a new schedule based on the same idea of trees of partial degree. Schedule Description. Again we work with chains of length r, but now we choose r even. Another difference with the duplex schedule is that now only the PUs with odd index within their chain are in charge of transferring packets to the head of the second successor chain. In this way, the synchronization is trivial: PUs with even indices receive in even steps and send in odd steps; PUs with odd indices send in even steps and receive in odd steps (though in some steps a PU is idle). An example is given in Figure 5. before step step step step step step step 5 PU PU & PU PU & PU PU & & PU PU & & Fig. 5. The fractional-tree algorithm for the one-port model. Here r =. Depicted is the movement of packets through a chain at depth l and into the successor chains at depth l +. The seven pictures show the development as a result of the 6 = r + routing steps performed during some phase t l +. The packets with indices ; actually denote the packets of run t l. The packets with indices ; denote the packets of run t l. The PUs that in the next step send to the head of the second successor chain are indicated with &. For chains of length r every phase takes r + steps. PUs with even indices are idle for two steps: one sending and one receiving step. PUs with odd indices are idle during

one receiving step. In the duplex schedule PUs were sending all the time and receiving r packets during r + steps. Thus here we pas some penalty: the average usage of the connections is slightly lower. During each phase a run of length r= progresses r steps deeper into the tree. As before, the depth d of the tree measured in chains equals d = blog P=rc. Performance Analysis. We copy the analysis from Section. and modify it where necessary. If the data are subdivided in s packets, then every phase takes (r+)(t+k=s) time. There are s=r runs of packets. Thus, the whole routing consists of s=r +d+ phases, which can be performed in time T fract, one-port = (s (r + )=r + d ) (t + k=s): (6) Here d = (d + ) (r + ) is the time step in which the first packet arrives at the last PU. (6) is minimized for s = p k d =t r=( (r + )). Substitution gives T fract, one-port = ( + =r) k + d t + p k ( + =r) d t = ( + o()) ( k + k=r + r d t): A good choice is r = p k=(d t). Substitution gives T fract, one-port = ( + o()) ( k + p k log P t): (7) Except for lower-order terms, this is exactly twice the result for the duplex algorithm. Performance Example. For large P almost a factor / is gained for t P = k. In that case both previous algorithms require more than k. According to (7), the partialtree algorithm requires only ( + p log P=P ) k. For large P this is hardly more than k. As a concrete example we consider P = and t = k=96. We can take r = 6, d = 5 and s = 8. Then there are 5 runs of length 8 and the routing consists of 57 phases of 8 steps each. Thus, there are 6 steps that take 6(k=8+k=96) = :77 k all together. How about the other algorithms The pipelined algorithm takes T pipe, one-port = (P + p s ) (t + k=s) time, which is minimized for s = P k=( t). So in our case, we should take s = 8 and achieve a time consumption of :66 k. The binary tree algorithm in the duplex model takes T bin, one-port = (log P + s) (t + p k=s), which is minimized for s = log P k=t. In our case, we should take s = and achieve a time consumption of : k. Sparse Interconnection Networks 5 Conclusion We have shown that the partial-tree schedule leads to faster broadcasting in the one-port and duplex model. It is up to twice as fast as earlier simple approaches. In comparison to the optimal hypercube schedule it has the advantage of being simpler and that it requires much smaller bandwith when embedding it on networks such as meshes and hierarchical crossbars.

References. Kumar, V., A. Grama, A. Gupta, G. Karypis, Introduction to Parallel Computing: Design and Analysis of Algorithms, Benjamin/Cummings, 99.. Snir, M., S.W. Otto, S. Huss-Lederman, D.W. Walker, J. Dongarra, MPI: the Complete Reference, MIT Press, 996.