Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2

Size: px

Start display at page:

Download "Intensive Hypercube Communication: Prearranged Communication in Link-Bound Machines 1 2"

Aron Franklin
5 years ago
Views:

1 This paper appears in J. of Parallel an Distribute Computing 10 (1990), pp Intensive Hypercube Communication: Prearrange Communication in Link-Boun Machines 1 2 Quentin F. Stout an Bruce Wagar Department of Electrical Engineering an Computer Science University of Michigan Ann Arbor, MI USA September 7, A preliminary conense version of this paper appeare as Passing messages in link-boun hypercubes, Hypercube Mutiprocessors 1987, M. Heath, e., pp This research was partially supporte by National Science Founation grant DCR an an Incentives for Excellence awar from Digital Equipment Corp.

2 Abstract Hypercube algorithms are evelope for a variety of communication-intensive tasks such as transposing a matrix, histogramming, one noe sening a (long) message to another, broacasting a message from one noe to all others, each noe broacasting a message to all others, an noes exchanging messages via a fixe permutation. The algorithm for exchanging via a fixe permutation can be viewe as a eterministic analogue of Valiant s ranomize routing. The algorithms are for link-boun hypercubes in which local processing time is ignore, communication time preominates, message heaers are not neee because all noes know the task being performe, an all noes can use all communication links simultaneously. Through systematic use of techniques such as pipelining, batching, variable packet sizes, symmetrizing, an completing, for all problems algorithms are obtaine which achieve a time with an optimal highest-orer term.

3 1 Introuction This paper gives efficient hypercube algorithms for a variety of communication-intensive tasks. The emphasis is on problems where the communication pattern (i.e., which noes are sening information an where it is going) is known in avance by all processors, an all messages are of the same size. This situation is common in SIMD (Single Instruction Multiple Data) or SCMD (Single Coe Multiple Data) applications such as matrix multiplication or inversion, some atabase operations, solving PDEs on a regular gri, image manipulation, an histogramming. By systematic use of a few basic techniques, algorithms are evelope which are significantly faster than the simplest or most common ones, an faster than those publishe previously. Our algorithms show that these techniques are quite powerful, an also show that it is avantageous to be able to use all communication links simultaneously when communication becomes a bottleneck. 1.1 Definition of a Hypercube A -imensional hypercube computer is a istribute-memory multiprocessor consisting of 2 separate processing elements (or noes), linke together in a -imensional binary cube network (see Figure 1). Each noe is given a unique -bit ientification number (henceforth referre to as the noe i..) an two noes linke if an only if their noe i.. s iffer in exactly one bit position. Two noes in a hypercube are sai to be ajacent or neighboring if they share a link. Hypercube networks have some useful properties such as a logarithmic communication iameter an a totally symmetric layout. Because the number of links per noe grows only linearly with the imension, reasonably-size hypercubes (of say, imension 10 or 12) can be practically built with current technology. NCUBE, Intel, Floating Point Systems, Ametek, an Thinking Machines have alreay introuce hypercube computers on the commercial market. 1.2 The Link-Boun Moel This paper uses a moel of hypercubes in which communication time is assume to preominate, an local processing time by the noes can be ignore. We are intereste in problems where extensive communication is require because very long messages are being sent, an where all noes are available to participate in the task an know the communication task being performe. This latter point helps reuce the communication time by insuring that messages o not nee to inclue heaer information such as message estination, message route, packet sequence number, etc. Further, we assume that each noe can utilize all of its communication links simultaneously, where the links between neighboring noes can be use in both irections simultaneously. Thus, a noe in a -imensional hypercube may be hanling 2 messages simultaneously, receiving an sening. This property we call link-boun, as oppose to other possibilities such as processor-boun (in which each noe can o only one operation at a time) or DMA-boun (in which there is an upper boun on the number of messages which can go in an out of a noe at one time). While no hypercube can currently use all of its communication links simultaneously, several manufacturers are trying to attain such ability. The NCUBE machines apparently come the closest [4], an the FPS T-Series machines have noes capable of 4 bi-irectional communications at the same time [3]. The link-boun moel has been stuie in [1, 5, 6, 7, 8, 9, 10, 11, 12, 13]. 1.3 Message-Passing Problems Throughout the paper, various hypercube communication patterns will be looke at. Each will involve sening messages between some known combination of the noes. These messages will be the same length m, but their contents may vary between ifferent pairs of communicating noes. It 1

4 = 0 = 1 = 2 = 3 Figure 1: Some Hypercubes for Small will be assume that messages may be broken own into packets at any time, while existing packets may be either recombine or broken own still further, in orer to facilitate sening, so long as the ultimate estination eventually receives the entire original message. Another key assumption is that a noe receiving a packet must finish receiving it before any of its contents can be utilize. This is sometimes calle the store-an-forwar or packet-switche moel, as oppose to a circuit-switche moel. All existing hypercubes use store-an-forwar. We assume that it takes m + β time for a noe to sen a packet of length m to a neighbor, where represents the transfer rate an β the time for start-up an termination. In real systems it is generally the case that β, an hence it is important to inclue the effect of start-up costs. Also note that the total internal I/O banwith of the computer is 2. In general, an β are treate as constants an the algorithms are analyze as functions of an m. As in [5, 6, 7, 8, 11], we ignore the effects of rouning or truncating. Furthermore, because special cases arise for various relative values of the parameters, the exact formulas will be given only for m sufficiently large. Because we are most intereste in processing long messages, the term containing the highest power of m will be calle the highest-orer term. Some of the problems stuie in this paper, such as broacasting a message from one noe to all others, have been previously consiere by [5, 9, 11]. These papers use a similar moel to the link-boun moel, but give slightly slower algorithms. In particular, [11] consiere a moel where communication between noes can only take place one irection at a time, so where this mae a ifference, the times of their algorithms have been ivie by two so that they can be fairly compare with ours. For most of the problems, several algorithms are evelope, the last being the fastest (an most sophisticate). In all problems, simple arguments show that the best versions of our algorithms have optimal high-orer terms. In some cases we believe that our algorithms are completely optimal, but are unable to prove this because our arguments can boun terms involving, m, an, or terms involving β an, but not sums of terms of ifferent types. 1.4 Notation Several of the patterns consiere here are oriente aroun one noe. For such patterns, it will be assume without loss of generality that noe 0 is the special noe. The set of all noes (or their corresponing i.. s, epening on context) which are istance k from noe 0 will be enote by C k. Note that because of the symmetry of the hypercube, any algorithm written for noe 0 can be 2

5 converte to an ientical algorithm for noe n {1,2,...,2 1} by exclusive-oring all noe i.. s reference in the noe 0 algorithm with n. Some of the algorithms utilize somewhat subtle message routing schemes, for which the following notation will be useful. Let k (n) enote the noe i.. forme by taking the bit-wise exclusive-or of n an 2 k. I.e., it is just n with the kth bit flippe. For example, if = 5, then 1 (01001) = an 3 (01001) = Observe then that the neighbors of noe n are noes 0 (n), 1 (n),..., 1 (n). Let i (n) enote the left circular-shift (rotation) of the -bit representation of n by i bits, where i {0,1,..., 1} an n {0,1,...,2 1}. For example, if = 5, then 2 (01001) = For each n {1,2,...,2 1}, let W n enote the set of all bit positions in the binary representation of n which are 1. That is, W n is the unique set of integers such that n = w W n 2 w. Likewise, let Z n = {0,1,..., 1} \ W n enote the corresponing set of 0-bit positions of n. Define n : Z n W n as follows: min W n if z > maxw n n (z) = min{w W n : w > z} otherwise for each z Z n. I.e., n (z) is the bit position of the first 1 (in left circular orer) after z in n. For example, when = 5, then (1) = 3 an (4) = 0. 2 Sening the Same Message From One Noe to All Others One of the funamental hypercube communication patterns is broacasting, in which one noe has to sen the same message to all the other noes in the hypercube. This problem has been previously examine in [5, 11], but our algorithms are somewhat faster. It is interesting to note that the NCUBE machine has special harware instructions to allow a noe to simultaneously broacast to all of its neighbors, though the current operating system oes not make use of these instructions. Assume noe 0 is to o the broacasting. The most common an straightforwar way to broacast in a hypercube is recursive oubling. Algorithm: BROADCAST 1 (Recursive Doubling) There are stages, numbere 0,1,..., 1. During stage k, noes 0,1,...,2 k 1 sen the message to noes k (0), k (1),..., k (2 k 1), respectively (an concurrently). Analysis of BROADCAST 1 Each of the stages takes m + β time, so the whole algorithm requires time (m + β) = m + β. BROADCAST 1 works on only one imension at a time an thus fails to take avantage of the concurrent rea/write channels of the link-boun hypercube. To alleviate this, it is necessary to symmetrize the algorithm. That is, break up the problem into subproblems an run them simultaneously along separate imensions. 3

6 Algorithm: BROADCAST 2 Symmetrize BROADCAST 1. I.e., break up the message into packets P 0,P 1,...,P 1, each of i size m/. During stage k, noes (0), i (1),..., (2 k 1) sen P i to noes (i+k) mo ( (0)), i (i+k) mo ( i (1)),..., (i+k) mo ( i (2 k 1)), respectively, for each i in {0,1,..., 1}. Notice that packet P 0 reaches each processor at the same stage as the single packet in BROAD- CAST 1, an in general at the en of stage k each packet has been broacast to a k-imensional subcube. The moular packet routing guarantees that no processor is trying to sen two packets along the same link at the same time. Analysis of BROADCAST 2 The same as for BROADCAST 1, only now each stage takes time (m/) + β for a total time of ( m ) + β = m + β. i Although much improve from the previous version, BROADCAST 2 still suffers from the fact that a large percentage of the links are ile most of the time. In fact, half of the links are never use (those which are irecte towars 0). A better algorithm woul be one in which each link is always busy sening ata to a noe which has not yet seen it. This is impossible to fully achieve, but it provies a goal to aim for. The next version attempts this in a systolic fashion, for the message travels across the hypercube in a wave going from one C k to the next, with every noe actively sening along all of its links when the wave hits it. Algorithm: BROADCAST 3A There are +1 stages, numbere 0,1,...,. Break up the message into packets as in BROADCAST 2. During stage k, only the noes in C k actively sen along their their outgoing links, each noe sening exactly one packet along each such link. Every noe in C k receives k istinct packets from C k 1 uring stage k 1 an the remaining k packets from C k+1 uring stage k+1. More specifically, uring stage k, each noe n C k starts out with P w for every w W n. For each such w, it sens P w to noe w (n) C k 1. In aition, for each z Z n, it sens P n(z) to noe z (n) C k+1. Figure 2 shows this process for = 3. By inuction, it is easy to verify that uring stage k 1, a noe n C k receives P w for every w W n, receiving it from noe v (n) C k 1, where v W n is such that n (v) = w. During stage k+1 noe n receives P i from i (n) C k+1 for each i W n. Therefore, at the en of stage k+1, every noe in C k has receive all of the packets. Analysis of BROADCAST 3A Similar to that of BROADCAST 2, only now there are +1 stages an thus, a total time of (+1) ( m ) + β = +1 m + (+1)β. Although slower than the previous broacast, 3A has the special property that only noes in C k are sening in the kth stage. This will be exploite later, but first notice that the only purpose of the last stage is to sen P 0,P 1,...,P 1 from noe 2 1 to noes 0 (2 1), 1 (2 1),..., 1 (2 1), respectively. 4

7 P 1 P P 1 P 1 P 1 P 1 P 0 P 0 P 1 P 2 P 2 P 2 P 0 P P 0 P 2 P 1 P 1 P 0 P 0 P 0 P 2 P 0 P Stage 0 Stage 1 Stage 2 Stage 3 Figure 2: BROADCAST 3A for = 3 Algorithm: BROADCAST 3B Ientical to BROADCAST 3A, but the last stage is eliminate by sening a secon wave right after the first. During stage 1, noe 0 relabels P 0,P 1,P 2,...,P 1 as Q 1,Q 0,Q 1,...,Q 2, respectively, an starts a secon BROADCAST 3A, only this time using Q 0,Q 1,...,Q 1 in place of P 0,P 1,...,P 1, respectively. This secon broacast is run for only 1 stages, after which each noe in C 1 will have receive every packet (an then some). Analysis of BROADCAST 3B The one stage saving results in a final time of ( m ) + β = m + β, the same as for BROADCAST 2. BROADCAST 3 can be significantly spe up through the use of pipelining. Basically, pipelining consists of breaking up a problem into smaller pieces an sening these out as separate waves, one after another, in orer to keep more of the noes busy at the same time. It is a useful tool for speeing up asymmetrical communication algorithms. Algorithm: BROADCAST 4 Pipeline BROADCAST 3. That is, ivie up the message into g groups, numbere 0,1,...,g 1, each of length m/g. Execute g 1 separate BROADCAST 3A s, one for each of the first g 1 groups, followe by a BROADCAST 3B for the last group. These shoul be one concurrently, but staggere one stage apart so that no noe is working on more than one at a time. To be more precise, there are +g 1 stages, numbere 0,1,...,+g 2. During stage k, each noe in C j, k g+1 j k, executes stage j of BROADCAST 3 for group k j. 5

8 Analysis of BROADCAST 4 Each of the +g 1 stages now takes (m/g) + β time, so the whole algorithm requires time (+g 1) ( m ) g + β = (+g 1) g By simple calculus, it can be shown that this time is minimize when which prouces a final time of g = 1 = 1 ( 1) β m 1/2 2 m + (+g 1)β., m + β = 1 m + 2 ( 1)β m 1/2 + ( 1)β 2. For purposes of comparison, the best broacast in [5] has a running time of m + 2 βm 1/2 + β. One last observation. BROADCAST 4 is absolutely optimal if all sens have to use fixe-size packets. To see this, suppose the packet size is s. Then each noe will have to receive at least m/s of these packets an, since only packets can be sent out at a time from noe 0, at least one such packet will have to wait (m/s 1)(s + β) time for a free link before it can leave noe 0. To reach noe 2 1, this packet will then have to travel links, which takes time (s+β), for a total minimum time of ( + ms 1 ) (s + β). But this is just what BROADCAST 4 takes when you let g = m/s. 3 Sening From One Noe to the Opposite Corner Noe Another important communication pattern is when some noe n sens to its opposite corner (o.c.) noe 2 1 n. Obviously, this is similar to, an coul be accomplishe by, broacasting. Although this might appear to waste time, the analysis of BROADCAST 4 shows that that algorithm alreay performs an optimal o.c. sen, if one is limite to fixe-size packets. Which is not to say that variable-size packets will help much, for at least (/)m + β time is neee just to get all of the ata out of noe n. More important, though, is to cut own all of the unnecessary communication. Assume that noe 0 wants to sen to noe 2 1. Algorithm: O.C. SEND 1 There are stages, numbere 0,1,..., 1. During stage k, noe 2 k 1 sens the message to noe k (2 k 1). 6

9 Analysis of O.C. SEND 1 Ientical to BROADCAST 1: (m + β) = m + β. Algorithm: O.C. SEND 2 Symmetrize O.C. SEND 1. I.e., break up the ata into packets P 0,P 1,...,P 1, each of size m/. During stage k, noe (2 i k 1) sens P i to noe (i+k) mo ( (2 i k 1)) for each i {0,1,..., 1}. Analysis of O.C. SEND 2 Ientical to BROADCAST 2: ( m ) + β = m + β. These algorithms reuce the total communication neee in their broacast equivalents consierably. Unfortunately, they on t save any time. In orer to o that, it is necessary to spee up the intermeiate stages. Algorithm: O.C. SEND 3 There are stages, numbere 0,1,..., 1. The packets will vary in size epening on the stage. During stage k, only the noes of C k will be sening, while only noes in C k+1 will be receiving. At the start of stage k, the ata starts out evenly istribute among each of the ( k) noes in Ck. Each such noe then breaks up its ata into k equal-size packets an sens a ifferent one out along each of its k links to C k+1, at which point the message will be evenly istribute among C k+1 s noes. Analysis of O.C. SEND 3 Stage k takes time so the whole algorithm requires where [ 1 k=0 m ( + β = k) ( k) ( 1 k ] ) m + β = ( 1 k [ 1 k=0 ) m + β, 1 ( 1 k ) ] m + β = K 1 m + β, K = k=0 1 ( k). 7

10 To help unerstan K, note that it s first six values are 1,2,2 1 2,22 3,22 3, an For 5, an hence, K ( 1) + 6( 5) ( 1)( 2) < ( 1), K The upshot of all this is that O.C. SEND 3 is faster than BROADCAST 3 by a factor of about /2. Like BROADCAST 3, O.C. SEND 3 also lens itself well to pipelining. Algorithm: O.C. SEND 4 Pipeline O.C. SEND 3. I.e., break up the message into g groups of length m/g an execute g separate O.C. SEND 3 s, one for each group. These shoul be one concurrently, but staggere one stage apart. Because O.C. SEND 3 s stages have varying lengths, there will be no synchronization between the various stages of O.C. SEND 3 being worke on for each of the groups. Since O.C. SEND 3 s first stage, which takes time (m/g) + β, is at least as long as any other of its stages, it sets the rate at which the groups can be starte. Each noe will be able to complete sening the previous group by the time it has finishe receiving the next group. Analysis of O.C. SEND 4 Due to the fact that there is no chance of conflict, the time neee for this algorithm is just the time neee to start the first g 1 O.C. SEND 3 s, as well as all of the time neee for the last one. This works out to be (g 1) ( m ) g + β + K 1 m g + β = g+k 1 1 m + (g+ 1)β, g which is minimize when g = 1 = 1 (K 1 1) β m 1/2 2. This prouces a final time of m + β = 1 m + 2 (K 1 1)β m 1/2 + ( 1)β 2. Notice that this is ientical to the time for BROADCAST 4, except that the coefficient of the m 1/2 term has been reuce by a factor of approximately 1. A final observation about the O.C. SEND algorithms: they only use links in one irection (i.e., towars noe 2 1). Consequently, another o.c. sen coul be one from noe 2 1 to noe 0 concurrently (that is, an opposite corner exchange) since they each woul use ifferent links. 8

11 4 Sening Messages Between Two Arbitrary Noes One of the more interesting questions that can be aske is, given the full use of the link-boun hypercube, what is the fastest possible time for one noe to sen to an arbitrary noe istance n away (1 n )? For simplicity, assume noe 0 is sening to noe 2 n 1. Algorithm: ARBITRARY SEND 1 Use O.C. SEND 4 on the n-imensional subcube containing noes 0,1,...,2 n 1. Analysis of ARBITRARY SEND 1 m + β n = 1 n m + 2 (K n 1 1)β m 1/2 + (n 1)β n 2 n. Like previous algorithms, this approach fails to make use of the tremenous banwith available an is slower than broacasting for all n <. Algorithm: ARBITRARY SEND 2 Use BROADCAST 4, eleting the last 2 n stages if n < 2. These stages were only neee to get the message to noes farther than istance n from noe 0. Analysis of ARBITRARY SEND 2 There are n+g 1 +g 1 stages, which prouce minimum times when g = 0 n 2 2 n 1 = 1 (n+1) β m 1/2 ( 1) m 1/2 β 1 n n. The corresponing times are then m + β = 1 m + 2 (n+1)β m 1/2 + (n+1)β 1 n 2 m + 2 ( 1)β m 1/2 + ( 1)β 0 2 n. 9

12 Closer inspection of BROADCAST 3 reveals that yet another stage can be save when /2 n 2 an a fraction of a stage save when 1 n < /2, but these only reuce the m 1/2 an β coefficients by negligible amounts. What s neee is a way to combine the link utilization of BROADCAST 4 with the efficiency of O.C. SEND 4. With this in min, it is necessary to look at a new communication pattern, the extene o.c. (x.o.c.) sen. It is similar to a regular o.c. sen, except that the sening an receiving noes are connecte to opposite corners of an n-imensional subcube S, an all communication must go through S. Algorithm: X.O.C. SEND 1 Ientical to O.C. SEND 3, except two stages, numbere 1 an n, are ae to get the items into an out of S. Analysis of X.O.C. SEND 1 The first an last stages each take time m + β, so the whole algorithm requires 2(m + β) + K n 1 m + nβ = n (2n + K n 1) n m + (n+2)β. Algorithm: X.O.C. SEND 2 Pipeline X.O.C. SEND 1 as in O.C. SEND 4. There are still g groups of m/g items apiece, only in this case the (m/g) + β time neee to get a group out of the sening noe sets the pace for the rest of the stages. Analysis of X.O.C. SEND 2 Similar to O.C. SEND 2. The total time neee is (g 1) ( m ) g + β + (2n + K n 1) m n g + (n+2)β = (gn + n + K n 1) gn m + (g+n+1)β, which is minimize when g = (n+k n 1 ) m 1/2 nβ with a resulting time of m + 2 (n+k n 1 )β n m 1/2 + (n+1)β. 10

13 Algorithm: ARBITRARY SEND 3 Break up the ata into n+1 packets, n of which contain m/ apiece while the other has the remaining nm/ items. Sen the larger packet via an O.C. SEND 4 through the subcube T containing noes 0 an 2 n 1 as opposite corners. Sen each of the other n packets via an X.O.C. SEND 2 through a ifferent one of the n n-imensional subcubes which both run parallel to T an are istance 1 from T (i.e., the subcubes containing noes 2 n+1 through 2 n+1 +2 n 1, 2 n+2 through 2 n+2 +2 n 1,..., 2 1 through n 1). Analysis of ARBITRARY SEND 3 The n+1 separate sens use ifferent links, so they can all be one concurrently with no conflicts. Each of the X.O.C. SEND 2 s of length m/ takes time m + 2 (n+k n 1 )β m 1/2 + (n+1)β n whereas the O.C. SEND 4 of length nm/ requires m + β n = 1 m + 2 (K n 1 1)β m 1/2 + (n 1)β 2 n Hence, the X.O.C. SEND 2 s will be slower than the O.C. SEND 4 whenever n+k n 1 K n 1 1, n which is true by inspection for 1 n 4 an is false for bigger n since n < K n hols for all n 4. What this all means is that the actual time for ARBITRARY SEND works out to m + β = n = 1 m + 2 (n+k n 1 )β m 1/2 + (n+1)β 1 n 4 an n 1 n. m + 2 (K n 1 1)β m 1/2 + (n 1)β 5 n or n = 2 This compares to the neee by [11]. m + 2 (n+1)β m + 2 ( 1)β m 1/2 + (n+1)β n 1 m 1/2 + ( 1)β n = As was the case with O.C. SEND 4, ARBITRARY SEND 3 is only a slight improvement over broacasting. More importantly, it uses only one link between noes, so an arbitrary exchange is possible between two noes in the same amount of time by using the links in the other irection. 11

14 5 Sening Different Messages From One Noe to Every Other Noe The next communication pattern to be looke at is where one noe nees to sen a ifferent message to each of the other noes. This operation has no stanar name (it was referre to as scatter in [11] an personalize communications in [5]), so we will call it istributing. Its ual operation, collecting, where one noe has to receive a message from each of the other noes, is exactly the same operation, only run in reverse. Hence, it suffices to esign an analyze istributing algorithms. Both of these operations are useful in asymmetrical situations where one noe of the hypercube acts as a master processor an the others as its slaves. The master istributes ifferent ata sets to each of the slaves, which in turn perform computations on them. Then the master collects all of the results. Assume without loss of generality that noe 0 is the istributing noe. The following algorithm makes use of O.C. SEND 2, which gets execute on every subcube containing noe 0. Algorithm: DISTRIBUTE 1 There are stages, numbere 0,1,..., 1. Each noe is sent its corresponing message via a separate O.C. SEND 2 applie to the subcube containing it an noe 0 as opposite corner noes. These 2 1 o.c. sens are run concurrently, with batching, an staggere so that the one to C goes first, followe by the ones to C 1, then C 2, etc. Specifically, uring stage k, the ( j) noes in Cj, 0 j k, o their share of the work for the ( ) ( +j k = k j) O.C. SEND 2 s whose estinations are in C +j k. Analysis of DISTRIBUTE 1 The time for each stage epens on the maximal amount of ata being sent out over a single link. For stage k, C j contains ( ( k j) messages of length m to be sent out evenly along its ) ( j ( j) = 1 ) j links to C j+1. This means that each such link sens a packet of length ( k j) m ( 1 j ), which is maximize when j = 0, for ( ) k i ( 1 ) > i ( ) k (i+1) ( 1 ) i+1 hols for all i {0,1,...,k 1}. Consequently, the time spent by noe 0 uring each stage is longer than noes in any other C j, so the total time for the algorithm is [ 1 k=0 ( ) k m + β The istribute algorithm in [11] ha a time of ] = [ 1 k=0 ( )] k m + β = (2 1) m + β. (2 1)m + β 12

15 an [5] s ha a time strictly greater than ours. The time in [5] is ificult to represent, but has the property that for fixe > 1, the coefficient of m is strictly greater than (2 1)/, but it tens to (2 1)/ as approaches. Note that, as was the case with the o.c. sen algorithms, DISTRIBUTE 1 uses links in only one irection, namely towars noe 2 1. Hence, a DISTRIBUTE 1 from noe 2 1 can be one concurrently using the opposite set of links. Also, for m sufficiently large (epening on the values of, β, an ), it is possible to reuce the β coefficient, which represents the number of stages or waves of ata leaving noe 0, by grouping together some of the waves as they leave noe 0 an then breaking them apart in C 1. For example, using = 3, first a wave containing messages for C 2 an C 3 is sent out, taking 4 3m + β time to leave noe 0. When this wave arrives at the noes of C 1, the portion estine for C 3 is sent, taking 1 6 m + β time, an then the portion estine for C 2 is sent, taking 1 2 m + β time. When the portion estine for C 3 reaches the noes of C 2, it is sent on to C 3, taking 1 3m + β time. Meanwhile, the secon wave of messages sent by noe 0 are those estine for C 1, taking m + β time. Messages for C 1 finish arriving at time 7 3m + 2β, messages for C 2 finish at time 2m + 3β, an messages for C 3 finish at time 11 6 m + 3β. If m/3 > β, then all messages arrive by 7 3m+2β, which has improve upon the coefficient of β. This can be shown to be absolutely optimal. This approach is extene in DISTRIBUTE 2. Algorithm: DISTRIBUTE 2 Fix, let k be the smallest integer such that ( 1)k , let r be such that rk 1 r 1 = 2 1. (Since r may be irrational, in practice one may prefer to use some rational r such that r r < 1.) There will be exactly k waves. Wave k will be those messages estine for C 1, where each link from noe 0 carries a message of size m. Wave k 1 starts from noe 0 with packets of size rm along each link, an will contain all of the messages for C 2 an (for large ) portions of each of the messages for C 3, where the portion is chosen to fill the packet size. Wave k 2 will start with packets of size r 2 m, containing the rest of each of the messages for C 3, plus messages for C 4, plus (for sufficiently large ) portions of messages for C 5. Each wave starts with packets r times larger than the following wave, an contains messages estine for a set of further C i s. When a wave reaches the noes of C 1 it is broken into wavelets, one for each of the estination C i in the wave. These wavelets continue on to their estination, ajusting the packet sizes at each step as in DISTRIBUTE 1, but not subiviing into smaller wavelets. Analysis of DISTRIBUTE 2 Since the banwith from C 1 to C 2 is 1 times the banwith from 0 to C 1, an r < 1, for m sufficiently large each wave can be sent on from C 1 before the next wave arrives. The reason m must be sufficiently large is that the breaking into wavelets introuces aitional β terms, but since r is less than 1 there is a slight bit of extra banwith, which can mask the extra start-up for sufficiently large m. As in DISTRIBUTE 1, it can be shown that, for suffiently large m, all wavelets reach their estination by the time the last wave reaches C 1. Therefore the total time is etermine 13

16 by the time it takes noe 0 to sen all k waves, which is (2 1) m + kβ (2 1) m + log 2 β. This algorithm shows that one cannot obtain a lower boun by simply aing the banwith lower boun, which etermines the optimal coefficient of m, to the start-up lower boun which shows that at least β time is neee to move any message across the hypercube. In general one can only take the maximum of these two components as a lower boun, since operations can be overlappe. 6 Completing Hypercube Algorithms Completing a hypercube operation refers to taking an operation centere aroun one noe an then simultaneously performing it on all of the noes. This prouces highly symmetrical communication patterns which utilize all of the available banwith. The simplest way to complete an operation is just to run 2 single-noe operations concurrently. In terms of algorithms, this amounts to using the same number of stages as the single-noe version. During each stage of the complete algorithm, however, each noe oes all the work necessary for the corresponing stage in all of the single noe algorithms. Link conflicts are resolve by batching. That is, grouping together all of the separate packets that have to be sent along a particular link uring the same stage an sening them as one big packet. This also reuces communication overhea (i.e., β terms) consierably. The best single-noe algorithms to complete are usually the simplest versions which still take avantage of the concurrent link capability. Sophisticate techniques such as pipelining an link balancing aren t necessary because the complete operations are so symmetric. The first operation to be complete will be the broacast. This pattern is useful for various matrix operations as well as vector multiplication. Algorithm: COMPLETE BROADCAST Complete BROADCAST 2. There are stages, numbere 0,1,..., 1, an uring stage k, each noe oes its share of the work for the corresponing stage of BROADCAST 2 for all ( k) noes which are istance k from it. Analysis of COMPLETE BROADCAST During stage k of BROADCAST 2, the total amount of ata being sent out is 2 k m/ = 2 k m, so the corresponing amount being sent out in COMPLETE BROADCAST is 2 2 k m. Due to the overall symmetry of COMPLETE BROADCAST, this outgoing ata will be evenly ivie among all 2 links of the hypercube, so each link ens up sening a packet of size 2 k m/. Therefore, the time for the algorithm is ( 1 k=0 2k m + β ) = (2 1) m + β. For purposes of comparison, [11] prouce an optimal complete broacast (which they referre to as a total exchange ) with a running time of (2 + 2 ) m + β. 14

17 The next operation to be complete will be the o.c. sen. A complete o.c. sen, henceforth referre to as an inversion is another funamental communication pattern useful for reversing the orer of ata which is store by noe i.. an for transposing matrices (to be iscusse later). Algorithm: INVERSION (Complete Opposite Corner Sen) Complete O.C. SEND 2 in exactly the same manner as BROADCAST 2 was in COMPLETE BROADCAST. Analysis of INVERSION During each stage of O.C. SEND 2, a total of packets of size m/ were being sent over separate links. Now there are 2 such packets, but there are also that many links an the symmetry of the algorithm guarantees that no more than one packet will be sent along the same link uring a stage. Thus, the time for INVERSION is ientical to O.C. SEND 2: ( m ) + β = m + β. Now consier the ultimate communication pattern, the complete exchange. This is when every noe wants to sen (as well as receive) a ifferent message to (from) each of the other noes. In other wors, it s the same thing as completing the istributing or collecting operations. The complete exchange turns out to be useful for matrix transpositions as well as ranom communication patterns (both to be iscusse later). In [11], complete exchange was calle multigather/scatter. Algorithm: COMPLETE EXCHANGE Just complete DISTRIBUTE in the same manner that O.C. SEND 1 was complete to prouce INVERSION. There are still stages, numbere 0, 1,..., 1. Analysis of COMPLETE EXCHANGE As in INVERSION s analysis, all that has to be etermine is the amount of ata each noe has to pass along each stage. Basically, every noe starts out with (2 1)m items which have to be sent out to the other noes. During stage k, it starts sening messages to the ( k) noes which are istance k away. No messages reach their proper estinations until the last stage, which means that a total of 2 k j=0 ( ) j messages are being worke on uring stage k. Due to the symmetric pattern of the sens, each of the 2 links thus sens a packet of size ( k j) m = j=0 k j=0 ( 1 ) j j m, so the algorithm nees time 1 k=0 ( k 1 ) j j m + β = j=0 1 1 j=0 k=j ( 1 ) j m + β j 15

18 = 1 j=0 ( ) 1 m + β j compare to the neee by [11] s corresponing algorithm. 2 1 m + β = 2 1 m + β, Finally, sometimes a situation arises where each noe wants to sen to one other noe as well as receive from just one noe. This will be terme a permute sen, although in some ways it is analogous to a complete arbitrary sen. An obvious example is an inversion. Another one is when each noe wants to sen to the next higher-numbere noe (mo 2 ), which can be thought of as a rotation. There are 2! such permutations, so etermining the most efficient algorithm for each one seems neither possible nor practical, though recently some papers have appeare analyzing specific permutations [8, 10]. For arbitrary permutatations, however, a eterministic analogue of Valiant s ranomize routing [12, 13] can be employe. It consists of two complete exchanges: one to isperse all the ata evenly throughout the cube an another to collect it all up at the appropriate estinations. We explicitly use the fact that all noes know the permutation being performe so that estination information nee not be sent with the ata. Algorithm: PERMUTED SEND Each noe breaks up its m items into 2 packets of m/2 items apiece. These packets are istribute throughout the cube via a complete exchange so that each noe has one packet from every noe in the cube. Since the communication pattern is a permutation, this also means that each noe has a ifferent packet to sen to every other noe in the cube. As a result, another complete exchange can be use to route all of the packets to their correct estinations. Analysis of PERMUTED SEND There are two complete exchanges, each involving message lengths of m/2 items, so the time neee for PERMUTED SEND is 2 (2 1 m2 ) + β = m + 2β. 7 Matrix Transposition Transposing a matrix in a hypercube is an interesting communication problem which can make goo use of some of the algorithms evelope so far. It has been previously consiere in [9, 11], but faster algorithms will be evelope here. Suppose you want to transpose an N N matrix M store in a -imensional hypercube, with each noe containing N 2 /2 entries. It is necessary to specify exactly how M is store, where the usual ways are either by rows(columns) or as square submatrices. Storage by rows is the easier of the two, so it will be consiere first. 16

19 7.1 Storage by Rows(Columns) There are many ways of storing by rows(columns), where we assume that N is evenly ivisible by 2. For example, the rows may be store by partitioning the rows into blocks of N/2 consecutive rows, where the assignment of blocks to noes may or may not use a Gray coe. Or it may be that a stripe pattern is use, partitioning the rows into sets of rows 2 apart, again with variations possible on how the sets are mappe onto the noes. However, no matter what metho is use to assign rows to noes (as long as the assignment evenly istributes the ata), to perform transposition each noe must sen exactly N 2 /2 2 entries to each other noe. In other wors, a complete exchange has to be performe with a message length of N 2 /2 2. This takes time 2 1 N β = 2 +1N2 + β, as compare to neee by [11]. 2 +1N2 + β 7.2 Storage by Submatrices When store as submatrices, it is convenient to assume that is even, say = 2c, an that N is evenly ivisible by 2 c. Assume that M has been partitione into submatrices M x,y, x,y {0,1,...,2 c 1}, where M x,y is forme by the intersection of rows with columns xn/2 c + 1 through (x+1)n/2 c yn/2 c + 1 through (y+1)n/2 c. Let G enote any permutation of {0,...,2 c 1}, an assume that M x,y is store in noe G(x)2 c + G(y). Typical choices for G inclue the ientity, in which case this is known as row-major orering, or a Gray coe, in which case ajacent submatrices are store in ajacent noes. No matter what G is use, transposition reuces to the problem of noe a2 c + b exchanging its entries with noe b2 c + a, for all a,b {0,1,...,2 c 1}. We provie an algorithm for this operation. First observe that this is a permute sen with message length N 2 /2. Hence, it can be accomplishe by using PERMUTED SEND in time ( ) N β = N2 + β. This is twice as long as when M is store by rows, yet on average, each item moves only half as far. Consequently, it woul not be unreasonable to expect there to be an algorithm which works in half the time. In fact, such an algorithm oes exist, but escribing an analyzing it requires examining the bit patterns of the noe i.. s. Consier noe a2 c + b, where a,b {0,1,...,2 c 1}. In their base two representations, a an b iffer by say, k bits, an agree on the other c k. Now let S a,b enote the set of all noes whose first c bits of their i.. s iffer from their last c bits in exactly the same k positions that a an b o. Observe that S a,b contains 2 k noes. By themselves, they o not form a proper subcube, but something close to one. The istance between any two noes of S a,b is always even, an if there 17

20 A D C B physical link logical link Figure 3: Logical an Physical Links Between two Noes were links between the noes that are istance two apart, then S a,b plus these new connections woul form a k-imensional hypercube. With these thoughts in min, efine a logical link between two hypercube noes A an B, which are istance two apart, to be the four physical links which connect A an B along the two possible paths of length 2 (see Figure 3). Let C an D enote the intermeiate noes connecting A an B. A logically connecte (l.c.) subcube can then be efine to be a subset of noes whose logical links connect them together in a hypercube network. That sai, S a,b is a l.c. subcube. Furthermore, its intermeiate noe i.. s have the property that their first c bits iffer from their last c bits in precisely k 1 positions. Finally, from the efinition of S a,b, every noe in the hypercube belongs to exactly one such l.c. subcube. Combining these last two statements, it becomes apparent that no two such subcubes can share the same physical link. As a consequence, algorithms can be run concurrently on all of the these l.c. subcubes without the possibility of link conflict. Returning to the original problem, noe a2 c + b can exchange ata with noe b2 c + a by simply performing an o.c. exchange in S a,b. In fact, every noe in S a,b can exchange ata with their corresponing noe for the transposition by performing an inversion in S a,b. This brings up the nee then for an inversion algorithm for l.c. subcubes. Inversion in a logically connecte subcube Sening ata along a logical link is equivalent to oing an o.c. sen in a 2-imensional hypercube. Hence, a stanar sen woul take time 2 m + 2 β 2 m1/2 + β to perform, so a l.c. inversion coul be accomplishe by performing a regular INVERSION using these logical sens. For a k-imensional l.c. subcube, this woul require time k ( ) m 2 k + 2 β m 1/2 + β = 2 k 2 m + 2 kβ 2 m1/2 + kβ. 18

21 As long as the number of packets sent out along each physical link in a logical sen is at least two, however, then there is no point in waiting for all of the incoming packets to arrive before starting to pass them along. That is, after all, the whole iea behin pipelining. Algorithm: L.C. INVERSION Perform a regular INVERSION along the logical links of the l.c. subcube, making sure to pipeline the stages together an breaking the ata up into at least four packets so that incoming packets start arriving no later than when the last of the outgoing packets are being sent out. Analysis of L.C. INVERSION Let p enote the number of packets to be sent out along each outgoing physical link (note that p has to be at least 2). Then the packet size for each sen is m/2kp. With pipelining then, it takes a total of kp sens for each noe to pass along its ata, with the intermeiate noes being one sen behin the regular noes. Therefore, kp + 1 sen stages are neee, so the algorithm runs in time which is minimize when ( ) (kp + 1) 2kp m + β, p = 1 k 2β m1/2. This prouces a final time of 2 m + 2 β 2 m1/2 + β. Observe that this time is the same time as neee by INVERSION for a 2-imensional cube. Also, it is inepenent of k so long as m is large enough to insure p 2. Algorithm: MATRIX TRANSPOSITION (store by submatrices) With L.C. INVERSION, transposing M becomes trivial. Just perform it on every l.c. subcube of M. Analysis of MATRIX TRANSPOSITION Since the l.c. inversions can all be one concurrently without overlap, an all take the same amount of time, the transposition is complete in time N β 2 ( ) N 2 1/2 + β = 2 β 2 +1N N + β, which is nearly the same time neee when M was store by rows. This is approximately half of the 2 N2 + ( 1) time require by [9], where it is assume that β is zero. 19

22 8 Histogramming The same techniques use previously can be applie to the problem of histogramming. We consier a simple variant in in which there are m ifferent bins, each noe starts with a value for each of the bins, an the goal is to fin the sum of the values for each bin. The sum for bins im/2 through (i+1)m/2 1 will be in noe i. (If it is esire that all noes contain all sums, then a complete broacast can be use at the en.) We assume that each value an sum is of unit length. Algorithm: HISTOGRAM First consier the algorithm where the ata is exchange one imension at a time, using recursive halving to ecrease the number of subtotals in each noe. During the first stage, noes in the bottom half of the subcube sen up their values for the secon half of the bins to their neighbors in the upper half, which are concurrently sening own their values for the first half of the bins. Each noe as the values receive to its own, an recursively continues on to the next stage. The final HISTOGRAM algorithm is just the symmetrize version of this simple algorithm. Analysis of HISTOGRAM For the unsymmetric algorithm, stage k, 1 k, takes (/2 k )m + β time, so for the symmetric algorithm it takes (/2 k )m + β time. The total time is (1 2 ) m + β. 9 Optimality of the Algorithms As was mentione earlier, the coefficients of the high-orer terms are the least possible. In all cases, a proof can be given base on a simple counting argument. The easiest such approach is pick a subset S of links, show that the total message loa that must be sent over S is at least some amount a, an conclue that at least one link sens at least a/ S an thus, takes at least time oing so. a S m + β For example, in BROADCAST, O.C. SEND, ARBITRARY SEND, DISTRIBUTE, an HISTOGRAM, let S be the outgoing links of noe 0. Then a is m, m, m, (2 1)m, an (1 2 )m, respectively. In the case of COMPLETE BROADCAST, pick S to be the incoming links to noe 0 an set a to (2 1)m, since noe 0 receives a ifferent message from each other broacast. For INVERSION an COMPLETE EXCHANGE, consier S to be the 2 links connecting noes 0,1,...,2 1 1 in the lower subcube L with noes 2 1,2 1 +1,...,2 1 in the upper subcube U. Then a is 2 m an m, respectively, since every noe in L sens one an 2 1 messages, respectively, to the noes in U (an vice versa). PERMUTED SEND is optimal since it is slower than the special case INVERSION by only an aitive β. Finally, for MATRIX TRANSPOSITION, when the 20

23 matrix is store by rows or columns the problem is just complete exchange, which was shown to be optimal. When the ata is store as submatrices, let S be all 2 links. Then a is N 2 /2 since the total istance travele by all messages, each of size N 2 /2, is 2 1. The lower boun for the permutation reflection, where every noe in L exchanges with its corresponing neighbor in U, has the same high-orer term as oes inversion (by the same argument). This occurs espite the fact that reflection is a fixe-point free permuation with the smallest total message istance, while inversion has the greatest total message istance. Given this, an the fact that PERMUTED SEND shows that all permutations can be route with this highest-orer term, one might guess that all fixe-point free permutations require the same highest-orer term. (If permutations with fixe points are consiere, then the ientity can be complete in zero time.) However, it has been shown that some fixe-point free permutations can be route with a highest-orer term smaller than that of reflection [10], an therefore PERMUTED SEND is only worst-case optimal among fixe-point free permutations. Beyon the highest-orer term, we believe that some of the algorithms herein are absolutely optimal. Unfortunately, we have generally been unable prove this because of the ifficulty in fining goo lower bouns which go beyon the highest-orer term. Such bouns must incorporate both banwith consierations an an accounting of start-up times, but, as was note in Section 5, one cannot simply a these components together to obtain a correct lower boun. We can, however, prove absolute optimality for DISTRIBUTE 2 for any, if m is sufficiently large. Notice that at least one of noe 0 s neighbors must receive at least (2 1)m/ items, an all except perhaps m of these items nee to be forware. Therefore it suffices to show that if a noe 0 is connecte to a noe 1, which in turn is connecte to 1 aitional noes, an if noe 0 starts with m items estine for noe 0, an (2 1)m/ m items estine for the aitional noes, then the time neee is at least the time taken by DISTRIBUTE 2. We assume that we have complete freeom in eciing which aitional noe to eliver a specific item to. Without increasing the time, we can alter any algorithm so that the items estine for noe 1 are the last items sent from noe 0. Suppose the first packet to arrive at noe 1 has size p, an the secon packet has size q, an both are estine for the aitional noes. If p < ( 1)q, then the items cannot finish arriving at the aitional noes until time (p + β) + (q + β) + (q/( 1) + β). By moving some of the items from the secon message to the first, creating new messages with lengths p an q, where p = ( 1)q an p +q = p+q, the messages can finish arriving at the aitional noes at time (p + β) + (q + β) + (q /( 1) + β). Since q < q, this is faster. A similar argument applies if p > ( 1)q, an therefore without increasing the time, we can assume that the first packet is 1 times as long as the secon. This argument can be applie inuctively, showing that we may assume that each packet sent from noe 0 is ( 1) as long as the following one. (Temporarily ignore the fact that this argument oes not apply to the last packets, since some of the items in them are not forware). Suppose noe 0 sens k packets, with sizes x( 1) k 1, x( 1) k 2,...,x, where x is such that the sum of the message sizes is (2 1)m/. The time for noe 1 to receive these messages an sen on the items estine for the aitional noes is at least (2 1) m + kβ + (x m) 1 where the last two terms are inclue only if x > m. For fixe,, an β, an sufficiently large m, this is minimize when k = log 1 [1 + ( 2)(2 1)/], which gives the time taken by DISTRIBUTE 2. To be correct, this argument must be moifie to eal with the sizes of the last packets, since the argument showing each packet must be 1 times as long as the following one assume that all items were estine for the aitional noes. An analysis by cases shows that the same time boun hols. + β, 21

24 10 Conclusion We have shown that link-boun hypercubes can make effective use of all of their communication links to perform some common communication-intensive tasks. Since a lower boun for some of these tasks is the time neee to sen out the ata from an originating noe, such tasks woul take longer on more restricte machines in which noes cannot use all of their communication links at one time. Thus our algorithms provie support for the belief that it is useful to buil machines where all communication links can be use simultaneously. By systematically applying a few techniques such as pipelining, symmetrizing, an completing, we were able to evelop a collection of algorithms giving efficient solutions to a wie range of problems. We concentrate on communication problems that are rather funamental, an have not trie to evelop all of their uses. However, we note that several aitional matrix manipulation problems can be solve by our algorithms. For example, if a matrix is store by rows or columns, then switching between blocke an stripe storage, or rotating by a quarter-turn, are all examples of complete exchange. If a matrix is store via submatrices, an the G function use in the assignment is either the ientity or a reflexive Gray coe, then rotation via quarter-turns or half-turns can be accomplishe by algorithms closely relate to MATRIX TRANSPOSITION. Since the initial announcement of our results in [14] an the submission of this paper, aitional papers have appeare which pursue the use of such techniques for matrix problems [6, 8, 9]. These papers inclue experimental results on Intel an Thinking Machines hypercubes, showing that our techniques o inee result in faster message transmission. Though our algorithms are eterministic, this paper has ties to Valiant s work on ranomize routing [12, 13]. He showe that inivisible unit-length messages in a link-boun hypercube coul be route in Θ() expecte time, no matter what the permutation, by routing each message to a ranom intermeiate estination an then on to its original estination. For long ivisible messages an a known permutation (so that heaer information nee not be attache), PERMUTED SEND eliminates the ranom estination by sening a portion of the message to every processor. Further, in [12] he use four ba examples to empirically show the usefulness of ranomization. One of these is equivalent to matrix transposition for a matrix store as submatrices, an the worst one was inversion. MATRIX TRANSPOSITION an INVERSION show that there are efficient eterministic routing schemes for these permutations. Finally, espite the intense interest in hypercube communication [1, 2, 5, 6, 7, 8, 9, 10, 11, 12, 13], still little is known about optimal hypercube performance on communication-intensive tasks such as sorting, routing, ata balancing, atabase operations, an image warping. For example, it is not known if a -imensional hypercube, starting with one item per noe, can sort the items in Θ() worst-case time. Aitional open questions inclue extening analyses to processor-boun an DMA-boun hypercubes, an to problems where the communication pattern is not known in avance an/or the message lengths are not uniform. 22

Lecture 1 September 4, 2013

Lecture 1 September 4, 2013 CS 84r: Incentives an Information in Networks Fall 013 Prof. Yaron Singer Lecture 1 September 4, 013 Scribe: Bo Waggoner 1 Overview In this course we will try to evelop a mathematical unerstaning for the