Ecube Planar adaptive Turn model (west-first non-minimal)

Size: px

Start display at page:

Download "Ecube Planar adaptive Turn model (west-first non-minimal)"

Silvester Shaw
5 years ago
Views:

1 Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda Dept. of Computer and Information Science The Ohio State University, Columbus, OH Tel: (614) , Fax: (614) Abstract This paper presents a new approach to implement global reduction operations in wormhole k-ary n-cubes. The novelty lies in using multidestination message passing mechanism instead of single destination () messages. Using pairwise exchange worms along each dimension, it is shown that complete global reduction and barrier synchronization operations, as dened by the Message Passing Interface (MPI) standard, can be implemented with n communication start-ups as compared to 2ndlog 2 ke start-ups required with -based message passing. Analytical results for dierent values of communication startup time, system size, and data size are presented and compared with the -based scheme. The analysis indicates that the proposed framework can be eectively used in wormhole-routed systems to achieve fast global reduction without a separate control network. 1 Introduction The wormhole-routing switching technique is becoming the trend in building future parallel systems due to its inherent advantages like low-latency communication and reduced communication hardware overhead [5]. Intel Paragon, Cray T3D, Ncube, J- Machine, and Stanford DASH are representative systems falling into this category. Such systems with direct interconnections are being used for supporting either distributed-memory or distributed-shared memory programming paradigms. In order to support these paradigms, the systems need fast communication and synchronization support from the underlying network. The Message Passing Interface Standard [4] has recently dened the importance of collective communication operations. One important category in This research is supported in part by the National Science Foundation Grant # MIP this class is global reduction (sum, max, min, or userdened functions) where there is involvement from all processes of an user-dened group. As dened by the standard, the result of an operation may be available to only one member of the group or to all the members. The operations can be carried on either scalar or vector data. Barrier synchronization [6] is a special case of this class of operation where there is no data (just an event) and the result is available to all members of the group. In this paper, we consider both reduction and barrier synchronization as a single class of reduction operation. Many software schemes have been recently proposed in the literature to eciently implement reduction[1] and barrier synchronization[11] operations on wormhole-routed systems. All these schemes use multiple phases of point-to-point message passing and encounter long latency due to multiple communication start-ups. Systems like Cray T3D and CM-5 use dedicated tree-based networks to provide fast global reduction and barrier synchronization. However, these schemes are not physically scalable [9]. This raises a question whether fast reduction and barrier synchronization can be implemented on wormhole-routed direct networks using software message passing with minimal architectural supports associated with each router. This will alleviate the need for a separate control network and can provide easy scalability as the system size grows. Traditionally, the wormhole-routed systems have supported only point-to-point () message passing [5]. This mechanism allows a message to have only a single destination. Using -based send and receive message passing primitives, reduction and barrier synchronization operations can easily be achieved using a two-step procedure: gather (report) and broadcast. During the gather step, data/information is gathered through phases of up- 652

2 ward tree communication. At the end of this step, reduced data/information is available with a single (root) processor. The second step involves broadcasting the reduced data to other processors in multiple phases using a downward tree communication. For a k-ary n-cube system with k n processors, such an approach requires 2n log 2 k communication phases. In current generation machines, communication startup time (t s ) is around 1.{35. microseconds while propagation time per it per hop (t p ) is in the range of 5.{15. nanoseconds. Hence, the latency of a communication phase is dominated by t s and the latency of any software-based synchronization scheme becomes proportional to the number of communication phases involved. This makes the global reduction cost quite high. In [1] it is reported that reduction operation on 8 bytes of data on 16x32 Intel Paragon takes around 76 microseconds. This raises question whether ef- cient mechanisms are possible to reduce overhead of such reduction operations. In this paper, we take such a challenge in proposing a new approach to implement fast global reduction in k-ary n-cube wormhole networks. Recently, we have introduced a new concept of multidestination wormhole mechanism [9, 1]. We used multidestination broadcast worm in [1] to design algorithms for broadcast and multicast operations in wormholerouted k-ary n-cube systems with reduced latency. In [9], we introduced a concept of multidestination gather worm and in conjunction with the broadcast worm, we showed how to implement complete and arbitrary barrier synchronization in wormhole-routed systems with reduced latency. In this paper, we introduce a concept of multidestination exchange worm and demonstrate that complete global reduction operations (including barrier synchronization) can be eciently implemented with very small latency. The multidestination exchange worms are designed to travel along any single dimension of a k-ary n-cube system in a pair-wise manner. Each router interface supports a number of xed size buers. A virtual cutthrough technique is proposed to ensure deadlock-free movement of these exchange worms. Based on such architectural support, we present algorithms for complete (all processors participating) global reduction. We do not emphasize on reduction with arbitrary number of participating processors because it can be done eciently using the scheme proposed in [9]. Using exchange worms, it is demonstrated that the reduction operations can be implemented with n communication start-up for k-ary n-cube systems when the data size is smaller than the individual buer size available at a router. For data size larger than the buer size, we present a pipelined algorithm. Analytical results for latency of reduction operations are derived and compared with the -based scheme. The proposed scheme is also shown to be superior and scalable compared to the -based scheme for various system size, topology, and data size. The paper is organized as follows. An overview of the multidestination mechanism is presented in section 2. The exchange worm is introduced in section 3. In section 4, we present algorithms for complete global reduction on k-ary n-cube systems. Performance analysis for both schemes are done in section 5. 2 Multidestination Mechanism In this section, we provide an overview of wormhole message passing mechanism with multiple destinations. The readers are requested to refer to [9, 1] for details. In single destination () wormhole message-passing, every message consists of a body and a header with the destination number. For a multidestination message, the header consists of multiple destinations and can span multiple its depending on the encoding of the destinations [2]. The sender node creates the list of destinations as an ordered list, depending on their intended order of traversal, and incorporate it into the header. Once the worm is injected into the network by the source processor, it is routed in a piece-wise manner from one destination to another. The worm gets completely consumed at the last destination. This multidestination scheme is quite general in the sense that messages can always be implemented under this scheme with only one destination. Such multidestination worms can be designed to have dierent functionality. For example, a broadcast worm can be used to broadcast/multicast a message to multiple destinations using a single communication start-up [1]. A multidestination broadcast worm uses forward and absorb capability at the router of each destination (except the last one), i.e., the its are forwarded to an adjacent router as well as copied to the system buer of the associated processor-router interface. It is to be noted that such a worm is quite powerful and can deliver a message to multiple destinations much faster than using multiple messages. In [1], we have presented broadcast/multicast algorithms using such worms. We have shown that the cost of multicast can in fact be reduced as the number of destinations increases beyond certain number depending on the system size, architectural parameters, and the routing scheme being used. In [9], we have introduced multidestination gather 653

3 worm. The functionality of such a worm is opposite to that of a broadcast worm. Instead of forwarding and absorbing a it at the router of every intermediate destinations, a gather worm gathers/collects information at the router (supplied by the associated processor) and moves ahead. Using both gather and broadcast worms, we have shown how to implement fast barrier synchronization (complete and arbitrary barriers) in [9]. In the following section, we introduce a new exchange worm type. In order to route such multidestination messages, we have proposed a Base Routing Conformed Path (BRCP) model in [1]. Figure 1 shows examples of multidestination worms on a 2D mesh with dierent base routing schemes. In an e-cube system, a multidestination worm can cover a set of destinations in row/column/row-column order. It is to be noted that a set of destinations ordered in a column-row manner will be an invalid path under this BRCP model for an e-cube system. For a planar adaptive system, a multidestination worm can cover a set of destinations along any diagonal in addition to the exibility supported by the e-cube system. Such additional paths are shown as bold lines in the gure. If the underlying routing scheme supports west rst non-minimal turn model, it can provide further exibility in covering a lot of destinations using a single worm. Hence, this model is quite general and can be used by any routing scheme. Figure 1: Examples of multidestination broadcast worms under the BRCP model conforming to dierent base routing schemes in a 2-D mesh. can take advantage of this model by grouping destinations on row, column, or row-column. As the adaptivity of the base routing scheme increases, more and more destinations can be covered by a single multidestination worm. There are dierent ways to encode the addresses in a multidestination worm [2]. The all-destination encoding scheme uses one or more its per each destination. However, as the number of destination increases, such encoding makes the header quite long and increases the message size. On the contrary, a bit-string encoding scheme can be used to represent each destination by a bit. This makes the header size quite compact. For a k-ary n-cube system with paths along a single dimension (x, y, z,...), maximum k bits are sucient to encode the destinations. For deterministic/adaptive paths along multiple dimensions, it may take more number of bits ( k n ). The current generation systems support channel widths (it sizes) of bits. Hence, maximum 1 or 2 its are suf- cient to encode destinations of a single dimensional path in a k-ary n-cube system with k 32. We have used such bit-string encoding in [9] and will be using it in this paper too. 3 Reduction on Linear Array with Exchange Worms In this section we rst provide an overview of singledirectional gather worm, as introduced in [9]. Then we introduce the concept of bi-directional exchange worms and demonstrate how to use them to perform global reduction on a linear array. 3.1 Single-directional Gather Worm The basic name of this worm indicates that it gathers information from multiple processors as it propagates. The gather operation can be any associative and commutative function (sum, max, min, or userdened) as dened under collective communication by the MPI standard[4]. The required reduction operation is assumed to be element-wise operation on the data packets. Barrier synchronization is a special case of such reduction operation where there is no data (just an event). We assume all processors in the system participating in a given reduction operation and the result being required to be available to all processors. Consider a multidestination gather worm initiated by processor P5, as shown in Fig. 2a. The bit-encoded multidestination address of this worm is A typical format for such a worm is shown in Fig. 2b. It The signicant benet of multidestination messagepassing under the BRCP model comes from the fact that a message can pass through multiple destinations with the same overhead as that of sending it to a single destination. Even the simplest e-cube routing scheme consists of a message type (single destination vs. mul- source destination tiple destination), a unique id to reect the current reduction operation in the system, a function eld indicating the type of reduction operation (max, min, etc.), a bit-string encoded multidestination address, and a packet of b its of data. Let us consider the movement of this worm assuming virtual cut-through wormhole-routing. Based 654 Ecube Planar adaptive Turn model (west-first non-minimal)

4 on the multidestination address, the worm rst gets routed to router R4 by R5. After reaching R4, the bit-encoded address is modied to Similar to the concept of registers in Cray T3D barrier network R R1 R2 R3 R4 R5 P P1 P2 P3 P4 P5 flit flit 1 flit d buff buff 1 buff s-1 Figure 2: Implementing reduction operation on a linear array with a multidestination gather worm: (a) propagation of the gather worm, (b) its message format, and (c) the necessary router interface organization. and buers in CM5 control network, each router interface consists of a set of buers. Each buer carries a few bits for id, a ag sa to indicate whether its associated processor has arrived at the gather point during its execution or not, a ag ma indicating whether the message for the corresponding id has arrived or not, a buer sdata to hold the data supplied by its associated processor, a buer message to hold the incoming message, and a buer result to hold the result. A typical router interface organization with s such buers is shown in Fig. 2c. These buers are accessed by the associated processor by memory-mapped I/O references. The worm, after arriving at router interface R4, checks for the ag sa to be `on(1)' on the buer carrying the same id as that of itself. If this ag is `on', it indicates that the processor also has arrived. at its gather execution point and has supplied data data b-1in the buer sdata. Now the appropriate logic (indicated by the function eld of the worm) gets activated and operates on sdata and data portion of the message to produce result. It may also happen that the processor has already supplied its data to the router interface (a) propagation of a gather worm msg type id function bit string multidestination address data (b) message format of a typical gather worm id sa ma sdata message result and made the ag sa `on(1)' before the arrival of the message. In this case, the logic operation starts as soon as the message arrives. Once the logic operation is over at R4, the worm is id sa ma sdata message result. id sa ma sdata message result sa = self arrived/not arrived 655 ma = message arrived/not arrived sdata = self data (c) router interface organization with buffers forwarded to the router R3 while replacing data of the message with the result. Like this, the worm moves ahead step by step while gathering results on its way. Finally, the message gets consumed by router R and the gathered result is available with P. The operation of a gather worm traversing k processors on a path can be expressed more formally as follows: gather[; k? 1] = sdata sdata 1 sdata k?1 where sdata i is the data item(s) associated with processor p i, the operation species the required gather function and gather[; k? 1] is the result gathered by the worm. The result computed at the router interface by the gather worm is also available for the processor to use. Assume the gather worm is initiated by processor P. Let the result available at processor P i be result i. Based on the above operation of the gather worm, result i can be logically dened as follows: result i = sdata sdata 1 sdata i It is to be noted that this result i is nothing but prex computation [4] of operation over data items sdata associated with the respective processors. 3.2 Bidirectional Exchange Worm It can be seen that after the gather operation, the intermediate results available at individual processors are prex results and the nal result is available with the last processor. - processor If the nal result needs to be available at all processors, a multidestination broadcast worm- as router discussed interface section 2 can be initiated by the end processor to distribute the nal result to all processors on the array. However, this will take an additional communication step. In this section, we propose a new scheme where this second step can be eliminated. Instead of a single gather worm from one of the end processors, let us consider initiating a pair of gather worms from two end processors of the linear array. Both worms carry the same id and the same function. Let us dene such a pair of gather worms as exchange worms. Figure 3 shows a pair of positive and negative exchange worms for the linear array example being considered. The positive worm initiated by P carries a header and the negative worm by P5 carries Both worms use the message format as shown in Fig. 2b. In order to support exchange worms, the router interface as shown in Fig. 2c needs minor modication. The single ag ma gets substituted by two logic for functions (sum, max, min, barrier,...) ags (pma for positive worm and nma for negative worm). Similarly, the message eld of each buer gets replaced by two elds (pmessage and nmessage).

5 positive worm negative worm Figure 3: Implementing reduction operation on a linear array with a pair of positive and negative exchange worms. The intermediate processors, after arriving at their reduction point, make their respective sdata available at the router interface, make a copy of it as result, and turn the ag sa `on(1)'. Since the movement of the worms can be asynchronous, the exchange worms can arrive at a given router interface in any order. The logic at the router interface implements the following operations when the respective worms arrive. Since a common logic unit is involved, the operations get sequentialized (the ordering does not matter) when both worms arrive simultaneously. The following operation is invoked when the positive exchange worm (pexchange) arrives at a router interface and the processor's data is available (sa ^ pma being true): result i = result i pexchange in pexchange out = sdata i pexchange in where pexchange in is the data coming into the router interface and pexchange out is the data going out of the interface. Similarly, the following operation is carried out by the router interface with the arrival of the negative exchange worm (nexchange): result i = result i nexchange in nexchange out = sdata i nexchange in When the worms reach the respective end destination processors, they get consumed. When both worms have crossed a given router interface, the reduction operation is complete with respect to its associated processor and the result is available to the processor for use. This leads to: Lemma 1 The operations described above for a pair of exchange worms implement global reduction on a linear array where the result is available to all processors. It can be seen that the above scheme needs only one communication step to perform complete (all processors participating) global reduction on a linear array. R R1 The numberr2 of communication R3 steps is R4independent R5of the size of the linear array. Using tree-based approach with messages, the number of communication steps required is 2dlog 2 ke, where k is the size of the linear array. This leads to: P P1 P2 P3 P4 P5 656 Theorem 1 Global reduction on a linear array of k processors can be implemented in one communication step using a pair of multidestination exchange worms. Compared to -based scheme, this mechanism reduces the number of communication steps by a factor of 2dlog 2 (k)e. Besides the communication start-up time, for both -based and exchange-worm based schemes, the overall reduction time depends on other parameters like node delay, path length, link propagation delay, and time to compute reduction operation at each router interface. In section 5, we take all these parameters into account in doing performance analysis. 3.3 Design Considerations Such gather/exchange worm-based schemes are prone to deadlock if careful consideration is not taken. The problem of deadlock for multiple concurrent barriers using path-based scheme was pointed out in our earlier works [3, 7]. In this paper, we alleviate this deadlock problem by using virtual-cut through technique. The gather/exchange worms get stored into the buers at the router interface if they can not move. For a single reduction operation at a time, two buers at each router interface are sucient. It can be easily seen k that with k buers at each router interface, concurrent reduction operations can be implemented in a 2 given system. The size of each buer puts a restriction on the maximum data size on which the reduction operation can be carried out. In order to implement reduction on large size data, we present a pipelined algorithm in the next section. In section 5, we study the tradeo of various buer size on the overall latency of reduction operation. The analysis indicates that as t s continues to decrease, 8-16 bytes buers are sucient to provide faster reduction operation using exchange worms. The current generation system like IBM SP1/SP2 already provides 1K bytes of buer space at each router switch as central queue. - processor - router interface Such an amount of resource can provide 4-8 buers for our scheme allowing 2-4 concurrent reduction operations. Hence, our proposed exchange-worm based scheme is technically feasible in the current generation systems. 4 Reduction on k-ary n-cube In this section, we formulate global reduction algorithms for k-ary n-cube systems. 4.1 Two-dimensional System Complete reduction can be achieved on a 2D mesh by considering it as a set of linear arrays. Based on the

6 Step 1 Step 2 scheme proposed in Fig. 3, reduction can be achieved rst along all rows in parallel. At the end of this step, each processor has the reduced result of its row processors. The second step involves similar exchange operation along the columns in parallel and operates on the reduced result of the rst step. It can be easily veried that this two-step scheme implements global reduction over all processors with the nal result being available to all processors. Figure 4 illustrates this two-step scheme. Figure 4: Complete global reduction on a 2D mesh in two steps using exchange worms. In the last section, we assumed xed size buers at the router interface. Let the maximum size of each buer be b bytes. The above example illustrates that reduction operation on data size smaller than b bytes can be carried out in 2 steps on a 2D mesh. However, for larger data size, a pipelined version of this algorithm is needed. Assuming the message size is l bytes (l > b), it can be broken into dl=be packets. During every step along a dimension, the end processors, instead of initiating only one packet, initiate dl=be packets one after another. These packets move in a pipelined manner along a dimension. They are tagged with sequence numbers in addition to the id and the logic block at a router interface picks up packets with the same sequence number to perform the reduction operation. Hence, for messages with large data size there will be 2dl=be communication steps on a 2D mesh. 4.2 Higher Dimensional Systems For an n-dimensional system, each step (using a pair of exchange worms) can be repeated n times along different dimensions. For a 3D system, three steps will be needed to perform the operation along x, y, and z dimensions, respectively. It is to be noted that the ordering of these dimensions does not matter because the reduction operations being considered are associative and commutative. For k-ary n-cube system, the proposed scheme uses n communication steps. By using -based scheme, the number of communication steps required in a k-ary n-cube system is 2dlog 2 (k n )e. This leads to: Theorem 2 Complete global reduction across k n processors on a k-ary n-cube system can be implemented positive worm 657 negative worm with ndl=be communication steps by using multidestination exchange worms where l and b are data size and buer size in bytes, respectively. If arbitrary number of processors participate in the operation, then the exchange steps can not always be carried out. If some end processors do not participate then the structure of the problem (participating processors) becomes asymmetric and it becomes dicult to carry out result from one phase to another [7]. Under these circumstances, one can use a two-phase method as suggested in [9]. 5 Performance Analysis In this section, we develop timing models for both -based and exchange worm-based schemes and compare them for dierent system and technological parameters. 5.1 Latency of Reduction Operation Let us consider the -based scheme rst. There are 2n log 2 k communication steps. The upward-tree communication (n log 2 k steps) can be implemented dimension-wise. The steps along a dimension involve hops of distance 1; 2; 4; 8; : : : ; k=2, respectively. During the upward tree communication, at the end of each level of communication, data reduction gets implemented on the entire message. However, during the broadcast phase, there is no reduction involved. This leads to the following latency for implementing reduction on k-ary n-cube system using -based scheme: T uni = n(2((dlog 2 ket s ) + (k? 1 + dlog 2 ke)t node +(k? 1 + (dlog 2 kedl=we))t p ) +(dlog 2 ket comp l)) (1) where t s = communication start-up time, t p = propagation time per link, t node = node delay, t comp = time to compute the reduction per element, l = message length in bytes, and w = channel width in bytes. In exchange worm-based scheme, the buer size at the router interface, as discussed in the last section, plays an important role in determining the overall latency. If data size is less than or equal to buer size (l b) then there is no pipelining and only one communication start-up is needed along each dimension. Hence, the overall latency for non-pipelined implementation of exchange worm-based scheme on k-ary n-cube system is: T np ex = n(t s + (kt node ) + ((dl=we + 1)(k? 1)t p ) +((k? 1)t comp l)) (2)

7 For larger data size, the pipelined algorithm is used. In [8], we have derived the latency of the pipelined algorithm as follows: T p ex = n(dl=bet max + (kt node ) + ((db=we + 1) (k? 1)t p ) + ((k? 1)t comp b)) (3) Using the above models, we compared our scheme with the -based scheme. The parameters were chosen to represent the current trend in technology. For overall evaluation, we assumed the following parameters: t s =.5, 1. and 5. microsec, t p = 5. nsec, t node = 2. nsec, and t comp = 15. nsec. 5.2 Overall Comparison for 2D Mesh Figure 5 shows the comparison results for two different mesh size. It can be observed that as t s reduces, exchange worm-based scheme performs superior to -based scheme for a wider range of data size. For 1-it data size (corresponding to barrier synchronization), the exchange worm-based scheme can implement a global barrier for 124 processors on a 32x32 system in just 4.83 microsecond with t s =1. microsecond. The buer size plays an important role in the overall latency of the reduction operation. With high t s, smaller buer size introduces more number of pipelined start-ups and increases the latency. Hence, a larger buer size is necessary for systems with high t s. However, when t s reduces, a conguration with larger buer size makes the latency higher because the message propagation time starts dominating. Hence, it is suggested that smaller buer size needs to be used for systems with lower t s. Smaller buer size also leads to low cost implementation of the scheme and the availability of more number of buers at a router interface. As t s keeps on reducing for new generation systems, these results indicate that the proposed exchange worm-based scheme can implement fast reduction in a cost-eective manner. 5.3 Comparison for 3D Mesh Figure 6 compares exchange worm-based scheme with -based scheme for 1x1x1 system with various t s. It can be observed that as t s reduces, the exchange worm-based scheme demonstrates consistent superiority for all data sizes. With t s =.5 microsec, it just takes 2.78 microsec to barrier synchronize 1K processors using exchange worm-based scheme compared to microsec by using -based scheme. With t s =.5 microsec, a global reduction operation on 512 bytes of data can be implemented in 55. microsec using exchange worms compared to microsec needed for -based scheme. As t s reduces, buer size of 16 bytes seems to be optimal for all data sizes. 5.4 Scalability with System Size We studied the latency of reduction operation for 32 and 256 bytes of data size on various system congurations (4x4, 8x8, 16x16, and 32x32) for analyzing scalability. It was observed [8] that with appropriate buer size, as system size increases, the exchange wormbased scheme continues to perform reduction operation with less cost. For smaller data size (32 bytes), the improvement varies from and for large data size (256 bytes) it varies between Such performance improvement demonstrates the scalability with respect to increase in system and data size. 6 Conclusions and Future Research In this paper we have presented a new approach to implement fast and scalable global reduction in k- ary n-cube wormhole systems using multidestination exchange worms. The necessary architectural supports and design modications to the router interface are presented. Algorithms are developed and evaluated for reduction operations on various data sizes. It is shown that only ndl=be communication steps are needed with exchange worms to implement complete reduction on l bytes of data with buers of size b bytes at each router interface. Compared to the -based scheme, the proposed framework demonstrates an asymptotic reduction of 2 log 2 k communication steps for smaller data size. Analytical results indicate that the proposed scheme is far superior and scalable compared to to the -based scheme for a wide variety of system topology, system size, data size, and t s. It is shown that the performance of the scheme is sensitive to the available buer size at router interfaces. As t s continues to decrease in current and future generation systems, it is shown that a few buers of 8-16 bytes each are sucient to take advantage of the proposed scheme. Hence, this framework is quite suitable for current and future generation wormhole systems. In this paper, we have studied the usage of multidestination mechanism for global reduction operation. We are extending our work to other collective communication patterns like scatter, complete-exchange, and parallel-prex. As the system size grows, the proposed scheme encounters more delay due to increased path length. We are working on alternative schemes to reduce such impact of path length. 658

8 x16 Mesh (ts=5. microsec) x16 Mesh (ts=1. microsec) x32 Mesh (ts=1. microsec) Figure 5: Comparison of -based and exchange-worm based schemes to implement global reduction on 2D meshes with dierent data size, system size, communication start-up time, and buer size (8, 16, and 32 bytes) x1x1 Mesh (ts=5. microsec) x1x1 Mesh (ts=1. microsec) x1x1 Mesh (ts=.5 microsec) Figure 6: Comparison of -based and exchange-worm based schemes to implement global reduction on 1x1x1 mesh with dierent data size, communication start-up time, and buer size. References [1] M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor Collective Communication Library (Intercom). In Scalable High Performance Computing Conference, pages 357{364, [2] C.-M. Chiang and L. M. Ni. Multi-Address Encoding for Multicast. In Proceedings of the Parallel Computer Routing and Communication Workshop, pages 146{16, May [3] S. K. S. Gupta and D. K. Panda. Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives. In Proceedings of the International Parallel Processing Symposium, pages 51{56, April [4] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Mar [5] L. Ni and P. K. McKinley. A Survey of Wormhole Routing Techniques in Direct Networks. IEEE Computer, pages 62{76, Feb [6] M. T. O'Keefe and H. G. Dietz. Hardware Barrier Synchronization: Static Barrier MIMD (SBM). In Proceedings of the International Conference on Parallel Processing, pages I: 35{42, Aug [7] D. K. Panda. Optimal Phase Barrier Synchronization in k-ary n-cube Wormhole-routed Systems using Multirendezvous Primitives. In Workshop on Fine-Grain Massively Parallel Coordination, pages 24{26, May [8] D. K. Panda. Global Reduction in Wormhole k- ary n-cube Networks with Multidestination Exchange Worms. Technical Report OSU-CISRC- 8/94-TR53, [9] D. K. Panda. Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms. In International Symposium on High Performance Comuter Architecture, pages 2{29, [1] D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme. In Proceedings of the Parallel Computer Routing and Communication Workshop, pages 131{145, [11] H. Xu, P. K. McKinley, and L. Ni. Ecient Implementation of Barrier Synchronization in Wormhole-routed Hypercube Multicomputers. Journal of Parallel and Distributed Computing, 16:172{184, 1992.

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The