Ecube Planar adaptive Turn model (west-first non-minimal)

Size: px
Start display at page:

Download "Ecube Planar adaptive Turn model (west-first non-minimal)"

Transcription

1 Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda Dept. of Computer and Information Science The Ohio State University, Columbus, OH Tel: (614) , Fax: (614) Abstract This paper presents a new approach to implement global reduction operations in wormhole k-ary n-cubes. The novelty lies in using multidestination message passing mechanism instead of single destination () messages. Using pairwise exchange worms along each dimension, it is shown that complete global reduction and barrier synchronization operations, as dened by the Message Passing Interface (MPI) standard, can be implemented with n communication start-ups as compared to 2ndlog 2 ke start-ups required with -based message passing. Analytical results for dierent values of communication startup time, system size, and data size are presented and compared with the -based scheme. The analysis indicates that the proposed framework can be eectively used in wormhole-routed systems to achieve fast global reduction without a separate control network. 1 Introduction The wormhole-routing switching technique is becoming the trend in building future parallel systems due to its inherent advantages like low-latency communication and reduced communication hardware overhead [5]. Intel Paragon, Cray T3D, Ncube, J- Machine, and Stanford DASH are representative systems falling into this category. Such systems with direct interconnections are being used for supporting either distributed-memory or distributed-shared memory programming paradigms. In order to support these paradigms, the systems need fast communication and synchronization support from the underlying network. The Message Passing Interface Standard [4] has recently dened the importance of collective communication operations. One important category in This research is supported in part by the National Science Foundation Grant # MIP this class is global reduction (sum, max, min, or userdened functions) where there is involvement from all processes of an user-dened group. As dened by the standard, the result of an operation may be available to only one member of the group or to all the members. The operations can be carried on either scalar or vector data. Barrier synchronization [6] is a special case of this class of operation where there is no data (just an event) and the result is available to all members of the group. In this paper, we consider both reduction and barrier synchronization as a single class of reduction operation. Many software schemes have been recently proposed in the literature to eciently implement reduction[1] and barrier synchronization[11] operations on wormhole-routed systems. All these schemes use multiple phases of point-to-point message passing and encounter long latency due to multiple communication start-ups. Systems like Cray T3D and CM-5 use dedicated tree-based networks to provide fast global reduction and barrier synchronization. However, these schemes are not physically scalable [9]. This raises a question whether fast reduction and barrier synchronization can be implemented on wormhole-routed direct networks using software message passing with minimal architectural supports associated with each router. This will alleviate the need for a separate control network and can provide easy scalability as the system size grows. Traditionally, the wormhole-routed systems have supported only point-to-point () message passing [5]. This mechanism allows a message to have only a single destination. Using -based send and receive message passing primitives, reduction and barrier synchronization operations can easily be achieved using a two-step procedure: gather (report) and broadcast. During the gather step, data/information is gathered through phases of up- 652

2 ward tree communication. At the end of this step, reduced data/information is available with a single (root) processor. The second step involves broadcasting the reduced data to other processors in multiple phases using a downward tree communication. For a k-ary n-cube system with k n processors, such an approach requires 2n log 2 k communication phases. In current generation machines, communication startup time (t s ) is around 1.{35. microseconds while propagation time per it per hop (t p ) is in the range of 5.{15. nanoseconds. Hence, the latency of a communication phase is dominated by t s and the latency of any software-based synchronization scheme becomes proportional to the number of communication phases involved. This makes the global reduction cost quite high. In [1] it is reported that reduction operation on 8 bytes of data on 16x32 Intel Paragon takes around 76 microseconds. This raises question whether ef- cient mechanisms are possible to reduce overhead of such reduction operations. In this paper, we take such a challenge in proposing a new approach to implement fast global reduction in k-ary n-cube wormhole networks. Recently, we have introduced a new concept of multidestination wormhole mechanism [9, 1]. We used multidestination broadcast worm in [1] to design algorithms for broadcast and multicast operations in wormholerouted k-ary n-cube systems with reduced latency. In [9], we introduced a concept of multidestination gather worm and in conjunction with the broadcast worm, we showed how to implement complete and arbitrary barrier synchronization in wormhole-routed systems with reduced latency. In this paper, we introduce a concept of multidestination exchange worm and demonstrate that complete global reduction operations (including barrier synchronization) can be eciently implemented with very small latency. The multidestination exchange worms are designed to travel along any single dimension of a k-ary n-cube system in a pair-wise manner. Each router interface supports a number of xed size buers. A virtual cutthrough technique is proposed to ensure deadlock-free movement of these exchange worms. Based on such architectural support, we present algorithms for complete (all processors participating) global reduction. We do not emphasize on reduction with arbitrary number of participating processors because it can be done eciently using the scheme proposed in [9]. Using exchange worms, it is demonstrated that the reduction operations can be implemented with n communication start-up for k-ary n-cube systems when the data size is smaller than the individual buer size available at a router. For data size larger than the buer size, we present a pipelined algorithm. Analytical results for latency of reduction operations are derived and compared with the -based scheme. The proposed scheme is also shown to be superior and scalable compared to the -based scheme for various system size, topology, and data size. The paper is organized as follows. An overview of the multidestination mechanism is presented in section 2. The exchange worm is introduced in section 3. In section 4, we present algorithms for complete global reduction on k-ary n-cube systems. Performance analysis for both schemes are done in section 5. 2 Multidestination Mechanism In this section, we provide an overview of wormhole message passing mechanism with multiple destinations. The readers are requested to refer to [9, 1] for details. In single destination () wormhole message-passing, every message consists of a body and a header with the destination number. For a multidestination message, the header consists of multiple destinations and can span multiple its depending on the encoding of the destinations [2]. The sender node creates the list of destinations as an ordered list, depending on their intended order of traversal, and incorporate it into the header. Once the worm is injected into the network by the source processor, it is routed in a piece-wise manner from one destination to another. The worm gets completely consumed at the last destination. This multidestination scheme is quite general in the sense that messages can always be implemented under this scheme with only one destination. Such multidestination worms can be designed to have dierent functionality. For example, a broadcast worm can be used to broadcast/multicast a message to multiple destinations using a single communication start-up [1]. A multidestination broadcast worm uses forward and absorb capability at the router of each destination (except the last one), i.e., the its are forwarded to an adjacent router as well as copied to the system buer of the associated processor-router interface. It is to be noted that such a worm is quite powerful and can deliver a message to multiple destinations much faster than using multiple messages. In [1], we have presented broadcast/multicast algorithms using such worms. We have shown that the cost of multicast can in fact be reduced as the number of destinations increases beyond certain number depending on the system size, architectural parameters, and the routing scheme being used. In [9], we have introduced multidestination gather 653

3 worm. The functionality of such a worm is opposite to that of a broadcast worm. Instead of forwarding and absorbing a it at the router of every intermediate destinations, a gather worm gathers/collects information at the router (supplied by the associated processor) and moves ahead. Using both gather and broadcast worms, we have shown how to implement fast barrier synchronization (complete and arbitrary barriers) in [9]. In the following section, we introduce a new exchange worm type. In order to route such multidestination messages, we have proposed a Base Routing Conformed Path (BRCP) model in [1]. Figure 1 shows examples of multidestination worms on a 2D mesh with dierent base routing schemes. In an e-cube system, a multidestination worm can cover a set of destinations in row/column/row-column order. It is to be noted that a set of destinations ordered in a column-row manner will be an invalid path under this BRCP model for an e-cube system. For a planar adaptive system, a multidestination worm can cover a set of destinations along any diagonal in addition to the exibility supported by the e-cube system. Such additional paths are shown as bold lines in the gure. If the underlying routing scheme supports west rst non-minimal turn model, it can provide further exibility in covering a lot of destinations using a single worm. Hence, this model is quite general and can be used by any routing scheme. Figure 1: Examples of multidestination broadcast worms under the BRCP model conforming to dierent base routing schemes in a 2-D mesh. can take advantage of this model by grouping destinations on row, column, or row-column. As the adaptivity of the base routing scheme increases, more and more destinations can be covered by a single multidestination worm. There are dierent ways to encode the addresses in a multidestination worm [2]. The all-destination encoding scheme uses one or more its per each destination. However, as the number of destination increases, such encoding makes the header quite long and increases the message size. On the contrary, a bit-string encoding scheme can be used to represent each destination by a bit. This makes the header size quite compact. For a k-ary n-cube system with paths along a single dimension (x, y, z,...), maximum k bits are sucient to encode the destinations. For deterministic/adaptive paths along multiple dimensions, it may take more number of bits ( k n ). The current generation systems support channel widths (it sizes) of bits. Hence, maximum 1 or 2 its are suf- cient to encode destinations of a single dimensional path in a k-ary n-cube system with k 32. We have used such bit-string encoding in [9] and will be using it in this paper too. 3 Reduction on Linear Array with Exchange Worms In this section we rst provide an overview of singledirectional gather worm, as introduced in [9]. Then we introduce the concept of bi-directional exchange worms and demonstrate how to use them to perform global reduction on a linear array. 3.1 Single-directional Gather Worm The basic name of this worm indicates that it gathers information from multiple processors as it propagates. The gather operation can be any associative and commutative function (sum, max, min, or userdened) as dened under collective communication by the MPI standard[4]. The required reduction operation is assumed to be element-wise operation on the data packets. Barrier synchronization is a special case of such reduction operation where there is no data (just an event). We assume all processors in the system participating in a given reduction operation and the result being required to be available to all processors. Consider a multidestination gather worm initiated by processor P5, as shown in Fig. 2a. The bit-encoded multidestination address of this worm is A typical format for such a worm is shown in Fig. 2b. It The signicant benet of multidestination messagepassing under the BRCP model comes from the fact that a message can pass through multiple destinations with the same overhead as that of sending it to a single destination. Even the simplest e-cube routing scheme consists of a message type (single destination vs. mul- source destination tiple destination), a unique id to reect the current reduction operation in the system, a function eld indicating the type of reduction operation (max, min, etc.), a bit-string encoded multidestination address, and a packet of b its of data. Let us consider the movement of this worm assuming virtual cut-through wormhole-routing. Based 654 Ecube Planar adaptive Turn model (west-first non-minimal)

4 on the multidestination address, the worm rst gets routed to router R4 by R5. After reaching R4, the bit-encoded address is modied to Similar to the concept of registers in Cray T3D barrier network R R1 R2 R3 R4 R5 P P1 P2 P3 P4 P5 flit flit 1 flit d buff buff 1 buff s-1 Figure 2: Implementing reduction operation on a linear array with a multidestination gather worm: (a) propagation of the gather worm, (b) its message format, and (c) the necessary router interface organization. and buers in CM5 control network, each router interface consists of a set of buers. Each buer carries a few bits for id, a ag sa to indicate whether its associated processor has arrived at the gather point during its execution or not, a ag ma indicating whether the message for the corresponding id has arrived or not, a buer sdata to hold the data supplied by its associated processor, a buer message to hold the incoming message, and a buer result to hold the result. A typical router interface organization with s such buers is shown in Fig. 2c. These buers are accessed by the associated processor by memory-mapped I/O references. The worm, after arriving at router interface R4, checks for the ag sa to be `on(1)' on the buer carrying the same id as that of itself. If this ag is `on', it indicates that the processor also has arrived. at its gather execution point and has supplied data data b-1in the buer sdata. Now the appropriate logic (indicated by the function eld of the worm) gets activated and operates on sdata and data portion of the message to produce result. It may also happen that the processor has already supplied its data to the router interface (a) propagation of a gather worm msg type id function bit string multidestination address data (b) message format of a typical gather worm id sa ma sdata message result and made the ag sa `on(1)' before the arrival of the message. In this case, the logic operation starts as soon as the message arrives. Once the logic operation is over at R4, the worm is id sa ma sdata message result. id sa ma sdata message result sa = self arrived/not arrived 655 ma = message arrived/not arrived sdata = self data (c) router interface organization with buffers forwarded to the router R3 while replacing data of the message with the result. Like this, the worm moves ahead step by step while gathering results on its way. Finally, the message gets consumed by router R and the gathered result is available with P. The operation of a gather worm traversing k processors on a path can be expressed more formally as follows: gather[; k? 1] = sdata sdata 1 sdata k?1 where sdata i is the data item(s) associated with processor p i, the operation species the required gather function and gather[; k? 1] is the result gathered by the worm. The result computed at the router interface by the gather worm is also available for the processor to use. Assume the gather worm is initiated by processor P. Let the result available at processor P i be result i. Based on the above operation of the gather worm, result i can be logically dened as follows: result i = sdata sdata 1 sdata i It is to be noted that this result i is nothing but prex computation [4] of operation over data items sdata associated with the respective processors. 3.2 Bidirectional Exchange Worm It can be seen that after the gather operation, the intermediate results available at individual processors are prex results and the nal result is available with the last processor. - processor If the nal result needs to be available at all processors, a multidestination broadcast worm- as router discussed interface section 2 can be initiated by the end processor to distribute the nal result to all processors on the array. However, this will take an additional communication step. In this section, we propose a new scheme where this second step can be eliminated. Instead of a single gather worm from one of the end processors, let us consider initiating a pair of gather worms from two end processors of the linear array. Both worms carry the same id and the same function. Let us dene such a pair of gather worms as exchange worms. Figure 3 shows a pair of positive and negative exchange worms for the linear array example being considered. The positive worm initiated by P carries a header and the negative worm by P5 carries Both worms use the message format as shown in Fig. 2b. In order to support exchange worms, the router interface as shown in Fig. 2c needs minor modication. The single ag ma gets substituted by two logic for functions (sum, max, min, barrier,...) ags (pma for positive worm and nma for negative worm). Similarly, the message eld of each buer gets replaced by two elds (pmessage and nmessage).

5 positive worm negative worm Figure 3: Implementing reduction operation on a linear array with a pair of positive and negative exchange worms. The intermediate processors, after arriving at their reduction point, make their respective sdata available at the router interface, make a copy of it as result, and turn the ag sa `on(1)'. Since the movement of the worms can be asynchronous, the exchange worms can arrive at a given router interface in any order. The logic at the router interface implements the following operations when the respective worms arrive. Since a common logic unit is involved, the operations get sequentialized (the ordering does not matter) when both worms arrive simultaneously. The following operation is invoked when the positive exchange worm (pexchange) arrives at a router interface and the processor's data is available (sa ^ pma being true): result i = result i pexchange in pexchange out = sdata i pexchange in where pexchange in is the data coming into the router interface and pexchange out is the data going out of the interface. Similarly, the following operation is carried out by the router interface with the arrival of the negative exchange worm (nexchange): result i = result i nexchange in nexchange out = sdata i nexchange in When the worms reach the respective end destination processors, they get consumed. When both worms have crossed a given router interface, the reduction operation is complete with respect to its associated processor and the result is available to the processor for use. This leads to: Lemma 1 The operations described above for a pair of exchange worms implement global reduction on a linear array where the result is available to all processors. It can be seen that the above scheme needs only one communication step to perform complete (all processors participating) global reduction on a linear array. R R1 The numberr2 of communication R3 steps is R4independent R5of the size of the linear array. Using tree-based approach with messages, the number of communication steps required is 2dlog 2 ke, where k is the size of the linear array. This leads to: P P1 P2 P3 P4 P5 656 Theorem 1 Global reduction on a linear array of k processors can be implemented in one communication step using a pair of multidestination exchange worms. Compared to -based scheme, this mechanism reduces the number of communication steps by a factor of 2dlog 2 (k)e. Besides the communication start-up time, for both -based and exchange-worm based schemes, the overall reduction time depends on other parameters like node delay, path length, link propagation delay, and time to compute reduction operation at each router interface. In section 5, we take all these parameters into account in doing performance analysis. 3.3 Design Considerations Such gather/exchange worm-based schemes are prone to deadlock if careful consideration is not taken. The problem of deadlock for multiple concurrent barriers using path-based scheme was pointed out in our earlier works [3, 7]. In this paper, we alleviate this deadlock problem by using virtual-cut through technique. The gather/exchange worms get stored into the buers at the router interface if they can not move. For a single reduction operation at a time, two buers at each router interface are sucient. It can be easily seen k that with k buers at each router interface, concurrent reduction operations can be implemented in a 2 given system. The size of each buer puts a restriction on the maximum data size on which the reduction operation can be carried out. In order to implement reduction on large size data, we present a pipelined algorithm in the next section. In section 5, we study the tradeo of various buer size on the overall latency of reduction operation. The analysis indicates that as t s continues to decrease, 8-16 bytes buers are sucient to provide faster reduction operation using exchange worms. The current generation system like IBM SP1/SP2 already provides 1K bytes of buer space at each router switch as central queue. - processor - router interface Such an amount of resource can provide 4-8 buers for our scheme allowing 2-4 concurrent reduction operations. Hence, our proposed exchange-worm based scheme is technically feasible in the current generation systems. 4 Reduction on k-ary n-cube In this section, we formulate global reduction algorithms for k-ary n-cube systems. 4.1 Two-dimensional System Complete reduction can be achieved on a 2D mesh by considering it as a set of linear arrays. Based on the

6 Step 1 Step 2 scheme proposed in Fig. 3, reduction can be achieved rst along all rows in parallel. At the end of this step, each processor has the reduced result of its row processors. The second step involves similar exchange operation along the columns in parallel and operates on the reduced result of the rst step. It can be easily veried that this two-step scheme implements global reduction over all processors with the nal result being available to all processors. Figure 4 illustrates this two-step scheme. Figure 4: Complete global reduction on a 2D mesh in two steps using exchange worms. In the last section, we assumed xed size buers at the router interface. Let the maximum size of each buer be b bytes. The above example illustrates that reduction operation on data size smaller than b bytes can be carried out in 2 steps on a 2D mesh. However, for larger data size, a pipelined version of this algorithm is needed. Assuming the message size is l bytes (l > b), it can be broken into dl=be packets. During every step along a dimension, the end processors, instead of initiating only one packet, initiate dl=be packets one after another. These packets move in a pipelined manner along a dimension. They are tagged with sequence numbers in addition to the id and the logic block at a router interface picks up packets with the same sequence number to perform the reduction operation. Hence, for messages with large data size there will be 2dl=be communication steps on a 2D mesh. 4.2 Higher Dimensional Systems For an n-dimensional system, each step (using a pair of exchange worms) can be repeated n times along different dimensions. For a 3D system, three steps will be needed to perform the operation along x, y, and z dimensions, respectively. It is to be noted that the ordering of these dimensions does not matter because the reduction operations being considered are associative and commutative. For k-ary n-cube system, the proposed scheme uses n communication steps. By using -based scheme, the number of communication steps required in a k-ary n-cube system is 2dlog 2 (k n )e. This leads to: Theorem 2 Complete global reduction across k n processors on a k-ary n-cube system can be implemented positive worm 657 negative worm with ndl=be communication steps by using multidestination exchange worms where l and b are data size and buer size in bytes, respectively. If arbitrary number of processors participate in the operation, then the exchange steps can not always be carried out. If some end processors do not participate then the structure of the problem (participating processors) becomes asymmetric and it becomes dicult to carry out result from one phase to another [7]. Under these circumstances, one can use a two-phase method as suggested in [9]. 5 Performance Analysis In this section, we develop timing models for both -based and exchange worm-based schemes and compare them for dierent system and technological parameters. 5.1 Latency of Reduction Operation Let us consider the -based scheme rst. There are 2n log 2 k communication steps. The upward-tree communication (n log 2 k steps) can be implemented dimension-wise. The steps along a dimension involve hops of distance 1; 2; 4; 8; : : : ; k=2, respectively. During the upward tree communication, at the end of each level of communication, data reduction gets implemented on the entire message. However, during the broadcast phase, there is no reduction involved. This leads to the following latency for implementing reduction on k-ary n-cube system using -based scheme: T uni = n(2((dlog 2 ket s ) + (k? 1 + dlog 2 ke)t node +(k? 1 + (dlog 2 kedl=we))t p ) +(dlog 2 ket comp l)) (1) where t s = communication start-up time, t p = propagation time per link, t node = node delay, t comp = time to compute the reduction per element, l = message length in bytes, and w = channel width in bytes. In exchange worm-based scheme, the buer size at the router interface, as discussed in the last section, plays an important role in determining the overall latency. If data size is less than or equal to buer size (l b) then there is no pipelining and only one communication start-up is needed along each dimension. Hence, the overall latency for non-pipelined implementation of exchange worm-based scheme on k-ary n-cube system is: T np ex = n(t s + (kt node ) + ((dl=we + 1)(k? 1)t p ) +((k? 1)t comp l)) (2)

7 For larger data size, the pipelined algorithm is used. In [8], we have derived the latency of the pipelined algorithm as follows: T p ex = n(dl=bet max + (kt node ) + ((db=we + 1) (k? 1)t p ) + ((k? 1)t comp b)) (3) Using the above models, we compared our scheme with the -based scheme. The parameters were chosen to represent the current trend in technology. For overall evaluation, we assumed the following parameters: t s =.5, 1. and 5. microsec, t p = 5. nsec, t node = 2. nsec, and t comp = 15. nsec. 5.2 Overall Comparison for 2D Mesh Figure 5 shows the comparison results for two different mesh size. It can be observed that as t s reduces, exchange worm-based scheme performs superior to -based scheme for a wider range of data size. For 1-it data size (corresponding to barrier synchronization), the exchange worm-based scheme can implement a global barrier for 124 processors on a 32x32 system in just 4.83 microsecond with t s =1. microsecond. The buer size plays an important role in the overall latency of the reduction operation. With high t s, smaller buer size introduces more number of pipelined start-ups and increases the latency. Hence, a larger buer size is necessary for systems with high t s. However, when t s reduces, a conguration with larger buer size makes the latency higher because the message propagation time starts dominating. Hence, it is suggested that smaller buer size needs to be used for systems with lower t s. Smaller buer size also leads to low cost implementation of the scheme and the availability of more number of buers at a router interface. As t s keeps on reducing for new generation systems, these results indicate that the proposed exchange worm-based scheme can implement fast reduction in a cost-eective manner. 5.3 Comparison for 3D Mesh Figure 6 compares exchange worm-based scheme with -based scheme for 1x1x1 system with various t s. It can be observed that as t s reduces, the exchange worm-based scheme demonstrates consistent superiority for all data sizes. With t s =.5 microsec, it just takes 2.78 microsec to barrier synchronize 1K processors using exchange worm-based scheme compared to microsec by using -based scheme. With t s =.5 microsec, a global reduction operation on 512 bytes of data can be implemented in 55. microsec using exchange worms compared to microsec needed for -based scheme. As t s reduces, buer size of 16 bytes seems to be optimal for all data sizes. 5.4 Scalability with System Size We studied the latency of reduction operation for 32 and 256 bytes of data size on various system congurations (4x4, 8x8, 16x16, and 32x32) for analyzing scalability. It was observed [8] that with appropriate buer size, as system size increases, the exchange wormbased scheme continues to perform reduction operation with less cost. For smaller data size (32 bytes), the improvement varies from and for large data size (256 bytes) it varies between Such performance improvement demonstrates the scalability with respect to increase in system and data size. 6 Conclusions and Future Research In this paper we have presented a new approach to implement fast and scalable global reduction in k- ary n-cube wormhole systems using multidestination exchange worms. The necessary architectural supports and design modications to the router interface are presented. Algorithms are developed and evaluated for reduction operations on various data sizes. It is shown that only ndl=be communication steps are needed with exchange worms to implement complete reduction on l bytes of data with buers of size b bytes at each router interface. Compared to the -based scheme, the proposed framework demonstrates an asymptotic reduction of 2 log 2 k communication steps for smaller data size. Analytical results indicate that the proposed scheme is far superior and scalable compared to to the -based scheme for a wide variety of system topology, system size, data size, and t s. It is shown that the performance of the scheme is sensitive to the available buer size at router interfaces. As t s continues to decrease in current and future generation systems, it is shown that a few buers of 8-16 bytes each are sucient to take advantage of the proposed scheme. Hence, this framework is quite suitable for current and future generation wormhole systems. In this paper, we have studied the usage of multidestination mechanism for global reduction operation. We are extending our work to other collective communication patterns like scatter, complete-exchange, and parallel-prex. As the system size grows, the proposed scheme encounters more delay due to increased path length. We are working on alternative schemes to reduce such impact of path length. 658

8 x16 Mesh (ts=5. microsec) x16 Mesh (ts=1. microsec) x32 Mesh (ts=1. microsec) Figure 5: Comparison of -based and exchange-worm based schemes to implement global reduction on 2D meshes with dierent data size, system size, communication start-up time, and buer size (8, 16, and 32 bytes) x1x1 Mesh (ts=5. microsec) x1x1 Mesh (ts=1. microsec) x1x1 Mesh (ts=.5 microsec) Figure 6: Comparison of -based and exchange-worm based schemes to implement global reduction on 1x1x1 mesh with dierent data size, communication start-up time, and buer size. References [1] M. Barnett, S. Gupta, D. G. Payne, L. Shuler, R. van de Geijn, and J. Watts. Interprocessor Collective Communication Library (Intercom). In Scalable High Performance Computing Conference, pages 357{364, [2] C.-M. Chiang and L. M. Ni. Multi-Address Encoding for Multicast. In Proceedings of the Parallel Computer Routing and Communication Workshop, pages 146{16, May [3] S. K. S. Gupta and D. K. Panda. Barrier Synchronization in Distributed-Memory Multiprocessors using Rendezvous Primitives. In Proceedings of the International Parallel Processing Symposium, pages 51{56, April [4] Message Passing Interface Forum. MPI: A Message-Passing Interface Standard, Mar [5] L. Ni and P. K. McKinley. A Survey of Wormhole Routing Techniques in Direct Networks. IEEE Computer, pages 62{76, Feb [6] M. T. O'Keefe and H. G. Dietz. Hardware Barrier Synchronization: Static Barrier MIMD (SBM). In Proceedings of the International Conference on Parallel Processing, pages I: 35{42, Aug [7] D. K. Panda. Optimal Phase Barrier Synchronization in k-ary n-cube Wormhole-routed Systems using Multirendezvous Primitives. In Workshop on Fine-Grain Massively Parallel Coordination, pages 24{26, May [8] D. K. Panda. Global Reduction in Wormhole k- ary n-cube Networks with Multidestination Exchange Worms. Technical Report OSU-CISRC- 8/94-TR53, [9] D. K. Panda. Fast Barrier Synchronization in Wormhole k-ary n-cube Networks with Multidestination Worms. In International Symposium on High Performance Comuter Architecture, pages 2{29, [1] D. K. Panda, S. Singal, and P. Prabhakaran. Multidestination Message Passing Mechanism Conforming to Base Wormhole Routing Scheme. In Proceedings of the Parallel Computer Routing and Communication Workshop, pages 131{145, [11] H. Xu, P. K. McKinley, and L. Ni. Ecient Implementation of Barrier Synchronization in Wormhole-routed Hypercube Multicomputers. Journal of Parallel and Distributed Computing, 16:172{184, 1992.

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The

More information

Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A.

Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A. In Scalable High Performance Computing Conference, 1994. Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity Dhabaleswar K. Panda and Vibha A. Dixit-Radiya

More information

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 6, JUNE 1998 535 Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms Rajendra V. Boppana, Member, IEEE, Suresh

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Fabrizio Petrini Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford OX1 3QD, England e-mail: fabp@comlab.ox.ac.uk

More information

Processor. Flit Buffer. Router

Processor. Flit Buffer. Router Path-Based Multicast Communication in Wormhole-Routed Unidirectional Torus Networks D. F. Robinson, P. K. McKinley, and B. H. C. Cheng Technical Report MSU-CPS-94-56 October 1994 (Revised August 1996)

More information

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

A Reliable Hardware Barrier Synchronization Scheme

A Reliable Hardware Barrier Synchronization Scheme A Reliable Hardware Barrier Synchronization Scheme Rajeev Sivaram Craig B. Stunkel y Dhabaleswar K. Panda Dept. of Computer and Information Science y IBM T. J. Watson Research Center The Ohio State University

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing

Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001 Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing Ram Kesavan and Dhabaleswar

More information

A Hybrid Interconnection Network for Integrated Communication Services

A Hybrid Interconnection Network for Integrated Communication Services A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 kchen@nortel.com Jyh-Charn Liu Department of Computer Science, Texas A&M Univ.

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin

Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies. Admin Networks: Routing, Deadlock, Flow Control, Switch Design, Case Studies Alvin R. Lebeck CPS 220 Admin Homework #5 Due Dec 3 Projects Final (yes it will be cumulative) CPS 220 2 1 Review: Terms Network characterized

More information

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control

Lecture 24: Interconnection Networks. Topics: topologies, routing, deadlocks, flow control Lecture 24: Interconnection Networks Topics: topologies, routing, deadlocks, flow control 1 Topology Examples Grid Torus Hypercube Criteria Bus Ring 2Dtorus 6-cube Fully connected Performance Bisection

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Efficient Communication in Metacube: A New Interconnection Network

Efficient Communication in Metacube: A New Interconnection Network International Symposium on Parallel Architectures, Algorithms and Networks, Manila, Philippines, May 22, pp.165 170 Efficient Communication in Metacube: A New Interconnection Network Yamin Li and Shietung

More information

Blocking SEND/RECEIVE

Blocking SEND/RECEIVE Message Passing Blocking SEND/RECEIVE : couple data transfer and synchronization - Sender and receiver rendezvous to exchange data P P SrcP... x : =... SEND(x, DestP)... DestP... RECEIVE(y,SrcP)... M F

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control

Lecture 12: Interconnection Networks. Topics: dimension/arity, routing, deadlock, flow control Lecture 12: Interconnection Networks Topics: dimension/arity, routing, deadlock, flow control 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees, butterflies,

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Parallel Computer Architecture II

Parallel Computer Architecture II Parallel Computer Architecture II Stefan Lang Interdisciplinary Center for Scientific Computing (IWR) University of Heidelberg INF 368, Room 532 D-692 Heidelberg phone: 622/54-8264 email: Stefan.Lang@iwr.uni-heidelberg.de

More information

Generic Methodologies for Deadlock-Free Routing

Generic Methodologies for Deadlock-Free Routing Generic Methodologies for Deadlock-Free Routing Hyunmin Park Dharma P. Agrawal Department of Computer Engineering Electrical & Computer Engineering, Box 7911 Myongji University North Carolina State University

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N.

Interconnection topologies (cont.) [ ] In meshes and hypercubes, the average distance increases with the dth root of N. Interconnection topologies (cont.) [ 10.4.4] In meshes and hypercubes, the average distance increases with the dth root of N. In a tree, the average distance grows only logarithmically. A simple tree structure,

More information

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* Young-Joo Suh, Binh Vien Dao, Jose Duato, and Sudhakar Yalamanchili Computer Systems Research Laboratory Facultad de Informatica School

More information

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels

Lecture: Interconnection Networks. Topics: TM wrap-up, routing, deadlock, flow control, virtual channels Lecture: Interconnection Networks Topics: TM wrap-up, routing, deadlock, flow control, virtual channels 1 TM wrap-up Eager versioning: create a log of old values Handling problematic situations with a

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science Analytical Modeling of Routing Algorithms in Virtual Cut-Through Networks Jennifer Rexford Network Mathematics Research Networking & Distributed Systems AT&T Labs Research Florham Park, NJ 07932 jrex@research.att.com

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Jong-Hoon Youn Bella Bose Seungjin Park Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Oregon State University

More information

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia A New Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks J. Duato Facultad de Informatica Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN E-mail: jduato@aii.upv.es

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Lecture 2 Parallel Programming Platforms

Lecture 2 Parallel Programming Platforms Lecture 2 Parallel Programming Platforms Flynn s Taxonomy In 1966, Michael Flynn classified systems according to numbers of instruction streams and the number of data stream. Data stream Single Multiple

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #8 2/7/2017 Xuehai Qian Xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Outline From last class

More information

Understanding the Routing Requirements for FPGA Array Computing Platform. Hayden So EE228a Project Presentation Dec 2 nd, 2003

Understanding the Routing Requirements for FPGA Array Computing Platform. Hayden So EE228a Project Presentation Dec 2 nd, 2003 Understanding the Routing Requirements for FPGA Array Computing Platform Hayden So EE228a Project Presentation Dec 2 nd, 2003 What is FPGA Array Computing? Aka: Reconfigurable Computing Aka: Spatial computing,

More information

INTERCONNECTION NETWORKS LECTURE 4

INTERCONNECTION NETWORKS LECTURE 4 INTERCONNECTION NETWORKS LECTURE 4 DR. SAMMAN H. AMEEN 1 Topology Specifies way switches are wired Affects routing, reliability, throughput, latency, building ease Routing How does a message get from source

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

NoC Test-Chip Project: Working Document

NoC Test-Chip Project: Working Document NoC Test-Chip Project: Working Document Michele Petracca, Omar Ahmad, Young Jin Yoon, Frank Zovko, Luca Carloni and Kenneth Shepard I. INTRODUCTION This document describes the low-power high-performance

More information

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009

VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs. April 6 th, 2009 VIII. Communication costs, routing mechanism, mapping techniques, cost-performance tradeoffs April 6 th, 2009 Message Passing Costs Major overheads in the execution of parallel programs: from communication

More information

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects

1/5/2012. Overview of Interconnects. Presentation Outline. Myrinet and Quadrics. Interconnects. Switch-Based Interconnects Overview of Interconnects Myrinet and Quadrics Leading Modern Interconnects Presentation Outline General Concepts of Interconnects Myrinet Latest Products Quadrics Latest Release Our Research Interconnects

More information

Communication Performance in Network-on-Chips

Communication Performance in Network-on-Chips Communication Performance in Network-on-Chips Axel Jantsch Royal Institute of Technology, Stockholm November 24, 2004 Network on Chip Seminar, Linköping, November 25, 2004 Communication Performance In

More information

Limits on Interconnection Network Performance. Anant Agarwal. Massachusetts Institute of Technology. Cambridge, MA Abstract

Limits on Interconnection Network Performance. Anant Agarwal. Massachusetts Institute of Technology. Cambridge, MA Abstract Limits on Interconnection Network Performance Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 0139 Abstract As the performance of interconnection networks

More information

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections

Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections Fault-Tolerant Multiple Task Migration in Mesh NoC s over virtual Point-to-Point connections A.SAI KUMAR MLR Group of Institutions Dundigal,INDIA B.S.PRIYANKA KUMARI CMR IT Medchal,INDIA Abstract Multiple

More information

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks

Lecture: Transactional Memory, Networks. Topics: TM implementations, on-chip networks Lecture: Transactional Memory, Networks Topics: TM implementations, on-chip networks 1 Summary of TM Benefits As easy to program as coarse-grain locks Performance similar to fine-grain locks Avoids deadlock

More information

Deadlock and Router Micro-Architecture

Deadlock and Router Micro-Architecture 1 EE482: Advanced Computer Organization Lecture #8 Interconnection Network Architecture and Design Stanford University 22 April 1999 Deadlock and Router Micro-Architecture Lecture #8: 22 April 1999 Lecturer:

More information

A Formal View of Multicomputers. Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy

A Formal View of Multicomputers. Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy A Formal View of Multicomputers Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy ydepartamento de Informatica, Universidad de Castilla-La Mancha, Escuela Universitaria Politecnica de Albacete, Campus

More information

Overview. Processor organizations Types of parallel machines. Real machines

Overview. Processor organizations Types of parallel machines. Real machines Course Outline Introduction in algorithms and applications Parallel machines and architectures Overview of parallel machines, trends in top-500, clusters, DAS Programming methods, languages, and environments

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN)

More information

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 28: Networks & Interconnect Architectural Issues Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Review: ABCs of Networks Starting Point: Send bits between 2 computers Queue

More information

Interconnection Networks

Interconnection Networks Lecture 17: Interconnection Networks Parallel Computer Architecture and Programming A comment on web site comments It is okay to make a comment on a slide/topic that has already been commented on. In fact

More information

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing

More information

Communication Cost in Parallel Computing

Communication Cost in Parallel Computing Communication Cost in Parallel Computing Ned Nedialkov McMaster University Canada SE/CS 4F03 January 2016 Outline Cost Startup time Pre-hop time Pre-word time Store-and-forward Packet routing Cut-through

More information

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus)

Routing Algorithm. How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Routing Algorithm How do I know where a packet should go? Topology does NOT determine routing (e.g., many paths through torus) Many routing algorithms exist 1) Arithmetic 2) Source-based 3) Table lookup

More information

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background

Lecture 15: PCM, Networks. Today: PCM wrap-up, projects discussion, on-chip networks background Lecture 15: PCM, Networks Today: PCM wrap-up, projects discussion, on-chip networks background 1 Hard Error Tolerance in PCM PCM cells will eventually fail; important to cause gradual capacity degradation

More information

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes EE482, Spring 1999 Research Paper Report Deadlock Recovery Schemes Jinyung Namkoong Mohammed Haque Nuwan Jayasena Manman Ren May 18, 1999 Introduction The selected papers address the problems of deadlock,

More information

the possibility of deadlock if the routing scheme is not appropriately constrained [3]. A good introduction to various aspects of wormhole routing is

the possibility of deadlock if the routing scheme is not appropriately constrained [3]. A good introduction to various aspects of wormhole routing is The Red Rover Algorithm for DeadlockFree Routing on Bidirectional Rings Je Draper USC/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 (310)822 1511 x750 Email: draper@isi.edu,

More information

Technische Universitat Munchen. Institut fur Informatik. D Munchen.

Technische Universitat Munchen. Institut fur Informatik. D Munchen. Developing Applications for Multicomputer Systems on Workstation Clusters Georg Stellner, Arndt Bode, Stefan Lamberts and Thomas Ludwig? Technische Universitat Munchen Institut fur Informatik Lehrstuhl

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved.

Parallel Systems Prof. James L. Frankel Harvard University. Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Parallel Systems Prof. James L. Frankel Harvard University Version of 6:50 PM 4-Dec-2018 Copyright 2018, 2017 James L. Frankel. All rights reserved. Architectures SISD (Single Instruction, Single Data)

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP

6LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃ7LPHÃIRUÃDÃ6SDFH7LPH $GDSWLYHÃ3URFHVVLQJÃ$OJRULWKPÃRQÃDÃ3DUDOOHOÃ(PEHGGHG 6\VWHP LPXODWLRQÃRIÃWKHÃ&RPPXQLFDWLRQÃLPHÃIRUÃDÃSDFHLPH $GDSWLYHÃURFHVVLQJÃ$OJRULWKPÃRQÃDÃDUDOOHOÃ(PEHGGHG \VWHP Jack M. West and John K. Antonio Department of Computer Science, P.O. Box, Texas Tech University,

More information

IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 8, AUGUST Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement

IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 8, AUGUST Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 8, AUGUST 2001 1 Four-Ary Tree-Based Barrier Synchronization for 2D Meshes without Nonmember Involvement Sangman Moh, Member, IEEE Computer Society, Chansu

More information

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Wormhole Routing Techniques for Directly Connected Multicomputer Systems Wormhole Routing Techniques for Directly Connected Multicomputer Systems PRASANT MOHAPATRA Iowa State University, Department of Electrical and Computer Engineering, 201 Coover Hall, Iowa State University,

More information

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Seungjin Park Jong-Hoon Youn Bella Bose Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science

More information

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011

CS252 Graduate Computer Architecture Lecture 14. Multiprocessor Networks March 9 th, 2011 CS252 Graduate Computer Architecture Lecture 14 Multiprocessor Networks March 9 th, 2011 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz.

Blocking vs. Non-blocking Communication under. MPI on a Master-Worker Problem. Institut fur Physik. TU Chemnitz. D Chemnitz. Blocking vs. Non-blocking Communication under MPI on a Master-Worker Problem Andre Fachat, Karl Heinz Homann Institut fur Physik TU Chemnitz D-09107 Chemnitz Germany e-mail: fachat@physik.tu-chemnitz.de

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC

Lecture 9: Group Communication Operations. Shantanu Dutt ECE Dept. UIC Lecture 9: Group Communication Operations Shantanu Dutt ECE Dept. UIC Acknowledgement Adapted from Chapter 4 slides of the text, by A. Grama w/ a few changes, augmentations and corrections Topic Overview

More information

Broadcasting on Meshes with Worm-Hole Routing. Supercomputer Systems Division. Scalable Concurrent Programming Laboratory

Broadcasting on Meshes with Worm-Hole Routing. Supercomputer Systems Division. Scalable Concurrent Programming Laboratory Broadcasting on Meshes with WormHole Routing Mike Barnett Department of Computer Science University of Idaho Moscow, Idaho 838441010 mbarnett@cs.uidaho.edu Robert A. van de Geijn Department of Computer

More information

Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II" via Claudio Napoli, Italy

Clusters. Mario Lauria. Dipartimento di Informatica e Sistemistica. Universita di Napoli \Federico II via Claudio Napoli, Italy MPI-FM: High Performance MPI on Workstation Clusters Mario Lauria Dipartimento di Informatica e Sistemistica Universita di Napoli \Federico II" via Claudio 21 0125 Napoli, Italy lauria@nadis.dis.unina.it.

More information

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines

Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Ecient Implementation of Sorting Algorithms on Asynchronous Distributed-Memory Machines Zhou B. B., Brent R. P. and Tridgell A. y Computer Sciences Laboratory The Australian National University Canberra,

More information

Adaptive Multimodule Routers

Adaptive Multimodule Routers daptive Multimodule Routers Rajendra V Boppana Computer Science Division The Univ of Texas at San ntonio San ntonio, TX 78249-0667 boppana@csutsaedu Suresh Chalasani ECE Department University of Wisconsin-Madison

More information

Balancing Traffic Load for Multi-Node Multicast in a Wormhole 2D Torus/Mesh

Balancing Traffic Load for Multi-Node Multicast in a Wormhole 2D Torus/Mesh Balancing Traffic Load for Multi-Node Multicast in a Wormhole 2D Torus/Mesh San-Yuan Wang Λ, Yu-Chee Tseng Λ, Ching-Sung Shiu, and Jang-Ping Sheu Λ Department of Computer Science and Information Engineering

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract

The Avalanche Myrinet Simulation Package. University of Utah, Salt Lake City, UT Abstract The Avalanche Myrinet Simulation Package User Manual for V. Chen-Chi Kuo, John B. Carter fchenchi, retracg@cs.utah.edu WWW: http://www.cs.utah.edu/projects/avalanche UUCS-96- Department of Computer Science

More information

Technical Report No On the Power of Arrays with. Recongurable Optical Buses CANADA. Abstract

Technical Report No On the Power of Arrays with. Recongurable Optical Buses CANADA. Abstract Technical Report No. 95-374 On the Power of Arrays with Recongurable Optical Buses Sandy Pavel, Selim G. Akl Department of Computing and Information Science Queen's University, Kingston, Ontario, K7L 3N6

More information

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen

Programming with Message Passing PART I: Basics. HPC Fall 2012 Prof. Robert van Engelen Programming with Message Passing PART I: Basics HPC Fall 2012 Prof. Robert van Engelen Overview Communicating processes MPMD and SPMD Point-to-point communications Send and receive Synchronous, blocking,

More information

COMMUNICATION IN HYPERCUBES

COMMUNICATION IN HYPERCUBES PARALLEL AND DISTRIBUTED ALGORITHMS BY DEBDEEP MUKHOPADHYAY AND ABHISHEK SOMANI http://cse.iitkgp.ac.in/~debdeep/courses_iitkgp/palgo/index.htm COMMUNICATION IN HYPERCUBES 2 1 OVERVIEW Parallel Sum (Reduction)

More information

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA)

Network Properties, Scalability and Requirements For Parallel Processing. Communication assist (CA) Network Properties, Scalability and Requirements For Parallel Processing Scalable Parallel Performance: Continue to achieve good parallel performance "speedup"as the sizes of the system/problem are increased.

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

TASK FLOW GRAPH MAPPING TO ABUNDANT CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC LIMITED Parallel Processing Letters c World Scientic Publishing Company FUNCTIONAL ALGORITHM SIMULATION OF THE FAST MULTIPOLE METHOD: ARCHITECTURAL IMPLICATIONS MARIOS D. DIKAIAKOS Departments of Astronomy and

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock.

[ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Buffering roblems [ 7.2.5] Certain challenges arise in realizing SAS or messagepassing programming models. Two of these are input-buffer overflow and fetch deadlock. Input-buffer overflow Suppose a large

More information

New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance

New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance The University of Southern Mississippi The Aquila Digital Community Dissertations Fall 12-2013 New Fault Tolerant Multicast Routing Techniques to Enhance Distributed-Memory Systems Performance Masoud Esmail

More information

Message-Passing Programming with MPI

Message-Passing Programming with MPI Message-Passing Programming with MPI Message-Passing Concepts David Henty d.henty@epcc.ed.ac.uk EPCC, University of Edinburgh Overview This lecture will cover message passing model SPMD communication modes

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

Analysis of Matrix Multiplication Computational Methods

Analysis of Matrix Multiplication Computational Methods European Journal of Scientific Research ISSN 1450-216X / 1450-202X Vol.121 No.3, 2014, pp.258-266 http://www.europeanjournalofscientificresearch.com Analysis of Matrix Multiplication Computational Methods

More information

Estimate the Routing Protocols for Internet of Things

Estimate the Routing Protocols for Internet of Things Estimate the Routing Protocols for Internet of Things 1 Manjushree G, 2 Jayanthi M.G 1,2 Dept. of Computer Network and Engineering Cambridge Institute of Technology Bangalore, India Abstract Internet of

More information

A Novel Energy Efficient Source Routing for Mesh NoCs

A Novel Energy Efficient Source Routing for Mesh NoCs 2014 Fourth International Conference on Advances in Computing and Communications A ovel Energy Efficient Source Routing for Mesh ocs Meril Rani John, Reenu James, John Jose, Elizabeth Isaac, Jobin K. Antony

More information

Lecture 18: Communication Models and Architectures: Interconnection Networks

Lecture 18: Communication Models and Architectures: Interconnection Networks Design & Co-design of Embedded Systems Lecture 18: Communication Models and Architectures: Interconnection Networks Sharif University of Technology Computer Engineering g Dept. Winter-Spring 2008 Mehdi

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Deadlock and Livelock. Maurizio Palesi

Deadlock and Livelock. Maurizio Palesi Deadlock and Livelock 1 Deadlock (When?) Deadlock can occur in an interconnection network, when a group of packets cannot make progress, because they are waiting on each other to release resource (buffers,

More information

Parallel Architectures

Parallel Architectures Parallel Architectures CPS343 Parallel and High Performance Computing Spring 2018 CPS343 (Parallel and HPC) Parallel Architectures Spring 2018 1 / 36 Outline 1 Parallel Computer Classification Flynn s

More information

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs -A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs Pejman Lotfi-Kamran, Masoud Daneshtalab *, Caro Lucas, and Zainalabedin Navabi School of Electrical and Computer Engineering, The

More information