Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A.

Size: px
Start display at page:

Download "Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A."

Transcription

1 In Scalable High Performance Computing Conference, Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity Dhabaleswar K. Panda and Vibha A. Dixit-Radiya Department of Computer and Information Science The Ohio State University, Columbus, OH Abstract This paper analyzes the impact of messageordering, between outgoing messages from a sender to multiple receivers (called multicasts), on the completion time of a program for wormhole-routed distributed-memory systems. In most existing systems, messages in a multicast are generally being sent as separate unicast messages by the source processor itself. We study how best to order a set of outgoing messages by taking into account message criticality and architectural issues including link contention, multiple ports, and adaptivity in routing. First, the simple algorithm of Dikaiakos et al. [8] is extended to obtain a static algorithm for non-fully-connected systems. Next, a dynamic message-ordering algorithm is proposed which works for any number of ports and takes advantage of routing adaptivity. Simulation results on random task graphs show improvement in completion time by 34% for static and 44% for dynamic, over naive sequential message-ordering. 1 Introduction One of the major problems in software development for parallel systems is the mapping problem. It is de- ned as allocation of the set of tasks of a parallel application onto available processors to obtain minimal program completion time [2, 6]. The mapping problem being NP-hard, is usually solved by solving its subproblems, viz., (a) clustering grouping of tasks for bounded number of processors, (b) assignment one-to-one mapping of task clusters onto processors with specic topology, (c) intra-processor scheduling ordering of ready tasks allocated to the same processor, and (d) message-ordering ordering between outgoing messages for a sender to multiple receiver This research is supported in part by the National Science Foundation Grant # MIP processors. There has been extensive work in the literature on developing near-optimal heuristics for subproblems (a) and (b) above. Even though subproblems (c) and (d) also signicantly aect program completion time, they have not received any attention in mapping literature. We have solved the clustering and intra-processor scheduling problems in [7] and the assignment problem in [6]. In this paper, we address the message-ordering problem. Logical communication between tasks in a parallel program can be categorized as either unicast (source sending a message to a destination) or multicast (source sending messages to multiple destinations at the same time). A broadcast is a special case of multicast. Multicasts are further classied into personalized (dierent data going to others) and nonpersonalized (same data going to dierent processors). Personalized multicasts are common in scientic and numerical computations using scattering techniques. Sophisticated multicasting schemes using path-based routing [13] and unicast-based schemes [12] have been shown to be ecient for non-personalized multicast. However, personalized multicasts cannot take advantage of these schemes. Hence for systems not having support for these schemes and for personalized multicasts, there is no other alternative but to send messages from the source node to destination processors as a collection of unicast messages. In such cases, the order in which these messages are sent from the source has direct impact on the program completion time. The wormhole-routing switching technique is increasingly becoming popular in building massively parallel systems due to its inherent advantages like low-latency communication and reduced communication hardware overhead [13]. In addition to the basic wormhole-routing switching, systems are gradually incorporating multiple communication ports and routing schemes with varying adaptivity. Intel Paragon [1], Cray T3D [4], and Stanford DASH [11] are some early representative systems in this trend. These sys-

2 tems provide low-latency communication when the trac in the system is low. However, with increase in communication trac, messages undergo severe link contention and the system starts performing poorly. Similarly, when a single processor sends multicast messages, they encounter port contention at the router of the sender node depending on the routing strategy used by the system. Such contention may increase the completion time of a program execution signicantly. Such increase in completion time can be reduced by determining an eective message-ordering strategy. Though commercial wormhole systems are becoming available, there is no study in the literature about the interplay between multiple ports and routing adaptivity and their impact on determining a suitable message-ordering strategy. In this paper, we take such an approach in determining eective message-ordering strategy for adaptive wormhole systems with multiple ports. While mapping an application to a distributed-memory system, Dikaiakos et al. [8] have shown that the completion time of a program can be reduced if multicast messages are ordered using a Latest Start Time (LST) strategy. However, this message-ordering strategy is determined by considering the system to be a fully-connected architecture. Hence, the ordering does not necessarily provide the best completion time of a program when executed on a non-fully-connected architecture. In our previous study of mapping applications onto distributed memory systems [6, 7], we have used LST-based strategies for clustering, task assignment, and intra-processor scheduling. In this paper, we rst enhance the strategy proposed by Dikaiakos et al. to non-fully-connected architecture and evaluate its performance compared to the sequential message-ordering strategy. Then we propose new dynamic message-ordering algorithms to take advantage of adaptivity and multiple ports. We analyze the performance of these algorithms under two dierent models of communication start-up: concurrent and skewed. We study the eectiveness of these algorithms against sequential message-ordering scheme for random task graphs with varying computation to communication characteristics and for systems with varying routing adaptivity and multiple ports. The paper is organized as follows. Section 2 discusses about the signicance of message-ordering in wormhole systems under link- and port-contention. Message-ordering algorithms are presented in section 3. Simulation experiments and results are presented in section 4. The conclusions and future work are presented in section. 2 Message-Ordering in Wormhole Routed Systems In this section, we introduce the basic concepts of wormhole routing and show how adaptivity in routing reduces link contention. We discuss about the operational principles of multiport wormhole systems and show the situations which lead to port-contention. Through an example, we show the signicance of message-ordering by taking into account routing adaptivity and multiple ports. 2.1 Routing Adaptivity and Link Contention In wormhole-routed systems[], the header it of a message establishes the path, the intermediate its follow the path, and the tail it releases the path. During the message propagation, if a desired link is already being used by another message, the current message gets blocked. This message waits in the network occupying all the links it is traversing. Such a phenomena is known as link-contention. This phenomena is very much associated with the underlying routing scheme, topology of the system, and the communication trac. To alleviate link- contention, several routing schemes with varying adaptivity have been proposed in the literature. Deterministic or e-cube routing [] denes a single path from a source to a destination node and thus has zero adaptivity. Such routing is simple to implement and deadlock free. However, it does not make eective use of all communication links in a system. Fully adaptive algorithms [9] allow a message to be routed along any of the shortest paths from the source to the destination processor Partially adaptive algorithms like planar [3] restrict routing freedom to two dimensions at a time. Figure 1 illustrates the dierences between these three routing schemes. Higher adaptivity has potential to reduce link-contention and hence is useful to reduce the overall execution time of a given program. However, for any of the above schemes, the system performance very much depends on how messages are pushed into the network or taken out from the network by the processor-router interface. The number of available ports at this interface plays a signicant role in determining the completion time of a program and hence the system performance. With limited number of ports at a processor-router interface, there is added chance that a message will undergo port-contention in addition to link-contention. Hence, a good messageordering strategy should consider routing adaptivity, link-contention, and port-contention together.

3 to the router through injection channels (ports), the order in which they get propagated from the router to the network is based on (a) the order in which the processor prepares and presents them to the set of ports and (b) the number of ports available in the system. Figure 1: Possible paths from a source to a destination under dierent routing schemes in a 3-cube. 2.2 Contention in Multiport Systems Consider a typical processor-router organization in a multicomputer with two injection and two consumption channels (ports), as shown in Fig. 2. Any message originating at a processor must be sent via an injection port to the router. This port remains busy till the message leaves the router. If the message gets blocked on its way due to link-contention, the assigned port remains busy and can not be allocated to any other message. This leads to messages being queued at the injection channel(s) introducing port-contention. With multiple ports, the contention may get alleviated. However, it does not get eliminated completely because the number of outgoing messages at any time may be greater than the number of available injection ports. Similar contention due to limited number of consumption channels also degrade system performance signicantly. This aspect has been studied separately in [1]. In this paper we focus on contention due to injection channel and determine best ways to utilize injection ports in a system together with its routing scheme. Figure 2: Processor and router organization of a node supporting two-port communication. 2.3 Eect of Message-Ordering Even for an 1-port system, suitable messageordering can increase the port utilization by assigning the port to a message whose outgoing link is free. Besides increasing the utilization of ports, from an application perspective, there exists criticality in messages, i.e., some messages are more critical than others and the program completion time increases signicantly [6, 8] if the critical messages get delayed. Hence, a good message-ordering algorithm should take into account criticality in messages, routing scheme, link con- Consider an example multicast pattern of Fig. 3 in a 4x4 mesh. The source processor P 6 sends 6 messages 4(m 1 {m 6 ) to 6 respective destinations. Assume ecube 1 routing 2 [] is being 3 used for determining routing paths. Local m It can be seen that multiple outgoing messages may S S 1 m m tention, and port contention to reduce program completion time as well as increasing the utilization of 1 Processor/Memory contend 2 for the 3 same outgoing link (messages m 1, m 2, 4 m 4 for the westbound 7 link from P 6 and messages m 6 source 6 ports. We emphasize on these issues in the following and m 3 for the eastbound link). consumption Since messages ports come section. injection ports 6 7 destination D D D internal channels Figure 3: An example ordering of a multicast pattern in a 4x4 mesh with ecube routing. If the system supports only 1 injection port, then the order of message propagation is identical to the message-order prepared by the processor provided there is no link contention due to previous multicasts or other messages passing through the outgoing links of the node. For example, a message-ordering of (m 1 ; m 2 ; : : : ; m 6 ) will force the messages to get propagated in that order. If the system supports 2 ports it can be observed that the above message-ordering is not ecient. Both messages m 1 and m 2 will grab the two injection ports. Due to link contention, m 1 will propagate and m 2 will get blocked. This is a poor utilization of ports. It can be noticed that a messageordering of (m 1 ; m 3 ; : : :) will allow both m 1 and m 3 to propagate simultaneously by using the two ports. If the underlying routing scheme is fully-adaptive [9], then the original message-ordering (m 1 ; m 2 ; : : :) would have allowed both messages m 1 and m 2 to move simultaneously. external input external output channels from channels to 2m 4 m Router 3 m 6 4 neighbors 4 neighbors 12 (a) 13 deterministic 14 (b) planar-adaptive (c) fully-adaptive

4 v v Message-Ordering Algorithms In deriving message-ordering algorithms, we take an application perspective and emphasize on program completion time. In a typical program, multiple multicast communication steps happen at dierent points in the program execution. Since there are direct and indirect temporal dependencies through communication steps in the program execution, our objective here is to reduce the overall program execution time by deriving an eective message-ordering strategy. The objective is not to reduce the latency of a single multicast communication step. We use a Temporal Communication Graph (TCG) model for representing the temporal dependencies in a distributed-memory parallel program. This model has been successfully used in our previous study of clustering and assignment problems[6, 7]. Details of the model can be found in these references. First we explain the naive sequential ordering strategy which neither takes into account of architectural nor application characteristics. Then we discuss the work done by Dikaiakos et al. [8] which presents a message-ordering scheme based on precedence graph model [7] for 1-port fully-connected system (not a realistic architecture for large-scale system) based on criticality in messages only. Using a Temporal Communication Graph (TCG) model, we extend their result to wormhole-routed systems with any topology. This improved static message-ordering algorithm uses temporal properties of communication steps in the application and takes into account the criticality in messages. It is based on Latest Start Time (LST) estimates of the destination computational nodes, associated with the outgoing messages. Using these estimates, we develop a new dynamic algorithm to obtain better message ordering by taking into account architectural characteristics of a system like routing scheme, link contention, T and port contention. In this section, we present these 1 T 2 T 3 T 4 algorithms. Simulation experiments and results showing the 31 impact 36 of these message-ordering algorithms 3 1 on various 33 applications 38 are described in the nest section. v Sequential Ordering This is a naive 3 message-ordering scheme 1 in which messagesv in a multicast are sent 4 by 22 the router in an increasing 1 order of receiving task identiers. It is a simple method which does v not consider 1 v application 1 or architectural characteristics. For example, for the TCG in Fig. 4, the messages in the multicast originating from1 source v 11 is sent in the order v 21 and v 31 (corresponding to tasks T 1 and v T 42 2 respectively) Earliest Start Time Earliest Finish Time Latest Start Time Latest Finish Time v 31 inter-task communication edge intra-task sequence edge critical path under the sequential message-ordering scheme. Depending on the design of the processor-router interface and software components associated with the communication primitives in a system, there can be two different cost (delay) in the way messages are injected into the network. We identify them as concurrent start-up and skewed start-up cost models. In the rst model, all messages belonging to the multicast pattern from a single node are prepared and presented to the processor-router interface concurrently with a single communication start-up. In the second model, the messages are prepared and presented one after another by encountering start-up for each message Thus, the messages are presented to the processor-router interface in a skewed manner. In our analysis, we consider both these models. Figure 4: Temporal communication graph with earliest and latest computation times for an example program with 4 tasks. 4 Dynamic Ordering The basic concept behind the dynamic ordering is not to assign a port to a message which can not propagate out from the router due to link contention. This is achieved by incorporating a message-scheduler with each processor-router pair. The scheduler maintains a queue of unicast and LST-ordered multicast messages which it receives continuously from the application task running on the processor. Under the concurrent startup communication cost model, a multicast encounters a single startup cost for all its messages

5 before they are submitted to its processor's messagescheduler. Under the skewed startup cost model, each message in a multicast encounters a startup cost resulting in skewed submissions of messages (within a multicast) to its message-scheduler. A message of a multicast is stamped with (a) msg:time in which indicates the time the message enters the queue, and (b) msg:float which is set equal to the dierence between the the message's latest start time and its earliest start time. This information enables the message scheduler to estimate a message's criticality with respect to the total completion time of the program. The message-scheduler on each processor functions as follows. Whenever an injection port becomes free, Dynamic it selects Message-Ordering a message fromalgorithm the queue such that it has the least LST (let's dene it as the earliest message) while ((there exists anda its free outgoing port) link is free. However, overtaking of and (message_queue an earlier message is not empty)) by a later do message in the queue is allowed only if the earlier message is not \very severe" first_msg = with message respect on top toof the message_queue; later message. This severity of a updated_float_of_first_msg message is determined = msg.float based - on its oat (dierence (current_clock_time between Latest - msg.time_in); Start Time and Earliest Start Time if (updated_float_of_first_msg estimates), the duration <= ) then for which it has waited in the /* first_msg queue, is critical, and theso time schedule that would it */ be taken by the overtaking first_msg message from message_queue; to free the injection port. This algo- remove schedule rithm (first_msg); provides improved utilization of ports while trying to minimize completion time. For a deadlock-free end_if else underlying routing scheme, our algorithm is deadlockfree. The algorithm also > */ ensures starvation freedom /* updated_float_of_first_msg if (outlink_of_first_msg because the oat is free) of a { message decreases as it waits /* outlink longer in of a the message queue. will Once depend theof oat routing of a message reduces strategy to less of architecture than or equal */ to zero, it becomes critical remove and is first_msg scheduled from immediately message_queue; when a free port becomes(first_msg); available. The steps of the message-scheduler schedule else are described in pseudocode form in Fig.. /* outlink of first_msg is not free, so check if other It is to messages be noted in that queue this can dynamic be scheduled message-ordering */ while scheme (queue appears not empty) to be douseful only for systems having a get dedicated next_msg hardware from message_queue; to perform the task of the message if ((outlink_of_next_msg scheduler. However, it is is free) notand a must. The scheme we are(updated_float_of_first_msg proposing is quite general - and can be used as the last step est_lat(first_msg) optimization > (after )) then clustering and task assignment) /* est_lat while of a message mappingis an the application wormhole- the dynamic latency of message-ordering a without scheme uses to a system. Sincerouted dynamic state contention of the */ network, the message-ordering derived /* by first_msg this algorithm is not very for severe each multicast with communication step respect can to benext_msg fed back */ to the program for use at the run remove time. next_msg The program from with message_queue; modied message orderingschedule will behave (next_msg); exactly as it would have performed in thebreak; presence /* from of while a message-scheduler. loop */ Hence, the scheme end_if can be used on any system without having a end_while message-scheduler hardware. /* Better candidate than first_msg not found */ remove first_msg from message_queue; schedule (first_msg); end_else end_else end_while Figure : Dynamic Message-Ordering Algorithm.

6 Simulation Experiments and Results We performed simulation experiments to study the impact on program completion time for LSTbased static message-ordering and dynamic messageordering algorithms with respect to sequential message-ordering algorithm. Our study included architectures with varying degree of routing adaptivity (e-cube (least adaptive) to fully adaptive[9]), multiple number of ports, and TCGs with varying application characteristics. Experiments were performed on random TCGs using an event-driven simulator written in CSIM [14] for 16x16 and 8x8 wormhole-routed meshes. The following system parameters, representing current-generation multicomputers, were used: startup time of 1 microsecond, link propagation time of ns, and router node delay of 2 ns. Four dierent TCG classes, representing applications with varying computation-communication ratio and degree of parallelism, were used. For 16x16 meshes having 6 tasks, the degree of multicast was chosen to be an exponential distribution with mean. The mean for 8x8 meshes with 64 tasks was 3. Figures 6 and 7 summarize the simulation results for concurrent and skewed communication start-up models, respectively. The following observations can be made from these simulation results. The LST-based static (ST) message-ordering algorithm reduces program completion time by 1% to 34% for the concurrent startup model and by 6compared to sequential (SEQ) messageordering. The dynamic (DYN) algorithm provides an additional reduction in completion time up to 1% for both startup models. The percentage improvement by static and dynamic algorithms is the highest for systems having 1 injection port. Thus, the algorithms are suited well for current-generation systems which support only single port. As the number of injection ports provided by the architecture increases, the improvement in completion time provided by static and dynamic over sequential decreases. This is because with multiple ports, there is more parallelism between the messages to move. For communication- bound, high- parallelism (CmHp) type of applications, the improvement by both static and dynamic algorithms is more than that for communication-bound, high-parallelism (CmLp) or computation-bound (CpHp, CpLp) applications. This is because link contention increases as we go from CpLp to CmHp and the benets of message-ordering are reected by reduction in completion time. For the varying port models and varying application classes, the additional improvement of dynamic over static algorithm is better for the concurrent startup model than that for the corresponding skewed startup cases. This is because the concurrent start-up model introduces more choices for the dynamic message-ordering algorithm to select messages. As routing adaptivity provided by the system increases from ecube to fully adaptive, the additional improvement provided by dynamic algorithm over static increases for the concurrent startup model while it decreases for the skewed startup model. This is because in case of skewed startup model, messages of a multicast are not available simultaneously for scheduling. Hence the scheduler is unable to eectively utilize the extra outgoing links provided by fully adaptive routing. 6 Conclusions This study indicates that message-ordering plays a signicant role in determining program completion time in distributed-memory systems. We have proposed message-ordering algorithms that consider temporal information of task computations and communications, criticality of messages, and dynamic system state such as link and port contentions to reduce program completion time. Such reductions in completion times are shown to be signicant particularly for systems with 1-2 injection ports supporting ecube or fully-adaptive routing and for applications that are communication-bound. These results indicate that dynamic message-ordering strategy, instead of sequential or static LST-based algorithms, can be used in current-generation high performance wormhole-routed distributed-memory systems in mapping applications to reduce program completion time signicantly. The analysis presented in this paper is based on random task graphs. We are continuing our work to apply the dynamic message ordering algorithms to scientic and numeric applications. In this paper, we have emphasized on reducing the completion time aspect of the program. However, direct networks with wormhole-routing are increasingly being used to support distributed-shared memory systems.

7 (a) CmHp TCGs on 16x16 mesh, concurrent Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (c) ecube routing, 1-port 8x8 mesh, concurrent Ts 3 DYN ST CmHp CmLp CpHp CpLp TCG class (b) CmLp TCGs, 16x16 mesh, concurrent Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (d) fully-adp routing, 1-port 8x8 mesh, concurrent Ts 3 DYN ST CmHp CmLp CpHp CpLp TCG class Figure 6: Comparing Completion Time (CT) of LST-based static (ST) and dynamic (DYN) messageordering algorithms over sequential (SEQ) message-ordering for 16x16 mesh and 8x8 mesh. Concurrent start-up (Ts) model is assumed for all cases. Percentage reduction in completion time is shown in (a) and (b) for varying routing adaptivity (ecube to fully) and number of injection ports (1 to 4). Similar reduction is shown in (c) and (d) for varying application characteristics (Communication-bound, High Parallelism (CmHp) to Computation-bound, Low Parallelism (CpLp)) of task graphs. (a) CmHp TCGs on 8x8 mesh, skewed Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (b) ecube routing, 1-port 8x8 mesh, skewed Ts DYN ST CmHp CmLp CpHp CpLp TCG class Figure 7: Comparing Completion Time (CT) of LST-based static (ST) and dynamic (DYN) messageordering algorithms over sequential (SEQ) message-ordering for 8x8 mesh. Skewed start-up (Ts) model is assumed for all cases. Percentage reduction in completion time is shown in (a) for ecube routing and number of injection ports (1 to 4). Similar reduction is shown in (b) for varying application characteristics (Communication-bound, High Parallelism (CmHp) to Computation-bound, Low Parallelism (CpLp)) of task graphs.

8 It will be interesting to see how dynamic messageordering scheme in the presence of adaptive routing and multiple ports can be used eectively in these systems to send invalidation/update multicast messages in order to reduce cache-coherency overheads. Such an approach will not only allow to have fast cachecoherency support but will also contribute to better throughput of these systems. References [1] S. Balakrishnan and D.K. Panda, \Impact of Multiple Consumption Channels on Wormhole Routed k-ary n-cube Networks," In Proceedings of the International Parallel Processing Symposium, pp , [11] D. Lenoski et. al., \The Stanford DASH Multiprocessor," IEEE Computer, pp. 63{79, Mar [12] P. McKinley et al., \Unicast-Based Multicast Communication in Wormhole-Routed Networks," Int'l Conference on Parallel Processing, Vol. II, pp. 1-19, [13] Lionel M. Ni and P.K. McKinley, \A Survey of Wormhole Routing Techniques in Direct Networks," IEEE Computer, pp. 62{76, Feb [14] H. Schwetman, \Introduction to Process - Oriented Simulation and CSIM," Proc. of Winter Simulation Conf., 199. [2] V. Chaudhary and J.K. Aggrawal, \A Generalized Scheme for Mapping Parallel Algorithms," IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 3, pp , March [3] A. A. Chien and J. H. Kim, \Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors," In International Symposium on Computer Architecture, pp. 268{277, [4] Cray Research, Inc., Cray T3D System Architecture Overview, [] W.J. Dally, \Virtual-channel Flow Control," IEEE Trans. on Parallel and Distributed Systems, Vol. 3, pp. 194-, March [6] V.A. Dixit-Radiya and D.K. Panda, \Task Assignment on Distributed-Memory Systems with Adaptive Wormhole Routing," In Symposium on Parallel and Distributed Processing, pp , [7] V.A. Dixit-Radiya and D.K. Panda, \Clustering and Intra-Processor Scheduling for Explicitly- Parallel Programs on Distributed-Memory Systems," In International Parallel Processing Symposium, 1994, accepted to be presented. [8] M. Dikaiakos, A. Rogers, and K. Steiglitz, \Message Ordering in Multiprocessors with Synchronous Communication," Int'l Conference on Parallel Processing, Vol. III, pp , [9] J. Duato, \Deadlock-Free Adaptive Routing Algorithms for Multicomputers: Evaluation of a New Algorithm," Sym. on Parallel and Distributed Processing, pp , Dec [1] Intel Corporation, Paragon XP/S Product Overview, 1991.

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The

More information

Ecube Planar adaptive Turn model (west-first non-minimal)

Ecube Planar adaptive Turn model (west-first non-minimal) Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp. 652-659. Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms

Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 9, NO. 6, JUNE 1998 535 Resource Deadlocks and Performance of Wormhole Multicast Routing Algorithms Rajendra V. Boppana, Member, IEEE, Suresh

More information

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo

Real-Time Scalability of Nested Spin Locks. Hiroaki Takada and Ken Sakamura. Faculty of Science, University of Tokyo Real-Time Scalability of Nested Spin Locks Hiroaki Takada and Ken Sakamura Department of Information Science, Faculty of Science, University of Tokyo 7-3-1, Hongo, Bunkyo-ku, Tokyo 113, Japan Abstract

More information

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia

A New Theory of Deadlock-Free Adaptive Multicast Routing in. Wormhole Networks. J. Duato. Facultad de Informatica. Universidad Politecnica de Valencia A New Theory of Deadlock-Free Adaptive Multicast Routing in Wormhole Networks J. Duato Facultad de Informatica Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia, SPAIN E-mail: jduato@aii.upv.es

More information

Deadlock- and Livelock-Free Routing Protocols for Wave Switching

Deadlock- and Livelock-Free Routing Protocols for Wave Switching Deadlock- and Livelock-Free Routing Protocols for Wave Switching José Duato,PedroLópez Facultad de Informática Universidad Politécnica de Valencia P.O.B. 22012 46071 - Valencia, SPAIN E-mail:jduato@gap.upv.es

More information

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS*

SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* SOFTWARE BASED FAULT-TOLERANT OBLIVIOUS ROUTING IN PIPELINED NETWORKS* Young-Joo Suh, Binh Vien Dao, Jose Duato, and Sudhakar Yalamanchili Computer Systems Research Laboratory Facultad de Informatica School

More information

Processor. Flit Buffer. Router

Processor. Flit Buffer. Router Path-Based Multicast Communication in Wormhole-Routed Unidirectional Torus Networks D. F. Robinson, P. K. McKinley, and B. H. C. Cheng Technical Report MSU-CPS-94-56 October 1994 (Revised August 1996)

More information

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults

Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Fault-Tolerant Routing Algorithm in Meshes with Solid Faults Jong-Hoon Youn Bella Bose Seungjin Park Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science Oregon State University

More information

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,

More information

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing

More information

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli

Eect of fan-out on the Performance of a. Single-message cancellation scheme. Atul Prakash (Contact Author) Gwo-baw Wu. Seema Jetli Eect of fan-out on the Performance of a Single-message cancellation scheme Atul Prakash (Contact Author) Gwo-baw Wu Seema Jetli Department of Electrical Engineering and Computer Science University of Michigan,

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science Analytical Modeling of Routing Algorithms in Virtual Cut-Through Networks Jennifer Rexford Network Mathematics Research Networking & Distributed Systems AT&T Labs Research Florham Park, NJ 07932 jrex@research.att.com

More information

The Odd-Even Turn Model for Adaptive Routing

The Odd-Even Turn Model for Adaptive Routing IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 11, NO. 7, JULY 2000 729 The Odd-Even Turn Model for Adaptive Routing Ge-Ming Chiu, Member, IEEE Computer Society AbstractÐThis paper presents

More information

A Hybrid Interconnection Network for Integrated Communication Services

A Hybrid Interconnection Network for Integrated Communication Services A Hybrid Interconnection Network for Integrated Communication Services Yi-long Chen Northern Telecom, Inc. Richardson, TX 7583 kchen@nortel.com Jyh-Charn Liu Department of Computer Science, Texas A&M Univ.

More information

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem

Deadlock. Reading. Ensuring Packet Delivery. Overview: The Problem Reading W. Dally, C. Seitz, Deadlock-Free Message Routing on Multiprocessor Interconnection Networks,, IEEE TC, May 1987 Deadlock F. Silla, and J. Duato, Improving the Efficiency of Adaptive Routing in

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing

Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing 808 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 12, NO. 8, AUGUST 2001 Efficient Multicast on Irregular Switch-Based Cut-Through Networks with Up-Down Routing Ram Kesavan and Dhabaleswar

More information

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ

A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ A Simple and Efficient Mechanism to Prevent Saturation in Wormhole Networks Λ E. Baydal, P. López and J. Duato Depto. Informática de Sistemas y Computadores Universidad Politécnica de Valencia, Camino

More information

the possibility of deadlock if the routing scheme is not appropriately constrained [3]. A good introduction to various aspects of wormhole routing is

the possibility of deadlock if the routing scheme is not appropriately constrained [3]. A good introduction to various aspects of wormhole routing is The Red Rover Algorithm for DeadlockFree Routing on Bidirectional Rings Je Draper USC/Information Sciences Institute 4676 Admiralty Way Marina del Rey, CA 90292 (310)822 1511 x750 Email: draper@isi.edu,

More information

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA

A taxonomy of race. D. P. Helmbold, C. E. McDowell. September 28, University of California, Santa Cruz. Santa Cruz, CA A taxonomy of race conditions. D. P. Helmbold, C. E. McDowell UCSC-CRL-94-34 September 28, 1994 Board of Studies in Computer and Information Sciences University of California, Santa Cruz Santa Cruz, CA

More information

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs

BARP-A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs -A Dynamic Routing Protocol for Balanced Distribution of Traffic in NoCs Pejman Lotfi-Kamran, Masoud Daneshtalab *, Caro Lucas, and Zainalabedin Navabi School of Electrical and Computer Engineering, The

More information

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone:

Some Thoughts on Distributed Recovery. (preliminary version) Nitin H. Vaidya. Texas A&M University. Phone: Some Thoughts on Distributed Recovery (preliminary version) Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 Phone: 409-845-0512 Fax: 409-847-8578 E-mail:

More information

Generic Methodologies for Deadlock-Free Routing

Generic Methodologies for Deadlock-Free Routing Generic Methodologies for Deadlock-Free Routing Hyunmin Park Dharma P. Agrawal Department of Computer Engineering Electrical & Computer Engineering, Box 7911 Myongji University North Carolina State University

More information

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults

Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Fault-Tolerant Wormhole Routing Algorithms in Meshes in the Presence of Concave Faults Seungjin Park Jong-Hoon Youn Bella Bose Dept. of Computer Science Dept. of Computer Science Dept. of Computer Science

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Wormhole Routing Techniques for Directly Connected Multicomputer Systems

Wormhole Routing Techniques for Directly Connected Multicomputer Systems Wormhole Routing Techniques for Directly Connected Multicomputer Systems PRASANT MOHAPATRA Iowa State University, Department of Electrical and Computer Engineering, 201 Coover Hall, Iowa State University,

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent

Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent Generalized Theory for Deadlock-Free Adaptive Wormhole Routing and its Application to Disha Concurrent Anjan K. V. Timothy Mark Pinkston José Duato Pyramid Technology Corp. Electrical Engg. - Systems Dept.

More information

Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing

Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing Virtual Multi-homing: On the Feasibility of Combining Overlay Routing with BGP Routing Zhi Li, Prasant Mohapatra, and Chen-Nee Chuah University of California, Davis, CA 95616, USA {lizhi, prasant}@cs.ucdavis.edu,

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Traffic Control in Wormhole Routing Meshes under Non-Uniform Traffic Patterns

Traffic Control in Wormhole Routing Meshes under Non-Uniform Traffic Patterns roceedings of the IASTED International Conference on arallel and Distributed Computing and Systems (DCS) November 3-6, 1999, Boston (MA), USA Traffic Control in Wormhole outing Meshes under Non-Uniform

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

/$10.00 (c) 1998 IEEE

/$10.00 (c) 1998 IEEE Dual Busy Tone Multiple Access (DBTMA) - Performance Results Zygmunt J. Haas and Jing Deng School of Electrical Engineering Frank Rhodes Hall Cornell University Ithaca, NY 85 E-mail: haas, jing@ee.cornell.edu

More information

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer

MIMD Overview. Intel Paragon XP/S Overview. XP/S Usage. XP/S Nodes and Interconnection. ! Distributed-memory MIMD multicomputer MIMD Overview Intel Paragon XP/S Overview! MIMDs in the 1980s and 1990s! Distributed-memory multicomputers! Intel Paragon XP/S! Thinking Machines CM-5! IBM SP2! Distributed-memory multicomputers with hardware

More information

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing?

Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? Combining In-Transit Buffers with Optimized Routing Schemes to Boost the Performance of Networks with Source Routing? J. Flich 1,P.López 1, M. P. Malumbres 1, J. Duato 1, and T. Rokicki 2 1 Dpto. Informática

More information

Message Passing Models and Multicomputer distributed system LECTURE 7

Message Passing Models and Multicomputer distributed system LECTURE 7 Message Passing Models and Multicomputer distributed system LECTURE 7 DR SAMMAN H AMEEN 1 Node Node Node Node Node Node Message-passing direct network interconnection Node Node Node Node Node Node PAGE

More information

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song

CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS. Xiaodong Zhang and Yongsheng Song CHAPTER 4 AN INTEGRATED APPROACH OF PERFORMANCE PREDICTION ON NETWORKS OF WORKSTATIONS Xiaodong Zhang and Yongsheng Song 1. INTRODUCTION Networks of Workstations (NOW) have become important distributed

More information

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes

EE482, Spring 1999 Research Paper Report. Deadlock Recovery Schemes EE482, Spring 1999 Research Paper Report Deadlock Recovery Schemes Jinyung Namkoong Mohammed Haque Nuwan Jayasena Manman Ren May 18, 1999 Introduction The selected papers address the problems of deadlock,

More information

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E)

Lecture 12: Interconnection Networks. Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) Lecture 12: Interconnection Networks Topics: communication latency, centralized and decentralized switches, routing, deadlocks (Appendix E) 1 Topologies Internet topologies are not very regular they grew

More information

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN)

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns

The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns The Effect of Adaptivity on the Performance of the OTIS-Hypercube under Different Traffic Patterns H. H. Najaf-abadi 1, H. Sarbazi-Azad 2,1 1 School of Computer Science, IPM, Tehran, Iran. 2 Computer Engineering

More information

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS

SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS SERVICE ORIENTED REAL-TIME BUFFER MANAGEMENT FOR QOS ON ADAPTIVE ROUTERS 1 SARAVANAN.K, 2 R.M.SURESH 1 Asst.Professor,Department of Information Technology, Velammal Engineering College, Chennai, Tamilnadu,

More information

3. Evaluation of Selected Tree and Mesh based Routing Protocols

3. Evaluation of Selected Tree and Mesh based Routing Protocols 33 3. Evaluation of Selected Tree and Mesh based Routing Protocols 3.1 Introduction Construction of best possible multicast trees and maintaining the group connections in sequence is challenging even in

More information

Under Bursty Trac. Ludmila Cherkasova, Al Davis, Vadim Kotov, Ian Robinson, Tomas Rokicki. Hewlett-Packard Laboratories Page Mill Road

Under Bursty Trac. Ludmila Cherkasova, Al Davis, Vadim Kotov, Ian Robinson, Tomas Rokicki. Hewlett-Packard Laboratories Page Mill Road Analysis of Dierent Routing Strategies Under Bursty Trac Ludmila Cherkasova, Al Davis, Vadim Kotov, Ian Robinson, Tomas Rokicki Hewlett-Packard Laboratories 1501 Page Mill Road Palo Alto, CA 94303 Abstract.

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Adaptive Multimodule Routers

Adaptive Multimodule Routers daptive Multimodule Routers Rajendra V Boppana Computer Science Division The Univ of Texas at San ntonio San ntonio, TX 78249-0667 boppana@csutsaedu Suresh Chalasani ECE Department University of Wisconsin-Madison

More information

Boosting the Performance of Myrinet Networks

Boosting the Performance of Myrinet Networks IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. XX, NO. Y, MONTH 22 1 Boosting the Performance of Myrinet Networks J. Flich, P. López, M. P. Malumbres, and J. Duato Abstract Networks of workstations

More information

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing

Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Total-Exchange on Wormhole k-ary n-cubes with Adaptive Routing Fabrizio Petrini Oxford University Computing Laboratory Wolfson Building, Parks Road Oxford OX1 3QD, England e-mail: fabp@comlab.ox.ac.uk

More information

A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes

A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes IEEE TRANSACTIONS ON COMPUTERS, VOL. 50, NO. 7, JULY 2001 1 A New Adaptive Hardware Tree-Based Multicast Routing in K-Ary N-Cubes Dianne R. Kumar, Member, IEEE, Walid A. Najjar, and Pradip K. Srimani,

More information

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering

Process Allocation for Load Distribution in Fault-Tolerant. Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering Process Allocation for Load Distribution in Fault-Tolerant Multicomputers y Jong Kim*, Heejo Lee*, and Sunggu Lee** *Dept. of Computer Science and Engineering **Dept. of Electrical Engineering Pohang University

More information

Study and Comparison of Mesh and Tree- Based Multicast Routing Protocols for MANETs

Study and Comparison of Mesh and Tree- Based Multicast Routing Protocols for MANETs Study and Comparison of Mesh and Tree- Based Multicast Routing Protocols for MANETs Rajneesh Gujral Associate Proffesor (CSE Deptt.) Maharishi Markandeshwar University, Mullana, Ambala Sanjeev Rana Associate

More information

A FORWARDING CACHE VLAN PROTOCOL (FCVP) IN WIRELESS NETWORKS

A FORWARDING CACHE VLAN PROTOCOL (FCVP) IN WIRELESS NETWORKS A FORWARDING CACHE VLAN PROTOCOL (FCVP) IN WIRELESS NETWORKS Tzu-Chiang Chiang,, Ching-Hung Yeh, Yueh-Min Huang and Fenglien Lee Department of Engineering Science, National Cheng-Kung University, Taiwan,

More information

Connection-oriented Multicasting in Wormhole-switched Networks on Chip

Connection-oriented Multicasting in Wormhole-switched Networks on Chip Connection-oriented Multicasting in Wormhole-switched Networks on Chip Zhonghai Lu, Bei Yin and Axel Jantsch Laboratory of Electronics and Computer Systems Royal Institute of Technology, Sweden fzhonghai,axelg@imit.kth.se,

More information

A Reliable Hardware Barrier Synchronization Scheme

A Reliable Hardware Barrier Synchronization Scheme A Reliable Hardware Barrier Synchronization Scheme Rajeev Sivaram Craig B. Stunkel y Dhabaleswar K. Panda Dept. of Computer and Information Science y IBM T. J. Watson Research Center The Ohio State University

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

EE 6900: Interconnection Networks for HPC Systems Fall 2016

EE 6900: Interconnection Networks for HPC Systems Fall 2016 EE 6900: Interconnection Networks for HPC Systems Fall 2016 Avinash Karanth Kodi School of Electrical Engineering and Computer Science Ohio University Athens, OH 45701 Email: kodi@ohio.edu 1 Acknowledgement:

More information

Demand Based Routing in Network-on-Chip(NoC)

Demand Based Routing in Network-on-Chip(NoC) Demand Based Routing in Network-on-Chip(NoC) Kullai Reddy Meka and Jatindra Kumar Deka Department of Computer Science and Engineering, Indian Institute of Technology Guwahati, Guwahati, India Abstract

More information

A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes

A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes A Fully Adaptive Fault-Tolerant Routing Methodology Based on Intermediate Nodes N.A. Nordbotten 1, M.E. Gómez 2, J. Flich 2, P.López 2, A. Robles 2, T. Skeie 1, O. Lysne 1, and J. Duato 2 1 Simula Research

More information

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael

More information

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs

Request Network Reply Network CPU L1 Cache L2 Cache STU Directory Memory L1 cache size unlimited L1 write buer 8 lines L2 cache size unlimited L2 outs Evaluation of Communication Mechanisms in Invalidate-based Shared Memory Multiprocessors Gregory T. Byrd and Michael J. Flynn Computer Systems Laboratory Stanford University, Stanford, CA Abstract. Producer-initiated

More information

A Formal View of Multicomputers. Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy

A Formal View of Multicomputers. Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy A Formal View of Multicomputers Jose A. Galludy, Jose M. Garcazand Francisco J. Quilesy ydepartamento de Informatica, Universidad de Castilla-La Mancha, Escuela Universitaria Politecnica de Albacete, Campus

More information

Parallel Pipeline STAP System

Parallel Pipeline STAP System I/O Implementation and Evaluation of Parallel Pipelined STAP on High Performance Computers Wei-keng Liao, Alok Choudhary, Donald Weiner, and Pramod Varshney EECS Department, Syracuse University, Syracuse,

More information

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock

Deadlock: Part II. Reading Assignment. Deadlock: A Closer Look. Types of Deadlock Reading Assignment T. M. Pinkston, Deadlock Characterization and Resolution in Interconnection Networks, Chapter 13 in Deadlock Resolution in Computer Integrated Systems, CRC Press 2004 Deadlock: Part

More information

TDT Appendix E Interconnection Networks

TDT Appendix E Interconnection Networks TDT 4260 Appendix E Interconnection Networks Review Advantages of a snooping coherency protocol? Disadvantages of a snooping coherency protocol? Advantages of a directory coherency protocol? Disadvantages

More information

TASK FLOW GRAPH MAPPING TO "ABUNDANT" CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC "LIMITED"

TASK FLOW GRAPH MAPPING TO ABUNDANT CLIQUE PARALLEL EXECUTION GRAPH CLUSTERING PARALLEL EXECUTION GRAPH MAPPING TO MAPPING HEURISTIC LIMITED Parallel Processing Letters c World Scientic Publishing Company FUNCTIONAL ALGORITHM SIMULATION OF THE FAST MULTIPOLE METHOD: ARCHITECTURAL IMPLICATIONS MARIOS D. DIKAIAKOS Departments of Astronomy and

More information

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract.

Fault-Tolerant Routing in Fault Blocks. Planarly Constructed. Dong Xiang, Jia-Guang Sun, Jie. and Krishnaiyan Thulasiraman. Abstract. Fault-Tolerant Routing in Fault Blocks Planarly Constructed Dong Xiang, Jia-Guang Sun, Jie and Krishnaiyan Thulasiraman Abstract A few faulty nodes can an n-dimensional mesh or torus network unsafe for

More information

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center

Ecient Parallel Data Mining for Association Rules. Jong Soo Park, Ming-Syan Chen and Philip S. Yu. IBM Thomas J. Watson Research Center Ecient Parallel Data Mining for Association Rules Jong Soo Park, Ming-Syan Chen and Philip S. Yu IBM Thomas J. Watson Research Center Yorktown Heights, New York 10598 jpark@cs.sungshin.ac.kr, fmschen,

More information

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization

Abstract. provide substantial improvements in performance on a per application basis. We have used architectural customization Architectural Adaptation in MORPH Rajesh K. Gupta a Andrew Chien b a Information and Computer Science, University of California, Irvine, CA 92697. b Computer Science and Engg., University of California,

More information

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University

Adaptive Migratory Scheme for Distributed Shared Memory 1. Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University Adaptive Migratory Scheme for Distributed Shared Memory 1 Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms

Recall: The Routing problem: Local decisions. Recall: Multidimensional Meshes and Tori. Properties of Routing Algorithms CS252 Graduate Computer Architecture Lecture 16 Multiprocessor Networks (con t) March 14 th, 212 John Kubiatowicz Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~kubitron/cs252

More information

Computation of Multiple Node Disjoint Paths

Computation of Multiple Node Disjoint Paths Chapter 5 Computation of Multiple Node Disjoint Paths 5.1 Introduction In recent years, on demand routing protocols have attained more attention in mobile Ad Hoc networks as compared to other routing schemes

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip

Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Routing Algorithms, Process Model for Quality of Services (QoS) and Architectures for Two-Dimensional 4 4 Mesh Topology Network-on-Chip Nauman Jalil, Adnan Qureshi, Furqan Khan, and Sohaib Ayyaz Qazi Abstract

More information

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, nig@cps.msu.edu bstract Ecient allocation of

More information

Node Application Logic. SCI Interface. Output FIFO. Input FIFO. Bypass FIFO M U X. Output Link. Input Link. Address Decoder

Node Application Logic. SCI Interface. Output FIFO. Input FIFO. Bypass FIFO M U X. Output Link. Input Link. Address Decoder Real-Time Message Transmission Over The Scalable Coherent Interface (SCI) Lei Jiang Sarit Mukherjee Dept. of Computer Science & Engg. University of Nebraska-Lincoln Lincoln, NE 68588-0115 Email: fljiang,

More information

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics

Lecture 16: On-Chip Networks. Topics: Cache networks, NoC basics Lecture 16: On-Chip Networks Topics: Cache networks, NoC basics 1 Traditional Networks Huh et al. ICS 05, Beckmann MICRO 04 Example designs for contiguous L2 cache regions 2 Explorations for Optimality

More information

Basic Low Level Concepts

Basic Low Level Concepts Course Outline Basic Low Level Concepts Case Studies Operation through multiple switches: Topologies & Routing v Direct, indirect, regular, irregular Formal models and analysis for deadlock and livelock

More information

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith

The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith The Multikernel: A new OS architecture for scalable multicore systems Baumann et al. Presentation: Mark Smith Review Introduction Optimizing the OS based on hardware Processor changes Shared Memory vs

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing

Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing Performance Evaluation of a New Routing Strategy for Irregular Networks with Source Routing J. Flich, M. P. Malumbres, P. López and J. Duato Dpto. Informática de Sistemas y Computadores Universidad Politécnica

More information

A Literature Review of on-chip Network Design using an Agent-based Management Method

A Literature Review of on-chip Network Design using an Agent-based Management Method A Literature Review of on-chip Network Design using an Agent-based Management Method Mr. Kendaganna Swamy S Dr. Anand Jatti Dr. Uma B V Instrumentation Instrumentation Communication Bangalore, India Bangalore,

More information

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits

Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits Computer Science Department Technical Report #TR050021 University of California, Los Angeles, June 2005 Deadlock-Free Connection-Based Adaptive Routing with Dynamic Virtual Circuits Yoshio Turner and Yuval

More information

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System

Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Distributed Scheduling for the Sombrero Single Address Space Distributed Operating System Donald S. Miller Department of Computer Science and Engineering Arizona State University Tempe, AZ, USA Alan C.

More information

Design and Implementation of Multistage Interconnection Networks for SoC Networks

Design and Implementation of Multistage Interconnection Networks for SoC Networks International Journal of Computer Science, Engineering and Information Technology (IJCSEIT), Vol.2, No.5, October 212 Design and Implementation of Multistage Interconnection Networks for SoC Networks Mahsa

More information

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the

Rajendra V. Boppana. Computer Science Division. for example, [23, 25] and the references therein) exploit the Fault-Tolerance with Multimodule Routers Suresh Chalasani ECE Department University of Wisconsin Madison, WI 53706-1691 suresh@ece.wisc.edu Rajendra V. Boppana Computer Science Division The Univ. of Texas

More information

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router

Performance Evaluation of Probe-Send Fault-tolerant Network-on-chip Router erformance Evaluation of robe-send Fault-tolerant Network-on-chip Router Sumit Dharampal Mediratta 1, Jeffrey Draper 2 1 NVIDIA Graphics vt Ltd, 2 SC Information Sciences Institute 1 Bangalore, India-560001,

More information

Connection Link Connection Member Intermediate Switch. Connection Link Receiver member Intermediate Switch Source member

Connection Link Connection Member Intermediate Switch. Connection Link Receiver member Intermediate Switch Source member Proceedings of the 996 IEEE International Conference on Distributed Computing Systems, pp 5-, Hong Kong, May 996 A Lightweight Protocol for Multipoint Connections under Link-State Routing Yih Huang and

More information

Portland State University ECE 588/688. Memory Consistency Models

Portland State University ECE 588/688. Memory Consistency Models Portland State University ECE 588/688 Memory Consistency Models Copyright by Alaa Alameldeen 2018 Memory Consistency Models Formal specification of how the memory system will appear to the programmer Places

More information

High Performance Interconnect and NoC Router Design

High Performance Interconnect and NoC Router Design High Performance Interconnect and NoC Router Design Brinda M M.E Student, Dept. of ECE (VLSI Design) K.Ramakrishnan College of Technology Samayapuram, Trichy 621 112 brinda18th@gmail.com Devipoonguzhali

More information

Routing and Deadlock

Routing and Deadlock 3.5-1 3.5-1 Routing and Deadlock Routing would be easy...... were it not for possible deadlock. Topics For This Set: Routing definitions. Deadlock definitions. Resource dependencies. Acyclic deadlock free

More information

ayaz ali Micro & Macro Scheduling Techniques Ayaz Ali Department of Computer Science University of Houston Houston, TX

ayaz ali Micro & Macro Scheduling Techniques Ayaz Ali Department of Computer Science University of Houston Houston, TX ayaz ali Micro & Macro Scheduling Techniques Ayaz Ali Department of Computer Science University of Houston Houston, TX 77004 ayaz@cs.uh.edu 1. INTRODUCTION Scheduling techniques has historically been one

More information

Evaluation of NOC Using Tightly Coupled Router Architecture

Evaluation of NOC Using Tightly Coupled Router Architecture IOSR Journal of Computer Engineering (IOSR-JCE) e-issn: 2278-0661,p-ISSN: 2278-8727, Volume 18, Issue 1, Ver. II (Jan Feb. 2016), PP 01-05 www.iosrjournals.org Evaluation of NOC Using Tightly Coupled Router

More information