Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A.

Size: px

Start display at page:

Download "Message-Ordering for Wormhole-Routed Multiport Systems with. Link Contention and Routing Adaptivity. Dhabaleswar K. Panda and Vibha A."

Amie McCoy
6 years ago
Views:

1 In Scalable High Performance Computing Conference, Message-Ordering for Wormhole-Routed Multiport Systems with Link Contention and Routing Adaptivity Dhabaleswar K. Panda and Vibha A. Dixit-Radiya Department of Computer and Information Science The Ohio State University, Columbus, OH Abstract This paper analyzes the impact of messageordering, between outgoing messages from a sender to multiple receivers (called multicasts), on the completion time of a program for wormhole-routed distributed-memory systems. In most existing systems, messages in a multicast are generally being sent as separate unicast messages by the source processor itself. We study how best to order a set of outgoing messages by taking into account message criticality and architectural issues including link contention, multiple ports, and adaptivity in routing. First, the simple algorithm of Dikaiakos et al. [8] is extended to obtain a static algorithm for non-fully-connected systems. Next, a dynamic message-ordering algorithm is proposed which works for any number of ports and takes advantage of routing adaptivity. Simulation results on random task graphs show improvement in completion time by 34% for static and 44% for dynamic, over naive sequential message-ordering. 1 Introduction One of the major problems in software development for parallel systems is the mapping problem. It is de- ned as allocation of the set of tasks of a parallel application onto available processors to obtain minimal program completion time [2, 6]. The mapping problem being NP-hard, is usually solved by solving its subproblems, viz., (a) clustering grouping of tasks for bounded number of processors, (b) assignment one-to-one mapping of task clusters onto processors with specic topology, (c) intra-processor scheduling ordering of ready tasks allocated to the same processor, and (d) message-ordering ordering between outgoing messages for a sender to multiple receiver This research is supported in part by the National Science Foundation Grant # MIP processors. There has been extensive work in the literature on developing near-optimal heuristics for subproblems (a) and (b) above. Even though subproblems (c) and (d) also signicantly aect program completion time, they have not received any attention in mapping literature. We have solved the clustering and intra-processor scheduling problems in [7] and the assignment problem in [6]. In this paper, we address the message-ordering problem. Logical communication between tasks in a parallel program can be categorized as either unicast (source sending a message to a destination) or multicast (source sending messages to multiple destinations at the same time). A broadcast is a special case of multicast. Multicasts are further classied into personalized (dierent data going to others) and nonpersonalized (same data going to dierent processors). Personalized multicasts are common in scientic and numerical computations using scattering techniques. Sophisticated multicasting schemes using path-based routing [13] and unicast-based schemes [12] have been shown to be ecient for non-personalized multicast. However, personalized multicasts cannot take advantage of these schemes. Hence for systems not having support for these schemes and for personalized multicasts, there is no other alternative but to send messages from the source node to destination processors as a collection of unicast messages. In such cases, the order in which these messages are sent from the source has direct impact on the program completion time. The wormhole-routing switching technique is increasingly becoming popular in building massively parallel systems due to its inherent advantages like low-latency communication and reduced communication hardware overhead [13]. In addition to the basic wormhole-routing switching, systems are gradually incorporating multiple communication ports and routing schemes with varying adaptivity. Intel Paragon [1], Cray T3D [4], and Stanford DASH [11] are some early representative systems in this trend. These sys-

2 tems provide low-latency communication when the trac in the system is low. However, with increase in communication trac, messages undergo severe link contention and the system starts performing poorly. Similarly, when a single processor sends multicast messages, they encounter port contention at the router of the sender node depending on the routing strategy used by the system. Such contention may increase the completion time of a program execution signicantly. Such increase in completion time can be reduced by determining an eective message-ordering strategy. Though commercial wormhole systems are becoming available, there is no study in the literature about the interplay between multiple ports and routing adaptivity and their impact on determining a suitable message-ordering strategy. In this paper, we take such an approach in determining eective message-ordering strategy for adaptive wormhole systems with multiple ports. While mapping an application to a distributed-memory system, Dikaiakos et al. [8] have shown that the completion time of a program can be reduced if multicast messages are ordered using a Latest Start Time (LST) strategy. However, this message-ordering strategy is determined by considering the system to be a fully-connected architecture. Hence, the ordering does not necessarily provide the best completion time of a program when executed on a non-fully-connected architecture. In our previous study of mapping applications onto distributed memory systems [6, 7], we have used LST-based strategies for clustering, task assignment, and intra-processor scheduling. In this paper, we rst enhance the strategy proposed by Dikaiakos et al. to non-fully-connected architecture and evaluate its performance compared to the sequential message-ordering strategy. Then we propose new dynamic message-ordering algorithms to take advantage of adaptivity and multiple ports. We analyze the performance of these algorithms under two dierent models of communication start-up: concurrent and skewed. We study the eectiveness of these algorithms against sequential message-ordering scheme for random task graphs with varying computation to communication characteristics and for systems with varying routing adaptivity and multiple ports. The paper is organized as follows. Section 2 discusses about the signicance of message-ordering in wormhole systems under link- and port-contention. Message-ordering algorithms are presented in section 3. Simulation experiments and results are presented in section 4. The conclusions and future work are presented in section. 2 Message-Ordering in Wormhole Routed Systems In this section, we introduce the basic concepts of wormhole routing and show how adaptivity in routing reduces link contention. We discuss about the operational principles of multiport wormhole systems and show the situations which lead to port-contention. Through an example, we show the signicance of message-ordering by taking into account routing adaptivity and multiple ports. 2.1 Routing Adaptivity and Link Contention In wormhole-routed systems[], the header it of a message establishes the path, the intermediate its follow the path, and the tail it releases the path. During the message propagation, if a desired link is already being used by another message, the current message gets blocked. This message waits in the network occupying all the links it is traversing. Such a phenomena is known as link-contention. This phenomena is very much associated with the underlying routing scheme, topology of the system, and the communication trac. To alleviate link- contention, several routing schemes with varying adaptivity have been proposed in the literature. Deterministic or e-cube routing [] denes a single path from a source to a destination node and thus has zero adaptivity. Such routing is simple to implement and deadlock free. However, it does not make eective use of all communication links in a system. Fully adaptive algorithms [9] allow a message to be routed along any of the shortest paths from the source to the destination processor Partially adaptive algorithms like planar [3] restrict routing freedom to two dimensions at a time. Figure 1 illustrates the dierences between these three routing schemes. Higher adaptivity has potential to reduce link-contention and hence is useful to reduce the overall execution time of a given program. However, for any of the above schemes, the system performance very much depends on how messages are pushed into the network or taken out from the network by the processor-router interface. The number of available ports at this interface plays a signicant role in determining the completion time of a program and hence the system performance. With limited number of ports at a processor-router interface, there is added chance that a message will undergo port-contention in addition to link-contention. Hence, a good messageordering strategy should consider routing adaptivity, link-contention, and port-contention together.

3 to the router through injection channels (ports), the order in which they get propagated from the router to the network is based on (a) the order in which the processor prepares and presents them to the set of ports and (b) the number of ports available in the system. Figure 1: Possible paths from a source to a destination under dierent routing schemes in a 3-cube. 2.2 Contention in Multiport Systems Consider a typical processor-router organization in a multicomputer with two injection and two consumption channels (ports), as shown in Fig. 2. Any message originating at a processor must be sent via an injection port to the router. This port remains busy till the message leaves the router. If the message gets blocked on its way due to link-contention, the assigned port remains busy and can not be allocated to any other message. This leads to messages being queued at the injection channel(s) introducing port-contention. With multiple ports, the contention may get alleviated. However, it does not get eliminated completely because the number of outgoing messages at any time may be greater than the number of available injection ports. Similar contention due to limited number of consumption channels also degrade system performance signicantly. This aspect has been studied separately in [1]. In this paper we focus on contention due to injection channel and determine best ways to utilize injection ports in a system together with its routing scheme. Figure 2: Processor and router organization of a node supporting two-port communication. 2.3 Eect of Message-Ordering Even for an 1-port system, suitable messageordering can increase the port utilization by assigning the port to a message whose outgoing link is free. Besides increasing the utilization of ports, from an application perspective, there exists criticality in messages, i.e., some messages are more critical than others and the program completion time increases signicantly [6, 8] if the critical messages get delayed. Hence, a good message-ordering algorithm should take into account criticality in messages, routing scheme, link con- Consider an example multicast pattern of Fig. 3 in a 4x4 mesh. The source processor P 6 sends 6 messages 4(m 1 {m 6 ) to 6 respective destinations. Assume ecube 1 routing 2 [] is being 3 used for determining routing paths. Local m It can be seen that multiple outgoing messages may S S 1 m m tention, and port contention to reduce program completion time as well as increasing the utilization of 1 Processor/Memory contend 2 for the 3 same outgoing link (messages m 1, m 2, 4 m 4 for the westbound 7 link from P 6 and messages m 6 source 6 ports. We emphasize on these issues in the following and m 3 for the eastbound link). consumption Since messages ports come section. injection ports 6 7 destination D D D internal channels Figure 3: An example ordering of a multicast pattern in a 4x4 mesh with ecube routing. If the system supports only 1 injection port, then the order of message propagation is identical to the message-order prepared by the processor provided there is no link contention due to previous multicasts or other messages passing through the outgoing links of the node. For example, a message-ordering of (m 1 ; m 2 ; : : : ; m 6 ) will force the messages to get propagated in that order. If the system supports 2 ports it can be observed that the above message-ordering is not ecient. Both messages m 1 and m 2 will grab the two injection ports. Due to link contention, m 1 will propagate and m 2 will get blocked. This is a poor utilization of ports. It can be noticed that a messageordering of (m 1 ; m 3 ; : : :) will allow both m 1 and m 3 to propagate simultaneously by using the two ports. If the underlying routing scheme is fully-adaptive [9], then the original message-ordering (m 1 ; m 2 ; : : :) would have allowed both messages m 1 and m 2 to move simultaneously. external input external output channels from channels to 2m 4 m Router 3 m 6 4 neighbors 4 neighbors 12 (a) 13 deterministic 14 (b) planar-adaptive (c) fully-adaptive

4 v v Message-Ordering Algorithms In deriving message-ordering algorithms, we take an application perspective and emphasize on program completion time. In a typical program, multiple multicast communication steps happen at dierent points in the program execution. Since there are direct and indirect temporal dependencies through communication steps in the program execution, our objective here is to reduce the overall program execution time by deriving an eective message-ordering strategy. The objective is not to reduce the latency of a single multicast communication step. We use a Temporal Communication Graph (TCG) model for representing the temporal dependencies in a distributed-memory parallel program. This model has been successfully used in our previous study of clustering and assignment problems[6, 7]. Details of the model can be found in these references. First we explain the naive sequential ordering strategy which neither takes into account of architectural nor application characteristics. Then we discuss the work done by Dikaiakos et al. [8] which presents a message-ordering scheme based on precedence graph model [7] for 1-port fully-connected system (not a realistic architecture for large-scale system) based on criticality in messages only. Using a Temporal Communication Graph (TCG) model, we extend their result to wormhole-routed systems with any topology. This improved static message-ordering algorithm uses temporal properties of communication steps in the application and takes into account the criticality in messages. It is based on Latest Start Time (LST) estimates of the destination computational nodes, associated with the outgoing messages. Using these estimates, we develop a new dynamic algorithm to obtain better message ordering by taking into account architectural characteristics of a system like routing scheme, link contention, T and port contention. In this section, we present these 1 T 2 T 3 T 4 algorithms. Simulation experiments and results showing the 31 impact 36 of these message-ordering algorithms 3 1 on various 33 applications 38 are described in the nest section. v Sequential Ordering This is a naive 3 message-ordering scheme 1 in which messagesv in a multicast are sent 4 by 22 the router in an increasing 1 order of receiving task identiers. It is a simple method which does v not consider 1 v application 1 or architectural characteristics. For example, for the TCG in Fig. 4, the messages in the multicast originating from1 source v 11 is sent in the order v 21 and v 31 (corresponding to tasks T 1 and v T 42 2 respectively) Earliest Start Time Earliest Finish Time Latest Start Time Latest Finish Time v 31 inter-task communication edge intra-task sequence edge critical path under the sequential message-ordering scheme. Depending on the design of the processor-router interface and software components associated with the communication primitives in a system, there can be two different cost (delay) in the way messages are injected into the network. We identify them as concurrent start-up and skewed start-up cost models. In the rst model, all messages belonging to the multicast pattern from a single node are prepared and presented to the processor-router interface concurrently with a single communication start-up. In the second model, the messages are prepared and presented one after another by encountering start-up for each message Thus, the messages are presented to the processor-router interface in a skewed manner. In our analysis, we consider both these models. Figure 4: Temporal communication graph with earliest and latest computation times for an example program with 4 tasks. 4 Dynamic Ordering The basic concept behind the dynamic ordering is not to assign a port to a message which can not propagate out from the router due to link contention. This is achieved by incorporating a message-scheduler with each processor-router pair. The scheduler maintains a queue of unicast and LST-ordered multicast messages which it receives continuously from the application task running on the processor. Under the concurrent startup communication cost model, a multicast encounters a single startup cost for all its messages

5 before they are submitted to its processor's messagescheduler. Under the skewed startup cost model, each message in a multicast encounters a startup cost resulting in skewed submissions of messages (within a multicast) to its message-scheduler. A message of a multicast is stamped with (a) msg:time in which indicates the time the message enters the queue, and (b) msg:float which is set equal to the dierence between the the message's latest start time and its earliest start time. This information enables the message scheduler to estimate a message's criticality with respect to the total completion time of the program. The message-scheduler on each processor functions as follows. Whenever an injection port becomes free, Dynamic it selects Message-Ordering a message fromalgorithm the queue such that it has the least LST (let's dene it as the earliest message) while ((there exists anda its free outgoing port) link is free. However, overtaking of and (message_queue an earlier message is not empty)) by a later do message in the queue is allowed only if the earlier message is not \very severe" first_msg = with message respect on top toof the message_queue; later message. This severity of a updated_float_of_first_msg message is determined = msg.float based - on its oat (dierence (current_clock_time between Latest - msg.time_in); Start Time and Earliest Start Time if (updated_float_of_first_msg estimates), the duration <= ) then for which it has waited in the /* first_msg queue, is critical, and theso time schedule that would it */ be taken by the overtaking first_msg message from message_queue; to free the injection port. This algo- remove schedule rithm (first_msg); provides improved utilization of ports while trying to minimize completion time. For a deadlock-free end_if else underlying routing scheme, our algorithm is deadlockfree. The algorithm also > */ ensures starvation freedom /* updated_float_of_first_msg if (outlink_of_first_msg because the oat is free) of a { message decreases as it waits /* outlink longer in of a the message queue. will Once depend theof oat routing of a message reduces strategy to less of architecture than or equal */ to zero, it becomes critical remove and is first_msg scheduled from immediately message_queue; when a free port becomes(first_msg); available. The steps of the message-scheduler schedule else are described in pseudocode form in Fig.. /* outlink of first_msg is not free, so check if other It is to messages be noted in that queue this can dynamic be scheduled message-ordering */ while scheme (queue appears not empty) to be douseful only for systems having a get dedicated next_msg hardware from message_queue; to perform the task of the message if ((outlink_of_next_msg scheduler. However, it is is free) notand a must. The scheme we are(updated_float_of_first_msg proposing is quite general - and can be used as the last step est_lat(first_msg) optimization > (after )) then clustering and task assignment) /* est_lat while of a message mappingis an the application wormhole- the dynamic latency of message-ordering a without scheme uses to a system. Sincerouted dynamic state contention of the */ network, the message-ordering derived /* by first_msg this algorithm is not very for severe each multicast with communication step respect can to benext_msg fed back */ to the program for use at the run remove time. next_msg The program from with message_queue; modied message orderingschedule will behave (next_msg); exactly as it would have performed in thebreak; presence /* from of while a message-scheduler. loop */ Hence, the scheme end_if can be used on any system without having a end_while message-scheduler hardware. /* Better candidate than first_msg not found */ remove first_msg from message_queue; schedule (first_msg); end_else end_else end_while Figure : Dynamic Message-Ordering Algorithm.

6 Simulation Experiments and Results We performed simulation experiments to study the impact on program completion time for LSTbased static message-ordering and dynamic messageordering algorithms with respect to sequential message-ordering algorithm. Our study included architectures with varying degree of routing adaptivity (e-cube (least adaptive) to fully adaptive[9]), multiple number of ports, and TCGs with varying application characteristics. Experiments were performed on random TCGs using an event-driven simulator written in CSIM [14] for 16x16 and 8x8 wormhole-routed meshes. The following system parameters, representing current-generation multicomputers, were used: startup time of 1 microsecond, link propagation time of ns, and router node delay of 2 ns. Four dierent TCG classes, representing applications with varying computation-communication ratio and degree of parallelism, were used. For 16x16 meshes having 6 tasks, the degree of multicast was chosen to be an exponential distribution with mean. The mean for 8x8 meshes with 64 tasks was 3. Figures 6 and 7 summarize the simulation results for concurrent and skewed communication start-up models, respectively. The following observations can be made from these simulation results. The LST-based static (ST) message-ordering algorithm reduces program completion time by 1% to 34% for the concurrent startup model and by 6compared to sequential (SEQ) messageordering. The dynamic (DYN) algorithm provides an additional reduction in completion time up to 1% for both startup models. The percentage improvement by static and dynamic algorithms is the highest for systems having 1 injection port. Thus, the algorithms are suited well for current-generation systems which support only single port. As the number of injection ports provided by the architecture increases, the improvement in completion time provided by static and dynamic over sequential decreases. This is because with multiple ports, there is more parallelism between the messages to move. For communication- bound, high- parallelism (CmHp) type of applications, the improvement by both static and dynamic algorithms is more than that for communication-bound, high-parallelism (CmLp) or computation-bound (CpHp, CpLp) applications. This is because link contention increases as we go from CpLp to CmHp and the benets of message-ordering are reected by reduction in completion time. For the varying port models and varying application classes, the additional improvement of dynamic over static algorithm is better for the concurrent startup model than that for the corresponding skewed startup cases. This is because the concurrent start-up model introduces more choices for the dynamic message-ordering algorithm to select messages. As routing adaptivity provided by the system increases from ecube to fully adaptive, the additional improvement provided by dynamic algorithm over static increases for the concurrent startup model while it decreases for the skewed startup model. This is because in case of skewed startup model, messages of a multicast are not available simultaneously for scheduling. Hence the scheduler is unable to eectively utilize the extra outgoing links provided by fully adaptive routing. 6 Conclusions This study indicates that message-ordering plays a signicant role in determining program completion time in distributed-memory systems. We have proposed message-ordering algorithms that consider temporal information of task computations and communications, criticality of messages, and dynamic system state such as link and port contentions to reduce program completion time. Such reductions in completion times are shown to be signicant particularly for systems with 1-2 injection ports supporting ecube or fully-adaptive routing and for applications that are communication-bound. These results indicate that dynamic message-ordering strategy, instead of sequential or static LST-based algorithms, can be used in current-generation high performance wormhole-routed distributed-memory systems in mapping applications to reduce program completion time signicantly. The analysis presented in this paper is based on random task graphs. We are continuing our work to apply the dynamic message ordering algorithms to scientic and numeric applications. In this paper, we have emphasized on reducing the completion time aspect of the program. However, direct networks with wormhole-routing are increasingly being used to support distributed-shared memory systems.

7 (a) CmHp TCGs on 16x16 mesh, concurrent Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (c) ecube routing, 1-port 8x8 mesh, concurrent Ts 3 DYN ST CmHp CmLp CpHp CpLp TCG class (b) CmLp TCGs, 16x16 mesh, concurrent Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (d) fully-adp routing, 1-port 8x8 mesh, concurrent Ts 3 DYN ST CmHp CmLp CpHp CpLp TCG class Figure 6: Comparing Completion Time (CT) of LST-based static (ST) and dynamic (DYN) messageordering algorithms over sequential (SEQ) message-ordering for 16x16 mesh and 8x8 mesh. Concurrent start-up (Ts) model is assumed for all cases. Percentage reduction in completion time is shown in (a) and (b) for varying routing adaptivity (ecube to fully) and number of injection ports (1 to 4). Similar reduction is shown in (c) and (d) for varying application characteristics (Communication-bound, High Parallelism (CmHp) to Computation-bound, Low Parallelism (CpLp)) of task graphs. (a) CmHp TCGs on 8x8 mesh, skewed Ts DYN, fully-adp ST, fully-adp DYN, ecube ST, ecube No. of injection ports (b) ecube routing, 1-port 8x8 mesh, skewed Ts DYN ST CmHp CmLp CpHp CpLp TCG class Figure 7: Comparing Completion Time (CT) of LST-based static (ST) and dynamic (DYN) messageordering algorithms over sequential (SEQ) message-ordering for 8x8 mesh. Skewed start-up (Ts) model is assumed for all cases. Percentage reduction in completion time is shown in (a) for ecube routing and number of injection ports (1 to 4). Similar reduction is shown in (b) for varying application characteristics (Communication-bound, High Parallelism (CmHp) to Computation-bound, Low Parallelism (CpLp)) of task graphs.

8 It will be interesting to see how dynamic messageordering scheme in the presence of adaptive routing and multiple ports can be used eectively in these systems to send invalidation/update multicast messages in order to reduce cache-coherency overheads. Such an approach will not only allow to have fast cachecoherency support but will also contribute to better throughput of these systems. References [1] S. Balakrishnan and D.K. Panda, \Impact of Multiple Consumption Channels on Wormhole Routed k-ary n-cube Networks," In Proceedings of the International Parallel Processing Symposium, pp , [11] D. Lenoski et. al., \The Stanford DASH Multiprocessor," IEEE Computer, pp. 63{79, Mar [12] P. McKinley et al., \Unicast-Based Multicast Communication in Wormhole-Routed Networks," Int'l Conference on Parallel Processing, Vol. II, pp. 1-19, [13] Lionel M. Ni and P.K. McKinley, \A Survey of Wormhole Routing Techniques in Direct Networks," IEEE Computer, pp. 62{76, Feb [14] H. Schwetman, \Introduction to Process - Oriented Simulation and CSIM," Proc. of Winter Simulation Conf., 199. [2] V. Chaudhary and J.K. Aggrawal, \A Generalized Scheme for Mapping Parallel Algorithms," IEEE Trans. on Parallel and Distributed Systems, Vol. 4, No. 3, pp , March [3] A. A. Chien and J. H. Kim, \Planar-Adaptive Routing: Low-Cost Adaptive Networks for Multiprocessors," In International Symposium on Computer Architecture, pp. 268{277, [4] Cray Research, Inc., Cray T3D System Architecture Overview, [] W.J. Dally, \Virtual-channel Flow Control," IEEE Trans. on Parallel and Distributed Systems, Vol. 3, pp. 194-, March [6] V.A. Dixit-Radiya and D.K. Panda, \Task Assignment on Distributed-Memory Systems with Adaptive Wormhole Routing," In Symposium on Parallel and Distributed Processing, pp , [7] V.A. Dixit-Radiya and D.K. Panda, \Clustering and Intra-Processor Scheduling for Explicitly- Parallel Programs on Distributed-Memory Systems," In International Parallel Processing Symposium, 1994, accepted to be presented. [8] M. Dikaiakos, A. Rogers, and K. Steiglitz, \Message Ordering in Multiprocessors with Synchronous Communication," Int'l Conference on Parallel Processing, Vol. III, pp , [9] J. Duato, \Deadlock-Free Adaptive Routing Algorithms for Multicomputers: Evaluation of a New Algorithm," Sym. on Parallel and Distributed Processing, pp , Dec [1] Intel Corporation, Paragon XP/S Product Overview, 1991.

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The