A,B,C,D 1 E,F,G H,I,J 3 B 5 6 K,L 4 M,N,O

Size: px

Start display at page:

Download "A,B,C,D 1 E,F,G H,I,J 3 B 5 6 K,L 4 M,N,O"

Kenneth Sims
5 years ago
Views:

1 HYCORE: A Hybrid Static-Dynamic Technique to Reduce Communication in Parallel Systems via Scheduling and Re-routing æ David R. Surma Edwin H.-M. Sha Peter M. Kogge Department of Computer Science and Engineering University of Notre Dame Notre Dame, IN Technical Report TR October 1997 With the advent of massively parallel machines there have been considerable gains made in reducing task processing times. However, these gains are signiæcantly diminished by the inherent communication overhead. As one of the point design teams to develop Petaæop supercomputers sponsored by NSF, our research group encountered such a problem while implementing a parallel solution for simulating partial diæerential equations, representing æuid dynamics problems. With the platform being a tightly-coupled architecture such as the processor-inmemory EXECUBE ë1ë, we realized that the communication overhead impeded our eæorts to obtain an optimized execution time. To reduce this overhead, we present a study of the communication incurred when nodes transfer information. Our novel technique involves both compile-time analysis and run-time scheduling. Experiments show signiæcant improvement compared to baseline approaches. The creation of a new scheduling technique was required since most existing scheduling methods do not consider the communication characteristics of the problem ë2, 3ë and are unable to achieve an optimal schedule. Furthermore, most techniques developed for parallel compilers do not consider this overhead ë2, 4ë. This research assumes that a suitable task allocation scheme has been used and deals speciæcally with the ordering and routing of the message transmissions. Therefore, the new scheduling technique is much diæerent from traditional multiprocessor scheduling ë3ë because it schedules at a lower level. Static techniques, while being able to achieve an optimal or near-optimal solution, require known information about the message traæc. Unfortunately, this a priori information may be unavailable or inaccurate. Dynamic scheduling techniques suæer from being unable to utilize information that might be known æ This work was supported in part by NSF MIP and NSF ACS

2 A A,B,C,D 1 E,F,G H,I,J 3 B E H K M 2 C F I L N D G J O 4 M,N,O 5 6 K,L Figure 1: èaè Task Flow DAG. èbè Tasks assigned to processing nodes Schedule 1 Schedule 2 Re-routed Schedule A! E; A! M A! K; A! M A! K 0 ; A! H A! H A! H; L! J; L! O A! E; A! M; I! G; L! O; L! J A! KI! G A! E; I! G L! O; L! J Table 1: Example Communication Schedules about the processing environment. Thus, this research presents a hybrid technique utilizing the appealing components of both approaches. To exemplify this type of scheduling, consider the task directed acyclical graph, or DAG, of Figure 1. Figure 1èbè shows one possible assignment of this graph to a two-dimensional mesh network of six processors. While tasks assigned to the same processor require no internode communication, this assignment scheme indicates that messages must be exchanged. For example, node 1 sends messages to nodes 2, 3, 4, and 5 corresponding to edges A! E; A! H; A! M; and A! K of the DAG. Since there is only a single bidirectional link between each node, network collisions occur. By collisions we mean that messages will compete for at least one physical link in the network. The ærst two columns of Table 1 give possible orderings of the resulting message traæc when XY-routing is used. Messages on the same line may be sent in parallel without collisions. In worm-hole routed networks, the time to transmit a message is relatively distance insensitive ë5ë so we can assume that equal length messages will take the same amount of time, t, to traverse the network. Thus, schedule 1 gives an ordering which completes at time 4t while schedule 2 completes at time 3t, a savings of 25è based on the communication schedule. An even greater amount of improvement can be obtained if message èa! Kè is re-routed to traverse in a YX direction. The third column shows this new schedule with the re-routed message denoted as A! K 0. The completion time of this new ordering is 2t. Thus, this work addresses the ordering or scheduling of the messages as well as the re-routing of some of them to reduce the overall completion time. The term used for this research is communication scheduling. It not only encompass routing aspects and path selection issues as discussed in ë5, 6ë, it also determines the order that the messages in the system should be sent. There have been several studies related to this problem. One eæort develops a `traæc scheduling' algorithm for multi-processor networks to balance the network links based on the fact that a large number of messages must eventually be delivered ë7ë. Their work, however, uses a First-Come First-Served, FCFS, approach and does not perform any scheduling of the individual message transmissions. Lee and Kim perform path selection 2

3 Message Est. Departure Source Destination ID Time è3,1è è7,7è è2,5è è5,7è è3,4è è5,6è è2,2è è7,8è è3,1è è6,4è è2,1è è6,3è è2,1è è5,4è Table 2: Example message list in a wormhole routed network but they search for unique paths for pairs of communicating nodes ë6ë. Kandlur and Shin ë8ë present a work similar to ë6ë in that dedicated paths are found. The problem with these techniques is that the dedicated paths can cause other messages to follow longer paths even though the dedicated links are unused. Additionally, no scheduling is done which can improve the overall performance. Recent work by Eberhart and Li ë9ë does perform a type of dynamic communication scheduling. However, their work is restricted to analyzing communication patterns that are commonly used in data parallel applications. The work presented here can apply to any type of message-passing activity. This paper presents a hybrid technique which uses known information about the required message traæc to statically determine priorities for the individual messages. Then, at run time when a node has several messages to transmit along the same physical link, preference is given to the message with the highest priority. The basis for the priority determination is the recently developed collision graph model ë10ë. The communication scheduling problem has been addressed previously in a purely static manner using æxed routing and a speciæc message traæc model ë11ë. This research greatly improves this eæort by presenting a technique for a general model of message traæc which allows re-routing of messages and operates in a dynamically. This starting point is a list of N messages to be transmitted by the network nodes. The goal is to ænd an optimal communication schedule which reduces the overall processing time. Table 2 shows a sample message list to be executed on a 10X10 two-dimensional mesh processor network. This work considers single packet messages composed of an arbitrary number of æits. Nodes of the multiprocessor system are attached to all-port routers and the routing scheme is XY as the default or a re-routed scheme which will be discussed shortly. Deænition 1 A message is deæned to be M = èm edt ;m S ;m D è where m edt is the estimated departure time of the message, m S is the source node of the message, and m D is the destination node of the message. PRIMAR Algorithm The ærst step in arriving at the communication schedule is to determine the priorities for each message. The algorithm to do this is called the Priority Mapping and Re-routing, or PRIMAR algorithm and it begins by transforming the problem into a graph model, called a collision 3

4 MSG EDT Src Dest (3,1) (7,7) (2,5) (5,7) (3,4) (5,6) (2,2) (7,8) (3,1) (6,4) (2,1) (6,3) (2,1) (5,4) Window = Figure 2: Collision Graph for S with window = 4. graph or CG. Deænition 2 A CG is deæned as G = èv; Eè where V is the set of nodes v1;v2; :::v N representing messages M1;M2; :::M N ; and E = fèv i ;v j èj the paths of M i and M j intersect.g. Since the estimated departure times vary throughout the message list, it is possible that two messages can traverse the same paths without colliding if these times are suæciently far apart. Consequently, a CG is not constructed for the entire message list. Rather, the message list is ærst sorted by estimated departure time and then processed in sections determined by a user input parameter called a window. This window is used as the range for the message traæc departure times to be operated on as a set, S. Figure 2 shows a CG constructed for the nodes in S from Table 2 when the window parameter is 4. To get the ordering from the undirected CG, arrows indicating message precedence must be added to the graph. An edge directed from v1! v2 denotes that the message corresponding to v1 is to be scheduled before the message corresponding to v2. If no edge exists between any two nodes they may be scheduled in parallel. Once an edge orientation has been established, the actual priorities are determined by ærst ænding the nodeèsè without any incoming edges and assigning them the highest priority. Next, these nodes and their edges are removed from the graph, and the process repeats assigning the next highest priority and so on for all messages. Thus, the major problem is determining the edge orientation for the CG that yields a priority scheme which produces the best performance. Central to getting the best performance is ænding the maximum number of messages that can be transmitted in parallel at any one time. This correlates to ænding the maximum independent set from the CG. Since ænding a maximum independent set is an NP-Complete problem, our problem is also NP-Complete, and heuristics are needed to arrive at a solution. Consider again the CG of Figure 2. The maximum independent set is 3 comprised of nodes 2, 3, 5. Those messages will be assigned priority 0 èhighestè and are said to be in S 0. The other nodes in S, S 0 have collisions with the nodes in S 0. Therefore, to enlarge S 0 re-routing of the messages in S, S 0 is considered. Re-routing in a process where the message routing path is changed from XY to YX. However, since deadlocks are a concern in wormhole routed networks, some restrictions are required. 8 turns are possible in two-dimensional mesh networks and XY routing is deadlock free by prohibiting 4 turns. We only restrict the 2 turns shown in Figure 3. Thus, our term for this type of routing is XY and restricted YX routing. 4

5 Figure 3: Illustration of allowable routing turns In the example, nodes 1 and 4 are eligible to be re-routed since they do not violate the turn restrictions. Node 1 is arbitrarily selected ærst for re-routing and it can be routed in a YX direction without colliding with any message in S 0. Thus, it will be assigned priority 0,added to S 0, and its routing æag set to YX. This æag is part of the æit header and each router must be able to interpret it for proper routing. Next node 4 is considered. Since if it is re-routed it will collide with a member of S 0 ère-routed message 1è, it cannot be re-routed. After the nodes with top priority have been determined, they will be eliminated from the graph and the nodes in S, S 0 will be aged. ènode 4 is this example.è Aging is a process where messages have their departure times updated to a later time. The value used for aging is determined by the length of the standard message. Next, the entire list of remaining messages are resorted and the process repeats assigning priority 1. The algorithm is executed with several window sizes, a metric produced and the best priority scheme used. Algorithm 1 PRIMAR Input: G=èV,Eè, and M Output: Mèvèpri 8v 2 V begin pri = 0; Input window from user; I=;; repeat until V =0; sort V by estimated departure time, edt; limit1 = earliest estimated departure time of a node v 2 V ; limit2 = limit1 + window; Build Gt =èvt,etè such that Vt = fvj limit1 ç Mèvè edt ç limit2g and E t = fej u,! e v and u,v 2 V tg; Determine the maximum independent set, I ç G t; 8v 2 I, Mèvèpri = pri; 8v 2 Gt =2 I, Explore re-routing for each v If re-routing can be done, Mèvèpri = pri path direction = YX, and add Mèvè to I; 8v 2 èv t, Iè Mèvè edt = Mèvè edt + age; pri = pri + 1; V = V, I; end loop; end algorithm PRIMAR HYCORE Technique and Results The Hybrid Communication Scheduling with Re-routing, or HYCORE, technique utilizes the results of the PRIMAR algorithm. At run-time each node selects a message to transmit based on several factors. If a node has only one message ready to transmit, it checks the routing æag and if the appropriate link is available the message is transmitted. However, if the node has several messages that are ready to be transmitted, the priority is used as the arbiter. A 5

6 Operation Msgs SCORE FCFS HYSTAD Re-routed HYCORE è HYCORE Sent FCFS Improvement LU Factorization Matrix Multiply Bitonic Sorting Table 3: Comparison of scheduling techniques without variance simulation program was developed to determine the time a message reaches its destination and a performance metric established. This metric is the average completion time, or ACT, for all messages transmitted. The ACT is used because our focus is on the individual message transfers. While we are interested in having the shortest ænal completion time we also want to have as many messages transmit as soon as possible. Thus, by using the ACT we can distinguish between two schedules which have equivalent ænal schedule completion times. In the example message list of Table 2, the ACT for a statically determined schedule is A FCFS approach has a time of while our hybrid approach without re-routing, ècalled HY STAD in the tableè, yields a value of Utilizing re-routing the static approach value decreases to while the HY CORE technique is Thus, the improvement gained by the HY CORE technique over a FCFS approach is a signiæcant 23.28è. The statically determined algorithm being the best makes sense because if exact information is known a priori about the message traæc a schedule can be optimized. However, obtaining this information with much accuracy is diæcult. Consequently, in experiments a variance is introduced which takes into account network uncertainties, congestion, and other performance æuctuations. This variance is distributed uniformly over the estimated departure times of all messages and experiments were performed to study its eæects. Two models of message traæc were considered in our experiments. First, LU factorization, matrix multiplication, and bitonic sorting were analyzed to determine the message passing that occurs when they are mapped to a two-dimensional mesh architecture. ACT values are given in Table 3 for the results of the SCORE static scheduling algorithm utilizing re-routing ë12ë, a FCFS approach both with and without re-routing, the HY STAD and the HY CORE techniques. In this table the variance is 0 so the static approach again performed the best. Further note that the HYCORE technique outperforms the FCFS approach by approximately 20è. Table 4 shows the results when the variance is 4. Static scheduling no longer works best as it must compensate for worst case times, and HYCORE still works better than FCFS although the percentage is not as great. This is due to the deteriorating accuracy of the information used to determine the priorities. It is still better indicating that having some knowledge, albeit not totally accurate, improves the performance. Table 5 shows results obtained when applying the æve scheduling techniques to randomly generated traæc patterns consisting of 30 messages. A hotspot index was used to vary the amount of collisions by causing the message destinations to be in a certain area with a given percent. The results are averages of 100 trials for each case. Note that the diæerences in the 6

7 Operation Msgs SCORE FCFS HYSTAD Re-routed HYCORE è HYCORE Sent FCFS Improvement LU Factorization Matrix Multiply Bitonic Sorting Table 4: Comparison of scheduling techniques with variance = 4 Hotspot SCORE FCFS HYSTAD Rescheduled HYCORE Percent Index FCFS Improvement 10è è è è è Table 5: Experiments with 30 messages and variance = 0 amount of improvement that can be obtained depends on the nature of the message traæc. The HY CORE technique works best on traæc where there is a moderate amount of collisions. At low collisions è10è hotspot index in the tableè, there is not much parallelism to exploit and consequently the improvement that can be obtained, while still signiæcant, is comparably low. At high amounts of collisions, the CG resembles a clique where the FCFS approach will begin to work as well as other approaches. Since the comparison is with this FCFS approach, as the amount of collisions increases, the amount of improvement that can be obtained decreases. In the table note the falloæ in improvement when the hotspot index exceeds 75è. In between these extremes, however, the improvement obtained by the HYCORE technique steadily increases to a maximum of 21è. Two parameters are changed to study the eæects of additional messages transmissions and also the introduction of a variance. Table 6 shows results for experiments using a 40è hotspot index and varying the amount of messages transmitted when the variance is 4. From this table it can be seen that the static SCORE technique performs poorly while the HYCORE technique is again better than the FCFS approach. Note that the amount of improvement begins to diminish when the number of messages is greater than 40. This is the case because more messages results in more collisions for a æxed hotspot index. As shown in the previous analysis, once the number of collisions becomes great, the performance begins to diminish. This paper presents a framework for studying communication scheduling. The HY CORE technique combines static and run-time elements along with re-routing to reduce the commu- Msgs SCORE FCFS HYCORE Percent sent Improvement Table 6: Experiments with 40è hotspot index and variance = 4 7

8 nication overhead by over 20è for both application-speciæc message traæc and for randomly generated message traæc. This technique will almost always perform better than a FCFS approach due to its using re-routing and since it acts ærst to schedule its messages on a FCFS basis. In the presence of variances, this technique will outperform baseline static scheduling techniques as well. References ë1ë P. M. Kogge, ëexecube- A New Architecture for Scalable MPPs," in 1994 International Conference on Parallel Processing, vol. I, pp. 77í84, August ë2ë H. Kasahara and S. Narita, ëpractical multiprocessor scheduling algorithms for eæcient parallel processing," IEEE Transactions on Computers, vol. c-33, November ë3ë H. El-Rewini, T. G. Lewis, and H. H. Ali, Task Scheduling in Parallel and Distributed Systems. Englewood Cliæs, NJ: Prentice Hall, ë4ë S. Shukla, B. Little, and A. Zaky, ëa compile-time technique for controlling real-time execution of task-level data-æow graphs.," in 1992 International Conference on Parallel Processing, vol. II, pp. 49í56, ë5ë L. M. Ni and P. McKinley, ëa survey of wormhole routing techniques in direct networks," IEEE Computer, vol. 26, February ë6ë S. Lee and J. Kim, ëpath selection for communicating tasks in a wormhole-routed multicomputer," in 1994 International Conference on Parallel Processing, vol. 3, pp. 172í175, ë7ë R. P. Bianchini and J. P. Shen, ëinterprocessor traæc scheduling algorithm for multipleprocessor networks," IEEE Transactions on Computers, vol. C-36, pp. 396í409, April ë8ë D. D. Kandlur and K. G. Shin, ëtraæc routing for multicomputer networks with virtual cut-through capability," IEEE Transactions on Computers, vol. c-41, pp. 1257í1270, October ë9ë A. Eberhart and J. Li, ëcontention-free communication scheduling on 2d meshes," in 1996 International Conference on Parallel Processing, pp. 44í51, ë10ë D. R. Surma and E. Sha, ëcollision graph based communication scheduling for parallel systems," to be published in Journal of Computers and their Applications, December ë11ë D. R. Surma and E. Sha, ëeæcient communication scheduling with re-routing based on collision graphs," in International Symposium on High Performance Computing Systems, July

9 ë12ë D. R. Surma and E. Sha, ëscore: An eæcient technique to reduce congestion in parallel systems," in To be presented at the Tenth International Conference on Parallel and Distributed Computing Systems, September

A,B,G,L F,K 4 E,J,N G H I J K L M

A,B,G,L F,K 4 E,J,N G H I J K L M Collision Graph based Communication Scheduling with Re-routing in Parallel Systems æ David Ray Surma Edwin Hsing-Mean Sha Dept. of Computer Science & Engineering University of Notre Dame Notre Dame, IN