Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Size: px

Start display at page:

Download "Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract"

Madison Johns
5 years ago
Views:

1 Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN) is proposed in this paper to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MINs and has the same bandwidth at each stage. A switch in an MBN is similar to that in a MIN switch except that there is a single bus connection instead of a crossbar. MBNs support bidirectional routing and there exists a number of paths between any source and destination pair. In this paper we develop self routing techniques for the various paths, present an algorithm to route a request along the path with minimum distance and analyze the probabilities of a packet taking dierent routes. Further, we derive a performance analysis of a synchronous packet-switched MBN in a distributed shared memory environment and compare the results with those of an equivalent bidirectional MIN (BMIN). Finally, we present the execution time of various applications on the MBN and the BMIN through an execution-driven simulation. We show that the MBN provides similar performance to a BMIN while oering simplicity in hardware and more fault-tolerance than a conventional MIN. Keywords interconnection network, routing, queueing model, performance analysis, packet-switching, executiondriven simulation This research was supported by NSF grants MIP and L. N. Bhuyan and R. Iyer are with the Department of Computer Science, Texas A&M University, College Station, TX , (bhuyan,ravi)@cs.tamu.edu T. Askar is with Advanced Micro Devices, Austin, TX A. Nanda is with IBM TJ Watson Research Center, P.O. Box 218, Yorktown Heights, NY M. Kumar is with the Department of Computer Science, Curtin University of Technology, GPO Box U 1987 Perth, WA 6001, Australia

2 2 I. Introduction In order to achieve signicant performance in parallel computing it is necessary to keep the communication overhead as low as possible. The communication overheads of a multiprocessor system depend to a great extent on the underlying interconnection network. An interconnection network (IN) can be either static or dynamic. Dynamic networks can connect any input to any output by enabling some switches. They are applicable to both shared memory and message passing multiprocessors. Among such dynamic INs, the hierarchical buses or rings [1], [2] and Multistage Interconnection Networks (MINs) [3], [4] have been commercially employed. In a strictly hierarchical bus architecture [1], there are a number of buses connected in the form of a tree between the processors and the memories. The use of multiple buses makes the hierarchical bus-based systems more scalable compared to the popular single bus multiprocessors. However, the bandwidth of this interconnection decreases as one moves toward the top of the tree. Thus, the scalability of a hierarchical bus system becomes limited by the bandwidth of the topmost level bus. The bandwidth problem can be alleviated through the fat tree design [5]. The simplicity of the bus based designs and the availability of a fast broadcasting mechanism are factors that make bus-based systems very attractive. The MINs, on the other hand, oer a uniform bandwidth across all stages of the network. The bandwidth of the network increases in proportion to the increase in system size, making the MIN a highly scalable interconnection. The switches in a MIN are made up of small crossbar switches. When the system size grows, bigger switches can be used to keep the number of stages and, hence, the memory latency low [6]. However, the complexity of a crossbar switch grows as the square of its size, and therefore, the total network cost becomes predominant in larger systems. We have observed that the trac in the network is very low making the crossbar based MIN switches highly underutilized. In a system using private caches, which is common in today's shared memory multiprocessors, the eective trac handled by the switches in the network is further reduced.

3 3 A novel interconnection scheme, called the Multistage Bus Network (MBN), is introduced in this paper that combines the positive features of hierarchical buses and MINs. The MBN consists of several stages of buses with equal number of buses at each stage. This provides a uniform bandwidth across the stages and forms multiple trees between processors and memories. Unlike hierarchical bus networks, the MBNs comprise multiple buses at higher levels reducing the trac at higher levels. Maintaining cache coherence is a major problem in shared memory multiprocessors. Unlike MINs, the snoopy cache coherence protocols can be applied to the MBN [7], which can improve the performance by a large extent. Also, the MBN provides much better fault tolerance and reliability compared to a conventional MIN [8]. It is known that a distributed shared memory organization has better scalability than a centralized organization [2], [3]. In such an organization, a request or response packet can make U-turns in the network and reach a destination quickly since the intermediate levels of an MBN consist of buses and bidirectional connections. Four dierent routing techniques [8] are presented here in this paper. We also develop equations for probabilities of taking each path based on the memory requests. In order to do a realistic comparison with MINs, we introduce the design and analysis of a corresponding Bidirectional MIN (BMIN) in this paper. The BMIN allows U-turns and a packet can be routed based on the same techniques presented in this paper for the MBN. Recently, Xu and Ni [9] have discussed a U-turn strategy for bidirectional MINs as applicable to the IBM SP architecture [4]. However, the MIN employed in SP architectures is cluster-based and works dierently than the proposed MBN or BMIN. In this paper, we analyze the performance of an MBN for distributed shared memory multiprocessors based on dierent self routing techniques. Unlike the previous analysis [8], the present analysis is based on routing along the minimum of the four paths for a given source and destination pair. The MBN has some inherent fault tolerance capabilities due to a number of switch disjoint paths between any source and destination pair. In this paper, we only concentrate on the routing and performance evaluation

4 4 N IN N M P Router Fig. 1. A distributed shared memory multiprocessor using queueing analysis and execution-driven simulations of various applications. Our executiondriven simulator is an extended version of Proteus [10] that simulates the behavior of a cache coherent distributed shared memory multiprocessor for various applications. The rest of the paper is organized as follows. We present the structure and introduce four types of self routing techniques for the MBN in Section 2. We dene the routing tags required to implement the four routing strategies in Section 3 and present an algorithm for the most optimal path in the network for a given source-destination pair in the same section. A performance analysis of the MBN and BMIN is then presented in Section 4. Results and comparison with the conventional and bidirectional MINs are presented in Section 5. Section 6 presents the execution-driven simulation specications and results. Finally Section 7 concludes the paper. II. Structure of the MBN We will consider a distributed shared memory (DSM) architecture throughout this paper. In such an environment, the memory modules are directly connected to the corresponding processors, as shown in Figure 1, but the address space is shared. An example of hierarchical bus interconnection with two levels of buses is shown in Figure 2a [1]. In this example, there are 16 processors, 4 memories, four level 1 buses and one level 2 bus. Naturally, the top level bus is the bottleneck in the system. In order to improve the performance, a number of buses must be connected at the top level with interleaved memory design. Such a connection is shown in Figure 2b for a 16*16 system with two levels of buses.

5 5 M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR (a) A 16 Processor Hierarchial Bus System LEVEL 1 BUS M M M M M M M M M M M M M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR LEVEL 1 BUS (b) A 16*16 MBN Based System using 4*4 switches Fig. 2. Hierarchical Bus Interconnection and the MBN We propose that each bus along with its controller be placed in a switch analogous to a MIN switch. Such a network is called a Multistage Bus Network (MBN). In an N N multistage network using k k switches, there are l = log k N stages of switches, numbered from stage 0 to stage l? 1, as shown in Fig. 3a. Every switch has a set of left connections closer to the processor side and a set of right connections closer to the memory side. The construction of a 4*4 MBN switch incorporating a bus, a bus access controller and output buers is shown in Figure 3b. There are control lines associated with each port to carry arbitration information to the bus access controller. Suzuki et al. have studied a similar bus structure in [11]. We also propose a Bidirectional MIN (BMIN) structure for comparison. The dierence between the switch architectures of BMIN and MBN is evident from Figs. 3b and c. The BMIN switch is a crossbar whereas the MBN switch is a bus. For both the networks, a packet from a stage i is passed on to the stage i + 1, or vice versa, using the destination tag digits. For a k k MBN switch there will be 2k packets (k inputs from either side) potentially competing for the bus in a cycle. When there is more than one such packet, the bus access controller chooses any one of them at random. Others are queued to be transmitted later. On the other hand, in a k k BMIN switch all the 2k inputs can be connected to the 2k outputs if the requests are to dierent destinations. The k k MBN and BMIN switches support forward, backward, turn

6 6 Stage 0 Stage 1 BUS CONTROL LINES OUTPUT DATA LINES (a) A 16*16 Multistage Network OUTPUT DATA LINES 4 LEFT PORTS BUFFER 4 RIGHT PORTS INPUT INPUT INPUT OUTPUT CONTROL LINES BUS ACCESS CONTROLLER CROSSBAR CONTROLLER (b) 4*4 MBN switch architecture (c) 4*4 BMIN switch architecture Fig. 3. Comparing Switch Architectures around connections, as explained in the next section. We describe the structure of the MBN below. The structure of the BMIN is similar. The processors P 0 ; P 1 ; :::; P N?1 are connected to the left connections of the MBN switches at stage 0. Memory modules M 0 ; M 1 ; :::; M N?1 are connected to the right connections at stage l? 1. Memory module M i is also directly connected to processor P i and is called the local memory of P i. A source is assigned a tag S = s 0 s 1 :::s j :::s l?1 and a destination is assigned a destination tag D = d 0 d 1 :::d j :::d l?1, where s j and d j are digits in the k?ary system. The digits s 0 and d 0 are the most signicant and s l?1 and d l?1 are the least signicant digits. The connection between stages in the MBN is a k-shue [6] which means the right connection at position a 0 a 1 :::a l?1 of stage i is connected to the left connection at position a 1 :::a l?1 a 0 of stage i + 1 for i = 0; 1; :::; l? 2. A memory request is satised internally by the local memory when the source tag and the destination tag of a request are the same, If the tags are dierent, the request travels to a remote memory through the MBN. As an example, a 16*16 MBN with 2*2 switches is shown in Figure 4. There may or may not be a shue interconnection before the rst stage of switches. Our routings are developed based on Figure 4 where there is no shue before the 1st stage. Hence, a set of processors with their memories are

7 7 connected to one switch at the rst stage and to another switch at the same position at the last stage. If there exists a k-shue connection before the rst stage a dierent set of processors will be connected to the rst stage and last stage switches. In Figure 4, a request travels in the forward direction when it starts from the processor side and passes through stages 0; 1:::(l? 1), in that order. It travels in the backward direction when it starts from the memory side and passes in the reverse direction through stages (l?1):::1; 0, as shown in Figure 5. A packet can also travel from left to right and make a U-turn in an intermediate stage, as shown in Figure 6. This is called F orward? U(F U) routing. Similarly Figure 7 shows Backward? U(BU) routing where a message enters the network from the right and makes a U-turn. These four routings provide four distinct paths between a source and a destination in the MBN. As a result, the fault tolerance and reliability of the MBN are much better than that of a conventional MIN. Exact expressions for the MBN reliability are derived in [8]. They are also valid for BMINs introduced in this paper. In conventional MINs like the Omega, Delta and GSN [6], the destination tag is used for the purpose of self routing of a request only in the forward direction. In case of the MBN, the destination tag can also be used for self routing in the forward direction. Since the stage 0 connections are straight instead of a k-shue, the destination tag itself can not be used for self routing in the backward direction. As explained later, the routing tag in the backward routing case is obtained by reverse shuing the destination tag by one digit. In order to determine where to take a turn in the above two routing techniques involving U-turns, we need to combine the source tag and the destination tag to form a combined tag. The following denitions are needed to develop exact routing algorithms later. Denition 1 (FRT) The Forward Routing Tag (FRT) is the same as the destination tag of a memory request, i.e., FRT = d 0 d 1 :::d l?1. Denition 2 (BRT) The Backward Routing Tag (BRT) is the destination tag reverse shued by one digit. If d 0 d 1 :::d l?1 is the destination, then BRT b 0 b 1 :::b l?1 = d l?1 d 0 d 1 :::d l?2, where b j = d (j?1)mod(l). Denition 3 (CT) The Combined Tag (CT) is the digit-wise exclusive-or of the source tag and the

8 8 destination tag i.e., CT = c 0 ; c 1 ; :::; c j ; :::; c l?1, where c j = s j J dj. The operation J means c j = 0 if s j = d j, and c j = 1 if s j 6= d j. Note that although the digits in S and D are k? ary, the digits in the CT are binary. Denition 4 (RCT) The Rotated Combined Tag (RCT) is the Combined Tag (CT) reverse shued or right rotated by one digit, i.e., RCT r 0 r 1 ; :::; r l?1 = c l?1 c 0 ; c 1 ; :::; c j ; :::; c l?2, where r j = c (j?1)mod(l). Denition 5 (FTS) The Forward Turning Stage (FTS) is dened as the rightmost nonzero position in the Rotated Combined Tag(RCT). That is, F T S = m, such that r m = 1 and r j = 0 for m < j l? 1. Denition 6 (BTS) The Backward Turning Stage (BTS) is dened as the leftmost nonzero position in the Combined Tag (CT). That is, BT S = n, such that c n = 1 and c j = 0 for 0 j < n. The routing tags FRT and BRT are used for self routing in case of the forward and backward directions respectively. The tags RCT and CT are used to nd the U-turn stages FTS and BTS respectively. The U-turn stages FTS and BTS are used to determine where to take forward and backward turns during U-turn routings. The various routing schemes possible in an MBN are described below. III. Routing Algorithms for MBN In this section, we rst present the four routing techniques for the MBN and then present an algorithm that chooses the path with minimum distance. Although these techniques are described for the MBN, they are equally valid for the BMIN. A. Routing Techniques A. Forward (FW) Routing : In Forward (FW) routing, a request from source processor S moves from stage 0 through stage l? 1 through the MBN to the destination memory D. An example of FW routing for source 0011 to destination 1011 is shown in bold line in Figure 4. The jth digit of the forward routing tag (FRT) is used by a switch at stage j for self-routing. Thus, a request that started

9 9 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 FRT = 1011 Fig. 4. Forward (FW) routing in MBN at position S = s 0 s 1 :::s j :::s l?1 at the left of stage 0, is switched to position s 0 s 1 :::s j :::s l?2 d 0 at the right of stage 0, then undergoes a k-shue and reaches at position s 1 s 2 :::s j :::s l?2 d 0 s 0 at the input of stage 1 and gets switched to s 1 s 2 :::s j :::s l?2 d 0 d 1 at the output of stage 1. In general, when a request arrives at position s j s j+1 :::s l?2 d 0 d 1 :::d j?1 s j?1 at the left of stage j, it is switched to position s j :::s l?2 d 0 d 1 :::d j?1 d j at the right of stage j, goes through a k-shue (except at the last stage, j = l? 1) and arrives at position s j+1 :::s l?2 d 0 d 1 :::d j s j at the left of stage j + 1. Finally it reaches the destination d 0 d 1 :::d j :::d l?1 at the output of the last stage of the MBN. B. Backward (BW) Routing : In Backward (BW) routing, a request from source node S moves backward from stage l? 1 through stage 0 to the destination node D. An example of BW routing for 0011 to 1011 is shown using bold line in Figure 5. The jth digit of the backward routing tag (BRT) is used by a switch at stage j for self-routing. Thus, a request that started at position S = s 0 s 1 :::s j :::s l?1 at the right of stage l? 1, is switched to position s 0 s 1 :::s j :::s l?2 d l?2 at the left of stage l? 1, then undergoes a reverse k-shue and reaches at position d l?2 s 0 s 1 :::s j :::s l?2 at the right of stage l? 2. In general, when a request arrives at position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 s j at the right of stage j, it is switched to position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 d (j?1)mod(l) (the jth digit of BRT, b j = d (j?1)mod(l) ) at the

10 10 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 BRT = 1101 Fig. 5. Backward (BW) routing in MBN left of stage j. Then it goes through a reverse shue (except at the last stage, j = 0) and arrives at position d j?1 d j d j+1 :::d l?2 s 0 s 1 :::s j?1 at the left of stage j? 1. C. Forward-U (FU) Routing : In Forward-U routing, the request starts from source S at stage 0, follows the FW routing (using FRT) up to stage FTS-1 and reaches at the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 s F T S?1. At FTS, it takes a U-turn, instead of getting switched to the right of stage FTS. The request is switched to the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 d (F T S?1)mod(l) and follows BW routing (using BRT) up to stage 0. Finally it reaches at position d 0 d 1 :::d F T S?1 s F T S :::s l?2 d l?1 at the left of stage 0. An example of FU routing for 0010 to 0110 is shown in bold line in Figure 6. D. Backward-U (BU) Routing : In Backward-U routing, the request starts from source S at stage l? 1, follows the BW routing (using the BRT) up to stage BTS+1 and reaches at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 s BT S at the right of stage BTS. At stage BTS, it takes a U-turn, instead of getting switched to the left of stage BTS. The request gets switched to the right of stage BTS at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 d BT S and follows FW routing(using FRT) up to stage l? 1. Finally the request reaches at position s 0 s 1 :::s BT S?1 d BT S d BT S+1 :::d l?1 in the right of stage l? 1. An

11 11 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 FTS = 2 Fig. 6. Forward-U (FU) routing in MBN example of BU routing for 0010 to 0110 is shown in Figure 7. B. Optimal Path Algorithm The distance between a source and destination in an MBN is dened as the minimum number of switches that the packet has to travel. For a conventional MIN, this distance is always equal to l, or the number of stages in the network. In case of an MBN, however, the distance may be less than l if FU or BU routing is chosen. The FU and BU (Forward-U and Backward-U) routings are used when the turning stage happens to be less than the center stage of the network. Therefore, there will be net savings in terms of distances between a given source and all the destinations. Detailed expressions for the overall savings in distances for such an MBN are given in Section 4. We present below an algorithm to choose the most optimal routing for a given source-destination pair. Optimal Path Algorithm 1. S = s 0 s 1 :::s l?1

12 12 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 BTS = 1 Fig. 7. Backward-U (BU) routing in MBN 2. D = d 0 d 1 :::d l?1 3. CT = S J D = c 0 ; c 1 ; :::; c j ; :::; c l?1 4. RCT = c l?1 ; c 0 ; c 1 ; :::; c j ; :::; c l?2 5. d l = bl=2c, d u = dl=2e 6. IF (source = destination) 7. THEN request is to local memory 8. ELSE 9. Find FTS and BTS (based on the tags RCT and CT respectively) 10. IF (F T S = (l? 1? BT S) = 0) 11. THEN select forward-u (FU) routing OR backward-u (BU) routing 12. ELSE IF (F T S < d l ) 13. THEN select forward-u (FU) routing 14. ELSE IF (BT S d u ) 15. THEN select backward-u (BU) routing

13 ELSE 17. select forward (FW) routing OR backward (BW) routing The optimal path algorithm chooses a route that has a minimum path length. Given a source S = s 0 s 1 :::s l?1 and a destination D = d 0 d 1 :::d l?1, this algorithm computes the tags described earlier in this section. It then uses a comparison of these tags to decide which of the four routings would give the minimum path length through the network. In the algorithm d is dened as the center stage of the MBN. It must be pointed out here that the optimal routing between two nodes is xed in a given network. Hence the optimal path can be precomputed and stored in a table that can be read when a request is issued. There is no need to execute the algorithm every time a message is sent out. If the source and the destination are the same, then the request is for the local memory. In this case no traversal through the MBN is required. All other requests pass through at least one stage of the MBN. The memories that are connected to a processor through the rst or last stage of the MBN are called cluster memories. Similarly processors that are one switch away from the memories are called cluster processors of those memories. Requests to cluster memories require that only one switch be traversed. Thus when (F T S = (l? 1? BT S)) = 0 FU or BU routing is taken to serve this purpose. If this is not satised then we should check for FU or BU routing because these would be the next possible minimum path. If F T S < bl=2c or BT S dl=2e we have turning stages in the MBN before or after the center stage. This would reduce the total path length to less than l and thus FU or BU routing is selected. If none of the above conditions is true, we have F T S bl=2c and BT S < dl=2e. In this case, forward (FW) routing or backward (BW) routing are the only options. The actual path lengths in terms of the number of switches traversed are presented below: Local memory: 0 switches (MBN is not traversed) Forward routing or Backward routing: l switches

14 14 Destination i FW/BW routing FU routing BU routing TABLE I The path lengths for each of the routings given source=0 and different destinations Forward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 F T S + 1 switches Backward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 (l? 1? BT S) + 1 switches These path length equations can be used to form a table for a given source and destination. As an example, Table 1 shows the path lengths from source 0 to dierent destinations in a 1024*1024 network for i The path length for each routing is quite dierent and thus a routing algorithm is required to route the request through the most optimal path. For example, if the destination is 2 then Backward-U routing will result in the optimal path length. On the other hand, if the destination is 256, then Forward-U routing will result in the optimal path length. The other two requests should use forward or backward routing strategies. IV. Performance of the MBN The Multistage Bus Network (MBN) is analyzed here in a distributed shared memory environment, shown in Figure 1. We also analyze the BMIN and compare its results with those of the MBN. In both the cases, the memory module M i is directly connected to the processor P i and is called the local memory of P i. Requests from a processor to its local memory are called internal requests and are

15 15 carried over the internal bus between the processor and its local memory. A memory can also receive external requests that originate from other processors and are carried over the MBN. A. Network operation In a distributed memory system, there are k? 1 processors that can be reached through the switch of size k at the rst stage or the last stage to which P i is connected. Thus the external request destined to a cluster processor or memory returns from the rst stage (Forward-U routing) or last stage (Backward-U routing) without going through the whole MBN. However, if the request is neither to a local nor to a cluster memory, the request may take one of four routings described earlier. Both internal and external requests arrive at a memory queue. Only one of them is selected for service on an FCFS basis while the remaining requests are queued at the buer of the memory. After receiving a request, a memory module will send a reply packet either directly to its local processor or to another processor through the network, depending on whether the request is internal or external. We will compare the performance of the MBN with that of a BMIN. The transmission of request and reply packets goes through the network following the routings given earlier in the paper. We shall assume a synchronous and packet switched system for analyzing the multistage networks. Since a buer size of four or more gives the same eect as an innite buer [12], [13], for simplicity, we shall assume an innite buer for MBN and BMIN. The analysis can be extended to nite buers, but the equations will be fairly complicated [13]. Since our aim here is to analyze the routing schemes, we prefer to give the basic innite buer analysis. The bus service time (for MBN) or the link service time (for BMIN) to transfer a message forms one system cycle time. The service times of the memory modules are assumed to be integral multiples of this system cycle time. A processor is represented by a delay center; in a given cycle, it submits a memory request with some given probability if it is busy in computation. Once it sends the memory request, the processor becomes idle until the memory response packet (in case of a read) or acknowledgment (in case of write) is obtained. The various

16 16 system parameters are dened below: k k : size of the MBN or MIN switches N l t s t m p : number of processors or memories in the system : log k N : number of stages in the IN : switch service time : memory service time : probability that a processor submits a memory request in a given cycle provided it is busy m : probability that a processor requests its local memory provided it has made a memory request p i : Probability that a request passes through stage i r i : mean response time of switch at stage i, 0 i (l? 1) q i q e d n l m d m P u : Average number of local requests by a processor per cycle : Average number of remote requests from a processor per cycle : total delay in the network (considering all stages) : average queue length in a memory module : average delay in a memory module : processor utilization (fraction of time the processor is busy) The performance analysis of the MBNs and BMINs will be carried out under the following assumptions [12], [13]. Packets are generated at each source node by independent and identically distributed random processes. At any point of time a processor is either busy doing some internal computations or is waiting for the response to a memory request. If there is no pending request, each busy processor generates a packet with probability p at each cycle. The probability that this request is to the local

17 17 memory (internal request) is m, and the probability to any other memory module (external request) is (1? m). A reply from memory travels in the opposite direction through the same path in the MBN or BMIN. It may be noted that in case of a MIN like Buttery [3], a reply has to traverse in the same direction (i.e., from processor to memory side) to reach the requesting processor because the MIN has unidirectional links. In [9], bidirectional links are used between stages and hence the requesting and reply messages may travel in the forward and backward directions respectively. The messages from processor to memory are generated using probabilities as specied below: Request Probability (p): The request probability is dened in Section 3 and is used as a means of estimating the processor behavior in terms of memory requests. When a processor is busy in computation, i.e., no request is outstanding in switches or a memory module, it can send a memory request. At each cycle, the processor decides whether or not a message is to be sent based on this probability. On an average, it takes 1=p cycles to send out a request from the processor. Local memory request probability (m): Given that a request is to be made to memory, a probability (m) is used to decide whether the request is for local or external memory. Though simple, the above probabilities play an important role and are the only inputs to the analysis. After each request to memory, the processor waits for an acknowledgment. Once an acknowledgment is received, the processor does useful computation for one cycle and then based on the above probabilities decides whether to continue or to send another request to the memory. Processor utilization: The processor utilization, P u, dened as the fraction of time a processor is busy, will be determined by the waiting time and service time faced by a request at various service centers. In a number of applications, a large portion of the requests are made to the cluster processors. In [8], we studied the performance of the MBN with varying probabilities for cluster requests. In the study forward-u and backward-u routings were allowed only at the rst and last stages. All other requests were routed by forward (FW) routing. The processor utilization for such a case is given by

18 18 the following equation: P u = 1=f1 + p(1? m)(1? m 1 )(d n + d m ) + pmd m + p(1? m)m 1 (2r 0 + d m )g (1) In this paper, a message in MBN or BMIN will be sent along the minimum distance. In such a case, P u = (2) where, corresponds to the expected delay for a local memory request to be served. corresponds to the expected delay for serving requests to cluster memories. corresponds to the expected delay for serving all requests, except cluster memories, that follow FU or BU routing. corresponds to the expected delay for serving all requests that folllow Forward routing (F W ) or Backward routing (BW ). The derivation of terms,,,, and, is presented below. These terms depend on (a) the routing probabilities along each path, (b) the amount of trac in the network, and (c) the service demand at individual service centers. Thus we get a non-linear equation with P u as the single variable that is solved by using iteration techniques. B. Routing probabilities and path delays The routing probabilities and path delays are derived here for MBN and BMIN under the assumption that all the non-local memories are equally addressed by a processor. These equations can be modied in case of nonuniform remote memory references. Since the path length of Backward Routing (BW ) is the same as that of the FW routing, we derive the term based on FW routing and multiply it by 2 to include BW routing. A similar method is used for FU and BU routing as well.

19 19 Local memory requests (): A local memory request does not involve switch traversal. Thus the only delay is that in servicing the request in the memory module (d m ). Given that the probability for a processor to request a memory is p and that to request a local memory is m, we can deduce that = p m d m (3) Cluster routing (): Requests to cluster processors travel to the rst or last stage switch and take a FU or BU routing to the destination processor. All those source-destination pairs where all bits except the least signicant log 2 k bits of the CT are zero entail this type of routing. Thus the number of cluster memories for a given source is k? 1 since k k is the size of an MBN or BMIN switch. The switch at stage 0 is traversed once for reaching the cluster memory and once for sending back the acknowledgment. Here, given that an external memory is requested, the probability for requesting cluster memories can be expressed as p = k? 1 N? 1 (4) Thus we have 2 r 0 delay for the switch traversals and d m for the memory service delay. We get the following equation, = p (1? m) k? 1 N? 1 (2r 0 + d m ) (5) Non-cluster FU or BU routing (): In forward-u and backward-u routing, the request traverses in one direction up to a particular stage (as explained in Section 2) and makes a U-turn to reach the destination processor. Thus given the turning stage, FTS, the path length can be said to be 2 F T S + 1. This is because the FTS is traversed only once while all stages to the left of FTS are traversed twice but not necessarily through the same switch. We should have a F T S < bl=2c for path

20 20 length optimization. and BT S dl=2e for optimal path length. As we have already covered cluster memories (F T S = 0, BT S = l? 1) we will start with F T S 1 and BT S l? 2. Consider F T S < bl=2c. A similar derivation can be done for BT S dl=2e also. We know that the number of destinations in total is N? 1. For a given turning stage 1 i < d, since FTS is dened as the rightmost bit in the tag, we can have all bits to the left of this position as 1 or 0. This gives us k i number of ways. As discussed in Section 3, the Rotated Combined Tag (RCT) is dened as the digit-wise EX-OR of the source and destination tags. Thus the RCT is a tag made up of 1's or 0's i.e the RCT tag is bitwise regardless of the source and destination tags. The number of ways in which a bit in the RCT can be 1 is k? 1. Thus, given that an external memory is requested, we have the equation for probability of non-cluster FU and BU routing as, d?1 X k? 1 p = 2 ( i=1 N? 1 ki ) (6) where d = bl=2c The delay in such a routing is dependent on the stage at which the U-turn is going to take place. Thus within the summation of the above equation we should include the delay for each switch traversed in that particular path. As discussed above for a turning stage FTS we traverse through all stages to the left of FTS twice. Thus the delay except for that in the turning stage is (2 ( P i?1 j=0 2r j)). This term is multiplied by two because it considers the acknowledgment packets also. The request and acknowledgment also traverse the turning stage and the memory module with delay r i + d m. Including this delay, (2 ( P i?1 j=0 2r j + r i ) + d m ), with the probability gives us the equation for as, d?1 X = p (1? m) 2 ( i=1 k? 1 N? 1 ki (2 ( Xi?1 j=0 2r j + r i ) + d m )) (7) Forward routing (): Finally, for all those source-destination pairs which don't fall into the above

21 21 Routing Size of the MBN FU or BU routing (N? 1) ( p + p ) FW or BW routing (N? 1) ( p ) TABLE II Number of destinations for different network sizes using different routings routing categories, the forward routing path is taken. Since forward routing or backward routing is the last choice for any other type of source-destination pairs, we can simply express as (1?? ) d n. In this type of routing all switches are traversed, thus giving a summation of all switch response times for d n = P l?1 i=0 r i. Thus the expected delay for all such routings can be expressed as, where, X l?1 = p (1? m) p (( i=0 r i ) + d m ) (8) p = 1? p? p (9) where p and p are given by equations 4 and 6 respectively. The equations 0 through 9 are valid when the local memory is accessed with a probability of m and all other memories are addressed with equal probabilities i.e. (1? m)=(n? 1). In an actual case, there will be more interaction between the tasks within a cluster. The equations can be easily extended to include such cases. Table 2 shows the number of destinations that can be reached from processor 0 with each of the routings as a function of network size. The switch size of the network is 2 2. It can be observed from the table that a signicant number of connections benet from the routings other than FW or BW that is commonly adopted today. Also the same number of processors use FU or BU routing in two successive network sizes. We can explain this behavior by an example. Consider l = 6 and l = 7,

22 22 corresponding to network sizes of 64 and 128 respectively. The networks, though of dierent sizes, have the same number of destinations for FU routing because the addition of one stage introduces a true center stage (centerstage = 3) in l = 7 while there is no true center stage in l = 6. Since the center bit in the CT tag has to be 0 for FU or BU routing, it is apparent that the addition of the center stage will not increase the number of possible FU or BU routings. The delays, r 0, r i, d n and d m, will depend on (a) the amount of trac in the network, which in turn is a function of P u itself and (b) the service demand at individual service centers. The queueing analysis for the delays is given next. C. Queueing delays in switches In order to make the analysis simpler, each stage in the network is considered in isolation from the other stages. Consider a queueing center with n inputs. Let the probability that there is a packet at one of the inputs at any given cycle be q, and the service demand of a packet at the service center be t cycles. The number of requests coming to the queue during the service time of any previous request will form a Binomial distribution with number of trials = nt, and success probability = q. The mean number of arriving requests, E = ntq and the variance, V = ntq(1? q). The average queue length Q at the queueing center can be found using the Pollaczek-Khinchine (P-K) mean value formula [14], Q = E 2 + V 2(1? E) (10) The throughput of these requests is E=t. Hence by using Little's law, the mean response time of the center, r, can be derived as, r = Q:t E = !? q t (11) 1? ntq

23 K INPUTS BUS.... 2K OUTPUTS (a) MBN Switch Queue... 2K INPUTS C C C C C l... 2K OUTPUTS l (b) BMIN Switch Queue Fig. 8. Queues at the BMIN and MBN switches Queueing models of an MBN switch and a BMIN switch are shown in Figure 8. For an MBN switch there is contention for the bus by packets from k right ports and k left ports. For a switch at stage i, n = 2k, t = t s and q = P u p(1? m)p i, where p i is the probability that a packet visits stage i. We can calculate mean switch response time, r i, for any of these MBN switches using the following equation : r i = !? q t s (12) 1? 2kt s q The network delay, d n, will be a sum of the response times of the stages a packet visits while routed through the network. In case of the BMIN, there are 2k inputs and 2k outputs in a switch. The request probability at an input or output of a BMIN switch at stage i will be P u p(1?m)p i. Following the model shown in Fig. 8b, we can calculate the response time of a BMIN switch by using n = 2k, t = t s and q = P u p(1? m)p i =n. The total network delay d n will be the sum of the response times of switches at dierent stages. In both networks, the mean number of arriving requests at a memory module, E m = q i +t m q e, where q i and q e are the internal and external requests for that memory module, respectively. The variance V m = q i (1? q i ) + t m q e (1? q e ). Hence average memory queue length, l m = q i + t m q e 2 + q i(1? q i ) + t m q e (1? q e ) 2(1? q i + t m q e ) (13)

24 24 and the mean memory response time or delay, d m = l m t m =E m. A packet (or request) takes the optimal path from a source to the destination. The number of switches traversed would depend on the nature of CT and RCT. The delays derived here are inserted into the equations 3,5,7 and 8 which in turn are plugged into equation 2 to obtain the processor utilization and response time of the network. Then we get a nonlinear equation with P u as the single variable that is solved by using iteration techniques. The iteration technique used to compute processor utilization, P u, can be presented as follows: 1. Initialize P u with a guess of the expected processor utilization. The better the guess, lesser is the number of iterations for the computation. 2. Calculate the request probabilities at each stage of the network and at the memory module. An intermediate step might be to calculate the static values for p i (the probability that a stage in the network is traversed). 3. Calculate the mean switch response times and the memory response time, r i and r m respectively. 4. Based on the above values, calculate the network delay and memory delay using equations provided for ; ; and. 5. Based on these values, calculate a new processor utilization, P u. 6. Repeat steps 2-5 until the new P u is within some tolerance of the last P u. An initial value of 0.5 for P u and an accuracy of were used to generate the analytical results, presented in the next section. V. Results and Discussions We performed extensive cycle-by-cycle simulations to verify that the proposed routings work and measured the routing probabilities and network delays [16]. The simulation was done using a synchronous packet-switched distributed memory environment. The simulation specications are the same as the analysis and are detailed below with a view to making the network operation more clear.

25 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Processor Utilization Request Probability Fig. 9. Comparison of analysis and simulation for processor utilizations of the MBN, varying m In our simulations each cycle was considered to be the time required for the transmission of a packet from one output buer of a switch to the next stage output buer. This includes the transmission of the packet through the link and the time a switch takes to route it to the corresponding destination buer. The minimum time taken for a packet to reach memory is based on the number of switches that the routing covers. All four routings discussed in Section 2 are used in the simulations. The simulation compares each source and destination by running the optimal routing algorithm and then chooses the proper routing. The choice between backward or forward routing is made as follows. All memory requests that could use either forward (FW) or backward (BW) routing use forward routing. All acknowledgements packets use backward routing to keep the load distribution same on both routings. Apart from these dierences the routing decisions are based solely on the tags generated by the optimal routing algorithm. The probabilities p, m, etc. are fed to the simulation as input parameters. All the memories except for the local memories are equally likely to be addressed upon a memory request. In this section we present the relative performance of BMIN and MBN. We start by comparing the

26 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Response Time Request Probability Fig. 10. Comparison of analysis and simulation for response time of the MBN, varying m "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Processor Utilization Request Probability Fig. 11. Comparison of processor utilizations, varying m

27 "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Response Time Request Probability Fig. 12. Comparison of response time, varying m results from the simulation versus those obtained from the analysis of the MBN. Many simulation experiments were run to verify the analytical models developed in this paper. The simulation results closely matched the analysis under all varied parameters. Here we present some results for a system with 2 2 switches. Memory service time is assumed to be 4 cycles. Processor utilization, P u is dened as the average amount of useful work the processor does in a given cycle. Response time is dened as the average dierence between the time when a processor submits a memory request and the time when it gets the reply back. Figures 9 and 10 show a comparison of analysis and simulation results for the processor utilization and response time of the MBN. In both the gures, the analytical results match very closely with those of simulation indicating that the independence of queues assumed during the analysis does not cause much of an error. The plots show the results from the analysis and simulation as a function of the memory request probability, p. In this plot the memory request probability (p) is varied from 0.1 to 1.0 and two values for the local memory request probability m (0.1 and 0.9) are chosen. For larger values of m, more requests are satised without going through the MBN. Thus, for m = 0:9, P u

28 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Processor Utilization Number of Processors Fig. 13. Processor Utilization: Scalability of MBN vs BMIN, varying p is much higher in Fig 9 and response time is much lower in Fig. 10. As p gets larger, more requests are generated and the response time increases due and the processor utilization reduces to a higher amount of trac and queueing delays. Figures 11 and 12 show a comparison of the performance of the MBN to that of the conventional MIN (CMIN) and the proposed bidirectional MIN (BMIN). Conventional MIN is similar to the network employed in the Buttery machine [3], where both the request and the response packets travel in one direction from processor to the memory side. On the other hand, BMINs allow all the four routings proposed for the MBN. The two plots show the processor utilization and the response time of the three networks for two dierent values of m. The MBN behaves exactly similar to the CMIN and the BMIN in terms of processor utilization. The response time is also same for all the networks for m = 0:9. For m = 0:1 the BMIN performs better than the MBN and the MBN performs better than the CMIN. Figures 13 and 14 show the processor utilization and response times for various system sizes. The results were obtained with a local memory request probability, m, xed at 0:5 and for two dierent values of p (0.1 and 0.5). We can see from the gures that, even as the system size grows, the

29 29 20 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Response Time Number of Processors Fig. 14. Response Time: Scalability of MBN vs BMIN, varying p performance of the MBN remains close to that of the BMIN. The curves for the CMIN are left o for clarity, but it is observed that the MBN always performs better than the CMIN. It can also be seen from the gures that as the system size doubles the reduction in performance is not that big, indicating that the MBN is highly scalable for the given trac load. The range of the processor utilization remains approximately between 0:5 and 0:4 as the system size changes from 32 to 1024 with 2 2 switches, but the request probability, p, has a much greater eect on the performance. Finally we present the processor utilization and the response time of the MBN obtained for dierent switch sizes and dierent number of processors in Table III. Some places in the table are left empty because an N N MBN cannot be built using only those k k switches. Both the request probability (p) and the local memory request probability (m) are xed at 0.5. For N = 64, there is a decrease in P u when k is increased from 4 to 8. This is because MBN is less ecient due to increased contention and delay in an 8x8 bus-based switch. On the other hand, in a system, when k is increased from 2 to 8, there is a good improvement. The number of switches in the entire network is still quite high, keeping the contention low enough to gain in performance.

30 30 #Procs. Dierent Switch Sizes (N) k = 2 k = 4 k = 8 P u RT P u RT P u RT MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN TABLE III Processor Utilization and Response Time of the MBN & BMIN varying k and N If we compare the MBN's performance to that of the BMIN's, we can see that as the switch size increases, the BMIN gives a higher processor utilization and a lower response time. This increase in performance is due to a lower contention in the crossbar switch. However, the BMIN gives this increased performance at the expense of cost. In [8], a cost parameter based on the number of connection points in a switch is presented. The number of connections is k 2 for a k k switch, where as for a bus, the number of connections is 2k. Thus, the total cost of BMIN and MBN are knlog k N and 2Nlog k N respectively. If we include these parameters along with the processor utilization and the response time, the cost-eectiveness of the MBN is higher than that of the BMIN, as shown in [8]. A 4 4 switch size works out to be most cost-ecient for dierent network sizes and workload inputs. VI. Execution-driven Simulation and Results The execution time of an application on a multiprocessor architecture is the ultimate parameter that indicates the performance. In order to show that the MBN performs similar to the Bidirectional MIN (BMIN), we study their performance by using an execution-driven simulation of various applications. Our simulator is based on Proteus [10], originally developed at MIT. However, this original simulator modeled the indirect interconnection networks based on an analytical model. We have modied the simulator extensively to exactly model the BMIN and the MBN using 2 2 switches and packetswitching strategy. The system considered in this paper has private cache memories that operate based on a directory-based cache-coherence protocol [15]. The node conguration and the network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu