Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract
|
|
- Madison Johns
- 5 years ago
- Views:
Transcription
1 Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN) is proposed in this paper to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MINs and has the same bandwidth at each stage. A switch in an MBN is similar to that in a MIN switch except that there is a single bus connection instead of a crossbar. MBNs support bidirectional routing and there exists a number of paths between any source and destination pair. In this paper we develop self routing techniques for the various paths, present an algorithm to route a request along the path with minimum distance and analyze the probabilities of a packet taking dierent routes. Further, we derive a performance analysis of a synchronous packet-switched MBN in a distributed shared memory environment and compare the results with those of an equivalent bidirectional MIN (BMIN). Finally, we present the execution time of various applications on the MBN and the BMIN through an execution-driven simulation. We show that the MBN provides similar performance to a BMIN while oering simplicity in hardware and more fault-tolerance than a conventional MIN. Keywords interconnection network, routing, queueing model, performance analysis, packet-switching, executiondriven simulation This research was supported by NSF grants MIP and L. N. Bhuyan and R. Iyer are with the Department of Computer Science, Texas A&M University, College Station, TX , (bhuyan,ravi)@cs.tamu.edu T. Askar is with Advanced Micro Devices, Austin, TX A. Nanda is with IBM TJ Watson Research Center, P.O. Box 218, Yorktown Heights, NY M. Kumar is with the Department of Computer Science, Curtin University of Technology, GPO Box U 1987 Perth, WA 6001, Australia
2 2 I. Introduction In order to achieve signicant performance in parallel computing it is necessary to keep the communication overhead as low as possible. The communication overheads of a multiprocessor system depend to a great extent on the underlying interconnection network. An interconnection network (IN) can be either static or dynamic. Dynamic networks can connect any input to any output by enabling some switches. They are applicable to both shared memory and message passing multiprocessors. Among such dynamic INs, the hierarchical buses or rings [1], [2] and Multistage Interconnection Networks (MINs) [3], [4] have been commercially employed. In a strictly hierarchical bus architecture [1], there are a number of buses connected in the form of a tree between the processors and the memories. The use of multiple buses makes the hierarchical bus-based systems more scalable compared to the popular single bus multiprocessors. However, the bandwidth of this interconnection decreases as one moves toward the top of the tree. Thus, the scalability of a hierarchical bus system becomes limited by the bandwidth of the topmost level bus. The bandwidth problem can be alleviated through the fat tree design [5]. The simplicity of the bus based designs and the availability of a fast broadcasting mechanism are factors that make bus-based systems very attractive. The MINs, on the other hand, oer a uniform bandwidth across all stages of the network. The bandwidth of the network increases in proportion to the increase in system size, making the MIN a highly scalable interconnection. The switches in a MIN are made up of small crossbar switches. When the system size grows, bigger switches can be used to keep the number of stages and, hence, the memory latency low [6]. However, the complexity of a crossbar switch grows as the square of its size, and therefore, the total network cost becomes predominant in larger systems. We have observed that the trac in the network is very low making the crossbar based MIN switches highly underutilized. In a system using private caches, which is common in today's shared memory multiprocessors, the eective trac handled by the switches in the network is further reduced.
3 3 A novel interconnection scheme, called the Multistage Bus Network (MBN), is introduced in this paper that combines the positive features of hierarchical buses and MINs. The MBN consists of several stages of buses with equal number of buses at each stage. This provides a uniform bandwidth across the stages and forms multiple trees between processors and memories. Unlike hierarchical bus networks, the MBNs comprise multiple buses at higher levels reducing the trac at higher levels. Maintaining cache coherence is a major problem in shared memory multiprocessors. Unlike MINs, the snoopy cache coherence protocols can be applied to the MBN [7], which can improve the performance by a large extent. Also, the MBN provides much better fault tolerance and reliability compared to a conventional MIN [8]. It is known that a distributed shared memory organization has better scalability than a centralized organization [2], [3]. In such an organization, a request or response packet can make U-turns in the network and reach a destination quickly since the intermediate levels of an MBN consist of buses and bidirectional connections. Four dierent routing techniques [8] are presented here in this paper. We also develop equations for probabilities of taking each path based on the memory requests. In order to do a realistic comparison with MINs, we introduce the design and analysis of a corresponding Bidirectional MIN (BMIN) in this paper. The BMIN allows U-turns and a packet can be routed based on the same techniques presented in this paper for the MBN. Recently, Xu and Ni [9] have discussed a U-turn strategy for bidirectional MINs as applicable to the IBM SP architecture [4]. However, the MIN employed in SP architectures is cluster-based and works dierently than the proposed MBN or BMIN. In this paper, we analyze the performance of an MBN for distributed shared memory multiprocessors based on dierent self routing techniques. Unlike the previous analysis [8], the present analysis is based on routing along the minimum of the four paths for a given source and destination pair. The MBN has some inherent fault tolerance capabilities due to a number of switch disjoint paths between any source and destination pair. In this paper, we only concentrate on the routing and performance evaluation
4 4 N IN N M P Router Fig. 1. A distributed shared memory multiprocessor using queueing analysis and execution-driven simulations of various applications. Our executiondriven simulator is an extended version of Proteus [10] that simulates the behavior of a cache coherent distributed shared memory multiprocessor for various applications. The rest of the paper is organized as follows. We present the structure and introduce four types of self routing techniques for the MBN in Section 2. We dene the routing tags required to implement the four routing strategies in Section 3 and present an algorithm for the most optimal path in the network for a given source-destination pair in the same section. A performance analysis of the MBN and BMIN is then presented in Section 4. Results and comparison with the conventional and bidirectional MINs are presented in Section 5. Section 6 presents the execution-driven simulation specications and results. Finally Section 7 concludes the paper. II. Structure of the MBN We will consider a distributed shared memory (DSM) architecture throughout this paper. In such an environment, the memory modules are directly connected to the corresponding processors, as shown in Figure 1, but the address space is shared. An example of hierarchical bus interconnection with two levels of buses is shown in Figure 2a [1]. In this example, there are 16 processors, 4 memories, four level 1 buses and one level 2 bus. Naturally, the top level bus is the bottleneck in the system. In order to improve the performance, a number of buses must be connected at the top level with interleaved memory design. Such a connection is shown in Figure 2b for a 16*16 system with two levels of buses.
5 5 M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR (a) A 16 Processor Hierarchial Bus System LEVEL 1 BUS M M M M M M M M M M M M M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR LEVEL 1 BUS (b) A 16*16 MBN Based System using 4*4 switches Fig. 2. Hierarchical Bus Interconnection and the MBN We propose that each bus along with its controller be placed in a switch analogous to a MIN switch. Such a network is called a Multistage Bus Network (MBN). In an N N multistage network using k k switches, there are l = log k N stages of switches, numbered from stage 0 to stage l? 1, as shown in Fig. 3a. Every switch has a set of left connections closer to the processor side and a set of right connections closer to the memory side. The construction of a 4*4 MBN switch incorporating a bus, a bus access controller and output buers is shown in Figure 3b. There are control lines associated with each port to carry arbitration information to the bus access controller. Suzuki et al. have studied a similar bus structure in [11]. We also propose a Bidirectional MIN (BMIN) structure for comparison. The dierence between the switch architectures of BMIN and MBN is evident from Figs. 3b and c. The BMIN switch is a crossbar whereas the MBN switch is a bus. For both the networks, a packet from a stage i is passed on to the stage i + 1, or vice versa, using the destination tag digits. For a k k MBN switch there will be 2k packets (k inputs from either side) potentially competing for the bus in a cycle. When there is more than one such packet, the bus access controller chooses any one of them at random. Others are queued to be transmitted later. On the other hand, in a k k BMIN switch all the 2k inputs can be connected to the 2k outputs if the requests are to dierent destinations. The k k MBN and BMIN switches support forward, backward, turn
6 6 Stage 0 Stage 1 BUS CONTROL LINES OUTPUT DATA LINES (a) A 16*16 Multistage Network OUTPUT DATA LINES 4 LEFT PORTS BUFFER 4 RIGHT PORTS INPUT INPUT INPUT OUTPUT CONTROL LINES BUS ACCESS CONTROLLER CROSSBAR CONTROLLER (b) 4*4 MBN switch architecture (c) 4*4 BMIN switch architecture Fig. 3. Comparing Switch Architectures around connections, as explained in the next section. We describe the structure of the MBN below. The structure of the BMIN is similar. The processors P 0 ; P 1 ; :::; P N?1 are connected to the left connections of the MBN switches at stage 0. Memory modules M 0 ; M 1 ; :::; M N?1 are connected to the right connections at stage l? 1. Memory module M i is also directly connected to processor P i and is called the local memory of P i. A source is assigned a tag S = s 0 s 1 :::s j :::s l?1 and a destination is assigned a destination tag D = d 0 d 1 :::d j :::d l?1, where s j and d j are digits in the k?ary system. The digits s 0 and d 0 are the most signicant and s l?1 and d l?1 are the least signicant digits. The connection between stages in the MBN is a k-shue [6] which means the right connection at position a 0 a 1 :::a l?1 of stage i is connected to the left connection at position a 1 :::a l?1 a 0 of stage i + 1 for i = 0; 1; :::; l? 2. A memory request is satised internally by the local memory when the source tag and the destination tag of a request are the same, If the tags are dierent, the request travels to a remote memory through the MBN. As an example, a 16*16 MBN with 2*2 switches is shown in Figure 4. There may or may not be a shue interconnection before the rst stage of switches. Our routings are developed based on Figure 4 where there is no shue before the 1st stage. Hence, a set of processors with their memories are
7 7 connected to one switch at the rst stage and to another switch at the same position at the last stage. If there exists a k-shue connection before the rst stage a dierent set of processors will be connected to the rst stage and last stage switches. In Figure 4, a request travels in the forward direction when it starts from the processor side and passes through stages 0; 1:::(l? 1), in that order. It travels in the backward direction when it starts from the memory side and passes in the reverse direction through stages (l?1):::1; 0, as shown in Figure 5. A packet can also travel from left to right and make a U-turn in an intermediate stage, as shown in Figure 6. This is called F orward? U(F U) routing. Similarly Figure 7 shows Backward? U(BU) routing where a message enters the network from the right and makes a U-turn. These four routings provide four distinct paths between a source and a destination in the MBN. As a result, the fault tolerance and reliability of the MBN are much better than that of a conventional MIN. Exact expressions for the MBN reliability are derived in [8]. They are also valid for BMINs introduced in this paper. In conventional MINs like the Omega, Delta and GSN [6], the destination tag is used for the purpose of self routing of a request only in the forward direction. In case of the MBN, the destination tag can also be used for self routing in the forward direction. Since the stage 0 connections are straight instead of a k-shue, the destination tag itself can not be used for self routing in the backward direction. As explained later, the routing tag in the backward routing case is obtained by reverse shuing the destination tag by one digit. In order to determine where to take a turn in the above two routing techniques involving U-turns, we need to combine the source tag and the destination tag to form a combined tag. The following denitions are needed to develop exact routing algorithms later. Denition 1 (FRT) The Forward Routing Tag (FRT) is the same as the destination tag of a memory request, i.e., FRT = d 0 d 1 :::d l?1. Denition 2 (BRT) The Backward Routing Tag (BRT) is the destination tag reverse shued by one digit. If d 0 d 1 :::d l?1 is the destination, then BRT b 0 b 1 :::b l?1 = d l?1 d 0 d 1 :::d l?2, where b j = d (j?1)mod(l). Denition 3 (CT) The Combined Tag (CT) is the digit-wise exclusive-or of the source tag and the
8 8 destination tag i.e., CT = c 0 ; c 1 ; :::; c j ; :::; c l?1, where c j = s j J dj. The operation J means c j = 0 if s j = d j, and c j = 1 if s j 6= d j. Note that although the digits in S and D are k? ary, the digits in the CT are binary. Denition 4 (RCT) The Rotated Combined Tag (RCT) is the Combined Tag (CT) reverse shued or right rotated by one digit, i.e., RCT r 0 r 1 ; :::; r l?1 = c l?1 c 0 ; c 1 ; :::; c j ; :::; c l?2, where r j = c (j?1)mod(l). Denition 5 (FTS) The Forward Turning Stage (FTS) is dened as the rightmost nonzero position in the Rotated Combined Tag(RCT). That is, F T S = m, such that r m = 1 and r j = 0 for m < j l? 1. Denition 6 (BTS) The Backward Turning Stage (BTS) is dened as the leftmost nonzero position in the Combined Tag (CT). That is, BT S = n, such that c n = 1 and c j = 0 for 0 j < n. The routing tags FRT and BRT are used for self routing in case of the forward and backward directions respectively. The tags RCT and CT are used to nd the U-turn stages FTS and BTS respectively. The U-turn stages FTS and BTS are used to determine where to take forward and backward turns during U-turn routings. The various routing schemes possible in an MBN are described below. III. Routing Algorithms for MBN In this section, we rst present the four routing techniques for the MBN and then present an algorithm that chooses the path with minimum distance. Although these techniques are described for the MBN, they are equally valid for the BMIN. A. Routing Techniques A. Forward (FW) Routing : In Forward (FW) routing, a request from source processor S moves from stage 0 through stage l? 1 through the MBN to the destination memory D. An example of FW routing for source 0011 to destination 1011 is shown in bold line in Figure 4. The jth digit of the forward routing tag (FRT) is used by a switch at stage j for self-routing. Thus, a request that started
9 9 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 FRT = 1011 Fig. 4. Forward (FW) routing in MBN at position S = s 0 s 1 :::s j :::s l?1 at the left of stage 0, is switched to position s 0 s 1 :::s j :::s l?2 d 0 at the right of stage 0, then undergoes a k-shue and reaches at position s 1 s 2 :::s j :::s l?2 d 0 s 0 at the input of stage 1 and gets switched to s 1 s 2 :::s j :::s l?2 d 0 d 1 at the output of stage 1. In general, when a request arrives at position s j s j+1 :::s l?2 d 0 d 1 :::d j?1 s j?1 at the left of stage j, it is switched to position s j :::s l?2 d 0 d 1 :::d j?1 d j at the right of stage j, goes through a k-shue (except at the last stage, j = l? 1) and arrives at position s j+1 :::s l?2 d 0 d 1 :::d j s j at the left of stage j + 1. Finally it reaches the destination d 0 d 1 :::d j :::d l?1 at the output of the last stage of the MBN. B. Backward (BW) Routing : In Backward (BW) routing, a request from source node S moves backward from stage l? 1 through stage 0 to the destination node D. An example of BW routing for 0011 to 1011 is shown using bold line in Figure 5. The jth digit of the backward routing tag (BRT) is used by a switch at stage j for self-routing. Thus, a request that started at position S = s 0 s 1 :::s j :::s l?1 at the right of stage l? 1, is switched to position s 0 s 1 :::s j :::s l?2 d l?2 at the left of stage l? 1, then undergoes a reverse k-shue and reaches at position d l?2 s 0 s 1 :::s j :::s l?2 at the right of stage l? 2. In general, when a request arrives at position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 s j at the right of stage j, it is switched to position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 d (j?1)mod(l) (the jth digit of BRT, b j = d (j?1)mod(l) ) at the
10 10 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 BRT = 1101 Fig. 5. Backward (BW) routing in MBN left of stage j. Then it goes through a reverse shue (except at the last stage, j = 0) and arrives at position d j?1 d j d j+1 :::d l?2 s 0 s 1 :::s j?1 at the left of stage j? 1. C. Forward-U (FU) Routing : In Forward-U routing, the request starts from source S at stage 0, follows the FW routing (using FRT) up to stage FTS-1 and reaches at the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 s F T S?1. At FTS, it takes a U-turn, instead of getting switched to the right of stage FTS. The request is switched to the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 d (F T S?1)mod(l) and follows BW routing (using BRT) up to stage 0. Finally it reaches at position d 0 d 1 :::d F T S?1 s F T S :::s l?2 d l?1 at the left of stage 0. An example of FU routing for 0010 to 0110 is shown in bold line in Figure 6. D. Backward-U (BU) Routing : In Backward-U routing, the request starts from source S at stage l? 1, follows the BW routing (using the BRT) up to stage BTS+1 and reaches at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 s BT S at the right of stage BTS. At stage BTS, it takes a U-turn, instead of getting switched to the left of stage BTS. The request gets switched to the right of stage BTS at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 d BT S and follows FW routing(using FRT) up to stage l? 1. Finally the request reaches at position s 0 s 1 :::s BT S?1 d BT S d BT S+1 :::d l?1 in the right of stage l? 1. An
11 11 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 FTS = 2 Fig. 6. Forward-U (FU) routing in MBN example of BU routing for 0010 to 0110 is shown in Figure 7. B. Optimal Path Algorithm The distance between a source and destination in an MBN is dened as the minimum number of switches that the packet has to travel. For a conventional MIN, this distance is always equal to l, or the number of stages in the network. In case of an MBN, however, the distance may be less than l if FU or BU routing is chosen. The FU and BU (Forward-U and Backward-U) routings are used when the turning stage happens to be less than the center stage of the network. Therefore, there will be net savings in terms of distances between a given source and all the destinations. Detailed expressions for the overall savings in distances for such an MBN are given in Section 4. We present below an algorithm to choose the most optimal routing for a given source-destination pair. Optimal Path Algorithm 1. S = s 0 s 1 :::s l?1
12 12 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 BTS = 1 Fig. 7. Backward-U (BU) routing in MBN 2. D = d 0 d 1 :::d l?1 3. CT = S J D = c 0 ; c 1 ; :::; c j ; :::; c l?1 4. RCT = c l?1 ; c 0 ; c 1 ; :::; c j ; :::; c l?2 5. d l = bl=2c, d u = dl=2e 6. IF (source = destination) 7. THEN request is to local memory 8. ELSE 9. Find FTS and BTS (based on the tags RCT and CT respectively) 10. IF (F T S = (l? 1? BT S) = 0) 11. THEN select forward-u (FU) routing OR backward-u (BU) routing 12. ELSE IF (F T S < d l ) 13. THEN select forward-u (FU) routing 14. ELSE IF (BT S d u ) 15. THEN select backward-u (BU) routing
13 ELSE 17. select forward (FW) routing OR backward (BW) routing The optimal path algorithm chooses a route that has a minimum path length. Given a source S = s 0 s 1 :::s l?1 and a destination D = d 0 d 1 :::d l?1, this algorithm computes the tags described earlier in this section. It then uses a comparison of these tags to decide which of the four routings would give the minimum path length through the network. In the algorithm d is dened as the center stage of the MBN. It must be pointed out here that the optimal routing between two nodes is xed in a given network. Hence the optimal path can be precomputed and stored in a table that can be read when a request is issued. There is no need to execute the algorithm every time a message is sent out. If the source and the destination are the same, then the request is for the local memory. In this case no traversal through the MBN is required. All other requests pass through at least one stage of the MBN. The memories that are connected to a processor through the rst or last stage of the MBN are called cluster memories. Similarly processors that are one switch away from the memories are called cluster processors of those memories. Requests to cluster memories require that only one switch be traversed. Thus when (F T S = (l? 1? BT S)) = 0 FU or BU routing is taken to serve this purpose. If this is not satised then we should check for FU or BU routing because these would be the next possible minimum path. If F T S < bl=2c or BT S dl=2e we have turning stages in the MBN before or after the center stage. This would reduce the total path length to less than l and thus FU or BU routing is selected. If none of the above conditions is true, we have F T S bl=2c and BT S < dl=2e. In this case, forward (FW) routing or backward (BW) routing are the only options. The actual path lengths in terms of the number of switches traversed are presented below: Local memory: 0 switches (MBN is not traversed) Forward routing or Backward routing: l switches
14 14 Destination i FW/BW routing FU routing BU routing TABLE I The path lengths for each of the routings given source=0 and different destinations Forward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 F T S + 1 switches Backward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 (l? 1? BT S) + 1 switches These path length equations can be used to form a table for a given source and destination. As an example, Table 1 shows the path lengths from source 0 to dierent destinations in a 1024*1024 network for i The path length for each routing is quite dierent and thus a routing algorithm is required to route the request through the most optimal path. For example, if the destination is 2 then Backward-U routing will result in the optimal path length. On the other hand, if the destination is 256, then Forward-U routing will result in the optimal path length. The other two requests should use forward or backward routing strategies. IV. Performance of the MBN The Multistage Bus Network (MBN) is analyzed here in a distributed shared memory environment, shown in Figure 1. We also analyze the BMIN and compare its results with those of the MBN. In both the cases, the memory module M i is directly connected to the processor P i and is called the local memory of P i. Requests from a processor to its local memory are called internal requests and are
15 15 carried over the internal bus between the processor and its local memory. A memory can also receive external requests that originate from other processors and are carried over the MBN. A. Network operation In a distributed memory system, there are k? 1 processors that can be reached through the switch of size k at the rst stage or the last stage to which P i is connected. Thus the external request destined to a cluster processor or memory returns from the rst stage (Forward-U routing) or last stage (Backward-U routing) without going through the whole MBN. However, if the request is neither to a local nor to a cluster memory, the request may take one of four routings described earlier. Both internal and external requests arrive at a memory queue. Only one of them is selected for service on an FCFS basis while the remaining requests are queued at the buer of the memory. After receiving a request, a memory module will send a reply packet either directly to its local processor or to another processor through the network, depending on whether the request is internal or external. We will compare the performance of the MBN with that of a BMIN. The transmission of request and reply packets goes through the network following the routings given earlier in the paper. We shall assume a synchronous and packet switched system for analyzing the multistage networks. Since a buer size of four or more gives the same eect as an innite buer [12], [13], for simplicity, we shall assume an innite buer for MBN and BMIN. The analysis can be extended to nite buers, but the equations will be fairly complicated [13]. Since our aim here is to analyze the routing schemes, we prefer to give the basic innite buer analysis. The bus service time (for MBN) or the link service time (for BMIN) to transfer a message forms one system cycle time. The service times of the memory modules are assumed to be integral multiples of this system cycle time. A processor is represented by a delay center; in a given cycle, it submits a memory request with some given probability if it is busy in computation. Once it sends the memory request, the processor becomes idle until the memory response packet (in case of a read) or acknowledgment (in case of write) is obtained. The various
16 16 system parameters are dened below: k k : size of the MBN or MIN switches N l t s t m p : number of processors or memories in the system : log k N : number of stages in the IN : switch service time : memory service time : probability that a processor submits a memory request in a given cycle provided it is busy m : probability that a processor requests its local memory provided it has made a memory request p i : Probability that a request passes through stage i r i : mean response time of switch at stage i, 0 i (l? 1) q i q e d n l m d m P u : Average number of local requests by a processor per cycle : Average number of remote requests from a processor per cycle : total delay in the network (considering all stages) : average queue length in a memory module : average delay in a memory module : processor utilization (fraction of time the processor is busy) The performance analysis of the MBNs and BMINs will be carried out under the following assumptions [12], [13]. Packets are generated at each source node by independent and identically distributed random processes. At any point of time a processor is either busy doing some internal computations or is waiting for the response to a memory request. If there is no pending request, each busy processor generates a packet with probability p at each cycle. The probability that this request is to the local
17 17 memory (internal request) is m, and the probability to any other memory module (external request) is (1? m). A reply from memory travels in the opposite direction through the same path in the MBN or BMIN. It may be noted that in case of a MIN like Buttery [3], a reply has to traverse in the same direction (i.e., from processor to memory side) to reach the requesting processor because the MIN has unidirectional links. In [9], bidirectional links are used between stages and hence the requesting and reply messages may travel in the forward and backward directions respectively. The messages from processor to memory are generated using probabilities as specied below: Request Probability (p): The request probability is dened in Section 3 and is used as a means of estimating the processor behavior in terms of memory requests. When a processor is busy in computation, i.e., no request is outstanding in switches or a memory module, it can send a memory request. At each cycle, the processor decides whether or not a message is to be sent based on this probability. On an average, it takes 1=p cycles to send out a request from the processor. Local memory request probability (m): Given that a request is to be made to memory, a probability (m) is used to decide whether the request is for local or external memory. Though simple, the above probabilities play an important role and are the only inputs to the analysis. After each request to memory, the processor waits for an acknowledgment. Once an acknowledgment is received, the processor does useful computation for one cycle and then based on the above probabilities decides whether to continue or to send another request to the memory. Processor utilization: The processor utilization, P u, dened as the fraction of time a processor is busy, will be determined by the waiting time and service time faced by a request at various service centers. In a number of applications, a large portion of the requests are made to the cluster processors. In [8], we studied the performance of the MBN with varying probabilities for cluster requests. In the study forward-u and backward-u routings were allowed only at the rst and last stages. All other requests were routed by forward (FW) routing. The processor utilization for such a case is given by
18 18 the following equation: P u = 1=f1 + p(1? m)(1? m 1 )(d n + d m ) + pmd m + p(1? m)m 1 (2r 0 + d m )g (1) In this paper, a message in MBN or BMIN will be sent along the minimum distance. In such a case, P u = (2) where, corresponds to the expected delay for a local memory request to be served. corresponds to the expected delay for serving requests to cluster memories. corresponds to the expected delay for serving all requests, except cluster memories, that follow FU or BU routing. corresponds to the expected delay for serving all requests that folllow Forward routing (F W ) or Backward routing (BW ). The derivation of terms,,,, and, is presented below. These terms depend on (a) the routing probabilities along each path, (b) the amount of trac in the network, and (c) the service demand at individual service centers. Thus we get a non-linear equation with P u as the single variable that is solved by using iteration techniques. B. Routing probabilities and path delays The routing probabilities and path delays are derived here for MBN and BMIN under the assumption that all the non-local memories are equally addressed by a processor. These equations can be modied in case of nonuniform remote memory references. Since the path length of Backward Routing (BW ) is the same as that of the FW routing, we derive the term based on FW routing and multiply it by 2 to include BW routing. A similar method is used for FU and BU routing as well.
19 19 Local memory requests (): A local memory request does not involve switch traversal. Thus the only delay is that in servicing the request in the memory module (d m ). Given that the probability for a processor to request a memory is p and that to request a local memory is m, we can deduce that = p m d m (3) Cluster routing (): Requests to cluster processors travel to the rst or last stage switch and take a FU or BU routing to the destination processor. All those source-destination pairs where all bits except the least signicant log 2 k bits of the CT are zero entail this type of routing. Thus the number of cluster memories for a given source is k? 1 since k k is the size of an MBN or BMIN switch. The switch at stage 0 is traversed once for reaching the cluster memory and once for sending back the acknowledgment. Here, given that an external memory is requested, the probability for requesting cluster memories can be expressed as p = k? 1 N? 1 (4) Thus we have 2 r 0 delay for the switch traversals and d m for the memory service delay. We get the following equation, = p (1? m) k? 1 N? 1 (2r 0 + d m ) (5) Non-cluster FU or BU routing (): In forward-u and backward-u routing, the request traverses in one direction up to a particular stage (as explained in Section 2) and makes a U-turn to reach the destination processor. Thus given the turning stage, FTS, the path length can be said to be 2 F T S + 1. This is because the FTS is traversed only once while all stages to the left of FTS are traversed twice but not necessarily through the same switch. We should have a F T S < bl=2c for path
20 20 length optimization. and BT S dl=2e for optimal path length. As we have already covered cluster memories (F T S = 0, BT S = l? 1) we will start with F T S 1 and BT S l? 2. Consider F T S < bl=2c. A similar derivation can be done for BT S dl=2e also. We know that the number of destinations in total is N? 1. For a given turning stage 1 i < d, since FTS is dened as the rightmost bit in the tag, we can have all bits to the left of this position as 1 or 0. This gives us k i number of ways. As discussed in Section 3, the Rotated Combined Tag (RCT) is dened as the digit-wise EX-OR of the source and destination tags. Thus the RCT is a tag made up of 1's or 0's i.e the RCT tag is bitwise regardless of the source and destination tags. The number of ways in which a bit in the RCT can be 1 is k? 1. Thus, given that an external memory is requested, we have the equation for probability of non-cluster FU and BU routing as, d?1 X k? 1 p = 2 ( i=1 N? 1 ki ) (6) where d = bl=2c The delay in such a routing is dependent on the stage at which the U-turn is going to take place. Thus within the summation of the above equation we should include the delay for each switch traversed in that particular path. As discussed above for a turning stage FTS we traverse through all stages to the left of FTS twice. Thus the delay except for that in the turning stage is (2 ( P i?1 j=0 2r j)). This term is multiplied by two because it considers the acknowledgment packets also. The request and acknowledgment also traverse the turning stage and the memory module with delay r i + d m. Including this delay, (2 ( P i?1 j=0 2r j + r i ) + d m ), with the probability gives us the equation for as, d?1 X = p (1? m) 2 ( i=1 k? 1 N? 1 ki (2 ( Xi?1 j=0 2r j + r i ) + d m )) (7) Forward routing (): Finally, for all those source-destination pairs which don't fall into the above
21 21 Routing Size of the MBN FU or BU routing (N? 1) ( p + p ) FW or BW routing (N? 1) ( p ) TABLE II Number of destinations for different network sizes using different routings routing categories, the forward routing path is taken. Since forward routing or backward routing is the last choice for any other type of source-destination pairs, we can simply express as (1?? ) d n. In this type of routing all switches are traversed, thus giving a summation of all switch response times for d n = P l?1 i=0 r i. Thus the expected delay for all such routings can be expressed as, where, X l?1 = p (1? m) p (( i=0 r i ) + d m ) (8) p = 1? p? p (9) where p and p are given by equations 4 and 6 respectively. The equations 0 through 9 are valid when the local memory is accessed with a probability of m and all other memories are addressed with equal probabilities i.e. (1? m)=(n? 1). In an actual case, there will be more interaction between the tasks within a cluster. The equations can be easily extended to include such cases. Table 2 shows the number of destinations that can be reached from processor 0 with each of the routings as a function of network size. The switch size of the network is 2 2. It can be observed from the table that a signicant number of connections benet from the routings other than FW or BW that is commonly adopted today. Also the same number of processors use FU or BU routing in two successive network sizes. We can explain this behavior by an example. Consider l = 6 and l = 7,
22 22 corresponding to network sizes of 64 and 128 respectively. The networks, though of dierent sizes, have the same number of destinations for FU routing because the addition of one stage introduces a true center stage (centerstage = 3) in l = 7 while there is no true center stage in l = 6. Since the center bit in the CT tag has to be 0 for FU or BU routing, it is apparent that the addition of the center stage will not increase the number of possible FU or BU routings. The delays, r 0, r i, d n and d m, will depend on (a) the amount of trac in the network, which in turn is a function of P u itself and (b) the service demand at individual service centers. The queueing analysis for the delays is given next. C. Queueing delays in switches In order to make the analysis simpler, each stage in the network is considered in isolation from the other stages. Consider a queueing center with n inputs. Let the probability that there is a packet at one of the inputs at any given cycle be q, and the service demand of a packet at the service center be t cycles. The number of requests coming to the queue during the service time of any previous request will form a Binomial distribution with number of trials = nt, and success probability = q. The mean number of arriving requests, E = ntq and the variance, V = ntq(1? q). The average queue length Q at the queueing center can be found using the Pollaczek-Khinchine (P-K) mean value formula [14], Q = E 2 + V 2(1? E) (10) The throughput of these requests is E=t. Hence by using Little's law, the mean response time of the center, r, can be derived as, r = Q:t E = !? q t (11) 1? ntq
23 K INPUTS BUS.... 2K OUTPUTS (a) MBN Switch Queue... 2K INPUTS C C C C C l... 2K OUTPUTS l (b) BMIN Switch Queue Fig. 8. Queues at the BMIN and MBN switches Queueing models of an MBN switch and a BMIN switch are shown in Figure 8. For an MBN switch there is contention for the bus by packets from k right ports and k left ports. For a switch at stage i, n = 2k, t = t s and q = P u p(1? m)p i, where p i is the probability that a packet visits stage i. We can calculate mean switch response time, r i, for any of these MBN switches using the following equation : r i = !? q t s (12) 1? 2kt s q The network delay, d n, will be a sum of the response times of the stages a packet visits while routed through the network. In case of the BMIN, there are 2k inputs and 2k outputs in a switch. The request probability at an input or output of a BMIN switch at stage i will be P u p(1?m)p i. Following the model shown in Fig. 8b, we can calculate the response time of a BMIN switch by using n = 2k, t = t s and q = P u p(1? m)p i =n. The total network delay d n will be the sum of the response times of switches at dierent stages. In both networks, the mean number of arriving requests at a memory module, E m = q i +t m q e, where q i and q e are the internal and external requests for that memory module, respectively. The variance V m = q i (1? q i ) + t m q e (1? q e ). Hence average memory queue length, l m = q i + t m q e 2 + q i(1? q i ) + t m q e (1? q e ) 2(1? q i + t m q e ) (13)
24 24 and the mean memory response time or delay, d m = l m t m =E m. A packet (or request) takes the optimal path from a source to the destination. The number of switches traversed would depend on the nature of CT and RCT. The delays derived here are inserted into the equations 3,5,7 and 8 which in turn are plugged into equation 2 to obtain the processor utilization and response time of the network. Then we get a nonlinear equation with P u as the single variable that is solved by using iteration techniques. The iteration technique used to compute processor utilization, P u, can be presented as follows: 1. Initialize P u with a guess of the expected processor utilization. The better the guess, lesser is the number of iterations for the computation. 2. Calculate the request probabilities at each stage of the network and at the memory module. An intermediate step might be to calculate the static values for p i (the probability that a stage in the network is traversed). 3. Calculate the mean switch response times and the memory response time, r i and r m respectively. 4. Based on the above values, calculate the network delay and memory delay using equations provided for ; ; and. 5. Based on these values, calculate a new processor utilization, P u. 6. Repeat steps 2-5 until the new P u is within some tolerance of the last P u. An initial value of 0.5 for P u and an accuracy of were used to generate the analytical results, presented in the next section. V. Results and Discussions We performed extensive cycle-by-cycle simulations to verify that the proposed routings work and measured the routing probabilities and network delays [16]. The simulation was done using a synchronous packet-switched distributed memory environment. The simulation specications are the same as the analysis and are detailed below with a view to making the network operation more clear.
25 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Processor Utilization Request Probability Fig. 9. Comparison of analysis and simulation for processor utilizations of the MBN, varying m In our simulations each cycle was considered to be the time required for the transmission of a packet from one output buer of a switch to the next stage output buer. This includes the transmission of the packet through the link and the time a switch takes to route it to the corresponding destination buer. The minimum time taken for a packet to reach memory is based on the number of switches that the routing covers. All four routings discussed in Section 2 are used in the simulations. The simulation compares each source and destination by running the optimal routing algorithm and then chooses the proper routing. The choice between backward or forward routing is made as follows. All memory requests that could use either forward (FW) or backward (BW) routing use forward routing. All acknowledgements packets use backward routing to keep the load distribution same on both routings. Apart from these dierences the routing decisions are based solely on the tags generated by the optimal routing algorithm. The probabilities p, m, etc. are fed to the simulation as input parameters. All the memories except for the local memories are equally likely to be addressed upon a memory request. In this section we present the relative performance of BMIN and MBN. We start by comparing the
26 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Response Time Request Probability Fig. 10. Comparison of analysis and simulation for response time of the MBN, varying m "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Processor Utilization Request Probability Fig. 11. Comparison of processor utilizations, varying m
27 "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Response Time Request Probability Fig. 12. Comparison of response time, varying m results from the simulation versus those obtained from the analysis of the MBN. Many simulation experiments were run to verify the analytical models developed in this paper. The simulation results closely matched the analysis under all varied parameters. Here we present some results for a system with 2 2 switches. Memory service time is assumed to be 4 cycles. Processor utilization, P u is dened as the average amount of useful work the processor does in a given cycle. Response time is dened as the average dierence between the time when a processor submits a memory request and the time when it gets the reply back. Figures 9 and 10 show a comparison of analysis and simulation results for the processor utilization and response time of the MBN. In both the gures, the analytical results match very closely with those of simulation indicating that the independence of queues assumed during the analysis does not cause much of an error. The plots show the results from the analysis and simulation as a function of the memory request probability, p. In this plot the memory request probability (p) is varied from 0.1 to 1.0 and two values for the local memory request probability m (0.1 and 0.9) are chosen. For larger values of m, more requests are satised without going through the MBN. Thus, for m = 0:9, P u
28 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Processor Utilization Number of Processors Fig. 13. Processor Utilization: Scalability of MBN vs BMIN, varying p is much higher in Fig 9 and response time is much lower in Fig. 10. As p gets larger, more requests are generated and the response time increases due and the processor utilization reduces to a higher amount of trac and queueing delays. Figures 11 and 12 show a comparison of the performance of the MBN to that of the conventional MIN (CMIN) and the proposed bidirectional MIN (BMIN). Conventional MIN is similar to the network employed in the Buttery machine [3], where both the request and the response packets travel in one direction from processor to the memory side. On the other hand, BMINs allow all the four routings proposed for the MBN. The two plots show the processor utilization and the response time of the three networks for two dierent values of m. The MBN behaves exactly similar to the CMIN and the BMIN in terms of processor utilization. The response time is also same for all the networks for m = 0:9. For m = 0:1 the BMIN performs better than the MBN and the MBN performs better than the CMIN. Figures 13 and 14 show the processor utilization and response times for various system sizes. The results were obtained with a local memory request probability, m, xed at 0:5 and for two dierent values of p (0.1 and 0.5). We can see from the gures that, even as the system size grows, the
29 29 20 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Response Time Number of Processors Fig. 14. Response Time: Scalability of MBN vs BMIN, varying p performance of the MBN remains close to that of the BMIN. The curves for the CMIN are left o for clarity, but it is observed that the MBN always performs better than the CMIN. It can also be seen from the gures that as the system size doubles the reduction in performance is not that big, indicating that the MBN is highly scalable for the given trac load. The range of the processor utilization remains approximately between 0:5 and 0:4 as the system size changes from 32 to 1024 with 2 2 switches, but the request probability, p, has a much greater eect on the performance. Finally we present the processor utilization and the response time of the MBN obtained for dierent switch sizes and dierent number of processors in Table III. Some places in the table are left empty because an N N MBN cannot be built using only those k k switches. Both the request probability (p) and the local memory request probability (m) are xed at 0.5. For N = 64, there is a decrease in P u when k is increased from 4 to 8. This is because MBN is less ecient due to increased contention and delay in an 8x8 bus-based switch. On the other hand, in a system, when k is increased from 2 to 8, there is a good improvement. The number of switches in the entire network is still quite high, keeping the contention low enough to gain in performance.
30 30 #Procs. Dierent Switch Sizes (N) k = 2 k = 4 k = 8 P u RT P u RT P u RT MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN TABLE III Processor Utilization and Response Time of the MBN & BMIN varying k and N If we compare the MBN's performance to that of the BMIN's, we can see that as the switch size increases, the BMIN gives a higher processor utilization and a lower response time. This increase in performance is due to a lower contention in the crossbar switch. However, the BMIN gives this increased performance at the expense of cost. In [8], a cost parameter based on the number of connection points in a switch is presented. The number of connections is k 2 for a k k switch, where as for a bus, the number of connections is 2k. Thus, the total cost of BMIN and MBN are knlog k N and 2Nlog k N respectively. If we include these parameters along with the processor utilization and the response time, the cost-eectiveness of the MBN is higher than that of the BMIN, as shown in [8]. A 4 4 switch size works out to be most cost-ecient for dierent network sizes and workload inputs. VI. Execution-driven Simulation and Results The execution time of an application on a multiprocessor architecture is the ultimate parameter that indicates the performance. In order to show that the MBN performs similar to the Bidirectional MIN (BMIN), we study their performance by using an execution-driven simulation of various applications. Our simulator is based on Proteus [10], originally developed at MIT. However, this original simulator modeled the indirect interconnection networks based on an analytical model. We have modied the simulator extensively to exactly model the BMIN and the MBN using 2 2 switches and packetswitching strategy. The system considered in this paper has private cache memories that operate based on a directory-based cache-coherence protocol [15]. The node conguration and the network
FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network
Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu
More informationis developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T
A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University
More informationAnalytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science
Analytical Modeling of Routing Algorithms in Virtual Cut-Through Networks Jennifer Rexford Network Mathematics Research Networking & Distributed Systems AT&T Labs Research Florham Park, NJ 07932 jrex@research.att.com
More informationChapter 9 Multiprocessors
ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University
More informationCharacteristics of Mult l ip i ro r ce c ssors r
Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central
More informationA New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract
A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing
More informationLecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University
18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel
More informationMultiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University
A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor
More informationAkhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.
Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:
More informationChapter 4 NETWORK HARDWARE
Chapter 4 NETWORK HARDWARE 1 Network Devices As Organizations grow, so do their networks Growth in number of users Geographical Growth Network Devices : Are products used to expand or connect networks.
More information/$10.00 (c) 1998 IEEE
Dual Busy Tone Multiple Access (DBTMA) - Performance Results Zygmunt J. Haas and Jing Deng School of Electrical Engineering Frank Rhodes Hall Cornell University Ithaca, NY 85 E-mail: haas, jing@ee.cornell.edu
More informationDr e v prasad Dt
Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction
More information4. Networks. in parallel computers. Advances in Computer Architecture
4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors
More informationHIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS. A Dissertation RAVISHANKAR IYER. Submitted to the Oce of Graduate Studies of
HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS A Dissertation by RAVISHANKAR IYER Submitted to the Oce of Graduate Studies of Texas A&M University in partial fulllment of the requirements
More informationMultiprocessors Interconnection Networks
Babylon University College of Information Technology Software Department Multiprocessors Interconnection Networks By Interconnection Networks Taxonomy An interconnection network could be either static
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD
More informationMultiprocessors & Thread Level Parallelism
Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction
More informationThis chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research
CHAPTER 1 Introduction This chapter provides the background knowledge about Multistage Interconnection Networks. Metrics used for measuring the performance of various multistage interconnection networks
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD
More informationInterconnection Networks: Topology. Prof. Natalie Enright Jerger
Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design
More informationReal-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.
Real-time communication scheduling in a multicomputer video server A. L. Narasimha Reddy Eli Upfal Texas A & M University IBM Almaden Research Center 214 Zachry 650 Harry Road College Station, TX 77843-3128
More informationFault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis
c British Computer Society 2002 Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis SYED MASUD MAHMUD, L.TISSA SAMARATUNGA AND SHILPA KOMMIDI Department
More information(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX
Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu
More informationIntroduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed
More informationInterconnect Technology and Computational Speed
Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented
More informationMultiprocessor Cache Coherency. What is Cache Coherence?
Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by
More informationPhysical Organization of Parallel Platforms. Alexandre David
Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:
More informationNon-Uniform Memory Access (NUMA) Architecture and Multicomputers
Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer
More informationIntersection of sets *
OpenStax-CNX module: m15196 1 Intersection of sets * Sunil Kumar Singh This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 2.0 We have pointed out that a set
More informationScalable Cache Coherence
arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy
More informationIntroduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano
Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico
More informationMultiprocessor Interconnection Networks
Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical
More informationThis paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing)
DUAL REINFORCEMENT Q-ROUTING: AN ON-LINE ADAPTIVE ROUTING ALGORITHM 1 Shailesh Kumar Risto Miikkulainen The University oftexas at Austin The University oftexas at Austin Dept. of Elec. and Comp. Engg.
More informationA Bandwidth Latency Tradeoff for Broadcast and Reduction
A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi@mpi-sb.mpg.de. http://www.mpi-sb.mpg.de/sanders,
More informationPE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.
Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures Shashank S. Nemawarkar, Guang R. Gao School of Computer Science McGill University 348 University Street, Montreal, H3A
More informationNumerical Evaluation of Hierarchical QoS Routing. Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar. Computer Science Department and UMIACS
Numerical Evaluation of Hierarchical QoS Routing Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar Computer Science Department and UMIACS University of Maryland, College Park CS-TR-395 April 3, 1998
More informationIBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand
1 SCHEDULING IN MULTIMEDIA SYSTEMS A. L. Narasimha Reddy IBM Almaden Research Center, 650 Harry Road, K56/802, San Jose, CA 95120, USA ABSTRACT In video-on-demand multimedia systems, the data has to be
More information3-ary 2-cube. processor. consumption channels. injection channels. router
Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The
More informationpaper provides an in-depth comparison of two existing reliable multicast protocols, and identies NACK suppression as a problem for reliable multicast.
Scalability of Multicast Communication over Wide-Area Networks Donald Yeung Laboratory for Computer Science Cambridge, MA 02139 April 24, 1996 Abstract A multitude of interesting applications require multicast
More informationOn Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme
On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract
More informationEgemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for
Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and
More informationInterconnection Network
Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network
More informationEcient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University
Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, nig@cps.msu.edu bstract Ecient allocation of
More informationOn Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors
On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael
More informationIV. PACKET SWITCH ARCHITECTURES
IV. PACKET SWITCH ARCHITECTURES (a) General Concept - as packet arrives at switch, destination (and possibly source) field in packet header is used as index into routing tables specifying next switch in
More informationConsistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:
Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical
More informationECE 697J Advanced Topics in Computer Networks
ECE 697J Advanced Topics in Computer Networks Switching Fabrics 10/02/03 Tilman Wolf 1 Router Data Path Last class: Single CPU is not fast enough for processing packets Multiple advanced processors in
More informationOptimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres
Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,
More informationScalable Cache Coherence
Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient
More informationAdvanced Parallel Architecture. Annalisa Massini /2017
Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing
More informationMultiprocessors - Flynn s Taxonomy (1966)
Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The
More informationLecture: Interconnection Networks
Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet
More informationNetsim: A Network Performance Simulator. University of Richmond. Abstract
Netsim: A Network Performance Simulator B. Lewis Barnett, III Department of Mathematics and Computer Science University of Richmond Richmond, VA 23233 barnett@armadillo.urich.edu June 29, 1992 Abstract
More informationEE382 Processor Design. Illinois
EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate
More informationNetworks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598
Techniques for Eliminating Packet Loss in Congested TCP/IP Networks Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny ydepartment of EECS znetwork Systems Department University of Michigan
More informationScalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University
Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels
More informationMemory Systems in Pipelined Processors
Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every
More informationCOEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence
1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations
More informationperform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p
Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University
More informationLondon SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency
Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College
More informationParallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?
Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing
More informationMulti-Processor / Parallel Processing
Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms
More informationLimits on Interconnection Network Performance. Anant Agarwal. Massachusetts Institute of Technology. Cambridge, MA Abstract
Limits on Interconnection Network Performance Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 0139 Abstract As the performance of interconnection networks
More informationReduction of Periodic Broadcast Resource Requirements with Proxy Caching
Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota
More informationPerformance of Multihop Communications Using Logical Topologies on Optical Torus Networks
Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,
More informationModule 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:
The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012
More informationModule 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth
Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012
More informationNetwork-on-chip (NOC) Topologies
Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance
More informationMDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science
MDP Routing in ATM Networks Using the Virtual Path Concept 1 Ren-Hung Hwang, James F. Kurose, and Don Towsley Department of Computer Science Department of Computer Science & Information Engineering University
More information3. Evaluation of Selected Tree and Mesh based Routing Protocols
33 3. Evaluation of Selected Tree and Mesh based Routing Protocols 3.1 Introduction Construction of best possible multicast trees and maintaining the group connections in sequence is challenging even in
More informationLecture 2: Topology - I
ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and
More informationEcube Planar adaptive Turn model (west-first non-minimal)
Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp. 652-659. Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda
More informationNeuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.
Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility
More informationEE/CSCI 451: Parallel and Distributed Computation
EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1
More informationModule 5 Introduction to Parallel Processing Systems
Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this
More informationLecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance
Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,
More informationMultiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering
Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to
More informationRelative Reduced Hops
GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP
More informationreasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap
Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com
More informationComputer parallelism Flynn s categories
04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories
More informationCS 204 Lecture Notes on Elementary Network Analysis
CS 204 Lecture Notes on Elementary Network Analysis Mart Molle Department of Computer Science and Engineering University of California, Riverside CA 92521 mart@cs.ucr.edu October 18, 2006 1 First-Order
More informationMULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration
MULTIPROCESSORS Characteristics of Multiprocessors Interconnection Structures Interprocessor Arbitration Interprocessor Communication and Synchronization Cache Coherence 2 Characteristics of Multiprocessors
More informationAssignment 5. Georgia Koloniari
Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last
More informationInterconnection networks
Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory
More information1. Memory technology & Hierarchy
1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In
More informationMultiprocessor Interconnection Networks- Part Three
Babylon University College of Information Technology Software Department Multiprocessor Interconnection Networks- Part Three By The k-ary n-cube Networks The k-ary n-cube network is a radix k cube with
More informationThe final publication is available at
Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The
More informationReinforcement Learning Scheme. for Network Routing. Michael Littman*, Justin Boyan. School of Computer Science. Pittsburgh, PA
A Distributed Reinforcement Learning Scheme for Network Routing Michael Littman*, Justin Boyan Carnegie Mellon University School of Computer Science Pittsburgh, PA * also Cognitive Science Research Group,
More informationB.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2
Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,
More informationChapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha
Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we have to take into account the complexity of the code.
More informationCluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]
Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of
More informationA Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks
A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks Hitoshi Oi and N. Ranganathan Department of Computer Science and Engineering, University of South Florida, Tampa, FL Abstract
More informationShared Memory Architecture Part One
Babylon University College of Information Technology Software Department Shared Memory Architecture Part One By Classification Of Shared Memory Systems The simplest shared memory system consists of one
More informationLecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996
Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single
More informationArchitecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting
Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,
More informationA Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland
A Study of Query Execution Strategies for Client-Server Database Systems Donald Kossmann Michael J. Franklin Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 f kossmann
More informationunder Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli
Interface Optimization for Concurrent Systems under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Abstract The scope of most high-level synthesis eorts to date has
More informationScalable Cache Coherent Systems
NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication
More informationAvailability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742
Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve
More informationLecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations
Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,
More information