Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract

Size: px
Start display at page:

Download "Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar. Abstract"

Transcription

1 Performance of Multistage Bus Networks for a Distributed Shared Memory Multiprocessor 1 Laxmi N. Bhuyan, Ravi R. Iyer, Tahsin Askar, Ashwini K. Nanda and Mohan Kumar Abstract A Multistage Bus Network (MBN) is proposed in this paper to overcome some of the shortcomings of the conventional multistage interconnection networks (MINs), single bus and hierarchical bus interconnection networks. The MBN consists of multiple stages of buses connected in a manner similar to the MINs and has the same bandwidth at each stage. A switch in an MBN is similar to that in a MIN switch except that there is a single bus connection instead of a crossbar. MBNs support bidirectional routing and there exists a number of paths between any source and destination pair. In this paper we develop self routing techniques for the various paths, present an algorithm to route a request along the path with minimum distance and analyze the probabilities of a packet taking dierent routes. Further, we derive a performance analysis of a synchronous packet-switched MBN in a distributed shared memory environment and compare the results with those of an equivalent bidirectional MIN (BMIN). Finally, we present the execution time of various applications on the MBN and the BMIN through an execution-driven simulation. We show that the MBN provides similar performance to a BMIN while oering simplicity in hardware and more fault-tolerance than a conventional MIN. Keywords interconnection network, routing, queueing model, performance analysis, packet-switching, executiondriven simulation This research was supported by NSF grants MIP and L. N. Bhuyan and R. Iyer are with the Department of Computer Science, Texas A&M University, College Station, TX , (bhuyan,ravi)@cs.tamu.edu T. Askar is with Advanced Micro Devices, Austin, TX A. Nanda is with IBM TJ Watson Research Center, P.O. Box 218, Yorktown Heights, NY M. Kumar is with the Department of Computer Science, Curtin University of Technology, GPO Box U 1987 Perth, WA 6001, Australia

2 2 I. Introduction In order to achieve signicant performance in parallel computing it is necessary to keep the communication overhead as low as possible. The communication overheads of a multiprocessor system depend to a great extent on the underlying interconnection network. An interconnection network (IN) can be either static or dynamic. Dynamic networks can connect any input to any output by enabling some switches. They are applicable to both shared memory and message passing multiprocessors. Among such dynamic INs, the hierarchical buses or rings [1], [2] and Multistage Interconnection Networks (MINs) [3], [4] have been commercially employed. In a strictly hierarchical bus architecture [1], there are a number of buses connected in the form of a tree between the processors and the memories. The use of multiple buses makes the hierarchical bus-based systems more scalable compared to the popular single bus multiprocessors. However, the bandwidth of this interconnection decreases as one moves toward the top of the tree. Thus, the scalability of a hierarchical bus system becomes limited by the bandwidth of the topmost level bus. The bandwidth problem can be alleviated through the fat tree design [5]. The simplicity of the bus based designs and the availability of a fast broadcasting mechanism are factors that make bus-based systems very attractive. The MINs, on the other hand, oer a uniform bandwidth across all stages of the network. The bandwidth of the network increases in proportion to the increase in system size, making the MIN a highly scalable interconnection. The switches in a MIN are made up of small crossbar switches. When the system size grows, bigger switches can be used to keep the number of stages and, hence, the memory latency low [6]. However, the complexity of a crossbar switch grows as the square of its size, and therefore, the total network cost becomes predominant in larger systems. We have observed that the trac in the network is very low making the crossbar based MIN switches highly underutilized. In a system using private caches, which is common in today's shared memory multiprocessors, the eective trac handled by the switches in the network is further reduced.

3 3 A novel interconnection scheme, called the Multistage Bus Network (MBN), is introduced in this paper that combines the positive features of hierarchical buses and MINs. The MBN consists of several stages of buses with equal number of buses at each stage. This provides a uniform bandwidth across the stages and forms multiple trees between processors and memories. Unlike hierarchical bus networks, the MBNs comprise multiple buses at higher levels reducing the trac at higher levels. Maintaining cache coherence is a major problem in shared memory multiprocessors. Unlike MINs, the snoopy cache coherence protocols can be applied to the MBN [7], which can improve the performance by a large extent. Also, the MBN provides much better fault tolerance and reliability compared to a conventional MIN [8]. It is known that a distributed shared memory organization has better scalability than a centralized organization [2], [3]. In such an organization, a request or response packet can make U-turns in the network and reach a destination quickly since the intermediate levels of an MBN consist of buses and bidirectional connections. Four dierent routing techniques [8] are presented here in this paper. We also develop equations for probabilities of taking each path based on the memory requests. In order to do a realistic comparison with MINs, we introduce the design and analysis of a corresponding Bidirectional MIN (BMIN) in this paper. The BMIN allows U-turns and a packet can be routed based on the same techniques presented in this paper for the MBN. Recently, Xu and Ni [9] have discussed a U-turn strategy for bidirectional MINs as applicable to the IBM SP architecture [4]. However, the MIN employed in SP architectures is cluster-based and works dierently than the proposed MBN or BMIN. In this paper, we analyze the performance of an MBN for distributed shared memory multiprocessors based on dierent self routing techniques. Unlike the previous analysis [8], the present analysis is based on routing along the minimum of the four paths for a given source and destination pair. The MBN has some inherent fault tolerance capabilities due to a number of switch disjoint paths between any source and destination pair. In this paper, we only concentrate on the routing and performance evaluation

4 4 N IN N M P Router Fig. 1. A distributed shared memory multiprocessor using queueing analysis and execution-driven simulations of various applications. Our executiondriven simulator is an extended version of Proteus [10] that simulates the behavior of a cache coherent distributed shared memory multiprocessor for various applications. The rest of the paper is organized as follows. We present the structure and introduce four types of self routing techniques for the MBN in Section 2. We dene the routing tags required to implement the four routing strategies in Section 3 and present an algorithm for the most optimal path in the network for a given source-destination pair in the same section. A performance analysis of the MBN and BMIN is then presented in Section 4. Results and comparison with the conventional and bidirectional MINs are presented in Section 5. Section 6 presents the execution-driven simulation specications and results. Finally Section 7 concludes the paper. II. Structure of the MBN We will consider a distributed shared memory (DSM) architecture throughout this paper. In such an environment, the memory modules are directly connected to the corresponding processors, as shown in Figure 1, but the address space is shared. An example of hierarchical bus interconnection with two levels of buses is shown in Figure 2a [1]. In this example, there are 16 processors, 4 memories, four level 1 buses and one level 2 bus. Naturally, the top level bus is the bottleneck in the system. In order to improve the performance, a number of buses must be connected at the top level with interleaved memory design. Such a connection is shown in Figure 2b for a 16*16 system with two levels of buses.

5 5 M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR (a) A 16 Processor Hierarchial Bus System LEVEL 1 BUS M M M M M M M M M M M M M M M M P P P P P P P P P P P P P P P P M: MEMORY, P: PROCESSOR LEVEL 1 BUS (b) A 16*16 MBN Based System using 4*4 switches Fig. 2. Hierarchical Bus Interconnection and the MBN We propose that each bus along with its controller be placed in a switch analogous to a MIN switch. Such a network is called a Multistage Bus Network (MBN). In an N N multistage network using k k switches, there are l = log k N stages of switches, numbered from stage 0 to stage l? 1, as shown in Fig. 3a. Every switch has a set of left connections closer to the processor side and a set of right connections closer to the memory side. The construction of a 4*4 MBN switch incorporating a bus, a bus access controller and output buers is shown in Figure 3b. There are control lines associated with each port to carry arbitration information to the bus access controller. Suzuki et al. have studied a similar bus structure in [11]. We also propose a Bidirectional MIN (BMIN) structure for comparison. The dierence between the switch architectures of BMIN and MBN is evident from Figs. 3b and c. The BMIN switch is a crossbar whereas the MBN switch is a bus. For both the networks, a packet from a stage i is passed on to the stage i + 1, or vice versa, using the destination tag digits. For a k k MBN switch there will be 2k packets (k inputs from either side) potentially competing for the bus in a cycle. When there is more than one such packet, the bus access controller chooses any one of them at random. Others are queued to be transmitted later. On the other hand, in a k k BMIN switch all the 2k inputs can be connected to the 2k outputs if the requests are to dierent destinations. The k k MBN and BMIN switches support forward, backward, turn

6 6 Stage 0 Stage 1 BUS CONTROL LINES OUTPUT DATA LINES (a) A 16*16 Multistage Network OUTPUT DATA LINES 4 LEFT PORTS BUFFER 4 RIGHT PORTS INPUT INPUT INPUT OUTPUT CONTROL LINES BUS ACCESS CONTROLLER CROSSBAR CONTROLLER (b) 4*4 MBN switch architecture (c) 4*4 BMIN switch architecture Fig. 3. Comparing Switch Architectures around connections, as explained in the next section. We describe the structure of the MBN below. The structure of the BMIN is similar. The processors P 0 ; P 1 ; :::; P N?1 are connected to the left connections of the MBN switches at stage 0. Memory modules M 0 ; M 1 ; :::; M N?1 are connected to the right connections at stage l? 1. Memory module M i is also directly connected to processor P i and is called the local memory of P i. A source is assigned a tag S = s 0 s 1 :::s j :::s l?1 and a destination is assigned a destination tag D = d 0 d 1 :::d j :::d l?1, where s j and d j are digits in the k?ary system. The digits s 0 and d 0 are the most signicant and s l?1 and d l?1 are the least signicant digits. The connection between stages in the MBN is a k-shue [6] which means the right connection at position a 0 a 1 :::a l?1 of stage i is connected to the left connection at position a 1 :::a l?1 a 0 of stage i + 1 for i = 0; 1; :::; l? 2. A memory request is satised internally by the local memory when the source tag and the destination tag of a request are the same, If the tags are dierent, the request travels to a remote memory through the MBN. As an example, a 16*16 MBN with 2*2 switches is shown in Figure 4. There may or may not be a shue interconnection before the rst stage of switches. Our routings are developed based on Figure 4 where there is no shue before the 1st stage. Hence, a set of processors with their memories are

7 7 connected to one switch at the rst stage and to another switch at the same position at the last stage. If there exists a k-shue connection before the rst stage a dierent set of processors will be connected to the rst stage and last stage switches. In Figure 4, a request travels in the forward direction when it starts from the processor side and passes through stages 0; 1:::(l? 1), in that order. It travels in the backward direction when it starts from the memory side and passes in the reverse direction through stages (l?1):::1; 0, as shown in Figure 5. A packet can also travel from left to right and make a U-turn in an intermediate stage, as shown in Figure 6. This is called F orward? U(F U) routing. Similarly Figure 7 shows Backward? U(BU) routing where a message enters the network from the right and makes a U-turn. These four routings provide four distinct paths between a source and a destination in the MBN. As a result, the fault tolerance and reliability of the MBN are much better than that of a conventional MIN. Exact expressions for the MBN reliability are derived in [8]. They are also valid for BMINs introduced in this paper. In conventional MINs like the Omega, Delta and GSN [6], the destination tag is used for the purpose of self routing of a request only in the forward direction. In case of the MBN, the destination tag can also be used for self routing in the forward direction. Since the stage 0 connections are straight instead of a k-shue, the destination tag itself can not be used for self routing in the backward direction. As explained later, the routing tag in the backward routing case is obtained by reverse shuing the destination tag by one digit. In order to determine where to take a turn in the above two routing techniques involving U-turns, we need to combine the source tag and the destination tag to form a combined tag. The following denitions are needed to develop exact routing algorithms later. Denition 1 (FRT) The Forward Routing Tag (FRT) is the same as the destination tag of a memory request, i.e., FRT = d 0 d 1 :::d l?1. Denition 2 (BRT) The Backward Routing Tag (BRT) is the destination tag reverse shued by one digit. If d 0 d 1 :::d l?1 is the destination, then BRT b 0 b 1 :::b l?1 = d l?1 d 0 d 1 :::d l?2, where b j = d (j?1)mod(l). Denition 3 (CT) The Combined Tag (CT) is the digit-wise exclusive-or of the source tag and the

8 8 destination tag i.e., CT = c 0 ; c 1 ; :::; c j ; :::; c l?1, where c j = s j J dj. The operation J means c j = 0 if s j = d j, and c j = 1 if s j 6= d j. Note that although the digits in S and D are k? ary, the digits in the CT are binary. Denition 4 (RCT) The Rotated Combined Tag (RCT) is the Combined Tag (CT) reverse shued or right rotated by one digit, i.e., RCT r 0 r 1 ; :::; r l?1 = c l?1 c 0 ; c 1 ; :::; c j ; :::; c l?2, where r j = c (j?1)mod(l). Denition 5 (FTS) The Forward Turning Stage (FTS) is dened as the rightmost nonzero position in the Rotated Combined Tag(RCT). That is, F T S = m, such that r m = 1 and r j = 0 for m < j l? 1. Denition 6 (BTS) The Backward Turning Stage (BTS) is dened as the leftmost nonzero position in the Combined Tag (CT). That is, BT S = n, such that c n = 1 and c j = 0 for 0 j < n. The routing tags FRT and BRT are used for self routing in case of the forward and backward directions respectively. The tags RCT and CT are used to nd the U-turn stages FTS and BTS respectively. The U-turn stages FTS and BTS are used to determine where to take forward and backward turns during U-turn routings. The various routing schemes possible in an MBN are described below. III. Routing Algorithms for MBN In this section, we rst present the four routing techniques for the MBN and then present an algorithm that chooses the path with minimum distance. Although these techniques are described for the MBN, they are equally valid for the BMIN. A. Routing Techniques A. Forward (FW) Routing : In Forward (FW) routing, a request from source processor S moves from stage 0 through stage l? 1 through the MBN to the destination memory D. An example of FW routing for source 0011 to destination 1011 is shown in bold line in Figure 4. The jth digit of the forward routing tag (FRT) is used by a switch at stage j for self-routing. Thus, a request that started

9 9 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 FRT = 1011 Fig. 4. Forward (FW) routing in MBN at position S = s 0 s 1 :::s j :::s l?1 at the left of stage 0, is switched to position s 0 s 1 :::s j :::s l?2 d 0 at the right of stage 0, then undergoes a k-shue and reaches at position s 1 s 2 :::s j :::s l?2 d 0 s 0 at the input of stage 1 and gets switched to s 1 s 2 :::s j :::s l?2 d 0 d 1 at the output of stage 1. In general, when a request arrives at position s j s j+1 :::s l?2 d 0 d 1 :::d j?1 s j?1 at the left of stage j, it is switched to position s j :::s l?2 d 0 d 1 :::d j?1 d j at the right of stage j, goes through a k-shue (except at the last stage, j = l? 1) and arrives at position s j+1 :::s l?2 d 0 d 1 :::d j s j at the left of stage j + 1. Finally it reaches the destination d 0 d 1 :::d j :::d l?1 at the output of the last stage of the MBN. B. Backward (BW) Routing : In Backward (BW) routing, a request from source node S moves backward from stage l? 1 through stage 0 to the destination node D. An example of BW routing for 0011 to 1011 is shown using bold line in Figure 5. The jth digit of the backward routing tag (BRT) is used by a switch at stage j for self-routing. Thus, a request that started at position S = s 0 s 1 :::s j :::s l?1 at the right of stage l? 1, is switched to position s 0 s 1 :::s j :::s l?2 d l?2 at the left of stage l? 1, then undergoes a reverse k-shue and reaches at position d l?2 s 0 s 1 :::s j :::s l?2 at the right of stage l? 2. In general, when a request arrives at position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 s j at the right of stage j, it is switched to position d j d j+1 :::d l?2 s 0 s 1 :::s j?1 d (j?1)mod(l) (the jth digit of BRT, b j = d (j?1)mod(l) ) at the

10 10 Stage 0 Stage 1 Stage 2 Stage S = 0011 D = 1011 BRT = 1101 Fig. 5. Backward (BW) routing in MBN left of stage j. Then it goes through a reverse shue (except at the last stage, j = 0) and arrives at position d j?1 d j d j+1 :::d l?2 s 0 s 1 :::s j?1 at the left of stage j? 1. C. Forward-U (FU) Routing : In Forward-U routing, the request starts from source S at stage 0, follows the FW routing (using FRT) up to stage FTS-1 and reaches at the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 s F T S?1. At FTS, it takes a U-turn, instead of getting switched to the right of stage FTS. The request is switched to the left of stage FTS at position s F T S s F T S+1 :::s l?2 d 0 d 1 :::d F T S?1 d (F T S?1)mod(l) and follows BW routing (using BRT) up to stage 0. Finally it reaches at position d 0 d 1 :::d F T S?1 s F T S :::s l?2 d l?1 at the left of stage 0. An example of FU routing for 0010 to 0110 is shown in bold line in Figure 6. D. Backward-U (BU) Routing : In Backward-U routing, the request starts from source S at stage l? 1, follows the BW routing (using the BRT) up to stage BTS+1 and reaches at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 s BT S at the right of stage BTS. At stage BTS, it takes a U-turn, instead of getting switched to the left of stage BTS. The request gets switched to the right of stage BTS at position d BT S d BT S+1 :::d l?2 s 0 s 1 :::s BT S?1 d BT S and follows FW routing(using FRT) up to stage l? 1. Finally the request reaches at position s 0 s 1 :::s BT S?1 d BT S d BT S+1 :::d l?1 in the right of stage l? 1. An

11 11 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 FTS = 2 Fig. 6. Forward-U (FU) routing in MBN example of BU routing for 0010 to 0110 is shown in Figure 7. B. Optimal Path Algorithm The distance between a source and destination in an MBN is dened as the minimum number of switches that the packet has to travel. For a conventional MIN, this distance is always equal to l, or the number of stages in the network. In case of an MBN, however, the distance may be less than l if FU or BU routing is chosen. The FU and BU (Forward-U and Backward-U) routings are used when the turning stage happens to be less than the center stage of the network. Therefore, there will be net savings in terms of distances between a given source and all the destinations. Detailed expressions for the overall savings in distances for such an MBN are given in Section 4. We present below an algorithm to choose the most optimal routing for a given source-destination pair. Optimal Path Algorithm 1. S = s 0 s 1 :::s l?1

12 12 Stage 0 Stage 1 Stage 2 Stage S = 0010 D = 0110 FRT = 0110 BRT = 0011 CT = 0100 RCT = 0010 BTS = 1 Fig. 7. Backward-U (BU) routing in MBN 2. D = d 0 d 1 :::d l?1 3. CT = S J D = c 0 ; c 1 ; :::; c j ; :::; c l?1 4. RCT = c l?1 ; c 0 ; c 1 ; :::; c j ; :::; c l?2 5. d l = bl=2c, d u = dl=2e 6. IF (source = destination) 7. THEN request is to local memory 8. ELSE 9. Find FTS and BTS (based on the tags RCT and CT respectively) 10. IF (F T S = (l? 1? BT S) = 0) 11. THEN select forward-u (FU) routing OR backward-u (BU) routing 12. ELSE IF (F T S < d l ) 13. THEN select forward-u (FU) routing 14. ELSE IF (BT S d u ) 15. THEN select backward-u (BU) routing

13 ELSE 17. select forward (FW) routing OR backward (BW) routing The optimal path algorithm chooses a route that has a minimum path length. Given a source S = s 0 s 1 :::s l?1 and a destination D = d 0 d 1 :::d l?1, this algorithm computes the tags described earlier in this section. It then uses a comparison of these tags to decide which of the four routings would give the minimum path length through the network. In the algorithm d is dened as the center stage of the MBN. It must be pointed out here that the optimal routing between two nodes is xed in a given network. Hence the optimal path can be precomputed and stored in a table that can be read when a request is issued. There is no need to execute the algorithm every time a message is sent out. If the source and the destination are the same, then the request is for the local memory. In this case no traversal through the MBN is required. All other requests pass through at least one stage of the MBN. The memories that are connected to a processor through the rst or last stage of the MBN are called cluster memories. Similarly processors that are one switch away from the memories are called cluster processors of those memories. Requests to cluster memories require that only one switch be traversed. Thus when (F T S = (l? 1? BT S)) = 0 FU or BU routing is taken to serve this purpose. If this is not satised then we should check for FU or BU routing because these would be the next possible minimum path. If F T S < bl=2c or BT S dl=2e we have turning stages in the MBN before or after the center stage. This would reduce the total path length to less than l and thus FU or BU routing is selected. If none of the above conditions is true, we have F T S bl=2c and BT S < dl=2e. In this case, forward (FW) routing or backward (BW) routing are the only options. The actual path lengths in terms of the number of switches traversed are presented below: Local memory: 0 switches (MBN is not traversed) Forward routing or Backward routing: l switches

14 14 Destination i FW/BW routing FU routing BU routing TABLE I The path lengths for each of the routings given source=0 and different destinations Forward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 F T S + 1 switches Backward-U routing: 1. Cluster memories: 1 switch 2. Other memories: 2 (l? 1? BT S) + 1 switches These path length equations can be used to form a table for a given source and destination. As an example, Table 1 shows the path lengths from source 0 to dierent destinations in a 1024*1024 network for i The path length for each routing is quite dierent and thus a routing algorithm is required to route the request through the most optimal path. For example, if the destination is 2 then Backward-U routing will result in the optimal path length. On the other hand, if the destination is 256, then Forward-U routing will result in the optimal path length. The other two requests should use forward or backward routing strategies. IV. Performance of the MBN The Multistage Bus Network (MBN) is analyzed here in a distributed shared memory environment, shown in Figure 1. We also analyze the BMIN and compare its results with those of the MBN. In both the cases, the memory module M i is directly connected to the processor P i and is called the local memory of P i. Requests from a processor to its local memory are called internal requests and are

15 15 carried over the internal bus between the processor and its local memory. A memory can also receive external requests that originate from other processors and are carried over the MBN. A. Network operation In a distributed memory system, there are k? 1 processors that can be reached through the switch of size k at the rst stage or the last stage to which P i is connected. Thus the external request destined to a cluster processor or memory returns from the rst stage (Forward-U routing) or last stage (Backward-U routing) without going through the whole MBN. However, if the request is neither to a local nor to a cluster memory, the request may take one of four routings described earlier. Both internal and external requests arrive at a memory queue. Only one of them is selected for service on an FCFS basis while the remaining requests are queued at the buer of the memory. After receiving a request, a memory module will send a reply packet either directly to its local processor or to another processor through the network, depending on whether the request is internal or external. We will compare the performance of the MBN with that of a BMIN. The transmission of request and reply packets goes through the network following the routings given earlier in the paper. We shall assume a synchronous and packet switched system for analyzing the multistage networks. Since a buer size of four or more gives the same eect as an innite buer [12], [13], for simplicity, we shall assume an innite buer for MBN and BMIN. The analysis can be extended to nite buers, but the equations will be fairly complicated [13]. Since our aim here is to analyze the routing schemes, we prefer to give the basic innite buer analysis. The bus service time (for MBN) or the link service time (for BMIN) to transfer a message forms one system cycle time. The service times of the memory modules are assumed to be integral multiples of this system cycle time. A processor is represented by a delay center; in a given cycle, it submits a memory request with some given probability if it is busy in computation. Once it sends the memory request, the processor becomes idle until the memory response packet (in case of a read) or acknowledgment (in case of write) is obtained. The various

16 16 system parameters are dened below: k k : size of the MBN or MIN switches N l t s t m p : number of processors or memories in the system : log k N : number of stages in the IN : switch service time : memory service time : probability that a processor submits a memory request in a given cycle provided it is busy m : probability that a processor requests its local memory provided it has made a memory request p i : Probability that a request passes through stage i r i : mean response time of switch at stage i, 0 i (l? 1) q i q e d n l m d m P u : Average number of local requests by a processor per cycle : Average number of remote requests from a processor per cycle : total delay in the network (considering all stages) : average queue length in a memory module : average delay in a memory module : processor utilization (fraction of time the processor is busy) The performance analysis of the MBNs and BMINs will be carried out under the following assumptions [12], [13]. Packets are generated at each source node by independent and identically distributed random processes. At any point of time a processor is either busy doing some internal computations or is waiting for the response to a memory request. If there is no pending request, each busy processor generates a packet with probability p at each cycle. The probability that this request is to the local

17 17 memory (internal request) is m, and the probability to any other memory module (external request) is (1? m). A reply from memory travels in the opposite direction through the same path in the MBN or BMIN. It may be noted that in case of a MIN like Buttery [3], a reply has to traverse in the same direction (i.e., from processor to memory side) to reach the requesting processor because the MIN has unidirectional links. In [9], bidirectional links are used between stages and hence the requesting and reply messages may travel in the forward and backward directions respectively. The messages from processor to memory are generated using probabilities as specied below: Request Probability (p): The request probability is dened in Section 3 and is used as a means of estimating the processor behavior in terms of memory requests. When a processor is busy in computation, i.e., no request is outstanding in switches or a memory module, it can send a memory request. At each cycle, the processor decides whether or not a message is to be sent based on this probability. On an average, it takes 1=p cycles to send out a request from the processor. Local memory request probability (m): Given that a request is to be made to memory, a probability (m) is used to decide whether the request is for local or external memory. Though simple, the above probabilities play an important role and are the only inputs to the analysis. After each request to memory, the processor waits for an acknowledgment. Once an acknowledgment is received, the processor does useful computation for one cycle and then based on the above probabilities decides whether to continue or to send another request to the memory. Processor utilization: The processor utilization, P u, dened as the fraction of time a processor is busy, will be determined by the waiting time and service time faced by a request at various service centers. In a number of applications, a large portion of the requests are made to the cluster processors. In [8], we studied the performance of the MBN with varying probabilities for cluster requests. In the study forward-u and backward-u routings were allowed only at the rst and last stages. All other requests were routed by forward (FW) routing. The processor utilization for such a case is given by

18 18 the following equation: P u = 1=f1 + p(1? m)(1? m 1 )(d n + d m ) + pmd m + p(1? m)m 1 (2r 0 + d m )g (1) In this paper, a message in MBN or BMIN will be sent along the minimum distance. In such a case, P u = (2) where, corresponds to the expected delay for a local memory request to be served. corresponds to the expected delay for serving requests to cluster memories. corresponds to the expected delay for serving all requests, except cluster memories, that follow FU or BU routing. corresponds to the expected delay for serving all requests that folllow Forward routing (F W ) or Backward routing (BW ). The derivation of terms,,,, and, is presented below. These terms depend on (a) the routing probabilities along each path, (b) the amount of trac in the network, and (c) the service demand at individual service centers. Thus we get a non-linear equation with P u as the single variable that is solved by using iteration techniques. B. Routing probabilities and path delays The routing probabilities and path delays are derived here for MBN and BMIN under the assumption that all the non-local memories are equally addressed by a processor. These equations can be modied in case of nonuniform remote memory references. Since the path length of Backward Routing (BW ) is the same as that of the FW routing, we derive the term based on FW routing and multiply it by 2 to include BW routing. A similar method is used for FU and BU routing as well.

19 19 Local memory requests (): A local memory request does not involve switch traversal. Thus the only delay is that in servicing the request in the memory module (d m ). Given that the probability for a processor to request a memory is p and that to request a local memory is m, we can deduce that = p m d m (3) Cluster routing (): Requests to cluster processors travel to the rst or last stage switch and take a FU or BU routing to the destination processor. All those source-destination pairs where all bits except the least signicant log 2 k bits of the CT are zero entail this type of routing. Thus the number of cluster memories for a given source is k? 1 since k k is the size of an MBN or BMIN switch. The switch at stage 0 is traversed once for reaching the cluster memory and once for sending back the acknowledgment. Here, given that an external memory is requested, the probability for requesting cluster memories can be expressed as p = k? 1 N? 1 (4) Thus we have 2 r 0 delay for the switch traversals and d m for the memory service delay. We get the following equation, = p (1? m) k? 1 N? 1 (2r 0 + d m ) (5) Non-cluster FU or BU routing (): In forward-u and backward-u routing, the request traverses in one direction up to a particular stage (as explained in Section 2) and makes a U-turn to reach the destination processor. Thus given the turning stage, FTS, the path length can be said to be 2 F T S + 1. This is because the FTS is traversed only once while all stages to the left of FTS are traversed twice but not necessarily through the same switch. We should have a F T S < bl=2c for path

20 20 length optimization. and BT S dl=2e for optimal path length. As we have already covered cluster memories (F T S = 0, BT S = l? 1) we will start with F T S 1 and BT S l? 2. Consider F T S < bl=2c. A similar derivation can be done for BT S dl=2e also. We know that the number of destinations in total is N? 1. For a given turning stage 1 i < d, since FTS is dened as the rightmost bit in the tag, we can have all bits to the left of this position as 1 or 0. This gives us k i number of ways. As discussed in Section 3, the Rotated Combined Tag (RCT) is dened as the digit-wise EX-OR of the source and destination tags. Thus the RCT is a tag made up of 1's or 0's i.e the RCT tag is bitwise regardless of the source and destination tags. The number of ways in which a bit in the RCT can be 1 is k? 1. Thus, given that an external memory is requested, we have the equation for probability of non-cluster FU and BU routing as, d?1 X k? 1 p = 2 ( i=1 N? 1 ki ) (6) where d = bl=2c The delay in such a routing is dependent on the stage at which the U-turn is going to take place. Thus within the summation of the above equation we should include the delay for each switch traversed in that particular path. As discussed above for a turning stage FTS we traverse through all stages to the left of FTS twice. Thus the delay except for that in the turning stage is (2 ( P i?1 j=0 2r j)). This term is multiplied by two because it considers the acknowledgment packets also. The request and acknowledgment also traverse the turning stage and the memory module with delay r i + d m. Including this delay, (2 ( P i?1 j=0 2r j + r i ) + d m ), with the probability gives us the equation for as, d?1 X = p (1? m) 2 ( i=1 k? 1 N? 1 ki (2 ( Xi?1 j=0 2r j + r i ) + d m )) (7) Forward routing (): Finally, for all those source-destination pairs which don't fall into the above

21 21 Routing Size of the MBN FU or BU routing (N? 1) ( p + p ) FW or BW routing (N? 1) ( p ) TABLE II Number of destinations for different network sizes using different routings routing categories, the forward routing path is taken. Since forward routing or backward routing is the last choice for any other type of source-destination pairs, we can simply express as (1?? ) d n. In this type of routing all switches are traversed, thus giving a summation of all switch response times for d n = P l?1 i=0 r i. Thus the expected delay for all such routings can be expressed as, where, X l?1 = p (1? m) p (( i=0 r i ) + d m ) (8) p = 1? p? p (9) where p and p are given by equations 4 and 6 respectively. The equations 0 through 9 are valid when the local memory is accessed with a probability of m and all other memories are addressed with equal probabilities i.e. (1? m)=(n? 1). In an actual case, there will be more interaction between the tasks within a cluster. The equations can be easily extended to include such cases. Table 2 shows the number of destinations that can be reached from processor 0 with each of the routings as a function of network size. The switch size of the network is 2 2. It can be observed from the table that a signicant number of connections benet from the routings other than FW or BW that is commonly adopted today. Also the same number of processors use FU or BU routing in two successive network sizes. We can explain this behavior by an example. Consider l = 6 and l = 7,

22 22 corresponding to network sizes of 64 and 128 respectively. The networks, though of dierent sizes, have the same number of destinations for FU routing because the addition of one stage introduces a true center stage (centerstage = 3) in l = 7 while there is no true center stage in l = 6. Since the center bit in the CT tag has to be 0 for FU or BU routing, it is apparent that the addition of the center stage will not increase the number of possible FU or BU routings. The delays, r 0, r i, d n and d m, will depend on (a) the amount of trac in the network, which in turn is a function of P u itself and (b) the service demand at individual service centers. The queueing analysis for the delays is given next. C. Queueing delays in switches In order to make the analysis simpler, each stage in the network is considered in isolation from the other stages. Consider a queueing center with n inputs. Let the probability that there is a packet at one of the inputs at any given cycle be q, and the service demand of a packet at the service center be t cycles. The number of requests coming to the queue during the service time of any previous request will form a Binomial distribution with number of trials = nt, and success probability = q. The mean number of arriving requests, E = ntq and the variance, V = ntq(1? q). The average queue length Q at the queueing center can be found using the Pollaczek-Khinchine (P-K) mean value formula [14], Q = E 2 + V 2(1? E) (10) The throughput of these requests is E=t. Hence by using Little's law, the mean response time of the center, r, can be derived as, r = Q:t E = !? q t (11) 1? ntq

23 K INPUTS BUS.... 2K OUTPUTS (a) MBN Switch Queue... 2K INPUTS C C C C C l... 2K OUTPUTS l (b) BMIN Switch Queue Fig. 8. Queues at the BMIN and MBN switches Queueing models of an MBN switch and a BMIN switch are shown in Figure 8. For an MBN switch there is contention for the bus by packets from k right ports and k left ports. For a switch at stage i, n = 2k, t = t s and q = P u p(1? m)p i, where p i is the probability that a packet visits stage i. We can calculate mean switch response time, r i, for any of these MBN switches using the following equation : r i = !? q t s (12) 1? 2kt s q The network delay, d n, will be a sum of the response times of the stages a packet visits while routed through the network. In case of the BMIN, there are 2k inputs and 2k outputs in a switch. The request probability at an input or output of a BMIN switch at stage i will be P u p(1?m)p i. Following the model shown in Fig. 8b, we can calculate the response time of a BMIN switch by using n = 2k, t = t s and q = P u p(1? m)p i =n. The total network delay d n will be the sum of the response times of switches at dierent stages. In both networks, the mean number of arriving requests at a memory module, E m = q i +t m q e, where q i and q e are the internal and external requests for that memory module, respectively. The variance V m = q i (1? q i ) + t m q e (1? q e ). Hence average memory queue length, l m = q i + t m q e 2 + q i(1? q i ) + t m q e (1? q e ) 2(1? q i + t m q e ) (13)

24 24 and the mean memory response time or delay, d m = l m t m =E m. A packet (or request) takes the optimal path from a source to the destination. The number of switches traversed would depend on the nature of CT and RCT. The delays derived here are inserted into the equations 3,5,7 and 8 which in turn are plugged into equation 2 to obtain the processor utilization and response time of the network. Then we get a nonlinear equation with P u as the single variable that is solved by using iteration techniques. The iteration technique used to compute processor utilization, P u, can be presented as follows: 1. Initialize P u with a guess of the expected processor utilization. The better the guess, lesser is the number of iterations for the computation. 2. Calculate the request probabilities at each stage of the network and at the memory module. An intermediate step might be to calculate the static values for p i (the probability that a stage in the network is traversed). 3. Calculate the mean switch response times and the memory response time, r i and r m respectively. 4. Based on the above values, calculate the network delay and memory delay using equations provided for ; ; and. 5. Based on these values, calculate a new processor utilization, P u. 6. Repeat steps 2-5 until the new P u is within some tolerance of the last P u. An initial value of 0.5 for P u and an accuracy of were used to generate the analytical results, presented in the next section. V. Results and Discussions We performed extensive cycle-by-cycle simulations to verify that the proposed routings work and measured the routing probabilities and network delays [16]. The simulation was done using a synchronous packet-switched distributed memory environment. The simulation specications are the same as the analysis and are detailed below with a view to making the network operation more clear.

25 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Processor Utilization Request Probability Fig. 9. Comparison of analysis and simulation for processor utilizations of the MBN, varying m In our simulations each cycle was considered to be the time required for the transmission of a packet from one output buer of a switch to the next stage output buer. This includes the transmission of the packet through the link and the time a switch takes to route it to the corresponding destination buer. The minimum time taken for a packet to reach memory is based on the number of switches that the routing covers. All four routings discussed in Section 2 are used in the simulations. The simulation compares each source and destination by running the optimal routing algorithm and then chooses the proper routing. The choice between backward or forward routing is made as follows. All memory requests that could use either forward (FW) or backward (BW) routing use forward routing. All acknowledgements packets use backward routing to keep the load distribution same on both routings. Apart from these dierences the routing decisions are based solely on the tags generated by the optimal routing algorithm. The probabilities p, m, etc. are fed to the simulation as input parameters. All the memories except for the local memories are equally likely to be addressed upon a memory request. In this section we present the relative performance of BMIN and MBN. We start by comparing the

26 "simulation_mbn_m=0.1" "analysis_mbn_m=0.1" "simulation_mbn_m=0.9" "analysis_mbn_m=0.9" Response Time Request Probability Fig. 10. Comparison of analysis and simulation for response time of the MBN, varying m "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Processor Utilization Request Probability Fig. 11. Comparison of processor utilizations, varying m

27 "mbn_m=0.1" "bmin_m=0.1" "cmin_m=0.1" "mbn_m=0.9" "bmin_m=0.9" "cmin_m=0.9" Response Time Request Probability Fig. 12. Comparison of response time, varying m results from the simulation versus those obtained from the analysis of the MBN. Many simulation experiments were run to verify the analytical models developed in this paper. The simulation results closely matched the analysis under all varied parameters. Here we present some results for a system with 2 2 switches. Memory service time is assumed to be 4 cycles. Processor utilization, P u is dened as the average amount of useful work the processor does in a given cycle. Response time is dened as the average dierence between the time when a processor submits a memory request and the time when it gets the reply back. Figures 9 and 10 show a comparison of analysis and simulation results for the processor utilization and response time of the MBN. In both the gures, the analytical results match very closely with those of simulation indicating that the independence of queues assumed during the analysis does not cause much of an error. The plots show the results from the analysis and simulation as a function of the memory request probability, p. In this plot the memory request probability (p) is varied from 0.1 to 1.0 and two values for the local memory request probability m (0.1 and 0.9) are chosen. For larger values of m, more requests are satised without going through the MBN. Thus, for m = 0:9, P u

28 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Processor Utilization Number of Processors Fig. 13. Processor Utilization: Scalability of MBN vs BMIN, varying p is much higher in Fig 9 and response time is much lower in Fig. 10. As p gets larger, more requests are generated and the response time increases due and the processor utilization reduces to a higher amount of trac and queueing delays. Figures 11 and 12 show a comparison of the performance of the MBN to that of the conventional MIN (CMIN) and the proposed bidirectional MIN (BMIN). Conventional MIN is similar to the network employed in the Buttery machine [3], where both the request and the response packets travel in one direction from processor to the memory side. On the other hand, BMINs allow all the four routings proposed for the MBN. The two plots show the processor utilization and the response time of the three networks for two dierent values of m. The MBN behaves exactly similar to the CMIN and the BMIN in terms of processor utilization. The response time is also same for all the networks for m = 0:9. For m = 0:1 the BMIN performs better than the MBN and the MBN performs better than the CMIN. Figures 13 and 14 show the processor utilization and response times for various system sizes. The results were obtained with a local memory request probability, m, xed at 0:5 and for two dierent values of p (0.1 and 0.5). We can see from the gures that, even as the system size grows, the

29 29 20 "p=0.1_mbn" "p=0.1_bmin" "p=0.5_mbn" "p=0.5_bmin" Response Time Number of Processors Fig. 14. Response Time: Scalability of MBN vs BMIN, varying p performance of the MBN remains close to that of the BMIN. The curves for the CMIN are left o for clarity, but it is observed that the MBN always performs better than the CMIN. It can also be seen from the gures that as the system size doubles the reduction in performance is not that big, indicating that the MBN is highly scalable for the given trac load. The range of the processor utilization remains approximately between 0:5 and 0:4 as the system size changes from 32 to 1024 with 2 2 switches, but the request probability, p, has a much greater eect on the performance. Finally we present the processor utilization and the response time of the MBN obtained for dierent switch sizes and dierent number of processors in Table III. Some places in the table are left empty because an N N MBN cannot be built using only those k k switches. Both the request probability (p) and the local memory request probability (m) are xed at 0.5. For N = 64, there is a decrease in P u when k is increased from 4 to 8. This is because MBN is less ecient due to increased contention and delay in an 8x8 bus-based switch. On the other hand, in a system, when k is increased from 2 to 8, there is a good improvement. The number of switches in the entire network is still quite high, keeping the contention low enough to gain in performance.

30 30 #Procs. Dierent Switch Sizes (N) k = 2 k = 4 k = 8 P u RT P u RT P u RT MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN MBN BMIN TABLE III Processor Utilization and Response Time of the MBN & BMIN varying k and N If we compare the MBN's performance to that of the BMIN's, we can see that as the switch size increases, the BMIN gives a higher processor utilization and a lower response time. This increase in performance is due to a lower contention in the crossbar switch. However, the BMIN gives this increased performance at the expense of cost. In [8], a cost parameter based on the number of connection points in a switch is presented. The number of connections is k 2 for a k k switch, where as for a bus, the number of connections is 2k. Thus, the total cost of BMIN and MBN are knlog k N and 2Nlog k N respectively. If we include these parameters along with the processor utilization and the response time, the cost-eectiveness of the MBN is higher than that of the BMIN, as shown in [8]. A 4 4 switch size works out to be most cost-ecient for dierent network sizes and workload inputs. VI. Execution-driven Simulation and Results The execution time of an application on a multiprocessor architecture is the ultimate parameter that indicates the performance. In order to show that the MBN performs similar to the Bidirectional MIN (BMIN), we study their performance by using an execution-driven simulation of various applications. Our simulator is based on Proteus [10], originally developed at MIT. However, this original simulator modeled the indirect interconnection networks based on an analytical model. We have modied the simulator extensively to exactly model the BMIN and the MBN using 2 2 switches and packetswitching strategy. The system considered in this paper has private cache memories that operate based on a directory-based cache-coherence protocol [15]. The node conguration and the network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network

FB(9,3) Figure 1(a). A 4-by-4 Benes network. Figure 1(b). An FB(4, 2) network. Figure 2. An FB(27, 3) network Congestion-free Routing of Streaming Multimedia Content in BMIN-based Parallel Systems Harish Sethu Department of Electrical and Computer Engineering Drexel University Philadelphia, PA 19104, USA sethu@ece.drexel.edu

More information

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T

is developed which describe the mean values of various system parameters. These equations have circular dependencies and must be solved iteratively. T A Mean Value Analysis Multiprocessor Model Incorporating Superscalar Processors and Latency Tolerating Techniques 1 David H. Albonesi Israel Koren Department of Electrical and Computer Engineering University

More information

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science

Analytical Modeling of Routing Algorithms in. Virtual Cut-Through Networks. Real-Time Computing Laboratory. Electrical Engineering & Computer Science Analytical Modeling of Routing Algorithms in Virtual Cut-Through Networks Jennifer Rexford Network Mathematics Research Networking & Distributed Systems AT&T Labs Research Florham Park, NJ 07932 jrex@research.att.com

More information

Chapter 9 Multiprocessors

Chapter 9 Multiprocessors ECE200 Computer Organization Chapter 9 Multiprocessors David H. lbonesi and the University of Rochester Henk Corporaal, TU Eindhoven, Netherlands Jari Nurmi, Tampere University of Technology, Finland University

More information

Characteristics of Mult l ip i ro r ce c ssors r

Characteristics of Mult l ip i ro r ce c ssors r Characteristics of Multiprocessors A multiprocessor system is an interconnection of two or more CPUs with memory and input output equipment. The term processor in multiprocessor can mean either a central

More information

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract

A New Theory of Deadlock-Free Adaptive. Routing in Wormhole Networks. Jose Duato. Abstract A New Theory of Deadlock-Free Adaptive Routing in Wormhole Networks Jose Duato Abstract Second generation multicomputers use wormhole routing, allowing a very low channel set-up time and drastically reducing

More information

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University

Lecture 26: Interconnects. James C. Hoe Department of ECE Carnegie Mellon University 18 447 Lecture 26: Interconnects James C. Hoe Department of ECE Carnegie Mellon University 18 447 S18 L26 S1, James C. Hoe, CMU/ECE/CALCM, 2018 Housekeeping Your goal today get an overview of parallel

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University.

Akhilesh Kumar and Laxmi N. Bhuyan. Department of Computer Science. Texas A&M University. Evaluating Virtual Channels for Cache-Coherent Shared-Memory Multiprocessors Akhilesh Kumar and Laxmi N. Bhuyan Department of Computer Science Texas A&M University College Station, TX 77-11, USA. E-mail:

More information

Chapter 4 NETWORK HARDWARE

Chapter 4 NETWORK HARDWARE Chapter 4 NETWORK HARDWARE 1 Network Devices As Organizations grow, so do their networks Growth in number of users Geographical Growth Network Devices : Are products used to expand or connect networks.

More information

/$10.00 (c) 1998 IEEE

/$10.00 (c) 1998 IEEE Dual Busy Tone Multiple Access (DBTMA) - Performance Results Zygmunt J. Haas and Jing Deng School of Electrical Engineering Frank Rhodes Hall Cornell University Ithaca, NY 85 E-mail: haas, jing@ee.cornell.edu

More information

Dr e v prasad Dt

Dr e v prasad Dt Dr e v prasad Dt. 12.10.17 Contents Characteristics of Multiprocessors Interconnection Structures Inter Processor Arbitration Inter Processor communication and synchronization Cache Coherence Introduction

More information

4. Networks. in parallel computers. Advances in Computer Architecture

4. Networks. in parallel computers. Advances in Computer Architecture 4. Networks in parallel computers Advances in Computer Architecture System architectures for parallel computers Control organization Single Instruction stream Multiple Data stream (SIMD) All processors

More information

HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS. A Dissertation RAVISHANKAR IYER. Submitted to the Oce of Graduate Studies of

HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS. A Dissertation RAVISHANKAR IYER. Submitted to the Oce of Graduate Studies of HIGH PERFORMANCE SWITCH ARCHITECTURES FOR CC-NUMA MULTIPROCESSORS A Dissertation by RAVISHANKAR IYER Submitted to the Oce of Graduate Studies of Texas A&M University in partial fulllment of the requirements

More information

Multiprocessors Interconnection Networks

Multiprocessors Interconnection Networks Babylon University College of Information Technology Software Department Multiprocessors Interconnection Networks By Interconnection Networks Taxonomy An interconnection network could be either static

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico February 29, 2016 CPD

More information

Multiprocessors & Thread Level Parallelism

Multiprocessors & Thread Level Parallelism Multiprocessors & Thread Level Parallelism COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline Introduction

More information

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research

This chapter provides the background knowledge about Multistage. multistage interconnection networks are explained. The need, objectives, research CHAPTER 1 Introduction This chapter provides the background knowledge about Multistage Interconnection Networks. Metrics used for measuring the performance of various multistage interconnection networks

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing Department of Computer Science and Engineering (DEI) Instituto Superior Técnico September 26, 2011 CPD

More information

Interconnection Networks: Topology. Prof. Natalie Enright Jerger

Interconnection Networks: Topology. Prof. Natalie Enright Jerger Interconnection Networks: Topology Prof. Natalie Enright Jerger Topology Overview Definition: determines arrangement of channels and nodes in network Analogous to road map Often first step in network design

More information

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road.

Real-time communication scheduling in a multicomputer video. server. A. L. Narasimha Reddy Eli Upfal. 214 Zachry 650 Harry Road. Real-time communication scheduling in a multicomputer video server A. L. Narasimha Reddy Eli Upfal Texas A & M University IBM Almaden Research Center 214 Zachry 650 Harry Road College Station, TX 77843-3128

More information

Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis

Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis c British Computer Society 2002 Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis SYED MASUD MAHMUD, L.TISSA SAMARATUNGA AND SHILPA KOMMIDI Department

More information

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX

(Preliminary Version 2 ) Jai-Hoon Kim Nitin H. Vaidya. Department of Computer Science. Texas A&M University. College Station, TX Towards an Adaptive Distributed Shared Memory (Preliminary Version ) Jai-Hoon Kim Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3 E-mail: fjhkim,vaidyag@cs.tamu.edu

More information

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part I) Prof. Cristina Silvano Politecnico di Milano Outline Key issues to design multiprocessors Interconnection network Centralized shared-memory architectures Distributed

More information

Interconnect Technology and Computational Speed

Interconnect Technology and Computational Speed Interconnect Technology and Computational Speed From Chapter 1 of B. Wilkinson et al., PARAL- LEL PROGRAMMING. Techniques and Applications Using Networked Workstations and Parallel Computers, augmented

More information

Multiprocessor Cache Coherency. What is Cache Coherence?

Multiprocessor Cache Coherency. What is Cache Coherence? Multiprocessor Cache Coherency CS448 1 What is Cache Coherence? Two processors can have two different values for the same memory location 2 1 Terminology Coherence Defines what values can be returned by

More information

Physical Organization of Parallel Platforms. Alexandre David

Physical Organization of Parallel Platforms. Alexandre David Physical Organization of Parallel Platforms Alexandre David 1.2.05 1 Static vs. Dynamic Networks 13-02-2008 Alexandre David, MVP'08 2 Interconnection networks built using links and switches. How to connect:

More information

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers

Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Non-Uniform Memory Access (NUMA) Architecture and Multicomputers Parallel and Distributed Computing MSc in Information Systems and Computer Engineering DEA in Computational Engineering Department of Computer

More information

Intersection of sets *

Intersection of sets * OpenStax-CNX module: m15196 1 Intersection of sets * Sunil Kumar Singh This work is produced by OpenStax-CNX and licensed under the Creative Commons Attribution License 2.0 We have pointed out that a set

More information

Scalable Cache Coherence

Scalable Cache Coherence arallel Computing Scalable Cache Coherence Hwansoo Han Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels of caches on a processor Large scale multiprocessors with hierarchy

More information

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano

Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Introduction to Multiprocessors (Part II) Cristina Silvano Politecnico di Milano Outline The problem of cache coherence Snooping protocols Directory-based protocols Prof. Cristina Silvano, Politecnico

More information

Multiprocessor Interconnection Networks

Multiprocessor Interconnection Networks Multiprocessor Interconnection Networks Todd C. Mowry CS 740 November 19, 1998 Topics Network design space Contention Active messages Networks Design Options: Topology Routing Direct vs. Indirect Physical

More information

This paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing)

This paper describes and evaluates the Dual Reinforcement Q-Routing algorithm (DRQ-Routing) DUAL REINFORCEMENT Q-ROUTING: AN ON-LINE ADAPTIVE ROUTING ALGORITHM 1 Shailesh Kumar Risto Miikkulainen The University oftexas at Austin The University oftexas at Austin Dept. of Elec. and Comp. Engg.

More information

A Bandwidth Latency Tradeoff for Broadcast and Reduction

A Bandwidth Latency Tradeoff for Broadcast and Reduction A Bandwidth Latency Tradeoff for Broadcast and Reduction Peter Sanders and Jop F. Sibeyn Max-Planck-Institut für Informatik Im Stadtwald, 66 Saarbrücken, Germany. sanders, jopsi@mpi-sb.mpg.de. http://www.mpi-sb.mpg.de/sanders,

More information

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s.

PE PE PE. Network Interface. Processor Pipeline + Register Windows. Thread Management & Scheduling. N e t w o r k P o r t s. Latency Tolerance: A Metric for Performance Analysis of Multithreaded Architectures Shashank S. Nemawarkar, Guang R. Gao School of Computer Science McGill University 348 University Street, Montreal, H3A

More information

Numerical Evaluation of Hierarchical QoS Routing. Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar. Computer Science Department and UMIACS

Numerical Evaluation of Hierarchical QoS Routing. Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar. Computer Science Department and UMIACS Numerical Evaluation of Hierarchical QoS Routing Sungjoon Ahn, Gayathri Chittiappa, A. Udaya Shankar Computer Science Department and UMIACS University of Maryland, College Park CS-TR-395 April 3, 1998

More information

IBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand

IBM Almaden Research Center, at regular intervals to deliver smooth playback of video streams. A video-on-demand 1 SCHEDULING IN MULTIMEDIA SYSTEMS A. L. Narasimha Reddy IBM Almaden Research Center, 650 Harry Road, K56/802, San Jose, CA 95120, USA ABSTRACT In video-on-demand multimedia systems, the data has to be

More information

3-ary 2-cube. processor. consumption channels. injection channels. router

3-ary 2-cube. processor. consumption channels. injection channels. router Multidestination Message Passing in Wormhole k-ary n-cube Networks with Base Routing Conformed Paths 1 Dhabaleswar K. Panda, Sanjay Singal, and Ram Kesavan Dept. of Computer and Information Science The

More information

paper provides an in-depth comparison of two existing reliable multicast protocols, and identies NACK suppression as a problem for reliable multicast.

paper provides an in-depth comparison of two existing reliable multicast protocols, and identies NACK suppression as a problem for reliable multicast. Scalability of Multicast Communication over Wide-Area Networks Donald Yeung Laboratory for Computer Science Cambridge, MA 02139 April 24, 1996 Abstract A multitude of interesting applications require multicast

More information

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme

On Checkpoint Latency. Nitin H. Vaidya. In the past, a large number of researchers have analyzed. the checkpointing and rollback recovery scheme On Checkpoint Latency Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 E-mail: vaidya@cs.tamu.edu Web: http://www.cs.tamu.edu/faculty/vaidya/ Abstract

More information

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for

Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc. Abstract. Direct Volume Rendering (DVR) is a powerful technique for Comparison of Two Image-Space Subdivision Algorithms for Direct Volume Rendering on Distributed-Memory Multicomputers Egemen Tanin, Tahsin M. Kurc, Cevdet Aykanat, Bulent Ozguc Dept. of Computer Eng. and

More information

Interconnection Network

Interconnection Network Interconnection Network Recap: Generic Parallel Architecture A generic modern multiprocessor Network Mem Communication assist (CA) $ P Node: processor(s), memory system, plus communication assist Network

More information

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University

Ecient Processor Allocation for 3D Tori. Wenjian Qiao and Lionel M. Ni. Department of Computer Science. Michigan State University Ecient Processor llocation for D ori Wenjian Qiao and Lionel M. Ni Department of Computer Science Michigan State University East Lansing, MI 4884-107 fqiaow, nig@cps.msu.edu bstract Ecient allocation of

More information

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors

On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors On Topology and Bisection Bandwidth of Hierarchical-ring Networks for Shared-memory Multiprocessors Govindan Ravindran Newbridge Networks Corporation Kanata, ON K2K 2E6, Canada gravindr@newbridge.com Michael

More information

IV. PACKET SWITCH ARCHITECTURES

IV. PACKET SWITCH ARCHITECTURES IV. PACKET SWITCH ARCHITECTURES (a) General Concept - as packet arrives at switch, destination (and possibly source) field in packet header is used as index into routing tables specifying next switch in

More information

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax:

Consistent Logical Checkpointing. Nitin H. Vaidya. Texas A&M University. Phone: Fax: Consistent Logical Checkpointing Nitin H. Vaidya Department of Computer Science Texas A&M University College Station, TX 77843-3112 hone: 409-845-0512 Fax: 409-847-8578 E-mail: vaidya@cs.tamu.edu Technical

More information

ECE 697J Advanced Topics in Computer Networks

ECE 697J Advanced Topics in Computer Networks ECE 697J Advanced Topics in Computer Networks Switching Fabrics 10/02/03 Tilman Wolf 1 Router Data Path Last class: Single CPU is not fast enough for processing packets Multiple advanced processors in

More information

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres

Optimal Topology for Distributed Shared-Memory. Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Optimal Topology for Distributed Shared-Memory Multiprocessors: Hypercubes Again? Jose Duato and M.P. Malumbres Facultad de Informatica, Universidad Politecnica de Valencia P.O.B. 22012, 46071 - Valencia,

More information

Scalable Cache Coherence

Scalable Cache Coherence Scalable Cache Coherence [ 8.1] All of the cache-coherent systems we have talked about until now have had a bus. Not only does the bus guarantee serialization of transactions; it also serves as convenient

More information

Advanced Parallel Architecture. Annalisa Massini /2017

Advanced Parallel Architecture. Annalisa Massini /2017 Advanced Parallel Architecture Annalisa Massini - 2016/2017 References Advanced Computer Architecture and Parallel Processing H. El-Rewini, M. Abd-El-Barr, John Wiley and Sons, 2005 Parallel computing

More information

Multiprocessors - Flynn s Taxonomy (1966)

Multiprocessors - Flynn s Taxonomy (1966) Multiprocessors - Flynn s Taxonomy (1966) Single Instruction stream, Single Data stream (SISD) Conventional uniprocessor Although ILP is exploited Single Program Counter -> Single Instruction stream The

More information

Lecture: Interconnection Networks

Lecture: Interconnection Networks Lecture: Interconnection Networks Topics: Router microarchitecture, topologies Final exam next Tuesday: same rules as the first midterm 1 Packets/Flits A message is broken into multiple packets (each packet

More information

Netsim: A Network Performance Simulator. University of Richmond. Abstract

Netsim: A Network Performance Simulator. University of Richmond. Abstract Netsim: A Network Performance Simulator B. Lewis Barnett, III Department of Mathematics and Computer Science University of Richmond Richmond, VA 23233 barnett@armadillo.urich.edu June 29, 1992 Abstract

More information

EE382 Processor Design. Illinois

EE382 Processor Design. Illinois EE382 Processor Design Winter 1998 Chapter 8 Lectures Multiprocessors Part II EE 382 Processor Design Winter 98/99 Michael Flynn 1 Illinois EE 382 Processor Design Winter 98/99 Michael Flynn 2 1 Write-invalidate

More information

Networks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598

Networks. Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny. Ann Arbor, MI Yorktown Heights, NY 10598 Techniques for Eliminating Packet Loss in Congested TCP/IP Networks Wu-chang Fengy Dilip D. Kandlurz Debanjan Sahaz Kang G. Shiny ydepartment of EECS znetwork Systems Department University of Michigan

More information

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

Scalable Cache Coherence. Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University Scalable Cache Coherence Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Hierarchical Cache Coherence Hierarchies in cache organization Multiple levels

More information

Memory Systems in Pipelined Processors

Memory Systems in Pipelined Processors Advanced Computer Architecture (0630561) Lecture 12 Memory Systems in Pipelined Processors Prof. Kasim M. Al-Aubidy Computer Eng. Dept. Interleaved Memory: In a pipelined processor data is required every

More information

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence

COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence 1 COEN-4730 Computer Architecture Lecture 08 Thread Level Parallelism and Coherence Cristinel Ababei Dept. of Electrical and Computer Engineering Marquette University Credits: Slides adapted from presentations

More information

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p

perform well on paths including satellite links. It is important to verify how the two ATM data services perform on satellite links. TCP is the most p Performance of TCP/IP Using ATM ABR and UBR Services over Satellite Networks 1 Shiv Kalyanaraman, Raj Jain, Rohit Goyal, Sonia Fahmy Department of Computer and Information Science The Ohio State University

More information

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency

London SW7 2BZ. in the number of processors due to unfortunate allocation of the. home and ownership of cache lines. We present a modied coherency Using Proxies to Reduce Controller Contention in Large Shared-Memory Multiprocessors Andrew J. Bennett, Paul H. J. Kelly, Jacob G. Refstrup, Sarah A. M. Talbot Department of Computing Imperial College

More information

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors?

Parallel Computers. CPE 631 Session 20: Multiprocessors. Flynn s Tahonomy (1972) Why Multiprocessors? Parallel Computers CPE 63 Session 20: Multiprocessors Department of Electrical and Computer Engineering University of Alabama in Huntsville Definition: A parallel computer is a collection of processing

More information

Multi-Processor / Parallel Processing

Multi-Processor / Parallel Processing Parallel Processing: Multi-Processor / Parallel Processing Originally, the computer has been viewed as a sequential machine. Most computer programming languages require the programmer to specify algorithms

More information

Limits on Interconnection Network Performance. Anant Agarwal. Massachusetts Institute of Technology. Cambridge, MA Abstract

Limits on Interconnection Network Performance. Anant Agarwal. Massachusetts Institute of Technology. Cambridge, MA Abstract Limits on Interconnection Network Performance Anant Agarwal Laboratory for Computer Science Massachusetts Institute of Technology Cambridge, MA 0139 Abstract As the performance of interconnection networks

More information

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching

Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Reduction of Periodic Broadcast Resource Requirements with Proxy Caching Ewa Kusmierek and David H.C. Du Digital Technology Center and Department of Computer Science and Engineering University of Minnesota

More information

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks

Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks Performance of Multihop Communications Using Logical Topologies on Optical Torus Networks X. Yuan, R. Melhem and R. Gupta Department of Computer Science University of Pittsburgh Pittsburgh, PA 156 fxyuan,

More information

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains:

Module 5: Performance Issues in Shared Memory and Introduction to Coherence Lecture 10: Introduction to Coherence. The Lecture Contains: The Lecture Contains: Four Organizations Hierarchical Design Cache Coherence Example What Went Wrong? Definitions Ordering Memory op Bus-based SMP s file:///d /...audhary,%20dr.%20sanjeev%20k%20aggrwal%20&%20dr.%20rajat%20moona/multi-core_architecture/lecture10/10_1.htm[6/14/2012

More information

Module 17: "Interconnection Networks" Lecture 37: "Introduction to Routers" Interconnection Networks. Fundamentals. Latency and bandwidth

Module 17: Interconnection Networks Lecture 37: Introduction to Routers Interconnection Networks. Fundamentals. Latency and bandwidth Interconnection Networks Fundamentals Latency and bandwidth Router architecture Coherence protocol and routing [From Chapter 10 of Culler, Singh, Gupta] file:///e /parallel_com_arch/lecture37/37_1.htm[6/13/2012

More information

Network-on-chip (NOC) Topologies

Network-on-chip (NOC) Topologies Network-on-chip (NOC) Topologies 1 Network Topology Static arrangement of channels and nodes in an interconnection network The roads over which packets travel Topology chosen based on cost and performance

More information

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science

MDP Routing in ATM Networks. Using the Virtual Path Concept 1. Department of Computer Science Department of Computer Science MDP Routing in ATM Networks Using the Virtual Path Concept 1 Ren-Hung Hwang, James F. Kurose, and Don Towsley Department of Computer Science Department of Computer Science & Information Engineering University

More information

3. Evaluation of Selected Tree and Mesh based Routing Protocols

3. Evaluation of Selected Tree and Mesh based Routing Protocols 33 3. Evaluation of Selected Tree and Mesh based Routing Protocols 3.1 Introduction Construction of best possible multicast trees and maintaining the group connections in sequence is challenging even in

More information

Lecture 2: Topology - I

Lecture 2: Topology - I ECE 8823 A / CS 8803 - ICN Interconnection Networks Spring 2017 http://tusharkrishna.ece.gatech.edu/teaching/icn_s17/ Lecture 2: Topology - I Tushar Krishna Assistant Professor School of Electrical and

More information

Ecube Planar adaptive Turn model (west-first non-minimal)

Ecube Planar adaptive Turn model (west-first non-minimal) Proc. of the International Parallel Processing Symposium (IPPS '95), Apr. 1995, pp. 652-659. Global Reduction in Wormhole k-ary n-cube Networks with Multidestination Exchange Worms Dhabaleswar K. Panda

More information

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control.

Neuro-Remodeling via Backpropagation of Utility. ABSTRACT Backpropagation of utility is one of the many methods for neuro-control. Neuro-Remodeling via Backpropagation of Utility K. Wendy Tang and Girish Pingle 1 Department of Electrical Engineering SUNY at Stony Brook, Stony Brook, NY 11794-2350. ABSTRACT Backpropagation of utility

More information

EE/CSCI 451: Parallel and Distributed Computation

EE/CSCI 451: Parallel and Distributed Computation EE/CSCI 451: Parallel and Distributed Computation Lecture #4 1/24/2018 Xuehai Qian xuehai.qian@usc.edu http://alchem.usc.edu/portal/xuehaiq.html University of Southern California 1 Announcements PA #1

More information

Module 5 Introduction to Parallel Processing Systems

Module 5 Introduction to Parallel Processing Systems Module 5 Introduction to Parallel Processing Systems 1. What is the difference between pipelining and parallelism? In general, parallelism is simply multiple operations being done at the same time.this

More information

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance

Lecture 13: Interconnection Networks. Topics: lots of background, recent innovations for power and performance Lecture 13: Interconnection Networks Topics: lots of background, recent innovations for power and performance 1 Interconnection Networks Recall: fully connected network, arrays/rings, meshes/tori, trees,

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

Relative Reduced Hops

Relative Reduced Hops GreedyDual-Size: A Cost-Aware WWW Proxy Caching Algorithm Pei Cao Sandy Irani y 1 Introduction As the World Wide Web has grown in popularity in recent years, the percentage of network trac due to HTTP

More information

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap

reasonable to store in a software implementation, it is likely to be a signicant burden in a low-cost hardware implementation. We describe in this pap Storage-Ecient Finite Field Basis Conversion Burton S. Kaliski Jr. 1 and Yiqun Lisa Yin 2 RSA Laboratories 1 20 Crosby Drive, Bedford, MA 01730. burt@rsa.com 2 2955 Campus Drive, San Mateo, CA 94402. yiqun@rsa.com

More information

Computer parallelism Flynn s categories

Computer parallelism Flynn s categories 04 Multi-processors 04.01-04.02 Taxonomy and communication Parallelism Taxonomy Communication alessandro bogliolo isti information science and technology institute 1/9 Computer parallelism Flynn s categories

More information

CS 204 Lecture Notes on Elementary Network Analysis

CS 204 Lecture Notes on Elementary Network Analysis CS 204 Lecture Notes on Elementary Network Analysis Mart Molle Department of Computer Science and Engineering University of California, Riverside CA 92521 mart@cs.ucr.edu October 18, 2006 1 First-Order

More information

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration

MULTIPROCESSORS. Characteristics of Multiprocessors. Interconnection Structures. Interprocessor Arbitration MULTIPROCESSORS Characteristics of Multiprocessors Interconnection Structures Interprocessor Arbitration Interprocessor Communication and Synchronization Cache Coherence 2 Characteristics of Multiprocessors

More information

Assignment 5. Georgia Koloniari

Assignment 5. Georgia Koloniari Assignment 5 Georgia Koloniari 2. "Peer-to-Peer Computing" 1. What is the definition of a p2p system given by the authors in sec 1? Compare it with at least one of the definitions surveyed in the last

More information

Interconnection networks

Interconnection networks Interconnection networks When more than one processor needs to access a memory structure, interconnection networks are needed to route data from processors to memories (concurrent access to a shared memory

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1. Memory technology & Hierarchy Back to caching... Advances in Computer Architecture Andy D. Pimentel Caches in a multi-processor context Dealing with concurrent updates Multiprocessor architecture In

More information

Multiprocessor Interconnection Networks- Part Three

Multiprocessor Interconnection Networks- Part Three Babylon University College of Information Technology Software Department Multiprocessor Interconnection Networks- Part Three By The k-ary n-cube Networks The k-ary n-cube network is a radix k cube with

More information

The final publication is available at

The final publication is available at Document downloaded from: http://hdl.handle.net/10251/82062 This paper must be cited as: Peñaranda Cebrián, R.; Gómez Requena, C.; Gómez Requena, ME.; López Rodríguez, PJ.; Duato Marín, JF. (2016). The

More information

Reinforcement Learning Scheme. for Network Routing. Michael Littman*, Justin Boyan. School of Computer Science. Pittsburgh, PA

Reinforcement Learning Scheme. for Network Routing. Michael Littman*, Justin Boyan. School of Computer Science. Pittsburgh, PA A Distributed Reinforcement Learning Scheme for Network Routing Michael Littman*, Justin Boyan Carnegie Mellon University School of Computer Science Pittsburgh, PA * also Cognitive Science Research Group,

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha

Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we ha Chapter 5 Lempel-Ziv Codes To set the stage for Lempel-Ziv codes, suppose we wish to nd the best block code for compressing a datavector X. Then we have to take into account the complexity of the code.

More information

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s]

Cluster quality 15. Running time 0.7. Distance between estimated and true means Running time [s] Fast, single-pass K-means algorithms Fredrik Farnstrom Computer Science and Engineering Lund Institute of Technology, Sweden arnstrom@ucsd.edu James Lewis Computer Science and Engineering University of

More information

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks

A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks A Comparative Study of Bidirectional Ring and Crossbar Interconnection Networks Hitoshi Oi and N. Ranganathan Department of Computer Science and Engineering, University of South Florida, Tampa, FL Abstract

More information

Shared Memory Architecture Part One

Shared Memory Architecture Part One Babylon University College of Information Technology Software Department Shared Memory Architecture Part One By Classification Of Shared Memory Systems The simplest shared memory system consists of one

More information

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996

Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 Lecture 30: Multiprocessors Flynn Categories, Large vs. Small Scale, Cache Coherency Professor Randy H. Katz Computer Science 252 Spring 1996 RHK.S96 1 Flynn Categories SISD (Single Instruction Single

More information

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting

Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Architecture-Dependent Tuning of the Parameterized Communication Model for Optimal Multicasting Natawut Nupairoj and Lionel M. Ni Department of Computer Science Michigan State University East Lansing,

More information

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland

A Study of Query Execution Strategies. for Client-Server Database Systems. Department of Computer Science and UMIACS. University of Maryland A Study of Query Execution Strategies for Client-Server Database Systems Donald Kossmann Michael J. Franklin Department of Computer Science and UMIACS University of Maryland College Park, MD 20742 f kossmann

More information

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli

under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Interface Optimization for Concurrent Systems under Timing Constraints David Filo David Ku Claudionor N. Coelho, Jr. Giovanni De Micheli Abstract The scope of most high-level synthesis eorts to date has

More information

Scalable Cache Coherent Systems

Scalable Cache Coherent Systems NUM SS Scalable ache oherent Systems Scalable distributed shared memory machines ssumptions: rocessor-ache-memory nodes connected by scalable network. Distributed shared physical address space. ommunication

More information

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742

Availability of Coding Based Replication Schemes. Gagan Agrawal. University of Maryland. College Park, MD 20742 Availability of Coding Based Replication Schemes Gagan Agrawal Department of Computer Science University of Maryland College Park, MD 20742 Abstract Data is often replicated in distributed systems to improve

More information

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations

Lecture 2: Snooping and Directory Protocols. Topics: Snooping wrap-up and directory implementations Lecture 2: Snooping and Directory Protocols Topics: Snooping wrap-up and directory implementations 1 Split Transaction Bus So far, we have assumed that a coherence operation (request, snoops, responses,

More information