Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis

c British Computer Society 2002 Fault-Tolerant Hierarchical Networks for Shared Memory Multiprocessors and their Bandwidth Analysis SYED MASUD MAHMUD, L.TISSA SAMARATUNGA AND SHILPA KOMMIDI Department of Electrical and Computer Engineering Wayne State University, Detroit, MI 48202, USA Email: smahmud@ece.eng.wayne.edu Many researchers have paid significant attention to the design of cluster-based systems, due to the fact that such systems need very inexpensive networks compared to those needed for noncluster-based systems. A number of hierarchical interconnection networks HINs have also been proposed in the literature which can be used for building large cluster-based systems. Most of the existing HINs are not fault tolerant. It is very desirable that a HIN be fault tolerant, because even a single fault in the network can completely disconnect a large number of processors and/or memory modules from the rest of the processors and memory modules of the system. As a result, the performance of the system will decrease significantly. In this paper, we have proposed two types of hierarchical interconnection networks which are fault tolerant and can be used to build large cluster-based multiprocessor systems. We have also developed analytical models to determine the performance of the proposed fault-tolerant HINs under fault-free and faulty conditions. Simulation models were also developed to verify the accuracy of the analytical models. The results obtained from the analytical models were found to be very close to those obtained from the simulation models. The technique that has been used to develop models in this paper can also be used to develop models for other hierarchical systems. Received 1997; revised 6 September 2001 1. INTRODUCTION Recently a great deal of attention has been paid to the design of cluster-based multiprocessor systems [1 23]. Clusterbased design is very appealing when a system is to be built with a very large number of processors and memory modules. A cluster-based multiprocessor system needs a less expensive interconnection network compared to that needed for a non-cluster-based system. A number of clusterbased designs are available in the literature. The Cm* [1] is made up of 50 processor memory pairs called compute modules, grouped into clusters. Communication within a cluster is via a parallel bus controlled by an address mapping processor termed a Kmap. There are five clusters and these communicate via an intercluster bus. The CEDAR system [2, 3] uses a bus interconnection between the processors within a cluster and the cluster memory they share, and a multistage interconnection network between all processors and the global memory shared among all clusters. The DASH multiprocessor [4] is also a cluster-based system. The processors and memory modules of a cluster are connected by a bus. This multiprocessor system can have as many clusters as needed. All the clusters can be connected by a A short version of this paper was presented at the IEEE International Conference on Algorithms and Architectures for Parallel Processing, Brisbane, Australia, April 19 21, 1995. general interconnection network. A cluster structure using shared buses as the basic interconnection media has been proposed by Wu and Liu [5]. Multiple levels of clustering may be present in their organization. Shared buses are used to interconnect the units within a cluster, and the entire system is built using a hierarchy of buses. The Fat Tree network [24] provides uniform bandwidth between any two end-points on a net. It does this by doubling the number of paths as one goes up the tree. The cost of paths will increase significantly compared to that of other hierarchical systems described below. The CM-5 [25] is a message passing system. Its internal networks include two components, a data network and a control network. The topology of the data network is a fat tree. KSR-1 [26] is a shared memory system and it consists of a hierarchy of rings. Agrawal and Mahgoub [6, 7] proposed a cluster-based multiprocessor system where a hierarchical interconnection network HIN is used for communication. The conflictfree access within each cluster is satisfied by relatively smaller crossbar switches. They showed that a cluster-based scheme provides results closer to a fully connected crossbar system if every processor accesses memory modules within its own cluster more frequently than other memory modules. Mahgoub and Elmagarmid [8] proposed a generalized class of cluster-based multiprocessor systems. They proposed

148 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI a multilevel hierarchical network for their systems, which consists of a large number of smaller crossbar switches. The performance of their network is very close to that of a full crossbar connection if a processor accesses its nearer memory modules more frequently than remote memory modules. Potlapalli and Agrawal [9] proposed a HIN called the Hierarchical Multistage Interconnection Network. This network consists of many levels and the network at each level is built using multistage interconnection networks. A number of other hierarchical interconnection networks are proposed in the literature [10 29] which can be used for multiprocessor and multicomputer systems. The motivation behind designing HINs is to exploit the inherent locality that exists in many general and parallel computations. The success of all cluster-based systems with reduced or limited interconnection depends on the locality of computations. This means that a processor must access the memory modules within its own cluster more frequently than those in other clusters. In fact, for the analysis of all hierarchical networks it is assumed that the probability that a processor generates a reference for one of its ith-level memory modules is p i,wherep i >p j for all i<j. The performance of a HIN is very sensitive to network faults. Sometimes a single fault in the network can degrade the performance of the system very significantly, depending upon the location of the fault. For example, if any one of the HINs presented in [5, 6, 8, 9] has a faulty link, then that faulty link will isolate a number of devices processors and/or memory modules from the rest of the system. The number of devices that will be isolated from the other devices depends on the location of the fault. If a fault occurs at a higher level, then that fault will isolate more devices than if the fault occurs at a lower level. Since all the devices of a hierarchical system can not be used together in the presence of a fault in the HIN, the performance of the system will degrade and the amount of degradation will depend on the location of the fault. The performance of the system will degrade significantly if the fault occurs at or near the highest level of the system. Moreover, if multiple faults exist in a HIN, then these faults may divide the entire system into many small isolated subsystems. Thus, in the presence of multiple faults in the HIN, the system may not be usable at all. In this paper we present two fault-tolerant HINs. Both HINs are designed using many small crossbar switches. In one type of HIN, multiple links are used at every input and output port of the crossbar switches. The bandwidth available from a port of a crossbar depends on the number of links present in that port. Thus, when a link becomes faulty, the bandwidth of the corresponding port decreases as opposed to the fact that a number of devices become disconnected from the rest of the system. Hence, all the devices of the entire system can still be used together, but with a slight degradation in performance. In another type of HIN we use only one link in every input and output port of a crossbar, but we use a small backup circuit with every crossbar in order to tolerate one or more faults within the crossbar. We have developed analytical models in order to determine the memory bandwidth of both types of HINs under fault-free and faulty conditions. We have verified our analytical models using extensive simulations. Most of the results from the analytical models match very closely within 5% to those from the simulation models. Section 2 describes our fault-tolerant HINs. The analytical models are presented in Section 3. Results from the analytical and simulation models are presented and discussed in Section 4, and the conclusions are presented in Section 5. 2. DESCRIPTION OF FAULT-TOLERANT HIERARCHICAL INTERCONNECTION NETWORKS The performance of a hierarchical system is very sensitive to network faults. If a link can not be used either because the link itself is faulty or there is a fault in the network which makes the link unusable, then a set of processors and memory modules becomes disconnected from the rest of the processors and memory modules of the system. As a result, the performance of the system may degrade significantly depending upon what fraction of the processors and memory modules is available within that set which becomes disconnected. Thus, it is very desirable that a hierarchical interconnection network must be fault tolerant. 2.1. A multiple-link-based HIN The multiple-link-based HIN, presented in this paper, has many levels of hierarchy. The processors and memory modules of the system are grouped into a number of processor memory clusters PMCs, called the local-level or the zeroth-level clusters. Every zeroth-level cluster has n 0 processors and m 0 memory modules. Every zeroth-level cluster also has an inlet with b 1 links coming from the first-level parent IN, and an outlet with a 1 links going to the parent IN. The interconnection network inside a zerothlevel cluster, called the zeroth-level IN, is built using an n 0 + b 1 m 0 + a 1 crossbar switch. A first-level IN is connected to k 1 zeroth-level INs child INs on one side and to a second-level IN parent IN on the other side. A first-level IN has k 1 + 1 input ports inlets and k 1 + 1 output ports outlets. A first-level IN is connected to its second-level parent IN using an inlet and an outlet containing b 2 and a 2 links, respectively. Each one of the other inlets and outlets of a first-level IN has a 1 and b 1 links, respectively, and these inlets and outlets are used to make connections between the first-level IN and the k 1 zeroth-level child INs. A first-level IN is built using a k 1 a 1 + b 2 k 1 b 1 + a 2 crossbar switch. In general, we can say that if a hierarchical system has L levels, then an ith 1 i L 2 level IN is connected to k i i 1th-level INs on one side and to an i + 1thlevel IN on the other side. An ith-level IN has k i + 1inlets and k i + 1 outlets. One inlet has b i+1 links coming from the i + 1th-level parent IN, and each one of the other k i inlets has a i links coming from an i 1th-level child IN.

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 149 FIGURE 1. A four-level multiple-link-based hierarchical interconnection network. One outlet has a i+1 links going to the i + 1th-level parent IN, and each one of the other k i outlets has b i links going to an i 1th-level child IN. This network is built using a k i a i + b i+1 k i b i + a i+1 crossbar switch. The highest level IN of an L-level hierarchical system has k L 1 inlets and k L 1 outlets. Every inlet has a L 1 links coming from an L 2th-level child IN. Every outlet has b L 1 links going to an L 2th-level child IN. This network is built using a k L 1 a L 1 k L 1 b L 1 crossbar switch. From the above description it is clear that k 1 zeroth-level clusters are connected to a first-level IN to form a first-level cluster, k 2 first-level clusters are connected to a second-level IN to form a second-level cluster, k 3 second-level clusters are connected to a third-level IN to form a third-level cluster and so on. Thus, the total number of zeroth-level clusters in an L-level hierarchical system is k 1 k 2 k 3...k L 1. Since there are n 0 processors and m 0 memory modules in each zeroth-level cluster, the total numbers of processors and memory modules in the system are N = n 0 k 1 k 2 k 3...k L 1 and M = m 0 k 1 k 2 k 3...k L 1, respectively. Figure 1 shows a four-level multiple-link-based hierarchical system. If a processor generates a memory reference for one of its local zeroth-level memory modules, then that reference goes to the memory module through the local interconnection network. However, if a processor generates a reference for one of its ith- i >0 level memory modules, then that reference first keeps moving up through the parent outlets of different INs until it reaches the ith-level IN of the ith-level cluster in which the processor is located. The reference then starts moving down through the child outlets of different INs until it reaches the referenced memory module. 2.2. A HIN with fault-tolerant INs Here we propose another type of fault-tolerant HIN. This type of HIN is designed using only one link at every inlet and outlet port. However, every IN has a backup circuit, as shown in Figure 2, to tolerate faults within the crossbar main crossbar. If any reference can not move through the main crossbar due to the presence of faults in the main crossbar, then that reference tries to move through the backup circuit. For an ith-level fault-tolerant IN, the backup circuit is composed of two small crossbars: one k i + 1 z i crossbar and another z i k i + 1 crossbar. All the k i + 1inletsofanith-level fault-tolerant IN are connected to the k i + 1 k i + 1 main crossbar and the k i + 1 z i backup crossbar. The z i outlets of the k i + 1 z i backup crossbar are in turn connected to the z i inlets of the other z i k i + 1 backup crossbar. Then, all the k i + 1 outlets of the z i k i + 1 backup crossbar are connected to the k i + 1 outlets of the main crossbar. It is assumed that the main crossbar has a built-in fault detection circuit, which generates control signals to route memory references through the backup circuit when necessary. The backup circuit allows a maximum of z i references to move through it, if these references can not move through the main crossbar due to the presence of faults in the main crossbar. Thus, if z i or fewer references can not move through the main crossbar due to the presence of faults, the performance of the system will not degrade, because all these references can move through the backup circuit. However, the performance of the system will degrade if more than z i references need to be moved through the backup circuit.

150 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI FIGURE 2. An ith-level IN with a backup circuit. 3. PERFORMANCE ANALYSIS In this section we present some analytical models to determine memory bandwidths of the HINs which are presented in the previous section. Different analytical models are developed to determine the bandwidths of the HINs in the presence of different types of faults. The following main notation is used to describe different parameters of the system. In addition to this notation, some other parameters are also used in this paper, and they are defined just before they are used in the text. 3.1. Notation N M L N i M i n i m i k i C i a i b i s i total number of processors in the system total number of memory modules in the system total number of levels in the system including the local level zeroth level number of processors in an ith-level cluster total number of memory modules in an ith-level cluster number of ith-level processors of a memory module number of ith-level memory modules of a processor number of i 1th-level clusters used to make an ith-level cluster total number of ith-level clusters in the entire system number of links in a child inlet of an ith-level IN number of links in a child outlet of an ith-level IN probability that a processor s generated reference is an ith-level reference pu i+1 pd i du i dd i BW BW i bw i bw i,j BWF bw i F BW F rate of reference at a link of the parent outlet of an ith-level IN rate of reference at a link of a child outlet of an ith-level IN number of distinct references competing for the parent outlet of an ith-level IN number of distinct references competing for a child outlet of an ith-level IN total bandwidth of the HIN bandwidth contribution from all the ith-level references of the system bandwidth contribution from all the processors of an ith-level cluster note that bw L 1 is the same as BW bandwidth contribution from the jth-level references of an ith-level cluster loss in bandwidth due to the presence of the set of faulty outlets F bandwidth contribution from an ith-level cluster in the presence of the set of faulty outlets F total bandwidth of the HIN in the presence of the set of faulty outlets F. The number of processors in an ith-level cluster is N i = { n 0, for i = 0, k i N i 1, for 1 i L 1. The number of memory modules in an ith-level cluster is M i = { m 0, for i = 0, k i M i 1, for 1 i L 1. 1 2

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 151 The number of ith-level processors of a memory module is { n 0, for i = 0, n i = 3 N i N i 1, for 1 i L 1. The number of ith-level memory modules of a processor is { m 0, for i = 0, m i = 4 M i M i 1, for 1 i L 1. The total number of ith-level clusters in the entire system is L 1 k j, for 0 i L 2, C i = j=i+1 1, for i = L 1. The models presented in this paper are developed based on the following assumptions: 1 the multiprocessor system is synchronous with N processors and M memory modules; 2 the new references generated in a cycle are random and independent of each other; 3 the references which are not accepted during a memory cycle are resubmitted for the same memory modules in the next cycle; 4 ψ is the probability with which an active unblocked processor generates a new reference in a memory cycle. Assumptions 1 and 2 are used by almost all the bandwidth analysis models available in the literature. Assumption 3 is used to make our model more realistic. 3.2. Bandwidth analysis of a multiple-link-based HIN 3.2.1. Model for a fault-free multiple-link-based system Let f be the fraction of the processors which are active at steady state. Note that normally f is less than 1, because some of the processors might have been blocked due to the fact that their references were not accepted during the previous cycles. The blocked processors will remain in the inactive state until their references are accepted by the memory modules and then they will go to the active state. Since s i is the fraction of a processor s references which are directed to its ith-level memory modules and there are m i ith-level memory modules of a processor, using the empirical expression developed by Yen et al. [30], one can show that the average number of distinct references competing for the parent outlet of a local cluster is L 1 du 0 = m i 1 1 ψf s i i=1 1 1mi 1 m i n0 1 f i m i 5 n0 mi. 6 Computation of the rate of reference at a link of the parent outletofanin. Since the parent outlet of a local cluster has a 1 links, during a memory cycle the average number of distinct references arriving at a link of the parent outlet of a local cluster is du 0 /a 1.Sincepu 1 is the rate of reference at a link of the parent outlet of a zeroth-level IN and the rate of reference at a link can not be greater than unity, the value of pu 1 can be determined as du 0, if du 0 < 1, a 1 a 1 pu 1 = 1, if du 7 0 1. a 1 During a memory cycle, the probability that a processor either generates a new reference for an ith-level memory module or it is already blocked for an ith-level memory module is ψf s i + f i.anith-level IN can receive only ithand higher-level references from its child INs. Let u i,j be the portion of these references which is directed to the jth-level memory modules. The value of u i,j can be expressed as ψf s j + f j u i,j = L 1 k=i ψf s k + f k, for 1 i L 1andi j L 1. 8 The average number of distinct references competing for the parent outlet of an ith-level IN is du i = L 1 j=i+1 m j 1 1 pu iu i,j a i m j ki, for 1 i L 2. 9 The rate of reference at a link of the parent outlet of an ith-level IN can be expressed as follows: du i, if du i < 1, a i+1 a i+1 pu i+1 = 1, if du for 1 i L 2. i 1, a i+1 10 Computation of the rate of reference at a link of a child outlet of an IN. The references which come to an ith-level IN through the parent inlet are uniformly distributed over all the memory modules of the ith-level cluster which exists underneath the ith-level IN. The average number of distinct references competing for a child outlet of an ith-level IN is dd i = M i 1 1 1 pd i+1b i+1 1 pu iu i,i a ki 1 i, for 1 i L 1 m i 11 where pd L = 0andu L 1,L 1 = 1. The average number of distinct references competing for a link of a child outlet of an ith-level IN can be expressed as dd i, if dd i < 1, b i b i pd i = 1, if dd for 1 i L 1. 12 i 1, b i M i

152 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI 1. f := 1.0; n := 0; For i := 0toL 1dof i n := 0; Done := False; REPEAT 2. n := n + 1 3. Do the analysis shown by 6 through 13, and get the value of BW using 13. 4. If BW fnψ <εwhere ε is a very small number Then Done := True; Else Begin Determine f i n + 1,0 i L 1, using 18; f := 1.0; For i := 0toL 1dof := f f i n + 1; End; UNTIL Done; 5. Accept BW as the bandwidth of the system. ALGORITHM 1. Bandwidth computation algorithm. Total memory bandwidth of the hierarchical system is BW = M 1 1 pd 1b 1 1 ψf s n0 0 1 1m0 1 m 0 1 f 0 m 0 m 0 n0 m0. 13 Computation of bandwidth contribution from the ith- 0 i L 1 level references. An ith-level IN receives i + 1th and higher-level references through its parent inlet. Let v i+1,j be the portion of these references which is directed to the jth- i + 1 j L 1 level memory modules. The value of v i,j j i can be determined as follows: v i,i = M i 1 1 pu iu i,i a ki 1 i m i [pd i+1 b i+1 + M i 1 for 1 i L 2 and v i,j = pd i+1 b i+1 v i+1,j [pd i+1 b i+1 + M i 1 for 1 i L 2, i+ 1 j L 1 1 pu iu i,i a ki 1 ] 1 i, m i 14a 1 pu iu i,i a ki 1 ] 1 i, m i 14b where v L 1,L 1 = 1. The bandwidth contributions from different types of references will be proportional to the number of corresponding references which arrive at a local cluster. Let d z be the average number of distinct zeroth-level references generated by the active and blocked processors of a local cluster. The value of d z can be expressed as d z = m 0 1 1 ψf s n0 0 1 1m0 1 m 0 1 f 0 m 0 n0 m0. 15 The average number of ith-level references which arrive at a local cluster from the first-level parent IN is pd 1 b 1 v 1,i. Hence, d z BW 0 = BW 16 d z + pd 1 b 1 and pd1 b 1 v 1,i BW i = BW, d z + pd 1 b 1 for 1 i L 1. 17 The fraction of processors which attempt to access their ithlevel memory modules during a memory cycle is fs i + f i 0 i L 1. The value of f i for the next iteration of the bandwidth computation can be determined as f i n + 1 = ψf s i + f i n BW i N, for 0 i L 1 18 where n is the iteration number. The iterative algorithm shown in Algorithm 1 can be used to determine the bandwidth of the hierarchical system. Note that, at steady state, the bandwidth of the system must be equal to fnψ. 3.2.2. Models for multiple-link-based HINs with faulty links In this subsection we are going to show the analytical models for different types of link faults. By link faults we mean that some links can not be used due to the presence of faults in the network. A link may not be usable either because there is a fault on the link itself, or there is a fault in an IN which makes the link unusable. First we present a few lemmas and then we show the analytical models for different types of faulty HINs. LEMMA 1. At steady state, the bandwidth contribution bw i from the processors of an ith- 0 i L 1 level cluster is proportional to the bandwidth contribution bw i,j from the jth- 0 j L 1 level references of those processors. LEMMA 2. At steady state, the bandwidth contribution bw i from the processors of an ith- 0 i L 2 level cluster is proportional to the bandwidth available from the parent outlet of the corresponding ith-level IN. The bandwidth available from an outlet means the total bandwidth contribution from all the references which go through that outlet. Analytical model for the HIN with one faulty parent outlet. Let the faulty parent outlet be u i.letu i be the parent outlet of an ith-level IN and the number of faulty links in u i be x

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 153 U 0 ={u i u i U and u i is neither an ancestor nor a descendant of u j U for all i j} t := 0; If U 0 U Then Repeat Pick u k such that u k U and u k / U 0... U t ; t := t + 1; U t ={u k } {u i u i U and u i is either an ancestor or a descendant of u k } Until U 0... U t = U ALGORITHM 2. x <a i+1. Due to the presence of the faulty links in an ithlevel parent outlet, the bandwidth contribution from the correspondingfaulty ith-level cluster will be less than that from other good ith-level clusters. In this subsection, we present an approximate analytical model for bandwidth analysis of a faulty HIN. Since the results obtained from this approximate analytical model were found to be very close to those obtained from the simulation model, we did not try to develop the actual probabilistic model for the faulty HIN which is very complex. Since the approximate analytical model gives very good results, we also felt that it may not be worthwhile developing the exact complex probabilistic model. Let pu i+1 be the rate of reference at a link of the faulty parent outlet of an ith-level IN. Note that the superscript is used to indicate that the corresponding value is for a faulty cluster. The value of pu i+1 can be expressed as du i pu i+1 = 1, if a i+1 x, if du i a i+1 x < 1, du i a i+1 x 1, 19 where du i is given by 9. However, the rate of reference at a link of the parent outlet of an ith-level IN of a good ithlevel cluster is pu i+1, as given by 10. In our approximate model we assume that the bandwidth available from the ithlevel parent outlet of a good ith-level cluster is proportional to pu i+1 a i+1 and that available from the faulty ith-level parent outlet is proportional to pu i+1 a i+1 x.nowusing Lemma 2 we can say that the bandwidth contribution from the processors of a good ith-level cluster is proportional to pu i+1 a i+1 and that from the processors of the faulty ith-level cluster is proportional to pu i+1 a i+1 x. Let us use the term ui, x, as shown below, to indicate the degradation due to the presence of x faulty links in an ithlevel parent outlet: ui, x = 1 pu i+1 a i+1 x. 20 pu i+1 a i+1 Hence, the bandwidth loss due to the presence of x faulty links in an ith-level parent outlet is given by bw i ui, x. Note that bw i = BW/C i,wherebw is the total bandwidth of a good HIN and C i is the total number of ith-level clusters in the entire system. Hence, the total bandwidth of the HIN in the presence of the faulty parent outlet u i is given by BW {u i } = BW 1 ui, x C i. 21 Models for multiple faulty parent outlets. Let the set of faulty parent outlets be U ={u 1,u 2,u 3,...,u r },wherer is the number of faulty outlets. Let the faulty parent outlet u i U be at level h i and the number of faulty links in the outlet be x i. First we use Algorithm 2 to generate a number of disjoint sets from U. From Algorithm 2 it is clear that U i U j = for 0 i, j t and i j. Now we are going to determine the loss in bandwidth due to the presence of the faulty outlets of the set U 0. Since the outlet u i U 0 is neither an ancestor nor a descendant of the outlet u j U for all i j, the references which move through u i U 0 can not move through any other outlets of the set U and vice versa. Thus, in our approximate model we assume that the outlet u i U 0 does not have any significant effect on any other outlets of the set U and vice versa. Hence, we can still assume that the bandwidth contribution from the h i th-level faulty cluster, which became faulty due to the presence of the faulty outlet u i U 0, is proportional to the bandwidth available from u i. Therefore, the loss in bandwidth due to the presence of the faulty outlet u i U 0 is bw hi uh i,x i,wherebw hi is the bandwidth contribution from a good h i th-level cluster and uh i,x i is given by 20. Hence, the total loss in bandwidth due to the presence of all the faulty outlets of the set U 0 is BWU 0 = bw hi uh i,x i u i U 0 uh i,x i = BW. 22 C u i U hi 0 If U 0 U, then there are more sets of faulty outlets. Since U i U j = for i j, it is clear that the references which move through one outlet of a set can not move through an outlet of another set. Thus, in our model we assume that the outlets of one set are not going to have any significant effect on the outlets of another set. Let us try to determine the loss in bandwidth due to the presence of the faulty outlets of the set U 1. Without loss of generality, we can assume that U 1 ={u 1,u 2,u 3,...,u v }, where v is the number of faulty outlets in U 1. It is clear that, for every pair of faulty outlets u i,u j U 1, the outlet u i is either an ancestor or a descendant of the outlet u j. Recall that the faulty parent outlet u i U is at level h i. Without loss of generality, we can assume that h i < h j for all i < j. Thus, u j is an ancestor of u i for all u i,u j U 1 and j > i. Since u j is an ancestor of u i,the

154 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI references which move through u i, also move through u j. As a result, there will be a direct effect of one faulty outlet on another. The loss in bandwidth due to the presence of the faulty outlet u i is bw hi uh i,x i. Hence, the bandwidth contribution from the faulty h j th- h j >h i level cluster, due to the presence of the faulty outlet u i, is given by bw hj bw hi uh i,x i. Since the references which move through u i also move through u j, in our approximate model we assume that the bandwidth contribution from the faulty h j th-level cluster, due to the presence of the faulty outlets u i and u j,isbw hj bw hi uh i,x i 1 uh j,x j. Hence, the total loss in bandwidth due to the presence of all the faulty parent outlets of the set U 1 can be expressed as uhv,x v BWU 1 = BW C hv v 1 uh i,x i + C i=1 hi v j=i+1 1 uh j,x j. 23 The loss in bandwidth due to the presence of the faulty parent outlets of the set U i 2 i t can be determined in a similar way to that for the faulty parent outlets of the set U 1. Thus, the total loss in bandwidth due to the presence of all the faulty parent outlets of the set U is t BWU = BWUi. 24 Hence, the bandwidth of the HIN in the presence of all the faulty parent outlets of the set U is BW U = BW BWU. 25 Now we develop analytical models for the multiple-linkbased HIN with faulty child outlets. First we present three more lemmas and then we show the analytical model for a system with faulty child outlets. LEMMA 3. An ith-level child outlet carries the j th- i j L 1 level references of k j 1 j 1th-level clusters. LEMMA 4. The ith-level references of a processor move through Nd j jth- j i level child outlets, where, for j = i, Nd j = i 1 26 k y, for j<i. y=j LEMMA 5. Assume that D is a set of child outlets such that, for every pair of outlets d m,d n D, d m is either an ancestor or a descendant of d n, and all the outlets of D carry some references of a given processor. If d m D carries the ith-level references of the given processor, then d n D for all m and n also carries the ith-level references of the processor and no outlet of D can carry any other type of references, say jth- j i level references, from that given processor. A cluster is called an affected cluster if the bandwidth contribution from every processor of that cluster is going to be affected due to the presence of faults in the system. Analytical model for the HIN with one faulty child outlet. Let the faulty child outlet be d i.letd i be a child outlet of an ith-level IN and let the number of faulty links in d i be x x <b i. Let pd i be the rate of reference at a link of the faulty child outlet. The value of pd i can be expressed as dd i pdi = 1, if b i x, if dd i b i x < 1, dd i b i x 1, 27 where dd i is given by 11. However, the rate of reference at a link of a good ith-level child outlet is pd i, as given by 12. In our approximate model we assume that the bandwidth available from the faulty ith-levelchild outletis proportional to pd i b i x and that available from a good ith-level child outlet is proportional to pd i b i. Lemma 3 shows that an ithlevel child outlet carries the jth- i j L 1 level references of k j 1j 1th-level clusters. Thus, a faulty child outlet will affect the bandwidth contribution of those clusters whose references move through the faulty outlet. If an ith-level child outlet is faulty, then this faulty outlet will affect the bandwidth contribution of many i 1th- and higher-level clusters. Now let us try to determine the loss in bandwidth contribution from different types of affected clusters. From Lemma 3 it is clear that the bandwidth contribution from k i 1i 1th-level clusters will be affected by a faulty ith-level child outlet, because the ith-level references from these i 1th-level clusters move through the faulty ithlevel child outlet. From Lemma 4 we see that the ith-level references of a processor move through k i 1 ith-level child outlets. Thus, the ith-level references of an i 1th-level affected cluster move through k i 2 good ith-level child outlets and the faulty child outlet. Hence, in our approximate model we assume that the bandwidth contribution from the ith-level references of an i 1th-level affected cluster is proportional to k i 2pd i b i + pd i b i x and that from a non-affected i 1th-level cluster is proportional to pd i b i. Let us use the term di, x, asshown below, to indicate the degradation due to the presence of x faulty links in an ith-level child outlet: di, x = 1 pd i b i x pd i b i. 28 Let us use the term δj,i,x to indicate the fraction of the bandwidth contribution which will be lost from a jth-level affected cluster due to the presence of x faulty links in an ith-level child outlet. In general, the value of δj,i,x can be expressed as di, x, for j = i 1, δj,i,x = di, x, k j+1 1k j k j 1...k i+1 k i for i j L 2. 29

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 155 For i := 1toC 0 do Begin D i,0 ={d j d j D i and d j is neither an ancestor nor a descendant of d k D i for j k}; m := 0; If D i,0 D i Then Repeat Pick u k such that u k D i and u k / D i,0... D i,m ; m := m + 1; D i,m ={u k } {d j d j D i and d j is either an ancestor or a descendant of u k }; Until D i,0... D i,m = D i End; ALGORITHM 3. For j := 1toL 1do bw i j := 0; For all d j D i,0 do bw i g j := bw i g j + bw 0 δg j 1,h j,x j ; If D i,0 D i Then For n := 1tom do bw i g n := bw i g n + bw 0 1 1 δg n 1,h j,x j d j D i,n ALGORITHM 4. The total loss in bandwidth from all the affected clusters in the system can be expressed as BW{d i } = L 2 j=i 1 bw j k j+1 1δj, i, x. 30 Since bw j = k i k i+1...k j bw i 1 and bw i 1 = BW/C i 1, 30 can be reduced to the following closed form expression: BW{d i } = L ibw di,x C i 1. 31 Hence, the bandwidth of the HIN in the presence of the faulty child outlet is BW L i di,x {d i } = BW 1. 32 C i 1 Model for multiple faulty child outlets. Let the set of faulty child outlets be D = {d 1,d 2,d 3,...,d s },wheres is the number of faulty outlets. Let the faulty outlet d i D be at level h i and the number of faulty links in the outlet be x i.we know that the total number of zeroth-level local clusters in the system is C 0. Let us assume that the zeroth-level clusters are numbered as 1, 2, 3,... and C 0.Nowwemakethesets D i 1 i C 0 as D i ={d j d j D and d j carries some references generated by the processors of the zeroth-level cluster #i}, for 1 i C 0. We then use Algorithm 3 to generate a number of disjoint sets from every D i 1 i C 0. Now we are going to determine the loss in bandwidth from the zeroth-level cluster #i 1 i C 0 due to the presence of the faulty outlets of the sets D i,n 0 n m. Since the outlet d j D i,0 is neither an ancestor nor a descendant of the outlet d k D i for all j k, we assume that d j D i,0 does not have any significant effect on d k D i and vice versa for all j k. Assume that the outlet d j D i,0 carries some g j th-level references of the zeroth-level cluster #i. The loss in bandwidth from the zeroth-level cluster #i due to the presence of the faulty outlet d j D i,0 is given by bw 0 δg j 1,h j,x j. For every pair of outlets d j,d k D i,n 1 n m, d j is either an ancestor or a descendant of d k. Hence, any two outlets of D i,n will have a direct effect on each other. Lemma 5 shows that the outlets of D i,n carry only one type of reference say, g n th-level references generated by the processors of the zeroth-level cluster #i. If the faulty outlets of the set D i,n are the only faulty outlets in the system, then the loss in bandwidth from the zeroth-level cluster #i can be expressed as bw 0 1 1 δg n 1,h j,x j. d j D i,n Since, at steady state, the total bandwidth available from a cluster is proportional to the bandwidth available from the ith- 0 i L 1 level references of the cluster, the maximum bandwidth which can be obtained from the zeroth-level cluster #i is limited by the references which cause the maximum loss in bandwidth. Let bw i j be the loss in bandwidth from the zeroth-level cluster #i caused by the jth-level references and let bw 0 i be the total bandwidth contribution from the zeroth-level cluster #i. Then we can write bw 0 i = bw 0 max bw i 1, bw i 2,..., bw i L 1. 33 The values of bw i j 1 j L 1 can be determined using Algorithm 4. Thus, the bandwidth of the HIN in the

156 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI presence of the faulty child outlets of D is where bw 0 i is given by 33. C 0 BW D = bw 0 i 34 i=1 3.3. Bandwidth analysis of a HIN with fault-tolerant INs When there is no fault in this type of HIN, the analytical model is the same as that of the multiple-link-based HIN. The only difference is that for the HIN with fault-tolerant INs a i = b i = 1for1 i L 1. Since an ithlevel backup circuit can move z i references through it, the performance of the HIN will not degrade if z i or fewer references can not move through the main crossbar. Thus, the analytical models are developed for the case when more than z i references can not move through the main crossbar. Here we present an analytical model for only one faulty IN in the system. The model can easily be extended for multiple faulty INs in a similar way as was done for multiple faults of the other type of HIN, as described in the previous subsection. Let the faulty IN be an ith-level IN. Some of the inlets of the fault-tolerant IN will not be able to send their references through the main crossbar when there are some faults in the main crossbar. Let us call these inlets faulty inlets. The model will depend on whether or not the parent inlet is one of the faulty inlets. 3.3.1. The parent inlet is not one of the faulty inlets Let the number of faulty inlets be g g > z i. Thus, the references from these g faulty inlets will move to the backup circuit. The average number of distinct references which will try to move through the backup circuit is given by L 1 db i = m j 1 1 pu g iu i,j. 35 m j j=i Hence, the probability that there is a reference on any output line of the k i + 1 z i backup crossbar is given by db i, if db i < 1, z i z i qu i = 1, if db i 1. z i 36 The probability that the parent outlet of the faulty IN is going to be accessed by at least one reference is pu i+1 = 1 1 pu i1 u i,i k i g 1 qu i 1 u i,i z i. 37 In our approximate model we assume that the bandwidth contribution from the i + 1th- and higher-level references of the faulty ith-level cluster is proportional to pu i+1 and that of a good ith-level cluster is proportional to pu i+1. Let ddi,i be the average number of distinct ith-level references competing for all the child outlets of the faulty ith-level IN. The value of ddi,i can be expressed as ddi,i = k i g 1 1 pu ki 1 g iu i,i 1 qu zi iu i,i + g 1 1 pu ki g iu i,i 1 1 1 1 g qu i u i,i zi. 38 Let dd i,i be the average number of distinct ith-level references competing for all the child outlets of a good ithlevel IN. The value of dd i,i can be expressed as dd i,i = k i 1 1 pu ki 1 iu i,i. 39 Total bandwidth of the HIN with one ith-level faulty IN can be expressed as BW = BW 1 1Ci pu 1 min i+1, dd i,i. pu i+1 dd i,i 40 3.3.2. The parent inlet is one of the faulty inlets The average number of distinct references which will try to move through the backup circuit is given by L 1 db i = pd i+1 + j=i m j 1 1 pu iu i,j m j g 1. 41 Now the value of qu i can be determined using 36. Let x be the fraction of db i which are incluster references the references which came through g 1 child inlets. The value of x can be expressed as x = 1 L 1 db i j=i m j 1 1 pu g 1 iu i,j. 42 m j The probability that the parent outlet of the faulty IN is going to be accessed by at least one reference is pu i+1 =1 c1 pu i1 u i,i ki+1 g 1 xqu i 1 u i,i z i. 43 The total number of distinct ith-level references competing for all the child outlets of the faulty ith-level IN can be expressed as ddi,i = k i + 1 g 1 1 pu ki g iu i,i 1 qu iu i,i x zi + g 1 1 1 pu iu i,i 1 1 1 1 g 1 ki +1 g qu i u i,i x zi. 44

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 157 The probability that an out of cluster reference the reference which comes from the parent IN will pass through the k i + 1 z i crossbar of the backup circuit is pd i+1 = qu iz i pd i+1 db i. 45 Let us use the term ci +1,g, as shown below, to indicate the degradation due to the fact that references from g inlets of an ith-level IN can not move through the main crossbar and one of these g inlets is the parent inlet: ci + 1,g = 1 pd i+1 pd i+1. 46 The effect of this degradation is similar to that of a faulty i + 1th-level child outlet of the multiple-link-based HIN. Thus, the total loss in bandwidth can be expressed as BW pu 1 min i+1 C i, dd i,i pu i+1 + dd i,i L i 1BW ci + 1,g C i. Hence, the total bandwidth of the HIN with one ith-level faulty IN is BW = BW 1 1Ci 1 + L i 1 ci + 1,g pu min i+1, dd i,i. 47 pu i+1 dd i,i 4. NUMERICAL RESULTS AND DISCUSSIONS We have analyzed a number of four-level hierarchical systems. In our analysis, the memory references of a processor were distributed among different memory modules in such a way that the INs at no particular level became the bottleneck. This means that we tried to have the same utilization for all the INs. Note that when the INs at a particular level become the bottleneck, the performance of a hierarchical system degrades severely, because most of the processors will be blocked for the memory modules at the corresponding level. We have developed simulation models to verify the accuracy of the analytical models which are presented in this paper. A simulation program was written to simulate the synchronous behavior of the hierarchical multiprocessors. Queues were maintained in the simulation model in order to keep track of the blocked processors. The simulation program was driven by a linear congruential random number generator. In order to determine the bandwidth with 95% confidence interval of a particular system for a given set of parameters, the program was run ten times with different seeds and each run was for 400,000 memory cycles. However, the first 50,000 memory cycles were ignored in order to avoid the initial transients. 4.1. Results for the multiple-link-based system In order to have the same utilization for all the INs of a multiple-link-based system, we determined the values of s i 0 i L 1 as follows: s 0 = m 0 m 0 + a 1 48 a 1 w 1 = m 0 + a 1 49 s i = w ik i b i, k i b i + a i+1 for 1 i L 2 50 w i+1 = w ia i+1, k i b i + a i+1 for 1 i L 2 51 s L 1 = w L 1. 52 Since the bandwidth available from the parent outlet of an ith-level IN is the same as that from the parent inlet of the ith-level IN, for a real system we can assume that a i = b i 1 i L 1. Table 1 shows the parameters of five different four-level HINs and their bandwidth under fault-free conditions. For all these systems a i = b i = 2 1 i L 1. Both the analytical and simulation results, shown in Table 1, were obtained for ψ = 1. The simulation results are shown with 95% confidence interval. From Table 1 it is seen that, for fault-free conditions, the results from the analytical model are very close to those from the simulation model. The error column of Table 1 shows the error in the analytical results, which is under 5%. Each one of the above five systems was analyzed under six different types of faulty conditions, and we assumed that a faulty outlet has only one faulty link. Table 2 shows the types of faults that were investigated for the above-mentioned five HINs. Tables 3 and 4 show the performance of the systems HIN1 HIN5 in the presence of different types of faulty parent outlets. These tables also show that the results from the analytical models are very close to those from the simulation models. For most of the cases, the results from the analytical models are within 5% of those from the simulation models. Comparing the results of Table 1 with those of Tables 3 and 4, we see that the performance of a system degrades in the presence of faults. The degradation depends on the position of the faults as well as on the number of faults. The performance degradation of different hierarchical systems in the presence of different types of faulty parent outlets is summarized in Table 5. This table shows that the performance of a system is more sensitive to the position of a fault rather than the number of faults. For example, the degradation due to one third-level faulty parent outlet see fault F1 is more than that due to two faulty parent outlets at the first and second levels see fault F2. Since the performance degradation of a HIN with a large number of processors is very significant when the highest level outlets become faulty, we further analyzed a HIN with a large number of processors and with the faults at the highest level in order to determine the accuracy of our analytical model at this high degradation. The parameters of the HIN which we investigated are k 1 = k 2 = k 3 = 8. The HIN

158 S. M. MAHMUD, L.T.SAMARATUNGA AND S. KOMMIDI TABLE 1. Some HINs and their bandwidths under fault-free conditions ψ = 1. Parameters of the HINs a i = b i = 2for1 i L 1 Bandwidth of fault-free HINs System n 0 = m 0 k 1 k 2 k 3 N = M Analytical Simulation Error in % HIN1 2 4 8 8 512 317.82 309.23 ± 0.03 2.8 HIN2 4 4 8 8 1024 420.95 405.48 ± 0.04 3.8 HIN3 4 6 8 8 1536 622.41 598.34 ± 0.02 4.0 HIN4 4 8 8 8 2048 824.04 788.28 ± 0.04 4.5 HIN5 5 8 8 8 2560 859.99 828.14 ± 0.04 3.8 TABLE 2. Types of faults investigated for the systems HIN1 through HIN5 note that a faulty outlet has only one faulty link. Type of faults F1 F2 F3 F4 F5 F6 Description of the faults One third-level parent outlet is faulty Two parent outlets a first-level and a second-level outlet are faulty. Neither faulty outlet is the ancestor/descendant of the other Three parent outlets a first-level, a second-level and a third-level outlet are faulty. No faulty outlet is the ancestor/descendant of the other faulty outlets One third-level child outlet is faulty Two child outlets a first-level and a second-level outlet are faulty. Neither faulty outlet is the ancestor/descendant of the other Three child outlets a first-level, a second-level and a third-level outlet are faulty. No faulty outlet is the ancestor/descendant of the other faulty outlets TABLE 3. Bandwidths of the systems HIN1 HIN5 in the presence of faults F1 and F2 ψ = 1. Fault F1 Fault F2 System Analytical Simulation Error in % Analytical Simulation Error in % HIN1 310.38 302.01 ± 0.03 2.8 315.90 306.23 ± 0.03 3.2 HIN2 405.51 389.58 ± 0.04 4.1 417.68 400.21 ± 0.02 4.4 HIN3 595.86 573.44 ± 0.02 3.9 617.57 593.21 ± 0.03 4.1 HIN4 786.42 750.33 ± 0.04 4.8 817.62 780.26 ± 0.03 4.8 HIN5 820.69 784.11 ± 0.04 4.7 853.29 819.91 ± 0.04 4.1 TABLE 4. Bandwidths of the systems HIN1 HIN5 in the presence of faults F3 F6 ψ = 1. Bandwidth of the HINs under different types of faults System Fault F3 Fault F4 Fault F5 Fault F6 HIN1 308.47 310.65 314.90 307.73 HIN2 402.24 405.68 414.49 399.23 HIN3 591.02 596.04 612.85 586.47 HIN4 780.00 786.60 811.34 773.91 HIN5 813.99 820.83 846.52 807.37

HIERARCHICAL NETWORKS FOR SHARED MEMORY MULTIPROCESSORS 159 TABLE 5. Loss in bandwidth in the presence of different types of faults ψ = 1. Loss in bandwidth in % under different types of faults System F1 F2 F3 F4 F5 F6 HIN1 2.34 0.60 2.94 2.26 0.92 3.17 HIN2 3.67 0.78 4.44 3.63 1.53 5.16 HIN3 4.27 0.78 5.04 4.24 1.54 5.77 HIN4 4.57 0.78 5.34 4.54 1.54 6.08 HIN5 4.57 0.78 5.35 4.55 1.57 6.12 FIGURE 3. Bandwidth versus the number of processors of a HIN in the presence of different types of faulty outlets k 1 = k 2 = k 3 = 8and a i = b i = 2for1 i 3. TABLE 6. Types of faults investigated for the HIN with k 1 = k 2 = k 3 = 8. Condition S1 S2 S3 S4 Description Fault-free HIN Four highest level parent outlets are faulty Four highest level child outlets are faulty Two highest level parent outlets and two highest level child outlets are faulty TABLE 7. Types of faults investigated for the HINs with faulttolerant INs. Condition X1 X2 X3 X4 Description Fault-free HIN A second-level crossbar has four faulty inlets A third-level crossbar has four faulty inlets A second-level crossbar has four faulty inlets and one of the faulty inlets is the parent inlet was analyzed under four different conditions, as shown in Table 6. The bandwidth of the HIN was determined for ψ = 1, and the number of processors in the system was varied from 1024 to 8192. The results from the simulation model are shown in Figure 3. This figure shows that for all three types of faults S2, S3 and S4 the degradation is almost the same. The system saturates after 3072 processors. At the saturation point, the degradation due to all three types of faults is about 20%, which is significant. Since the degradation due to all three types of faults is the same, we can conclude that the degradation due to a faulty parent or a child outlet is the same, as long as the fault occurs at the same level and a i = b i for all i. Since in practice it is unlikely that a given system will have too many faults at the same time and since the simulation is very time consuming, we did not try to determine how accurate our analytical model is for a system with too many faults. 4.2. Results for the HINs with fault-tolerant INs Even for the HINs with fault-tolerant INs, we determined the values of s i using 48 52. The parameters of the HINs which we investigated are the same as those shown in Table 1, except the values of a i and b i for 1 i 3; note that for the HINs with fault-tolerant INs a i = b i = 1 for 1 i 3. We assumed that for all the fault-tolerant INs z i = 1for1 i 3. This means that the maximum number of references which can move through the backup circuit of any IN is one. Since the parameters k 1, k 2 and k 3 of the HINs with fault-tolerant INs which we investigated are the same as those of the HINs shown in Table 1, we still would like to use the notation HIN1, HIN2, HIN3,..., HIN5 to indicate these HINs. Each one these five HINs has been analyzed under four different conditions as shown in Table 7. Figure 4 shows the bandwidth of the HINs under different conditions. This figure also shows that the degradation in