The Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems

The Peta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems Abdulkarim Ayyad Departmet of Computer Egieerig, Al-Quds Uiversity, Jerusalem, P.O. Box 20002 Tel: 02-2797024, Fax: 02-2797023 Abstract: - It is well kow that crossbar switch has the best performace amog the multiprocessor itercoectio etworks. However expadig this switch so that the performace grows liearly with the umber of processig elemets (N) is costly ad complicated. The size of the switch grows as a fuctio of N 2. This paper presets a expasio scheme, which keeps the liearity betwee the performace, the umber of processig elemets ad the umber of switches.16x16 ad 32x32 crossbar switch modules are used as buildig blocks of this scheme. It is a modular scheme as well ad so othig eeds to be redesiged. The results of mathematical aalysis show that the performace of this scheme is better tha that of a Delta MIN built if 4x4crossbar switch modules (Delta-4) ad comparable to that of MINs built with the same crossbar switch module size, i.e., Delta-16 ad Delta-32.The cost efficiecy of this scheme is better tha that of ay other etwork except Delta-4 MIN. Key Words: Itercoectio Networks, Multiprocessors, Scalability, Modularity, Crossbar, Performace, Cost Efficiecy, Mathematical Model. 1. Itroductio I multiprocessor ad multi-computer systems, the itercoectio etwork is of prime importace. It is the medium through which the processig elemets or the computers of the system exchage iformatio. Over the last three decades, large umber of etwork topologies were proposed, desiged, implemeted ad tested. May were mathematically aalyzed or simulated [1, 2]. I all etworks, the major cocer was ad is still, to achieve a better performace ad a better scalability. Performace is measured i terms of badwidth or throughput, probability of acceptace ad latecy. Scalability is a rather imprecise term used to idicate a desig that allows the system to be icreased i size ad i doig so, obtai icreased performace. This is similar to the defiitio of modularity give by Stestrom [3], which states " for a multiprocessor to be modular, the badwidth of the etwork must be proportioal to the umber of processors i the system". Icreasig the system meas expadig it by addig more fuctioal uits to obtai higher performace without redesigig these uits (expadability). The crossbar switch etwork was first proposed by Wulf ad Bell i 1972, ad implemeted by Wulf ad his team i 1981 as reported i [1]. From its early aalysis ad implemetatios [1,2,4], It became well kow that the crossbar switch etwork has the best performace amog the multiprocessor etworks. Its badwidth steadily icreases after a certai size. However, for a full crossbar etwork (NXN), doublig the umber of its iputs ad outputs meas icreasig the umber of switches four times i order to keep the scalability measures. This process ecessitates the redesig of the system. May attempts were made i order to desig scalable ad easily expadable crossbar-based systems. I all those attempts, the desigers realized that they could t achieve full scalability ad expadability at the same time. So, they resorted to compromise the two measures. The most famous example is the large umber ad variety of multistage itercoectio etworks (MINs) that were desiged usig small crossbar modules as buildig blocks. Samples of these MINs are show i [5-7]. The performace of the MIN etwork lies betwee the performace of the crossbar ad that of the shared bus[1,2 ]. Delta etwork built with 4x4 crossbar uits proved to be the most cost effective amog the MIN etworks if the cost is calculated i terms of the umber of switches used 1

i the etwork [5]. However, the advet i VLSI techology has made this measure of less importace. I 1993, Barry Wilkiso proposed, aalyzed ad simulated a overlapped scalable topology for the crossbar switch. I this topology rhombic crossbar modules are coected to each other so that the processor ca access two memory modules i two eighborig crossbar modules. As far as the processor is tryig to access memories i eighborig modules, its performace matches that of a full crossbar, but if the rage of requests reaches farther, the badwidth will be degraded drastically [8].The mai issue ow is to achieve a better performace for a modular(scalable ad easily expadable) desig. I this paper, the author presets a scalable crossbar ad a cost effective scheme which is easy to expad, easy to cotrol, ad yet has a performace comparable, ad sometimes better tha those of MINs havig the same umber of iputs ad outputs. The scheme will be aalyzed for multiprocessor systems the, i a separate work, it will be show how the scheme ca be used for multi-computer systems. This work was doe eight moths ago for two purposes. Firstly to model the proposed expasio of the traditioal crossbar switch used i multiprocessor systems, ad secodly, to use its results as a idicator for the expected results of the expasio scheme of a ewly proposed STC104 like crossbar switch for multi-computer etworks. The idea of this switch was first proposed, desiged ad simulated at The Jordaia Uiversity of Sciece ad Techology (JUST) i Jorda [9]. The results obtaied the bettered those of the STC104 switch show i [10]. Curretly the author is supervisig a M.Sc project o a improved versio of the JUST switch ad its expasio accordig to this scheme (Peta-S) at Al-Quds Uiversity i Jerusalem. The prelimiary simulatio results have show to be cosistet with the results of the mathematical model preseted i this paper ad they will be submitted i separate papers i the ear future. 2. The Peta-S Scheme The buildig block i this scheme is a full (x crossbar module of the multiprocessor type. By multiprocessor type we mea that the data ad the address are preseted o separate parallel busses ad each destiatio bus of the crossbar has its ow arbiter. No iput or output buffers are used. Each module ca accommodate processig elemets (PEs) of a multiprocessor. A -shuffle coectio is used to coect up to (+1) of these modules. For example, if 32X32 size crossbar modules are used as buildig blocks, the up to 1056 PEs ca be coected usig this scheme, see figure 1-A. Each PE i the etwork is provided with two ports; oe to coect it to its crossbar module (hereiafter referred to as the crossbar port), ad the other to coect it to a PE of aother module i the scheme via the shuffle (hereiafter called the shuffle port). Each port has a iput ad a output see figure 1-B. The iput of crossbar port has a bypass cotrol uit to pass the iformatio to the output of the shuffle port if the iformatio belogs to a PE i aother crossbar module. The iput of the shuffle port is provided with a buffer i order to keep the icomig requests from other modules pedig their service via the output of the crossbar port i the comig etwork cycles. This topology forms a Sigle Stage-Sigle Shuffle Scalable system, hece the ame Peta-S scheme. The -shuffle provides a direct lik betwee each PE i the crossbar module ad its correspodet (cliet) i aother module. Each two correspodets are meat to provide a uique commuicatio chael betwee two crossbar modules. For example, the shuffle port of PE5 of module 2 (PE25) is coected via the shuffle to the shuffle port of PE2 of module 5 (PE52). This makes PE25 ad PE52 two correspodets, which serve the commuicatio betwee the PEs of modules 2 ad 5. Figure 1 shows a block diagram of the system ad the two ports of the PE. The shuffle lik is desiged to coect +1 crossbar modules of x size. Note i figure 2 that the shuffle is desiged so that the shuffle port output of PEij is coected to the shuffle port iput of PEji ad vice versa for all (i /= j). Note also that the ports of PEij are coected to PEkj for all (i = j ),where k =+1. I figure 2, PE11, PE22, ad PE33 are coected to PE41, PE42 ad PE43 respectively. 2

. Figure 1-A: A block diagram of the system oe etwork cycle PE13 seds the data to PE12. The Bypass mechaism of PE12 port recogizes that the data does ot belog to PE12 ad directs it to the output of the shuffle port. Thus the data goes directly to the iput of PE21 shuffle port ad stored i a local buffer. I the ext cycle PE21 seds the data locally via module 2 to PE22. Figure 1-B: The Crossbar ad the Shuffle ports of the processig elemet. 3. The Commuicatio Process I this system, all the requests are preseted iterally, i.e., withi the crossbar module, regardless of whether the destiatio belogs to the same module as the source or belogs to aother module. I case the destiatio belogs to aother module, the bypass mechaism allows the data to go directly through the shuffle lik to the cliet PE i that module. The data is stored i a temporary buffer i that PE i order to be set to the fial destiatio i the ext cycle. For example, assume i figure 1 (see also figure 2 to follow the example) that the output of device PE12 shuffle port is coected via the shuffle to the iput of device PE21 shuffle port. This meas that device PE21 represets the cliet destiatio of module 1 i module 2. Assume ow that device PE13 wat to sed data to device PE22.The commuicatio process takes place as follows: I Figure 2: A 3-shuffle, which coects four 3X3 crossbar modules (a total of 12 PEs). Regardig the priority, the devices give higher priority to their ow requests i oe cycle ad a higher priority to the requests comig from other modules i the ext. This pre-determied priority guaratees fairess betwee local ad global requests. The priority for accessig the destiatio withi the module depeds o the priority algorithm used i the module desig. 3

4. The System Mathematical Aalysis Mathematical aalysis suits the tightly coupled multiprocessor etworks more tha multicomputer etworks. This is due to the fact that i multiprocessor etworks the request ca be preseted, arbitrated for ad accepted or rejected i the same cycle. I multicomputer systems, other requests ca arrive while previous requests are beig served. This makes multi-computer etworks difficult to be mathematically aalyzed. However, for certai topology, aalyzig the multiprocessor etwork gives a good idicatio of the correspodig multi-computer etwork behavior. So the author decided to mathematically aalyze the multiprocessor versio of the PENTA- S scheme. Simulatio for the multi-computer versio of PENTA-S will be preseted i a ear future work. I aalyzig this scheme, the followig assumptios are used: 1.x crossbar modules, which ca coect sources to destiatios, are used as a buildig block i this scheme. 2. k modules are used i buildig the system, where k ca have a maximum value of +1. 3. The devices preset their ow requests i oe cycle ad requests comig from other modules i the ext. 4. The rejected requests will be discarded. 5. The cycle is defied as the time required for the request to be preseted, arbitrated for ad served. 6. The probability that a device is makig a request durig oe cycle is r, ad the probability that it is havig a request comig from aother module durig a previous cycle is p. 4.1 Badwidth ad Probability of Acceptace Aalysis I the ith cycle the badwidth of each module is give by the followig equatio: r BW (, ) = 1 (1) Assume k blocks are used the it is expected that 1 BW (, Requests are accepted for local k k 1 PEs, ad BW (, requests are accepted ad k trasferred to other blocks via the shuffle. I the (i+1)th cycle, the probability that requests from other block have arrived is p, ad the probability that a local processor has made a request is r. The priority is give for requests from other blocks. The the probability that there is a request is give by: R = r + p rp (2) So the expected badwidth i the (i+1)th cycle is R BW2 (, ) = 1 (3) So the total badwidth for each block i both cycles 1 r R BW12 (, = 1 + 1 k (4) where R is give by equatio (2), ad p is give k 1 r by: p = 1 i.e., k k 1 r p = 1 1 (5) k Multiplyig equatio (4) by K gives the badwidth of the system for two successive cycles r R BW12 ( k,, = 1 + k 1.(6) So the average badwidth of the system per cycle is give by: BW12 ( k,, BW ( k,, = (7) 2 This equatio reduces to N p ( r + p rp) BW ( k,, = 1 + 1 2 k 1 (8) Where N = k The probability of acceptace P a is defied as the ratio betwee the badwidth ad the expected umber of requests durig a cycle. BW ( k,, P a ( k,, = (9) A 4

( r + R) N Where A = 2 For the purpose of compariso, the badwidth ad the probability of acceptace of full crossbar ad a delta-4 multistage itercoectio etwork are stated i the followig equatios: The badwidth of NXN crossbar is give by r BW ( N, N ) = N N 1 (10) N where r is the probability that a PE is makig a request durig a cycle. The probability of acceptace of the crossbar is give by BW ( N, N) P a ( N, N) = (11) rn The badwidth of a delta-b etwork which uses (bxb)crossbar modules as buildig blocks is give by: BW = i delta ( b, b) b ri (12) Where the umber of iputs ad outputs N is i give by: N = b, where i is the umber of stages i the etwork The results obtaied from equatios 8 to 12 are plotted ad discussed i the ext sectio of this paper. 5. The Results of Mathematical Aalysis 5.1 The Badwidth The badwidth ad the probability of acceptace of PENTA-S, NXN crossbar ad delta-4 etworks are obtaied from the above equatios over various values of N,two values of r (1 ad 0.5), ad two values of K (32 ad 16). The results are show i figures 3 to 17. Note that the term Delta- meas a Delta MIN built of xcrossbar modules, the term Peta- meas a Peta etwork topology built of x crossbar modules, BW meas the badwidth, BW(NXN) meas the badwidth of a full NXN crossbar ad BW(NX0.5N) meas the badwidth of half full crossbar etwork. Figure 3 shows the badwidth of Peta-32 as compared to Delta ad crossbar etworks. The figure shows that Peta-32 has the same badwidth as Delta-4 for N<=256 at r = 1 ad a N higher badwidth tha Delta-4 for N>256 at (r=1). It also has a better badwidth at r = 1 tha the half crossbar etwork (NX0.5N) at r =0.5. The badwidth of Peta-32 is also compareable to those of half crossbar, Delta-32 ad Delta-16 for N<512 at r = 1. It must be oted that Peta etwork betters the Deltas ad the crossbar etworks i beig modular ad easily expadable. As most of MINs use 4X4 switches as buildig blocks, i.e., they are of Delta-4 type, we ca say that Peta32 betters them i havig higher badwidth, beig more modular ad easier to expad. Most MINs are of Delta-4 type because the 4X4 switch prooved to be the most cost efficiet switch size for buildig MINs [5]. Figure 3: Badwidth of Peta-32 as compared to other etworks at r=1 Figure 4:Badwidth of Peta-16 as compared to Halfad Full-Crossbar etworks 5

Figure 4 shows the badwidth of Peta-16 as compared to other etworks for 16=<N=<256 (The maximum size of Peta-16 is 272 PEs). Over this rage, the badwidth of Peta-16 at r = 1 is higher tha badwidth of the half crossbar at r = 0.5 ad less tha that of Peta-32 at r=1. Its badwidth is also comparable to the badwidth of the half crossbar at r = 1. Figure 5 shows the badwidth of Peta-16 as compared to those of Delta etworks at r=1. The figure shows clearly that the badwidth of Peta-16 is early the same as that of Delta-4 ad less tha those of Deltas 16 ad 32. Regardig the modularity ad expadability, the above argumet of Peta-32 applies to Peta-16 as well. Fifure 7 :Badwidth of Peta-16 as compared to other etworks at r=0.5 Figures 6 ad 7compares the badwidth of Petas 32 ad 16 to the badwidth of other etworks at r = 0.5. These figures show that reducig r does ot chage the positio of Peta etworks with respect to other etworks regardig the badwidth. 5.2 The Probability of Acceptace The probability of acceptace is a measure which shows the probability of acceptig a request durig a cycle. The reciprocal of the probability of acceptace idicates the average latecy, i cycles, of servig the request. Figure 5: Badwidth of Peta-16 as compared to Delta etworks at r = 1. Figure 6 : The Badwidth of Peta-32 as compared to other etworks at r=0.5 Figure 8:Probability of Acceptace of Peta-32 as compared to other etworks at r=1 6

Figure 8 shows the probability of acceptace of Peta-32 as compared to other etworks at r =1. It is clear from the figure that Peta-32 has a higher probability of acceptace tha Delta-4 etwork for all values of N exept N =256 where they have equal probability of acceptace. We ca ote from the figure that Delta-16, Delta-32, ad Full crossbar etworks have higher probability of acceptace tha Peta-32. The probability of acceptace of Half Crossbar etwork falls betwee the probability of acceptace of Delta-32 ad that of delta-16 for N>384 ad r =1. acceptace. Figure 9 shows that Peta-16 ad Delta-4 have, more or less, the same probability of acceptace for N > 48. Figures 10 ad 11 show that chagig the value of r to 0.5 does ot chage the positio of Peta etworks with respect to other etworks regardig the probability of acceptace. Figure 11 :Probability of Acceptace of Peta 16 as compared with other etworks at r = 0.5 Figure 9:Probability of Acceptace of Peta-16 as compared to other etworks at r=1 5.3 The Cost Efficiecy I this study the author calculates the cost efficiecy i terms of the badwidth per simple switch. This is simply because the etworks used i this study utilizes various switch sizes. However, regardless of the switch size, all the switchig elemets are built of simple switches. Figure 10:Probability of Acceptace of Peta-32 as compared to other etworks at r=0.5 For N<384 Half-crossbar has less probability of acceptace tha both Delta-32 ad Delta-16. Also, we ca ote that i Delta etworks, the larger the size of the switch, the higher the probabilty of Figure 12:Cost Efficiecy of Petas 16 ad 32 as compared to other etworks For example the 4X4 switch used i delta-4 is a 4X4 crossbar switch that is made of 16 simple 7

switches ad the 16X16 switch used i Delta-16 ad Peta-16 is a 16X16 crossbar switch made of 256 simple switches. Figure12 shows the cost efficiecy of Peta-32 ad Peta-16 as compared to those of other etworks. It is clear that the most cost efficiet Network is Delta-4, the thig with agrees with the calculatios preseted i [5]. However as ow we have 16X16 ad 32X32 crossbar switches itegrated o oe chip it is more temptig to build Delta MINS of these chips, i.e., Delta-16 & Delta- 32, to have a better performace. Figure 13 compares the cost efficiecy of Peta-32 to those of Delta-16 ad Delta-32. It is clear that the cost efficiecy of Peta-32 is higher tha that of Delta- 32. Delta-16 has a higher cost efficiecy that that of Peta-32 oly for umber of PEs<100. grows with the same rate as the umber of PEs, i.e., the size of the switch grows as a fuctio of N 2, where N is the umber of the PEs the switch desiged to serve. Figure 14:Cost per PE of Petas 16 ad 32 as compared to Delta etworks Figure 13:Cost Effeciecy of Peta-32 as compared to Delta Networks 5.4 The Cost Per Processig Elemet Figures 14 ad 15 show two thigs, Firstly, Peta etworks have a fixed cost per processig elemet regardless of the size of the etworks, 16 simple switches per PE for Peta-16 ad 32 simple switches per PE for Peta-32. For umber of processig elemets N>=128, Delta-4 matches Peta-16, i this measure. For umber of PEs N>=64 Delta-16 matches Peta-32. So Peta etwork allows the usage of larger switch size without the pealty of the cost. Note that the cost per processig elemet for the full crossbar C/PE(FCB) ad the half crossbar C/PE(HCB) Figure 15:Cost per PE of Petas 32 ad 16 as compared to Half ad Full Crossbar etworks 6. Coclusio I this paper we have mathematically aalyzed the behavior of a ew scheme of coectig clusters of multiprocessor systems. The term scheme meas the topology plus its usage (especially, the local ad global addressig plus assigig a cliet for each module i all other modules of the etwork). This proposed scheme is a mixture of circuit switchig ad wormhole routig schemes, i the 8

sese that iside the crossbar module, the Network behaves i a circuit switchig fashio, where as i the case the request is made for other modules, the request bypasses th the local cliet PE of the destiatio module to the buffer associated with the cliet PE i the destiatio module. The cliet i the destiatio module allows the buffer to preset its request iterally i order to address its fial destiatio. This is very much like the wormhole routig but with a sigle middle stage i the way to its destiatio. The performace of such a scheme proved to be better tha that of the traditioal MINs built of (4X4) crossbar switch modules, ad comparable to that of moder MINs built of 16X16 ad 32X32 crossbar modules. It is cost effective, easy to expad, ad has a reasoable aggregate scalability Overlappig Coectivity, IEEE Tras. Compu., V. 41, No. 6, PP. 738-46. [9] Nabeel Hasaseh, Router Architecture With Collisio Avoidace Cotrol Usig Crossbar Switch i Multicomputer Systems, M.Sc Thesis, Supervised by T. El-Dos ad A. Ayyad, Computer Egieerig Departmet, JUST Uiversity, Jorda, July 2000. [10] P.W Thompso ad J.D Lewis, The STC104 Packet Routig Chip, SGS- Thomso Microelectroics Limited, Bristol,Eglad, 1997, PP. 1347-1356. Refereces [1] K. Hwag ad F. Briggs, " Computer Architecture ad Parallel Processig", McGraw-Hill, USA, 1984. [2] B. Wilkiso, Computer Architecture; Desig ad Performace, 2 d editio, Pretice Hall Europe, Hertford shire, U.K,1996. [3] P. Stestrom, Reducig the cotetio i Shared memory Multiprocessors, IEEE Computer, V. 21, No. 11,Nov, 1988, PP 26-37. [4] B. Wilkiso ad H. Abachi, " Crossbar Switch Multiprocessor System", Microprocessors ad Microsystems, Vol. 7, No. 2, March 1983, PP 75-79 [5] L. Bhuya ad D. Agrawal, " Desig ad Performace of Geeralized Shuffle Networks", IEEE Trasactio o Computers, Vol. C-32, No. 12, Dec., 1983, PP 1081-1089. [6] C-L. Wu ad T-Y. Feg, "O a Class of Multistage Itercoectio Networks, IEEE Trasactio o Computers, Vol. C-29, No. 8, Aug. 1980, PP 694-702. [7] J. Patel, Performace of Processor-Memory Itercoectio for Multiprocessors IEEE Trasactio o Computers, Oct., 1981, PP 771-780. [8] A.B. Wilkiso, O Crossbar Switch ad Multiple Bus Itercoectio Networks With 9