I containing data and synchronization information between

Size: px

Start display at page:

Download "I containing data and synchronization information between"

Shannon Jacobs
5 years ago
Views:

~ ~ 1016 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER 1991 Express Cubes: Improvig the Performace of k-ary -cube Itercoectio Networks William J.

1 ~ ~ 1016 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER 1991 Express Cubes: Improvig the Performace of k-ary -cube Itercoectio Networks William J. Dally, Member, ZEEE Abstruct-Express cubes are k-ary -cube itercoectio etworks augmeted by express chaels that provide a short path for olocal messages. A express cube combies the logarithmic diameter of a multistage etwork with the wire-efficiecy ad ability to exploit locality of a low-dimesioal mesh etwork. The isertio of express chaels reduces the etwork diameter ad thus the distace compoet of etwork latecy. Wire legth is icreased allowig etworks to operate with latecies that approach the physical speed-of-light limitatio rather tha beig limited by ode delays. Express chaels icrease wire bisectio i a maer that allows the bisectio to be cotrolled idepedet of the choice of radix, dimesio, ad chael width. By icreasig wire bisectio to saturate the available wirig media, throughput ca be substatially icreased. With a express cube both latecy ad throughput are wire-limited ad withi a small factor of the physical limit o performace. Express chaels may be iserted ito existig itercoectio etworks usig iterchages. No chages to the local commuicatio cotrollers are required. Idex Terms-Commuicatio etworks, cocurret computig, itercoectio etworks, multicomputers, packet routig, packet switchig, parallel processig, topology. I. INTRODU~ON NTERCONNECTION etworks are used to pass messages I cotaiig data ad sychroizatio iformatio betwee the odes of cocurret computers [l], [2], [18], [19]. The messages may be set betwee the processig odes of a message-passig multicomputer [ 11 or betw'ee the processors ad memories of a shared-memory multiprocessor [2]. A itercoectio etwork is characterized by its topology, routig, ad flow cotrol [lo]. The topology of a etwork is the arragemet of its odes ad chaels ito a graph. Routig determies the path chose by a message i this graph. Flow cotrol deals with the allocatio of chael ad buffer resources to a message as it travels alog this path. This paper deals oly with topology. Express cubes ca be applied idepedet of routig ad flow cotrol strategies. The performace of a etwork is measured i terms of its latecy ad its throughput. The latecy of a message is the elapsed time from whe the message sed is iitiated util the message is completely received. Network latecy is the Mauscript received October 14, 1989; revised April 27, This work was supported i part by the Defese Advaced Research Projects Agecy uder Cotracts "14-88K-0738 ad N K-0825 ad i part by a Natioal Sciece Foudatio Presidetial Youg Ivestigator Award, Grat MIP , with matchig fuds from Geeral Electric Corporatio ad IBM Corporatio. The author is with the Artificial Itelligece Laboratory ad the Laboratory for Computer Sciece, Massachusetts Istitute of Techology, Camhridge, MA IEEE Log Number f&9340/91/ $ IEEE average message latecy uder specified coditios. Network throughput is the umber of messages the etwork ca deliver per uit time. Low-dimesioal k-ary -cube etworks usig wormhole routig have bee show to provide low latecy ad high throughput for etworks that are wire-limited [4], [5], [9]. For 3, the k-ary -cube topology is wire-efficiet i that it makes efficiet use of the available bisectio width. This topology maps ito the three physical dimesios i a maer that allows messages to use all of the available badwidth alog their path without ever havig to double back o themselves. Also, low-dimesioal k-ary -cubes cocetrate badwidth ito a few wide chaels so that the compoet of latecy due to message legth is reduced. I most cotemporary cocurret computers, this is the domiat compoet of latecy. Because of their low-latecy, high throughput, ad affiity for implemetatio i VLSI, these k-ary -cube etworks with = 2 or 3 have bee used successfully i the desig of several cocurret computers icludig the Ametek 2010 [19], the J-Machie [7], [8], ad the Mosaic [20]. However, low-dimesioal k-ary -cube itercoectio etworks have two sigificat shortcomigs: Because wires are short, ode delays domiate wire delays ad the distace related compoet of latecy falls more tha a order of magitude short of speed-of-light limitatios. I the J-Machie [7], for example, ode delay is 50 s while the logest wire is 225 mm ad has a time-of-flight delay of 1.5 s. The chael width of these etworks is ofte limited by ode pi cout rather tha by wire bisectio. For example, the J-Machie chael width is limited to 9-bits by pi cout limitatios. I the physical ode width of 50 mm, a six-layer prited circuit board ca hadle over four times this chael width after accoutig for through holes ad local coectios. I short, may regular k-ary -cube itercoectio etworks are ode-limited rather tha wire-limited. I these etworks, ode delay ad pi limitatios domiate wire delay ad wire desity limitatios. The ratios of ode delay to wire delays ad pi desity to wire desity caot be balaced i a regular k-ary -cube. Express cubes overcome this problem by allowig wire legth ad wire desity to be adjusted idepedetly of the choice of radix k, dimesio, ad chael width W. A express cube is a k-ary -cube augmeted by oe or more levels of express chaels that allow olocal messages to

2 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCO"E(JI1ON NETWORKS 1017 bypass odes. The wire legth of the express chaels ca be icreased to the poit that wire delays domiate ode delays. The umber of express chaels ca be adjusted to icrease throughput util the available wirig media is saturated. This ability to balace ode ad wire limitatios is achieved without sacrificig the wire-efficiecy of k-ary -cube etworks. The umber of chaels traversed by a message i a hierarchical express cube grows logarithmically with distace as i a multistage itercoectio etwork [12], [21]. The express cube, however, is able to exploit locality while i a multistage etwork all messages must traverse the diameter of the etwork. With a express cube, both latecy ad throughput are wire-limited ad are withi a small costat factor of the physical limit o performace. The remaider of this paper describes the express cube topology ad aalyzes its performace. Sectio I1 summarizes the otatio that will be used throughout the paper. Sectio 111 itroduces the express cube topology i steps. Basic express cubes (Sectio 111-A) reduce latecy to twice the delay of dedicated wire for messages travelig log distaces. Throughput ca be icreased to saturate the available wirig desity by addig multiple express chaels (Sectio 111-B). With a hierarchical express cube (Sectio 111-C), latecy for short distaces, while ode-limited, is withi a small costat factor of the best that ca be achieved by ay bouded degree etwork. Some desig cosideratios for express cube iterchages are discussed i Sectio IV. 11. NOTATION The followig symbols are used i this paper. They are listed here for referece. the set of chaels i the etwork. Mahatta distace traveled by a message, 12, - zd( + (y,- yd( + (z,- zdl, where the source is at (zs, ys, z,) ad the destiatio is at (zd, yd, zd). the fractio of traffic at level j i a hierarchical express cube. hops, the umber of odes traversed by a message. umber of odes betwee iterchages i a express cube. the radix of the etwork-the legth i each dimesio. the umber of levels of hierarchy i a hierarchical express cube. the message legth i bits. mi, the umber of multiple express chaels at level j. M, the umber of express chaels through each ode., the dimesio of the etwork. N, the set of odes i the etwork. Where it is uambiguous, N is also used for the umber of odes i the etwork, IN(. T,, T,, the latecy of a ode. the latecy of a wire that coects two physically adjacet odes. Tp, the pipelie period of a ode. W, the width of a chael i bits. W, the width of a ode-the umber of wires that may pass ito a ode i each dimesio. a, the ratio of ode latecy to wire latecy, T/T,. p, the ratio of chael width to ode width, WfW. A itercoectio etwork cosists of a set of odes N that are coected by a set of chaels, C N x N. Each chael is uidirectioal ad carries data from a source ode to a destiatio ode. For the purposes of this paper it is assumed that the etwork is bidirectioal: chaels occur i pairs so that (1,722) E C * (2,i) E C. Commuicatio betwee odes is performed by sedig messages. A message may be broke ito oe or more packets for trasmissio. A packet is the smallest uit that cotais routig ad sequecig iformatio. Packets cotai oe or more flow cotrol digits or flits. A flit is the smallest uit o which flow cotrol is performed. A flit i tur is composed of oe or more physical trasfer uits or phits.' A phit is W-bits, the size of the physical commuicatio media. The express cube topology is particularly suitable for use with wormhole routig, a flow-cotrol protocol that advaces each flit of a packet as soo as it arrives at a ode (pipeliig) ad blocks packets i place whe required resources are uavailable [4], [5], [9]. Wormhole routig is attractive i that 1) it reduces the latecy of message delivery compared to store ad forward routig, ad 2) it requires oly a few flit buffers per ode. Wormhole routig differs from virtual cutthrough routig [ll] i that with wormhole routig it is ot ecessary for a ode to allocate a etire packet buffer before acceptig each packet. This distictio reduces the amout of bufferig required o each ode makig it possible to build fast, iexpesive routers. The bisectio width of a etwork is the miimum umber of chaels that must be cut to partitio the etwork ito two equal halves. The wire bisectio is the umber of wires i this chael cutset. Bisectio width gives a lower boud o wire desity, the maximum umber of wires that must cross a uit distace (2-D) or area (3-D) EXPRESS CUBES A. Express Chaels Reduce Latecy Fig. 1 illustrates the applicatio of express chaels to a k-ary 1-cube or liear array. A regular k-ary 1-cube is show i Fig. l(a). The etwork is a liear array of k processig odes, labeled N, each coected to its earest eighbors by chaels of width W. The delay of a phit propagatig through a ode is T,. The delay of the wire coectig two odes is T,. Each chael ca accept a ew phit every Tp. The latecy of a message of legth L set distace D is L T, = HT, -k DT, -k - W Tp. 'There is o costrait that the physical uit of trasfer, phit, must be smaller tha the flow cotrol uit, flit. It is possible to costruct systems with several flits i each phit.

3 1018 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER 1991 ( b) Fig. 1. Isertio of express chaels reduces latecy: (a) A regular k-ary 1-cube etwork may be domiated by ode delay. (b) A Ic-ary 1-cube with express chaels reduces the ode delay compoet of latecy. Message latecy is composed of three compoets as show i (1).2 e first compoet is the ode latecy, due to the umber of hops H. The secod compoet is the wire latecy, due to the distace D. The third compoet is due to message legth L. For a covetioal k-ary -cube, H = D givig L T, = (T, + T,)D + - Tp. W For most etworks T, >> T, so the ode latecy domiates the wire latecy. Express cubes reduce the ode latecy by icreasig wire legth to reduce the umber of hops H. A express k-ary 1-cube is show i Fig. l(b). Express chaels have bee added to the array by isertig a iterchage, labeled I, every i odes. A iterchage is ot a processig ode. It performs oly commuicatio fuctios ad is ot assiged a address. Each iterchage is coected to its eighborig iterchages by a additioal chael of width W, the express chael. Whe a message arrives at a iterchage it is routed directly to the ext iterchage if it is ot destied for oe of the iterveig odes. To preserve the wire-efficiecy of the etwork, messages are ever routed past their destiatios o the express chaels eve though doig so would reduce H i may cases. The delay T,, ad throughput l/tp, of a iterchage are assumed to be idetical to those of a ode. The wire delay of the express chael is assumed to be it,. To simplify the followig aalysis, it is assumed that iterchages add o physical distace to the etwork. Assumig i I D, H = D/i + i ad isertio of express chaels reduces the latecy to I the geeral case, i 1 D, a average message traversig D processig odes travels over Hi = (i + 1)/2 local chaels to reach a iterchage, He = LD/i - 1/2 + 1/(2i)J express chaels to reach the last iterchage before the destiatio, ad fially Hf = (1 + (D - i/2-1/2) mod i) local chaels to the destiatio. The total umber of hops is H = Hi + He + Hf givig a latecy of Tb=(F + 1 D *Throughout this paper the term latecy is used to refer to the latecy of a sigle message i the absece of traffic. For a discussio of the effects of traffic o latecy see [5] ad 191. (2) (3) LTP + -. W For large distaces, D >> a = T,/T,, choosig i = Q balaces the ode ad wire delay. With this choice of i, the latecy due to distace is approximately twice the wire latecy, To M 2T,D. The latecy for large distaces of a express chael etwork with i = a is withi a factor of two of the latecy of a dedicated Mahatta wire betwee the source ad destiatio.3 For small distaces or large a, the i term i the coefficiet of T i (3) is sigificat ad ode delay domiates. For such etworks, latecy is miimized by choosig i = resultig i To E 2 ( - l)t. The use of hierarchical express chaels (Sectio 111-C) ca further improve the latecy for small distaces. B. Multiple Express Chaels Icrease Throughput to Saturate Wire Desity To first order, etwork throughput is proportioal to wire bisectio ad hece wire desity. If more wires are available to trasmit data across the etwork, throughput will be icreased provided that routig ad flow cotrol strategies are able to profitably schedule traffic oto these wires. May regular etwork topologies, such as low-dimesioal k-ary -cubes, are uable to make use of all available wire desity because of pi limitatios. The wire bisectio of a express cube ca be cotrolled idepedet of the choice of radix k, dimesio, or chael width W by addig multiple express chaels to the etwork to match etwork throughput with the available wirig desity W. Fig. 2 shows two methods of isertig multiple express chaels. Multiple express chaels may be hadled by each iterchage as show i Fig. 2(a). Alteratively, simplex iterchages ca be iterleaved as show i Fig. 2(b). I method (a), usig multiple chael iterchages, a iterchage is iserted every i odes as above ad each iterchage is coected to its eighbors usig m parallel express chaels. Fig. 2(a) shows a etwork with i = 4 ad m = 2. The iterchage acts as a cocetrator combiig messages arrivig o the m icomig express chaels with olocal messages arrivig o the local chael ad cocetratig these messages streams oto the m outgoig express chaels. This method has the advatage of makig better use of the express chaels sice ay message ca route o ay express chael. Flexibility i express chael assigmet is achieved at the expese of higher pi cout ad limited expasio. With method (b), iterleavig simplex iterchages, m simplex iterchages are iserted ito each group of i odes. Each iterchage is coected to the correspodig iterchage i the ext group by a sigle express chael. All messages from the odes immediately before a iterchage will be routed o that iterchage s express chaels. Because load caot 3There is othig special about the factor of two. By choosig i = ja the distace compoet of latecy will be (1 + l/j) times the latecy of a Mahatta wire. (4)

4 . I DALLY: EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCONNECTION NETWORKS 1019 ( b) Fig. 2. Multiple express chaels allow wire desity to be icreased to saturate the available wirig media. Express chaels ca be added usig either (a) iterchages with multiple express chaels, or (b) iterleaved simplex iterchages. be shared amog iterleaved express chaels, a ueve distributio of traffic may result i some chaels beig saturated while parallel chaels are idle. Method (b) has the advatage of usig simple iterchages ad allowig arbitrary expasio. I the extreme case of isertig a iterchage betwee every pair of odes the resultig topology is almost the same as the topology that would result from doublig the umber of dimesios. Both of the methods illustrated i Fig. 2 have the effect of icreasig the wire desity (ad bisectio) by a factor of m + 1. To first order, etwork throughput will icrease by a similar amout. There will be some degradatio due to ueve loadig of parallel chaels. The use of multiple express chaels offsets the load imbalace betwee express ad local chaels. If traffic is uiformly distributed, the average fractio of messages crossig a poit i the ceter of the etwork o a local chael is fo = 2i/k as compared to f1 = (k - 2i)/IC crossig o a express chael. For large etworks where IC >> i, the bulk of the traffic is o express chaels. Icreasig the umber of express chaels applies more of the etwork badwidth where it is most eeded. The issue of allocatig multiple express chaels is discussed further i Sectio 111-E. Multiple express chaels are a effective method of icreasig throughput i etworks where the chael width is limited by piout costraits. For example, i the J-Machie the chael width W = 9 is set by pi limitatio^.^ The prited-circuit board techology is capable of ruig W = 80 wires i each dimesio across the 50 mm width of a ode. Eve with may of these wires used for local coectios, four parallel 15-bit (data + cotrol) wide chaels ca be easily ru across each ode. A multiple express chael etwork with m = 3 could use this available wire desity to quadruple the throughput of the etwork. C. Hierarchical Express Cubes Have Logarithmic Node Delay With a sigle level of express chaels, a average of i local chaels are traversed by each olocal message. The ode delay o these local chaels represets a sigificat compoet of latecy ad causes etworks with short distaces, D 5 a2, 4Each J-Machie ode is packaged i a 168-pi pi-grid array. The six commuicatio chaels each require 9 data bits ad six cotrol bits cosumig 90 of these pis. Power coectios use 48 pis. The remaiig 30 pis are used by exteral memory iterface ad cotrol to be ode limited. Hierarchical express cubes overcome this limitatio by usig several levels of express chaels to make ode delay icrease logarithmically with distace for short distaces. The use of hierarchical express chaels, show i Fig. 3, reduces the latecy due to ode delay o local chaels. With hierarchical express chaels, there are 1 levels of iterchages. A first-level iterchage is iserted every i odes. A secod-level iterchage replaces every ith first level iterchage, every i2 odes. I geeral, a jth level iterchage replaces every ith j - 1st level iterchage, every i j odes.' Fig. 3 illustrates a hierarchical express cube with i = 2, 1 = 2. A jth level iterchage has j + 1 iputs ad j + 1 outputs. Arrivig messages are treated idetically regardless of the iput o which they arrive. Messages that are destied for oe of the ext i odes are routed to the local (0th) output. Those remaiig messages that are destied for oe of the ext i2 odes are routed to the 1st output. The process cotiues with all messages with a destiatio betwee ip ad ip+' odes away, 0 5 p 5 j - 1, routed to the pth output. All remaiig messages are routed to the jth output. A message i a hierarchical express cube is delivered i three phases: ascet, cruise, ad descet. I the ascet phase, a average message travels (i + 1)/2 hops to get to the first iterchage, ad (i - 1)/2 hops at each level for a total of H, = (i - 1)/2 + 1 hops ad a distace of D, = (2' - 1)/2. Durig the cruise phase, a message travels H, = [(D - Da)/i'J hops o level 1 chaels for a distace of D, = i'hc. Fially, the message desceds back through the levels routig o each level, j, as log as the remaiig distace is greater tha ij. For the special case where i' ID, the descedig message takes Hd = (i - 1)1/2 + 1 hops for a distace of Dd = (2' + 1)/2. This gives a latecy of D LTP T,= ( +(i-1)1+1 T+TwD+ -. W (5) Choosig a ad 1 so that a' = a balaces ode ad wire delay for large distaces. With this choice, the delay due to local odes is (i- 1) 1 T = (i- 1) log, at. Give that i is a iteger greater tha uity, this expressio is miimized for i = 2. Choosig i to be a power of two facilitates decodig of biary addresses i iterchages. Networks with i = 4, i = 8, ad i = 16 may be desirable uder some circumstaces. I the geeral case, iz + D, the latecy of a hierarchical express cube is calculated by represetig the source ad destiatio coordiates as h = log,k-digit radix4 umbers, S = Sh-1 1. SO, ad D = dhfl... do. Without loss of geerality we assume that S < D. Durig the ascet phase, a message routes from S to Sh-1.--s takig Ha = Ci=i ((2 - sj)mod i) hops for a distace of D, = Eizi((i - sj) mod i)ij. The cruise phase takes the message H, = E;:;' (dj- sj)ij-' hops for a distace of D, = Hci'. Fially, the descet phase takes the message from dh-1. + d10.o to D takig Hd = cizb dj hops for a distace of Dd = Eizi djij. 'This costructio yields a fixed-radix express cube, with radix i for each level. It is also possible to costruct mixed-radix express cubes where the radix varies from level to level.

5 1020 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEFEMBER 1991 Fig. 3. Hierarchical express chaels reduce latecy due to local routig. E Dlstace Delay vs. Dlstace for Express Cubes Fig. 4. Latecy as a fuctio of distace for a hierarchical express chael cube with i = 4, I = 3, a = 64, ad a flat express chael cube with i = 16, a = 64. I a hierarchical express chael cube latecy is logarithmic for short distaces ad liear for log distaces. The crossover occurs betwee D = a ad D = ia log, a. The flat cube has liear delay domiated by T, for short distaces ad T, for log distaces. For short distaces the cruise phase will ever be reached. The message will move from ascet to descet as soo as it reaches a ode where all ozero coordiates agree with D. The total latecy for the geeral case is plotted as a fuctio of distace i Fig. 4. Fig. 5 shows how hierarchical iterchages ca be implemeted usig pi-bouded modules. A level-j iterchage requires j + 1 iputs ad outputs if implemeted as a sigle module as show for a third level iterchage i Fig. 5(a). A level-j iterchage ca be decomposed ito 2j - 1 level-oe iterchages as show for j = 2 i Fig. 5(b). A series of j - 1 ascedig iterchages that route olocal traffic toward higher levels is followed by a top-level iterchage ad a series of j - 1 descedig iterchages that allow local traffic to desced. With some degradatio i performace, the ascedig iterchages ca be elimiated as show i Fig. 5(c). This chage requires extra hops i some cases as a message caot skip levels o its way up to a high-level express chael. Each message must traverse at least oe level j - 1 chael before beig switched to a level-j chael. By restrictig messages to also travel o at least oe chael at each level as they desced, the descedig iterchages ca be elimiated as well leavig oly the sigle top-level iterchage as show i Fig. 5(d). D. Performace Compariso Fig. 4 shows how latecy varies with distace i hierar- chical ad flat express cubes ad compares these latecies to the latecy of a covetioal k-ary 1-cube ad of a direct wire. These curves assume that the message source is midway betwee two iterchages. The latecies are ormalized to uits of the wire delay betwee adjacet odes. The latecy of a covetioal k-ary 1-cube is liear with slope a while the latecy of a wire is liear with slope 1. For short distaces, util the first express chael is reached, a flat (ohierarchical) express cube has the same delay as a covetioal k-ary -cube, TD = ad. Oce the message begis travelig o express chaels, latecy icreases liearly with slope 1 + a/i. This occurs at distace D = 24 i the figure. There is a periodic variatio i delay aroud this asymptote due to the umber of local chaels beig traversed, Dlocal = (i + 1)/2 + ((D - i/2 + 1/2) mod i). The hierarchical express cube has a latecy that is logarithmic for short distaces ad liear for log distaces. The latecy of messages travelig a short distace, D < Q is ode limited ad icreases logarithmically with distace, TD M (i - 1) log, DT,. This delay is withi a factor of i - 1 of the best that ca be achieved with radix i switches. Log distace messages have a latecy of TD M (1 + a/iz)tw. If iz = a, this log distace latecy is approximately twice the latecy of a dedicated Mahatta wire. I a hierarchical etwork, the iterchage spacig i ca be made small, givig good performace for short distaces, without compromisig the delay of log distace messages which depeds o the ratio a/i'. I a flat etwork with a sigle parameter i, it is ot possible to simultaeously optimize performace for both short ad log distaces. E. Area Tradeoffs Assume that a ode has a cross-sectioal area that permits W wires to pass through i each dimesio. W of these wires are used for a local chael. The remaiig W - W wires are allocated as M = - 11 W-wire chaels sice a arrower chael will form a bottleeck that will slow other chaels. The M available chaels should be divided amog the levels i a hierarchical express cube i a maer that evely balaces the load. Assumig radom traffic, at each level j, from j = 0, local chaels, to j = 1-1, the fractio of traffic carried at level j o a chael ear the ceter of the machie is The fractio of traffic o the top-level chael is the remaider, To balace the load betwee levels of the express cube, mj chaels should be allocated to each level j of the cube i proportio to the fractio of traffic fj at that level. I practice, M is ot large eough to permit a exact balace. For example, i a cube with k = 64, i = 2, 1 = 3, ad radom traffic, the fractios, fo to f3, are , 0.125, 0.250, ad , (7)

6 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCONNECTION NETWORKS 1021 Fig. 5. Hierarchical iterchages. (a) A third-level iterchage. (b) A third-level iterchage implemeted from first-level iterchages. (c), (d) With a small performace pealty, ascedig ad/or descedig iterchages ca be elimiated. respectively. If A4 = 7, a reasoable allocatio to mo through m3 is 1, 1, 2, ad 4, respectively. If there is cosiderable locality i the traffic patter, more traffic will travel o the lower levels ad additioal chaels should be allocated to these levels. It is better to overallocate chaels to lower levels tha to higher levels because the lower-level chaels are more versatile. Log distace messages ca make use of lower level chaels with some icrease i latecy; however, local messages caot make progress o high-level chaels. F. Express Chaels i May Dimesios A multidimesioal express cube may be costructed by isertig iterchages ito each dimesio separately as show i Fig. 6(a). The figure shows part of a two-dimesioal express cube with i = 4, 1 = 1. Iterchages have bee iserted separately ito the X ad Y dimesios. A similar costructio ca be realized for higher dimesios ad for hierarchical etworks. With this approach iterchage picout is miimal as each iterchage hadles oly a sigle dimesio. Also, the desig is easy to package ito modules as the iterchages are located i regular rows ad colums. This approach has the disadvatage that messages must desced to local chaels to switch dimesios. A alterate costructio of a multidimesioal express cube is to iterleave multidimesioal iterchages ito the array as show i Fig. 6(b) for i = 4, 1 = 1. This approach allows messages o express chaels to chage dimesios without descedig to a local chael. It is particularly useful i etworks that use adaptive routig [14], [15] as it provides alterate paths at each level of the etwork. The iterleaved costructio has the disadvatages of requirig a higher iterchage pi cout ad beig more difficult to package ito modules. G. Modularity The iterchages i a express cube ca be used to chage wire desity, speed, ad sigalig levels at module boudaries as show i Fig. 7. Large etworks are built from may modules i a physical hierarchy. A typical hierarchy icludes itegrated circuits, prited circuit boards, chassis, ad cabiets. Available wire desity ad badwidth chage sigificatly betwee levels of the hierarchy. For example, a typical itegrated circuit has a wire desity of 250 wires/" per layer while a prited circuit board ca hadle oly 2 wires/" per layer.6 Iterchages placed at module boudaries as show 6This itegrated circuit wire desity is typical of first-level metal i a 1 pm CMOS process. The prited circuit wire desity is for a board with 8 mil wires ad spaces. Both desities assume all area is available for wirig. Fig. 6. A multidimesioal express cube may be costructed either by (a) isertig iterchages ito each dimesio separately, or (b) iterleavig multidimesioal iterchages ito the array. i Fig. 7 cdh be used to vary the umber ad width of express ad local chaels. These boudary iterchages may also covert iteral module sigalig levels ad speeds to levels ad speeds more,appropriate betwee modules. Usig express chaels ad boudary iterchages, the etwork ca be adjusted to saturate the available wirig desity eve though this desity is ot uiform across the packagig hierarchy. TO make use of the available badwidth, computatios ruig o the etwork must exploit locality. IV. INTERCHANGE DESIGN Fig. 8 shows the block diagram of a uidirectioal iterchage. A bidirectioal iterchage icludes a idetical circuit i the opposite directio. The basic desig is similar to that of a router [17], [6], [3]. Two iput latches hold arrivig flits ad two output latches hold departig flits. If additioal bufferig is desired, ay of these latches may be replaced by a FIFO buffer. If a phit is a differet size tha a flit,

7 1022 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER A -I I - l l 1 ; : ' 1, I, AI I ) I I Le Fig. 7. Iterchages allow wire desity, speed, ad sigalig levels to be chaged at module boudaries. --c IN REG MUX ) d OUT REG Fig. 8. Block diagram of a iterchage. lbo multiplexors perform switchig betwee iput ad output registers based o a compariso of the high address bits i a message header. multiplexig ad demultiplexig is required betwee the flit buffers ad the iterchage pis. Associated with each output latch is a multiplexor that selects which iput is routed to the latch. Routig decisios are made by comparig the address iformatio i the head flit(s) of the message to the local address. If the destiatio lies withi the ext i odes, the local chael is chose, otherwise the express chael is chose. If i is a power of two, iterchages are aliged, ad absolute addresses are used i headers, the compariso ca be made by checkig all but the 1 log, i least sigificat bits for equality to the local address. The iterchage state icludes presece bits for each register, a iput state for each iput, ad a output state for each output. The presece bits are used for flit-level flow cotrol. A flit is allowed to advace oly if the presece bit of its destiatio register is clear (o data preset), or if the register is to be emptied i the same cycle. The iput state bits hold the destiatio port ad status (empty, head, advacig, blocked) of the message curretly usig each iput. The output state cosists of a bit to idetify whether the output is busy ad a secod bit to idetify which iput has bee grated the output. The combiatioal logic to maitai these state bits ad cotrol the data path is straightforward. V. CONCLUSION Express cubes are k-ary -cubes augmeted by express chaels that provide a short path for olocal messages. A express cube retais the wire efficiecy of a covetioal k-ary -cube while providig improved latecy ad through- a latecy that is withi a small factor of the best that ca be achieved with a bouded degree etwork. For log distaces, the latecy ca be made arbitrarily close to that of a dedicated Mahatta wire. Multiple express chaels ca be used to icrease throughput to the limit of the available wire desity. The express cube combies the low diameter of multistage itercoectio etworks with the wire efficiecy ad ability to exploit locality of a low-dimesioal mesh etwork. The result is a etwork with latecy ad throughput that are withi a small factor of the physical limit. Express chaels are added to a k-ary -cube by periodically isertig iterchages ito each dimesio. NO modificatios are required to the routers i each processig ode; express chaels ca be added to most existig k-ary -cube etworks. Iterchages also allow wire desity, speed, ad sigalig levels to be chaged at module boudaries. A express cube ca make use of all available wire desity eve if the wire desity is ouiform. This is ofte required as the wire desity ad speed may chage sigificatly betwee levels of packagig. Express cubes achieve their performace at the cost of addig iterchages, icreasig the latecy for some shortdistace messages, ad icreasig the bisectio width of the etwork. Each iterchage adds a compoet to the system ad icreases the latecy of local messages that cross a iterchage but do ot take the express chael by oe ode delay, (T, + TW). Express chaels icrease the wire bisectio by usig available uused wirig capacity. I parts of the etwork that are already wire-limited the express ad local chaels ca be combied as show i Fig. 7. As the performace of itercoectio etworks approaches the limits of the uderlyig wirig media their rage of applicatio icreases. These etworks ca go beyod exchagig messages betwee the odes of cocurret computers to servig as a geeral itercoectio media for digital electroic systems. For distaces larger tha D' = ai log, cy, the delay of a hierarchical express cube etwork is withi a factor of three of that of a dedicated wire. The etwork may provide better performace tha the wire because it is able to share its wirig resources amog may paths i the etwork while a dedicated wire serves oly a sigle source ad destiatio. For distaces smaller tha D', dedicated wirig offers a sigificat latecy advatage at the cost of elimiatig resource sharig. ACKNOWLEDGMENT I thak C. Leiserso for poitig out the ode-limited ature of most real k-ary -cubes. Express cubes have bee strogly iflueced by his work o fat trees [13]. I thak S. Ward, A. Agarwal, ad T. Kight for may helpful commets ad suggestios about routig etworks ad their aalysis. I thak C. Seitz for may helpful suggestios about etworks, routers, ad cocurret computers. 1 thak A. Chie ad M. Noakes for their careful review of early drafts of this mauscript. Fially I thak all the members of the MIT Cocurret VLSI

8 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF IC-ARY -CUBE 1NTERCO E~ION NETWORKS 1023 Architecture group for their help with ad cotributios to this paper. REFERENCES [l] W. C. Athas ad C. L. Seitz, Multicomputers: Message-passig cocurret computers, IEEE Corput. Mag., vol. 21, pp. 9-24, Aug [2] BBN Advaced Computers, Ic., Butterfly parallel processor overview, BBN Rep. 6148, Mar [3] W. J. Dally ad C. L. Seitz, The torus routig chip, J. Distributedsyst., vol. 1, o. 3, pp , [4] W. J. Dally,A VLSIArchitecture for CocurretData Structures. Higham, MA. Kluwer, [5] -, Wire efficiet VLSI multiprocessor commuicatio etworks, i Proc. Staford Co$ Advaced Res. VLSI, P. Loslebe, Ed. Cambridge, MA: MIT Press, Mar. 1987, pp [6] W.J. Dally ad P. Sog, Desig of a self-timed VLSI multicomputer commuicatio cotroller, i Proc. It. Co$ Comput. Desig, ICCD-87, 1987, pp [7] W. J. Dally et az., The J-Machie: A fie-grai cocurret computer, i Proc. IFIP Cogress, I81 W.J. Dally, The J-Machie: Svstem suooort for actors. i Actors: 1. Kowledge-Based Cocurret Coputig, Hewitt ad Agha, Eds. Cam- bridge, MA: MIT Press, 1991., Performace aalysis of k-ary -cube itercoectio etworks, IEEE Tras. Comput., vol. 39, pp , Jue 1990., Network ad processor architecture for message-drive computig, i ESI ad Parallel Processig, R. Suaya ad G. Birtwistle, Eds. Los Altos, CA Morga Kaufma, P. Kermai ad L. Kleirock, Virtual cut-through: A ew computer commuicatio switchig techique, Comput. Networks, vol. 3, pp , ~ 1121 D.H. Lawrie, AIiamet ad access of data i a arrav orocessor. IEEE Tras. Compit., vol. C-24, pp , Dec. 197;. [13] C. E. Leiserso, Fat-trees: Uiversal etworks for hardware-efficiet supercomputig, IEEE Tras. Comput., vol. C-34, pp , Oct [14] J. Mailhot, A comparative study of routig ad flow cotrol strategies i k-ary -cube etworks, S.B. thesis, Massachusetts Istit. of Techo]., May [15] J. Ngai, A framework for adaptive routig i multicomputer etworks, Ph.D. dissertatio, Caltech Computer Sciece Tech. Rep., Caltech-CS- TR-89-09, May [16] M. 0. Noakes ad W. J. Dally, System desig of the J-Machie, i Proc. Sixth MIT Co& Advaced Res. VLSI, MIT Press, 1990, pp [17] P. R. Nuth, Router protocol, MIT Cocurret VLSI Architecture Memo 23, Feb [18] C. L. Seitz, The Cosmic Cube, Commu. ACM, vol. 28, pp , Ja [19] C.L. Seitz et al., The architecture ad programmig of the Ametek Series 2010 Multicomputer, i Proc. Third Cofi Hypercube Cocurret Compu?. Appl., ACM, Ja. 1988, pp [20] C. L. Seitz et al., Submicro systems architecture project semiaual techical report, Caltech Computer Sciede Tech. Rep., Caltech-CS- TR-88-18, p. 2 ad pp , Nov [21] C.-L. Wu ad T. Feg, O a class of multistage itercoectio etworks, IEEE Tras. Comput., vol. C-29, pp , Aug William J. Dally (S 78-M 86) received the B.S. degree i electrical egieerig from Virgiia Polytechic Istitute, the M.S. degree i electrical egieerig from Staford Uiversity, ad the Ph.D. degree i computer sciece from Caltech. He has worked at Bell Telephoe Laboratories where he cotributed to desig of the BELLMAC- 32 microprocessor. Later as a cosultat to Bell Laboratories he helped desig the MARS hardware accelerator. He was a Research Assistat ad the a Research Fellow at Caltech where he desiged the MOSSIM Simulatio Egie ad the TONS Routig Chip. He is curretly a Associate Professor of Computer Sciece at the Massachusetts Istitute of Techology where he directs a research group that is buildig the J-Machie, a fie-grai cocurret computer. His research iterests iclude cocurret computig, computer architecture, computer-aided desig, ad VLSI desig.

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5

Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5 Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases: