I containing data and synchronization information between
|
|
- Shannon Jacobs
- 5 years ago
- Views:
Transcription
1 ~ ~ 1016 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER 1991 Express Cubes: Improvig the Performace of k-ary -cube Itercoectio Networks William J. Dally, Member, ZEEE Abstruct-Express cubes are k-ary -cube itercoectio etworks augmeted by express chaels that provide a short path for olocal messages. A express cube combies the logarithmic diameter of a multistage etwork with the wire-efficiecy ad ability to exploit locality of a low-dimesioal mesh etwork. The isertio of express chaels reduces the etwork diameter ad thus the distace compoet of etwork latecy. Wire legth is icreased allowig etworks to operate with latecies that approach the physical speed-of-light limitatio rather tha beig limited by ode delays. Express chaels icrease wire bisectio i a maer that allows the bisectio to be cotrolled idepedet of the choice of radix, dimesio, ad chael width. By icreasig wire bisectio to saturate the available wirig media, throughput ca be substatially icreased. With a express cube both latecy ad throughput are wire-limited ad withi a small factor of the physical limit o performace. Express chaels may be iserted ito existig itercoectio etworks usig iterchages. No chages to the local commuicatio cotrollers are required. Idex Terms-Commuicatio etworks, cocurret computig, itercoectio etworks, multicomputers, packet routig, packet switchig, parallel processig, topology. I. INTRODU~ON NTERCONNECTION etworks are used to pass messages I cotaiig data ad sychroizatio iformatio betwee the odes of cocurret computers [l], [2], [18], [19]. The messages may be set betwee the processig odes of a message-passig multicomputer [ 11 or betw'ee the processors ad memories of a shared-memory multiprocessor [2]. A itercoectio etwork is characterized by its topology, routig, ad flow cotrol [lo]. The topology of a etwork is the arragemet of its odes ad chaels ito a graph. Routig determies the path chose by a message i this graph. Flow cotrol deals with the allocatio of chael ad buffer resources to a message as it travels alog this path. This paper deals oly with topology. Express cubes ca be applied idepedet of routig ad flow cotrol strategies. The performace of a etwork is measured i terms of its latecy ad its throughput. The latecy of a message is the elapsed time from whe the message sed is iitiated util the message is completely received. Network latecy is the Mauscript received October 14, 1989; revised April 27, This work was supported i part by the Defese Advaced Research Projects Agecy uder Cotracts "14-88K-0738 ad N K-0825 ad i part by a Natioal Sciece Foudatio Presidetial Youg Ivestigator Award, Grat MIP , with matchig fuds from Geeral Electric Corporatio ad IBM Corporatio. The author is with the Artificial Itelligece Laboratory ad the Laboratory for Computer Sciece, Massachusetts Istitute of Techology, Camhridge, MA IEEE Log Number f&9340/91/ $ IEEE average message latecy uder specified coditios. Network throughput is the umber of messages the etwork ca deliver per uit time. Low-dimesioal k-ary -cube etworks usig wormhole routig have bee show to provide low latecy ad high throughput for etworks that are wire-limited [4], [5], [9]. For 3, the k-ary -cube topology is wire-efficiet i that it makes efficiet use of the available bisectio width. This topology maps ito the three physical dimesios i a maer that allows messages to use all of the available badwidth alog their path without ever havig to double back o themselves. Also, low-dimesioal k-ary -cubes cocetrate badwidth ito a few wide chaels so that the compoet of latecy due to message legth is reduced. I most cotemporary cocurret computers, this is the domiat compoet of latecy. Because of their low-latecy, high throughput, ad affiity for implemetatio i VLSI, these k-ary -cube etworks with = 2 or 3 have bee used successfully i the desig of several cocurret computers icludig the Ametek 2010 [19], the J-Machie [7], [8], ad the Mosaic [20]. However, low-dimesioal k-ary -cube itercoectio etworks have two sigificat shortcomigs: Because wires are short, ode delays domiate wire delays ad the distace related compoet of latecy falls more tha a order of magitude short of speed-of-light limitatios. I the J-Machie [7], for example, ode delay is 50 s while the logest wire is 225 mm ad has a time-of-flight delay of 1.5 s. The chael width of these etworks is ofte limited by ode pi cout rather tha by wire bisectio. For example, the J-Machie chael width is limited to 9-bits by pi cout limitatios. I the physical ode width of 50 mm, a six-layer prited circuit board ca hadle over four times this chael width after accoutig for through holes ad local coectios. I short, may regular k-ary -cube itercoectio etworks are ode-limited rather tha wire-limited. I these etworks, ode delay ad pi limitatios domiate wire delay ad wire desity limitatios. The ratios of ode delay to wire delays ad pi desity to wire desity caot be balaced i a regular k-ary -cube. Express cubes overcome this problem by allowig wire legth ad wire desity to be adjusted idepedetly of the choice of radix k, dimesio, ad chael width W. A express cube is a k-ary -cube augmeted by oe or more levels of express chaels that allow olocal messages to
2 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCO"E(JI1ON NETWORKS 1017 bypass odes. The wire legth of the express chaels ca be icreased to the poit that wire delays domiate ode delays. The umber of express chaels ca be adjusted to icrease throughput util the available wirig media is saturated. This ability to balace ode ad wire limitatios is achieved without sacrificig the wire-efficiecy of k-ary -cube etworks. The umber of chaels traversed by a message i a hierarchical express cube grows logarithmically with distace as i a multistage itercoectio etwork [12], [21]. The express cube, however, is able to exploit locality while i a multistage etwork all messages must traverse the diameter of the etwork. With a express cube, both latecy ad throughput are wire-limited ad are withi a small costat factor of the physical limit o performace. The remaider of this paper describes the express cube topology ad aalyzes its performace. Sectio I1 summarizes the otatio that will be used throughout the paper. Sectio 111 itroduces the express cube topology i steps. Basic express cubes (Sectio 111-A) reduce latecy to twice the delay of dedicated wire for messages travelig log distaces. Throughput ca be icreased to saturate the available wirig desity by addig multiple express chaels (Sectio 111-B). With a hierarchical express cube (Sectio 111-C), latecy for short distaces, while ode-limited, is withi a small costat factor of the best that ca be achieved by ay bouded degree etwork. Some desig cosideratios for express cube iterchages are discussed i Sectio IV. 11. NOTATION The followig symbols are used i this paper. They are listed here for referece. the set of chaels i the etwork. Mahatta distace traveled by a message, 12, - zd( + (y,- yd( + (z,- zdl, where the source is at (zs, ys, z,) ad the destiatio is at (zd, yd, zd). the fractio of traffic at level j i a hierarchical express cube. hops, the umber of odes traversed by a message. umber of odes betwee iterchages i a express cube. the radix of the etwork-the legth i each dimesio. the umber of levels of hierarchy i a hierarchical express cube. the message legth i bits. mi, the umber of multiple express chaels at level j. M, the umber of express chaels through each ode., the dimesio of the etwork. N, the set of odes i the etwork. Where it is uambiguous, N is also used for the umber of odes i the etwork, IN(. T,, T,, the latecy of a ode. the latecy of a wire that coects two physically adjacet odes. Tp, the pipelie period of a ode. W, the width of a chael i bits. W, the width of a ode-the umber of wires that may pass ito a ode i each dimesio. a, the ratio of ode latecy to wire latecy, T/T,. p, the ratio of chael width to ode width, WfW. A itercoectio etwork cosists of a set of odes N that are coected by a set of chaels, C N x N. Each chael is uidirectioal ad carries data from a source ode to a destiatio ode. For the purposes of this paper it is assumed that the etwork is bidirectioal: chaels occur i pairs so that (1,722) E C * (2,i) E C. Commuicatio betwee odes is performed by sedig messages. A message may be broke ito oe or more packets for trasmissio. A packet is the smallest uit that cotais routig ad sequecig iformatio. Packets cotai oe or more flow cotrol digits or flits. A flit is the smallest uit o which flow cotrol is performed. A flit i tur is composed of oe or more physical trasfer uits or phits.' A phit is W-bits, the size of the physical commuicatio media. The express cube topology is particularly suitable for use with wormhole routig, a flow-cotrol protocol that advaces each flit of a packet as soo as it arrives at a ode (pipeliig) ad blocks packets i place whe required resources are uavailable [4], [5], [9]. Wormhole routig is attractive i that 1) it reduces the latecy of message delivery compared to store ad forward routig, ad 2) it requires oly a few flit buffers per ode. Wormhole routig differs from virtual cutthrough routig [ll] i that with wormhole routig it is ot ecessary for a ode to allocate a etire packet buffer before acceptig each packet. This distictio reduces the amout of bufferig required o each ode makig it possible to build fast, iexpesive routers. The bisectio width of a etwork is the miimum umber of chaels that must be cut to partitio the etwork ito two equal halves. The wire bisectio is the umber of wires i this chael cutset. Bisectio width gives a lower boud o wire desity, the maximum umber of wires that must cross a uit distace (2-D) or area (3-D) EXPRESS CUBES A. Express Chaels Reduce Latecy Fig. 1 illustrates the applicatio of express chaels to a k-ary 1-cube or liear array. A regular k-ary 1-cube is show i Fig. l(a). The etwork is a liear array of k processig odes, labeled N, each coected to its earest eighbors by chaels of width W. The delay of a phit propagatig through a ode is T,. The delay of the wire coectig two odes is T,. Each chael ca accept a ew phit every Tp. The latecy of a message of legth L set distace D is L T, = HT, -k DT, -k - W Tp. 'There is o costrait that the physical uit of trasfer, phit, must be smaller tha the flow cotrol uit, flit. It is possible to costruct systems with several flits i each phit.
3 1018 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER 1991 ( b) Fig. 1. Isertio of express chaels reduces latecy: (a) A regular k-ary 1-cube etwork may be domiated by ode delay. (b) A Ic-ary 1-cube with express chaels reduces the ode delay compoet of latecy. Message latecy is composed of three compoets as show i (1).2 e first compoet is the ode latecy, due to the umber of hops H. The secod compoet is the wire latecy, due to the distace D. The third compoet is due to message legth L. For a covetioal k-ary -cube, H = D givig L T, = (T, + T,)D + - Tp. W For most etworks T, >> T, so the ode latecy domiates the wire latecy. Express cubes reduce the ode latecy by icreasig wire legth to reduce the umber of hops H. A express k-ary 1-cube is show i Fig. l(b). Express chaels have bee added to the array by isertig a iterchage, labeled I, every i odes. A iterchage is ot a processig ode. It performs oly commuicatio fuctios ad is ot assiged a address. Each iterchage is coected to its eighborig iterchages by a additioal chael of width W, the express chael. Whe a message arrives at a iterchage it is routed directly to the ext iterchage if it is ot destied for oe of the iterveig odes. To preserve the wire-efficiecy of the etwork, messages are ever routed past their destiatios o the express chaels eve though doig so would reduce H i may cases. The delay T,, ad throughput l/tp, of a iterchage are assumed to be idetical to those of a ode. The wire delay of the express chael is assumed to be it,. To simplify the followig aalysis, it is assumed that iterchages add o physical distace to the etwork. Assumig i I D, H = D/i + i ad isertio of express chaels reduces the latecy to I the geeral case, i 1 D, a average message traversig D processig odes travels over Hi = (i + 1)/2 local chaels to reach a iterchage, He = LD/i - 1/2 + 1/(2i)J express chaels to reach the last iterchage before the destiatio, ad fially Hf = (1 + (D - i/2-1/2) mod i) local chaels to the destiatio. The total umber of hops is H = Hi + He + Hf givig a latecy of Tb=(F + 1 D *Throughout this paper the term latecy is used to refer to the latecy of a sigle message i the absece of traffic. For a discussio of the effects of traffic o latecy see [5] ad 191. (2) (3) LTP + -. W For large distaces, D >> a = T,/T,, choosig i = Q balaces the ode ad wire delay. With this choice of i, the latecy due to distace is approximately twice the wire latecy, To M 2T,D. The latecy for large distaces of a express chael etwork with i = a is withi a factor of two of the latecy of a dedicated Mahatta wire betwee the source ad destiatio.3 For small distaces or large a, the i term i the coefficiet of T i (3) is sigificat ad ode delay domiates. For such etworks, latecy is miimized by choosig i = resultig i To E 2 ( - l)t. The use of hierarchical express chaels (Sectio 111-C) ca further improve the latecy for small distaces. B. Multiple Express Chaels Icrease Throughput to Saturate Wire Desity To first order, etwork throughput is proportioal to wire bisectio ad hece wire desity. If more wires are available to trasmit data across the etwork, throughput will be icreased provided that routig ad flow cotrol strategies are able to profitably schedule traffic oto these wires. May regular etwork topologies, such as low-dimesioal k-ary -cubes, are uable to make use of all available wire desity because of pi limitatios. The wire bisectio of a express cube ca be cotrolled idepedet of the choice of radix k, dimesio, or chael width W by addig multiple express chaels to the etwork to match etwork throughput with the available wirig desity W. Fig. 2 shows two methods of isertig multiple express chaels. Multiple express chaels may be hadled by each iterchage as show i Fig. 2(a). Alteratively, simplex iterchages ca be iterleaved as show i Fig. 2(b). I method (a), usig multiple chael iterchages, a iterchage is iserted every i odes as above ad each iterchage is coected to its eighbors usig m parallel express chaels. Fig. 2(a) shows a etwork with i = 4 ad m = 2. The iterchage acts as a cocetrator combiig messages arrivig o the m icomig express chaels with olocal messages arrivig o the local chael ad cocetratig these messages streams oto the m outgoig express chaels. This method has the advatage of makig better use of the express chaels sice ay message ca route o ay express chael. Flexibility i express chael assigmet is achieved at the expese of higher pi cout ad limited expasio. With method (b), iterleavig simplex iterchages, m simplex iterchages are iserted ito each group of i odes. Each iterchage is coected to the correspodig iterchage i the ext group by a sigle express chael. All messages from the odes immediately before a iterchage will be routed o that iterchage s express chaels. Because load caot 3There is othig special about the factor of two. By choosig i = ja the distace compoet of latecy will be (1 + l/j) times the latecy of a Mahatta wire. (4)
4 . I DALLY: EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCONNECTION NETWORKS 1019 ( b) Fig. 2. Multiple express chaels allow wire desity to be icreased to saturate the available wirig media. Express chaels ca be added usig either (a) iterchages with multiple express chaels, or (b) iterleaved simplex iterchages. be shared amog iterleaved express chaels, a ueve distributio of traffic may result i some chaels beig saturated while parallel chaels are idle. Method (b) has the advatage of usig simple iterchages ad allowig arbitrary expasio. I the extreme case of isertig a iterchage betwee every pair of odes the resultig topology is almost the same as the topology that would result from doublig the umber of dimesios. Both of the methods illustrated i Fig. 2 have the effect of icreasig the wire desity (ad bisectio) by a factor of m + 1. To first order, etwork throughput will icrease by a similar amout. There will be some degradatio due to ueve loadig of parallel chaels. The use of multiple express chaels offsets the load imbalace betwee express ad local chaels. If traffic is uiformly distributed, the average fractio of messages crossig a poit i the ceter of the etwork o a local chael is fo = 2i/k as compared to f1 = (k - 2i)/IC crossig o a express chael. For large etworks where IC >> i, the bulk of the traffic is o express chaels. Icreasig the umber of express chaels applies more of the etwork badwidth where it is most eeded. The issue of allocatig multiple express chaels is discussed further i Sectio 111-E. Multiple express chaels are a effective method of icreasig throughput i etworks where the chael width is limited by piout costraits. For example, i the J-Machie the chael width W = 9 is set by pi limitatio^.^ The prited-circuit board techology is capable of ruig W = 80 wires i each dimesio across the 50 mm width of a ode. Eve with may of these wires used for local coectios, four parallel 15-bit (data + cotrol) wide chaels ca be easily ru across each ode. A multiple express chael etwork with m = 3 could use this available wire desity to quadruple the throughput of the etwork. C. Hierarchical Express Cubes Have Logarithmic Node Delay With a sigle level of express chaels, a average of i local chaels are traversed by each olocal message. The ode delay o these local chaels represets a sigificat compoet of latecy ad causes etworks with short distaces, D 5 a2, 4Each J-Machie ode is packaged i a 168-pi pi-grid array. The six commuicatio chaels each require 9 data bits ad six cotrol bits cosumig 90 of these pis. Power coectios use 48 pis. The remaiig 30 pis are used by exteral memory iterface ad cotrol to be ode limited. Hierarchical express cubes overcome this limitatio by usig several levels of express chaels to make ode delay icrease logarithmically with distace for short distaces. The use of hierarchical express chaels, show i Fig. 3, reduces the latecy due to ode delay o local chaels. With hierarchical express chaels, there are 1 levels of iterchages. A first-level iterchage is iserted every i odes. A secod-level iterchage replaces every ith first level iterchage, every i2 odes. I geeral, a jth level iterchage replaces every ith j - 1st level iterchage, every i j odes.' Fig. 3 illustrates a hierarchical express cube with i = 2, 1 = 2. A jth level iterchage has j + 1 iputs ad j + 1 outputs. Arrivig messages are treated idetically regardless of the iput o which they arrive. Messages that are destied for oe of the ext i odes are routed to the local (0th) output. Those remaiig messages that are destied for oe of the ext i2 odes are routed to the 1st output. The process cotiues with all messages with a destiatio betwee ip ad ip+' odes away, 0 5 p 5 j - 1, routed to the pth output. All remaiig messages are routed to the jth output. A message i a hierarchical express cube is delivered i three phases: ascet, cruise, ad descet. I the ascet phase, a average message travels (i + 1)/2 hops to get to the first iterchage, ad (i - 1)/2 hops at each level for a total of H, = (i - 1)/2 + 1 hops ad a distace of D, = (2' - 1)/2. Durig the cruise phase, a message travels H, = [(D - Da)/i'J hops o level 1 chaels for a distace of D, = i'hc. Fially, the message desceds back through the levels routig o each level, j, as log as the remaiig distace is greater tha ij. For the special case where i' ID, the descedig message takes Hd = (i - 1)1/2 + 1 hops for a distace of Dd = (2' + 1)/2. This gives a latecy of D LTP T,= ( +(i-1)1+1 T+TwD+ -. W (5) Choosig a ad 1 so that a' = a balaces ode ad wire delay for large distaces. With this choice, the delay due to local odes is (i- 1) 1 T = (i- 1) log, at. Give that i is a iteger greater tha uity, this expressio is miimized for i = 2. Choosig i to be a power of two facilitates decodig of biary addresses i iterchages. Networks with i = 4, i = 8, ad i = 16 may be desirable uder some circumstaces. I the geeral case, iz + D, the latecy of a hierarchical express cube is calculated by represetig the source ad destiatio coordiates as h = log,k-digit radix4 umbers, S = Sh-1 1. SO, ad D = dhfl... do. Without loss of geerality we assume that S < D. Durig the ascet phase, a message routes from S to Sh-1.--s takig Ha = Ci=i ((2 - sj)mod i) hops for a distace of D, = Eizi((i - sj) mod i)ij. The cruise phase takes the message H, = E;:;' (dj- sj)ij-' hops for a distace of D, = Hci'. Fially, the descet phase takes the message from dh-1. + d10.o to D takig Hd = cizb dj hops for a distace of Dd = Eizi djij. 'This costructio yields a fixed-radix express cube, with radix i for each level. It is also possible to costruct mixed-radix express cubes where the radix varies from level to level.
5 1020 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEFEMBER 1991 Fig. 3. Hierarchical express chaels reduce latecy due to local routig. E Dlstace Delay vs. Dlstace for Express Cubes Fig. 4. Latecy as a fuctio of distace for a hierarchical express chael cube with i = 4, I = 3, a = 64, ad a flat express chael cube with i = 16, a = 64. I a hierarchical express chael cube latecy is logarithmic for short distaces ad liear for log distaces. The crossover occurs betwee D = a ad D = ia log, a. The flat cube has liear delay domiated by T, for short distaces ad T, for log distaces. For short distaces the cruise phase will ever be reached. The message will move from ascet to descet as soo as it reaches a ode where all ozero coordiates agree with D. The total latecy for the geeral case is plotted as a fuctio of distace i Fig. 4. Fig. 5 shows how hierarchical iterchages ca be implemeted usig pi-bouded modules. A level-j iterchage requires j + 1 iputs ad outputs if implemeted as a sigle module as show for a third level iterchage i Fig. 5(a). A level-j iterchage ca be decomposed ito 2j - 1 level-oe iterchages as show for j = 2 i Fig. 5(b). A series of j - 1 ascedig iterchages that route olocal traffic toward higher levels is followed by a top-level iterchage ad a series of j - 1 descedig iterchages that allow local traffic to desced. With some degradatio i performace, the ascedig iterchages ca be elimiated as show i Fig. 5(c). This chage requires extra hops i some cases as a message caot skip levels o its way up to a high-level express chael. Each message must traverse at least oe level j - 1 chael before beig switched to a level-j chael. By restrictig messages to also travel o at least oe chael at each level as they desced, the descedig iterchages ca be elimiated as well leavig oly the sigle top-level iterchage as show i Fig. 5(d). D. Performace Compariso Fig. 4 shows how latecy varies with distace i hierar- chical ad flat express cubes ad compares these latecies to the latecy of a covetioal k-ary 1-cube ad of a direct wire. These curves assume that the message source is midway betwee two iterchages. The latecies are ormalized to uits of the wire delay betwee adjacet odes. The latecy of a covetioal k-ary 1-cube is liear with slope a while the latecy of a wire is liear with slope 1. For short distaces, util the first express chael is reached, a flat (ohierarchical) express cube has the same delay as a covetioal k-ary -cube, TD = ad. Oce the message begis travelig o express chaels, latecy icreases liearly with slope 1 + a/i. This occurs at distace D = 24 i the figure. There is a periodic variatio i delay aroud this asymptote due to the umber of local chaels beig traversed, Dlocal = (i + 1)/2 + ((D - i/2 + 1/2) mod i). The hierarchical express cube has a latecy that is logarithmic for short distaces ad liear for log distaces. The latecy of messages travelig a short distace, D < Q is ode limited ad icreases logarithmically with distace, TD M (i - 1) log, DT,. This delay is withi a factor of i - 1 of the best that ca be achieved with radix i switches. Log distace messages have a latecy of TD M (1 + a/iz)tw. If iz = a, this log distace latecy is approximately twice the latecy of a dedicated Mahatta wire. I a hierarchical etwork, the iterchage spacig i ca be made small, givig good performace for short distaces, without compromisig the delay of log distace messages which depeds o the ratio a/i'. I a flat etwork with a sigle parameter i, it is ot possible to simultaeously optimize performace for both short ad log distaces. E. Area Tradeoffs Assume that a ode has a cross-sectioal area that permits W wires to pass through i each dimesio. W of these wires are used for a local chael. The remaiig W - W wires are allocated as M = - 11 W-wire chaels sice a arrower chael will form a bottleeck that will slow other chaels. The M available chaels should be divided amog the levels i a hierarchical express cube i a maer that evely balaces the load. Assumig radom traffic, at each level j, from j = 0, local chaels, to j = 1-1, the fractio of traffic carried at level j o a chael ear the ceter of the machie is The fractio of traffic o the top-level chael is the remaider, To balace the load betwee levels of the express cube, mj chaels should be allocated to each level j of the cube i proportio to the fractio of traffic fj at that level. I practice, M is ot large eough to permit a exact balace. For example, i a cube with k = 64, i = 2, 1 = 3, ad radom traffic, the fractios, fo to f3, are , 0.125, 0.250, ad , (7)
6 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF k-ary -CUBE INTERCONNECTION NETWORKS 1021 Fig. 5. Hierarchical iterchages. (a) A third-level iterchage. (b) A third-level iterchage implemeted from first-level iterchages. (c), (d) With a small performace pealty, ascedig ad/or descedig iterchages ca be elimiated. respectively. If A4 = 7, a reasoable allocatio to mo through m3 is 1, 1, 2, ad 4, respectively. If there is cosiderable locality i the traffic patter, more traffic will travel o the lower levels ad additioal chaels should be allocated to these levels. It is better to overallocate chaels to lower levels tha to higher levels because the lower-level chaels are more versatile. Log distace messages ca make use of lower level chaels with some icrease i latecy; however, local messages caot make progress o high-level chaels. F. Express Chaels i May Dimesios A multidimesioal express cube may be costructed by isertig iterchages ito each dimesio separately as show i Fig. 6(a). The figure shows part of a two-dimesioal express cube with i = 4, 1 = 1. Iterchages have bee iserted separately ito the X ad Y dimesios. A similar costructio ca be realized for higher dimesios ad for hierarchical etworks. With this approach iterchage picout is miimal as each iterchage hadles oly a sigle dimesio. Also, the desig is easy to package ito modules as the iterchages are located i regular rows ad colums. This approach has the disadvatage that messages must desced to local chaels to switch dimesios. A alterate costructio of a multidimesioal express cube is to iterleave multidimesioal iterchages ito the array as show i Fig. 6(b) for i = 4, 1 = 1. This approach allows messages o express chaels to chage dimesios without descedig to a local chael. It is particularly useful i etworks that use adaptive routig [14], [15] as it provides alterate paths at each level of the etwork. The iterleaved costructio has the disadvatages of requirig a higher iterchage pi cout ad beig more difficult to package ito modules. G. Modularity The iterchages i a express cube ca be used to chage wire desity, speed, ad sigalig levels at module boudaries as show i Fig. 7. Large etworks are built from may modules i a physical hierarchy. A typical hierarchy icludes itegrated circuits, prited circuit boards, chassis, ad cabiets. Available wire desity ad badwidth chage sigificatly betwee levels of the hierarchy. For example, a typical itegrated circuit has a wire desity of 250 wires/" per layer while a prited circuit board ca hadle oly 2 wires/" per layer.6 Iterchages placed at module boudaries as show 6This itegrated circuit wire desity is typical of first-level metal i a 1 pm CMOS process. The prited circuit wire desity is for a board with 8 mil wires ad spaces. Both desities assume all area is available for wirig. Fig. 6. A multidimesioal express cube may be costructed either by (a) isertig iterchages ito each dimesio separately, or (b) iterleavig multidimesioal iterchages ito the array. i Fig. 7 cdh be used to vary the umber ad width of express ad local chaels. These boudary iterchages may also covert iteral module sigalig levels ad speeds to levels ad speeds more,appropriate betwee modules. Usig express chaels ad boudary iterchages, the etwork ca be adjusted to saturate the available wirig desity eve though this desity is ot uiform across the packagig hierarchy. TO make use of the available badwidth, computatios ruig o the etwork must exploit locality. IV. INTERCHANGE DESIGN Fig. 8 shows the block diagram of a uidirectioal iterchage. A bidirectioal iterchage icludes a idetical circuit i the opposite directio. The basic desig is similar to that of a router [17], [6], [3]. Two iput latches hold arrivig flits ad two output latches hold departig flits. If additioal bufferig is desired, ay of these latches may be replaced by a FIFO buffer. If a phit is a differet size tha a flit,
7 1022 IEEE TRANSACTIONS ON COMPUTERS, VOL. 40, NO. 9, SEPTEMBER A -I I - l l 1 ; : ' 1, I, AI I ) I I Le Fig. 7. Iterchages allow wire desity, speed, ad sigalig levels to be chaged at module boudaries. --c IN REG MUX ) d OUT REG Fig. 8. Block diagram of a iterchage. lbo multiplexors perform switchig betwee iput ad output registers based o a compariso of the high address bits i a message header. multiplexig ad demultiplexig is required betwee the flit buffers ad the iterchage pis. Associated with each output latch is a multiplexor that selects which iput is routed to the latch. Routig decisios are made by comparig the address iformatio i the head flit(s) of the message to the local address. If the destiatio lies withi the ext i odes, the local chael is chose, otherwise the express chael is chose. If i is a power of two, iterchages are aliged, ad absolute addresses are used i headers, the compariso ca be made by checkig all but the 1 log, i least sigificat bits for equality to the local address. The iterchage state icludes presece bits for each register, a iput state for each iput, ad a output state for each output. The presece bits are used for flit-level flow cotrol. A flit is allowed to advace oly if the presece bit of its destiatio register is clear (o data preset), or if the register is to be emptied i the same cycle. The iput state bits hold the destiatio port ad status (empty, head, advacig, blocked) of the message curretly usig each iput. The output state cosists of a bit to idetify whether the output is busy ad a secod bit to idetify which iput has bee grated the output. The combiatioal logic to maitai these state bits ad cotrol the data path is straightforward. V. CONCLUSION Express cubes are k-ary -cubes augmeted by express chaels that provide a short path for olocal messages. A express cube retais the wire efficiecy of a covetioal k-ary -cube while providig improved latecy ad through- a latecy that is withi a small factor of the best that ca be achieved with a bouded degree etwork. For log distaces, the latecy ca be made arbitrarily close to that of a dedicated Mahatta wire. Multiple express chaels ca be used to icrease throughput to the limit of the available wire desity. The express cube combies the low diameter of multistage itercoectio etworks with the wire efficiecy ad ability to exploit locality of a low-dimesioal mesh etwork. The result is a etwork with latecy ad throughput that are withi a small factor of the physical limit. Express chaels are added to a k-ary -cube by periodically isertig iterchages ito each dimesio. NO modificatios are required to the routers i each processig ode; express chaels ca be added to most existig k-ary -cube etworks. Iterchages also allow wire desity, speed, ad sigalig levels to be chaged at module boudaries. A express cube ca make use of all available wire desity eve if the wire desity is ouiform. This is ofte required as the wire desity ad speed may chage sigificatly betwee levels of packagig. Express cubes achieve their performace at the cost of addig iterchages, icreasig the latecy for some shortdistace messages, ad icreasig the bisectio width of the etwork. Each iterchage adds a compoet to the system ad icreases the latecy of local messages that cross a iterchage but do ot take the express chael by oe ode delay, (T, + TW). Express chaels icrease the wire bisectio by usig available uused wirig capacity. I parts of the etwork that are already wire-limited the express ad local chaels ca be combied as show i Fig. 7. As the performace of itercoectio etworks approaches the limits of the uderlyig wirig media their rage of applicatio icreases. These etworks ca go beyod exchagig messages betwee the odes of cocurret computers to servig as a geeral itercoectio media for digital electroic systems. For distaces larger tha D' = ai log, cy, the delay of a hierarchical express cube etwork is withi a factor of three of that of a dedicated wire. The etwork may provide better performace tha the wire because it is able to share its wirig resources amog may paths i the etwork while a dedicated wire serves oly a sigle source ad destiatio. For distaces smaller tha D', dedicated wirig offers a sigificat latecy advatage at the cost of elimiatig resource sharig. ACKNOWLEDGMENT I thak C. Leiserso for poitig out the ode-limited ature of most real k-ary -cubes. Express cubes have bee strogly iflueced by his work o fat trees [13]. I thak S. Ward, A. Agarwal, ad T. Kight for may helpful commets ad suggestios about routig etworks ad their aalysis. I thak C. Seitz for may helpful suggestios about etworks, routers, ad cocurret computers. 1 thak A. Chie ad M. Noakes for their careful review of early drafts of this mauscript. Fially I thak all the members of the MIT Cocurret VLSI
8 DALLY EXPRESS CUBES: IMPROVING PERFORMANCE OF IC-ARY -CUBE 1NTERCO E~ION NETWORKS 1023 Architecture group for their help with ad cotributios to this paper. REFERENCES [l] W. C. Athas ad C. L. Seitz, Multicomputers: Message-passig cocurret computers, IEEE Corput. Mag., vol. 21, pp. 9-24, Aug [2] BBN Advaced Computers, Ic., Butterfly parallel processor overview, BBN Rep. 6148, Mar [3] W. J. Dally ad C. L. Seitz, The torus routig chip, J. Distributedsyst., vol. 1, o. 3, pp , [4] W. J. Dally,A VLSIArchitecture for CocurretData Structures. Higham, MA. Kluwer, [5] -, Wire efficiet VLSI multiprocessor commuicatio etworks, i Proc. Staford Co$ Advaced Res. VLSI, P. Loslebe, Ed. Cambridge, MA: MIT Press, Mar. 1987, pp [6] W.J. Dally ad P. Sog, Desig of a self-timed VLSI multicomputer commuicatio cotroller, i Proc. It. Co$ Comput. Desig, ICCD-87, 1987, pp [7] W. J. Dally et az., The J-Machie: A fie-grai cocurret computer, i Proc. IFIP Cogress, I81 W.J. Dally, The J-Machie: Svstem suooort for actors. i Actors: 1. Kowledge-Based Cocurret Coputig, Hewitt ad Agha, Eds. Cam- bridge, MA: MIT Press, 1991., Performace aalysis of k-ary -cube itercoectio etworks, IEEE Tras. Comput., vol. 39, pp , Jue 1990., Network ad processor architecture for message-drive computig, i ESI ad Parallel Processig, R. Suaya ad G. Birtwistle, Eds. Los Altos, CA Morga Kaufma, P. Kermai ad L. Kleirock, Virtual cut-through: A ew computer commuicatio switchig techique, Comput. Networks, vol. 3, pp , ~ 1121 D.H. Lawrie, AIiamet ad access of data i a arrav orocessor. IEEE Tras. Compit., vol. C-24, pp , Dec. 197;. [13] C. E. Leiserso, Fat-trees: Uiversal etworks for hardware-efficiet supercomputig, IEEE Tras. Comput., vol. C-34, pp , Oct [14] J. Mailhot, A comparative study of routig ad flow cotrol strategies i k-ary -cube etworks, S.B. thesis, Massachusetts Istit. of Techo]., May [15] J. Ngai, A framework for adaptive routig i multicomputer etworks, Ph.D. dissertatio, Caltech Computer Sciece Tech. Rep., Caltech-CS- TR-89-09, May [16] M. 0. Noakes ad W. J. Dally, System desig of the J-Machie, i Proc. Sixth MIT Co& Advaced Res. VLSI, MIT Press, 1990, pp [17] P. R. Nuth, Router protocol, MIT Cocurret VLSI Architecture Memo 23, Feb [18] C. L. Seitz, The Cosmic Cube, Commu. ACM, vol. 28, pp , Ja [19] C.L. Seitz et al., The architecture ad programmig of the Ametek Series 2010 Multicomputer, i Proc. Third Cofi Hypercube Cocurret Compu?. Appl., ACM, Ja. 1988, pp [20] C. L. Seitz et al., Submicro systems architecture project semiaual techical report, Caltech Computer Sciede Tech. Rep., Caltech-CS- TR-88-18, p. 2 ad pp , Nov [21] C.-L. Wu ad T. Feg, O a class of multistage itercoectio etworks, IEEE Tras. Comput., vol. C-29, pp , Aug William J. Dally (S 78-M 86) received the B.S. degree i electrical egieerig from Virgiia Polytechic Istitute, the M.S. degree i electrical egieerig from Staford Uiversity, ad the Ph.D. degree i computer sciece from Caltech. He has worked at Bell Telephoe Laboratories where he cotributed to desig of the BELLMAC- 32 microprocessor. Later as a cosultat to Bell Laboratories he helped desig the MARS hardware accelerator. He was a Research Assistat ad the a Research Fellow at Caltech where he desiged the MOSSIM Simulatio Egie ad the TONS Routig Chip. He is curretly a Associate Professor of Computer Sciece at the Massachusetts Istitute of Techology where he directs a research group that is buildig the J-Machie, a fie-grai cocurret computer. His research iterests iclude cocurret computig, computer architecture, computer-aided desig, ad VLSI desig.
Morgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5
Morga Kaufma Publishers 26 February, 28 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Set-Associative Cache Architecture Performace Summary Whe CPU performace icreases:
More information1. SWITCHING FUNDAMENTALS
. SWITCING FUNDMENTLS Switchig is the provisio of a o-demad coectio betwee two ed poits. Two distict switchig techiques are employed i commuicatio etwors-- circuit switchig ad pacet switchig. Circuit switchig
More informationMultiprocessors. HPC Prof. Robert van Engelen
Multiprocessors Prof. Robert va Egele Overview The PMS model Shared memory multiprocessors Basic shared memory systems SMP, Multicore, ad COMA Distributed memory multicomputers MPP systems Network topologies
More informationThe Penta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems
The Peta-S: A Scalable Crossbar Network for Distributed Shared Memory Multiprocessor Systems Abdulkarim Ayyad Departmet of Computer Egieerig, Al-Quds Uiversity, Jerusalem, P.O. Box 20002 Tel: 02-2797024,
More informationCourse Site: Copyright 2012, Elsevier Inc. All rights reserved.
Course Site: http://cc.sjtu.edu.c/g2s/site/aca.html 1 Computer Architecture A Quatitative Approach, Fifth Editio Chapter 2 Memory Hierarchy Desig 2 Outlie Memory Hierarchy Cache Desig Basic Cache Optimizatios
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Part A Datapath Design
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter The Processor Part A path Desig Itroductio CPU performace factors Istructio cout Determied by ISA ad compiler. CPI ad
More informationMulti-Threading. Hyper-, Multi-, and Simultaneous Thread Execution
Multi-Threadig Hyper-, Multi-, ad Simultaeous Thread Executio 1 Performace To Date Icreasig processor performace Pipeliig. Brach predictio. Super-scalar executio. Out-of-order executio. Caches. Hyper-Threadig
More informationElementary Educational Computer
Chapter 5 Elemetary Educatioal Computer. Geeral structure of the Elemetary Educatioal Computer (EEC) The EEC coforms to the 5 uits structure defied by vo Neuma's model (.) All uits are preseted i a simplified
More informationImprovement of the Orthogonal Code Convolution Capabilities Using FPGA Implementation
Improvemet of the Orthogoal Code Covolutio Capabilities Usig FPGA Implemetatio Naima Kaabouch, Member, IEEE, Apara Dhirde, Member, IEEE, Saleh Faruque, Member, IEEE Departmet of Electrical Egieerig, Uiversity
More informationLecture 1: Introduction and Strassen s Algorithm
5-750: Graduate Algorithms Jauary 7, 08 Lecture : Itroductio ad Strasse s Algorithm Lecturer: Gary Miller Scribe: Robert Parker Itroductio Machie models I this class, we will primarily use the Radom Access
More informationOn Nonblocking Folded-Clos Networks in Computer Communication Environments
O Noblockig Folded-Clos Networks i Computer Commuicatio Eviromets Xi Yua Departmet of Computer Sciece, Florida State Uiversity, Tallahassee, FL 3306 xyua@cs.fsu.edu Abstract Folded-Clos etworks, also referred
More information3D Model Retrieval Method Based on Sample Prediction
20 Iteratioal Coferece o Computer Commuicatio ad Maagemet Proc.of CSIT vol.5 (20) (20) IACSIT Press, Sigapore 3D Model Retrieval Method Based o Sample Predictio Qigche Zhag, Ya Tag* School of Computer
More informationOn (K t e)-saturated Graphs
Noame mauscript No. (will be iserted by the editor O (K t e-saturated Graphs Jessica Fuller Roald J. Gould the date of receipt ad acceptace should be iserted later Abstract Give a graph H, we say a graph
More informationParallel Polygon Approximation Algorithm Targeted at Reconfigurable Multi-Ring Hardware
Parallel Polygo Approximatio Algorithm Targeted at Recofigurable Multi-Rig Hardware M. Arif Wai* ad Hamid R. Arabia** *Califoria State Uiversity Bakersfield, Califoria, USA **Uiversity of Georgia, Georgia,
More informationOnes Assignment Method for Solving Traveling Salesman Problem
Joural of mathematics ad computer sciece 0 (0), 58-65 Oes Assigmet Method for Solvig Travelig Salesma Problem Hadi Basirzadeh Departmet of Mathematics, Shahid Chamra Uiversity, Ahvaz, Ira Article history:
More informationCIS 121 Data Structures and Algorithms with Java Spring Stacks, Queues, and Heaps Monday, February 18 / Tuesday, February 19
CIS Data Structures ad Algorithms with Java Sprig 09 Stacks, Queues, ad Heaps Moday, February 8 / Tuesday, February 9 Stacks ad Queues Recall the stack ad queue ADTs (abstract data types from lecture.
More informationSwitch Construction CS
Switch Costructio CS 00 Workstatio-Based Aggregate badwidth /2 of the I/O bus badwidth capacity shared amog all hosts coected to switch example: Gbps bus ca support 5 x 00Mbps ports (i theory) I/O bus
More informationCSC 220: Computer Organization Unit 11 Basic Computer Organization and Design
College of Computer ad Iformatio Scieces Departmet of Computer Sciece CSC 220: Computer Orgaizatio Uit 11 Basic Computer Orgaizatio ad Desig 1 For the rest of the semester, we ll focus o computer architecture:
More informationOperating System Concepts. Operating System Concepts
Chapter 4: Mass-Storage Systems Logical Disk Structure Logical Disk Structure Disk Schedulig Disk Maagemet RAID Structure Disk drives are addressed as large -dimesioal arrays of logical blocks, where the
More informationAppendix D. Controller Implementation
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Appedix D Cotroller Implemetatio Cotroller Implemetatios Combiatioal logic (sigle-cycle); Fiite state machie (multi-cycle, pipelied);
More informationLecture 5. Counting Sort / Radix Sort
Lecture 5. Coutig Sort / Radix Sort T. H. Corme, C. E. Leiserso ad R. L. Rivest Itroductio to Algorithms, 3rd Editio, MIT Press, 2009 Sugkyukwa Uiversity Hyuseug Choo choo@skku.edu Copyright 2000-2018
More informationCIS 121 Data Structures and Algorithms with Java Spring Stacks and Queues Monday, February 12 / Tuesday, February 13
CIS Data Structures ad Algorithms with Java Sprig 08 Stacks ad Queues Moday, February / Tuesday, February Learig Goals Durig this lab, you will: Review stacks ad queues. Lear amortized ruig time aalysis
More informationCopyright 2016 Ramez Elmasri and Shamkant B. Navathe
Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe CHAPTER 18 Strategies for Query Processig Copyright 2016 Ramez Elmasri ad Shamkat B. Navathe Itroductio DBMS techiques to process a query Scaer idetifies
More informationSwitching Hardware. Spring 2018 CS 438 Staff, University of Illinois 1
Switchig Hardware Sprig 208 CS 438 Staff, Uiversity of Illiois Where are we? Uderstad Differet ways to move through a etwork (forwardig) Read sigs at each switch (datagram) Follow a kow path (virtual circuit)
More informationcondition w i B i S maximum u i
ecture 10 Dyamic Programmig 10.1 Kapsack Problem November 1, 2004 ecturer: Kamal Jai Notes: Tobias Holgers We are give a set of items U = {a 1, a 2,..., a }. Each item has a weight w i Z + ad a utility
More information6.854J / J Advanced Algorithms Fall 2008
MIT OpeCourseWare http://ocw.mit.edu 6.854J / 18.415J Advaced Algorithms Fall 2008 For iformatio about citig these materials or our Terms of Use, visit: http://ocw.mit.edu/terms. 18.415/6.854 Advaced Algorithms
More informationUNIVERSITY OF MORATUWA
UNIVERSITY OF MORATUWA FACULTY OF ENGINEERING DEPARTMENT OF COMPUTER SCIENCE & ENGINEERING B.Sc. Egieerig 2014 Itake Semester 2 Examiatio CS2052 COMPUTER ARCHITECTURE Time allowed: 2 Hours Jauary 2016
More informationIMP: Superposer Integrated Morphometrics Package Superposition Tool
IMP: Superposer Itegrated Morphometrics Package Superpositio Tool Programmig by: David Lieber ( 03) Caisius College 200 Mai St. Buffalo, NY 4208 Cocept by: H. David Sheets, Dept. of Physics, Caisius College
More informationCS 683: Advanced Design and Analysis of Algorithms
CS 683: Advaced Desig ad Aalysis of Algorithms Lecture 6, February 1, 2008 Lecturer: Joh Hopcroft Scribes: Shaomei Wu, Etha Feldma February 7, 2008 1 Threshold for k CNF Satisfiability I the previous lecture,
More informationAPPLICATION NOTE PACE1750AE BUILT-IN FUNCTIONS
APPLICATION NOTE PACE175AE BUILT-IN UNCTIONS About This Note This applicatio brief is iteded to explai ad demostrate the use of the special fuctios that are built ito the PACE175AE processor. These powerful
More informationMaster Informatics Eng. 2017/18. A.J.Proença. Memory Hierarchy. (most slides are borrowed) AJProença, Advanced Architectures, MiEI, UMinho, 2017/18 1
Advaced Architectures Master Iformatics Eg. 2017/18 A.J.Proeça Memory Hierarchy (most slides are borrowed) AJProeça, Advaced Architectures, MiEI, UMiho, 2017/18 1 Itroductio Programmers wat ulimited amouts
More informationOverview. Some Definitions. Some definitions. High Performance Computing Programming Paradigms and Scalability Part 2: High-Performance Networks
Overview High Performace Computig Programmig Paradigms ad Scalability Part : High-Performace Networks some defiitios static etwork topologies dyamic etwork topologies examples PD Dr. rer. at. habil. Ralf-Peter
More informationCreating Exact Bezier Representations of CST Shapes. David D. Marshall. California Polytechnic State University, San Luis Obispo, CA , USA
Creatig Exact Bezier Represetatios of CST Shapes David D. Marshall Califoria Polytechic State Uiversity, Sa Luis Obispo, CA 93407-035, USA The paper presets a method of expressig CST shapes pioeered by
More informationLecture 28: Data Link Layer
Automatic Repeat Request (ARQ) 2. Go ack N ARQ Although the Stop ad Wait ARQ is very simple, you ca easily show that it has very the low efficiecy. The low efficiecy comes from the fact that the trasmittig
More informationAn Improved Shuffled Frog-Leaping Algorithm for Knapsack Problem
A Improved Shuffled Frog-Leapig Algorithm for Kapsack Problem Zhoufag Li, Ya Zhou, ad Peg Cheg School of Iformatio Sciece ad Egieerig Hea Uiversity of Techology ZhegZhou, Chia lzhf1978@126.com Abstract.
More informationAdaptive Resource Allocation for Electric Environmental Pollution through the Control Network
Available olie at www.sciecedirect.com Eergy Procedia 6 (202) 60 64 202 Iteratioal Coferece o Future Eergy, Eviromet, ad Materials Adaptive Resource Allocatio for Electric Evirometal Pollutio through the
More informationLower Bounds for Sorting
Liear Sortig Topics Covered: Lower Bouds for Sortig Coutig Sort Radix Sort Bucket Sort Lower Bouds for Sortig Compariso vs. o-compariso sortig Decisio tree model Worst case lower boud Compariso Sortig
More information. Written in factored form it is easy to see that the roots are 2, 2, i,
CMPS A Itroductio to Programmig Programmig Assigmet 4 I this assigmet you will write a java program that determies the real roots of a polyomial that lie withi a specified rage. Recall that the roots (or
More informationMorgan Kaufmann Publishers 26 February, COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 5.
Morga Kaufma Publishers 26 February, 208 COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 5 Virtual Memory Review: The Memory Hierarchy Take advatage of the priciple
More informationAnalysis Metrics. Intro to Algorithm Analysis. Slides. 12. Alg Analysis. 12. Alg Analysis
Itro to Algorithm Aalysis Aalysis Metrics Slides. Table of Cotets. Aalysis Metrics 3. Exact Aalysis Rules 4. Simple Summatio 5. Summatio Formulas 6. Order of Magitude 7. Big-O otatio 8. Big-O Theorems
More informationAnnouncements. Reading. Project #4 is on the web. Homework #1. Midterm #2. Chapter 4 ( ) Note policy about project #3 missing components
Aoucemets Readig Chapter 4 (4.1-4.2) Project #4 is o the web ote policy about project #3 missig compoets Homework #1 Due 11/6/01 Chapter 6: 4, 12, 24, 37 Midterm #2 11/8/01 i class 1 Project #4 otes IPv6Iit,
More information1 Graph Sparsfication
CME 305: Discrete Mathematics ad Algorithms 1 Graph Sparsficatio I this sectio we discuss the approximatio of a graph G(V, E) by a sparse graph H(V, F ) o the same vertex set. I particular, we cosider
More informationPattern Recognition Systems Lab 1 Least Mean Squares
Patter Recogitio Systems Lab 1 Least Mea Squares 1. Objectives This laboratory work itroduces the OpeCV-based framework used throughout the course. I this assigmet a lie is fitted to a set of poits usig
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. Chapter 4. The Processor. Single-Cycle Disadvantages & Advantages
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Iterface 5 th Editio Chapter 4 The Processor Pipeliig Sigle-Cycle Disadvatages & Advatages Clk Uses the clock cycle iefficietly the clock cycle must
More informationCMSC Computer Architecture Lecture 11: More Caches. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 11: More Caches Prof. Yajig Li Uiversity of Chicago Lecture Outlie Caches 2 Review Memory hierarchy Cache basics Locality priciples Spatial ad temporal How to access
More informationThe University of Adelaide, School of Computer Science 22 November Computer Architecture. A Quantitative Approach, Sixth Edition.
Computer Architecture A Quatitative Approach, Sixth Editio Chapter 2 Memory Hierarchy Desig 1 Itroductio Programmers wat ulimited amouts of memory with low latecy Fast memory techology is more expesive
More informationReliable Transmission. Spring 2018 CS 438 Staff - University of Illinois 1
Reliable Trasmissio Sprig 2018 CS 438 Staff - Uiversity of Illiois 1 Reliable Trasmissio Hello! My computer s ame is Alice. Alice Bob Hello! Alice. Sprig 2018 CS 438 Staff - Uiversity of Illiois 2 Reliable
More informationSorting in Linear Time. Data Structures and Algorithms Andrei Bulatov
Sortig i Liear Time Data Structures ad Algorithms Adrei Bulatov Algorithms Sortig i Liear Time 7-2 Compariso Sorts The oly test that all the algorithms we have cosidered so far is compariso The oly iformatio
More informationImproving Template Based Spike Detection
Improvig Template Based Spike Detectio Kirk Smith, Member - IEEE Portlad State Uiversity petra@ee.pdx.edu Abstract Template matchig algorithms like SSE, Covolutio ad Maximum Likelihood are well kow for
More informationTransitioning to BGP
Trasitioig to BGP ISP Workshops These materials are licesed uder the Creative Commos Attributio-NoCommercial 4.0 Iteratioal licese (http://creativecommos.org/liceses/by-c/4.0/) Last updated 24 th April
More informationCOMP Parallel Computing. PRAM (1): The PRAM model and complexity measures
COMP 633 - Parallel Computig Lecture 2 August 24, 2017 : The PRAM model ad complexity measures 1 First class summary This course is about parallel computig to achieve high-er performace o idividual problems
More informationAdaptive Graph Partitioning Wireless Protocol S. L. Ng 1, P. M. Geethakumari 1, S. Zhou 2, and W. J. Dewar 1 1
Adaptive Graph Partitioig Wireless Protocol S. L. Ng 1, P. M. Geethakumari 1, S. Zhou 2, ad W. J. Dewar 1 1 School of Electrical Egieerig Uiversity of New South Wales, Australia 2 Divisio of Radiophysics
More informationFundamentals of Media Processing. Shin'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dinh Le
Fudametals of Media Processig Shi'ichi Satoh Kazuya Kodama Hiroshi Mo Duy-Dih Le Today's topics Noparametric Methods Parze Widow k-nearest Neighbor Estimatio Clusterig Techiques k-meas Agglomerative Hierarchical
More informationAlpha Individual Solutions MAΘ National Convention 2013
Alpha Idividual Solutios MAΘ Natioal Covetio 0 Aswers:. D. A. C 4. D 5. C 6. B 7. A 8. C 9. D 0. B. B. A. D 4. C 5. A 6. C 7. B 8. A 9. A 0. C. E. B. D 4. C 5. A 6. D 7. B 8. C 9. D 0. B TB. 570 TB. 5
More informationLecture Notes 6 Introduction to algorithm analysis CSS 501 Data Structures and Object-Oriented Programming
Lecture Notes 6 Itroductio to algorithm aalysis CSS 501 Data Structures ad Object-Orieted Programmig Readig for this lecture: Carrao, Chapter 10 To be covered i this lecture: Itroductio to algorithm aalysis
More informationCMSC Computer Architecture Lecture 12: Virtual Memory. Prof. Yanjing Li University of Chicago
CMSC 22200 Computer Architecture Lecture 12: Virtual Memory Prof. Yajig Li Uiversity of Chicago A System with Physical Memory Oly Examples: most Cray machies early PCs Memory early all embedded systems
More informationGraphs. Minimum Spanning Trees. Slides by Rose Hoberman (CMU)
Graphs Miimum Spaig Trees Slides by Rose Hoberma (CMU) Problem: Layig Telephoe Wire Cetral office 2 Wirig: Naïve Approach Cetral office Expesive! 3 Wirig: Better Approach Cetral office Miimize the total
More informationThe Counterchanged Crossed Cube Interconnection Network and Its Topology Properties
WSEAS TRANSACTIONS o COMMUNICATIONS Wag Xiyag The Couterchaged Crossed Cube Itercoectio Network ad Its Topology Properties WANG XINYANG School of Computer Sciece ad Egieerig South Chia Uiversity of Techology
More informationData Structures and Algorithms. Analysis of Algorithms
Data Structures ad Algorithms Aalysis of Algorithms Outlie Ruig time Pseudo-code Big-oh otatio Big-theta otatio Big-omega otatio Asymptotic algorithm aalysis Aalysis of Algorithms Iput Algorithm Output
More informationLoad balanced Parallel Prime Number Generator with Sieve of Eratosthenes on Cluster Computers *
Load balaced Parallel Prime umber Geerator with Sieve of Eratosthees o luster omputers * Soowook Hwag*, Kyusik hug**, ad Dogseug Kim* *Departmet of Electrical Egieerig Korea Uiversity Seoul, -, Rep. of
More informationAlgorithms for Disk Covering Problems with the Most Points
Algorithms for Disk Coverig Problems with the Most Poits Bi Xiao Departmet of Computig Hog Kog Polytechic Uiversity Hug Hom, Kowloo, Hog Kog csbxiao@comp.polyu.edu.hk Qigfeg Zhuge, Yi He, Zili Shao, Edwi
More informationAPPLICATION NOTE. Automated Gain Flattening. 1. Experimental Setup. Scope and Overview
APPLICATION NOTE Automated Gai Flatteig Scope ad Overview A flat optical power spectrum is essetial for optical telecommuicatio sigals. This stems from a eed to balace the chael powers across large distaces.
More informationDESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO
DESIGN AND ANALYSIS OF LDPC DECODERS FOR SOFTWARE DEFINED RADIO Sagwo Seo, Trevor Mudge Advaced Computer Architecture Laboratory Uiversity of Michiga at A Arbor {swseo, tm}@umich.edu Yumig Zhu, Chaitali
More informationThe isoperimetric problem on the hypercube
The isoperimetric problem o the hypercube Prepared by: Steve Butler November 2, 2005 1 The isoperimetric problem We will cosider the -dimesioal hypercube Q Recall that the hypercube Q is a graph whose
More informationThroughput-Delay Scaling in Wireless Networks with Constant-Size Packets
Throughput-Delay Scalig i Wireless Networks with Costat-Size Packets Abbas El Gamal, James Mamme, Balaji Prabhakar, Devavrat Shah Departmets of EE ad CS Staford Uiversity, CA 94305 Email: {abbas, jmamme,
More informationA Study on the Performance of Cholesky-Factorization using MPI
A Study o the Performace of Cholesky-Factorizatio usig MPI Ha S. Kim Scott B. Bade Departmet of Computer Sciece ad Egieerig Uiversity of Califoria Sa Diego {hskim, bade}@cs.ucsd.edu Abstract Cholesky-factorizatio
More informationPerformance Plus Software Parameter Definitions
Performace Plus+ Software Parameter Defiitios/ Performace Plus Software Parameter Defiitios Chapma Techical Note-TG-5 paramete.doc ev-0-03 Performace Plus+ Software Parameter Defiitios/2 Backgroud ad Defiitios
More informationRedundancy Allocation for Series Parallel Systems with Multiple Constraints and Sensitivity Analysis
IOSR Joural of Egieerig Redudacy Allocatio for Series Parallel Systems with Multiple Costraits ad Sesitivity Aalysis S. V. Suresh Babu, D.Maheswar 2, G. Ragaath 3 Y.Viaya Kumar d G.Sakaraiah e (Mechaical
More informationAn Efficient Algorithm for Graph Bisection of Triangularizations
A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045 Oe Brookigs Drive St. Louis, Missouri 63130-4899, USA jaegerg@cse.wustl.edu
More informationAnalysis of Algorithms
Aalysis of Algorithms Ruig Time of a algorithm Ruig Time Upper Bouds Lower Bouds Examples Mathematical facts Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite
More informationFast Fourier Transform (FFT) Algorithms
Fast Fourier Trasform FFT Algorithms Relatio to the z-trasform elsewhere, ozero, z x z X x [ ] 2 ~ elsewhere,, ~ e j x X x x π j e z z X X π 2 ~ The DFS X represets evely spaced samples of the z- trasform
More informationAnalysis of Algorithms
Presetatio for use with the textbook, Algorithm Desig ad Applicatios, by M. T. Goodrich ad R. Tamassia, Wiley, 2015 Aalysis of Algorithms Iput 2015 Goodrich ad Tamassia Algorithm Aalysis of Algorithms
More informationMedia Access Protocols. Spring 2018 CS 438 Staff, University of Illinois 1
Media Access Protocols Sprig 2018 CS 438 Staff, Uiversity of Illiois 1 Where are We? you are here 00010001 11001001 00011101 A midterm is here Sprig 2018 CS 438 Staff, Uiversity of Illiois 2 Multiple Access
More informationGPUMP: a Multiple-Precision Integer Library for GPUs
GPUMP: a Multiple-Precisio Iteger Library for GPUs Kaiyog Zhao ad Xiaowe Chu Departmet of Computer Sciece, Hog Kog Baptist Uiversity Hog Kog, P. R. Chia Email: {kyzhao, chxw}@comp.hkbu.edu.hk Abstract
More informationProperties and Embeddings of Interconnection Networks Based on the Hexcube
JOURNAL OF INFORMATION PROPERTIES SCIENCE AND AND ENGINEERING EMBEDDINGS OF 16, THE 81-95 HEXCUBE (2000) 81 Short Paper Properties ad Embeddigs of Itercoectio Networks Based o the Hexcube JUNG-SING JWO,
More informationLecturers: Sanjam Garg and Prasad Raghavendra Feb 21, Midterm 1 Solutions
U.C. Berkeley CS170 : Algorithms Midterm 1 Solutios Lecturers: Sajam Garg ad Prasad Raghavedra Feb 1, 017 Midterm 1 Solutios 1. (4 poits) For the directed graph below, fid all the strogly coected compoets
More informationCS2410 Computer Architecture. Flynn s Taxonomy
CS2410 Computer Architecture Dept. of Computer Sciece Uiversity of Pittsburgh http://www.cs.pitt.edu/~melhem/courses/2410p/idex.html 1 Fly s Taxoomy SISD Sigle istructio stream Sigle data stream (SIMD)
More informationPseudocode ( 1.1) Analysis of Algorithms. Primitive Operations. Pseudocode Details. Running Time ( 1.1) Estimating performance
Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Pseudocode ( 1.1) High-level descriptio of a algorithm More structured
More informationStone Images Retrieval Based on Color Histogram
Stoe Images Retrieval Based o Color Histogram Qiag Zhao, Jie Yag, Jigyi Yag, Hogxig Liu School of Iformatio Egieerig, Wuha Uiversity of Techology Wuha, Chia Abstract Stoe images color features are chose
More informationImage Segmentation EEE 508
Image Segmetatio Objective: to determie (etract) object boudaries. It is a process of partitioig a image ito distict regios by groupig together eighborig piels based o some predefied similarity criterio.
More informationΣ P(i) ( depth T (K i ) + 1),
EECS 3101 York Uiversity Istructor: Ady Mirzaia DYNAMIC PROGRAMMING: OPIMAL SAIC BINARY SEARCH REES his lecture ote describes a applicatio of the dyamic programmig paradigm o computig the optimal static
More informationAn Efficient Algorithm for Graph Bisection of Triangularizations
Applied Mathematical Scieces, Vol. 1, 2007, o. 25, 1203-1215 A Efficiet Algorithm for Graph Bisectio of Triagularizatios Gerold Jäger Departmet of Computer Sciece Washigto Uiversity Campus Box 1045, Oe
More informationFAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS
SIAM J. SCI. COMPUT. Vol. 22, No. 6, pp. 2113 2134 c 21 Society for Idustrial ad Applied Mathematics FAST BIT-REVERSALS ON UNIPROCESSORS AND SHARED-MEMORY MULTIPROCESSORS ZHAO ZHANG AND XIAODONG ZHANG
More informationChapter 3 Classification of FFT Processor Algorithms
Chapter Classificatio of FFT Processor Algorithms The computatioal complexity of the Discrete Fourier trasform (DFT) is very high. It requires () 2 complex multiplicatios ad () complex additios [5]. As
More informationRunning Time. Analysis of Algorithms. Experimental Studies. Limitations of Experiments
Ruig Time Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects. The
More informationEE260: Digital Design, Spring /16/18. n Example: m 0 (=x 1 x 2 ) is adjacent to m 1 (=x 1 x 2 ) and m 2 (=x 1 x 2 ) but NOT m 3 (=x 1 x 2 )
EE26: Digital Desig, Sprig 28 3/6/8 EE 26: Itroductio to Digital Desig Combiatioal Datapath Yao Zheg Departmet of Electrical Egieerig Uiversity of Hawaiʻi at Māoa Combiatioal Logic Blocks Multiplexer Ecoders/Decoders
More informationDLECTE fl * IELECTE! ,- D& APPROVED FOR ... PUBLIC DISTRIBUTION. N VLSI Memo No October 1989
APPROVED FOR... PUBLIC DISTRIBUTION MASSACHUSETTS INTITUTE OF TECHNOLOGY D T IC VLSI PUBLICATIONS IELECTE! N VLSI Memo No. 89-564 3 October 1989,- D& DLECTE fl Express Cubes: Improving the Performance
More informationMassachusetts Institute of Technology Lecture : Theory of Parallel Systems Feb. 25, Lecture 6: List contraction, tree contraction, and
Massachusetts Istitute of Techology Lecture.89: Theory of Parallel Systems Feb. 5, 997 Professor Charles E. Leiserso Scribe: Guag-Ie Cheg Lecture : List cotractio, tree cotractio, ad symmetry breakig Work-eciet
More informationBezier curves. Figure 2 shows cubic Bezier curves for various control points. In a Bezier curve, only
Edited: Yeh-Liag Hsu (998--; recommeded: Yeh-Liag Hsu (--9; last updated: Yeh-Liag Hsu (9--7. Note: This is the course material for ME55 Geometric modelig ad computer graphics, Yua Ze Uiversity. art of
More informationExact Minimum Lower Bound Algorithm for Traveling Salesman Problem
Exact Miimum Lower Boud Algorithm for Travelig Salesma Problem Mohamed Eleiche GeoTiba Systems mohamed.eleiche@gmail.com Abstract The miimum-travel-cost algorithm is a dyamic programmig algorithm to compute
More informationRunning Time ( 3.1) Analysis of Algorithms. Experimental Studies. Limitations of Experiments
Ruig Time ( 3.1) Aalysis of Algorithms Iput Algorithm Output A algorithm is a step- by- step procedure for solvig a problem i a fiite amout of time. Most algorithms trasform iput objects ito output objects.
More informationAnalysis of Algorithms
Aalysis of Algorithms Iput Algorithm Output A algorithm is a step-by-step procedure for solvig a problem i a fiite amout of time. Ruig Time Most algorithms trasform iput objects ito output objects. The
More informationSecurity of Bluetooth: An overview of Bluetooth Security
Versio 2 Security of Bluetooth: A overview of Bluetooth Security Marjaaa Träskbäck Departmet of Electrical ad Commuicatios Egieerig mtraskba@cc.hut.fi 52655H ABSTRACT The purpose of this paper is to give
More informationChapter 24. Sorting. Objectives. 1. To study and analyze time efficiency of various sorting algorithms
Chapter 4 Sortig 1 Objectives 1. o study ad aalyze time efficiecy of various sortig algorithms 4. 4.7.. o desig, implemet, ad aalyze bubble sort 4.. 3. o desig, implemet, ad aalyze merge sort 4.3. 4. o
More informationNew HSL Distance Based Colour Clustering Algorithm
The 4th Midwest Artificial Itelligece ad Cogitive Scieces Coferece (MAICS 03 pp 85-9 New Albay Idiaa USA April 3-4 03 New HSL Distace Based Colour Clusterig Algorithm Vasile Patrascu Departemet of Iformatics
More informationBig-O Analysis. Asymptotics
Big-O Aalysis 1 Defiitio: Suppose that f() ad g() are oegative fuctios of. The we say that f() is O(g()) provided that there are costats C > 0 ad N > 0 such that for all > N, f() Cg(). Big-O expresses
More informationThreads and Concurrency in Java: Part 1
Cocurrecy Threads ad Cocurrecy i Java: Part 1 What every computer egieer eeds to kow about cocurrecy: Cocurrecy is to utraied programmers as matches are to small childre. It is all too easy to get bured.
More informationPython Programming: An Introduction to Computer Science
Pytho Programmig: A Itroductio to Computer Sciece Chapter 6 Defiig Fuctios Pytho Programmig, 2/e 1 Objectives To uderstad why programmers divide programs up ito sets of cooperatig fuctios. To be able to
More informationAN OPTIMIZATION NETWORK FOR MATRIX INVERSION
397 AN OPTIMIZATION NETWORK FOR MATRIX INVERSION Ju-Seog Jag, S~ Youg Lee, ad Sag-Yug Shi Korea Advaced Istitute of Sciece ad Techology, P.O. Box 150, Cheogryag, Seoul, Korea ABSTRACT Iverse matrix calculatio
More informationSession Initiated Protocol (SIP) and Message-based Load Balancing (MBLB)
F5 White Paper Sessio Iitiated Protocol (SIP) ad Message-based Load Balacig (MBLB) The ability to provide ew ad creative methods of commuicatios has esured a SIP presece i almost every orgaizatio. The
More information