Communication Speed Selection and Functional Partitioning for Low-Energy On-Chip Networked Multiprocessor

Size: px

Start display at page:

Download "Communication Speed Selection and Functional Partitioning for Low-Energy On-Chip Networked Multiprocessor"

Morgan Glenn
6 years ago
Views:

1 ommuncaton Speed Selecton and Functonal Parttonng for Low-Energy On-hp Networed ultprocessor Jnfeng Lu, Pa H. hou, Nader Bagherzadeh epartment of Electrcal & omputer Engneerng Unversty of alforna, Irvne, A , USA {nfengl, chou, nader}@ece.uc.edu Abstract Hgh-speed seral networ nterfaces are becomng the prmary way for modern embedded systems and systems-onchp to connect wth each other and wth perpheral devces. odern communcaton nterfaces are capable of operatng at multple speeds and are openng a new dmenson of tradeoffs between computaton and communcaton. Unfortunately, today s PU-centrc technques often fal to consder mult-speed communcaton and the balance between communcaton and computaton for tme and energy; as a result, they yeld sub-optmal f not ncorrect desgns. Ths paper presents a new technque for global energy optmzaton through coordnated functonal parttonng and speed selecton for the processors and ther communcaton nterfaces. We propose a mult-dmensonal dynamc programmng formulaton for energy-optmal functonal parttonng wth PU/communcaton speed selecton for a class of data-regular applcatons under performance constrants. We demonstrate the effectveness of our optmzaton technques wth an mage processng applcaton mapped onto a mult-processor archtecture wth a mult-speed Ethernet. Keywords: communcaton speed selecton, functonal parttonng, on-chp networed mult-processor, low-power desgn Introducton Towards Hgh-Speed Seral Busses on So A ey trend n systems-on-chp s towards the use of hghspeed seral busses for system-level nterconnect. Seral busses offer many compellng advantages, ncludng modularty, composablty, scalablty, form factor, and power effcency [5,, 3]. odularty and composablty are extremely mportant, because the sheer complexty of these chps forces desgners to rase the level of abstracton. ost So desgns are done by ntegraton of ntellectual property (IP) components as a way to manage complexty whle meetng tme-to-maret Ths research was sponsored by ARPA grant F and Prntronx Fellowshp. deadlnes. Seral protocols are well understood and have long been used n n automotve control (e.g., AN from Bausch) and consumer electroncs (e.g., I from Phlps). ore recent protocols such as FreWre (IEEE 394) and USB are commonly used not only for perpheral devces but also for connectng multple embedded processors. They provde a smple, standardzed, effcent, and scalable way of buldng loosely coupled systems. Hgh-speed seral controllers such as Ethernet are now an ntegral part of many embedded processors. Seral busses also have power and form factor advantages. From automobles to computer perpherals, seral nterconnects such as FreWre and USB are compact and low power compared to SSI or parallel, whch are buly, hgh power, and lmted n length. Ths s especally mportant for systems-on-chp, where gates are vrtually free, but wres are the most expensve part of the chp real estate. Long, parallel, shared wres are not only hgh power but also suffer from cloc sews and even cross tals as the feature sze shrns. Seral controllers provde a clean abstracton by sheldng components from these low-level concerns. oreover, modern protocols also support plug-and-play and power management features such as subnet shutdown or ln suspenson. These features and more mae hgh-speed seral protocols an attractve choce for rapd ntegraton of So archtectures. /Performance Issues wth Seral Networs Of course, seral controllers come at a prce. The area and IP lcensng wll have a cost, but ths cost mght be ustfed by tme-to-maret or other overrdng busness concerns. In fact, t mght be even less of an ssue for future IP, whch wll lely have these seral controllers ntegrated. For example, A s newly announced Au [] s a IPS based mcrocontroller wth ntegrated /-base T Ethernet, USB, and many other I/O. However, power and performance wll become the crtcal ssues, as they drectly affect the correctness of the desgn. For power optmzaton, prevous efforts focused on the processor for several reasons. The PU was the man consumer of power, and t also offered the most optons for power management, ncludng voltage scalng. However, recent advances n both processors and communcaton nter-

2 faces are drvng a shft n how power should be managed. PU-centrc power management has gven rse to a new generaton of processors wth dramatcally mproved power effcency, and the PU s now drawng a smaller percentage of the overall system power. The nsatable demand for bandwdth has also resulted n hgh-speed communcaton nterfaces. Even though ther power effcency (.e., energy per bt transmtted) has also been mproved, communcaton power now matches or surpasses the PU, and s thus a larger fracton of the system power. For nstance, the Intel XScale processor consumes.6w at full speed, whle a GgaBt Ethernet nterface consumes 6W. System anagement wth Speed Selecton any communcaton nterfaces today support multple data rates. However, the scalng effects tend to be opposte those of voltage scalable PUs. For PUs, slower speed generally means lower power and lower energy per nstructon; but for communcaton, faster speed means hgher power but often less energy per bt. Ths s hghly dependent on the specfc controller. Few research wors to date explored communcaton speed as a ey parameter for power optmzaton. Speed selecton cannot be performed for ust communcaton or computaton n solaton, because a local decson can have a global mpact. One reason s that communcaton now goes through a shared medum rather than pontto-pont. The PUs cannot all be run at the slowest, most power-effcent speeds, because they must compete for the avalable tme and power wth each other and wth the communcaton nterfaces. A faster communcaton speed, even at a hgher energy-per-bt, can save energy by creatng opportuntes for subsystem shutdown or voltage scalng the processors. Greedly savng communcaton power may actually result n hgher overall energy. At the same tme, functonal parttonng must be an ntegral part of the optmzaton loop, because dfferent parttonng schemes can dramatcally alter the communcaton payload and computaton worload for each node. Approach For a gven worload on a networed archtecture, our problem statement s to generate a functonal parttonng scheme and to select the speeds of communcaton nterfaces and processors, such that the total energy s mnmzed. In general, such a problem s extremely dffcult. Fortunately, for a class of systems wth ppelned tass under an overall latency constrant, effcent, exact solutons exst. Ths paper presents a mult-dmensonal dynamc programmng soluton to such a problem. It formulates the energy consumed by the processors and communcaton nterfaces wth ther power/speed scalng factors wthn ther avalable tme budget. We demonstrate the effectveness of ths technque wth an mage processng algorthm mapped onto a multprocessor archtecture nterconnected by a GgaBt Ethernet. Ths technque s also applcable as a heurstc to general dataflow problems. Related Wor Prevous wors have explored communcaton synthess and optmzaton n dstrbuted mult-processor systems. [7] presents communcaton schedulng to wor wth ratemonotonc tass, whle [7] assumes the more determnstc tme-trggered protocol (TTP). [] dstrbutes tmng constrants on communcaton among segments through prorty assgnment on seral busses (such as control-area networ) and customzaton of devce drvers. Whle these assume a bus or a networ protocol, LYOS [9] ntegrates the ablty to select among several communcaton protocols (wth dfferent delays, data szes, burstness) nto the man parttonng loop. Although these and many other wors can be extended to So archtectures, they do not specfcally optmze for energy mnmzaton by explotng the processors voltage scalng capabltes. Related technques that optmze for power consumpton of processors typcally assume a fxed communcaton data rate. [4] uses smulated heatng search strateges to fnd low-power desgn ponts for voltage scalable embedded processors. [] performs battery-aware tas post-schedulng for dstrbuted, voltage-scalable processors by movng tass to smooth the power profle. [6, 5] propose parttonng the computaton onto a mult-processor archtecture that consumes sgnfcantly less power than a sngle processor. [6] reduces swtchng actvtes of both functonal unts and communcaton lns by parttonng tass onto a mult-chp archtecture; whle [8] maxmzes the opportunty to shut down dle processors through functonal parttonng. All these technques focus on the computatonal aspect wthout explorng the speed/power scalablty of the communcaton nterfaces. Exstng technques cannot be readly combned to explore many tmng/power trade-offs between computaton and communcaton. The quadratc voltage scalng propertes for PU s do not generalze to communcaton nterfaces. Even f they do, these technques have not consdered the parttonng of power and tmng budgets among computaton/communcaton components across the networ. Selectng communcaton attrbutes by only consderng deadlnes wthout power wll lead to unexpected, often ncorrect results at the system level. 3 System odel Ths secton defnes a system-level performance/energy model for both computaton and communcaton components n a networed on-chp mult-processor archtecture. In ths paper, a system conssts of processng nodes N, =,,..., connected by a shared communcaton medum. Each processng node (or node for short) conssts of a processor, a local memory, and one or more communcaton n-

3 terfaces that send and/or receve data from other nodes. 3. Jobs and Tass A processng ob assgned to a node has three tass: REV, PRO, and SEN, whch must be executed serally n that order. REV and SEN are communcaton tass on the nterfaces, and PRO s a computaton tas on the processor. The worload for each tas s defned as follows. For communcaton tass REV and SEN, worload W r and W s ndcate the number of bts to be receved and sent, respectvely. For the computaton tas PRO, the worload W p s the number of cycles. Let T p,t r,t s denote the delays of tass PRO, REV and SEN, respectvely. Let F p denote the cloc frequency of the processor, F r and F s the respectve data bt rates for recevng and sendng. We have T p = W p F p ; T r = W r F r ; T s = W s F s () () s reasonable for processors executng data-domnated programs, where the total cycles W p can be analyzed and bounded statcally. However, t does not hold true n general f the effectve data rate can be reduced by collsons and errors on the shared communcaton medum. We present the collson-free condton of the shared medum n Secton 4. To model the non-deal aspect of the medum, we ntroduce the communcaton effcency terms, ρ r and ρ s, ρ r,ρ s, such that T r = W r ρ r F r and T s = W s ρ s F s. Note that ρ r and ρ s need not be constants, but may be functons of communcaton speeds F r,f s. For brevty, our expermental results assume an deal communcaton medum (ρ r = ρ s = ) wthout loss of generalty. A more practcal communcaton model can be drectly appled, snce ρ r and ρ s can be very well bounded for a collson-free medum. s a deadlne on each processng ob, whch requres T r + T p + T s for the three seralzed tass. If any slac tme exsts, then we can slow down tas PRO by voltage scalng to reduce energy. Therefore, we assume the ob fnshes at the deadlne. That s, 3. Scalng = T r + T p + T s () On each node, we assume only the processor and the communcaton nterfaces are power-manageable by speed selecton. The power consumpton by the communcaton medum s nterpreted to be the total power consumed by all actve communcaton nterfaces. We assume a processor s voltage-scalng characterstcs can be expressed by a scalng functon Scale p that maps the PU frequency to ts power level. A communcaton nterface also has scalng functons that characterze the power levels at dfferent communcaton data rates for sendng and recevng. () mples Scale p REV recevng Wr bts Wp cycles on processor PRO (a) bloc dgram SEN sendng Ws bts delay: Tr = Wr / Fr REV Pr power: Pr Pp speed: Fr delay: Tp = Wp / Fp PRO power: Pp speed: Fp OVERHEA (b) tmng-power dgram delay: Ts = Ws / Fs SEN Ps power: Ps speed: Fs power: Povh Fgure : Tmng and power propertes of a processng node. s contnuous, whle communcaton nterfaces support only a few dscrete scalng ponts. Let P p, P r, and P s denote the power levels of tass PRO, REV and SEN, respectvely, then, P p = Scale p (F p ); P r = Scale r (F r ); P s = Scale s (F s ) (3) Let P ovh denote the power overhead when ntroducng an addtonal node nto the system. It captures the power of the memory, mnmum power of the PU and communcaton nterface, PU s power durng REV and SEN (A), and communcaton nterfaces power durng PRO. The energy consumpton of a tas s the power-delay product. Let E p,e r,e s, and E ovh denote the energy consumpton of tass PRO, REV, SEN, and overhead of a node, E p = P p T p ; E r = P r T r ; E s = P s T s ; E ovh = P ovh (4) For one node N wth tass PRO, REV, and SEN, the total energy of node N s Tme E N = E p + E r + E s + E ovh (5) Fg. shows the structure of a processng node. The gray bar represents the overhead and whte bars represent tass REV, PRO and SEN. The area of the bars refers to the energy contrbuton of the tass and overhead. Fnally, the total energy of the system s the sum of energy consumpton on each node, 3.3 -Node Ppelne E sys = = E N (6) Ths paper consders a specal case called an -node ppelne. It conssts of dentcal nodes N, =,,..., as characterzed by Scale p,scale r,scale s,e ovh. Each node N receves W r bts of data from the prevous node N (except N ), processes the data n W p cycles, and sends the W s -bt result to the next node N + (except N ). Each SEN REV + communcaton par sends and receves same amount of data at the same communcaton speed, wth the same communcaton delay, and we assume they start and fnsh at the same tme. That s, W s = W r+,f s = F r+,t s = T r+. All nodes have the same deadlne, and each node

4 recevng Wr bts T = Tr N RE V Wp cycles on processor N Tp PRO communcatng Ws =Wr bts Ts= Tr SE N Tme T= Tr=Ts N RE V Wp cycles on processor N (a) bloc dagram PRO communcatng Ws =Wr3 bts Ts = Tr3 SEN T = Tme Tr3 =Ts N3 N N N3 T RE V Tp3 - T PRO3 Tp PRO Tp PRO T3 T Tp REV3 (b) seralzed tmng-power dagram T SE N T SEN RE V T T SE REV3 PR N3 O3 Tme PRO3 Tp PRO Tp3 - T T3 T SEN T SE REV3 N3 (c) ppelned tmng-power dagram Wp3 cycles on processor Tme Fgure : A 3-node ppelne. N3 Tp3 PRO3 Tp3 PRO3 sendng Ws3 bts T3 = Ts3 SE N3 T3 SE N3 acts as a ppelne stage wth delay. Fg. shows an example of a three-node ppelne. For brevty, the overhead s not shown. An -node ppelne can be parttoned and mapped onto an -node ppelne ( ) by mergng adacent nodes N,N +,...,N + ( ) nto a new node N. The new node N combnes all computaton worload, receves W r bts of data, and sends W s bts of data. ommuncaton wthn a node become local data accesses. That s, W p = l= W p +l, and W r =W r,w s =W s. The new -node ppelne s called a parttonng of the ntal -node ppelne. 4 Schedulablty ondtons Tme Tme Ths secton presents the schedulablty condtons for the ppelned on-chp mult-processor system. In the ppelned tmng dagram Fg. (c) of the three-node ppelne, we fold the tass n Fg. (b) nto a common nterval wth duraton, whch s the delay of each ppelne stage. Note that there appear to be two nstances of tas PRO on node N 3. Ths does not mean that tas PRO on node N 3 s preempted. In fact, each nstance s a part of an ntegrated tas PRO across the boundary between ppelne stages. In other words, the boundary between ppelne stages resdes n the mddle durng the executon of tas PRO. Fg. (c) shows that due to the common deadlne, communcaton actvtes are shfted to dfferent tme slots, such that at any gven tme, there s at most one actve communcaton nstance (a SEN REV + par, e.g. SEN REV 3 and SEN REV are seralzed). Ths s especally meanngful f all nodes share the communcaton medum such as Ethernet nstead of pont-to-pont connectons. If collson does not occur, then our estmaton on both performance and energy of the whole system can be well bounded. ollson s always undesrable because retransmsson costs both tme and energy. ommuncaton actvtes should be scheduled such that the system s collsonfree. Lemma (ollson-free ondton) In an -node ppelne wth a deadlne, let T, =,,..., ndcate the delays of + nstances of data communcaton. T = T r ( = ) T s = T r+ ( =,,..., ) T s ( = ) The system does not have collson on the shared communcaton medum ff the utlzaton of the shared communcaton medum s less than or equal to. That s, U = T = (7) Note that for a general mult-processor, Lemma expresses the overload condton and can be only a necessary condton for a collson-free schedule. However, t s also a suffcent condton for -node ppelnes as defned n Secton 3.3, because ths specal case of ppelnng has the property of seralzng all communcaton nstances. Lemma s also the schedulablty condton for the shared communcaton medum. Lemma (Schedulablty ondton of One Node) In an -node ppelne wth a deadlne, nodes N, =,,...,, N s able to meet the deadlne ff N s not overloaded, that s, W p max(f p ) T r T s (8) Lemma states the overload condton of one node: gven the communcaton speeds (that determne communcaton delays T r,t s ), f ts computaton tas cannot be completed before the tme budget T r T s by operatng at the maxmum PU cloc rate, then ths node wll fal to meet the deadlne and thus the whole ppelne wll be malfunctonng. If Lemma cannot be satsfed, then the only way to meet the deadlne s to select hgher communcaton speeds to reduce T r,t s, n order to allocate addtonal tme budget for computaton. Hgh-speed communcaton can also reduce communcaton collson to satsfy Lemma.

5 Wr = 8Kb N: Target etecton Wp = 4K cycles Ws = Wr = 4Kb N: FFT Wp = 9K cycles Ws = Wr3 = 4Kb N3: Flter Wp3 = 54K cycles Ws3 = Wr4 = 4Kb N4: IFFT Wp4 = 357K cycles Ws4 = Wr5 = 4Kb N5: ompute stance Wp5 = 639K cycles Fgure 3: Functonal blocs of the ATR algorthm. bps Node N OVERHEA bps bps (a) A fne-gran parttonng scheme reduces energy on computaton, at the cost of nter-proessor communcaton and overhead of addtonal nodes. bps erge N and N nto a combned node N PRO (ncreased OVERHEA (b) The combned node reduces communcaton and overhead, but t requres more energy for computaton. bps Tme Tme Node N OVERHEA bps bps Node N OVERHEA Tme bps Ws5 = 4Kb (c) The computaton energy can be reduced by ncreasng communcaton speeds, whch leaves more tme on computaton. Fgure 4: The mpact of dfferent parttonng schemes and communcaton speed settngs. Lemma 3 (Schedulablty ondton of the System) An -node ppelne s schedulable to meet a deadlne ff () node N, =,,...,, N meets the deadlne (Lemma ), and () The shared communcaton medum s collson-free (Lemma ). Lemma 3 says that the system s schedulablty s determned by the schedulablty of all resources, ncludng nodes and the communcaton medum. If and only f none of them s overloaded, the system can be ppelned by the deadlne. Lemma 3 holds true only for ths -ppelne organzaton; t s a necessary but not suffcent condton for a general mult-processor system. 5 otvatng Example We use an automatc target recognton (ATR) algorthm (Fg. 3) as our motvatng example. Orgnally t s a seral algorthm. We reconstructed a parallel verson and mapped t onto ppelned multple processors. Ppelnng allows each processor to run at a much slower speed wth a lower voltage level to reduce overall computaton energy, whle parallelsm compensates for the performance. Of course, a multprocessor platform ncurs energy for nter-processor communcaton, extra processors, memory, and other overhead. appng Tas to Node through Parttonng Gven the fve functonal blocs (tass) of the ATR algorthm, several parttonng schemes are possble for mappng the tass to a number of ppelned nodes. Fg. 4 shows an Tme example by consderng how they map the frst two tass onto nodes. In Fg. 4(a), they are mapped onto two nodes N and N that are both allowed to operate at a lower speed (3Hz) for computaton. Ths scheme has lower computaton energy than f they were mapped onto one node, but t requres energy on communcaton tass SEN REV, plus overhead. Fg. 4(b) shows a mappng onto one node. It elmnates the communcaton SEN REV and the overhead of an extra node. However, the combned node has much more computaton worload and must run at a faster cloc rate (6Hz), a less energy-effcent level. Zoomng out, many parttonng schemes are possble, even when lmted to a ppelned organzaton. For example, one parttonng [N, N][N3, N4, N5] may be optmal for nodes N and N; but t wll preclude another soluton [N],[N,N3],[N4,N5] that may lead to lower energy for the whole system. Speed Selecton for PU and ommuncaton In addtonal to parttonng, the selecton of communcaton speed s an equally crtcal ssue. For example we consder a //Base-T Ethernet nterface. It consumes more power than the PU at hgh (/bps) speeds, but less power than the PU at the slower, bps data rate. In Fg. 4(b), the processor must operate at a hgh cloc rate due to the low-speed communcaton at bps. Because of the deadlne, communcaton and computaton compete for ths budget. Low-speed communcaton leaves less tme for computaton, thereby forcng the processor to run faster to meet the deadlne. onversely, hgh-speed communcaton could free more tme budget for computaton, shown n Fg. 4(c), where the PU s cloc rate s dropped to 3Hz wth bps communcaton. Although extra energy could be allocated to communcaton, f the energy savng on the PU could compensate for ths cost, then (c) would be more energy-effcent than (b). The communcaton-computaton nteracton becomes more ntrcate n a mult-processor envronment. Any data dependency between dfferent nodes must nvolve ther communcaton nterfaces. The communcaton speed of a sender wll not only determne the recever s communcaton speed but also nfluence the choce of the recever s computaton speed. The communcaton speed on the frst node of the ppelne wll have a chan effect on all other nodes n the system. A locally optmal speed for the frst node wll not necessarly lead to a globally optmal soluton. ombnng Parttonng and Speed Selecton Parttonng and communcaton speed selecton are mutually enablng each other. Gven a fxed parttonng scheme, the desgners can always fnd the correspondng optmal speed settng that mnmzes energy for that scheme. However, energy-optmal speed selecton for a parttonng s not necessarly optmal over all parttonngs. Instead, parttonng and speed selecton are mutually enablng. In ths pa-

6 per, we tae a mult-dmensonal optmzaton approach that consders performance requrement, schedulablty, load balancng, communcaton-computaton trade-offs, and multprocessor overhead n a system-level context. 6 Problem Formulaton Gven an -node ppelne, the choces of parttonng and communcaton speed settngs wll lead to dfferent energy consumpton at the system level. Ths secton formulates the energy mnmzaton problems by means of parttonng and communcaton speed selecton. In both cases, the optmal solutons can be obtaned by dynamc programmng. Fnally, the combned optmzaton problem wth both parttonng and communcaton speed selecton can be addressed synergstcally by mult-dmensonal dynamc programmng. Problem (Optmal Parttonng) Gven (a) ppelned nodes N wth worload W p,w r,w s, =,,...,, (b) a deadlne for all nodes, and (c) the constrant that the speed settngs of all communcaton nstance must match: F r,f s = F r+,f s, for =,,...,, fnd a parttonng scheme that mnmzes energy E sys. To avod exhaustve enumeraton n the O( ) soluton space, we construct a seres of sub-problems as follows. We consder a sub-problem P[, ] that maps the frst orgnal nodes N,N,...,N onto a sub-parttonng nodes N,N,...,N. The optmal soluton of P[, ] has the mnmum energy E[, ]. It can be decomposed nto two parts shown n Fg. 5: (a) a sub-parttonng P[,l] that maps frst l orgnal nodes to new nodes, plus (b) the th new node N that combnes the orgnal nodes N l+,...,n wth ts energy denoted as E N. In order to acheve the mnmum energy E[, ], the energy consumpton of (a) must also be an optmal sub-soluton E[,l]. Snce l can be any value n a range l, E[, ] must also be the mnmum value of E[,l] + E N over all these possble values of l. That s, E[, ] = mn l {E[,l] + E N }. Any optmal sub-soluton E[, ] can be derved from other optmal sub-solutons E[,l]. Therefore, the problem has an optmal sub-structure and a dynamc programmng approach s approprate. It s llustrated n Fg. 6. atrx E[, ] s ntalzed to for. We defne E[,] = and t can be used to compute the frst row E[, ], =,,...,. For any entry E[, ], ts value can be computed by entres n the prevous row E[,l], l. These entres are shaded n Fg. 6. Thus, a seres of optmal sub-solutons E[, ],E[3, ],...,E[, ] n each row of the matrx can be computed subsequently. Fnally, these sub-solutons lead to the global optmal soluton mn {E[,]}, whch maps all orgnal nodes onto a new parttonng wth mnmum energy. Note that the same algorthm can also solve the optmal parttonng onto a fxed number of nodes. For example, orgnal nodes -node optmal sub-parttonng wth mnmum energy E[, ] (a) a sub-parttonng that maps l nodes N,..., Nl on to - new nodes N',..., N'- wth mnmum energy E[-, l] N N Nl Nl+ N N' N'- N' (b) the last new node N' combnes nodes Nl+,..., N wth energy EN' Fgure 5: The optmal sub-structure of Problem. - E[,] E[,] E[,] E[,] E[-, -] l = -,..., - E[-, -] E[,] E[-, E[-, -] ] E opt = mn {E[, ]} =,,..., E[,] E[,] Fgure 6: The dynamc programmng approach to solve Problem. Each entry E[, ] can be computed by the shaded entres n the prevous row. The global optmal energy s the mnmum value of the last column. E[,] s the optmal energy for mappng nodes onto an arbtrary -node new parttonng. To summarze, the optmal cost functon E s defned as follows: E[, ] = E[,] E[, ] for = = mn l { E[,l] } f +E N for U[,l] + W s F s, (9) To guarantee each optmal sub-soluton s schedulable, by Lemma 3, the communcaton medum must be collsonfree, and any node n the new sub-parttonng must not be overloaded. We defne a utlzaton matrx U[, ] ndcatng the utlzaton of the communcaton medum correspondng to the optmal soluton of a sub-problem P[, ], whch s guarded by U[, ] (Lemma ). U s ntalzed to, whle settng U[,] = W r F r (= T n(7)), ndcatng the bandwdth used by the frst communcaton nstance REV. We also defne the energy consumpton of a node N as E N that refnes (5) by Lemma. If a node s overloaded, then ts energy consumpton s ndcatng an nvald soluton.

7 parttonng(w r [ : ],W s [ : ],W p [ : ],F r [ : ],F s [ : ], scale r,scale s,scale p,,p ovh ) for := to do for := to do E[, ] := U[, ] := P[, ] := E[,] := U[,] := W r []/F r []/ for := to do for := to do for l := to do e := E[,l] + E N u := U[,l] +W s [ ]/F s [ ]/ f u and e < E[, ] then E[, ] := e U[, ] := u P[, ] := l E opt,p opt := retreve from matrces E,P return E opt,p opt U[, ] = E N = Fgure 7: Optmal parttonng algorthm. W r F r for = = U[,l] + W s F s scale r (F r )T r + scale s (F s )T s + scale p (F p )T p + P ovh for l that acheves mn{e[, ]} n (9), for f F p = W p T r T s F max (T r = W r F r,t s = W s F s ) () otherwse () Fg. 7 shows the optmal parttonng algorthm derved from (9) and (). The parttonng matrx P[, ] records the prevous optmal sub-solutons for each sub-problem. Ths nformaton can be used to retreve the optmal parttonng P opt. The tme complexty of ths algorthm s O( 3 ) determned by the three-level nested loop. Problem (Optmal Speed Selecton) Gven (a) a fxed parttonng scheme wth ppelned nodes N wth worload W p,w r,w s, =,,...,, (b) a deadlne for all nodes, and (c) the avalable choces for communcaton speed settngs F c, =,,...,, fnd all processor speeds F p and communcaton speeds F r,f s that mnmze energy E sys. We also perform dynamc programmng as opposed to exhaustve search n O( + ) soluton space. Snce communcaton speeds decde processor speeds, we only select communcaton speeds for each node. Gven that the sendng speed and recevng speed are equal for each communcaton nstance, selectng only sendng speed s suffcent. frst nodes where the last sendng speed Fs= Fc wth mnmum energy E[, ] N N... N- N (a) a sub speed selecton problem where node N -'s sendng speed selected as Fs- = Fcm wth mnmum energy E[-, m] sendng speed Fs- = Fcm recevng speed Fr = Fcm sendng speed Fs = Fc (b) the last node N whose recevng speed s Fcm and sendng speed s Fc wth energy EN(Fr = Fcm, Fs = Fc) Fgure 8: The optmal sub-structure of Problem. - E[,] E[,] E[,] E[-,] E[,] E[,] E[,] E[,] E[-,-] E[,] E[,] E[,] E[,] E[-,] E[,] E[,] E opt = mn {E[, ]} =,,..., Fgure 9: The dynamc programmng approach to solve Problem. Each entry E[,] can be computed by the shaded row E[,l]. The global optmal energy s the mnmum value of the last row. We defne a sub-problem S[, ] that selects communcaton speeds for the frst nodes, wth the last node N s sendng speed selected to be the th choce of speed settngs, F s = F c. Its optmal sub-soluton has mnmum energy E[,]. As llustrated n Fg. 8, a sub-problem S[,] conssts of two parts: (a) another sub-problem S[,m] that selects speed settngs for the frst nodes wth node N s sendng speed F s = F cm, combned wth (b) node N wth recevng speed F r = F cm and sendng speed F s = F c. (a) must be an optmal sub-soluton wth mnmum energy E[,m]. (b) has only one node N that receves data from (a) through speed F cm ; and ts sendng speed s F c. Its energy s denoted as E N (F r = F cm,f s = F c ). Therefore, E[,] = E[,m] + E N (F r = F cm,f s = F c ). In the sub-problem S[,m], F cm can be any choce among F c,f c,...,f c. In order to acheve the mnmum energy E[,], t must be the mnmum value among all possble F cm. That s, the optmal sub-structure of ths problem can be defned as E[,] = mn m {E[,m] + E N (F r = F cm,f s = F c )} The dynamc programmng algorthm s llustrated n Fg. 9. Snce each E[,] can be derved from the prevous row E[,m],m =,,...,, the algorthm can compute all rows of matrx E from E[,],E[,],..., to E[,], =,,..., sequentally. The global optmal energy s the mnmum value n the last row, mn {E[,]}. The energy matrx E[, ] and utlzaton matrx U[, ] are defned as follows. U[, ] guarantees that each optmal sub-soluton E[, s schedulable. Both E and U are ntalzed to, except E[,] =, U[,] s set to the utlzaton

8 speedselecton(w r [ : ],W s [ : ],W p [ : ],F c [ : ], scale r,scale s,scale p,,p ovh ) for := to do for := to do E[,] := U[,] := S[,] := for := to do E[,] := U[,] := W r []/F c []/ for := to do for := to do for m := to do e := E[,m] + E N (F r = F c [m],f s = F c []) u := U[,m] +W s []/F c []/ f u and e < E[,m] then E[,] := e U[,] := u S[,] := m E opt,s opt := retreve from matrces E,S return E opt,s opt Fgure : Optmal speed selecton algorthm. of the frst communcaton nstance REV usng communcaton speed F c, for =,,...,. E[,] = for mn m W r F c E[,m]+ E N (F r = F cm, F s = F c ) for =, =, f U[,m] + W s F c, for, () U[,] = for m that acheves U[,m] mn{e[,]} n (), + W s F c, for (3) The algorthm s shown n Fg.. The speed matrx S records the prevous optmal sub-solutons. The optmal speed settng S opt wll be retreved from S. The tme complexty of ths algorthm s O( ). Note that the algorthm can be modfed trvally to f the frst communcaton speed F r and the last communcaton speed F s are fxed. Ths refers to the stuaton where the ppelned mult-processor has a fxed communcaton speed settng to other components whle ts nternal communcaton speeds can be selected to optmal. Problem 3 (Optmal Parttonng and Speed Selecton) Gven (a) ppelned nodes N wth worload W p,w r,w s, =,,...,, (b) a deadlne for all nodes, and (c) the avalable choces for communcaton speed settngs F c, =,,...,, fnd a parttonng scheme and correspondng communcaton speed settngs that mnmze energy E sys. ue to the nter-dependency between speed settngs and parttonng schemes, the optmal soluton cannot be acheved by solvng two prevous problems ndvdually. Exhaustvely enumeratng over one dmenson and dynamc programmng over the other s qute expensve wth the tme complexty as ether O( ) or O( + 3 ). We proposed a mult-dmensonal dynamc programmng algorthm gven the fact that the two prevous problems are all characterzed by optmal sub-structures. Based on the dynamc programmng approaches n prevous problems, we defne a subproblem PS[,,] that maps orgnal nodes N,N,...,N onto an -node new sub-parttonng N,N,...,N, wth the last node N s sendng speed F s = F c. The optmal subsoluton has mnmum energy E[,,]. Smlar to the prevous problems, a sub-problem PS[,, ] can be decomposed wth an optmal sub-structure, shown n Fg.. (a) s a prevous sub-problem PS[,l,m], whch maps the frst l orgnal nodes N,N,...,N l onto new nodes wth node N s sendng speed selected as F c m. (b) s the new node N that combnes orgnal nodes N l+,...,n wth recevng speed F cm and sendng speed F c. (a) must be an optmal sub-soluton wth the mnmum energy E[,l,m]. Note that (b) has only one node N, and ts energy s denoted as E N (F r = F cm,f s = F c ). For sub-soluton E[,l,m], l can be any value n range l and F cm s one of speed choces F c,f c,...,f c. E[,,] must be derved from all possble pars of (l,m) to acheve the mnmum value. Therefore, E[,,] = mn l, m {E[,l,m] + E N (F r = F cm,f s = F c )}. The algorthm s llustrated n Fg.. The threedmensonal matrx E[,, ] s represented by a seres of two-dmensonal sub-matrx ndexed by =,,...,. Any E[,,] can be computed from entres n a sub-matrx E[,l,m], l, m. The algorthm constructs all optmal sub-solutons from E[,,],E[,,],... to E[,,],,. The global mnmum energy s mn, {E[,,]}. It refers to the mnmum value of the last rows n all sub-matrces. The energy matrx E[,,] and the utlzaton matrx U[,,] s defned as follows.

9 orgnal nodes -node optmal sub-parttonng where the last sendng speed F's = Fc wth mnmum energy E[,, ] - - N N Nl Nl+ N N' N'- N' sendng recevng sendng speed speed speed F's- = Fcm F'r = Fcm F's = Fc (a) a sub-parttonng that maps l nodes N,..., Nl on to - new nodes N',..., N'- where node N'- 's sendng speed selected as F's- = Fcm wth mnmum energy E[-, l] (b) the last new node N' combnes nodes Nl+,..., N whose recevng speed s Fcm and sendng speed s Fc wth energy EN'(Fr = Fcm, Fs = Fc) Fgure : The optmal sub-structure of Problem 3. E[-,,] E[-,-,] - E[-,-,] - E[-,,] E[-,,]... E[-,-,] E[-,-,] - E[-,-,] E[-,-,] - E[-,,] E[-,,] E[-,-,] E[-,-,] - E[-,-,] E[-,-,] - E[-,,] E[-,,] E[-,,] E[-,,] E[-,,] E[-,-,] E[-,-,] E[-,-,] E[-,-,] E[-,-,] E[-,-,] E[-,-,] E[-,,] E[-,-,] E[-,-,] E[-,-,] E[-,,] E[-,,] E[-,-,] E[-,-,] E[-,,] E[-,-,] E[-,-,] E[-,,] l = -,..., - m =,,..., E[-,,] E[-,,] E[,,]... Eopt = mn{e[,, ]} =,,..., =,,..., E[-,,] E[,,] E[-,,] E[-,,]... E[,,] E[,,] E[-,,] E[-,,] E[,,] E[,,] E[,,] E[,,]... Fgure : The mult-dmensonal dynamc programmng approach to solve Problem 3. Each entry E[,, ] can be computed by the shaded entres n the prevous sub-matrx. The global optmal energy s the mnmum value n the last row of all sub-matrces. E[,,] = U[,,] = for mn l, m W r F c U[,l,m] + W s F c E[,l,m]+ E N (F r = F cm, F s = F c ) for = =, E[-,,] E[-,,] E[,,] = =, U[,l,m] f + W s F c, for, (4) for (l,m) that acheve mn{e[,,]} n (4),, for (5) The algorthm s shown n Fg. 3. It combnes two prevous algorthms by two-dmensonal dynamc programmng. The tme complexty of the algorthm s O( 3 ). It also apples to stuatons where the new parttonng has a fxed number of nodes, or the ppelne has a fxed communcaton nterface to other components whle only nternal communcaton speed can be selected. parttonng-speedselecton(w r [ : ],W s [ : ],W p [ : ], F c [ : ],scale r,scale s,scale p,,p ovh ) for := to do for := to do for := to do E[,,] := U[,,] := P[,,] := S[,,] := for := to do E[,,] := U[,,] := W r []/F c []/ for := to do for := to do for := to do for l := to do for m := to do e := E[,l,m] + E node (merge(n l+,...,n ), wth F r = F c [m],f s = F c []) u := U[,l,m] +W s [ ]/F c []/ f u and e < E[,,] then E[,,] := e U[,,] := u P[,,] := l S[,,] := m E opt,p opt,s opt := retreve from matrces E,P,S return E opt,p opt,s opt Fgure 3: ombned parttonng wth speed selecton. 7 Analytcal Results To evaluate our energy optmzaton technque, we expermented wth mappng the ATR algorthm [4] (Fg. 3) onto two fxed parttonng schemes: (a) a sngle-node that combnes all blocs, and (b) a fve-node ppelne that maps each bloc onto an ndvdual node. (a) and (b) are two extremes representng seral vs. parallel schemes. For both (a) and (b) we apply optmal speed selecton. We also fnd the optmal parttonng wth speed selecton as (c) and compare wth (a) and (b) under three types of performance requrements: () hgh performance, = ms, () moderate performance, = 5ms, and (3) low performance, = ms. Each node conssts of an XScale processor and an LXT- Ethernet nterface from Intel. The Scale p and Scale s (same as Scale r ) functons, whch ndcate the power vs. performance characterstcs of a node, are extracted from ther data sheets [, 3] and are shown n Fg. 4 and 5. Besdes the power draw from the PU and communcaton nterfaces, we assume each node has a constant power draw P ovh = mw. The results are presented n Fg. 6. In all cases, bps s always the optmal speed settng for communcaton. The low-power, bps communcaton speed results n the hghest energy. Ths s because t leaves so lttle tme for computaton such that the processors must run faster wth more energy to meet the deadlne, and t has the hghest energy-per-bt ratng. The low-speed communcaton also tends to volate the schedulablty condtons (Lemma 3). Gven propertes of ths partcular Eth-

4 4 4 Overhead Energy / frame (mj) 8 6 4 8 6 4 8 6 4 ommuncaton omputaton (a) -node (b) 5-node (c) Optmal NN N3 N4 N5 () hgh performance = ms (a) -node (b) 5-node (c) Optmal NNN3N4 N5 () moderate

10 4 4 4 Overhead Energy / frame (mj) ommuncaton omputaton (a) -node (b) 5-node (c) Optmal NN N3 N4 N5 () hgh performance = ms (a) -node (b) 5-node (c) Optmal NNN3N4 N5 () moderate performance = 5ms (a) -node (optmal) (b) 5-node (3) low performance = ms Fgure 6: Analytcal results. ernet nterface, bps communcaton wll always lead to the lowest energy consumpton snce t requres the least amount of energy per bt and leaves the maxmum amount of tme budget for reducng PU energy. However, n cases where the energy-per-bt ratng does not decrease monotoncally wth the communcaton speed, the optmal speed settng may nvolve some combnatons of low-speed and hghspeed settngs between dfferent nodes. For example, the node N may communcate wth N at bps and wth N + at bps. Fg. 6() shows the energy consumpton per mage frame n three parttonng schemes. Wth a tght performance constrant, the sngle-node (a) s heavly loaded wth computaton. Therefore t s desrable to reduce PU energy by ppelnng. As a result, the fve-node ppelne (b) s more energy-effcent at the cost of addtonal communcaton and overhead. However, the optmal parttonng s (c) wth three nodes: [N,N],[N3,N4],[N5]. It consumes more PU energy than (b), but overall t s optmal wth less energy on communcaton and overhead. Fgure 4: vs. performance of the XScale processor. ode bps consumpton 8 mw bps.5w bps 6W Fgure 5: modes of the Ethernet nterface. In case of the moderate performance constrant (Fg. 6()), (a) s stll domnated by computaton but t s not heavly loaded due to the relaxed deadlne. The reducton of PU energy by (b) cannot compensate for the added overhead of new nodes and communcaton. Therefore (a) s better than (b) and ppelnng seems neffcent. However, the optmal parttonng (c) s stll a ppelned soluton. It combnes N,N,N3,N4 nto one node and maps N5 to another node. (c) acheves mnmum energy by approprately balancng computaton, communcaton wth ppelnng overhead. In cases where the performance s not crtcal, ppelnng s not effcent and the seral soluton (a) s optmal. Fg. 6(3) shows that the computaton load on (a) s very lght. Introducng addtonal nodes wll only save margnal PU energy that wll be offset by extra communcaton and overhead.

11 8 oncluson We present a combned parttonng and speed selecton technque for the energy optmzaton of embedded multprocessor-on-chp archtectures wth hgh-speed onchp networs. As communcaton power approaches or surpasses that of processor power, communcaton must be treated as a prmary concern n system-level energy optmzaton. We explot the mult-speed feature of modern hgh-speed communcaton nterfaces as an effectve way to complement and enhance today s PU-centrc power optmzaton approaches. In such systems, communcaton and computaton compete over opportuntes for operatng at the most energy-effcent ponts. It s crtcal to not only balance the load among processors by functonal parttonng, but also to balance the speeds between communcaton and computaton on each node and across the whole system. Our mult-dmensonal dynamc programmng formulaton s exact and s of polynomal tme complexty. It produces energy-optmal solutons as defned by a parttonng scheme and by the speed selectons for all computaton and communcaton tass. We expect ths technque to be applcable to a large class of data domnated systems-on-chp that can be structured n a ppelned organzaton. References [] The Alchemy Au from A: Internet edge processor. nfo/- au/ndex.html. [] INTEL ethernet PHYs/transcevers. ntel.com/desgn/networ/products/ethernet/- lnecard ept.htm. [3] INTEL XScale mcroarchtecture. ntel.com/desgn/ntelxscale/. [4] N. K. Bambha, S. S. Bhattacharyya, J. Tech, and E. Ztzler. Hybrd global/local search strateges for dynamc voltage scalng n embedded multprocessors. In Proc. Internatonal Symposum on Hardware/Software odesgn, pages 43 48,. [5] L. Benn and G. e chel. Networs on chps: a new soc paradgm. IEEE omputer, 35():7 78, Jan. [6] R. herabudd,. Bayoum, and H. Krshnamurthy. A low power based system parttonng and bndng technque for mult-chp module archtectures. In Proc. Proc. Great Laes Symposum on VLSI, pages 56 6, 997. [7] P. Eles, A. obol, P. Pop, and Z. Peng. Schedulng wth bus access optmzaton for dstrbuted embedded systems. IEEE Transactons on VLSI Systems, 8(5):47 49,. [8] E. Huwang, F. Vahd, and Y.-. Hsu. FS functonal parttonng for low power. In Proc. esgn, Automaton and Test n Europe, pages 8, 999. [9] P. V. Knudsen and J. adsen. Integratng communcaton protocol selecton wth hardware/software codesgn. IEEE Transactons on omputer-aded esgn of Integrated rcuts and Systems, 8(8):77 95, August 999. [] K. Lahr, A. Raghunathan, and G. Lashmnarayana. LOTTERYBUS: a new hgh-performance communcaton archtecture for system-on-chp desgns. In Proc. esgn Automaton onference, pages 5, June. [] J. Luo and N. K. Jha. Battery-aware statc schedulng for dstrbuted real-tme embedded systems. In Proc. esgn Automaton onference, pages , June. [] R. Ortega and G. Borrello. ommuncaton synthess for dstrbuted embedded systems. In Proc. Internatonal onference on omputer-aded esgn, pages , 998.

12 [3]. Sgro,. Sheets, A. hal, K. Keutzer, S. al, J. Rabaey, and A. Sangovann-Vncentell. Addressng the system-on-a-chp nterconnect woes through communcaton-based desgn. In Proc. esgn Automaton onference, pages , June. [4] R. Sms. Sgnal to clutter measurement and ATR performance. Proc. of the SPIE - The Internatonal Socety for Optcal Engneerng, 337():3 7, Aprl 998. [5] A. Wang and A. handraasan. Energy effcent system parttonng for dstrbuted wreless sensor networs. In Proc. IEEE Internatonal onference on Acoustcs, Speech and Sgnal Processng, pages 95 98, ay. [6] E. F. Weglarz, K. K. Salua, and. H. Lpast. nmzng energy consumpton for hgh-performance processng. In Proc. Asan and South Pacfc esgn Automaton onference, pages 99 4,. [7] W. Wolf. An archtectural co-synthess algorthm for dstrbuted embedded computng systems. IEEE Transactons on VLSI Systems, pages 8 9, June 997.

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Combned Functonal Parttonng and Communcaton Speed Selecton for Networked Voltage-Scalable Processors Jnfeng Lu, Pa H. Chou, Nader Bagherzadeh epartment of Electrcal & Computer Engneerng Unversty of Calforna,