Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ

Size: px

Start display at page:

Download "Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Λ"

Antony Summers
5 years ago
Views:

1 Combned Functonal Parttonng and Communcaton Speed Selecton for Networked Voltage-Scalable Processors Λ Jnfeng Lu, Pa H. Chou, Nader Bagherzadeh epartment of Electrcal & Computer Engneerng Unversty of Calforna, Irvne, CA , USA fjnfengl, chou, Categores and Subject escrptors C.3 [SPECIAL-PURPOSE AN APPLICATION-BASE SYS- TEMS]: Real-tme and embedded systems General Terms esgn, Performance, Algorthms Keywords functonal parttonng, communcaton speed selecton, communcaton/computaton trade-offs, embedded mult-processor, lowpower desgn ABSTRACT Ths paper presents a new technque for global energy optmzaton through coordnated functonal parttonng and speed selecton for embedded processors nterconnected by a hgh-speed seral bus. Many such seral nterfaces are capable of operatng at multple speeds and can open up a new dmenson of trade-offs to complement today s CPU-centrc voltage scalng technques for processors. We propose a mult-dmensonal dynamc programmng formulaton for energy-optmal functonal parttonng wth CPU/communcaton speed selecton for a class of data-regular applcatons under performance constrants. We demonstrate the effectveness of our optmzaton technques wth an mage processng applcaton mapped onto a mult-processor archtecture wth a mult-speed Ethernet. 1. INTROUCTION A key trend n embedded systems s towards the use of hghspeed seral busses for system-level nterconnect. Hgh-speed seral controllers such as Ethernet are now an ntegral part of many embedded processors. Newer protocols such as FreWre (IEEE Λ Ths research was sponsored by ARPA grant F and Prntronx Fellowshp. Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, to republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. ISSS 2, October 2 4, 22, Kyoto, Japan. Copyrght 22 ACM /2/1...$ ) and USB are commonly used not only for perpheral devces but also for connectng embedded processors. Many have advocated hgh-speed, seral packet networks for systems-on-chp for ther compellng advantages ncludng modularty, composablty, scalablty, form factor, and power effcency. For power optmzaton, prevous efforts focused on the processor for several reasons. The CPU was the man consumer of power, and t also offered the most optons for power management, ncludng voltage scalng. However, recent advances n both processors and communcaton nterfaces are drvng a shft n how power should be managed. Low-power CPU, Hgh-power Communcaton CPU-centrc power management has gven rse to a new generaton of processors wth dramatcally mproved power effcency, and the CPU s now drawng a smaller percentage of the overall system power. The nsatable demand for bandwdth has also resulted n hgh-speed communcaton nterfaces. Even though ther power effcency (.e., energy per bt transmtted) has also been mproved, communcaton power now matches or surpasses the CPU, and s thus a larger fracton of the system power. For nstance, the Intel XScale processor consumes 1.6W at full speed, whle a GgaBt Ethernet nterface consumes 6W. Mult-speed Communcaton Interfaces Many communcaton nterfaces today support multple data rates. However, the scalng effects tend to be the opposte those of voltage scalable CPUs. For CPUs, slower speed generally means lower power and lower energy per nstructon; but for communcaton, faster speed means hgher power but often less energy per bt. Ths s hghly dependent on the specfc controller. Few research works to date explored communcaton speed as a key parameter for power optmzaton. Speed Selecton and Functonal Parttonng Speed selecton cannot be performed for just communcaton or computaton n solaton, because a local decson can have a global mpact. The CPUs cannot all be run at the slowest, most powereffcent speeds, because they must compete for the avalable tme and power wth each other and wth the communcaton nterfaces. A faster communcaton speed, even at a hgher energy-per-bt, can save energy by creatng opportuntes for voltage scalng the processors. Greedly savng communcaton power may actually result n hgher overall energy. At the same tme, functonal parttonng must be an ntegral part of the optmzaton loop, because dfferent parttonng schemes can dramatcally alter the communcaton and computaton workload for each node. 14

2 Approach For a gven workload on a networked archtecture, our problem statement s to generate a functonal parttonng scheme and to select the speeds of communcaton nterfaces and processors, such that the total energy s mnmzed. In general, ths problem s extremely dffcult. Fortunately, for a class of systems wth ppelned multple processors under a latency constrant, effcent, exact solutons exst. We construct such a system model and formulate the energy consumed by the processors and communcaton nterfaces wth ther power/speed scalng factors wthn ther avalable tme budget. In [], we presented the schedulablty condtons and the problem of communcaton speed selecton and sketched solutons by exhaustve search. Ths paper combnes communcaton speed selecton wth functonal parttonng and presents an effcent mult-dmensonal dynamc programmng soluton to mnmze system energy. We demonstrate the effectveness of ths technque wth an mage processng algorthm mapped onto a ppelned mult-processor archtecture nterconnected by a GgaBt Ethernet. 2. RELATE WORK Prevous works have explored communcaton synthess and optmzaton n dstrbuted mult-processor systems. [13] presents communcaton schedulng to work wth rate-monotonc tasks, whle [5] assumes the more determnstc tme-trggered protocol (TTP). [1] dstrbutes tmng constrants on communcaton among segments through prorty assgnment on seral busses (such as controlarea network) and customzaton of devce drvers. Whle these assume a bus or a network protocol, LYCOS [7] ntegrates the ablty to select among several communcaton protocols (wth dfferent delays, data szes, burstness) nto the man parttonng loop. These technques do not specfcally optmze for energy by explotng the processors voltage scalng capabltes or the characterstcs of the communcaton nterfaces power consumpton. Related technques that optmze for power consumpton of processors typcally assume a fxed communcaton data rate. [3] uses smulated heatng search strateges to fnd low-power desgn ponts for voltage scalable embedded processors. [9] performs batteryaware task post-schedulng for dstrbuted, voltage-scalable processors by movng tasks to smooth the power profle. [12, 11] propose parttonng the computaton onto a mult-processor archtecture that consumes sgnfcantly less power than a sngle processor. [4] reduces swtchng actvtes of both functonal unts and communcaton lnks by parttonng tasks onto a mult-chp archtecture; whle [6] maxmzes the opportunty to shut down dle processors through functonal parttonng. All these technques focus on the computatonal aspect wthout explorng the speed/power scalablty of the communcaton nterfaces. Exstng technques cannot be readly combned to explore many tmng/power trade-offs between computaton and communcaton. The quadratc voltage scalng propertes for CPU s do not generalze to communcaton nterfaces. Even f they do, these technques have not consdered the parttonng of power and tmng budgets among computaton/communcaton components across the network. Selectng communcaton attrbutes by only consderng deadlnes wthout power wll lead to unexpected, often ncorrect results at the system level. 3. SYSTEM MOEL Ths secton defnes a system-level performance/energy model for both computaton and communcaton components n a networked, multple-processor embedded system. In ths paper, such a system conssts of M processng nodes N ; = 1;2;:::;M connected by a shared communcaton medum. Each processng node (or node for short) conssts of a processor, a local memory, and one or more communcaton nterfaces that send and/or receve data from other nodes. A processng job assgned to a node s modeled n terms of three tasks: RECV,, and SEN, whch must be executed serally n that order. RECV and SEN are communcaton tasks on the nterfaces, and s a computaton task on the processor. For communcaton tasks RECV and SEN, workload W r and W s ndcate the number of bts to be receved and sent, respectvely. For the computaton task, the workload W p s the number of. Let T p ;T r ;T s denote the delays of tasks, RECV and SEN, respectvely. Let F p denote the clock frequency of the processor, and F r and F s the respectve data bt rates for recevng and sendng. We have T W p p = ; T W r r = ; T W s s = (1) F p F r F s (1) s reasonable for processors executng data-domnated programs, where the total W p can be analyzed and bounded statcally. To model non-deal aspects of the medum, we ntroduce the communcaton effcency terms, ρ r and ρ s, where» ρ r ;ρ s» 1, such that T r = W r ρ r F r and T s = W s ρ s F s. Note that ρ r and ρ s need not be constants, but may be functons of communcaton speeds F r ;F s. For brevty, our expermental results assume an deal communcaton medum (ρ r = ρ s = 1) wthout loss of generalty. s a deadlne on each processng job, whch requres T r +T p + T s» for the three seralzed tasks. If any slack tme exsts, then we assume we can always slow down task by voltage scalng to reduce energy, based on the capablty of modern embedded processors. Therefore, we convert the nequalty nto an equalty n the deadlne equaton. That s, = T r + T p + T s (2) We assume a processor s voltage-scalng characterstcs can be expressed by a scalng functon Scale p that maps the CPU s frequency to ts power level. A communcaton nterface also has scalng functons Scale s and Scale r for sendng and recevng. (2) mples Scale p s contnuous, whle communcaton nterfaces support only a few dscrete scalng ponts. Let P p, P r, and P s denote the power for the processor, recevng, and sendng, respectvely. Then, P p = Scale p (F p ); P r = Scale r (F r ); P s = Scale s (F s ) (3) Let P ovh denote the power overhead assocated wth havng an addtonal node nto the system. It captures the power of the memory, mnmum power of the CPU and communcaton nterface, CPU s power durng RECV and SEN (MA), and communcaton nterfaces power durng. The energy consumpton of a task s the power-delay product. Let E p ;E r ;E s, and E ovh denote the energy consumpton of tasks, RECV, SEN, and overhead of a node, respectvely. Let E N denote the total energy of node N. Fnally, the total energy of the system s the sum of energy consumpton on each node. To summarze, E p = P p T p ; E r = P r T r ; E s = P s T s ; E ovh = P ovh (4) E N = E p + E r + E s + E ovh (5) E sys = M =1 E N (6) 15

3 N1 N2 N3 RECV recevng Wr bts Wp on processor (a) block dgram SEN sendng Ws bts delay: Tr = Wr / Fr delay: Tp = Wp / Fp RECV Pr power: Pr Pp power: Pp speed: Fr speed: Fp OVERHEA (b) tmng-power dgram delay: Ts = Ws / Fs SEN Ps power: Ps speed: Fs power: Povh Fgure 1: Tmng and power propertes of a processng node. Tr1 RE CV Tp1 SE N RE CV Ts1= Tr2 Tp2 Ts2 = Tr3 SEN Tp3 RECV (a) seralzed tmng dagram Ts3 SE N Tr1 RE N1 CV N2 N3 Tp2 Tp3 - Ts1 Tp1 Ts3 Ts2 = Tr3 SE N SEN RE CV SE RECV PR N OC Fgure 2: A three-node ppelne. Ts1= Tr2 SEN SE RECV N (b) ppelned tmng dagram Fg. 1 shows the tmng and power breakdown of the tasks on a node. The gray bar represents the overhead, whle the whte bars represent tasks RECV, and SEN. The area of a bar represents the energy consumpton by the correspondng task or overhead. Ths paper consders a specal case called an M-node ppelne. It conssts of dentcal nodes N ; = 1;2;:::;M as characterzed by Scale p ;Scale r ;Scale s ;E ovh. Each node N receves W r bts of data from the prevous node N 1 (except N 1 ), processes the data n W p, and sends the W s -bt result to the next node N +1 (except N M ). Each SEN! RECV +1 communcaton par sends and receves same amount of data at the same communcaton speed, wth the same communcaton delay, and we assume they start and fnsh at the same tme. That s, W s = W r+1 ;F s = F r+1 ;T s = T r+1. All nodes have the same deadlne, and each node acts as a ppelne stage wth delay. Fg. 2 shows an example of a three-node ppelne. For brevty, the overhead s not shown. Fg. 2(b) shows the ppelned tmng dagram by foldng the tasks n Fg. 2(a) nto a common nterval wth duraton, whch s the delay of each ppelne stage. [] presented the schedulablty condtons for an M-node ppelne based on collson and utlzaton of the shared communcaton medum. An M-node ppelne can be parttoned and mapped onto an M - node ppelne (M» M) by mergng adjacent nodes N ;N +1 ;:::;N j ( j ) nto a new node Nk. The new node N k combnes all computaton workload, receves W r bts of data, and sends W s j bts of data. Communcaton wthn a node become local data accesses. That s, Wp k = j l= W p l, and Wr k = W r ;Ws k = W s j. The new M - node ppelne s called a parttonng of the ntal M-node ppelne. 4. MOTIVATING EXAMPLE We use an automatc target recognton (ATR) algorthm (Fg. 3) as our motvatng example. Orgnally t s a seral algorthm. We reconstructed a parallel verson and mapped t onto ppelned multple processors. Ppelnng allows each processor to run at a much slower speed wth a lower voltage level to reduce overall computaton energy, whle parallelsm compensates for the performance. Of course, havng extra processors costs energy overhead for nterprocessor communcaton, memory, etc. Wr1 = 12Kb N1: Target etecton Wp1 = 4K 1 Mbps Ws1 = Wr2 N2: FFT Wp2 = 119K Ws2 = Wr3 N3: Flter Wp3 = 54K Ws3 = Wr4 N4: IFFT Wp4 = 357K Ws4 = Wr5 Fgure 3: Stages of the ATR algorthm. Node N1 OVERHEA 1 Mbps 1 Mbps (a) A fne-gran parttonng scheme reduces energy on computaton, at the cost of nter-proessor communcaton and overhead of addtonal nodes. 1 Mbps Merge N1 and N2 nto a combned node N (ncreased OVERHEA (b) The combned node reduces communcaton and overhead, but t requres more energy for computaton. 1 Mbps Node N2 OVERHEA 1 Mbps 1 Mbps Node OVERHEA N5: Compute stance Wp5 = 2639K 1 Mbps (c) The computaton energy can be reduced by hgh-speed communcaton, whch leaves more tme for computaton. Ws5 = 14Kb Fgure 4: The mpact of dfferent parttonng schemes and communcaton speed settngs. Task to Node Mappng Gven the decomposton nto fve stages of the ATR algorthm, several parttonng schemes are possble for mappng them onto a number of ppelned nodes. Fg. 4 shows an example by consderng how they map the frst two stages onto (a) two nodes and (b) one node. In Fg. 4(a), mappng onto two nodes N1 and N2 enables both processors to operate at a reduced speed (3MHz) for computaton. The two nodes together consume lower computaton energy than one node at a faster speed but must pay the prce of communcaton energy for SEN1! RECV 2. In Fg. 4(b), even though mergng the two stages onto one node elmnates the SEN1! RECV 2 communcaton, the CPU must execute the combned computaton workload at a faster clock rate (6MHz), a less energyeffcent level. Zoomng out, many parttonng schemes are possble, even when lmted to a ppelned organzaton. For example, one parttonng [N1;N2][N3;N4;N5] may be optmal for nodes N1 and N2; but t wll preclude another soluton [N1]; [N2; N3]; [N4; N5] that may lead to less energy for the whole system. Speed Selecton for CPU and Communcaton The selecton of communcaton speed s an equally crtcal ssue. For example, a 1/1/1 Base-T Ethernet nterface can consume more power than a CPU at hgh (1/1Mbps) speeds, but less power at the slower, 1Mbps data rate. In Fg. 4(b), the processor must operate at a hgh clock rate due to the low-speed communcaton at 1Mbps. Because of the deadlne, communcaton and computaton compete for ths budget. Low-speed communcaton leaves less tme for computaton, thereby forcng the processor to run faster to meet the deadlne. Conversely, hgh-speed communcaton could free up more tme budget for computaton, as shown n Fg. 4(c), where the CPU s clock rate s dropped to 3MHz. Although extra energy could be allocated to hgh-speed communcaton, f the energy savng on the CPU could compensate for ths cost, then (c) would be more energy-effcent than (b). 16

4 The communcaton-computaton nteracton becomes more ntrcate n a mult-processor envronment. Any data dependency between dfferent nodes must nvolve ther communcaton nterfaces. The communcaton speed of a sender wll not only determne the recever s communcaton speed but also nfluence the choce of the recever s computaton speed. The communcaton speed on the frst node of the ppelne wll have a chan effect on all other nodes n the system. A locally optmal speed for the frst node wll not necessarly lead to a globally optmal soluton. Combnng Parttonng and Speed Selecton Gven a fxed parttonng scheme, the desgners can always fnd the correspondng optmal speed settng that mnmzes energy for that scheme. However, energy-optmal speed selecton for a parttonng s not necessarly optmal over all parttonngs. Instead, parttonng and speed selecton are mutually enablng. In ths paper, we take a mult-dmensonal optmzaton approach that consders performance requrements, schedulablty, load balancng, communcaton-computaton trade-offs, and mult-processor overhead n a system-level context. 5. PROBLEM FORMULATION Gven an M-node ppelne, choces of parttonng and communcaton speed settngs wll lead to dfferent levels of energy consumpton at the system level. Ths secton formulates three energy mnmzaton problems: by parttonng, by communcaton speed selecton, and by both. In the frst two problems, the optmal soluton can be obtaned by dynamc programmng, and the combned optmzaton problem can be solved by mult-dmensonal dynamc programmng. Problem 1 (Optmal Parttonng) Gven (a) M ppelned nodes N wth workload W p ;W r ;W s, = 1;2;:::;M, (b) a deadlne for all nodes, and (c) the constrant that the speed settngs of all communcaton nstance must match: F r1 ;F s = F r+1 ;F sm, for = 1;2;:::;M 1, fnd a parttonng scheme that mnmzes energy E sys. To avod exhaustve enumeraton n the O(2 M 1 ) soluton space, we construct a seres of optmal solutons to sub-problems by mappng the orgnal M nodes one by one onto new sub-parttonngs. We compute the optmal cost functon n terms of the mnmum energy consumpton over the sub-parttonngs. Upon mappng each node, the new optmal sub-soluton can be computed from past optmal sub-solutons. Therefore, a dynamc programmng approach s applcable. For dynamc programmng, we use an energy matrx E to store the cost functon. Each entry E[; j] ndcates the mnmum energy of a sub-problem that maps the frst j orgnal nodes N 1 ;N 2 ;:::;N j onto a new sub-parttonng wth nodes N 1 ;N 2 ;:::;N. Matrx E s ntalzed to. E[; j] = >< >: for = j = mn 1»l» j 1» E[ 1;l]+ E N for 1»» j» M (7) ndcates that the optmal -node sub-parttonng that maps the frst j orgnal nodes must be a combnaton of the followngs: (a) a sub-parttonng that maps the frst l orgnal nodes N 1 ;N 2 ;:::;N l to 1 new nodes, and (b) the th new node N that combnes the orgnal nodes N l+1 ;:::;N j. The sub-parttonng (a) must be optmal wth mnmum energy E[ 1; l]. (b) only has one (7) node N. Its energy s denoted as E N. Snce E[; j] s the optmal energy for the sub-problem, t must be the mnmum value of (7) among all possble choces of l. The dynamc programmng algorthm can terate (7) from = j = untl = j = M. Each optmal sub-soluton E[; j] can be derved from prevously computed E[ 1;l]. Fnally, the mnmum energy s mn(e[;m]); = 1; 2;:::; M. We omt the algorthm for brevty. Its tme complexty s O(M 3 ). Problem 2 (Optmal Communcaton Speed Selecton) Gven (a) a fxed parttonng scheme wth M ppelned nodes N wth workload W p ;W r ;W s, = 1;2;:::;M, (b) a deadlne for all nodes, and (c) the avalable choces for communcaton speed settngs F ck ;k = 1;2;:::;C, fnd all processor speeds F p and communcaton speeds F r ;F s that mnmze energy E sys. We also perform dynamc programmng as opposed to exhaustve search n O(C M+1 ) soluton space. urng step when processng node N, we only select communcaton speeds F r ;F s of N, because they determne F p, and the prevous speed settngs of the sub-problems have already been selected to optmal. For each choce of F r ;F s, we compute the energy of node N, plus the optmal energy of a sub-problem computed by step 1 wth F s 1 = F r to fnd the optmal energy of the new sub-problem n step. Each element E[;k] n the energy matrx E ndcates the mnmum energy of a sub-problem. It has nodes N 1 ;N 2 ;:::;N wth the last node N s sendng speed selected to be the k th speed choce F ck. E s ntalzed to. E[;k]= >< >: mn 1»m»C» E[ 1;m]+ E N (F r = F cm ;F s = F ck ) for =, for 1»» M, () () ndcates that the optmal speed settng for the sub-problem up to node N whose sendng speed F s = F ck s determned by: (a) a prevous optmal sub-soluton where node N 1 s sendng speed F s 1 = F cm, plus (b) node N whose recevng speed F r = F cm, sendng speed F s = F ck. (a) ncludes 1 nodes N 1 ;N 2 ;:::;N 1 and communcates wth (b) at speed F cm. The optmal energy of subproblem (a) s E[ 1;m]. (b) has only one node N that receves data from (a) through speed F cm ; and ts sendng speed s F ck. Its energy s denoted as E N (F r = F cm ;F s = F ck ). Snce E[;k] s optmal, t must be the mnmum value among all possble speed settngs F cm n (). The algorthm s omtted for brevty. It terates () untl = M;k = C. Each E[;k] can be derved from prevously computed E[ 1; m]. The global mnmum energy s mn(e[m;k]);k = 1;2;:::;C. The tme complexty of the algorthm s O(MC 2 ). Problem 3 (Optmal Parttonng and Speed Selecton) Gven (a) M ppelned nodes N wth workload W p ;W r ;W s, = 1;2;:::;M, (b) a deadlne for all nodes, and (c) the avalable choces for communcaton speed settngs F ck ;k = 1;2;:::;C, fnd a parttonng scheme and correspondng communcaton speed settngs that mnmze energy E sys. ue to the nter-dependency between speed settng and parttonng, the optmal soluton cannot be acheved by solvng two prevous problems ndvdually. Exhaustvely enumeratng over one 17

5 parttonng-speedselecton(w r [1:M];W s [1:M];W p [1:M]; F c [1:C];Scale r ;Scale s ;Scale p ;;P ovh ) for :=tom do for j := to M do for k :=1toC do E[; j;k] := U[; j;k] := P[; j;k] := S[; j;k] := for k := 1 to C do E[;;k] := U[;;k] := W r [1]=F c [k]= for :=1toM do for j := to M do for k :=1toC do for l := 1to j 1 do for m := 1 to C do e := E[ 1;l;m]+E N (F r = F c [m];f s = F c [k]) u := U[ 1;l;m]+W s [ j]=f c [k]= f u» 1 and e < E[; j;k] then E[; j;k] := e U[; j;k] := u P[; j;k] := l S[; j;k] := m E opt ;P opt ;S opt := retreve from matrces E;U;P;S return E opt ;P opt ;S opt Fgure 5: Combned parttonng wth speed selecton. dmenson and dynamc programmng over the other s qute expensve wth the tme complexty as ether O(2 M 1 MC 2 ) or O(C M+1 M 3 ). We propose a mult-dmensonal dynamc programmng algorthm gven the fact that the prevous two problems can be solved by dynamc programmng ndependently. Based on the prevous two dynamc programmng approaches, the energy matrx E for the combned problem s defned as follows: each element E[; j;k] stores the mnmum energy of a sub-problem that maps the frst j orgnal nodes N 1 ;N 2 ;:::;N j onto a new -node sub-parttonng, whose last node N has sendng speed F s = F ck. E[; j;k]= >< 2 3 E[ 1;l;m]+ 4 E N (F r = F cm ; 5 F s = F ck ) for = j =, mn for 1» >: 1» l» j» j» M; 1; 1» m» C (9) The optmal energy E[; j;k] s derved from: (a) E[ 1;l;m] of a prevous optmal sub-soluton, whch maps l orgnal nodes N 1 ;:::;N l onto 1 new nodes N1 ;:::;N 1 wth the last node N 1 s sendng speed selected to be F c m, plus (b) the new node N that combnes orgnal nodes N l+1 ;:::;N j wth recevng speed F cm and sendng speed F ck. The sub-soluton (a) has the optmal energy E[ 1;l;m]. Note that (b) has only one node N, and ts energy s denoted as E N (F r = F cm ;F s = F ck ). E[; j;k] must be derved from all possble pars of (l;m) to acheve the mnmum value of (9). The algorthm s shown n Fg. 5. It combnes two prevous algorthms by two-dmensonal dynamc programmng. There are three addtonal matrces. The utlzaton matrx U tracks the schedulablty condton [] and guards each optmal sub-soluton to guarantee ts schedulablty. The parttonng matrx P and speed matrx S are used to record the ntermedate solutons and for retrevng the optmal parttonng P opt and optmal speed settng S opt when the algorthm termnates. The global mnmum energy s Wr1 = 12Kb N1: Target etecton Wp1 = 4K Wr = 12Kb Ws1 = Wr2 N2: FFT Wp2 = 119K N: Mergng N1, N2, N3, N4, N5 nto one node Wp = Wp1 + Wp2 + Wp3 + Wp4 + Wp5 = 33K (a) sngle-node parttonng Ws2 = Wr3 N3: Flter Wp3 = 54K Ws3 = Wr4 (b) fve-node parttonng N4: IFFT Wp4 = 357K Ws = 14Kb Ws4 = Wr5 N5: Compute stance Wp5 = 2639K Fgure 6: Two fxed parttonng schemes of ATR. Ws5 = 14Kb mn(e[;m;k]); = 1;2;:::;M;k = 1;2;:::;C. The tme complexty of the algorthm s O(M 3 C 2 ). 6. EXPERIMENTAL RESULTS To evaluate our energy optmzaton technques, we experment wth mappng the ATR algorthm onto two fxed parttonng schemes: (a) a sngle-node that combnes all blocks, and (b) a fve-node ppelne that maps each block onto an ndvdual node (Fg. 6). The nput data sze s 12K bts, and the output s 14K bts per frame. In scheme (a), the sngle node combnes all the workload of fve nodes n (b); and t elmnates all nternal communcaton nstances between nodes n (b). (a) and (b) are two extremes representng seral vs. parallel schemes. For both (a) and (b) we apply optmal speed selecton. We also fnd the optmal parttonng wth speed selecton as (c) and compare ts energy consumpton per mage frame wth (a) and (b) under two types of performance requrements: (1) hgh performance, = 1ms, (2) moderate performance, = 15ms. Each node conssts of an Intel XScale processor [2] whose power vs. performance level ranges from 5mW@15MHz to 1.6W@1GHz (Fg. 7), and an LXT-1 Ethernet nterface [1] wth power levels of.w@1mbps, 1.5W@1Mbps, and 6W@1Mbps (Fg. ). We assume each node has a constant power draw P ovh = 1mW. The results are presented n Fg. 9. In all cases, 1Mbps s always the optmal speed settng for communcaton. The low-power, 1Mbps communcaton speed results n the hghest energy. Ths s because t leaves so lttle tme for computaton such that the processors must run faster wth more energy to meet the deadlne, and t has the hghest energy-per-bt ratng. The low-speed communcaton also tends to volate the schedulablty condtons []. Gven propertes of ths partcular Ethernet nterface, 1Mbps communcaton wll always lead to the lowest energy consumpton snce t requres the least amount of energy per bt and leaves the maxmum amount of tme budget for reducng CPU energy. However, n cases where the energy-per-bt ratng does not decrease monotoncally wth the communcaton speed, the optmal speed settng may nvolve some combnatons of low-speed and hgh-speed settngs between dfferent nodes. For example, the node N may communcate wth N 1 at 1Mbps and wth N +1 at 1Mbps. Fg. 9(1) shows the energy consumpton of all three parttonng schemes under a tght performance constrant. The sngle-node (a) s heavly loaded wth computaton. Therefore t s desrable to reduce CPU energy by ppelnng. As a result, the fve-node ppelne (b) s more energy-effcent at the cost of addtonal communcaton and overhead. However, the optmal parttonng s (c) wth three nodes: [N1;N2]; [N3;N4]; [N5]. It consumes more CPU energy than (b), but overall t s optmal wth less energy on communcaton and overhead. 1

7. CONCLUSION We present an energy optmzaton technque for networked embedded processors and emergng system-on-chp archtectures wth hgh-speed on-chp networks.

6 7. CONCLUSION We present an energy optmzaton technque for networked embedded processors and emergng system-on-chp archtectures wth hgh-speed on-chp networks. We explot wth the mult-speed feature of modern hgh-speed communcaton nterfaces as an effectve way to complement and enhance today s CPU-centrc power optmzaton approaches. In such systems, communcaton and computaton compete over opportuntes for operatng at the most energyeffcent ponts. It s crtcal to not only balance the load among processors by functonal parttonng, but also to balance the speeds between communcaton and computaton on each node and across the whole system. Our mult-dmensonal dynamc programmng formulaton s exact and produces the energy-optmal soluton as defned by a parttonng scheme and the speed selectons for all computaton and communcaton tasks. We expect ths technque to be applcable to a large class of data domnated systems that can be structured n a ppelned organzaton. Fgure 7: vs. performance of the XScale processor. Energy per frame (mj) Mode consumpton 1M bps mw 1M bps 1.5W 1M bps 6W Fgure : modes of the Ethernet nterface. (a) 1-node (b) 5-node (c) Optmal N1N2 N3 N4 N5 (1) hgh performance = 1ms (a) 1-node (b) 5-node (c) Optmal N1N2N3N4 N5 (2) moderate performance = 15ms Overhead Communcaton Computaton Fgure 9: Energy consumpton of three parttonng schemes. In case of the moderate performance constrant (Fg. 9(2)), (a) s stll domnated by computaton but t s not heavly loaded due to the relaxed deadlne. The reducton of CPU energy by (b) cannot compensate for the added overhead of new nodes and communcaton. Therefore (a) s better than (b) and ppelnng seems neffcent. However, the optmal parttonng (c) s stll a ppelned soluton. It combnes N1;N2;N3;N4 nto one node and maps N5 to another node. (c) acheves mnmum energy by approprately balancng computaton, communcaton wth ppelnng overhead. If the performance constrant s further relaxed, the seral soluton (a) wll become optmal.. REFERENCES [1] INTEL ethernet PHYs/transcevers. ethernet/lnecard ept.htm. [2] INTEL XScale mcroarchtecture. [3] N. K. Bambha, S. S. Bhattacharyya, J. Tech, and E. Ztzler. Hybrd global/local search strateges for dynamc voltage scalng n embedded multprocessors. In Proc. Internatonal Symposum on Hardware/Software Codesgn, pages , 21. [4] R. Cherabudd, M. Bayoum, and H. Krshnamurthy. A low power based system parttonng and bndng technque for mult-chp module archtectures. In Proc. Great Lakes Symposum on VLSI, pages , [5] P. Eles, A. obol, P. Pop, and Z. Peng. Schedulng wth bus access optmzaton for dstrbuted embedded systems. IEEE Transactons on VLSI Systems, (5): , 2. [6] E. Huwang, F. Vahd, and Y.-C. Hsu. FSM functonal parttonng for low power. In Proc. esgn, Automaton and Test n Europe, pages 22 2, [7] P. V. Knudsen and J. Madsen. Integratng communcaton protocol selecton wth hardware/software codesgn. IEEE Transactons on Computer-Aded esgn of Integrated Crcuts and Systems, 1(): , August [] J. Lu, P. H. Chou, and N. Bagherzadeh. Communcaton speed selecton for embedded systems wth networked voltage-scalable processors. In Proc. Internatonal Symposum on Hardware/Software Codesgn, pages , Aprl 22. [9] J. Luo and N. K. Jha. Battery-aware statc schedulng for dstrbuted real-tme embedded systems. In Proc. esgn Automaton Conference, pages , June 21. [1] R. Ortega and G. Borrello. Communcaton synthess for dstrbuted embedded systems. In Proc. Internatonal Conference on Computer-Aded esgn, pages , 199. [11] A. Wang and A. Chandrakasan. Energy effcent system parttonng for dstrbuted wreless sensor networks. In Proc. IEEE Internatonal Conference on Acoustcs, Speech and Sgnal Processng, pages 95 9, May 21. [12] E. F. Weglarz, K. K. Saluja, and M. H. Lpast. Mnmzng energy consumpton for hgh-performance processng. In Proc. Asan and South Pacfc esgn Automaton Conference, pages , 22. [13] W. Wolf. An archtectural co-synthess algorthm for dstrbuted embedded computng systems. IEEE Transactons on VLSI Systems, pages , June

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors

Combined Functional Partitioning and Communication Speed Selection for Networked Voltage-Scalable Processors Combned Functonal Parttonng and Communcaton Speed Selecton for Networked Voltage-Scalable Processors Jnfeng Lu, Pa H. Chou, Nader Bagherzadeh epartment of Electrcal & Computer Engneerng Unversty of Calforna,