Balancing Register Allocation Across Threads for a Multithreaded Network Processor

Size: px

Start display at page:

Download "Balancing Register Allocation Across Threads for a Multithreaded Network Processor"

Dwight Carpenter
5 years ago
Views:

1 Balancing Regier Allocaion Acro Thread for a Mulihreaded Nework Proceor Xiaoong Zhuang Georgia Iniue of Technology College of Compuing Alana, GA, x2000@cc.gaech.edu Sanoh Pande Georgia Iniue of Technology College of Compuing Alana, GA, anoh@cc.gaech.edu ABSTRACT + Modern nework proceor employ muli-hreading o allow concurrency among muliple packe proceing ak. We udied he properie of applicaion running on he nework proceor and oberved ha heir imbalanced regier requiremen acro differen hread a differen program poin could lead o poor performance. Many ime applicaion need demand ome hread o be more performance criical han oher and hu by conrolling he regier allocaion acro hread one could impac he performance of he hread and ge he deired performance properie for concurren hread. Thi promp our work. Our regier allocaor aim o diribue available regier o differen hread according o heir need. The compiler analyze he regier need of each hread boh a he poin of a conex wich a well a inernally. Compiler hen deignae ome regier a hared and ome a privae o each hread. Shared regier are allocaed acro all hread explicily by he compiler. Value ha are live acro a conex wich can no be kep in hared regier due o afey reaon; hu, only hoe live range ha are inernal o he conex wich can be afely allocaed o hared regier. Spill can caue a conex wich. and hu, he problem of conex wich and allocaion are cloely coupled and we propoe a oluion o hi problem. The propoed inerference graph (GIG,BIG,IIG) diinguih variable ha mu ue a hread' privae regier from hoe ha can ue hared regier. We fir eimae he regier requiremen bound, hen reduce from he upper bound gradually o achieve a good regier balance among hread. To reduce he regier need, move inerion are inered a program poin ha pli he live range or he node on he inerference graph. We how ha he lower bound i reachable via live range pliing and i adequae for our benchmark program for imulaneouly aigning hem on differen hread. A our objecive, he number of move inrucion i minimized. Empirical reul how ha he compiler i able o effecively conrol he regier allocaion acro hread by maximizing he number of hared regier. Speed-up for performance criical hread range from 18 o 24% wherea degradaion for performance of non-criical hread range only from 1 o 4%. + Permiion o make digial or hard copie of all or par of hi work for peronal or claroom ue i graned wihou fee provided ha copie are no made or diribued for profi or commercial advanage and ha copie bear hi noice and he full ciaion on he fir page. To copy oherwie, or republih, o po on erver or o rediribue o li, require prior pecific permiion and/or a fee. PLDI 04, June 9 11, 2004, Wahingon, DC, USA. Copyrigh 2004 ACM /04/ $5.00. Caegorie and Subjec Decripor: D.3.4 [Programming Language]: Proceor Opimizaion, Code generaion, Run-ime environmen. General Term: Algorihm, Language, Performance. Keyword: Nework Proceor, Regier Allocaion, Mulihreaded Proceor. 1. INTRODUCTION The dramaic growh in Inerne raffic ha moivaed a pecialized caegory of embedded proceor called Nework Proceor (NP) wih fa proceing peed and pecialized hardware uppor for nework applicaion. Nework proceor are diinguihed by heir fa proceing core and are programmed in a dedicaed manner for caering o he pecific need of underlying applicaion. The compiler opimizaion for nework proceor i an emerging opic for reearch [3][4][5][19]. In hi paper, we aemp he regier allocaion problem for a mulihreaded nework proceor IXP. The IXP nework proceor model can be applied o any nework proceor wih hared CPU and regier file for muliple hread and wih fa conex wich o hide long laency operaion uch a memory accee. Typically, nework proceor applicaion coni of muliple hread concurrenly execuing muliple ak of a nework proceing applicaion. The ak can be a imple a packe rouing o complex one ha proce packe conen for virue and malignan code ec. In conra o general proceor, he ak ha execue on differen hread of a nework proceor are bound o hem a compilaion ime; in oher word, no run ime hread aignmen ake place. Since low level operaion originally done wih OS or hardware uch a conex wich are expoed o he programmer, he compiler ha he knowledge of hread ineracion which are predicable. I i obviou ha differen ak have differen complexiie and alo level of deired performance. Some ak may be more performance criical han oher. Implemening (effecing) uch performance need acro hread i currenly impoible for any uer. Thi i o ince compiler allocae a fixed number (32) of regier o each hread and doe no underake iner-hread analyi o balance heir overall regier need. I may be noed ha performance of a hread i quie eniive o regier need; even hough he number of pill may be mall for a larger number of regier, each pill i very expenive (laency of abou 20 cycle). Our experience wih Inel IXP nework proce family, which largely follow hi model, ell u ha: 1) we can achieve regier balancing among differen hread and 2) we can reduce pill hrough he afe ue of hared regier which are no

2 live acro conex wich inrucion for individual hread 3) hrough he ue of regier haring, overall, we make more regier available o hread booing heir performance. Thu, overall by balancing regier need acro hread we can mee heir performance requiremen. Thee opimizaion are neceary due o he dipariy of regier preure acro hread and acro differen region of code in each hread. We fir dicu he nework proceor archiecure o gain ome underanding of he problem of balancing regier requiremen acro hread. 1.1 Nework proceor Sae-of-he-ar nework proceor like Inel IXP1200/2400/2800, MMC np erie, IBM power NP ec.[21]. have programmable proceing core ha can be coded for applicaion need. In conra o radiional proceor, nework proceor have heir pecial properie. Speed v Flexibiliy Nework proceor face he dilemma of offering boh promp proceing of he nework raffic and flexibiliy o he ofware programmer o mee he requiremen of differen applicaion. A he nework peed coninue o increae, he ime o proce each packe mu be horened o avoid packe lo. For example, proceing a OC-192 allow only 52n for each packe and OC-768 leave only 13n for proceing. The higher peed require boh horer proceing ime for each packe and horer ime a packe can ay in he yem (waiing ime + proceing ime). To peedup he criical pah for packe proceing, normally a number of RISC proceor core are equipped o work in parallel. Alhough omeime a co-proceor (ypically a general purpoe proceor) i added o handle oher low ak, he packe proceing core mu be opimized for peed. Therefore, feaure uch a explici mulihreading, explici and fa conex wiching (only pc i aved), direc memory acce (wihou he complicaion of cache) are commonly een in nework proceor deign. A memory operaion are exremely ime-conuming, oluion hould focu on hiding he laency wih olerable hardware and ofware complexiy. Wih fa conex wich, each proceor core can hide laencie by conex wiching o oher hread when acceing he peripheral. Even if cache are enabled, conex wich o oher hread i generally a clever way o avoid he deviaion in memory acce ime like in he MMC np erie. Nework proceor are alo aimed o provide plehora of oluion for nework applicaion, which were originally implemened wih dedicaed hardware (no flexible) or general purpoe proceor (oo low). Recen reearch [1][2][22] ha aemped complicaed ak uch a conen inpecion, ofware rouer, inruion deecion, ec. A more code bae i added o nework proceor, wriing in aembly can be error-prone and ime-auming, which grealy hamper he fa prooyping and increae ime o marke. To provide programmabiliy, a High Level Language (HLL) compiler i omeime provided, alhough ypically wih limied language feaure uppor. There are everal on-going reearch effor o build proper opimizing compiler for nework proceor [3][4][5]. A menioned earlier, non of he curren compiler underake iner-hread analyi forcing programmer o manage he regier preure acro hread. Wihou any help from he compiler i i impoible for a uer o hand-une (muli-hreaded) code. Thi mo Inel IXP Nework Proceor In hi paper, we bae our work on he Inel IXP nework proceor. Since i ucceful deign ha made i a very popular produc in he nework proceor marke, we generalize i feaure a a general model in ecion 2. Here, we preen everal prominen feaure of he IXP nework proceor, which promp he hread regier allocaion problem (deail in ecion 1.2). Figure 1 how he block diagram of he IXP1200 nework proceor. The chip ha 6 micro-engine (proceing uni or PU) and 4 hread hare he ame PU. The chip ha connecion o offchip SRAM, SDRAM, PCI bu ec. A hown in Figure 2.a, ypically, each PU ge packe from i inpu queue, procee i and hen wrie o i oupu queue, or he inpu queue of he nex PU in he nex pipeline age. Wih pipeline proceing, ypically, ome PU are in charge of geing packe from he inpu por; ome handle packe proceing and ome are for oupu por. Our opimizaion focue on he code on differen hread of he ame PU. Figure 2.b how major componen inide each PU. Some of he imporan feaure are a follow: 1. Shared regier file bu ypically non-overlapped pariion. Figure 2.b how ha he general purpoe regier (GPR) file i hared by he 4 hread. Each hread ha acce o all regier; however wihou opimizaion, each hread i normally allocaed non-overlapping par of he regier file. The reaon for he regier file pariion i due o lighweighed conex wich a dicued below. 2. Non-preempable hread execuion. There i no operaing yem, no conrol preen over he hread haring he CPU. A hread give up he CPU only when i block on I/O or oher long laency operaion or execue a conex wich (cx_wich) inrucion volunarily Ligh-weighed conex wich. Conex wich i cheap (only PC i aved), hi i alo he reaon regier are normally allocaed in a non-overlapped fahion from he regier file. If a regier i allocaed o wo hread, afer conex wich, he conen in ha regier may be modified by he oher hread. Since regier are neiher auomaically aved nor reored during a conex wich uch poibiliie exi and hi i where i become a compiler problem o manage regier. 4. Cheap ALU, expenive memory acce. No cache i available for memory accee; a lea 20 cycle are needed for each load/ore inrucion. Conex wiche are ypically followed o hide he long laency of memory accee. In conra, all ALU inrucion can be compleed in 1 cycle. Large memory laency make overall performance eniive o pill even hough hey may be few in number. Figure 1. IXP1200 block diagram. 1 cx_wich inrucion can be inered by he programmer o achieve fair haring of he CPU.

3 128 GPR PU 1 PU 2 A Proceing Uni--MicroEngine Thread 1 Thread 2 Thread 3 Thread 4 Figure 2. IXP1200 hread and regier file on a PU. The above feaure of he IXP nework proceor are driven by deign philoophy o implify hardware o a o increae he clock rae and execuion peed. For inance, conex wich i kep very imple and fa (1 cycle laency). For hi only program couner (pc) i aved bu no regier are aved becaue i can caue long delay in conex wich which may offe he benefi of CPU haring. On he oher hand, ince all he hardware deail are expoed, compiler can pruden deciion regarding regier haring ec. Nex, we propoe he muli-hreaded regier allocaion problem. 1.2 The Regier Allocaion Problem A menioned above, alhough he regier file can be acceed by all hread, i ha o be pariioned wihou overlap acro hread becaue no regier i aved/reored during conex wich. Here, we argue ha ome regier can be afely hared by all hread hrough compiler analyi ince hread wich i predicable. The example in Figure 3 illurae he problem and he poible way o olve i. In Figure 3.a, he code for wo hread are hown. Aume all variable are dead afer heir la ue in he code. In hread 1, a code egmen conain 12 inrucion, including wo conex wich inrucion cx_wich give up CPU volunarily and a load caue conex wich o wai for I/O operaion. Any pair of he 3 variable inerfere wih each oher (co-live a ome program poin), o in Figure 3.b, hey are aigned 3 differen phyical regier. Noice ha variable a i live acro cx_wich inrucion, o i mu be allocaed o a phyical regier ha i no ued by any oher hread, becaue when hread 1 i conex wiched a hi poin, oher hread hould no modify he phyical regier of variable a, which mean only hread 1 hould ue he regier. On he conrary, variable b and c are only ued beween wo conex wich inrucion. In oher word, when hread 1 i wiched ou of he CPU, boh b and c mu be dead. Therefore i i afe o reue he phyical regier allocaed o b and c in oher hread. Thread 2 ha 4 inrucion, wih wo conex wich inrucion. d i only live beween wo conex wich inrucion, herefore d can hare a phyical regier wih oher hread. Simply, r2 i hared ued for b in hread 1 and d in hread 2, becaue he code guaranee ha when conex i wiched o hread 2, r2 conain a dead value for hread 1. Similarly, when conex i wiched o hread 1, r2 conain a dead value (d) in hread 2. Thi example how benefi of haring regier and lowering oal regier requiremen from four o hree. We now how ha hrough anoher echnique (live range pliing) one can reduce oal regier requiremen furher. Three regier eem neceary for hread 1, however we noice ha a any program poin, only wo variable are co-live. Thi promp our echnique of pliing one of he variable and inering a move inrucion a cerain poin. Thi i demonraed in Figure 3.c. In inrucion 6, r3 i replaced by r1, while from inrucion 8 o 9, r3 i replaced by r2. Inrucion 10 copie r2 o r1, o in inrucion 12, we have a conien replacemen (r3 r1). We have managed o reduce oal regier requiremen down o wo now. Thread 1 1. a= 2. cx_wich 3. if( )br L1 4. b= 5. =a+b 6. c= 7. br L2 L1: 8. c= 9. =a+c 10. b= L2: 11. =b+c 12. load Thread 2 1. cx_wich 2. d= 3. =d+ 4. ore Thread 1 1. r1= 2. cx_wich 3. if( )br L1 4. r2= 5. =r1+r2 6. r3= 7. br L2 L1: 8. r3= 9. =r1+r3 10. r2= L2: 11. =r2+r3 12. load Thread 2 1. cx_wich 2. r2= 3. =r2+ 4. ore Thread 1 1. r1= 2. cx_wich 3. if( )br L1 4. r2= 5. =r1+r2 6. r1(r3)= 7. br L2 L1: 8. r2(r3)= 9. =r1+r2(r3) 10. r1(r3)=r2(r3)* 11. r2= L2: 12. =r2+r1(r3) 13. load Figure 3. Example of regier haring and move inerion. The above example illurae he poenial benefi of regier haring acro hread and live range pliing. To furher juify he muli-hreaded regier allocaion i imporan and a compiler oluion i feaible, we li ome properie of he program ha run on he nework o uppor hi argumen. 1. For IXP1200, he hardware provide eemingly enough regier. 128 general purpoe regier (GPR) can be ued for each PU. However, for each hread, only 32 GPR are available if no GPR i hared acro hread. Regier haring in IXP i a purely ofware oluion, unlike ome SMT (Simulaneou Muli-hreading) where i i hardware managed. Compiler deignae and allocae a regier eiher a a hared or privae one. 2. Since here i no operaing yem o manage hread, memory acce, conex wich ec. are all explici and hu conex wich i predicable a compile ime. 3. A hown in our experimen, conex wich inrucion are ypically le han 10% of he oal inrucion and many variable are no live acro conex wich inrucion. 4. PU are aigned wih differen ak. Packe are proceed in pipeline fahion--figure 2.a. Currenly, ak aignmen canno be done auomaically. Alhough in mo cae, he ame ak i aigned o hread on he ame microengine. Thi acually lead o low uilizaion of he CPU, becaue i i hard o chop ak properly o ha hey all ake roughly ¼ of he compuaion power of he PU. Therefore, we hould aume ak migh be differen for hread on he ame PU. Iem 1 indicae ha he regier may no be ufficien on he nework proceor. Iem 2 and 3 uppor he feaibiliy of a compiler oluion o opimize he regier allocaion. Finally, iem 4 promp wo kind of problem, i.e. ymmeric v. aymmeric regier allocaion, which will be defined in nex ecion. Thi paper i organized a follow. Secion 2 decribe he yem model and problem formaion, ecion 3 alk abou he conrucion of he inerference graph, ecion 4 i he overall framework, ecion 5 propoe he algorihm o eimae bound of (c)

4 regier number, ecion 6 and 7 are for iner-hread and inrahread regier allocaion, ecion 8 menion SRA problem briefly, ecion 9 how performance evaluaion reul and ecion 10 alk abou relaed work and ecion 11 i he concluion. 2. PRELIMINARIES Syem Model In hi paper, we udy a mulihreaded nework proceor ha can run muliple hread on a ingle proceing uni (PU i.e. micro-engine for IXP). The hread on one PU hare he compuaion power of he PU and regier file ec. Formally, he model i a follow: 1. There are oally N reg regier ha can be ued by N hd hread haring a ingle PU. 2. Explici conex wich. A hread won give up he CPU once i ar execuion on i, unil a conex wich inrucion i me. Conex wich can happen due o explici inrucion or long laency inrucion like a load or a ore. 3. Conex wich i very cheap (only pc i aved) and i i inended o hide long laency operaion. 4. Since nework packe are moly independen of each oher o are hread. The purpoe of mulihreading on he ame PU i mainly for laency hiding and concurrency. When one hread i alled due o I/O or oher long laency operaion, oher hread can ake he CPU. Therefore, code on differen hread are almo independen (Figure 2.a). Thread communicaion or ynchronizaion rarely happen, however, our curren oluion ill work under uch circumance. A a fuure work, knowledge abou hread communicaion or ynchronizaion migh be exploied o improve he regier allocaor. 5. All regier are acceible by all hread, bu he regier ued by one hread a he poin of conex wich hould no be ued anywhere by oher hread (laer, we will define hee regier a privae regier), becaue hi migh caue unexpeced modificaion o he regier and lead o unafe code. 6. Move inrucion i much cheaper han pill. 7. Code on differen hread of he ame PU can be differen. Problem Claificaion A menioned in ecion 1.2, program execuing on differen hread can be idenical. We call he regier allocaion problem under uch circumance Symmeric Regier Allocaion (SRA). On he conrary, Aymmeric Regier Allocaion (ARA) aume differen program for differen hread. Mixing hread wih differen compuaion requiremen can achieve beer CPU uilizaion. Since SRA i a ub-problem of ARA, in hi paper, we develop our approache baed on ARA. Noice ha, alhough currenly mo real program are for SRA, we are no inenionally complicaing he problem, becaue our algorihm are equally neceary and imporan o SRA, a will be illuraed laer, SRA only reduce earching pace during iner-hread regier allocaion, while all echnique in hi paper are applicable o boh problem. Our goal i develop general echnique ha apply wihou undue rericion. Objecive The number of oal available regier i limied. Therefore, in a mulihreaded nework proceor model, we aim o (for ARA) balance he regier allocaion among all hread, o ha more regier are allocaed o he hread wih higher regier preure and he regier allocaion i caered o he requiremen of differen hread in he yem. Furhermore, deignaing a larger number of hared regier can help all hread o inernally adju heir regier preure wihou cauing pill. In cae here are no enough regier available for all hread, we aemp o pli he live range inide a hread by uing move inrucion. Alo, our objecive i o minimize he number of move inrucion inered. The reul how move inerion i cheap and effecive. Problem Formulaion To formalize he problem, we define everal concep. DEFINITIONS: PR i : Number of privae regier for hread i, hee are phyical regier only (excluively) ued by hread i. SR i : Number of hared regier needed by hread i, hee are phyical regier ued by hread i, bu oher hread may ue hem a well. R i : Number of oal phyical regier needed by hread i, equal PR i +SR i SGR: Number of globally hared regier needed, i i he maximum of hared regier demand of each hread, ince hared regier can be ued by all hread, hi i he maximum of all SR. N reg : Toal number of phyical regier available in a PU. For a hread, PR i he number of phyical regier ha are excluively allocaed o i or he number of phyical regier ha can be live acro conex wich inrucion, while SR i he number of allocaed phyical regier ha are dead during conex wich, which mean hey can be hared acro hread. For example, in Figure 3.b, for hread 1, PR 1 =1, SR 1 =2, for hread 2, PR 2 =0,SR 2 =1, herefore, SGR=2. The relaionhip and rericion among hee variable are illuraed a he following condiion: SGR = Max( SR1, SR2... SR Nhd ) PR + SGR N i i PRi + SRi = Ri reg For SRA, all PR i and SR i are equal. Given hee rericion, we need o aign regier in a way ha he overall regier need i aified and pill are minimized. 3. CONSTRUCTION OF INTERFERENCE GRAPHS 3.1 Non-Swich Region DEFINITIONS: Non-Swich Region (NSR): A non-wich region i a maximal conneced ub-graph of he CFG wihou any inernal conex wich inrucion. I conain conneced par from everal baic block. The boundarie of he NSR are eiher conex wich inrucion or program enry/exi poin. Conex Swich Boundary (CSB): The program poin of he conex wich inrucion. A CSB eparae he baic block i reide, hu become he boundary of NSR(). A NSR can be conruced by aring from an individual inrucion and grown i unil all nearby inrucion are conex wich inrucion or program enry/exi poin.

5 To illurae, Figure 4.a how he CFG and NSR for a code egmen from benchmark frag in he Commbench uie [15]. Thi code egmen i from one of he funcion o calculae he IP checkum. The CFG coni of 10 baic block. Noiceably, here are four conex wich inrucion, i.e. he read inrucion in BB3 and BB7, he explici cx_wich inrucion in BB5 and BB6. The cx_wich inrucion are inered by he programmer o avoid he monopoly of he CPU. Figure 4.b how he NSR. Afer erminaing he CFG a he poin of conex wich inrucion (boundarie), we ge 3 NSR. The NSR are bound by eiher program enry/exi poin or conex wich inrucion (CSB). We can aume all erminaing are inide baic block, herefore ome baic block are pli, like BB5 i pli ino BB5.a in NSR2 and BB5.b in NSR1. Someime, wo par of a eparaed baic block ill belong o he ame NSR like he BB7 in Figure 4. For he example in Figure 3, hread 1 ha wo NSR, inrucion 1 and 2 are in NSR1 and inrucion 2 o 12 coniue NSR2. For hread 2, all inrucion form one NSR. BB3 read mp1 [buf], 1 um+=mp1&0xffff buf=buf+2 if!(um&0x ) br BB5 BB4 Sum=(um&0xFFFF) +(um>>16) BB5 len-=2 cx_wich Goo BB2 cx_wich Goo BB2 BB3.a read NSR2 BB1 BB2 NSR1 BB5.b BB3.b read BB4 BB5.a len-=2 cx_wich Sum=0 If (len<2) br BB6 BB1 BB2 BB6 cx_wich If!(len) goo BB8 BB7 Read mp2 [buf],1 Sum+=mp2&0xFFFF BB6.a cx_wich BB8 If!(um>>16)br BB10 BB9 Sum=(um&0xFFFF) +(um>>16) goo BB8 BB7.a read BB7.b Read Sum+= BB9 reurn ~um NSR3 BB6.b cx_wich BB10 Figure 4. Program CFG and he conruced NSR. 3.2 Inerference Graph Afer building he NSR, we build he inerference graph, which will guide he regier requiremen eimaion and regier allocaion. We need o diinguih wo kind of inerference and inroduce ome oher definiion for he inerference graph. BB8 DEFINITIONS: Node: Live range of a virual regier or variable 2 Boundary Node: Node ha i live acro he CSB, which may inerfere wih oher boundary node. Inernal Node: Node ha i no live acro CSB. Boundary Inerference: If wo boundary node are co-live acro he ame CSB, hey are aid o be boundary inerfering wih each oher. Inernal Inerference: If wo node (inernal or boundary node) inerfere (co-live a a program poin) wihin a NSR. Boundary Inerference Graph (BIG): A graph coni of all boundary node and edge only repreening boundary inerference. Inernal Inerference Graph (IIG): For each NSR, we have an IIG, which only include he inernal node live wihin hi NSR and heir inerference edge. Global Inerference Graph (GIG): The global inerference graph include boh boundary node and inernal node. An edge i added if any wo node (inernal or boundary) inerfere wih each oher. The GIG of he code for he example in Figure 4 i drawn in Figure 5. We aume boh len and buf are live a he enry poin a he lengh and he buffer poiner of he packe o be calculaed. Alo, we aume all variable are dead afer heir la ue in he code. From Figure 4.b, we can ee boh variable mp1 and mp2 are only live wihin an NSR, o hey are inernal node. Oher variable are live acro CSB boundarie. They are boundary node. For memory read, ince all daa i fir loaded ino ranfer regier 3, he deinaion regier i no aumed o be live acro he memory read i.e. he CSB. A BB1, um, buf and len inerfere wih each oher inernally (hey alo inerfere a CSB), hu, he 3 node form a clique on he GIG. mp1 inerfere wih um, buf and len in BB3.b, bu a he live poin of mp2 in BB7.b, boh buf and len are dead. Thu, um, buf and len form a BIG; he IIG 1 for NSR1 i empy; he IIG 2 for NSR2 include only mp1, he IIG 3 for NSR3 include only mp2. Obviouly, we have he following claim for each hread. Claim 1: To avoid pill, he GIG hould be colored wih R color and he BIG hould be colored wih PR color. Each IIG, a a par of he GIG, hould be colored wih no more han R color. Claim 2: Inernal node on differen IIG are no conneced i.e. hey do no inerfere wih each oher. boundary node inernal node IIG 1 um buf mp 1 IIG 2 len mp 2 IIG 3 BIG Figure 5. Global inerference graph for he example. Noice ha, NSR and inerference graph can be conruced iner-procedurally. CFG and NSR of differen funcion are conneced wih edge linking funcion call and reurn poin. 2 Here, we aume each live range repreen one variable. 3 Tranfer regier are pecial regier on IXP ued o ore daa from/o he memory, generally we can aume hey are emporary regier dedicaed for memory accee bu unavailable a a GPR.

6 4. OVERALL FRAMEWORK Build NSR, Inerference Graph Eimae Lower/Upper bound Iner-hread Regier Allocaion Inra-hread Regier Allocaion Figure 6. Overall framework. Figure 6 how our framework o perform he regier allocaion. Our fir ep i o build NSR and inerference graph, we hen ry o eimae he lower and upper bound of PR and R for each hread. Saring from he upper bound he iner-hread regier allocaor reduce he overall regier requiremen gradually unil i i wihin N reg. During hi proce, when he iner-hread regier allocaor inend o reduce PR or SR, i call he inra-hread allocaor for all hread. The iner-hread allocaor goe oward he direcion of he malle co increae. The framework allow he inra-hread regier allocaor o be buil eparaely from he iner-hread regier allocaor. 5. REGISTER NUMBER ESTIMATION A he fir ep oward aigning regier o muliple hread, we need o eimae he number of regier each hread need baed on he inerference graph. The eimaion help o guide he diribuion of regier o hread a he beginning. Here, we are concerned wih finding he bound for R and PR a defined below. We do no eimae bound for SR, ince he number of SR i alway equal o R-PR. DEFINITIONS: MinPR, MaxPR: Minimal, maximal number of PR MinR,MaxR: Minimal, maximal number of R Lower Bound Eimaion The lower bound i he minimum number of regier a hread need. Fir we can ge an eimaion for he minimum number of privae regier (MinPR) one hread need. A rough eimaion i MinPR RegPCSB max Max(number of co-live regier a CSB) I i obviou ha if a a CSB poin, here are RegPCSB max node (variable) co-live, we need a lea hi number of privae regier ince hey canno be hared during conex wich. In oher word, he minimal number of privae regier needed i a lea equal o he maximal number of node co-live a he CSB boundarie. The following lemma ay hi bound can be reached if enough move inrucion are inered. Alo, we will explain more abou move inrucion inerion in Secion 7. Lemma 1: Regardle of hared regier, MinPR can be made equal o RegPCSB max by inering move inrucion. Proof: If we are given privae regier PR 1, PR 2 PR, Re gpcsb max and a a cerain CSB, here are V 1, V 2 V n oally n variable live acro, RegPCSB max n. Simply, iner n move inrucion PR 1 =V 1,PR 2 =V 2, PR n =V n before he CSB and n move inrucion V 1 = PR 1, V 2 = PR 2, V n =PR n afer he CSB can make he code equivalen o he original and he number of privae regier needed i no more han RegPCSB max. However, in realiy, move inrucion ill co 1 cycle in our model, alhough i i much cheaper han pill, we ill need o keep he number of inered move inrucion mall. Similarly, we can eimae he MinR needed. MinR RegP max Max(# of co-live regier a program poin) Thi lower bound i alo achievable given enough move inrucion. The proof i imilar o he one above. Upper Bound Eimaion The upper bound give a maximal number of regier required wihou any exra move inrucion inered. According o claim 1 in ecion 3.2, he be eimaion for MaxPR and MaxR i he minimal number of color required o color BIG and GIG. However, for GIG he coloring problem i lighly differen from he radiional graph coloring. The problem i o find a coloring cheme for a hread which aifie: 1. All boundary node are colored wih a mo MaxPR color 2. All node are MaxR colorable 3. Any wo inerfering node are colored differenly For he GIG in Figure 5, all boundary node can be minimally colored wih 3 color; hu MaxPR=3. And, all node can be minimally colored wih 4 color (here i one 4-node clique), o MaxR=4 SR=1. Acually, here i a radeoff beween MaxPR and MaxR eimaion. Reducing MaxPR may induce a larger MaxR. To minimize MaxPR, we can fir remove all inernal node and color he BIG minimally, hen iner back he inernal node and color he graph auming all boundary node have fixed color. To find he ighe (minimal) value of MaxR ufficien o color, we hould ignore he condiion 1 above, i.e. we could aume ha all node are indiinguihable and we could imply color he GIG a uual uing any coloring allocaor. Such a coloring would hen minimize MaxR bu may give a higher MaxPR. PR=mini_color (BIG) R=Max(mini_color(IIG1), mini_color(iig2) ) Any conflic edge beween BIG and IIG? N Done Y BIG i colored wih PR color BIG R++, color one end node of he edge wih new color Try o change one end node color on he edge (may adju heir neighbor color) mini_color(g): a graph coloring algorihm Conflic Edge: edge wih wo end node having he ame color Y Succeed? Each IIG i colored wih up o R color Figure 7. Eimae he maximal regier requiremen. We ake an approach lighly differen from he fir one, i.e. we minimize he MaxPR fir. Thi approach i moivaed by he fac ha increae in PR caue direc increae in oal number of regier, while increae in SR only affec he oal number of regier when hi SR i he maximum among all hread (refer o he formula a he end of ecion 2). Baed on claim 2 menioned in ecion 3.2, (i.e. IIG are no conneced wih each oher) we can color IIG and BIG eparaely and hen merge hem ogeher o keep a igh conrol on colorabiliy. Afer merging, edge added beween BIG and IIG may caue conflic. For example, in Figure IIG1 IIGk N

7 5, when IIG and BIG are colored eparaely, variable um may ge he ame color a mp1, leading o color conflic when he edge beween hem i added during he merge. A general algorihm o color he whole graph alogeher may ake much more ime, ince he graph can be big (i include all live range in he program. Some code in our experimen conain hundred of node). Our approach i imilar o he fuion-baed or region-baed regier allocaion [23], excep ha our region are choen a he IIG and BIG. The algorihm (Figure 7.a) fir build BIG and IIG from he GIG and color each of hem independenly. In oher word, he BIG i colored wih color number from 1 o PR, while each IIG i colored wih color number from 1 up o R. Some IIG may be colored wih le han R color, bu an IIG can be colored wih a mo R color. The nex ep rie o merge each IIG wih he BIG. The edge beween IIG and BIG can caue problem if he wo end node of an edge have he ame color. Such edge are called Conflic Edge. The loop in Figure 7.a how how o reolve all he conflic edge. We illurae he procedure in Figure 7.b. Suppoe boundary node and inernal node i colored wih he ame color. If color can be changed o anoher color wihin color number 1 o PR or color can be changed o anoher color wihin color number 1 o R, hen one of hem can be changed o anoher color o remove hi conflic edge. If ha fail, we heuriically ry o change heir neighbor color o ee if he wo node can be recolored afer ha. Afer all hee aemp fail, we have o increae R and i re-colored wih he new color. The algorihm give MaxPR and MaxR finally. The complexiy of he algorihm i ΣO(mini_ color(iig i ))+O(mini_color(BIG))+O(#Edge beween BIG and IIG). In conra, he complexiy o color he whole graph i O(mini_color(GIG)). Thi mean he algorihm i alo quie fa o ry ou a given coloring for a hread. 6. INTERTHREAD REGISTER ALLOCATION 6.1 Our approach One of he difficulie in regier allocaion for muliple hread i ha we do no know exacly how many regier each hread need. Trying all combinaion o find ou he be regier allocaion will caue remendou amoun of compilaion ime and will be infeaible o build ino any pracical yem. Our approach i o fir ge an eimaion (range) of how many regier are needed by individual hread via he algorihm propoed in he previou ecion. From hi aring poin, we ue a greedy heuriic algorihm o approach a ub-opimal oluion by reducing he oal number of required phyical regier gradually. The algorihm alo encapulae he inra-hread regier allocaor, o ha i can be developed independenly. 6.2 The Regier Allocaion Algorihm Afer geing he eimaed upper bound MaxPR i and MaxR i for each hread, Le SRi = MaxRi MaxPR and i PRi = MaxPR. i We can check wih he following condiion: PR + Max( S, S... S ) N (**) i i 1 2 Nhd reg If hi hold, we can aign SGR = Max( S1, S2... S Nhd ) a he number of globally hared regier and MaxPRi a he number of privae regier for each hread o aify all regier requiremen. If he above condiion (**) canno hold good, he regier requiremen i oo high. We mu eiher reduce he PR() or SR() o aify (**). From (**), we can ee, here are wo way o reduce he lef ide value. Eiher we can reduce one of he PR i, which will reul in direc reducion of he lef-ide value. The oher way i o reduce SR i, we hould reduce he one() wih he maximal value. In cae muliple SR i have he ame maximal value, we hould conider reducing one of he PR i if ha co le. The iner-hread regier allocaion algorihm i hown in Figure 8. The algorihm fir build GIG and ge he eimaion for each hread. If he needed regier are enough (le han N reg ), he program imply allocae regier and reurn. Oherwie, i ener a loop o gradually reduce he number of overall regier requiremen hrough a greedy algorihm, i.e. every ime we chooe a direcion ha can achieve he minimal co. To reduce he regier requiremen (i.e. he lef ide of (**)) by 1, we have many choice. Eiher we can reduce one of he PR by 1 or reduce all he maximal SR() by 1 o cu down Max( SR1, SR2... SR N hd ). Every ime we reduce PR of one hread, we check if i i larger han he lower bound. Alo, he lower bound of R i =PR i +SR i >=MinR i i verified when eiher PR i or SR i i reduced. INPUT: N hd, N reg, CFG of all hread OUTPUT: all PR i and SR i, SGR, CFG afer regier allocaion /*Inra-hread regier allocaor, reurn move co*/ Inra_hd_allocaor(CFG, GIG, PR, SR); ALGORITHM: Iner_hd_reg_allocaion 1. Build_GIG() Eimae_reg_requiremen() While(um(PR i)+max(sr 1,SR 2 SR Nhd)>N reg) 6. Foreach PR i>minpr i and PR i+sr i>minr i do 7. co_pr i=regier allocaion co afer reducing PR i by od max_sr=max(sr 1, SR 2 SR Nhd) 11. co_sr= Regier allocaion co afer reducing all SR 12. ha equal max_sr by 1, if all uch SR can 13. be reduced(by checking PR+SR>MinR) 14. Find he min one among co_sr, co_pr 1,co_PR Chooe he one wih minimal co, modify PR and SR. 16. Endw Acually modify he CFG baed on new PR and SR 19. SGR= Max(SR i) 20. Reurn all PR i and SR i, SGR, all CFG Figure 8. Algorihm for iner-hread regier allocaion. The funcion Inra_hd_allocaor i an inra-hread regier allocaor. I accep he PR and SR, hen rie o reurn an allocaion uing PR and SR number of regier. Thi funcion i called when we calculae regier allocaion co for each hread and when we finally modify he CFG. I reurn he allocaion co. Acually, he inerference graph and coloring cheme given by he funcion Eimae_reg_requiremen can be paed o he inra-hread regier allocaor a a aring poin. However, o provide more flexibiliy, we leave hi o he implemenaion of Inra_hd_allocacor. The complexiy of our heuriic algorihm i O(N reg *N hd )* O(Inra_hd_allocaor), which largely depend on he complexiy of he inra-hread regier allocaor. Our regier allocaion algorihm generae aifacory oluion for all benchmark program wihin almo negligible compilaion ime. 7. INTRATHREAD REGISTER ALLOCATION The inra-hread regier allocaor aemp o allocae up o

8 PR number of phyical regier o boundary node and up o R=PR+SR phyical regier o all node. 7.1 Move Inerion and Live Range Spliing Our inra-regier allocaion i baed on live range pliing and move inrucion inerion. Live range pliing ha been ued in regier allocaion [11] o pill par of he live range o memory. In hi paper, we aemp o pli he live range by inering move inrucion o reduce he chromaic number. Lemma 1 ha hown ha hrough live range pliing MinPR can be reached. Figure 9 give anoher example. In Figure 9.a, live range A B and C inerfere wih each oher a hree differen CSB poin. The lower bound lemma in ecion 3 give MinPR=2, bu he inerference graph mu be colored wih 3 color, becaue A,B, and C form a clique. In Figure 9.b we pli he live range of variable A ino A 1 and A 2 by inering move inrucion a he pli poin. The reuling inerference graph can be colored wih 2 color which i equal o MinPR. Noice ha, hi i alo he way we reduce he number of regier required in he fir example (Figure 3.c). In our inra-hread allocaion algorihm, we focu on live range pliing hrough move inerion becaue pill i oo expenive on nework proceor and our experimen how MinPR (MinR) i much maller han MaxPR (MaxR). Thi provide u room o reduce chromaic number oward he lower bound by inering move inrucion. Node Color A 1 B 2 C 3 B A CSB C A 1 B Live Range move A 2 C Node Color A 1 1 A 2 2 B 2 C 1 Figure 9. Live range pliing via move inerion. 7.2 Inra-hread Regier Allocaion Algorihm Our regier allocaor work incremenally, i.e. i record he conex (inerference graph wih pli node and he poiion of move inrucion) of he la 2 invocaion and modifie he conex o aify he new PR and SR value. Noice ha he inerhread allocaion algorihm in Figure 8 call Inra_hd_allocaor muliple ime. In each ep, eiher i accep he previou conex and reduce PR or SR by 1 or i rejec he previou modificaion and ar from he previou o previou conex and reduce PR or SR by 1. Incremenal modificaion can ave ime for oherwie repeiive work. Furher, baed on he record of he wo conex, we can aume ha each ime he allocaor i invoked, i aemp o reduce eiher PR or SR by 1 from one of he recorded conex. We name hee wo kind of invocaion a Reduce-PR invocaion and Reduce-SR invocaion. Reduce-PR Invocaion In hi ype of invocaion he allocaor wan o reduce he PR by one from i la invocaion. In oher word, he la acceped conex can color all boundary node wih PR color and hi invocaion wan o color i wih PR-1 color. In hi age, we aume all move inrucion are inered near he CSB. Wih hi aumpion, we do no need o aler he color of inernal node. Normally, changing he color of boh inernal and boundary node migh induce more move inrucion (in hi cae we mu pli he live range o recolor an inernal node) and increae he co accordingly. Laer, we will how ome of he move inrucion a he CSB can be eliminaed by merging hem wih move inrucion inide he NSR. Thi acually relocae he move inrucion from he CSB boundary. Before he dicuion of our algorihm, we fir define Neighbor Color Number (NCN). Definiion: Neighbor Color Number (NCN): The number of color ued by he neighbor of a given node in a colored graph. INPUT: PR, SR OUTPUT: co (number of inered move inrucion) Saic conex_pre, conex_pre_pre 1. FUNCTION Reduce_PR(conex):co 2. Begin 3. Foreach color c in PR do 4. Co=0 5. Foreach node in Se_color_node(c,BIG) do 6. If NCN(,BIG)<PR-1 hen 7. Change o anoher color c in PR oher han c. 8. Co+=min(Cu_if_conflic(,c,c )) for all poible c 9. Ele 10. Co+=min(NSR_excluion_co(,c,c )) for each 11. color c in PR oher han c 12. Add newly pli node wih color c o Se_color_node(c,BIG) 13. if i i boundary node 14. Endif 15. od 16. Eliminae_unneceary_move() 17. Record o min_co if hi co i maller and record he conex. 18. od 19. Keep he minimal co conex and reurn min_co 20. End 21. FUNCTION Reduce_SR(conex):co 22. Begin 23. Foreach color c in SR 24. Co=0 25. Foreach NSR i color c i ued do 26. Foreach inernal node in Se_color_node(c,IIG i) do 27. If NCN(, GIG)<R-1 hen 28. Color wih a color oher han c. 29. Ele 30. Co+=min(live_range_excluion_co(,c,c )) 31. For each color c in R oher han c 32. Add newly pli node wih color c o Se_color_node(c,IIGi) 33. Endif 34. od 35. od 36. Eliminae_unneceary_move() 37. Record o min_co if hi co i maller and record he conex. 38. od 39. Keep he minimal co conex and reurn min_co 40. End 41. FUNCTION Inra_hd_allocaor(PR,SR):co 42. Begin 43. According o he acceped conex, pick ored eiher conex_pre 44. or conex_pre_pre => conex. 45. If(PR i reduced) reurn Reduce_PR(conex) 46. Ele if (SR i reduced) reurn Reduce_SR(conex) 47. Ele reurn co for he conex //no change 48. End Figure 10. Algorihm for inra-hread regier allocaion. The algorihm in Figure 10 ue funcion NCN(,BIG) o ge he neighbor color number of node on he BIG. The algorihm alo work in a greedy manner. I rie each color c in PR color and check he co o eliminae ha color. Then, he color wih lea eliminaion co i eleced o be eliminaed and all needed move inrucion are inered. Funcion Se_color_node(c,BIG) reurn he e of node on BIG wih color c. We need o change every node in hi e o a differen color in PR.

9 Firly, we check he NCN of ha ha color c on he BIG. If hi number i le han PR-1 (which mean here i a lea one color available in PR no ued by i neighbor), we can change o anoher color. Since we have changed color on BIG and may inernally inerfere wih oher inernal node or boundary node (wo boundary node can inerfere only inide NSR bu no on he CSB), we need o check if here i a color conflic. The funcion Cu_if_conflic(,c,c ) aemp o iner move inrucion o diconnec uch edge. Figure 11 how how he diconnecion i done and he correponding change on he GIG. In Figure 11.a, i originally colored wih color c ; afer node i changed o color c from color c i conflic wih inernal node. We iner a move a he CSB, o live range i pli. The par of he live range in NSR2 become, and hi par can keep color c, o i doe no conflic wih, while, on he BIG, i changed o color c. Figure 11.b how he change on he GIG. The edge beween and ge eliminaed afer pli from. keep he original color of, o in he IIG, i i compaible wih, while on he BIG, he color of i changed. In he algorihm, we ry every candidae color for and pick he one wih minimal co. inernal node boundary node in NSR2 Boundary node CSB NSR1 move NSR2 Inernal node Figure 11. Node pliing o change he color of node. If hi ep fail, i.e. NCN(,BIG)=PR-1, he algorihm call funcion NSR_excluion_co(,c,c ) o ge he co of changing o anoher color c and o exclude all he NSR wih conflic node. NSR_excluion_co look a each NSR where i live o ee if here i any node wih color c in i. If o, he NSR i excluded by pliing he live range of in ha NSR and by inering move inrucion. In our approach, he NSR are pli in whole, i.e. eiher he live range in ha NSR i kep wih color c (if no conflic) or he live range i pli (afer pliing, in ha NSR keep color c). inernal node Boundary node boundary node CSB in NSR2 NSR1 move CSB NSR4 NSR2 NSR3 Inernal node Boundary node r move Figure 12. NSR excluion o reduce PR. Figure 12 how how NSR excluion i done. Boundary node canno change o color c becaue he boundary node r and he inernal node are uing color c. The conflic NSR are NSR2 and NSR3, where and r are live. So, hee wo NSR are excluded from he live range of he original boundary node. On he GIG, we ee i pli from and now can be colored wih c. keep color c and i i ill compaible wih and r. Noice ha, afer r r pliing, he edge originally conneced from r o i conneced o. Therefore, he NCN of i reduced and can be recolored wih c. The algorihm rie each color oher han c o recolor and find he minimal value o finally color. Alo noice ha, afer hi ep, i colored wih c and, if i i a boundary node, we hould add o Se_color_node(c,BIG) and we will color i wih ome oher color during he laer ieraion. Se_color_node(c,BIG) will no increae infiniely, ince furher pliing will finally generae inernal node. Reduce-SR Invocaion To reduce SR, we check wih each color c in SR o ee which one hould be reduced wih minimal co. The co i calculaed by adding up co in every NSR where hi color i ued. Alo noice ha in hi ep, all boundary node are aumed o have fixed color o ha he phae will no affec he PR number. The algorihm rie o recolor node wih color c in a NSR o oher color. If he node on he GIG ha NCN le han R-1, we can ju pick ha color and color he node wihou any co. Oherwie, live range pliing i needed. Live range pliing i illuraed in Figure 13. In Figure 13.a, he example ha 3 baic block. Live range i recolored wih color c, however, live range alo ue color c. Our algorihm hen pli a he boundary where he wo live range overlap. Afer pliing, can ill ue color c and now change o c. We aign he color wih minimal co o node. Afer he pliing, node i puh ino Se_color_node(c,IIG i ), becaue now i bear color c. Thi proce will finally op. Afer each pliing, he live range wih color c i reduced. Since he value R-1 RegP max (according o he lower bound eimaion in ecion 5 and he algorihm in Figure 7), in he exreme cae, each live range i a ingle program poin, here will be a mo RegP max node co-live and live range wih color c can alway be recolored. move Figure 13. Excluding a live range wihin NSR o reduce SR. Eliminae Unneceary Move During he aemp o reduce PR, we aume ha all move inrucion are inered near he CSB boundary and during reduce SR, ome move inrucion are inered inide he NSR. A hi poin, we can merge ome of he inernal move inrucion wih hoe a he boundary. For wo conecuive move, he fir move inrucion o he live range i unneceary if he color a he enrance o he fir move i alo accepable in he region beween he wo move inrucion. We can afely eliminae he fir move and hi acually relaxe he rericion in Reduce_PR o bind move o he CSB. 8. THE SRA PROBLEM For SRA problem (defined in ecion ecion 2), given he

10 PR are equal and SR are alo equal. The rericion can be rewrien in a imple form: N hd PR + SR N reg Thu, he iner-hread regier allocaion algorihm can alo be implified. There are only wo poibiliie o reduce he regier requiremen. Due o he hrunk oluion pace, for algorihm in Figure 8, we can acually ravere all he poible PR and SR o find he be oluion. 9. EXPERIMENTAL RESULTS The evaluaion of our algorihm i done wih he Inelprovided imulaion environmen IXP1200 Developer Benchmark The IXP1200 workbench uppor cycleaccurae imulaion for IXP microengine and oher peripherie wih high fideliy. In hi ecion, we experimen wih 11 benchmark program and ome of heir combinaion o ee he effecivene of he regier allocaor. Thee benchmark are colleced from Commbench[15], Nebench[16], Inel provided example code and a packe cheduling algorihm from [18]. To evaluae our algorihm, he benchmark program are rewrien in IXP C code (a ube of andard C) and a few of hem are direcly wrien in aembly (microcode). For hoe wrien in aembly code, we reore he virual regier o ha our regier allocaor can work on he live range from crach. Our pa build he CFG and inerference graph from he aembly code, afer imple ranlaion of he aembly direcive. The aembly code i hen paed o he aembler o generae machine code. The IXP aembly coni of only 40 RISC inrucion which make he ranlaion eay. The aembler imply exi if oo many regier are required. However, afer our pa, he regier requiremen are alway aified, o he machine code can be generaed properly. Table 1 how he properie of he benchmark program. The code ize i number of inrucion afer code generaion. The cycle coun are meaure a follow: for ome program like L2l3forward, i canno run o a op in finie ime, ince hee program all run in a while loop o accep and proce packe, he cycle coun are averaged number per ieraion of he main loop. We li CTX inrucion (conex wich inrucion, which include load/ore, volunary conex wich and oher I/O operaion ha can caue conex wich) each benchmark ha. Roughly, abou 10% inrucion are CTX inrucion. The CTX inrucion here do no include pill inrucion, a we have removed all pill and reconruced original live range (we did hi baed on he ource code and he annoaion embedded in he generaed aembly code by he Inel IXP compiler). The number of live range (node on he GIG) i lied in he 5 h column. Thee number come from he reored virual regier. Column 6 and 7 are maximal regier preure in he program (RegP max ) and maximal regier preure a he CSB (RegPCSB max ). Thee are he lower bound eimaion for regier requiremen of he hread. Column 8 and 9 are he upper bound eimaion for R and PR baed on he algorihm in Figure 7. The 10 h and 11 h column give aiic for he number of NSR and heir average ize. One obervaion i ha normally larger NSR lead o bigger difference beween he maximal and minimal value of P and PR. Becaue more inernal node can exi in larger NSR, he regier preure for GIG hould exceed he BIG wih larger margin. Figure 14 evaluae our iner-hread regier allocaion algorihm for SRA. The ame evaluaion for ARA i combined in Table 3. For each benchmark program, we how wo relevan bar. The fir bar i he number of regier allocaed o he benchmark auming only ingle hread i available. We ue a Chaiin [9] yle regier allocaor for comparion wih our hared regier allocaor. The econd and hird bar are he number of privae regier and hared regier aigned wih our iner-hread regier allocaion algorihm. The ame benchmark i aumed o execue on four hread. The algorihm coninue unil he co reurned i non-zero, which mean we wan o e how many PR and SR are needed wihou any move inrucion inerion wih he iner-hread allocaion algorihm. The figure how ha he number of privae regier allocaed for he muli-hreaded cae i le han he number of regier needed for andalone regier allocaion. Thi i no urpriing becaue hared regier can ake care he higher regier preure inide he NSR. If no hared regier are ued and each hread run he ingle-hread regier allocaor, many regier are waed. Compared o he cae wih muli-hreaded regier requiremen i.e. 4*PR+SR, he average oal regier aving for all benchmark i 24%. In Table 2, we collec daa for he exreme cae wih our regier allocaion algorihm, i.e. he maximal number of move inrucion ha will be inered, if only he minimal number of regier i allocaed. Thi mean our algorihm mu pli many live range o reach he minimal number of regier. The move inerion overhead in he exreme cae i moly wihin 10% of he oal number of inrucion for he benchmark. Thi co i affordable compared o he overhead due o regier pill if he regier number i ou of range wih he ingle hread regier allocaion algorihm. Finally, Table 3 evaluae our regier allocaion algorihm for ARA wih 3 cenario. Noice ha all ak are periodic, independenly haring he CPU and execue forever. Thu, we meaure he performance improvemen of each hread in erm of he percenage reducion of cycle per ieraion. The fir cenario pu wo Md5 program on hread 0 and 1, wo fir2dim on hread 2 and 3. Thi can be a proceing module beween he receiving and ending module. Our daa how he PR and SR aigned, he number of live range afer he regier allocaion (#Live Range), conex wich inrucion number reducion and cycle change. The column of #CTX Reg Spill i he original code generaed by he Inel compiler ha allocae regier wih pilling and wihou regier haring acro hread (only allocae 32 regier for each hread). And, #CTX Reg Sharing i he number wih our allocaor (acually no change compared wih Table 1, becaue we avoid pill). The ame i rue for cycle coun ( #Cycle Reg Spill and #Cycle Reg Sharing ). The fir2dim acually run lower due o inered move. Bu hi i profiable due o he big aving from Md5. Thu, he allocaor i able o boo he performance criical hread (Md5) by lighly lowing down le performance criical one (fir2dim). The econd cenario coni of L2l3fwd receive and end on hread 0 and 1 and Md5 on hread 2 and 3. Thi can be a complee proceing module erving on one ending and one receiving por. The reul ill how he pill are aved for Md5 wih minor co for move on L2l3fwd hread. The la cenario run wrap receive and end on hread 0 and 1, fir2dim and frag on hread 2 and 3. The allocaor balance regier allocaion o aify wrap hread. Due o a high regier preure, wrap receive and end can run much lower (due o pill) if regier are no allocaed properly. Our reul how ha over 20% peedup i achieved for wrap, wherea only ligh lowdown i incurred for he oher wo benchmark, which i in accordance wih our opimizaion objecive of booing performance criical hread.

Flow graph/networks MAX FLOW APPLICATIONS. Flow constraints. Max flow problem 4/26/12

Flow graph/networks MAX FLOW APPLICATIONS. Flow constraints. Max flow problem 4/26/12 4// low graph/nework MX LOW PPLIION 30, pring 0 avid Kauchak low nework direced, weighed graph (V, ) poiive edge weigh indicaing he capaciy (generally, aume ineger) conain a ingle ource V wih no incoming