Loop Pipelining in Hardware-Software Partitioning

Size: px

Start display at page:

Download "Loop Pipelining in Hardware-Software Partitioning"

Brittney Jackson
5 years ago
Views:

1 Loop Pipelining in Hrdwre-Softwre Prtitioning Jinhwn Jeon nd Kioung Choi School of Electricl Engineering Seoul Ntionl Universit Seoul, Kore Tel: F: e-mil: Astrct This pper presents hrdwre-softwre prtitioning lgorithm tht eploits loop technique. The prtitioning lgorithm is sed on itertive improvement. The lgorithm tries to minimize hrdwre cost through hrdwre shring nd hrdwre implementtion selection without violting given performnce constrint. The proposed loop technique, which is n dpttion of compiler optimiztion technique for instruction level prllelism, increses prllelism within loop trnsforming the structure of n input sstem description. B comining this technique with our prtitioning lgorithm, we cn further reduce the hrdwre cost nd/or improve the performnce of the prtitioned sstem. Eperiments show out 19% performnce improvement nd 44% reduced hrdwre for JPEG encoder design, compred to the results without loop. 1. Introduction Mied hrdwre nd softwre implementtion is common in the design of digitl sstems such s communiction sstems, DSP pplictions, nd emedded sstems. In generl, softwre is es to modif, mintin, nd upgrde, though it is slow compred to hrdwre. Hrdwre cn e mde fster thn softwre ut the cost for ll hrdwre solution is usull too high. An issue rised in designing such sstems is to find n optiml point in etween ll hrdwre solution nd ll softwre solution where we otin mimum performnce t minimum hrdwre cost. This process is clled hrdwresoftwre prtitioning. There re some heuristics proposed for hrdwre-softwre prtitioning prolem [1, 3, 4, 5, 6, 7, 9, 11]. Gupt et l.[4] proposed hrdwre-oriented pproch in which ll the opertions ecept for the dt dependent del opertions re initill mpped to hrdwre. Then the prtitioner repets moving node to softwre while performnce constrints re met. Selection of the node to e moved to softwre is done greed method. Ernst, Henkel, nd Benner [3, 5] proposed softwre-oriented pproch in which ll the units of prtitioning(clled BSB) re initill mpped to softwre. The used simulted nneling lgorithm for prtitioning. Vhid et l. [11] proposed inr-constrint serch lgorithm which serches the design spce while chnging the hrdwre constrint in inr-serch fshion. For ech hrdwre constrint, the run simulted nneling lgorithm nd check to see if the performnce constrint is met. Klvde et l. [6, 7] proposed glol criticlit/locl phse(gclp) driven lgorithm nd hrdwre-softwre mpping nd implementtion in selection(mibs) lgorithm. The ke feture of the lgorithm is the dptive ojective mechnism glol nd locl mesures. This lgorithm pplies two ojectives ccording to glol-time criticlit. If the glol time is criticl, the ojective function is to reduce the ltenc. Otherwise, the ojective function is to minimize the hrdwre resources. Since GC does not contin the informtion on ech node, loclphse(lp) is used to represent the preference of ech node. MIBS is the etension of GCLP. This lgorithm not onl prtitions the nodes ut lso finds the implementtion method of hrdwre node to minimize hrdwre resource. Knudsen et l. [9] proposed dnmic progrmming lgorithm. This lgorithm ssumes n eecution model in which softwre cnnot eecute other jos while hrdwre is running, nd ssumes relistic communiction model. Bsed on this model, the lgorithm finds n optiml solution using dnmic progrmming method. Adms et l. [1] proposed multipleprocess ehviorl snthesis lgorithm for heterogeneous sstems. The use n inter-process code motion to prtition nd llocte n input sstem description which is originll composed of one process. To schedule code segments within process, intr-process code motion is used. Such code motions re mde t rndom to increse concurrenc etween processes nd to improve performnce nd cost, while overll performnce nd cost is optimized simulted nneling. The pproches in [3, 5, 9, 11] ssume n eecution model tht does not llow prllel eecution of hrdwre nd softwre, leding to limited performnce improvement. Though other pproches [1, 4, 6, 7] eploit the prllelism tht resides in the sstem, the onl utilize the eplicit prllelism given the input description. Therefore, codesign pproch for softwre ccelertion seems to hve little dvntge over pure softwre solution [12]. In this pper, we propose loop technique which increses prllelism for more effective hrdwre-softwre prtitioning. In ddition, we propose et nother prtitioning lgorithm which mps nodes to hrdwre or softwre considering vrious hrdwre implementtion lterntives nd hrdwre resource shring. Though the proposed loop technique is not new ide, we show tht simple comintion of technique with prtitioning lgorithm gives more room for softwre ccelertion with less hrdwre cost thn eisting lgorithms, which is the contriution of our work. Our prtitioning lgorithm with loop is suitle for computtion-intensive pplictions minl composed of loops, s is common in most DSP ppli-

2 ctions. Bkshi et l. recentl proposed method [13] which lso dels with the sme suject: prtitioning nd. The perform hrdwre-softwre prtitioning simpl mpping node to hrdwre if it cnnot meet throughput constrint in softwre itself. Then the repet n optimiztion process consisting of, scheduling nd processor lloction until the throughput constrint is met. However, their simple prtitioning scheme does not work - tht is, ll the nodes re mpped to softwre - when ever node meets the throughput constrint, s is often the cse when smll grnulrit is used. In tht cse, the prefer ll softwre solution in multi-processor trget rchitecture to mied hrdwresoftwre solution. However, multi-processor solution is not lws cheper thn one-processor solution with smll ASIC. In our pproch we perform efore prtitioning. Therefore, we cn consider ll the nodes s hrdwre cndidtes. Moreover, we cn comine our loop technique with n other eisting prtitioning lgorithm. Our pper is orgnized s follows. In section 2 we give n overview of our prtitioning lgorithm. In section 3, we propose loop technique for prtitioning nd n lgorithm for solving the etended hrdwre-softwre prtitioning prolem. Section 4 shows eperimentl results efore we conclude in section Overview Figure 1 illustrtes the steps used to prtition n input sstem description. The first step is to trnsform n input ehviorl description into CDFG which is used s n intermedite formt for hrdwre-softwre prtitioning. The CDFG is formll defined s grph G=(N, E), where ech node represents n opertion or set of opertions (e.g., tsk, process, nd code grouping) nd ech edge represents dt nd control dependenc etween nodes. Hrdwre Snthesis Inform tion E stim tor Input D escription (C, VHDL) CDFG Loop Pipelining C ost Function Softwre Profiling Inform tion P rtitioner Performnce Constrint Figure 1. Overview of hrdwre-softwre prtitioning steps. The second step is to otin hrdwre snthesis nd softwre profiling informtion for prtitioning. Tsk(lef procedure or function) level grnulrit is used to otin such informtion from the CDFG. Hrdwre snthesis is done Hper [2]. Hper is performnce-constrined reminimizing high-level snthesis tool. We cn otin mn implementtion lterntives chnging the performnce constrint. The snthesis informtion from Hper includes the numer of used eecution units, the numer of register files, totl eecution del, nd the cost for ech eecution unit. For the purpose of hrdwre snthesis, we must trnslte the CDFG into Silge [2]. Since we hve not implemented n utomtic trnsltor et, the trnsltion is done mnull. However, we hve implemented tool for trnsltion from CDFG to C code nd we use it to otin softwre profiling informtion including eecution del nd invoction count of ech tsk. The finl step is to prtition the CDFG into hrdwre nd softwre prt to stisf the performnce constrint given the user. The hrdwre snthesis informtion nd the softwre profiling informtion otined in the previous step re used for estimting the eecution time nd the hrdwre cost. The prtitioning step is composed of loop stge nd itertive improvement stge. Loop is performed efore the itertive improvement stge in order to increse prllelism within loop. In the itertive improvement stge, hrdwre-softwre prtitioning is performed such tht the cost is minimized while mintining the performnce ove the constrint. In this stge, we consider vrious hrdwre implementtion lterntives to select possil the est one, shre hrdwre modules, s well s perform hrdwre-softwre mpping. Currentl our trget rchitecture consists of single generl purpose processor nd multiple ASICs, lthough the proposed lgorithm cn e etended to the cse of multiple processors replcing the performnce estimtion method(in section 3.2) with scheduling method proposed in [4]. We ssume memor mpped communiction model where no hrdwre is dedicted for communiction - tht is, softwre is locked until communiction completes. 3. Prtitioning Approch 3.1. Nottion Our prtitioning lgorithm focuses on performnceconstrined hrdwre cost minimiztion. The performnce constrint given the user is denoted s D. We denote node in CDFG s n i nd n edge from n i to n j s e i,j. Ech node n i hs informtion including the eecution del(d i ) nd hrdwre-softwre mpping. Ech node n i hs hrdwre implementtion lterntives which cn e represented n implementtion curve IH i. Such curves cn e otined using Hper [2]. We denote the set of ll the predecessors(successors) of n i s pred(n i )(succ(n i )). If node n i is chosen to e implemented in hrdwre nd the predecessors(successors) re implemented in softwre we insert communiction nodes etween n i s predecessors(successors) nd n i. We denote the communiction node etween n j nd n i (n i nd n k ) s nc j,i (nc i,k ), nd the communiction del s dc j,i (dc i,k ), where n pred n (n succ n ). j i k i

3 3.2. Estimtion The prtitioner evlutes the qulit of prtitioned sstem sed on two metrics; totl eecution del nd totl hrdwre cost. First, the totl eecution del is estimted simple list scheduling lgorithm, which is similr to the one proposed in [10]. For the list scheduling of hrdwre nodes, priorit is given to node with the lrgest sum of its own del nd ll successors dels, there llowing the most criticl hrdwre node to e scheduled first. For softwre nodes, priorit is given to node which hs hrdwre successor with the highest priorit, there llowing softwre node tht leds to the most criticl hrdwre node to e scheduled first. We prioritize softwre nodes onl for etter scheduling of hrdwre nodes ecuse ordering of softwre nodes does not ffect the performnce of the softwre when we use single processor. According to this scheme, priorit vlue p i of node n i is defined s d k + d i if n i N HW nk succ( ni ) pi = m ( pk ) if ( n N succ n N n succ n N i SW ) ( ( i ) HW ) k ( i ) HW 0 otherwise where N HW (N SW ) denotes the set of ll hrdwre(softwre) nodes. We consider hrdwre shring effect during list scheduling mking the shring nodes hve the sme resource id. This list scheduling lgorithm is pplied to ech sic lock in the CDFG to otin n estimtion of the eecution del for ech sic lock. Then, recursivel summing up ll the vlues otined multipling invoction count of ech sic lock to the lock s eecution del, we cn clculte the totl eecution del. We estimte the totl hrdwre cost sed on the snthesis informtion provided Hper. In Hper, this cost is hrdwre re. If there is no shring mong hrdwre nodes, hrdwre cost is estimted simpl summing up the hrdwre cost of ech hrdwre node. If multiple hrdwre nodes shre hrdwre resources, the totl cost is reduced the mount of shred resources. To consider resource shring, we need to emple the hrdwre rchitecture. The trget rchitecture of Hper is composed of eecution units, register files, control unit, nd multipleers which re connected crossr network. Currentl, mong these hrdwre resources, we consider onl the eecution units s hrdwre resources tht cn e shred. We estimte the totl hrdwre cost sutrcting the cost of shred resources from the sum of ll the hrdwre nodes costs. We ignore the re increse due to the dded multipleers nd wiring Loop Pipelining Since loop is generll the most time-criticl prt in the computtion-intensive pplictions, there hve een mn loop optimiztion techniques for prllel computing. Softwre pipeline is one of those techniques, which overlps the eecution of code locks in different itertion steps. To llow such n eecution overlp, there must e no dt dependenc etween susequent loop itertions. Most dt processing lgorithms, which receive n input dt strem nd genertes n output dt strem, hve structure suitle for this kind of optimiztion. Loop technique for prtitioning, which we propose in this pper, is n dpttion of softwre pipeline technique to increse the prllelism within loop. We cn eploit the prllelism through concurrent eecution of hrdwre nd softwre s well s concurrent eecution of hrdwre modules. Our loop technique for prtitioning consists of the following three steps. We ssume tht user gives the numer of pipeline stges(n ps ) eforehnd. 1. Find feedck edges which represent dt dependencies to the net itertion of the loop. Then, for ech feedck edge e i,j, mke cluster node which consists of nodes tht eist etween n i nd n j. Finll, recursivel merge cluster nodes tht shre node into cluster node such tht there re no common nodes mong cluster nodes. 2. B grouping nodes nd/or cluster nodes in topologicl order, mke initil pipeline locks which cn e overlpped within the loop. Then repetedl moving node from one pipeline lock to the neighoring pipeline lock, find n optimized set of pipeline locks such tht the communiction etween pipeline locks is minimized nd dels of ll the locks re lnced. During this process, we mke the numer of pipeline locks equl to N ps. 3. Trnsform the loop such tht ll the pipeline locks cn run in prllel. The purpose of the first step is to prevent feedck edge from eing cut pipeline lock oundr. All the nodes connected feedck edge re put into cluster node. Otherwise, the m cuse dt dependenc violtion. Figure 2 () shows this step. In the second step, n initil set of pipeline lock is uilt grouping nodes such tht the numer of pipeline locks is equl to N ps. In Figure 2 (), two pipeline locks, 1 nd 2, re uilt the procedure listed in step 2. The criteri for grouping is the eecution del of ech pipeline lock nd the communiction etween susequent pipeline locks. First, it is desirle tht the eecution del of ech pipeline lock e equl in order to reduce the criticl pth of the trnsformed loop nd increse prllelism mong prtitioned locks. Secondl, since vrile cop instructions (s7 in Figure 2 (c)) should e inserted to compenste for the cut edges, it is desirle tht the communiction etween susequent pipeline locks e minimized in order to reduce the overhed induced vrile cop instructions nd communiction overhed from or to hrdwre nodes. For these optimiztions, we use greed method which reduces cost function f L defined s P d P 1 loop f L = d + α ncomm ( i ) i i= 1 P i= 1 where P, d i, d loop, n comm ( i ), nd α re numer of pipeline locks, del of pipeline lock i, eecution del of the loop, communiction overhed from i to i+1, nd weighting fctor, respectivel.

4 In the finl step, we trnsform the loop such tht ll the pipeline locks cn run in prllel, s shown in Figure 2 (c). Note tht vrile in s3 in the loop is renmed s 2 so tht 1 nd 2 cn run in prllel, nd s7 is dded s n epilogue code of the loop so tht 2 cn use the updted vlue of vrile in the net itertion. Recll tht in the second step of loop technique, we tr to reduce the overhed of this epilogue code. loop: i (0... N) s2: := f0(i); s3: := f1(); s4: z := f2(, ); z s5: := f3(z, ); s6: c := g1(); () merging nodes connected feedck edge loop: i (0... N) 1 2 s2: := f0(i); s3: := f1(); s4: z := f(, ); z s5: := f3(z, ); s6: c := g1(); () mking pipeline lock 1 loop: i (1... N) 2 s2: := f0(0); s3: := f1(); s4: z := f(, ); z s5: := f3(z, ); 1 2 s2: := f0(i); s3: 2:= f1(); 2 s7: := 2; s4: z := f2(, ); z s5: := f3(z, ); s6: c := g1(); (c) overlpping pipeling locks Figure 2. Loop trnsformtion loop. This method cn lso e used s post optimizer to improve the performnce of prtitioned sstem. Figure 3 illustrtes the performnce improvement loop trnsformtion. Assume Figure 3 () is the prtitioned sstem hrdwresoftwre prtitioner, where s4 nd s5 is mpped to hrdwre. In the originl structure, the processor should e idling while s4 nd s5 re running ecuse there re no jos to eecute in prllel. However, if we trnsform the structure of the loop s shown in Figure 3 (), the processor cn eecute s2 nd s3 while the hrdwre is running s4 nd s5. loop: i (0... N) s2: := f0(i); s3: := f1(); s7: Send(); s4: z := f2(, ); z s5: := f3 (z, ); s8: := R ecv(); s6: c := g1(); () prtitioned sstem loop: i (1... N) s2: := f0(0); s3: := f1(); s7: Send(); s4: z := f2(, ); z s5: := f3 (z, ); s2: := f0(i); s3: 2 := f1(); 2 s9: := 2; s4: z := f2(, ); z s5: := f3 (z, ); s8: := R ecv(); s6: c := g1(); () trnsform ed sstem Figure 3. Loop trnsformtion s post optimizer for prtitioned sstem Prtitioning Algorithm Our prtitioning lgorithm mkes decisions regrding the implementtion of hrdwre node, shring mong hrdwre nodes, nd hrdwre-softwre mpping, suject to given performnce constrint. Figure 4 shows pseudo code of the prtitioning lgorithm, where totl_del() procedure nd totl_cost() procedures re totl eecution del estimtor nd totl hrdwre cost estimtor, respectivel. The Reduce- Cost() procedure used in the lgorithm tries to incrementll reduce nd updte totl hrdwre cost ccording to the procedurl steps listed elow: 1. Find pir of hrdwre modules which reduce totl hrdwre cost mimll shring, while stisfing the performnce constrint. 2. Find node ni NHW which reduces the totl hrdwre cost mimll while stisfing the performnce constrint, when it is moved to nother point in the implementtion curve IH i. 3. Find node ni NHW which reduces the totl hrdwre cost mimll while stisfing the performnce constrint, when it is mpped to softwre. 4. Among the cndidtes otined from steps 1-3, dopt the cndidte whose cost reduction is mimum. 5. Repet steps 1-4 until no more cndidtes re ville. /* Input: CDFG G(N, E), performnce constrint D */ /* Output: prtitioned CDFG G(N, E) */ Prtition (G(N, E), D) { G est =G; N HW=φ; N SW=N; /* first phse */ G=GreedPrtition(G); /* initil prtition greed method */ G est = ReduceCost(G); /* Reduce hrdwre cost */ cost est = totl_cost(g est ); / * cost of initil prtition */ /* second phse : itertive improvement */ do { N fied=φ; /* first inner loop: over llocte hrdwre node */ for (i=0; i<n m; i++) { n cn=n i N SW, where {speedup/cost} is mimum in HW; N HW=N HW n cnd; N SW=N SW - n cnd; G =ReduceCost(G); } if (cost est > totl_cost(g )) { G est = G ; cost est = totl_cost(g ); } /* second inner loop: dellocte hrdwre node */ do { /* mp node with the mimum hrdwre cost to softwre */ n cnd =n i (N HW - N fied), where cost reduction is mimum in SW; N SW=N SW n cnd; N HW =N HW - n cnd; N fied=n fied n cnd; /* mp SW nodes to HW to meet performnce constrint */ while (totl_ del(g) > D) { n cnd=n i (N SW - N fied), where {speedup/cost} is mimum in HW; N HW = N HW n cnd; N SW = N SW - n cnd; N fied = N fied n cnd; } G =ReduceCost(G); if (cost est > totl_cost(g )) { G est = G ; cost est = totl_cost(g ); } } while (N fied N); G = G est ; } while (cost improvement is otinle); return G; } Figure 4. Pseudo code of the prtitioning lgorithm.

5 The prtitioning lgorithm consists of two phses. In the first phse of the prtitioning lgorithm, strting from ll softwre solution, GreedPrtition() procedure mkes n initil prtition tht stisfies the performnce constrint repetedl mpping node, which hs the mimum speedup per cost, to hrdwre. During the GreedPrtition(), the implementtion of hrdwre cndidte node n i is selected t the fstest point of IH i. In the second phse, the lgorithm itertivel improves the initil prtition within in the two nested loops. In the first inner loop, softwre node is mpped to hrdwre while reducing the cost ReduceCost() procedure, until the numer of moved nodes reches N m which is proportionl to the numer of nodes. The purpose of the first inner loop is to give more chnce of cost reduction during ReduceCost() procedure llocting more hrdwre nodes thn re needed. In the second inner loop, we select node from the set of nodes currentl mpped to hrdwre nd move it to softwre. We select node which will reduce the totl hrdwre cost mimll fter the move. Then nodes mpped to softwre re repetedl moved to hrdwre until the performnce constrint is met. The implementtion of the nodes moved to hrdwre is selected t the fstest point of IH i. During this procedure, once node is moved to the other prtition group (hrdwre to softwre or softwre to hrdwre), it is fied in order to prevent from eing re-selected s cndidte node. After performnce constrint is met through the ove procedure, ReduceCost() procedure is clled in order to reduce the totl hrdwre cost. If totl hrdwre cost reduced ReduceCost() is less thn the est hrdwre cost otined so fr, current prtition is sved s the est prtition. Note tht the prtitioning result ReduceCost() is sved to G not G. One reson for this is tht ReduceCost() cn chnge the mpping of fied hrdwre node to reduce totl hrdwre cost nd sving the result to G could cuse prolem to the itertion process. Another reson is tht ccumulting the result ReduceCost() m prevent G from escping from locl optimum in the itertive improvement process. The inner loop repets the ove procedure until ll the nodes re fied, while outer loop repets the inner loop until no more improvement is ville. 4. Eperimentl Results The prtitioning lgorithm with loop is implemented in C++ under UNIX environment. We use JPEG encoder which is descried in 977 lines of C code s n emple. This emple is computtion-intensive ppliction which consists of two min loops nd 20 tsks. The grnulrit of prtitioning is tsk-level (i.e. the lef tsk), where ech tsk hs 4-5 hrdwre implementtion lterntives. Softwre profiling informtion is otined on SPARC 1. We ssume memor-mpped communiction model with 2 clocks of communiction overhed for 32it dt trnsfer. Figure 5 shows design spce curves for the emple otined our prtitioning lgorithm in the following two different cses: cse 1: prtitioning without loop. cse 2: prtitioning with loop. Figure 5. ƒ RNPeKPX QNXeKPX QNVeKPX QNTeKPX QNReKPX QNPeKPX XNPeKPW VNPeKPW TNPeKPW RNPeKPW PNPeKPP ƒ Q H ˆ Œ Œ Ž Ž I ƒ R H ˆ Œ Œ Ž Ž I PNX QNX RNX Œ H ƒi Hrdwre-softwre prtitioning result for JPEG encoder. In cse 2, we set the numer of pipeline lock P s 3 for ech min loop. From the figure, we cn see tht the curve of cse 2 is lws on the left side of tht of the cse 1. This mens tht the prtitioned sstem of cse 2 is lws fster tht tht of cse 1 when the hve the sme hrdwre cost. The minimum del of cse 1 is 1.1 with hrdwre cost of 1.83E+8, wheres tht of cse 2 is 0.89 with hrdwre cost of 1.01E+8. The improvement of the minimum del loop technique is out 19%. The improvement of the cost with the del kept constnt is 7 to 70%. This fct eplins tht loop for hrdwre-softwre prtitioning is effective in improving oth performnce nd cost. To the est of our knowledge, there is no previous lgorithm tht considers hrdwre shring nd implementtion selection t the sme time. Therefore, we compre the result of the proposed lgorithm with tht of our own simulted nneling lgorithm in order to show the effectiveness of our itertive improvement prtitioning lgorithm. Simulted nneling is n lgorithm sed on the concept of the proilistic selection of rndoml generted sttes [8]. We generte new stte three rndom moves; toggling hrdwresoftwre mpping(m1), chnging the stte of hrdwre shring(m2), nd chnging the implementtion of hrdwre node(m3). The genertion proilit of M1 is 0.5 while M2 nd M3 is generte t the sme proilit of The cost function f A of the nneling is defined s f A = c1 d + c2 hrdwre _ cos t where c 1 nd c 2 re weighting fctors, nd d is 0 if the totl eecution del is smller thn D nd is totl_eecution_del-d otherwise. To stisf the performnce constrint, c 1 is scheduled to increse s nneling process proceeds with the cooling rtio of 0.9. Tle 1 compres the prtitioning result of the proposed lgorithm with tht of the simulted nneling. We eclude the informtion on computtion time, since our comprison intends to show not the efficienc ut the effectiveness of the proposed lgorithm. The proposed lgorithm is of course much fster (out order of mgnitude) thn simulted nneling. The results of the simulted nneling re otined selecting the est cost

6 fter running the progrm 5 times. From the tle, we cn see tht our lgorithm find solution comprle to tht of simulted nneling. Tle 1. Comprison of the proposed lgorithm with simulted nneling. Our lgorithm Simulted Anneling w ithout loop with loop w ithout loop With loop HW cost HW cost HW cost HW cost D(sec) To see the effect of hrdwre shring nd implementtion selection, we test our lgorithm without these fetures. Tle 2 shows the results otined our lgorithm with loop ut without hrdwre shring or implementtion selection. When we test our lgorithm without the feture of implementtion selection, we fi the implementtion of ech node n i t the medin point in the implementtion curve IH i. We don t fi the implementtion t the slowest point in the implementtion curve ecuse we cnnot stisf tight performnce constrint using such n implementtion. The results in Tle 2 shows tht shring nd implementtion selection is effective for reducing the totl hrdwre cost of the prtitioned sstem. Tle 2. Prtitioning results without hrdwre shring or implementtion selection Without shring Without implementtion selection (medin point) D (sec) HW cost % cost increse HW cost % cost increse Conclusions In this pper, we proposed hrdwre-softwre prtitioning lgorithm with loop technique which is suitle for computtion-intensive pplictions composed of loops. Loop technique is effective for prtitioning ecuse this technique increses prllelism within loop there llowing prtitioning lgorithm to find etter solution. Thnks to incresed prllelism, we cn find lower cost solution t given performnce constrint nd etter performnce solution t given hrdwre cost constrint. In ddition, our hrdwre-softwre prtitioning lgorithm, which is sed on n itertive improvement method, llows hrdwre implementtion selection nd hrdwre shring to minimize hrdwre cost, suject to performnce constrint. Eperimentl results show tht (i) loop technique is effective in tht it provides more chnce of overlpping the eecution of hrdwre nd softwre during prtitioning process nd (ii) the proposed lgorithm efficientl finds solution comprle to tht of simulted nneling lgorithm. Future work includes dnmicll chnging pipeline locks ccording to the current prtition stte, there incresing the effect of loop further. Recll tht in the current implementtion, we determine pipeline locks stticll efore the prtitioning process. We re lso working on considering multiple processor trget rchitecture for softwre implementtion. References [1] J. K. Adms nd D. E. Thoms, Multiple-Process Behviorl Snthesis for Mied hrdwre-softwre Sstems, Proceedings of Interntionl Smposium on Sstem Snthesis, pp , [2] C. Chu, et l., HYPER: An interctive snthesis environment for high performnce rel time pplictions, Proceedings of Interntionl Conference on Computer Design, Novemer [3] R. Ernst nd J. Henkel, Hrdwre-Softwre Cosnthesis for Microcontrollers, IEEE Design & Test of Computers, pp , Decemer [4] R. K. Gupt nd G. De Micheli, Sstem-level Snthesis using Reprogrmmle Components, Proceedings of EURO-DAC 92, pp. 2-7, Ferur [5] J. Henkel, T. Benner, nd R. Ernst, Hrdwre genertion nd prtitioning effects in the COSYMA sstem, Proceedings of Int l Workshop on Hrdwre-Softwre Codesign, pp , Octoer [6] A. Klvde nd E. A. Lee, A Glol Criticlit/Locl Phse Driven Algorithm for the Constrined Hrdwre-softwre Prtitioning Prolem,, Proceedings of Int l Workshop on Hrdwre/Softwre Codesign, pp.42-48, Septemer [7] A. Klvde nd E. A. Lee, The Etended Prtitioning Prolem: Hrdwre-softwre Mpping nd Implementtion-Bin Selection, Proceedings of Int l Workshop on Rpid Sstems Prototping, June [8] S. Kirkptrick, C. D. Geltt, Jr., M. P. Vecchi, "Optimiztion Simulted Anneling," Science, vol. 220, pp , M [9] P. V. Knudsen nd J. Mdsen, PACE: A Dnmic Progrmming Algorithm for Hrdwre-softwre Prtitioning, Proceedings of Int l Workshop on Hrdwre/Softwre Codesign, pp , Mrch [10] K. Olukotun, R. Helihel, J. Levitt nd R. Rmirez, A Softwre- Hrdwre Cosnthesis Approch to Digitl Sstem Simultion, IEEE Micro, pp , August [11] F. Vhid, J. Gong, nd D. D. Gjski, A Binr-Constrint Serch Algorithm for Minimizing Hrdwre during Hrdwre-softwre Prtitioning, Proceedings of EURO-DAC 94, pp , [12] M. Edwrds, Softwre Accelertion Using Coprocessors: Is it Worth the Effort?, Proceedings of Int l Workshop on Hrdwre/Softwre Codesign, pp , Mrch [13] S. Bkshi nd D. D. Gjski, Hrdwre/Softwre Prtitioning nd Pipelining, Proceedings of Design Automtion Conference, pp , June [14] P. Pjorn-Jorgensen nd J. Mdsen, Criticl Pth Driven Cosnthesis for Heterogeneous Trget Architecture, Proceedings of Int l Workshop on Hrdwre/Softwre Codesign, pp , Mrch 1997.

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining

EECS150 - Digital Design Lecture 23 - High-level Design and Optimization 3, Parallelism and Pipelining EECS150 - Digitl Design Lecture 23 - High-level Design nd Optimiztion 3, Prllelism nd Pipelining Nov 12, 2002 John Wwrzynek Fll 2002 EECS150 - Lec23-HL3 Pge 1 Prllelism Prllelism is the ct of doing more