WCET-Directed Dynamic Scratchpad Memory Allocation of Data

Size: px

Start display at page:

Download "WCET-Directed Dynamic Scratchpad Memory Allocation of Data"

Dylan Gardner
6 years ago
Views:

1 WCET-Drected Dynamc Scratchpad Memory Allocaton of Data Jean-Franços Deverge and Isabelle Puaut Unversté Européenne de Bretagne / IRISA, Rennes, France Abstract Many embedded systems feature processors coupled wth a small and fast scratchpad memory. To the dfference wth caches, allocaton of data to scratchpad memory must be handled by software. The major gan s to enhance the predctablty of memory accesses latences. A comple-tme dynamc allocaton approach enables evcton and placement of data to the scratchpad memory at runtme. Prevous dynamc scratchpad memory allocaton approaches amed to reduce average-case program executon tme or the energy consumpton due to memory accesses. For real-tme systems, worst-case executon tme s the man metrc to optmze. In ths paper, we propose a WCET-drected algorthm to dynamcally allocate statc data and stack data of a program to scratchpad memory. The granularty of placement of memory transfers (e.g. on functon, basc block boundares) s dscussed from the perspectve of ts computaton complexty and the qualty of allocaton. 1. Introducton Worst-case executon tme (WCET) of a program s the maxmum tme ths program may take to execute on a specfc hardware platform [14, 25, 28]. Knowng program s WCET s of prme mportance for hard real-tme systems to guarantee computatons wll complete before ther deadlne. Drect-addressed scratchpad memores are beng used as an alternatve to processor caches as they consume less area and less power. Approaches for statc [2] and dynamc [15, 26, 27] allocatons have been desgned to automatcally place code and data on scratchpad memores. So far, many studes have been led on allocaton of code and data on scratchpad memory for average executon tme [2] or energy reducton [27]. A study [28] has demonstrated the superorty of scratchpad memory placement on some cache modelng technques for executon tme predctablty of hard real-tme systems. Recently, algorthms for statc data allocaton on scratchpad memores n [25], and for dynamc code allocaton n [21] have been specally desgned for WCET optmzaton. As far as we know, no dynamc scratchpad memory data allocaton methods for WCET optmzaton have been proposed. In ths paper, we present an approach to allocate program data to scratchpad memory for WCET reducton. Our approach determnes at comple-tme the possble program locatons where data wll be transferred on and off the scratchpad memory at runtme n a two-steps method. Frst, memory accesses to data along the worst-case executon path of the program are analyzed. Second, a /1 nteger lnear program (ILP) problem s formulated to select these data for dynamc scratchpad memory allocaton. However, the worst-case executon path of the program may change after a data allocaton. Consequently, the ILP problem s greedly refned to compute a WCET-drected allocaton. The two steps of the method are descrbed n Secton 2 and 3. In Secton 2, we propose a compler technque to determne potental targets of any data memory accesses of a program. These nformaton are employed to estmate the proft for a data allocaton. Secton 3 descrbes the approach for dynamc scratchpad memory allocaton. Secton 4 provdes some results and studes the performance mprovements of our proposal over prevous scratchpad memory allocaton technques. Secton 5 overvews related work whle Secton 6 descrbes future work and concludes. 2. Determnaton of load-store nstructons targets On many programs, a large amount of data accesses are dynamc; the target address of load-store nstructons may change for each executon. Table 1 gves a twodmensonal classfcaton of data storage and load-store accesses from [18]. The storage type defnes the locaton of a gven data. Statc data, stack data and heap data are respectvely stored n global, heap and stack sectons of the program memory space layout. Lterals are compler-generated constants stored n the code secton; these data are used to reduce the sze of the program code. Storage type Statc Stack Heap Lterals Access type Scalar Regular Irregular Input dependent Descrpton Global and statc structures. Functon stack frame, splled temporares and stack allocated structures. Dynamcally allocated structures on the heap. Constants stored n program code secton. Explanaton Only one element. Array accessed by regular, strde accesses. Non-regular but stll nput data ndependent. Reference drectly depends on nput data Table 1. Data structure classfcaton based on storage type (upper table) and access type (lower table) [18].

2 The access type defnes the way a data s accessed. Scalar access types are accesses to a unque data address for statcs and to a relatve address to stack frame base addresses for stack data. The access type of a load-store nstructon s regular f ths nstructon s accessng to multple elements of a unque array wth a constant strde (classfed as lnear address sequence accesses n []). Irregular accesses nclude accesses to (possbly multple) data through ponters and are stll ndependent to the nput data. Lastly, nput dependent accesses nclude any accesses wth addresses computed at runtme from unknown nput data (mentoned as ndrect address sequence accesses n []). In the next secton, we wll motvate the need for a method to analyze any data memory accesses of programs Quanttatve study of data memory accesses by types Table 2 gves the mpact of data by access types and storage types on the worst-case executon path of programs. Benchmark programs are ndvdually descrbed later n Secton 4 and these programs don t access heap data. Programs are compled for the StrongARM-1 [22] wth loop-related optmzatons (loop unrollng, etc) dsabled. WCET analyses of programs are performed wth the Heptane WCET tmng analyser [6]. Statc Stack Benchmark Scal. or reg. Irreg. or nput dep. Scal. or reg. Irreg. or nput dep. Lterals Adpcm 17.% 6.% 9.1% - 13.% Engne 16.9% % 3.% 12.9% G % 8.6% 39.1% % Hstogram 99.9% -.1% - - Lpc 96.1% -.5% - 3.4% Pocsag 62.4%.4% 13.9% % Spectral 31.7% 24.5% 37.8% - 6.1% Statemate 6.4% - 6.1% % Table 2. Impact rato of load-store nstructons by storage types and access types. Table 2 presents the rato of accesses to statc data, stack data and lterals along the worst-case executon path. To llustrate the partton between accesses wth or wthout ponters, we have respectvely merged results for scalar/regular and rregular/nput dependent accesses nto two sub-categores for statc and stack data. The rato of load-store nstructons to lterals s up to 33.5% and s mportant for most programs of the benchmarks set. On the one hand, most programs have a large amount (between 16.9%-99.9%) of data accesses to statcs. On the other hand, stack data represent a large part of data accesses for three of the eght benchmarks (between 33.2%- 69.8%). All programs, except Hstogram, make use of rregular/nput dependent accesses (e.g. memory accesses through ponters). Moreover, two benchmarks programs make ntensve (24.5% and 6.%) use of such accesses. As a concluson, any accesses to statc data and stack data are mportant. We have to propose a method to calculate the targets for any access types of memory accesses found n programs Calculaton of targets of data memory accesses In programmng languages such as C or C++, programs typcally employ ponters to arrays elements, dynamc data structures (e.g. lnked lsts) and procedures parameters. As shown n the prevous secton, rregular and nput dependent accesses types represent a large part of load-store nstructons. Tradtonally, ponter analyss has been used n complers to buld alasng nformaton [4, 11]. In ths paper, we propose to reuse exstng ponter analyss methods of the compler to determne possble data accessed by any load-store nstructons of a program. In order to exhaustvely assocate any memory access to (possbly-multple) data target(s), we have to apply ponter analyss to () all ponters defntons of the program nterprocedurally, and to () the text of the whole-program [11] wth ts related lbrares. Output assembly Program sources (C,C++,...) language front-end Collecton of ntermedate representatons Code generaton back-end load-store nstructons targets annotatons Ponter analyss Alasng nformatons Code transformatons 1..n Fgure 1. Ponter analyss n complaton process. As shown on the Fgure 1, a compler nfrastructure typcally contans a collecton of ntermedate representatons []. A set of code transformatons s appled teratvely on ntermedate representatons. Ponter analyses must be processed on early phases of program transformatons; these nformaton are brought through the rest of optmzatons phases as annotatons to the ntermedate representatons. Then, the code generaton backend phase translates a low-level ntermedate representaton (smlar to [12]) to the output assembly fle. GCC supports whole-program complaton and t currently provdes an ntraprocedural ponter analyss [4]: targets of ponters passed on procedures parameters are not computed. For the am of ths paper s study, we have slghtly modfed the compler nfrastructure to apply the ponter analyss nterprocedurally ; the compler keeps results of 1 GCC GNU C Compler:

3 ponter analyses durng the whole complaton lfetme. We have also modfed the ARM backend to produce the set of possble ponters targets for each generated load-store nstructon n the output assembly fle. The ponter analyss appled n ths paper supports statc and stack storage types. None of our real-tme benchmarks make use of dynamc heap allocaton. In our study, we don t make the dfferentaton between ndvdual elements of arrays and between the felds of data structures (whereas such nformaton s computed n the current ponter analyss mplementaton of the compler [4]). Moreover, stack data of the whole stack frame of each functon are managed as an ndvdual data structure nstance Related work on determnaton of load-store nstructons targets Some approaches have prevously succeeded to generate nformaton on some access types. Dsassembly of bnary fles enables extracton of scalar accesses to statc data; some dataflow analyses technques have been appled on assembly code to extend scalar access detecton to stack data [3,9,13]. Data dependence analyses technques and loop nducton analyses have been appled on the low-level representaton of VPO [19] to determne regular accesses n [29]. [24] uses a processor smulator to generate program memory profle. The profle contans the trace of all memory addresses accessed. The trace must cover all nstructons of the program and drectly assocates an observed target data address for each load-store nstructon. Ths approach enables analyss of any scalar memory accesses. Non-scalar accesses calculaton s possble by checkng the ncluson of the caught address to any known data s range addresses. Ths approach guarantee to detect the target of any memory accesses to a unque data. An external module, based on abstract nterpretaton technques [5], has been employed on the ntermedate representaton of the SUIF compler [3] for ponter analyss. The results of ths analyss are re-assocated after code generaton to the output assembly wthn the executon of SmpleScalar smulator. Ther approach s the most smlar to ours. The am of ther study s the mpact of memory access alasng nformaton for schedulng of processor memory request queue [5]. 3. Dynamc scratchpad memory allocaton The prevous secton has presented a complete program analyss framework to determne the targets of load-store nstructons of a program. In ths secton, we employ these nformaton to defne at comple-tme a data allocaton n a sngle scratchpad memory devce. Frst, we descrbe the program flowgraph representaton consdered (Secton 3.1). Second, a /1 nteger lnear program (ILP) formulaton s gven for the allocaton of statc data (Secton 3.2) based on ntal knowledge of frequences along worst-case executon path. We apply ths formulaton on the consdered flowgraph to generate an ILP problem. One soluton to ths ILP problem provdes the locaton of memory transfer operatons on the consdered flowgraph. The formulaton s later extended to handle stack data (Secton 3.2.3). Fnally, we descrbe an teratve algorthm to tackle nstablty of worst-case executon path of a program. Ths algorthm ncrementally generates the ILP problem for a better WCET optmzaton Flowgraphs and computaton of worst-case executon path nformaton We have multple choces for placement of memory transfers operatons (e.g. on functons entry-ext, on basc blocks boundares, etc). We propose to ntroduce a generc graph representaton of the program flow. The chosen representaton level of the generated program flowgraph may lead to dfferent placements of memory transfer operatons. In Fgures 2 and 3, the (rght-sde) flowgraph s generated from the (left-sde) orgnal graph representaton. The edges n the generated flowgraph are requred to descrbe any possble flows of executon of the program. proc_d() man() proc_a() proc_c() proc_b() e 7 e 9 e 11 start Fgure 2. Call graph transformaton to a (coarse-gran) flowgraph. For example, one can buld a flowgraph from the orgnal call graph of an applcaton (see Fgure 2). There s one node n the flowgraph for each functon n the call graph. We can also buld a flowgraph from the nterprocedural control flow graph of the applcaton (see Fgure 3). Here, there s one node n the flowgraph for each basc block n the nterprocedural control flow graph. Other levels of representaton are possble; one may balance between coarseness and sze of the resultng flowgraph. The sze of the flowgraph has a practcal ncdence on the complexty of the future memory allocaton problem as shown later n expermental results n Secton 4.2. Prevous approaches for dynamc scratchpad allocaton [15, 26, 27] have focused on the optmzaton of the average case. Data accesses statstcs are typcally computed from the executon proflng wth a tran nput. In order to reduce the WCET of a real-tme applcaton, we rely on nformaton of data memory accesses on the worst-case executon path of the program usng WCET analyss. Consequently, we apply statc tmng analyss as an ntal step to determne the nformaton as proposed n [21, 25]. Heptane produces nformaton on frequences of executon of ndvdual basc blocks on the worst-case executon path. Snce we are able to compute the set of targets for each load-store nstructon (see Secton 2.2), we can determne the mpact e 8 e 2 e 1 e 3 e 5 e 4 e e 13 e 6

4 e 1 start 1 load v 1 = 1 1 = 1 e 7 e 2 e e 3 e 7 4 = 1 store v = 1 ej+1 +n e 5 e 6 e 8 Fgure 4. Fgure 5. Fgure 3. Control flow graph transformaton to a (fne-gran) flowgraph. of any data for each basc blocks of the worst-case executon path. Moreover, we are able determne f ths data s MOD (modfed) or USE (used) on the executon of ths basc block. The outgong edges of the generated flowgraph assocated wth these basc blocks are annotated wth these nformaton. More formally, the flowgraph s a drected graph wth the followng defntons: N = Number of nodes n flowgraph; E = Number of edges n flowgraph; = jth edge of flowgraph, j [1, E]; C ej (v) = Estmated contrbuton to WCET reducton for data v scratchpad-allocated on edge ; Type of usage of data v on edge where U ej (v) {MOD, USE}. Some real-tme applcatons are desgned to be actvated U ej (v) = from multple entry ponts. We have added the start node to the flowgraph to represent these flows of executon. Some edges are added to lnk the start node wth any possble program entry. start node s artfcally actng as a sngle entry pont for the program. In the same way, all program exts are lnked to ths start node Formulaton for statc data We are consderng an ntal problem formulaton to allocate statc data only wth the followng defntons: M = Sze of scratchpad memory; G = Number of statc data n applcaton; v = th statc data, [1, G]; S(v ) = Sze of varable v n bytes; X copy(v ) = Tme to transfer varable v between man memory and scratchpad n cycles; The optmzaton problem s formulated as a /1 nteger lnear programmng problem. We defne the followng set of bnary varables, [1, G], j [1, E]: 8 >< load v = >: 8 >< store v = >: 1 f data v s transferred to scratchpad memory at the begnnng of edge, otherwse. 1 f data v s transferred back to man memory at the end of edge, otherwse. 8 >< = >: 8 >< alloc_ro v = >: 1 f mutable data v s allocated on scratchpad memory on edge, otherwse. 1 f read-only data v s allocated on scratchpad memory on edge, otherwse. Varables load v and store v determne where data v are to be respectvely loaded and stored on scratchpad memory. Varables /alloc_ro v gve the state modfed/not modfed of the scratchpad-allocated data v. A modfed data v must be transferred back to the man memory on end of allocaton. The objectve functon to maxmze s the sum of contrbutons to the WCET of all memory accesses to allocated statc data n the applcaton mnus the cost of transfer operatons of data between man memory and scratchpad memory. GX EX =1 j=1 C ej (v ) + alloc_ro v C ej (v ) load v X copy(v ) store v X copy(v ) Prelmnary constrants have to be added to prevent nconsstences on bnary varables. Data v s allocated on scratchpad memory wth or alloc_ro v exclusvely. [1, G], j [1, E]: + alloc_ro v 1 (1) The MOD and USE annotatons of the edges of the flowgraph have a drect ncdence on the problem formulaton. We have to unset alloc_ro varables for edges that may update ths data: Flow constrants alloc_ro v = f U ej (v ) = MOD; (2) Fgure 4 llustrates the need and objectve of flow constrants. Let us consder data v allocated on scratchpad memory on adjacent and connected edges 1 and. On ths example, ths data s loaded on the executon of 1 and stored back n man memory on the end of s executon. [1, G], (j 1, j) ([1, E], [1, E]), where 1 s an ncomng edge of : 1 alloc_ro v 1 load v = (3) 1 store v 1 = (4) alloc_ro v alloc_ro v 1 load v = (5)

5 Constrant 3 enables scratchpad-allocaton of data v on edge f ths data was already scratchpad-allocated on the ncomng edge 1, or f ths data s loaded on edge. Constrant 4 ensures that data v, updated on edge 1, must be stored and transferred to man memory or alloc_rw on next edge. Constrant 5 ensures that data v, read-only allocated on edge, must be loaded on ths edge or alloc_ro on ncomng edge 1. Fgure 5 llustrates a node wth multples outgong edges. Constrant 6 guarantees consstent values for varables of outgong edges of a node n the flowgraph. [1, G], (j, j ) ([1, E], [1, E]) where edges and are outgong edges of the same node: + alloc_rov alloc_rwv alloc_ro e v j = (6) Fnally, Constrant 7 specfes the upper bound on the sum of the sze of all allocated data on each edge, [1, G], j [1, E]: MX =1 S(v ) + alloc_ro v S(v ) M (7) Optonal support of dynamcally scheduled archtectures WCET analyss requres the complete knowledge of nstructons executons tmes. In dynamcally scheduled archtectures [17], ppelne modelng should take nto account all possble tmngs for each varyng tmng nstructon, ncreasng the complexty of the WCET analyss [16]. For example, load-store nstructons may have multple executons latences f possble data targets are stored n dfferent memores wth heterogeneous latences. In order to reduce the complexty of WCET analyss, we would lke to guarantee unque tmng for each load-store nstructon. Therefore, we have to express allocaton of any targets data of ths load-store nstructon to the same level of the memory herarchy (here, the scratchpad memory or the man memory). Constrant 8 enforces removal of tmng anomales due to data memory accesses. j [1, E], ( 1, 2 ) ([1, G], [1, G]) where v 1 and v 2 are possbly accessed on by the same load-store nstructon: 1 + alloc_ro v 1 2 alloc_ro v 2 = (8) The mpact of Constrant 8 may drectly depend on the number of possble targets of the ponters n programs. However, the StrongARM-1 [22] s a statcally scheduled archtecture and does not enable an evaluaton of ths constrant n the experments of ths paper Memory data address assgnment Our formulaton provdes an optmstc soluton to data allocaton (varables alloc_rw or alloc_rw) on the edges of the flowgraph. Optmstc n the sense not all data selected by the ILP problem resoluton necessary ft on scratchpad memory due to fragmentaton. An address assgnment algorthm has been proposed n [27] to place data on scratchpad memory at comple-tme. If no free place s found for one data, ths data s smply left n man memory. In ther approach, each data can be transferred multple tmes between man memory and scratchpad memory; however, each data must have only one address n the scratchpad memory for the whole program executon. We propose an mprovement to the address assgnment algorthm of [27] wth the detecton of ndvdual data regon. Our proposal, detaled n Algorthm 1, may decrease placement conflcts of data on scratchpad memory. A data regon s defned by a subgraph of connected edges of the flowgraph where data v s scratchpad-allocated. Algorthm 1 enables the assgnment of a dfferent address on scratchpad memory for each data regon. Algorthm 1 Address assgnment algorthm wth detecton of data regons 1: data_regons extract data regons from computed allocaton 2: sort data_regons lst on ther mpactng order on WCET 3: for all ndvdual regon (data, edge_set) from data_regons lst do 4: f data fts n free memory space on edges of edge_set then 5: select free placement for data on edges of edge_set wth frstft polcy 6: else 7: remove the unallocatable data regon 8: end f 9: end for : return data_regons Frst, Algorthm 1 reads the optmstc allocaton computed from the ILP problem, a lst of data regons s generated (lne 1). Ths step requres an analyss of the connected components of the flowgraph for each allocated data. Ths lst s then sorted from the mpact on the program WCET of ndvdual data regon (lne 2). We must try to assgn a concrete address to each data regon. For all edges covered by a data regon, fnd a vald slot to assgn the data (lnes 3-9). If a data regon can not be loaded on scratchpad memory, we gnore ths data regon and we smply remove all transfer operatons for ths data n ths regon Extenson for stack data We are now consderng an extenson to our prevous formulaton to support the lmted lfetme of stack data. These data do not requre ntalzaton nor content backup to the man memory at the end of ther lfetme. Smlarly to statc data, the flowgraph s annotated wth MOD or USE for any usage of stack data. The DEF attrbute s now defned on functon entry and on functon ext for stack data assocated wth the functon lfe span. The DEAD s an attrbute set on flowgraph edges to avod memory transfers for non-lve stack data. Number of stack data n the applcaton; F = f = th stack data, [1, F ]; Type of usage of varable f U e j (f ) = on edge where U e j (f ) {DEF, MOD, USE, DEAD}; S(f ), C ej (f ), X copy(f ) and the varables alloc_rw, alloc_ro, load and store are smlarly defned for stack data.

6 The general flow Constrants 3, 4, 5 and 6 for statc data are drectly applcable to stack data. The objectve functon to maxmze s the contrbuton for WCET reducton of all accesses to the statc data and to the stack data n the applcaton mnus all the dynamc transfers of data between man memory and scratchpad memory: GX EX =1 j=1 FX EX + =1 j=1 C ej (v ) + alloc_ro v C ej (v ) load v X copy(v ) store v X copy(v ) alloc_rw f C ej (f ) + alloc_ro f C ej (f ) load f e j X copy(f ) store f e j X copy(f ) Stack data are created and destroyed on functon entry and on functon ext (where U (f ) = DEF). Consequently these data don t requre memory transfer operatons (Constrants 9 and ). On such edges, stack data are ntalzed wth default values and Constrant 11 forbds read-only allocaton. Moreover, we enforce (Constrant 12) -cost memory transfer operatons to scratchpad before and after the stack data lfetme, [1, F ], j [1, E]: load f = f U (f ) = DEF (9) store f = f U (f ) = DEF () alloc_ro f = f U (f ) = DEF (11) X copy(f ) = f U (f ) = DEAD (12) As descrbed n [2], stack data may have dsjont lfetmes. The program call graph s analyzed to provde nformaton on lfetme of stack data: L = The set of all leaf nodes n the call graph; Total number of unque paths to the lth NP (l) = P t (l) = leaf node n the call graph, l [1, L]; The set of stack data defntons on the tth unque path to lth leaf node n the call graph, t [1, NP (l)]. P t (l) set computes any possble stack data combnatons that are smultaneously alve. The sze constrant should be formulated as, j [1, E], l L, t [1, NP (l)]: GX S(v ) + alloc_ro v S(v ) =1 + X alloc_rw f e j S(v ) + alloc_ro f e j S(f ) f P t(l) M (13) An addtonal extenson would be to support heap allocated data as proposed n [8]. DEF machnery s an deal attrbute to defne a lmted-lfetme data and may be the bass for such an extenson of our formulaton to heap data. Currently, real-tme programs benchmarks rarely employ dynamc heap allocaton and won t enable us to lead a complete study on dynamc allocated data Support for nstablty of the worst-case executon path For many programs, the worst-case executon path of the program may change after some data allocatons. Consequently, t may be needed to evaluate all possble combnatons of data allocatons to fnd the optmal reducton of WCET of the program. However, an exhaustve evaluaton of all possble combnatons would be too tmeconsumng [25]. A greedy heurstc has been proposed for WCET-centrc scratchpad memory statc allocaton that greatly enhances the qualty of allocaton n [25]. The method outlne s to teratvely allocate one data on scratchpad memory and to (re)-estmate data frequences nformaton at every teratons. We propose to adapt ther approach to the ILP problembased scratchpad memory dynamc allocaton schemes n Algorthm 2. Algorthm 2 Iteratve dynamc allocaton algorthm 1: allocatons empty 2: repeat 3: change false 4: perform WCET estmaton 5: extract nformaton on worst-case executon path 6: generate dynamc allocaton ILP problem 7: generate addtonnal Constrants (14) for allocatons 8: new_allocatons call solver on ILP problem 9: f new_allocatons empty then : change true 11: allocatons allocatons most mpactng allocaton from new_allocatons 12: end f 13: untl change = false 14: return allocatons The man dea behnd Algorthm 2 s to ncrementally refne the ILP problem formulaton to support greedy allocaton of most mpactng data. Intally, the worst-case executon path of the applcaton s determned (lne 4) and data accesses nformaton are computed (lne 5). An ntal ILP problem (lne 6) s generated and the problem solver computes a lst of data to allocate (lne 8). On next teratons, WCET estmaton s performed agan and Constrant 14 s added to the problem formulaton (lne 7) to enforce allocaton of selected data. [1, G], j [1, E] where v has been selected for allocaton on : + alloc_ro v = 1 (14) The algorthm selects and allocates the most (non-already allocated) mpactng data to the scratchpad memory (lne 11). Ths process s appled teratvely untl no more allocaton can reduce the WCET of the applcaton. Smlarly, we have modfed the ILP problem formulaton of [2] to obtan an teratve statc memory allocaton algorthm. Ths algorthm s used n the experments of ths paper. Lnes 6-7 of Algorthm 2 are modfed (1) to generate the statc allocaton s ILP problem [2] and (2) to generate addtonal constrants that enforce allocaton of selected data on next teratons. We won t descrbe n ths paper these addtonal constrants appled to [2] s ILP problem (lne 7) due to the lack of space. 4. Results The evaluaton of the approach for dynamc scratchpad memory allocaton s performed for the StrongARM-1 processor [22]. Benchmark programs are compled usng

7 Benchmark Source Lnes of code Statc data sze Max. stack sze Load-store nst. rato Descrpton Adpcm WCET B. 87 bytes 116 bytes 39% Speech codng Engne Powerstone bytes 116 bytes 6% Engne control G721 Powerstone bytes 284 bytes 16% Voce compresson Hstogram UTDSP bytes bytes 39% Image enhancng applcaton Lpc UTDSP bytes 72 bytes 22% Speech codng Pocsag Powerstone bytes 112 bytes 22% Communcaton protocol Spectral UTDSP bytes 116 bytes 44% Speech power spectral estmaton Statemate WCET B bytes 132 bytes 6% Car wndow lft control Table 3. Informatons on benchmarks programs. a modfed GCC 4.1 compler (see Secton 2.2 for the modfcatons appled to the compler to compute load-store nstructons targets). The compler generates two fles: the output program and an addtonal fle for load-store nstructons targets annotatons. Second, the Heptane tmng analyzer reads the annotaton fle wth load-store s targets nstructons to model the complete memory behavor of the program. The program bnary s read. The maxmum teratons for each program loops are gven as annotatons to enable determnaton of possble executon paths n the program. Ths study reports results on optmzed code wth looprelated optmzatons dsabled. The Heptane tmng analyser supports the ppelned executon and the nstructon cache of the StrongARM-1. The latency of a word access to man memory s 11 cycles [22]. The latency for accesses to data allocated on the scratchpad memory s 1 cycle. A penalty model for scratchpad memory transfers operatons are ntegrated to the tmng analyss of Heptane. We appled a penalty latency of 12 cycles per word of data to transfer. The commercal ILP solver CPLEX s confgured to stop on the frst vald soluton found. The proposed technque s evaluated on an assorted set of benchmarks from WCET benchmarks 3, Powerstone [23] and UTDSP Scratchpad allocaton results We have undertaken a comparson of the mpact of our teratve dynamc scheme over non-teratve statc scratchpad memory allocaton [2] on programs WCET. In ths study, (fne-graned) flowgraphs are generated from the nterprocedural control flow graph of benchmarks programs. Ths gves the maxmum lattude for placement of memory transfers n programs. Fgure 6 gves the mprovement rato of teratve dynamc scratchpad memory allocaton over non-teratve statc allocaton (y-axs) for a range of scratchpad memory szes (x-axs) computed by #cycles reducton from teratve dynamc allocaton #cycles reducton from non-teratve statc allocaton. Fgure 6 gves n addton the mprovement of teratve statc allocaton over non-teratve statc allocaton, provdng nsght on stablty of programs worst-case executon paths. 2 ILOG CPLEX Hgh-performance software for mathematcal programmng and optmzaton: cplex/ 3 WCET benchmarks: wcet/benchmarks.html 4 UTDSP DSP Benchmark Sute: edu/~cornna/dsp/nfrastructure/utdsp.html For fve out of eght benchmarks, we can remark teratve statc allocaton may mprove non-teratve statc scratchpad memory allocaton up to 3%, partcularly for programs wth a large amount of control flow (Engne, Pocsag, Statemate). Hstogram s a typcal example of the beneft of the dynamc capablty of our scratchpad memory allocaton method. Ths program contans two frequently used arrays of 24 bytes separately used n two program phases. Statc allocaton succeeds to place one of these two arrays n a 24 bytes scratchpad memory. Dynamc allocaton moves these two arrays alternatvely n the scratchpad memory unt for an mprovement of 47% of the orgnal performance enhancement due to a statc scratchpad allocaton. On a scratchpad memory larger than 48 bytes, there s enough room to statcally place the two arrays. Both schemes yeld to dentcal WCET value. Major benefts for dynamc scratchpad allocaton are acheved for small ratos of scratchpad memory szes over the whole program data workng set. For example, dynamc scratchpad allocaton s valuable for scratchpad szes ratos lower than % of the workng set for the programs Adpcm, Engne and G721. On these ranges, the method outperforms the statc allocaton from 12% to 85%. The approach s notably proftable to systems wth a scratchpad memory shared among several real-tme tasks. Due to the support of stack data (typcally smaller than 32 bytes), our method takes advantage of very small scratchpad memory szes except programs Hstogram, LPC and Spectral. These benchmarks have very few stack data nstances (see Fgure 4) or few accesses to stack data (see Fgure 2). We have conducted some prelmnary evaluatons of address assgnment algorthms descrbed n Secton In the experments of [27], the address assgnment algorthm s shown to be farly close to the optmal address assgnment. In our experments, the algorthm wth detecton of data regons gves margnal performance mprovements over the address assgnment algorthm of [27]. The teratve allocaton algorthm selects data n ther performance mpact order. Consequently, data wth hgh performance mpact have hgher chance to get a vald address assgnment. Programs Adpcm, G721 and Lpc get a relatve performance ncrease of 3%-7% when the detecton of data regons s enabled. Gans are observed for small (less than 3 bytes) scratchpad memores. The scratchpad memory usage s hgh for such confguratons and many data are transferred on scratchpad memory multple tmes.

8 adpcm Iteratve dynamc allocaton Iteratve statc allocaton engne g hstogram 5 15 lpc pocsag spectral statemate Fgure 6. Improvement (n percent) of teratve dynamc and teratve statc allocatons over nonteratve statc allocaton Solver executon tme Allocaton solvng tme tghtly depends on the number of varables of the ILP problem. The number of varables for statc allocaton problem s O(D) where D s the number of statc data and stack data n the programs. In our experments, there s mplctly one stack data nstance for each defned functon. The count of statc data and functons n programs of the benchmark set are gven n the Table 4. The number of varables for dynamc allocaton problem s O(D E) where E s the number of edges n the flowgraph. The number of edges depends on the representaton level of the flowgraph. Table 4 delvers the number of functons and the number of basc blocks (BBs) of the programs. In ths table, the number of functons of a program gves an dea of the sze of the (coarse-gran) flowgraph generated from ts call graph. In the same way, the number of basc blocks of a program gves an dea of the sze of the (fne-gran) flowgraph generated from ts nterprocedural control flow graph. In our experments, we have observed CPLEX runnng tme s the worst for scratchpad memory sze confguratons where dynamc scratchpad memory allocaton gves the best mprovement over statc scratchpad memory allocaton. Table 4 compares maxmum observed runnng tme for CPLEX solver to produce a soluton for (A) statc allocaton problem, (B) dynamc allocaton problem (coarse-gran) flowgraph (C) wth (fne-gran) flowgraph. Consequently, for the same program, B has less number of possble placements for memory transfer operatons and t gves lower qualty allocaton than C. The fnal column of Table 4 gves the relatve allocaton qualty reducton B A. C A A value of % means B allocaton s as effcent as C and % means allocaton provdes results as low as A statc allocaton. Ths rato s computed for the scratchpad memory sze where C does ts best over A statc allocaton. Frst of all, the runnng tme of the ILP solver s typcally not an ssue for any statc scratchpad memory allocaton problems. Second, programs (Adpcm, G711, Pocsag, Statemate) wth an mportant number of data and a large generated (fne-gran) flowgraph may have huge solver runnng tme. Applyng our method to much more benchmarks programs may enable us to draw general conclusons of the number of ILP varables on solver s runnng tme. Unsurprsngly, B dynamc allocaton at functon granularty produces lower qualty results than C dynamc allocaton on basc-blocks granularty for most programs. One can remark B s as effcent as C for two of eght benchmarks (Engne, Statemate): even though ther respectve solvng tme s shorter. Conversely, Hstogram contans only one functon and B allocaton s strctly equvalent to A statc allocaton. The major concluson of ths study s two-fold. Frst, the practcal lmtaton of our method s the runnng tme to solve ILP problems, whch s problematc for the largest benchmarks studed n ths paper. Second, approaches exst to scale up the applcablty of our method to larger programs. A coarse flowgraph nduces smaller ILP problems, potentally leadng to a lower allocaton qualty. An orthogonal approach may be to apply the method to regons of program (.e. program subgraphs), to generate smaller ILP sub-problems. Moreover, t must be proftable to gnore some non-proftable data wthn a regon n the generated sub-problem, reducng the number of data consdered n generated sub-problems. 5. Related work A man ssue for dynamc scratchpad memory allocaton s the prelmnary selecton of possble placements for memory transfer operatons. [15] and [27] are consderng placement of memory transfer operatons at the level of basc

9 Benchmark Data Functons BBs Allocaton problem solvng tme Allocaton qualty (statc) A B C mprov. rato Adpcm s s 179s 59.2% Engne s 1s 35s.% G s 9s 42s 14.3% Hstogram s 1s 1s.% Lpc s 1s s 68.8% Pocsag s 2s 51s 47.% Spectral s 1s 2s 39.8% Statemate s 7s 8367s.% Table 4. Programs szes vs. problems solvng tme. blocks. [26] proposes to restrct memory transfer operatons to nterestng program ponts, such as functons, condtonals or loops entres/exts wth hgh executon frequences n a flexble way. Moreover, [26] assocates executon tmestamps to program ponts n order to capture program executon context. Data accesses statstcs are recorded usng these tmestamps on a profled executon. The manageable granularty of the flowgraph enables flexble selecton of possble places for memory transfer operatons. Moreover, the support of program executon order n our flowgraph seems possble through the replcaton of subgraphs of the generated flowgraph. However, t s unclear how WCET analysers could generate useful data accesses nformaton n assocaton wth executon tmestamps. Formulaton for dynamc scratchpad memory allocaton ntroduced n Secton 3.2 s an adaptaton of [27] to manage read-only and modfed data. The man beneft s to avod useless store memory transfer operatons from scratchpad to the reference copy n man memory for non-modfed data. The support of stack data descrbed n Secton 3.3 s smlar to the work for statc memory allocaton n [2] appled to our formulaton for dynamc memory allocaton. [7] studes allocaton of splled data on a small and fast drect addressed memory. Our formulaton for dynamc scratchpad memory allocaton supports stack data and t supersedes ths orgnal work. [2] contans an nterestng study on granularty of allocaton of whole stack frame (as appled n ths paper s experments) or ndvdual stack data. Each stack data can be allocated n dfferent memores; hence, the program must have to manage multple program stack ponters on ts executon. Ther study concludes ndvdual stack data allocatons gves margnal performance ncrease aganst whole stack frame allocaton due to ncreased cost of a multple stack management. [25] propose an algorthm for greedy statc allocaton on scratchpad memory. Ther algorthm teratvely () evaluates the worst-case executon path of the applcaton, () selects and allocates the most mpactng (non-already allocated) data to the scratchpad memory then apply () and () untl no more allocaton on free memory space s possble. Our approach dffers because the solver s teratvely called on a refned ILP problem and t supports allocaton of stack data. Moreover, our approach s portable to both statc and dynamc scratchpad memory allocaton ILP problems. In Secton 3.2.3, due to scratchpad memory fragmentaton, we have proposed to leave unallocatable data n man memory. Memory compacton s an nterestng alternatve to rearrange data on scratchpad memory. [26] has shown that such a mechansm has a mnor mpact on program performance. [26] also addresses major mplementatons ssues on statc data and stack data relocaton for dynamc scratchpad memory allocaton. Fxed-szed scratchpad memory are unable to allocate too large data and are unable to take beneft of temporal localty on access of such data, to the dfference wth data caches. As studed n [15], program transformatons such as tlng of bg arrays enable better scratchpad memory usage and ncrease global effectveness of the allocaton. 6. Concluson and future work The man contrbutons of ths paper are two-fold. Frst, we have descrbed an approach to calculate targets of loadstore nstructons. Our approach s based on a common compler nfrastructure and reles on the presence of an nterprocedural ponter analyss. Exhaustve knowledge of load-store nstructons targets n a program requres a wholeprogram analyss mode, avalable n our compler nfrastructure. Second, we have proposed a dynamc scratchpad memory allocaton algorthm to support both statc data and stack data. Our approach attempts to reduce WCET of real-tme programs wth the allocaton of most mpactng data on ther worst-case executon paths. Due to the varablty of the worst-case executon paths n programs [25], we have appled an teratve scheme for data allocaton. Ths scheme requres multple teratons of WCET program analyss and t has demonstrated mproved results [25]. Our experments have shown the ncreasng computatonal complexty of allocaton problem solvng wth program sze. To tackle ths ssue, we have proposed to lmt data transfers to entry and ext of functons, reducng allocaton problem sze and leadng to an absolute decrease of allocaton qualty. Scratchpad memory allocaton of data provdes a fully predctable latency of load-store nstructons [1]. Furthermore, some compler optmzatons (e.g. nstructon schedulng) could make proft of such nformaton for better code generaton. [28] have compared statc scratchpad memory allocaton wth some nstructon cache WCET analyzes. We plan to compare our approach for dynamc scratchpad memory allocaton wth data cache analyses. In ths paper, we have consdered a system wth only one scratchpad memory devce. The optmal statc memory al-

10 locaton [2] supports multple scratchpad memory devces. Ths may ncrease drastcally the number of varables of the generated allocaton problem. We leave such an extenson for dynamc memory allocaton as future work. Acknowledgments. The authors thank Olver Rochecouste for ts comments that helped mprove the qualty of ths paper. References [1] S. G. Abraham, R. A. Sugumar, D. Wndheser, B. R. Rau, and R. Gupta. Predctablty of load/store nstructon latences. In Proceedngs of the 26th Annual Internatonal Symposum on Mcroarchtecture, pages , Austn, TX, Dec [2] O. Avssar, R. Barua, and D. Stewart. An optmal memory allocaton scheme for scratch-pad-based embedded systems. ACM Transactons on Embedded Computng Systems, 1(1):6 26, Nov. 2. [3] G. Balakrshnan and T. W. Reps. Analyzng memory accesses n x86 executables. In Proceedngs of the 13th Internatonal Conference on Compler Constructon, volume 2985 of Lecture Notes n Computer Scence, pages 5 23, Barcelona, Span, Mar. 4. [4] D. Berln. Structure alasng n GCC. In Proceedngs of the 5 GCC Developer s Summt, pages 25 35, Ottawa, Canada, June 5. [5] H. Cassé, L. Féraud, C. Rochange, and P. Sanrat. Usng Abstract Interpretaton Technques for Statc Ponter Analyss. Computer Archtecture News, 27(1):47 5, Mar [6] A. Coln and I. Puaut. A modular & retargetable framework for tree-based WCET analyss. In Proceedngs of the 13th Euromcro Conference on Real-Tme Systems, pages 37 44, Delft, The Netherlands, June 1. [7] K. D. Cooper and T. J. Harvey. Compler-controlled memory. In Proceedngs of the 8th Internatonal Conference on Archtectural Support for Programmng Languages and Operatng Systems, pages 2 11, San Jose, CA, Oct [8] A. Domnguez, S. Udayakumaran, and R. Barua. Heap data allocaton to scratch-pad memory n embedded systems. Journal of Embedded Computng, 1(4):521 54, July 5. [9] C. Ferdnand, R. Heckmann, M. Langenbach, F. Martn, M. Schmdt, H. Thelng, S. Thesng, and R. Wlhelm. Relable and precse WCET determnaton for a real-lfe processor. In Proceedngs of the 1st Internatonal Workshop on Embedded Software, volume 2211 of Lecture Notes n Computer Scence, pages , Tahoe Cty, CA, Oct. 1. [] L. J. Hendren, C. Donawa, M. Emam, G. R. Gao, Justan, and B. Srdharan. Desgnng the McCAT compler based on a famly of structured ntermedate representatons. In Proceedngs of the 5th Internatonal Workshop on Languages and Complers for Parallel Computng, pages 46 4, New Haven, CT, Aug [11] M. Hnd. Ponter analyss: haven t we solved ths problem yet? In Proceedngs of the ACM SIGPLAN-SIGSOFT 1 Workshop on Program Analyss for Software Tools and Engneerng, pages 54 61, Snowbrd, UT, June 1. [12] R. E. Johnson, C. McConnell, and J. M. Lake. The RTL system: A framework for code optmzaton. In Proceedngs of the Internatonal Workshop on Code Generaton, pages , Dagstuhl, Germany, May [13] S.-K. Km, S. L. Mn, and R. Ha. Effcent worst case tmng analyss of data cachng. In Proceedngs of the 2nd IEEE Real-Tme Technology and Applcatons Symposum, pages 23 24, Brooklne, MA, June [14] R. Krner and P. P. Puschner. Classfcaton of WCET analyss technques. In Proceedngs of the 8th IEEE Internatonal Symposum on Object-Orented Real-Tme Dstrbuted Computng, pages , Seattle, WA, May 5. [15] L. L, L. Gao, and J. Xue. Memory colorng: A compler approach for scratchpad memory management. In Proceedngs of the 14th Internatonal Conference on Parallel Archtectures and Complaton Technques, pages , St. Lous, MO, Sept. 5. [16] X. L, A. Roychoudhury, and T. Mtra. Modelng out-oforder processors for WCET analyss. Real-Tme Systems, 34(3): , Nov. 6. [17] T. Lundqvst and P. Stenström. Tmng anomales n dynamcally scheduled mcroprocessors. In Proceedngs of the th IEEE Real-Tme Systems Symposum, pages 12 21, Phoenx, AZ, Dec [18] T. Lundqvst and P. Stenström. A method to mprove the estmated worst-case performance of data cachng. In Proceedngs of the 6th Internatonal Conference on Real-Tme Computng Systems and Applcatons, pages , Hong Kong, Chna, Dec [19] J. W. D. Manuel E. Bentez. A portable global optmzer and lnker. In Proceedngs of the ACM SIGPLAN 1988 Conference on Programmng Language Desgn and Implementaton, pages , Atlanta, GA, June [] S. Mehrotra and L. Harrson. Examnaton of a memory access classfcaton scheme for ponter-ntensve and numerc programs. In Proceedngs of the 1996 Internatonal Conference on Supercomputng, pages , Phladelpha, PA, May [21] I. Puaut and C. Pas. Scratchpad memores vs locked caches n hard real-tme systems, a quanttatve comparson. In Proceedngs of the 7 Conference on Desgn Automaton and Test Europe, pages , Nce, France, Apr. 7. [22] SA-1 mcroprocessor tmng: an applcaton note. Dgtal Equpment Corporaton, June [23] J. Scott, L. H. Lee, J. Arends, and B. Moyer. Desgnng the low-power mcore archtecture. In Proceedngs of the Workshop on Power Drven Mcroarchtecture, pages , Barcelona, Span, June [24] J. Staschulat and R. Ernst. Worst case tmng analyss of nput dependent data cache behavor. In Proceedngs of the 18th Euromcro Conference on Real-Tme Systems, pages , Dresden, Germany, July 6. [25] V. Suhendra, T. Mtra, A. Roychoudhury, and T. Chen. WCET centrc data allocaton to scratchpad memory. In Proceedngs of the 26th IEEE Real-Tme Systems Symposum, pages , Mam, FL, Dec. 5. [26] S. Udayakumaran, A. Domnguez, and R. Barua. Dynamc allocaton for scratch-pad memory usng comple-tme decsons. ACM Transactons on Embedded Computng Systems, 5(2): , May 6. [27] M. Verma and P. Marwedel. Overlay technques for scratchpad memores n low-power embedded processors. IEEE Transactons on Very Large Scale Integraton Systems, 4(8):82 815, Aug. 6. [28] L. Wehmeyer and P. Marwedel. Influence of memory herarches on predctablty for tme constraned embedded software. In Proceedngs of 5 Desgn, Automaton and Test n Europe Conference and Exposton, pages 6 65, Munch, Germany, Mar. 5. [29] R. T. Whte, F. Mueller, C. A. Healy, D. B. Whalley, and M. G. Harmon. Tmng analyss for data and wrap-around fll caches. Real-Tme Systems, 17(2-3):9 233, Nov [3] R. P. Wlson, R. S. French, C. S. Wlson, S. P. Amarasnghe, J.-A. M. Anderson, S. W. K. Tjang, S.-W. Lao, C.-W. Tseng, M. W. Hall, M. S. Lam, and J. L. Hennessy. SUIF: An nfrastructure for research on parallelzng and optmzng complers. SIGPLAN Notces, 29(12):31 37, Dec

Parallelism for Nested Loops with Non-uniform and Flow Dependences

Parallelism for Nested Loops with Non-uniform and Flow Dependences Parallelsm for Nested Loops wth Non-unform and Flow Dependences Sam-Jn Jeong Dept. of Informaton & Communcaton Engneerng, Cheonan Unversty, 5, Anseo-dong, Cheonan, Chungnam, 330-80, Korea. seong@cheonan.ac.kr