Extending STI for Demanding Hard-Real-Time Systems

Size: px

Start display at page:

Download "Extending STI for Demanding Hard-Real-Time Systems"

Hester O’Brien’
5 years ago
Views:

1 Extendng STI for Demandng Hard-Real-Tme Systems Benamn Welch Shobht Kanaua Adarsh Seetharam Deepaksrvats Thrumala Center for Embedded Systems Research North Carolna State Unversty Ralegh, NC Alexander G. Dean {sokanau, aseetha, ABSTRACT Software thread ntegraton (STI) s a complaton technque whch enables the effcent use of an applcaton s fne-gran dle tme on generc processors wthout specal hardware support. Wth STI, a prmary functon (wth real-tme requrements on specfc nstructons) s automatcally nterleaved wth a secondary functon to create a sngle mplctly multthreaded functon whch mnmzes context swtchng and hence both mproves performance and also offers very fne-gran concurrency. In ths paper we extend STI technques to address two challenges. Frst, we reduce response tme for nterrupts or other hgh-prorty threads by ntroducng pollng servers nto ntegrated threads. Currently ntegrated threads dsable nterrupts, delayng all other work untl ther completon. Second, we enable ntegraton wth long host threads, expandng the doman of STI. Wth current technques, f there are frequent nterrupts, only host threads whch can fnsh executon before the next nterrupt can be ntegrated. We derve methods to evaluate the response tme for threads n systems wth and wthout these new ntegraton methods. We demonstrate these concepts wth the ntegraton of varous threads n a sample hard-real-tme system on a hghly-constraned mcrocontroller. We use an nexpensve 20 MHz AVR 8-bt mcrocontroller to generate monochrome NTSC vdeo whle servcng a hgh-speed (5.2 kbaud) seral communcaton lnk. We have bult and tested ths system and demonstrate graphcs renderng speed-ups of 3.99x to 3.5x. Categores and Subect Descrptors B..4 [Control Structures and Mcroprogrammng]: Mcroprogram Desgn Ads Languages and complers, optmzaton; C.3 [Specal-Purpose and Applcaton-Based Systems]: Real-tme and embedded systems; D.3.4 [Programmng Languages]: Processors Code Generaton, Permsson to make dgtal or hard copes of all or part of ths work for personal or classroom use s granted wthout fee provded that copes are not made or dstrbuted for proft or commercal advantage and that copes bear ths notce and the full ctaton on the frst page. To copy otherwse, or republsh, to post on servers or to redstrbute to lsts, requres pror specfc permsson and/or a fee. CASES 03, Oct. 30 Nov. 2, 2003, San Jose, Calforna, USA. Copyrght 2003 ACM /03/000 $5.00. Complers, Optmzaton, Run-Tme Envronments; D.4. [Operatng Systems]: Process Management Concurrency, multtaskng, schedulng, threads; D.4.7 [Operatng Systems]: Organzaton and Desgn Real-tme systems and embedded systems. General Terms Algorthms, Desgn, Expermentaton, Performance, Theory. Keywords Embedded systems, software thread ntegraton, hardware-tosoftware mgraton, post-pass compler, fne-gran concurrency, NTSC vdeo, AVR, STIGLtz.. INTRODUCTION. Software Thread Integraton Software thread ntegraton (STI) s a back-end compler technque that provdes fne-gran concurrency on generc processors by elmnatng many context swtches. By elmnatng the need for specal archtectural features t allows generc, lowcost processors to replace more expensve specalzed devces. STI reduces the clock speed needed to mplement gven functonalty on a generc processor, savng money, power, and energy and smplfyng desgn efforts. Guest Schedule (Executon Hardware Tme Functon Reqts.) Idle Tme Real-Tme Guest (Prmary) Thread Host (Secondary) Thread Integrated Thread Software Thread Integraton Idle Tme Reclamed Fgure. Overvew of hardware to software mgraton wth STI. Idle tme s statcally flled at comple tme wth other useful work from the system. Hardware to software mgraton (HSM) s often performed n ndustry to reduce costs and ncrease desgn flexblty. The economes of scale make mcrocontrollers much less expensve than most dedcated hardware crcuts (e.g. protocol controllers).

2 STI smplfes hardware-to-software mgraton by squeezng more performance out of generc processors whle nsulatng the programmer from mplementaton detals. STI works by mergng two functons nto one mplctly multthreaded functon, as shown n Fgure [8][9][0][][2]. When used for real-tme software, t enables the placement of tme-crtcal nstructons from one thread so they execute at a gven tme relatve to the begnnng of the ntegrated functon, regardless of the control or data flow characterstcs of ether thread. STI begns wth tmng regularzaton; executon paths of uneven duraton are padded to last the same amount of tme. Next, ether the regster fle s parttoned or def-use webs are extracted for later regster reallocaton. Code from one thread can now be moved to execute at gven tme n the other, usng code transformatons such as moton, replcaton, loop peelng / splttng / guardng / unrollng / fuson. The transformatons are drven by user-suppled tmng drectves whch ndcate when specfc real-tme nstructons must execute. We have mplemented a thread-ntegratng compler Thrnt whch mplements many of these analyses and transformatons for the AVR [] archtecture, whch s 8-bt, load/store, and optmzed for embedded C code. Although the tmng varablty of modern hgh-performance CPUs and memory herarches greatly reduces the temporal determnsm whch STI requres, ths s a non-ssue. STI targets applcatons whch nether need nor can afford these CPUs and memory systems. For perspectve, n % of the 8 bllon mcroprocessors sold were four- and eght-bt unts [8]. These mcrocontrollers run applcatons whch are not computatonally ntensve, and do not need more parallelsm or faster clock rates. They lack sophstcated mcroarchtectures and memory systems, and often cannot afford them. Instead these applcatons are constraned by other ssues such as functonalty, cost, power dsspaton, desgn tme and use of commercal off-the-shelf products. Hardware to software mgraton wth STI allows the desgner to address these ssues effcently. Each thread conssts of one or more functons; t s the functons whch are ntegrated. Prevous STI work classfes threads to be ntegrated as ether guest threads (moved from hardware nto software, and hence contanng the most tme-crtcal nstructons) or host threads (applcaton threads whch have always been software). When not referrng to hardware-to-software mgraton, guest threads are called prmary threads due to ther fne-gran (nternal) dle tme, and host threads are called secondary threads due to ther lack of fne-gran dle tme..2 Demandng Hard Real-Tme Systems and Software Thread Integraton Fgure 2 classfes n two dmensons the system s threads (ncludng nterrupt servce routnes (ISRs)). The horzontal axs ndcates the worst-case executon tme (WCET) C of a thread, whle the vertcal axs shows ts laxty L. The lower a thread s on the plot, the less ts laxty and the sooner t must be executed. The farther to the rght a thread s, the more processng t requres per nstance. These are the threads from whch one or more secondary functons for ntegraton must be chosen. Current STI methods make portons of ths desgn space unreachable for two reasons. Frst, a functon ntegrated wth STI dsables nterrupts (and hence context swtches) to ensure that all ts nstructons execute at the approprate tmes. In turn, ths can delay the executon of other threads, potentally resultng n mssed deadlnes, whch wll make hard real-tme systems fal. Ths conflct lmts the use of STI to systems whose threads have laxtes of at least the worst-case duraton of the ntegrated functons. In Fgure 2, sectons 2 and 4 show response tmes whch can no longer be acheved when STI s appled. Laxty (maxmum latency allowed) Worst case executon tme of ntegrated thread 2 4 WCET for applcaton thread Mnmum prmary thread perod mnus maxmum prmary thread work Fgure 2. Desgn space overvew consders real-tme requrements. Second, exstng STI methods rely upon ntegratng prmary functons wth secondary functons that are short enough to complete executon before the occurrence of the next prmary functon (whether perodcally scheduled or trggered by an nterrupt). In sectons 3 and 4 of Fgure 2 the next nstance of the perodc prmary thread arrves before the already-runnng nstance completes ts executon. These constrants lmt a desgner s optons for usng exstng STI methods. Frst, all nterrupt servce routnes (and threads n general) can be delayed by the duraton of the longest ntegrated functon. Ths mples that f any threads are present n regons 2 and 4, STI cannot be used n the system. Second, ntegrated functons must run to completon to ensure the real-tme nstructons wthn are executed on tme. An ntegrated functon conssts of a prmary functon lastng up to C Pr wth some dle tme I Pr and a secondary functon lastng up to C Sec. Ths mples that the ntegrated functon wll last at least C Pr + C Sec I Idle. The system desgner must choose the secondary functon(s) such that the ntegrated functon s duraton s shorter than the mnmum perod of ether the prmary or secondary thread. In the event of a frequently runnng prmary thread, the number of vald secondary functons s dramatcally reduced. In ths paper we present new methods whch elmnate both of these constrants, allowng ntegraton wth threads n regons 2, 3 and 4. Frst we show how to ntegrate pollng servers to handle short-latency nterrupts to reduce the response tme. Second we present methods whch partton a secondary functon nto multple shorter segments, each of whch s ntegrated wth the prmary functon, enablng longer secondary functons to be used. The paper s organzed as follows. Secton 2 summarzes software thread ntegraton. Secton 3 presents our new methods based 3

3 upon pollng servers and thread parttonng. Secton 4 apples the methods to a demandng hard real-tme system -- a 20 MHz 8 bt mcrocontroller whch generated monochrome NTSC vdeo wth mnmal hardware support. Secton 5 analyzes the results of ntegraton upon the system. Secton 6 draws conclusons and presents future work. 2. STI OVERVIEW Fgure presents an overvew of how STI s used for hardwareto-software mgraton (HSM). A hardware functon s replaced wth software wrtten by a programmer. Ths code conssts of one or more guest threads (represented by the sold bar) wth real-tme requrements. When the threads are scheduled for executon on a suffcently fast CPU, gaps wll appear n the schedule of guest nstructons, as llustrated by the whte gaps n the black bar. These gaps are peces of dle tme whch can be reclamed to perform useful host work. STI recovers fne-gran dle tme effcently and automatcally. STI uses a control dependence graph (CDG, a subset of the program dependence graph [4]) to represent each functon n a program. In ths herarchcal graph, control dependence regons such as condtonals and loops are represented as non-leaf nodes, and assembly language nstructons are stored n leaf nodes. Condtonal nestng s represented vertcally whle executon order s horzontal. The CDG s a good form for holdng a program for STI because ths structure smplfes analyss and transformaton through ts herarchy. Program constructs such as loops and condtonals as well as sngle basc blocks are moved effcently n a coarse-gran fashon, yet the transformatons also provde fne-gran schedulng of nstructons as needed. Usng STI for HSM nvolves movng guest code nto the correct poston wthn the host code for executon at the correct tme. The guest functon code s frst appended to the end of each host functon. The resultng functon s then automatcally ntegrated by movng guest nodes to the left n the CDG to locatons whch correspond to the target tme ranges. A tght target tme range may fall completely wthn a host node, forcng movement down nto that node or ts subgraph. Before code moton the host and guest threads are statcally analyzed for tmng behavor, wth best and worst cases predcted. Hardware and software both conspre to make ths a dffcult problem n the general case. However, we focus on applcatons wthout recurson or dynamc functon calls, and processors wthout superscalar executon, vrtual memory or varable latency nstructons. We assume locked caches or fast onchp memory and predctable ppelnng; these restrctons have lttle mpact on the desgn space targeted. Durng ntegraton, programmer-suppled tmng drectves gude ntegraton. Tmng tter from uneven predcates n the host thread s automatcally reduced (usng paddng nstructons) to meet guest requrements. The CDG s structure makes the tmng analyss and ntegraton straghtforward. STI produces code whch s more effcent than context-swtchng or busy-watng. The processor spends fewer cycles performng overhead work. The prce s expanded code memory. STI may duplcate code, unroll and splt loops and add guard nstructons. It may also duplcate both host and guest threads. The memory expanson can be reduced by tradng off executon speed or tmng accuracy. Ths flexblty allows the talorng of STI transformatons to a partcular embedded system s constrants. For more detals please see [9] and related work. 3. NEW TECHNIQUES The methods presented here use pollng servers to support short laxty secondary threads, and segmentaton to enable ntegraton of secondary functons wth frequent, short prmary functons. These extensons assume that the ntegrated threads have the hghest prorty n the system, are run soon enough to meet ther deadlnes, and are not preemptable (by other threads, the OS, or ISRs). Other than ths there are no restrctons on other threads, any OS/RTOS, or the schedulng polcy. 3. Integrated Pollng Servers Table. Terms for Schedulng Analyss Term a-z D T T PS R R C C L I N PS hp() t ps Z Z * Thread Deadlne of thread Perod of thread Defnton Perod of Pollng Server for thread Response tme of thread Response tme of thread after ntegratng pollng servers Worst case computaton tme of thread (ncludng fne-gran dle tme Z ) Worst case computaton tme of thread (ncludng fne-gran dle tme Z ) after ntegratng pollng servers Laxty or slack tme of thread Interference from hgher-prorty threads for thread Number of pollng servers for thread Set of threads wth hgher prorty than thread Set of ntegrated threads before pollng server ntegraton Set of pollng servers Idle tme n thread Mnmum dle tme n thread ncludng mnmum dle tme between nvocatons Table presents the terms used n ths analyss. At tme t, thread s released. To complete on tme (before ts deadlne D ), t must start no later than ts worst case computaton tme C added to the maxmum tme whch t can be delayed by nterference I (from hgher prorty threads hp()) before ts deadlne. Integrated threads (t) cannot be preempted, hence they are all n hp() and can nterfere wth thread. I k t In a system wth multple ntegrated threads, ths may rase response tmes to levels whch make real-tme performance unachevable. We solve ths problem by ntegratng pollng servers nto these ntegrated threads. C k

4 A pollng server s a perodc thread wth relatvely hgh prorty whch servces an aperodc thread n order to ensure a fxed response tme. It conssts of pollng code and condtonally executed servce (thread or ISR) code. If laxty L of thread s greater than an ntegrated thread s duraton C, n t, then can be deferred untl completes, and no pollng server s needed. Otherwse, n order to ensure that an aperodc thread s servced soon enough to meet ts deadlne, a pollng server s ntegrated at least every T PS n each ntegrated thread. The maxmum pollng server perod s T PS = D - R. The response tme R can be computed from the standard recurrence equaton [2], n whch estmate n+ depends upon the prevous estmate, and R 0 = C : R n+ = C + hp( ) R T n C Overlappng multple pollng servers wthn an ntegrated thread are scheduled n order of decreasng prorty, leadng to the delay (statc preempton) of the lower prorty pollng servers. After ntegraton there wll one or more copes of the pollng server for each thread n the ntegrated thread. Threads wth lttle tolerance for delay (T PS << D) wll requre many more pollng server copes than those wth more tolerance (T PS D). N PS C = T Integraton requres reducng tmng tter (typcally through nop paddng) to ensure that upcomng prmary thread nstructons execute on tme. Ths mples that the pollng server code must be padded to last a constant tme regardless of whether the servce s actually needed or not. As a result, the pollng server lasts C n both cases and s not bandwdth-preservng. The secondary thread s executon tme then rses to: C ' = C + ps PS C N The nterference tme also rses, as pollng servers for hgher prorty threads wll affect thread regardless of whether these threads are released. These run more often and always last the worst case tme C. The response tme, below, ncludes three summatons whch consder the pollng servers whch servce hgher prorty threads, all ntegrated threads (not ncludng the pollng servers) and the remanng hgher prorty threads whch do not use ntegrated pollng servers. R' n+ = C + l hp( ) ~ ps R' Tl hp( ) ps n Cl R' T n PS C + PS k t ~ ps R' Tk n C Once laxty drops much below the duraton of a secondary thread, both the code sze and secondary thread executon tme suffer. The executon tme also rses as the duraton of the pollng server rses, emphaszng the mportance of mnmzng the duraton of these hgh-prorty threads, typcally mplemented as nterrupt servce routnes. k + There has been much work extendng pollng servers (e.g. deferred server, sporadc server [20]); nvestgatng what s compatble wth STI s statc schedulng characterstcs s left for future work. 3.2 Parttonng Long Secondary Threads nto Segments A prmary thread whch runs frequently wll not be able to provde much dle tme for a secondary thread to execute, as t must complete before the next nstance of the prmary thread (as shown n Fgure 3). Because ntegrated threads are not preemptable, the maxmum computaton tme C for a secondary thread must be less than Z *, the avalable dle tme per perod. C Z = Z + T C 2 * T * cocall Our soluton s to partton the secondary thread nto segments and ntegrate each segment wth the prmary thread. Coroutne calls and a separate stack frame for the ntegrated thread preserve lve varables across ts successve nvocatons, untl the completon of the secondary thread. Each nvocaton of the ntegrated thread executes a segment of the secondary thread. WCET C ncludes fne-gran dle tme Z fne-gran dle tme T : Perod or mn. nterarrval tme T + Z C Z * : Max. tme avalable for secondary thread T -C Occurrences of task 2*T CS : Contextswtchng delay Fgure 3. Frequently executng prmary thread has lmted dle tme, lmtng duraton of potental secondary thread for ntegraton. Segments are formed by traversng the CDG of the secondary thread functon. Subroutne calls are not supported as t mght be necessary to ntegrate prmary code wth the called functon. Unstructured or rreducble code (strongly connected components wth multple entres [7]) s not supported as t ntroduces tmng varablty; future work wll use node splttng and paddng to correct ths. In the frst step of segment formaton, the condtonals n the secondary thread functon are padded so each case has the same duraton. However, n the case of loops wth unknown teratons, ths paddng s not performed. Second, the CDG s splt nto one or more segments, each of whch completes wthn the avalable dle tme. Thrd, a coroutne call s placed at the end of each segment. Fourth, the prmary thread s ntegrated wth each segment usng exstng STI technques. The algorthm for segment formaton appears n Appendx A and s appled herarchcally. A new segment s created and ts avalable dle tme s set to the maxmum avalable (Z * ) and the frst node n the functon s examned. If ts duraton s less than the avalable dle tme, the node s added to the segment and the method s appled to the successor node. The segment grows untl the next node does not ft. If the node s duraton s known, but s too long, there are two possble actons. If the node s a

5 condtonal, a new segment s started for each case (true and false). If the node s a loop, t s splt to fll but not exceed the remanng dle tme. If the duraton of the node s unknown, then ts type s examned. For a predcate, new segments are started wth the successor node and also wth each condton. For a loop, a new segment s started wth the successor node and also the frst chld n the loop body. Prevous work [] ntegrates prmary and secondary loops by creatng a fused loop and two clean-up loops. The fused verson matches secondary work to avalable dle tme n the prmary loop body through unrollng and paddng. The fused loop executes whle both prmary and secondary work exst. The second loop (prmary clean-up) fnshes any remanng prmary teratons, padded wth nops, and the thrd loop (secondary clean-up) fnshes any remanng secondary teratons. These technques are now modfed so that f the fused loop termnates because the prmary loop teratons have been completed, but enough secondary loop teratons reman to execute the segment agan, the segment wll repeat. Recall that the segment begns at the start of the loop body. Ths s mplemented by savng the current (not next) segment s start address when performng the coroutne call. These transformatons enable long secondary threads to be parttoned nto segments short enough to be ntegrated wth frequent prmary threads wth lttle dle tme. 3.3 Combnng Integraton of Pollng Servers wth Parttoned Threads In order to combne ntegrated pollng servers wth parttoned long secondary threads, the followng approach s used. Frst, the pollng servers are ntegrated wth the prmary thread. Ths flls n the frst C of dle tme wth the pollng server. Pollng servers for multple threads are ntegrated sequentally. The remanng dle tme now determnes the segment tme used to partton the long secondary threads. In ths fashon the pollng servers are executed each tme the ntegrated thread s run, regardless of whch segment executes. 4. EXPERIMENTAL METHOD 4. Target Applcaton Front Porch Back Porch.5 usec 4.7 usec Horzontal Sync 4.7 usec Actve Vdeo 52.6 usec Fgure 4. Vdeo porton of monochrome NTSC vdeo sgnal To demonstrate the benefts of STI for HSM we use an NTSC vdeo refresh controller applcaton for drvng a dsplay. We replace a complex vdeo generator chp wth smple, nexpensve hardware and software-mplemented functonalty, as shown n Fgure 5. NTSC vdeo sgnal generaton represents a large applcaton doman, wth vdeo outputs present n consumer electronc devces such as DVD players, dgtal cameras, camcorders, and vdeo games. Vdeo overlay, or on-screen dsplay, s a related functon n whch locally generated graphcs are overlad atop the ncomng vdeo sgnal. Ths overlayng requres precse tmng analyss of that vdeo sgnal s synchronzaton nformaton to selectvely replace pxels wth frame-buffer contents. Ths appears n televsons, hosptal and securty montorng systems. Pcture-n-pcture dsplay s also related, but requres actual samplng, bufferng and reszng of multple ncomng vdeo streams and hence requres much more processng. From a computatonal perspectve, the generaton and overlay threads are qute smlar, and we have developed onscreen dsplay software and hardware [6] based upon the work presented here. Ths paper s applcaton can be easly appled to vdeo overlay. The processor must generate an NTSC-compatble monochrome vdeo sgnal, Fgure 4 summarzes the vdeo porton of the sgnal. Although the CRT s electron beam scans 525 tmes per frame (n two nterlaced passes (felds) per 33.3 ms frame), only 494 rows are vsble and requre vdeo data. There are addtonal features n a vdeo sgnal (vertcal sync and equalzaton pulses); these are generated by our software as well. The vdeo data porton of the sgnal s the most demandng, as a pxel of vdeo data must be generated every 200 ns (for 256 pxels per row). We use an external shft regster to seralze a byte packed wth data, reducng the processor loadng. Wth a 20 MHz CPU ths corresponds to 6 clock cycles per byte, whch s too frequent for context swtchng or dynamc schedulng. A dgtal-to-analog converter (DAC) converts the seralzed pxels from the data byte to an analog voltage for the NTSC output. Our system generates a monochrome 256 x 254 pxel mage wth two bts per pxel, although resolutons of up to 52 x 525 wth bt per pxel are possble wth mnor modfcatons. 4.2 Expermental Platform 4.2. Hardware Latch 64 kbyte SRAM MCU Clock ATmega28 MCU Clear Pxel Clock Dvder Shft Sync 4-bt Shft Regster Byte Clock Dvder Load NTSC Vdeo Out Fgure 5. Block dagram of overall crcut. 4-bt Shft Regster Our expermental platform, called STIGLtz [3], provdes a lbrary of graphcs prmtve renderng functons ntegrated wth vdeo refresh code and hgh-speed seral communcatons wth very smple hardware. We target the Atmel AVR archtecture, whch features 8 bt natve word sze, 32 general-purpose regsters, and lmted support for 6 bt operatons. Three regster pars can be used as ndex regsters, speedng memory access. The Atmega 28 processor [] s nexpensve (about $3 n volume) and provdes 28 klobytes of Flash program memory, 4 klobytes of on-board data SRAM and numerous perpherals. The CPU core features a two-stage ppelne; most nstructons take one cycle, but some take more (branches are 2 cycles, multples 2, calls 5, returns 4, loads 2-3 and stores 2-3). Data memory s byte-

accessble and byte-algned, and there s no cache. An Atmel STK500 evaluaton board and STK50 processor expanson card are used to execute the ntegrated code. These are clocked at 20 MHz.

6 accessble and byte-algned, and there s no cache. An Atmel STK500 evaluaton board and STK50 processor expanson card are used to execute the ntegrated code. These are clocked at 20 MHz. 64 klobytes of external SRAM are used as well, wth a one cycle performance penalty, so loads and stores take three cycles. The C compler used s GCC 3.2 [3]. No operatng system s used, although STI does not preclude the use of one. functon. Fgure 9 presents an overvew of what ntegrated threads are avalable and how they are composed. The chosen thread then reads vdeo data from the frame buffer n memory and sends t out to the CRT through the DAC. The ISR consumes 98.2% of the processor s tme. Seral communcaton at 5.2 kbaud s handled by a UART, two crcular queues, and routnes for enqueueng and dequeung data. Fgure 8 presents the overvew of the software constructs whch allow the applcaton to gather work (.e. renderng graphcs prmtves) to perform durng vdeo refresh. The applcaton program can specfy f renderng work s to be performed mmedately or can be deferred. In the latter case, parameters for each deferred prmtve are saved n the approprate deferred work queue. Graphcs Applcaton Graphcs Prmtve Arguments Fgure 6. Photograph of vdeo seralzer board. Every 6 cycles the crcut shown n Fgure 5 and Fgure 6 samples the data byte present on an MCU output port, seralzes t nto four pxels (usng two four-bt shft regsters) and converts t nto an NTSC-complant voltage. Ths hardware releves the MCU of shftng out ndvdual pxels, but stll leaves t wth the responsblty of ensurng that the vdeo data s present on ts output port at the approprate tme. The shft regsters are clocked by the pxel clock dvder and are loaded by the byte clock dvder. We have bult and tested ths crcut and t works correctly. 4.3 Software 4.3. Overvew and Software Archtecture HSync Back Porch Vdeo 63.5 us Front Porch DrawXLne PumpPxel UART DrawHLne PumpPxel UART Dspatcher n perodc ISR selects one of these functons to refresh dsplay DrawVLne PumpPxel UART Refresh Pxels DrawCrcle PumpPxel UART Output Port and Vdeo Dgtal to to Analog Converter NTSC Vdeo Out Other Graphcs Prmtve Other Graphcs Renderng Prmtve Other Graphcs FunctonsV Renderng Prmtve FunctonsV Renderng Functons Renderng Pxels Frame Buffer PumpPxel UART Fgure 8. Overvew of software archtecture for STIGLtz. Deferrable graphcs work s enqueued for later renderng durng vdeo refresh. Vdeo Refresh PumpPxel UART Pollng Servers Graphcs Prmtves Renderng Tx Rx Servce_DrawHLne_Queue Servce_DrawXLne_Queue NTSC ISR Dspatcher Integrated functon or PumpPxel_UART PumpPxel_UART Servce_DrawDagonalLne_Queue Fgure 7. NTSC ISR calls dspatcher, whch calls or resumes an ntegrated thread to render most of vdeo data and servce the UART. Fgure 7 shows a tmelne of processor actvty durng the generaton of a vdeo scan lne and the relatonshp to portons of the NTSC sgnal. A perodc tmer-based nterrupt trggers an ISR whch generates the vdeo sgnal. Its two responsbltes are to draw a full feld (whch takes 6.7 mllseconds and occurs 60 tmes per second) and to generate the equalzaton pulses of the vertcal synchronzaton sgnal ( Fgure 4). The ISR examnes the queues and selects one of the ntegrated functons (f data s present n the queue) or else a dedcated busy-wat refresh Segments 0 2 DrawHLne_PumpPxel_UART 0 7 DrawDLne_PumpPxel_UART 8 DrawXLne_PumpPxel_UART Integrated functons wth pollng servers Fgure 9. Overvew of ntegrated threads, ncludng orgnal threads and sequence of ntegraton.

7 4.3.2 Real-Tme Workload Analyss The real-tme threads appear n Table 2. Wthout ntegraton, the UART servce code could be delayed for the entre duraton of the vdeo refresh ISR, or over 6 ms. In ths case most ncomng data on the UART wll be lost due to preempton by the vdeo refresh. Term Defnton Vdeo Refresh (hghest prorty) Smlarly the effectve transmsson speed wll be a fracton of the speed possble, as the transmt ISR wll only be servced between vdeo refresh processng. Wth ntegraton of the pollng servers the response tme drops to 63.5 us, guaranteeng proper UART operaton at full speed. Table 2. STIGLtz Real-Tme Thread Set for 20 MHz MCU UART Receve (5.2 kbaud) UART Transmt (5.2 kbaud, lowest prorty) D Deadlne ms 86.8 us 86.8 us T Perod 6.72 ms 86.8 us 86.8 us C Worst case computaton tme (ncludng fne-gran dle tme) 6.7 ms 4.2 us.6 us I Interference from hgher-prorty threads n/a 6.72 ms ms L Laxty or slack tme n/a 82.6 us 85.2 us R Maxmum response tme n/a ms ms R Maxmum response tme after ntegratng pollng servers n/a 63.5 us 63.5 us Prmary Thread Functon The functon PumpPxel (Fgure 0) s called once per scan lne (263 tmes per vdeo refresh nterrupt) to send out a row of vdeo data from the frame buffer. The processng tme s 7 cycles per byte, whle dle tme vares wth dsplay resoluton, bt depth and processor clock speed. Wth a 256 pxel wde, two-bt-per-pxel dsplay, a byte must be sent out every 6 cycles, so the dle tme s 9 cycles per byte. However, ntegraton may nvolve loop unrollng, whch wll reduce the loop overhead (3 cycles) and rase the dle tme toward ts bound of 2 cycles per byte. The dle tme remanng wthn PumpPxel s thus between 567 (28.35 us) and 756 cycles (37.8 us) for 64 bytes of vdeo data. unsgned char * FBptr; unsgned char ; vod PumpPxel(){ FBptr = CurFrameBuffer; = FrameBufferWdth; do { PORTE = *FBPtr++; } whle (--); PP_Int: ld r26,lo8(framebuffer) ld r27,h8(framebuffer) ld r4,framebufferwdth PP_Loop: ld r5,x+ ; 3 cy out PORTE,r5 ; cy dec r4 ; cy brne PP_Loop ; -2 cy Fgure 0. PumpPxel functon body n C and assembly code Secondary Thread Functons Table 3. Secondary Thread Functon Szes Functon Lnes of C Code Compled Sze (.text, bytes) Functon Lnes of C Code Compled Sze (.text, bytes) DrawCrcle DrawXMaorLne DrawHorzontalLne DrawYMaorLne DrawVertcalLne DrawSprte DrawDagonalLne DrawSprteOVR UART_Rx 3 94 UART_Tx 0 46 The dle tme wthn a sngle nvocaton of PumpPxel s too short for most graphcs renderng prmtves. For example, renderng an 80-pxel long x-maor lne takes 435 us. DrawLne s splt nto fve sub-functons based on lne type n order to smplfy ntegraton. In order to be able to use STI, we partton frequently executed graphcs prmtves DrawLne, DrawCrcle, DrawSprte and DrawSprteOVR. The szes of these functons are presented n Table 3. The UART pollng servers, prevously descrbed n Table 2, have no fne-gran (nternal) dle tme and so are secondary threads.

8 4.3.5 Impact of Regster Fle Parttonng upon Performance STI requres that all threads share the regster fle, so hgh regster pressure could lead to more spllng and fllng and hence less effcent code. The optmal method for sharng the regster fle s to perform regster allocaton [7] after ntegraton, allowng each thread to use regsters for lve values only. However, ths could lead to spll and fll code whch would dsrupt tmng. The alternatve we use s to partton the regster fle usng avr-gcc and allocate specfc regsters to each thread statcally before ntegraton. Ths may reduce performance but elmnates post-ntegraton tmng perturbatons. One common characterstc of embedded processors s a small and rregular regster set. Ths complcates the task of regster allocaton. Much research has been performed attemptng to mprove regster allocaton for these archtectures, buldng upon allocaton methods for general CPUs and DSPs [5][5][5] [9]. Although the AVR archtecture features 32 regsters, ther use s rregular. Varous classes of regsters arse due to lmtatons such as regster parng for word nstructons, mplct result regsters for certan nstructons, ponter regsters and mmedate operand constrants. Ths rregularty may lead to reduced performance after regster allocaton. For smplcty we dvde the AVR regsters nto three classes for our senstvty analyss. Ponter regsters (r26-r3) can be used as address 5 ponters and operate on mmedate operands. Immedate regsters (r6-r25) can use mmedate operands. Other regsters (r0-r5) can be used only for regster-regster and I/O operatons. We examne the senstvty of the host threads executon tme to the number of regsters avalable by decreasng the number of regsters n a gven category avalable to the compler s regster allocator through gcc s ffxed opton. Each of the three host functons s called 00 tmes wth varyng parameters by a test harness to render a seres of graphcs prmtves. Executon tme s measured usng an on-chp tmer/counter, and all nterrupts (other than the tmer/counter overflow) are dsabled. Complaton fals when more than one par of ponter regsters s excluded. Fgure shows how host functon executon tme was affected by the regster reductons n three categores. DrawSprte s most senstve to reduced mmedate or other regsters, whle DrawLne and DrawCrcle are not very senstve to the reductons. Based on these measurements, we choose to exclude eght other regsters and two ponter regsters when complng DrawLne and DrawCrcle, but exclude only one other regster and two ponter regsters when complng DrawSprte. Ths wll accelerate the context swtchng for DrawLne and DrawCrcle (reducng overhead) but not slow down DrawSprte s executon. Normalzed Run Tme DrawSprte - Immedate DrawSprte - Ponter DrawSprte - Other DrawCrcle - Immedate DrawCrcle - P o nter DrawCrcle - Other DrawLne - Immedate DrawLne - Ponter DrawLne - Other Number of Regsters Excluded Fgure. Executon tmes for graphcs functons rse as fewer regsters are avalable, reflectng varyng regster pressures. 4.4 Integraton Thread ntegraton s performed as follows. Frst, the secondary thread functons are prepared for ntegraton by complng them wth gcc and excludng varous regsters as descrbed above. Second, Thrnt s used to pad away tmng tter from condtonals usng nops as well as create tme-annotated CDGs to gude ntegraton. The remanng steps are summarzed n Fgure 9. Thrd, the pollng servers are manually ntegrated wth the PumpPxel functon to create PumpPxel_UART. Fourth, segments are manually formed for the secondary thread functons, usng the tmng nformaton prevously derved. Ffth, PumpPxel_UART s manually ntegrated wth each segment of each secondary thread functon. The resultng code can now be assembled, lnked and downloaded. 5. ANALYSIS We evaluate the tmng accuracy of the ntegrated code through emprcal testng; the vdeo sgnal successfully drves two dfferent types of televson set. The tmng of the packed vdeo byte from the mcrocontroller s evaluated wth a dgtal samplng osclloscope to ensure ts correctness. Seral communcatons at 5.2 kbaud are performed usng a PC and are correct. The osclloscope screen shot n Fgure 2

9 demonstrates the proper smultaneous operaton of vdeo generaton wth seral data recepton and transmsson. We measure performance of the orgnal dscrete renderng and new ntegrated renderng usng debuggng sgnals and the osclloscope. output nstructons. Ths leaves only 0.36 MIPS for foreground processng. The four bars to the rght demonstrate processor utlzaton when renderng varous types of lnes. STI reclams large amounts of dle tme, provdng.3 to 4.5 MIPS of lne renderng and 2. MIPS of seral communcaton processng nstead. Some tme s wasted n the dspatcher or context swtchng, whle some s lost because STI ntegraton s not completely effcent when dealng wth unpredctable loops. 5 Normalzed Performance (/tme) 0 5 Dscrete Integrated Fgure 2. Osclloscope shows smultaneous hgh-speed seral transmsson and recepton durng vdeo sgnal generaton. Traces: Seral receve data, seral transmt data, sync out, vdeo out. 5. Renderng performance of ntegrated code MIPS Used No Integraton Int. - Horzontal Lne Int. - Vertcal Lne Int. - Dagonal Lne Int. - X- Maor Lne Wasted capacty Integrated renderng UART pollng servers Dspatcher w /context sw tchng Dsplay refresh & sync Foreground processng Fgure 3. Processor utlzaton for vdeo sgnal generaton by STIGLtz. Wthout ntegraton, 2 MIPS of dle tme s trapped n fragments only 9 cycles long and s unusable for other processng. Fgure 3 shows the processor utlzaton for our system under varous condtons. The leftmost bar (No Integraton) shows that before applyng STI, the MCU spends most of ts tme refreshng the dsplay or executng the nops between vdeo 0 H-Lne V-Lne D-Lne X-Maor-Lne Fgure 4. Integraton leads to speed-ups of 3.99x to 3.5x n tme for renderng graphcs prmtves. Fgure 4 shows the normalzed renderng performance for two dfferent desgn alternatves: dscrete renderng and dsplay refresh, and ntegrated renderng and refresh. In the frst case, the graphcs prmtves are rendered wth dscrete (nonntegrated) code, whch can run only when the vdeo refresh ISR s not actve, or durng the.8% of the total tme avalable. The second case uses ntegrated code when possble to render graphcs prmtves. Integraton speeds up renderng tme by 3.99x to 3.54x over the dscrete case. The varaton n speed-up comes from the amount of renderng work performed per segment and the number of segments needed per secondary thread. Recall that each loop wth an unknown teraton count requres at least one segment; ths s very neffcent f the loop s executon tme s much less than the dle tme of the segment. The horzontal, vertcal and dagonal functons all contan a sngle such loop wth a sngle level of condtonals, allowng effcent ntegraton and only three segments. The x- maor functon has a much more complex CDG and contans four such loops, one doubly nested. These loops requre the formaton of nne segments, wastng much of the avalable dle tme. 5.2 Code memory expanson Table 4 shows how code szes ncrease by a factor of 3x to 5x after ntegraton. Varous factors contrbute to the ncrease, ncludng paddng, loop unrollng and splttng, and code replcaton nto condtonals. Although these code sze ncreases are sgnfcant, they apply only to ntegrated functons, and may be an acceptable prce to pay gven the dramatc performance mprovement.

10 Table 4. Szes of Orgnal and Integrated Functons Functon Orgnal Sze (bytes) Padded Sze Integrated Sze Code Expanson Rato DrawHorzontalLne DrawVertcalLne DrawDagonalLne DrawXMaorLne CONCLUSIONS AND FUTURE WORK Ths paper ntroduces two new methods whch enable STI to be used for hard real-tme systems wth urgent threads and long applcaton work. Frst, prevous STI work dd not allow maxmum response tmes to fall below the duraton of the ntegrated threads; ths work presents pollng servers and methods of fndng the resultng maxmum response tme. Second, prevous work requred that secondary (host) functons be no longer than the worst-case dle tme + slack tme for the prmary (guest) functon. Ths work ntroduces segmentaton methods whch remove ths restrcton. Our methods are demonstrated on a software vdeo generaton applcaton and enable smultaneous hgh-speed seral communcaton, vdeo refresh and graphcs renderng. We have bult and tested the system and performance mproves by a factor of about 4x to over 3.5x. There are varous drectons for mprovement of these technques. Better segment formaton methods could offer more consstent performance mprovement from STI by better usng avalable dle tme when loops have unknown teraton counts. It may be possble to reduce the code sze expanson through schedule-senstve abstracton of common code segments. Implementng the new segment-based transformatons automatcally would accelerate the ntegraton process and smplfy debuggng. 7. ACKNOWLEDGEMENTS Ths work was funded n part by NSF CAREER award CCR The authors thank the varous students who contrbuted to ths proect: Robert Morrson, Barret Krupnow, Jmmy Hll, Crag Nowell, and Paul Lee. In addton, thanks go to Atmel and BITS for the knd donatons of development tools. 8. REFERENCES [] Atmel Corp., Atmega 28: 8-Bt AVR Mcrocontroller wth 28K Bytes In-System Programmable Flash, pdf [2] Audsley, N. et al. Applyng New Schedulng Theory to Statc Prorty Preemptve Schedulng, Software Engneerng Journal, 8(5):284:292 [3] avr-gcc, [4] T. P. Baker and Alan Shaw. The Cyclc Executve Model and Ada, Real-Tme Systems, (): 7-25, 989 [5] Barag, D., Pande S. and Agarwal, D. P.. A Framework for Enhancng Code Qualty n Lmted Regster Set Embedded Processors, ACM SIGPLAN Workshop on Languages, Complers and Tools for Embedded Systems (LCTES) 2000, June [6] Barthelmann, V.. Inter-Task Regster-Allocaton for Statc Operatng Systems, ACM SIGPLAN Workshop on Languages, Complers and Tools for Embedded Systems / Software and Complers for Embedded Systems (LCTES 02-SCOPES 02), June [7] Chatn, G. J. Regster Allocaton and Spllng va Graph Colorng, ACM SIGPLAN Notces, 7(6):98-05 [8] Dean, A. G. and Grzybowsk, R. R., A Hgh-Temperature Embedded Network Interface usng Software Thread Integraton," Second Workshop on Compler and Archtectural Support for Embedded Systems, Washngton, DC, October [9] Dean, A. G., Complng for Concurrency: Plannng and Performng Software Thread Integraton, 23rd IEEE Real- Tme Systems Symposum, December 3-5, 2002, Austn, TX. [0] Dean, A., Shen, J.P. "System-Level Issues for Software Thread Integraton: Guest Trggerng and Host Selecton," 20th IEEE Real-Tme Systems Symposum, Phoenx, Arzona, December -3, 999 [] Dean, A., Shen, J. P. "Technques for Software Thread Integraton n Real-Tme Embedded Systems," 9th IEEE Real-Tme Systems Symposum, Madrd, Span, December 2-4, 998 [2] Dean, A., Shen, J. P. "Hardware to Software Mgraton wth Real-Tme Thread Integraton," 24th EuroMcro Conference, Västerås, Sweden, August 25-27, 998 [3] Dean, A. STIGLtz Proect Manual, CESR Techncal Report, June [4] Ferrante, J., Ottensten, K. J. and Warren, J. D.. The Program Dependence Graph and Its Use n Optmzaton, ACM Transactons on Programmng Languages, July 987, 9(3): [5] Kong, T. and Wlken, K. Precse Regster Allocaton for Irregular Archtectures, 3st Internatonal Mcroarchtecture Conference (MICRO-3), December 998. [6] Krupnow, B., Hll, J., Nowell, C. and Lee, P. Vdeo Software Thread Integraton, CESR Techncal Report, December [7] Muchnck, S. S. Advanced Compler Desgn and Implementaton, Morgan Kaufmann Publshers, 997

11 [8] Nsley, E. Rsng Tdes, Dr. Dobb s Journal, #346, March 2003 [9] Scholz, B.and Ecksten, E. Regster Allocaton for Irregular Archtectures, ACM SIGPLAN Workshop on Languages, Complers and Tools for Embedded Systems / Software and Complers for Embedded Systems (LCTES 02-SCOPES 02), June [20] Strosnder, J. K., Lehoczky, J. P. and Sha, L. The Deferrable Server Algorthm for Enhanced Aperodc Responsveness n Hard Real-Tme Envronments, IEEE Transactons on Computers, 44(), January 995, pp. 73-9

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng