A Predictable Execution Model for COTS-based Embedded Systems

Size: px

Start display at page:

Download "A Predictable Execution Model for COTS-based Embedded Systems"

Britton Joseph
5 years ago
Views:

2011 17th IEEE Real-Tme and Embedded Technology and Applcatons Symposum A Predctable Executon Model for COTS-based Embedded Systems Rodolfo Pellzzon, Emlano Bett, Stanley Bak, Gang Yao, John Crswell,

1 th IEEE Real-Tme and Embedded Technology and Applcatons Symposum A Predctable Executon Model for COTS-based Embedded Systems Rodolfo Pellzzon, Emlano Bett, Stanley Bak, Gang Yao, John Crswell, Marco Caccamo and Russell Kegley Unversty of Waterloo, Canada, rpellzz@uwaterloo.ca Unversty of Illnos at Urbana-Champagn, USA, {ebett, sbak2, crswell, mcaccamo}@llnos.edu Scuola Superore Sant Anna, Italy, g.yao@sssup.t Lockheed Martn Corp., USA, russell.b.kegley@lmco.com Abstract Buldng safety-crtcal real-tme systems out of nexpensve, non-real-tme, COTS components s challengng. Although COTS components generally offer hgh performance, they can occasonally ncur sgnfcant tmng delays. To prevent ths, we propose controllng the operatng pont of each shared resource (lke the cache, memory, and nterconnecton buses) to mantan t below ts saturaton lmt. Ths s necessary because the low-level arbters of these shared resources are not typcally desgned to provde real-tme guarantees. In ths work, we ntroduce a novel system executon model, the PRedctable Executon Model (PREM), whch, n contrast to the standard COTS executon model, coschedules at a hgh level all actve components n the system, such as CPU cores and I/O perpherals. In order to permt predctable, system-wde executon, we argue that realtme embedded applcatons should be compled accordng to a new set of rules dctated by PREM. To expermentally valdate our theory, we developed a COTS-based PREM testbed and modfed the LLVM Compler Infrastructure to produce PREMcompatble executables. I. INTRODUCTION Real-tme embedded systems are ncreasngly beng bult usng commercal-off-the-shelf (COTS) components such as mass-produced CPUs, perpherals and buses. Overall performance of mass produced components s often sgnfcantly hgher than custom-made systems. For example, a PCI Express bus [16] can transfer data three orders of magntude faster than the real-tme SAFEbus [9]. Unfortunately, COTS components are typcally desgned wth lttle or no attenton to worstcase tmng guarantees requred by real-tme systems. Modern COTS-based embedded systems nclude multple actve components (such as CPU cores and I/O perpherals) that can ndependently ntate access to shared resources, whch, n the worst case, cause contenton leadng to tmng degradaton. Computng precse bounds on tmng delays due to contenton s dffcult. Even though some exstng approaches [18], [24] can produce safe upper bounds, they need to be very pessmstc due to the unpredctable behavor of arbters of physcally shared resources (lke caches, memores, and buses). As a motvatng example, we have prevously shown that the computaton tme of a task can ncrease lnearly wth the number of suffered cache msses due to contenton for access to man memory [20]. In a system wth three actve components, a task s worst-case computaton tme can nearly trple. To explot the hgh average performance of COTS components wthout occasonally experencng long delays suffered by real-tme tasks, we need to control the operatng pont of each shared resource and mantan t below saturaton lmts. Ths work ams at showng that ths s ndeed possble by carefully rethnkng the executon model of real-tme tasks and by enforcng a hgh-level coschedulng mechansm among all actve components n the system. Brefly, the key dea s to coschedule actve components so that contenton for accessng shared resources s mplctly resolved by the hghlevel coscheduler wthout relyng on low-level, non-real-tme arbters. In partcular, n ths work we focus on contenton at the level of bus communcaton and man memory. Several challenges had to be overcome to realze the PRedctable Executon Model (PREM): I/O perpherals wth DMA master capabltes contend for physcally shared resources, ncludng memory and buses, n an unpredctable manner. To address ths problem, we expand upon on our prevous work [1] and ntroduce hardware to put the I/O subsystem under the dscplne of real-tme schedulng. The bus and memory access patterns of tasks executed on COTS CPUs can exhbt hgh varance. In partcular, predctng a precse pattern of cache fetches n man memory s very dffcult, forcng the desgner to make very pessmstc assumptons when performng schedulablty analyss. To address ths problem, PREM uses a novel program executon model wth three man features: (1) jobs are dvded nto a sequence of non-preemptve schedulng ntervals; (2) some of these schedulng ntervals (named predctable ntervals) are executed predctably and wthout cache-msses by prefetchng all requred data at the begnnng of the nterval tself; (3) the executon tme of predctable ntervals s kept constant by montorng CPU tme counters at run-tme. Low-level COTS arbters are usually desgned to acheve farness nstead of real-tme performance. To address ths problem, we enforce a coschedulng mechansm that seralzes arbtraton requests of actve components (CPU cores and I/O perpherals). Durng the executon of a task s predctable nterval, a scheduled perpheral can /11 $ IEEE DOI /RTAS

2 access the bus and memory wthout experencng delays due to cache msses caused by the task s executon. Our PRedctable Executon Model (PREM) can be used wth a hgh level programmng language lke C by settng some programmng gudelnes and by usng a modfed compler to generate predctable executables. Based on annotatons provded by the programmer, the compler generates programs whch perform cache prefetchng and enforce a constant executon tme n each predctable nterval. The generated executable becomes hghly predctable n ts memory access behavor, and when run wth the rest of the PREM system, shows sgnfcantly reduced worst-case executon tme. Some work s needed to code an applcaton accordng to our programmng gudelnes. However, we argue that our requrements are not sgnfcantly strcter wth respect to those of state-of-the-art tmng analyss, whch requres detaled modelng of both the software applcaton and hardware platform to produce tght bounds. Whle our code annotatons are stll dependent on archtectural features such as the sze and assocatvty of the cache, task executon under PREM becomes ndependent from low-level bus and memory arbters, smplfyng system desgn and verfcaton. The rest of the paper s organzed as follows. Secton II dscusses related work. Secton III descrbes our man contrbuton: a co-schedulng mechansm that schedules I/O nterrupt handlers, task memory accesses and I/O perpheral data transfers n such a way that access to shared resources s seralzed, achevng zero or neglgble contenton durng memory accesses. Sectons IV and V present our solutons to hardware archtecture and code organzaton challenges that make creatng predctable real-tme tasks dffcult. Secton VI presents our schedulablty analyss. In Secton VII, we detal our prototype testbed, ncludng our compler mplementaton based on the LLVM Compler Infrastructure [10], and provde an expermental evaluaton. We conclude wth future work n Secton VIII. II. RELATED WORK Pror real-tme research has proposed several solutons to address dfferent sources of unpredctablty n COTS components, ncludng real-tme handlng of perpheral drvers, realtme complaton, and analyss of contenton for memory and buses. For perpheral drvers, Facchnett et al. [5] proposed usng a non-preemptve nterrupt server to better support the reusng of legacy drvers. Addtonally, analyss can be done to model worst-case temporal nterference caused by devce drvers [12]. For real-tme complaton, a tght couplng between compler and worst-case executon tme (WCET) analyzer can optmze a program s WCET [6]. Alternatvely, a compler-based approach can provde predctable pagng [21]. For analyss of contenton for memory and buses, exstng tmng analyss technques can analyze the maxmum delay caused by contenton for a shared memory or bus under varous access models [18], [24]. All these works attempt to analyze or control a sngle resource and obtan safe bounds that are COTS Perpheral COTS Perpheral COTS Perpheral Fg. 1. Real-Tme Brdge Real-Tme Brdge Perpheral Scheduler Real-Tme Brdge PCIe PCI RAM North Brdge South Brdge COTS Motherboard FSB Real-Tme I/O Management System. CPU often hghly pessmstc. Instead, PREM s based on a global coschedule of all relevant system resources. Instead of usng COTS components, other researchers have dscussed new archtectural solutons that can greatly ncrease system predctablty by removng sgnfcant sources of nterference. Instead of a standard cache-based archtecture, a real-tme scratchpad archtecture can be used to provde predctable access tme to man memory [28]. The Precson Tme (PRET) machne [4] promses to smultaneously delver hgh computatonal performance together wth cycle-accurate estmaton of program executon tme. Unfortunately, these solutons requre extensve redesgn of exstng components (n partcular, the CPU). Whle our PREM executon model borrows some deas from these works, t s compatble wth avalable COTS platforms: all exstng components can be reused, albet some new devces must be connected to the motherboard and the COTS perpherals. Ths approach allows PREM to leverage the advantage of the economy of scale of COTS systems, and support the progressve mgraton of legacy systems. III. SYSTEM MODEL We consder a typcal COTS-based real-tme embedded system composed of a CPU, man memory and multple DMA perpherals. Whle ths paper restrcts our dscusson to snglecore systems wth no hardware multthreadng, we beleve that our predctable executon model s also applcable to multcore systems. The CPU can mplement one or more cache levels. We focus on the last cache level, whch typcally employs a wrte-back polcy. Whenever a task suffers a cache mss n the last level, the cache controller must access man memory to fetch the newly referenced cache lne and possbly wrteback a replaced cache lne. Perpherals are connected to the system through COTS nterconnects such as PCI or PCIe [16]. DMA perpherals can autonomously ntate data transfers on the nterconnect. We assume that all data transfers target man memory, that s, data s always transferred between the perpheral s nternal buffers and man memory. Therefore, we can treat man memory as a sngle resource shared by all perpherals and by the cache controller 1. 1 Note that whle usng a dual-port memory could potentally reduce contenton between a sngle core and perpheral nterconnect, mplementng a large external memory as a dual-port SRAM s mpractcal. Furthermore, such a soluton does not scale to multple nterconnects and/or cores. 270

<8=#$%&'()*"# <:'>&#.&3'>&2#:",# /&?-:'&4&"32# 8&/9?>&/:-#,:3:# 3/:"2.&/2# Fg. 2. 4&4*/1##?>:2&# &%&'()*"#?>:2&# Predctable Interval wth constant executon tme.!"#$%&'()*"#+*,&-#.

We model all perpheral actvtes as a set of M perodc I/O flows Γ I/O = {τ I/O 1,.

Unfortunately, COTS perpherals do not typcally conform to the descrbed model. As an example, consder a task recevng nput data from a Network Interface Card (NIC).

3 <8=#$%&'()*"# <:'>&#.&3'>&2#:",# /&?-:'&4&"32# 8&/9?>&/:-#,:3:# 3/:"2.&/2# Fg. 2. 4&4*/1##?>:2&# &%&'()*"#?>:2&# Predctable Interval wth constant executon tme.!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# The CPU executes a set of N real-tme perodc tasks Γ = {τ 1,..., τ N }. Each task can use one or more perpherals to transfer nput or output data to or from man memory. We model all perpheral actvtes as a set of M perodc I/O flows Γ I/O = {τ I/O 1,..., τ I/O M } wth assgned tmng reservatons, and we want to schedule them n such a way that only one flow s transferred at a tme. Unfortunately, COTS perpherals do not typcally conform to the descrbed model. As an example, consder a task recevng nput data from a Network Interface Card (NIC). Delays n the network could easly cause a burst of packets to arrve at the NIC. Snce a hghperformance COTS NIC s desgned to autonomously transfer ncomng packets to man memory as soon as possble, the NIC could potentally requre memory access for sgnfcantly longer than ts expected perodc reservaton. In [1], [2], we ntroduced a soluton to ths problem consstng of a realtme I/O management scheme. A dagram of our proposed archtecture s depcted n Fgure 1. A mnmally ntrusve hardware devce, called the real-tme brdge, s nterposed between each perpheral and the rest of the system. The realtme brdge buffers all ncomng traffc from the perpheral and delvers t predctably to man memory accordng to a global I/O schedule. Outgong traffc s also retreved from man memory n a predctable fashon. Bounds on the necessary buffer szes so that data loss s avoded can be computed [1]. Our mplemented real-tme brdges are fully compatble wth the PCIe standard and requre no modfcaton to ether the system chpset or the controlled perpherals. Furthermore, real-tme brdges provde traffc vrtualzaton and solaton: a sngle perpheral can support multple I/O flows wth dfferent tmng reservatons, each servcng a dfferent task [2]. To maxmze responsveness and avod CPU overhead, the I/O schedule s computed by a separate perpheral scheduler, a hardware devce based on the prevously-developed [1] reservaton controller, whch controls all real-tme brdges. Notce that our prevously-developed I/O management system [1] does not solve the problem of memory nterference between perpherals and CPU tasks. When a typcal realtme task s executed on a COTS CPU, cache msses are unpredctable, makng t dffcult to avod low-level contenton for access to man memory. To overcome ths ssue, we propose a set of compler and OS technques that enable us to predctably schedule all cache msses durng a gven porton of a task executon. The code for each task τ s dvded nto a set of N schedulng ntervals {s,1,..., s,n }, whch are executed sequentally at run-tme. The tmng requrements of τ can be expressed by a tuple {{e,1,..., e,n }, p, D }, where p, D are the perod and relatve deadlne of the task, wth D p, and e,j s the maxmum executon tme of s,j, assumng that the nterval runs n solaton wth no memory nterference. A job can only be preempted by a hgher prorty job at the end of a schedulng nterval. Ths ensures that the cache content can not be altered by the preemptng job durng the executon of an nterval. We classfy the schedulng ntervals nto compatble ntervals and predctable ntervals. Compatble ntervals are compled and executed wthout any specal provsons (they are backwards compatble). Cache msses can happen at any tme durng these ntervals. The task code s allowed to perform OS system calls, but blockng calls must have bounded blockng tme. Furthermore, the task can be preempted by nterrupt handlers of assocated perpherals. We assume that the maxmum executon tme e,j for a compatble nterval s computed based on statc analyss technques. However, to reduce the pessmsm n the analyss, we prohbt perpheral traffc from beng transmtted durng a compatble nterval. Ideally, there should be a small number of compatble ntervals whch are kept as short as possble. Predctable ntervals are specally compled to execute accordng to the PREM model shown n Fgure 2: they are dvded nto two dfferent phases and exhbt three man propertes. Frst, durng the ntal memory phase, the CPU accesses man memory to perform a set of cache lne fetches and replacements. At the end of the memory phase, all cache lnes requred durng the predctable nterval are avalable n last level cache. Second, durng the followng executon phase, the task performs useful computaton wthout sufferng any last level cache msses. Predctable ntervals do not contan any system calls and can not be preempted by nterrupt handlers. Hence, the CPU does not perform any external man memory access durng the executon phase. Ths property allows perpheral traffc to be scheduled durng the executon phase of a predctable nterval wthout causng any contenton for access to man memory. Thrd, at run-tme, we force the tme length of a predctable nterval to always be equal to e,j. Let e mem,j be the maxmum tme requred to complete the memory phase and e exec,j to complete the executon phase. Then offlne we set e,j = e mem,j +e exec,j and at run-tme, f the predctable nterval completes n less than e,j, we busy-wat untl e,j tme unts have elapsed snce the begnnng of the nterval. Ths property ensures that perpherals can transmt for at least e exec,j tme unts n a tme wndow of length e,j. If we dd not enforce a constant nterval length, then the executon phase could potentally complete n zero tme, resultng n no perpheral traffc beng sent durng that predctable nterval. In Secton VI we wll formally show how we can provde hard real-tme guarantees to I/O flows based on the constant nterval length property. Fgure 3 shows a concrete example of a system-level predctable schedule for a task set wth two tasks τ 1, τ 2 together 271

1"2&'('),+'('-''3/(3'!"#"$%&'(')*+'),+'('-''./00' τ 1 τ 2 τ I/O 1 τ I/O 2 s 1,1 s 1,2 s 1,3 s 2,1 s 2,2 s 2,3 s 2,4 6$278'' 9:8:' "78278'' 9:8:' 4' (4' 04',4' *4' 54' 34' &';"%2:<=>?' ''6$8?@A:>' &'%?

Example System-Level Predctable Schedule Fg. 4. Cache organzaton wth one memory regon!"#$%&'()*"#+*,&-#.

Both tasks and I/O flows are scheduled accordng to fxed prorty, wth τ 1 havng hgher prorty than τ 2 and τ I/O 1 hgher prorty than τ I/O 2.

As shown n Fgure 3 for task τ 1, ths means that the nput data for a gven job s transmtted n the perod before the job s executed, and the output data s transmtted n the perod after.

The frst and last nterval of both τ 1 and τ 2 are specal compatble ntervals.

4 1"2&'('),+'('-''3/(3'!"#"$%&'(')*+'),+'('-''./00' τ 1 τ 2 τ I/O 1 τ I/O 2 s 1,1 s 1,2 s 1,3 s 2,1 s 2,2 s 2,3 s 2,4 6$278'' 9:8:' "78278'' 9:8:' 4' (4' 04',4' *4' 54' 34' &';"%2:<=>?' ''6$8?@A:>' &'%?%"@B' ''2C:D?' &'?E?;7<"$' ''2C:D?' &'F/G'H"I' ':'A&#B:1# ':'A&# -9"&# 79/3(:-#4&4#2<:'&# <A129':-#4&4#2<:'&# -9"& 9",& C# <:=&#># <:=&#?# <:=&#@# C# C# ># Fg. 3. Example System-Level Predctable Schedule Fg. 4. Cache organzaton wth one memory regon!"#$%&'()*"#+*,&-#.*/#0123&456&7&-#8/&,9'3:;9-931# wth two I/O flows τ I/O 1, τ I/O 2 whch servce τ 1 and τ 2, respectvely. Both tasks and I/O flows are scheduled accordng to fxed prorty, wth τ 1 havng hgher prorty than τ 2 and τ I/O 1 hgher prorty than τ I/O 2. We set D = p and assgn to each I/O flow the same perod and deadlne as ts servced task and a transmsson tme equal to 4 tme unts. As shown n Fgure 3 for task τ 1, ths means that the nput data for a gven job s transmtted n the perod before the job s executed, and the output data s transmtted n the perod after. Task τ 1 has a sngle predctable nterval of length e 1,2 = 4 whle τ 2 has two predctable ntervals of lengths e 2,2 = 4 and e 2,3 = 3. The frst and last nterval of both τ 1 and τ 2 are specal compatble ntervals. These ntervals are needed to execute the assocated perpheral drver (ncludng nterrupt handlers) and set up the recepton and transmsson buffers n man memory (.e. read and wrte system calls). More detals are provded n Secton VII. I/O flows can be scheduled both durng executon phases and whle the CPU s dle. As we wll show n Secton VI, the descrbed scheme can be modeled as a herarchcal schedulng system [26], where the CPU schedule of predctable ntervals supples avalable transmsson tme to I/O flows. Therefore, exstng tests can be reused to check the schedulablty of I/O flows. However, due to the characterstcs of predctable ntervals, a more complex analyss s requred to derve the supply functon. Executng a task accordng to the PREM model reduces ts overall executon tme because PREM ensures that perpheral traffc n man memory cannot contend wth and stall cache fetches; n the worst case, the perpheral-nduced delay can be very sgnfcant. A possble ssue n our approach s that by decdng whch cache lnes to prefetch durng the memory phase, we mght need to prefetch more cache lnes than the ones that are actually used at run-tme n the executon phase. Our expermental evaluaton n Secton VII shows that for several embedded benchmarks, ths ncrease n memory load s not sgnfcant. A second possble ssue s that by blockng I/O flows durng compatble ntervals, we rsk reducng perpheral bandwdth sgnfcantly. However, n Secton VII, we show that for a sgnfcant category of benchmarks, the executon tme of predctable ntervals domnates the total length of the job. IV. ARCHITECTURAL CONSTRAINTS AND SOLUTIONS Predctable ntervals are executed n a radcally dfferent way compared to the speculatve executon model that COTS components are typcally desgned to support. In ths secton, we detal the challenges and solutons to mplement the PREM executon model on top of a COTS archtecture. Cachng and Prefetch: Our general strategy to mplement the memory phase conssts of two steps: (1) we determne the complete set of memory regons that are accessed durng the nterval. Each regon s a contnuous area n vrtual memory. In general, ts start address can only be determned at run-tme, but ts sze A s known at comple tme. (2) Durng the memory phase, we prefetch all cache lnes that contan nstructons and data for requred regons; most nstructon sets nclude a prefetch nstructon that can be used to load specfc cache lnes n last level cache. Step (1) wll be detaled n Secton V. Step (2) can be successful only f there s no cache selfevcton, that s, prefetchng a cache lne never evcts another lne loaded durng the same memory phase. In the remander of ths subsecton, we descrbe self-evcton preventon. Most COTS CPUs mplement the last-level cache as a wrteback, N-way set assocatve cache. Data s loaded nto and wrtten-back from cache n unts known as cache lnes. Let B be the total sze of the cache and L be the sze of each cache lne n bytes. An N-way set assocatve cache s dvded nto N cache ways, each wth a sze of W = B/N bytes. An assocatve set s the set of all cache lnes, one for each of the N cache ways, whch have the same ndex n cache; there are W/L assocatve sets. Last level cache s typcally physcally tagged and physcally ndexed, meanng that cache lnes are accessed based on physcal memory addresses only. Each cache lne n man memory s assocated wth a sngle assocatve set based on ts ndex n the physcal addess space, but can be loaded nto any of the N ways; the specfc way s chosen at run-tme by the cache replacement polcy. We also assume that last level cache s not exclusve, that s, when a cache lne s coped to a hgher cache level, t s not removed 272

5 from the last level. Fgure 4 shows an example of ndexng n man memory where L = 4, W = 16 (parameters are chosen to smplfy the dscusson and are not representatve of typcal systems). The man dea behnd our conflct analyss s as follows: we compute the maxmum number of entres n each assocatve set that are requred to hold the cache lnes prefetched for all memory regons n a schedulng nterval. Based on the cache replacement polcy, we then derve a safe lower bound on the number of entres that can be prefetched n an assocatve set wthout causng any self-evcton. Consder a memory regon of sze A. As shown n Fgure 4 for a regon wth A = 15, K = 5, the worst case number of occuped cache lnes s produced when the regon uses a sngle byte n ts frst cache lne. The remanng A 1 bytes wll then occupy A 1 L other cache lnes. Hence, the regon requres at most K = 1 + A 1 L cache lnes. Assume now that vrtual memory addresses concde wth physcal addresses. Snce the regon s contguous and there are W/L cache lnes n each way, the maxmum number of entres used n any assocatve set by the regon s K W/L. For example, the regon n Fgure 4 requres two entres n the set wth ndex 0. We then derve the maxmum number of entres for the entre nterval by summng the entres requred for each memory regon. Unfortunately, ths s not generally true f the system employs paged vrtual memory. If the page sze P s smaller than the sze W of each way, the ndex of each cache lne nsde the cache way s dfferent for vrtual and physcal addresses. In the example of Fgure 4 wth P = 8, the number of entres for the memory regon s ncreased from 2 to 3. We consder two solutons: 1) f the system supports t, we can select a page sze multple of W just for our specfc process. Ths soluton, whch we employed n our mplementaton, solves the problem because the ndex n cache for vrtual and physcal addresses s the same no matter the page allocaton. 2) We use a modfed page allocaton algorthm n the OS. Note that a sutable allocaton algorthm could decrease the requred number of assocatve entres by controllng the allocaton n physcal memory of multple regons. We wll pursue ths soluton n future work. Due to space constrants, a thorough dscusson of cache replacement polces s provded n [17]. Let Q be the maxmum number of entres requred by the schedulng nterval n any N-way assocatve set. Based on the results n [8], [23], n our companon techncal report [17] we show the followng: Theorem 1. A memory phase wll not suffer any cache selfevcton f Q s at most equal to: N: for FIFO or LRU replacement polcy; log 2 N + 1: for pseudo-lru replacement polcy; 1: for random replacement polcy. Computng Phase Length: We assume that tradtonal statc analyss can be employed to derve bounds on the maxmum tme requred for an executon phase e exec,j, as well as on the executon tme e,j for a compatble nterval. Obtanng tght bounds on WCET can be challengng n COTS cores explotng archtectural features such as deep ppelnes and ntermedate cache levels. We beleve that the problem of ntracore executon analyss s fundamentally orthogonal to our approach; the role of PREM s to smplfy tmng analyss by makng cache mss patterns predctable. Research n statc tmng analyss for complex core archtectures s very actve. For example, analyses that derve patterns of cache msses for splt level 1 data and nstructon caches have been proposed n [14], [22], whle the work n [13] derves tght executon bounds by analyzng the ppelne status. The length e mem,j of a memory phase depends on the tme requred to prefetch all accessed memory regons. An analyss to compute upper bounds for read/wrte operatons usng a COTS DRAM memory controller s detaled n [15]. Note that DRAM access tmes are sgnfcantly dependent on both space and tme localty of data n man memory. Therefore, the current analyss must make pessmstc guesses. However, snce the number and relatve vrtual addresses of prefetched cache lnes s known under PREM, the analyss could potentally be extended to consder features such as burst read and parallel bank access; we plan to nvestgate ths drecton n future work. Smlarly, we expect that statc analyss wll sgnfcantly overestmate the executon tme of compatble ntervals; however, as prevously mentoned, our expermental results n Secton VII show that compatble ntervals are typcally short. An mportant note s relatve to the wrte-back mechansm of last level cache. Snce the task can be preempted at the boundary of schedulng ntervals, the cache state s unknown at the begnnng of the memory phase. Hence, n the analyss we must account that each prefetch could requre both a cache lne fetch and a replacement. An alternatve soluton would be to add a second memory phase after each executon phase and nvaldate the whole cache, forcng wrte-back of all drty cache lnes. We have not pursued ths soluton because our mplemented testbed does not allow full-cache nvaldaton. Fnally, for systems mplementng paged vrtual memory, we employ the followng three assumptons: 1) the CPU supports hardware Translaton Lookasde Buffer (TLB) management; 2) all pages used by predctable ntervals are locked n man memory; 3) the TLB s large enough to contan all page entres for a predctable nterval wthout sufferng any conflct. Under such assumptons, each page used n a predctable nterval can cause at most one TLB mss durng the memory phase, whch requres a number of fetches n man memory equal to at most the level of the page table. Schedulng synchronzaton: In our model, perpherals are only allowed to transmt durng a predctable nterval s executon phase or whle the CPU s dle. To compute the perpheral schedule, the perpheral scheduler must thus know the status of the CPU schedule. Synchronzaton s acheved by connectng the perpheral scheduler to a perpheral nterconnecton as shown n Fgure 1; schedulng messages contanng the amount of consecutve tme n whch perpherals are allowed to transmt are then sent by ether a task or the OS to the perpheral scheduler. In partcular, at the end of each memory phase the task sends to the perpheral scheduler the remanng amount of tme untl the end of the current predctable nterval. 273

6 Note that to smplfy the dscusson we do not consder ssues of message propagaton delay 2 and clock drft n ths paper, but the descrbed scheme together wth the schedulablty analyss of Secton VI could be easly modfed by sutably reducng the tme allowed for perpheral traffc compared to the value contaned n the schedulng message. Fnally, to avod executng nterrupt handlers durng predctable ntervals, a perpheral should only rase nterrupts to the CPU durng compatble ntervals of ts servced task. As we descrbe n Secton VII, n our I/O management scheme, perpherals rase nterrupts through ther assgned real-tme brdge. Snce the perpheral scheduler communcates wth each real-tme brdge, t s used to block nterrupt propagaton outsde the desred compatble ntervals. Schedulng messages are agan used to notfy the perpheral scheduler of the length of nterrupt-enabled ntervals. Note that blockng real-tme brdge nterrupts to the CPU wll not cause any loss of nput data because the real-tme brdge s capable of ndependently acknowledgng the perpheral and storng all ncomng data n the brdge local buffer. V. PROGRAMMING MODEL Our system supports exstng applcatons wrtten n standard hgh-level languages such as C. Unmodfed code can be executed wthn one or more compatble ntervals. We extend the source language wth a predctable block construct that defnes a sngle-entry, sngle-ext regon of code that should execute as a sngle predctable nterval. In C, we defne the construct as the keyword predctable followed by a compound block of statements. Durng complaton, the PREM real-tme compler transforms code wthn a predctable block so that t frst prefetches any data and code requred n the predctable block nto the cache; t also adds a busy-wat loop at the end of the predctable block to ensure that every executon of the predctable nterval takes the same amount of tme. Ths ensures that no cache msses occur durng the executon phase and that the nterval tself has a constant executon tme. In order to create a predctable nterval, the programmer should frst profle the code to determne the portons n whch the task spends most of ts executon tme. The programmer should then perform the cache and archtecture analyss as presented n Secton IV for these portons of hot code and then, based on the results and task nformaton, place portons of the code nto predctable blocks. Snce t s dffcult to comple arbtrary code so that t does not nduce cache msses, there are several constrants that the compler must place on code wthn predctable blocks. These constrants are: 1) Predctable code blocks should only access memory objects, arrays, and scalar values that are capable of beng referenced at the entry of the predctable block. There should be no traversal of lnk-based data structures 2 In our mplementaton we measured an upper bound to the message propagaton tme of 1µs, whle we envson schedulng ntervals wth a length of µs. (e.g., a bnary tree) snce the compler cannot nfer the memory that would need to be prefetched. 2) The code can use data structures, n partcular arrays, that are allocated outsde the predctable code block 3. For global or heap allocated arrays, the programmer must specfy the frst and last address that s accessed wthn the predctable code block and (f necessary) the maxmum dfference between these two addresses f the compler cannot nfer ths nformaton va statc analyss. Ths must be done for the code n the predctable block and for code wthn functons that are called (ether drectly or transtvely) by the code n the predctable block. The compler needs ths nformaton to add correct prefetchng code to the predctable code block. 3) Code wthn a predctable block should not contan system calls, calls to heap allocators, or stack allocatons wthn loops. System calls enter the kernel and execute code not generated by the PREM compler, and the heap allocator executes code that can generate cache msses. Stack allocatons cannot occur n loops because the compler must nsert code to prefetch the stack frame at the begnnng of a predctable nterval; stack allocatons n loops make the stack frame sze unpredctable. 4) Code wthn a predctable block should not make recursve functon calls. Recursve functon calls can grow the stack frame to an unpredctable sze, makng t mpossble for the compler to prefetch the stack frame. 5) Code wthn a predctable block may use both drect and ndrect functon calls. The compler can use pontsto analyss combned wth call-graph constructon [11] to fnd all the targets of ndrect functon calls. Snce ponts-to analyss may yeld conservatve results, the compler may fnd more functon targets than are actually possble. If too many functon targets are found (makng constructon of a predctable nterval mpossble), the compler may ask the programmer to use annotatons to specfy the vald functon targets. The compler employs several transforms to ensure that code wthn predctable blocks does not cause cache msses durng the executon phase. Frst, the compler nlnes all functons called (ether drectly or transtvely) by the predctable block nto the predctable block (lnk-tme optmzaton can nlne functons across complaton unts). Ths ensures that all code used by the predctable block s contguous wthn vrtual memory and uses a sngle stack frame. Second, the compler nserts code at the begnnng of the predctable block to prefetch the code and data needed to execute the nterval; ths prefetchng code s the memory phase of the predctable nterval. Based on the descrbed constrants, ths ncludes three types of contguous memory regons: (1) the code for the functon; (2) the actual parameters passed to the functon and the stack frame (whch contans local varables and regster spll slots); and (3) the global and heap memory objects 3 Note that heap objects must have been allocated durng a compatble nterval. 274

7 accessed wthn the predctable block. Thrd, the compler nserts code to send schedulng messages to the perpheral scheduler as descrbed n Secton IV. Fnally, the compler emts code at the end of predctable block to enforce ts constant, predctable executon tme. A. Portng Legacy Applcatons Convertng exstng code to predctable ntervals clearly requres some work. In partcular, addng annotatons to correctly splt the code nto predctable blocks requres some knowledge of cache parameters. Whle ths mght seem an undue lmt on code portablty, the type of data-ntensve, real-tme applcatons that we target n ths work are typcally already optmzed based on hardware archtecture. In ths sense, we beleve that the benefts of a more predctable behavor for program hot-spots, decoupled from the low-level detals of bus and memory arbters, outwegh the burden of code annotatons. Addtonally, we can desgn our compler to help the programmer create predctable code blocks. The compler can verfy when the aforementoned restrctons are volated n a predctable block (e.g., t can use statc analyss to fnd rregular data structure usage or use of system calls) and ssue warnngs to ad the programmer n correctng them. Furthermore, gven cache sze nformaton, the compler could verfy that all prefetched memory regons ft n last level cache and ssue warnngs otherwse. A second possble concern regards code constrants. In general, we beleve that our constrants are not sgnfcantly more restrctve than those mposed by state-of-the-art statc tmng analyss. Typcal real-tme applcatons avod recursve calls, stack or heap allocaton wthn loops, and ndrect functon calls that are not decdable at comple tme. Furthermore, we are not aware of any tmng analyss tool that can provde WCET bounds f Constrants 3 or 4 are volated. Constrants 1-2 are more severe because they prevent usng complex ponter-based data structures. However, exstng code that s too complex to be compled nto predctable ntervals can stll be executed nsde compatble ntervals. An alternatve soluton s presented n [25]: the cache s statcally parttoned, for example usng the OS page allocator, nto an area for predctable code and data and a second area for complex, unpredctable data structures. Durng a predctable nterval, the predctable area s prefetched whle unpredctable data s handled by the cachng logc. Statc analyss can then be used to derve a (pessmstc) upper bound to the number of cache msses n the unpredctable area; deally, most of the data would be placed n the predctable area, resultng n a very small number of unpredctable msses. Whle we do not dscuss t n detal, the PREM model could be amended to tolerate a small number of cache msses n the predctable phase by usng the analyss n [18], [19] to compute the (lmted) contenton delay on both the task and I/O flows. VI. SCHEDULABILITY ANALYSIS PREM allows us to enforce strct tmng guarantees for both CPU tasks and ther assocated I/O flows. By settng tmng parameters as shown n Fgure 3, the task schedule becomes ndependent of I/O schedulng. Therefore, task schedulablty can be checked usng avalable schedulablty tests. As an example, assume that tasks are scheduled accordng to fxed prorty schedulng as n Fgure 3. For a task τ, let e = N j=1 e,j be the sum of the executon tmes of ts schedulng ntervals, or equvalently, the executon tme of the whole task. Furthermore, let hp Γ be the set of hgher prorty tasks than τ, and lp the set of lower prorty tasks. Snce schedulng ntervals are executed non-preemptvely, τ can suffer a blockng tme due to lower prorty tasks of at most B = max τl lp max j=1...nl e l,j. The worst-case response tme of τ can then be found [3] as the fxed pont r of the followng teraton, startng from r 0 = e + B : r k+1 = e + B + l hp r k p l e l, (1) Task set Γ s schedulable f τ : r D. We now turn our attenton to perpheral schedulng. Note that due to space lmtatons, proofs and a more detaled dscusson are provded n [17]. Assume that each I/O flow τ I/O s characterzed by a maxmum transmsson tme e I/O (wth no nterference n both man memory and the nterconnect), perod p I/O and relatve deadlne D I/O, where D I/O p I/O. The schedulablty analyss for I/O flows s more complex because the schedulng of data transfers depends on the task schedule. To solve ths ssue, we extend the herarchcal schedulng framework proposed by Shn and Lee n [26]. In ths framework, tasks/flows n a chld schedulng model execute usng a tmng resource provded by a parent schedulng model. Schedulablty for the chld model can be tested based on the supply bound functon sbf(t), whch represents the mnmum resource supply provded to the chld model n any nterval of tme t. In our model, the I/O flow schedule s the chld model and sbf(t) represents the mnmum amount of tme n any nterval of tme t durng whch the executon phase of a predctable nterval s scheduled or the CPU s dle. Defne the servce tme bound functon tbf(t) as the pseudonverse of sbf(t), that s, tbf(t) = mn{x sbf(x) t}. Then f I/O flows are scheduled accordng to fxed prorty, n [26] t s shown that the response tme r I/O computed accordng to the teraton: r I/O,k+1 ( = tbf e I/O + l hp I/O of flow τ I/O I/O,k r p I/O l e I/O l can be ), (2) where hp I/O has the same meanng as hp. In the remander of ths secton, we detal how to compute sbf(t). For the sake of smplcty, let us ntally assume that tasks are strctly perodc and that the ntal actvaton tme of each task s known. Furthermore, notce that usng the soluton descrbed n Secton IV, we could enforce nterval lengths not just for predctable ntervals but also for all compatble ntervals. Fnally, let h be the hyperperod of task set Γ, defned as the least common multple of all tasks perods. 275

8 Under these assumptons, t s easy to see that f Γ s feasble, the CPU schedule can be computed offlne and repeats tself dentcally wth perod h after an ntal nterval of 2h tme unts (h tme unts f all tasks are actvated smultenously). Therefore, a tght sbf(t) can be computed as the mnmum amount of supply (tme durng whch the CPU s dle or n executon phase of a predctable nterval) durng any nterval of tme t n the perodc task schedule, startng from the ntal two hyperperods. More formally, let {t 1,..., t K } be the set of start tmes for all schedulng ntervals actvated n the frst two hyperperods; sbf(t) s derved n the followng theorem. Theorem 2. Let sf(t, t ) be the amount of supply provded n the perodc task schedule durng nterval [t, t ]. Then: sbf(t) = mn k=1...k sf(tk, t k + t). (3) Unfortunately, the proposed sbf(t) dervaton can only be appled to strctly perodc tasks. If Γ ncludes any sporadc task τ wth mnmum nterarrval tme p, the schedule cannot be computed offlne. Therefore, n [17], we detal an alternatve analyss that s ndependent of the CPU schedulng algorthm and computes a lower bound sbf L (t) to sbf(t). Note that snce sbf(t) s the mnmum supply n any nterval t, usng Equaton 2 wth sbf L (t) nstead of sbf(t) wll stll result n a suffcent schedulablty test. The analyss frst computes sbf L (t) n a fnte set of tme ponts by solvng a lnear optmzaton problem and then derves sbf L (t) for all values of t usng nterpolaton. The overall tme complexty of the analyss s O ( N 2 max D I/O / mn p ). VII. EVALUATION In order to verfy the valdty and practcalty of PREM, we mplemented the key components of the system. In ths secton, we frst descrbe our mplemented testbed. We then dscuss our compler mplementaton and analyze ts effectveness on several embedded benchmarks. Fnally, usng synthetc tasks, we measure the effectveness of the PREM system as a functon of cache stall tme. A. PREM Hardware/Software Testbed To support I/O flow schedulng, we developed a real-tme brdge and perpheral scheduler prototype as descrbed n Secton III. Due to space lmtatons, a detaled descrpton of the mplemented hardware components, based on FPGA prototypng boards, s provded n [17]. Compared to the prevous work n [1], [2], the new components exhbt two man dfferences: (1) an addtonal nterrupt_block wre between the real-tme brdge and perpheral scheduler s used to control nterrupt propagaton as detaled n Secton IV; (2) the perpheral scheduler s connected to the PCIe bus and exposes a set of regsters accessble from the man CPU. In partcular, the yeld regster s used to receve schedulng messages as descrbed n Secton IV. Both the real-tme brdge and the perpheral scheduler requre software drvers to be controlled from the man CPU and nteract wth each perpheral. The drver for the perpheral scheduler s extremely smple, exposng to the CPU the perpheral scheduler s regsters. The drver for each real-tme brdge s more dffcult, snce each unque COTS perpheral requres a unque drver. However, snce, n our mplementaton, we employ Lnux (verson ), we can reuse exstng, thoroughly tested Lnux drvers to drastcally reduce the drver creaton effort [1]. The presence of a real-tme brdge s not apparent n user space, and software programs usng the perpherals requre no modfcaton. For our experments, we use an Intel Q6700 CPU wth a 82975X system controller; we set the CPU frequency to 1Ghz obtanng a measured memory bandwdth of 1.8Gbytes/s to confgure the system n lne wth typcal values for embedded systems. We also dsable the speculatve CPU HW prefetcher snce t negatvely mpacts the predctablty of any real-tme task. The Q6700 has four CPU cores and each par of cores shares a common level 2 (last level) cache. Each cache s 16-assocatve wth a total sze of B = 4 Mbytes and a lne sze of L = 64 bytes; reloadng the whole cache requres roughly 2.2 ms. Snce we use a PC platform runnng a COTS Lnux operatng system, there are many potental sources of tmng nose, such as nterrupts, kernel threads, and other processes, whch must be removed for our measurements to be meanngful. For ths reason, n order to best emulate a typcal un-processor embedded real-tme platform, we dvded the 4 cores n two parttons. The system partton, runnng on the frst par of cores, receves all nterrupts for non-crtcal devces (e.g., the keyboard) and runs all the system actvtes and non real-tme processes (e.g., the shell we use to run the experments). The real-tme partton runs on the second par of cores. One core n the real-tme partton runs our real-tme tasks together wth the drvers for real-tme brdges and the perpheral scheduler; the other core s turned off. Note that the cores of the system partton can stll produce a small amount of unscheduled bus and man memory accesses or rase rare nter-processor nterrupts (IPI) that cannot be easly prevented. However, n our experments, we found these sources of nose to be neglgble. Fnally, to solve the pagng ssue detaled n Secton IV, we used a large 4MB page sze just for the realtme tasks usng the HugeTLB feature of the Lnux kernel for large page support. B. Compler Evaluaton We bult a PREM real-tme C compler prototype usng the LLVM Compler Infrastructure [10]. We extended LLVM by wrtng self-contaned analyss and transformaton passes whch were then loaded nto the compler. For smplcty, n the current compler prototype, nterval parttonng s performed by puttng each predctable nterval nto a separate functon. Wthn the predctable nterval functon, the programmer adds macros that 1) ndcate that non-local data should be prefetched (PREFETCH_DATA(start_address, sze)) and 2) ndcate that the executon phase s begnnng and send schedulng messages (START_EXECUTION(WCET)). Our new LLVM compler pass performs all remanng 276

9 Input bytes 4K 8K 32K 128K 512K 1M Non-PREM mss PREM prefetch PREM exec-mss TABLE I DES BENCHMARK CACHE MISSES. operatons needed to transform the nterval to be predctable. When a functon representng a predctable nterval s found, our transform frst nlnes all functons called wthn the predctable nterval functon. Ths ensures that there s only a sngle stack frame and segment of code that needs to be prefetched nto the cache. Second, our transform nserts code to read and record the processor s cycle counter at the begnnng of the nterval. Thrd, t nserts code to prefetch the stack frame and functon arguments by prefetchng memory between the stack ponter and slghtly beyond the frame ponter (to nclude functon arguments) usng the prefetcht2 nstructon. Fourth, the transform prefetches the code of the functon by dervng ponters to the begnnng and end of the predctable functon and then usng the prefetcht2 nstructon. Fnally, the pass dentfes all return nstructons nsde the predctable nterval functon and adds a specal functon eplog before them. The eplog performs nterval-length enforcement by loopng untl the cycle counter reaches the worst-case cycle count based on the tme value saved at the begnnng of the nterval and the predctable nterval length (WCET) provded n START_EXECUTION. To verfy the correctness of the PREM compler and to test ts applcablty, we used LLVM to comple several benchmarks mostly taken from MBench [27], a commercallyrepresentatve embedded benchmark sute. Frst, we elaborate on a DES cypher benchmark because ts pattern of cache msses n the varous PREM phases s representatve of the other benchmarks we tested. Second, we dscuss a JPEG benchmark because t has a larger, more complcated code base, and we beleve the workflow to make t PREM-complant wll match realstc PREM applcatons. Lastly, we dscuss the entre automotve program group of MBench (6 benchmarks) to evaluate the broader necessty and feasblty of PREM complaton. We ran all benchmarks wth multple nput data szes and have shown here representatve measurements usng the nput small fles from the benchmarks. No perpheral traffc occurs durng executon. To capture worst-case behavor, the cache was nvaldated pror to the start of each measured functon by usng a hand-wrtten cache_trash functon. The DES Cypher Benchmark s composed of one schedulng nterval whch encrypts a varable amount of data. We compled t both n the standard, non-prem way and also wth PREM prefetchng, and measured the number of cache msses and prefetches whch occurred by usng a CPU performance counter. Adaptng the nterval requred no modfcaton to any cypher functons and a total of 11 PREFETCH_DATA macros. As shown n Table I, non-prem executon results n a sgnfcant number of cache msses throughout the nterval, whch PREM Non-PREM prefetch exec-mss tme(µs) mss tme(µs) JPEG(1 Mpx) JPEG(8 Mpx) qsort susan smooth susan edge susan corner TABLE II MIBENCH RESULTS WITHOUT PERIPHERAL TRAFFIC. ncreases roughly proportonally to the amount of processed data. If I/O perpherals were to transmt to man memory concurrently, the task s executon tme would ncrease. Conversely, the executon phase (after prefetch) of the predctable nterval has almost zero cache msses, only sufferng a small ncrease when a large amount of data s beng processed. Ths demonstrates the key result: wth PREM, I/O perpherals can communcate wth man memory freely durng the executon phase wthout affectng the tmng of the executng task. The reason the number of cache msses s not exactly zero n the PREM executon phase s that the Q6700 CPU core used n our experments uses a random cache replacement polcy, meanng that wth more than one contguous memory regon, the probablty of self-evcton s non-zero. In our experments, we observed that the number of self-evctons s usually small. However, we stll recommend that tmng-crtcal applcatons avod CPUs wth a random cache replacement polcy. Many embedded platforms used n safety-crtcal markets such as avoncs use processors lke the Freescale PowerPC famly [7] wth more predctable polces lke pseudo-lru [23]. A typcal PREM code augmentaton workflow was exemplfed by the JPEG Image Encodng Benchmark. In ths benchmark, we frst used gprof to fnd that around 80% of the executon tme s spent n the compress_data() functon whch performs DCT transformaton, quantzaton and Huffman encodng. We made compress_data() PREMcomplant by replacng constant functon ponters wth drect calls, addng 18 PREFETCH_DATA macros, and removng fwrte system calls from the predctable nterval. The results for two mage szes are shown n Table II, where tme(µs) represents the executon tme of the whole nterval. We also went through the complete Automotve Program Group of MBench and evaluated each of the sx benchmarks to determne the broader necessty and feasblty of PREM complaton. Two of the benchmarks (bascmath and btcount) were not data ntensve, so PREM was not necessary. Of the remanng four benchmarks, three (qsort, susan smooth, susan edge) were found to be well-suted for PREM, and we were able to perform all computaton nsde predctable ntervals. The fnal benchmark (susan corner) had varablesze output so PREM would typcally need to prefetch more buffer space than was actually used. The results are agan shown n Table II. Note that, except for susan corner, the number of prefetched cache lnes s only slghtly hgher than the number of cache msses suffered n the non-prem way. 277

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory

Virtual Memory. Background. No. 10. Virtual Memory: concept. Logical Memory Space (review) Demand Paging(1) Virtual Memory Background EECS. Operatng System Fundamentals No. Vrtual Memory Prof. Hu Jang Department of Electrcal Engneerng and Computer Scence, York Unversty Memory-management methods normally requres the entre process