Bounding DMA Interference on Hard-Real-Time Embedded Systems *

Size: px

Start display at page:

Download "Bounding DMA Interference on Hard-Real-Time Embedded Systems *"

Sara Walsh
5 years ago
Views:

1 JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 22, (2006) Boundng DMA Interference on Hard-Real-Tme Embedded Systems * TAI-YI HUANG, CHIH-CHIEH CHOU AND PO-YUAN CHEN Department of Computer Scence Natonal Tsng Hua Unversty Hsnchu, 300 Tawan E-mal: {tyhuang; ccchou; pychen}@cs.nthu.edu.tw A DMA controller that operates n the cycle-stealng mode transfers data by stealng bus cycles from the CPU. The concurrent contenton for the I/O bus by a CPU task and a cycle-stealng DMA I/O task retards ther progress and extends ther executon tmes. In ths paper we frst descrbe a method for boundng the worst-case executon tme (WCET) of a CPU task when cycle-stealng DMA I/O s present. We next use the dynamc-programmng technque to develop a method for boundng the WCET of a cyclestealng DMA I/O task executng concurrently wth a set of CPU tasks. We conducted exhaustve smulatons on a wdely-used embedded processor. The expermental results demonstrate that our methods tghtly bound the WCETs of CPU tasks and of cycle-stealng DMA I/O tasks. Keywords: hard-real-tme systems, worst-case executon tme, cycle-stealng DMA I/O, concurrent executon, embedded systems 1. INTRODUCTION In a hard-real-tme embedded system, both CPU tasks and I/O tasks are requred to complete executons by ther deadlne. A task that executes longer than ts allocated computaton tme may lead to mssed deadlnes and the falure of the whole system. The schedulablty analyss of such a system requres that the worst-case executon tme (WCET) of each task be known n advance to ensure the completon of each task by ts allocated computaton tme [14, 19, 20, 28, 30]. To tghtly bound the WCET, the nterference of concurrently executng CPU tasks and I/O tasks must be consdered. Ths paper addresses the problems of boundng the WCETs of concurrently executng CPU tasks and cycle-stealng DMA I/O tasks n a hard-real-tme embedded system. We model each CPU task as a sequence of nstructons (.e., code wthout undetermned loop bounds and recursve functon calls) whch s qute commonly found n the syntheszed code of hard-real-tme embedded systems [10, 11, 27, 29]. A DMA controller (DMAC) transfers data between the man memory and I/O devces wth mnmal CPU nvolvement. A DMAC may operate ether n the burst mode or n the cycle-stealng mode. A DMAC that operates n the burst mode gans the control of the I/O bus once t s free and retans ts ownershp untl all data transfers n the DMA I/O task complete. In contrast, a DMAC that operates n the cycle-stealng mode transfers data by stealng Receved June 21, 2004; accepted November 1, Communcated by Davd H. C. Du. * Ths work was supported n part by the Mnstry of Economc Affars of Tawan, under grant No. MOEA 95-EC-17-A-01-S1-038 and 94-EC-17-A-04-S

2 1230 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN bus cycles from an executng CPU task. A cycle-stealng DMA I/O task allows a CPU task to execute concurrently. The contenton for the I/O bus by the cycle-stealng DMA I/O task and the CPU task retards ther progress and extends ther executon tmes. Ths paper frst analyzes the delay caused by cycle-stealng DMA I/O actvtes on a concurrently executng CPU nstructon. Based on ths analyss, we develop a method that tghtly bounds the WCET of a CPU task. We next proceed to the problem of boundng the WCET of a cycle-stealng DMA I/O task. We defne the executon tme of a DMA I/O task to be the nterval from the nstant when the DMAC s ready to transfer the frst unt of data to the nstant when the CPU receves from the DMAC an nterrupt sgnal when the transfer of the last unt of data s complete. We present here a method for boundng the WCET of a cycle-stealng DMA I/O task executng concurrently wth a set of CPU tasks on a sngle-processor embedded system. We use the dynamc-programmng technque n the development of ths method. The runnng-tme complexty of ths method s O(ZU) + O(K 2 Z 2 ), where Z s the number of unts of data to be transferred by the DMA I/O task, K s the number of CPU tasks, and U s the sum of the number of nstructons n CPU tasks. To demonstrate the effectveness of our method on boundng the WCET of a CPU task, we compare our WCET predcton wth the one obtaned by the tradtonal pessmstc approach. Gven a CPU task and a cycle-stealng DMA I/O task, whch are ready at the same tme, the tradtonal pessmstc approach estmates the WCET of the CPU task to be the sum of the executon tmes of both tasks when each executes alone. We measure the performance of our method n terms of the amount of reducton from the most pessmstc WCET predcton. Among the several commonly-used programs tested, our method acheves up to 39% mprovement n the accuracy of the WCET predcton. We demonstrate the correctness of our method on boundng the WCET of a cyclestealng DMA I/O task through exhaustve smulatons. Gven a cycle-stealng DMA I/O task and a set of CPU tasks, we smulated all possble combnatons of release tmes and concurrent executons of these preemptve CPU tasks and the cycle-stealng DMA I/O task. We compared the maxmum executon tme of the DMA I/O task recorded n the smulaton experment wth the WCET predcton obtaned by our method. The expermental results show that our method tghtly bounds the WCET of a cycle-stealng DMA I/O task. The rest of the paper s structured as follows. Secton 2 descrbes our machne model. Secton 3 analyzes the delay caused by cycle-stealng DMA I/O and presents the method for boundng the WCET of a CPU task. Secton 4 descrbes the method for boundng the WCET of a cycle-stealng DMA I/O task. The expermental results are dscussed n secton 5. Secton 6 descrbes related work. Fnally, secton 7 concludes ths paper. 2. THE MACHINE MODEL We adopt here the commonly-used sngle-processor machne model shown n Fg. 1. In ths model the DMAC operates n the cycle-stealng mode and shares the same I/O bus wth the CPU. The bus controller allows only one bus master at any tme. As a result, ether the CPU or the DMAC, but not both, can hold the bus and transfer data at the same tme.

3 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1231 CPU DMA Controller Bus Controller I/O Bus Man Memory I/O Devce Fg. 1. The archtecture of the machne model. Fg. 2. The nstructon cycle of ADD 1, (A0). We assume that sgnal transmsson n the bus s nstantaneous. Our analytcal method presented n ths paper works only on a processor when the cache memory and the nstructon ppelne are dsabled. Although ths assumpton may seem mpractcal wth the current mcroprocessor technology, the authors argue that, gven the extreme complexty of the problem beng addressed, our method s the frst soluton to bound the WCET of cycle-stealng DMA I/O tasks. The executon of an nstructon n ths machne model s as shown n Fg. 2. An nstructon cycle conssts of a sequence of operatons to fetch and execute an nstructon. The sequence takes one or more machne cycles. A machne cycle requres one or more processor clock cycles to execute. The begnnng of each machne cycle s trggered by the processor clock. For example, the nstructon cycle of ADD 1, (A0), shown n Fg. 2, s composed of four machne cycles: a memory-read (bus-access) cycle to fetch the nstructon, a memory-read (bus-access) cycle to fetch an operand, an executon (no-busaccess) cycle to carry out the addton, followed by a memory-wrte (bus-access) cycle to wrte back the data. Each machne cycle n ths example n turn takes 2, 4, 2, and 4 processor clock cycles to execute. We classfy all machne cycles nto two categores: B (busaccess) cycles and E (executon) cycles. A B-cycle s a machne cycle durng whch the CPU uses the I/O bus. In contrast, the CPU does not use the bus when t s n an E-cycle. In general, there may be several consecutve E-cycles n an nstructon cycle. For the sake of concreteness, we assume that the bus contenton between the CPU and the DMAC s regulated accordng to the VMEbus [33] protocol. Ths protocol s suffcently general such that our analyss presented n ths paper may be easly appled to many other commonly-used bus protocols. To access the bus, the DMAC frst sends a bus request. If the bus s already used by the CPU, the DMAC wats untl the bus becomes free. When the bus s free, there s a short delay, called the bus master transfer tme (BMT), whle the DMAC gans the control of the bus. The DMAC can transfer data

4 1232 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN when t becomes the bus master. At the end of each transfer of a unt of data, f there s no bus request from a hgher prorty devce (e.g., the CPU), the DMAC may contnue to hold the bus and transfer data. Otherwse, the DMAC must release the bus, and after another BMT delay the hgher prorty devce gans the control of the bus and becomes the bus master. Fg. 3 llustrates the concurrent executon of DMA I/O and a sequence of machne cycles B E 1 E 2 E k B +1. The DMAC gans the bus when the CPU enters E 1 cycle from B cycle. It keeps transferrng data durng the nterval from E 1 cycle to E k cycle. The CPU requests the bus at the end of E k cycle. The DMAC checks f there s any pendng bus request at the end of each data transfer; the DMAC releases the bus at the end of m-th transfer, and the CPU gans the bus after another BMT delay. The executon of B +1 cycle s delayed for (b + BMT), where b s the delay between when the CPU requests the bus and when the request s checked and the DMAC releases the bus. Fg. 3. The concurrent executon of DMA I/O and a sequence of E-cycles. We assume that the transfer of each unt of data by the DMAC takes the same amount of tme and denote ths tme by DT. Let T be the total executon tme of the k consecutve E-cycles, and m be the maxmum unts of data the DMAC can transfer durng ths sequence of E-cycles. Based on the facts that 0 b < DT and T + b = m D + BMT, we have (m 1) < T BMT DT m. We can compute m by the equaton T BMT m =. DT (1) The worst-case delay suffered by the CPU executon of the sequence of machne cycles s m DT + 2 BMT T. Because each machne cycle s trggered by the processor clock, the frst B-cycle B +1 after the sequence of k E-cycles cannot start untl the next processor clock cycle. As a result, the exact worst-case delay suffered by the CPU executon s equal to

5 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1233 m DT + 2 BMT T d = T T c c, (2) where T c s the perod of a clock cycle. 3. BOUNDING THE WCET OF A CPU TASK Let A C denote a CPU task, whch n turn s a sequence of k CPU nstructons I 1 I 2 I k. Because on a smple archtecture each nstructon begns wth a B-cycle to fetch the nstructon from the memory, no DMA data transfer can take place across two nstructons. Consequently, the effects of cycle-stealng on each nstructon can be analyzed ndependently, wthout consderng the other nstructons. The WCET of the CPU task A C when t executes concurrently wth DMA I/O s therefore bounded by the sum of the WCET of each nstructon executng concurrently wth DMA I/O. To bound the WCET of an nstructon I that executes concurrently wth DMA I/O, we frst use Eq. (2) to calculate the worst-case delay suffered by each sequence of E-cycles. We obtan the WCET of the nstructon, denoted by W(I ), by summng the executon tme of the nstructon when t executes alone and the worst-case delay of all E-cycles sequences n the nstructon. Smlarly, by Eq. (1), we obtan the maxmum unts of data the DMAC can transfer durng the executon of the nstructon. We denote ths value by M(I ). Fnally, we obtan the WCET of A C, denoted by W(A C ), by the equaton k W( A ) = W( I ), (3) C = 1 and the maxmum unts of data the DMAC can transfer durng the executon of A C, denoted by M(A C ), by the equaton k M( A ) = M( I ). (4) C = 1 The computaton of W(A C ) by Eq. (3) requres, as nputs, two parameters of the I/O bus, BMT and DT. The other nformaton needed by the equaton ncludes how many machne cycles each nstructon s composed of, the functon of each machne cycle, and the executon tme of each machne cycle. We can obtan ths nformaton from the reference manual provded by the manufacturer of the processor. 4. BOUNDING THE WCET OF A DMA I/O TASK Because a DMA I/O task proceeds by stealng bus cycles from executng nstructons, ts executon tme depends on the sequence of nstructons executng concurrently wth t. We generalze the problem of boundng the WCET of a DMA I/O task wth a workload that conssts of a DMA I/O task and K ndependent CPU tasks. The DMA I/O task, denoted by A D, transfers Z unts of data. Each of the K CPU tasks, denoted by A 1, A 2,, A K, s a sequence of CPU nstructons. Each CPU task has an arbtrary release

6 1234 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN tme, and these K CPU tasks are scheduled preemptvely. In contrast, the DMA I/O task A D s nonpreemptable. A D s ntalzed by a task other than the K CPU tasks. Consequently, A D may execute concurrently wth any of the K CPU tasks. The method presented here works under any schedulng algorthm. In the followng we frst descrbe three propertes revealed by a sequence of nstructons that execute concurrently wth the DMA I/O task A D. Based on these propertes we develop a recursve formula to compute the WCET of A D. The recursve formula gves the bass of a dynamc-programmng method whch can be used to bound the WCET of A D. To smplfy the dscusson, we assume that the CPU s never dle durng the executon of A D. We wll remove ths assumpton at the end of ths secton by modelng an dle perod as an nstructon of a specal CPU task. 4.1 The Propertes of a Concurrent Instructon Sequence Let S denote a sequence of nstructons I a I a+j executng concurrently wth the DMA I/O task A D. Because nterrupts are processed between nstructon cycles, A D and the frst nstructon I a begn at the same tme. Smlarly, A D and the last nstructon I a+j end at the same tme. Consequently, the WCET of the sequence S, denoted by W(S), s bounded by the sum of the WCET of each nstructon when t executes concurrently wth DMA I/O. That s W(S) = W(I a ) + + W(I a+j ). Fg. 4. The executon tme of a DMA I/O task. Example 1: Let A D execute concurrently wth a sequence of nstructons I a I a+j as shown n Fg. 4. The CPU sgnals the DMAC to start ts data transfer at tme t 1 and starts the executon of the frst nstructon I a at the same tme. The DMAC transfers the frst unt of data when the CPU enters the frst E-cycle. The DMAC sgnals the CPU the completon of the last unt of data durng the executon of the last nstructon I a+j. Because nterrupt sgnals are processed between nstructon cycles, the CPU s notfed the completon of A D at t 2, when the last nstructon I a+j completes ts executon. The executon tme of A D s therefore equal to (t 2 t 1 ), that s bounded by W(I a ) + + W(I a+j ). Property 1 The DMA I/O task A D and the sequence S begn and end at the same tme. The WCET of S s bounded by the sum of the WCET of each nstructon when t executes concurrently wth DMA I/O.

7 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1235 The DMAC must transfer the last unt of data durng the executon of the last nstructon I a+j. Because nterrupts are only processed between nstructon cycles, some of the E-cycles n I a+j may not be utlzed by the DMAC as shown by the example n Fg. 4. In contrast, because the DMA I/O task s nonpreemptable, the DMAC must fully utlze all the E-cycles n the rest of the nstructons to transfer data. Agan, let M(I ) be the maxmum unts of data transferred by the DMAC durng the executon of the nstructon I. The sequence of nstructons I a I a+j must satsfy a+ j 1 a+ j M( I ) < Z M( I ). = 1 = 1 Property 2 The DMAC must fully utlze all the E-cycles n every nstructon of S except I a+j. In addton, the last unt of data must be transferred durng the executon of I a+j. Because these K CPU tasks are scheduled preemptvely, the sequence S may contan nstructons from any of the K CPU tasks. Among the nstructons n I a I a+j, let S denote the set of nstructons from the CPU task A. S s ether an empty set or a subsequence of contguous nstructons of the task A. Property 3 Among the nstructons of S, the set of nstructons from the same CPU task must be a subsequence of contguous nstructons of the CPU task. 4.2 The Recursve Formula Let Y denote the set of all possble sequences of nstructons that may execute concurrently wth the DMA I/O task A D. Accordng to Property 1, W(S), the WCET of a sequence S, s smply the sum of the W(I) for every nstructon I S. Therefore, we can obtan the WCET of A D, denoted by W(A D ), as the maxmum W(S) for every S Y; that s W( A ) = max W( S). D S Y The dervaton Let us dvde Y nto K dsjont subsets Y 1, Y 2,, Y K n such a way that the subset Y α conssts of all the sequences where the last nstructon of each sequence s from the task A α. Let W (K,Z,α) denote the maxmum W(S) for every S Y α. We can redefne W(A D ) as W( AD) = max W( K, Z, α ). (5) 1 α K (m Let us further dvde Y α nto a number of dsjont subsets. Let Y 1,m 2,,m ) K α denote a subset of sequences n Y α such that each sequence S n ths subset has the followng property: the DMAC transfers m unts of data durng the executons of the nstructons from (m the task A, = 1 to K. Let W 1,m 2,,m K ) α denote the maxmum W(S) for every S (m Y 1,m 2,,m ) K α. We can defne W (K,Z,α) as

8 1236 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN W (K,Z,α) = max{w α (m 1,m 2,,m K ) }. (6) Because the DMA I/O task A D transfers Z unts of data n total, we have m 1 + m m K = Z. (7) In addton, because the last nstructon of each sequence S Y α (m 1,m 2,,m K ) s from the task A α, we have m α > 0 accordng to Property 2. And, m 0 for any α. That s 0 < m α, and 0 m for α. (8) To compute W α (m 1,m 2,,m K ), we frst defne f (m ) and p (m ). Let I a I a+j be a subsequence of contguous nstructons of the task A such that a+ j m = M( I ). (9) l= a l Let F m denote the set of all possble subsequences of A that satsfy Eq. (9). Agan, we use W(S) to denote the maxmum executon tme of a subsequence S when t executes concurrently wth DMA I/O. We defne f (m ) to be the maxmum W(S) for every S F m. That s f ( m ) = max W( S). (10) S F m Smlarly, let I a I a+j be a subsequence of contguous nstructons of the task A such that a+ j 1 a+ j M( Il ) < m M( Il). (11) l= a l= a Let P m denote the set of all possble subsequences of A that satsfy Eq. (11). We defne p (m ) as p ( m ) = max W( S). (12) m S P Let us get back to a sequence S Y α (m 1,m 2,,m K ). Accordng to Property 3, the sequence S s n fact the concatenaton of the subsequence S of each task A such that the DMAC transfers m unts of data durng the executon of the subsequence S, = 1 to K. In addton, because no DMA transfer can take place across two subsequences, W(S) s equal to the sum of W(S ), = 1 to K. Consequently, we can use f (m ) and p (m ) to defne W α (m 1,m 2,,m K ) as where W α (m 1,m 2,,m K ) = e 1 (m 1 ) + e 2 (m 2 ) + + e K (m K ) (13)

9 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1237 p( m) f = α, e( m) = f( m) f = α. (14) By replacng Eq. (8) wth the settngs of p (0) to, = 1 to K, we can generalze the defnton of W (K,Z,α) gven n Eqs. (6) to (8), (13), and (14) to the followng form W (K,Z,α) = max{e 1 (m 1 ) + e 2 (m 2 ) + + e K (m K )} where e (m ) s gven by Eq. (14) and the max functon s over all m 1, m 2,, m K such that (1) m 1 + m m K = Z, and (2) 0 m Z, = 1, 2,, K. By consderng m K separately, we can further rewrte the above formula as W = max {max{ e ( m ) + e ( m ) + + e ( m )} + e ( g)} ( KZ,, α ) K 1 K 1 K 0 g Z where the nner max functon s over all m 1, m 2,, m K-1 such that (1) m 1 + m m K-1 = Z g, and (2) 0 m Z g, = 1, 2,, K 1. Snce the nner term n the above formula s exactly W (K-1,Z-g,α), we smplfy t to W = max { W + e ( g)}. ( KZ,, α) ( K 1, Z g, α) K 0 g Z After consderng the termnatve condton of ths recursve formula, we obtan the defnton of W (K,Z,α) below W ( KZ,, α ) e1 ( Z) f K = 1, = max { W( K 1, Z g, α ) + ek( g)} f K > 1. 0 g Z (15) Agan, e (m ) s gven by Eq. (14). Fnally, Eqs. (5) and (15) together gve a recursve formula for computng the WCET of the DMA I/O task A D Table constructon The computaton of Eq. (15) requres frequent accesses to both f (m ) and p (m ). To avod computng the same f (m ) and p (m ) repeatedly, we pre-compute each f (m ) and p (m ), and store the results n the tables f[, m ] and p[, m ], respectvely, for = 1 to K and m = 0 to Z. We rewrte Eq. (14) as below to retreve pre-computed results from these two tables.

10 1238 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN p [, m ] f = α, e( m) = f[, m ] f α. (16) Fg. 5 lsts the procedure for constructng the tables f[α, z] and p[α, z] of a CPU task A α. Here we let U α denote the number of nstructons n A α. Intally, we set f[α, 0] to 0, f[α, z] to, z = 1 to Z. In addton, we set p[α, z] to, z = 0 to Z. We update the table f[α, z] each tme we locate a subsequence n A α that belongs to F α z and whose WCET s larger than the current value. Smlarly, we update the table p[α, z] each tme we locate a subsequence n A α that belongs to P α z and has a larger WCET. If at the end of the procedure an entry f[α, z] (or p[α, z]) stll has the value of, ths fact mples that t s mpossble to fnd n the task A α a subsequence of nstructons that belongs to F α z (or P α z ). Input: the CPU task A α, a sequence of U α nstructons. Output: the entres f[α, z] and p[α, z], z = 0, 1,, Z. Procedure: for z = 0 to Z do for j = 1 to U α do { 1. fnd a longest subsequence that starts wth the j-th nstructon and belongs to F z α ; 2. f (such a subsequence exsts) and (ts WCET s larger than f[α, z]) then set f[α, z] to the WCET of the subsequence; 3. fnd a longest subsequence that starts wth the j-th nstructon and belongs to P z α ; 4. f (such a subsequence exsts) and (ts WCET s larger than p[α, z]) then set p[α, z] to the WCET of the subsequence; } Fg. 5. The procedure that computes f[α, z] and p[α, z] for the task A α Runnng-tme complexty Instead of searchng through the sequence of nstructons repeatedly, the steps 1 and 3 of the procedure shown n Fg. 5 can be carred out n constant tme by utlzng the nformaton calculated n a prevous teraton of the loop. Specfcally, the subsequences that start wth the (j 1)-th nstructon can be used to locate the subsequences that start wth the j-th nstructon. Consequently, the runnng-tme complexty of the procedure shown n Fg. 5 can be optmzed to O(ZU α ). To construct the whole tables of f[k, z] and p[k, z], we apply ths procedure to each of the K CPU tasks. The tme complexty s K OZU ( ) k 1 k = O(ZU), where U s the sum of the number of nstructons of these K CPU tasks. The procedure shown n Fg. 6 uses the tables f[α, z] and p[α, z] together wth Eqs. (15) and (16) to compute W (K,Z,α). We mplement Eq. (5) by the for-loop below. Intally, we set W(A D ) to 0. At the end of the loop, W(A D ) returns the WCET of the DMA I/O task A D.

11 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1239 Input: the tables f[α, z] and p[α, z], z = 0, 1,, Z. the defntons of e (m ) gven n Eq. (16). Output: the value of W (k,z,α). Procedure: EQ15(k, z, α) 1. f (k == 1) then return e 1 (z); 2. set R to 0; 3. for g = 0 to z do { set T to (EQ15(k 1, z g, α) + e k (g)); f (T > R) then set R to T; } 4. return R; Fg. 6. The procedure that mplements Eq. (15). For α = 1 to K do { W (K,Z,α) = EQ15(K, Z, α); If (W (K,Z,α) > W(A D )) that set W(A D ) to W (K,Z,α) ; } The tme complexty for computng W (K,Z,α) wth the procedure shown n Fg. 6 s O(W (K,Z,α) ) = O((Z + 1) O(W (K-,Z,α) )) = O((Z + 1) K-1 ) = O(Z K ). Fnally, the tme complexty of computng W(A D ) wth the recursve formula s O(ZU) + O(KZ K ). In other words, the tme complexty of the recursve formula grows exponentally as the number of CPU tasks grow. 4.3 A Dynamc-Programmng Method To avod redundant computaton, we mplement Eq. (15) by the procedure shown n Fg. 7. The tme complexty of ths dynamc-programmng method s O(KZ 2 ), and the tme complexty of computng W(A D ) by ths procedure s O(K 2 Z 2 ). Because the tme complexty of buldng the tables f[k, z] and p[k, z] s O(ZU), the tme complexty of computng W(A D ) s O(ZU) + O(K 2 Z 2 ), where Z s the number of unts of data to be transferred by A D, K s the number of CPU tasks that may execute concurrently wth A D, and U s the sum of the number of nstructons of these K CPU tasks. Another advantage of mplementng Eq. (15) by the dynamc-programmng method s that the table W[k, z, α] bult for the purpose of boundng the WCET of A D can be used to bound the WCET of other DMA I/O tasks. For example, to compute the WCET of another DMA I/O task A D whch transfer Z unts of data, Z < Z, by Eq. (15) we need to compute frst W[K, Z, α]. Because W[K, Z, α] had already been computed n the process of computng the WCET of A D, we can use the results stored n the table drectly to compute the WCET of A D. Suppose that there are γ DMA I/O tasks that can execute concurrently wth these K CPU tasks, and each DMA I/O task transfers Z unts of data, = 1, 2,, γ. The tme complexty of boundng the WCETs of these DMA I/O tasks s

12 1240 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN Input: the tables f[k, z] and p[k, z], k = 1 to K, z = 0 to Z. the defntons of e (m ) gven n Eq. (16). Output: the table of W[k, z, α], k = 1 to K, z = 0 to Z. Procedure: 1. set W[1, z, α] to e 1 (z), z = 0 to Z. 2. for k = 2 to K do for z = 0 to Z do { set W[k, z, α] to ; for g = 0 to z do { f (W[k 1, z g, α] + e k (g) > W[k, z, α]) than { set W[k, z, α] to (W[k 1, z g, α] + e k (g)); } } } Fg. 7. A dynamc-programmng method for Eq. (15). O(Z max U) + O(K 2 Z 2 max) where Z max s the maxmum value of Z 1, Z 2,, Z γ. The dscusson thus far assumes the CPU to be never dle. We now remove ths assumpton. Suppose that there s an dle perod durng the executon of A D. Let m denote the number of unts of data the DMAC transfers durng ths perod. We model ths dle perod as an nstructon I l of a specal CPU task A K+1, called the background task. Because the DMAC takes at most (2 BMT + DT) to transfer a unt of data, the executon tme of ths perod s bounded by m (2 BMT + DT). That s M(I l ) = m, and W(I l ) = m (2 BMT + DT). (17) Furthermore, wth Eq. (17), the WCET of an nstructon I p n front of the dle perod can stll be bounded by W(I p ) as descrbed n secton 3, whether the nstructon I p ends wth a B-cycle or an E-cycle. Let S denote a mxed sequence of nstructons and dle perods that executes concurrently wth A D. Let S denote the new sequence of nstructons after replacng each dle perod n S wth an nstructon of the background task A K+1. The new sequence S holds the three propertes dscussed n secton 4.1. Consequently, by addng the background task A K+1 to the set of the K CPU tasks that can execute concurrently wth A D and settng f[k + 1, z] = p[k + 1, z] = z (2 BMT + DT), z = 0, 1,, Z, the dynamc-programmng method gven n Fg. 7 stll bounds the WCET of A D at the tme complexty of O(ZU) + O(K 2 Z 2 ) when CPU dle perods are allowed.

13 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS EXPERIMENTAL RESULTS We demonstrate the effectveness of our methods through exhaustve smulatons on a wdely-used embedded processor. Table 1 lsts the eght CPU tasks, each of whch s a random executon trace of a commonly-used program. We compled each program nto an MC68030 assembly program and executed on an MC68030 smulator wth randomlygenerated nput data to obtan the executon trace. Column 3 of Table 1 lsts the number of nstructons n each CPU task. Table 1. The CPU task set. Program Bref Descrpton Instructons QuckSort Recursve QuckSort 23,026 BubbleSort Sequental BubbleSort 65,726 FFT Fast Fourer Transform 249,107 Splne Cublc Splne Functon 209,837 Gaussan Gaussan Elmnaton 47,242 Mtxmul Matrx Multplcaton 36,789 Correlate Track-Correlate Functon 26,543 Mtxmu12 Loop-Unrolled Verson of Mtxmul 9,391 We obtaned from the Motorola manual [2] the tmng nformaton of each nstructon n the traces. The clock frequency of the mcroprocessor was 20 MHZ: the perod of a clock cycle T c was 50 ns. We assume a 0-wat memory was used n ths experment, and each DMA transfer of a unt of data took two clock cycles. Hence, we set DT to 100 ns. Fnally, BMT was 5 ns. 5.1 CPU Tasks For each CPU task A C lsted n Table 1, we frst used Eq. (3) to compute W(A C ). We next used Eq. (4) to compute M(A C ). We then compared our WCET predcton wth the one obtaned by the tradtonal pessmstc approach. Gven the CPU task A C and a DMA I/O task that transfers M(A C ) unts of data, whch are ready at the same tme, the pessmstc approach estmates the WCET of A C to be equal to the sum of the executon tme of A C when t executes alone and the executon tme of the DMA I/O task when t s done alone. We denote ths tradtonal pessmstc WCET predcton by W t (A C ). We use the percentage of reducton from W t (A C ) Wt ( AC) W( AC) R = W ( A ) t C to measure the performance of our method. We also nvestgated the relatonshp between the performance of our method and the computatonal requrement of a CPU task. We classfy all nstructons here nto two

14 1242 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN categores: long nstructons and short nstructons. An nstructon s a long one f durng ts executon; the CPU does not need the bus for 8 processor clock cycles or more. In contrast, durng the executon of a short nstructon, the CPU never allows any I/O devce to have the bus for such a long perod. For example, the nstructons MULU.W D1, D2 and DIVU.W D2, D0 are long nstructons, and MOVE.L (A3)+, D0 and ADD.L, D0, D1 are short nstructons. Column 2 of Table 2 gves the percentage of long nstructons n each CPU task. Table 2. The smulaton results for CPU tasks. Program Long Instructons % R n % QuckSort 0% 8% BubbleSort 0% 10% FFT 2% 19% Splne 3% 21% Gaussan 5% 25% Mtxmul 11% 35% Correlate 17% 38% Mtxmu12 22% 39% Column 3 of Table 2 gves the reducton percentage on each CPU task. Because the delay caused by cycle-stealng on each nstructon s bounded by Eq. (2), the overhead of each DMA transfer n a long nstructon s less than that n a short nstructon. In addton, more DMA data transfers can be carred out n a long nstructon than n a short nstructon. Therefore, our method produces a larger percentage of reducton on a CPU task wth a hgher percentage of long nstructons. Among the tested CPU tasks, Mtxmu12 s obtaned by unrollng the whole nnermost loop of Mtxmul. The loop-unrollng procedure sgnfcantly ncreases the percentage of long nstructons n the trace. As a result, our method produces a hgher percentage of reducton on the loop-unrolled verson: a 39% reducton from the most pessmstc WCET predcton s acheved. 5.2 DMA I/O Tasks We demonstrate the correctness of the dynamc-programmng method through exhaustve smulatons. To make exhaustve smulaton feasble, we executed the programs lsted n Table 1 wth a much smaller data set to obtan the CPU task set lsted n Table 3. Column 3 gves the number of nstructons n each smplfed CPU task. We frst used the dynamc-programmng method to compute the WCET of a DMA I/O task A D when t executes concurrently wth the eght CPU tasks. We next smulated the concurrent executon of these CPU tasks and A D under the round-robn schedulng algorthm and the fxed prorty assgnment algorthm, and recorded the executon tme of A D. CPU tasks were smulated for all possble combnatons of release tmes, and n the case of fxed prorty assgnment, all possble combnatons of prorty assgnments were smulated. We allowed schedulng ponts to occur only every 100 nstructons. We use W r (A D ) and W p (A D ) to denote the maxmum executon tmes of A D found by the smulaton when the CPU

15 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1243 Table 3. The smplfed CPU task set. Program Bref Descrpton Instructons QuckSort Recursve QuckSort 3,124 BubbleSort Sequental BubbleSort 2,763 FFT Fast Fourer Transform 3,662 Splne Cublc Splne Functon 2,101 Gaussan Gaussan Elmnaton 1,436 Mtxmul Matrx Multplcaton 1,170 Correlate Track-Correlate Functon 814 Mtxmu12 Loop-Unrolled Verson of Mtxmul 884 Table 4. The smulaton results for DMA I/O tasks. The length of the I/O task W(A D )/W r (A D ) W(A D )/W p (A D ) tasks are scheduled by the round-robn and fxed prorty assgnment schedulng algorthms, respectvely. We compared our WCET predcton W(A D ) aganst W r (A D ) and W p (A D ) to show the correctness of the dynamc-programmng method. Table 4 shows the expermental results for DMA I/O tasks that transfer dfferent unts of data. Rows 2 and 3 of Column 2 gve the values of W(A D )/W r (A D ) and W(A D )/ W p (A D ), respectvely, when the DMA I/O task A D transfers 250 unts of data. We also smulated the concurrent executon of the CPU task set and three other DMA I/O tasks whch transfer 500, 750, 1000 unts of data, and the results are shown n Columns 3, 4, and 5. As explaned n secton 4.3, our dynamc-programmng method only computes the WCET of the DMA I/O task that transfers 1000 unts of data. The WCETs of the other three DMA I/O tasks are obtaned n a table-drven manner. For every of the eght cases nvestgated n ths experment, our WCET predcton W(A D ) s always larger than the maxmum executon tme of the DMA I/O task recorded n the exhaustve smulatons. Our method overestmates the WCET for at most 6.3% when the CPU tasks are scheduled by the fxed prorty assgnment algorthm and the DMA I/O task transfers 250 unts of data. The percentage of overestmaton s smaller wth a longer DMA I/O task. Ths behavor results from the overestmaton of our method on the last nstructon of the sequence that executes concurrently wth the DMA I/O task. Obvously, the overestmaton wll have a smaller effect on the WCET predcton of a longer DMA I/O task. Fnally, our method stll produces 0.6% and 1.4% overestmaton on the WCET of the DMA I/O task that transfers 1000 unts of data at the round-robn and the fxed prorty assgnment schedulng algorthms, respectvely. It s caused by the 100-nstructon schedulng dstance. Ths lmt consderably trms down the set of possble nstructon sequences. We are confdent that, by allowng schedulng ponts to occur on every nstructon, the overestmaton by our method wll be practcally neglgble.

16 1244 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN 6. RELATED WORK Most of the prevous studes focused on boundng the WCET of CPU tasks [1, 3-8, 12, 15-18, 21-26, 29, 31, 32]. Muller et al. [23] developed a statc cache smulaton to bound the WCET of CPU tasks executed on a contemporary machne wth the nstructon cache. L and Malk [16] presented the mplct path-enumeraton method to convert the problem of boundng the WCET nto one of solvng a set of ILP constrants. L et al. [17] later extended ther approach to nclude the tmng analyss of both drect-mapped and set-assocatve caches. Lm et al. [18] proposed a tmng analyss technque for modern multple-ssue machnes such as superscalar processors. Km et al. [15] presented quanttatve analyss results on the mpacts of varous archtecture features on the accuracy of WCET predctons. Thelng et al. [31, 32] adopted abstract nterpretaton to analyze the performance of modern archtectures and ntegrated wth the mplct path-enumeraton method to bound the WCET. All of the above methods nvarably assume that a CPU task to be analyzed executes wthout any nterference of concurrently executng I/O tasks n the system. Huang et al. [13] frst attempts to bound the WCET of a CPU task when cycle-stealng DMA I/O s concurrently executng. Hahn et al. [9] bounded the worst-case DMA response tme n a fxed-prorty bus arbtraton protocol. However, these paper dd not address at all on how to bound the WCET of a concurrently executng cycle-stealng DMA I/O task. To our knowledge, our work s the frst one that attempts to bound the worst-case nterference between concurrently executng CPU tasks and cycle-stealng DMA I/O tasks. 7. CONCLUDING REMARKS Cycle-stealng DMA I/O operatons have often been dsabled n hard-real-tme embedded systems. In ths paper we frst presented an analyss for boundng the delay. Based on ths analyss, we developed a method for boundng the WCET of a CPU task. Smulaton results demonstrate that our method produces much tghter WCET predctons than the tradtonal pessmstc method, especally when the CPU task contans a large percentage of computaton-ntensve nstructons. We also derved a recursve formula for boundng the WCET of a cycle-stealng DMA I/O task executng concurrently wth a set of CPU tasks wth arbtrary release tmes and prorty assgnments. We reduced the runnng-tme complexty of the recursve formula wth a dynamc-programmng technque. Consequently, the WCET predcton table constructed by a full evaluaton of the dynamc-programmng method can be used to bound the WCETs of all cycle-stealng DMA I/O tasks executng concurrently wth the same set of CPU tasks. Our method of boundng the WCET of a cycle-stealng DMA I/O task s applcable on an embedded processor where each nstructon begns wth a B-cycle. Ths paper successfully provdes the frst soluton for the real-tme communty to fully utlze the bandwdth of the I/O bus n a hard-real-tme embedded system by allowng the concurrent executon of CPU tasks and cycle-stealng DMA I/O tasks.

17 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1245 REFERENCES 1. A. Coln and I. Puaut, Worst case executon tme analyss for a processor wth branch predcton, Journal of Real-Tme Systems, Vol. 18, 2000, pp Motorola, MC68030 Enhanced 32-bt Mcroprocessor: User s Manual, Prentce- Hall, Englewood Clffs, New Jersey, J. Engblom, A. Ermedahl, M. Sjoedn, J. Gubstafsson, and H. Hansson, Worst-case executon-tme analyss for embedded real-tme systems, Journal of Software Tools for Technology Transfer, Vol. 4, 2001, pp J. Engblom and A. Ermedahl, Ppelne tmng analyss usng a trace-drven smulator, n Proceedngs of the 6th Internatonal Conference on Real-Tme Computng Systems and Applcatons, 1999, pp J. Engblom and A. Ermedahl Modelng complex flows for worst-case executon tme analyss, n Proceedngs of the 21st Real-Tme System Symposum, 2000, pp C. Ferdnand, F. Martn, and R. Wlhelm. Applyng compler technques to cache behavor predcton, n Proceedngs of the ACM SIGPLAN Workshop on Languages, Complers, and Tools for Real-Tme Systems, 1997, pp C. Ferdnand and R. Wlhelm, Effcent and precse cache behavor predcton for real-tme systems, Journal of Real-Tme Systems, Vol. 17, 1999, pp R. Gupta and P. Gopnath, Correlaton analyss technques for refnng executon tme estmates of real-tme applcatons, n Proceedngs of the IEEE Workshop on Real-Tme Operatng Systems and Software, 1994, pp J. Hahn, R. Ha, S. L. Mn, and J. W. S. Lu, Analyss of worst case DMA response tme n a fxed-prorty bus arbtraton protocol, Journal of Real-Tme Systems, Vol. 23, 2002, pp H. Hansson, H. Lawson, O. Brdal, C. Erksson, S. Larsson, H. Lönn, and M. Strömberg, BASEMENT: an archtecture and methodology for dstrbuted automotve real-tme systems, IEEE Transactons on Computers, Vol. 46, 1997, pp D. Harel and A. Naamad, The STATEMATE semantcs of statecharts, ACM Transactons on Software Engneerng Method, Vol. 5, 1996, pp C. Healy, R. Arnold, F. Mueller, D. Whalley, and M. Harmon, Boundng ppelne and nstructon cache performance, IEEE Transactons on Computers, Vol. 48, 1999, pp T. Y. Huang, J. W. S. Lu, and D. Hull, A method for boundng the effect of DMA I/O nterference on program executon tme, n Proceedngs of the 17th Real-Tme System Symposum, 1996, pp K. Jeffay, D. F. Stanat, and C. U. Martel, On non-preemptve schedulng of perodc and sporadc tasks, n Proceedngs of the 12th Real-Tme System Symposum, 1991, pp S. K. Km, R. Ha, and S. L. Mn, Analyss of the mpacts of overestmaton sources on the accuracy of worst case tmng analyss, n Proceedngs of the 20th Real-Tme System Symposum, 1999, pp Y. T. S. L and S. Malk, Performance analyss of embedded software usng mplct path enumeraton, n Proceedngs of the 32nd ACM/IEEE Desgn Automaton Conference, 1995, pp

18 1246 TAI-YI HUANG, CHIN-CHIEH CHOU AND PO-YUAN CHEN 17. Y. T. S. L, S. Malk, and A. Wolfe, Cache modelng for real-tme software: beyond drect mapped nstructon caches, n Proceedngs of the 17th Real-Tme System Symposum, 1996, pp S. S. Lm, J. H. Han, J. Km, and S. L. Mn, A worst case tmng analyss technque for multple-ssue machnes, n Proceedngs of the 19th Real-Tme System Symposum, 1998, pp C. L. Lu and J. Layland, Schedulng algorthms for multprogrammng n a hard realtme envronment, Journal of the ACM, Vol. 10, 1973, pp J. W. S. Lu, W. K. Shh, K. J. Ln, R. Bettat, and J. Y. Chung, Imprecse computatons, IEEE Proceedngs, Vol. 82, 1994, pp T. Lundqvst and P. Stenström, An ntegrated path and tmng analyss method based on cycle-level symbolc executon, Journal of Real-Tme Systems, Vol. 17, 1999, pp T. Lundqvst and P. Stenström, Tmng anomales n dynamcally scheduled mcroprocessors, n Proceedngs of the 20th Real-Tme System Symposum, 1999, pp F. Mueller, D. Whalley, and M. Harmon, Predctng nstructon cache behavor, n Proceedngs of the ACM SIGPLAN Workshop on Languages, Complers, and Tools for Real-Tme Systems, G. Ottosson and M. Sjödn, Worst-case executon tme analyss for modern hardware archtectures, n Proceedngs of the ACM SIGPLAN Workshop on Languages, Complers and Tools for Real-Tme Systems, C. Y. Park and A. C. Shaw, Experments wth a program tmng tool based on source-level tmng schema, IEEE Computer, 1991, pp P. Puschner and C. Koza, Calculatng the maxmum executon tme of real-tme programs, Journal of Real-Tme Systems, Vol. 1, 1989, pp J. Rchert, Integraton of mechatronc desgn tools wth CAMeL, exemplfed by vehcle convoy control desgn, n Proceedngs of the IEEE Internatonal Symposum on Computer-Aded Control System Desgn, 1996, pp L. Sha, R. Rajkumar, and J. P. Lehoczky, Prorty nhertance protocols: An approach to real-tme synchronzaton, IEEE Transactons on Computers, Vol. 39, 1990, pp F. Stappert and P. Altenbernd, Complete worst-case executon tme analyss of straghtlne hard real-tme programs, Journal of Systems Archtecture, Vol. 46, 2000, pp J. Sun, M. Gardner, and J. W. S. Lu, Boundng completon tmes of jobs wth arbtrary release tmes, varable executon tmes, and resource sharng. IEEE Transactons on Software Engneerng, Vol. 23, 1997, pp H. Thelng and C. Ferdnand, Combnng abstract nterpretaton and ILP for mcroarchtecture modellng and program path analyss, n Proceedngs of the 19th Real-Tme System Symposum, 1998, pp H. Thelng, C. Ferdnand, and R. Wlhelm, Fast and precse WCET predcton by separated cache and path analyses, Journal of Real-Tme Systems, Vol. 18, 2000, pp The VMEbus Specfcaton, Motorola, 1985.

BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1247 Ta-Y Huang ( 黃泰一 ) receved the B.S. degree n Computer Scence and Informaton Engneerng from Natonal Tawan Unversty n 1991.

From 1996 to 2001, he was a software desgn engneer n Wndows OS Kernel Performance Group, Mcrosoft Inc.

He s currently the executve secretary of the Embedded Software Consortum, Mnstry of Educaton, Tawan.

19 BOUNDING DMA INTERFERENCE ON HARD-REAL-TIME EMBEDDED SYSTEMS 1247 Ta-Y Huang ( 黃泰一 ) receved the B.S. degree n Computer Scence and Informaton Engneerng from Natonal Tawan Unversty n He receved both the M.S. and Ph.D. degrees from Unversty of Illnos at Urbana-Champagn n Computer Scence n 1994 and 1996, respectvely. From 1996 to 2001, he was a software desgn engneer n Wndows OS Kernel Performance Group, Mcrosoft Inc. Snce February 2002, he has been an assstant professor wth the Computer Scence Department at Natonal Tsng Hua Unversty, Tawan. He s currently the executve secretary of the Embedded Software Consortum, Mnstry of Educaton, Tawan. Hs research nterests nclude low-power embedded systems, realtme operatng systems, and hgh-performance clustered storages. He s a member of the IEEE and the ACM. Chh-Cheh Chou ( 周智杰 ) receved the B.S. degree n Computer Scence from Natonal Chung Cheng Unversty, Tawan and the M.S. degree n Computer Scence from Natonal Tsng Hua Unversty, Tawan. He s currently a software engneer at Trend Mcro Enterprse, Tawan. Hs projects concentrate on network securty applances. Po-Yuan Chen ( 陳柏元 ) receved the B.Sc and M.Sc. degrees from Computer Scence Department, Natonal Tsng Hua Unversty, Tawan, n 2002 and 2004, respectvely. He s currently a Ph.D. student n Computer Scence Department, Natonal Tsng Hua Unversty, Tawan. Hs research nterests nclude operatng systems, real-tme systems, hardware/software codesgn, dgtal sgnal processor desgn, and very large scale ntegrated crcut desgn.

AADL : about scheduling analysis

AADL : about scheduling analysis AADL : about schedulng analyss Schedulng analyss, what s t? Embedded real-tme crtcal systems have temporal constrants to meet (e.g. deadlne). Many systems are bult wth operatng systems provdng multtaskng