Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Size: px

Start display at page:

Download "Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations"

Roland Warner
5 years ago
Views:

1 ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, La Jolla, California, June Predicting the Worst-Case Execution Time of the Concurrent Execution of Instructions and Cycle-Stealing DMA I/O Operations Tai-Yi Huang and Jane W.-S. Liu Department of Computer Science University of Illinois at Urbana-Champaign Urbana, IL fthuang, May 3, 1995 Abstract This paper describes an ecient algorithm which gives a bound on the worst-case execution times of the concurrent execution of CPU instructions and cycle-stealing DMA I/O operations. Simulations of several programs were conducted to evaluate this algorithm. Compared with the traditional pessimistic approach, the bound on the worst-case execution time produced by the algorithm is signicantly tighter. For a sample program that multiplies two matrices while the I/O bus is fully utilized, our algorithm achieves a 39% improvement in the accuracy of the prediction. 1 Introduction Algorithms for scheduling tasks in hard-real-time systems typically assume that their worst-case execution times are known. Such a system is deigned to ensure that all tasks can complete by their deadlines as long as no task in a system executes longer than its worst-case execution time (WCET). A task which overruns may lead to missed deadlines and the failure of the whole system. For this reason, how to bound the WCET of programs has received a great deal of attention in recent years. Mok et al. [3] developed a graphical tool to analyze the timing behavior of assembly language programs and to bound their WCET. This tool requires that the maximum iteration number of each loop structure be known. Park and Shaw [6, 7] developed a similar method for source-level programs. Their dynamic path analysis method eliminates infeasible execution paths and thus tightens the prediction of the WCET. Pusher and Koza [8] introduced several new language constructs with which programmers can describe the timing behavior of their programs. Their experiment showed that with this valuable information, the gap between the calculated WCET and the real WCET can be reduced signicantly. To predict the WCET of concurrent programs, Niehaus [5] developed a semanticspreserving transformation for concurrent programming language constructs such as critical sections and synchronous communication. Zhang, Burns and Nicholson [11] developed a mathematical model to predict the WCET of programs executed on a two-stage pipeline. Mueller, Whalley and Harmon [4] developed a static cache simulation method to predict instruction cache behavior and bound its worst-case performance. This paper rst analyzes the delay caused by cyclestealing direct-memory access (DMA) I/O activities. It then presents an algorithm to estimate the WCET of the concurrent execution of a stream of CPU instructions and DMA activities. A DMA controller transfers data between the main memory and I/O devices with minimal CPU involvement. As a result, the CPU can execute other instructions while a DMA controller is transferring data. A DMA controller operates either in burst mode or in cycle-stealing mode. A DMA controller in cycle-stealing mode transfers data by \stealing" bus cycles from an executing program. In this way, it retards the progress of the executing program and extends the execution time of the program. A conservative estimate of the WCET of a stream of CPU instructions and a cycle-stealing DMA I/O operation, which are ready at the same time, is the sum of their WCET obtained by assuming that each executes alone. Obviously this estimate is pessimistic. We present here an analysis method and an algorithm which give a tighter bound on the WCET. The performance of the algorithm in terms of the amount of reduction from the most pessimistic WCET estimate is demonstrated by simulation results. The rest of the paper is structured as follow. Section 2 describes the machine model that is the basis of our analysis method. Section 3 presents the method. Section 4 presents an algorithm to implement the method.

2 processor clock I/O Bus BUSY BUSY IDLE BUSY CPU memory read (fetch instruction) memory read (fetch operand) internal operation (execution) memory write (write data) An instruction cycle Figure 1: The instruction cycle of ADD 1, (A0) Our simulation results are presented in Section 5. Finally, Section 6 concludes the paper and discusses future work. 2 The Machine Model We adopt a commonly used machine model according to which an instruction executes in the manner as shown in Figure 1. The sequence of fetch and execution of an instruction is called an instruction cycle. Each instruction cycle is composed of one or more machine cycles. A machine cycle requires one to several processor clock cycles to execute. Dierent machine cycles execute different functions. For example, the instruction cycle of ADD 1,(A0), shown in Figure 1, is composed of four machine cycles: a memory read (bus-access) cycle to fetch the instruction, a memory read (bus-access) cycle to fetch the operand, an execution (no-bus-access) cycle to carry out the addition, followed by a memory write (bus-access) cycle to write back the data. Each machine cycle in this example in turn takes 4, 3, 3, and 4 processor clock cycles to execute. Since we focus on the analysis of how the DMA controller and the CPU contend for the bus, we are concerned primarily with whether the CPU accesses the I/O bus during each machine cycle. Therefore, we classify all machine cycles into two categories: bus-access (B) cycles and execution (E) cycles. B cycles are those machine cycles during which the CPU uses the I/O bus. In contrast, during E cycles, the CPU does not need the bus. In general, there may be several consecutive E cycles in an instruction cycle. We assume that the CPU is a synchronous one: the beginning of each machine cycle is triggered by the processor clock. Our analysis method is applicable only for systems that have no cache memory and no pipeline operation. The DMA controller and the CPU share the same I/O bus, as shown in Figure 2. At any time, either the DMA controller or the CPU, but not both, can hold the bus (i.e., be the master of the bus) and transfer data. We focus on the case where the DMA controller operates in cycle-stealing mode. In this mode, it is allowed to access the bus only when the CPU is in an E cycle. The protocol we use to regulate the bus contention between the DMA controller and the CPU is based on the VMEbus specication [9]. Because this protocol is suf- ciently general, the analysis method presented in Section 3 for bounding the delay caused by cycle-stealing DMA I/O activities is applicable to many other commonly used buses. CPU I/O device DMA controller I/O Bus I/O device Memory Figure 2: The architecture of the machine model To become the bus master, the DMA controller rst sends a bus request. If the CPU is in a B cycle, the DMA controller waits. The CPU releases the bus when it enters an E cycle. After a short delay, while the ownership of bus is transferred from the CPU to the DMA controller, the DMA controller gains the bus and starts its data transfer. We will refer to this delay period as Bus Master transfer Time (BMT). After the DMA controller completes the transfer of each unit of data, the bus controller checks if there is any pending bus request from the CPU. The DMA controller is allowed to continue its next transfer if there is no request from the CPU. Otherwise, the DMA controller releases the bus. The CPU gains the bus after a BMT delay.

3 3 Timing Analysis Generally the DMA controller behaves in the following manner. After sending a bus request, the DMA controller waits when the CPU enters a B cycle from a B cycle, becomes the bus master when the CPU enters an E cycle from a B cycle, continues its transfer as long as the CPU continues to be in E cycles, and releases the bus when it nishes all the data transfers or when the CPU enters a B cycle from an E cycle. Again, whether there is any pending bus request is checked only at the end of each data transfer. The CPU does not gain the bus immediately after it sends a bus request if the DMA controller is currently transferring data. Therefore, the executing program may suer delay, and its completion time is postponed accordingly. Figure 3 illustrates the concurrent execution of the DMA controller and the sequence of machine cycles B n! E 1! E 2! : : :! E k! B n+1. Our analysis method assumes that the number of consecutive E cycles in each instruction is known. As shown in Figure 3, the DMA controller gains the bus when the CPU enters E 1 cycle from B n cycle. It keeps transferring data during the interval from E 1 cycle to E k cycle. The DMA controller releases the bus when the CPU is entering B n+1 cycle. The execution of B n+1 cycle is delayed for (b + BMT), where b is the delay between when the CPU requests the bus and when the request is checked and the DMA controller releases the bus. Again, BMT is the delay between the time when a bus master releases the bus to the time when the next bus master gains the bus. Let m denote the number of data transfers performed by the DMA controller during the k consecutive E cycles, and DT be the amount of time to do each transfer. To calculate m, we let T Ei denote the execution time of the machine cycle E i, and T k = kx i=1 T Ei be the total execution time of the k consecutive E cycles. Based on the facts that 0 b < DT and T k + b = m DT + BMT, we have (m? 1) T k? BMT < DT Therefore we conclude that m = Tk? BMT DT m (1) Bn BMT E 1 E m * DT E k m b d BMT CPU sends a bus request B n+1 Figure 3: The concurrent execution of cycle-stealing DMA I/O and a sequence of E cycles The delay suered by the CPU execution of the sequence of machine cycles is d 0 = m DT + 2 BMT? T k if the DMA controller holds the bus for m data transfers. On the other hand, if less than m data transfers are performed, the delay becomes shorter because the bus is transferred without the delay b. Because of the assumption that each machine cycle is triggered by the processor clock, the machine cycle B n+1 cannot start until the next clock cycle. As a result, the exact delay suered by the CPU execution is at most equal to d d = 0 where T c is the period of a clock cycle. T c T c (2) 4 Bounding the WCET For a given a stream of N CPU instructions, together with a DMA I/O operation that requires M data transfers and is ready at the same time as the stream of CPU instructions, we want to nd the WCET of the concurrent execution, that is, the maximum amount of time required for both the instruction stream and the I/O operation to complete. We now present an algorithm which makes use of the knowledge about how the CPU instructions and the DMA operation interfere one another. By doing so, it gives us a tighter bound on the WCET. Because each instruction begins with a B cycle, no DMA data transfer can take place across two instructions. Consequently, the eects of cycle-stealing on each instruction can be analyzed independently, without considering the other instructions. The algorithm shown

4 in Figure 4 uses Eqs. (1) and (2) to calculate in each instruction what the worst-case delay caused by cyclestealing is and how many data transfers the DMA controller can perform. The information needed by this algorithm includes how many machine cycles each instruction is composed of, the function of each machine cycle, and the execution time of each machine cycle. This information can be obtained from the reference manual provided by the manufacturer of the processor. The algorithm also requires as inputs, two parameters of the bus, BMT and DT. The rest of the algorithm is self-explanatory. Because the delay of each instruction obtained here is the worst-case delay, the value returned by the algorithm is an upper bound of the execution time. On the other hand, because the algorithm accounts for the effect of the concurrent execution of CPU instructions and DMA I/O, the WCET we get should be much tighter than the pessimistic estimate. 5 Simulation Results We now demonstrate the performance of our algorithm by several simulation results. Given a stream of CPU instructions and a DMA I/O operation, we rst use the pessimistic approach to predict its WCET. According to this approach, the predicted value WCET pessi is equal to the sum of the WCET of the instruction stream when it executes alone and the WCET of the DMA I/O operation when the operation is done alone. We next make the prediction by our algorithm. The value returned is denoted by WCET ours. We use the percentage reduction from the pessimistic WCET R = WCET pessi? WCET ours WCET pessi 100% to measure the performance of the algorithm. Table 1 lists the C programs tested in our simulation. Because of the wide use of CPU32 in embedded systems, we compile these programs into assembly language programs of MC68332, one of the MC68300 Family of embedded controllers. We execute these programs in a simulator to obtain their execution traces. The timing information of each instruction in the traces is given by [1]. Since the clock frequency of MC68332 microprocessor is MHz, the period of a clock cycle T c is 60 ns. Assuming that a 0-wait memory is used and the size of data in each DMA transfer is equal to the bus bandwidth, we set DT to 120 ns. At last, BMT is 10 ns. To investigate the relationship between the performance of our algorithm and the fraction of a trace which is overlapped with a DMA I/O operation, four simulations were conducted on each execution trace. For each trace, we generated four DMA I/O operations which Name qsort bubble fft spline gaussian mtxmul correlate mtxmul2 Description a quicksort of 250 elements. a bubble sort of 100 elements. a 128-node Fast Fourier Transform. a cubic spline function of 100 points. a 10x10 Gaussian elimination. a multiplication of 2 10x10 matrices. a correlate function of 500 tracks. a loop-unrolled version of mtxmul. Table 1: The test set of C programs carry out dierent number of data transfers. In particular, we chose the lengths of these DMA I/O operations so that the trace overlapped with each of the DMA I/O operations for either 25%, 50%, 75%, or 100% of the instructions. The case with 25% overlap means that the rst quarter of the trace executed concurrently with the DMA I/O operation while the rest executed alone. Similarly, in the case of 50% (75%) overlap, the last 50% (25%) of the trace executed alone. We then computed the WCET of the concurrent execution of the trace and each of the four DMA I/O operations. Thus, four values of R were obtained. Table 2 gives the results. Column 2 lists the number of instructions in each program trace. Column 4, 5, 6 and 7 list the percentage reduction in the predicted WCET when the rst quarter, the rst half, the rst three quarters, and the whole trace executed concurrently with a DAM I/O operation, respectively. These values of R indicate that compared with the pessimistic approach, our algorithm produces a more accurate prediction of the WCET of a program when the program executes concurrently with a DMA I/O operation for a larger percentage of the time. This conclusion is an expected one since WCET pessi is more pessimistic when the percentage of overlap is larger. We also investigated the relationship between the performance of our algorithm and the computational requirements of programs. We classify all instructions here into two categories: long instructions and short instructions. An instruction is a long one if during its execution, the CPU does not need the bus for 8 processor clock cycles or more. In contrast, during the execution of a short instruction, the CPU never allows any I/O device to have the bus for such a long period. Generally speaking, long instructions require intensive computation, and short instructions are those that do data movement or simple computation. For example, the instructions MULU.W D1,D2 and DIVU.W D2,D0 are long instructions, and MOVE.L (A3)+,D0 and ADD.L D0,D1 are short instructions. Because the delay caused by cycle-stealing on each instruction is bounded by Eq. (2),

5 Input: { the number of CPU instructions, N, and the instructions inst[1], inst[2],: : :, inst[n] in the stream S. { the number of data transfers, M, required in the cycle-stealing DMA I/O operation. { the execution time of each instruction, inst[i].execution time, and its machine cycles, for i = 1; 2; : : : ; N. { BMT and DT, the two parameters of the I/O bus. Output: WCET, the worst-case execution time of the concurrent execution of the instruction stream S and the DMA I/O operation. Procedure: 1. Set WCET and trans, the number of transfers completed, to zero. 2. For i = 1 to N, compute the contribution of inst[i] to WCET and increment WCET by the amount as follow: a. update WCET = WCET + inst[i].execution time. b. if (trans < M) { compute the worst-case delay d suered by inst[i] according to Eq. (2), and update WCET = WCET + d. { compute the number of transfers, m, completed in inst[i] according to Eq. (1), and update trans = trans + m. 3. If the I/O operation is not completed yet, increment WCET by the amount of time (M? trans) DT to complete the last M? trans data transfer alone. Figure 4: An algorithm which gives a tighter bound on the WCET Name Instructions Long R in % executed instructions qsort 23,026 0% bubble 65,726 0% fft 249,107 2% spline 209,837 3% gaussian 47,272 5% mtxmul 36,789 11% correlate 26,543 17% mtxmul2 9,391 22% Table 2: The simulation results

6 the overhead of each DMA transfer in a long instruction is less than that in a short instruction. Consequently, when a trace contains a higher percentage of long instructions, the algorithm produces a larger reduction percentage. We tested programs with dierent computational requirements in the simulation. Column 3 of Table 2 gives the percentage of long instructions in each program trace. We note the value of R increases monotonically with the percentage of long instructions. Among the tested programs, mtxmul2 is obtained by unrolling the whole innermost loop of mtxmul. The loop unrolling procedure signicantly increases the percentage of long instructions in the trace. As a result, our algorithm performs better on the loop-unrolled version: a 39% reduction from the most pessimistic WCET estimate is achieved. 6 Conclusion and Future Work Cycle-stealing DMA operations have often been disabled in real-time systems because of the uncertainty in the amount of time such an operation may delay the completion of an executing program. We presented here an analysis method to determine this delay. Based on the method, we developed an algorithm which gives a tighter bound on the WCET of the concurrent execution of a stream of CPU instructions and a cycle-stealing DMA I/O operation. Simulation results demonstrate that the algorithm can produce more accurate predictions of WCET than the pessimistic WCET estimates, especially when the program contains a large percentage of computation intensive instructions. Our analysis method is applicable only when there is no cache memory. If cache memory is used, the number of bus accesses by the CPU is signicantly reduced because of cache hits. We expect a greater improvement gained by our more accurately accounting for the delay caused by the concurrent execution of cycle-stealing operations. In the future we will extend our analysis to systems in which on-chip cache memory is present. Our work encourages the inclusion of I/O instructions in real-time programs. Because of the hardwaredependent features of I/O instructions, determining their WCET becomes extremely dicult. Traditionally, I/O instructions are not allowed or restricted to appear at the predened areas such as the beginning and end of a program [2] [10]. By decomposing timing related information in a table-driven manner, our work can be used to predict the WCET of a program containing DMA I/O instructions. The future work will build a tool capable of predicting the WCET of programs containing any I/O instruction in a table-driven manner. References [1] MC68000 Family: CPU32 reference manual. Motorola, [2] Mark H. Klein and Thomas Ralya. An analysis of input/output paradigms for real-time systems. Technical Report CMU/SEI-90-TR-19, CMU Software Engineering Institute, July [3] Aloysius K. Mok, Prasanna Amerasinghe, Moyer Chen, and Kamtorn Tantisirivat. Evaluating tight execution time bounds of programs by annotations. In Proceedings of the Sixth IEEE Workshop on Real-Time Operating Systems and Software, pages 272{279, May [4] Frank Mueller, David Whalley, and Marion Harmon. Predicting instruction cache behavior. In ACM SIGPLAN Workshop on Language, Compiler, and Tool Support for Real-Time Systems, June [5] Douglas Niehaus. Program representation and translation for predictable real-time systems. In Proceedings of Real-Time Systems Symposium, pages 53{63, [6] Chang Yun Park. Predicting program execution times by analyzing static and dynamic program paths. Journal of Real-Time Systems, 5:31{62, March [7] Chang Yun Park and Alan C. Shaw. Experiments with a program timing tool based on source-level timing schema. IEEE Computer, pages 48{57, May [8] P. Puschner and C. Koza. Calculating the maximum execution time of real{time programs. Journal of Real-Time Systems, 1:159{176, September [9] The VMEbus Specication. Motorola, [10] A. Vrchoticky and P. Puschner. On the feasibiity of response time predictions{an experimental evaluation. Technical Report 2/91, Institute fur Technische Informatik Technische Universitat Wien, March [11] N. Zhang, A. Burns, and M. Nicholson. Pipelined processors and worst case execution times. Journal of Real-Time Systems, 5:319{343, October 1993.

Allowing Cycle-Stealing Direct Memory Access I/O. Concurrent with Hard-Real-Time Programs

To appear in: Int. Conf. on Parallel and Distributed Systems, ICPADS'96, June 3-6, 1996, Tokyo Allowing Cycle-Stealing Direct Memory Access I/O Concurrent with Hard-Real-Time Programs Tai-Yi Huang, Jane