Timing Anomalies and WCET Analysis. Ashrit Triambak

Size: px

Start display at page:

Download "Timing Anomalies and WCET Analysis. Ashrit Triambak"

Damian Fletcher
5 years ago
Views:

1 Timing Anomalies and WCET Analysis Ashrit Triambak October 20, 2014

2 Contents 1 Abstract 2 2 Introduction 2 3 Timing Anomalies Retated work Limitations of Previous methods Classification Scheduling Timing Anomalies Speculation Timing Anomalies Cache Timing Anomalies Examples of Timing Anomalies Further Research on Timing Anomalies Hardware related terminologies Example Hardware architecture Timing anomalies caused by Out-of-Order resources Timing anomalies caused by In-Order resources WCET Analysis The influence of timing anomalies on WCET analysis Precise WCET analysis Methods of eliminating Anomalies Pessimistic serial-execution method Program Modification method Conclusion 21 References 23 1

3 1 Abstract Timing anomalies add to the complexity of WCET analysis and hence makes it difficult to apply the divide and conquer strategies to simplify the WCET analysis. Previous timing anomalies were explained as a problem that occurred when the WCET of a control flow graph is computed from the WCET of its sub-graphs. We consider examples to illustrate that the worst case execution time need not correspond to the worst case behaviour. This study also involves examples for cases where timing anomalies can arise in much simpler hardware architectures i.e. even in hardware that contains only in-order functional units. We also discuss that the timing anomaly can occur in a parallel decomposition of the WCET problem i.e. when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. 2 Introduction Due to the temporal constraints required for the correct operation of a realtime system, the prediction in the temporal domain is an important criterion to be satisfied. Worst case execution time analysis is the research field investigating methods to access the timing behaviour of the real-time systems. A central part in analysing the execution time behaviour is modeling the timing behaviour of the involved hardware within the analysis process. The analysis of simple hardware architecture without caches, complex pipelines or branch prediction mechanisms can be successfully performed with a number of wellestablished WCET analysis techniques. The estimation of an upper bound on the execution time which is known as worst case execution time (WCET) is trivial for highly dependable real time systems. Because of the pessimistic timing assumptions, WCET is often grossly overestimated, which results in poor utilization of the resources, mainly in the real time systems using high performance processors with advanced pipelining and caching techniques. Recent researches have shown that the high performance processor systems that take a number of advantages of the performance enhancement mechanisms like the instruction level parallelism and speculative execution causes a lot of problems in WCET analysis. The interference of operations executed in different units of these processors and the high level of parallelism 2

4 makes it practically infeasible to evaluate the exact worst case timing models of such systems. Therefore the analysis is usually divided into a number of steps that models the timing behaviours of the sub-systems separately. The partial results thus computed are then combined to compute the final WCET bound for the code under consideration. It is important to note that this divide and conquer method will only work if the partial results can be safely combined to yield the total result. Due to complex inter-dependencies between the timing of parallel functional units, the safe application of the divide and conquer strategy is unimaginably complex or becomes impossible. The instruction schedule depends on the execution time of each individual instruction. Therefore the scheduling of the future instructions can actually cause a counter intuitive increase or decrease in the execution time of the rest of the execution path. To find a safe estimate of the WCET in the presence of such anomalies, the effect of all possible schedules resulting from a variable latency instruction has to be analysed so that the instruction latency that leads to the longest overall execution time is obtained. In general, if we have n variable latency instructions along a path in the program, where each instruction may lead to k different future schedules, then, in the worst case, one must analyse kn different schedules. The two main tasks of the WCET analysis tools are control flow analysis that determines the feasible or infeasible paths in a program and the processor behaviour analysis that models the hardware. The problems of processor behaviour analysis namely the timing anomalies are discussed further. Parallel timing anomalies are the timing effects due to the changes in the initial state. Timing anomalies can as well occur in a parallel decomposition of the WCET problem, i.e., when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components that work in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. 3 Timing Anomalies The term timing anomalies is used to describe system behaviour where relaxing some constraints leads to an increase of system timing. This is typically caused due to a greedy scheduler that cannot foresee the future impact of its local decisions. For example, with respect to the worst case execution time analysis, such a constraint may require execution of two instruction se- 3

5 quences to finish within a given deadline. Decreasing the execution time of the first instruction sequence relaxes the constraint for the second instruction sequence to finish within the deadline which can lead to timing anomalies [4]. An anomaly never exists alone i.e. it is necessarily embedded in some context. The context of timing anomalies is the set of WCET analysis methods. 3.1 Retated work i. The term timing anomaly was introduced by Lundqvist and Per Stenstrom who were one of the first persons to discover this kind of abnormal timing behaviour when the modern processor hardware are used. They presented examples on timing anomalies and identified out of order resources to be the characteristic feature that caused timing anomaly.[2] ii. iii. A worst case execution time analysis method was developed by Schneider for the Motorolla PowerPC 755 architecture that handle timing anomalies occurring in that specific architecture. [2] Thesing discussed the Motorolla ColdFire 5307 which contains simple in order pipeline that did not have resources with with overlapping abilities. He showed that the cache replacement strategy used in this processor i.e, round robin method causes timing anomalies. The effect of the cache miss on the cache state is sometimes different from that of a cache hit. 3.2 Limitations of Previous methods For correct estimation of WCET, the effects of all variations in instruction execution times on all possible instruction schedules have to be considered. The problems that arise when performing accurate pipeline analysis for dynamically scheduled processors and how the previous methods failed are explained below. Consider the following definitions Definition 1 The current pipeline state is the current state of the pipeline timing model. It describes which instructions are currently executing in the pipeline and the current resource allocations. 4

6 Definition 2 The current cache state is the current content of the cache timing model. It consists of the cache tag memory, i.e., the identification tags of the current blocks in the cache. Consider that a program contains a single feasible path. The WCET is the longest execution time of the instruction sequence along this path. If a sequence contains n variable latency instructions with unknown latencies, and we know that each instruction can have k different latencies, then to be safe we have to examine kn instruction schedules which is not feasible. Normally the timing analysis methods deal with making safe decisions at instruction or basic block level. Consider a partial sequence of instructions i.e. a basic block containing a variable latency instruction. When the simulation of the execution of this partial sequence in the pipeline is done, we may have k different pipeline states. For safety, we must choose that pipeline state that will give us the longest overall execution time. This becomes impossible without the knowledge of the whole instruction sequence. The previous methods presented for doing cache and pipeline analysis performs the later by looking at the instructions or the basic block first, and then combines the WCET of all these entities into a total WCET for the whole program. This was however not the best way to obtain the overall longest time of execution. Whenever it was not possible to classify a cache access as a miss or a hit, it was by default assumed to be a cache miss. This would lead to a too pessimistic estimation based on the example presented in the previous section. Next, consider a program containing several feasible paths. The WCET is the maximum WCET found among all the paths. In order to find the maximum WCET, we would have to examine all the paths in the program. 3.3 Classification Scheduling Timing Anomalies In this type of timing anomalies, the length of the pivot task in two task sets is compared. Let us take an example of a cache hit vs. a cache miss. In Figure 1, the task set differ only in length of task A. This kind of timing anomaly has been extensively studied on different scheduling routines. The 5

7 scheduling depends on the length of task A as it can be seen in the figure. A greedy scheduler is unable to prevent such anomalies in general. Figure 1: Scheduling Anomaly Speculation Timing Anomalies In this type of timing anomaly, the entire task set changes depending on the pivot task and not just confined to its length. The example in Figure 2. shows that in both cases, the processor pre fetches the instructions. A cache miss while pre fetching the first instruction, which can be the local worst case, takes so much time that the branch condition can be evaluated before more harm to the cache can be done by further prefetches. Figure 2: Speculation Anomaly Cache Timing Anomalies These timing anomalies occur due to abnormal cache behaviour. Taking an example of the ColdFire 5307 processor, where the non-local worst case cache hit results in a different future cache state than the local worst case cache 6

8 miss. This difference in the cache state can be the cause of the cache hit branch to be stalled later on. 3.4 Examples of Timing Anomalies This section will give examples of the timing anomalies present in dynamically scheduled processors. The term dynamically scheduled processors is often used to describe a processor for which instructions execute out of program order. We know that the execution time of an instruction can take one of many discrete values depending on the input data. For example the execution time of load instruction depends on whether the address hits or misses in the cache. Another example can be an arithmetic instruction whose execution time may depend on the operands. In the following sections we will use the term latency as the instruction execution time. Execution time is the overall execution time of the program. To model the instruction execution in a pipelined processor, a hardware model is used most often. In this model whenever an instruction that proceeds through a pipeline gets stalled it is due to resource contention with another instruction that accesses a common resource or operand. The examples of resources are functional units and registers. The read and write ports, buses and buffers are treated as resources if they can cause an instruction to stall. The resources are mainly divided into two types. The first one is the in-order resources like the registers in which the resources can be allocated in program order of execution. The second type is the out-of-order resources such as functional units, in which a new instruction can use a resource before an older instruction can use it according to some dynamic scheduling decision. If a processor only contains in-order resources, then no timing anomalies can occur. [1] This is because if there are only in-order resources, two instructions can use a resource in program order. If the completion of an instruction is postponed by x cycles, then the later instructions are also postponed as the resources cannot be allocated before the earlier instructions complete the execution. If out of order resources are present, timing anomalies can occur. The timing anomalies presented below are studied on a simplified PowerPC architecture (Figure 3) containing no Floating point units. The architecture consists of a multiple issue pipeline, capable of dispatching two instructions each clock cycle and separate instruction and data caches. Each functional 7

9 Figure 3: A simplified yet timing anomalous PowerPC architecture unit has two reservation stations to implement out-of-order execution of instructions. These can hold dispatched instructions before their operands are available. All resources in the processor are in-order resources except the Integer Unit (IU) and the multiple cycle integer unit (MCIU) which are out of order resources. The timing anomalies observed in such architecture are discussed below. Anomaly 1: Cache hits can result in worst case timing Here in this anomaly we will discuss about a case where a data cache hit causes an overall longer execution time than a data cache miss. In Figure 4, the table shows when each functional unit is busy executing an instruction. The horizontal dashed lines shows when the reservation stations are occupied. At the top, the arrows indicate when each instruction is dispatched to the reservation stations. Here we can think of two cases, one 8

10 Figure 4: Example of Cache hit causing a longer execution time than a cache miss when the load address hits in the data cache and another when it misses the data cache. If the load address hits in the cache then the LD instruction executes for 2 cycles and then it can forward its result to instruction B that can start executing in cycle 3. In this, it is assumed that B gets priority over C since B is older. Likewise, If the load address misses in the cache then the LD instruction executes for 10 cycles and the execution of B is postponed. This means that C can start executing in cycle 3, one cycle earlier than in the cache hit case. This in turn makes D and E to execute one cycle earlier. Hence there is an overall reduction of the execution time by 1 cycle in the case when cache misses. Here the anomaly is due to the fact that the unit being an out-of-order resource permitting B and C to execute out of order. Anomaly 2: Miss Penalties can be higher than expected. This example shows that the overall penalty in the execution time due to a cache miss can be greater than the cache miss penalty. In the table depicted in Figure 5, the first instruction is a load instruction 9

11 which can either hit or miss the cache. Assume that the second load instruction (C) always misses. The first 3 instructions A, B and C depend on each other and must execute one at a time. In the cache hit case all instructions will execute as soon as possible. As the last instruction D does not depend on any other instructions, it will not interfere with the execution of the other instructions. On the other hand, if the first load instruction experience a cache miss, then the execution of B will be postponed. Here, instruction D which is ready for execution would have already started the execution when B becomes eligible and hence the execution of B gets further postponed. In this case instruction C finishes its execution 11 clock cycles later as compared to the cache hit case. This is greater than the normal cache miss penalty of 8 clock cycles. In this case the anomaly is due to the MCIU being an out of order resource, which allows instruction B and D to execute in arbitrary order. Figure 5: Example of Cache miss penalty being higher than expected. 10

12 Anomaly 3: Impact on WCET may not be bounded The increase in penalty of cache miss is not limited by a constant value, but can be proportional to the length of the program. This means that a small interference in the beginning of the execution may contribute with an arbitrarily high penalty to the overall execution time. Consider the instruction sequence in Figure 6. The two instructions A and B constitute the body of a loop. E A refers to the clock cycle when A executed in the previous iteration of the loop. D A refers to the clock cycle when A was dispatched in the current loop iteration. We consider two different scenarios of execution. In the fast case, instruction A in the first iteration executes immediately when it is dispatched. In the slow case it gets delayed by one clock cycle due to the dependency with an earlier instruction. This delay in the beginning will delay the execution of A by one clock cycle in each of the iterations. This is known as the Domino effect. The total penalty on the execution time caused by the small delay of A in the beginning will be k clock cycles if the loop does k iterations. Figure 6: Example of domino effects 11

13 3.5 Further Research on Timing Anomalies Hardware related terminologies A multiple Issue processor, also known as superscalar processor is characterized by the following properties. 1. The pipeline in a superscalar processor includes all features of a classical pipeline, but furthermore, the instructions may be executed simultaneously in the same pipeline stage 2. The execution of instructions can be initiated simultaneously in one clock cycle. The instructions are dynamically scheduled i.e the actual instruction grouping is done in runtime, in contrast to the VLIW architectures. The term dispatch refers to the primary distribution of the instructions among the particular subsystems of functional units and the term issue refers to the assignment of an instruction to a particular functional unit for some immediate execution. In-order resources: These are the resources that can be allocated in the program order of execution. Eg: Registers Out-of-order resources: These are the resources that can be allocated to the instructions dynamically. This means that a new instruction can use the resource before the old one. e.g. Functional units Example Hardware architecture Consider an abstracted hardware architecture in Figure 7(a) including the following units. The instructions are dispatched by the DS stage to the respective reservation stations RSi. When there are no free reservation station buffers available, then the dispatch stalls. Consecutively, the instructions are issued to the respective functional units. In the case when the functional unit is idle on dispatch, dispatch and issue operations coincide in one cycle. Whenever an instruction is issued to the functional unit, its respective reservation station entry still remains allocated to it until the instruction has finished its execution and can be sent to the reorder buffer ROB where it is committed. The model M1 in Figure 7a uses the following issue policy. (i) The functional units serve disjunctive sets of instruction types (ii) at most one instruction per cycle is assumed to be dispatched. 12

The Model M2 in figure 7b comprises of two functional units serving an overlapping set of instruction types without any reservation stations.

14 The Model M2 in figure 7b comprises of two functional units serving an overlapping set of instruction types without any reservation stations. The functional unit FU1 is able to serve all instructions of type belonging to class IC1 and the functional unit FU2 is able to serve for class IC1 and IC2. This means that FU2 is able to handle more instructions than FU1. The instructions dispatched to FU1 can also be executed in FU2 but the reverse is however not possible. (a) Model M 1 (b) Model M 2 Figure 7: Model M 1 with two resercation stations allowing out-of-order allocation of functional units FU 2. Model M 2 consisting of two non equivalent functioal units (a) Resource requirements of model M 1 (b) Resource requirements of model M 2 Figure 8: Instruction Sequences In the table shown in Figure 8a, the resource requirements for the instruction sequence (A,B,C,D) for model M 1 are depicted. Here it is assumed that the instructions A and D require FU 1, while B and C require FU 2. 13

3 Timing anomalies caused by Out-of-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 9: Examples for Model M 1

15 In the table shown in Figure 8b the resource requirements for the instruction sequence (A,B,C,D) for model M 2 are depicted. Here it is assumed that the instructions A, C and D can be executed in either FU 1 or FU 2, while D can be executed only on FU Timing anomalies caused by Out-of-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 9: Examples for Model M 1 Figure 9a and 9b shows the execution of the sequence shown in figure 8a with model M1 in a timing diagram. Figure 9a shows an example of counterdirective timing anomaly and Figure 9b shows an example of strong-impact 14

16 timing anomaly. The arrows that are used below the instruction label symbolize the instruction dispatch event. The instruction latencies are provided in a small box on the right side of the diagram. The arrows next to the latencies represent the dependency relationship between the instructions. These figures illustrate two cases each. The first two rows represent case 1 and the next two rows represent case 2. These two cases illustrate the execution of the instruction sequences using the latencies provided. The bars represent the utilization of the respective functional unit. The dotted lines above the bars represent the reservation stations allocated by the instructions. Counter-directive timing anomaly In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 10 clock cycles. In Case 2 the latency of instruction A is increased by 2 clock cycles. Since instruction B is dependent on instruction A, its execution is delayed. Meanwhile when A is still executing, instruction C starts its execution as it is not dependent on either instruction A or B. When instruction A and C finishes its execution, instruction D and B execute during the same clock cycles since they are not dependent on each other. The total time taken to complete the execution is 8 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a counter directive timing anomaly since the increase in instruction latency leads to a decrease in total execution time. Strong-impact Timing Anomaly In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 7 clock cycles. In Case 2 the latency of instruction A is increased by 2 clock cycles. Since instruction B is dependent on instruction A, its execution is delayed. Meanwhile when A is still executing, instruction C starts its execution as it is not dependent on either instruction A or B. When instruction A and C finishes its execution, instruction B starts its execution followed by instruction D. The total time taken to complete the execution is 11 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a strong-impact timing anomaly since the increase in instruction latency leads to an even greater increase in total execution 15

time. 3.5.4 Timing anomalies caused by In-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 10: Examples for Model M 2 In section 3.

resources. Figure 10a and 10b shows the execution of the sequence shown in figure 8b with model M2 in a timing diagram.

17 time Timing anomalies caused by In-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 10: Examples for Model M 2 In section 3.4 it was shown that timing anomalies cannot occur in the presence of in-order resources, but in the later years it was proved that timing anomalies can also occur in the presence of in-order resources. Figure 10a and 10b shows the execution of the sequence shown in figure 8b with model M2 in a timing diagram. Figure 10a shows an example of counter-directive timing anomaly and Figure 10b shows an example of strong-impact timing anomaly. The arrows that are used below the instruction label symbolize the instruction dispatch event. The instruction latencies are provided in a small box on the right side of the diagram. These figures illustrate two cases each. The first two rows represent case 1 and the next two rows represent case 2. These two cases illustrate the execution of the instruction sequences using the latencies provided. The bars represent the utilization of the respective functional unit. The dotted lines above the bars represent the reservation stations allocated by the instructions. Counter-directive timing anomaly: In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 8 clock cycles. 16

18 In Case 2 the latency of Instruction B is increased by 2 clock cycles. The total time taken to complete the execution is 7 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a counter directive timing anomaly since the increase in instruction latency leads to a decrease in overall execution time. Strong-impact Timing Anomaly: In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 7 clock cycles. In Case 2 the latency of Instruction A is increased by 2 clock cycles. The total time taken to complete the execution is 10 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a strong impact timing anomaly since the increase in instruction latency leads to an even greater increase in overall execution time. 4 WCET Analysis 4.1 The influence of timing anomalies on WCET analysis The presence of timing anomalies makes the WCET analysis process extremely complex. Consequently the hardware allowing timing anomalies can only be analysed safely using the pessimistic serial execution method [1] leading to useless results due to high overestimation. For the traditional analysis methods, the following assumptions have to be satisfied. Monotonic assumption This means that when certain information is processed by a WCET analysis approach it is assumed that a longer latency for an instruction necessarily imposes an at least equal or longer execution time for the overall instruction sequence under consideration.[2] This assumption can be imagined being implicitly included in many WCET approaches. In case of timing anomalies this assumption does not hold. 17

19 Basic Composability assumption In many WCET calculation methods, a basic composability assumption is taken for granted. This assumption denotes the fact that WCET bound values for sub-paths can be safely composed by the WCET calculation method to compute the WCET bounds for the composite paths [2] This property of the validity of sub-paths WCET bounds is lost when timing anomalies are present. The WCET bound of the loop body is underestimated in a particular execution context due to a violation of the monotonicity and composability assumption. This error will multiply during the runtime depending on the actual input data values. This phenomenon is called unbounded timing effect. 4.2 Precise WCET analysis In the previous sections we described timing anomaly as a problem that occurs when the WCET of a control flow graph is computed from the WCET of its sub-graphs i.e. from series decomposition. Timing anomalies do not only occur between the timing of two subsequent instruction sequences, but also between the component latency of processor components and the total execution time. The timing anomaly can occur in a parallel decomposition of the WCET problem i.e. when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. The two main tasks of the WCET analysis tools are control flow analysis that determines the feasible or infeasible paths in a program and the processor behaviour analysis that models the hardware. Timing anomalies are a challenge for WCET analysis because they violate the continuity properties Proportionality and monotony of program execution[1]. Parallel timing anomalies are the timing effects due to the changes in the initial state. WCET analysis with parallel decomposition is a method to reduce the complexity of processor behaviour analysis. The idea of parallel decomposition is to calculate the WCET of an instruction sequence in two steps. Before performing this calculation, the state space is divided into two parts. The first part is the state space for processor component and another is the state space for other components in the processor. For example, the hardware 18

20 component may be the instruction cache and the other state fraction may cover the pipeline and the other processor components. In the first step the timing of processor component is analysed and one state (say A) is chosen. Based on this choice, the overall processor timing is analysed in the second step by searching the state space while using the result for state A. 4.3 Methods of eliminating Anomalies In this section we discuss two approaches to estimate the WCET of a program running on a dynamically scheduled processor where timing anomalies may be experienced Pessimistic serial-execution method One of the straight forward ways to make safe estimations for architecture containing anomalies is to use the pessimistic serial execution method. It means that all the instructions are being executed in order in the functional units. We sum up all instruction latencies in the functional units. Additionally we add the miss penalties for all instructions and data cache misses. We know that the WCET corresponding to a serial execution of instruction assuming their worst case latencies is always higher than the WCET corresponding to any pipelined execution of the same sequence. Also we know theoretically that the instructions cannot be executed slower than in order since this means that some functional units will be idle for some time. This is true since there would be instructions those will always be ready for execution. The only possibility for an instruction to stall is cache misses which we add separately. This way a serial execution estimate will be safe but will be too pessimistic. The biggest advantage is that unknown events in a system will be handled in a safe way. They cannot lead to an execution time greater than the one estimated for serial executions Program Modification method The serial modification method discussed above is very pessimistic. If we want a tighter estimated WCET, we must model the pipelined execution accurately and deal with the problem of timing anomalies. One way of accomplishing this is to modify the program so that we can rely on safe local 19

21 decisions. We want to make the following conditions true[2] 1. All variable-latency instructions that have an unknown latency must, when simulated, still result in a predictable pipeline state. In addition, other unknown events such as unknown instruction cache accesses must also result in a predictable pipeline state. 2. If the number of paths in a small section of the program is being reduced by selecting the longest one or discarding the shortest ones, then the state of the pipeline and the caches at the beginning and the end of the paths must not differ when comparing them. The first condition can be fulfilled by forcing an in-order resource use when executing the variable-latency instruction. Then, the pipeline state must be predictable before allowing out-of-order resource use again. The way to accomplish this is highly architecture dependent. For example, in the POW- ERPC architecture, there exists a memory synchronization instruction called sync, which inhibits further dispatching until the sync instruction completes. The uses of sync instructions ensure that the following conditions occur: 1. All previously issued instructions have completed, at least to a point where they can no longer cause an exception. The sync instruction can be used to ensure that memory accesses are complete. 2. Previously-issued instructions complete in the context in which they were issued (privilege, protection, address translation). Instructions issued after the synchronizing instruction execute in the new context 3. The instruction queue is flushed and all these instructions are refetched with the new context in place to ensure that context changes occur for instructions after the synchronization. This instruction can be used as a way to force serialization together with a variable-latency instruction. If a sync is placed after the variable-latency instruction then the pipeline state will be known afterwards. Similarly if another sync is placed before the variable-latency instruction we will be able to know that the instruction will execute in-order The second condition can be fulfilled by the use of the sync instruction again to handle the pipeline state. Placing a sync instruction for example, at the end of two paths, the pipeline states in the two paths are made equal 20

22 to each other. It is also necessary that the state of the cache corresponding to the two paths is equal to each other. This is however highly architecture dependent. There exists several options[2] 1. One can invalidate all blocks in the caches. This should be possible in almost all processors. 2. One can invalidate only the blocks that differ in the two caches. This requires support for invalidation on the block level. 3. One can replace the blocks that differ with blocks that will be needed in the future by preloading blocks into the caches. This requires support for explicitly loading blocks into a cache In the first option, it is not the best solution to invalidate the entire contents of the cache since the performance will become poor. This is true also for the second option since each invalidated operation will result in an additional cache miss later on. The third option is the most promising one but requires special instructions to preload the cache. Examples of such instructions are the instruction and data cache block touch instructions (icbt and dcbt) found in the POWERPC architecture. When the blocks are preloaded, it is best to preload a block that would be needed somewhere along the worst-case path. Additionally, it is often a best practice to place a preload instruction outside loops if possible so that the overhead is reduced. 5 Conclusion The high complexity of today s processors makes it one of the most challenging problems of the WCET analysis. The features such as pipelines and caches create a huge space. The effects like timing anomaly makes it impossible to construct an efficient processor behaviour analysis that does not need to search the whole state space for the whole program. Previous methods fail in estimating WCET because they assumed that one can rely on the worst case assumptions for local entities such as instructions and basic blocks to estimate the effect on the overall WCET. For a certain class of programs running on dynamically scheduled processors, it is possible to make safe and tight estimates on the WCET. There must also be some support in the architecture to be able to explicitly control the state of 21

23 the caches and the resource allocation in the pipeline. Then, the processor could be forced to allocate resources in-order resulting in a stable scheduling of instructions but probably also lower performance. If there does not exist any support for the control of the pipeline state, then one is forced to use the serial execution method which often leads to more pessimism in the estimated WCET. Whenever the processor contains resources that allow such runtime resource allocation decisions (e.g., out-of-order pipelines, functional units serving different instruction types) then latency variations of single instructions might cause timing anomalies further down the instruction stream on this particular type of hardware. The reasons for latency variations of single instructions may result from caches, branch prediction mechanisms or different operand values (e.g., floating points operations). If the actually executed instruction sequence does not cause any dynamic resource allocation decision at execution time, an actual combination of hardware and software can be guaranteed to be free of timing anomalies 22

24 References [1] Raimund Kirner, Albrecht Kadlec, and Peter Puschner. Precise worstcase execution time analysis for processors with timing anomalies. In Real-Time Systems, ECRTS st Euromicro Conference on, pages IEEE, [2] Thomas Lundqvist and Per Stenstrom. Timing anomalies in dynamically scheduled microprocessors. In Real-Time Systems Symposium, Proceedings. The 20th IEEE, pages IEEE, [3] Jan Reineke, Björn Wachter, Stefan Thesing, Reinhard Wilhelm, Ilia Polian, Jochen Eisinger, and Bernd Becker. A definition and classification of timing anomalies. In OASIcs-OpenAccess Series in Informatics, volume 4. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, [4] Ingomar Wenzel, Raimund Kirner, Peter Puschner, and Bernhard Rieder. Principles of timing anomalies in superscalar processors. In Quality Software, 2005.(QSIC 2005). Fifth International Conference on, pages IEEE,

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS Peter Puschner and Raimund Kirner Vienna University of Technology, A-1040 Vienna, Austria {peter, raimund}@vmars.tuwien.ac.at Abstract Keywords: