Timing Anomalies and WCET Analysis. Ashrit Triambak

Size: px
Start display at page:

Download "Timing Anomalies and WCET Analysis. Ashrit Triambak"

Transcription

1 Timing Anomalies and WCET Analysis Ashrit Triambak October 20, 2014

2 Contents 1 Abstract 2 2 Introduction 2 3 Timing Anomalies Retated work Limitations of Previous methods Classification Scheduling Timing Anomalies Speculation Timing Anomalies Cache Timing Anomalies Examples of Timing Anomalies Further Research on Timing Anomalies Hardware related terminologies Example Hardware architecture Timing anomalies caused by Out-of-Order resources Timing anomalies caused by In-Order resources WCET Analysis The influence of timing anomalies on WCET analysis Precise WCET analysis Methods of eliminating Anomalies Pessimistic serial-execution method Program Modification method Conclusion 21 References 23 1

3 1 Abstract Timing anomalies add to the complexity of WCET analysis and hence makes it difficult to apply the divide and conquer strategies to simplify the WCET analysis. Previous timing anomalies were explained as a problem that occurred when the WCET of a control flow graph is computed from the WCET of its sub-graphs. We consider examples to illustrate that the worst case execution time need not correspond to the worst case behaviour. This study also involves examples for cases where timing anomalies can arise in much simpler hardware architectures i.e. even in hardware that contains only in-order functional units. We also discuss that the timing anomaly can occur in a parallel decomposition of the WCET problem i.e. when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. 2 Introduction Due to the temporal constraints required for the correct operation of a realtime system, the prediction in the temporal domain is an important criterion to be satisfied. Worst case execution time analysis is the research field investigating methods to access the timing behaviour of the real-time systems. A central part in analysing the execution time behaviour is modeling the timing behaviour of the involved hardware within the analysis process. The analysis of simple hardware architecture without caches, complex pipelines or branch prediction mechanisms can be successfully performed with a number of wellestablished WCET analysis techniques. The estimation of an upper bound on the execution time which is known as worst case execution time (WCET) is trivial for highly dependable real time systems. Because of the pessimistic timing assumptions, WCET is often grossly overestimated, which results in poor utilization of the resources, mainly in the real time systems using high performance processors with advanced pipelining and caching techniques. Recent researches have shown that the high performance processor systems that take a number of advantages of the performance enhancement mechanisms like the instruction level parallelism and speculative execution causes a lot of problems in WCET analysis. The interference of operations executed in different units of these processors and the high level of parallelism 2

4 makes it practically infeasible to evaluate the exact worst case timing models of such systems. Therefore the analysis is usually divided into a number of steps that models the timing behaviours of the sub-systems separately. The partial results thus computed are then combined to compute the final WCET bound for the code under consideration. It is important to note that this divide and conquer method will only work if the partial results can be safely combined to yield the total result. Due to complex inter-dependencies between the timing of parallel functional units, the safe application of the divide and conquer strategy is unimaginably complex or becomes impossible. The instruction schedule depends on the execution time of each individual instruction. Therefore the scheduling of the future instructions can actually cause a counter intuitive increase or decrease in the execution time of the rest of the execution path. To find a safe estimate of the WCET in the presence of such anomalies, the effect of all possible schedules resulting from a variable latency instruction has to be analysed so that the instruction latency that leads to the longest overall execution time is obtained. In general, if we have n variable latency instructions along a path in the program, where each instruction may lead to k different future schedules, then, in the worst case, one must analyse kn different schedules. The two main tasks of the WCET analysis tools are control flow analysis that determines the feasible or infeasible paths in a program and the processor behaviour analysis that models the hardware. The problems of processor behaviour analysis namely the timing anomalies are discussed further. Parallel timing anomalies are the timing effects due to the changes in the initial state. Timing anomalies can as well occur in a parallel decomposition of the WCET problem, i.e., when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components that work in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. 3 Timing Anomalies The term timing anomalies is used to describe system behaviour where relaxing some constraints leads to an increase of system timing. This is typically caused due to a greedy scheduler that cannot foresee the future impact of its local decisions. For example, with respect to the worst case execution time analysis, such a constraint may require execution of two instruction se- 3

5 quences to finish within a given deadline. Decreasing the execution time of the first instruction sequence relaxes the constraint for the second instruction sequence to finish within the deadline which can lead to timing anomalies [4]. An anomaly never exists alone i.e. it is necessarily embedded in some context. The context of timing anomalies is the set of WCET analysis methods. 3.1 Retated work i. The term timing anomaly was introduced by Lundqvist and Per Stenstrom who were one of the first persons to discover this kind of abnormal timing behaviour when the modern processor hardware are used. They presented examples on timing anomalies and identified out of order resources to be the characteristic feature that caused timing anomaly.[2] ii. iii. A worst case execution time analysis method was developed by Schneider for the Motorolla PowerPC 755 architecture that handle timing anomalies occurring in that specific architecture. [2] Thesing discussed the Motorolla ColdFire 5307 which contains simple in order pipeline that did not have resources with with overlapping abilities. He showed that the cache replacement strategy used in this processor i.e, round robin method causes timing anomalies. The effect of the cache miss on the cache state is sometimes different from that of a cache hit. 3.2 Limitations of Previous methods For correct estimation of WCET, the effects of all variations in instruction execution times on all possible instruction schedules have to be considered. The problems that arise when performing accurate pipeline analysis for dynamically scheduled processors and how the previous methods failed are explained below. Consider the following definitions Definition 1 The current pipeline state is the current state of the pipeline timing model. It describes which instructions are currently executing in the pipeline and the current resource allocations. 4

6 Definition 2 The current cache state is the current content of the cache timing model. It consists of the cache tag memory, i.e., the identification tags of the current blocks in the cache. Consider that a program contains a single feasible path. The WCET is the longest execution time of the instruction sequence along this path. If a sequence contains n variable latency instructions with unknown latencies, and we know that each instruction can have k different latencies, then to be safe we have to examine kn instruction schedules which is not feasible. Normally the timing analysis methods deal with making safe decisions at instruction or basic block level. Consider a partial sequence of instructions i.e. a basic block containing a variable latency instruction. When the simulation of the execution of this partial sequence in the pipeline is done, we may have k different pipeline states. For safety, we must choose that pipeline state that will give us the longest overall execution time. This becomes impossible without the knowledge of the whole instruction sequence. The previous methods presented for doing cache and pipeline analysis performs the later by looking at the instructions or the basic block first, and then combines the WCET of all these entities into a total WCET for the whole program. This was however not the best way to obtain the overall longest time of execution. Whenever it was not possible to classify a cache access as a miss or a hit, it was by default assumed to be a cache miss. This would lead to a too pessimistic estimation based on the example presented in the previous section. Next, consider a program containing several feasible paths. The WCET is the maximum WCET found among all the paths. In order to find the maximum WCET, we would have to examine all the paths in the program. 3.3 Classification Scheduling Timing Anomalies In this type of timing anomalies, the length of the pivot task in two task sets is compared. Let us take an example of a cache hit vs. a cache miss. In Figure 1, the task set differ only in length of task A. This kind of timing anomaly has been extensively studied on different scheduling routines. The 5

7 scheduling depends on the length of task A as it can be seen in the figure. A greedy scheduler is unable to prevent such anomalies in general. Figure 1: Scheduling Anomaly Speculation Timing Anomalies In this type of timing anomaly, the entire task set changes depending on the pivot task and not just confined to its length. The example in Figure 2. shows that in both cases, the processor pre fetches the instructions. A cache miss while pre fetching the first instruction, which can be the local worst case, takes so much time that the branch condition can be evaluated before more harm to the cache can be done by further prefetches. Figure 2: Speculation Anomaly Cache Timing Anomalies These timing anomalies occur due to abnormal cache behaviour. Taking an example of the ColdFire 5307 processor, where the non-local worst case cache hit results in a different future cache state than the local worst case cache 6

8 miss. This difference in the cache state can be the cause of the cache hit branch to be stalled later on. 3.4 Examples of Timing Anomalies This section will give examples of the timing anomalies present in dynamically scheduled processors. The term dynamically scheduled processors is often used to describe a processor for which instructions execute out of program order. We know that the execution time of an instruction can take one of many discrete values depending on the input data. For example the execution time of load instruction depends on whether the address hits or misses in the cache. Another example can be an arithmetic instruction whose execution time may depend on the operands. In the following sections we will use the term latency as the instruction execution time. Execution time is the overall execution time of the program. To model the instruction execution in a pipelined processor, a hardware model is used most often. In this model whenever an instruction that proceeds through a pipeline gets stalled it is due to resource contention with another instruction that accesses a common resource or operand. The examples of resources are functional units and registers. The read and write ports, buses and buffers are treated as resources if they can cause an instruction to stall. The resources are mainly divided into two types. The first one is the in-order resources like the registers in which the resources can be allocated in program order of execution. The second type is the out-of-order resources such as functional units, in which a new instruction can use a resource before an older instruction can use it according to some dynamic scheduling decision. If a processor only contains in-order resources, then no timing anomalies can occur. [1] This is because if there are only in-order resources, two instructions can use a resource in program order. If the completion of an instruction is postponed by x cycles, then the later instructions are also postponed as the resources cannot be allocated before the earlier instructions complete the execution. If out of order resources are present, timing anomalies can occur. The timing anomalies presented below are studied on a simplified PowerPC architecture (Figure 3) containing no Floating point units. The architecture consists of a multiple issue pipeline, capable of dispatching two instructions each clock cycle and separate instruction and data caches. Each functional 7

9 Figure 3: A simplified yet timing anomalous PowerPC architecture unit has two reservation stations to implement out-of-order execution of instructions. These can hold dispatched instructions before their operands are available. All resources in the processor are in-order resources except the Integer Unit (IU) and the multiple cycle integer unit (MCIU) which are out of order resources. The timing anomalies observed in such architecture are discussed below. Anomaly 1: Cache hits can result in worst case timing Here in this anomaly we will discuss about a case where a data cache hit causes an overall longer execution time than a data cache miss. In Figure 4, the table shows when each functional unit is busy executing an instruction. The horizontal dashed lines shows when the reservation stations are occupied. At the top, the arrows indicate when each instruction is dispatched to the reservation stations. Here we can think of two cases, one 8

10 Figure 4: Example of Cache hit causing a longer execution time than a cache miss when the load address hits in the data cache and another when it misses the data cache. If the load address hits in the cache then the LD instruction executes for 2 cycles and then it can forward its result to instruction B that can start executing in cycle 3. In this, it is assumed that B gets priority over C since B is older. Likewise, If the load address misses in the cache then the LD instruction executes for 10 cycles and the execution of B is postponed. This means that C can start executing in cycle 3, one cycle earlier than in the cache hit case. This in turn makes D and E to execute one cycle earlier. Hence there is an overall reduction of the execution time by 1 cycle in the case when cache misses. Here the anomaly is due to the fact that the unit being an out-of-order resource permitting B and C to execute out of order. Anomaly 2: Miss Penalties can be higher than expected. This example shows that the overall penalty in the execution time due to a cache miss can be greater than the cache miss penalty. In the table depicted in Figure 5, the first instruction is a load instruction 9

11 which can either hit or miss the cache. Assume that the second load instruction (C) always misses. The first 3 instructions A, B and C depend on each other and must execute one at a time. In the cache hit case all instructions will execute as soon as possible. As the last instruction D does not depend on any other instructions, it will not interfere with the execution of the other instructions. On the other hand, if the first load instruction experience a cache miss, then the execution of B will be postponed. Here, instruction D which is ready for execution would have already started the execution when B becomes eligible and hence the execution of B gets further postponed. In this case instruction C finishes its execution 11 clock cycles later as compared to the cache hit case. This is greater than the normal cache miss penalty of 8 clock cycles. In this case the anomaly is due to the MCIU being an out of order resource, which allows instruction B and D to execute in arbitrary order. Figure 5: Example of Cache miss penalty being higher than expected. 10

12 Anomaly 3: Impact on WCET may not be bounded The increase in penalty of cache miss is not limited by a constant value, but can be proportional to the length of the program. This means that a small interference in the beginning of the execution may contribute with an arbitrarily high penalty to the overall execution time. Consider the instruction sequence in Figure 6. The two instructions A and B constitute the body of a loop. E A refers to the clock cycle when A executed in the previous iteration of the loop. D A refers to the clock cycle when A was dispatched in the current loop iteration. We consider two different scenarios of execution. In the fast case, instruction A in the first iteration executes immediately when it is dispatched. In the slow case it gets delayed by one clock cycle due to the dependency with an earlier instruction. This delay in the beginning will delay the execution of A by one clock cycle in each of the iterations. This is known as the Domino effect. The total penalty on the execution time caused by the small delay of A in the beginning will be k clock cycles if the loop does k iterations. Figure 6: Example of domino effects 11

13 3.5 Further Research on Timing Anomalies Hardware related terminologies A multiple Issue processor, also known as superscalar processor is characterized by the following properties. 1. The pipeline in a superscalar processor includes all features of a classical pipeline, but furthermore, the instructions may be executed simultaneously in the same pipeline stage 2. The execution of instructions can be initiated simultaneously in one clock cycle. The instructions are dynamically scheduled i.e the actual instruction grouping is done in runtime, in contrast to the VLIW architectures. The term dispatch refers to the primary distribution of the instructions among the particular subsystems of functional units and the term issue refers to the assignment of an instruction to a particular functional unit for some immediate execution. In-order resources: These are the resources that can be allocated in the program order of execution. Eg: Registers Out-of-order resources: These are the resources that can be allocated to the instructions dynamically. This means that a new instruction can use the resource before the old one. e.g. Functional units Example Hardware architecture Consider an abstracted hardware architecture in Figure 7(a) including the following units. The instructions are dispatched by the DS stage to the respective reservation stations RSi. When there are no free reservation station buffers available, then the dispatch stalls. Consecutively, the instructions are issued to the respective functional units. In the case when the functional unit is idle on dispatch, dispatch and issue operations coincide in one cycle. Whenever an instruction is issued to the functional unit, its respective reservation station entry still remains allocated to it until the instruction has finished its execution and can be sent to the reorder buffer ROB where it is committed. The model M1 in Figure 7a uses the following issue policy. (i) The functional units serve disjunctive sets of instruction types (ii) at most one instruction per cycle is assumed to be dispatched. 12

14 The Model M2 in figure 7b comprises of two functional units serving an overlapping set of instruction types without any reservation stations. The functional unit FU1 is able to serve all instructions of type belonging to class IC1 and the functional unit FU2 is able to serve for class IC1 and IC2. This means that FU2 is able to handle more instructions than FU1. The instructions dispatched to FU1 can also be executed in FU2 but the reverse is however not possible. (a) Model M 1 (b) Model M 2 Figure 7: Model M 1 with two resercation stations allowing out-of-order allocation of functional units FU 2. Model M 2 consisting of two non equivalent functioal units (a) Resource requirements of model M 1 (b) Resource requirements of model M 2 Figure 8: Instruction Sequences In the table shown in Figure 8a, the resource requirements for the instruction sequence (A,B,C,D) for model M 1 are depicted. Here it is assumed that the instructions A and D require FU 1, while B and C require FU 2. 13

15 In the table shown in Figure 8b the resource requirements for the instruction sequence (A,B,C,D) for model M 2 are depicted. Here it is assumed that the instructions A, C and D can be executed in either FU 1 or FU 2, while D can be executed only on FU Timing anomalies caused by Out-of-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 9: Examples for Model M 1 Figure 9a and 9b shows the execution of the sequence shown in figure 8a with model M1 in a timing diagram. Figure 9a shows an example of counterdirective timing anomaly and Figure 9b shows an example of strong-impact 14

16 timing anomaly. The arrows that are used below the instruction label symbolize the instruction dispatch event. The instruction latencies are provided in a small box on the right side of the diagram. The arrows next to the latencies represent the dependency relationship between the instructions. These figures illustrate two cases each. The first two rows represent case 1 and the next two rows represent case 2. These two cases illustrate the execution of the instruction sequences using the latencies provided. The bars represent the utilization of the respective functional unit. The dotted lines above the bars represent the reservation stations allocated by the instructions. Counter-directive timing anomaly In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 10 clock cycles. In Case 2 the latency of instruction A is increased by 2 clock cycles. Since instruction B is dependent on instruction A, its execution is delayed. Meanwhile when A is still executing, instruction C starts its execution as it is not dependent on either instruction A or B. When instruction A and C finishes its execution, instruction D and B execute during the same clock cycles since they are not dependent on each other. The total time taken to complete the execution is 8 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a counter directive timing anomaly since the increase in instruction latency leads to a decrease in total execution time. Strong-impact Timing Anomaly In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 7 clock cycles. In Case 2 the latency of instruction A is increased by 2 clock cycles. Since instruction B is dependent on instruction A, its execution is delayed. Meanwhile when A is still executing, instruction C starts its execution as it is not dependent on either instruction A or B. When instruction A and C finishes its execution, instruction B starts its execution followed by instruction D. The total time taken to complete the execution is 11 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a strong-impact timing anomaly since the increase in instruction latency leads to an even greater increase in total execution 15

17 time Timing anomalies caused by In-Order resources (a) Counter-Directive Timing Anomaly (b) Strong-Impact Timing Anomaly Figure 10: Examples for Model M 2 In section 3.4 it was shown that timing anomalies cannot occur in the presence of in-order resources, but in the later years it was proved that timing anomalies can also occur in the presence of in-order resources. Figure 10a and 10b shows the execution of the sequence shown in figure 8b with model M2 in a timing diagram. Figure 10a shows an example of counter-directive timing anomaly and Figure 10b shows an example of strong-impact timing anomaly. The arrows that are used below the instruction label symbolize the instruction dispatch event. The instruction latencies are provided in a small box on the right side of the diagram. These figures illustrate two cases each. The first two rows represent case 1 and the next two rows represent case 2. These two cases illustrate the execution of the instruction sequences using the latencies provided. The bars represent the utilization of the respective functional unit. The dotted lines above the bars represent the reservation stations allocated by the instructions. Counter-directive timing anomaly: In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 8 clock cycles. 16

18 In Case 2 the latency of Instruction B is increased by 2 clock cycles. The total time taken to complete the execution is 7 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a counter directive timing anomaly since the increase in instruction latency leads to a decrease in overall execution time. Strong-impact Timing Anomaly: In Case 1 the instructions are executed as per the latency given. We notice that the execution of the instruction sequence is completed in 7 clock cycles. In Case 2 the latency of Instruction A is increased by 2 clock cycles. The total time taken to complete the execution is 10 clock cycles. When comparing the total execution time of case 1 and case 2, we encounter the occurrence of a strong impact timing anomaly since the increase in instruction latency leads to an even greater increase in overall execution time. 4 WCET Analysis 4.1 The influence of timing anomalies on WCET analysis The presence of timing anomalies makes the WCET analysis process extremely complex. Consequently the hardware allowing timing anomalies can only be analysed safely using the pessimistic serial execution method [1] leading to useless results due to high overestimation. For the traditional analysis methods, the following assumptions have to be satisfied. Monotonic assumption This means that when certain information is processed by a WCET analysis approach it is assumed that a longer latency for an instruction necessarily imposes an at least equal or longer execution time for the overall instruction sequence under consideration.[2] This assumption can be imagined being implicitly included in many WCET approaches. In case of timing anomalies this assumption does not hold. 17

19 Basic Composability assumption In many WCET calculation methods, a basic composability assumption is taken for granted. This assumption denotes the fact that WCET bound values for sub-paths can be safely composed by the WCET calculation method to compute the WCET bounds for the composite paths [2] This property of the validity of sub-paths WCET bounds is lost when timing anomalies are present. The WCET bound of the loop body is underestimated in a particular execution context due to a violation of the monotonicity and composability assumption. This error will multiply during the runtime depending on the actual input data values. This phenomenon is called unbounded timing effect. 4.2 Precise WCET analysis In the previous sections we described timing anomaly as a problem that occurs when the WCET of a control flow graph is computed from the WCET of its sub-graphs i.e. from series decomposition. Timing anomalies do not only occur between the timing of two subsequent instruction sequences, but also between the component latency of processor components and the total execution time. The timing anomaly can occur in a parallel decomposition of the WCET problem i.e. when complexity is reduced by splitting the hardware state space and performing a separate WCET analysis for hardware components in parallel. The potential occurrence of parallel timing anomalies makes the parallel decomposition technique unsafe. The two main tasks of the WCET analysis tools are control flow analysis that determines the feasible or infeasible paths in a program and the processor behaviour analysis that models the hardware. Timing anomalies are a challenge for WCET analysis because they violate the continuity properties Proportionality and monotony of program execution[1]. Parallel timing anomalies are the timing effects due to the changes in the initial state. WCET analysis with parallel decomposition is a method to reduce the complexity of processor behaviour analysis. The idea of parallel decomposition is to calculate the WCET of an instruction sequence in two steps. Before performing this calculation, the state space is divided into two parts. The first part is the state space for processor component and another is the state space for other components in the processor. For example, the hardware 18

20 component may be the instruction cache and the other state fraction may cover the pipeline and the other processor components. In the first step the timing of processor component is analysed and one state (say A) is chosen. Based on this choice, the overall processor timing is analysed in the second step by searching the state space while using the result for state A. 4.3 Methods of eliminating Anomalies In this section we discuss two approaches to estimate the WCET of a program running on a dynamically scheduled processor where timing anomalies may be experienced Pessimistic serial-execution method One of the straight forward ways to make safe estimations for architecture containing anomalies is to use the pessimistic serial execution method. It means that all the instructions are being executed in order in the functional units. We sum up all instruction latencies in the functional units. Additionally we add the miss penalties for all instructions and data cache misses. We know that the WCET corresponding to a serial execution of instruction assuming their worst case latencies is always higher than the WCET corresponding to any pipelined execution of the same sequence. Also we know theoretically that the instructions cannot be executed slower than in order since this means that some functional units will be idle for some time. This is true since there would be instructions those will always be ready for execution. The only possibility for an instruction to stall is cache misses which we add separately. This way a serial execution estimate will be safe but will be too pessimistic. The biggest advantage is that unknown events in a system will be handled in a safe way. They cannot lead to an execution time greater than the one estimated for serial executions Program Modification method The serial modification method discussed above is very pessimistic. If we want a tighter estimated WCET, we must model the pipelined execution accurately and deal with the problem of timing anomalies. One way of accomplishing this is to modify the program so that we can rely on safe local 19

21 decisions. We want to make the following conditions true[2] 1. All variable-latency instructions that have an unknown latency must, when simulated, still result in a predictable pipeline state. In addition, other unknown events such as unknown instruction cache accesses must also result in a predictable pipeline state. 2. If the number of paths in a small section of the program is being reduced by selecting the longest one or discarding the shortest ones, then the state of the pipeline and the caches at the beginning and the end of the paths must not differ when comparing them. The first condition can be fulfilled by forcing an in-order resource use when executing the variable-latency instruction. Then, the pipeline state must be predictable before allowing out-of-order resource use again. The way to accomplish this is highly architecture dependent. For example, in the POW- ERPC architecture, there exists a memory synchronization instruction called sync, which inhibits further dispatching until the sync instruction completes. The uses of sync instructions ensure that the following conditions occur: 1. All previously issued instructions have completed, at least to a point where they can no longer cause an exception. The sync instruction can be used to ensure that memory accesses are complete. 2. Previously-issued instructions complete in the context in which they were issued (privilege, protection, address translation). Instructions issued after the synchronizing instruction execute in the new context 3. The instruction queue is flushed and all these instructions are refetched with the new context in place to ensure that context changes occur for instructions after the synchronization. This instruction can be used as a way to force serialization together with a variable-latency instruction. If a sync is placed after the variable-latency instruction then the pipeline state will be known afterwards. Similarly if another sync is placed before the variable-latency instruction we will be able to know that the instruction will execute in-order The second condition can be fulfilled by the use of the sync instruction again to handle the pipeline state. Placing a sync instruction for example, at the end of two paths, the pipeline states in the two paths are made equal 20

22 to each other. It is also necessary that the state of the cache corresponding to the two paths is equal to each other. This is however highly architecture dependent. There exists several options[2] 1. One can invalidate all blocks in the caches. This should be possible in almost all processors. 2. One can invalidate only the blocks that differ in the two caches. This requires support for invalidation on the block level. 3. One can replace the blocks that differ with blocks that will be needed in the future by preloading blocks into the caches. This requires support for explicitly loading blocks into a cache In the first option, it is not the best solution to invalidate the entire contents of the cache since the performance will become poor. This is true also for the second option since each invalidated operation will result in an additional cache miss later on. The third option is the most promising one but requires special instructions to preload the cache. Examples of such instructions are the instruction and data cache block touch instructions (icbt and dcbt) found in the POWERPC architecture. When the blocks are preloaded, it is best to preload a block that would be needed somewhere along the worst-case path. Additionally, it is often a best practice to place a preload instruction outside loops if possible so that the overhead is reduced. 5 Conclusion The high complexity of today s processors makes it one of the most challenging problems of the WCET analysis. The features such as pipelines and caches create a huge space. The effects like timing anomaly makes it impossible to construct an efficient processor behaviour analysis that does not need to search the whole state space for the whole program. Previous methods fail in estimating WCET because they assumed that one can rely on the worst case assumptions for local entities such as instructions and basic blocks to estimate the effect on the overall WCET. For a certain class of programs running on dynamically scheduled processors, it is possible to make safe and tight estimates on the WCET. There must also be some support in the architecture to be able to explicitly control the state of 21

23 the caches and the resource allocation in the pipeline. Then, the processor could be forced to allocate resources in-order resulting in a stable scheduling of instructions but probably also lower performance. If there does not exist any support for the control of the pipeline state, then one is forced to use the serial execution method which often leads to more pessimism in the estimated WCET. Whenever the processor contains resources that allow such runtime resource allocation decisions (e.g., out-of-order pipelines, functional units serving different instruction types) then latency variations of single instructions might cause timing anomalies further down the instruction stream on this particular type of hardware. The reasons for latency variations of single instructions may result from caches, branch prediction mechanisms or different operand values (e.g., floating points operations). If the actually executed instruction sequence does not cause any dynamic resource allocation decision at execution time, an actual combination of hardware and software can be guaranteed to be free of timing anomalies 22

24 References [1] Raimund Kirner, Albrecht Kadlec, and Peter Puschner. Precise worstcase execution time analysis for processors with timing anomalies. In Real-Time Systems, ECRTS st Euromicro Conference on, pages IEEE, [2] Thomas Lundqvist and Per Stenstrom. Timing anomalies in dynamically scheduled microprocessors. In Real-Time Systems Symposium, Proceedings. The 20th IEEE, pages IEEE, [3] Jan Reineke, Björn Wachter, Stefan Thesing, Reinhard Wilhelm, Ilia Polian, Jochen Eisinger, and Bernd Becker. A definition and classification of timing anomalies. In OASIcs-OpenAccess Series in Informatics, volume 4. Schloss Dagstuhl-Leibniz-Zentrum für Informatik, [4] Ingomar Wenzel, Raimund Kirner, Peter Puschner, and Bernhard Rieder. Principles of timing anomalies in superscalar processors. In Quality Software, 2005.(QSIC 2005). Fifth International Conference on, pages IEEE,

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS Peter Puschner and Raimund Kirner Vienna University of Technology, A-1040 Vienna, Austria {peter, raimund}@vmars.tuwien.ac.at Abstract Keywords:

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information

Timing Predictability of Processors

Timing Predictability of Processors Timing Predictability of Processors Introduction Airbag opens ca. 100ms after the crash Controller has to trigger the inflation in 70ms Airbag useless if opened too early or too late Controller has to

More information

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core Sebastian Hahn and Jan Reineke RTSS, Nashville December, 2018 saarland university computer science SIC: Provably Timing-Predictable

More information

UNIT I (Two Marks Questions & Answers)

UNIT I (Two Marks Questions & Answers) UNIT I (Two Marks Questions & Answers) Discuss the different ways how instruction set architecture can be classified? Stack Architecture,Accumulator Architecture, Register-Memory Architecture,Register-

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

SUPERSCALAR AND VLIW PROCESSORS

SUPERSCALAR AND VLIW PROCESSORS Datorarkitektur I Fö 10-1 Datorarkitektur I Fö 10-2 What is a Superscalar Architecture? SUPERSCALAR AND VLIW PROCESSORS A superscalar architecture is one in which several instructions can be initiated

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS Homework 2 (Chapter ) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration..

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 18 Dynamic Instruction Scheduling with Branch Prediction

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Jan Gustafsson Department of Computer Engineering, Mälardalen University Box 883, S-721 23 Västerås, Sweden jangustafsson@mdhse

More information

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures

Design and Analysis of Real-Time Systems Predictability and Predictable Microarchitectures Design and Analysis of Real-Time Systems Predictability and Predictable Microarcectures Jan Reineke Advanced Lecture, Summer 2013 Notion of Predictability Oxford Dictionary: predictable = adjective, able

More information

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations

Basic concepts UNIT III PIPELINING. Data hazards. Instruction hazards. Influence on instruction sets. Data path and control considerations UNIT III PIPELINING Basic concepts Data hazards Instruction hazards Influence on instruction sets Data path and control considerations Performance considerations Exception handling Basic Concepts It is

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Instruction Pipelining Review

Instruction Pipelining Review Instruction Pipelining Review Instruction pipelining is CPU implementation technique where multiple operations on a number of instructions are overlapped. An instruction execution pipeline involves a number

More information

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations

Predicting the Worst-Case Execution Time of the Concurrent Execution. of Instructions and Cycle-Stealing DMA I/O Operations ACM SIGPLAN Workshop on Languages, Compilers and Tools for Real-Time Systems, La Jolla, California, June 1995. Predicting the Worst-Case Execution Time of the Concurrent Execution of Instructions and Cycle-Stealing

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST

Chapter 4. Advanced Pipelining and Instruction-Level Parallelism. In-Cheol Park Dept. of EE, KAIST Chapter 4. Advanced Pipelining and Instruction-Level Parallelism In-Cheol Park Dept. of EE, KAIST Instruction-level parallelism Loop unrolling Dependence Data/ name / control dependence Loop level parallelism

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

CACHE-RELATED PREEMPTION DELAY COMPUTATION FOR SET-ASSOCIATIVE CACHES

CACHE-RELATED PREEMPTION DELAY COMPUTATION FOR SET-ASSOCIATIVE CACHES CACHE-RELATED PREEMPTION DELAY COMPUTATION FOR SET-ASSOCIATIVE CACHES PITFALLS AND SOLUTIONS 1 Claire Burguière, Jan Reineke, Sebastian Altmeyer 2 Abstract In preemptive real-time systems, scheduling analyses

More information

Static vs. Dynamic Scheduling

Static vs. Dynamic Scheduling Static vs. Dynamic Scheduling Dynamic Scheduling Fast Requires complex hardware More power consumption May result in a slower clock Static Scheduling Done in S/W (compiler) Maybe not as fast Simpler processor

More information

ECE 341. Lecture # 15

ECE 341. Lecture # 15 ECE 341 Lecture # 15 Instructor: Zeshan Chishti zeshan@ece.pdx.edu November 19, 2014 Portland State University Pipelining Structural Hazards Pipeline Performance Lecture Topics Effects of Stalls and Penalties

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2005

CS 614 COMPUTER ARCHITECTURE II FALL 2005 CS 614 COMPUTER ARCHITECTURE II FALL 2005 DUE : November 9, 2005 HOMEWORK III READ : - Portions of Chapters 5, 6, 7, 8, 9 and 14 of the Sima book and - Portions of Chapters 3, 4, Appendix A and Appendix

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline

NOW Handout Page 1. Review from Last Time #1. CSE 820 Graduate Computer Architecture. Lec 8 Instruction Level Parallelism. Outline CSE 820 Graduate Computer Architecture Lec 8 Instruction Level Parallelism Based on slides by David Patterson Review Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism

More information

Dissecting Execution Traces to Understand Long Timing Effects

Dissecting Execution Traces to Understand Long Timing Effects Dissecting Execution Traces to Understand Long Timing Effects Christine Rochange and Pascal Sainrat February 2005 Rapport IRIT-2005-6-R Contents 1. Introduction... 5 2. Long timing effects... 5 3. Methodology...

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

Embedded Systems Lecture 11: Worst-Case Execution Time. Björn Franke University of Edinburgh

Embedded Systems Lecture 11: Worst-Case Execution Time. Björn Franke University of Edinburgh Embedded Systems Lecture 11: Worst-Case Execution Time Björn Franke University of Edinburgh Overview Motivation Worst-Case Execution Time Analysis Types of Execution Times Measuring vs. Analysing Flow

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

History-based Schemes and Implicit Path Enumeration

History-based Schemes and Implicit Path Enumeration History-based Schemes and Implicit Path Enumeration Claire Burguière and Christine Rochange Institut de Recherche en Informatique de Toulouse Université Paul Sabatier 6 Toulouse cedex 9, France {burguier,rochange}@irit.fr

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism The potential overlap among instruction execution is called Instruction Level Parallelism (ILP) since instructions can be executed in parallel. There are mainly two approaches

More information

Timing Anomalies Reloaded

Timing Anomalies Reloaded Gernot Gebhard AbsInt Angewandte Informatik GmbH 1 of 20 WCET 2010 Brussels Belgium Timing Anomalies Reloaded Gernot Gebhard AbsInt Angewandte Informatik GmbH Brussels, 6 th July, 2010 Gernot Gebhard AbsInt

More information

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures

Like scalar processor Processes individual data items Item may be single integer or floating point number. - 1 of 15 - Superscalar Architectures Superscalar Architectures Have looked at examined basic architecture concepts Starting with simple machines Introduced concepts underlying RISC machines From characteristics of RISC instructions Found

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

Processor: Superscalars Dynamic Scheduling

Processor: Superscalars Dynamic Scheduling Processor: Superscalars Dynamic Scheduling Z. Jerry Shi Assistant Professor of Computer Science and Engineering University of Connecticut * Slides adapted from Blumrich&Gschwind/ELE475 03, Peh/ELE475 (Princeton),

More information

Transforming Execution-Time Boundable Code into Temporally Predictable Code

Transforming Execution-Time Boundable Code into Temporally Predictable Code Transforming Execution-Time Boundable Code into Temporally Predictable Code Peter Puschner Institut for Technische Informatik. Technische Universitdt Wien, Austria Abstract: Traditional Worst-Case Execution-Time

More information

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr

Ti Parallel Computing PIPELINING. Michał Roziecki, Tomáš Cipr Ti5317000 Parallel Computing PIPELINING Michał Roziecki, Tomáš Cipr 2005-2006 Introduction to pipelining What is this What is pipelining? Pipelining is an implementation technique in which multiple instructions

More information

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1

Load1 no Load2 no Add1 Y Sub Reg[F2] Reg[F6] Add2 Y Add Reg[F2] Add1 Add3 no Mult1 Y Mul Reg[F2] Reg[F4] Mult2 Y Div Reg[F6] Mult1 Instruction Issue Execute Write result L.D F6, 34(R2) L.D F2, 45(R3) MUL.D F0, F2, F4 SUB.D F8, F2, F6 DIV.D F10, F0, F6 ADD.D F6, F8, F2 Name Busy Op Vj Vk Qj Qk A Load1 no Load2 no Add1 Y Sub Reg[F2]

More information

Design and Analysis of Real-Time Systems Microarchitectural Analysis

Design and Analysis of Real-Time Systems Microarchitectural Analysis Design and Analysis of Real-Time Systems Microarchitectural Analysis Jan Reineke Advanced Lecture, Summer 2013 Structure of WCET Analyzers Reconstructs a control-flow graph from the binary. Determines

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019

250P: Computer Systems Architecture. Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 250P: Computer Systems Architecture Lecture 9: Out-of-order execution (continued) Anton Burtsev February, 2019 The Alpha 21264 Out-of-Order Implementation Reorder Buffer (ROB) Branch prediction and instr

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network

Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Probabilistic Worst-Case Response-Time Analysis for the Controller Area Network Thomas Nolte, Hans Hansson, and Christer Norström Mälardalen Real-Time Research Centre Department of Computer Engineering

More information

Improving Timing Analysis for Matlab Simulink/Stateflow

Improving Timing Analysis for Matlab Simulink/Stateflow Improving Timing Analysis for Matlab Simulink/Stateflow Lili Tan, Björn Wachter, Philipp Lucas, Reinhard Wilhelm Universität des Saarlandes, Saarbrücken, Germany {lili,bwachter,phlucas,wilhelm}@cs.uni-sb.de

More information

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1

CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 CISC 662 Graduate Computer Architecture Lecture 13 - CPI < 1 Michela Taufer http://www.cis.udel.edu/~taufer/teaching/cis662f07 Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer

More information

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches

Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar, VLIW. CPE 631 Session 19 Exploiting ILP with SW Approaches Session xploiting ILP with SW Approaches lectrical and Computer ngineering University of Alabama in Huntsville Outline Review: Basic Pipeline Scheduling and Loop Unrolling Multiple Issue: Superscalar,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version:

SISTEMI EMBEDDED. Computer Organization Pipelining. Federico Baronti Last version: SISTEMI EMBEDDED Computer Organization Pipelining Federico Baronti Last version: 20160518 Basic Concept of Pipelining Circuit technology and hardware arrangement influence the speed of execution for programs

More information

Advanced issues in pipelining

Advanced issues in pipelining Advanced issues in pipelining 1 Outline Handling exceptions Supporting multi-cycle operations Pipeline evolution Examples of real pipelines 2 Handling exceptions 3 Exceptions In pipelined execution, one

More information

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University

EE382A Lecture 7: Dynamic Scheduling. Department of Electrical Engineering Stanford University EE382A Lecture 7: Dynamic Scheduling Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 7-1 Announcements Project proposal due on Wed 10/14 2-3 pages submitted

More information

Good luck and have fun!

Good luck and have fun! Midterm Exam October 13, 2014 Name: Problem 1 2 3 4 total Points Exam rules: Time: 90 minutes. Individual test: No team work! Open book, open notes. No electronic devices, except an unprogrammed calculator.

More information

Instruction Level Parallelism

Instruction Level Parallelism Instruction Level Parallelism Software View of Computer Architecture COMP2 Godfrey van der Linden 200-0-0 Introduction Definition of Instruction Level Parallelism(ILP) Pipelining Hazards & Solutions Dynamic

More information

Real-time integrated prefetching and caching

Real-time integrated prefetching and caching Real-time integrated prefetching and caching Peter Sanders Johannes Singler Rob van Stee September 26, 2012 Abstract The high latencies for access to background memory like hard disks or flash memory can

More information

An Approach to Task Attribute Assignment for Uniprocessor Systems

An Approach to Task Attribute Assignment for Uniprocessor Systems An Approach to ttribute Assignment for Uniprocessor Systems I. Bate and A. Burns Real-Time Systems Research Group Department of Computer Science University of York York, United Kingdom e-mail: fijb,burnsg@cs.york.ac.uk

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Chapter 12. CPU Structure and Function. Yonsei University

Chapter 12. CPU Structure and Function. Yonsei University Chapter 12 CPU Structure and Function Contents Processor organization Register organization Instruction cycle Instruction pipelining The Pentium processor The PowerPC processor 12-2 CPU Structures Processor

More information

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU

6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU 1-6x86 PROCESSOR Superscalar, Superpipelined, Sixth-generation, x86 Compatible CPU Product Overview Introduction 1. ARCHITECTURE OVERVIEW The Cyrix 6x86 CPU is a leader in the sixth generation of high

More information

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units

For this problem, consider the following architecture specifications: Functional Unit Type Cycles in EX Number of Functional Units CS333: Computer Architecture Spring 006 Homework 3 Total Points: 49 Points (undergrad), 57 Points (graduate) Due Date: Feb. 8, 006 by 1:30 pm (See course information handout for more details on late submissions)

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling)

Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) 18-447 Computer Architecture Lecture 12: Out-of-Order Execution (Dynamic Instruction Scheduling) Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 2/13/2015 Agenda for Today & Next Few Lectures

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions)

References EE457. Out of Order (OoO) Execution. Instruction Scheduling (Re-ordering of instructions) EE457 Out of Order (OoO) Execution Introduction to Dynamic Scheduling of Instructions (The Tomasulo Algorithm) By Gandhi Puvvada References EE557 Textbook Prof Dubois EE557 Classnotes Prof Annavaram s

More information

University of Toronto Faculty of Applied Science and Engineering

University of Toronto Faculty of Applied Science and Engineering Print: First Name:............ Solutions............ Last Name:............................. Student Number:............................................... University of Toronto Faculty of Applied Science

More information

Handout 2 ILP: Part B

Handout 2 ILP: Part B Handout 2 ILP: Part B Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level Parallelism Loop unrolling by compiler to increase ILP Branch prediction to increase ILP

More information

EECC551 Exam Review 4 questions out of 6 questions

EECC551 Exam Review 4 questions out of 6 questions EECC551 Exam Review 4 questions out of 6 questions (Must answer first 2 questions and 2 from remaining 4) Instruction Dependencies and graphs In-order Floating Point/Multicycle Pipelining (quiz 2) Improving

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls

Page 1. Recall from Pipelining Review. Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: Ideas to Reduce Stalls CS252 Graduate Computer Architecture Recall from Pipelining Review Lecture 16: Instruction Level Parallelism and Dynamic Execution #1: March 16, 2001 Prof. David A. Patterson Computer Science 252 Spring

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation 1 Branch Prediction Basic 2-bit predictor: For each branch: Predict taken or not

More information

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

More information

1 Tomasulo s Algorithm

1 Tomasulo s Algorithm Design of Digital Circuits (252-0028-00L), Spring 2018 Optional HW 4: Out-of-Order Execution, Dataflow, Branch Prediction, VLIW, and Fine-Grained Multithreading uctor: Prof. Onur Mutlu TAs: Juan Gomez

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

Chapter. Out of order Execution

Chapter. Out of order Execution Chapter Long EX Instruction stages We have assumed that all stages. There is a problem with the EX stage multiply (MUL) takes more time than ADD MUL ADD We can clearly delay the execution of the ADD until

More information

CS 614 COMPUTER ARCHITECTURE II FALL 2004

CS 614 COMPUTER ARCHITECTURE II FALL 2004 CS 64 COMPUTER ARCHITECTURE II FALL 004 DUE : October, 005 HOMEWORK II READ : - Portions of Chapters 5, 7, 8 and 9 of the Sima book and - Portions of Chapter 3, 4 and Appendix A of the Hennessy book ASSIGNMENT:

More information

Superscalar Processor Design

Superscalar Processor Design Superscalar Processor Design Superscalar Organization Virendra Singh Indian Institute of Science Bangalore virendra@computer.org Lecture 26 SE-273: Processor Design Super-scalar Organization Fetch Instruction

More information

Superscalar Processors

Superscalar Processors Superscalar Processors Superscalar Processor Multiple Independent Instruction Pipelines; each with multiple stages Instruction-Level Parallelism determine dependencies between nearby instructions o input

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Basic Compiler Techniques for Exposing ILP Advanced Branch Prediction Dynamic Scheduling Hardware-Based Speculation

More information

Concurrent/Parallel Processing

Concurrent/Parallel Processing Concurrent/Parallel Processing David May: April 9, 2014 Introduction The idea of using a collection of interconnected processing devices is not new. Before the emergence of the modern stored program computer,

More information

Chapter 14 - Processor Structure and Function

Chapter 14 - Processor Structure and Function Chapter 14 - Processor Structure and Function Luis Tarrataca luis.tarrataca@gmail.com CEFET-RJ L. Tarrataca Chapter 14 - Processor Structure and Function 1 / 94 Table of Contents I 1 Processor Organization

More information

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng

Slide Set 8. for ENCM 501 in Winter Steve Norman, PhD, PEng Slide Set 8 for ENCM 501 in Winter 2018 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 501 Winter 2018 Slide Set 8 slide

More information

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline

Minimizing Data hazard Stalls by Forwarding Data Hazard Classification Data Hazards Present in Current MIPS Pipeline Instruction Pipelining Review: MIPS In-Order Single-Issue Integer Pipeline Performance of Pipelines with Stalls Pipeline Hazards Structural hazards Data hazards Minimizing Data hazard Stalls by Forwarding

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.

Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2. Instruction-Level Parallelism and its Exploitation: PART 2 Hardware-based speculation (2.6) Multiple-issue plus static scheduling = VLIW (2.7) Multiple-issue, dynamic scheduling, and speculation (2.8)

More information

CS252 Graduate Computer Architecture Midterm 1 Solutions

CS252 Graduate Computer Architecture Midterm 1 Solutions CS252 Graduate Computer Architecture Midterm 1 Solutions Part A: Branch Prediction (22 Points) Consider a fetch pipeline based on the UltraSparc-III processor (as seen in Lecture 5). In this part, we evaluate

More information

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011

15-740/ Computer Architecture Lecture 10: Out-of-Order Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/3/2011 5-740/8-740 Computer Architecture Lecture 0: Out-of-Order Execution Prof. Onur Mutlu Carnegie Mellon University Fall 20, 0/3/20 Review: Solutions to Enable Precise Exceptions Reorder buffer History buffer

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation M. Sonza Reorda Politecnico di Torino Dipartimento di Automatica e Informatica 1 Introduction Hardware-based speculation is a technique for reducing the effects of control dependences

More information

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis

Hardware-Software Codesign. 9. Worst Case Execution Time Analysis Hardware-Software Codesign 9. Worst Case Execution Time Analysis Lothar Thiele 9-1 System Design Specification System Synthesis Estimation SW-Compilation Intellectual Prop. Code Instruction Set HW-Synthesis

More information

There are different characteristics for exceptions. They are as follows:

There are different characteristics for exceptions. They are as follows: e-pg PATHSHALA- Computer Science Computer Architecture Module 15 Exception handling and floating point pipelines The objectives of this module are to discuss about exceptions and look at how the MIPS architecture

More information

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Yiqiang Ding, Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Outline

More information

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009

VIII. DSP Processors. Digital Signal Processing 8 December 24, 2009 Digital Signal Processing 8 December 24, 2009 VIII. DSP Processors 2007 Syllabus: Introduction to programmable DSPs: Multiplier and Multiplier-Accumulator (MAC), Modified bus structures and memory access

More information

Advanced processor designs

Advanced processor designs Advanced processor designs We ve only scratched the surface of CPU design. Today we ll briefly introduce some of the big ideas and big words behind modern processors by looking at two example CPUs. The

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

CSC D70: Compiler Optimization Register Allocation

CSC D70: Compiler Optimization Register Allocation CSC D70: Compiler Optimization Register Allocation Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons

More information

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011

15-740/ Computer Architecture Lecture 12: Issues in OoO Execution. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 15-740/18-740 Computer Architecture Lecture 12: Issues in OoO Execution Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 10/7/2011 Reviews Due next Monday Mutlu et al., Runahead Execution: An Alternative

More information