A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

Similar documents
Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy

Single-Path Programming on a Chip-Multiprocessor System

Comprehensive Review of Data Prefetching Mechanisms

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS

Page 1. Multilevel Memories (Improving performance using a little cash )

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Chapter 2: Memory Hierarchy Design Part 2

Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis

Techniques for Efficient Processing in Runahead Execution Engines

Control Hazards. Prediction

Lecture notes for CS Chapter 2, part 1 10/23/18

Chapter 2: Memory Hierarchy Design Part 2

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

A Survey of prefetching techniques

A Hybrid Adaptive Feedback Based Prefetcher

2 TEST: A Tracer for Extracting Speculative Threads

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

Memory Hierarchy. Slides contents from:

Threshold-Based Markov Prefetchers

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability

Handling Cyclic Execution Paths in Timing Analysis of Component-based Software

Speculation and Future-Generation Computer Architecture

Instruction Level Parallelism (Branch Prediction)

EITF20: Computer Architecture Part4.1.1: Cache - 2

History-based Schemes and Implicit Path Enumeration

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Pipelining and Vector Processing

Timing Predictability of Processors

Evaluation of Branch Prediction Strategies

Memory Hierarchies 2009 DAT105

Demand fetching is commonly employed to bring the data

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware

Advanced Caching Techniques

Understanding The Effects of Wrong-path Memory References on Processor Performance

The check bits are in bit numbers 8, 4, 2, and 1.

Computer Architecture Spring 2016

Two-Level Address Storage and Address Prediction

A Study for Branch Predictors to Alleviate the Aliasing Problem

Fetch Directed Instruction Prefetching

Final Lecture. A few minutes to wrap up and add some perspective

Lec 11 How to improve cache performance

Computer Systems Architecture

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

Improving Worst-Case Cache Performance through Selective Bypassing and Register-Indexed Cache

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Improve performance by increasing instruction throughput

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

SF-LRU Cache Replacement Algorithm

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

Portland State University ECE 587/687. Memory Ordering

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

EITF20: Computer Architecture Part4.1.1: Cache - 2

Advanced Computer Architecture

Timing Anomalies and WCET Analysis. Ashrit Triambak

Memory Hierarchy. Slides contents from:

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

LECTURE 11. Memory Hierarchy

A Cache Hierarchy in a Computer System

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Memory Consistency. Challenges. Program order Memory access order

Control Hazards. Branch Prediction

Combining Prefetch with Instruction Cache Locking in Multitasking Real-Time Systems

Exploitation of instruction level parallelism

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

Dynamic Branch Prediction

Complex Pipelines and Branch Prediction

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

SPECULATIVE MULTITHREADED ARCHITECTURES

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

A Two-Expert Approach to File Access Prediction

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Optimisation. sometime he thought that there must be a better way

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

Transforming Execution-Time Boundable Code into Temporally Predictable Code

A Survey of Data Prefetching Techniques

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced optimizations of cache performance ( 2.2)

Master Thesis On. Instruction Prefetching Techniques for Ultra Low-Power Multicore Architectures

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

Transcription:

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology A1040 Wien, Austria Email: {bekim, daniel, peter}@vmars.tuwien.ac.at Abstract Trustable Worst-Case Execution-Time (WCET) bounds are a necessary component for the construction and verification of hard real-time computer systems. Deriving such bounds for contemporary hardware/software systems is a complex task. The single-path conversion overcomes this difficulty by transforming all unpredictable branch alternatives in the code to a sequential code structure with a single execution trace. However, the simpler code structure and analysis of single-path code comes at the cost of a longer execution time. In this paper we address the problem of the execution performance of single-path code. We present a new cache organization that utilizes the principle of locality of single-path code to reduce cache miss latency and cache miss rate. The proposed cache memory architecture combines cache prefetching and cache locking, so that the prefetcher capitalizes on spatial locality while the locker makes use of temporal locality. The demonstration section shows how these two techniques can complement each other. I. INTRODUCTION In hard real-time systems it is necessary to estimate the Worst-Case Execution Time (WCET) in order to assure that timing requirements of critical tasks are met. To derive precise time bounds, timing analysis on the set of all possible controlflow paths of the tasks is required. Unfortunately, this process is not trivial. For typical processors with caches, branch prediction, and pipelines, the analysis has to consider the statedependencies of the hardware components as well. Keeping track of all possible states can lead to state-space explosion in the analysis. The state-of-the-art WCET tools are avoiding the problem of complexity by using different models of abstraction for the hardware architecture [1]. However, such an approach can result in lots of unclassified states which, due to the safetycritical nature of the system, possibly leads to the computation of a highly overestimated WCET bound. Puschner and Burns [2] address the problem of complexity by converting conventional code to single-path code. The approach converts all input-dependent alternatives of the code into pieces of sequential code and transforms all loops with input-dependent termination conditions into loops with constant execution count. This, in turn, eliminates all control-flow induced variations in execution time by forcing the execution to always follow the same trace of instructions. To obtain information about the timing of the code, it is sufficient to run the code only once and to identify the stream of instructions which is followed in any repeated execution. The major drawback of single-path code is its potentially long execution time. Our focus in this paper is on improving the performance of systems that run single-path code. The long latency of memory accesses is one of the key performance bottlenecks of contemporary systems. Although the use of caches is a crucial step towards improving memory performance, it is still not a complete solution. Cache misses can result in a significant performance loss by stalling the processor until the needed instruction is brought into the cache. Prefetching has been shown to be an effective solution for reducing the cache miss rate. Its purpose is to mask the large memory latency by loading instructions into the cache before they are actually needed [3]. To take full advantage of this improvement, prefetching commands have to be issued at the right time if they are issued too late, memory latencies are only partially masked; if they are issued too early, there is the risk that the prefetched instruction will evict other useful instructions from the cache. The prefetching algorithms can be divided into two main categories: correlated and non-correlated prefetching. Correlated prefetching associates each cache miss with some predefined target stored in a dedicated table [4], [5]. Usually this kind of prefetchers are active for a non-sequential stream of instructions. The second group, non-correlated, predict the next prefetch line according to some simple algorithm [6], [7], [8]. This group is also called sequential prefetching. However, for both mentioned techniques, the ability to guess the next reference is not perfect. Thus, prefetching can result in unnecessary cache pollution and memory traffic. Cache locking is another technique that can reduce the cache miss rate [9]. This technique has the ability to prevent overwriting of some or all lines in the cache. Cache locking can be static or dynamic. Static locking loads and locks the memory blocks at program start, and the set of locked cache lines remains unchanged throughout the program execution. Dynamic locking splits the program into appropriate regions marked with so called lock points. As the program runs through a lock point, new memory blocks are reloaded and locked. Cache locking can lock the entire cache (full locking) or only few cache lines (partial locking). Hits to a locked line have the same timing as hits to an unlocked line. Using the predictability of single-path traces, we propose a new technique that combines prefetching and cache locking to reduce both the miss rate and the penalty of cache misses. On one hand, the proposed prefetching design is able to prefetch

sequential and non-sequential streams of instructions with full accuracy in the value and time domain. This constitutes an effective instruction prefetching scheme that increases the execution performance of single-path code without cache pollution or useless memory traffic. On the other hand, the proposed cache locking is a dynamic mechanism that decreases the cache miss rate of the system by locking appropriate memory blocks. The two techniques complement each other. So, while the prefetcher exploits spatial locality, the cache locking makes use of temporal locality. The rest of the paper is organized as follows. In Section II we review the work related to prefetching and cache locking techniques. The architecture of the proposed cache, prefetcher and cache locker are described in Section III. Section IV shows an example to demonstrate the advantages of the new architecture. Finally, we conclude the paper in Section V. II. RELATED WORK Designers have proposed several prefetching strategies to increase the performance of cache-memory systems. Some of these approaches use software support to perform prefetching, while others are strictly hardware-based. Software solutions need explicit prefetching instructions in the executable code. These prefetching instructions need to be generated by the compiler. Because this paper will introduce a hardware-based solution the following paragraphs discuss only related work on hardware-based prefetching. The simplest form of prefetching comes passively from the cache itself. Whenever a cache miss occurs, the cache fetches both the missed instruction as well as the instructions that belong to the same cache line into cache. An extension of the cache line size implies a larger granularity more instructions are fetched on a cache miss. The disadvantages of longer cache lines are that they take longer to fill, generate useless memory traffic, and also they contribute to cache pollution [10]. The first prefetch algorithms mainly focused on sequential prefetching. One block look ahead prefetching, later extended to next-n-line, prefetches cache lines that are following the one currently being fetched by the CPU [8]. The scheme requires small additional hardware to compute the next sequential addresses. Unfortunately, next-n-line is unlikely to improve performance when execution proceeds through non-sequential execution paths. In this case the prefetching guess can be incorrect. Another similar sequential prefetching is tagged prefetching. This approach has a tag bit associated with each cache line [7]. When a line is prefetched, its tag bit is set to zero. If the line is used then the bit is set to one and the next sequential line is immediately prefetched. The stream buffer is a similar approach to tagged prefetching except that the stream buffer stands between main memory and cache in order to avoid polluting the cache with data that may never be needed. The target prefetching scheme is one of the first algorithms that addresses the problem of non sequential code [5]. This approach comprises a next-line prediction table that has entries with two components a current line address and a target line address. When the program counter changes the value, the prefetcher searches the prediction table. If there is a match, then the target address is a candidate for prefetching. The hybrid scheme that combines both next-line and target prefetching offers a cumulative accuracy for reducing cache misses. However, this solution has limited effectiveness since the predicted direction is defined from the previous execution. A similar approach is also used in wrong-path prefetching except that instead of a target table this approach prefetches immediately the target of the conditional branch after a branch instruction is recognized in the decode stage [6]. This solution can be effective only for non-taken branches. The Markov prefetcher [4] prefetches multiple reference predictions from the memory subsystem, by observing the miss-reference stream as a Markov model. Similar to our prefetcher is the Loop-based instruction prefetching [11], except that in this solution the loop headers are always prefetched and the prefetching is issued at the end of the loop without giving the prefetch distance. The cooperative approach [12] also considers sequential and non-sequential prefetching by using a software solution for non-sequential prefetching. A dual-mode instruction prefetch scheme [13] is an alternative to improve worst-case execution time by associating a thread to each instruction block that is part of the WCET path. Threads are generated by the compiler and they are static during task execution. On the other side, cache locking is considered as an appropriate technique to improve the predictability of the cache in order to derive more accurate WCET bounds. First algorithms on choosing the memory block candidates to be locked were based on heuristics [14][15]. It was shown that these approaches were able to reduce the WCET bound even if the reduction was not the optimal one. Later, integer-linear programming (ILP) was used as a tool to find the cache line with the highest impact on cache miss rate on the whole code. Plazar et al. [16] use ILP results for statically locking the entire cache. All the other instructions that are not part of the locked memory blocks are classified as cache misses. Ding et al. [17] show that static partial locking of the cache achieves a lower miss rate than full cache locking. In [9], Ding et al. improve their approach by using ILP for memory block selection and dynamic locking. To the best of our knowledge, [18] is the only work that combines prefetching with cache locking for a multitask system. The prefetcher uses two buffers, one for fetching and the other for prefetching using the tagged algorithm. A separate lockable instruction cache is preloaded at context switches such that each task can utilize the whole cache, and lines are locked in a way that minimizes the worst overall execution cost. There have been different other approaches that address the time-predictability of the memory hierarchy. Reineke et al. [19] have analyzed the predictability of caches with respect to their replacement policy. They have defined metrics to capture aspects of cache-state predictability, and evaluated the predictability of common replacement policies. Schoeberl [20] proposes a cache organization that consists of several separate caches for different data areas, which can be analyzed independently to derive tight WCET bounds. In particular, he proposes an instruction cache in which whole functions are loaded into the cache on a function call or return. Cache misses only occur at these instructions, and only the call tree of a program needs to be considered for cache analysis.

In contrast to those approaches, we base our efforts on the inherently predictable execution of single-path programs, where every execution follows the same sequence of instructions. On one hand, this predictability comes at additional cost in terms of the number of instructions fetched in a single execution. On the other hand, the exact knowledge of the actual instruction sequence provides additional possibilities for optimization, which we try to exploit in our approach. III. HARDWARE ARCHITECTURE This section presents the hardware architecture of the cache with its prefetcher and the cache locking. A. Architecture of the Cache Memory Caches are small and fast memories that are used to improve the performance between processors and main memories based on the principle of locality. The property of locality can be observed from the aspects of temporal and spatial behavior of the execution. Temporal locality means that the code that is executed at the moment is likely to be referenced again in the near future. This type of behavior is expected from program loops in which both data and instructions are reused. Spatial locality means that the instructions and data whose addresses are close by will tend to be referenced in temporal proximity because the instructions are mostly executed sequentially [3]. As an application is executed over time, the CPU references the memory by sending memory addresses. Referenced instructions that are found in cache are called hits, while those that are not in cache are called misses. Usually the processor stalls in case of a cache miss until the instructions have been fetched from main memory. The time needed for transferring the instructions from the main memory into the cache is called miss penalty. To benefit from the spatial locality properties of the code, the cache always fetches one or more chunks of instructions, called cache blocks, and places them into the cache lines. The very first reference to a memory block always generates a cache miss (compulsory miss). Also, when the cache is full, some instructions must be evicted to create space for the incoming ones. If the evicted instruction is referenced again a cache miss will occur (capacity miss) [10]. Figure 1 shows the cache memory augmented with a prefetcher and cache locking. The cache has two banks, where cache lines consist of tag, data, prefetch bit P, and lock bit L entries. The separation of the cache into two banks allows us to overlap the process of fetching (by the CPU) with prefetching (by the prefetch unit). At any time, one bank is servicing the CPU with instructions that are already in the cache while the other bank is used by the prefetcher to bring instructions from the main memory to the cache. Whenever the program counter (PC) changes its value, the value is sent to the cache and to the prefetcher. Both, the CPU and the prefetcher can issue requests to the cache memory at the same time. The difference is that the CPU requests the currently needed instruction while the prefetcher issues requests for the next cache line to exploit spatial locality. When the prefetcher issues a request, the tag entry of the cache line is updated and the P bit of the associated cache line is set to zero. At the moment the prefetched line arrives in cache the P bit is set to one. The P bit prevents the cache to replicate requests issued to main memory if the same request has already been issued by the prefetcher. There are three possible scenarios when a CPU request is issued: No match within tag columns - the instruction is not in the cache. The cache stalls the processor and forwards the address request to the main memory. In this case the CPU stall time is equal to the miss penalty; Tag match, P bit is zero - the instruction is not in the cache but the prefetcher has already sent the request for that cache line and the fetching is in progress. The cache stalls the processor and waits for the ongoing prefetching operation to be finished. In this case the CPU stall time is shorter than the miss penalty; Tag match, P bit is one - the instruction is already in the cache (cache hit). The lock bit L is used to enable/disable the replacement policy. The last lines of the banks have no L bit; thus they can always be used by the prefetcher. So, while the locked cache lines exploit temporal locality, one of the unlockable cache lines takes advantage of the spatial locality inside that cache line (servicing the CPU). The other unlockable cache line makes use of spatial locality outside the cache line (filled by the prefetcher). B. Prefetching for Single-Path Code The prefetching algorithm for single-path code considers two forms of prefetching: sequential and non-sequential prefetching. Sequentially executed instruction streams are a trivial pattern to predict because the address of the next prefetching is an increment of the current address line. A simple nextline prefetcher [8] is a suitable solution for such a pattern of instruction execution. In contrast, a non-sequential prefetcher needs some source of information to determine the target address to be prefetched. Single-path code has a strong advantage in this part of the prefetching, because the target of every branching instruction is statically known. This target information can be given to the prefetcher in form of instructions (software prefetching) or it can be kept in a local memory (organized as a table) and used by the prefetcher when it is needed (hardware prefetching). For the software solution, special prefetch instructions are needed and the CPU hardware has to be modified. In order to keep the development of the prefetcher independent form the CPU and the compiler, and also to avoid the overhead of executing fetch instructions in software, we have decided to use a hardware solution for the single-path prefetcher. Since single-path code consists of serial segments and loops, the subject of treatment of non-sequential algorithms are only the branch instructions of the loops. Loops that are larger than the cache size are easily handled by the prefetching algorithm, where the loop body is prefetched with the sequential algorithm while the non-sequential one is used for the loop header. If loops fully fit into the cache then they do not need to be prefetched in each iteration. Thus these loops are identified and the prefetcher does not generate any prefetching

Fig. 1: Proposed cache architecture augmented with a prefetcher and cache locking. requests except for the last iteration, when the execution stream exits the loop. The granularity of prefetching is defined as one cache line. For larger amounts of prefetched instructions, the probability of overshooting the end of sequence would increase, thus resulting in cache pollution with useless prefetching. The granularity also determines that the prefetching distance is one cache line ahead. 1) Reference Prediction Table: The Reference Prediction Table (RPT) is the part of the prefetcher that holds information about the instruction stream (Figure 1). The RPT entries consist of a trigger line, destination line, count and type column. Trigger line refers to the set of program counter addresses that trigger the non-sequential algorithm of the prefetcher. Destination line is the target address that is prefetched. Since loops in single-path code have a constant number of iterations, the counter data is used to inform the prefetcher for how many times the target address should be prefetched. The type field indicates which loops fit into the cache and which loops are larger than the cache. If the value of type is zero, then the prefetcher will not take any action since the loop is smaller than the cache and completely contained in it. When the counter of that loop reaches zero, the loop iterations are finished, and the prefetcher triggers the prefetching of the next cache line. The RPT is populated by means of analyzing the (singleton) instruction trace of the single-path program. The analysis process identifies loops (loop entries and backward branches), the number of iterations and the size of the loops. 2) Architecture of the Prefetcher: As shown in Figure 1 the prefetch hardware for single-path code consist of the Reference Prediction Table (RPT), the next-line prefetcher, and the prefetch controller (state machine). The next-line prefetcher is responsible for the prefetching of the sequential parts of the code, while the state machine in association with the RPT is used for prefetching targets in distance. At run-time, when a new address is generated, its value is passed to the RPT table and the next-line prefetcher. In cases when the PC value matches an entry in the RPT table, the prefetch controller reads the type bit and the counter value to check if the loop is smaller/bigger than cache. Further the prefetch controller checks on each iteration if the final iteration has been reached. If there is no match with the RPT table entry, the next-line prefetching will increment the address for one cache line and issue that address to the cache. The RPT output has precedence over next-line prefetching. C. Cache Locking The cache locking mechanism is very simple. It uses a table with two types of entries. The first column holds entries of addresses used to lock the cache while the second column is filled with addresses to unlock the cache. During code execution, all addresses generated from PC are also transmitted

to the lock unit. When a lock address is detected, the lock unit sends a lock signal and sets all L bits of the cache to one. This disables the line replacement of the cache. The cache hold this state till the moment when the unlock address is detected and the unlock signal is sent to all L bits to set them to zero. IV. DEMONSTRATION OF THE BENEFITS In this section, we illustrate the benefits of combining the prefetcher with cache locking in order to improve the cache efficiency. Figure IV shows a memory layout of single-path code. The code consists of seven memory blocks (c 1,...,c 7 ). Each reference i n represents the address of an instruction. The example assumes that the instruction length is fixed and each memory block has four instructions. From the generated trace we can identify the control flow instructions (b 18,b 27 ), the entry (i 3,i 26 ) and exit (b 18,b 27 ) points of the loops and loop bounds (l 1,l 2 ). In this case we assume that each loop iterates only once (loop bound is two). TABLE I: Reference prediction table Trigger line Destination line Count Type c 5 c 1 1 1 c 5 c 6 1 0 The three columns at the right side of Figure 2 show the behavior for a simple cache, a cache with prefetching, and a cache with combined prefetching and cache locking. Each line of the table presents the cache state and the outcome of the cache-line request. For simplicity, we demonstrate the cache hits and misses on cache line level without showing the hits generated from instructions within the same cache lines. The cache is full-associative with four lines and uses the Least Recently Used (LRU) replacement policy. The first column shows the benefit of single-path code executed in a system with simple cache. In such a system, the cache is beneficial only for requested instructions belonging to the same cache line (spatial locality) and for loops that are smaller than the cache size (temporal locality). In this case, the second loop l 2 benefits from cache on the level of cache lines. If we assume that a cache hit takes one clock cycle while a cache miss takes 20 cycles, then the time spent on memory accesses for all instructions is 278 cycles. TABLE II: Locking table Lock address Unlock address b 18 i 19 The second column in Figure 2 illustrates the operation of the cache with prefetching. Loop-exit points are the only places in the example where the prefetcher has to switch from sequential to non-sequential prefetching. These parameters, on branch instructions, together with the number of iterations of loops are entered into the Reference Prediction Table (Table I). When the CPU accesses the c i cache line for the first time, the prefetcher immediately issues a request for line c i+1 for sequential prefetching and of c i+n for non-sequential prefetching. The first loop l 1 is larger than the cache size and therefore the entry type equals one in the RPT table, while l 2 fits into the cache and the entry type is zero. In this example the non-sequential prefetching is activated only once when c 5 cache line is issued. The prefetcher also fully stops when the execution enters the l 2 in order to not issue requests for cache lines that are already in the cache. We observe that aggressive prefetching with accurate prediction of the next cache line can reduce the miss-latency for each cache miss (Figure 2). The reduction of the cache miss-latency depends on the time that it takes for a cache line to be executed. Again, assuming one cycle for a cache hit and 20 cycles for a cache miss, the time spent on memory accesses for this scenario is 238 cycles. The ideal case would be if the execution time of one cache line is equal with the fetch time of the line. The third column in Figure 2 demonstrates the benefit of combined prefetching and cache locking. Cache locking comes into play only for loops that are larger than the cache size. From the trace of single-path code, the exit point of the loop is identified (b 18 ) and entered into the Locking table (Table II). Also the following instruction i 19 that does not belong to the loop is entered as unlock point. In this example, since the cache has size of four lines, only cache lines c 4 and c 5 are locked while the other two lines remain unlocked to fetch/prefetch all the other unlocked cache lines that are part of the same loop l 1. After loop execution, the issued request for i 19 is also used to unlock all cache lines and the cache continues to operate with the LRU replacement policy. Figure 2 shows that combining these two techniques can improve the miss latency and also can reduce the miss rate. Using the same assumptions for cache hit and miss latencies as above, the memory access time is 208 cycles. V. CONCLUSION To overcome the problem of long execution times of single-path code, we have proposed a new memory hierarchy organization that attempts to reduce the memory access time by bringing the instructions into the cache before they are accessed and to reduce the cache miss rate by dynamically locking the most frequently requested memory blocks. The single-path prefetching algorithm combines a sequential and a non-sequential prefetching scheme with full accuracy in the predicted instruction stream based on the predictable properties of the single-path code. Designed as a hardware solution, the prefetcher does not produce any additional timing overhead for the instruction prefetching. Also, our solution allows the prefetcher functionality to operate independently, i.e., prefetching does not interfere with any stage of the CPU. The dual-bank cache makes it possible to pipeline the CPU and prefetcher accesses into the cache memory in order to fully utilize the memory bandwidth. By using a prefetch granularity of one cache line we eliminate the possibility of cache pollution and useless memory traffic. Cache locking comes into play when loops larger than the cache size are executed. With a simple analysis of the singleton instruction trace of a single-path program we identify the entry and exit instructions of the loops. During runtime the addresses

Fig. 2: Comparison of the operation with no prefetching or locking, with prefetching only, and with both prefetching and locking. of these entry and exit instructions are used as lock/unlock points for the cache content. The last cache lines of the banks of the cache are not lockable in order not to interfere with the prefetcher. We plan to thoroughly evaluate our approach and to show the feasibility of the memory hierarchy by implementing it in an FPGA platform as future work. Furthermore, we will extend the prefetcher for input-independent if-then-else structures that are not converted to sequential code. ACKNOWLEDGMENT This work has been supported in part by the ARTEMIS project under grant agreement 295311 (VeTeSS) and the EU COST Action IC1202: Timing Analysis on Code Level (TACLe). REFERENCES [1] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra et al., The worstcase execution-time problem overview of methods and survey of tools, ACM Transactions on Embedded Computing Systems (TECS), vol. 7, no. 3, p. 36, 2008. [2] P. Puschner and A. Burns, Writing temporally predictable code, in Object-Oriented Real-Time Dependable Systems, 2002.(WORDS 2002). Proceedings of the Seventh International Workshop on. IEEE, 2002, pp. 85 91. [3] A. J. Smith, Cache memories, ACM Computing Surveys (CSUR), vol. 14, no. 3, pp. 473 530, 1982. [4] D. Joseph and D. Grunwald, Prefetching using markov predictors, in ACM SIGARCH Computer Architecture News, vol. 25, no. 2. ACM, 1997, pp. 252 263. [5] J. E. Smith and W.-C. Hsu, Prefetching in supercomputer instruction caches, in Proceedings of the 1992 ACM/IEEE conference on Supercomputing. IEEE Computer Society Press, 1992, pp. 588 597. [6] J. Pierce and T. Mudge, Wrong-path instruction prefetching, in Microarchitecture, 1996. MICRO-29. Proceedings of the 29th Annual IEEE/ACM International Symposium on. IEEE, 1996, pp. 165 175. [7] N. P. Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, in Computer Architecture, 1990. Proceedings., 17th Annual International Symposium on. IEEE, 1990, pp. 364 373. [8] A. J. Smith, Sequential program prefetching in memory hierarchies, Computer, vol. 11, no. 12, pp. 7 21, 1978. [9] H. Ding, Y. Liang, and T. Mitra, Wcet-centric dynamic instruction cache locking, in Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 2014, p. 27. [10] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, 2012. [11] Y. Ding and W. Zhang, Loop-based instruction prefetching to reduce the worst-case execution time, Computers, IEEE Transactions on, vol. 59, no. 6, pp. 855 864, 2010. [12] C.-K. Luk and T. C. Mowry, Architectural and compiler support for effective instruction prefetching: a cooperative approach, ACM Transactions on Computer Systems, no. 1, pp. 71 109, 2001. [13] M. Lee, S. L. Min, C. Y. Park, Y. H. Bae, H. Shin, and C. S. Kim, A dual-mode instruction prefetch scheme for improved worst case and average case program execution times, in Real-Time Systems Symposium, 1993., Proceedings. IEEE, 1993, pp. 98 105. [14] I. Puaut, Wcet-centric software-controlled instruction caches for hard real-time systems, in Real-Time Systems, 2006. 18th Euromicro Conference on. IEEE, 2006, pp. 10 pp. [15] H. Falk, S. Plazar, and H. Theiling, Compile-time decided instruction cache locking using worst-case execution paths, in Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis. ACM, 2007, pp. 143 148. [16] S. Plazar, J. C. Kleinsorge, P. Marwedel, and H. Falk, Wcet-aware static locking of instruction caches, in Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 2012, pp. 44 52. [17] H. Ding, Y. Liang, and T. Mitra, Wcet-centric partial instruction cache locking, in Design Automation Conference (DAC), 2012 49th ACM/EDAC/IEEE. IEEE, 2012, pp. 412 420. [18] L. C. Aparicio, J. Segarra, C. Rodriguez, and V. Vinals, Combining prefetch with instruction cache locking in multitasking realtime systems, in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2010 IEEE 16th International Conference on. IEEE, 2010, pp. 319 328. [19] J. Reineke, D. Grund, C. Berg, and R. Wilhelm, Timing predictability of cache replacement policies, Real-Time Systems, vol. 37, no. 2, pp. 99 122, 2007. [20] M. Schoeberl, Time-predictable cache organization, in Proceedings of the 2009 Software Technologies for Future Dependable Distributed Systems, ser. STFSSD 09. Washington, DC, USA: IEEE Computer Society, 2009, pp. 11 16.