A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

Size: px
Start display at page:

Download "A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking"

Transcription

1 A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology A1040 Wien, Austria {bekim, daniel, Abstract Trustable Worst-Case Execution-Time (WCET) bounds are a necessary component for the construction and verification of hard real-time computer systems. Deriving such bounds for contemporary hardware/software systems is a complex task. The single-path conversion overcomes this difficulty by transforming all unpredictable branch alternatives in the code to a sequential code structure with a single execution trace. However, the simpler code structure and analysis of single-path code comes at the cost of a longer execution time. In this paper we address the problem of the execution performance of single-path code. We present a new cache organization that utilizes the principle of locality of single-path code to reduce cache miss latency and cache miss rate. The proposed cache memory architecture combines cache prefetching and cache locking, so that the prefetcher capitalizes on spatial locality while the locker makes use of temporal locality. The demonstration section shows how these two techniques can complement each other. I. INTRODUCTION In hard real-time systems it is necessary to estimate the Worst-Case Execution Time (WCET) in order to assure that timing requirements of critical tasks are met. To derive precise time bounds, timing analysis on the set of all possible controlflow paths of the tasks is required. Unfortunately, this process is not trivial. For typical processors with caches, branch prediction, and pipelines, the analysis has to consider the statedependencies of the hardware components as well. Keeping track of all possible states can lead to state-space explosion in the analysis. The state-of-the-art WCET tools are avoiding the problem of complexity by using different models of abstraction for the hardware architecture [1]. However, such an approach can result in lots of unclassified states which, due to the safetycritical nature of the system, possibly leads to the computation of a highly overestimated WCET bound. Puschner and Burns [2] address the problem of complexity by converting conventional code to single-path code. The approach converts all input-dependent alternatives of the code into pieces of sequential code and transforms all loops with input-dependent termination conditions into loops with constant execution count. This, in turn, eliminates all control-flow induced variations in execution time by forcing the execution to always follow the same trace of instructions. To obtain information about the timing of the code, it is sufficient to run the code only once and to identify the stream of instructions which is followed in any repeated execution. The major drawback of single-path code is its potentially long execution time. Our focus in this paper is on improving the performance of systems that run single-path code. The long latency of memory accesses is one of the key performance bottlenecks of contemporary systems. Although the use of caches is a crucial step towards improving memory performance, it is still not a complete solution. Cache misses can result in a significant performance loss by stalling the processor until the needed instruction is brought into the cache. Prefetching has been shown to be an effective solution for reducing the cache miss rate. Its purpose is to mask the large memory latency by loading instructions into the cache before they are actually needed [3]. To take full advantage of this improvement, prefetching commands have to be issued at the right time if they are issued too late, memory latencies are only partially masked; if they are issued too early, there is the risk that the prefetched instruction will evict other useful instructions from the cache. The prefetching algorithms can be divided into two main categories: correlated and non-correlated prefetching. Correlated prefetching associates each cache miss with some predefined target stored in a dedicated table [4], [5]. Usually this kind of prefetchers are active for a non-sequential stream of instructions. The second group, non-correlated, predict the next prefetch line according to some simple algorithm [6], [7], [8]. This group is also called sequential prefetching. However, for both mentioned techniques, the ability to guess the next reference is not perfect. Thus, prefetching can result in unnecessary cache pollution and memory traffic. Cache locking is another technique that can reduce the cache miss rate [9]. This technique has the ability to prevent overwriting of some or all lines in the cache. Cache locking can be static or dynamic. Static locking loads and locks the memory blocks at program start, and the set of locked cache lines remains unchanged throughout the program execution. Dynamic locking splits the program into appropriate regions marked with so called lock points. As the program runs through a lock point, new memory blocks are reloaded and locked. Cache locking can lock the entire cache (full locking) or only few cache lines (partial locking). Hits to a locked line have the same timing as hits to an unlocked line. Using the predictability of single-path traces, we propose a new technique that combines prefetching and cache locking to reduce both the miss rate and the penalty of cache misses. On one hand, the proposed prefetching design is able to prefetch

2 sequential and non-sequential streams of instructions with full accuracy in the value and time domain. This constitutes an effective instruction prefetching scheme that increases the execution performance of single-path code without cache pollution or useless memory traffic. On the other hand, the proposed cache locking is a dynamic mechanism that decreases the cache miss rate of the system by locking appropriate memory blocks. The two techniques complement each other. So, while the prefetcher exploits spatial locality, the cache locking makes use of temporal locality. The rest of the paper is organized as follows. In Section II we review the work related to prefetching and cache locking techniques. The architecture of the proposed cache, prefetcher and cache locker are described in Section III. Section IV shows an example to demonstrate the advantages of the new architecture. Finally, we conclude the paper in Section V. II. RELATED WORK Designers have proposed several prefetching strategies to increase the performance of cache-memory systems. Some of these approaches use software support to perform prefetching, while others are strictly hardware-based. Software solutions need explicit prefetching instructions in the executable code. These prefetching instructions need to be generated by the compiler. Because this paper will introduce a hardware-based solution the following paragraphs discuss only related work on hardware-based prefetching. The simplest form of prefetching comes passively from the cache itself. Whenever a cache miss occurs, the cache fetches both the missed instruction as well as the instructions that belong to the same cache line into cache. An extension of the cache line size implies a larger granularity more instructions are fetched on a cache miss. The disadvantages of longer cache lines are that they take longer to fill, generate useless memory traffic, and also they contribute to cache pollution [10]. The first prefetch algorithms mainly focused on sequential prefetching. One block look ahead prefetching, later extended to next-n-line, prefetches cache lines that are following the one currently being fetched by the CPU [8]. The scheme requires small additional hardware to compute the next sequential addresses. Unfortunately, next-n-line is unlikely to improve performance when execution proceeds through non-sequential execution paths. In this case the prefetching guess can be incorrect. Another similar sequential prefetching is tagged prefetching. This approach has a tag bit associated with each cache line [7]. When a line is prefetched, its tag bit is set to zero. If the line is used then the bit is set to one and the next sequential line is immediately prefetched. The stream buffer is a similar approach to tagged prefetching except that the stream buffer stands between main memory and cache in order to avoid polluting the cache with data that may never be needed. The target prefetching scheme is one of the first algorithms that addresses the problem of non sequential code [5]. This approach comprises a next-line prediction table that has entries with two components a current line address and a target line address. When the program counter changes the value, the prefetcher searches the prediction table. If there is a match, then the target address is a candidate for prefetching. The hybrid scheme that combines both next-line and target prefetching offers a cumulative accuracy for reducing cache misses. However, this solution has limited effectiveness since the predicted direction is defined from the previous execution. A similar approach is also used in wrong-path prefetching except that instead of a target table this approach prefetches immediately the target of the conditional branch after a branch instruction is recognized in the decode stage [6]. This solution can be effective only for non-taken branches. The Markov prefetcher [4] prefetches multiple reference predictions from the memory subsystem, by observing the miss-reference stream as a Markov model. Similar to our prefetcher is the Loop-based instruction prefetching [11], except that in this solution the loop headers are always prefetched and the prefetching is issued at the end of the loop without giving the prefetch distance. The cooperative approach [12] also considers sequential and non-sequential prefetching by using a software solution for non-sequential prefetching. A dual-mode instruction prefetch scheme [13] is an alternative to improve worst-case execution time by associating a thread to each instruction block that is part of the WCET path. Threads are generated by the compiler and they are static during task execution. On the other side, cache locking is considered as an appropriate technique to improve the predictability of the cache in order to derive more accurate WCET bounds. First algorithms on choosing the memory block candidates to be locked were based on heuristics [14][15]. It was shown that these approaches were able to reduce the WCET bound even if the reduction was not the optimal one. Later, integer-linear programming (ILP) was used as a tool to find the cache line with the highest impact on cache miss rate on the whole code. Plazar et al. [16] use ILP results for statically locking the entire cache. All the other instructions that are not part of the locked memory blocks are classified as cache misses. Ding et al. [17] show that static partial locking of the cache achieves a lower miss rate than full cache locking. In [9], Ding et al. improve their approach by using ILP for memory block selection and dynamic locking. To the best of our knowledge, [18] is the only work that combines prefetching with cache locking for a multitask system. The prefetcher uses two buffers, one for fetching and the other for prefetching using the tagged algorithm. A separate lockable instruction cache is preloaded at context switches such that each task can utilize the whole cache, and lines are locked in a way that minimizes the worst overall execution cost. There have been different other approaches that address the time-predictability of the memory hierarchy. Reineke et al. [19] have analyzed the predictability of caches with respect to their replacement policy. They have defined metrics to capture aspects of cache-state predictability, and evaluated the predictability of common replacement policies. Schoeberl [20] proposes a cache organization that consists of several separate caches for different data areas, which can be analyzed independently to derive tight WCET bounds. In particular, he proposes an instruction cache in which whole functions are loaded into the cache on a function call or return. Cache misses only occur at these instructions, and only the call tree of a program needs to be considered for cache analysis.

3 In contrast to those approaches, we base our efforts on the inherently predictable execution of single-path programs, where every execution follows the same sequence of instructions. On one hand, this predictability comes at additional cost in terms of the number of instructions fetched in a single execution. On the other hand, the exact knowledge of the actual instruction sequence provides additional possibilities for optimization, which we try to exploit in our approach. III. HARDWARE ARCHITECTURE This section presents the hardware architecture of the cache with its prefetcher and the cache locking. A. Architecture of the Cache Memory Caches are small and fast memories that are used to improve the performance between processors and main memories based on the principle of locality. The property of locality can be observed from the aspects of temporal and spatial behavior of the execution. Temporal locality means that the code that is executed at the moment is likely to be referenced again in the near future. This type of behavior is expected from program loops in which both data and instructions are reused. Spatial locality means that the instructions and data whose addresses are close by will tend to be referenced in temporal proximity because the instructions are mostly executed sequentially [3]. As an application is executed over time, the CPU references the memory by sending memory addresses. Referenced instructions that are found in cache are called hits, while those that are not in cache are called misses. Usually the processor stalls in case of a cache miss until the instructions have been fetched from main memory. The time needed for transferring the instructions from the main memory into the cache is called miss penalty. To benefit from the spatial locality properties of the code, the cache always fetches one or more chunks of instructions, called cache blocks, and places them into the cache lines. The very first reference to a memory block always generates a cache miss (compulsory miss). Also, when the cache is full, some instructions must be evicted to create space for the incoming ones. If the evicted instruction is referenced again a cache miss will occur (capacity miss) [10]. Figure 1 shows the cache memory augmented with a prefetcher and cache locking. The cache has two banks, where cache lines consist of tag, data, prefetch bit P, and lock bit L entries. The separation of the cache into two banks allows us to overlap the process of fetching (by the CPU) with prefetching (by the prefetch unit). At any time, one bank is servicing the CPU with instructions that are already in the cache while the other bank is used by the prefetcher to bring instructions from the main memory to the cache. Whenever the program counter (PC) changes its value, the value is sent to the cache and to the prefetcher. Both, the CPU and the prefetcher can issue requests to the cache memory at the same time. The difference is that the CPU requests the currently needed instruction while the prefetcher issues requests for the next cache line to exploit spatial locality. When the prefetcher issues a request, the tag entry of the cache line is updated and the P bit of the associated cache line is set to zero. At the moment the prefetched line arrives in cache the P bit is set to one. The P bit prevents the cache to replicate requests issued to main memory if the same request has already been issued by the prefetcher. There are three possible scenarios when a CPU request is issued: No match within tag columns - the instruction is not in the cache. The cache stalls the processor and forwards the address request to the main memory. In this case the CPU stall time is equal to the miss penalty; Tag match, P bit is zero - the instruction is not in the cache but the prefetcher has already sent the request for that cache line and the fetching is in progress. The cache stalls the processor and waits for the ongoing prefetching operation to be finished. In this case the CPU stall time is shorter than the miss penalty; Tag match, P bit is one - the instruction is already in the cache (cache hit). The lock bit L is used to enable/disable the replacement policy. The last lines of the banks have no L bit; thus they can always be used by the prefetcher. So, while the locked cache lines exploit temporal locality, one of the unlockable cache lines takes advantage of the spatial locality inside that cache line (servicing the CPU). The other unlockable cache line makes use of spatial locality outside the cache line (filled by the prefetcher). B. Prefetching for Single-Path Code The prefetching algorithm for single-path code considers two forms of prefetching: sequential and non-sequential prefetching. Sequentially executed instruction streams are a trivial pattern to predict because the address of the next prefetching is an increment of the current address line. A simple nextline prefetcher [8] is a suitable solution for such a pattern of instruction execution. In contrast, a non-sequential prefetcher needs some source of information to determine the target address to be prefetched. Single-path code has a strong advantage in this part of the prefetching, because the target of every branching instruction is statically known. This target information can be given to the prefetcher in form of instructions (software prefetching) or it can be kept in a local memory (organized as a table) and used by the prefetcher when it is needed (hardware prefetching). For the software solution, special prefetch instructions are needed and the CPU hardware has to be modified. In order to keep the development of the prefetcher independent form the CPU and the compiler, and also to avoid the overhead of executing fetch instructions in software, we have decided to use a hardware solution for the single-path prefetcher. Since single-path code consists of serial segments and loops, the subject of treatment of non-sequential algorithms are only the branch instructions of the loops. Loops that are larger than the cache size are easily handled by the prefetching algorithm, where the loop body is prefetched with the sequential algorithm while the non-sequential one is used for the loop header. If loops fully fit into the cache then they do not need to be prefetched in each iteration. Thus these loops are identified and the prefetcher does not generate any prefetching

4 Fig. 1: Proposed cache architecture augmented with a prefetcher and cache locking. requests except for the last iteration, when the execution stream exits the loop. The granularity of prefetching is defined as one cache line. For larger amounts of prefetched instructions, the probability of overshooting the end of sequence would increase, thus resulting in cache pollution with useless prefetching. The granularity also determines that the prefetching distance is one cache line ahead. 1) Reference Prediction Table: The Reference Prediction Table (RPT) is the part of the prefetcher that holds information about the instruction stream (Figure 1). The RPT entries consist of a trigger line, destination line, count and type column. Trigger line refers to the set of program counter addresses that trigger the non-sequential algorithm of the prefetcher. Destination line is the target address that is prefetched. Since loops in single-path code have a constant number of iterations, the counter data is used to inform the prefetcher for how many times the target address should be prefetched. The type field indicates which loops fit into the cache and which loops are larger than the cache. If the value of type is zero, then the prefetcher will not take any action since the loop is smaller than the cache and completely contained in it. When the counter of that loop reaches zero, the loop iterations are finished, and the prefetcher triggers the prefetching of the next cache line. The RPT is populated by means of analyzing the (singleton) instruction trace of the single-path program. The analysis process identifies loops (loop entries and backward branches), the number of iterations and the size of the loops. 2) Architecture of the Prefetcher: As shown in Figure 1 the prefetch hardware for single-path code consist of the Reference Prediction Table (RPT), the next-line prefetcher, and the prefetch controller (state machine). The next-line prefetcher is responsible for the prefetching of the sequential parts of the code, while the state machine in association with the RPT is used for prefetching targets in distance. At run-time, when a new address is generated, its value is passed to the RPT table and the next-line prefetcher. In cases when the PC value matches an entry in the RPT table, the prefetch controller reads the type bit and the counter value to check if the loop is smaller/bigger than cache. Further the prefetch controller checks on each iteration if the final iteration has been reached. If there is no match with the RPT table entry, the next-line prefetching will increment the address for one cache line and issue that address to the cache. The RPT output has precedence over next-line prefetching. C. Cache Locking The cache locking mechanism is very simple. It uses a table with two types of entries. The first column holds entries of addresses used to lock the cache while the second column is filled with addresses to unlock the cache. During code execution, all addresses generated from PC are also transmitted

5 to the lock unit. When a lock address is detected, the lock unit sends a lock signal and sets all L bits of the cache to one. This disables the line replacement of the cache. The cache hold this state till the moment when the unlock address is detected and the unlock signal is sent to all L bits to set them to zero. IV. DEMONSTRATION OF THE BENEFITS In this section, we illustrate the benefits of combining the prefetcher with cache locking in order to improve the cache efficiency. Figure IV shows a memory layout of single-path code. The code consists of seven memory blocks (c 1,...,c 7 ). Each reference i n represents the address of an instruction. The example assumes that the instruction length is fixed and each memory block has four instructions. From the generated trace we can identify the control flow instructions (b 18,b 27 ), the entry (i 3,i 26 ) and exit (b 18,b 27 ) points of the loops and loop bounds (l 1,l 2 ). In this case we assume that each loop iterates only once (loop bound is two). TABLE I: Reference prediction table Trigger line Destination line Count Type c 5 c c 5 c The three columns at the right side of Figure 2 show the behavior for a simple cache, a cache with prefetching, and a cache with combined prefetching and cache locking. Each line of the table presents the cache state and the outcome of the cache-line request. For simplicity, we demonstrate the cache hits and misses on cache line level without showing the hits generated from instructions within the same cache lines. The cache is full-associative with four lines and uses the Least Recently Used (LRU) replacement policy. The first column shows the benefit of single-path code executed in a system with simple cache. In such a system, the cache is beneficial only for requested instructions belonging to the same cache line (spatial locality) and for loops that are smaller than the cache size (temporal locality). In this case, the second loop l 2 benefits from cache on the level of cache lines. If we assume that a cache hit takes one clock cycle while a cache miss takes 20 cycles, then the time spent on memory accesses for all instructions is 278 cycles. TABLE II: Locking table Lock address Unlock address b 18 i 19 The second column in Figure 2 illustrates the operation of the cache with prefetching. Loop-exit points are the only places in the example where the prefetcher has to switch from sequential to non-sequential prefetching. These parameters, on branch instructions, together with the number of iterations of loops are entered into the Reference Prediction Table (Table I). When the CPU accesses the c i cache line for the first time, the prefetcher immediately issues a request for line c i+1 for sequential prefetching and of c i+n for non-sequential prefetching. The first loop l 1 is larger than the cache size and therefore the entry type equals one in the RPT table, while l 2 fits into the cache and the entry type is zero. In this example the non-sequential prefetching is activated only once when c 5 cache line is issued. The prefetcher also fully stops when the execution enters the l 2 in order to not issue requests for cache lines that are already in the cache. We observe that aggressive prefetching with accurate prediction of the next cache line can reduce the miss-latency for each cache miss (Figure 2). The reduction of the cache miss-latency depends on the time that it takes for a cache line to be executed. Again, assuming one cycle for a cache hit and 20 cycles for a cache miss, the time spent on memory accesses for this scenario is 238 cycles. The ideal case would be if the execution time of one cache line is equal with the fetch time of the line. The third column in Figure 2 demonstrates the benefit of combined prefetching and cache locking. Cache locking comes into play only for loops that are larger than the cache size. From the trace of single-path code, the exit point of the loop is identified (b 18 ) and entered into the Locking table (Table II). Also the following instruction i 19 that does not belong to the loop is entered as unlock point. In this example, since the cache has size of four lines, only cache lines c 4 and c 5 are locked while the other two lines remain unlocked to fetch/prefetch all the other unlocked cache lines that are part of the same loop l 1. After loop execution, the issued request for i 19 is also used to unlock all cache lines and the cache continues to operate with the LRU replacement policy. Figure 2 shows that combining these two techniques can improve the miss latency and also can reduce the miss rate. Using the same assumptions for cache hit and miss latencies as above, the memory access time is 208 cycles. V. CONCLUSION To overcome the problem of long execution times of single-path code, we have proposed a new memory hierarchy organization that attempts to reduce the memory access time by bringing the instructions into the cache before they are accessed and to reduce the cache miss rate by dynamically locking the most frequently requested memory blocks. The single-path prefetching algorithm combines a sequential and a non-sequential prefetching scheme with full accuracy in the predicted instruction stream based on the predictable properties of the single-path code. Designed as a hardware solution, the prefetcher does not produce any additional timing overhead for the instruction prefetching. Also, our solution allows the prefetcher functionality to operate independently, i.e., prefetching does not interfere with any stage of the CPU. The dual-bank cache makes it possible to pipeline the CPU and prefetcher accesses into the cache memory in order to fully utilize the memory bandwidth. By using a prefetch granularity of one cache line we eliminate the possibility of cache pollution and useless memory traffic. Cache locking comes into play when loops larger than the cache size are executed. With a simple analysis of the singleton instruction trace of a single-path program we identify the entry and exit instructions of the loops. During runtime the addresses

6 Fig. 2: Comparison of the operation with no prefetching or locking, with prefetching only, and with both prefetching and locking. of these entry and exit instructions are used as lock/unlock points for the cache content. The last cache lines of the banks of the cache are not lockable in order not to interfere with the prefetcher. We plan to thoroughly evaluate our approach and to show the feasibility of the memory hierarchy by implementing it in an FPGA platform as future work. Furthermore, we will extend the prefetcher for input-independent if-then-else structures that are not converted to sequential code. ACKNOWLEDGMENT This work has been supported in part by the ARTEMIS project under grant agreement (VeTeSS) and the EU COST Action IC1202: Timing Analysis on Code Level (TACLe). REFERENCES [1] R. Wilhelm, J. Engblom, A. Ermedahl, N. Holsti, S. Thesing, D. Whalley, G. Bernat, C. Ferdinand, R. Heckmann, T. Mitra et al., The worstcase execution-time problem overview of methods and survey of tools, ACM Transactions on Embedded Computing Systems (TECS), vol. 7, no. 3, p. 36, [2] P. Puschner and A. Burns, Writing temporally predictable code, in Object-Oriented Real-Time Dependable Systems, 2002.(WORDS 2002). Proceedings of the Seventh International Workshop on. IEEE, 2002, pp [3] A. J. Smith, Cache memories, ACM Computing Surveys (CSUR), vol. 14, no. 3, pp , [4] D. Joseph and D. Grunwald, Prefetching using markov predictors, in ACM SIGARCH Computer Architecture News, vol. 25, no. 2. ACM, 1997, pp [5] J. E. Smith and W.-C. Hsu, Prefetching in supercomputer instruction caches, in Proceedings of the 1992 ACM/IEEE conference on Supercomputing. IEEE Computer Society Press, 1992, pp [6] J. Pierce and T. Mudge, Wrong-path instruction prefetching, in Microarchitecture, MICRO-29. Proceedings of the 29th Annual IEEE/ACM International Symposium on. IEEE, 1996, pp [7] N. P. Jouppi, Improving direct-mapped cache performance by the addition of a small fully-associative cache and prefetch buffers, in Computer Architecture, Proceedings., 17th Annual International Symposium on. IEEE, 1990, pp [8] A. J. Smith, Sequential program prefetching in memory hierarchies, Computer, vol. 11, no. 12, pp. 7 21, [9] H. Ding, Y. Liang, and T. Mitra, Wcet-centric dynamic instruction cache locking, in Proceedings of the conference on Design, Automation & Test in Europe. European Design and Automation Association, 2014, p. 27. [10] J. L. Hennessy and D. A. Patterson, Computer architecture: a quantitative approach. Elsevier, [11] Y. Ding and W. Zhang, Loop-based instruction prefetching to reduce the worst-case execution time, Computers, IEEE Transactions on, vol. 59, no. 6, pp , [12] C.-K. Luk and T. C. Mowry, Architectural and compiler support for effective instruction prefetching: a cooperative approach, ACM Transactions on Computer Systems, no. 1, pp , [13] M. Lee, S. L. Min, C. Y. Park, Y. H. Bae, H. Shin, and C. S. Kim, A dual-mode instruction prefetch scheme for improved worst case and average case program execution times, in Real-Time Systems Symposium, 1993., Proceedings. IEEE, 1993, pp [14] I. Puaut, Wcet-centric software-controlled instruction caches for hard real-time systems, in Real-Time Systems, th Euromicro Conference on. IEEE, 2006, pp. 10 pp. [15] H. Falk, S. Plazar, and H. Theiling, Compile-time decided instruction cache locking using worst-case execution paths, in Proceedings of the 5th IEEE/ACM international conference on Hardware/software codesign and system synthesis. ACM, 2007, pp [16] S. Plazar, J. C. Kleinsorge, P. Marwedel, and H. Falk, Wcet-aware static locking of instruction caches, in Proceedings of the Tenth International Symposium on Code Generation and Optimization. ACM, 2012, pp [17] H. Ding, Y. Liang, and T. Mitra, Wcet-centric partial instruction cache locking, in Design Automation Conference (DAC), th ACM/EDAC/IEEE. IEEE, 2012, pp [18] L. C. Aparicio, J. Segarra, C. Rodriguez, and V. Vinals, Combining prefetch with instruction cache locking in multitasking realtime systems, in Embedded and Real-Time Computing Systems and Applications (RTCSA), 2010 IEEE 16th International Conference on. IEEE, 2010, pp [19] J. Reineke, D. Grund, C. Berg, and R. Wilhelm, Timing predictability of cache replacement policies, Real-Time Systems, vol. 37, no. 2, pp , [20] M. Schoeberl, Time-predictable cache organization, in Proceedings of the 2009 Software Technologies for Future Dependable Distributed Systems, ser. STFSSD 09. Washington, DC, USA: IEEE Computer Society, 2009, pp

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Bekim Cilku, Roland Kammerer, and Peter Puschner Institute of Computer Engineering Vienna University of Technology A0 Wien, Austria

More information

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Bekim Cilku Institute of Computer Engineering Vienna University of Technology A40 Wien, Austria bekim@vmars tuwienacat Roland Kammerer

More information

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont. History Table. Correlating Prediction Table

EECS 470. Lecture 15. Prefetching. Fall 2018 Jon Beaumont.   History Table. Correlating Prediction Table Lecture 15 History Table Correlating Prediction Table Prefetching Latest A0 A0,A1 A3 11 Fall 2018 Jon Beaumont A1 http://www.eecs.umich.edu/courses/eecs470 Prefetch A3 Slides developed in part by Profs.

More information

Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy

Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy Bekim Cilku, Wolfgang Puffitsch, Daniel Prokesch, Martin Schoeberl and Peter Puschner Vienna University of Technology,

More information

Single-Path Programming on a Chip-Multiprocessor System

Single-Path Programming on a Chip-Multiprocessor System Single-Path Programming on a Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Vienna University of Technology, Austria mschoebe@mail.tuwien.ac.at, {peter,raimund}@vmars.tuwien.ac.at

More information

Comprehensive Review of Data Prefetching Mechanisms

Comprehensive Review of Data Prefetching Mechanisms 86 Sneha Chhabra, Raman Maini Comprehensive Review of Data Prefetching Mechanisms 1 Sneha Chhabra, 2 Raman Maini 1 University College of Engineering, Punjabi University, Patiala 2 Associate Professor,

More information

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS

FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS FROM TIME-TRIGGERED TO TIME-DETERMINISTIC REAL-TIME SYSTEMS Peter Puschner and Raimund Kirner Vienna University of Technology, A-1040 Vienna, Austria {peter, raimund}@vmars.tuwien.ac.at Abstract Keywords:

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis

Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 41 of 49 Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008

Principles in Computer Architecture I CSE 240A (Section ) CSE 240A Homework Three. November 18, 2008 Principles in Computer Architecture I CSE 240A (Section 631684) CSE 240A Homework Three November 18, 2008 Only Problem Set Two will be graded. Turn in only Problem Set Two before December 4, 2008, 11:00am.

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information

A Survey of prefetching techniques

A Survey of prefetching techniques A Survey of prefetching techniques Nir Oren July 18, 2000 Abstract As the gap between processor and memory speeds increases, memory latencies have become a critical bottleneck for computer performance.

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic

Fall 2011 Prof. Hyesoon Kim. Thanks to Prof. Loh & Prof. Prvulovic Fall 2011 Prof. Hyesoon Kim Thanks to Prof. Loh & Prof. Prvulovic Reading: Data prefetch mechanisms, Steven P. Vanderwiel, David J. Lilja, ACM Computing Surveys, Vol. 32, Issue 2 (June 2000) If memory

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Yiqiang Ding, Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Outline

More information

Handling Cyclic Execution Paths in Timing Analysis of Component-based Software

Handling Cyclic Execution Paths in Timing Analysis of Component-based Software Handling Cyclic Execution Paths in Timing Analysis of Component-based Software Luka Lednicki, Jan Carlson Mälardalen Real-time Research Centre Mälardalen University Västerås, Sweden Email: {luka.lednicki,

More information

Speculation and Future-Generation Computer Architecture

Speculation and Future-Generation Computer Architecture Speculation and Future-Generation Computer Architecture University of Wisconsin Madison URL: http://www.cs.wisc.edu/~sohi Outline Computer architecture and speculation control, dependence, value speculation

More information

Instruction Level Parallelism (Branch Prediction)

Instruction Level Parallelism (Branch Prediction) Instruction Level Parallelism (Branch Prediction) Branch Types Type Direction at fetch time Number of possible next fetch addresses? When is next fetch address resolved? Conditional Unknown 2 Execution

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

History-based Schemes and Implicit Path Enumeration

History-based Schemes and Implicit Path Enumeration History-based Schemes and Implicit Path Enumeration Claire Burguière and Christine Rochange Institut de Recherche en Informatique de Toulouse Université Paul Sabatier 6 Toulouse cedex 9, France {burguier,rochange}@irit.fr

More information

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File

Agenda. EE 260: Introduction to Digital Design Memory. Naive Register File. Agenda. Memory Arrays: SRAM. Memory Arrays: Register File EE 260: Introduction to Digital Design Technology Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa 2 Technology Naive Register File Write Read clk Decoder Read Write 3 4 Arrays:

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Timing Predictability of Processors

Timing Predictability of Processors Timing Predictability of Processors Introduction Airbag opens ca. 100ms after the crash Controller has to trigger the inflation in 70ms Airbag useless if opened too early or too late Controller has to

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Memory Hierarchies 2009 DAT105

Memory Hierarchies 2009 DAT105 Memory Hierarchies Cache performance issues (5.1) Virtual memory (C.4) Cache performance improvement techniques (5.2) Hit-time improvement techniques Miss-rate improvement techniques Miss-penalty improvement

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

More information

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses

Reducing Miss Penalty: Read Priority over Write on Miss. Improving Cache Performance. Non-blocking Caches to reduce stalls on misses Improving Cache Performance 1. Reduce the miss rate, 2. Reduce the miss penalty, or 3. Reduce the time to hit in the. Reducing Miss Penalty: Read Priority over Write on Miss Write buffers may offer RAW

More information

A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware

A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware Stefan Metzlaff 1, Irakli Guliashvili 1,SaschaUhrig 2,andTheoUngerer 1 1 Department of Computer Science, University of

More information

Advanced Caching Techniques

Advanced Caching Techniques Advanced Caching Approaches to improving memory system performance eliminate memory operations decrease the number of misses decrease the miss penalty decrease the cache/memory access times hide memory

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

The check bits are in bit numbers 8, 4, 2, and 1.

The check bits are in bit numbers 8, 4, 2, and 1. The University of Western Australia Department of Electrical and Electronic Engineering Computer Architecture 219 (Tutorial 8) 1. [Stallings 2000] Suppose an 8-bit data word is stored in memory is 11000010.

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 omputer Architecture Spring 2016 Lecture 09: Prefetching Shuai Wang Department of omputer Science and Technology Nanjing University Prefetching(1/3) Fetch block ahead of demand Target compulsory, capacity,

More information

Two-Level Address Storage and Address Prediction

Two-Level Address Storage and Address Prediction Two-Level Address Storage and Address Prediction Enric Morancho, José María Llabería and Àngel Olivé Computer Architecture Department - Universitat Politècnica de Catalunya (Spain) 1 Abstract. : The amount

More information

A Study for Branch Predictors to Alleviate the Aliasing Problem

A Study for Branch Predictors to Alleviate the Aliasing Problem A Study for Branch Predictors to Alleviate the Aliasing Problem Tieling Xie, Robert Evans, and Yul Chu Electrical and Computer Engineering Department Mississippi State University chu@ece.msstate.edu Abstract

More information

Fetch Directed Instruction Prefetching

Fetch Directed Instruction Prefetching In Proceedings of the 32nd Annual International Symposium on Microarchitecture (MICRO-32), November 1999. Fetch Directed Instruction Prefetching Glenn Reinman y Brad Calder y Todd Austin z y Department

More information

Final Lecture. A few minutes to wrap up and add some perspective

Final Lecture. A few minutes to wrap up and add some perspective Final Lecture A few minutes to wrap up and add some perspective 1 2 Instant replay The quarter was split into roughly three parts and a coda. The 1st part covered instruction set architectures the connection

More information

Lec 11 How to improve cache performance

Lec 11 How to improve cache performance Lec 11 How to improve cache performance How to Improve Cache Performance? AMAT = HitTime + MissRate MissPenalty 1. Reduce the time to hit in the cache.--4 small and simple caches, avoiding address translation,

More information

Computer Systems Architecture

Computer Systems Architecture Computer Systems Architecture Lecture 12 Mahadevan Gomathisankaran March 4, 2010 03/04/2010 Lecture 12 CSCE 4610/5610 1 Discussion: Assignment 2 03/04/2010 Lecture 12 CSCE 4610/5610 2 Increasing Fetch

More information

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach FIFO Cache Analysis for WCET Estimation: A Quantitative Approach Abstract Although most previous work in cache analysis for WCET estimation assumes the LRU replacement policy, in practise more processors

More information

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.

More information

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 12: Advanced Caching. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 12: Advanced Caching Prof. Onur Mutlu Carnegie Mellon University Announcements Chuck Thacker (Microsoft Research) Seminar Tomorrow RARE: Rethinking Architectural

More information

Improving Worst-Case Cache Performance through Selective Bypassing and Register-Indexed Cache

Improving Worst-Case Cache Performance through Selective Bypassing and Register-Indexed Cache Improving WorstCase Cache Performance through Selective Bypassing and RegisterIndexed Cache Mohamed Ismail, Daniel Lo, and G. Edward Suh Cornell University Ithaca, NY, USA {mii5, dl575, gs272}@cornell.edu

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho

Why memory hierarchy? Memory hierarchy. Memory hierarchy goals. CS2410: Computer Architecture. L1 cache design. Sangyeun Cho Why memory hierarchy? L1 cache design Sangyeun Cho Computer Science Department Memory hierarchy Memory hierarchy goals Smaller Faster More expensive per byte CPU Regs L1 cache L2 cache SRAM SRAM To provide

More information

Improve performance by increasing instruction throughput

Improve performance by increasing instruction throughput Improve performance by increasing instruction throughput Program execution order Time (in instructions) lw $1, 100($0) fetch 2 4 6 8 10 12 14 16 18 ALU Data access lw $2, 200($0) 8ns fetch ALU Data access

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture

Announcements. ! Previous lecture. Caches. Inf3 Computer Architecture Announcements! Previous lecture Caches Inf3 Computer Architecture - 2016-2017 1 Recap: Memory Hierarchy Issues! Block size: smallest unit that is managed at each level E.g., 64B for cache lines, 4KB for

More information

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 21: Superscalar Processing. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 21: Superscalar Processing Prof. Onur Mutlu Carnegie Mellon University Announcements Project Milestone 2 Due November 10 Homework 4 Out today Due November 15

More information

Portland State University ECE 587/687. Memory Ordering

Portland State University ECE 587/687. Memory Ordering Portland State University ECE 587/687 Memory Ordering Copyright by Alaa Alameldeen and Haitham Akkary 2012 Handling Memory Operations Review pipeline for out of order, superscalar processors To maximize

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

Advanced Computer Architecture

Advanced Computer Architecture Advanced Computer Architecture Chapter 1 Introduction into the Sequential and Pipeline Instruction Execution Martin Milata What is a Processors Architecture Instruction Set Architecture (ISA) Describes

More information

Timing Anomalies and WCET Analysis. Ashrit Triambak

Timing Anomalies and WCET Analysis. Ashrit Triambak Timing Anomalies and WCET Analysis Ashrit Triambak October 20, 2014 Contents 1 Abstract 2 2 Introduction 2 3 Timing Anomalies 3 3.1 Retated work........................... 4 3.2 Limitations of Previous

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Branch Prediction. Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Branch Prediction Prof. Onur Mutlu Carnegie Mellon University A Note on This Lecture These slides are partly from 18-447 Spring 2013, Computer Architecture, Lecture 11: Branch Prediction

More information

Computer Organization and Structure. Bing-Yu Chen National Taiwan University

Computer Organization and Structure. Bing-Yu Chen National Taiwan University Computer Organization and Structure Bing-Yu Chen National Taiwan University Large and Fast: Exploiting Memory Hierarchy The Basic of Caches Measuring & Improving Cache Performance Virtual Memory A Common

More information

Memory Consistency. Challenges. Program order Memory access order

Memory Consistency. Challenges. Program order Memory access order Memory Consistency Memory Consistency Memory Consistency Reads and writes of the shared memory face consistency problem Need to achieve controlled consistency in memory events Shared memory behavior determined

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Combining Prefetch with Instruction Cache Locking in Multitasking Real-Time Systems

Combining Prefetch with Instruction Cache Locking in Multitasking Real-Time Systems Combining Prefetch with Instruction Cache Locking in Multitasking Real-Time Systems Luis C. Aparicio, Juan Segarra, Clemente Rodríguez and Víctor Viñals Dpt. Informática e Ingeniería de Sistemas, Universidad

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES

AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES AN OVERVIEW OF HARDWARE BASED CACHE OPTIMIZATION TECHNIQUES Swadhesh Kumar 1, Dr. P K Singh 2 1,2 Department of Computer Science and Engineering, Madan Mohan Malaviya University of Technology, Gorakhpur,

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás

CACHE MEMORIES ADVANCED COMPUTER ARCHITECTURES. Slides by: Pedro Tomás CACHE MEMORIES Slides by: Pedro Tomás Additional reading: Computer Architecture: A Quantitative Approach, 5th edition, Chapter 2 and Appendix B, John L. Hennessy and David A. Patterson, Morgan Kaufmann,

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches

Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Highly-Associative Caches Improving Cache Performance and Memory Management: From Absolute Addresses to Demand Paging 6.823, L8--1 Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Highly-Associative

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 3. Instruction-Level Parallelism and Its Exploitation Computer Architecture A Quantitative Approach, Fifth Edition Chapter 3 Instruction-Level Parallelism and Its Exploitation Introduction Pipelining become universal technique in 1985 Overlaps execution of

More information

A Two-Expert Approach to File Access Prediction

A Two-Expert Approach to File Access Prediction A Two-Expert Approach to File Access Prediction Wenjing Chen Christoph F. Eick Jehan-François Pâris 1 Department of Computer Science University of Houston Houston, TX 77204-3010 tigerchenwj@yahoo.com,

More information

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017

Advanced Parallel Architecture Lessons 5 and 6. Annalisa Massini /2017 Advanced Parallel Architecture Lessons 5 and 6 Annalisa Massini - Pipelining Hennessy, Patterson Computer architecture A quantitive approach Appendix C Sections C.1, C.2 Pipelining Pipelining is an implementation

More information

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance

Cache Performance and Memory Management: From Absolute Addresses to Demand Paging. Cache Performance 6.823, L11--1 Cache Performance and Memory Management: From Absolute Addresses to Demand Paging Asanovic Laboratory for Computer Science M.I.T. http://www.csg.lcs.mit.edu/6.823 Cache Performance 6.823,

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken

Branch statistics. 66% forward (i.e., slightly over 50% of total branches). Most often Not Taken 33% backward. Almost all Taken Branch statistics Branches occur every 4-7 instructions on average in integer programs, commercial and desktop applications; somewhat less frequently in scientific ones Unconditional branches : 20% (of

More information

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome

Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Thoai Nam Pipelining concepts The DLX architecture A simple DLX pipeline Pipeline Hazards and Solution to overcome Reference: Computer Architecture: A Quantitative Approach, John L Hennessy & David a Patterson,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Transforming Execution-Time Boundable Code into Temporally Predictable Code

Transforming Execution-Time Boundable Code into Temporally Predictable Code Transforming Execution-Time Boundable Code into Temporally Predictable Code Peter Puschner Institut for Technische Informatik. Technische Universitdt Wien, Austria Abstract: Traditional Worst-Case Execution-Time

More information

A Survey of Data Prefetching Techniques

A Survey of Data Prefetching Techniques A Survey of Data Prefetching Techniques Technical Report No: HPPC-96-05 October 1996 Steve VanderWiel David J. Lilja High-Performance Parallel Computing Research Group Department of Electrical Engineering

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Master Thesis On. Instruction Prefetching Techniques for Ultra Low-Power Multicore Architectures

Master Thesis On. Instruction Prefetching Techniques for Ultra Low-Power Multicore Architectures ALMA MATER STUDIORUM - UNIVERSITÀ DI BOLOGNA Engineering and Architecture School. Electronic Engineering. Electronic and Communications Science and Technology Master Program. Master Thesis On Hardware-Software

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information