A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware

Size: px
Start display at page:

Download "A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware"

Transcription

1 A Dynamic Instruction Scratchpad Memory for Embedded Processors Managed by Hardware Stefan Metzlaff 1, Irakli Guliashvili 1,SaschaUhrig 2,andTheoUngerer 1 1 Department of Computer Science, University of Augsburg, Germany {metzlaff,guliashvili,ungerer}@informatik.uni-augsburg.de 2 Robotics Research Institute, TU Dortmund, Germany sascha.uhrig@tu-dortmund.de Abstract. This paper proposes a hardware managed instruction scratchpad on the granularity of functions which is designed for realtime systems. It guarantees that every instruction will be fetched from the local, fast and timing predictable scratchpad memory. Thus, a predictable behavior is reached that eases a precise timing analysis of the system. We estimate the hardware resources required to implement the dynamic instruction scratchpad for an FPGA. An evaluation quantifies the impact of our scratchpad on average case performance. It shows that the dynamic instruction scratchpad compared to standard instruction memories has a reasonable performance - while providing predictable behavior and easing timing analysis. 1 Introduction Embedded systems in safety-critical application domains like automotive and avionic applications underlie hard real-time (HRT) constraints. In HRT systems the miss of a deadline can cause serious system breakdowns that may harm the system or even human beings. Therefore, the timing of a HRT system has to be analyzed before it can be deployed. To be sure that in all cases, independent of input values and system states, the deadlines are met, the worst case execution time (WCET) has to be determined. To estimate the WCET of an application two methods [1] can be applied: static or measurement-based WCET analysis. Both WCET analysis techniques have to consider the whole system including processor and the memory system. Therefore, a predictable memory access is crucial for a HRT system. In memories like caches it is complex to determine, if a memory access will lead to a cache hit or miss [2]. If there is an uncertainty on a cache hit, a cache miss has to be assumed, to assure that the calculated WCET is not underestimated. Thus, an upper bound of the memory access time can be estimated for caches, but it will be pessimistic. In the worst case the WCET analysis has to ignore the cache. The content of a scratchpad is usually defined by the application designer at compile-time. This simplifies the analysis of timing behavior, because it is defined This work has been supported by the EC Grant Agreement n (MERASA). M. Berekovic et al. (Eds.): ARCS 2011, LNCS 6566, pp , c Springer-Verlag Berlin Heidelberg 2011

2 Dynamic Instruction Scratchpad Memory 123 what data is located in which memory. Hence, the usage of scratchpads with a static assignment of data or instructions allows exact timing estimations for the memory accesses and eases the computation of the WCET. The drawback of static assignment is that the scratchpad memory content cannot change during run-time. In contrast to caches the memory utilization of statically assigned scratchpads is poor. Beside the type of the memory it is also of importance if and how memory operations could interfere each other. If a memory request can be delayed by another resource (e.g. when the instruction fetch unit and the load/store unit access a common memory controller), the complexity of a precise WCET analysis is escalated. But if such memory interferences are not handled correctly by the analysis, the calculated WCET might be underestimated. In this paper we propose the Dynamic Instruction ScratchPad (D ISP), that loads functions dynamically on demand, provides a predictable instruction fetch and addresses the problem of memory interferences. By a two-phased execution scheme the D ISP hinders memory interferences between instruction and data path and thus eases a precise WCET analysis. The paper is structured as follows: The next section discusses related work. Section 3 describes architectural characteristics of the D ISP and its implementation. Section 4 evaluates the impact of the scratchpad on average case performance and Section 5 discusses the hardware requirements of the D ISP. The last section concludes the paper and presents an outlook to future work. 2 Related Work Scratchpad memories are mainly used in embedded systems to reduce energy consumption or lower the WCET of HRT applications. Both goals can be reached, because scratchpads (usually SRAMs) have a low and constant memory latency and do not need an energy consuming tag memory like caches. The content of scratchpads can be assigned either statically or it is managed during run-time by software or, like we propose, by hardware. To statically assign code to the scratchpad e.g. the most frequently used instructions are selected to reduce the energy consumption [3] or the WCET [4]. In [5] the authors reduce WCET by statically locating basic blocks at compiletime into an instruction scratchpad. In contrast to static scratchpads softwaremanaged scratchpads allow the changing of the content during run-time as proposed in [6,7]. Egger et. al. [8] combines both static and software managed approaches: A function can be statically located in the scratchpad, in the external memory, or the function is loaded dynamically on demand into the scratchpad. For functions that are copied on demand a page manager handles the content lookup and the function copying. The decision which function will be placed in the scratchpad is done by an optimization algorithm to reduce the energy consumption. Since the page manager is implemented in software, the performance drawback will be higher than for a hardware controlled solution that we propose. Moreover, this overhead must be taken into account during the WCET analysis.

3 124 S. Metzlaff et al. Janapsatya et. al. [9] describe a managed scratchpad that is optimized regarding to the memory utilization to reduce energy consumption. The scratchpad contains basic blocks that are selected based on a temporal proximity metric: basic blocks that are executed consecutively are assigned together to the scratchpad at the same time. A scratchpad controller implemented in hardware is triggered by special copy instructions and loads the selected blocks into the scratchpad. In contrast to the D ISP the scratchpad proposed in [9] is not supposed to hide the memory hierarchy from the processors fetch path. Using the D ISP every fetch is directed to it. This ensures that fetches do not interfere with other memory accesses which is important for a precise WCET estimation. Caches as hardware managed memories are also used in real-time systems. But because a precise cache analysis is complicated for common caches, different techniques are used in real-time systems. To allow predictability of caches the cache content can be locked [10]. Thus, the cache eviction can be controlled, such that the analysis is simplified. In [11] a cache-based code partitioning for functions to decrease the WCET is proposed. A WCET-aware compiler decides which functions are put in a cacheable memory region and which are not using the cache. Thus, the miss rate of the cache for the worst case path can be reduced and the WCET of the application is decreased. Another approach based on functions as the proposed D ISP is the predictable method cache by Schoeberl [12,13]. The method cache uses complete methods as replacement granularity. The proposed cache structure binds a memory block to its cache tag. So the usage of smaller memory blocks to improve memory density, leads to a high number of cache tags that are causing either a slow or hardware intensive hit detection. In the D ISP the scratchpad content is decoupled from the lookup tables. So the complexity of hit detection is restricted to the number of entries of the lookup tables only. Preußer et. al. [14] address the complexity problem of a fine grained method cache implementation and show an implementation with a stack based replacement policy. 3 The Dynamic Instruction Scratchpad The idea of the D ISP is to bind all instructions of one function together and load them into a fast on-chip scratchpad at once. Thus, it is ensured that every instruction of the active function is held in the scratchpad before the function is executed. So while a function is executed every instruction is fetched from the D ISP and no instruction memory access on any other level of memory hierarchy is needed. This fact is of importance, because any fetch that leads to a shared memory level, like off-chip memory as shown in Figure 1(a), will not only disrupt the timing of the execution (caused by waiting on the instructions) it also interferes with data memory accesses. If memory interferences are possible, a complex and detailed integrated pipeline and memory system analysis has to be applied to obtain a save WCET. Otherwise for each memory access an additional delay caused by interferences that have be assumed. But this pessimistic approach impairs the tightness of the estimated WCET. By eliminating memory

4 Dynamic Instruction Scratchpad Memory 125 (a) D ISP overview (b) Detailed D ISP block diagram with memories Fig. 1. D ISP block diagrams interferences of instruction and data memory accesses, a precise WCET analysis with reduced effort is possible. The D ISP precludes the interferences between data and instruction memory accesses, by guaranteeing that all instructions of the active function are located in the D ISP. This leads to a two-phased execution behavior: either the pipeline is stalled because of loading a function into the D ISP or the pipeline executes a function. 3.1 D ISP Architecture The D ISP is located on-chip near the fetch stage of the processor pipeline. It handles the fetch requests from the processor like a common scratchpad. The D ISP requires control signals (see Figure 1(a)) from the pipeline to notice the control flow changes on calls and returns. Therefore, the host processor needs minor changes in logic and signal routing: If the pipeline executes call or return instructions the D ISP has to be informed. Furthermore the call target address has to be routed to the D ISP. With that information the D ISP is capable of loading functions that are activated by calls or returns into the local memory before their execution. For functions that are stored in the D ISP every fetch request will be answered with the constant low latency of an on-chip SRAM. The D ISP consists of two different parts: first the fetch control which is responsible for delivering the instructions to the pipeline and the second part is the content management. For handling fetch requests the D ISP has to translate the native addresses requested by the pipeline into the local addresses used by the scratchpad. To allow a correct address translation, information from the content management about the stored functions is needed. The D ISP content management has to perform several subtasks to assure that the currently active function is in the scratchpad memory: check the content of the scratchpad, copy functions into the scratchpad, evict functions on overwrite and update the address mappings. On calls or returns the content management has to check at first if a function is already contained in the D ISP. If so (D ISP hit) the function execution can be started without any delay. On a D ISP miss, the content management has to load the complete function from the next memory level into the scratchpad. If

5 126 S. Metzlaff et al. the scratchpad space is too small to load the function without overwriting others, these will be evicted. After completely copying the function into the scratchpad the address mapping is updated. Since the loading of a function may need several cycles, the function execution is stalled until the function is completely loaded. This is necessary because otherwise the two-phased execution scheme would be disrupted by owing function loads requested by the D ISP controller that may interfere with data memory accesses triggered by function execution. The content management ensures that a function is either completely contained in the D ISP or not. Therefore, the minimum D ISP size is determined by the size of the largest function in the executed code that is intended to be loaded into the D ISP. 3.2 D ISP Implementation Details The D ISP requires the length of a function when copying it into the scratchpad. In contrast to previous work [15] where the D ISP controller detected the end of the function on the fly, we decided to use instrumentation to obtain the function s length. This reduces the complexity of the D ISP, because no instruction parsing has to be performed. Therefore, we created an instrumentation tool that hooks into the compilation and linking process of the application and adds a special instruction at the beginning of every function in the application code to encode the function length. Using this instruction the D ISP obtains the function size and copies the appropriate number of bytes into the scratchpad. Functions larger than the scratchpad or without this special instruction will be ignored by the D ISP. As shown in Figure 1(b) the D ISP controller consists of the two separated parts described in Section 3.1: fetch control and content management. Both are coupled by the context register. This register is written by the content management and used by the fetch control to determine the start address of the active function in the scratchpad. For simplicity of program execution the D ISP hides its address space from the pipeline by an address mapping in the fetch control. The fetch control calculates the offset of the fetch regarding to the function s start address and adds it to the address where the function is located in the scratchpad memory. Both addresses are stored in the context register. Because this is implemented asynchronously, fetches can be handled by the D ISP within one cycle. To prevent the fetch control of accessing invalid entries in the context register and deliver wrong instructions, the fetch control stalls all pending fetch requests, while the content management is active. The content management is activated on call or return only. It uses a mapping table to check if the activated function is in the scratchpad or not. For each mapped function the mapping table holds the address in the native address space, the address in the scratchpad and the function length. To find entries in the mapping table without inspecting the whole table (which takes one cycle per entry) an additional lookup table is used, see Figure 1(b). The lookup table delivers several addresses of mapped functions within one cycle to the D ISP controller that compares them to the address of the function that was called.

6 Dynamic Instruction Scratchpad Memory 127 On a lookup table hit, the corresponding mapping table entry is selected, the context register is updated and the fetch control will be reactivated. If the called function is not found in the lookup table, the function has to be copied into the scratchpad. Therefore, a new mapping is created and the first fetch block is requested from the main memory. Using the function size obtained by decoding the special instruction, the content management requests the remaining fetch blocks of the function and copies them into the scratchpad. If the first block of another function in the scratchpad is overwritten, the content management deletes the corresponding mapping table and lookup table entries. Thus, each function is maintained in a whole or not by the scratchpad. The scratchpad addressing is cyclic such that the replacement policy of the scratchpad is FIFO. After copying the last block of a function into the scratchpad the context register is updated and the fetch control is reactivated. On return instructions the content management has to determine the address of the caller function, to check if it is still in the scratchpad. To obtain this address without complex calculations or long latencies, the content management maintains its own stack memory (see Figure 1(b)). This memory works as a stack that contains the addresses of the called functions. So on every call the function address is put onto the stack and on return the top address is deleted. Using the stack memory the address of the reactivated function is determined on return without delay. With this address the content management is able to check the lookup table and proceed like described above for handling function calls. 4 Evaluation of Performance Impact The main contribution of the D ISP is an improved predictability for instruction fetches: The two-phased execution scheme enforces the absence of instruction and data memory interferences, which impede a precise WCET estimation. The loading of a function into the D ISP is timing predictable, because the size of the function is known a priori. By the use of the D ISP for instruction fetches during function execution, a constant memory access time for all fetches is achieved. Furthermore due to the use of functions by the content management, the complexity of a content analysis that determines the worst case behavior for function load and eviction can be alleviated. Beside the predictability aspect of the D ISP, we will discuss its average case execution performance. We will show that the cost on average performance is worth the gain in predictability that is inherited in the D ISP usage. To classify the average performance of the D ISP, we compare the fetch cost of several benchmarks for the D ISP, with commonly used memories in embedded systems: an instruction cache and a static instruction scratchpad. We implemented the D ISP in a cycle accurate SystemC simulator for architectural evaluation and in VHDL to estimate the hardware cost. The host processor for the D ISP is the TriCore instruction set compatible CarCore [16] processor. The CarCore is an SMT processor with a predictable timing designed to execute HRT threads. We also implemented the D ISP as first level instruction memory for HRT threads in the MERASA multi-core processor [17]. The

7 128 S. Metzlaff et al. D ISP needs to be informed by the host pipeline, if a call or return is executed. Thus, we had to apply minor changes to the pipeline which are limited to the signaling of call or return processing and the routing of the call target address to the D ISP controller. All memory types were evaluated in the cycle accurate CarCore SystemC simulator. The instruction cache is direct-mapped and has a cache line size of 32 bytes containing 4 fetch blocks each. The static instruction scratchpad (S ISP) contains multiple functions of the benchmark code. For the selection of the functions that are put into the S ISP we used a Knapsack optimization to fit the given scratchpad size as best as possible. We used the maximum dynamic instruction count of a function as metric for the selection. Functions that are not contained in the S ISP are fetched directly from the off-chip memory. The memory access time for the S ISP is one cycle. For fetches that lead to the off-chip memory level 4 cycles are needed. A cache hit also takes one cycle. On a cache miss four 64 bit off-chip memory accesses are needed plus one extra cycle to detect the miss. For D ISP the hit detection on call or return cost 4 cycles for table lookup and context register write. This delay is hidden by the call or return processing of the pipeline. A miss of the D ISP takes as much cycles as are needed to load the complete function from the off-chip memory plus 5 cycles for internal processing, like table lookup, context register write and write latencies. Fetches are handled by the D ISP within one cycle. For the performance evaluation we measured the fetch cost of applications from three different benchmark suites: Mälardalen WCET benchmarks [18], MiBench [19] and EEMBC AutoBench [20]. We selected 6 benchmarks with different function length and call characteristics. The fetch cost measured in the evaluation is the number of cycles for all fetches requested by the processor for the whole benchmark run. We normalized these numbers to the configuration where the complete code is located in the S ISP which represents the minimal fetch cost. A normalized fetch cost of 4 will be reached, if all instructions are fetched from the off-chip memory. To compare the three on-chip instruction memories we varied their sizes from 128 byte to the overall size of the actual benchmark in steps of 128 byte. The results are shown in Figure 2. The Figures 2(a) to (f) show that, if the memory size is as large as the benchmark code, the S ISP always performs slightly better than the cache and the D ISP. This is caused by the fact, that the content of S ISP is set up before the benchmark execution. The fetch cost of the S ISP decreases with higher scratchpad sizes, because more functions can be assigned to the scratchpad. But there are exceptions for this behavior in Compress (b) and Rspeed (e). This is caused by the used metric of the static assignment algorithm that takes the longest possible path for each function and the maximum number of function invocations into account, though this is independent of the real execution behavior. The cache reduces the fetch cost to nearly the optimum even with a memory size lower than one third of the code size. In some configurations the cache shows a thrashing behavior, like for Compress (b) at 384 bytes. This is caused by the direct-mapped cache organization. Also the cache miss rate for the EEMBC

8 Dynamic Instruction Scratchpad Memory 129 Normalized fetch cost S ISP I Cache D ISP Normalized fetch cost S ISP I Cache D ISP Normalized fetch cost S ISP I Cache D ISP Memory size in Byte Memory size in Byte Memory size in Byte (a) ADPCM (Mälardalen) (b) Compress (Mälardalen) (c) EDN (Mälardalen) 27.9 Normalized fetch cost S ISP I Cache D ISP Normalized fetch cost S ISP I Cache D ISP Normalized fetch cost S ISP I Cache D ISP Memory size in Byte (d) Dijkstra (MiBench) Memory size in Byte (e) Rspeed (EEMBC) Memory size in Byte (f) Ttsprk (EEMBC) Fig. 2. Normalized fetch cost for S ISP, cache and D ISP with different memory sizes benchmarks (e) and (f) is high. This behavior is due to the low spatial locality of the EEMBC benchmark caused by code replication. In the Figures 2(a) to (f) the upright lines denote that the memory size is at least as large as the size of the largest function in the benchmark code. This mark is important for the D ISP, because it loads only complete functions into the scratchpad. Thus, for any measure points on the left-hand side of the upright line, the D ISP has to ignore functions larger than its size. This is done automatically by the D ISP controller. Furthermore the D ISP cannot enforce the two-phased execution scheme for the unmaintained functions. This causes that the timing analysis of these functions has to take memory interferences into account. For configurations with a larger (or equal) scratchpad size than the largest function, every fetch request is directed to and handled by the D ISP. Then no memory interferences can occur and the timing analysis can threat instruction and data memories independently. This assumption cannot be made for any cache and S ISP configuration, except if the whole code is located in the S ISP. For benchmarks (a), (b) and (f) some outliers for the D ISP appear, if the size of the scratchpad is slightly larger than the largest function. This is caused by the fact that the largest function will always evict nearly all other functions maintained in the scratchpad on its load. So if the largest function is very active in calling or getting called, the fetch cost for the D ISP will increase, caused by evicting and reloading functions frequently. For Ttsprk (f) this behavior results

9 130 S. Metzlaff et al. Table 1. Utilization of D ISP controller on Stratix II EP2S180F1020C3 FPGA Functions ALUTs Registers max. Frequency Mhz Mhz Mhz Mhz Mhz Mhz in a fetch cost of 27.9 which is almost 7 times worse than using the off-chip memory instead of the D ISP. In Ttsprk the call hierarchy is flat and every function is called directly by the large main function. For scratchpad sizes that are not close to the size of the largest function the D ISP performs better than the S ISP. This is explained by the dynamic content managed of the D ISP. So the D ISP will maintain the functions that are used in the actual phase of the application, whereas the S ISP holds functions for the whole application execution. Comparing the D ISP to the instruction cache, the cache accounts mostly a lower fetch cost. This is caused by the finer granularity of the cache lines that hide the structure of the application. Thus, the D ISP in many configurations cannot compete to the instruction cache, since the cache handles misses on a much finer granularity: On a miss only one line has to be loaded into the cache, in contrast to the D ISP that loads the complete function. In general the average performance of the D ISP is between the S ISP and cache performance. This is expected since it uses a dynamic memory management but not as fine grained as a cache does. By the two-phased execution the D ISP provides a predictable function execution behavior. From timing analysis point of view the D ISP eases the analysis by allowing independent pipeline and memory analysis without decreasing the accuracy of the estimates. Furthermore the WCET analysis of a function-based instruction memory is easier than for an instruction cache, because the content changes only on call/return instructions. These benefits of the D ISP outweigh the moderate average case performance, which is in general better than a S ISP and in the same order of magnitude as the direct-mapped instruction cache. 5 D ISP Hardware Effort Estimation To estimate the hardware effort of the D ISP we used an Altera Stratix II EP2S180F1020C3 FPGA. Table 1 shows the utilization of the D ISP controller, without the memories needed to store and maintain the content of the D ISP, in used ALUTs and registers. The maximum possible frequency of the controller is also provided. The number of functions in the table defines how many functions can be checked on lookup by the content management within one clock cycle. It is defined by the port width of the lookup table memory. This number does not represent the number of functions that can be handled by the D ISP concurrently, which depends on the sizes of the mapping and lookup table memories.

10 Dynamic Instruction Scratchpad Memory 131 As Table 1 shows, the number of comparisons for function lookup is critical for the usage of ALUTs and the maximum possible frequency. This relation is expected since the parallel comparisons are very costly in hardware. Therefore, we propose to use at most 32 functions to be compared within one cycle. To support more than 32 functions the lookup can be split into multiple cycles e.g., if the D ISP should handle 128 functions, then the function lookup will take 4 cycles at maximum. This is an acceptable delay for the hardware amount that is saved. Another aspect of the D ISP hardware amount is the memory that is used. The overall used memory is fragmented into the memory used to store the functions (size func ), the size of the function mapping table (size map ), the size of the lookup table (size lookup ) and the size of the function stack (size stack ): size D ISP = size func + size stack + size map + size lookup size map =(width naddr + width saddr + length f ) no f1 size lookup = width naddr no f1 size stack = width naddr no f2 Three of these memories depend on the width of the function addresses stored in the tables (width naddr, width saddr ), on the number of bytes to encode the function length (length f ) and additional parameters (no f1 and no f2 ). The mapping table needs to store two addresses for each function: one in the native (width naddr ) and one in the scratchpad address space (width saddr ). Also the size in bytes of the mapped function (length f ) is stored in the table. When allowing a maximum scratchpad size of 512kB, the function length (length f )can be encoded in 2 bytes, for a fetch block size of 64 bit. Then for the scratchpad addresses 16 bit can be used (width saddr ). For the native addresses (width naddr ) the 32 bit addresses can be reduced to 24 bit by taking the 5 bit segment address of the CarCore into account and aligning all functions to 64 bit addresses. So in sum 7 bytes are necessary per mapping table entry. The lookup table and the stack memory store only the address of functions in the native address space, thus each entry is holding a 24 bit address and is 3 bytes long. The variable no f1 defines the number of functions that the D ISP is able to maintain. Notice that this number must not correspond to the number of functions that can be looked up by the D ISP controller within one cycle, which is represented by the used logic amount in table 1. The maximum allowed stack depth is defined by no f2. Both parameters no f1 and no f2 have to be powers of two. To evaluate the memory amount used by the D ISP, we compare it with a direct-mapped cache with a cache line size of 32 bytes. The cache uses 27 bit addresses, because the CarCore s 5 bit segment address is taken into account. For the D ISP we used a stack size of 16 functions (no f2 ) and the number of supported functions (no f1 ) varies from 2 to 128. Figure 3 compares the used memory size of the D ISP to the cache and a static scratchpad. It shows the overall required memory (including cache tag and the additional memory structures of the D ISP controller) in relation to the memory that is used to buffer instructions. As depicted in Figure 3 the D ISP has a high memory overhead for small scratchpad sizes compared to the cache. This is caused by the fact, that

11 132 S. Metzlaff et al. Required memory (in bytes) D ISP (2 functions) D ISP (4 functions) D ISP (8 functions) D ISP (16 functions) D ISP (32 functions) D ISP (64 functions) D ISP (128 functions) direct mapped cache static scratchpad Usable memory (in bytes) Fig. 3. Memory usage of D ISP, direct-mapped cache and static scratchpad the tag memory size of a cache depends on the cache size (assuming that the cache line size is constant), whereas the additional memory used by the D ISP is independent of the scratchpad size. But if the scratchpad size is increased above 2kB the memory amount used by D ISP and cache is in the same order of magnitude - even for a rather large number of functions. 6 Conclusions In this paper we proposed the Dynamic Instruction Scratchpad (D ISP) as alternative instruction memory for real-time systems. The D ISP design is focused on predictability. It hinders memory interferences of data and instruction accesses on a shared memory level by the two-phased execution behavior and features a function based content management. The D ISP is able to outperform a statically managed scratchpad but it cannot compete with a cache with its fine-grained content management. But in contrast to a cache, the D ISP allows a tight and separated timing analysis, caused by the absence of memory interferences. This influence on WCET analysis outweighs the moderate average case performance for hard real-time systems. The hardware effort needed to implement the D ISP depends strongly on the number of functions that can be looked up in parallel. But by splitting the function lookup into multiple cycles it is possible to support an arbitrary number of function lookups without an exceeding logic amount. We also found out that the memory overhead for the D ISP content management tables is larger than the tag memory of caches. But for larger scratchpad sizes the memory overhead of the D ISP is in the same order of magnitude as the tag memory of a cache.

12 Dynamic Instruction Scratchpad Memory 133 For our future work we plan to model the timing of the D ISP and calculate WCET estimates. Then the impact of the D ISP on the WCET can be compared with caches and other scratchpad memories that suffer from memory interferences. References 1. Wilhelm, R., Engblom, J., Ermedahl, A., Holsti, N., Thesing, S., Whalley, D., Bernat, G., Ferdinand, C., Heckmann, R., Mitra, T., Mueller, F., Puaut, I., Puschner, P., Staschulat, J., Stenström, P.: The Worst-Case Execution-Time Problem - Overview of Methods and Survey of Tools. ACM Trans. Embed. Comput. Syst. 7(3), 1 53 (2008) 2. Reineke, J., Grund, D., Berg, C., Wilhelm, R.: Timing Predictability of Cache Replacement Policies. Real-Time Systems 37(2), (2007) 3. Banakar, R., Steinke, S., Lee, B.S., Balakrishnan, M., Marwedel, P.: Scratchpad memory: design alternative for cache on-chip memory in embedded systems. In: Proc. of the 10th Int. Symp. on Hardware/software Codesign, pp ACM, New York (2002) 4. Wehmeyer, L., Marwedel, P.: Influence of Onchip Scratchpad Memories on WCET Prediction. In: Proc. of the 4th Int. Workshop on WCET Analysis (2004) 5. Falk, H., Kleinsorge, J.: Optimal static WCET-aware scratchpad allocation of program code. In: Proc. of the 46th Design Automation Conf., pp (2009) 6. Udayakumaran, S., Dominguez, A., Barua, R.: Dynamic allocation for scratch-pad memory using compile-time decisions. ACM Trans. Embed. Comput. Syst. 5(2), (2006) 7. Ravindran, R.A., Nagarkar, P.D., Dasika, G.S., Marsman, E.D., Senger, R.M., Mahlke, S.A., Brown, R.B.: Compiler managed dynamic instruction placement in a low-power code cache. In: Proc. of the Int. Symp. on Code Generation and Optimization, pp IEEE, Los Alamitos (2005) 8. Egger, B., Kim, C., Jang, C., Nam, Y., Lee, J., Min, S.L.: A Dynamic Code Placement Technique for Scratchpad Memory Using Postpass Optimization. In: Conf. Compilers, Architecture, and Synthesis for Embedded Systems (2006) 9. Janapsatya, A., Ignjatović, A., Parameswaran, S.: A novel instruction scratchpad memory optimization method based on concomitance metric. In: Asia and South Pacific Conf. on Design Automation, pp IEEE, Los Alamitos (2006) 10. Puaut, I., Pais, C.: Scratchpad Memories vs Locked Caches in Hard Real-Time Systems: a Quantitative Comparison. In: Proc. of the Conf. on Design, Automation and Test in Europe, pp (2007) 11. Plazar, S., Lokuciejewski, P., Marwedel, P.: WCET-driven Cache-aware Memory Content Selection. In: Proc. of the 13th IEEE Int. Symp. on Object/Component/Service-Oriented Real-Time Distributed Computing, pp (2010) 12. Schoeberl, M.: A Time Predictable Instruction Cache for a Java Processor. In: Workshop on Java Technologies for Real-Time and Embedded Systems, pp (2004) 13. Kirner, R., Schoeberl, M.: Modeling the Function Cache for Worst-Case Execution Time Analysis. In: Proc. of the 44th Design Automation Conf., pp (2007) 14. Preußer, T., Zabel, M., Spallek, R.: Bump-pointer method caching for embedded Java processors. In: Proc. of the 5th Int. Workshop on Java Technologies for Realtime and Embedded Systems, p ACM, New York (2007)

13 134 S. Metzlaff et al. 15. Metzlaff, S., Uhrig, S., Mische, J., Ungerer, T.: Predictable dynamic instruction scratchpad for simultaneous multithreaded processors. In: Proc. of the 9th Workshop on Memory Performance, pp ACM, New York (2008) 16. Mische, J., Guliashvili, I., Uhrig, S., Ungerer, T.: How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable in-order SMT. In: Müller-Schloer, C., Karl, W., Yehia, S. (eds.) ARCS LNCS, vol. 5974, pp Springer, Heidelberg (2010) 17. Ungerer, T., Cazorla, F., Sainrat, P., Bernat, G., Petrov, Z., Rochange, C., Quinones, E., Gerdes, M., Paolieri, M., Wolf, J., Casse, H., Uhrig, S., Guliashvili, I., Houston, M., Kluge, F., Metzlaff, S., Mische, J.: Merasa: Multicore execution of hard real-time applications supporting analyzability. IEEE Micro. 30, (2010) 18. Mälardalen Real-Time Research Center (MRTC): WCET Benchmark Suite, Guthaus, M., Ringenberg, J., Ernst, D., Austin, T., Mudge, T., Brown, R.: MiBench: A free, commercially representative embedded benchmark suite. In: 2001 IEEE Int. Workshop on Workload Characterization, pp (2001) 20. Embedded Microprocessor Benchmark Consortium: AutoBench1.1 software benchmark data book,

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology

More information

MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY

MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY ... MERASA: MULTICORE EXECUTION OF HARD REAL-TIME APPLICATIONS SUPPORTING ANALYZABILITY... THE MERASA PROJECT AIMS TO ACHIEVE A BREAKTHROUGH IN HARDWARE DESIGN, HARD REAL-TIME SUPPORT IN SYSTEM SOFTWARE,

More information

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Bekim Cilku, Roland Kammerer, and Peter Puschner Institute of Computer Engineering Vienna University of Technology A0 Wien, Austria

More information

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the

1. Introduction. 1 Multi-Core Execution of Hard Real-Time Applications Supporting Analysability. This research is partially funded by the WCET ANALYSIS OF A PARALLEL 3D MULTIGRID SOLVER EXECUTED ON THE MERASA MULTI-CORE 1 Christine Rochange 2, Armelle Bonenfant 2, Pascal Sainrat 2, Mike Gerdes 3, Julian Wolf 3, Theo Ungerer 3, Zlatko Petrov

More information

Single-Path Programming on a Chip-Multiprocessor System

Single-Path Programming on a Chip-Multiprocessor System Single-Path Programming on a Chip-Multiprocessor System Martin Schoeberl, Peter Puschner, and Raimund Kirner Vienna University of Technology, Austria mschoebe@mail.tuwien.ac.at, {peter,raimund}@vmars.tuwien.ac.at

More information

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses

Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Aligning Single Path Loops to Reduce the Number of Capacity Cache Misses Bekim Cilku Institute of Computer Engineering Vienna University of Technology A40 Wien, Austria bekim@vmars tuwienacat Roland Kammerer

More information

WCET-Aware C Compiler: WCC

WCET-Aware C Compiler: WCC 12 WCET-Aware C Compiler: WCC Jian-Jia Chen (slides are based on Prof. Heiko Falk) TU Dortmund, Informatik 12 2015 年 05 月 05 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

More information

Improving Worst-Case Cache Performance through Selective Bypassing and Register-Indexed Cache

Improving Worst-Case Cache Performance through Selective Bypassing and Register-Indexed Cache Improving WorstCase Cache Performance through Selective Bypassing and RegisterIndexed Cache Mohamed Ismail, Daniel Lo, and G. Edward Suh Cornell University Ithaca, NY, USA {mii5, dl575, gs272}@cornell.edu

More information

How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable In-Order SMT

How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable In-Order SMT How to Enhance a Superscalar Processor to Provide Hard Real-Time Capable In-Order SMT Jörg Mische, Irakli Guliashvili, Sascha Uhrig, and Theo Ungerer Institute of Computer Science University of Augsburg

More information

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach

FIFO Cache Analysis for WCET Estimation: A Quantitative Approach FIFO Cache Analysis for WCET Estimation: A Quantitative Approach Abstract Although most previous work in cache analysis for WCET estimation assumes the LRU replacement policy, in practise more processors

More information

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu

More information

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability

Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Architectural Time-predictability Factor (ATF) to Measure Architectural Time Predictability Yiqiang Ding, Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University Outline

More information

Reconciling Repeatable Timing with Pipelining and Memory Hierarchy

Reconciling Repeatable Timing with Pipelining and Memory Hierarchy Reconciling Repeatable Timing with Pipelining and Memory Hierarchy Stephen A. Edwards 1, Sungjun Kim 1, Edward A. Lee 2, Hiren D. Patel 2, and Martin Schoeberl 3 1 Columbia University, New York, NY, USA,

More information

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

1. Memory technology & Hierarchy

1. Memory technology & Hierarchy 1 Memory technology & Hierarchy Caching and Virtual Memory Parallel System Architectures Andy D Pimentel Caches and their design cf Henessy & Patterson, Chap 5 Caching - summary Caches are small fast memories

More information

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems

Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems Hardware Support for WCET Analysis of Hard Real-Time Multicore Systems Marco Paolieri Barcelona Supercomputing Center (BSC) Barcelona, Spain marco.paolieri@bsc.es Eduardo Quiñones Barcelona Supercomputing

More information

Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy

Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy Improving Performance of Single-path Code Through a Time-predictable Memory Hierarchy Bekim Cilku, Wolfgang Puffitsch, Daniel Prokesch, Martin Schoeberl and Peter Puschner Vienna University of Technology,

More information

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals

instruction fetch memory interface signal unit priority manager instruction decode stack register sets address PC2 PC3 PC4 instructions extern signals Performance Evaluations of a Multithreaded Java Microcontroller J. Kreuzinger, M. Pfeer A. Schulz, Th. Ungerer Institute for Computer Design and Fault Tolerance University of Karlsruhe, Germany U. Brinkschulte,

More information

Scratchpad memory vs Caches - Performance and Predictability comparison

Scratchpad memory vs Caches - Performance and Predictability comparison Scratchpad memory vs Caches - Performance and Predictability comparison David Langguth langguth@rhrk.uni-kl.de Abstract While caches are simple to use due to their transparency to programmer and compiler,

More information

Power Efficient Instruction Caches for Embedded Systems

Power Efficient Instruction Caches for Embedded Systems Power Efficient Instruction Caches for Embedded Systems Dinesh C. Suresh, Walid A. Najjar, and Jun Yang Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time

A Predictable Simultaneous Multithreading Scheme for Hard Real-Time A Predictable Simultaneous Multithreading Scheme for Hard Real-Time Jonathan Barre, Christine Rochange, and Pascal Sainrat Institut de Recherche en Informatique de Toulouse, Université detoulouse-cnrs,france

More information

Toward Language Independent Worst-Case Execution Time Calculation

Toward Language Independent Worst-Case Execution Time Calculation 10 Toward Language Independent Worst-Case Execution Time Calculation GORDANA RAKIĆ and ZORAN BUDIMAC, Faculty of Science, University of Novi Sad Set of Software Quality Static Analyzers (SSQSA) is a set

More information

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition

3Introduction. Memory Hierarchy. Chapter 2. Memory Hierarchy Design. Computer Architecture A Quantitative Approach, Fifth Edition Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Exploiting Standard Deviation of CPI to Evaluate Architectural Time-Predictability

Exploiting Standard Deviation of CPI to Evaluate Architectural Time-Predictability Regular Paper Journal of Computing Science and Engineering, Vol. 8, No. 1, March 2014, pp. 34-42 Exploiting Standard Deviation of CPI to Evaluate Architectural Time-Predictability Wei Zhang* and Yiqiang

More information

Caching and Demand-Paged Virtual Memory

Caching and Demand-Paged Virtual Memory Caching and Demand-Paged Virtual Memory Definitions Cache Copy of data that is faster to access than the original Hit: if cache has copy Miss: if cache does not have copy Cache block Unit of cache storage

More information

COMPARISON OF IMPLICIT PATH ENUMERATION AND MODEL CHECKING BASED WCET ANALYSIS Benedikt Huber and Martin Schoeberl 1

COMPARISON OF IMPLICIT PATH ENUMERATION AND MODEL CHECKING BASED WCET ANALYSIS Benedikt Huber and Martin Schoeberl 1 COMPARISON OF IMPLICIT PATH ENUMERATION AND MODEL CHECKING BASED WCET ANALYSIS Benedikt Huber and Martin Schoeberl 1 Abstract In this paper, we present our new worst-case execution time (WCET) analysis

More information

Written Exam / Tentamen

Written Exam / Tentamen Written Exam / Tentamen Computer Organization and Components / Datorteknik och komponenter (IS1500), 9 hp Computer Hardware Engineering / Datorteknik, grundkurs (IS1200), 7.5 hp KTH Royal Institute of

More information

A Template for Predictability Definitions with Supporting Evidence

A Template for Predictability Definitions with Supporting Evidence A Template for Predictability Definitions with Supporting Evidence Daniel Grund 1, Jan Reineke 2, and Reinhard Wilhelm 1 1 Saarland University, Saarbrücken, Germany. grund@cs.uni-saarland.de 2 University

More information

Best Practice for Caching of Single-Path Code

Best Practice for Caching of Single-Path Code Best Practice for Caching of Single-Path Code Martin Schoeberl, Bekim Cilku, Daniel Prokesch, and Peter Puschner Technical University of Denmark Vienna University of Technology 1 Context n Real-time systems

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont

EECS 470. Lecture 14 Advanced Caches. DEC Alpha. Fall Jon Beaumont Lecture 14 Advanced Caches DEC Alpha Fall 2018 Instruction Cache BIU Jon Beaumont www.eecs.umich.edu/courses/eecs470/ Data Cache Slides developed in part by Profs. Austin, Brehob, Falsafi, Hill, Hoe, Lipasti,

More information

A Time-predictable Object Cache

A Time-predictable Object Cache A Time-predictable Object Cache Martin Schoeberl Department of Informatics and Mathematical Modeling Technical University of Denmark Email: masca@imm.dtu.dk Abstract Static cache analysis for data allocated

More information

Timing analysis and timing predictability

Timing analysis and timing predictability Timing analysis and timing predictability Architectural Dependences Reinhard Wilhelm Saarland University, Saarbrücken, Germany ArtistDesign Summer School in China 2010 What does the execution time depends

More information

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu

EXAM 1 SOLUTIONS. Midterm Exam. ECE 741 Advanced Computer Architecture, Spring Instructor: Onur Mutlu Midterm Exam ECE 741 Advanced Computer Architecture, Spring 2009 Instructor: Onur Mutlu TAs: Michael Papamichael, Theodoros Strigkos, Evangelos Vlachos February 25, 2009 EXAM 1 SOLUTIONS Problem Points

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

SPM Management Using Markov Chain Based Data Access Prediction*

SPM Management Using Markov Chain Based Data Access Prediction* SPM Management Using Markov Chain Based Data Access Prediction* Taylan Yemliha Syracuse University, Syracuse, NY Shekhar Srikantaiah, Mahmut Kandemir Pennsylvania State University, University Park, PA

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach

Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach Towards a Time-predictable Dual-Issue Microprocessor: The Patmos Approach Martin Schoeberl 1, Pascal Schleuniger 1, Wolfgang Puffitsch 2, Florian Brandner 3, Christian W. Probst 1, Sven Karlsson 1, and

More information

On the Interplay of Loop Caching, Code Compression, and Cache Configuration

On the Interplay of Loop Caching, Code Compression, and Cache Configuration On the Interplay of Loop Caching, Code Compression, and Cache Configuration Marisha Rawlins and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL

More information

Scope-based Method Cache Analysis

Scope-based Method Cache Analysis Scope-based Method Cache Analysis Benedikt Huber 1, Stefan Hepp 1, Martin Schoeberl 2 1 Vienna University of Technology 2 Technical University of Denmark 14th International Workshop on Worst-Case Execution

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

CS Computer Architecture

CS Computer Architecture CS 35101 Computer Architecture Section 600 Dr. Angela Guercio Fall 2010 An Example Implementation In principle, we could describe the control store in binary, 36 bits per word. We will use a simple symbolic

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs

Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Eliminating Annotations by Automatic Flow Analysis of Real-Time Programs Jan Gustafsson Department of Computer Engineering, Mälardalen University Box 883, S-721 23 Västerås, Sweden jangustafsson@mdhse

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Stack Frames Placement in Scratch-Pad Memory for Energy Reduction of Multi-task Applications

Stack Frames Placement in Scratch-Pad Memory for Energy Reduction of Multi-task Applications Stack Frames Placement in Scratch-Pad Memory for Energy Reduction of Multi-task Applications LOVIC GAUTHIER 1, TOHRU ISHIHARA 1, AND HIROAKI TAKADA 2 1 System LSI Research Center, 3rd Floor, Institute

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems

Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems Fast and Accurate Resource Conflict Simulation for Performance Analysis of Multi-Core Systems Stefan Stattelmann,OliverBringmann FZI Forschungszentrum Informatik Haid-und-Neu-Str. 10 14 D-76131 Karlsruhe,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis

Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis Proceedings of the 5th Intl Workshop on Worst-Case Execution Time (WCET) Analysis Page 41 of 49 Classification of Code Annotations and Discussion of Compiler-Support for Worst-Case Execution Time Analysis

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

CEC 450 Real-Time Systems

CEC 450 Real-Time Systems CEC 450 Real-Time Systems Lecture 6 Accounting for I/O Latency September 28, 2015 Sam Siewert A Service Release and Response C i WCET Input/Output Latency Interference Time Response Time = Time Actuation

More information

Performance metrics for caches

Performance metrics for caches Performance metrics for caches Basic performance metric: hit ratio h h = Number of memory references that hit in the cache / total number of memory references Typically h = 0.90 to 0.97 Equivalent metric:

More information

A Cache Hierarchy in a Computer System

A Cache Hierarchy in a Computer System A Cache Hierarchy in a Computer System Ideally one would desire an indefinitely large memory capacity such that any particular... word would be immediately available... We are... forced to recognize the

More information

Handling Cyclic Execution Paths in Timing Analysis of Component-based Software

Handling Cyclic Execution Paths in Timing Analysis of Component-based Software Handling Cyclic Execution Paths in Timing Analysis of Component-based Software Luka Lednicki, Jan Carlson Mälardalen Real-time Research Centre Mälardalen University Västerås, Sweden Email: {luka.lednicki,

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system. Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding

More information

V. Primary & Secondary Memory!

V. Primary & Secondary Memory! V. Primary & Secondary Memory! Computer Architecture and Operating Systems & Operating Systems: 725G84 Ahmed Rezine 1 Memory Technology Static RAM (SRAM) 0.5ns 2.5ns, $2000 $5000 per GB Dynamic RAM (DRAM)

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

History-based Schemes and Implicit Path Enumeration

History-based Schemes and Implicit Path Enumeration History-based Schemes and Implicit Path Enumeration Claire Burguière and Christine Rochange Institut de Recherche en Informatique de Toulouse Université Paul Sabatier 6 Toulouse cedex 9, France {burguier,rochange}@irit.fr

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1

Basic Memory Management. Basic Memory Management. Address Binding. Running a user program. Operating Systems 10/14/2018 CSC 256/456 1 Basic Memory Management Program must be brought into memory and placed within a process for it to be run Basic Memory Management CS 256/456 Dept. of Computer Science, University of Rochester Mono-programming

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART

CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART CONTENTION IN MULTICORE HARDWARE SHARED RESOURCES: UNDERSTANDING OF THE STATE OF THE ART Gabriel Fernandez 1, Jaume Abella 2, Eduardo Quiñones 2, Christine Rochange 3, Tullio Vardanega 4 and Francisco

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

A Novel Technique to Use Scratch-pad Memory for Stack Management

A Novel Technique to Use Scratch-pad Memory for Stack Management A Novel Technique to Use Scratch-pad Memory for Stack Management Soyoung Park Hae-woo Park Soonhoi Ha School of EECS, Seoul National University, Seoul, Korea {soy, starlet, sha}@iris.snu.ac.kr Abstract

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture

Chapter 2. Parallel Hardware and Parallel Software. An Introduction to Parallel Programming. The Von Neuman Architecture An Introduction to Parallel Programming Peter Pacheco Chapter 2 Parallel Hardware and Parallel Software 1 The Von Neuman Architecture Control unit: responsible for deciding which instruction in a program

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

The Memory Hierarchy & Cache

The Memory Hierarchy & Cache Removing The Ideal Memory Assumption: The Memory Hierarchy & Cache The impact of real memory on CPU Performance. Main memory basic properties: Memory Types: DRAM vs. SRAM The Motivation for The Memory

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Efficient Pointer Management of Stack Data for Software Managed Multicores

Efficient Pointer Management of Stack Data for Software Managed Multicores Efficient Pointer Management of Stack Data for Software Managed Multicores Jian Cai, Aviral Shrivastava Compiler Microarchitecture Laboratory Arizona State University Tempe, Arizona 85287 USA {jian.cai,

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Computer Architecture Spring 2016

Computer Architecture Spring 2016 Computer Architecture Spring 2016 Lecture 08: Caches III Shuai Wang Department of Computer Science and Technology Nanjing University Improve Cache Performance Average memory access time (AMAT): AMAT =

More information

A NOVEL APPROACH FOR A HIGH PERFORMANCE LOSSLESS CACHE COMPRESSION ALGORITHM

A NOVEL APPROACH FOR A HIGH PERFORMANCE LOSSLESS CACHE COMPRESSION ALGORITHM A NOVEL APPROACH FOR A HIGH PERFORMANCE LOSSLESS CACHE COMPRESSION ALGORITHM K. Janaki 1, K. Indhumathi 2, P. Vijayakumar 3 and K. Ashok Kumar 4 1 Department of Electronics and Communication Engineering,

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core

SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core SIC: Provably Timing-Predictable Strictly In-Order Pipelined Processor Core Sebastian Hahn and Jan Reineke RTSS, Nashville December, 2018 saarland university computer science SIC: Provably Timing-Predictable

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

Fast, predictable and low energy memory references through architecture-aware compilation 1

Fast, predictable and low energy memory references through architecture-aware compilation 1 ; Fast, predictable and low energy memory references through architecture-aware compilation 1 Peter Marwedel, Lars Wehmeyer, Manish Verma, Stefan Steinke +, Urs Helmig University of Dortmund, Germany +

More information

CS429: Computer Organization and Architecture

CS429: Computer Organization and Architecture CS429: Computer Organization and Architecture Warren Hunt, Jr. and Bill Young Department of Computer Sciences University of Texas at Austin Last updated: August 26, 2014 at 08:54 CS429 Slideset 20: 1 Cache

More information