Scratchpad memory vs Caches - Performance and Predictability comparison

Size: px
Start display at page:

Download "Scratchpad memory vs Caches - Performance and Predictability comparison"

Transcription

1 Scratchpad memory vs Caches - Performance and Predictability comparison David Langguth langguth@rhrk.uni-kl.de Abstract While caches are simple to use due to their transparency to programmer and compiler, they are source of predictability problems in real-time systems. Scratchpads however, introduce additional complexity to programming due to their software-controlled mapping but with the advantage of high predictability. This paper compares scratchpads and caches in speed, area, energy and predictability. Furthermore, scratchpad memory allocation is considered and an algorithm for off-line selection of scratchpad memory-content is proposed. 1 Introduction Due to the increasing gap between processor and memory performance, memory has become an increasing bottleneck in computation. I. e. a memory access is much slower than the processor speed, which forces the processor to wait for the access to complete. This memory wall [9] has led to the implementation of caches to bridge that gap and speed up memory access. Since the cache is controlled by hardware and transparent to the software, no alterations of the software have to be performed. The cache, however, tends only to improve the average-case performance, not necessarily the worst-case execution time (WCET), which leads to predictability problems in hard real-time systems [3]. Even during the design phase of real-time systems it must be guaranteed that certain deadlines will always be met 1. To improve cache predictability cache locking techniques are of interest, which allow the software (compiler or programmer) to control the cache contents: load data into the cache and disable cache replacement policy (lock or freeze the cache). The cache content can be locked for the whole execution of the task (static) or changed at run-time (dynamic) which allow for tighter worst case execution time (WCET) estimates. [5] An alternative approach to caches are scratchpad memories 2. Scratchpads are small onchip static RAMs mapped into the processors address space at a predefined address range. Due to their small size they are extremely fast and energy efficient. Unlike caches, where data allocation is performed by hardware, scratchpad data allocation is under software control (compiler or programmer). This leads to a high predictability which made them popular in real-time systems. Scratchpads, however, can replace 1 A lot of progress in has been achieved in the last ten years to statically predict worst-case execution times (WCET) of tasks on architectures with caches. These techniques, however, are not always applicable due to the lack of documentation of hardware manuals concerning the cache replacement policies. Moreover, they tend to be pessimistic with some cache replacement policies [5]. 2 Also known as tightly coupled memories (TCM) [7].

2 2 caches only if they are supported by an effective compiler. Significant effort has been invested in developing efficient allocation techniques for scratchpad memories, where most of them aim at reducing the average-case execution time (ACET). For real-time systems, however, the upper bound of execution time (WCET) is of main interest, since deadlines have to be met in all situations. Previous work proposed static mapping of the hot-spots of an application to the scratchpad memory in order to gain performance and to save energy. While with static scratchpad allocation (fixed scratchpad content at run-time), WCET is easily predictable, it raises performance issues if the code/data size is much larger than scratchpad size. The rest of the paper is structured as follows: The next section considers some of the previous work in scratchpad allocation algorithms, WCET analysis and scratchpad memory area/energy analysis. Section 3 summarizes the results followed by a conclusion in Section 4. 2 Related Work Today s expectation to steadily increase computing power forced computer architects to include performance enhancing features, since limitations concerning miniaturization and ever faster gigahertz processors surfaced. These features are e. g. pipelines or speculative execution using branch prediction units. The growing speed gap ( memory wall ) between processors and the slower memory speed is the main reason for the widespread integration of caches. All these techniques help increase the average-case performance of the system. The worst-case execution time analysis, which is required in real-time embedded systems to guarantee that deadlines are never violated, becomes increasingly difficult when many of the above features are present in the processor. Due to their highly dynamic behavior, an effective prediction of worst case timings at design time is hard or even impossible. A commercial tool for WCET analysis of several processor and cache architectures is ait [2]. ait is actively used in industry e. g. by Airbus France to determine upper bounds (WCET) of execution time of time critical avionics software. The input of the tool is an executable for a specific platform along with user supplied data concerning e. g. loop bounds, access addresses and architectural data concerning the memory layout. The output is a safe upper bound for the expected WCET. With this tool, the elaborate task of finding input sets that yield WCET in the simulation is no longer necessary. For scratchpad memory, however, no extra analysis module is necessary for a given architecture to investigate the WCET, since it simply introduces a new distinct memory region. In general, scratchpads are an effective replacement for caches, since they bring down energy consumption and offer performance benefits comparable to those of caches [7]. The drawback of scratchpads is that they are, unlike caches, not transparent and allocation has to be performed by software (programmer or compiler). This however increases energy efficiency since there is no need for the complex cache control logic which dynamically controls the cache contents. Furthermore the allocation of scratchpad content during compilation is advantageous since the compiler has detailed knowledge about execution and access frequencies. Instead of the cache s dynamic ad-hoc decisions, an optimal distribution of memory objects among the available different memories (fast on-chip scratchpad memory and slower off-chip memory) can be achieved with this information. A very similar type of on-chip memory are locked caches. The possibility of loading and locking content in the cache allows the software to control its content. Compared to scratchpads, locked caches do not suffer from fragmentation, e. g. if blocks to be allocated are too large to fit in the left free space of the scratchpad. On the downside, the fixed cache block size of locked caches can lead to cache pollution if the data to be locked is not aligned on the cache block boundaries, i. e. extra instructions may be locked which do not contribute to the worst-case execution path (WCEP). Furthermore conflicts of cache

3 3 mapping are possible if multiple distinct relevant code sections (basic blocks) map to the same cache block, especially in direct-mapped caches. [5] Section 2.1 considers some of the previous work on on-chip data selection. 2.1 Selection of on-chip memory data The selection of on-chip memory content can be performed either in a static or dynamic way. In case of static allocation, the data and/or instructions to be placed in the on-chip memory e. g. scratchpad are selected at compile time. The content is then held in the on-chip memory for the whole execution time. For the content to be allocated, the most beneficial code/data with respect to performance and energy consumption i. e. heavily used content ( hotspots ) is to be selected. With this method, an optimal utilization of the on-chip memory is not always possible, especially if the amount of code/data is much larger then the on-chip memory, since not all hotspots could fit into the on-chip memory. In this case dynamic allocation is beneficial because hotspots can be loaded into the on-chip memory at runtime as needed. Compared to the static approach, savings up to 38% in energy consumption are reported [7]. With dynamic allocation, however, the objects (code or data) and the reload points, at witch the objects are copied to the on-chip memory, are usually fixed at compile time. The problem of finding the optimal set of objects to map to the on-chip memory is solved by formulating it as a variant of the knapsack problem, which can be solved e. g. by an ILP (integer linear programming) solver. An example formulation of the knapsack problem 3 for mapping objects could be: Maximise n i=1 acc(i) x i subject to n i=1 size(i) x i S with x i [0,1] with n as the total number of objects to map, S the size of the on-chip memory, acc(i) the number of accesses to object i, size(i) the size of object i and x i is the decision variable that takes the value 1 if object i is to be mapped to the on-chip memory, and 0 otherwise. In short, a benefit function is used, which associates each memory object with a certain performance or energy gain if the object is allocated to the on-chip memory instead of the off-chip memory. In the objective function, this benefit is maximised under the constraint that capacity of on-chip memory is not exceeded. 2.2 A WCET oriented algorithm for dynamic allocation In [5] an algorithm for WCET oriented dynamic allocation of code to scratchpad memories and locked caches is proposed, which will be summarized hereinafter. As cache, a k-way set-associative instruction cache is considered. This algorithm consists of two independent parts: selection of reload points and selection of the on-chip memory content. The selection is based on the knowledge of the execution frequencies of basic blocks and statements along the WCEP, obtained by an external WCEP estimation tool. 3 For a detailed example of knapsack based mapping see [6]

4 Reload points The reload points are placed at loop pre-headers 4 to exploit the temporal locality of code. The selection of reload points only depends on the code structure and hardware-parameters (on- and off-chip memory latencies). The cost function CF(L) for selecting reload point for loop L is: WCET onchip(l) = i m f (L) WCET o f f chip(l) = f (i) t on + i instr(l) m f (L) i pl(l) f (i) t o f f f (i) t o f f + i pre head(l) CF(L) = WCET o f f chip(l) WCET onchip(l) f (i) (a + b m f (L) ) with t on and t o f f as the access latencies for on-chip and off-chip memory access, f (s) the total number of executions of statement s along the WCEP, m f (L) the most frequently executed instructions in loop L, instr(l) the whole set of instructions of loop L and pre head(l) as the pre-headers of loop L 5. Briefly the cost function CF(L) is the execution time loop L when stored completely off-chip minus the time of its execution with its most frequently used instructions stored on-chip and the time of reloading these functions to the on-chip memory in all pre-headers of loop L. A positive value of CF(L) means a WCET improvement Selection of on-chip memory content The selection of on-chip memory content is based on execution frequency along the WCEP, which is re-evaluated on a regular basis, since it may change by loading objects into the on-chip memory. The algorithm [5, p. 3] progressively fills the on-chip memory at the reload points identified in Section This is done by considering successively all basic blocks starting from the one with the most expected decrease of WCET estimate. Then the basic blocks are iteratively loaded at their dominating load points by the function Load. This is done until the WCET estimation, which is re-evaluated regularly at a predefined basis, does not improve anymore, or all basic blocks are considered. The function Load differs for loading basic blocks into scratchpad or a locked cache: Locked cache: Information is loaded and locked on a per cache block basis. A program line pl is inserted if there is a free block in the cache set where pl is mapped to. It may be remarked that there is no modification of the memory layout of the application. Scratchpad: Information is allocated on per basic block basis. A first-fit allocation strategy is used by Load to find a free block of the basic block size in the scratchpad memory and jointly determines the address where the basic block will be copied to at run-time A quantitative comparison of dynamically locked caches vs scratchpad memories A quantitative comparison of dynamically locked caches vs scratchpad memories using this algorithm has been made in [5]. Since the contents of the locked cache or scratchpad memory is selected at compile time, both schemes are predictable. So the outcome of every memory access (on-chip and off-chip) is known offline. 4 basic block before a loop header in control flow graph (CFG) 5 pl(l) is not explained in [5]. Probably equal to instr(l) 6 In [5] it is not pointed out when previously allocated memory is released to free on-chip memory.

5 5 Table 1: Task characteristics [5] Several factors related to the nature of the on-chip memory (scratchpad vs locked cache) are expected to impact worst-case performance: Addressing scheme: While the loading and locking of information in locked caches is under software control, the location of information in the cache is under hardware control and transparent to the software. As a positive aspect of this no modification of the code layout is required when a basic block is locked into the instruction cache, yet there may be conflicts for cache locations. Since two basic blocks that map to the same cache address of a direct mapped cache cannot be locked simultaneously, only one can be held in cache. In a scratchpad, however, this problem does not arise since address selection is under software control. Granularity of allocation.: Since the smallest locking unit in a locked cache is a cache block, basic blocks may not be aligned to cache block boundaries if no modification of the code layout is done. In the experiments of [5], the differences between dynamic allocation in locked caches and dynamic allocation in scratchpad memory are evaluated. The results are given on a per-task basis (see Table 1) with focus on WCET. The WCET estimation is only accounted for memory accesses (on-chip and off-chip) to isolate the impact on memory hierarchy. 1 cycle for on-chip memory access and 10 cycles for off-chip memory access are assumed. Other architectural memory elements than memory hierarchy (e. g. pipelining, branch-prediction) are ignored to be as architecture-independent as possible. The experiments are conducted on MIPS R200/R300 binary code, but the results are independent from any specific MIPS-compatible processor since only instruction caches and scratchpad memorys are focused. For the instruction cache, 16 byte large blocks (4 instructions) with a parametrized associativity degree is considered. By default, cache and scratchpad size is 1 KB and blocks in scratchpad memory are aligned to instruction bound (4 byte). The results are expressed in terms of ratios of categories of memory accesses ( n on chip/n o f f chip /n reload n on chip +n o f f chip ) and WCET estimates. WCET estimates for locked caches and scratchpads are very close to each other for most benchmarks, with acceptable memory accesses for reloading the on-chip memory (see Figure 1). For this experiment, a direct mapped cache has been used. It is also notable that the task-size does not have direct impact on on-chip memory access ratios: des, the biggest application whose code size is 11 times the cache/scratchpad size, exhibits a better on-chip memory access ratio than some smaller applications (adpcm, jfdctint). The impact of cache block size is shown in Table 2a. For this experiment, a full associative cache has been used 7 in order to focus on the impact of cache block size only. It can be seen that a increase 7 Conflicts for cache block locations do not exist

6 6 (a) (b) Figure 1: On-chip/Off-chip/reload ratios for locked caches & scratchpad memories [5] in cache block size always results in a higher WCET estimate. This is because the cache is locked on a cache-block basis and bigger cache blocks result in loading of code that belongs to basic blocks not necessarily on the WCEP (pollution) 8. For the benchmarks adpcm and compress, pollution does increase the on-chip access ratio, because the extra instruction locked do not prevent more interesting instructions to be locked. In all cases the reload costs increase with bigger cache blocks because more instructions than necessary are locked (pollution). For bigger cache blocks, the impact on WCET estimates is rather slim. The basic block size is the factor with the biggest impact on worst-case performance (see Table 2b). This study is done on the jfdctint benchmark with small basic blocks (small BB) and big basic blocks (big BB), using either a full-associative cache of 1KB or a scratchpad of 1KB. The original version of the benchmark (small BB), has two inner loops which are mainly made of calls to a function with a very (a) Impact of cache block size (b) Impact of Basic block size Table 2: Impact of block sizes [5] 8 The pollution problem could be removed at the cost of larger code size by aligning basic blocks on cache boundaries

7 7 small body of only a few C statements 9. In the modified benchmark (big BB), the function bodies are inlined in the callees, which results in the inner loops being mainly composed of a big basic block of around 1.5KB. The results show that locked caches are rather insensitive to the size of basic blocks, since the locking granularity is independent of the basic block size. Scratchpads, however, have turned out to be very sensitive on basic block size because of fragmentation. The ratio of on-chip memory accesses drastically drops from 60.4% to 32.2%. This is because a single basic block cannot be loaded into the scratchpad memory because of fragmentation (no big enough continuous free space). 2.3 A comparison of execution times for caches and scratchpads In [7], the average-case execution time and WCET for scratchpads and caches is compared. For the selection of scratchpad memory content (functions and global data), the knapsack based algorithm from [6] is used 10. The allocation is performed statically. A direct mapped cache is used for code and data. For the experiment, the executables are generated with the encc compiler (Energy Aware C Compiler) in 16bit THUMB mode 11 of the ARM7 processor. The executables are simulated using ARMulator [4] (for average-case execution times) and the WCET is analyzed using the ait [2] tool. Due to the lack of ait s cache analysis for ARM7, a simple experimental cache analysis provided by AbsInt GmbH, which only uses a subset of the available analysis techniques provided by the commercial versions of ait, is used. Yet using a simple sorting algorithm with known worst case input, the results obtained by simulation and WCET analysis only differed by 0.2%, highlighting the high precision of the used WCET analysis tool. Figure 2 shows the workflow of the experiment: The left branch shows the scratchpad setup. The compiler is provided with the scratchpad size and the access costs to solve the corresponding knapsack problem for scratchpad allocation. The generated executable contains the address information for all memory objects, so their static location in main memory or scratchpad is known. The executable is simulated using ARMulator, which is provided with the size and address-range of the scratchpad memory to account the latencies of Figure 2: Workflow [7] 9 The increase of WCET estimation in this version is explained by the time required for function calls and parameter passing. 10 Allocation is performed in an energy-optimal way using the energy model for the ARM7TDMI processor. 11 Recommended for energy- and size constrained systems because of higher code density [7].

8 8 Table 3: Cycles per memory access (access + waitstates) [7] accesses to main memory and scratchpad memory in the simulation. The result of this step is a simulated average case execution time. WCET analysis for scratchpad is done with ait, which require the specification of memory regions, i. e. the timings shown in Table 3. The code generation and simulation was repeated for scratchpad memory sizes from 64 bytes to 8kB The right branch shows the cache setup. Since the cache is transparent to the software, encc does not need any special information regarding the cache and one generated executable is sufficient for all cache sizes. The generated executables are simulated using ARMulator, which requires information about cache size and organization to determine the number of cycles required for the execution of the benchmarks for different cache sizes from 64 bytes to 8kB. For the experiments, a simple direct mapped unified cache architecture found in ARM processors is assumed. The WCET is analyzed using ait s cache analysis feature for ARM7. Like in the case of scratchpads, ait has to be provided with timing informations about cache hits and misses in its configuration files. The differentiation between 16 and 32 bit access is not necessary in this case, since cache always performs 32 bit accesses (Table 3) to fill an entire cache line on a miss. Like a scratchpad access, a cache hit requires only one cycle. The model is based on a AT91EB01 evaluation board by ATMEL Corp. The access times to main memory depend on the with of the access (see Table 3). Table 4 shows the benchmarks used in the experiment. As expected, the simulated execution time and WCET estimation determined by ait decrease (at the same rate) when the scratchpad memory capacity is increased (see Figure 3a for the G.721 benchmark) 12. For caches (see Figure 3b for the G.721 benchmark), the simulation times are quite similar to the scratchpad values of Figure 3a. For very small caches, the execution times go up due to the high number of conflict misses. For bigger caches, the times decrease roughly at the same rate as they do for a scratchpad. The WCET, however, stays at a nearly constant, very high level for all cache sizes. Besides Table 4: Benchmarks [7] 12 The step-function appearance in the simulation is due to the granularity information is allocated to the scratchpad, i. e. only objects that fit into the free space can be allocated to the scratchpad.

9 9 (a) Using a Scratchpad (b) Using a Cache Figure 3: Results for G.721 benchmark [7] an overestimation by ait because of it s limited cache analysis techniques for ARM7 and the hard to predict cache behaviour, the cache can yield a near constant amount misses for a worst-case input set. Figure 4 shows the ratio of WCET estimation and simulated number of cycles for different scratchpad and cache sizes for the G.721 and MultiSort benchmark. The number of cycles was normalized to 1. It is notable that the ratio between average case execution time and WCET remains roughly constant for all scratchpad memory sizes (especially for G.721 in Figure 4a). For caches, the difference of WCET and average-case execution time increases strongly with cache size in either G.721 and MultiSort benchmark. This is due to the decreasing average case execution time but nearly constant WCET for increasing cache sizes. For the ADPCM benchmark, Figure 5 shows a clear performance benefit of a scratchpad compared to a cache for small sizes. This is due to the high number of cache misses for small caches occurring in this benchmark. It is also remarkable that the deviation of WCET and simulated cycles is very low for (a) G.721 Benchmark (b) MultiSort Benchmark Figure 4: Ratio of WCET and Simulated Cycles for cache and scratchpad based systems [7]

10 10 (a) Using a Scratchpad (b) Using a Cache Figure 5: Results for ADPCM benchmark [7] this benchmark. This may either be due to the fact that the chosen input set for the simulation is close to a worst case set, or that the program is not very control flow intensive and thus consists mainly of the critical path. All in all, it is notable that using a scratchpad direct translates to a reduced WCET estimate. 2.4 A comparison of area and energy consumption between caches and scratchpad memories In [1], a comparison of area and energy consumption between caches and scratchpad memories is performed. For that purpose, an area model and an energy model for either cache and scratchpad memory is developed. The area model is based on the transistor count, computed from the designs of the circuits. In case of the scratchpad memory (see Figure 6 and 7a), the area is the sum of the area occupied by Figure 6: Scratchpad memory array [1]

11 11 the data decoder A sde, data array area A sda, column multiplexer A sco, pre-charge circuit A spr, data sense amplifier A sse and output driver A sou : A s = A sde + A sda + A sco + A spr + A sse + A sou For the cache, the area is the sum of the area occupied by the tag array A tag and the data array A data. A c = A tag + A data The tag array consists of the tag decoder unit A dt, tag array A ta, column multiplexer A co, pre-charge circuit A pr, sense amplifiers A se, tag comparators A com and multiplexer driver units A mu : A tag = A dt + A ta + A co + A pr + A se + A com + A mu The data array consists of the data decoder unit A de, data array A da, column multiplexer A col, precharge circuit A pre, data sense amplifiers A sen and the output driver units A out : A data = A de + A da + A col + A pre + A sen + A out For the energy estimation, the CACTI tool [8] is used. In case of scratchpad memory, the energy consumption is estimated from the energy consumption of its components, i. e. the decoder E decoder and the memory columns E memcol : E scratchpad = E decoder + E memcol The energy consumption in the memory array consists of the energy consumed in the sens amplifiers, column multiplexers, the output driver circuitry and the memory cells due to the word-line, pre-charge circuit and the bit line circuitry. The major energy consumption is due to the memory array unit. For example, the energy consumption for the memory array is described as follows: E memcol = C memcol V 2 dd P 0 1 with P 0 1 as the probability of a bit toggle (here taken as 0.5) and C memcol as the capacitance of the memory array unit computed as follows: C memcol = ncols (C pre +C readwrite ) (a) Scratchpad (b) Cache Figure 7: Memory organization [1]

12 12 Access Number of cycles Cache Using Table 5b Scratchpad 1 cycle Main Memory 16 bit 1 cycle + 1 wait state = 2 cycles Main Memory 32 bit 1 cycle + 3 wait state = 4 cycles (a) Memory access cycles Access type Ca read Ca write Mm read Mm write Read hit Read miss 1 L L 0 Write hit Write miss (b) Cache memory interaction model Table 5: Memory accesses [1] C pre is the effective load capacitance of the bit-lines during pre-charging and C readwrite is the effective load capacitance of the cell read/write. ncols is the number of columns in the memory. So the total energy spend in scratchpad memory is given as follows: E sptotal = SP accesses E scratchpad with SP accesses as the number of accesses to the scratchpad. The power estimation for the cache is performed at transistor level by the CACTI tool 13. The energy consumption per access is the sum of the energy consumed by the previously identified components. The analysis is similar to that described for the scratchpad memory. The clock cycle estimation is based on the ARMulator trace output for cache and scratchpad memory. It is assumed that the clock cycle count directly reflects performance i. e. the larger the number of clock cycles, the lower the performance. The code is generated with encc for the ARM7 core and the identification and static allocation of critical code and data structures to the scratchpad is done via a knapsack packing algorithm. Figure 8 shows the experiments workflow. For the cache configuration, a 2-way set-associative cache is used. In case of scratchpads, the memory accesses in the ARMulator trace file are classified as going to scratchpad or going to main memory and an appropriate latency is added to the overall program delay (see Table 5a). The scratchpad energy consumption is calculated by the number of accesses to the scratchpad multiplied by the energy per access as described earlier. Figure 8: Experimental flow diagram [1] 13 The predicted area and energy is based on the CACTI model for 0.5 m technology.

13 13 Table 6: Energy per access for various devices [1] In the case of caches, the number of cache read hits, read misses, write hits and write misses can be obtained from ARMulators s trace file. The cache is a write through cache, and the number of accesses are computed based on Table 5b, where the number of cycles required for each type of access is listed in Table 5a. There a four cases of cache acess that are considered in the model: Cache read hit: On a read hit, the tag register is accessed and the data is read from cache. Cache read miss: On a read miss, the corresponding cache line (L words) has to be brought from main memory to cache. Thus we have a cache read, followed by a memory read and cache write of L words. Cache write hit: On a cache write hit, we have a cache write followed by a main memory write. Cache write miss: On a cache write miss, we have a cache tag read (to establish the miss) followed by a main memory write. The cache is not updated in this case. (a) Area (b) Energy consumption Figure 9: Comparison of cache and scratchpad [1]

14 14 Table 7: Area/Performance ratios for bubblesort [1] Using this model, the cache energy equation is derived as: E cache = (N c read + N c write ) E where E cache is the energy spent in the cache, N c read the number of cache read accesses and N c write the number of cache write accesses. Energy E is computed like E memcol, considering the appropriate load and cycle count. To compare the energy consumption, the energy consumption of the main memory needs to be known as well. The scratchpad values for 2048 bytes were obtained from the previously mentioned models and the main memory values were obtained by measurements on the ATMEL board (see Table 6). Thus the main memory energy consumption along with the on-chip memory consumption were accounted. A series of experiments were conducted to demonstrate the merits of using on-chip scratchpads and caches. Figure 9a shows a area comparison of cache and scratchpad for varying sizes. It can be seen that on average the area occupied by scratchpad is less than the cache by 34%. Figure 9b shows the energy consumed for biquad, matrixmult and quicksort examples for cache and scratchpad, respectively. It can be seen that in all cases, except for biquad with cache size of 256, the scratchpad consumes less energy for the same size of cache. On average scratchpad memory reduced the energy by 40% compared to cache. Table 7 gives the area/performance tradeoff. Column 1 is the cache/scratchpad size in bytes. Columns 2 and 3 are the cache and scratchpad area in transistors. Columns 4 and 5 are the number of cpu cycles (*1000) for cache and scratchpad based systems. Columns 6 and 7 give the area and cycle reduction achieved by replacing the cache with a scratchpad and column 8 gives the improvement of the area-time product AT (assuming constant cycles) computed as follows: AT = A s N s A c N c The average reductions for this example are 34% for area, 18% for time and 46% for the AT product, respectively. 3 Scratchpad memory vs Caches - Performance and Predictability As presented in the related work, especially in Section 2.3, scratchpad memories outperform caches on almost all counts in respect of predictability. Caches exhibit a very high WCET estimation compared to

15 15 (a) Average Case execution time (b) Worst Case execution time Figure 10: Cache/Scratchpad performance for G.721 benchmark [7] the average-case execution time because of its dynamic, hardware-controlled allocation and replacement strategies. This can lead to a high amount of cache misses for a worst-case input set. Furthermore, a tight WCET estimation for caches is only possible if all cache-parameters (mapping, block size, cache associativity, replacement policy,... ) are well-known, which is often not the case in practice due to the lack of documentation. This leads to WCET overestimation. For scratchpad memories, however, the parameters are well-known and the allocation is under full compiler/software control. Thus the uncertainty concerning execution time caused by a hard to predict cache behavior is eliminated and a simple WCET analysis based on the software and basic memory access times is sufficient. This fact eases a WCET oriented optimization with scratchpad memory. A similar type of on-chip memories are locked caches. Since locked caches allow the locking of content by the software, they yield a good predictability similar to scratchpad memories. The downside is that the location of information in locked caches is still entirely under hardware control and transparent to the software, so that conflicts for cache locations may occur. This again is depending on the code/data alignment with respect to the cache blocks, which complicates WCET estimation. Figure 10b and 11b show the WCET estimation for cache and scratchpad memory in case of the G.721 and ADPCM benchmark respectively. It can be seen that the cache yields a higher WCET estimation in almost all cases compared to the scratchpad memory. Especially for small caches the WCET estimation is extremely high because of the high number of cache misses. (a) Average Case execution time (b) Worst Case execution time Figure 11: Cache/Scratchpad performance for ADPCM benchmark [7]

16 16 Figure 12: Cache/Scratchpad CPU cycles for bubblesort [1] As for the average-case performance, caches yield good results especially for larger caches (see Figure 10a, 11a and 12 for G.721, ADPCM and bubblesort benchmark). For smaller sizes, scratchpads outperform caches because of the high number of cache misses, which result in very frequent lookups followed by reloads. For larger caches, starting with bytes in case of Figure 10a, 11a and 12, the average case performance converges to the scratchpad performance and even surpasses it in some cases. This is because a worst case input set, which results in WCET, does not occur very frequently under normal circumstances. In cases of dynamic allocation, the benefit of a cache is that it reloads data on the fly if needed, so that the software does not need to take care of reloading. In respect of energy efficiency, scratchpads outperform caches of same size in almost all counts, as shown in Section Conclusion In this paper scratchpad memories and caches have been compared in respect of performance and predictability. The results indicate that scratchpad memories are a good replacement for caches especially for low-power embedded systems and real-time systems. Their better WCET performance and higher predictability due to their simplicity makes them suitable for hard real-time environments and their energy efficiency is beneficial for low-power applications such as mobile devices. On the contrary, scratchpads can not always replace caches. The transparency of caches makes them beneficial for software portability, since unlike for scratchpads, the cache needs not to be considered by the compiler or programmer. In the case of desktop PCs, WCET is rather unimportant, since the average-case performance is normally experienced. With caches, it is possible to improve the average-case performance comparable to scratchpads while still maintaining software portability. As a trade-off, locked caches are of interest since they can maintain the transparency of a regular cache while still being able to lock information when requested by the software. This is especially beneficial in mixed scenarios with real-time components, where the cache can be controlled by the software in critical sections and remain transparent in uncritical sections while still providing a performance improvement. A direct comparison between locked and regular caches along with a direct comparison between dynamically and statically allocated scratchpads would be of interest, since [5] only compares dynamically allocated scratchpads with locked caches. Furthermore, scenarios where scratchpads perform worse than caches, if any, would be of interest.

17 17 5 Bibliography [1] R. Banakar, S. Steinke, Bo-Sik Lee, M. Balakrishnan & P. Marwedel (2002): Scratchpad memory: a design alternative for cache on-chip memory in embedded systems. In: Hardware/Software Codesign, CODES Proceedings of the Tenth International Symposium on, pp , doi: /codes [2] Abslnt Angewandte Informatik GmbH (2004): ait: Worst Case Execution Time Analyzers. Available at http: // [3] R. Heckmann, M. Langenbach, S. Thesing & R. Wilhelm (2003): The influence of processor architecture on the design and the results of WCET tools. Proceedings of the IEEE 91(7), pp , doi: /jproc [4] ARM Ltd.: ARM Instruction Set Simulator ARMulator. Available at ARMulator.html. [5] I. Puaut & C. Pais (2007): Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison. In: Design, Automation Test in Europe Conference Exhibition, DATE 07, pp. 1 6, doi: /date [6] S. Steinke, L. Wehmeyer, Bo-Sik Lee & P. Marwedel (2002): Assigning program and data objects to scratchpad for energy reduction. In: Design, Automation and Test in Europe Conference and Exhibition, Proceedings, pp , doi: /date [7] L. Wehmeyer & P. Marwedel (2005): Influence of memory hierarchies on predictability for time constrained embedded software. In: Design, Automation and Test in Europe, Proceedings, pp Vol. 1, doi: /date [8] S.J.E. Wilton & N.P. Jouppi (1996): CACTI: an enhanced cache access and cycle time model. Solid-State Circuits, IEEE Journal of 31(5), pp , doi: / [9] Wm. A. Wulf & Sally A. McKee (1995): Hitting the Memory Wall: Implications of the Obvious. SIGARCH Comput. Archit. News 23(1), pp , doi: / Available at /

WCET-Aware C Compiler: WCC

WCET-Aware C Compiler: WCC 12 WCET-Aware C Compiler: WCC Jian-Jia Chen (slides are based on Prof. Heiko Falk) TU Dortmund, Informatik 12 2015 年 05 月 05 日 These slides use Microsoft clip arts. Microsoft copyright restrictions apply.

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

Cache-Aware Scratchpad Allocation Algorithm

Cache-Aware Scratchpad Allocation Algorithm 1530-1591/04 $20.00 (c) 2004 IEEE -Aware Scratchpad Allocation Manish Verma, Lars Wehmeyer, Peter Marwedel Department of Computer Science XII University of Dortmund 44225 Dortmund, Germany {Manish.Verma,

More information

Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory

Reducing Energy Consumption by Dynamic Copying of Instructions onto Onchip Memory Reducing Consumption by Dynamic Copying of Instructions onto Onchip Memory Stefan Steinke *, Nils Grunwald *, Lars Wehmeyer *, Rajeshwari Banakar +, M. Balakrishnan +, Peter Marwedel * * University of

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS

ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS ait: WORST-CASE EXECUTION TIME PREDICTION BY STATIC PROGRAM ANALYSIS Christian Ferdinand and Reinhold Heckmann AbsInt Angewandte Informatik GmbH, Stuhlsatzenhausweg 69, D-66123 Saarbrucken, Germany info@absint.com

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology

Memory Hierarchies. Instructor: Dmitri A. Gusev. Fall Lecture 10, October 8, CS 502: Computers and Communications Technology Memory Hierarchies Instructor: Dmitri A. Gusev Fall 2007 CS 502: Computers and Communications Technology Lecture 10, October 8, 2007 Memories SRAM: value is stored on a pair of inverting gates very fast

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Comparing Multiported Cache Schemes

Comparing Multiported Cache Schemes Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

1" 0" d" c" b" a" ...! ! Benefits of Larger Block Size. " Very applicable with Stored-Program Concept " Works well for sequential array accesses

1 0 d c b a ...! ! Benefits of Larger Block Size.  Very applicable with Stored-Program Concept  Works well for sequential array accesses Asst. Proflecturer SOE Miki Garcia inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III WHEN FIBER OPTICS IS TOO SLOW 07/16/2014: Wall Street Buys NATO Microwave Towers in

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 Caches III 2008-04-16 Lecturer SOE Dan Garcia Hi to Chin Han from U Penn! Prem Kumar of Northwestern has created a quantum inverter

More information

Embedded SRAM Technology for High-End Processors

Embedded SRAM Technology for High-End Processors Embedded SRAM Technology for High-End Processors Hiroshi Nakadai Gaku Ito Toshiyuki Uetake Fujitsu is the only company in Japan that develops its own processors for use in server products that support

More information

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems

Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Compiler-Directed Memory Hierarchy Design for Low-Energy Embedded Systems Florin Balasa American University in Cairo Noha Abuaesh American University in Cairo Ilie I. Luican Microsoft Inc., USA Cristian

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Asst. Proflecturer SOE Miki Garcia WHEN FIBER OPTICS IS TOO SLOW 07/16/2014: Wall Street Buys NATO Microwave Towers in

More information

COMP 3221: Microprocessors and Embedded Systems

COMP 3221: Microprocessors and Embedded Systems COMP 3: Microprocessors and Embedded Systems Lectures 7: Cache Memory - III http://www.cse.unsw.edu.au/~cs3 Lecturer: Hui Wu Session, 5 Outline Fully Associative Cache N-Way Associative Cache Block Replacement

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Lecturer SOE Dan Garcia Google Glass may be one vision of the future of post-pc interfaces augmented reality with video

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Kashif Ali MoKhtar Aboelaze SupraKash Datta Department of Computer Science and Engineering York University Toronto ON CANADA Abstract

More information

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng

Slide Set 9. for ENCM 369 Winter 2018 Section 01. Steve Norman, PhD, PEng Slide Set 9 for ENCM 369 Winter 2018 Section 01 Steve Norman, PhD, PEng Electrical & Computer Engineering Schulich School of Engineering University of Calgary March 2018 ENCM 369 Winter 2018 Section 01

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu

More information

Chapter-5 Memory Hierarchy Design

Chapter-5 Memory Hierarchy Design Chapter-5 Memory Hierarchy Design Unlimited amount of fast memory - Economical solution is memory hierarchy - Locality - Cost performance Principle of locality - most programs do not access all code or

More information

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking

A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking A Time-Predictable Instruction-Cache Architecture that Uses Prefetching and Cache Locking Bekim Cilku, Daniel Prokesch, Peter Puschner Institute of Computer Engineering Vienna University of Technology

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

UC Berkeley CS61C : Machine Structures

UC Berkeley CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C : Machine Structures Lecture 33 Caches III 2007-04-11 Future of movies is 3D? Dreamworks says they may exclusively release movies in this format. It s based

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Cache 11232011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Review Memory Components/Boards Two-Level Memory Hierarchy

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

www-inst.eecs.berkeley.edu/~cs61c/

www-inst.eecs.berkeley.edu/~cs61c/ CS61C Machine Structures Lecture 34 - Caches II 11/16/2007 John Wawrzynek (www.cs.berkeley.edu/~johnw) www-inst.eecs.berkeley.edu/~cs61c/ 1 What to do on a write hit? Two Options: Write-through update

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size

Lecture 33 Caches III What to do on a write hit? Block Size Tradeoff (1/3) Benefits of Larger Block Size CS61C L33 Caches III (1) inst.eecs.berkeley.edu/~cs61c UC Berkeley CS61C Machine Structures Lecture 33 Caches III 27-4-11 Lecturer SOE Dan Garcia www.cs.berkeley.edu/~ddgarcia Future of movies is 3D? Dreamworks

More information

Optimizations - Compilation for Embedded Processors -

Optimizations - Compilation for Embedded Processors - 12 Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany Graphics: Alexandra Nolte, Gesine Marwedel, 23 211 年 1 月 12 日 These slides use Microsoft clip arts.

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.

Embedded processors. Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto. Embedded processors Timo Töyry Department of Computer Science and Engineering Aalto University, School of Science timo.toyry(at)aalto.fi Comparing processors Evaluating processors Taxonomy of processors

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c/su05 CS61C : Machine Structures Lecture #21: Caches 3 2005-07-27 CS61C L22 Caches III (1) Andy Carle Review: Why We Use Caches 1000 Performance 100 10 1 1980 1981 1982 1983

More information

Hitting the Memory Wall: Implications of the Obvious. Wm. A. Wulf and Sally A. McKee {wulf

Hitting the Memory Wall: Implications of the Obvious. Wm. A. Wulf and Sally A. McKee {wulf Hitting the Memory Wall: Implications of the Obvious Wm. A. Wulf and Sally A. McKee {wulf mckee}@virginia.edu Computer Science Report No. CS-9- December, 99 Appeared in Computer Architecture News, 3():0-,

More information

ECE 571 Advanced Microprocessor-Based Design Lecture 10

ECE 571 Advanced Microprocessor-Based Design Lecture 10 ECE 571 Advanced Microprocessor-Based Design Lecture 10 Vince Weaver http://www.eece.maine.edu/ vweaver vincent.weaver@maine.edu 2 October 2014 Performance Concerns Caches Almost all programming can be

More information

Lecture 2: Memory Systems

Lecture 2: Memory Systems Lecture 2: Memory Systems Basic components Memory hierarchy Cache memory Virtual Memory Zebo Peng, IDA, LiTH Many Different Technologies Zebo Peng, IDA, LiTH 2 Internal and External Memories CPU Date transfer

More information

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design

SRAMs to Memory. Memory Hierarchy. Locality. Low Power VLSI System Design Lecture 10: Low Power Memory Design SRAMs to Memory Low Power VLSI System Design Lecture 0: Low Power Memory Design Prof. R. Iris Bahar October, 07 Last lecture focused on the SRAM cell and the D or D memory architecture built from these

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay!

Lecture 16. Today: Start looking into memory hierarchy Cache$! Yay! Lecture 16 Today: Start looking into memory hierarchy Cache$! Yay! Note: There are no slides labeled Lecture 15. Nothing omitted, just that the numbering got out of sequence somewhere along the way. 1

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2014 Lecture 14 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2014 Lecture 14 LAST TIME! Examined several memory technologies: SRAM volatile memory cells built from transistors! Fast to use, larger memory cells (6+ transistors

More information

Memory Hierarchies &

Memory Hierarchies & Memory Hierarchies & Cache Memory CSE 410, Spring 2009 Computer Systems http://www.cs.washington.edu/410 4/26/2009 cse410-13-cache 2006-09 Perkins, DW Johnson and University of Washington 1 Reading and

More information

Predictable paging in real-time systems: an ILP formulation

Predictable paging in real-time systems: an ILP formulation Predictable paging in real-time systems: an ILP formulation Damien Hardy Isabelle Puaut Université Européenne de Bretagne / IRISA, Rennes, France Abstract Conventionally, the use of virtual memory in real-time

More information

CS61C : Machine Structures

CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c CS61C : Machine Structures CS61C L22 Caches II (1) CPS today! Lecture #22 Caches II 2005-11-16 There is one handout today at the front and back of the room! Lecturer PSOE,

More information

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors

CPUs. Caching: The Basic Idea. Cache : MainMemory :: Window : Caches. Memory management. CPU performance. 1. Door 2. Bigger Door 3. The Great Outdoors CPUs Caches. Memory management. CPU performance. Cache : MainMemory :: Window : 1. Door 2. Bigger Door 3. The Great Outdoors 4. Horizontal Blinds 18% 9% 64% 9% Door Bigger Door The Great Outdoors Horizontal

More information

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel

More information

And in Review! ! Locality of reference is a Big Idea! 3. Load Word from 0x !

And in Review! ! Locality of reference is a Big Idea! 3. Load Word from 0x ! CS61C L23 Caches II (1)! inst.eecs.berkeley.edu/~cs61c CS61C Machine Structures Lecture 23 Caches II 2010-07-29!!!Instructor Paul Pearce! TOOLS THAT AUTOMATICALLY FIND SOFTWARE BUGS! Black Hat (a security

More information

Welcome to Part 3: Memory Systems and I/O

Welcome to Part 3: Memory Systems and I/O Welcome to Part 3: Memory Systems and I/O We ve already seen how to make a fast processor. How can we supply the CPU with enough data to keep it busy? We will now focus on memory issues, which are frequently

More information

Page 1. Multilevel Memories (Improving performance using a little cash )

Page 1. Multilevel Memories (Improving performance using a little cash ) Page 1 Multilevel Memories (Improving performance using a little cash ) 1 Page 2 CPU-Memory Bottleneck CPU Memory Performance of high-speed computers is usually limited by memory bandwidth & latency Latency

More information

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds.

Review: Performance Latency vs. Throughput. Time (seconds/program) is performance measure Instructions Clock cycles Seconds. Performance 980 98 982 983 984 985 986 987 988 989 990 99 992 993 994 995 996 997 998 999 2000 7/4/20 CS 6C: Great Ideas in Computer Architecture (Machine Structures) Caches Instructor: Michael Greenbaum

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

Optimizations - Compilation for Embedded Processors -

Optimizations - Compilation for Embedded Processors - Springer, 2010 12 Optimizations - Compilation for Embedded Processors - Peter Marwedel TU Dortmund Informatik 12 Germany 2014 年 01 月 17 日 These slides use Microsoft clip arts. Microsoft copyright restrictions

More information

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University

COSC4201. Chapter 5. Memory Hierarchy Design. Prof. Mokhtar Aboelaze York University COSC4201 Chapter 5 Memory Hierarchy Design Prof. Mokhtar Aboelaze York University 1 Memory Hierarchy The gap between CPU performance and main memory has been widening with higher performance CPUs creating

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Tutorial 11. Final Exam Review

Tutorial 11. Final Exam Review Tutorial 11 Final Exam Review Introduction Instruction Set Architecture: contract between programmer and designers (e.g.: IA-32, IA-64, X86-64) Computer organization: describe the functional units, cache

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 7: Memory Hierarchy and Caches Dr. Ahmed Sallam Suez Canal University Spring 2015 Based on original slides by Prof. Onur Mutlu Memory (Programmer s View) 2 Abstraction: Virtual

More information

Shared Cache Aware Task Mapping for WCRT Minimization

Shared Cache Aware Task Mapping for WCRT Minimization Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications,

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

DESIGN AND ANALYSIS OF ALGORITHMS. Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

CS161 Design and Architecture of Computer Systems. Cache $$$$$

CS161 Design and Architecture of Computer Systems. Cache $$$$$ CS161 Design and Architecture of Computer Systems Cache $$$$$ Memory Systems! How can we supply the CPU with enough data to keep it busy?! We will focus on memory issues,! which are frequently bottlenecks

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

Memory-architecture aware compilation

Memory-architecture aware compilation - 1- ARTIST2 Summer School 2008 in Europe Autrans (near Grenoble), France September 8-12, 8 2008 Memory-architecture aware compilation Lecturers: Peter Marwedel, Heiko Falk Informatik 12 TU Dortmund, Germany

More information

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES

Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES DESIGN AND ANALYSIS OF ALGORITHMS Unit 1 Chapter 4 ITERATIVE ALGORITHM DESIGN ISSUES http://milanvachhani.blogspot.in USE OF LOOPS As we break down algorithm into sub-algorithms, sooner or later we shall

More information

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations: Announcements HW1 is available online Next Class Liang will give a tutorial on TinyOS/motes Very useful! Classroom: EADS Hall 116 This Wed ONLY Proposal is due on 5pm, Wed Email me your proposal Loops

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory II Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Large and Fast: Exploiting Memory Hierarchy

Large and Fast: Exploiting Memory Hierarchy CSE 431: Introduction to Operating Systems Large and Fast: Exploiting Memory Hierarchy Gojko Babić 10/5/018 Memory Hierarchy A computer system contains a hierarchy of storage devices with different costs,

More information

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University

CS 201 The Memory Hierarchy. Gerson Robboy Portland State University CS 201 The Memory Hierarchy Gerson Robboy Portland State University memory hierarchy overview (traditional) CPU registers main memory (RAM) secondary memory (DISK) why? what is different between these

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system.

August 1994 / Features / Cache Advantage. Cache design and implementation can make or break the performance of your high-powered computer system. Cache Advantage August 1994 / Features / Cache Advantage Cache design and implementation can make or break the performance of your high-powered computer system. David F. Bacon Modern CPUs have one overriding

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Improving Timing Analysis for Matlab Simulink/Stateflow

Improving Timing Analysis for Matlab Simulink/Stateflow Improving Timing Analysis for Matlab Simulink/Stateflow Lili Tan, Björn Wachter, Philipp Lucas, Reinhard Wilhelm Universität des Saarlandes, Saarbrücken, Germany {lili,bwachter,phlucas,wilhelm}@cs.uni-sb.de

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

An introduction to SDRAM and memory controllers. 5kk73

An introduction to SDRAM and memory controllers. 5kk73 An introduction to SDRAM and memory controllers 5kk73 Presentation Outline (part 1) Introduction to SDRAM Basic SDRAM operation Memory efficiency SDRAM controller architecture Conclusions Followed by part

More information

CS152 Computer Architecture and Engineering Lecture 16: Memory System

CS152 Computer Architecture and Engineering Lecture 16: Memory System CS152 Computer Architecture and Engineering Lecture 16: System March 15, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http://http.cs.berkeley.edu/~patterson

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Memory Technologies. Technology Trends

Memory Technologies. Technology Trends . 5 Technologies Random access technologies Random good access time same for all locations DRAM Dynamic Random Access High density, low power, cheap, but slow Dynamic need to be refreshed regularly SRAM

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Memory Hierarchy Y. K. Malaiya

Memory Hierarchy Y. K. Malaiya Memory Hierarchy Y. K. Malaiya Acknowledgements Computer Architecture, Quantitative Approach - Hennessy, Patterson Vishwani D. Agrawal Review: Major Components of a Computer Processor Control Datapath

More information