Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems

Size: px
Start display at page:

Download "Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems"

Transcription

1 Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems Abu Asaduzzaman 1, Kishore K. Chidella 2, Md Moniruzzaman 3 1,2,3 Electrical Engineering and Computer Science Department, Wichita State University, Wichita, Kansas, USA Abstract Most modern computing systems are having multicore processors with multilevel caches for high performance. Caches increase total power consumption and worsen execution time unpredictability. Studies show that way (or partial) cache locking may improve timing predictability and performance-to-power ratio for both single-core and multicore systems. Even though both private first-level and shared last-level cache locking improve timing predictability, it is difficult to justify the performance and power trade-off between these two locking mechanisms. In this work, we evaluate two cache locking schemes for multicore systems at private first-level caches and at shared last-level cache. Both schemes are based on the analysis of applications worst case execution time (WCET) and both allow changing the locked cache size during runtime to achieve the optimal performance-to-power ratio for the running applications. Using Heptane WCET analyser, we generate workloads for H.264/AVC, MPEG4, FFT, MI, and DFT codes. Using VisualSim tool, we model and simulate a system with four cores and two levels of caches. Experimental results confirm that both cache locking schemes improve timing predictability by decreasing the total number of cache misses. Results also indicate that for small applications like FFT, shared last-level cache locking outperforms private first-level cache locking; but for large applications like MPEG4 and H.264/AVC, private first-level cache locking performs better than shared last-level cache locking. Keywords Cache locking, multicore architecture, performance-to-power ratio, timing predictability, private first-level cache, shared last-cache I. INTRODUCTION As multicore processors provide higher performance-topower ratio, the popularity and demand of multicore processors are increasing in both desktop and embedded markets [1 3]. In a multicore processor, two or more independent cores are combined into a die. In most cases, each core has its own private first-level cache (CL1), which is split into instruction cache (I1) and data cache (D1); and the multicore processor may have one shared last-level cache such as shared level-2 cache (CL2) or multiple distributed CL2s [4 5]. A common choice for the memory organization of a multicore system is a two-level cache hierarchy (examples include Intel Xeon, IBM Power5, and Sun Niagara) [1, 3-7]. 26 According to this memory hierarchy, level-1 caches are attached to and privately accessible by each core. A larger level-2 cache is shared by the cores (e.g., Intel Xeon). Please note that level-2 cache can be private to each core (like AMD Athlon); but that is beyond the scope of this work. The presence of a shared level-2 cache offers the flexibility in adjusting the memory allocated per core according to its requirement, as well as the possibility of multiple cores getting fast access to the shared code and/or data. Cache parameters significantly influence system performance, especially worst case performance of embedded systems [8-9]. New generation multicore designs have shown that normally two (or more) cores running at (or less than) one half of the frequency can approach the performance of a single core running at full frequency, while the multicore consumes less amount of power. Multicore architectures are more suitable for real-time applications, because concurrent execution of tasks on a single processor, in many respects including thermal constraints (power consumption and heat dissipation), is inadequate for achieving the required level of performance and reliability. Real-time systems deal with timing constraints and usually interact with the environment rather than the human operator. Because timeliness and reliability are so important in their behaviour, real-time systems are often distributed among multiple program units (a.k.a., tasks) running simultaneously to perform required functions. However, the increasing usages of caches potentially increase the execution time unpredictability. Real-time applications cannot afford to miss deadlines and hence demand timing predictability. As multicore systems have multiple levels of caches, supporting real-time applications on multicore architectures pose significant challenge. For single-core systems, it is proven that cache locking improves predictability [7-8, 10-16]. Cache locking is the ability to prevent some or all of the instruction or data cache from being overwritten. Cache entries can be locked for either an entire cache or for individual ways within the cache. Entire cache locking is inefficient if the number of instructions or the size of data to be locked is small compared to the cache size. In way locking, only a portion of the cache is locked by locking ways within the cache.

2 Unlocked ways of the cache behave normally. Using way locking, Intel Xeon processor achieves the effect of using local storage in IBM Cell architecture. Most processors (unlike PowerPC 750GX) allow way locking at level-1 or level-2 cache [17-18]. So, way locking may be an important alternative of entire locking. Effectively using multi-level caches of multicore systems is a great challenge. To the best of our knowledge, current multicore processors are not able to take full advantage of cache locking, because there is no such efficient cache locking technique. Most existing cache locking mechanisms are not suitable for multicore systems. In this work, we explore two promising cache locking mechanisms for multicore systems one at private level-1 cache and another at shared level-2 cache. Our work focuses on higher level abstraction. The target systems (that should be evaluated) may not even exist. Instead of pinpoint accuracy, our goal is to provide fast reasonable accuracy at the early stage of the system design flow. The beauty of our approach is that this easy and fast solution can be used for simulation of both existing and nonexisting systems. This paper is organized as follows. In Section II, some related articles are discussed. The cache locking schemes are introduced in Section III. The schemes are evaluated by simulating a multicore system; simulation details are presented in Section IV. In Section V, some important simulation results are discussed. Finally, this work is concluded in Section VI. II. LITERATURE SURVEY Cache locking technique is being used to improve timing predictability in single-core systems for years. However, cache locking in newly innovated multicore system is difficult due to the fact that multicore architecture has multiple levels of caches. Therefore, we first discuss the cache memory subsystem of popular single-core and multicore systems; then we discuss some selected published articles closely related to level-1 and level-2 cache locking. A traditional inclusive cache memory subsystem consists of on-chip CL1 (split into instruction cache I1 and data cache D1) and off-chip CL2 (unified). The schematic diagram in Figure 1 illustrates a single-core system where CL1 is on-chip and CL2 is off-chip [1]. In this system both CL1 and CL2 are private. Figure 1. Cache memory subsystem of a single-core system The architecture of a dual-core system where each core has its own private CL1 and CL2 is shared by both cores is illustrated in Figure 2 [1]. Level-1 cache locking in a single-core system is easier than that in a multicore system. Level-1 cache locking in multicore systems is difficult as each core keeps its dataset local and modifies its local dataset as required. Figure 2. Cache memory subsystem of a dual-core system Methods presented in [10 11] suggest that cache contents should be statically locked to make memory access time and cache-related pre-emption delay predictable. However, these approaches need to be tested using larger real benchmarks. An algorithm for off-line selection of the contents of two on-chip memories locked caches and scratchpad memories is proposed in [12]. Experimental results show that the algorithm generates good ratios of on-chip memory accesses on the worst-case execution path. However, worstcase performance with locked caches may be degraded when large cache lines are used due to cache pollution. In [13], static cache analysis is combined with data cache locking to estimate the worst-case memory performance in a safe, tight and fast way. Experimental results show that this scheme is more predictable than a system without cache. In [14], a memory hierarchy is proposed to provide high performance combined with high predictability for complex systems. 27

3 In [15], an algorithm is introduced which partitions the task into a set of regions. Each region owns statically a locked cache content determined offline. A sharp improvement is observed, as compared with a system without any cache. In [16], various algorithms to select a set of instructions to be locked in cache are compared. The algorithms mentioned in [13-16] show performance improvement and can be used to assess a tight upper bound of the response time of tasks. However, these techniques are developed for single-core systems. These techniques are not useful to estimate power consumption a crucial design factor for embedded/mobile systems. Therefore, these techniques are not adequate to analyse the performance, power consumption, and predictability of multicore (real-time) systems. A miss table based static cache locking scheme is introduced at level-2 cache in [7] to improve the execution time predictability and overall system performance/power ratio. The miss table is to hold the information of block addresses (related to the applications being processed) which cause most cache misses if not locked. Experimental results show that in addition to improving the predictability, a reduction in mean delay per task and a reduction in total power consumption are achieved for MPEG4. However, this level-2 cache locking technique is not applicable for level-1 cache locking. Also, this static technique does not allow changing the total locked cache size during run time. III. PROPOSED CACHE LOCKING STRATEGIES FOR MULTICORE SYSTEMS In this section, we present private level-1 and shared level-2 cache locking schemes suitable for multicore systems. For both schemes, cache miss information for the target applications is pre-processed and used by cache locking algorithm to determine the right memory blocks that should be locked. A. Cache Locking Cache locking is a mechanism that prevents some or all of the instructions or data from being replaced from cache. Cache entries can be locked for either an entire cache or for individual ways within the cache [19]. In entire cache locking, cache hits are treated in the same manner as hits to an unlocked cache. Cache misses are treated as a cacheinhibited access. Invalid cache entries at the time of the locking will remain invalid and inaccessible until the cache is unlocked. Entire cache locking is inefficient if the number of instructions or the size of data to be locked is small compared to the cache size. 28 In way locking, only a portion of the cache is locked by locking ways within the cache. Invalid entries in way locking are accessible and available for data placement this behaviour differs from entire cache locking. Unlocked ways of the cache behave normally. Understanding the impact of cache locking on the performance, power consumption, and predictability of multicore systems requires analysing them separately and observing their interaction with the entire system architecture using the target applications. In a multicore system, cache can be locked at level-1 or at level-2. Although cache locking at level-1 in a multicore is difficult (each level-1 cache is private to a specific core), level-1 cache locking may be beneficial for some applications as locked blocks are very close to the core. Cache locking at level-2 in a multicore is easy as level-2 cache is shared by all cores and may be beneficial for some smaller applications. B. Block Address and Miss Information (BAMI) For way cache locking, information about the blocks that cause cache misses are collected together and called a BAMI (short for block address and miss information). In a BAMI, memory block addresses are sorted in descending order of their total number of misses. Generating a BAMI is tricky and involves manual activities. Major steps involved to create BAMIs are: Select target applications (written in C). If the applications are very large, select code segments from the large applications that contain big loops. This is because cache locking mechanism is expected to perform better on computation extensive big loops. Generate Heptane (Hades Embedded Processor Timing ANalyzEr) tree-graph using application code (in C) [20]. Manually collect instruction block (IB) miss information from the tree-graph. Create and sort IB-address miss block list. Memory block with the maximum number of misses should be the first one. Select memory blocks based on the cache size, line size, and locked cache amount for the BAMI. For each application, after post-processing the Heptane tree-graph, one BAMI is generated. BAMIs can be used for cache locking, cache replacement, selective preloading, and so on. A BAMI is more like a look-up table. Information about a small number of selected blocks should be stored in the BAMI. For cache locking at CL1 in multicore systems, BAMIs can be implemented using registers or a small piece of cache in each core.

4 For cache locking at shared CL2, BAMIs can be implemented using a small portion of CL2. In a BAMI, the top-most entry should hold information about the memory block that has the maximum number of misses, and so on. Say, the size of a BAMI cache is 128 Bytes and each BAMI entry is 64 bits (among which 1 bit for lock-bit L, 32 bits are for block address, and rest 31 bits are for the number of misses). So, each block may hold 16 (= 128 * 8 / 64) BAMI entries. The lock-bit (L) can be set/cleared dynamically to indicate that the associated block is locked or not. C. Work-Flow of the Cache Locking Schemes In this subsection, we explain the basic work-flow of the private level-1 and shared level-2 cache locking techniques. According to these schemes, the blocks that are anticipated to cause more misses should be locked. Both schemes are based on the WCET analysis of the target applications. A number of jobs are selected depending on the total number of available (free) cores. A job can be a complete application or a part of an application. Each job should have a number of tasks. A task is a segment of code (one or more instructions such as threads). After jobs are selected, CL1s and CL2 are preloaded with selected blocks using BAMI(s). There are significant differences between the two cache locking schemes. In level-1 locking scheme, cache locking decision is made at core level (by each core) after the jobs are assigned among the free cores. Every time a core gets a job to process, cache locking decision and locked I1 cache size is determined depending on the assigned job. In this scheme, the locked cache size may be changed during runtime to achieve the optimal predictability and performance/power ratio. The diagram in Figure 3 illustrates the workflow of level-1 cache locking strategy. In this scheme, how much cache should be locked is dynamically determined after a core decides to apply level- 1 cache locking on a job. In the shared level-2 cache locking scheme, the cache locking decision is made by the master core before allocating the jobs among the free cores. The diagram in Figure 4 illustrates the workflow of shared level-2 cache locking strategy. The delay and power consumption are calculated using the same method (as used in the private first-level cache locking scheme). Figure 3. Work flow of private first-level cache locking scheme Figure 4. Work flow of shared last-level cache locking scheme 29

5 In both schemes, after a task is completed, the core becomes free. The maximum delay and power consumption for that batch of jobs are obtained and the total delay and total power consumption are updated. After completing all the jobs, mean delay per task and total power consumption are calculated. To estimate average delay per task, the maximum delay for each batch of jobs is considered. To estimate power, total power consumed by all the cores to complete all the jobs is considered. In addition to the cache locking strategies (which are based on WCET analysis), we also simulate random cache locking strategy for both level-1 and level-2 cache locking. In random cache locking strategy (where WCET analysis is not required), memory blocks are selected randomly to preload CL2/CL1 and to lock in private level-1 caches or shared level-2 cache. Mean delay per task and total power consumption are calculated the same way. IV. SIMULATION DETAILS In this work, we evaluate private level-1 and shared level-2 way cache locking schemes using simulation technique. In this section, we briefly discuss the simulation details. We use VisualSim (short for VisualSim Architect) [21] and Heptane tools to model and simulate a multicore system. Heptane is used for WCET analysis of applications on a single-core system with one-level cache and to generate workloads. VisualSim is used to simulate multicore systems with multi-level caches and to obtain the simulation results. Assumptions, simulated architecture, workloads, and important parameters are discussed in the following subsections. A. Assumptions We make some assumptions in this research work. Important assumptions include: A homogenous multicore is simulated where all cores are identical. Level-1 (i.e., I1) and level-2 (i.e., CL2) cache locking are considered in this work. In I1 cache locking, each core can lock independently and dynamically per its needs. In CL2 locking, cache locking decision is made by the master core (for a batch of jobs). Write-back memory update policy and random cache replacement strategy are used as required. The delay introduced by the bus that connects CL2 and the main memory is 10 times longer than the delay introduced by the bus that connects CL1 and CL2 caches. B. Simulation Architecture As mentioned earlier, we focus on higher level abstraction of the target system in this work. According to our modeling and simulation approach, the target system is not required to be present. This is to facilitate early estimation of future complex systems. However, this approach is extremely useful to analyze/update existing products as well. This approach is fast, cheap, risk-free, and reasonably accurate. We model and simulate a quad-core system which depicts popular Intel Xeon processor architecture. Figure 5 illustrates BAMI implementation inside the cores for level- 1 cache locking. Information stored in each BAMI depends on the application being processed by the corresponding core. Along with the BAMI, each core has its private CL1 (split into I1 and D1 for improved performance). The system has one shared CL2. Two cores are connected to the CL2 using the same bus to reduce the bus contention. In case of an I1 miss, updated cache replacement policy is used to select a victim block using BAMIs. An unlocked block with minimum number of misses should be selected for replacement. In case of a tie in the number of misses, a block is selected randomly. It should be noted that the BAMI(s) should be implemented inside shared CL2 (not inside the cores) for shared CL2 locking (which is not shown in the Figure). Figure 5. Simulated multicore architecture with 4 cores illustrating level-1 cache locking 30

6 C. Workloads International Journal of Emerging Technology and Advanced Engineering In this work, we use a diverse group of important applications to run the simulation programs Moving Picture Experts Group s MPEG4 (part-2), Advanced Video Coding widely known as H.264/AVC, Fast Fourier Transform (FFT), Matrix Inversion (MI), and Discrete Fourier Transform (DFT). The complete codes of FFT, MI, and DFT are considered. For H.264/AVC and MPEG-4, code segments are carefully selected. The code size and number of instructions of the applications are shown in Table I. For each application, one BAMI is generated using Heptane and the BAMI is used to run the VisualSim simulation programs. TABLE I IMPORTANT CHARACTERISTICS OF THE APPLICATIONS Applications Code Size (Bytes) Number of Instructions FFT 2,335 36,5184 MI 1, ,518 DFT 1, ,307 H.264/AVC 185,568 40,922,256 MPEG4 209,937 52,071,180 Because of their effectiveness, we use Heptane package and VisualSim tool for modelling and simulation of the cache locking schemes. Heptane takes C code as the input application and generates tree-graph. Tree-graph shows which memory blocks caused cache misses and how many (if any). This tree-graph information is used to create the BAMI. BAMIs are used in VisualSim programs to calculate the total power consumption (as Heptane cannot calculate power consumption). D. Important Parameters Some important input and output parameters are shown in Table II. We change the total cache size and locked cache size. TABLE II IMPORTANT INPUT AND OUTPUT SIMULATION PARAMETERS Parameter Value I1 cache size (KB) 2, 4, 8, 16, 32 D1 cache size (KB) 2, 4, 8, 16, 32 CL2 cache size (KB) 128, 256, 512, 1024, 2048 Total locked cache size (%) 0.0, 12.5, 25.0, 37.5, 50.0 CL1/CL2 line size 128 Byte (fixed) CL1/CL2 associativity level 8-way (fixed) Mean delay per task?(to be obtained) Total power consumption? (to be obtained) 31 Output parameters include mean delay per task and total power consumption. Delay is defined as the time between the start of execution of a task and its end. For power analysis, a system component is considered to be one of the three states active (component consumes adequate amount of energy to be turned on and active), idle (component consumes minimum amount of energy just to be turned on), or sleep (component is turned off and consumes no energy). For a system with X tasks and Y components, the total power consumption can be expressed as shown below, X Pt (total) = Σ (Pi) (1) i = 1 Y P i = Σ (Pj (active) + Pj (idle) (2) j = 1 Where, each task i is associated with Y (different) components. Pi is the power consumed by i-th task. In our experiment, the following components are considered for power consumption: CPU, CL1 (i.e., I1 and D1), buses, CL2, and main memory (MM). Table III shows how power consumption is distributed among the processor components [22]. It is assumed that an idle component takes only ¼ amount of power comparing its active stage. TABLE III POWER CONSUMPTION BY PROCESSOR Parameter Power Consumed Unit/Usage (Active) Unit/Usage (Idle) I1 27% D1 16% CL1 43% CPU 36% Bus, others Total 100% Power consumed by CL2 and MM is determined as: CL2 (power) = (CL1 power / CL1 size) * (CL2 size) (3) MM (power) = (CL1 power / CL1 size) * (MM size) (4) V. RESULTS AND DISCUSSION In this work, we investigate level-1 and level-2 cache locking methods for real-time multicore systems to study how the predictability can be enhanced without compromising the performance/power ratio. Cache locking improves the predictability by making the block local and closer to the cores.

7 However, aggressive cache locking may decrease the performance and increase the total power consumption due to the reduction of the effective cache size. We model a computing system with 4 cores and run the simulation programs using MPEG4, H.264/AVC, FFT, MI, and DFT workloads. We obtain results by varying the locked cache size (from 0% to 50%), I1 cache size (from 2KB to 32 KB), and CL2 cache size (from 128KB to 2MB). We present some important simulation results in the following subsections. A. Impact of Level-1 Cache Locking As the number of locked blocks increases on the one hand, cache blocks that cause most of the misses are locked; but on the other hand, the effective cache size decreases. Due to these contradictory phenomena, it is difficult to predict the mean delay per task and total power consumption when cache locking is applied. Figure 6 illustrates that the mean delay per task starts decreasing with the increase in the locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. However, when locked cache size is beyond 25%, the mean delay per task starts increasing with the increase in the locked blocks for MPEG4, H.264/AVC, and FFT. This is because beyond 25% locking, the effective cache size decreases so much that the cache miss starts increasing. Cache locking has zero or negligible positive impact for MI and DFT. Figure 7. Total power consumption due to I1 locked cache size The results due to MPEG4 and H.264/AVC applications are very similar; so in the next section, the MPEG4 results will also represent the omitted H.264/AVC results. Also, the results due to MI and DFT should be omitted in the next section as they are not impacted by level-1 cache locking. B. Impact of CL2 Cache Locking In a multicore system, it is easy to implement cache locking at shared level-2 cache than at private level-1 caches. However, it is difficult to predict the impact of level-2 cache locking on mean delay per task and total power consumption. Figure 8 illustrates that the mean delay per task starts decreasing with the increase in locked blocks (0% to 25% locking) for MPEG4 and H.264/AVC. However, when locked cache size is beyond 25%, the mean delay per task starts increasing with the increase in the locked blocks for MPEG4 and H.264/AVC as the effective cache size decreases so much that the cache miss starts increasing. Shared level-2 cache locking has zero or negligible positive impact for FFT, MI, and DFT. Figure 6. Mean delay per task versus the amount of I1 locked cache size The impact of level-1 cache locking on total power consumption is shown in Figure 7. Like mean delay per task, the total power consumption starts decreasing with the increase in the locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. However, when locked cache size is beyond 25%, the total power consumption starts increasing. Like mean delay per task, the total power consumption is not positively impacted by level-1 cache locking for MI and DFT. 32 Figure 8. Mean delay per core versus the amount of CL2 locked cache size

8 The impact of shared level-2 cache locking on total power consumption is shown in Figure 9. The total power consumption starts decreasing with the increase in locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. The total power consumption for these applications increases for more than 25% CL2 locking. Level-2 cache locking has no positive impact on total power consumption for MI and DFT. Figure 9. Total power consumption versus the amount of CL2 locked cache C. Execution Time Predictability We first study the impact of level-1 cache locking on the execution time predictability on a single-core system with only level-1 cache. Table IV presents some experimental results obtained by analyzing the Heptane tree-graph for FFT code. A total of 246 misses is generated from 19 different blocks for I1 cache size 2KB and line size 128 Byte. The total number of blocks in I1 is 16 (2048/128). For 8-way set associativity cache, if 2 blocks from each set is locked, 4 blocks out of 16 blocks (i.e., 25% of I1) is locked. Based on Heptane WCET analysis, the maximum number of misses from 4 blocks is 128. By locking those 4 blocks (that create the maximum number of 128 misses), predictability can be enhanced (as more than 50% cache misses are avoided) by locking only 25% of the I1 size. Then, we study the impact of shared level-2 cache locking on the execution time predictability on a multicore system with two levels of caches using FFT code. As shown in Table V, for 128 KB CL2, all the blocks that cause misses can be locked at CL2. Therefore, by preloading those blocks at CL2 and applying level-2 locking, the system can be totally predictable. This is because the total number of blocks in CL2 (1024) is much higher than the total number of blocks in CL1 (16). Therefore, CL2 can hold all the blocks that cause cache misses (19). D. A Closer Look: Private First-Level or Shared Last- Level Cache Locking? In this subsection, we discuss the impact of level-1 and level-2 cache locking on performance and power consumption. First, we analyze the impact of level-1 I1 cache locking on mean delay per task and total power consumption. We collect mean delay per task and total power consumption without any locking, with random locking, and with I1 locking using BAMI. As mentioned earlier, blocks are selected randomly in random locking without any WCET analysis. As a result, for new and large applications (i.e., miss information for block address is not known and the code does not fit in the cache, respectively), random cache locking in multicore may be very effective. For I1 cache locking (using BAMI and using random blocks), the mean delay per task and total power consumption start decreasing as I1 cache size increases from 2 KB (see Figure 10). It is observed that the impact of cache locking for 32 KB I1 is not as significant as the impact of cache locking for 2 KB I1.This is because cache hits increase as cache size increases. For small I1 cache size, level-1 cache locking clearly outperforms random cache locking in our experiment for both mean delay per task and total power consumption. Although both strategies help decrease the mean delay per task and total power consumption, I1 locking strategy is more efficient for the applications used as blocks are selected wisely using BAMI. It should be noted that only two (MPEG4 and H.264/AVC) of the five applications are large and do not entirely fit in I1. Figure 10. Impact of I1 cache size on level-1 cache locking 33

9 Similarly, we analyze the impact of shared level-2 cache locking by collecting mean delay per task and total power consumption without any locking, with random locking, and with CL2 locking using BAMI. Like level-1 locking, the mean delay per task and total power consumption start decreasing as CL2 cache size increases from 128 KB. Although both strategies help decrease the mean delay per task and total power consumption, CL2 locking strategy is more efficient as blocks are selected using BAMI. For the applications used, experimental results show that shared level-2 cache locking strategy is better than the random cache locking strategy for both mean delay per task and total power consumption (see Figure 11). Again it should be noted that three (FFT, MI, and DFT) of the five applications are small and entirely fit in 4 KB I1 cache. Figure 12. Impact of locked cache size on I1 and CL2 cache locking TABLE IV COMPARISON BETWEEN PRIVATE FIRST-LEVEL AND SHARED LAST- LEVEL CACHE LOCKING Figure 11. Impact of CL2 cache size on level-2 cache locking The performance of private level-1 and shared level-2 cache locking for various locked cache size is depicted in Figure 12. Experimental results show that locking at shared level-2 cache outperforms locking at level-1 caches for the applications used, specifically at 25% locking. This indicates that small applications (like FFT, MI, and DFT) take more advantages of shared level-2 cache locking than private level-1 cache locking (which is not the case for large applications like MPEG4 and H.264/AVC). We summarize some of the important characteristics of private level-1 and shared level-2 cache locking strategies in Table IV. Here, (+) sign means advantage and (-) sign means disadvantage. For example, locking at private firstlevel caches is more complicated when compared with locking at shared last-level cache. Consideration Private First- Shared Last- Level Locking Level Locking Complexity More (-) Less (+) Predictability Low (-) High (+) Response Time Low (+) High (-) Large applications Very useful (+) Not useful (-) Small applications Not useful (-) Very useful (+) Unknown applications Locking at each core? Need WCET analysis? Multicore architecture Not useful (-) Not useful (-) Depends on (-) No (+) Yes (-) Yes (-) Promising (+) Promising (+) Therefore, when complexity is concerned, it is an advantage for shared level-2 cache locking and disadvantage for private level-1 cache locking. However, response time is an advantage for private level-1 cache locking and a disadvantage for shared level-2 cache locking. Finally, we find both level-1 and level-2 cache locking schemes are effective and promising for multicore systems. Even though shared level-2 cache locking strategy shows the best performance in this experiment, the results may be different if all or most of the workloads are large. 34

10 For large applications, cache locking at private first-level should be more beneficial than locking at shared last-level; but it is difficult to implement cache locking at each core individually. VI. CONCLUSIONS Multicore architectures help improve performance-topower ratio, but makes execution time more unpredictable due to cache s dynamic behavior. It has been proven that cache locking can be used to improve timing predictability. However, aggressive and ignorant use of cache locking may reduce the performance-to-power ratio. In this work, we present private first-level and shared last-level way cache locking schemes for power-aware real-time multicore systems. Both schemes allow changing the locked cache size during runtime to achieve the optimal predictability and performance/power ratio. To evaluate these schemes, we model and simulate a system with four cores and two-level cache memory subsystem using VisualSim. We generate a diverse set of workloads by postprocessing the respective Heptane tree-graphs for H.264/AVC, MPEG4, FFT, MI, and DFT codes. These workloads are used to run the VisualSim programs. Experimental results indicate that both performance and predictability can be increased and power consumption can be decreased by using a cache locking mechanism added to an efficient cache memory organization. It is noticed that up to 43% reduction in mean delay and up to 37% reduction in total power consumption are possible by locking only 25% of the cache size. Results suggest that for small applications like FFT, MI, and DFT, shared level-2 cache locking outperforms private level-1 cache locking; but for large applications like MPEG4 and H.264/AVC, private level-1 cache locking performs better than shared level-2 cache locking. This is probably because small applications take more advantages of shared level-2 cache locking than private level-1 cache locking (which is not the case for large applications). We plan to investigate the impact of locking at victim cache(s) on performance, power consumption, and execution time predictability of a multicore real-time system in our next endeavor. REFERENCES [1] A. Asaduzzaman, Cache Optimization for Real-Time Embedded Systems, Ph.D. Dissertation. Florida Atlantic University, [2] V. Suhendra and T. Mitra, Exploring Locking & Partitioning for Predictable Shared Caches on Multi-Cores, in DAC'2008, Anaheim, CA, [3] V. Romanchenko, Evaluation of the multi-core processor architecture Intel core: Conroe, Kentsfield, in Digital-Daily.com, [4] Multi-core (computing), Wikipedia. DOI= wikipedia.org/wiki/xeon;../wiki/athlon, [5] D.K. Every, IBM s Cell Processor: The next generation of computing?, in Shareware Press, DOI= [6] A. Asaduzzaman, and I. Mahgoub, Cache Modeling and Optimization for Portable Devices Running MPEG-4 Video Decoder, in MTAP Journal, MTAP 05, [7] A. Asaduzzaman, and F.N. Sibai, Improving Cache Locking Performance of Modern Embedded Systems via the Addition of a Miss Table at the L2 Cache Level, in JSA Journal, [8] Y. Liang and T. Mitra, Instruction cache locking using temporal reuse profile, in DAC 10, pp , [9] T. Liu, M. Li, and C.J. Xue, Minimizing WCET for Real-Time Embedded Systems via Static Instruction Cache Locking, in RTAS 2009, pp , [10] I. Puaut and D. Decotigny, Low-Complexity Algorithms for Static Cache Locking in Multitasking Hard RT Systems, in IEEE Conference, [11] I. Puaut, Cache Analysis Vs Static Cache Locking for Schedulability Analysis in Multitasking Real-Time Systems, DOI= [12] I. Puaut and C. Pais, Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison, in Design, Automation & Test in Europe Conference & Exhibition (DATE'07), pp. 1-6, [13] X. Vera and B. Lisper, Data Cache Locking for Higher Program Predictability, in SIGMETRICS'03, CA, [14] E. Tamura, F. Rodriguez, J.V. Busquets-Mataix, and A.M. Campoy, High Performance Memory Architectures with Dynamic Locking Cache for Real-Time Systems, in the Proceedings of the 16th Euromicro Conference on Real-Time Systems, pp. 1-4, Italy, [15] A. Arnaud and I. Puaut, Dynamic Instruction Cache Locking in Hard Real-Time Systems, [16] E. Tamura, J.V. Busquets-Mataix, J.J.S. Martin, and A.M. Campoy, A Comparison of Three Genetic Algorithms for Locking-Cache Contents Selection in Real-Time Systems, in the Proceedings of the Int l Conference in Coimbra, Portugal, [17] C. Harrison, Programming the cache on the PowerPC 750GX/FX - Use cache management instructions to improve performance. In IBM Microcontroller Applications Group, DOI= 128.ibm.com/developerworks/library/pa-ppccache.html [18] J. Stokes, Xenon's L2 vs. Cell's local storage, and some notes on IBM/Nintendo's Gekko, DOI= arstechnica.com/articles/paedia/cpu/xbox360-1.ars/6 [19] MPC8272 PowerQUICC II Family Reference Manual, DOI= 2RM.pdf [20] Heptane - a WCET analysis tool, DOI= [21] VisualSim - a system-level simulator, DOI= [22] W. Tang, R. Gupta, and A. Nicolau, Power Savings in Embedded Processors through Decode Filter Cache, in Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE 02), pp.1-6, 2002.

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

More information

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance

Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Hybrid SPM-Cache Architectures to Achieve High Time Predictability and Performance Wei Zhang and Yiqiang Ding Department of Electrical and Computer Engineering Virginia Commonwealth University {wzhang4,ding4}@vcu.edu

More information

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE

A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE A TALENTED CPU-TO-GPU MEMORY MAPPING TECHNIQUE Abu Asaduzzaman, Deepthi Gummadi, and Chok M. Yip Department of Electrical Engineering and Computer Science Wichita State University Wichita, Kansas, USA

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Shared Cache Aware Task Mapping for WCRT Minimization

Shared Cache Aware Task Mapping for WCRT Minimization Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications,

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern

Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Sireesha R Basavaraju Embedded Systems Group, Technical University of Kaiserslautern Introduction WCET of program ILP Formulation Requirement SPM allocation for code SPM allocation for data Conclusion

More information

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur

High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur High Performance Computer Architecture Prof. Ajit Pal Department of Computer Science and Engineering Indian Institute of Technology, Kharagpur Lecture - 23 Hierarchical Memory Organization (Contd.) Hello

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications

Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications University of Dortmund Operating system integrated energy aware scratchpad allocation strategies for multiprocess applications Robert Pyka * Christoph Faßbach * Manish Verma + Heiko Falk * Peter Marwedel

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

Phase-based Cache Locking for Embedded Systems

Phase-based Cache Locking for Embedded Systems Phase-based Cache Locking for Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida (UF), Gainesville, FL 32611, USA tosironkbd@ufl.edu

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache optimization Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

Optimization of Simulation based System Level Modeling to Enhance Embedded Systems Performance

Optimization of Simulation based System Level Modeling to Enhance Embedded Systems Performance International Journal of Engineering Research and Technology. ISSN 0974-3154 Volume 11, Number 7 (2018), pp. 1119-1128 International Research Publication House http://www.irphouse.com Optimization of Simulation

More information

Cache Optimization for Mobile Devices Running Multimedia Applications

Cache Optimization for Mobile Devices Running Multimedia Applications Cache Optimization for Mobile Devices Running Multimedia Applications Abu Asaduzzaman, Imad Mahgoub, Praveen Sanigepalli, Hari Kalva, Ravi Shankar, and Borko Furht Department of Computer Science & Engineering

More information

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first

Caches. Cache Memory. memory hierarchy. CPU memory request presented to first-level cache first Cache Memory memory hierarchy CPU memory request presented to first-level cache first if data NOT in cache, request sent to next level in hierarchy and so on CS3021/3421 2017 jones@tcd.ie School of Computer

More information

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems

Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Managing Hybrid On-chip Scratchpad and Cache Memories for Multi-tasking Embedded Systems Zimeng Zhou, Lei Ju, Zhiping Jia, Xin Li School of Computer Science and Technology Shandong University, China Outline

More information

A Review on Cache Memory with Multiprocessor System

A Review on Cache Memory with Multiprocessor System A Review on Cache Memory with Multiprocessor System Chirag R. Patel 1, Rajesh H. Davda 2 1,2 Computer Engineering Department, C. U. Shah College of Engineering & Technology, Wadhwan (Gujarat) Abstract

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste!

More information

EITF20: Computer Architecture Part4.1.1: Cache - 2

EITF20: Computer Architecture Part4.1.1: Cache - 2 EITF20: Computer Architecture Part4.1.1: Cache - 2 Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Cache performance optimization Bandwidth increase Reduce hit time Reduce miss penalty Reduce miss

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5)

Lecture: Large Caches, Virtual Memory. Topics: cache innovations (Sections 2.4, B.4, B.5) Lecture: Large Caches, Virtual Memory Topics: cache innovations (Sections 2.4, B.4, B.5) 1 More Cache Basics caches are split as instruction and data; L2 and L3 are unified The /L2 hierarchy can be inclusive,

More information

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)]

ECE7995 (6) Improving Cache Performance. [Adapted from Mary Jane Irwin s slides (PSU)] ECE7995 (6) Improving Cache Performance [Adapted from Mary Jane Irwin s slides (PSU)] Measuring Cache Performance Assuming cache hit costs are included as part of the normal CPU execution cycle, then CPU

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico

More information

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP

A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP A Study on Optimally Co-scheduling Jobs of Different Lengths on CMP Kai Tian Kai Tian, Yunlian Jiang and Xipeng Shen Computer Science Department, College of William and Mary, Virginia, USA 5/18/2009 Cache

More information

Show Me the $... Performance And Caches

Show Me the $... Performance And Caches Show Me the $... Performance And Caches 1 CPU-Cache Interaction (5-stage pipeline) PCen 0x4 Add bubble PC addr inst hit? Primary Instruction Cache IR D To Memory Control Decode, Register Fetch E A B MD1

More information

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality Administrative EEC 7 Computer Architecture Fall 5 Improving Cache Performance Problem #6 is posted Last set of homework You should be able to answer each of them in -5 min Quiz on Wednesday (/7) Chapter

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg

Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg Multiprocessor scheduling, part 1 -ChallengesPontus Ekberg 2017-10-03 What is a multiprocessor? Simplest answer: A machine with >1 processors! In scheduling theory, we include multicores in this defnition

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1)

Lecture: Cache Hierarchies. Topics: cache innovations (Sections B.1-B.3, 2.1) Lecture: Cache Hierarchies Topics: cache innovations (Sections B.1-B.3, 2.1) 1 Types of Cache Misses Compulsory misses: happens the first time a memory word is accessed the misses for an infinite cache

More information

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems

Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Taming Non-blocking Caches to Improve Isolation in Multicore Real-Time Systems Prathap Kumar Valsan, Heechul Yun, Farzad Farshchi University of Kansas 1 Why? High-Performance Multicores for Real-Time Systems

More information

Mo Money, No Problems: Caches #2...

Mo Money, No Problems: Caches #2... Mo Money, No Problems: Caches #2... 1 Reminder: Cache Terms... Cache: A small and fast memory used to increase the performance of accessing a big and slow memory Uses temporal locality: The tendency to

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads

Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java Threads Performance Evaluations for Parallel Image Filter on Multi - Core Computer using Java s Devrim Akgün Computer Engineering of Technology Faculty, Duzce University, Duzce,Turkey ABSTRACT Developing multi

More information

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS

MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS MEMORY/RESOURCE MANAGEMENT IN MULTICORE SYSTEMS INSTRUCTOR: Dr. MUHAMMAD SHAABAN PRESENTED BY: MOHIT SATHAWANE AKSHAY YEMBARWAR WHAT IS MULTICORE SYSTEMS? Multi-core processor architecture means placing

More information

Deterministic Memory Abstraction and Supporting Multicore System Architecture

Deterministic Memory Abstraction and Supporting Multicore System Architecture Deterministic Memory Abstraction and Supporting Multicore System Architecture Farzad Farshchi $, Prathap Kumar Valsan^, Renato Mancuso *, Heechul Yun $ $ University of Kansas, ^ Intel, * Boston University

More information

Simultaneous Multithreading on Pentium 4

Simultaneous Multithreading on Pentium 4 Hyper-Threading: Simultaneous Multithreading on Pentium 4 Presented by: Thomas Repantis trep@cs.ucr.edu CS203B-Advanced Computer Architecture, Spring 2004 p.1/32 Overview Multiple threads executing on

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II

CS 152 Computer Architecture and Engineering. Lecture 7 - Memory Hierarchy-II CS 152 Computer Architecture and Engineering Lecture 7 - Memory Hierarchy-II Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Parallelized Progressive Network Coding with Hardware Acceleration

Parallelized Progressive Network Coding with Hardware Acceleration Parallelized Progressive Network Coding with Hardware Acceleration Hassan Shojania, Baochun Li Department of Electrical and Computer Engineering University of Toronto Network coding Information is coded

More information

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual

Software within building physics and ground heat storage. HEAT3 version 7. A PC-program for heat transfer in three dimensions Update manual Software within building physics and ground heat storage HEAT3 version 7 A PC-program for heat transfer in three dimensions Update manual June 15, 2015 BLOCON www.buildingphysics.com Contents 1. WHAT S

More information

CSC 631: High-Performance Computer Architecture

CSC 631: High-Performance Computer Architecture CSC 631: High-Performance Computer Architecture Spring 2017 Lecture 10: Memory Part II CSC 631: High-Performance Computer Architecture 1 Two predictable properties of memory references: Temporal Locality:

More information

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs

Chapter 5 (Part II) Large and Fast: Exploiting Memory Hierarchy. Baback Izadi Division of Engineering Programs Chapter 5 (Part II) Baback Izadi Division of Engineering Programs bai@engr.newpaltz.edu Virtual Machines Host computer emulates guest operating system and machine resources Improved isolation of multiple

More information

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola

1. Microprocessor Architectures. 1.1 Intel 1.2 Motorola 1. Microprocessor Architectures 1.1 Intel 1.2 Motorola 1.1 Intel The Early Intel Microprocessors The first microprocessor to appear in the market was the Intel 4004, a 4-bit data bus device. This device

More information

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor

Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Experiences with the Sparse Matrix-Vector Multiplication on a Many-core Processor Juan C. Pichel Centro de Investigación en Tecnoloxías da Información (CITIUS) Universidade de Santiago de Compostela, Spain

More information

MICROPROCESSOR ARCHITECTURE

MICROPROCESSOR ARCHITECTURE MICROPROCESSOR ARCHITECTURE UOP S.E.COMP (SEM-I) MULTICORE DESIGN Prof.P.C.Patil Department of Computer Engg Matoshri College of Engg.Nasik pcpatil18@gmail.com. History 2 History The most important part

More information

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore

CSCI-GA Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore CSCI-GA.3033-012 Multicore Processors: Architecture & Programming Lecture 10: Heterogeneous Multicore Mohamed Zahran (aka Z) mzahran@cs.nyu.edu http://www.mzahran.com Status Quo Previously, CPU vendors

More information

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure.

File Systems. OS Overview I/O. Swap. Management. Operations CPU. Hard Drive. Management. Memory. Hard Drive. CSI3131 Topics. Structure. File Systems I/O Management Hard Drive Management Virtual Memory Swap Memory Management Storage and I/O Introduction CSI3131 Topics Process Management Computing Systems Memory CPU Peripherals Processes

More information

Subject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date:

Subject Name:Operating system. Subject Code:10EC35. Prepared By:Remya Ramesan and Kala H.S. Department:ECE. Date: Subject Name:Operating system Subject Code:10EC35 Prepared By:Remya Ramesan and Kala H.S. Department:ECE Date:24-02-2015 UNIT 1 INTRODUCTION AND OVERVIEW OF OPERATING SYSTEM Operating system, Goals of

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

Evaluation of Application-Specific Multiprocessor Mobile System

Evaluation of Application-Specific Multiprocessor Mobile System Evaluation of Application-Specific Multiprocessor Mobile System Abu Asaduzzaman & Imad Mahgoub Department of Computer Science & Engineering Florida Atlantic University 777 Glades Road, Boca Raton, Florida

More information

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems

10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems 1 License: http://creativecommons.org/licenses/by-nc-nd/3.0/ 10 Parallel Organizations: Multiprocessor / Multicore / Multicomputer Systems To enhance system performance and, in some cases, to increase

More information

Single Chip Heterogeneous Multiprocessor Design

Single Chip Heterogeneous Multiprocessor Design Single Chip Heterogeneous Multiprocessor Design JoAnn M. Paul July 7, 2004 Department of Electrical and Computer Engineering Carnegie Mellon University Pittsburgh, PA 15213 The Cell Phone, Circa 2010 Cell

More information

Placement de processus (MPI) sur architecture multi-cœur NUMA

Placement de processus (MPI) sur architecture multi-cœur NUMA Placement de processus (MPI) sur architecture multi-cœur NUMA Emmanuel Jeannot, Guillaume Mercier LaBRI/INRIA Bordeaux Sud-Ouest/ENSEIRB Runtime Team Lyon, journées groupe de calcul, november 2010 Emmanuel.Jeannot@inria.fr

More information

Data Cache Locking for Tight Timing Calculations

Data Cache Locking for Tight Timing Calculations Data Cache Locking for Tight Timing Calculations XAVIER VERA and BJÖRN LISPER Mälardalens Högskola and JINGLING XUE University of New South Wales Caches have become increasingly important with the widening

More information

Cache Performance (H&P 5.3; 5.5; 5.6)

Cache Performance (H&P 5.3; 5.5; 5.6) Cache Performance (H&P 5.3; 5.5; 5.6) Memory system and processor performance: CPU time = IC x CPI x Clock time CPU performance eqn. CPI = CPI ld/st x IC ld/st IC + CPI others x IC others IC CPI ld/st

More information

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors

An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors An Empirical Model for Predicting Cross-Core Performance Interference on Multicore Processors Jiacheng Zhao Institute of Computing Technology, CAS In Conjunction with Prof. Jingling Xue, UNSW, Australia

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Multimedia Systems 2011/2012

Multimedia Systems 2011/2012 Multimedia Systems 2011/2012 System Architecture Prof. Dr. Paul Müller University of Kaiserslautern Department of Computer Science Integrated Communication Systems ICSY http://www.icsy.de Sitemap 2 Hardware

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Virtual Memory. Chapter 8

Virtual Memory. Chapter 8 Virtual Memory 1 Chapter 8 Characteristics of Paging and Segmentation Memory references are dynamically translated into physical addresses at run time E.g., process may be swapped in and out of main memory

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Improving Cache Performance

Improving Cache Performance Improving Cache Performance Computer Organization Architectures for Embedded Computing Tuesday 28 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy 4th Edition,

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2015 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2015 Lecture 15 LAST TIME! Discussed concepts of locality and stride Spatial locality: programs tend to access values near values they have already accessed

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Enhancements to Linux I/O Scheduling

Enhancements to Linux I/O Scheduling Enhancements to Linux I/O Scheduling Seetharami R. Seelam, UTEP Rodrigo Romero, UTEP Patricia J. Teller, UTEP William Buros, IBM-Austin 21 July 2005 Linux Symposium 2005 1 Introduction Dynamic Adaptability

More information

MARACAS: A Real-Time Multicore VCPU Scheduling Framework

MARACAS: A Real-Time Multicore VCPU Scheduling Framework : A Real-Time Framework Computer Science Department Boston University Overview 1 2 3 4 5 6 7 Motivation platforms are gaining popularity in embedded and real-time systems concurrent workload support less

More information

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence

Exam-2 Scope. 3. Shared memory architecture, distributed memory architecture, SMP, Distributed Shared Memory and Directory based coherence Exam-2 Scope 1. Memory Hierarchy Design (Cache, Virtual memory) Chapter-2 slides memory-basics.ppt Optimizations of Cache Performance Memory technology and optimizations Virtual memory 2. SIMD, MIMD, Vector,

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introduction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threads 6. CPU Scheduling 7. Process Synchronization 8. Deadlocks 9. Memory Management 10. Virtual

More information

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy

COMPUTER ARCHITECTURE. Virtualization and Memory Hierarchy COMPUTER ARCHITECTURE Virtualization and Memory Hierarchy 2 Contents Virtual memory. Policies and strategies. Page tables. Virtual machines. Requirements of virtual machines and ISA support. Virtual machines:

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

12 Cache-Organization 1

12 Cache-Organization 1 12 Cache-Organization 1 Caches Memory, 64M, 500 cycles L1 cache 64K, 1 cycles 1-5% misses L2 cache 4M, 10 cycles 10-20% misses L3 cache 16M, 20 cycles Memory, 256MB, 500 cycles 2 Improving Miss Penalty

More information

EITF20: Computer Architecture Part 5.1.1: Virtual Memory

EITF20: Computer Architecture Part 5.1.1: Virtual Memory EITF20: Computer Architecture Part 5.1.1: Virtual Memory Liang Liu liang.liu@eit.lth.se 1 Outline Reiteration Virtual memory Case study AMD Opteron Summary 2 Memory hierarchy 3 Cache performance 4 Cache

More information

LECTURE 10: Improving Memory Access: Direct and Spatial caches

LECTURE 10: Improving Memory Access: Direct and Spatial caches EECS 318 CAD Computer Aided Design LECTURE 10: Improving Memory Access: Direct and Spatial caches Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

WEEK 7. Chapter 4. Cache Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved.

WEEK 7. Chapter 4. Cache Memory Pearson Education, Inc., Hoboken, NJ. All rights reserved. WEEK 7 + Chapter 4 Cache Memory Location Internal (e.g. processor registers, cache, main memory) External (e.g. optical disks, magnetic disks, tapes) Capacity Number of words Number of bytes Unit of Transfer

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications

Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications Multi-threading technology and the challenges of meeting performance and power consumption demands for mobile applications September 2013 Navigating between ever-higher performance targets and strict limits

More information

4. Hardware Platform: Real-Time Requirements

4. Hardware Platform: Real-Time Requirements 4. Hardware Platform: Real-Time Requirements Contents: 4.1 Evolution of Microprocessor Architecture 4.2 Performance-Increasing Concepts 4.3 Influences on System Architecture 4.4 A Real-Time Hardware Architecture

More information

Design of Parallel Algorithms. Course Introduction

Design of Parallel Algorithms. Course Introduction + Design of Parallel Algorithms Course Introduction + CSE 4163/6163 Parallel Algorithm Analysis & Design! Course Web Site: http://www.cse.msstate.edu/~luke/courses/fl17/cse4163! Instructor: Ed Luke! Office:

More information

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week)

(Advanced) Computer Organization & Architechture. Prof. Dr. Hasan Hüseyin BALIK (4 th Week) + (Advanced) Computer Organization & Architechture Prof. Dr. Hasan Hüseyin BALIK (4 th Week) + Outline 2. The computer system 2.1 A Top-Level View of Computer Function and Interconnection 2.2 Cache Memory

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

Memory Systems IRAM. Principle of IRAM

Memory Systems IRAM. Principle of IRAM Memory Systems 165 other devices of the module will be in the Standby state (which is the primary state of all RDRAM devices) or another state with low-power consumption. The RDRAM devices provide several

More information

A Cache Utility Monitor for Multi-core Processor

A Cache Utility Monitor for Multi-core Processor 3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,

More information

Parallelizing Inline Data Reduction Operations for Primary Storage Systems

Parallelizing Inline Data Reduction Operations for Primary Storage Systems Parallelizing Inline Data Reduction Operations for Primary Storage Systems Jeonghyeon Ma ( ) and Chanik Park Department of Computer Science and Engineering, POSTECH, Pohang, South Korea {doitnow0415,cipark}@postech.ac.kr

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information