Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems

Size: px

Start display at page:

Download "Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems"

Prosper Howard
5 years ago
Views:

1 Efficient Cache Locking at Private First-Level Caches and Shared Last-Level Cache for Modern Multicore Systems Abu Asaduzzaman 1, Kishore K. Chidella 2, Md Moniruzzaman 3 1,2,3 Electrical Engineering and Computer Science Department, Wichita State University, Wichita, Kansas, USA Abstract Most modern computing systems are having multicore processors with multilevel caches for high performance. Caches increase total power consumption and worsen execution time unpredictability. Studies show that way (or partial) cache locking may improve timing predictability and performance-to-power ratio for both single-core and multicore systems. Even though both private first-level and shared last-level cache locking improve timing predictability, it is difficult to justify the performance and power trade-off between these two locking mechanisms. In this work, we evaluate two cache locking schemes for multicore systems at private first-level caches and at shared last-level cache. Both schemes are based on the analysis of applications worst case execution time (WCET) and both allow changing the locked cache size during runtime to achieve the optimal performance-to-power ratio for the running applications. Using Heptane WCET analyser, we generate workloads for H.264/AVC, MPEG4, FFT, MI, and DFT codes. Using VisualSim tool, we model and simulate a system with four cores and two levels of caches. Experimental results confirm that both cache locking schemes improve timing predictability by decreasing the total number of cache misses. Results also indicate that for small applications like FFT, shared last-level cache locking outperforms private first-level cache locking; but for large applications like MPEG4 and H.264/AVC, private first-level cache locking performs better than shared last-level cache locking. Keywords Cache locking, multicore architecture, performance-to-power ratio, timing predictability, private first-level cache, shared last-cache I. INTRODUCTION As multicore processors provide higher performance-topower ratio, the popularity and demand of multicore processors are increasing in both desktop and embedded markets [1 3]. In a multicore processor, two or more independent cores are combined into a die. In most cases, each core has its own private first-level cache (CL1), which is split into instruction cache (I1) and data cache (D1); and the multicore processor may have one shared last-level cache such as shared level-2 cache (CL2) or multiple distributed CL2s [4 5]. A common choice for the memory organization of a multicore system is a two-level cache hierarchy (examples include Intel Xeon, IBM Power5, and Sun Niagara) [1, 3-7]. 26 According to this memory hierarchy, level-1 caches are attached to and privately accessible by each core. A larger level-2 cache is shared by the cores (e.g., Intel Xeon). Please note that level-2 cache can be private to each core (like AMD Athlon); but that is beyond the scope of this work. The presence of a shared level-2 cache offers the flexibility in adjusting the memory allocated per core according to its requirement, as well as the possibility of multiple cores getting fast access to the shared code and/or data. Cache parameters significantly influence system performance, especially worst case performance of embedded systems [8-9]. New generation multicore designs have shown that normally two (or more) cores running at (or less than) one half of the frequency can approach the performance of a single core running at full frequency, while the multicore consumes less amount of power. Multicore architectures are more suitable for real-time applications, because concurrent execution of tasks on a single processor, in many respects including thermal constraints (power consumption and heat dissipation), is inadequate for achieving the required level of performance and reliability. Real-time systems deal with timing constraints and usually interact with the environment rather than the human operator. Because timeliness and reliability are so important in their behaviour, real-time systems are often distributed among multiple program units (a.k.a., tasks) running simultaneously to perform required functions. However, the increasing usages of caches potentially increase the execution time unpredictability. Real-time applications cannot afford to miss deadlines and hence demand timing predictability. As multicore systems have multiple levels of caches, supporting real-time applications on multicore architectures pose significant challenge. For single-core systems, it is proven that cache locking improves predictability [7-8, 10-16]. Cache locking is the ability to prevent some or all of the instruction or data cache from being overwritten. Cache entries can be locked for either an entire cache or for individual ways within the cache. Entire cache locking is inefficient if the number of instructions or the size of data to be locked is small compared to the cache size. In way locking, only a portion of the cache is locked by locking ways within the cache.

2 Unlocked ways of the cache behave normally. Using way locking, Intel Xeon processor achieves the effect of using local storage in IBM Cell architecture. Most processors (unlike PowerPC 750GX) allow way locking at level-1 or level-2 cache [17-18]. So, way locking may be an important alternative of entire locking. Effectively using multi-level caches of multicore systems is a great challenge. To the best of our knowledge, current multicore processors are not able to take full advantage of cache locking, because there is no such efficient cache locking technique. Most existing cache locking mechanisms are not suitable for multicore systems. In this work, we explore two promising cache locking mechanisms for multicore systems one at private level-1 cache and another at shared level-2 cache. Our work focuses on higher level abstraction. The target systems (that should be evaluated) may not even exist. Instead of pinpoint accuracy, our goal is to provide fast reasonable accuracy at the early stage of the system design flow. The beauty of our approach is that this easy and fast solution can be used for simulation of both existing and nonexisting systems. This paper is organized as follows. In Section II, some related articles are discussed. The cache locking schemes are introduced in Section III. The schemes are evaluated by simulating a multicore system; simulation details are presented in Section IV. In Section V, some important simulation results are discussed. Finally, this work is concluded in Section VI. II. LITERATURE SURVEY Cache locking technique is being used to improve timing predictability in single-core systems for years. However, cache locking in newly innovated multicore system is difficult due to the fact that multicore architecture has multiple levels of caches. Therefore, we first discuss the cache memory subsystem of popular single-core and multicore systems; then we discuss some selected published articles closely related to level-1 and level-2 cache locking. A traditional inclusive cache memory subsystem consists of on-chip CL1 (split into instruction cache I1 and data cache D1) and off-chip CL2 (unified). The schematic diagram in Figure 1 illustrates a single-core system where CL1 is on-chip and CL2 is off-chip [1]. In this system both CL1 and CL2 are private. Figure 1. Cache memory subsystem of a single-core system The architecture of a dual-core system where each core has its own private CL1 and CL2 is shared by both cores is illustrated in Figure 2 [1]. Level-1 cache locking in a single-core system is easier than that in a multicore system. Level-1 cache locking in multicore systems is difficult as each core keeps its dataset local and modifies its local dataset as required. Figure 2. Cache memory subsystem of a dual-core system Methods presented in [10 11] suggest that cache contents should be statically locked to make memory access time and cache-related pre-emption delay predictable. However, these approaches need to be tested using larger real benchmarks. An algorithm for off-line selection of the contents of two on-chip memories locked caches and scratchpad memories is proposed in [12]. Experimental results show that the algorithm generates good ratios of on-chip memory accesses on the worst-case execution path. However, worstcase performance with locked caches may be degraded when large cache lines are used due to cache pollution. In [13], static cache analysis is combined with data cache locking to estimate the worst-case memory performance in a safe, tight and fast way. Experimental results show that this scheme is more predictable than a system without cache. In [14], a memory hierarchy is proposed to provide high performance combined with high predictability for complex systems. 27

3 In [15], an algorithm is introduced which partitions the task into a set of regions. Each region owns statically a locked cache content determined offline. A sharp improvement is observed, as compared with a system without any cache. In [16], various algorithms to select a set of instructions to be locked in cache are compared. The algorithms mentioned in [13-16] show performance improvement and can be used to assess a tight upper bound of the response time of tasks. However, these techniques are developed for single-core systems. These techniques are not useful to estimate power consumption a crucial design factor for embedded/mobile systems. Therefore, these techniques are not adequate to analyse the performance, power consumption, and predictability of multicore (real-time) systems. A miss table based static cache locking scheme is introduced at level-2 cache in [7] to improve the execution time predictability and overall system performance/power ratio. The miss table is to hold the information of block addresses (related to the applications being processed) which cause most cache misses if not locked. Experimental results show that in addition to improving the predictability, a reduction in mean delay per task and a reduction in total power consumption are achieved for MPEG4. However, this level-2 cache locking technique is not applicable for level-1 cache locking. Also, this static technique does not allow changing the total locked cache size during run time. III. PROPOSED CACHE LOCKING STRATEGIES FOR MULTICORE SYSTEMS In this section, we present private level-1 and shared level-2 cache locking schemes suitable for multicore systems. For both schemes, cache miss information for the target applications is pre-processed and used by cache locking algorithm to determine the right memory blocks that should be locked. A. Cache Locking Cache locking is a mechanism that prevents some or all of the instructions or data from being replaced from cache. Cache entries can be locked for either an entire cache or for individual ways within the cache [19]. In entire cache locking, cache hits are treated in the same manner as hits to an unlocked cache. Cache misses are treated as a cacheinhibited access. Invalid cache entries at the time of the locking will remain invalid and inaccessible until the cache is unlocked. Entire cache locking is inefficient if the number of instructions or the size of data to be locked is small compared to the cache size. 28 In way locking, only a portion of the cache is locked by locking ways within the cache. Invalid entries in way locking are accessible and available for data placement this behaviour differs from entire cache locking. Unlocked ways of the cache behave normally. Understanding the impact of cache locking on the performance, power consumption, and predictability of multicore systems requires analysing them separately and observing their interaction with the entire system architecture using the target applications. In a multicore system, cache can be locked at level-1 or at level-2. Although cache locking at level-1 in a multicore is difficult (each level-1 cache is private to a specific core), level-1 cache locking may be beneficial for some applications as locked blocks are very close to the core. Cache locking at level-2 in a multicore is easy as level-2 cache is shared by all cores and may be beneficial for some smaller applications. B. Block Address and Miss Information (BAMI) For way cache locking, information about the blocks that cause cache misses are collected together and called a BAMI (short for block address and miss information). In a BAMI, memory block addresses are sorted in descending order of their total number of misses. Generating a BAMI is tricky and involves manual activities. Major steps involved to create BAMIs are: Select target applications (written in C). If the applications are very large, select code segments from the large applications that contain big loops. This is because cache locking mechanism is expected to perform better on computation extensive big loops. Generate Heptane (Hades Embedded Processor Timing ANalyzEr) tree-graph using application code (in C) [20]. Manually collect instruction block (IB) miss information from the tree-graph. Create and sort IB-address miss block list. Memory block with the maximum number of misses should be the first one. Select memory blocks based on the cache size, line size, and locked cache amount for the BAMI. For each application, after post-processing the Heptane tree-graph, one BAMI is generated. BAMIs can be used for cache locking, cache replacement, selective preloading, and so on. A BAMI is more like a look-up table. Information about a small number of selected blocks should be stored in the BAMI. For cache locking at CL1 in multicore systems, BAMIs can be implemented using registers or a small piece of cache in each core.

4 For cache locking at shared CL2, BAMIs can be implemented using a small portion of CL2. In a BAMI, the top-most entry should hold information about the memory block that has the maximum number of misses, and so on. Say, the size of a BAMI cache is 128 Bytes and each BAMI entry is 64 bits (among which 1 bit for lock-bit L, 32 bits are for block address, and rest 31 bits are for the number of misses). So, each block may hold 16 (= 128 * 8 / 64) BAMI entries. The lock-bit (L) can be set/cleared dynamically to indicate that the associated block is locked or not. C. Work-Flow of the Cache Locking Schemes In this subsection, we explain the basic work-flow of the private level-1 and shared level-2 cache locking techniques. According to these schemes, the blocks that are anticipated to cause more misses should be locked. Both schemes are based on the WCET analysis of the target applications. A number of jobs are selected depending on the total number of available (free) cores. A job can be a complete application or a part of an application. Each job should have a number of tasks. A task is a segment of code (one or more instructions such as threads). After jobs are selected, CL1s and CL2 are preloaded with selected blocks using BAMI(s). There are significant differences between the two cache locking schemes. In level-1 locking scheme, cache locking decision is made at core level (by each core) after the jobs are assigned among the free cores. Every time a core gets a job to process, cache locking decision and locked I1 cache size is determined depending on the assigned job. In this scheme, the locked cache size may be changed during runtime to achieve the optimal predictability and performance/power ratio. The diagram in Figure 3 illustrates the workflow of level-1 cache locking strategy. In this scheme, how much cache should be locked is dynamically determined after a core decides to apply level- 1 cache locking on a job. In the shared level-2 cache locking scheme, the cache locking decision is made by the master core before allocating the jobs among the free cores. The diagram in Figure 4 illustrates the workflow of shared level-2 cache locking strategy. The delay and power consumption are calculated using the same method (as used in the private first-level cache locking scheme). Figure 3. Work flow of private first-level cache locking scheme Figure 4. Work flow of shared last-level cache locking scheme 29

5 In both schemes, after a task is completed, the core becomes free. The maximum delay and power consumption for that batch of jobs are obtained and the total delay and total power consumption are updated. After completing all the jobs, mean delay per task and total power consumption are calculated. To estimate average delay per task, the maximum delay for each batch of jobs is considered. To estimate power, total power consumed by all the cores to complete all the jobs is considered. In addition to the cache locking strategies (which are based on WCET analysis), we also simulate random cache locking strategy for both level-1 and level-2 cache locking. In random cache locking strategy (where WCET analysis is not required), memory blocks are selected randomly to preload CL2/CL1 and to lock in private level-1 caches or shared level-2 cache. Mean delay per task and total power consumption are calculated the same way. IV. SIMULATION DETAILS In this work, we evaluate private level-1 and shared level-2 way cache locking schemes using simulation technique. In this section, we briefly discuss the simulation details. We use VisualSim (short for VisualSim Architect) [21] and Heptane tools to model and simulate a multicore system. Heptane is used for WCET analysis of applications on a single-core system with one-level cache and to generate workloads. VisualSim is used to simulate multicore systems with multi-level caches and to obtain the simulation results. Assumptions, simulated architecture, workloads, and important parameters are discussed in the following subsections. A. Assumptions We make some assumptions in this research work. Important assumptions include: A homogenous multicore is simulated where all cores are identical. Level-1 (i.e., I1) and level-2 (i.e., CL2) cache locking are considered in this work. In I1 cache locking, each core can lock independently and dynamically per its needs. In CL2 locking, cache locking decision is made by the master core (for a batch of jobs). Write-back memory update policy and random cache replacement strategy are used as required. The delay introduced by the bus that connects CL2 and the main memory is 10 times longer than the delay introduced by the bus that connects CL1 and CL2 caches. B. Simulation Architecture As mentioned earlier, we focus on higher level abstraction of the target system in this work. According to our modeling and simulation approach, the target system is not required to be present. This is to facilitate early estimation of future complex systems. However, this approach is extremely useful to analyze/update existing products as well. This approach is fast, cheap, risk-free, and reasonably accurate. We model and simulate a quad-core system which depicts popular Intel Xeon processor architecture. Figure 5 illustrates BAMI implementation inside the cores for level- 1 cache locking. Information stored in each BAMI depends on the application being processed by the corresponding core. Along with the BAMI, each core has its private CL1 (split into I1 and D1 for improved performance). The system has one shared CL2. Two cores are connected to the CL2 using the same bus to reduce the bus contention. In case of an I1 miss, updated cache replacement policy is used to select a victim block using BAMIs. An unlocked block with minimum number of misses should be selected for replacement. In case of a tie in the number of misses, a block is selected randomly. It should be noted that the BAMI(s) should be implemented inside shared CL2 (not inside the cores) for shared CL2 locking (which is not shown in the Figure). Figure 5. Simulated multicore architecture with 4 cores illustrating level-1 cache locking 30

6 C. Workloads International Journal of Emerging Technology and Advanced Engineering In this work, we use a diverse group of important applications to run the simulation programs Moving Picture Experts Group s MPEG4 (part-2), Advanced Video Coding widely known as H.264/AVC, Fast Fourier Transform (FFT), Matrix Inversion (MI), and Discrete Fourier Transform (DFT). The complete codes of FFT, MI, and DFT are considered. For H.264/AVC and MPEG-4, code segments are carefully selected. The code size and number of instructions of the applications are shown in Table I. For each application, one BAMI is generated using Heptane and the BAMI is used to run the VisualSim simulation programs. TABLE I IMPORTANT CHARACTERISTICS OF THE APPLICATIONS Applications Code Size (Bytes) Number of Instructions FFT 2,335 36,5184 MI 1, ,518 DFT 1, ,307 H.264/AVC 185,568 40,922,256 MPEG4 209,937 52,071,180 Because of their effectiveness, we use Heptane package and VisualSim tool for modelling and simulation of the cache locking schemes. Heptane takes C code as the input application and generates tree-graph. Tree-graph shows which memory blocks caused cache misses and how many (if any). This tree-graph information is used to create the BAMI. BAMIs are used in VisualSim programs to calculate the total power consumption (as Heptane cannot calculate power consumption). D. Important Parameters Some important input and output parameters are shown in Table II. We change the total cache size and locked cache size. TABLE II IMPORTANT INPUT AND OUTPUT SIMULATION PARAMETERS Parameter Value I1 cache size (KB) 2, 4, 8, 16, 32 D1 cache size (KB) 2, 4, 8, 16, 32 CL2 cache size (KB) 128, 256, 512, 1024, 2048 Total locked cache size (%) 0.0, 12.5, 25.0, 37.5, 50.0 CL1/CL2 line size 128 Byte (fixed) CL1/CL2 associativity level 8-way (fixed) Mean delay per task?(to be obtained) Total power consumption? (to be obtained) 31 Output parameters include mean delay per task and total power consumption. Delay is defined as the time between the start of execution of a task and its end. For power analysis, a system component is considered to be one of the three states active (component consumes adequate amount of energy to be turned on and active), idle (component consumes minimum amount of energy just to be turned on), or sleep (component is turned off and consumes no energy). For a system with X tasks and Y components, the total power consumption can be expressed as shown below, X Pt (total) = Σ (Pi) (1) i = 1 Y P i = Σ (Pj (active) + Pj (idle) (2) j = 1 Where, each task i is associated with Y (different) components. Pi is the power consumed by i-th task. In our experiment, the following components are considered for power consumption: CPU, CL1 (i.e., I1 and D1), buses, CL2, and main memory (MM). Table III shows how power consumption is distributed among the processor components [22]. It is assumed that an idle component takes only ¼ amount of power comparing its active stage. TABLE III POWER CONSUMPTION BY PROCESSOR Parameter Power Consumed Unit/Usage (Active) Unit/Usage (Idle) I1 27% D1 16% CL1 43% CPU 36% Bus, others Total 100% Power consumed by CL2 and MM is determined as: CL2 (power) = (CL1 power / CL1 size) * (CL2 size) (3) MM (power) = (CL1 power / CL1 size) * (MM size) (4) V. RESULTS AND DISCUSSION In this work, we investigate level-1 and level-2 cache locking methods for real-time multicore systems to study how the predictability can be enhanced without compromising the performance/power ratio. Cache locking improves the predictability by making the block local and closer to the cores.

7 However, aggressive cache locking may decrease the performance and increase the total power consumption due to the reduction of the effective cache size. We model a computing system with 4 cores and run the simulation programs using MPEG4, H.264/AVC, FFT, MI, and DFT workloads. We obtain results by varying the locked cache size (from 0% to 50%), I1 cache size (from 2KB to 32 KB), and CL2 cache size (from 128KB to 2MB). We present some important simulation results in the following subsections. A. Impact of Level-1 Cache Locking As the number of locked blocks increases on the one hand, cache blocks that cause most of the misses are locked; but on the other hand, the effective cache size decreases. Due to these contradictory phenomena, it is difficult to predict the mean delay per task and total power consumption when cache locking is applied. Figure 6 illustrates that the mean delay per task starts decreasing with the increase in the locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. However, when locked cache size is beyond 25%, the mean delay per task starts increasing with the increase in the locked blocks for MPEG4, H.264/AVC, and FFT. This is because beyond 25% locking, the effective cache size decreases so much that the cache miss starts increasing. Cache locking has zero or negligible positive impact for MI and DFT. Figure 7. Total power consumption due to I1 locked cache size The results due to MPEG4 and H.264/AVC applications are very similar; so in the next section, the MPEG4 results will also represent the omitted H.264/AVC results. Also, the results due to MI and DFT should be omitted in the next section as they are not impacted by level-1 cache locking. B. Impact of CL2 Cache Locking In a multicore system, it is easy to implement cache locking at shared level-2 cache than at private level-1 caches. However, it is difficult to predict the impact of level-2 cache locking on mean delay per task and total power consumption. Figure 8 illustrates that the mean delay per task starts decreasing with the increase in locked blocks (0% to 25% locking) for MPEG4 and H.264/AVC. However, when locked cache size is beyond 25%, the mean delay per task starts increasing with the increase in the locked blocks for MPEG4 and H.264/AVC as the effective cache size decreases so much that the cache miss starts increasing. Shared level-2 cache locking has zero or negligible positive impact for FFT, MI, and DFT. Figure 6. Mean delay per task versus the amount of I1 locked cache size The impact of level-1 cache locking on total power consumption is shown in Figure 7. Like mean delay per task, the total power consumption starts decreasing with the increase in the locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. However, when locked cache size is beyond 25%, the total power consumption starts increasing. Like mean delay per task, the total power consumption is not positively impacted by level-1 cache locking for MI and DFT. 32 Figure 8. Mean delay per core versus the amount of CL2 locked cache size

8 The impact of shared level-2 cache locking on total power consumption is shown in Figure 9. The total power consumption starts decreasing with the increase in locked blocks (0% to 25% locking) for MPEG4, H.264/AVC, and FFT. The total power consumption for these applications increases for more than 25% CL2 locking. Level-2 cache locking has no positive impact on total power consumption for MI and DFT. Figure 9. Total power consumption versus the amount of CL2 locked cache C. Execution Time Predictability We first study the impact of level-1 cache locking on the execution time predictability on a single-core system with only level-1 cache. Table IV presents some experimental results obtained by analyzing the Heptane tree-graph for FFT code. A total of 246 misses is generated from 19 different blocks for I1 cache size 2KB and line size 128 Byte. The total number of blocks in I1 is 16 (2048/128). For 8-way set associativity cache, if 2 blocks from each set is locked, 4 blocks out of 16 blocks (i.e., 25% of I1) is locked. Based on Heptane WCET analysis, the maximum number of misses from 4 blocks is 128. By locking those 4 blocks (that create the maximum number of 128 misses), predictability can be enhanced (as more than 50% cache misses are avoided) by locking only 25% of the I1 size. Then, we study the impact of shared level-2 cache locking on the execution time predictability on a multicore system with two levels of caches using FFT code. As shown in Table V, for 128 KB CL2, all the blocks that cause misses can be locked at CL2. Therefore, by preloading those blocks at CL2 and applying level-2 locking, the system can be totally predictable. This is because the total number of blocks in CL2 (1024) is much higher than the total number of blocks in CL1 (16). Therefore, CL2 can hold all the blocks that cause cache misses (19). D. A Closer Look: Private First-Level or Shared Last- Level Cache Locking? In this subsection, we discuss the impact of level-1 and level-2 cache locking on performance and power consumption. First, we analyze the impact of level-1 I1 cache locking on mean delay per task and total power consumption. We collect mean delay per task and total power consumption without any locking, with random locking, and with I1 locking using BAMI. As mentioned earlier, blocks are selected randomly in random locking without any WCET analysis. As a result, for new and large applications (i.e., miss information for block address is not known and the code does not fit in the cache, respectively), random cache locking in multicore may be very effective. For I1 cache locking (using BAMI and using random blocks), the mean delay per task and total power consumption start decreasing as I1 cache size increases from 2 KB (see Figure 10). It is observed that the impact of cache locking for 32 KB I1 is not as significant as the impact of cache locking for 2 KB I1.This is because cache hits increase as cache size increases. For small I1 cache size, level-1 cache locking clearly outperforms random cache locking in our experiment for both mean delay per task and total power consumption. Although both strategies help decrease the mean delay per task and total power consumption, I1 locking strategy is more efficient for the applications used as blocks are selected wisely using BAMI. It should be noted that only two (MPEG4 and H.264/AVC) of the five applications are large and do not entirely fit in I1. Figure 10. Impact of I1 cache size on level-1 cache locking 33

9 Similarly, we analyze the impact of shared level-2 cache locking by collecting mean delay per task and total power consumption without any locking, with random locking, and with CL2 locking using BAMI. Like level-1 locking, the mean delay per task and total power consumption start decreasing as CL2 cache size increases from 128 KB. Although both strategies help decrease the mean delay per task and total power consumption, CL2 locking strategy is more efficient as blocks are selected using BAMI. For the applications used, experimental results show that shared level-2 cache locking strategy is better than the random cache locking strategy for both mean delay per task and total power consumption (see Figure 11). Again it should be noted that three (FFT, MI, and DFT) of the five applications are small and entirely fit in 4 KB I1 cache. Figure 12. Impact of locked cache size on I1 and CL2 cache locking TABLE IV COMPARISON BETWEEN PRIVATE FIRST-LEVEL AND SHARED LAST- LEVEL CACHE LOCKING Figure 11. Impact of CL2 cache size on level-2 cache locking The performance of private level-1 and shared level-2 cache locking for various locked cache size is depicted in Figure 12. Experimental results show that locking at shared level-2 cache outperforms locking at level-1 caches for the applications used, specifically at 25% locking. This indicates that small applications (like FFT, MI, and DFT) take more advantages of shared level-2 cache locking than private level-1 cache locking (which is not the case for large applications like MPEG4 and H.264/AVC). We summarize some of the important characteristics of private level-1 and shared level-2 cache locking strategies in Table IV. Here, (+) sign means advantage and (-) sign means disadvantage. For example, locking at private firstlevel caches is more complicated when compared with locking at shared last-level cache. Consideration Private First- Shared Last- Level Locking Level Locking Complexity More (-) Less (+) Predictability Low (-) High (+) Response Time Low (+) High (-) Large applications Very useful (+) Not useful (-) Small applications Not useful (-) Very useful (+) Unknown applications Locking at each core? Need WCET analysis? Multicore architecture Not useful (-) Not useful (-) Depends on (-) No (+) Yes (-) Yes (-) Promising (+) Promising (+) Therefore, when complexity is concerned, it is an advantage for shared level-2 cache locking and disadvantage for private level-1 cache locking. However, response time is an advantage for private level-1 cache locking and a disadvantage for shared level-2 cache locking. Finally, we find both level-1 and level-2 cache locking schemes are effective and promising for multicore systems. Even though shared level-2 cache locking strategy shows the best performance in this experiment, the results may be different if all or most of the workloads are large. 34

10 For large applications, cache locking at private first-level should be more beneficial than locking at shared last-level; but it is difficult to implement cache locking at each core individually. VI. CONCLUSIONS Multicore architectures help improve performance-topower ratio, but makes execution time more unpredictable due to cache s dynamic behavior. It has been proven that cache locking can be used to improve timing predictability. However, aggressive and ignorant use of cache locking may reduce the performance-to-power ratio. In this work, we present private first-level and shared last-level way cache locking schemes for power-aware real-time multicore systems. Both schemes allow changing the locked cache size during runtime to achieve the optimal predictability and performance/power ratio. To evaluate these schemes, we model and simulate a system with four cores and two-level cache memory subsystem using VisualSim. We generate a diverse set of workloads by postprocessing the respective Heptane tree-graphs for H.264/AVC, MPEG4, FFT, MI, and DFT codes. These workloads are used to run the VisualSim programs. Experimental results indicate that both performance and predictability can be increased and power consumption can be decreased by using a cache locking mechanism added to an efficient cache memory organization. It is noticed that up to 43% reduction in mean delay and up to 37% reduction in total power consumption are possible by locking only 25% of the cache size. Results suggest that for small applications like FFT, MI, and DFT, shared level-2 cache locking outperforms private level-1 cache locking; but for large applications like MPEG4 and H.264/AVC, private level-1 cache locking performs better than shared level-2 cache locking. This is probably because small applications take more advantages of shared level-2 cache locking than private level-1 cache locking (which is not the case for large applications). We plan to investigate the impact of locking at victim cache(s) on performance, power consumption, and execution time predictability of a multicore real-time system in our next endeavor. REFERENCES [1] A. Asaduzzaman, Cache Optimization for Real-Time Embedded Systems, Ph.D. Dissertation. Florida Atlantic University, [2] V. Suhendra and T. Mitra, Exploring Locking & Partitioning for Predictable Shared Caches on Multi-Cores, in DAC'2008, Anaheim, CA, [3] V. Romanchenko, Evaluation of the multi-core processor architecture Intel core: Conroe, Kentsfield, in Digital-Daily.com, [4] Multi-core (computing), Wikipedia. DOI= wikipedia.org/wiki/xeon;../wiki/athlon, [5] D.K. Every, IBM s Cell Processor: The next generation of computing?, in Shareware Press, DOI= [6] A. Asaduzzaman, and I. Mahgoub, Cache Modeling and Optimization for Portable Devices Running MPEG-4 Video Decoder, in MTAP Journal, MTAP 05, [7] A. Asaduzzaman, and F.N. Sibai, Improving Cache Locking Performance of Modern Embedded Systems via the Addition of a Miss Table at the L2 Cache Level, in JSA Journal, [8] Y. Liang and T. Mitra, Instruction cache locking using temporal reuse profile, in DAC 10, pp , [9] T. Liu, M. Li, and C.J. Xue, Minimizing WCET for Real-Time Embedded Systems via Static Instruction Cache Locking, in RTAS 2009, pp , [10] I. Puaut and D. Decotigny, Low-Complexity Algorithms for Static Cache Locking in Multitasking Hard RT Systems, in IEEE Conference, [11] I. Puaut, Cache Analysis Vs Static Cache Locking for Schedulability Analysis in Multitasking Real-Time Systems, DOI= [12] I. Puaut and C. Pais, Scratchpad memories vs locked caches in hard real-time systems: a quantitative comparison, in Design, Automation & Test in Europe Conference & Exhibition (DATE'07), pp. 1-6, [13] X. Vera and B. Lisper, Data Cache Locking for Higher Program Predictability, in SIGMETRICS'03, CA, [14] E. Tamura, F. Rodriguez, J.V. Busquets-Mataix, and A.M. Campoy, High Performance Memory Architectures with Dynamic Locking Cache for Real-Time Systems, in the Proceedings of the 16th Euromicro Conference on Real-Time Systems, pp. 1-4, Italy, [15] A. Arnaud and I. Puaut, Dynamic Instruction Cache Locking in Hard Real-Time Systems, [16] E. Tamura, J.V. Busquets-Mataix, J.J.S. Martin, and A.M. Campoy, A Comparison of Three Genetic Algorithms for Locking-Cache Contents Selection in Real-Time Systems, in the Proceedings of the Int l Conference in Coimbra, Portugal, [17] C. Harrison, Programming the cache on the PowerPC 750GX/FX - Use cache management instructions to improve performance. In IBM Microcontroller Applications Group, DOI= 128.ibm.com/developerworks/library/pa-ppccache.html [18] J. Stokes, Xenon's L2 vs. Cell's local storage, and some notes on IBM/Nintendo's Gekko, DOI= arstechnica.com/articles/paedia/cpu/xbox360-1.ars/6 [19] MPC8272 PowerQUICC II Family Reference Manual, DOI= 2RM.pdf [20] Heptane - a WCET analysis tool, DOI= [21] VisualSim - a system-level simulator, DOI= [22] W. Tang, R. Gupta, and A. Nicolau, Power Savings in Embedded Processors through Decode Filter Cache, in Proceedings of Design, Automation and Test in Europe Conference and Exhibition (DATE 02), pp.1-6, 2002.

Caches in Real-Time Systems. Instruction Cache vs. Data Cache

Caches in Real-Time Systems. Instruction Cache vs. Data Cache Caches in Real-Time Systems [Xavier Vera, Bjorn Lisper, Jingling Xue, Data Caches in Multitasking Hard Real- Time Systems, RTSS 2003.] Schedulability Analysis WCET Simple Platforms WCMP (memory performance)