LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

Size: px

Start display at page:

Download "LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm"

Juliana Goodman
6 years ago
Views:

1 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and Technology. Rami Sheikh is with North Carolina State University. September 1, 212 Digital Object Indentifier 1.119/TC /13/$ IEEE

2 2 Abstract The design of an effective last-level cache (LLC) in general - and an effective cache replacement/partitioning algorithm in particular - is critical to the overall system performance. The processor s ability to hide the LLC miss penalty differs widely from one miss to another. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor s ability to hide LLC miss latencies, and the resultant non-uniformity in the performance impact of LLC misses, opens up an opportunity for a new cost-sensitive cache replacement algorithm. This paper makes two key contributions. First, it proposes a framework for estimating the costs of cache blocks at run-time based on the processor s ability to (partially) hide their miss latencies. Second, it proposes a simple, low-hardware overhead, yet effective, cache replacement algorithm that is Locality-Aware and Cost-Sensitive (LACS). LACS is thoroughly evaluated using a detailed simulation environment. LACS speeds up 12 LLC-performanceconstrained SPEC CPU26 benchmarks by up to 51% and 11% on average. When evaluated using a dual/quad-core CMP with a shared LLC, LACS significantly outperforms LRU in terms of performance and fairness, achieving improvements up to 54%. Index Terms Cache memories, cache replacement algorithms, cost-sensitive cache replacement, shared caches I. INTRODUCTION As the performance gap between the processor and main memory continues to widen, the design of an effective cache hierarchy becomes more critical in order to reduce the average memory access times perceived by the processor. The design of an effective last-level cache (LLC) continues to be the center of substantial research for several reasons: First, while a processor may be able to hide a miss in the higher-level (L1 and L2) caches followed by an LLC (L3 cache) 1 hit through exploiting ILP, out-of-order execution, and non-blocking caches, it is almost impossible to fully hide the long LLC miss penalty. Second, as multi-core processors sharing the LLC becomes the dominant computing platform, new cache design constraints arise with the goal of maximizing performance and throughput while ensuring thread fairness [1], [2]. 1 Without loss of generality, we assume throughout this paper a 3-level cache hierarchy where the LLC is the L3 cache. The concepts and algorithms developed in this paper are also applicable to a 2-level cache hierarchy. September 1, 212

3 3 A crucial design aspect of LLCs continues to be the cache replacement and partitioning algorithms. This is evident in the many papers proposing intelligent LLC replacement and partitioning algorithms found in the recent literature. Examples include: dead block predictors [3] [8], re-reference interval predictors and adaptive insertion algorithms [9] [14], and CMP cache partitioning algorithms [15], [16], among others. Unfortunately, most of these algorithms only target the cache s miss rate while ignoring the aggregate misses cost. Only a few proposed replacement algorithms attempt to reduce the aggregate misses cost or penalty [17] [21]. In modern superscalar processors, the processor attempts to hide cache misses by exploiting ILP through issuing and executing independent instructions in parallel and out-of-order. Unfortunately, even with the most aggressive superscalar processors, it is quite impossible to hide the large LLC miss penalty. During this long miss penalty, the reorder buffer (ROB) and the other processor queues fill up. This eventually stalls the whole processor waiting on the LLC miss. Yet, depending on the dependency chain, miss bursts and other factors, the processor s ability to partially hide the LLC miss penalty differs widely from one miss to another [22]. Figure 1 illustrates this point by showing the histogram of the number of issued instructions during the service of an LLC miss for several SPEC CPU26 benchmarks [23]. The vertical axis represents the number of misses while the horizontal axis shows the number of issued instructions during the service of the miss (plotted using intervals of 2 instructions) 2. For example, looking at the sub-figure for the mcf benchmark, the leftmost bar indicates that for about 64 million of its LLC misses, the processor managed to issue only -19 instructions per miss. The number of issued instructions is counted from the time the instruction suffering the LLC miss is placed in the LLC MSHR until the requested data is received. The figure clearly shows that for most benchmarks, the number of issued instructions during an LLC miss is not uniform and varies widely asserting the statement: Not All Misses are Created Equal [21]. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor s ability to hide the latencies of LLC misses, and the resultant non-uniformity in the performance 2 Although the ROB used in our evaluation has 128 entries, the number of instructions issued during an LLC miss may be larger than 128. There are 255 different instructions that may be in the ROB from the time the instruction suffering the miss is added to the tail of the ROB until it retires at the ROB s head. Some or all of these instructions may issue during the miss. September 1, 212

4 4 x1 6 4 astar x bwaves x bzip2 x1 6.5 dealii x gcc x1 6.5 gobmk x gromacs x hmmer x lbm x libquantum x mcf x milc x omnetpp x sjeng x1 6 soplex x sphinx x xalancbmk x zeusmp Fig. 1. Issued Instructions Per LLC Miss Histogram. Simulation environment details are in Section V. impact of LLC misses, opens up an opportunity to develop new cost-sensitive cache replacement algorithms. We define cache blocks where the processor manages to issue a small/large number of instructions during a miss on that block as high/low-cost blocks, respectively. Substituting high-cost misses with low-cost misses reduces the aggregate miss penalty and thus enhances the overall cache performance. This paper proposes a novel, simple, yet effective, cache replacement algorithm called LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. LACS estimates the cost of a cache block by counting the number of instructions issued during the block s LLC miss, which reflects the processor s ability to (partially) hide the miss penalty. Cache blocks are classified as low-cost or high-cost blocks based on whether the number of issued instructions is larger or smaller than a threshold. On a cache miss, when a victim block needs to be found, LACS chooses a low-cost block keeping high-cost blocks in the cache. This is referred to as high-cost block reservation [17] [2]. However, since a block with a high cost cannot be reserved forever, a mechanism must exist to relinquish the reservation once the block is dead (no longer needed). To achieve this, LACS implements a simple locality-based algorithm that ages a block while it is not being accessed inverting its cost from high to low. As a result, LACS attempts to reserve high-cost blocks in the cache, but only while their locality is still high (i.e. they have been September 1, 212

5 5 accessed recently). The underlying locality-based algorithm, employed by LACS, can also be a dead block predictor. Although not shown in this paper, we integrated LACS with two dead-block predictors [3], [6] and found that, even though the integrated dead block predictors outperform our simple locality-based algorithm as standalone cache replacement algorithms, both approaches almost perform equally when their goal is limited to providing locality hints to LACS. Moreover, our locality-based algorithm has a much smaller storage overhead compared to other dead-block predictors. The fact that LACS reserves a small subset of high-cost blocks in the cache makes it thrashresistant. In addition, LACS is scan-resistant since its locality-aware component increases the costs of frequently accessed blocks while they are in the cache, and decreases the costs of blocks that do not get re-accessed leading to their early eviction. Both thrash-resistance and scan-resistance are key traits of an efficient cache replacement algorithm [9] [12]. Consequently, while LACS reduces the miss penalty by substituting high-cost misses with low-cost misses, it also reduces the miss count by being both thrash- and scan-resistant. This paper has four main contributions: Miss Costs: The non-uniformity in the performance impact (cost) of LLC misses due to the non-uniformity in the processor s ability to hide LLC miss latencies is asserted. Cost Estimation: A novel, simple, yet effective run-time cost estimation method for inflight misses is presented. The cost is estimated based on the number of instructions the processor manages to issue during the miss, which reflects the miss s performance impact and how well the processor is capable of hiding the miss penalty. LACS: A cost-sensitive and locality-aware cache replacement algorithm that utilizes the devised cost estimation method is proposed. LACS is simple, has low-hardware overhead, and is effective for private and shared LLCs. LACS Optimizations: The performance of LACS is further improved by introducing novel and effective run-time optimizations. These optimizations include: 1) A mechanism to dynamically and periodically update the threshold value, which allows LACS to better adapt to different applications and execution phases. 2) A mechanism to turn the cost-sensitive component of LACS on and off based on the predictability of block costs. September 1, 212

6 6 LACS is thoroughly evaluated using a detailed simulation environment. When evaluated using a uniprocessor architecture model, LACS speeds up 12 LLC-performance-constrained SPEC CPU26 benchmarks by up to 51% and 11% on average (relative to the base LRU) without slowing down any of the 23 SPEC CPU26 benchmarks used in the study. This performance improvement is comparable to that achieved with a 5%-1% larger cache using LRU, and yet it is achieved using a simple implementation with low-hardware overhead. In addition, LACS s effectiveness is demonstrated over a wide range of LLC sizes. Moreover, LACS is compared to and shown to outperform both: a state-of-the-art cost-based replacement algorithm (MLP- SBAR) [21] and another state-of-the-art locality-based algorithm (SHiP) [9]. When evaluated using a dual-core CMP architecture model with a shared LLC, LACS improves 36 SPEC CPU26 benchmark pairs by up to 54% and 1% on average 3. When evaluated using a quad-core CMP architecture model with a shared LLC, LACS improves 1 SPEC CPU26 benchmark quadruples by up to 38% and 1% on average 3. The rest of the paper is organized as follows. Section II presents the related work and compares LACS to other replacement algorithms. Section III develops the foundations for LACS. Section IV discusses LACS and its optimizations in detail for both private and shared caches. Section V describes the evaluation environment while Section VI discusses the experimental evaluation in detail. Finally, Section VII concludes the paper. II. RELATED WORK Traditionally, cache replacement algorithms were developed with the goal of reducing the aggregate miss count and thus assumed that misses were uniform in cost. Belady s optimal (OPT) replacement algorithm [24] victimizes the block in the set with the largest future usage distance. It guarantees a minimal miss count but requires future knowledge and thus remains theoretical and can only be approximated. The LRU replacement algorithm, and its approximations, rely on the principle of temporal locality by victimizing the least recently used block in the set. However, studies have shown that the performance gap between LRU and OPT was wide for high-associativity L2 caches [7]. One factor that works against LRU is that locality is usually filtered by the L1 cache and thus is inverted in the lower cache levels [3]. To bridge the gap 3 The metric reported here is the harmonic mean of weighted IPCs normalized to the base LRU. It is a measure of both performance and fairness improvement. September 1, 212

7 7 between OPT and LRU replacement algorithms, many intelligent replacement algorithms have been proposed for LLCs including but not limited to: Dead block predictors [3] [8], re-reference interval predictors and adaptive insertion algorithms [9] [14]. Dead block predictors aim to predict dead blocks (blocks that will no longer be used during their current generation times 4 ) in the cache and evict them early while preserving live blocks. Re-reference interval predictors and adaptive insertion algorithms aim to predict the time interval between consecutive accesses to a cache block which determines the insertion position of the block in the LRU or re-use stack. Moreover, cache replacement algorithms in shared CMP caches have been studied in the context of shared cache partitioning among the concurrently-running threads [15], [16]. Replacement algorithms such as OPT, LRU, dead block predictors, and others, only distinguish between cache blocks in terms of liveness and do not distinguish between blocks in terms of miss costs. However, in modern systems, cache misses are not uniform and have different costs [18], [21], [22], [25]. Thus, it is wiser to take into consideration the miss costs in addition to the access locality in the replacement algorithm in order to improve the cache s overall performance. This is exactly what LACS is designed to achieve. In Section VI, we compare LACS against SHiP (Signature-based Hit Predictor) [9]: a stateof-the-art locality-based cache replacement algorithm. SHiP associates a cache reference with a unique signature and attempts to predict the re-reference interval for that signature. A Signature History Counter Table (SHCT) of saturating counters is used to learn and predict the re-reference behavior of the signatures. The table is updated on cache hits and evictions. On a cache fill, SHiP indexes into the SHCT with the new block s signature to obtain a prediction of its re-reference interval. SHiP only tracks whether a signature is re-referenced or not, but not the actual rereference timing. For block promotion and eviction decisions, SHiP utilizes SRRIP [1]. In our evaluation, LACS is found to outperform SHiP in terms of performance improvement in both a private and a shared LLC while requiring about 2% less storage overhead. Srinivasan and Lebeck [22] explore load latency tolerance in dynamically scheduled processors and show that load instructions are not equal in terms of processor tolerance for load latencies. They also show that load latency tolerance is a function of the number and types of dependent instructions especially mispredicted branches. Moreover, Puzak et al. [25] also assert that misses 4 A block s generation time starts from when it is placed in the cache after a miss until it is evicted September 1, 212

8 8 have variable costs and present a new simulation-based technique for calculating the cost of a miss for different cache levels. The observation of the non-uniform impact of cache misses led to a new class of replacement algorithms called Cost-Sensitive Cache Replacement Algorithms. These algorithms assign different costs to cache blocks according to well-defined criteria, and rely on these costs to select which block to evict on a cache miss (least cost block gets evicted first). The miss cost may be latency, penalty, power consumption, bandwidth consumption, or any other property attached to a miss [17] [21], [26]. LACS assigns costs based on the processor s ability to (partially) hide the miss latency by counting the number of instructions issued during the miss. One of the earliest implementations of cost-sensitive cache replacement algorithms were proposed by Jeong and Dubois [17] [19] in the context of CC-NUMA multiprocessors in which the cost of a miss mapping to a remote memory, as opposed to local memory, is higher in terms of latency, bandwidth, and power consumption. A cost-sensitive optimal replacement algorithm (CSOPT) for CC-NUMA multiprocessors with static miss costs is evaluated and found to outperform a traditional OPT algorithm in terms of overall miss costs savings although the miss count increases. In addition, several realizable algorithms are evaluated with a cost based on the miss latency. In comparison, LACS estimates a block s cost based on the processor s ability to tolerate and hide the miss and not on the miss latency itself. Moreover, LACS is applicable to both uniprocessors and multiprocessors. Jeong et al. [2] also proposed a cost-sensitive cache replacement algorithm for uniprocessors. The algorithm assigns cost based on whether a block s next access is predicted to be a load (highcost) or store (low-cost) since processors can better tolerate store misses over load misses. In their implementation, all loads are equal and are considered high-cost. In comparison, LACS does not treat load misses equally but distinguishes between load misses in terms of cost based on the processor s ability to tolerate and hide the load miss. Our study and the studies of others [22], [25] show that load miss costs are not uniform and thus should not be treated equally. Moreover, some store misses may be critical and can stall the processor. This, for example, can happen if the LLC MSHR or write buffers get full after a long sequence of consecutive store misses such as when initializing or copying an array. Moreover, an increase in the number of store misses can put pressure on the memory bandwidth. Srinivasan et al. [26] proposed a hardware scheme in which critical blocks are either preserved September 1, 212

9 9 in a special critical cache or used to initiate prefetching. Criticality is estimated by keeping track of a load s dependence chain and the processor s ability to execute independent instructions following the load. Although they demonstrate the effectiveness of their approach when all critical loads are guaranteed to hit in the cache, no significant improvement is achieved under a realistic configuration due to the large working set of critical loads and the inefficient way of identifying critical loads. In comparison, LACS does not need to keep track of a load s dependence chain, instead it uses a simpler, more effective approach for cost estimation. Moreover, LACS achieves considerable performance improvement under a realistic configuration because: (a) high-cost blocks are preserved in the LLC itself instead of a smaller critical cache, and (b) LACS includes a mechanism to relinquish high-cost blocks that may no longer be needed by the processor making room for other useful blocks. Qureshi et al. [21] proposed a cost-sensitive cache replacement algorithm based on Memory Level Parallelism (MLP). The MLP-aware cache replacement algorithm relies on the parallelism of miss occurrences since some cache misses occur in isolation (classified as high-cost and are thus preserved in the cache) while others occur and get served concurrently (classified as low-cost). Because of its significant performance degradation in pathological cases, it is used in conjunction with a tournament-predictor-like Sampling Based Adaptive Replacement (SBAR) algorithm to choose between the MLP-aware algorithm and the traditional LRU depending on which provides better performance. In comparison, LACS estimates a block s cost based on the processor s ability to tolerate the miss. Moreover, LACS s performance is demonstrated to be more robust with negligible pathological behavior. In Section VI, LACS is compared against and is shown to outperform the MLP-aware with SBAR algorithm. Finally, other replacement algorithms that utilized cost were proposed outside the context of processor caches, such as in disk paging [27] and Web proxy caching [28]. III. LACS S FOUNDATIONS AND UNDERLYING PRINCIPLES This section lays the foundations and underlying principles for LACS. First, we discuss the impact of LLC misses in modern processors, thus, establishing our cost heuristic. Second, we examine the predictability/consistency of block costs, a vital property for LACS s performance. September 1, 212

10 1 A. Anatomy of the Impact of LLC Misses in Modern Processors Modern dynamically-scheduled superscalar processors improve performance by issuing and executing independent instructions in parallel and out-of-order. Multiple instructions are fetched, issued, and executed every clock cycle. After an instruction completes execution, it writes back its result and waits to be retired (by updating the architectural state) in program order. Although instructions get issued and executed out-of-order, program order is preserved at retirement using the ROB [1]. Dynamically-scheduled superscalar processors can tolerate the high latency of some instructions (e.g. loads that suffer L1 cache misses) by issuing and executing independent instructions. However, there is a limit to the delay a processor can tolerate and may eventually stall. This happens in particular when a load instruction suffers an LLC miss and has to be serviced by the long-latency main memory. The main reason why a processor may stall after an LLC load miss is that the ROB would fill up and dependencies would clog it [29]. Even if there are not so many dependent instructions, the load instruction would reach the head of the ROB and prevent the retirement of completed instructions following it, again filling up the ROB and preventing the dispatch of new instructions as no free ROB entries are available. The clogging of the ROB has a domino effect on other pipeline queues such as the instruction queue and the issue queues preventing the fetch and dispatch of new instructions. Moreover, even store instructions can stall the processor despite the fact that they can be retired under an LLC miss. This can happen after a long sequence of LLC store misses that fill up the cache s MSHR or write buffer preventing new instructions from being added and thus stalling the processor. Such a scenario could happen when a large array is being initialized or copied. This is why LACS treats store misses similar to load misses and assigns costs at block granularity. Yet, the processor s ability to tolerate the miss latency differs from one miss to another [22], [25]. Consider the following two extreme scenarios: In the first, a load instruction is followed by a long chain of dependent instructions that directly or indirectly depend on it. If the load instruction suffers an LLC miss, the processor may stall immediately since none of the instructions following it can issue. After the load instruction completes, the dependent instructions need to be issued and executed. Their execution times will be added to the miss latency. In the second scenario, a load instruction is followed by independent instructions. If the load instruction suffers an LLC miss, the processor can still remain busy by issuing and executing the independent instructions. It will September 1, 212

11 11 only stall once the ROB is clogged by the load instruction and the completed (but not retired) independent instructions following it. However, once the load instruction retires, all the following completed instructions can retire relatively quickly. The execution times of these instructions will be overlapped with the miss latency and thus saved. Most load instructions exhibit scenarios between these two extremes. Figure 1 earlier asserts this observation and demonstrates that the number of instructions that a processor manages to issue during an LLC miss differs widely from one miss to another. On the one hand, the processor only manages to issue -19 instructions while servicing some misses (leftmost bar), but on the other hand, it manages to issue more than 16 instructions while servicing other misses (rightmost bar). Therefore, the impact and cost of the miss can be effectively estimated from the number of instructions issued during the miss. Cache misses in which the processor fails to issue many instructions are considered high-cost, while cache misses in which the processor manages to remain busy and issue many instructions are considered low-cost since they can be highly tolerable by the processor. LACS uses the above heuristic in estimating the cost of a cache block based on whether the miss on the block is a low- or a high-cost miss. If the number of issued instructions during the miss is larger than a certain threshold value, the block is considered a low-cost block. Otherwise, it is considered a high-cost block. LACS attempts to reserve high-cost blocks in the cache at the expense of low-cost blocks. When a victim block needs to be found, a low-cost block is chosen. At the same time, LACS is aware of the fact that high-cost blocks should not be reserved forever and must be evicted after they are no longer needed. Therefore, LACS relinquishes high-cost blocks if they have not been accessed for some time. B. Cost Consistency and Predictability LACS attempts to reserve blocks that suffered high-cost misses in the past with the assumption that future misses on the same blocks will also be high-cost misses. These high-cost blocks are reserved at the expense of blocks that suffered low-cost misses in the past with the assumption that future misses on those same blocks will also be low-cost misses. LACS thus substitutes highcost misses with low-cost misses. Two factors determine a block s cost: the number of issued instructions during an LLC miss on the block (numissued) and the threshold value (thresh). In order for LACS to perform effectively, the cost of a block must be consistent and repetitive across consecutive generations. In other words, the numissued values for an individual block September 1, 212

12 12 must be repetitive and consistent. Fortunately, our studies and profiling assert that this is true. Profile of numissued Values astar bwaves bzip2 calculix dealii gcc numissued Avg. gobmk gromacs h264ref hmmer lbm numissued Abs. Diff. Avg. libquantum mcf milc namd omnetpp perlbench povray sjeng soplex sphinx3 xalancbmk zeusmp Average Fig. 2. Profile of numissued Values. Simulation environment details are in Section V. Figure 2 shows some profiling statistics for the values of numissued. During profiling, numissued values are simply recorded but are not used for cache replacement decisions, instead the base LRU replacement is used. While profiling, the absolute difference between the numissued values for the same cache block over two consecutive misses are recorded. The right bar for each benchmark shows the average of these absolute differences. LACS does not use the exact numissued value to estimate cost, instead, cost is estimated based on whether this numissued value is larger or smaller than a threshold value. The left bar for each benchmark shows the average of all numissued values for all misses (numissued avg ). Comparing the two bars for each benchmark shows that the absolute differences between the values of numissued for the same cache block over consecutive misses (right bar) are much smaller than the overall numissued average (left bar). In other words, because the numissued values for the same block over consecutive misses are close to each other relative to the overall numissued average, the block will most likely have the same cost across its generations. The povray benchmark only suffers from cold misses and thus its absolute differences average bar is not shown. Moreover, numissued avg values must not change dramatically across periods. Otherwise, a block s cost relative to other blocks would not be consistent across periods. A block that was marked as high-cost in one period could be a low-cost block had its cost been evaluated in a subsequent period, and vice versa. For example, assume that a block has a numissued value of 3. On one hand, if the value of numissued avg in a period was 1, then the block will most September 1, 212

13 13 likely be considered as a high-cost block. On the other hand, if the value of numissued avg drops to 2 in a subsequent period, then the same high-cost block should become a low-cost block. Fortunately, our studies and profiling of numissued avg values across periods indicate that these averages are consistent and repetitive most of the time for most benchmarks. 16 bwaves milc gcc Fig. 3. Plot of numissued avg Values Over 1 (16K Miss) Periods. Simulation environment details are in Section V. Figure 3 shows the plots of numissued avg values over 1 consecutive periods (of 16K misses each) for three representative benchmarks. The horizontal axis covers 1 periods, while the vertical axis shows the numissued avg values in each period. The figure shows three different patterns. In the first plot (bwaves), the average values are equal over all the periods. In the second plot (milc), the average values are equal over long period stretches (1-3 periods) with different stretches having different averages. Finally, in the third plot (gcc), the average values experience large fluctuations. For the first two patterns, it is expected that block costs would have high consistency and predictability. However, in the third pattern, it is expected that block costs would suffer low consistency. As we will describe in the next section, LACS employs a simple mechanism to detect large fluctuations in period averages and turn its cost-sensitive (CS) component on and off accordingly. Overall, the above discussion asserts that block costs are mostly repetitive and predictable, a necessary characteristic for LACS to perform effectively. IV. LACS S IMPLEMENTATION This section explains the implementation of LACS and its optimizations, in addition to its hardware and storage organization, for both private and shared LLCs. September 1, 212

14 14 A. LACS s Implementation Details Table I summarizes the different LACS operations explained in this section. Throughout the discussion, the corresponding steps in the table will be pointed out. TABLE I SUMMARY OF LACS S OPERATIONS On an LLC miss Step A: misscount++; on block B: Step B: MSHR[B].IIR = IIC; Step C: Find a victim block with cost current =; If none exists, cost current-- for all set blocks and repeat search; When the miss on Step D: numissued = IIC - MSHR[B].IIR; block B returns: Step E: Update the history table: Step E1: if(numissued < thresh) { Table[B].cost stored ++; totalhigh++; } Step E2: else Table[B].cost stored --; Step F: Initialize block B: B.cost current = Table[B].cost stored ; On an LLC hit Step G: Update block B: B.cost current++; on block B: Threshold Step Op1: if(totalhigh < minhigh) thresh new = thresh old +8; Calculation: else if(totalhigh > maxhigh) thresh new = thresh old -8; else thresh new = thresh old ; In the previous section, we established that block costs were highly repetitive and predictable. However, there remains some blocks whose costs swing across consecutive misses. This is usually the case when the program is transitioning from one execution phase to another or when the block s numissued values are around the threshold value. Therefore, a confidence mechanism is needed to distinguish between high- and low-confidence cost values. LACS uses a small history table to store the costs of individual blocks (whether they are in the cache or not) in their previous misses. A 2-bit (cost stored ) saturating counter is stored per cache block representing the cost of the block and the confidence in its cost. A cost stored value of 3 or 2 designates a high-cost block while a cost stored value of or 1 designates a low-cost block. Moreover, A cost stored value of 3 or reflects high confidence in the cost, whereas, a cost stored value of 2 or 1 reflects low confidence in the cost. In order for LACS to calculate the number of issued instructions during a miss, a performance counter that tracks the number of issued instructions in the pipeline is needed. This performance counter is incremented for every issued instruction. Such a performance counter (or similar ones) is readily available in most modern processors. We will call this counter the Issued Instructions Counter (IIC). In addition, every LLC MSHR entry is augmented with a field: Issued Instructions Register (IIR) that stores the value of the IIC when the load/store instruction is added to the September 1, 212

15 15 MSHR (Step B). Once the miss is serviced and the data is returned from main memory, the number of issued instructions during the miss (numissued) can be calculated as the difference between the current IIC value and the stored IIR value corresponding to the missed request (Step D). Once the value of numissued is calculated, it is compared to the threshold value (thresh). If numissued is smaller than thresh, the current miss cost is considered to be high, and the corresponding cost stored counter in the history table is incremented (Step E1). Otherwise, the current miss cost is considered to be low, and the corresponding cost stored counter in the history table is decremented (Step E2). In addition, a totalhigh counter is incremented for every highcost block found. Figure 4 illustrates LACS s cost estimation and assignment logic. IIC thresh IIR IIR IIR numissued misscost = High/Low <? 1 cost stored MSHR Inc./Dec. Cache Line cost current History Table Fig. 4. LACS s Cost Estimation and Assignment Logic (Steps D+E+F). Each cache block entry in the tag array is augmented with a 2-bit (cost current ) saturating counter that stores the cost of the block while it resides in the cache. This cost current field is used when replacement decisions need to be made. Once the cost stored value in the history table is updated, it is copied into the block s cost current field in the tag array (Step F). When a victim block needs to be found, LACS searches for a block in the set with a cost current value of. If more than one block is found, a victim is chosen randomly from among them. However, if none exists, the cost current values of all blocks in the set are decremented and the search is repeated until a victim is found (Step C). In addition to being cost-sensitive, LACS is also locality-aware. LACS favors frequentlyaccessed blocks by incrementing their cost current values on cache hits, thus, increasing their costs (Step G). On the other hand, the cost current values of blocks that do not get re-accessed September 1, 212

16 16 while in the cache are periodically decremented, consequently, aging them and decreasing their costs (Step C). This locality-aware component of LACS combines features of both the Least Frequently Used (LFU) and the Not-Recently Used (NRU) replacement algorithms. Both LFU and NRU have scan-resistance traits [1] which makes LACS also scan-resistant. Note that combining both the cost-sensitive and locality-aware components of LACS in a single cost current counter reflects the philosophy of LACS of giving equal importance to both components. B. LACS s Optimizations LACS is further improved by introducing two novel run-time optimizations: The first dynamically changes the threshold value based on the number of high-cost blocks found in the last period. The second turns the cost-sensitive component of LACS on and off based on the predictability of block costs. Optimization 1: In order to accommodate variances in inter- and intra-application behaviors, LACS uses a dynamic threshold value that is updated periodically during execution. A static value cannot accommodate different applications with different numissued values distributions. In addition, programs normally go through many execution phases with different numissued values distributions. A simple and effective solution is to calculate and update the threshold value periodically at run-time. At the end of each period (16K misses in our implementation), the threshold value is re-calculated based on the number of high-cost blocks found during the period. This threshold value is then used throughout the next period, at the end of which it is re-calculated again, and so on. A misscount counter is incremented for every cache miss (Step A). The end of a period is determined when the 14 least-significant bits of the misscount counter become all zeros. In order for LACS to be thrash-resistant, it must reserve a small subset of high-cost blocks in the cache. Because of the large working data sets of modern applications and the relatively small LLC sizes, only a small fraction of the application s data set must be assigned high cost and thus reserved. We experimented with different fraction values and found that a fraction value between 1 and performed best. Unsurprisingly, the studies of others [1], [12] on thrash-resistance also coincide with these values. To accomplish this, LACS raises and lowers the thresh value according to the number of high-cost blocks found during the period (totalhigh) (Step Op1 in Table I). Raising/lowering the thresh value would increase the fraction of blocks assigned September 1, 212

17 17 high/low-cost, respectively. If totalhigh is smaller than (minhigh =16K 1 = 512), the 32 thresh value is incremented by 8. If totalhigh is larger than (maxhigh =16K 1 = 124), 16 the thresh value is decremented by 8. totalhigh is reset after being used at the end of a period. Optimization 2: As pointed out earlier, large fluctuations in numissued avg values across periods can lead to cost inconsistency and, consequently, pathological behavior in LACS. To prevent this, a simple mechanism is used to turn the cost-sensitive (CS) component of LACS on and off. A register sumissued accumulates the numissued values throughout a period. At the end of a period, numissued avg is calculated by shifting the value in the sumissued register to the right 14 times, equivalent to dividing by the number of misses in a period (16K). sumissued is reset after being used at the end of a period. LACS compares the numissued avg values in the current and previous periods. If they are within a tolerance level (equal to 1 in our implementation), the CS component of LACS is turned on. Otherwise, it is turned off. However, even while it is turned off, learning never stops. LACS continues to assign costs and calculate the different averages/values it uses. The only change is that all cost current values would be initialized to 1 on cache fills instead of the corresponding cost stored values. A numissued old avg register stores the previous period s numissued avg value. C. LACS s Storage Organization, Hardware Cost, and Overhead Figure 5 shows the storage organization for LACS: The history table is direct-mapped and tagless with 16K 2-bit entries. Its total size is equal to a modest 4KB, which is less than.4% of the size of a 1MB cache. The history table acts as a confidence mechanism and is only accessed on an LLC miss allowing it to be placed off-chip, if needed, and its access time to be fully-overlapped with the LLC miss latency. To index into the table, the block address is divided into 14-bit parts that are XOR-ed together to produce a 14-bit index. We evaluated aliasing effects in the history table and found that they were minimal. All table entries are initialized to a value of 1. Each LLC cache line in the tag array is augmented with a single 2-bit cost current field that reflects both the block s cost and locality. Assuming a 1MB LLC with 64 Byte lines, the total storage overhead for the cost fields would be only 4KB, which is less than.4% of the cache size. September 1, 212

18 18 The IIC is a 32-bit counter in the pipeline. Each MSHR entry is augmented with a 32-bit IIR. In addition, five 32-bit registers/counters are used by LACS: thresh, misscount, sumissued, totalhigh, andnumissued old avg. Assuming a 32-entry MSHR, the total storage overhead for these registers and counters would be 4x(32+1+5) = 152 bytes. LACS Engine IIC Processor L1I$ L1D$ L2$ History Table IIRs MSHR costcurrent LLC Fig. 5. LACS s Storage Organization. Sizes are not to scale. The operations performed by LACS are limited to increment/decrement, addition/subtraction, shifting, and comparison. These operations only require simple logic and a small HW overhead. Moreover, most of these operations are performed infrequently (only on an LLC miss) allowing the same units to be shared between different components. For example, all MSHR entries can share the same cost estimation logic (Figure 4). Moreover, since most operations are performed on an LLC miss or at the end of a period, they can be performed off the critical path and their latencies can be overlapped with the miss latency. D. LACS s Implementation in Shared Caches LACS can be used in both private and shared LLCs. Since applications have different runtime LLC access characteristics such as different numissued values distributions and different access frequencies, it is wiser to use private LACS meta-data per core 5. Sharing the meta-data can lead to skewness and bias problems. Therefore, LACS s storage and engine are duplicated for each core. However, since many hardware units are used infrequently, they can be shared 5 A core could be a physical core such as in CMPs or a virtual core such as in SMTs. September 1, 212

19 19 between cores. For example, a core s periodic updates are very infrequent (every 16K misses), thus, the hardware unit that performs these updates can be shared albeit it will use and update each core s private meta-data individually. LACS performs the operations explained earlier for each core independently. However, when a victim block needs to be chosen, all blocks in the set are possible candidates regardless of the core they belong to. Studies have shown that threads do not put the same pressure on all sets [16], [3]. Allowing LACS to randomly choose a victim from among all the low-cost blocks in the set will result in a dynamic and adaptive partitioning of individual cache sets that reflects each core s demand. However, such a solution, if used alone, can lead to cache thrashing and thread starvation especially when one thread has a much larger cache access frequency compared to the other thread(s). Reserving a subset of each core s blocks in the cache can significantly reduce thrashing [1] [12]. LACS can reserve blocks in the cache by assigning high cost to them. To prevent thrashing and improve fairness, LACS raises and lowers a core s threshold value based on its number of blocks in the cache. For instance, if a core has very few blocks in the cache, its threshold value is raised which increases the likelihood of its blocks being classified as high-cost and thus reserved. On the other hand, if a core has too many blocks in the cache, its threshold value is lowered which increases the likelihood of its blocks being classified as low-cost and thus evicted early. This can be done by utilizing the dynamic thresh calculation optimization discussed earlier (Section IV-B and Step Op1 in Table I). E. Other Implementation Issues Prefetching is commonly used and can be easily integrated with LACS. Prefetched blocks should not update the history table since they do not stall the processor and thus do not have a numissued value. Moreover, LACS can even help by identifying and prioritizing the prefetching of high-cost blocks. This is left as future work. Although not evaluated explicitly, we believe the impact of LACS on power consumption to be modest: First, the total storage required is small, thus adding a relatively negligible static power consumption. Second, most LACS operations are performed (infrequently) on an LLC miss, thus increasing dynamic power consumption slightly. This slight increase in power consumption is offset by the expected power savings due to improved performance. September 1, 212

20 2 V. EVALUATION ENVIRONMENT LACS is evaluated using a detailed execution-driven cycle-accurate simulator based on SESC [31]. The modeled processor is a 4-way out-of-order superscalar with a 128-entry ROB and 8 stages (with different execution-stage delays). The modeled cache hierarchy has 3 levels and is based on an Intel Core i7 system. Table II lists the different memory hierarchy parameters used in the evaluation. Latencies in the table correspond to contention-free conditions. We model port/bus contention and queuing delays in detail. LACS parameters follow those from Section IV. TABLE II PARAMETERS OF THE SIMULATED MEMORY HIERARCHY L1 Inst. Cache: L1 Data Cache: L2 Cache: L3 Cache (LLC): Main Memory: Memory Bus: 32 KB, 4-way, 64B Lines, Private, WB, LRU, 1-cycle, 32-entry MSHR 32 KB, 8-way, 64B Lines, Private, WB, LRU, 2-cycles, 32-entry MSHR 256 KB, 8-way, 64B Lines, Private, WB, LRU, 1-cycles, 32-entry MSHR 1 MB per core (1MB/2MB/4MB for uni-/dual-/quad-core, respectively), 16-way, 64B Lines, Shared, WB, LRU, 3-cycles, 32-entry MSHR 25-cycles Split-transactions, 16B-wide, 1/4 Processor Frequency In addition, 23 of the 29 SPEC CPU26 benchmarks [23] are used in the evaluation. The missing 6 benchmarks are excluded because our cross-compiler does not support Fortran 9x. The benchmarks were compiled with gcc using -O3 optimization level. The reference input sets were used for all benchmarks. Table III lists the 23 benchmarks used in the evaluation along with their LLC miss rates, LLC misses per 1 instructions (MPKI), and the fraction of their execution times stalled on LLC misses. This fraction was calculated from the execution times of a base LLC versus a perfect always-hit LLC. A benchmark is considered LLC-performance-constrained if over 25% of its execution time is spent on LLC miss stalls. Those benchmarks are grouped into Group A in the table and are the focus of our evaluation. Since LACS attempts to reduce the aggregate miss cost, its impact will only be evident in benchmarks where LLC miss stalls significantly impact the execution time. The remaining benchmarks are not LLC-performanceconstrained and are placed in Group B. We evaluate LACS under two configurations: a uniprocessor with a private LLC, and a dual/quad-core CMP system with a shared LLC. When running uniprocessor simulations, each benchmark is simulated for five billion instructions after fast forwarding the first five billion instructions. When running CMP simulations, the first five billion instructions in each benchmark September 1, 212

21 21 TABLE III THE BENCHMARKS USEDINTHEEVALUATION WITH THEIR LLC-BASED STATISTICS Group A Group B Benchmark Miss Rate MPKI Stalls Fraction Benchmark Miss Rate MPKI Stalls Fraction astar 95% % bzip2 39% % bwaves 98% % calculix * 75% <.1 1% gcc * 73% 2. 3% dealii * 5%.4 9% hmmer * 33% 2. 28% gobmk 28%.5 7% lbm * 1% % gromacs 45%.6 14% libquantum 1% % h264ref 4% <.1 1% mcf * 73% 3. 72% namd 5%.1 2% milc 1% % perlbench 88%.3 3% omnetpp * 97% % povray 2% <.1 % sphinx3 93% % sjeng 87%.3 4% xalancbmk * 85% % soplex 44% % zeusmp 88% % are fast forwarded then the simulation runs until each benchmark has completed two billion instructions. If a benchmark completes two billion instructions, it continues execution until all benchmarks do. However, results are reported for the first two billion instructions for each benchmark. We construct all 36 possible pairs from 8 representative benchmarks 6 (marked with asterisks in Table III). We also randomly construct 1 heterogeneous quadruples from the 23 benchmarks. A quadruple contains at least two Group A benchmarks. VI. EVALUATION In this section, the performance of LACS is first studied in the context of private LLCs (uniprocessors) (Section VI-A) where it is compared to other cache replacement algorithms. Its sensitivity to different optimizations and cache sizes is also investigated. After that, LACS s performance is studied in the context of shared LLCs (CMPs) (Section VI-B). A. Evaluation in a Private LLC Environment Figure 6 shows the percentage of IPC performance improvement (speedup) of LACS and three other cache replacement algorithms over the base LRU. LACS randomly chooses a victim block from the low-cost blocks in a set, so it is compared against a random replacement algorithm (RND) to show that LACS s speedup is not due to the randomness in selecting a victim. LACS is 6 These 8 benchmarks cover different application characteristics in terms of IPC, MPKI, and LLC access frequencies. September 1, 212

22 22 6% 55% 5% 45% 4% 35% 3% 25% 2% 15% 1% -1% -5% % 5% -15% Performance Improvement (Speedup) RND MLP-SBAR SHiP LACS astar bw aves gcc hmmer lbm libquantum mcf milc omnetpp sphinx3 xalancbmk zeusmp AVG-GrpA AVG-ALL Fig. 6. IPC Performance Improvement (Speedup) of LACS Compared to Other Cache Replacement Algorithms also compared against a state-of-the-art cost-sensitive cache replacement algorithm: MLP-aware replacement using SBAR (MLP-SBAR) [21]. In addition, LACS is compared to a state-of-the-art locality-based cache replacement algorithm: Signature-based Hit Predictor (SHiP) [9]. SHiP-PC- S-R2 is used in our implementation. MLP-SBAR and SHiP are the best performing cost-based and locality-based cache replacement algorithms to date, respectively. AVG-GrpA and AVG-All show the average speedup achieved for Group A benchmarks versus all benchmarks (Groups A+B), respectively. The figure shows that LACS achieves an average speedup of about 11% for the LLC-performanceconstrained Group A benchmarks and up to 51% (in mcf). LACS speeds up 6 benchmarks by 5% or more (gcc, hmmer, mcf, omnetpp, sphinx3, and xalancbmk). Furthermore, LACS does not slow down any benchmark by more than 2% demonstrating its robustness. The average speedup achieved by RND for the Group A benchmarks is comparably insignificant (2%). Moreover, RND suffers from pathological behavior in some Group B benchmarks reducing its overall speedup for all benchmarks to less than 1%. MLP-SBAR speeds up the same 6 benchmarks that LACS does, but LACS outperforms MLP-SBAR in all of them except omnetpp. Overall, the average speedup achieved by LACS is larger than that achieved by MLP-SBAR for both the Group A benchmarks (11% vs. 7%) and all benchmarks. SHiP speeds up 5 of the 6 benchmarks that LACS speeds up (gcc, hmmer, mcf, omnetpp, and xalancbmk). While LACS speeds up sphinx3, SHiP slows it down by 11% indicating that SHiP is less robust compared to LACS. Overall, the average speedup achieved by LACS is larger than that achieved by SHiP for both the Group September 1, 212

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell