LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

Size: px
Start display at page:

Download "LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm"

Transcription

1 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and Technology. Rami Sheikh is with North Carolina State University. September 1, 212 Digital Object Indentifier 1.119/TC /13/$ IEEE

2 2 Abstract The design of an effective last-level cache (LLC) in general - and an effective cache replacement/partitioning algorithm in particular - is critical to the overall system performance. The processor s ability to hide the LLC miss penalty differs widely from one miss to another. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor s ability to hide LLC miss latencies, and the resultant non-uniformity in the performance impact of LLC misses, opens up an opportunity for a new cost-sensitive cache replacement algorithm. This paper makes two key contributions. First, it proposes a framework for estimating the costs of cache blocks at run-time based on the processor s ability to (partially) hide their miss latencies. Second, it proposes a simple, low-hardware overhead, yet effective, cache replacement algorithm that is Locality-Aware and Cost-Sensitive (LACS). LACS is thoroughly evaluated using a detailed simulation environment. LACS speeds up 12 LLC-performanceconstrained SPEC CPU26 benchmarks by up to 51% and 11% on average. When evaluated using a dual/quad-core CMP with a shared LLC, LACS significantly outperforms LRU in terms of performance and fairness, achieving improvements up to 54%. Index Terms Cache memories, cache replacement algorithms, cost-sensitive cache replacement, shared caches I. INTRODUCTION As the performance gap between the processor and main memory continues to widen, the design of an effective cache hierarchy becomes more critical in order to reduce the average memory access times perceived by the processor. The design of an effective last-level cache (LLC) continues to be the center of substantial research for several reasons: First, while a processor may be able to hide a miss in the higher-level (L1 and L2) caches followed by an LLC (L3 cache) 1 hit through exploiting ILP, out-of-order execution, and non-blocking caches, it is almost impossible to fully hide the long LLC miss penalty. Second, as multi-core processors sharing the LLC becomes the dominant computing platform, new cache design constraints arise with the goal of maximizing performance and throughput while ensuring thread fairness [1], [2]. 1 Without loss of generality, we assume throughout this paper a 3-level cache hierarchy where the LLC is the L3 cache. The concepts and algorithms developed in this paper are also applicable to a 2-level cache hierarchy. September 1, 212

3 3 A crucial design aspect of LLCs continues to be the cache replacement and partitioning algorithms. This is evident in the many papers proposing intelligent LLC replacement and partitioning algorithms found in the recent literature. Examples include: dead block predictors [3] [8], re-reference interval predictors and adaptive insertion algorithms [9] [14], and CMP cache partitioning algorithms [15], [16], among others. Unfortunately, most of these algorithms only target the cache s miss rate while ignoring the aggregate misses cost. Only a few proposed replacement algorithms attempt to reduce the aggregate misses cost or penalty [17] [21]. In modern superscalar processors, the processor attempts to hide cache misses by exploiting ILP through issuing and executing independent instructions in parallel and out-of-order. Unfortunately, even with the most aggressive superscalar processors, it is quite impossible to hide the large LLC miss penalty. During this long miss penalty, the reorder buffer (ROB) and the other processor queues fill up. This eventually stalls the whole processor waiting on the LLC miss. Yet, depending on the dependency chain, miss bursts and other factors, the processor s ability to partially hide the LLC miss penalty differs widely from one miss to another [22]. Figure 1 illustrates this point by showing the histogram of the number of issued instructions during the service of an LLC miss for several SPEC CPU26 benchmarks [23]. The vertical axis represents the number of misses while the horizontal axis shows the number of issued instructions during the service of the miss (plotted using intervals of 2 instructions) 2. For example, looking at the sub-figure for the mcf benchmark, the leftmost bar indicates that for about 64 million of its LLC misses, the processor managed to issue only -19 instructions per miss. The number of issued instructions is counted from the time the instruction suffering the LLC miss is placed in the LLC MSHR until the requested data is received. The figure clearly shows that for most benchmarks, the number of issued instructions during an LLC miss is not uniform and varies widely asserting the statement: Not All Misses are Created Equal [21]. The more instructions the processor manages to issue during the miss, the better it is capable of hiding the miss penalty and the lower the cost of that miss. This non-uniformity in the processor s ability to hide the latencies of LLC misses, and the resultant non-uniformity in the performance 2 Although the ROB used in our evaluation has 128 entries, the number of instructions issued during an LLC miss may be larger than 128. There are 255 different instructions that may be in the ROB from the time the instruction suffering the miss is added to the tail of the ROB until it retires at the ROB s head. Some or all of these instructions may issue during the miss. September 1, 212

4 4 x1 6 4 astar x bwaves x bzip2 x1 6.5 dealii x gcc x1 6.5 gobmk x gromacs x hmmer x lbm x libquantum x mcf x milc x omnetpp x sjeng x1 6 soplex x sphinx x xalancbmk x zeusmp Fig. 1. Issued Instructions Per LLC Miss Histogram. Simulation environment details are in Section V. impact of LLC misses, opens up an opportunity to develop new cost-sensitive cache replacement algorithms. We define cache blocks where the processor manages to issue a small/large number of instructions during a miss on that block as high/low-cost blocks, respectively. Substituting high-cost misses with low-cost misses reduces the aggregate miss penalty and thus enhances the overall cache performance. This paper proposes a novel, simple, yet effective, cache replacement algorithm called LACS: Locality-Aware Cost-Sensitive Cache Replacement Algorithm. LACS estimates the cost of a cache block by counting the number of instructions issued during the block s LLC miss, which reflects the processor s ability to (partially) hide the miss penalty. Cache blocks are classified as low-cost or high-cost blocks based on whether the number of issued instructions is larger or smaller than a threshold. On a cache miss, when a victim block needs to be found, LACS chooses a low-cost block keeping high-cost blocks in the cache. This is referred to as high-cost block reservation [17] [2]. However, since a block with a high cost cannot be reserved forever, a mechanism must exist to relinquish the reservation once the block is dead (no longer needed). To achieve this, LACS implements a simple locality-based algorithm that ages a block while it is not being accessed inverting its cost from high to low. As a result, LACS attempts to reserve high-cost blocks in the cache, but only while their locality is still high (i.e. they have been September 1, 212

5 5 accessed recently). The underlying locality-based algorithm, employed by LACS, can also be a dead block predictor. Although not shown in this paper, we integrated LACS with two dead-block predictors [3], [6] and found that, even though the integrated dead block predictors outperform our simple locality-based algorithm as standalone cache replacement algorithms, both approaches almost perform equally when their goal is limited to providing locality hints to LACS. Moreover, our locality-based algorithm has a much smaller storage overhead compared to other dead-block predictors. The fact that LACS reserves a small subset of high-cost blocks in the cache makes it thrashresistant. In addition, LACS is scan-resistant since its locality-aware component increases the costs of frequently accessed blocks while they are in the cache, and decreases the costs of blocks that do not get re-accessed leading to their early eviction. Both thrash-resistance and scan-resistance are key traits of an efficient cache replacement algorithm [9] [12]. Consequently, while LACS reduces the miss penalty by substituting high-cost misses with low-cost misses, it also reduces the miss count by being both thrash- and scan-resistant. This paper has four main contributions: Miss Costs: The non-uniformity in the performance impact (cost) of LLC misses due to the non-uniformity in the processor s ability to hide LLC miss latencies is asserted. Cost Estimation: A novel, simple, yet effective run-time cost estimation method for inflight misses is presented. The cost is estimated based on the number of instructions the processor manages to issue during the miss, which reflects the miss s performance impact and how well the processor is capable of hiding the miss penalty. LACS: A cost-sensitive and locality-aware cache replacement algorithm that utilizes the devised cost estimation method is proposed. LACS is simple, has low-hardware overhead, and is effective for private and shared LLCs. LACS Optimizations: The performance of LACS is further improved by introducing novel and effective run-time optimizations. These optimizations include: 1) A mechanism to dynamically and periodically update the threshold value, which allows LACS to better adapt to different applications and execution phases. 2) A mechanism to turn the cost-sensitive component of LACS on and off based on the predictability of block costs. September 1, 212

6 6 LACS is thoroughly evaluated using a detailed simulation environment. When evaluated using a uniprocessor architecture model, LACS speeds up 12 LLC-performance-constrained SPEC CPU26 benchmarks by up to 51% and 11% on average (relative to the base LRU) without slowing down any of the 23 SPEC CPU26 benchmarks used in the study. This performance improvement is comparable to that achieved with a 5%-1% larger cache using LRU, and yet it is achieved using a simple implementation with low-hardware overhead. In addition, LACS s effectiveness is demonstrated over a wide range of LLC sizes. Moreover, LACS is compared to and shown to outperform both: a state-of-the-art cost-based replacement algorithm (MLP- SBAR) [21] and another state-of-the-art locality-based algorithm (SHiP) [9]. When evaluated using a dual-core CMP architecture model with a shared LLC, LACS improves 36 SPEC CPU26 benchmark pairs by up to 54% and 1% on average 3. When evaluated using a quad-core CMP architecture model with a shared LLC, LACS improves 1 SPEC CPU26 benchmark quadruples by up to 38% and 1% on average 3. The rest of the paper is organized as follows. Section II presents the related work and compares LACS to other replacement algorithms. Section III develops the foundations for LACS. Section IV discusses LACS and its optimizations in detail for both private and shared caches. Section V describes the evaluation environment while Section VI discusses the experimental evaluation in detail. Finally, Section VII concludes the paper. II. RELATED WORK Traditionally, cache replacement algorithms were developed with the goal of reducing the aggregate miss count and thus assumed that misses were uniform in cost. Belady s optimal (OPT) replacement algorithm [24] victimizes the block in the set with the largest future usage distance. It guarantees a minimal miss count but requires future knowledge and thus remains theoretical and can only be approximated. The LRU replacement algorithm, and its approximations, rely on the principle of temporal locality by victimizing the least recently used block in the set. However, studies have shown that the performance gap between LRU and OPT was wide for high-associativity L2 caches [7]. One factor that works against LRU is that locality is usually filtered by the L1 cache and thus is inverted in the lower cache levels [3]. To bridge the gap 3 The metric reported here is the harmonic mean of weighted IPCs normalized to the base LRU. It is a measure of both performance and fairness improvement. September 1, 212

7 7 between OPT and LRU replacement algorithms, many intelligent replacement algorithms have been proposed for LLCs including but not limited to: Dead block predictors [3] [8], re-reference interval predictors and adaptive insertion algorithms [9] [14]. Dead block predictors aim to predict dead blocks (blocks that will no longer be used during their current generation times 4 ) in the cache and evict them early while preserving live blocks. Re-reference interval predictors and adaptive insertion algorithms aim to predict the time interval between consecutive accesses to a cache block which determines the insertion position of the block in the LRU or re-use stack. Moreover, cache replacement algorithms in shared CMP caches have been studied in the context of shared cache partitioning among the concurrently-running threads [15], [16]. Replacement algorithms such as OPT, LRU, dead block predictors, and others, only distinguish between cache blocks in terms of liveness and do not distinguish between blocks in terms of miss costs. However, in modern systems, cache misses are not uniform and have different costs [18], [21], [22], [25]. Thus, it is wiser to take into consideration the miss costs in addition to the access locality in the replacement algorithm in order to improve the cache s overall performance. This is exactly what LACS is designed to achieve. In Section VI, we compare LACS against SHiP (Signature-based Hit Predictor) [9]: a stateof-the-art locality-based cache replacement algorithm. SHiP associates a cache reference with a unique signature and attempts to predict the re-reference interval for that signature. A Signature History Counter Table (SHCT) of saturating counters is used to learn and predict the re-reference behavior of the signatures. The table is updated on cache hits and evictions. On a cache fill, SHiP indexes into the SHCT with the new block s signature to obtain a prediction of its re-reference interval. SHiP only tracks whether a signature is re-referenced or not, but not the actual rereference timing. For block promotion and eviction decisions, SHiP utilizes SRRIP [1]. In our evaluation, LACS is found to outperform SHiP in terms of performance improvement in both a private and a shared LLC while requiring about 2% less storage overhead. Srinivasan and Lebeck [22] explore load latency tolerance in dynamically scheduled processors and show that load instructions are not equal in terms of processor tolerance for load latencies. They also show that load latency tolerance is a function of the number and types of dependent instructions especially mispredicted branches. Moreover, Puzak et al. [25] also assert that misses 4 A block s generation time starts from when it is placed in the cache after a miss until it is evicted September 1, 212

8 8 have variable costs and present a new simulation-based technique for calculating the cost of a miss for different cache levels. The observation of the non-uniform impact of cache misses led to a new class of replacement algorithms called Cost-Sensitive Cache Replacement Algorithms. These algorithms assign different costs to cache blocks according to well-defined criteria, and rely on these costs to select which block to evict on a cache miss (least cost block gets evicted first). The miss cost may be latency, penalty, power consumption, bandwidth consumption, or any other property attached to a miss [17] [21], [26]. LACS assigns costs based on the processor s ability to (partially) hide the miss latency by counting the number of instructions issued during the miss. One of the earliest implementations of cost-sensitive cache replacement algorithms were proposed by Jeong and Dubois [17] [19] in the context of CC-NUMA multiprocessors in which the cost of a miss mapping to a remote memory, as opposed to local memory, is higher in terms of latency, bandwidth, and power consumption. A cost-sensitive optimal replacement algorithm (CSOPT) for CC-NUMA multiprocessors with static miss costs is evaluated and found to outperform a traditional OPT algorithm in terms of overall miss costs savings although the miss count increases. In addition, several realizable algorithms are evaluated with a cost based on the miss latency. In comparison, LACS estimates a block s cost based on the processor s ability to tolerate and hide the miss and not on the miss latency itself. Moreover, LACS is applicable to both uniprocessors and multiprocessors. Jeong et al. [2] also proposed a cost-sensitive cache replacement algorithm for uniprocessors. The algorithm assigns cost based on whether a block s next access is predicted to be a load (highcost) or store (low-cost) since processors can better tolerate store misses over load misses. In their implementation, all loads are equal and are considered high-cost. In comparison, LACS does not treat load misses equally but distinguishes between load misses in terms of cost based on the processor s ability to tolerate and hide the load miss. Our study and the studies of others [22], [25] show that load miss costs are not uniform and thus should not be treated equally. Moreover, some store misses may be critical and can stall the processor. This, for example, can happen if the LLC MSHR or write buffers get full after a long sequence of consecutive store misses such as when initializing or copying an array. Moreover, an increase in the number of store misses can put pressure on the memory bandwidth. Srinivasan et al. [26] proposed a hardware scheme in which critical blocks are either preserved September 1, 212

9 9 in a special critical cache or used to initiate prefetching. Criticality is estimated by keeping track of a load s dependence chain and the processor s ability to execute independent instructions following the load. Although they demonstrate the effectiveness of their approach when all critical loads are guaranteed to hit in the cache, no significant improvement is achieved under a realistic configuration due to the large working set of critical loads and the inefficient way of identifying critical loads. In comparison, LACS does not need to keep track of a load s dependence chain, instead it uses a simpler, more effective approach for cost estimation. Moreover, LACS achieves considerable performance improvement under a realistic configuration because: (a) high-cost blocks are preserved in the LLC itself instead of a smaller critical cache, and (b) LACS includes a mechanism to relinquish high-cost blocks that may no longer be needed by the processor making room for other useful blocks. Qureshi et al. [21] proposed a cost-sensitive cache replacement algorithm based on Memory Level Parallelism (MLP). The MLP-aware cache replacement algorithm relies on the parallelism of miss occurrences since some cache misses occur in isolation (classified as high-cost and are thus preserved in the cache) while others occur and get served concurrently (classified as low-cost). Because of its significant performance degradation in pathological cases, it is used in conjunction with a tournament-predictor-like Sampling Based Adaptive Replacement (SBAR) algorithm to choose between the MLP-aware algorithm and the traditional LRU depending on which provides better performance. In comparison, LACS estimates a block s cost based on the processor s ability to tolerate the miss. Moreover, LACS s performance is demonstrated to be more robust with negligible pathological behavior. In Section VI, LACS is compared against and is shown to outperform the MLP-aware with SBAR algorithm. Finally, other replacement algorithms that utilized cost were proposed outside the context of processor caches, such as in disk paging [27] and Web proxy caching [28]. III. LACS S FOUNDATIONS AND UNDERLYING PRINCIPLES This section lays the foundations and underlying principles for LACS. First, we discuss the impact of LLC misses in modern processors, thus, establishing our cost heuristic. Second, we examine the predictability/consistency of block costs, a vital property for LACS s performance. September 1, 212

10 1 A. Anatomy of the Impact of LLC Misses in Modern Processors Modern dynamically-scheduled superscalar processors improve performance by issuing and executing independent instructions in parallel and out-of-order. Multiple instructions are fetched, issued, and executed every clock cycle. After an instruction completes execution, it writes back its result and waits to be retired (by updating the architectural state) in program order. Although instructions get issued and executed out-of-order, program order is preserved at retirement using the ROB [1]. Dynamically-scheduled superscalar processors can tolerate the high latency of some instructions (e.g. loads that suffer L1 cache misses) by issuing and executing independent instructions. However, there is a limit to the delay a processor can tolerate and may eventually stall. This happens in particular when a load instruction suffers an LLC miss and has to be serviced by the long-latency main memory. The main reason why a processor may stall after an LLC load miss is that the ROB would fill up and dependencies would clog it [29]. Even if there are not so many dependent instructions, the load instruction would reach the head of the ROB and prevent the retirement of completed instructions following it, again filling up the ROB and preventing the dispatch of new instructions as no free ROB entries are available. The clogging of the ROB has a domino effect on other pipeline queues such as the instruction queue and the issue queues preventing the fetch and dispatch of new instructions. Moreover, even store instructions can stall the processor despite the fact that they can be retired under an LLC miss. This can happen after a long sequence of LLC store misses that fill up the cache s MSHR or write buffer preventing new instructions from being added and thus stalling the processor. Such a scenario could happen when a large array is being initialized or copied. This is why LACS treats store misses similar to load misses and assigns costs at block granularity. Yet, the processor s ability to tolerate the miss latency differs from one miss to another [22], [25]. Consider the following two extreme scenarios: In the first, a load instruction is followed by a long chain of dependent instructions that directly or indirectly depend on it. If the load instruction suffers an LLC miss, the processor may stall immediately since none of the instructions following it can issue. After the load instruction completes, the dependent instructions need to be issued and executed. Their execution times will be added to the miss latency. In the second scenario, a load instruction is followed by independent instructions. If the load instruction suffers an LLC miss, the processor can still remain busy by issuing and executing the independent instructions. It will September 1, 212

11 11 only stall once the ROB is clogged by the load instruction and the completed (but not retired) independent instructions following it. However, once the load instruction retires, all the following completed instructions can retire relatively quickly. The execution times of these instructions will be overlapped with the miss latency and thus saved. Most load instructions exhibit scenarios between these two extremes. Figure 1 earlier asserts this observation and demonstrates that the number of instructions that a processor manages to issue during an LLC miss differs widely from one miss to another. On the one hand, the processor only manages to issue -19 instructions while servicing some misses (leftmost bar), but on the other hand, it manages to issue more than 16 instructions while servicing other misses (rightmost bar). Therefore, the impact and cost of the miss can be effectively estimated from the number of instructions issued during the miss. Cache misses in which the processor fails to issue many instructions are considered high-cost, while cache misses in which the processor manages to remain busy and issue many instructions are considered low-cost since they can be highly tolerable by the processor. LACS uses the above heuristic in estimating the cost of a cache block based on whether the miss on the block is a low- or a high-cost miss. If the number of issued instructions during the miss is larger than a certain threshold value, the block is considered a low-cost block. Otherwise, it is considered a high-cost block. LACS attempts to reserve high-cost blocks in the cache at the expense of low-cost blocks. When a victim block needs to be found, a low-cost block is chosen. At the same time, LACS is aware of the fact that high-cost blocks should not be reserved forever and must be evicted after they are no longer needed. Therefore, LACS relinquishes high-cost blocks if they have not been accessed for some time. B. Cost Consistency and Predictability LACS attempts to reserve blocks that suffered high-cost misses in the past with the assumption that future misses on the same blocks will also be high-cost misses. These high-cost blocks are reserved at the expense of blocks that suffered low-cost misses in the past with the assumption that future misses on those same blocks will also be low-cost misses. LACS thus substitutes highcost misses with low-cost misses. Two factors determine a block s cost: the number of issued instructions during an LLC miss on the block (numissued) and the threshold value (thresh). In order for LACS to perform effectively, the cost of a block must be consistent and repetitive across consecutive generations. In other words, the numissued values for an individual block September 1, 212

12 12 must be repetitive and consistent. Fortunately, our studies and profiling assert that this is true. Profile of numissued Values astar bwaves bzip2 calculix dealii gcc numissued Avg. gobmk gromacs h264ref hmmer lbm numissued Abs. Diff. Avg. libquantum mcf milc namd omnetpp perlbench povray sjeng soplex sphinx3 xalancbmk zeusmp Average Fig. 2. Profile of numissued Values. Simulation environment details are in Section V. Figure 2 shows some profiling statistics for the values of numissued. During profiling, numissued values are simply recorded but are not used for cache replacement decisions, instead the base LRU replacement is used. While profiling, the absolute difference between the numissued values for the same cache block over two consecutive misses are recorded. The right bar for each benchmark shows the average of these absolute differences. LACS does not use the exact numissued value to estimate cost, instead, cost is estimated based on whether this numissued value is larger or smaller than a threshold value. The left bar for each benchmark shows the average of all numissued values for all misses (numissued avg ). Comparing the two bars for each benchmark shows that the absolute differences between the values of numissued for the same cache block over consecutive misses (right bar) are much smaller than the overall numissued average (left bar). In other words, because the numissued values for the same block over consecutive misses are close to each other relative to the overall numissued average, the block will most likely have the same cost across its generations. The povray benchmark only suffers from cold misses and thus its absolute differences average bar is not shown. Moreover, numissued avg values must not change dramatically across periods. Otherwise, a block s cost relative to other blocks would not be consistent across periods. A block that was marked as high-cost in one period could be a low-cost block had its cost been evaluated in a subsequent period, and vice versa. For example, assume that a block has a numissued value of 3. On one hand, if the value of numissued avg in a period was 1, then the block will most September 1, 212

13 13 likely be considered as a high-cost block. On the other hand, if the value of numissued avg drops to 2 in a subsequent period, then the same high-cost block should become a low-cost block. Fortunately, our studies and profiling of numissued avg values across periods indicate that these averages are consistent and repetitive most of the time for most benchmarks. 16 bwaves milc gcc Fig. 3. Plot of numissued avg Values Over 1 (16K Miss) Periods. Simulation environment details are in Section V. Figure 3 shows the plots of numissued avg values over 1 consecutive periods (of 16K misses each) for three representative benchmarks. The horizontal axis covers 1 periods, while the vertical axis shows the numissued avg values in each period. The figure shows three different patterns. In the first plot (bwaves), the average values are equal over all the periods. In the second plot (milc), the average values are equal over long period stretches (1-3 periods) with different stretches having different averages. Finally, in the third plot (gcc), the average values experience large fluctuations. For the first two patterns, it is expected that block costs would have high consistency and predictability. However, in the third pattern, it is expected that block costs would suffer low consistency. As we will describe in the next section, LACS employs a simple mechanism to detect large fluctuations in period averages and turn its cost-sensitive (CS) component on and off accordingly. Overall, the above discussion asserts that block costs are mostly repetitive and predictable, a necessary characteristic for LACS to perform effectively. IV. LACS S IMPLEMENTATION This section explains the implementation of LACS and its optimizations, in addition to its hardware and storage organization, for both private and shared LLCs. September 1, 212

14 14 A. LACS s Implementation Details Table I summarizes the different LACS operations explained in this section. Throughout the discussion, the corresponding steps in the table will be pointed out. TABLE I SUMMARY OF LACS S OPERATIONS On an LLC miss Step A: misscount++; on block B: Step B: MSHR[B].IIR = IIC; Step C: Find a victim block with cost current =; If none exists, cost current-- for all set blocks and repeat search; When the miss on Step D: numissued = IIC - MSHR[B].IIR; block B returns: Step E: Update the history table: Step E1: if(numissued < thresh) { Table[B].cost stored ++; totalhigh++; } Step E2: else Table[B].cost stored --; Step F: Initialize block B: B.cost current = Table[B].cost stored ; On an LLC hit Step G: Update block B: B.cost current++; on block B: Threshold Step Op1: if(totalhigh < minhigh) thresh new = thresh old +8; Calculation: else if(totalhigh > maxhigh) thresh new = thresh old -8; else thresh new = thresh old ; In the previous section, we established that block costs were highly repetitive and predictable. However, there remains some blocks whose costs swing across consecutive misses. This is usually the case when the program is transitioning from one execution phase to another or when the block s numissued values are around the threshold value. Therefore, a confidence mechanism is needed to distinguish between high- and low-confidence cost values. LACS uses a small history table to store the costs of individual blocks (whether they are in the cache or not) in their previous misses. A 2-bit (cost stored ) saturating counter is stored per cache block representing the cost of the block and the confidence in its cost. A cost stored value of 3 or 2 designates a high-cost block while a cost stored value of or 1 designates a low-cost block. Moreover, A cost stored value of 3 or reflects high confidence in the cost, whereas, a cost stored value of 2 or 1 reflects low confidence in the cost. In order for LACS to calculate the number of issued instructions during a miss, a performance counter that tracks the number of issued instructions in the pipeline is needed. This performance counter is incremented for every issued instruction. Such a performance counter (or similar ones) is readily available in most modern processors. We will call this counter the Issued Instructions Counter (IIC). In addition, every LLC MSHR entry is augmented with a field: Issued Instructions Register (IIR) that stores the value of the IIC when the load/store instruction is added to the September 1, 212

15 15 MSHR (Step B). Once the miss is serviced and the data is returned from main memory, the number of issued instructions during the miss (numissued) can be calculated as the difference between the current IIC value and the stored IIR value corresponding to the missed request (Step D). Once the value of numissued is calculated, it is compared to the threshold value (thresh). If numissued is smaller than thresh, the current miss cost is considered to be high, and the corresponding cost stored counter in the history table is incremented (Step E1). Otherwise, the current miss cost is considered to be low, and the corresponding cost stored counter in the history table is decremented (Step E2). In addition, a totalhigh counter is incremented for every highcost block found. Figure 4 illustrates LACS s cost estimation and assignment logic. IIC thresh IIR IIR IIR numissued misscost = High/Low <? 1 cost stored MSHR Inc./Dec. Cache Line cost current History Table Fig. 4. LACS s Cost Estimation and Assignment Logic (Steps D+E+F). Each cache block entry in the tag array is augmented with a 2-bit (cost current ) saturating counter that stores the cost of the block while it resides in the cache. This cost current field is used when replacement decisions need to be made. Once the cost stored value in the history table is updated, it is copied into the block s cost current field in the tag array (Step F). When a victim block needs to be found, LACS searches for a block in the set with a cost current value of. If more than one block is found, a victim is chosen randomly from among them. However, if none exists, the cost current values of all blocks in the set are decremented and the search is repeated until a victim is found (Step C). In addition to being cost-sensitive, LACS is also locality-aware. LACS favors frequentlyaccessed blocks by incrementing their cost current values on cache hits, thus, increasing their costs (Step G). On the other hand, the cost current values of blocks that do not get re-accessed September 1, 212

16 16 while in the cache are periodically decremented, consequently, aging them and decreasing their costs (Step C). This locality-aware component of LACS combines features of both the Least Frequently Used (LFU) and the Not-Recently Used (NRU) replacement algorithms. Both LFU and NRU have scan-resistance traits [1] which makes LACS also scan-resistant. Note that combining both the cost-sensitive and locality-aware components of LACS in a single cost current counter reflects the philosophy of LACS of giving equal importance to both components. B. LACS s Optimizations LACS is further improved by introducing two novel run-time optimizations: The first dynamically changes the threshold value based on the number of high-cost blocks found in the last period. The second turns the cost-sensitive component of LACS on and off based on the predictability of block costs. Optimization 1: In order to accommodate variances in inter- and intra-application behaviors, LACS uses a dynamic threshold value that is updated periodically during execution. A static value cannot accommodate different applications with different numissued values distributions. In addition, programs normally go through many execution phases with different numissued values distributions. A simple and effective solution is to calculate and update the threshold value periodically at run-time. At the end of each period (16K misses in our implementation), the threshold value is re-calculated based on the number of high-cost blocks found during the period. This threshold value is then used throughout the next period, at the end of which it is re-calculated again, and so on. A misscount counter is incremented for every cache miss (Step A). The end of a period is determined when the 14 least-significant bits of the misscount counter become all zeros. In order for LACS to be thrash-resistant, it must reserve a small subset of high-cost blocks in the cache. Because of the large working data sets of modern applications and the relatively small LLC sizes, only a small fraction of the application s data set must be assigned high cost and thus reserved. We experimented with different fraction values and found that a fraction value between 1 and performed best. Unsurprisingly, the studies of others [1], [12] on thrash-resistance also coincide with these values. To accomplish this, LACS raises and lowers the thresh value according to the number of high-cost blocks found during the period (totalhigh) (Step Op1 in Table I). Raising/lowering the thresh value would increase the fraction of blocks assigned September 1, 212

17 17 high/low-cost, respectively. If totalhigh is smaller than (minhigh =16K 1 = 512), the 32 thresh value is incremented by 8. If totalhigh is larger than (maxhigh =16K 1 = 124), 16 the thresh value is decremented by 8. totalhigh is reset after being used at the end of a period. Optimization 2: As pointed out earlier, large fluctuations in numissued avg values across periods can lead to cost inconsistency and, consequently, pathological behavior in LACS. To prevent this, a simple mechanism is used to turn the cost-sensitive (CS) component of LACS on and off. A register sumissued accumulates the numissued values throughout a period. At the end of a period, numissued avg is calculated by shifting the value in the sumissued register to the right 14 times, equivalent to dividing by the number of misses in a period (16K). sumissued is reset after being used at the end of a period. LACS compares the numissued avg values in the current and previous periods. If they are within a tolerance level (equal to 1 in our implementation), the CS component of LACS is turned on. Otherwise, it is turned off. However, even while it is turned off, learning never stops. LACS continues to assign costs and calculate the different averages/values it uses. The only change is that all cost current values would be initialized to 1 on cache fills instead of the corresponding cost stored values. A numissued old avg register stores the previous period s numissued avg value. C. LACS s Storage Organization, Hardware Cost, and Overhead Figure 5 shows the storage organization for LACS: The history table is direct-mapped and tagless with 16K 2-bit entries. Its total size is equal to a modest 4KB, which is less than.4% of the size of a 1MB cache. The history table acts as a confidence mechanism and is only accessed on an LLC miss allowing it to be placed off-chip, if needed, and its access time to be fully-overlapped with the LLC miss latency. To index into the table, the block address is divided into 14-bit parts that are XOR-ed together to produce a 14-bit index. We evaluated aliasing effects in the history table and found that they were minimal. All table entries are initialized to a value of 1. Each LLC cache line in the tag array is augmented with a single 2-bit cost current field that reflects both the block s cost and locality. Assuming a 1MB LLC with 64 Byte lines, the total storage overhead for the cost fields would be only 4KB, which is less than.4% of the cache size. September 1, 212

18 18 The IIC is a 32-bit counter in the pipeline. Each MSHR entry is augmented with a 32-bit IIR. In addition, five 32-bit registers/counters are used by LACS: thresh, misscount, sumissued, totalhigh, andnumissued old avg. Assuming a 32-entry MSHR, the total storage overhead for these registers and counters would be 4x(32+1+5) = 152 bytes. LACS Engine IIC Processor L1I$ L1D$ L2$ History Table IIRs MSHR costcurrent LLC Fig. 5. LACS s Storage Organization. Sizes are not to scale. The operations performed by LACS are limited to increment/decrement, addition/subtraction, shifting, and comparison. These operations only require simple logic and a small HW overhead. Moreover, most of these operations are performed infrequently (only on an LLC miss) allowing the same units to be shared between different components. For example, all MSHR entries can share the same cost estimation logic (Figure 4). Moreover, since most operations are performed on an LLC miss or at the end of a period, they can be performed off the critical path and their latencies can be overlapped with the miss latency. D. LACS s Implementation in Shared Caches LACS can be used in both private and shared LLCs. Since applications have different runtime LLC access characteristics such as different numissued values distributions and different access frequencies, it is wiser to use private LACS meta-data per core 5. Sharing the meta-data can lead to skewness and bias problems. Therefore, LACS s storage and engine are duplicated for each core. However, since many hardware units are used infrequently, they can be shared 5 A core could be a physical core such as in CMPs or a virtual core such as in SMTs. September 1, 212

19 19 between cores. For example, a core s periodic updates are very infrequent (every 16K misses), thus, the hardware unit that performs these updates can be shared albeit it will use and update each core s private meta-data individually. LACS performs the operations explained earlier for each core independently. However, when a victim block needs to be chosen, all blocks in the set are possible candidates regardless of the core they belong to. Studies have shown that threads do not put the same pressure on all sets [16], [3]. Allowing LACS to randomly choose a victim from among all the low-cost blocks in the set will result in a dynamic and adaptive partitioning of individual cache sets that reflects each core s demand. However, such a solution, if used alone, can lead to cache thrashing and thread starvation especially when one thread has a much larger cache access frequency compared to the other thread(s). Reserving a subset of each core s blocks in the cache can significantly reduce thrashing [1] [12]. LACS can reserve blocks in the cache by assigning high cost to them. To prevent thrashing and improve fairness, LACS raises and lowers a core s threshold value based on its number of blocks in the cache. For instance, if a core has very few blocks in the cache, its threshold value is raised which increases the likelihood of its blocks being classified as high-cost and thus reserved. On the other hand, if a core has too many blocks in the cache, its threshold value is lowered which increases the likelihood of its blocks being classified as low-cost and thus evicted early. This can be done by utilizing the dynamic thresh calculation optimization discussed earlier (Section IV-B and Step Op1 in Table I). E. Other Implementation Issues Prefetching is commonly used and can be easily integrated with LACS. Prefetched blocks should not update the history table since they do not stall the processor and thus do not have a numissued value. Moreover, LACS can even help by identifying and prioritizing the prefetching of high-cost blocks. This is left as future work. Although not evaluated explicitly, we believe the impact of LACS on power consumption to be modest: First, the total storage required is small, thus adding a relatively negligible static power consumption. Second, most LACS operations are performed (infrequently) on an LLC miss, thus increasing dynamic power consumption slightly. This slight increase in power consumption is offset by the expected power savings due to improved performance. September 1, 212

20 2 V. EVALUATION ENVIRONMENT LACS is evaluated using a detailed execution-driven cycle-accurate simulator based on SESC [31]. The modeled processor is a 4-way out-of-order superscalar with a 128-entry ROB and 8 stages (with different execution-stage delays). The modeled cache hierarchy has 3 levels and is based on an Intel Core i7 system. Table II lists the different memory hierarchy parameters used in the evaluation. Latencies in the table correspond to contention-free conditions. We model port/bus contention and queuing delays in detail. LACS parameters follow those from Section IV. TABLE II PARAMETERS OF THE SIMULATED MEMORY HIERARCHY L1 Inst. Cache: L1 Data Cache: L2 Cache: L3 Cache (LLC): Main Memory: Memory Bus: 32 KB, 4-way, 64B Lines, Private, WB, LRU, 1-cycle, 32-entry MSHR 32 KB, 8-way, 64B Lines, Private, WB, LRU, 2-cycles, 32-entry MSHR 256 KB, 8-way, 64B Lines, Private, WB, LRU, 1-cycles, 32-entry MSHR 1 MB per core (1MB/2MB/4MB for uni-/dual-/quad-core, respectively), 16-way, 64B Lines, Shared, WB, LRU, 3-cycles, 32-entry MSHR 25-cycles Split-transactions, 16B-wide, 1/4 Processor Frequency In addition, 23 of the 29 SPEC CPU26 benchmarks [23] are used in the evaluation. The missing 6 benchmarks are excluded because our cross-compiler does not support Fortran 9x. The benchmarks were compiled with gcc using -O3 optimization level. The reference input sets were used for all benchmarks. Table III lists the 23 benchmarks used in the evaluation along with their LLC miss rates, LLC misses per 1 instructions (MPKI), and the fraction of their execution times stalled on LLC misses. This fraction was calculated from the execution times of a base LLC versus a perfect always-hit LLC. A benchmark is considered LLC-performance-constrained if over 25% of its execution time is spent on LLC miss stalls. Those benchmarks are grouped into Group A in the table and are the focus of our evaluation. Since LACS attempts to reduce the aggregate miss cost, its impact will only be evident in benchmarks where LLC miss stalls significantly impact the execution time. The remaining benchmarks are not LLC-performanceconstrained and are placed in Group B. We evaluate LACS under two configurations: a uniprocessor with a private LLC, and a dual/quad-core CMP system with a shared LLC. When running uniprocessor simulations, each benchmark is simulated for five billion instructions after fast forwarding the first five billion instructions. When running CMP simulations, the first five billion instructions in each benchmark September 1, 212

21 21 TABLE III THE BENCHMARKS USEDINTHEEVALUATION WITH THEIR LLC-BASED STATISTICS Group A Group B Benchmark Miss Rate MPKI Stalls Fraction Benchmark Miss Rate MPKI Stalls Fraction astar 95% % bzip2 39% % bwaves 98% % calculix * 75% <.1 1% gcc * 73% 2. 3% dealii * 5%.4 9% hmmer * 33% 2. 28% gobmk 28%.5 7% lbm * 1% % gromacs 45%.6 14% libquantum 1% % h264ref 4% <.1 1% mcf * 73% 3. 72% namd 5%.1 2% milc 1% % perlbench 88%.3 3% omnetpp * 97% % povray 2% <.1 % sphinx3 93% % sjeng 87%.3 4% xalancbmk * 85% % soplex 44% % zeusmp 88% % are fast forwarded then the simulation runs until each benchmark has completed two billion instructions. If a benchmark completes two billion instructions, it continues execution until all benchmarks do. However, results are reported for the first two billion instructions for each benchmark. We construct all 36 possible pairs from 8 representative benchmarks 6 (marked with asterisks in Table III). We also randomly construct 1 heterogeneous quadruples from the 23 benchmarks. A quadruple contains at least two Group A benchmarks. VI. EVALUATION In this section, the performance of LACS is first studied in the context of private LLCs (uniprocessors) (Section VI-A) where it is compared to other cache replacement algorithms. Its sensitivity to different optimizations and cache sizes is also investigated. After that, LACS s performance is studied in the context of shared LLCs (CMPs) (Section VI-B). A. Evaluation in a Private LLC Environment Figure 6 shows the percentage of IPC performance improvement (speedup) of LACS and three other cache replacement algorithms over the base LRU. LACS randomly chooses a victim block from the low-cost blocks in a set, so it is compared against a random replacement algorithm (RND) to show that LACS s speedup is not due to the randomness in selecting a victim. LACS is 6 These 8 benchmarks cover different application characteristics in terms of IPC, MPKI, and LLC access frequencies. September 1, 212

22 22 6% 55% 5% 45% 4% 35% 3% 25% 2% 15% 1% -1% -5% % 5% -15% Performance Improvement (Speedup) RND MLP-SBAR SHiP LACS astar bw aves gcc hmmer lbm libquantum mcf milc omnetpp sphinx3 xalancbmk zeusmp AVG-GrpA AVG-ALL Fig. 6. IPC Performance Improvement (Speedup) of LACS Compared to Other Cache Replacement Algorithms also compared against a state-of-the-art cost-sensitive cache replacement algorithm: MLP-aware replacement using SBAR (MLP-SBAR) [21]. In addition, LACS is compared to a state-of-the-art locality-based cache replacement algorithm: Signature-based Hit Predictor (SHiP) [9]. SHiP-PC- S-R2 is used in our implementation. MLP-SBAR and SHiP are the best performing cost-based and locality-based cache replacement algorithms to date, respectively. AVG-GrpA and AVG-All show the average speedup achieved for Group A benchmarks versus all benchmarks (Groups A+B), respectively. The figure shows that LACS achieves an average speedup of about 11% for the LLC-performanceconstrained Group A benchmarks and up to 51% (in mcf). LACS speeds up 6 benchmarks by 5% or more (gcc, hmmer, mcf, omnetpp, sphinx3, and xalancbmk). Furthermore, LACS does not slow down any benchmark by more than 2% demonstrating its robustness. The average speedup achieved by RND for the Group A benchmarks is comparably insignificant (2%). Moreover, RND suffers from pathological behavior in some Group B benchmarks reducing its overall speedup for all benchmarks to less than 1%. MLP-SBAR speeds up the same 6 benchmarks that LACS does, but LACS outperforms MLP-SBAR in all of them except omnetpp. Overall, the average speedup achieved by LACS is larger than that achieved by MLP-SBAR for both the Group A benchmarks (11% vs. 7%) and all benchmarks. SHiP speeds up 5 of the 6 benchmarks that LACS speeds up (gcc, hmmer, mcf, omnetpp, and xalancbmk). While LACS speeds up sphinx3, SHiP slows it down by 11% indicating that SHiP is less robust compared to LACS. Overall, the average speedup achieved by LACS is larger than that achieved by SHiP for both the Group September 1, 212

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Multiperspective Reuse Prediction

Multiperspective Reuse Prediction ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

RECENT studies have shown that, in highly associative

RECENT studies have shown that, in highly associative IEEE TRANSACTIONS ON COMPUTERS, VOL. 57, NO. 4, APRIL 2008 433 Counter-Based Cache Replacement and Bypassing Algorithms Mazen Kharbutli, Member, IEEE, and Yan Solihin, Member, IEEE Abstract Recent studies

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

Improving Cache Management Policies Using Dynamic Reuse Distances

Improving Cache Management Policies Using Dynamic Reuse Distances Improving Cache Management Policies Using Dynamic Reuse Distances Nam Duong, Dali Zhao, Taesu Kim, Rosario Cammarota, Mateo Valero and Alexander V. Veidenbaum University of California, Irvine Universitat

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Portland State University ECE 587/687. Caches and Memory-Level Parallelism

Portland State University ECE 587/687. Caches and Memory-Level Parallelism Portland State University ECE 587/687 Caches and Memory-Level Parallelism Revisiting Processor Performance Program Execution Time = (CPU clock cycles + Memory stall cycles) x clock cycle time For each

More information

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University

Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University Lecture 4: Advanced Caching Techniques (2) Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee282 Lecture 4-1 Announcements HW1 is out (handout and online) Due on 10/15

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

Cache Controller with Enhanced Features using Verilog HDL

Cache Controller with Enhanced Features using Verilog HDL Cache Controller with Enhanced Features using Verilog HDL Prof. V. B. Baru 1, Sweety Pinjani 2 Assistant Professor, Dept. of ECE, Sinhgad College of Engineering, Vadgaon (BK), Pune, India 1 PG Student

More information

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies

Donn Morrison Department of Computer Science. TDT4255 Memory hierarchies TDT4255 Lecture 10: Memory hierarchies Donn Morrison Department of Computer Science 2 Outline Chapter 5 - Memory hierarchies (5.1-5.5) Temporal and spacial locality Hits and misses Direct-mapped, set associative,

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Lecture-16 (Cache Replacement Policies) CS422-Spring

Lecture-16 (Cache Replacement Policies) CS422-Spring Lecture-16 (Cache Replacement Policies) CS422-Spring 2018 Biswa@CSE-IITK 1 2 4 8 16 32 64 128 From SPEC92 Miss rate: Still Applicable Today 0.14 0.12 0.1 0.08 0.06 0.04 1-way 2-way 4-way 8-way Capacity

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

Systems Programming and Computer Architecture ( ) Timothy Roscoe

Systems Programming and Computer Architecture ( ) Timothy Roscoe Systems Group Department of Computer Science ETH Zürich Systems Programming and Computer Architecture (252-0061-00) Timothy Roscoe Herbstsemester 2016 AS 2016 Caches 1 16: Caches Computer Architecture

More information

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt The University of Texas at Austin ETH Zürich ABSTRACT Runahead execution pre-executes

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory

Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory Cache Memories Cache memories are small, fast SRAM-based memories managed automatically in hardware. Hold frequently accessed blocks of main memory CPU looks first for data in caches (e.g., L1, L2, and

More information

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( )

Lecture 15: Caches and Optimization Computer Architecture and Systems Programming ( ) Systems Group Department of Computer Science ETH Zürich Lecture 15: Caches and Optimization Computer Architecture and Systems Programming (252-0061-00) Timothy Roscoe Herbstsemester 2012 Last time Program

More information

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering

Multiprocessors and Thread-Level Parallelism. Department of Electrical & Electronics Engineering, Amrita School of Engineering Multiprocessors and Thread-Level Parallelism Multithreading Increasing performance by ILP has the great advantage that it is reasonable transparent to the programmer, ILP can be quite limited or hard to

More information

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15

CS24: INTRODUCTION TO COMPUTING SYSTEMS. Spring 2017 Lecture 15 CS24: INTRODUCTION TO COMPUTING SYSTEMS Spring 2017 Lecture 15 LAST TIME: CACHE ORGANIZATION Caches have several important parameters B = 2 b bytes to store the block in each cache line S = 2 s cache sets

More information

CS3350B Computer Architecture

CS3350B Computer Architecture CS335B Computer Architecture Winter 25 Lecture 32: Exploiting Memory Hierarchy: How? Marc Moreno Maza wwwcsduwoca/courses/cs335b [Adapted from lectures on Computer Organization and Design, Patterson &

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

Operating Systems. Operating Systems Sina Meraji U of T

Operating Systems. Operating Systems Sina Meraji U of T Operating Systems Operating Systems Sina Meraji U of T Recap Last time we looked at memory management techniques Fixed partitioning Dynamic partitioning Paging Example Address Translation Suppose addresses

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

OpenPrefetch. (in-progress)

OpenPrefetch. (in-progress) OpenPrefetch Let There Be Industry-Competitive Prefetching in RISC-V Processors (in-progress) Bowen Huang, Zihao Yu, Zhigang Liu, Chuanqi Zhang, Sa Wang, Yungang Bao Institute of Computing Technology(ICT),

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Multi-Cache Resizing via Greedy Coordinate Descent

Multi-Cache Resizing via Greedy Coordinate Descent Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption

More information

Improving Writeback Efficiency with Decoupled Last-Write Prediction

Improving Writeback Efficiency with Decoupled Last-Write Prediction Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx

More information

A Front-end Execution Architecture for High Energy Efficiency

A Front-end Execution Architecture for High Energy Efficiency A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Data Prefetching by Exploiting Global and Local Access Patterns

Data Prefetching by Exploiting Global and Local Access Patterns Journal of Instruction-Level Parallelism 13 (2011) 1-17 Submitted 3/10; published 1/11 Data Prefetching by Exploiting Global and Local Access Patterns Ahmad Sharif Hsien-Hsin S. Lee School of Electrical

More information

TDT 4260 lecture 7 spring semester 2015

TDT 4260 lecture 7 spring semester 2015 1 TDT 4260 lecture 7 spring semester 2015 Lasse Natvig, The CARD group Dept. of computer & information science NTNU 2 Lecture overview Repetition Superscalar processor (out-of-order) Dependencies/forwarding

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University

Multiprocessing and Scalability. A.R. Hurson Computer Science and Engineering The Pennsylvania State University A.R. Hurson Computer Science and Engineering The Pennsylvania State University 1 Large-scale multiprocessor systems have long held the promise of substantially higher performance than traditional uniprocessor

More information

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture

More information

Combining Local and Global History for High Performance Data Prefetching

Combining Local and Global History for High Performance Data Prefetching Combining Local and Global History for High Performance Data ing Martin Dimitrov Huiyang Zhou School of Electrical Engineering and Computer Science University of Central Florida {dimitrov,zhou}@eecs.ucf.edu

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs

An Intelligent Fetching algorithm For Efficient Physical Register File Allocation In Simultaneous Multi-Threading CPUs International Journal of Computer Systems (ISSN: 2394-1065), Volume 04 Issue 04, April, 2017 Available at http://www.ijcsonline.com/ An Intelligent Fetching algorithm For Efficient Physical Register File

More information

Lecture-13 (ROB and Multi-threading) CS422-Spring

Lecture-13 (ROB and Multi-threading) CS422-Spring Lecture-13 (ROB and Multi-threading) CS422-Spring 2018 Biswa@CSE-IITK Cycle 62 (Scoreboard) vs 57 in Tomasulo Instruction status: Read Exec Write Exec Write Instruction j k Issue Oper Comp Result Issue

More information

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013

18-447: Computer Architecture Lecture 25: Main Memory. Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 18-447: Computer Architecture Lecture 25: Main Memory Prof. Onur Mutlu Carnegie Mellon University Spring 2013, 4/3/2013 Reminder: Homework 5 (Today) Due April 3 (Wednesday!) Topics: Vector processing,

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin {cjlee, narasima, patt}@ece.utexas.edu

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi University of Michigan Carnegie

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy

Beyond ILP II: SMT and variants. 1 Simultaneous MT: D. Tullsen, S. Eggers, and H. Levy EE482: Advanced Computer Organization Lecture #13 Processor Architecture Stanford University Handout Date??? Beyond ILP II: SMT and variants Lecture #13: Wednesday, 10 May 2000 Lecturer: Anamaya Sullery

More information