Multi-Cache Resizing via Greedy Coordinate Descent

Size: px
Start display at page:

Download "Multi-Cache Resizing via Greedy Coordinate Descent"

Transcription

1 Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption in CPUs, researchers have studied dynamic cache resizing. However, existing techniques only resize a single cache within a uniprocessor or the shared last-level cache (LLC) within a multicore CPU. To maximize benefits, it is necessary to resize all caches, which in today s CPUs includes one or two private caches per core and a shared LLC. Such multi-cache resizing (MCR) is challenging because the multiple resizing decisions are coupled, yielding an enormous configuration space. In this paper, we present a dynamic MCR technique that uses searchbased optimization. Our main contribution is a set of heuristics that enable the search to find the best configuration rapidly. In particular, our search moves in a coordinate descent (Manhattan) fashion across the configuration space. At each search step, we select the next cache for resizing greedily based on a power efficiency gain (PEG) metric. To further enhance search speed, we permit parallel greedy selection. Across 60 multiprogrammed workloads, our technique reduces power by 13.9% while sacrificing 1.5% of the performance. Keywords Cache Resizing Multicore CPUs Search-Based Optimization Power-Efficient Computing I. Stephen Choi Samsung 3655 N 1st Street San Jose, CA stephen.ch@samsung.com Donald Yeung University of Maryland at College Park 1323 A. V. Williams College Park, MD Tel.: Fax: yeung@umd.edu

2 2 I. Stephen Choi, Donald Yeung 1 Introduction Power consumption has been the most critical problem facing computer architects over the past decade [23], and is still the main limiter to achieving high performance in today s CPUs. Unfortunately, this problem will only get worse as process technologies continue to scale to smaller feature sizes. As such, power efficiency will remain an extremely important design goal, requiring hardware designers to continue looking for ways to squeeze wasteful power consumption out of architectures. A key place to look for power savings is the on-chip cache hierarchy. Caches occupy a large portion of the CPU s die area upwards of 50% in today s CPUs so they contribute significantly to a processor s overall power budget. In addition, caches are sized for the worst case. This means most computations cannot effectively utilize all of the cache capacity. Such over-provisioning can result in significant waste that, if eliminated, can potentially yield large power savings. Several researchers have investigated dynamic cache resizing to target such waste [1,17,3,4,26,27,34,40,46,47]. The idea is to monitor a cache s behavior at runtime e.g., using way counters [38] and dynamically reconfigure its capacity by enabling/disabling cache ways or sets to trade off performance for power. In particular, as a cache is downsized, its power consumption reduces. (Dynamic power goes down because less cache is activated per access, while static power also goes down because unused cache can be power gated). But this comes at the expense of additional cache misses, which can degrade performance and increase power consumption at the next level of the memory hierarchy. Cache resizing tries to pick the capacity that optimizes this tradeoff, typically by using way counters to evaluate, perhaps exhaustively, the performance and power that would have occurred under different configurations. Although there has been significant work on dynamic cache resizing, existing techniques are very limited in scope. Most only consider resizing a single cache within a uniprocessor cache hierarchy [1, 26, 27, 34, 46, 47]. Balasubramonian s work [3,4] resizes two levels of cache (also for a uniprocessor), but not independently as the sum of the two cache sizes is always fixed. So again, there is only one cache that is explicitly resized. More recently, researchers have begun studying resizing for multicore CPUs [17, 40, 44]. Unfortunately, Wang only offers an off-line solution. While Kedzierski and Sundararajan propose dynamic techniques, they only target the shared last-level cache (LLC). Granted, they perform partitioning which requires selecting multiple partition sizes within the LLC, but they still only resize a single cache. This is limiting because modern CPUs employ many caches typically one or two private caches per core along with a shared LLC. No single cache will ever be responsible for all of the power consumption, and so dynamically resizing only one cache will not address all of the waste. To illustrate, Figure 1 breaks down the power consumed by several SPEC 2006 benchmarks running sequentially on a CPU with a three-level cache hierarchy. Bar-stacks are shown for each benchmark, reporting the power consumption by caching level L1, L2,

3 Multi-Cache Resizing via Greedy Coordinate Descent 3 Power Consumption (Watt) L1D L1S L2D L2S LLCD LLCS Fig. 1: Cache power consumption breakdown. or LLC and by type dynamic (D) versus static (S). Although different types of power consumption tend to be localized (dynamic power is prevalent in the L1 while static power is prevalent in the L2 and LLC), every cache contributes non-trivially to the total power, with the L1, L2, and LLC contributing 36.2%, 27.5%, and 36.3%, respectively. While not explicitly shown in Figure 1, we observe a similar result when many SPEC benchmarks run on a multicore CPU, with power consumption distributed across multiple caching levels (as in Figure 1) and across multiple cores caches. So, for dynamic cache resizing to be effective, it is crucial to perform resizing at multiple caches simultaneously. An important question then is how should such dynamic multi-cache resizing (MCR) be conducted? If the behavior across different caches were independent, then optimizing each cache locally should result in a good solution globally. If this were the case, one could simply apply existing single-cache resizing techniques to each cache separately. Unfortunately, behaviors across caches in a multicore cache hierarchy are not independent. Instead, resizing decisions at different caches are coupled. Coupling certainly occurs between different caching levels (i.e. vertically) within the same core. As mentioned above, resizing a cache affects the power consumption at the next caching level. In MCR, though, not only can we resize the upstream cache, we can also resize the downstream cache as well. If the pair of resizing decisions are made in concert, we can save more power. For example, when downsizing the upstream cache, we may be able to simultaneously downsize the downstream cache to reduce its access energy, thus lowering the power penalty for any additional cache misses from the smaller upstream cache. This in turn may enable even more aggressive downsizing of the upstream cache. Such MCR decisions can be applied in succession (i.e., resize the L1-L2 together as well as the L2-LLC together), which means resizing decisions at any pair of caches along the vertical dimension are coupled transitively. In addition, as mentioned above, cache partitioning is performed as part of LLC resizing. This means the selection of each core s LLC capacity is coupled (i.e. horizontally) since cache partitioning techniques allocate the per-core capacities in a coordinated fashion from the same physical LLC [35]. Horizontal coupling at the LLC, in combination with vertical coupling between caching levels, means that in fact the resizing decisions at all caches in a multicore cache hierarchy are coupled, even the ones for private caches belonging to

4 4 I. Stephen Choi, Donald Yeung different cores. Again, this is due to resizing decisions impacting each other transitively, but now the transitive coupling can occur across different cores by way of the shared LLC. Hence, in order to achieve the most power savings, all of the caches in a multicore cache hierarchy should be resized in a globally consistent fashion. This means that compared to single-level cache resizing, MCR techniques must consider a much larger configuration space with very high dimensionality (essentially, each cache is an independent optimization variable). In particular, the total number of configurations grows as the cross-product of all per-cache configurations, including the partitioning configurations in the LLC. While single-level resizing techniques can use way counters to exhaustively explore configurations, the configuration space for MCR techniques is so large that it can be intractable to exhaustively explore even off-line, let alone during on-line optimization. In this paper, we study new dynamic cache resizing techniques that address the complexity of MCR. Rather than try to predict the best configuration outright, we investigate search-based optimization techniques that evolve the cache hierarchy towards the best configuration over time. Existing search heuristics, like the Nelder-Mead (NM) simplex method [31], can already optimize objective functions in complex multi-dimensional spaces. But NM is not fast enough. Because we run the search on line, very high search speed is needed to minimize runtime overhead and also permit adapting to a workload s dynamic behavior. Our main contribution is a set of search heuristics for boosting the speed of searching MCR configurations. More specifically, our technique starts from the maximum allocation and downsizes caches one at a time, thus searching the multi-dimensional MCR space in a coordinate descent (i.e. Manhattan) fashion. (In the LLC, we use UCP [35] for the initial partitioning, and while we mainly downsize, we also explore upsizing partitions which may become desirable as searching progresses). At each search step, we select the next cache for downsizing greedily based on a power efficiency gain (PEG) metric that captures both the cost in performance and benefit in power associated with each cache s downsizing. We find greedy coordinate descent (GCD) is effective at efficiently navigating through MCR spaces. To further enhance search speed (especially as core count scales), we permit parallel greedy selection across cores and give higher downsizing priority to caches with larger capacity. We implemented our GCD MCR technique on a detailed architectural simulator, and conducted an in-depth evaluation of its effectiveness. In addition to GCD MCR, our evaluation also considers off-line algorithms that identify either the static optimal when it is tractable to do so, or a very good (i.e. an aggressive off-line ) configuration when the configuration space becomes intractably large. We first show GCD MCR is effective for individual benchmarks running sequentially. Across 22 SPEC CPU2006 benchmarks, we find GCD MCR saves 13.4% of the power while sacrificing only 0.45% of the performance on the baseline system (without cache resizing). Moreover, GCD MCR is very close to the static optimal configuration, which saves 15.4% of

5 Multi-Cache Resizing via Greedy Coordinate Descent 5 the power. Next, we show GCD MCR is also effective for multiprogrammed workloads running on multicore processors. We created 60 multiprogrammed workloads for this study. (Our current experiments do not consider parallel or multithreaded programs). For the multiprogrammed workloads, we find GCD MCR saves 13.9% of the power, whereas an aggressive off-line technique saves 15.2% of the power, compared to the baseline. Performance degradation is slightly worse in the multicore case, but still acceptably small 1.5% on average. These results show GCD MCR effectively navigates through large MCR configuration spaces to rapidly find very good configurations. The rest of this paper is organized as follows. Section 2 presents the design of our GCD MCR technique, and Section 3 discusses its implementation. After describing our evaluation methodology in Section 4, Section 5 presents our results. Finally, Section 6 discusses related work, and Section 7 concludes the paper. 2 Greedy Coordinate Descent We treat MCR as a constrained multi-variable optimization problem. More specifically, the size of each cache in the multicore CPU is considered to be a free variable. And, the goal is to determine the best allocation of capacity (i.e., the best variable setting) at every cache such that power is minimized under the constraint that performance degradation never exceeds some userprescribed limit. (While achieving high power efficiency is the main goal, we also want to maintain a high level of performance, hence the constraint on performance degradation). To drive the constrained multi-variable optimization, we employ a searchbased approach. We divide workload execution into time intervals, called epochs, and monitor performance and power consumption as different cache allocations are searched across epochs. In particular, we rely on hardware way counters [38] used in previous cache techniques [22, 39, 35, 40] to monitor performance and power consumption. Although the way counters only track cache misses, average memory access time or AMAT can be computed from their cache miss counts which we use as a proxy for performance. Also, the cache miss counts can be used to compute power consumption as well. Together, these yield estimates on power efficiency, the objective for our technique. As discussed earlier, a major challenge for MCR is the complexity of the configuration space, making search speed a crucial design consideration. While search techniques such as Nelder-Mead (NM) exist, they are not fast enough for on-line optimization. Slow search techniques can cause significant overheads, which can also impede adapting to a workload s dynamic behaviors. In this section, we discuss several aspects of our search heuristic s design for boosting search speed. Coordinate descent To address the high complexity of MCR, we employ a coordinate descent method [42]. Coordinate descent has been shown to be

6 6 I. Stephen Choi, Donald Yeung extremely effective at optimizing non-differentiable functions of multiple variables, and is efficient at solving large problems [32]. In this approach, each variable (cache) is optimized (resized) one at a time, resulting in a Manhattan movement over the multi-variable solution space. Moreover, the movement is always in the downward direction: we initialize all caches to their maximum allocation, and then run the search heuristic to move monotonically towards a reduced configuration. Together, the Manhattan and monotonic movements tend to effectively search the MCR configurations despite the coupling that occurs between caches, resulting in better optimization decisions. Lastly, to enable adaptation, we periodically reset all caches to their maximum allocation, and re-run the search heuristic from this initial configuration. One issue is that immediately after downsizing a cache, the cache s hits and misses may take some time to reach their steady-state behavior. Only after steady state has been reached will the way counters accurately reflect the impact that downsizing had on the cache s performance. The problem is that the duration of this transient period depends on the capacity of the resized cache, which varies across caching levels. Hence, it is necessary to resize the caches from different caching levels at different rates. In particular, we resize smaller caches more frequently and larger caches less frequently. Greedy search order A crucial design question is in what order should the caches be downsized during coordinate descent? In our work, we downsize caches in a greedy fashion using a novel metric that we propose, called power efficiency gain (PEG). For each potential downsized cache, we define PEG to be the power consumption reduction divided by the performance loss that would result from the downsizing (as predicted by the way counters). Thus, our PEG metric captures both the benefit and the cost of downsizing a particular cache. Our greedy coordinate descent (GCD) MCR technique uses this PEG metric in two ways. First, for each cache in the hierarchy, GCD MCR assesses the PEG resulting from every possible downsizing of the cache (exhaustive evaluation for a single cache is feasible), and identifies the one that achieves the maximum PEG value. Then, GCD compares the max-peg values across different caches to identify the cache with the globally maximum PEG value. Notice, the PEG metric can be used to compare any pair of caches. While all caches contribute to power consumption, they also contribute to overall performance. In particular, our GCD MCR technique uses weighted speedup as the performance metric which considers all programs within a multiprogrammed workload. So, not only can PEG assess the cost-benefit of downsizing caches from the same core (which affects performance of a single program), it can also assess downsizing caches from different cores (which affects systemlevel performance through the weighted speedup metric). In this way, PEG enables greedy coordinate descent globally across all caches within a multicore cache hierarchy. However, while GCD MCR uses PEG to order caches for greedy selection, we actually do not always downsize the cache with the globally maximum PEG value due to scalability issues, which we will discuss below.

7 Multi-Cache Resizing via Greedy Coordinate Descent 7 PEG-based cache partitioning / resizing The LLC in a multicore CPU is often shared by all the cores. Hence, in addition to resizing the LLC, GCD MCR must also partition it (if it is shared). Many cache partitioning techniques have been studied in the past [38,22,39,35,43,7,25], but these techniques are focused on performance only. More recently, researchers have studied partitioning techniques that also downsize and power off unused portions of cache to save power [40]. Our approach is similar except we use our GCD MCR technique to control the partitioning and downsizing of the shared cache. As mentioned above, GCD MCR periodically resets all caches to their maximum allocation and re-runs its search heuristic. For a shared LLC, GCD MCR also resets its capacity, but in addition, uses UCP [35] to determine an initial partitioning of the maximally-sized LLC. Then, during the run of its search heuristic, GCD MCR treats each LLC partition as if it were a separate cache in the hierarchy that can be resized independently. GCD MCR first considers downsizing each LLC partition, using the PEG metric to select the best candidate as normal. However, GCD MCR also considers upsizing each LLC partition as well. While downsizing usually moves towards a more powerefficient configuration, it may become profitable for some partitions to expand as others contract. By considering moving LLC partitions in both directions, our GCD MCR technique is more likely to find the globally best allocation. Scalability Our GCD MCR technique is relatively fast. As Section 5 will show, GCD MCR can converge in much fewer epochs than existing techniques for solving constrained multi-variable optimization problems. Nevertheless, one significant issue remains: the serial nature of the search heuristic. As discussed above, the basic coordinate descent method selects different caches (or LLC partitions) to resize one at a time across successive epochs. But this is not scalable. As the number of cores (and hence, the number of caches) increases, so will the time required for the search to converge. Another related scaling problem arises from the multi-rate fashion in which caches are considered for resizing. Because larger caches are resized at lower frequency, delaying their resizing can lengthen the time to convergence. Unfortunately, the probability that a smaller cache exhibits the globally maximum PEG value and hence, is selected for resizing over a given larger cache grows as the number of cores increases. To improve scalability, we perform multiple rounds of greedy selection per epoch, allowing many caches / LLC partitions to be resized as a group. One drawback with this enhancement is that we do not get feedback from the way counters after every resizing decision. This may degrade the quality of resizing if the simultaneous resizing decisions are coupled with one another. In general, resizing decisions at all caches are coupled, but we find coupling is greater between caches from the same core as compared to caches across different cores. So, when resizing multiple caches, we require that they be selected from different cores to minimize the negative effects of sampling way counters for groups of resized caches.

8 8 I. Stephen Choi, Donald Yeung Finally, to prevent large caches from impeding search speed, we give them the chance to resize when their turn to be considered comes up. In particular, for multicore processors, we never let smaller caches prevent larger caches from resizing even if their PEG values are higher. Effectively, this means we only use PEG to order caches within the caching level whose turn it is to be considered for resizing. For example, when it is time to resize the L2 caches, we order all of the L2s by their PEG values, and downsize the one with maximum PEG i.e., we ignore the L1 caches when comparing the L2 PEG values. Likewise, when it is time to resize the LLC partitions, we order all of them by their PEG values, ignoring the L1 and L2 PEG values. For uniprocessors, we relax this requirement as it is less likely for a smaller cache to prevent a larger cache from resizing when there are only 2 or 3 caches total. 3 Implementation Having discussed the high-level design of our GCD MCR technique, we now describe its implementation. Section 3.1 specifies the technique in detail. Then, Section 3.2 discusses the implementation overheads. 3.1 Detailed Algorithm Algorithm 1, called gcd mcr(), presents the pseudocode for our technique, more formally specifying what was discussed in Section 2. In Algorithm 1, the outermost loop in gcd mcr (line 4) sequences the search across epochs, i.e. every EPOCH SIZE cycles. At each iteration of this loop, different caching levels are considered for resizing either the L1 (line 5), L2 (line 11), or LLC (line 16). The L1 is considered for resizing every epoch (as long as the L2 and LLC are not being considered); the L2 is considered for resizing every L2 FREQ epochs (as long as the LLC is not being considered); and the LLC is considered for resizing every LLC FREQ epochs. Every caching level is given the chance to resize periodically so that all caching levels especially the L2 and LLC can make forward progress. To allow for adaptation, we reset each cache to its maximum allocation every RESET FREQ epochs to start a new search from the maximal configuration. (After RESET FREQ epochs, lines 8, 13, and 19 reset the L1, L2, and LLC, respectively, the next time it is that cache s turn to resize). Before selecting the cache(s) to resize for a given caching level, we call update max peg() to determine each cache s maximum PEG value locally. In particular, we first compute the permitted AMAT by calling get delta amat() (line 52). Because we treat AMAT as a proxy for performance, get delta amat simply allows AMAT to change by the user-prescribed maximum performance degradation, perflimit. Since the user specifies perflimit relative to the maximum performance (i.e., assuming the maximum cache allocation), we substract the AMAT already incurred by the current downsizing, yielding the remaining headroom, called AMAT Headroom. Then, we invoke

9 Multi-Cache Resizing via Greedy Coordinate Descent 9 Algorithm 1: GCD MCR 1 gcd mcr(perflimit): 2 begin 3 /* main epoch loop */ 4 while not (cycle counter % EPOCH SIZE) do 5 if (epoch % L2 FREQ) and (epoch % LLC FREQ) then 6 /* reset ways at a different frequency */ 7 if not ((epoch - 1) % RESET FREQ) then 8 foreach cpu i do L1 ways reset 9 update max peg(perflimit, L1) /* update PEG at different freq. */ 10 cmp mcr per level(l1) 11 else if (epoch % LLC FREQ) then 12 if not ((epoch - L2 FREQ ) % RESET FREQ) then 13 foreach cpu i do L2 ways reset 14 update max peg(perflimit, L2) 15 cmp mcr per level(l2) 16 else 17 /* Last level cache reconfiguration */ 18 if not ((epoch - LLC FREQ ) % RESET FREQ) then 19 LLC ways reset 20 update max peg(perflimit, LLC) 21 if (ncores == 1) then 22 cmp mcr per level(llc) 23 else 24 pcp(perflimit) 25 /* Send reconfiguration command */ 26 if allocl is changed then reconfigure() 27 cmp mcr per level(level): 28 begin 29 /* main loop */ 30 while true do 31 winner = find max peg per level(level) 32 if winner < 0 then 33 break 34 alloc[winner][level] = peg[winner][level].alloc 35 pcp(perflimit): 36 begin 37 alloc = ucp() /* start from utility-based partitioning */ 38 while getamatw S() >= perflimit do 39 winner = find max peg per level(llc) 40 if winner < 0 then 41 break 42 update min peg(perflimit, LLC) 43 up winner = find min peg per level(llc) 44 if peg[winner].delta power < up peg[winner].delta power then 45 alloc[winner][llc] = up peg[up winner].alloc 46 else 47 alloc[winner][llc] = peg[winner][llc].alloc

10 10 I. Stephen Choi, Donald Yeung Algorithm 1: GCD MCR (cont d) 48 update max peg(perflimit, curlevel): 49 begin 50 foreach cpu i do 51 /* calculate AMAT headroom */ 52 AMAT Headroom = get delta amat(i, perflimit ) - AMAT due to cache resizing 53 /* get mpeg */ 54 peg[i][curlevel] = get max peg(i, curlevel, AMAT Headroom) 55 find max peg per level(level): 56 begin 57 result = -1, max mpeg = 0 58 if ncores == 1 then 59 foreach cache i do 60 if max mpeg < peg[0][i].mpeg then max mpeg = peg[0][i].mpeg 61 if max mpeg == peg[0][level].mpeg then result = 0 62 else 63 foreach cpu i do 64 if max mpeg < peg[i][level].mpeg then 65 max mpeg = peg[i][level].mpeg, result = i 66 return result 67 get max peg(curcore, curlevel, AMAT Headroom): 68 begin 69 alloc = current allocation of curcore s curlevel cache 70 foreach available way-off i do 71 peg[i] = get peg(curcore, curlevel, alloc, alloc i, AMAT Headroom) 72 winner = allocation with maximum PEG value 73 return winner 74 get peg(curcore, curlevel, a, b, AMAT Headroom): 75 begin 76 deltapower = sum of dynamic and static power changes due to change in misses when assigned ways decrease from a to b 77 deltaamat = increased AMAT due to change in misses when assigned ways decrease from a to b 78 if deltaamat < AMAT Headroom then 79 result.mpeg = deltapower / deltaamat, result.alloc = b 80 result.delta power = deltapower, result.delta amat = deltaamat 81 else 82 result.mpeg = -1, result.alloc = 0 83 result.delta power = result.delta amat = 0 84 return result 85 get delta amat(curcore, perflimit): 86 begin 87 amat = calculate AMAT based on way counters of current core 88 /* calculate AMAT headroom */ 89 return amat * perflimit / (1- perflimit)

11 Multi-Cache Resizing via Greedy Coordinate Descent 11 get max peg() to determine the maximum PEG value from further downsizing; in essence, this is the downsizing that achieves the maximum power savings without exceeding the AMAT headroom. Notice, the power savings calculation in get peg() takes into consideration both the reduction in dynamic and static power due to having a smaller cache, as well as the increase in power consumption incurred at the next caching level due to additional cache misses from downsizing. Among the per-cache maximum PEG values, we greedily select the largest for downsizing. Across private caches (i.e., the L1 and L2, and including the LLC for uniprocessors), this is done by calling cmp mcr per level(). We pass into cmp mcr per level the argument level which specifies the current private caching level being considered for downsizing. Then, to identify the cache with the largest PEG value, we call find max peg per level(). For multicores, we only consider PEG values from the current caching level (lines 63 65), guaranteeing that a cache from level will be downsized as long as it contributes non-negative PEG. Moreover, this is performed from within a loop (line 30) so that multiple caches across different cores can be downsized per epoch to improve scalability. For uniprocessors, we relax this requirement and consider PEG from all caching levels (lines 59 60), allowing a smaller cache to block the current level from downsizing. But still, only the cache at level can downsize (line 61). Lastly, besides downsizing private caches, GCD MCR also partitions and resizes the shared LLC, which is done by calling pcp(). As discussed in Section 2, pcp starts from the UCP partitioning of the current LLC capacity (line 37). Then, treating each LLC partition as if it were a separate cache, it uses find max peg per level to identify the partition with maximum PEG for potential downsizing (line 39), just like cmp mcr per level does for private caches. But in addition, pcp also considers upsizing LLC partitions. Analogous to computing each LLC partition s maximum PEG that would result from downsizing, pcp calls update min peg() (line 42) to compute each LLC partition s minimum reciprocal of PEG that would result from upsizing. Then, pcp calls find min peg() (line 43) to select the smallest among these for potential upsizing. (Both update min peg and find min peg are very similar to update max peg and find max peg per level, except they compute the min across the reciprocal of PEG. So, we omit their pseudocode from Algorithm 1.) If the power increase incurred by upsizing is smaller than the power savings from downsizing, then GCD MCR commits to the upsizing candidate; otherwise, it commits to the downsizing candidate (lines 44 47). All of this is performed inside a loop (line 38) so that multiple LLC partitions can be downsized / upsized per epoch for scalability. Low-pass filter A critical design parameter in Algorithm 1 is the epoch size. For our experiments in Section 5, we employ fairly small epochs (see Section 4). This enables our GCD MCR technique to move towards the best configuration rapidly. However, fine-grain epochs can be susceptible to spurious and/or transient behavior that can be picked up by the way counters and cause GCD

12 12 I. Stephen Choi, Donald Yeung Level L1 L2 LLC Sets Ways Size of tag entry (bits) (valid bit + tag bits + LRU bits) (1+29+3) (1+26+3) (1+23+2) Tags (Byte) Data (KB) Area of baseline cache (KB) Shadow tag sets Shadow tags (Byte) Way counters (Byte) Area overhead of shadow tags (Byte) Total area overhead (%) Table 1: Storage overhead of shadow tags. Overheads for the LLC are per core. MCR to make poor resizing decisions. To address this, we average the new configuration computed by Algorithm 1 with the configuration from the previous epoch. Such averaging acts as a low-pass filter, resulting in a smoother movement towards the best configuration. 3.2 Software / Hardware Overheads We assume the GCD MCR technique can be performed in software. In such an implementation, each core in the CPU would receive a periodic interrupt at every epoch boundary. Upon entering the interrupt handler, all the cores would read their local way counters and execute the GCD MCR pseudocode in Algorithm 1. Much of this algorithm can be executed in parallel (for example, computing the max peg value on each core) to mitigate its runtime overhead. After making the cache resizing decisions, the cores modify their local caches configurations, and return from the interrupt handler. For our experiments in Section 5, we did not simulate the execution of such interrupt handlers. The main bottleneck in GCD MCR lies in searching over the different cache size configurations across different epochs. Comparatively, the overhead of the interrupt handlers for determining the next resizing decision is very small. In Section 5.3, we will present an analysis to estimate this overhead, which we find to be less than 0.3%. The main source of hardware overhead is the shadow tags [15,35] which permit way counters to track hits and misses for different cache sizes. Shadow tag arrays are similar to regular cache structures except they have no data arrays. They require s w t bits, where s is the number of sampled sets, w is the number of cache ways, and t is the number of bits per tag entry. Table 1 reports the overhead numbers for the specific cache hierarchy we study later in Section 5. In particular, we sample one set per 32 regular tags, and we assume a 42-bit physical address. Overall, the shadow tag arrays require 4034 bytes, which represents an increase of 0.15% in the storage requirement compared

13 Multi-Cache Resizing via Greedy Coordinate Descent 13 to the baseline caches. Besides the shadow tags, the way counters themeselves require an additional 4 w bits, and an adder for incrementing them. Another source of hardware overhead is reconfigurable caches. We assume all caches in the hierarchy, except for the L1 I-cache (which we assume is fixed), are reconfigurable. In particular, we employ selective ways [1] in the L2 cache and LLC, and both selective sets and ways [46] in the L1 cache. For selective ways, we assume reconfiguration in increments of a cache way from 1 to the associativity number of ways in the cache; and for selective sets, we assume a certain number of power-of-two set configurations. Hence, for the cache hierarchy in Table 1, there are 12, 8, and 32 different configurations for the L1, L2, and LLC, respectively, assuming 8 cores. While each cache s access delay also changes across different configurations, we assume a constant number of CPU cycles to access each cache chosen to handle that cache s worst-case access delay (i.e. with all ways and sets enabled). When upsizing caches, we assume 2/4/7 cycles to power up and reset newly added portions of cache for the L1/L2/LLC, respectively. When downsizing, we walk the removed portions of cache to flush their contents. Clean cache blocks are discarded after checking upstream caches to maintain inclusion. Dirty cache blocks not only check upstream caches but also write back to the next-lower level. We assume these operations are pipelined such that flushing takes 1 cycle per walked cache block. Downsized ways are selected in reverse way ID order. Because we do not physically move cache blocks once they are filled, the flushed cache blocks have an equal probability of being at any position in the LRU stack. Moreover, we do not attempt to reconstruct the per-set LRU stacks after flushing. In order to estimate power consumption as a function of the cache configuration and the way counter values, our technique requires the per-access and leakage energies (derived from CACTI). We provide configurable registers to store these energy values. 4 Experimental Methodology Performance simulation We modified the SimpleScalar v4.0 simulator for the Alpha ISA [6] to conduct our study. Table 2 shows our baseline processor configuration. We use a modest 2-way issue out-of-order core that achieves good power efficiency. The cores are attached to a three-level cache hierarchy with a split 8-way 32KB L1 private cache, a unified 256KB L2 private cache, and a shared last-level cache (LLC). The LLC provides 2MB for a single core, 4MB for two cores, 8MB for 4 cores, and 16MB for 8 cores, and its associativity increases by 4 ways for each additional core. The cache block size is 64 bytes for all caches. The baseline cache hierarchy maintains the non-inclusive inclusion property. For main memory, we model 32GB of low-power DDR3 DRAM with a 140ns access latency [8]. To facilitate our multicore experiments, we created a parallel version of the SimpleScalar simulator. In particular, we execute multiple copies of the simu-

14 14 I. Stephen Choi, Donald Yeung Cores L1 I-Cache L1 D-Cache L2 Unified Cache L3 Shared Cache Memory 2.0 GHz 2-way issue, out-of-order, 64-entry ROB, 24-entry LSQ Gshare/bimodal hybrid branch predictor, 2048-entry meta table 512-entry BTB 32 KB, 2-way, 64-byte blocks, 1 cycle 32 KB, 2-ports, 8-way, 64-byte blocks, 4 cycles 256 KB, 8-way, 64-byte blocks, 7 cycles 2MB and 4 ways per core (up to 16 MB, 32-way) 64-byte blocks, 19 cycles Noninclusive, bus-type interconnect 32 GB Low Voltage Quad-rank RDIMM DDR3-800, 140ns load latency Table 2: Architectural parameters of the baseline configuration. lator as separate UNIX processes, one per core, and synchronize the processes through sockets to simulate inter-core interactions. One important interaction occurs at epoch boundaries to execute our MCR algorithm and resize caches. Hence, we synchronize all of the simulator processes once per epoch to run our MCR technique. In addition to these per-epoch interactions, inter-core interactions potentially occur on LLC accesses as well. Although there are no cache coherence actions since we run multiprogrammed workloads, there can be contention for the LLC and main memory. For our workloads, we find contention is extremely low. So, our simulator assumes LLC and main memory accesses always incur the contention-free access latencies in Table 2 to avoid process synchronizations on every LLC access. In Section 5.4, we will validate this assumption. A critical design consideration for our GCD MCR technique is the rate at which caches are resized. Higher resizing frequencies enable greater adaptivity, but as discussed in Section 2, they cannot be so high that resized caches do not reach steady state which is capacity dependent before the next resizing decision is made. For the cache capacities in Table 2, we tried several frequencies and found a baseline epoch size, i.e. the resizing frequency for the L1, of 200K cycles works well (EPOCH SIZE = 200K in Algorithm 1). Moreover, we resize the L2 every 5 epochs and the LLC every 50 epochs (L2 FREQ = 5 and LLC FREQ = 50, respectively, in Algorithm 1). Lastly, the reset frequency to start a new search i.e. RESET FREQ in Algorithm 1 is 125 epochs. Lastly, as mentioned in Section 3.2, the resizing decisions (i.e., via execution of our GCD MCR code) are performed in our simulator. We do not simulate the interrupt handlers mentioned in Section 3.2, so we don t explicitly account for their runtime overheads. (Our experiments also consider off-line techniques, which will be described in Section 5, but these are not impacted by this issue). Not only does this affect our on-line GCD MCR technique, but it also affects our baseline multicore CPU which employs UCP to partition the shared LLC. (The UCP technique would also invoke interrupt handlers to compute new LLC partitionings across different epochs). Section 5.3 will present an analysis of this code s overhead, which we find to be very small. Power simulation We use McPAT [24] and CACTI 6.5 [29] for power modeling. Our baseline model uses the 32nm technology node and ITRS high per-

15 Multi-Cache Resizing via Greedy Coordinate Descent 15 L1 Cache Access Type (tag array / data array / data array h-tree) Read/write energy per access Subthreshold/gate leakage L2 Cache Access Type Read/write energy per access Subthreshold/gate leakage L3 Cache Access Type Read/write energy per access (Standby) Subthreshold/gate leakage Parallel / Parallel / Parallel / nj 41.5 / 22.3 mw Parallel / Parallel / Serial / nj / 78.5 mw Serial / Serial / Serial 4.38 / 6.6 nj / 1666 mw Table 3: Cache parameters for the baseline multi-level caches. formance devices. To manage static power consumption, we employ a state-ofthe-art circuit- and device-level static power reduction technique in the shared LLC. Specifically, we assume high-v t devices throughout [18], but apply reverse body bias (RBB) in standby mode to reduce standby leakage [41]. When an access occurs, we apply a forward body bias (FBB) to restore the threshold voltage for low access delay. We assume that applying FBB does not impact the access delay for the cache [41]. We utilize stack effect in conjunction with ABB to model way selection [36,41]. We use the Model for Assessment of cmos Technologies And Roadmaps (MASTAR 2011) from ITRS [28] to derive parameters required for CACTI according to our assumptions. Table 3 summarizes the cache parameters used, and the resulting energy numbers for power modeling. Benchmarks We used the SPEC CPU2006 benchmarks for our study. Because we use a version of SimpleScalar targeted for Alpha/Linux, we compiled the SPEC CPU2006 benchmarks on a native Linux environment. We installed an Alpha CPU emulator [10] on a Windows PC, and then installed the Linux (Debian Lenny) system on the emulator. We compiled the benchmarks inside the emulator using an Alpha gcc compiler. All benchmarks were built with the -O2 option. Among all of the benchmarks in the suite, we were unable to compile 447.dealII. Moreover, one integer benchmark (403.gcc) and five floating point benchmarks (416.gamess, 433.milc, 450.soplex, 465.tonto, and 481.wrf) either did not complete, or completed with incorrect outputs. Otherwise, we simulated the remaining benchmarks 22 in total (11 integer and 11 floating point) as shown in Table 4. Using the reference inputs, all of the compiled benchmarks were run to completion on the SimPoint tool [13]. We take the most representative simpoint, consisting of 1B instructions per benchmark. In our experiments, we begin simulation 100M instructions prior to the simpoint to warmup the caches, and then we simulate the next 800M cycles after cache warmup to acquire detailed statistics. Because benchmarks IPCs differ, we simulate a fixed cycle

16 16 I. Stephen Choi, Donald Yeung Group Benchmark MPKI Group Benchmark MPKI (h 4 ) Mcf 51 Libquantum 29 (l 9 ) Astar 0.82 High Lbm 22 Perlbench 0.69 Omnetpp 15 Hmmer 0.66 (h 0 ) GemsFDTD 14 H264ref 0.54 (m 6 ) Leslie3d 9.5 Low Sjeng 0.27 Sphinx3 8.2 Gobmk 0.2 Xalan 6.8 Calculix 0.2 Medium Bwaves 4.8 Gromacs 0.13 Zeusmp 4.1 Namd 0.07 CactusADM 2.3 (l 0 ) Povray 0.03 (m 0 ) Bzip2 1.9 Table 4: Benchmark classification based on misses per kilo instructions (MPKI) for a 2MB LLC. count within each 1B-instruction simpoint instead of a fixed instruction count to ensure all benchmarks from a given multiprogrammed workload are active simultaneously. The SPEC CPU2006 benchmarks are used for both our uniprocessor and multicore experiments. Because these benchmarks are sequential, we formed multiprogrammed workloads to drive the multicore study (see Section 5.2). Multiprogrammed workloads generate very diverse behaviors across different cores caches, so they tend to be more challenging for cache resizing techniques. At the same time, however, because we do not run experiments with parallel workloads, our current results do not reflect how our techniques would perform for multithreaded programs. We leave this important topic for future work. 5 Experimental Evaluation We now conduct an evaluation of our GCD MCR technique. Section 5.1 begins by studying single-benchmark performance. Then, Section 5.2 presents our main multicore results. Finally, Section 5.4 validates our simulator s assumption of low LLC contention. 5.1 Single-Benchmark Results We first focus on the performance of individual benchmarks, highlighting how effectively GCD MCR coordinates resizing vertically across caching levels. This also provides a baseline for interpreting the main multicore results later in Section 5.2. Not only do we present experiments for GCD MCR, we also compare it against several off-line algorithms to better understand the limits of our technique. Off-line analysis One of the off-line algorithms we consider is the static optimal. To facilitate this study, we run each benchmark on every possible cache

17 Multi-Cache Resizing via Greedy Coordinate Descent 17 hierarchy configuration to find the best static solution (we do not permit reconfigurations during a run). For the parameters in Table 2, there are 384 unique configurations per benchmark and 8,448 configurations across all 22 benchmarks in the uniprocessor case (i.e., with a 4-way 2MB LLC), which is a large but feasible design space to simulate exhaustively. In our work, the goal is to maximize power savings with negligible performance degradation. So, we search for the configuration with the largest power savings, but with no more than 1% performance degradation compared to the baseline i.e., the configuration in Table 2. In addition to the static optimal, we also implement off-line versions of the Nelder-Mead (NM) simplex method mentioned earlier, as well as our GCD MCR technique. These off-line techniques use the exhaustive simulations from the static optimal. In particular, we run the NM and GCD MCR algorithms on the exhaustive simulation results, feeding to each technique the information from the simulations that drives their search algorithms (IPC and power for the NM method, and cache way counts for GCD MCR). Each technique is run only once, and the solution found upon convergence is reported. These results represent static versions of the NM method and GCD MCR algorithm in other words, they do not incur any on-line related overheads, but at the same time, they also cannot adapt to dynamic program behavior. Figure 2 shows the total system power savings achieved by the static optimal (labeled SO ), the off-line NM method (labeled NM-Off ), and the off-line GCD MCR algorithm (labeled GCD-Off ) for each of our 22 SPEC benchmarks. All results are normalized to the power consumption of the baseline configuration. As Figure 2 shows, SO can save quite a bit of power for many benchmarks, providing an average savings of 15.4% across the entire suite. This demonstrates significant headroom for our techniques to save power within individual benchmarks by resizing multiple caches vertically. NM-Off also achieves good power savings, even matching SO in a few cases. While GCD-Off is not quite as good, it is still very close to NM-Off. On average, NM-Off saves 9.0% of the power consumption whereas GCD-Off saves 8.8%. So, GCD MCR s solution quality is comparable to NM. But a significant advantage of GCD MCR is its fast convergence rate. Figure 3 reports the number of iterations required by NM-Off and GCD-Off to converge. Our GCD MCR technique is superior in every benchmark. On average, it converges in 3.4 iterations whereas NM requires 19.5 iterations. On-line analysis The bars labeled in Figure 2 report the total system power achieved by our on-line GCD MCR algorithm, normalized to the power consumption of the baseline configuration. In Figure 2, we can see that the on-line GCD MCR algorithm beats the off-line version in every benchmark except for libquantum, bwaves, namd, and calculix. Averaged across all 22 benchmarks, on-line GCD MCR saves 13.4% of the total system power compared to 8.8% for the off-line version. In fact, the on-line GCD MCR algorithm comes very close to the 15.4% power savings provided by the static optimal.

18 18 I. Stephen Choi, Donald Yeung Normalized Power SO NM-Off GCD-Off 0.6 Fig. 2: Power consumption of static optimal (SO), off-line NM (NM-Off), static GCD MCR (GCD-Off), and on-line GCD MCR () for single benchmarks. All results are normalized to the baseline configuration. Iterations NM-Off GCD-Off Fig. 3: NM-Off and GCD-Off convergence rates for single benchmarks. Although on-line GCD MCR incurs runtime overhead, it can also adapt to dynamic behavior, which gives it an advantage over the off-line techniques. In terms of performance degradation, our on-line GCD MCR algorithm is generally worse than the static techniques, but it still maintains reasonable performance. Figure 4 shows the performance degradation for both the offline and on-line GCD MCR algorithms relative to the baseline configuration for each of our benchmarks. As Figure 4 shows, GCD-Off is within the 1% performance degradation target for every benchmark. In contrast, exhibits worse performance for many benchmarks, and fails to achieve the 1% target for gromacs and namd. Because the on-line technique relies on prediction rather than perfect off-line information, it can make decisions that lead to unanticipated performance loss. But on average, on-line GCD MCR s performance is still very good only 0.45% worse than the baseline. Lastly, Figure 5 compares the cache power consumption of our on-line GCD MCR algorithm against the baseline configuration. As we can see in Figure 5, our technique gets power savings from all levels of the cache hierarchy. In particular, on-line GCD MCR significantly reduces the L1 cache power, especially the dynamic part (labeled L1D ). On average, L1 dynamic power is reduced by 54.2% compared to the baseline. On-line GCD MCR also significantly reduces the L2 cache power, especially the static part (labeled L2S ). On average, L2 static power is reduced by 54.8% compared to the baseline. And, on-line GCD MCR also reduces the LLC cache power, especially the static part (labeled LLCS ). On average, the LLC static power is reduced by

19 Multi-Cache Resizing via Greedy Coordinate Descent 19 Performance Degradation in % GCD-Off Fig. 4: Percent performance degradation of off-line GCD MCR (GCD-Off) and on-line GCD MCR () relative to the baseline configuration for single benchmarks. Power Consumption (Watt) L1D L1S L2D L2S LLCD LLCS perlbench bzip2 mcf hmmer libquantum omnetpp xalan zeusmp cactusadm namd calculix lbm AVG gobmk sjeng h264ref astar bwaves gromacs leslie3d povray GemsFDTD sphinx3 Fig. 5: Cache power consumption breakdown of the baseline configuration () and on-line GCD MCR () for single benchmarks. 19.8% compared to the baseline. (The LLC power savings are more modest because the baseline LLC already employs an aggressive static power reduction technique see Section 4). These results show our technique effectively resizes all caching levels (vertically) in order to maximize the overall power savings. 5.2 Multicore Results We now present our main results: the GCD MCR technique running on multicore CPUs. To facilitate this study, we generated multi-programmed workloads by randomly combining the benchmarks from Table 4. We created 20 2-core workloads, 20 4-core workloads, and 20 8-core workloads. Table 5 lists the names of the workloads, and specifies their constituent benchmarks using the abbreviations from Table 4. Off-line algorithms Similar to Section 5.1, we begin our multicore study by developing off-line algorithms in particular, the static optimal to determine the limits of our technique. A significant challenge this time, though, is that there are many more configurations. As we scale core count, we also increase

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

ABSTRACT. Inseok Stephen Choi Doctor of Philosophy, 2014

ABSTRACT. Inseok Stephen Choi Doctor of Philosophy, 2014 ABSTRACT Title of dissertation: GREEDY COORDINATE DESCENT CMP MULTI-LEVEL CACHE RESIZING Inseok Stephen Choi Doctor of Philosophy, 2014 Dissertation directed by: Professor Donald Yeung Department of Electrical

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Energy Models for DVFS Processors

Energy Models for DVFS Processors Energy Models for DVFS Processors Thomas Rauber 1 Gudula Rünger 2 Michael Schwind 2 Haibin Xu 2 Simon Melzner 1 1) Universität Bayreuth 2) TU Chemnitz 9th Scheduling for Large Scale Systems Workshop July

More information

Lightweight Memory Tracing

Lightweight Memory Tracing Lightweight Memory Tracing Mathias Payer*, Enrico Kravina, Thomas Gross Department of Computer Science ETH Zürich, Switzerland * now at UC Berkeley Memory Tracing via Memlets Execute code (memlets) for

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Dynamic Cache Pooling in 3D Multicore Processors

Dynamic Cache Pooling in 3D Multicore Processors Dynamic Cache Pooling in 3D Multicore Processors TIANSHENG ZHANG, JIE MENG, and AYSE K. COSKUN, BostonUniversity Resource pooling, where multiple architectural components are shared among cores, is a promising

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches

Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Insertion and Promotion for Tree-Based PseudoLRU Last-Level Caches Daniel A. Jiménez Department of Computer Science and Engineering Texas A&M University ABSTRACT Last-level caches mitigate the high latency

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Multiperspective Reuse Prediction

Multiperspective Reuse Prediction ABSTRACT Daniel A. Jiménez Texas A&M University djimenezacm.org The disparity between last-level cache and memory latencies motivates the search for e cient cache management policies. Recent work in predicting

More information

Near-Threshold Computing: How Close Should We Get?

Near-Threshold Computing: How Close Should We Get? Near-Threshold Computing: How Close Should We Get? Alaa R. Alameldeen Intel Labs Workshop on Near-Threshold Computing June 14, 2014 Overview High-level talk summarizing my architectural perspective on

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 36 Performance 2010-04-23 Lecturer SOE Dan Garcia How fast is your computer? Every 6 months (Nov/June), the fastest supercomputers in

More information

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor

Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Performance Characterization of SPEC CPU Benchmarks on Intel's Core Microarchitecture based processor Sarah Bird ϕ, Aashish Phansalkar ϕ, Lizy K. John ϕ, Alex Mericas α and Rajeev Indukuru α ϕ University

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

Optimizing SMT Processors for High Single-Thread Performance

Optimizing SMT Processors for High Single-Thread Performance University of Maryland Inistitute for Advanced Computer Studies Technical Report UMIACS-TR-2003-07 Optimizing SMT Processors for High Single-Thread Performance Gautham K. Dorai, Donald Yeung, and Seungryul

More information

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez

Memory Mapped ECC Low-Cost Error Protection for Last Level Caches. Doe Hyun Yoon Mattan Erez Memory Mapped ECC Low-Cost Error Protection for Last Level Caches Doe Hyun Yoon Mattan Erez 1-Slide Summary Reliability issues in caches Increasing soft error rate (SER) Cost increases with error protection

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012

Energy Proportional Datacenter Memory. Brian Neel EE6633 Fall 2012 Energy Proportional Datacenter Memory Brian Neel EE6633 Fall 2012 Outline Background Motivation Related work DRAM properties Designs References Background The Datacenter as a Computer Luiz André Barroso

More information

ECE/CS 757: Homework 1

ECE/CS 757: Homework 1 ECE/CS 757: Homework 1 Cores and Multithreading 1. A CPU designer has to decide whether or not to add a new micoarchitecture enhancement to improve performance (ignoring power costs) of a block (coarse-grain)

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

A Front-end Execution Architecture for High Energy Efficiency

A Front-end Execution Architecture for High Energy Efficiency A Front-end Execution Architecture for High Energy Efficiency Ryota Shioya, Masahiro Goshima and Hideki Ando Department of Electrical Engineering and Computer Science, Nagoya University, Aichi, Japan Information

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture

DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture DASCA: Dead Write Prediction Assisted STT-RAM Cache Architecture Junwhan Ahn *, Sungjoo Yoo, and Kiyoung Choi * junwhan@snu.ac.kr, sungjoo.yoo@postech.ac.kr, kchoi@snu.ac.kr * Department of Electrical

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016

562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 562 IEEE TRANSACTIONS ON COMPUTERS, VOL. 65, NO. 2, FEBRUARY 2016 Memory Bandwidth Management for Efficient Performance Isolation in Multi-Core Platforms Heechul Yun, Gang Yao, Rodolfo Pellizzoni, Member,

More information

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm

LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm 1 LACS: A Locality-Aware Cost-Sensitive Cache Replacement Algorithm Mazen Kharbutli and Rami Sheikh (Submitted to IEEE Transactions on Computers) Mazen Kharbutli is with Jordan University of Science and

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Scalable Dynamic Task Scheduling on Adaptive Many-Cores

Scalable Dynamic Task Scheduling on Adaptive Many-Cores Introduction: Many- Paradigm [Our Definition] Scalable Dynamic Task Scheduling on Adaptive Many-s Vanchinathan Venkataramani, Anuj Pathania, Muhammad Shafique, Tulika Mitra, Jörg Henkel Bus CES Chair for

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

Energy-centric DVFS Controlling Method for Multi-core Platforms

Energy-centric DVFS Controlling Method for Multi-core Platforms Energy-centric DVFS Controlling Method for Multi-core Platforms Shin-gyu Kim, Chanho Choi, Hyeonsang Eom, Heon Y. Yeom Seoul National University, Korea MuCoCoS 2012 Salt Lake City, Utah Abstract Goal To

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System

WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System WADE: Writeback-Aware Dynamic Cache Management for NVM-Based Main Memory System ZHE WANG, Texas A&M University SHUCHANG SHAN, Chinese Institute of Computing Technology TING CAO, Australian National University

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES

JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES JIGSAW: SCALABLE SOFTWARE-DEFINED CACHES NATHAN BECKMANN AND DANIEL SANCHEZ MIT CSAIL PACT 13 - EDINBURGH, SCOTLAND SEP 11, 2013 Summary NUCA is giving us more capacity, but further away 40 Applications

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip

An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip An Application-Oriented Approach for Designing Heterogeneous Network-on-Chip Technical Report CSE-11-7 Monday, June 13, 211 Asit K. Mishra Department of Computer Science and Engineering The Pennsylvania

More information

Power Measurement Using Performance Counters

Power Measurement Using Performance Counters Power Measurement Using Performance Counters October 2016 1 Introduction CPU s are based on complementary metal oxide semiconductor technology (CMOS). CMOS technology theoretically only dissipates power

More information

Predicting Performance Impact of DVFS for Realistic Memory Systems

Predicting Performance Impact of DVFS for Realistic Memory Systems Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu

More information

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

Performance. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University Performance Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Defining Performance (1) Which airplane has the best performance? Boeing 777 Boeing

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat

Security-Aware Processor Architecture Design. CS 6501 Fall 2018 Ashish Venkat Security-Aware Processor Architecture Design CS 6501 Fall 2018 Ashish Venkat Agenda Common Processor Performance Metrics Identifying and Analyzing Bottlenecks Benchmarking and Workload Selection Performance

More information

A Comprehensive Scheduler for Asymmetric Multicore Systems

A Comprehensive Scheduler for Asymmetric Multicore Systems A Comprehensive Scheduler for Asymmetric Multicore Systems Juan Carlos Saez Manuel Prieto Complutense University, Madrid, Spain {jcsaezal,mpmatias}@pdi.ucm.es Alexandra Fedorova Sergey Blagodurov Simon

More information

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3.

I, J A[I][J] / /4 8000/ I, J A(J, I) Chapter 5 Solutions S-3. 5 Solutions Chapter 5 Solutions S-3 5.1 5.1.1 4 5.1.2 I, J 5.1.3 A[I][J] 5.1.4 3596 8 800/4 2 8 8/4 8000/4 5.1.5 I, J 5.1.6 A(J, I) 5.2 5.2.1 Word Address Binary Address Tag Index Hit/Miss 5.2.2 3 0000

More information

arxiv: v1 [cs.ar] 1 Feb 2016

arxiv: v1 [cs.ar] 1 Feb 2016 SAFARI Technical Report No. - Enabling Efficient Dynamic Resizing of Large DRAM Caches via A Hardware Consistent Hashing Mechanism Kevin K. Chang, Gabriel H. Loh, Mithuna Thottethodi, Yasuko Eckert, Mike

More information

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical

More information

Flexible Cache Error Protection using an ECC FIFO

Flexible Cache Error Protection using an ECC FIFO Flexible Cache Error Protection using an ECC FIFO Doe Hyun Yoon and Mattan Erez Dept Electrical and Computer Engineering The University of Texas at Austin 1 ECC FIFO Goal: to reduce on-chip ECC overhead

More information

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality

ChargeCache. Reducing DRAM Latency by Exploiting Row Access Locality ChargeCache Reducing DRAM Latency by Exploiting Row Access Locality Hasan Hassan, Gennady Pekhimenko, Nandita Vijaykumar, Vivek Seshadri, Donghyuk Lee, Oguz Ergin, Onur Mutlu Executive Summary Goal: Reduce

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies

Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Hybrid Cache Architecture (HCA) with Disparate Memory Technologies Xiaoxia Wu, Jian Li, Lixin Zhang, Evan Speight, Ram Rajamony, Yuan Xie Pennsylvania State University IBM Austin Research Laboratory Acknowledgement:

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches

Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Micro-sector Cache: Improving Space Utilization in Sectored DRAM Caches Mainak Chaudhuri Mukesh Agrawal Jayesh Gaur Sreenivas Subramoney Indian Institute of Technology, Kanpur 286, INDIA Intel Architecture

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

CloudCache: Expanding and Shrinking Private Caches

CloudCache: Expanding and Shrinking Private Caches CloudCache: Expanding and Shrinking Private Caches Hyunjin Lee, Sangyeun Cho, and Bruce R. Childers Computer Science Department, University of Pittsburgh {abraham,cho,childers}@cs.pitt.edu Abstract The

More information

MLP Aware Heterogeneous Memory System

MLP Aware Heterogeneous Memory System MLP Aware Heterogeneous Memory System Sujay Phadke and Satish Narayanasamy University of Michigan, Ann Arbor {sphadke,nsatish}@umich.edu Abstract Main memory plays a critical role in a computer system

More information

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores

Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Evaluating Private vs. Shared Last-Level Caches for Energy Efficiency in Asymmetric Multi-Cores Anthony Gutierrez Adv. Computer Architecture Lab. University of Michigan EECS Dept. Ann Arbor, MI, USA atgutier@umich.edu

More information

Efficient Memory Shadowing for 64-bit Architectures

Efficient Memory Shadowing for 64-bit Architectures Efficient Memory Shadowing for 64-bit Architectures The MIT Faculty has made this article openly available. Please share how this access benefits you. Your story matters. Citation Qin Zhao, Derek Bruening,

More information

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency

Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Linearly Compressed Pages: A Main Memory Compression Framework with Low Complexity and Low Latency Gennady Pekhimenko, Vivek Seshadri, Yoongu Kim, Hongyi Xin, Onur Mutlu, Todd C. Mowry Phillip B. Gibbons,

More information

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2)

The Memory Hierarchy. Cache, Main Memory, and Virtual Memory (Part 2) The Memory Hierarchy Cache, Main Memory, and Virtual Memory (Part 2) Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Cache Line Replacement The cache

More information

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt The University of Texas at Austin ETH Zürich ABSTRACT Runahead execution pre-executes

More information

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi University of Michigan Carnegie

More information

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes (Chap06) Page 1 / 20 Dr. Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 341 6.2 Types of Memory 341 6.3 The Memory Hierarchy 343 6.3.1 Locality of Reference 346 6.4 Cache Memory 347 6.4.1 Cache Mapping Schemes 349 6.4.2 Replacement Policies 365

More information

Chapter 8 & Chapter 9 Main Memory & Virtual Memory

Chapter 8 & Chapter 9 Main Memory & Virtual Memory Chapter 8 & Chapter 9 Main Memory & Virtual Memory 1. Various ways of organizing memory hardware. 2. Memory-management techniques: 1. Paging 2. Segmentation. Introduction Memory consists of a large array

More information

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming

MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM. B649 Parallel Architectures and Programming MULTIPROCESSORS AND THREAD-LEVEL PARALLELISM B649 Parallel Architectures and Programming Motivation behind Multiprocessors Limitations of ILP (as already discussed) Growing interest in servers and server-performance

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory

PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory PiCL: a Software-Transparent, Persistent Cache Log for Nonvolatile Main Memory Tri M. Nguyen Department of Electrical Engineering Princeton University Princeton, USA trin@princeton.edu David Wentzlaff

More information

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing

Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing Spatial Locality-Aware Cache Partitioning for Effective Cache Sharing Saurabh Gupta Oak Ridge National Laboratory Oak Ridge, USA guptas@ornl.gov Abstract In modern multi-core processors, last-level caches

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.

Lecture: SMT, Cache Hierarchies. Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2. Lecture: SMT, Cache Hierarchies Topics: memory dependence wrap-up, SMT processors, cache access basics and innovations (Sections B.1-B.3, 2.1) 1 Problem 0 Consider the following LSQ and when operands are

More information