Multi-Cache Resizing via Greedy Coordinate Descent

Size: px

Start display at page:

Download "Multi-Cache Resizing via Greedy Coordinate Descent"

Quentin Skinner
5 years ago
Views:

1 Noname manuscript No. (will be inserted by the editor) Multi-Cache Resizing via Greedy Coordinate Descent I. Stephen Choi Donald Yeung Received: date / Accepted: date Abstract To reduce power consumption in CPUs, researchers have studied dynamic cache resizing. However, existing techniques only resize a single cache within a uniprocessor or the shared last-level cache (LLC) within a multicore CPU. To maximize benefits, it is necessary to resize all caches, which in today s CPUs includes one or two private caches per core and a shared LLC. Such multi-cache resizing (MCR) is challenging because the multiple resizing decisions are coupled, yielding an enormous configuration space. In this paper, we present a dynamic MCR technique that uses searchbased optimization. Our main contribution is a set of heuristics that enable the search to find the best configuration rapidly. In particular, our search moves in a coordinate descent (Manhattan) fashion across the configuration space. At each search step, we select the next cache for resizing greedily based on a power efficiency gain (PEG) metric. To further enhance search speed, we permit parallel greedy selection. Across 60 multiprogrammed workloads, our technique reduces power by 13.9% while sacrificing 1.5% of the performance. Keywords Cache Resizing Multicore CPUs Search-Based Optimization Power-Efficient Computing I. Stephen Choi Samsung 3655 N 1st Street San Jose, CA stephen.ch@samsung.com Donald Yeung University of Maryland at College Park 1323 A. V. Williams College Park, MD Tel.: Fax: yeung@umd.edu

2 2 I. Stephen Choi, Donald Yeung 1 Introduction Power consumption has been the most critical problem facing computer architects over the past decade [23], and is still the main limiter to achieving high performance in today s CPUs. Unfortunately, this problem will only get worse as process technologies continue to scale to smaller feature sizes. As such, power efficiency will remain an extremely important design goal, requiring hardware designers to continue looking for ways to squeeze wasteful power consumption out of architectures. A key place to look for power savings is the on-chip cache hierarchy. Caches occupy a large portion of the CPU s die area upwards of 50% in today s CPUs so they contribute significantly to a processor s overall power budget. In addition, caches are sized for the worst case. This means most computations cannot effectively utilize all of the cache capacity. Such over-provisioning can result in significant waste that, if eliminated, can potentially yield large power savings. Several researchers have investigated dynamic cache resizing to target such waste [1,17,3,4,26,27,34,40,46,47]. The idea is to monitor a cache s behavior at runtime e.g., using way counters [38] and dynamically reconfigure its capacity by enabling/disabling cache ways or sets to trade off performance for power. In particular, as a cache is downsized, its power consumption reduces. (Dynamic power goes down because less cache is activated per access, while static power also goes down because unused cache can be power gated). But this comes at the expense of additional cache misses, which can degrade performance and increase power consumption at the next level of the memory hierarchy. Cache resizing tries to pick the capacity that optimizes this tradeoff, typically by using way counters to evaluate, perhaps exhaustively, the performance and power that would have occurred under different configurations. Although there has been significant work on dynamic cache resizing, existing techniques are very limited in scope. Most only consider resizing a single cache within a uniprocessor cache hierarchy [1, 26, 27, 34, 46, 47]. Balasubramonian s work [3,4] resizes two levels of cache (also for a uniprocessor), but not independently as the sum of the two cache sizes is always fixed. So again, there is only one cache that is explicitly resized. More recently, researchers have begun studying resizing for multicore CPUs [17, 40, 44]. Unfortunately, Wang only offers an off-line solution. While Kedzierski and Sundararajan propose dynamic techniques, they only target the shared last-level cache (LLC). Granted, they perform partitioning which requires selecting multiple partition sizes within the LLC, but they still only resize a single cache. This is limiting because modern CPUs employ many caches typically one or two private caches per core along with a shared LLC. No single cache will ever be responsible for all of the power consumption, and so dynamically resizing only one cache will not address all of the waste. To illustrate, Figure 1 breaks down the power consumed by several SPEC 2006 benchmarks running sequentially on a CPU with a three-level cache hierarchy. Bar-stacks are shown for each benchmark, reporting the power consumption by caching level L1, L2,

3 Multi-Cache Resizing via Greedy Coordinate Descent 3 Power Consumption (Watt) L1D L1S L2D L2S LLCD LLCS Fig. 1: Cache power consumption breakdown. or LLC and by type dynamic (D) versus static (S). Although different types of power consumption tend to be localized (dynamic power is prevalent in the L1 while static power is prevalent in the L2 and LLC), every cache contributes non-trivially to the total power, with the L1, L2, and LLC contributing 36.2%, 27.5%, and 36.3%, respectively. While not explicitly shown in Figure 1, we observe a similar result when many SPEC benchmarks run on a multicore CPU, with power consumption distributed across multiple caching levels (as in Figure 1) and across multiple cores caches. So, for dynamic cache resizing to be effective, it is crucial to perform resizing at multiple caches simultaneously. An important question then is how should such dynamic multi-cache resizing (MCR) be conducted? If the behavior across different caches were independent, then optimizing each cache locally should result in a good solution globally. If this were the case, one could simply apply existing single-cache resizing techniques to each cache separately. Unfortunately, behaviors across caches in a multicore cache hierarchy are not independent. Instead, resizing decisions at different caches are coupled. Coupling certainly occurs between different caching levels (i.e. vertically) within the same core. As mentioned above, resizing a cache affects the power consumption at the next caching level. In MCR, though, not only can we resize the upstream cache, we can also resize the downstream cache as well. If the pair of resizing decisions are made in concert, we can save more power. For example, when downsizing the upstream cache, we may be able to simultaneously downsize the downstream cache to reduce its access energy, thus lowering the power penalty for any additional cache misses from the smaller upstream cache. This in turn may enable even more aggressive downsizing of the upstream cache. Such MCR decisions can be applied in succession (i.e., resize the L1-L2 together as well as the L2-LLC together), which means resizing decisions at any pair of caches along the vertical dimension are coupled transitively. In addition, as mentioned above, cache partitioning is performed as part of LLC resizing. This means the selection of each core s LLC capacity is coupled (i.e. horizontally) since cache partitioning techniques allocate the per-core capacities in a coordinated fashion from the same physical LLC [35]. Horizontal coupling at the LLC, in combination with vertical coupling between caching levels, means that in fact the resizing decisions at all caches in a multicore cache hierarchy are coupled, even the ones for private caches belonging to

4 4 I. Stephen Choi, Donald Yeung different cores. Again, this is due to resizing decisions impacting each other transitively, but now the transitive coupling can occur across different cores by way of the shared LLC. Hence, in order to achieve the most power savings, all of the caches in a multicore cache hierarchy should be resized in a globally consistent fashion. This means that compared to single-level cache resizing, MCR techniques must consider a much larger configuration space with very high dimensionality (essentially, each cache is an independent optimization variable). In particular, the total number of configurations grows as the cross-product of all per-cache configurations, including the partitioning configurations in the LLC. While single-level resizing techniques can use way counters to exhaustively explore configurations, the configuration space for MCR techniques is so large that it can be intractable to exhaustively explore even off-line, let alone during on-line optimization. In this paper, we study new dynamic cache resizing techniques that address the complexity of MCR. Rather than try to predict the best configuration outright, we investigate search-based optimization techniques that evolve the cache hierarchy towards the best configuration over time. Existing search heuristics, like the Nelder-Mead (NM) simplex method [31], can already optimize objective functions in complex multi-dimensional spaces. But NM is not fast enough. Because we run the search on line, very high search speed is needed to minimize runtime overhead and also permit adapting to a workload s dynamic behavior. Our main contribution is a set of search heuristics for boosting the speed of searching MCR configurations. More specifically, our technique starts from the maximum allocation and downsizes caches one at a time, thus searching the multi-dimensional MCR space in a coordinate descent (i.e. Manhattan) fashion. (In the LLC, we use UCP [35] for the initial partitioning, and while we mainly downsize, we also explore upsizing partitions which may become desirable as searching progresses). At each search step, we select the next cache for downsizing greedily based on a power efficiency gain (PEG) metric that captures both the cost in performance and benefit in power associated with each cache s downsizing. We find greedy coordinate descent (GCD) is effective at efficiently navigating through MCR spaces. To further enhance search speed (especially as core count scales), we permit parallel greedy selection across cores and give higher downsizing priority to caches with larger capacity. We implemented our GCD MCR technique on a detailed architectural simulator, and conducted an in-depth evaluation of its effectiveness. In addition to GCD MCR, our evaluation also considers off-line algorithms that identify either the static optimal when it is tractable to do so, or a very good (i.e. an aggressive off-line ) configuration when the configuration space becomes intractably large. We first show GCD MCR is effective for individual benchmarks running sequentially. Across 22 SPEC CPU2006 benchmarks, we find GCD MCR saves 13.4% of the power while sacrificing only 0.45% of the performance on the baseline system (without cache resizing). Moreover, GCD MCR is very close to the static optimal configuration, which saves 15.4% of

5 Multi-Cache Resizing via Greedy Coordinate Descent 5 the power. Next, we show GCD MCR is also effective for multiprogrammed workloads running on multicore processors. We created 60 multiprogrammed workloads for this study. (Our current experiments do not consider parallel or multithreaded programs). For the multiprogrammed workloads, we find GCD MCR saves 13.9% of the power, whereas an aggressive off-line technique saves 15.2% of the power, compared to the baseline. Performance degradation is slightly worse in the multicore case, but still acceptably small 1.5% on average. These results show GCD MCR effectively navigates through large MCR configuration spaces to rapidly find very good configurations. The rest of this paper is organized as follows. Section 2 presents the design of our GCD MCR technique, and Section 3 discusses its implementation. After describing our evaluation methodology in Section 4, Section 5 presents our results. Finally, Section 6 discusses related work, and Section 7 concludes the paper. 2 Greedy Coordinate Descent We treat MCR as a constrained multi-variable optimization problem. More specifically, the size of each cache in the multicore CPU is considered to be a free variable. And, the goal is to determine the best allocation of capacity (i.e., the best variable setting) at every cache such that power is minimized under the constraint that performance degradation never exceeds some userprescribed limit. (While achieving high power efficiency is the main goal, we also want to maintain a high level of performance, hence the constraint on performance degradation). To drive the constrained multi-variable optimization, we employ a searchbased approach. We divide workload execution into time intervals, called epochs, and monitor performance and power consumption as different cache allocations are searched across epochs. In particular, we rely on hardware way counters [38] used in previous cache techniques [22, 39, 35, 40] to monitor performance and power consumption. Although the way counters only track cache misses, average memory access time or AMAT can be computed from their cache miss counts which we use as a proxy for performance. Also, the cache miss counts can be used to compute power consumption as well. Together, these yield estimates on power efficiency, the objective for our technique. As discussed earlier, a major challenge for MCR is the complexity of the configuration space, making search speed a crucial design consideration. While search techniques such as Nelder-Mead (NM) exist, they are not fast enough for on-line optimization. Slow search techniques can cause significant overheads, which can also impede adapting to a workload s dynamic behaviors. In this section, we discuss several aspects of our search heuristic s design for boosting search speed. Coordinate descent To address the high complexity of MCR, we employ a coordinate descent method [42]. Coordinate descent has been shown to be

6 6 I. Stephen Choi, Donald Yeung extremely effective at optimizing non-differentiable functions of multiple variables, and is efficient at solving large problems [32]. In this approach, each variable (cache) is optimized (resized) one at a time, resulting in a Manhattan movement over the multi-variable solution space. Moreover, the movement is always in the downward direction: we initialize all caches to their maximum allocation, and then run the search heuristic to move monotonically towards a reduced configuration. Together, the Manhattan and monotonic movements tend to effectively search the MCR configurations despite the coupling that occurs between caches, resulting in better optimization decisions. Lastly, to enable adaptation, we periodically reset all caches to their maximum allocation, and re-run the search heuristic from this initial configuration. One issue is that immediately after downsizing a cache, the cache s hits and misses may take some time to reach their steady-state behavior. Only after steady state has been reached will the way counters accurately reflect the impact that downsizing had on the cache s performance. The problem is that the duration of this transient period depends on the capacity of the resized cache, which varies across caching levels. Hence, it is necessary to resize the caches from different caching levels at different rates. In particular, we resize smaller caches more frequently and larger caches less frequently. Greedy search order A crucial design question is in what order should the caches be downsized during coordinate descent? In our work, we downsize caches in a greedy fashion using a novel metric that we propose, called power efficiency gain (PEG). For each potential downsized cache, we define PEG to be the power consumption reduction divided by the performance loss that would result from the downsizing (as predicted by the way counters). Thus, our PEG metric captures both the benefit and the cost of downsizing a particular cache. Our greedy coordinate descent (GCD) MCR technique uses this PEG metric in two ways. First, for each cache in the hierarchy, GCD MCR assesses the PEG resulting from every possible downsizing of the cache (exhaustive evaluation for a single cache is feasible), and identifies the one that achieves the maximum PEG value. Then, GCD compares the max-peg values across different caches to identify the cache with the globally maximum PEG value. Notice, the PEG metric can be used to compare any pair of caches. While all caches contribute to power consumption, they also contribute to overall performance. In particular, our GCD MCR technique uses weighted speedup as the performance metric which considers all programs within a multiprogrammed workload. So, not only can PEG assess the cost-benefit of downsizing caches from the same core (which affects performance of a single program), it can also assess downsizing caches from different cores (which affects systemlevel performance through the weighted speedup metric). In this way, PEG enables greedy coordinate descent globally across all caches within a multicore cache hierarchy. However, while GCD MCR uses PEG to order caches for greedy selection, we actually do not always downsize the cache with the globally maximum PEG value due to scalability issues, which we will discuss below.

7 Multi-Cache Resizing via Greedy Coordinate Descent 7 PEG-based cache partitioning / resizing The LLC in a multicore CPU is often shared by all the cores. Hence, in addition to resizing the LLC, GCD MCR must also partition it (if it is shared). Many cache partitioning techniques have been studied in the past [38,22,39,35,43,7,25], but these techniques are focused on performance only. More recently, researchers have studied partitioning techniques that also downsize and power off unused portions of cache to save power [40]. Our approach is similar except we use our GCD MCR technique to control the partitioning and downsizing of the shared cache. As mentioned above, GCD MCR periodically resets all caches to their maximum allocation and re-runs its search heuristic. For a shared LLC, GCD MCR also resets its capacity, but in addition, uses UCP [35] to determine an initial partitioning of the maximally-sized LLC. Then, during the run of its search heuristic, GCD MCR treats each LLC partition as if it were a separate cache in the hierarchy that can be resized independently. GCD MCR first considers downsizing each LLC partition, using the PEG metric to select the best candidate as normal. However, GCD MCR also considers upsizing each LLC partition as well. While downsizing usually moves towards a more powerefficient configuration, it may become profitable for some partitions to expand as others contract. By considering moving LLC partitions in both directions, our GCD MCR technique is more likely to find the globally best allocation. Scalability Our GCD MCR technique is relatively fast. As Section 5 will show, GCD MCR can converge in much fewer epochs than existing techniques for solving constrained multi-variable optimization problems. Nevertheless, one significant issue remains: the serial nature of the search heuristic. As discussed above, the basic coordinate descent method selects different caches (or LLC partitions) to resize one at a time across successive epochs. But this is not scalable. As the number of cores (and hence, the number of caches) increases, so will the time required for the search to converge. Another related scaling problem arises from the multi-rate fashion in which caches are considered for resizing. Because larger caches are resized at lower frequency, delaying their resizing can lengthen the time to convergence. Unfortunately, the probability that a smaller cache exhibits the globally maximum PEG value and hence, is selected for resizing over a given larger cache grows as the number of cores increases. To improve scalability, we perform multiple rounds of greedy selection per epoch, allowing many caches / LLC partitions to be resized as a group. One drawback with this enhancement is that we do not get feedback from the way counters after every resizing decision. This may degrade the quality of resizing if the simultaneous resizing decisions are coupled with one another. In general, resizing decisions at all caches are coupled, but we find coupling is greater between caches from the same core as compared to caches across different cores. So, when resizing multiple caches, we require that they be selected from different cores to minimize the negative effects of sampling way counters for groups of resized caches.

8 8 I. Stephen Choi, Donald Yeung Finally, to prevent large caches from impeding search speed, we give them the chance to resize when their turn to be considered comes up. In particular, for multicore processors, we never let smaller caches prevent larger caches from resizing even if their PEG values are higher. Effectively, this means we only use PEG to order caches within the caching level whose turn it is to be considered for resizing. For example, when it is time to resize the L2 caches, we order all of the L2s by their PEG values, and downsize the one with maximum PEG i.e., we ignore the L1 caches when comparing the L2 PEG values. Likewise, when it is time to resize the LLC partitions, we order all of them by their PEG values, ignoring the L1 and L2 PEG values. For uniprocessors, we relax this requirement as it is less likely for a smaller cache to prevent a larger cache from resizing when there are only 2 or 3 caches total. 3 Implementation Having discussed the high-level design of our GCD MCR technique, we now describe its implementation. Section 3.1 specifies the technique in detail. Then, Section 3.2 discusses the implementation overheads. 3.1 Detailed Algorithm Algorithm 1, called gcd mcr(), presents the pseudocode for our technique, more formally specifying what was discussed in Section 2. In Algorithm 1, the outermost loop in gcd mcr (line 4) sequences the search across epochs, i.e. every EPOCH SIZE cycles. At each iteration of this loop, different caching levels are considered for resizing either the L1 (line 5), L2 (line 11), or LLC (line 16). The L1 is considered for resizing every epoch (as long as the L2 and LLC are not being considered); the L2 is considered for resizing every L2 FREQ epochs (as long as the LLC is not being considered); and the LLC is considered for resizing every LLC FREQ epochs. Every caching level is given the chance to resize periodically so that all caching levels especially the L2 and LLC can make forward progress. To allow for adaptation, we reset each cache to its maximum allocation every RESET FREQ epochs to start a new search from the maximal configuration. (After RESET FREQ epochs, lines 8, 13, and 19 reset the L1, L2, and LLC, respectively, the next time it is that cache s turn to resize). Before selecting the cache(s) to resize for a given caching level, we call update max peg() to determine each cache s maximum PEG value locally. In particular, we first compute the permitted AMAT by calling get delta amat() (line 52). Because we treat AMAT as a proxy for performance, get delta amat simply allows AMAT to change by the user-prescribed maximum performance degradation, perflimit. Since the user specifies perflimit relative to the maximum performance (i.e., assuming the maximum cache allocation), we substract the AMAT already incurred by the current downsizing, yielding the remaining headroom, called AMAT Headroom. Then, we invoke

9 Multi-Cache Resizing via Greedy Coordinate Descent 9 Algorithm 1: GCD MCR 1 gcd mcr(perflimit): 2 begin 3 /* main epoch loop */ 4 while not (cycle counter % EPOCH SIZE) do 5 if (epoch % L2 FREQ) and (epoch % LLC FREQ) then 6 /* reset ways at a different frequency */ 7 if not ((epoch - 1) % RESET FREQ) then 8 foreach cpu i do L1 ways reset 9 update max peg(perflimit, L1) /* update PEG at different freq. */ 10 cmp mcr per level(l1) 11 else if (epoch % LLC FREQ) then 12 if not ((epoch - L2 FREQ ) % RESET FREQ) then 13 foreach cpu i do L2 ways reset 14 update max peg(perflimit, L2) 15 cmp mcr per level(l2) 16 else 17 /* Last level cache reconfiguration */ 18 if not ((epoch - LLC FREQ ) % RESET FREQ) then 19 LLC ways reset 20 update max peg(perflimit, LLC) 21 if (ncores == 1) then 22 cmp mcr per level(llc) 23 else 24 pcp(perflimit) 25 /* Send reconfiguration command */ 26 if allocl is changed then reconfigure() 27 cmp mcr per level(level): 28 begin 29 /* main loop */ 30 while true do 31 winner = find max peg per level(level) 32 if winner < 0 then 33 break 34 alloc[winner][level] = peg[winner][level].alloc 35 pcp(perflimit): 36 begin 37 alloc = ucp() /* start from utility-based partitioning */ 38 while getamatw S() >= perflimit do 39 winner = find max peg per level(llc) 40 if winner < 0 then 41 break 42 update min peg(perflimit, LLC) 43 up winner = find min peg per level(llc) 44 if peg[winner].delta power < up peg[winner].delta power then 45 alloc[winner][llc] = up peg[up winner].alloc 46 else 47 alloc[winner][llc] = peg[winner][llc].alloc

10 10 I. Stephen Choi, Donald Yeung Algorithm 1: GCD MCR (cont d) 48 update max peg(perflimit, curlevel): 49 begin 50 foreach cpu i do 51 /* calculate AMAT headroom */ 52 AMAT Headroom = get delta amat(i, perflimit ) - AMAT due to cache resizing 53 /* get mpeg */ 54 peg[i][curlevel] = get max peg(i, curlevel, AMAT Headroom) 55 find max peg per level(level): 56 begin 57 result = -1, max mpeg = 0 58 if ncores == 1 then 59 foreach cache i do 60 if max mpeg < peg[0][i].mpeg then max mpeg = peg[0][i].mpeg 61 if max mpeg == peg[0][level].mpeg then result = 0 62 else 63 foreach cpu i do 64 if max mpeg < peg[i][level].mpeg then 65 max mpeg = peg[i][level].mpeg, result = i 66 return result 67 get max peg(curcore, curlevel, AMAT Headroom): 68 begin 69 alloc = current allocation of curcore s curlevel cache 70 foreach available way-off i do 71 peg[i] = get peg(curcore, curlevel, alloc, alloc i, AMAT Headroom) 72 winner = allocation with maximum PEG value 73 return winner 74 get peg(curcore, curlevel, a, b, AMAT Headroom): 75 begin 76 deltapower = sum of dynamic and static power changes due to change in misses when assigned ways decrease from a to b 77 deltaamat = increased AMAT due to change in misses when assigned ways decrease from a to b 78 if deltaamat < AMAT Headroom then 79 result.mpeg = deltapower / deltaamat, result.alloc = b 80 result.delta power = deltapower, result.delta amat = deltaamat 81 else 82 result.mpeg = -1, result.alloc = 0 83 result.delta power = result.delta amat = 0 84 return result 85 get delta amat(curcore, perflimit): 86 begin 87 amat = calculate AMAT based on way counters of current core 88 /* calculate AMAT headroom */ 89 return amat * perflimit / (1- perflimit)

11 Multi-Cache Resizing via Greedy Coordinate Descent 11 get max peg() to determine the maximum PEG value from further downsizing; in essence, this is the downsizing that achieves the maximum power savings without exceeding the AMAT headroom. Notice, the power savings calculation in get peg() takes into consideration both the reduction in dynamic and static power due to having a smaller cache, as well as the increase in power consumption incurred at the next caching level due to additional cache misses from downsizing. Among the per-cache maximum PEG values, we greedily select the largest for downsizing. Across private caches (i.e., the L1 and L2, and including the LLC for uniprocessors), this is done by calling cmp mcr per level(). We pass into cmp mcr per level the argument level which specifies the current private caching level being considered for downsizing. Then, to identify the cache with the largest PEG value, we call find max peg per level(). For multicores, we only consider PEG values from the current caching level (lines 63 65), guaranteeing that a cache from level will be downsized as long as it contributes non-negative PEG. Moreover, this is performed from within a loop (line 30) so that multiple caches across different cores can be downsized per epoch to improve scalability. For uniprocessors, we relax this requirement and consider PEG from all caching levels (lines 59 60), allowing a smaller cache to block the current level from downsizing. But still, only the cache at level can downsize (line 61). Lastly, besides downsizing private caches, GCD MCR also partitions and resizes the shared LLC, which is done by calling pcp(). As discussed in Section 2, pcp starts from the UCP partitioning of the current LLC capacity (line 37). Then, treating each LLC partition as if it were a separate cache, it uses find max peg per level to identify the partition with maximum PEG for potential downsizing (line 39), just like cmp mcr per level does for private caches. But in addition, pcp also considers upsizing LLC partitions. Analogous to computing each LLC partition s maximum PEG that would result from downsizing, pcp calls update min peg() (line 42) to compute each LLC partition s minimum reciprocal of PEG that would result from upsizing. Then, pcp calls find min peg() (line 43) to select the smallest among these for potential upsizing. (Both update min peg and find min peg are very similar to update max peg and find max peg per level, except they compute the min across the reciprocal of PEG. So, we omit their pseudocode from Algorithm 1.) If the power increase incurred by upsizing is smaller than the power savings from downsizing, then GCD MCR commits to the upsizing candidate; otherwise, it commits to the downsizing candidate (lines 44 47). All of this is performed inside a loop (line 38) so that multiple LLC partitions can be downsized / upsized per epoch for scalability. Low-pass filter A critical design parameter in Algorithm 1 is the epoch size. For our experiments in Section 5, we employ fairly small epochs (see Section 4). This enables our GCD MCR technique to move towards the best configuration rapidly. However, fine-grain epochs can be susceptible to spurious and/or transient behavior that can be picked up by the way counters and cause GCD

12 12 I. Stephen Choi, Donald Yeung Level L1 L2 LLC Sets Ways Size of tag entry (bits) (valid bit + tag bits + LRU bits) (1+29+3) (1+26+3) (1+23+2) Tags (Byte) Data (KB) Area of baseline cache (KB) Shadow tag sets Shadow tags (Byte) Way counters (Byte) Area overhead of shadow tags (Byte) Total area overhead (%) Table 1: Storage overhead of shadow tags. Overheads for the LLC are per core. MCR to make poor resizing decisions. To address this, we average the new configuration computed by Algorithm 1 with the configuration from the previous epoch. Such averaging acts as a low-pass filter, resulting in a smoother movement towards the best configuration. 3.2 Software / Hardware Overheads We assume the GCD MCR technique can be performed in software. In such an implementation, each core in the CPU would receive a periodic interrupt at every epoch boundary. Upon entering the interrupt handler, all the cores would read their local way counters and execute the GCD MCR pseudocode in Algorithm 1. Much of this algorithm can be executed in parallel (for example, computing the max peg value on each core) to mitigate its runtime overhead. After making the cache resizing decisions, the cores modify their local caches configurations, and return from the interrupt handler. For our experiments in Section 5, we did not simulate the execution of such interrupt handlers. The main bottleneck in GCD MCR lies in searching over the different cache size configurations across different epochs. Comparatively, the overhead of the interrupt handlers for determining the next resizing decision is very small. In Section 5.3, we will present an analysis to estimate this overhead, which we find to be less than 0.3%. The main source of hardware overhead is the shadow tags [15,35] which permit way counters to track hits and misses for different cache sizes. Shadow tag arrays are similar to regular cache structures except they have no data arrays. They require s w t bits, where s is the number of sampled sets, w is the number of cache ways, and t is the number of bits per tag entry. Table 1 reports the overhead numbers for the specific cache hierarchy we study later in Section 5. In particular, we sample one set per 32 regular tags, and we assume a 42-bit physical address. Overall, the shadow tag arrays require 4034 bytes, which represents an increase of 0.15% in the storage requirement compared

13 Multi-Cache Resizing via Greedy Coordinate Descent 13 to the baseline caches. Besides the shadow tags, the way counters themeselves require an additional 4 w bits, and an adder for incrementing them. Another source of hardware overhead is reconfigurable caches. We assume all caches in the hierarchy, except for the L1 I-cache (which we assume is fixed), are reconfigurable. In particular, we employ selective ways [1] in the L2 cache and LLC, and both selective sets and ways [46] in the L1 cache. For selective ways, we assume reconfiguration in increments of a cache way from 1 to the associativity number of ways in the cache; and for selective sets, we assume a certain number of power-of-two set configurations. Hence, for the cache hierarchy in Table 1, there are 12, 8, and 32 different configurations for the L1, L2, and LLC, respectively, assuming 8 cores. While each cache s access delay also changes across different configurations, we assume a constant number of CPU cycles to access each cache chosen to handle that cache s worst-case access delay (i.e. with all ways and sets enabled). When upsizing caches, we assume 2/4/7 cycles to power up and reset newly added portions of cache for the L1/L2/LLC, respectively. When downsizing, we walk the removed portions of cache to flush their contents. Clean cache blocks are discarded after checking upstream caches to maintain inclusion. Dirty cache blocks not only check upstream caches but also write back to the next-lower level. We assume these operations are pipelined such that flushing takes 1 cycle per walked cache block. Downsized ways are selected in reverse way ID order. Because we do not physically move cache blocks once they are filled, the flushed cache blocks have an equal probability of being at any position in the LRU stack. Moreover, we do not attempt to reconstruct the per-set LRU stacks after flushing. In order to estimate power consumption as a function of the cache configuration and the way counter values, our technique requires the per-access and leakage energies (derived from CACTI). We provide configurable registers to store these energy values. 4 Experimental Methodology Performance simulation We modified the SimpleScalar v4.0 simulator for the Alpha ISA [6] to conduct our study. Table 2 shows our baseline processor configuration. We use a modest 2-way issue out-of-order core that achieves good power efficiency. The cores are attached to a three-level cache hierarchy with a split 8-way 32KB L1 private cache, a unified 256KB L2 private cache, and a shared last-level cache (LLC). The LLC provides 2MB for a single core, 4MB for two cores, 8MB for 4 cores, and 16MB for 8 cores, and its associativity increases by 4 ways for each additional core. The cache block size is 64 bytes for all caches. The baseline cache hierarchy maintains the non-inclusive inclusion property. For main memory, we model 32GB of low-power DDR3 DRAM with a 140ns access latency [8]. To facilitate our multicore experiments, we created a parallel version of the SimpleScalar simulator. In particular, we execute multiple copies of the simu-

14 14 I. Stephen Choi, Donald Yeung Cores L1 I-Cache L1 D-Cache L2 Unified Cache L3 Shared Cache Memory 2.0 GHz 2-way issue, out-of-order, 64-entry ROB, 24-entry LSQ Gshare/bimodal hybrid branch predictor, 2048-entry meta table 512-entry BTB 32 KB, 2-way, 64-byte blocks, 1 cycle 32 KB, 2-ports, 8-way, 64-byte blocks, 4 cycles 256 KB, 8-way, 64-byte blocks, 7 cycles 2MB and 4 ways per core (up to 16 MB, 32-way) 64-byte blocks, 19 cycles Noninclusive, bus-type interconnect 32 GB Low Voltage Quad-rank RDIMM DDR3-800, 140ns load latency Table 2: Architectural parameters of the baseline configuration. lator as separate UNIX processes, one per core, and synchronize the processes through sockets to simulate inter-core interactions. One important interaction occurs at epoch boundaries to execute our MCR algorithm and resize caches. Hence, we synchronize all of the simulator processes once per epoch to run our MCR technique. In addition to these per-epoch interactions, inter-core interactions potentially occur on LLC accesses as well. Although there are no cache coherence actions since we run multiprogrammed workloads, there can be contention for the LLC and main memory. For our workloads, we find contention is extremely low. So, our simulator assumes LLC and main memory accesses always incur the contention-free access latencies in Table 2 to avoid process synchronizations on every LLC access. In Section 5.4, we will validate this assumption. A critical design consideration for our GCD MCR technique is the rate at which caches are resized. Higher resizing frequencies enable greater adaptivity, but as discussed in Section 2, they cannot be so high that resized caches do not reach steady state which is capacity dependent before the next resizing decision is made. For the cache capacities in Table 2, we tried several frequencies and found a baseline epoch size, i.e. the resizing frequency for the L1, of 200K cycles works well (EPOCH SIZE = 200K in Algorithm 1). Moreover, we resize the L2 every 5 epochs and the LLC every 50 epochs (L2 FREQ = 5 and LLC FREQ = 50, respectively, in Algorithm 1). Lastly, the reset frequency to start a new search i.e. RESET FREQ in Algorithm 1 is 125 epochs. Lastly, as mentioned in Section 3.2, the resizing decisions (i.e., via execution of our GCD MCR code) are performed in our simulator. We do not simulate the interrupt handlers mentioned in Section 3.2, so we don t explicitly account for their runtime overheads. (Our experiments also consider off-line techniques, which will be described in Section 5, but these are not impacted by this issue). Not only does this affect our on-line GCD MCR technique, but it also affects our baseline multicore CPU which employs UCP to partition the shared LLC. (The UCP technique would also invoke interrupt handlers to compute new LLC partitionings across different epochs). Section 5.3 will present an analysis of this code s overhead, which we find to be very small. Power simulation We use McPAT [24] and CACTI 6.5 [29] for power modeling. Our baseline model uses the 32nm technology node and ITRS high per-

15 Multi-Cache Resizing via Greedy Coordinate Descent 15 L1 Cache Access Type (tag array / data array / data array h-tree) Read/write energy per access Subthreshold/gate leakage L2 Cache Access Type Read/write energy per access Subthreshold/gate leakage L3 Cache Access Type Read/write energy per access (Standby) Subthreshold/gate leakage Parallel / Parallel / Parallel / nj 41.5 / 22.3 mw Parallel / Parallel / Serial / nj / 78.5 mw Serial / Serial / Serial 4.38 / 6.6 nj / 1666 mw Table 3: Cache parameters for the baseline multi-level caches. formance devices. To manage static power consumption, we employ a state-ofthe-art circuit- and device-level static power reduction technique in the shared LLC. Specifically, we assume high-v t devices throughout [18], but apply reverse body bias (RBB) in standby mode to reduce standby leakage [41]. When an access occurs, we apply a forward body bias (FBB) to restore the threshold voltage for low access delay. We assume that applying FBB does not impact the access delay for the cache [41]. We utilize stack effect in conjunction with ABB to model way selection [36,41]. We use the Model for Assessment of cmos Technologies And Roadmaps (MASTAR 2011) from ITRS [28] to derive parameters required for CACTI according to our assumptions. Table 3 summarizes the cache parameters used, and the resulting energy numbers for power modeling. Benchmarks We used the SPEC CPU2006 benchmarks for our study. Because we use a version of SimpleScalar targeted for Alpha/Linux, we compiled the SPEC CPU2006 benchmarks on a native Linux environment. We installed an Alpha CPU emulator [10] on a Windows PC, and then installed the Linux (Debian Lenny) system on the emulator. We compiled the benchmarks inside the emulator using an Alpha gcc compiler. All benchmarks were built with the -O2 option. Among all of the benchmarks in the suite, we were unable to compile 447.dealII. Moreover, one integer benchmark (403.gcc) and five floating point benchmarks (416.gamess, 433.milc, 450.soplex, 465.tonto, and 481.wrf) either did not complete, or completed with incorrect outputs. Otherwise, we simulated the remaining benchmarks 22 in total (11 integer and 11 floating point) as shown in Table 4. Using the reference inputs, all of the compiled benchmarks were run to completion on the SimPoint tool [13]. We take the most representative simpoint, consisting of 1B instructions per benchmark. In our experiments, we begin simulation 100M instructions prior to the simpoint to warmup the caches, and then we simulate the next 800M cycles after cache warmup to acquire detailed statistics. Because benchmarks IPCs differ, we simulate a fixed cycle

16 16 I. Stephen Choi, Donald Yeung Group Benchmark MPKI Group Benchmark MPKI (h 4 ) Mcf 51 Libquantum 29 (l 9 ) Astar 0.82 High Lbm 22 Perlbench 0.69 Omnetpp 15 Hmmer 0.66 (h 0 ) GemsFDTD 14 H264ref 0.54 (m 6 ) Leslie3d 9.5 Low Sjeng 0.27 Sphinx3 8.2 Gobmk 0.2 Xalan 6.8 Calculix 0.2 Medium Bwaves 4.8 Gromacs 0.13 Zeusmp 4.1 Namd 0.07 CactusADM 2.3 (l 0 ) Povray 0.03 (m 0 ) Bzip2 1.9 Table 4: Benchmark classification based on misses per kilo instructions (MPKI) for a 2MB LLC. count within each 1B-instruction simpoint instead of a fixed instruction count to ensure all benchmarks from a given multiprogrammed workload are active simultaneously. The SPEC CPU2006 benchmarks are used for both our uniprocessor and multicore experiments. Because these benchmarks are sequential, we formed multiprogrammed workloads to drive the multicore study (see Section 5.2). Multiprogrammed workloads generate very diverse behaviors across different cores caches, so they tend to be more challenging for cache resizing techniques. At the same time, however, because we do not run experiments with parallel workloads, our current results do not reflect how our techniques would perform for multithreaded programs. We leave this important topic for future work. 5 Experimental Evaluation We now conduct an evaluation of our GCD MCR technique. Section 5.1 begins by studying single-benchmark performance. Then, Section 5.2 presents our main multicore results. Finally, Section 5.4 validates our simulator s assumption of low LLC contention. 5.1 Single-Benchmark Results We first focus on the performance of individual benchmarks, highlighting how effectively GCD MCR coordinates resizing vertically across caching levels. This also provides a baseline for interpreting the main multicore results later in Section 5.2. Not only do we present experiments for GCD MCR, we also compare it against several off-line algorithms to better understand the limits of our technique. Off-line analysis One of the off-line algorithms we consider is the static optimal. To facilitate this study, we run each benchmark on every possible cache

17 Multi-Cache Resizing via Greedy Coordinate Descent 17 hierarchy configuration to find the best static solution (we do not permit reconfigurations during a run). For the parameters in Table 2, there are 384 unique configurations per benchmark and 8,448 configurations across all 22 benchmarks in the uniprocessor case (i.e., with a 4-way 2MB LLC), which is a large but feasible design space to simulate exhaustively. In our work, the goal is to maximize power savings with negligible performance degradation. So, we search for the configuration with the largest power savings, but with no more than 1% performance degradation compared to the baseline i.e., the configuration in Table 2. In addition to the static optimal, we also implement off-line versions of the Nelder-Mead (NM) simplex method mentioned earlier, as well as our GCD MCR technique. These off-line techniques use the exhaustive simulations from the static optimal. In particular, we run the NM and GCD MCR algorithms on the exhaustive simulation results, feeding to each technique the information from the simulations that drives their search algorithms (IPC and power for the NM method, and cache way counts for GCD MCR). Each technique is run only once, and the solution found upon convergence is reported. These results represent static versions of the NM method and GCD MCR algorithm in other words, they do not incur any on-line related overheads, but at the same time, they also cannot adapt to dynamic program behavior. Figure 2 shows the total system power savings achieved by the static optimal (labeled SO ), the off-line NM method (labeled NM-Off ), and the off-line GCD MCR algorithm (labeled GCD-Off ) for each of our 22 SPEC benchmarks. All results are normalized to the power consumption of the baseline configuration. As Figure 2 shows, SO can save quite a bit of power for many benchmarks, providing an average savings of 15.4% across the entire suite. This demonstrates significant headroom for our techniques to save power within individual benchmarks by resizing multiple caches vertically. NM-Off also achieves good power savings, even matching SO in a few cases. While GCD-Off is not quite as good, it is still very close to NM-Off. On average, NM-Off saves 9.0% of the power consumption whereas GCD-Off saves 8.8%. So, GCD MCR s solution quality is comparable to NM. But a significant advantage of GCD MCR is its fast convergence rate. Figure 3 reports the number of iterations required by NM-Off and GCD-Off to converge. Our GCD MCR technique is superior in every benchmark. On average, it converges in 3.4 iterations whereas NM requires 19.5 iterations. On-line analysis The bars labeled in Figure 2 report the total system power achieved by our on-line GCD MCR algorithm, normalized to the power consumption of the baseline configuration. In Figure 2, we can see that the on-line GCD MCR algorithm beats the off-line version in every benchmark except for libquantum, bwaves, namd, and calculix. Averaged across all 22 benchmarks, on-line GCD MCR saves 13.4% of the total system power compared to 8.8% for the off-line version. In fact, the on-line GCD MCR algorithm comes very close to the 15.4% power savings provided by the static optimal.

18 18 I. Stephen Choi, Donald Yeung Normalized Power SO NM-Off GCD-Off 0.6 Fig. 2: Power consumption of static optimal (SO), off-line NM (NM-Off), static GCD MCR (GCD-Off), and on-line GCD MCR () for single benchmarks. All results are normalized to the baseline configuration. Iterations NM-Off GCD-Off Fig. 3: NM-Off and GCD-Off convergence rates for single benchmarks. Although on-line GCD MCR incurs runtime overhead, it can also adapt to dynamic behavior, which gives it an advantage over the off-line techniques. In terms of performance degradation, our on-line GCD MCR algorithm is generally worse than the static techniques, but it still maintains reasonable performance. Figure 4 shows the performance degradation for both the offline and on-line GCD MCR algorithms relative to the baseline configuration for each of our benchmarks. As Figure 4 shows, GCD-Off is within the 1% performance degradation target for every benchmark. In contrast, exhibits worse performance for many benchmarks, and fails to achieve the 1% target for gromacs and namd. Because the on-line technique relies on prediction rather than perfect off-line information, it can make decisions that lead to unanticipated performance loss. But on average, on-line GCD MCR s performance is still very good only 0.45% worse than the baseline. Lastly, Figure 5 compares the cache power consumption of our on-line GCD MCR algorithm against the baseline configuration. As we can see in Figure 5, our technique gets power savings from all levels of the cache hierarchy. In particular, on-line GCD MCR significantly reduces the L1 cache power, especially the dynamic part (labeled L1D ). On average, L1 dynamic power is reduced by 54.2% compared to the baseline. On-line GCD MCR also significantly reduces the L2 cache power, especially the static part (labeled L2S ). On average, L2 static power is reduced by 54.8% compared to the baseline. And, on-line GCD MCR also reduces the LLC cache power, especially the static part (labeled LLCS ). On average, the LLC static power is reduced by

19 Multi-Cache Resizing via Greedy Coordinate Descent 19 Performance Degradation in % GCD-Off Fig. 4: Percent performance degradation of off-line GCD MCR (GCD-Off) and on-line GCD MCR () relative to the baseline configuration for single benchmarks. Power Consumption (Watt) L1D L1S L2D L2S LLCD LLCS perlbench bzip2 mcf hmmer libquantum omnetpp xalan zeusmp cactusadm namd calculix lbm AVG gobmk sjeng h264ref astar bwaves gromacs leslie3d povray GemsFDTD sphinx3 Fig. 5: Cache power consumption breakdown of the baseline configuration () and on-line GCD MCR () for single benchmarks. 19.8% compared to the baseline. (The LLC power savings are more modest because the baseline LLC already employs an aggressive static power reduction technique see Section 4). These results show our technique effectively resizes all caching levels (vertically) in order to maximize the overall power savings. 5.2 Multicore Results We now present our main results: the GCD MCR technique running on multicore CPUs. To facilitate this study, we generated multi-programmed workloads by randomly combining the benchmarks from Table 4. We created 20 2-core workloads, 20 4-core workloads, and 20 8-core workloads. Table 5 lists the names of the workloads, and specifies their constituent benchmarks using the abbreviations from Table 4. Off-line algorithms Similar to Section 5.1, we begin our multicore study by developing off-line algorithms in particular, the static optimal to determine the limits of our technique. A significant challenge this time, though, is that there are many more configurations. As we scale core count, we also increase

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe