PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems

Size: px

Start display at page:

Download "PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems"

Loreen Evans
5 years ago
Views:

1 PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems Sam Gianelli and Tosiron Adegbija Department of Electrical & Computer Engineering University of Arizona, Tucson, AZ, USA {samjgianelli, Abstract Due to the cache s significant impact on an embedded system, much research has focused on cache optimizations, such as reduced energy consumption or improved performance. However, throughout an embedded system s lifetime, the system may have different optimization priorities, due to variable operating conditions and requirements. Variable optimization priorities, embedded systems stringent design constraints, and the fact that applications typically have execution phases with varying runtime resource requirements, necessitate new robust optimization techniques that can dynamically adapt to different optimization goals. In this paper, we present priority-aware phasebased cache tuning (PACT), which tunes an embedded system s cache at runtime in order to dynamically adhere the cache configurations to varying optimization goals (specifically EDP, energy, and execution time), application execution phases, and operating conditions, while accruing minimal runtime overheads. Index Terms Configurable memory, design space exploration, cache tuning, cache memories, low-power design, low-power embedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly to an embedded system s energy consumption and performance. As a result, much optimization research focuses on optimizations that improve the cache s energy efficiency and performance, while accruing minimal optimization overheads. Configurable caches [2] have been widely studied as a viable architecture for minimizing the cache s energy consumption. Using configurable caches, the cache configurations (i.e., cache size, associativity, and line/block size) can be dynamically tuned or specialized to executing applications resource requirements, in order to minimize energy overheads from an over-provisioned cache. However, cache optimization, otherwise known as cache tuning, is very challenging due to embedded systems typical design constraints (e.g., area, energy, real-time constraints), potentially large configurable cache design spaces, and the runtime variability of embedded systems applications and executing conditions [3]. Depending on different factors, such as resource availability, user requirements, and application characteristics, an embedded system s optimization goals (e.g., energy/execution time minimization) may change throughout its lifetime. Ideally, a runtime cache tuning technique must account for variable application requirements and optimization goals /7/$3. c 27 IEEE Cache optimization challenges are further compounded by the presence of different execution phases in emerging embedded systems applications [7]. A phase is an execution interval during which the application s execution characteristics (e.g., cache miss rates, instructions per cycle (IPC), etc.) are relatively stable. Different phases within the same application, due to their different characteristics, may have different resource requirements. To account for these variable resource requirements, phase-based cache tuning [3], [9] can achieve fine-grained optimization by tuning the cache configurations to determine configurations that satisfy the different application phases requirements. Prior work has shown that phase based cache tuning can reduce the cache s energy consumption by up to 62% []. However, phase-based cache tuning can result in runtime execution time overhead. To mitigate this overhead, prior work [3] proposed phase distance mapping (PDM) as an analytical approach to phase-based cache tuning. PDM calculates the relative distance between a new executing phase and a previously tuned base phase, and approximates the new phase s best cache configuration based on the distance between the new and base phases execution characteristics. PDM, much like several prior cache optimization techniques [9], [5], [8], has the drawback of only focusing on one optimization goal throughout the system s lifetime. Due to the adversarial nature of optimization goals improving the energy, for example, may degrade execution time effective optimization techniques must be adaptable to changing runtime operating conditions (the current state of the device, e.g., critically low battery) that may necessitate variable optimization goals. In this paper, we present Priority-Aware phase-based Cache Tuning (PACT). PACT incorporates the notion of priorities to phase-based cache tuning using PDM, and allows different optimization goals to be dynamically prioritized in order to satisfy variable operating conditions. Specifically, PACT allows the energy delay product (EDP), energy, or execution time to be prioritized, without incurring any additional tuning overhead with respect to prior work. Using experimental results, we show that PACT can trade off non-prioritized optimization goals for prioritized optimization goals in order to satisfy varying runtime resource requirements. PACT determines configurations that are optimal or near-optimal for the specified optimization priority, and improves over PDM for energy and execution time optimization.

2 II. BACKGROUND AND RELATED WORK Cache tuning uses a configurable cache and a cache tuning mechanism software or hardware cache tuners to determine the best cache configurations that satisfy an application s execution requirements. The key challenge in cache tuning is accurately determining the best configuration without incurring significant tuning overheads. To address this challenge, several prior cache tuning techniques [3], [6], [9] [], [4], [6], [9] have been proposed. In general, these prior phasebased cache tuning techniques can be broadly categorized as exhaustive search, heuristic, or analytical methods [3]. Chen et al. [6] proposed an algorithm that exhaustively explored the entire cache design space to determine the optimal configuration. this method was highly accurate, the proposed method also incurred significant overheads due to the amount of time required to fully explore the cache design space, thus impeding optimization potentials. To reduce the cache tuning overhead, Rawlins et al. [6] proposed a cache tuning heuristic that significantly pruned the design space during the exploration process, while achieving near-optimal results. To further reduce cache tuning overheads, analytical methods have been proposed to directly determine the best cache configurations without the need to explore inferior configurations. Gordon-Ross et al [] used an oracle-based approach to non-intrusively predict the best configuration without incurring runtime tuning overheads with respect to execution time. However, the proposed approach incurred hardware overheads as a result of the oracle hardware. Other analytical techniques have been proposed (e.g., [8], [5]); however, most of these techniques are typically computationally complex, thus impeding optimization potentials in resource constrained embedded systems. To address the challenges of prior work, Adegbija et al. [3] proposed phase distance mapping (PDM), which was based on the hypothesis that the more disparate two phases characteristics are, the more disparate the phases best configurations are likely to be. Given a known phase s best configuration, PDM predicts a new phase s best configuration using the distance between the two phases characteristics the phase distance to estimate the distance between the two phases best configurations. However, PDM only targets EDP optimization and cannot satisfy changing runtime optimization goals. We address challenges in prior work by developing a new priority-aware phase-based cache tuning (PACT) method that allows dynamic cache tuning to specifically prioritize different optimization goals based on the current system operating conditions/device state. We show that PACT achieves these variable optimization goals without incurring any runtime tuning overheads with respect to prior work. III. PRIORITIZED PHASE-BASED CACHE TUNING In the discussions herein, we assume that the embedded system is equipped with a highly configurable cache [2] with configurable size, line size, and associativity. A highly Phase classification Retrieve Configuration C pi,s form phase history table Phase characteristics Phase P i,s executed Search phase history table for P i,s New Phase? Priority Aware Cache Tuning (PACT) Algorithm P i,s configuration, C pi,s Execute phase P i with C pi,s Fig. : Flowchart of PACT Device State Add P I,s to phase history table configurable cache utilizes multiple memory banks, each of which can be shutdown to configure the cache size, or concatenated to configure the associativity. Given a physical cache line size (e.g., 6B), multiple lines can be concatenated to logically form larger line sizes. In this work, we use a physical base 32KB configurable cache with 2KB banks. This cache offers configurable cache sizes ranging from 2KB to 32KB, associativities ranging from - to 4-way, and line sizes ranging from 6B to 64B, all in power of two increments. We direct the reader to [2] for details on the configurable cache s circuitry and design. To perform phase-based cache tuning, executing applications characteristics must first be classified into phases. Phase classification [7] breaks an application into execution intervals intervals can be measured in number of instructions or time and clusters the intervals with similar execution characteristics to form phases. Since phase classification has been widely studied in prior work, we skip the details in this paper. In this section, we present the problem statement, motivated by mobile devices (e.g., smartphones, tablets), describe an overview of the proposed approach, and present the PACT algorithm. A. Problem Statement The primary challenge of phase-based cache tuning is accurately determining the best configurations at runtime without accruing significant runtime overheads. For example, our configurable cache in this work features a design space with 432 possible configurations (both instruction and data caches). This design space could exponentially increase in more complex caches. Exploring such a large design space at runtime (using exhaustive search, heuristics, or even analytical methods) for multiple optimization goals could be prohibitive for resource constrained embedded systems. Thus, the objective of PACT is to determine the best cache configurations that satisfy variable runtime optimization goals, while accruing minimal runtime tuning overhead.

3 Given an embedded system, the optimization goals can be determined by different operating conditions or user requirements. For example, a battery-operated mobile device may have different operating conditions depending on the battery state. When operating on full battery, the EDP may be prioritized to account for both energy and execution time optimization. When the battery is at critical levels (e.g., % of battery power), energy efficiency is prioritized over execution time. When the system is being charged, the execution time is prioritized, since energy is no longer a significant resource constraint. B. Overview of PACT To ensure low-overhead, high-speed, and accurate tuning, we considered two options for adding priorities to phasebased tuning. Since a phase s best configurations, after they are determined, are traditionally stored in a phase history table [3], the first option uses log(n) additional bits to store each phase s current optimization priority, where n is the number of device states (corresponding to optimization goals). An alternative option includes a separate transformation lookup table that stores configuration transformations between different priorities for each application phase. Thus, the table only stores the phases configurations for a single priority and the configuration transformations (i.e., change in configuration). When a new priority is required, the transformation is applied to the current configuration to determine a new configuration that satisfies the new optimization goal. In this work, we use the first option, due to its simplicity and low-overhead characteristics. Figure presents an overview of PACT. PACT takes as input the phase characteristics, obtained during phase classification, and the current device state, which indicates the prioritized optimization goal. For example, the device state could be informed by a smartphone s battery state: fully charged, low power, or charging, to indicate prioritization of EDP, energy, or execution time, respectively. When a phase P i is executed, PACT searches the phase history table for P i or for phases with similar characteristics (cache miss rates) within a similarity threshold to P i. The similarity threshold is a designer specified or runtime tunable feature of PACT that represents a tradeoff between tuning accuracy and tuning overheads. A larger similarity threshold reduces the tuning overhead at the expense of accuracy, while a smaller similarity threshold increases accuracy at the expense of tuning overhead. In this work, we empirically established our similarity threshold, using the base cache configuration, by normalizing each phase s cache miss rate to the base phase s cache miss rate. We used the first executing phase as the base phase, and used a similarity threshold of. Phases with a normalized miss rate within the ranges - used the same phase history table entries; similarly for phases with normalized miss rates within -2, 2-3, etc. At runtime, the similarity threshold can easily be dynamically determined as shown in prior work [3]. Input ConfigP MRU, P, n j n C i C max A i A max L i L max C i = C i * 2 A i = A i * 2 L i = L i * 2 Fig. 2: PACT algorithm Output ConfigP best Fig. 3: PACT algorithm s checkstate subroutine If an entry is found, the phase P i, or a similar phase P i,s, has been previously executed, and the phase s best configuration C pi is retrieved from the phase history table and used to execute P i. A new phase s similarity to an existent phase is a function of the phase distance between the two phases, which can be measured by the Euclidean distance between the two phases instruction and data cache miss rates. If no entry is found, P i is a new phase, and PACT executes the tuning algorithm (Section III-C) to determine C pi. C pi is then added to the phase history table, and the phase P i is executed using C pi. C. PACT Algorithm Figure 2 depicts the PACT algorithm, which determines an executing phase s best cache configuration, given a spec-

4 ified priority. The inputs to the algorithm are the number of iterations n, the most recently used cache configuration ConfigP MRU comprised of the most recently used cache size, associativity, and line size, C MRU, A MRU, and L MRU, respectively, and the current priority, R. The number of iterations, n, specifies the number of phase executions required to fine-tune the phase s best configuration; the default value of n is designer-specified, depending on the executing applications, but can be dynamically adjusted at runtime depending on the quality of the configurations being determined by PACT. For example, n may be dynamically increased or decreased for an executing phase depending on how much improvement is achieved with respect to the base configuration, and how often the phase is executed. A larger value of n would benefit a phase with several executions and may yield cache configurations that are closer to the optimal. We analyzed different values of n, and empirically determined that n = 3, which we used for our experiments (Section IV), provided a sufficient tradeoff between number of iterations and optimization potential. When a new phase is executed, the algorithm starts with the most recently used configurations and iteratively cycles through the caches sizes, associativities, and line sizes in power-of-two increments this process is iteratively performed for all caches in the system (e.g. data and instruction) until the maximum values are reached or if the current configuration yields better results than the stored best configuration. At system startup, the most recently used configurations default to the base configurations. Figure 3 depicts the algorithm for a checkstate subroutine, which PACT uses to monitor the prioritized optimization goal for each iteration. For each phase, checkstate(r) determines if the currently executing configuration [ConfigP i ] improves over the stored best configuration [ConfigP best ] stored in the phase history table for priority R. If [ConfigP i ] improves over [ConfigP best ], [ConfigP best ] is set to [ConfigP i ] in the phase history table. A. Experimental Setup IV. EXPERIMENTS To evaluate PACT s effectiveness, we used 2 benchmarks from the SPEC CPU26 benchmark suite [2]. We used SPEC benchmarks since they feature greater execution complexity and runtime variability, and provide a more rigorous test for PACT. We fast forwarded each application for 3 million instructions, and ran the reference input sets for billion instructions. We used SimPoint 3.2 [2] to determine the distinct phases in each. To model a system similar to modern day embedded systems microprocessors and gather execution statistics, we implemented the proposed approach using GEM5 [5]. We simulated a system comprised of private level one instruction and data caches, with a base configuration featuring 32 KB size, 4-way set associativity, and 64 byte line size, similar to an ARM Cortex-A9 []. Given this base configuration, the configurable size ranged from 2 KB to 32 KB, associativity ranged from - way to 4-way, and the line size ranged from 6 byte to 64 byte, all in power of two increments. We assumed a system with dedicated tuners within each core; thus, the cores are tuned independently of each other. We used McPAT [3] to calculate the system s total power consumption, which we then used, combined with execution statistics from GEM5, to calculate the energy consumption. B. Results ) PACT Results and Comparison to the Optimal and Prior Work: We evaluated PACT by comparing the results using configurations determined by PACT to results obtained using the base configuration, the optimal configuration (determined through exhaustive search), and PDM (to represent prior work). Figure 4 depicts the EDP, energy, and execution time achieved by PACT as compared to the optimal and PDM configurations. The PACT, optimal, and PDM results are all normalized to the base configuration in order to evaluate the improvements with respect to the base configuration. Figure 4(a) compares the EDP achieved by PACT to the optimal and PDM when the EDP is prioritized. As compared to the base configuration, PACT reduced the EDP by 6.5% on average across all the applications, with reductions as high as 36% for omnetpp. On average, PACT determined configurations that were within 7.7% of the optimal; for 6 out of the 2 applications, PACT s configuration were within less than % of the optimal. In a few of the applications, PACT s configurations were worse than the optimal. For mcf, for example this was was the worst case PACT s configuration was within 2.% of the optimal and increased the EDP by 3% with respect to the base configuration. We attribute this behavior to the fact that PACT s tuning was oblivious to some of the applications intrinsic memory access behaviors. For instance, mcf has long memory access latencies, which conflicted with PACT s attempt to simultaneously tune both the energy and the execution time (delay). However, as expected, PACT achieved similar EDP savings as PDM, since PDM natively optimizes the EDP. Figure 4(b) compares the energy consumption achieved by PACT to the optimal and PDM when the energy is prioritized. On average, PACT reduced the energy by 8.2% compared to the base configuration, with reductions as high as 28% for libquantum. Unlike in the case of EDP prioritization, PACT did not degrade any application s energy consumption with respect to the base configuration. PACT determined configurations that were within 4.2% of the optimal, on average, and outperformed PDM by 2.2% across all the applications. Figure 4(c) compares the execution time achieved by PACT to the optimal and PDM when the execution time is prioritized. On average across all the applications, PACT reduced the execution time by 2.7% compared to the base configuration, with reductions as high as % for omnetpp. The execution time reduction was much less (compared to the EDP and energy reductions) because the base configuration was nearoptimal for most applications, as evidenced by the fact that

5 Execution Time normalized to base cache configuration Energy normalized to base cache configuration EDP normalized to base cache configuration OptimalEDP PACT EDP PDM EDP (a) OptimalEnergy PACT Energy PDM Energy (b) OptimalExTime PACT ExTime PDM ExTime (c) Fig. 4: PACT compared to the optimal and PDM. PACT, optimal, and PDM are all normalized to the base configuration. PACT s configurations were within.3% of the optimal. These results show PACT s ability to prioritize optimization goals as required. 2) Prioritization Tradeoffs: To further illustrate PACT s ability to trade off non-prioritized optimization goals for the prioritized optimization goals, we compared the nonprioritized EDP, energy, or execution time obtained with PACT (with one of the goals prioritized) to the base configuration. Figure 5 shows the results of all the optimization goals (EDP, energy, and execution time) when the system is running under each of the three different priorities. Figure 5 (a) compares PACT s energy and execution time with the EDP prioritized to the base configuration s energy and execution time. When the EDP is prioritized, PACT reduced the EDP, energy, and execution time by 6.5%, 6.%, and %, respectively, on average across all the applications. For two applications hmmer and mcf, prioritizing the EDP increased the execution time by 6% and 3%, respectively. However, for both applications, we observed energy reductions of 5% and 8%, respectively, as compared to the base configuration. Figure 5 (b) compares PACT s EDP and execution time, with the energy prioritized, to the base configuration s EDP and execution time. On average, PACT reduced the EDP and energy by 7.8% and 8.2%, respectively, while the execution time was reduced by less than %. We observed that energy prioritization resulted in significant tradeoffs of execution time for some of the benchmarks. For example, in the largest tradeoff observed, PACT reduced the energy by 3% for leslie3d, while the EDP and execution time increased by 6.4% and 2.4%, respectively. Similarly to mcf (Section IV-B), we attribute this behavior to leslie3d s memory access characteristics, which feature long memory access latencies. Figure 5 (c) compares PACT s EDP and energy, with the execution time prioritized, to the base configurations EDP and energy. On average, PACT reduced the EDP, execution time, and energy by 2.7%,.7%, and 2.7%. For several of the applications, PACT determined the base configuration as the best configuration (since the base configuration had the best execution time for those applications), thus, the energy and EDP for those applications were identical to the base. In general, these results illustrate the adversarial nature of energy and execution time prioritizing one is usually at the expense of the other and PACT s ability to tradeoff the execution time to adhere to more stringent energy constraints or vice versa, based on the executing conditions and device state (e.g., when the system s battery is in a critical state). 3) PACT Overhead: PACT s overhead comprises of the hardware and runtime tuning overheads. The hardware overhead comprises of the phase history table and the tuner (Section II), which orchestrates the runtime tuning process. We estimated, using synthesizable VHDL and Synopsys Design Compiler [7] simulations, that PACT incurs % and % hardware area and power overheads, respectively, with respect to an ARM Cortex A9 microprocessor. We quantified the runtime tuning overhead using the total tuning stall cycles [4] as: total tuning stall cycles = (number of configurations explored - ) * tuning stall cycles per configuration. On average, for each phase history table entry, PACT explored 5% of the design space (Section III-A) and incurred 258 stall cycles for each configuration change. Using these estimates, PACT accrued a runtime tuning overhead of 4799 cycles per benchmark. With a.9ghz clock frequency, this overhead translates to 2.526µs across all benchmarks. V. CONCLUSIONS In this paper, we presented Priority-Aware phase-based Cache Tuning (PACT), which uses the existing phase distance mapping (PDM) framework to determine the best cache configurations for varying runtime optimization goals. We showed PACT s ability to trade off non-prioritized optimization goals when a specific goal must be prioritized due to changing operating conditions or device states. Our

6 Optimization goals normalized to base cache configuration (ExTime prioritized) Optimization goals normalized to base cache configuration (Energy prioritized) Optimization goals normalized to base cache configuration (EDP prioritized) OptimalEDP EDP Energy ExTime (a) OptimalEnergy EDP Energy ExTime (b) OptimalExTime EDP Energy ExTime (c) Fig. 5: Impact of prioritizing one optimization goals on the non-prioritized optimization goals. experimental results show that PACT performed similarly to PDM for EDP optimization (since PDM focuses on EDP optimization). Furthermore, PACT improved over PDM for energy, and execution time optimizations. For future work, we intend to explore techniques for achieving results that are closer to the optimal, without degrading the prioritization potential. In addition, we intend to extend PACT to complex systems with multilevel cache hierarchies. REFERENCES [] Arm. Accessed: December 26. [2] Spec cpu26. Accessed: January 26. [3] T. Adegbija, A. Gordon-Ross, and A. Munir. Phase distance mapping: a phase-based cache tuning methodology for embedded systems. Design Automation for Embedded Systems, 8(3-4):25 278, 24. [4] T. Adegbija, A. Gordon-Ross, and M. Rawlins. Analysis of cache tuner architectural layouts for multicore embedded systems. In Performance Computing and Communications Conference (IPCCC), 24 IEEE International, pages 8. IEEE, 24. [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. Computer Architecture News, 4(2):, 22. [6] L. Chen, X. Zou, J. Lei, and Z. Liu. Dynamically reconfigurable cache for low-power embedded system. In Third International Conference on Natural Computation (ICNC 27), volume 5, pages IEEE, 27. [7] D. Compiler. Synopsys inc, 2. [8] A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Transactions on Design Automation of Electronic Systems (TODAES), 9(4):49 44, 24. [9] A. Gordon-Ross, J. Lau, and B. Calder. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 8th ACM Great Lakes symposium on VLSI, pages ACM, 28. [] A. Gordon-Ross and F. Vahid. A self-tuning configurable cache. In Proceedings of the 44th annual Design Automation Conference, pages ACM, 27. [] H. Hajimiri, P. Mishra, and S. Bhunia. Dynamic cache tuning for efficient memory based computing in multicore architectures. In VLSI Design and 23 2th International Conference on Embedded Systems (VLSID), 23 26th International Conference on, pages IEEE, 23. [2] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism, 7(4): 28, 25. [3] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. The mcpat framework for multicore and manycore architectures: Simultaneously modeling power, area, and timing. ACM Transactions on Architecture and Code Optimization (TACO), ():5, 23. [4] O. Navarro, T. Leiding, and M. Hübner. Configurable cache tuning with a victim cache. In Reconfigurable Communication-centric Systems-on- Chip (ReCoSoC), 25 th International Symposium on, pages 6. IEEE, 25. [5] M. Peng, J. Sun, and Y. Wang. A phase-based self-tuning algorithm for reconfigurable cache. In Digital Society, 27. ICDS 7. First International Conference on the, pages IEEE, 27. [6] M. Rawlins and A. Gordon-Ross. Cpact-the conditional parameter adjustment cache tuner for dual-core architectures. In Computer Design (ICCD), 2 IEEE 29th International Conference on, pages IEEE, 2. [7] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discovering and exploiting program phases. IEEE micro, 23(6):84 93, 23. [8] T. Sondag and H. Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In Code Generation and Optimization (CGO), 2 9th Annual IEEE/ACM International Symposium on, pages 2. IEEE, 2. [9] K. Vivekanandarajah, T. Srikanthan, and C. T. Clarke. Profile directed instruction cache tuning for embedded systems. In Emerging VLSI Technologies and Architectures, 26. IEEE Computer Society Annual Symposium on, pages 6 pp. IEEE, 26. [2] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded systems. In Computer Architecture, 23. Proceedings. 3th Annual International Symposium on, pages IEEE, 23.

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering University of Florida,