PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems

Size: px
Start display at page:

Download "PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems"

Transcription

1 PACT: Priority-Aware Phase-based Cache Tuning for Embedded Systems Sam Gianelli and Tosiron Adegbija Department of Electrical & Computer Engineering University of Arizona, Tucson, AZ, USA {samjgianelli, Abstract Due to the cache s significant impact on an embedded system, much research has focused on cache optimizations, such as reduced energy consumption or improved performance. However, throughout an embedded system s lifetime, the system may have different optimization priorities, due to variable operating conditions and requirements. Variable optimization priorities, embedded systems stringent design constraints, and the fact that applications typically have execution phases with varying runtime resource requirements, necessitate new robust optimization techniques that can dynamically adapt to different optimization goals. In this paper, we present priority-aware phasebased cache tuning (PACT), which tunes an embedded system s cache at runtime in order to dynamically adhere the cache configurations to varying optimization goals (specifically EDP, energy, and execution time), application execution phases, and operating conditions, while accruing minimal runtime overheads. Index Terms Configurable memory, design space exploration, cache tuning, cache memories, low-power design, low-power embedded systems, adaptable hardware I. INTRODUCTION AND MOTIVATION Caches contribute significantly to an embedded system s energy consumption and performance. As a result, much optimization research focuses on optimizations that improve the cache s energy efficiency and performance, while accruing minimal optimization overheads. Configurable caches [2] have been widely studied as a viable architecture for minimizing the cache s energy consumption. Using configurable caches, the cache configurations (i.e., cache size, associativity, and line/block size) can be dynamically tuned or specialized to executing applications resource requirements, in order to minimize energy overheads from an over-provisioned cache. However, cache optimization, otherwise known as cache tuning, is very challenging due to embedded systems typical design constraints (e.g., area, energy, real-time constraints), potentially large configurable cache design spaces, and the runtime variability of embedded systems applications and executing conditions [3]. Depending on different factors, such as resource availability, user requirements, and application characteristics, an embedded system s optimization goals (e.g., energy/execution time minimization) may change throughout its lifetime. Ideally, a runtime cache tuning technique must account for variable application requirements and optimization goals /7/$3. c 27 IEEE Cache optimization challenges are further compounded by the presence of different execution phases in emerging embedded systems applications [7]. A phase is an execution interval during which the application s execution characteristics (e.g., cache miss rates, instructions per cycle (IPC), etc.) are relatively stable. Different phases within the same application, due to their different characteristics, may have different resource requirements. To account for these variable resource requirements, phase-based cache tuning [3], [9] can achieve fine-grained optimization by tuning the cache configurations to determine configurations that satisfy the different application phases requirements. Prior work has shown that phase based cache tuning can reduce the cache s energy consumption by up to 62% []. However, phase-based cache tuning can result in runtime execution time overhead. To mitigate this overhead, prior work [3] proposed phase distance mapping (PDM) as an analytical approach to phase-based cache tuning. PDM calculates the relative distance between a new executing phase and a previously tuned base phase, and approximates the new phase s best cache configuration based on the distance between the new and base phases execution characteristics. PDM, much like several prior cache optimization techniques [9], [5], [8], has the drawback of only focusing on one optimization goal throughout the system s lifetime. Due to the adversarial nature of optimization goals improving the energy, for example, may degrade execution time effective optimization techniques must be adaptable to changing runtime operating conditions (the current state of the device, e.g., critically low battery) that may necessitate variable optimization goals. In this paper, we present Priority-Aware phase-based Cache Tuning (PACT). PACT incorporates the notion of priorities to phase-based cache tuning using PDM, and allows different optimization goals to be dynamically prioritized in order to satisfy variable operating conditions. Specifically, PACT allows the energy delay product (EDP), energy, or execution time to be prioritized, without incurring any additional tuning overhead with respect to prior work. Using experimental results, we show that PACT can trade off non-prioritized optimization goals for prioritized optimization goals in order to satisfy varying runtime resource requirements. PACT determines configurations that are optimal or near-optimal for the specified optimization priority, and improves over PDM for energy and execution time optimization.

2 II. BACKGROUND AND RELATED WORK Cache tuning uses a configurable cache and a cache tuning mechanism software or hardware cache tuners to determine the best cache configurations that satisfy an application s execution requirements. The key challenge in cache tuning is accurately determining the best configuration without incurring significant tuning overheads. To address this challenge, several prior cache tuning techniques [3], [6], [9] [], [4], [6], [9] have been proposed. In general, these prior phasebased cache tuning techniques can be broadly categorized as exhaustive search, heuristic, or analytical methods [3]. Chen et al. [6] proposed an algorithm that exhaustively explored the entire cache design space to determine the optimal configuration. this method was highly accurate, the proposed method also incurred significant overheads due to the amount of time required to fully explore the cache design space, thus impeding optimization potentials. To reduce the cache tuning overhead, Rawlins et al. [6] proposed a cache tuning heuristic that significantly pruned the design space during the exploration process, while achieving near-optimal results. To further reduce cache tuning overheads, analytical methods have been proposed to directly determine the best cache configurations without the need to explore inferior configurations. Gordon-Ross et al [] used an oracle-based approach to non-intrusively predict the best configuration without incurring runtime tuning overheads with respect to execution time. However, the proposed approach incurred hardware overheads as a result of the oracle hardware. Other analytical techniques have been proposed (e.g., [8], [5]); however, most of these techniques are typically computationally complex, thus impeding optimization potentials in resource constrained embedded systems. To address the challenges of prior work, Adegbija et al. [3] proposed phase distance mapping (PDM), which was based on the hypothesis that the more disparate two phases characteristics are, the more disparate the phases best configurations are likely to be. Given a known phase s best configuration, PDM predicts a new phase s best configuration using the distance between the two phases characteristics the phase distance to estimate the distance between the two phases best configurations. However, PDM only targets EDP optimization and cannot satisfy changing runtime optimization goals. We address challenges in prior work by developing a new priority-aware phase-based cache tuning (PACT) method that allows dynamic cache tuning to specifically prioritize different optimization goals based on the current system operating conditions/device state. We show that PACT achieves these variable optimization goals without incurring any runtime tuning overheads with respect to prior work. III. PRIORITIZED PHASE-BASED CACHE TUNING In the discussions herein, we assume that the embedded system is equipped with a highly configurable cache [2] with configurable size, line size, and associativity. A highly Phase classification Retrieve Configuration C pi,s form phase history table Phase characteristics Phase P i,s executed Search phase history table for P i,s New Phase? Priority Aware Cache Tuning (PACT) Algorithm P i,s configuration, C pi,s Execute phase P i with C pi,s Fig. : Flowchart of PACT Device State Add P I,s to phase history table configurable cache utilizes multiple memory banks, each of which can be shutdown to configure the cache size, or concatenated to configure the associativity. Given a physical cache line size (e.g., 6B), multiple lines can be concatenated to logically form larger line sizes. In this work, we use a physical base 32KB configurable cache with 2KB banks. This cache offers configurable cache sizes ranging from 2KB to 32KB, associativities ranging from - to 4-way, and line sizes ranging from 6B to 64B, all in power of two increments. We direct the reader to [2] for details on the configurable cache s circuitry and design. To perform phase-based cache tuning, executing applications characteristics must first be classified into phases. Phase classification [7] breaks an application into execution intervals intervals can be measured in number of instructions or time and clusters the intervals with similar execution characteristics to form phases. Since phase classification has been widely studied in prior work, we skip the details in this paper. In this section, we present the problem statement, motivated by mobile devices (e.g., smartphones, tablets), describe an overview of the proposed approach, and present the PACT algorithm. A. Problem Statement The primary challenge of phase-based cache tuning is accurately determining the best configurations at runtime without accruing significant runtime overheads. For example, our configurable cache in this work features a design space with 432 possible configurations (both instruction and data caches). This design space could exponentially increase in more complex caches. Exploring such a large design space at runtime (using exhaustive search, heuristics, or even analytical methods) for multiple optimization goals could be prohibitive for resource constrained embedded systems. Thus, the objective of PACT is to determine the best cache configurations that satisfy variable runtime optimization goals, while accruing minimal runtime tuning overhead.

3 Given an embedded system, the optimization goals can be determined by different operating conditions or user requirements. For example, a battery-operated mobile device may have different operating conditions depending on the battery state. When operating on full battery, the EDP may be prioritized to account for both energy and execution time optimization. When the battery is at critical levels (e.g., % of battery power), energy efficiency is prioritized over execution time. When the system is being charged, the execution time is prioritized, since energy is no longer a significant resource constraint. B. Overview of PACT To ensure low-overhead, high-speed, and accurate tuning, we considered two options for adding priorities to phasebased tuning. Since a phase s best configurations, after they are determined, are traditionally stored in a phase history table [3], the first option uses log(n) additional bits to store each phase s current optimization priority, where n is the number of device states (corresponding to optimization goals). An alternative option includes a separate transformation lookup table that stores configuration transformations between different priorities for each application phase. Thus, the table only stores the phases configurations for a single priority and the configuration transformations (i.e., change in configuration). When a new priority is required, the transformation is applied to the current configuration to determine a new configuration that satisfies the new optimization goal. In this work, we use the first option, due to its simplicity and low-overhead characteristics. Figure presents an overview of PACT. PACT takes as input the phase characteristics, obtained during phase classification, and the current device state, which indicates the prioritized optimization goal. For example, the device state could be informed by a smartphone s battery state: fully charged, low power, or charging, to indicate prioritization of EDP, energy, or execution time, respectively. When a phase P i is executed, PACT searches the phase history table for P i or for phases with similar characteristics (cache miss rates) within a similarity threshold to P i. The similarity threshold is a designer specified or runtime tunable feature of PACT that represents a tradeoff between tuning accuracy and tuning overheads. A larger similarity threshold reduces the tuning overhead at the expense of accuracy, while a smaller similarity threshold increases accuracy at the expense of tuning overhead. In this work, we empirically established our similarity threshold, using the base cache configuration, by normalizing each phase s cache miss rate to the base phase s cache miss rate. We used the first executing phase as the base phase, and used a similarity threshold of. Phases with a normalized miss rate within the ranges - used the same phase history table entries; similarly for phases with normalized miss rates within -2, 2-3, etc. At runtime, the similarity threshold can easily be dynamically determined as shown in prior work [3]. Input ConfigP MRU, P, n j n C i C max A i A max L i L max C i = C i * 2 A i = A i * 2 L i = L i * 2 Fig. 2: PACT algorithm Output ConfigP best Fig. 3: PACT algorithm s checkstate subroutine If an entry is found, the phase P i, or a similar phase P i,s, has been previously executed, and the phase s best configuration C pi is retrieved from the phase history table and used to execute P i. A new phase s similarity to an existent phase is a function of the phase distance between the two phases, which can be measured by the Euclidean distance between the two phases instruction and data cache miss rates. If no entry is found, P i is a new phase, and PACT executes the tuning algorithm (Section III-C) to determine C pi. C pi is then added to the phase history table, and the phase P i is executed using C pi. C. PACT Algorithm Figure 2 depicts the PACT algorithm, which determines an executing phase s best cache configuration, given a spec-

4 ified priority. The inputs to the algorithm are the number of iterations n, the most recently used cache configuration ConfigP MRU comprised of the most recently used cache size, associativity, and line size, C MRU, A MRU, and L MRU, respectively, and the current priority, R. The number of iterations, n, specifies the number of phase executions required to fine-tune the phase s best configuration; the default value of n is designer-specified, depending on the executing applications, but can be dynamically adjusted at runtime depending on the quality of the configurations being determined by PACT. For example, n may be dynamically increased or decreased for an executing phase depending on how much improvement is achieved with respect to the base configuration, and how often the phase is executed. A larger value of n would benefit a phase with several executions and may yield cache configurations that are closer to the optimal. We analyzed different values of n, and empirically determined that n = 3, which we used for our experiments (Section IV), provided a sufficient tradeoff between number of iterations and optimization potential. When a new phase is executed, the algorithm starts with the most recently used configurations and iteratively cycles through the caches sizes, associativities, and line sizes in power-of-two increments this process is iteratively performed for all caches in the system (e.g. data and instruction) until the maximum values are reached or if the current configuration yields better results than the stored best configuration. At system startup, the most recently used configurations default to the base configurations. Figure 3 depicts the algorithm for a checkstate subroutine, which PACT uses to monitor the prioritized optimization goal for each iteration. For each phase, checkstate(r) determines if the currently executing configuration [ConfigP i ] improves over the stored best configuration [ConfigP best ] stored in the phase history table for priority R. If [ConfigP i ] improves over [ConfigP best ], [ConfigP best ] is set to [ConfigP i ] in the phase history table. A. Experimental Setup IV. EXPERIMENTS To evaluate PACT s effectiveness, we used 2 benchmarks from the SPEC CPU26 benchmark suite [2]. We used SPEC benchmarks since they feature greater execution complexity and runtime variability, and provide a more rigorous test for PACT. We fast forwarded each application for 3 million instructions, and ran the reference input sets for billion instructions. We used SimPoint 3.2 [2] to determine the distinct phases in each. To model a system similar to modern day embedded systems microprocessors and gather execution statistics, we implemented the proposed approach using GEM5 [5]. We simulated a system comprised of private level one instruction and data caches, with a base configuration featuring 32 KB size, 4-way set associativity, and 64 byte line size, similar to an ARM Cortex-A9 []. Given this base configuration, the configurable size ranged from 2 KB to 32 KB, associativity ranged from - way to 4-way, and the line size ranged from 6 byte to 64 byte, all in power of two increments. We assumed a system with dedicated tuners within each core; thus, the cores are tuned independently of each other. We used McPAT [3] to calculate the system s total power consumption, which we then used, combined with execution statistics from GEM5, to calculate the energy consumption. B. Results ) PACT Results and Comparison to the Optimal and Prior Work: We evaluated PACT by comparing the results using configurations determined by PACT to results obtained using the base configuration, the optimal configuration (determined through exhaustive search), and PDM (to represent prior work). Figure 4 depicts the EDP, energy, and execution time achieved by PACT as compared to the optimal and PDM configurations. The PACT, optimal, and PDM results are all normalized to the base configuration in order to evaluate the improvements with respect to the base configuration. Figure 4(a) compares the EDP achieved by PACT to the optimal and PDM when the EDP is prioritized. As compared to the base configuration, PACT reduced the EDP by 6.5% on average across all the applications, with reductions as high as 36% for omnetpp. On average, PACT determined configurations that were within 7.7% of the optimal; for 6 out of the 2 applications, PACT s configuration were within less than % of the optimal. In a few of the applications, PACT s configurations were worse than the optimal. For mcf, for example this was was the worst case PACT s configuration was within 2.% of the optimal and increased the EDP by 3% with respect to the base configuration. We attribute this behavior to the fact that PACT s tuning was oblivious to some of the applications intrinsic memory access behaviors. For instance, mcf has long memory access latencies, which conflicted with PACT s attempt to simultaneously tune both the energy and the execution time (delay). However, as expected, PACT achieved similar EDP savings as PDM, since PDM natively optimizes the EDP. Figure 4(b) compares the energy consumption achieved by PACT to the optimal and PDM when the energy is prioritized. On average, PACT reduced the energy by 8.2% compared to the base configuration, with reductions as high as 28% for libquantum. Unlike in the case of EDP prioritization, PACT did not degrade any application s energy consumption with respect to the base configuration. PACT determined configurations that were within 4.2% of the optimal, on average, and outperformed PDM by 2.2% across all the applications. Figure 4(c) compares the execution time achieved by PACT to the optimal and PDM when the execution time is prioritized. On average across all the applications, PACT reduced the execution time by 2.7% compared to the base configuration, with reductions as high as % for omnetpp. The execution time reduction was much less (compared to the EDP and energy reductions) because the base configuration was nearoptimal for most applications, as evidenced by the fact that

5 Execution Time normalized to base cache configuration Energy normalized to base cache configuration EDP normalized to base cache configuration OptimalEDP PACT EDP PDM EDP (a) OptimalEnergy PACT Energy PDM Energy (b) OptimalExTime PACT ExTime PDM ExTime (c) Fig. 4: PACT compared to the optimal and PDM. PACT, optimal, and PDM are all normalized to the base configuration. PACT s configurations were within.3% of the optimal. These results show PACT s ability to prioritize optimization goals as required. 2) Prioritization Tradeoffs: To further illustrate PACT s ability to trade off non-prioritized optimization goals for the prioritized optimization goals, we compared the nonprioritized EDP, energy, or execution time obtained with PACT (with one of the goals prioritized) to the base configuration. Figure 5 shows the results of all the optimization goals (EDP, energy, and execution time) when the system is running under each of the three different priorities. Figure 5 (a) compares PACT s energy and execution time with the EDP prioritized to the base configuration s energy and execution time. When the EDP is prioritized, PACT reduced the EDP, energy, and execution time by 6.5%, 6.%, and %, respectively, on average across all the applications. For two applications hmmer and mcf, prioritizing the EDP increased the execution time by 6% and 3%, respectively. However, for both applications, we observed energy reductions of 5% and 8%, respectively, as compared to the base configuration. Figure 5 (b) compares PACT s EDP and execution time, with the energy prioritized, to the base configuration s EDP and execution time. On average, PACT reduced the EDP and energy by 7.8% and 8.2%, respectively, while the execution time was reduced by less than %. We observed that energy prioritization resulted in significant tradeoffs of execution time for some of the benchmarks. For example, in the largest tradeoff observed, PACT reduced the energy by 3% for leslie3d, while the EDP and execution time increased by 6.4% and 2.4%, respectively. Similarly to mcf (Section IV-B), we attribute this behavior to leslie3d s memory access characteristics, which feature long memory access latencies. Figure 5 (c) compares PACT s EDP and energy, with the execution time prioritized, to the base configurations EDP and energy. On average, PACT reduced the EDP, execution time, and energy by 2.7%,.7%, and 2.7%. For several of the applications, PACT determined the base configuration as the best configuration (since the base configuration had the best execution time for those applications), thus, the energy and EDP for those applications were identical to the base. In general, these results illustrate the adversarial nature of energy and execution time prioritizing one is usually at the expense of the other and PACT s ability to tradeoff the execution time to adhere to more stringent energy constraints or vice versa, based on the executing conditions and device state (e.g., when the system s battery is in a critical state). 3) PACT Overhead: PACT s overhead comprises of the hardware and runtime tuning overheads. The hardware overhead comprises of the phase history table and the tuner (Section II), which orchestrates the runtime tuning process. We estimated, using synthesizable VHDL and Synopsys Design Compiler [7] simulations, that PACT incurs % and % hardware area and power overheads, respectively, with respect to an ARM Cortex A9 microprocessor. We quantified the runtime tuning overhead using the total tuning stall cycles [4] as: total tuning stall cycles = (number of configurations explored - ) * tuning stall cycles per configuration. On average, for each phase history table entry, PACT explored 5% of the design space (Section III-A) and incurred 258 stall cycles for each configuration change. Using these estimates, PACT accrued a runtime tuning overhead of 4799 cycles per benchmark. With a.9ghz clock frequency, this overhead translates to 2.526µs across all benchmarks. V. CONCLUSIONS In this paper, we presented Priority-Aware phase-based Cache Tuning (PACT), which uses the existing phase distance mapping (PDM) framework to determine the best cache configurations for varying runtime optimization goals. We showed PACT s ability to trade off non-prioritized optimization goals when a specific goal must be prioritized due to changing operating conditions or device states. Our

6 Optimization goals normalized to base cache configuration (ExTime prioritized) Optimization goals normalized to base cache configuration (Energy prioritized) Optimization goals normalized to base cache configuration (EDP prioritized) OptimalEDP EDP Energy ExTime (a) OptimalEnergy EDP Energy ExTime (b) OptimalExTime EDP Energy ExTime (c) Fig. 5: Impact of prioritizing one optimization goals on the non-prioritized optimization goals. experimental results show that PACT performed similarly to PDM for EDP optimization (since PDM focuses on EDP optimization). Furthermore, PACT improved over PDM for energy, and execution time optimizations. For future work, we intend to explore techniques for achieving results that are closer to the optimal, without degrading the prioritization potential. In addition, we intend to extend PACT to complex systems with multilevel cache hierarchies. REFERENCES [] Arm. Accessed: December 26. [2] Spec cpu26. Accessed: January 26. [3] T. Adegbija, A. Gordon-Ross, and A. Munir. Phase distance mapping: a phase-based cache tuning methodology for embedded systems. Design Automation for Embedded Systems, 8(3-4):25 278, 24. [4] T. Adegbija, A. Gordon-Ross, and M. Rawlins. Analysis of cache tuner architectural layouts for multicore embedded systems. In Performance Computing and Communications Conference (IPCCC), 24 IEEE International, pages 8. IEEE, 24. [5] N. Binkert, B. Beckmann, G. Black, S. K. Reinhardt, A. Saidi, A. Basu, J. Hestness, D. R. Hower, T. Krishna, S. Sardashti, et al. The gem5 simulator. Computer Architecture News, 4(2):, 22. [6] L. Chen, X. Zou, J. Lei, and Z. Liu. Dynamically reconfigurable cache for low-power embedded system. In Third International Conference on Natural Computation (ICNC 27), volume 5, pages IEEE, 27. [7] D. Compiler. Synopsys inc, 2. [8] A. Ghosh and T. Givargis. Cache optimization for embedded processor cores: An analytical approach. ACM Transactions on Design Automation of Electronic Systems (TODAES), 9(4):49 44, 24. [9] A. Gordon-Ross, J. Lau, and B. Calder. Phase-based cache reconfiguration for a highly-configurable two-level cache hierarchy. In Proceedings of the 8th ACM Great Lakes symposium on VLSI, pages ACM, 28. [] A. Gordon-Ross and F. Vahid. A self-tuning configurable cache. In Proceedings of the 44th annual Design Automation Conference, pages ACM, 27. [] H. Hajimiri, P. Mishra, and S. Bhunia. Dynamic cache tuning for efficient memory based computing in multicore architectures. In VLSI Design and 23 2th International Conference on Embedded Systems (VLSID), 23 26th International Conference on, pages IEEE, 23. [2] G. Hamerly, E. Perelman, J. Lau, and B. Calder. Simpoint 3.: Faster and more flexible program phase analysis. Journal of Instruction Level Parallelism, 7(4): 28, 25. [3] S. Li, J. H. Ahn, R. D. Strong, J. B. Brockman, D. M. Tullsen, and N. P. Jouppi. The mcpat framework for multicore and manycore architectures: Simultaneously modeling power, area, and timing. ACM Transactions on Architecture and Code Optimization (TACO), ():5, 23. [4] O. Navarro, T. Leiding, and M. Hübner. Configurable cache tuning with a victim cache. In Reconfigurable Communication-centric Systems-on- Chip (ReCoSoC), 25 th International Symposium on, pages 6. IEEE, 25. [5] M. Peng, J. Sun, and Y. Wang. A phase-based self-tuning algorithm for reconfigurable cache. In Digital Society, 27. ICDS 7. First International Conference on the, pages IEEE, 27. [6] M. Rawlins and A. Gordon-Ross. Cpact-the conditional parameter adjustment cache tuner for dual-core architectures. In Computer Design (ICCD), 2 IEEE 29th International Conference on, pages IEEE, 2. [7] T. Sherwood, E. Perelman, G. Hamerly, S. Sair, and B. Calder. Discovering and exploiting program phases. IEEE micro, 23(6):84 93, 23. [8] T. Sondag and H. Rajan. Phase-based tuning for better utilization of performance-asymmetric multicore processors. In Code Generation and Optimization (CGO), 2 9th Annual IEEE/ACM International Symposium on, pages 2. IEEE, 2. [9] K. Vivekanandarajah, T. Srikanthan, and C. T. Clarke. Profile directed instruction cache tuning for embedded systems. In Emerging VLSI Technologies and Architectures, 26. IEEE Computer Society Annual Symposium on, pages 6 pp. IEEE, 26. [2] C. Zhang, F. Vahid, and W. Najjar. A highly configurable cache architecture for embedded systems. In Computer Architecture, 23. Proceedings. 3th Annual International Symposium on, pages IEEE, 23.

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems

Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Energy-efficient Phase-based Cache Tuning for Multimedia Applications in Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering University of Florida,

More information

Phase distance mapping: a phase-based cache tuning methodology for embedded systems

Phase distance mapping: a phase-based cache tuning methodology for embedded systems Des Autom Embed Syst DOI 10.1007/s10617-014-9127-8 Phase distance mapping: a phase-based cache tuning methodology for embedded systems Tosiron Adegbija Ann Gordon-Ross Arslan Munir Received: 2 September

More information

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems

Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems Exploring Configurable Non-Volatile Memory-based Caches for Energy-Efficient Embedded Systems Tosiron Adegbija Department of Electrical and Computer Engineering University of Arizona Tucson, Arizona tosiron@email.arizona.edu

More information

Coding for Efficient Caching in Multicore Embedded Systems

Coding for Efficient Caching in Multicore Embedded Systems Coding for Efficient Caching in Multicore Embedded Systems Tosiron Adegbija and Ravi Tandon Department of Electrical and Computer Engineering University of Arizona, USA Email: {tosiron,tandonr}@emailarizonaedu

More information

Phase-based Cache Locking for Embedded Systems

Phase-based Cache Locking for Embedded Systems Phase-based Cache Locking for Embedded Systems Tosiron Adegbija and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida (UF), Gainesville, FL 32611, USA tosironkbd@ufl.edu

More information

Intra-Task Dynamic Cache Reconfiguration *

Intra-Task Dynamic Cache Reconfiguration * Intra-Task Dynamic Cache Reconfiguration * Hadi Hajimiri, Prabhat Mishra Department of Computer & Information Science & Engineering University of Florida, Gainesville, Florida, USA {hadi, prabhat}@cise.ufl.edu

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

Application-Specific Autonomic Cache Tuning for General Purpose GPUs

Application-Specific Autonomic Cache Tuning for General Purpose GPUs Application-Specific Autonomic Cache Tuning for General Purpose GPUs Sam Gianelli, Edward Richter, Diego Jimenez, Hugo Valdez, Tosiron Adegbija, Ali Akoglu Department of Electrical and Computer Engineering

More information

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review

Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Relative Performance of a Multi-level Cache with Last-Level Cache Replacement: An Analytic Review Bijay K.Paikaray Debabala Swain Dept. of CSE, CUTM Dept. of CSE, CUTM Bhubaneswer, India Bhubaneswer, India

More information

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems *

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Hadi Hajimiri, Kamran Rahmani, Prabhat Mishra Department of Computer & Information Science & Engineering

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1. PhLock: A Cache Energy Saving Technique Using Phase-Based Cache Locking

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1. PhLock: A Cache Energy Saving Technique Using Phase-Based Cache Locking IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 PhLock: A Cache Energy Saving Technique Using Phase-Based Cache Locking Tosiron Adegbija, Member, IEEE, and Ann Gordon-Ross, Member, IEEE

More information

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors

A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid Cache in 3D chip Multi-processors , July 4-6, 2018, London, U.K. A Spherical Placement and Migration Scheme for a STT-RAM Based Hybrid in 3D chip Multi-processors Lei Wang, Fen Ge, Hao Lu, Ning Wu, Ying Zhang, and Fang Zhou Abstract As

More information

Low Effort Design Space Exploration Methodology for Configurable Caches

Low Effort Design Space Exploration Methodology for Configurable Caches computers Article Low Effort Design Space Exploration Methodology for Configurable Caches Mohamad Hammam Alsafrjalani 1, * and Ann Gordon-Ross 2 1 Department of Electrical and Computer Engineering, University

More information

Sustainable Computing: Informatics and Systems

Sustainable Computing: Informatics and Systems Sustainable Computing: Informatics and Systems 2 (212) 71 8 Contents lists available at SciVerse ScienceDirect Sustainable Computing: Informatics and Systems j ourna l ho me page: www.elsevier.com/locate/suscom

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Tag Check Elision. Zhong Zheng, Zhiying Wang. Mikko Lipasti ABSTRACT. Categories and Subject Descriptors. 1 Introduction.

Tag Check Elision. Zhong Zheng, Zhiying Wang. Mikko Lipasti ABSTRACT. Categories and Subject Descriptors. 1 Introduction. Tag Check Elision Zhong Zheng, Zhiying Wang State Key Laboratory of High Performance Computing & School of Computer National University of Defense Technology zheng_zhong@nudt.edu.cn, zywang@nudt.edu.cn

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

Efficient Program Power Behavior Characterization

Efficient Program Power Behavior Characterization Efficient Program Power Behavior Characterization Chunling Hu Daniel A. Jiménez Ulrich Kremer Department of Computer Science {chunling, djimenez, uli}@cs.rutgers.edu Rutgers University, Piscataway, NJ

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

On the Interplay of Loop Caching, Code Compression, and Cache Configuration

On the Interplay of Loop Caching, Code Compression, and Cache Configuration On the Interplay of Loop Caching, Code Compression, and Cache Configuration Marisha Rawlins and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL

More information

HoLiSwap: Reducing Wire Energy in L1 Caches

HoLiSwap: Reducing Wire Energy in L1 Caches : Reducing Wire Energy in L1 Caches CVA MEMO 136 Yatish Turakhia 1, Subhasis Das 2, Tor M. Aamodt 3, and William J. Dally 4 1,2,4 Department of Electrical Engineering, Stanford University 3 Department

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Utilizing Concurrency: A New Theory for Memory Wall

Utilizing Concurrency: A New Theory for Memory Wall Utilizing Concurrency: A New Theory for Memory Wall Xian-He Sun (&) and Yu-Hang Liu Illinois Institute of Technology, Chicago, USA {sun,yuhang.liu}@iit.edu Abstract. In addition to locality, data access

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Skewed-Associative Caches: CS752 Final Project

Skewed-Associative Caches: CS752 Final Project Skewed-Associative Caches: CS752 Final Project Professor Sohi Corey Halpin Scot Kronenfeld Johannes Zeppenfeld 13 December 2002 Abstract As the gap between microprocessor performance and memory performance

More information

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling

Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Implementation of Fine-Grained Cache Monitoring for Improved SMT Scheduling Joshua L. Kihm and Daniel A. Connors University of Colorado at Boulder Department of Electrical and Computer Engineering UCB

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Smart Cache: A Self Adaptive Cache Architecture for Energy Efficiency

Smart Cache: A Self Adaptive Cache Architecture for Energy Efficiency Smart Cache: A Self Adaptive Cache Architecture for Energy Efficiency Karthik T. Sundararajan School of Informatics University of Edinburgh Email: tskarthik@ed.ac.uk Timothy M. Jones Computer Laboratory

More information

DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory

DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory DFPC: A Dynamic Frequent Pattern Compression Scheme in NVM-based Main Memory Yuncheng Guo, Yu Hua, Pengfei Zuo Wuhan National Laboratory for Optoelectronics, School of Computer Science and Technology Huazhong

More information

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management

ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache Management This is a pre-print, author's version of the paper to appear in the 32nd IEEE International Conference on Computer Design. ReMAP: Reuse and Memory Access Cost Aware Eviction Policy for Last Level Cache

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Automatic Tuning of Two-Level Caches to Embedded Applications

Automatic Tuning of Two-Level Caches to Embedded Applications Automatic Tuning of Two-Level Caches to Embedded Applications Ann Gordon-Ross, Frank Vahid* Department of Computer Science and Engineering University of California, Riverside {ann/vahid}@cs.ucr.edu http://www.cs.ucr.edu/~vahid

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

A Cache Utility Monitor for Multi-core Processor

A Cache Utility Monitor for Multi-core Processor 3rd International Conference on Wireless Communication and Sensor Network (WCSN 2016) A Cache Utility Monitor for Multi-core Juan Fang, Yan-Jin Cheng, Min Cai, Ze-Qing Chang College of Computer Science,

More information

SACR: Scheduling-Aware Cache Reconfiguration for Real- Time Embedded Systems

SACR: Scheduling-Aware Cache Reconfiguration for Real- Time Embedded Systems 2009 22nd International Conference on VLSI Design SACR: Scheduling-Aware Cache Reconfiguration for Real- Time Embedded Systems Weixun Wang and Prabhat Mishra Department of Computer and Information Science

More information

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches

OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches OAP: An Obstruction-Aware Cache Management Policy for STT-RAM Last-Level Caches Jue Wang, Xiangyu Dong, Yuan Xie Department of Computer Science and Engineering, Pennsylvania State University Qualcomm Technology,

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does

More information

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy

Revolutionizing Technological Devices such as STT- RAM and their Multiple Implementation in the Cache Level Hierarchy Revolutionizing Technological s such as and their Multiple Implementation in the Cache Level Hierarchy Michael Mosquera Department of Electrical and Computer Engineering University of Central Florida Orlando,

More information

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao

Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Bus Encoding Technique for hierarchical memory system Anne Pratoomtong and Weiping Liao Abstract In microprocessor-based systems, data and address buses are the core of the interface between a microprocessor

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

A novel SRAM -STT-MRAM hybrid cache implementation improving cache performance

A novel SRAM -STT-MRAM hybrid cache implementation improving cache performance A novel SRAM -STT-MRAM hybrid cache implementation improving cache performance Odilia Coi, Guillaume Patrigeon, Sophiane Senni, Lionel Torres, Pascal Benoit To cite this version: Odilia Coi, Guillaume

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

Quantitative study of data caches on a multistreamed architecture. Abstract

Quantitative study of data caches on a multistreamed architecture. Abstract Quantitative study of data caches on a multistreamed architecture Mario Nemirovsky University of California, Santa Barbara mario@ece.ucsb.edu Abstract Wayne Yamamoto Sun Microsystems, Inc. wayne.yamamoto@sun.com

More information

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation

The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Noname manuscript No. (will be inserted by the editor) The Smart Cache: An Energy-Efficient Cache Architecture Through Dynamic Adaptation Karthik T. Sundararajan Timothy M. Jones Nigel P. Topham Received:

More information

1 Dynamic Cache Reconfiguration for Soft Real-Time Systems

1 Dynamic Cache Reconfiguration for Soft Real-Time Systems Dynamic Cache Reconfiguration for Soft Real-Time Systems WEIXUN WANG, University of Florida PRABHAT MISHRA, University of Florida ANN GORDON-ROSS, University of Florida In recent years, efficient dynamic

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

Thread Affinity Experiments

Thread Affinity Experiments Thread Affinity Experiments Power implications on Exynos Introduction The LPGPU2 Profiling Tool and API provide support for CPU thread affinity locking and logging, and although this functionality is not

More information

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Yingjie Xia 1, 2, Mingzhe Zhu 2, 3, Li Kuang 2, Xiaoqiang Ma 3 1 Department of Automation School of Electronic

More information

Design and Implementation of a Random Access File System for NVRAM

Design and Implementation of a Random Access File System for NVRAM This article has been accepted and published on J-STAGE in advance of copyediting. Content is final as presented. IEICE Electronics Express, Vol.* No.*,*-* Design and Implementation of a Random Access

More information

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals

Cache Memory COE 403. Computer Architecture Prof. Muhamed Mudawar. Computer Engineering Department King Fahd University of Petroleum and Minerals Cache Memory COE 403 Computer Architecture Prof. Muhamed Mudawar Computer Engineering Department King Fahd University of Petroleum and Minerals Presentation Outline The Need for Cache Memory The Basics

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Packet Classification Using Dynamically Generated Decision Trees

Packet Classification Using Dynamically Generated Decision Trees 1 Packet Classification Using Dynamically Generated Decision Trees Yu-Chieh Cheng, Pi-Chung Wang Abstract Binary Search on Levels (BSOL) is a decision-tree algorithm for packet classification with superior

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

Reconfigurable Energy Efficient Near Threshold Cache Architectures

Reconfigurable Energy Efficient Near Threshold Cache Architectures Reconfigurable Energy Efficient Near Threshold Cache Architectures Ronald G. Dreslinski, Gregory K. Chen, Trevor Mudge, David Blaauw, Dennis Sylvester University of Michigan - Ann Arbor, MI {rdreslin,grgkchen,tnm,blaauw,dennis}@eecs.umich.edu

More information

c 2004 by Ritu Gupta. All rights reserved.

c 2004 by Ritu Gupta. All rights reserved. c by Ritu Gupta. All rights reserved. JOINT PROCESSOR-MEMORY ADAPTATION FOR ENERGY FOR GENERAL-PURPOSE APPLICATIONS BY RITU GUPTA B.Tech, Indian Institute of Technology, Bombay, THESIS Submitted in partial

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Australian Journal of Basic and Applied Sciences

Australian Journal of Basic and Applied Sciences ISSN:1991-8178 Australian Journal of Basic and Applied Sciences Journal home page: www.ajbasweb.com Adaptive Replacement and Insertion Policy for Last Level Cache 1 Muthukumar S. and 2 HariHaran S. 1 Professor,

More information

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism

Motivation for Parallelism. Motivation for Parallelism. ILP Example: Loop Unrolling. Types of Parallelism Motivation for Parallelism Motivation for Parallelism The speed of an application is determined by more than just processor speed. speed Disk speed Network speed... Multiprocessors typically improve the

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Schema-Agnostic Indexing with Azure Document DB

Schema-Agnostic Indexing with Azure Document DB Schema-Agnostic Indexing with Azure Document DB Introduction Azure DocumentDB is Microsoft s multi-tenant distributed database service for managing JSON documents at Internet scale Multi-tenancy is an

More information

LPM: Concurrency-driven Layered Performance Matching

LPM: Concurrency-driven Layered Performance Matching 2015 44th International Conference on Parallel Processing LPM: Concurrency-driven Layered Performance Matching Yu-Hang Liu, Xian-He Sun Department of Computer Science Illinois Institute of Technology Chicago,

More information

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Kashif Ali MoKhtar Aboelaze SupraKash Datta Department of Computer Science and Engineering York University Toronto ON CANADA Abstract

More information

Algorithms for CPU and DRAM DVFS Under Inefficiency Constraints

Algorithms for CPU and DRAM DVFS Under Inefficiency Constraints Algorithms for CPU and DRAM DVFS Under Inefficiency Constraints Rizwana Begum, Mark Hempstead, Guru Prasad Srinivasa and Geoffrey Challen School of Electrical and Computer Engineering, Drexel University,

More information

Shared Cache Aware Task Mapping for WCRT Minimization

Shared Cache Aware Task Mapping for WCRT Minimization Shared Cache Aware Task Mapping for WCRT Minimization Huping Ding & Tulika Mitra School of Computing, National University of Singapore Yun Liang Center for Energy-efficient Computing and Applications,

More information

Preface. Fig. 1 Solid-State-Drive block diagram

Preface. Fig. 1 Solid-State-Drive block diagram Preface Solid-State-Drives (SSDs) gained a lot of popularity in the recent few years; compared to traditional HDDs, SSDs exhibit higher speed and reduced power, thus satisfying the tough needs of mobile

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers

William Stallings Computer Organization and Architecture 8 th Edition. Chapter 18 Multicore Computers William Stallings Computer Organization and Architecture 8 th Edition Chapter 18 Multicore Computers Hardware Performance Issues Microprocessors have seen an exponential increase in performance Improved

More information

A Self-Tuning Cache Architecture for Embedded Systems

A Self-Tuning Cache Architecture for Embedded Systems A Self-Tuning Cache Architecture for Embedded Systems Chuanjun Zhang Department of Electrical Engineering University of California, Riverside czhang@ee.ucr.edu Abstract Memory accesses can account for

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

IN modern systems, the high latency of accessing largecapacity

IN modern systems, the high latency of accessing largecapacity IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016 3071 BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration

Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration , pp.517-521 http://dx.doi.org/10.14257/astl.2015.1 Improving Energy Efficiency of Block-Matching Motion Estimation Using Dynamic Partial Reconfiguration Jooheung Lee 1 and Jungwon Cho 2, * 1 Dept. of

More information

Adaptive Spatiotemporal Node Selection in Dynamic Networks

Adaptive Spatiotemporal Node Selection in Dynamic Networks Adaptive Spatiotemporal Node Selection in Dynamic Networks Pradip Hari, John B. P. McCabe, Jonathan Banafato, Marcus Henry, Ulrich Kremer, Dept. of Computer Science, Rutgers University Kevin Ko, Emmanouil

More information

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors

Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Dynamic Cache Pooling for Improving Energy Efficiency in 3D Stacked Multicore Processors Jie Meng, Tiansheng Zhang, and Ayse K. Coskun Electrical and Computer Engineering Department, Boston University,

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading

TDT Coarse-Grained Multithreading. Review on ILP. Multi-threaded execution. Contents. Fine-Grained Multithreading Review on ILP TDT 4260 Chap 5 TLP & Hierarchy What is ILP? Let the compiler find the ILP Advantages? Disadvantages? Let the HW find the ILP Advantages? Disadvantages? Contents Multi-threading Chap 3.5

More information

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation

Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Analyzing and Improving Clustering Based Sampling for Microprocessor Simulation Yue Luo, Ajay Joshi, Aashish Phansalkar, Lizy John, and Joydeep Ghosh Department of Electrical and Computer Engineering University

More information

http://uu.diva-portal.org This is an author produced version of a paper presented at the 4 th Swedish Workshop on Multi-Core Computing, November 23-25, 2011, Linköping, Sweden. Citation for the published

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching

An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching 18th Asia and South Pacific Design Automation Conference January 22-25, 2013 - Yokohama, Japan An Adaptive Filtering Mechanism for Energy Efficient Data Prefetching Xianglei Dang, Xiaoyin Wang, Dong Tong,

More information

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 02. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 02 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 2.1 The levels in a typical memory hierarchy in a server computer shown on top (a) and in

More information

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala

Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors. By: Anvesh Polepalli Raj Muchhala Review on ichat: Inter Cache Hardware Assistant Data Transfer for Heterogeneous Chip Multiprocessors By: Anvesh Polepalli Raj Muchhala Introduction Integrating CPU and GPU into a single chip for performance

More information

Memory Hierarchy. Slides contents from:

Memory Hierarchy. Slides contents from: Memory Hierarchy Slides contents from: Hennessy & Patterson, 5ed Appendix B and Chapter 2 David Wentzlaff, ELE 475 Computer Architecture MJT, High Performance Computing, NPTEL Memory Performance Gap Memory

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT

Hardware Assisted Recursive Packet Classification Module for IPv6 etworks ABSTRACT Hardware Assisted Recursive Packet Classification Module for IPv6 etworks Shivvasangari Subramani [shivva1@umbc.edu] Department of Computer Science and Electrical Engineering University of Maryland Baltimore

More information