Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems

Size: px
Start display at page:

Download "Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems"

Transcription

1 Fairness via Source Throttling: A Configurable and High-Performance Fairness Substrate for Multi-Core Memory Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt Depment of Electrical and Computer Engineering The University of Texas at Austin {ebrahimi, cjlee, patt}@ece.utexas.edu Computer Architecture Laboratory (CALCM) Carnegie Mellon University onur@cmu.edu Abstract Cores in a chip-multiprocessor (CMP) system share multiple hardware resources in the memory subsystem. If resource sharing is unfair, some applications can be delayed significantly while others are unfairly prioritized. Previous research proposed separate fairness mechanisms in each individual resource. Such resource-based fairness mechanisms implemented independently in each resource can make contradictory decisions, leading to low fairness and loss of performance. Therefore, a coordinated mechanism that provides fairness in the entire shared memory system is desirable. This paper proposes a new approach that provides fairness in the entire shared memory system, thereby eliminating the need for and complexity of developing fairness mechanisms for each individual resource. Our technique, Fairness via Source Throttling (), estimates the unfairness in the entire shared memory system. If the estimated unfairness is above a threshold set by system software, throttles down cores causing unfairness by limiting the number of requests they can inject into the system and the frequency at which they do. As such, our source-based fairness control ensures fairness decisions are made in tandem in the entire memory system. also enforces thread priorities/weights, and enables system software to enforce different fairness objectives and fairness-performance tradeoffs in the memory system. Our evaluations show that provides the best system fairness and performance compared to four systems with no fairness control and with state-of-the- fairness mechanisms implemented in both shared caches and memory controllers. Categories and Subject Descriptors: C. [Processor Architectures]: General; C.5.3 [Microcomputers]: Microprocessors; C.1.2 [Multiple Data Stream Architectures (Multiprocessors)] General Terms: Design, Performance. 1. Introduction Chip-multiprocessor (CMP) systems commonly share a large portion of the memory subsystem between different cores. Main memory and shared caches are two examples of shared resources. Memory requests from different applications executing on different cores of a CMP can interfere with and delay each other in the shared memory subsystem. Compared to a scenario where each application runs alone on the CMP, this inter-core interference causes the execution of simultaneously running applications to slow down. However, sharing memory system resources affects the execution of different applications very differently because the resource management algorithms Permission to make digital or hard copies of all or p of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ASPLOS 10, March 13 17, 2010, Pittsburgh, Pennsylvania, USA. Copyright c 2010 ACM /10/03... $10 employed in the shared resources are unfair [22]. As a result some applications are unfairly slowed down a lot more than others. Figure 1 shows examples of vastly differing effects of resourcesharing on simultaneously executing applications on a 2-core CMP system (Section 4 describes our experimental setup). When bzip2 and run simultaneously with equal priorities, the inter-core interference caused by the sharing of memory system resources slows downbzip2 by5.2x compared to when it is run alone while slows down by only1.15x. In order to achieve system level fairness or quality of service (QoS) objectives, the system software (operating system or virtual machine monitor) expects proportional progress of equal-priority applications when running simultaneously. Clearly, disparities in slowdown like those shown in Figure 1 due to sharing of the memory system resources between simultaneously running equal-priority applications is unacceptable since it would make priority-based thread scheduling policies ineffective [6]. Normalized Execution Time zeusmp Normalized Execution Time bzip2 Figure 1: Disparity in slowdowns due to unfairness To mitigate this problem, previous works [12, 13, 15, 22 25] on fair memory system design for multi-core systems mainly focused on pitioning a picular shared resource (cache space, cache bandwidth, or memory bandwidth) to provide fairness in the use of that shared resource. However, none of these prior works directly target a fair memory system design that provides fair sharing of all resources together. We define a memory system design asfair if the slowdowns of equal-priority applications running simultaneously on the cores sharing that memory system are the same (this definition has been used in several prior works [3, 7, 18, 22, 28]). As shown in previous research [2], employing separate uncoordinated fairness techniques together does not necessarily result in a fair memory system design. This is because fairness mechanisms in different resources can contradict each other. Our goal in this paper is to develop a low-cost architectural technique that allows system software fairness policies to be achieved in CMPs by enabling fair sharing of theentirememory system, without requiring multiple complicated, specialized, and possibly contradictory fairness techniques for different shared resources. Basic Idea: To achieve this goal, we propose a fundamentally new mechanism that 1) gathers dynamic feedback information about 335

2 the unfairness in the system and 2) uses this information to dynamically adapt the rate at which the different cores inject requests into the shared memory subsystem such that system-level fairness objectives are met. To calculate unfairness at run-time, a slowdown value is estimated for each application in hardware. Slowdown is defined as T shared /T alone, where T shared is the number of cycles it takes to run simultaneously with other applications and T alone is the number of cycles it would have taken the application to run alone. Unfairness is calculated as the ratio of the largest slowdown to the smallest slowdown of the simultaneously running applications. If the unfairness in the system becomes larger than the unfairness threshold set by the system software, the core that interferes most with the core experiencing the largest slowdown is throttled down. This means that the rate at which the most interfering core injects memory requests into the system is reduced, in order to reduce the inter-core interference it generates. If the system software s fairness goal is met, all cores are allowed to throttle up to improve system throughput while system unfairness is continuously monitored. This configurable hardware substrate enables the system software to achieve different QoS/fairness policies: it can determine the balance between fairness and system throughput, dictate different fairness objectives, and enforce thread priorities in the entire memory system. Summary of Evaluation: We evaluate our technique on both 2- core and 4-core CMP systems in comparison to three previouslyproposed state-of-the- shared hardware resource management mechanisms. Experimental results across ten multi-programmed workloads on a 4-core CMP show that our proposed technique improves average system performance by 25.6%/14.5% while reducing system unfairness by 44.4%/36.2% compared respectively to a system with no fairness techniques employed and a system with stateof-the- fairness mechanisms implemented for both shared cache capacity [25] and the shared memory controller [23]. Contributions: We make the following contributions: 1. We introduce a low-cost, hardware-based and systemsoftware-configurable mechanism to achieve fairness goals specified by system software in theentire shared memory system. 2. We introduce a mechanism that collects dynamic feedback on the unfairness of the system and adjusts request rates of the different cores to achieve the desired fairness/performance balance. By performing source-based fairness control, this work eliminates the need for complicatedindividualresource-based fairness mechanisms that are implemented independently in each resource and that require coordination. 3. We qualitatively and quantitatively compare our proposed technique to multiple prior works in fair shared cache pitioning and fair memory scheduling. We find that our proposal, while simpler, provides significantly higher system performance and better system fairness compared to previous proposals. 2. Background and Motivation We first present brief background on how we model the shared memory system of CMPs. We then motivate our approach to providing fairness in the entire shared memory system by showing how employing resource-based fairness techniques does not necessarily provide better overall fairness. 2.1 Shared CMP Memory Systems In this paper, we assume that the last-level (L2) cache and off-chip DRAM bandwidth are shared by multiple cores on a chip as in many commercial CMPs [1, 11, 29, 31]. Each core has its own L1 cache. Miss Status Holding/information Registers (MSHRs) [16] keep track of all requests to the shared L2 cache until they are serviced. When an L1 cache miss occurs, an access request to the L2 cache is created by allocating an MSHR entry. Once the request is serviced by the L2 cache or DRAM system as a result of a cache hit or miss respectively, the corresponding MSHR entry is freed and used for a new request. The number of MSHR entries for a core indicates the total number of outstanding requests allowed to the L2 cache and DRAM system. Therefore increasing/decreasing the number of MSHR entries for a core can increase/decrease the rate at which memory requests from the core are injected into the shared memory system. 2.2 Motivation Most prior works on providing fairness in shared resources focus on pitioning of a single shared resource. However, by pitioning a single shared resource, the demands on other shared resources may change such that neither system fairness nor system performance is improved. In the following example, we describe how constraining the rate at which an application s memory requests are injected to the shared resources can result in higher fairness and system performance than employing fair pitioning of a single resource. Figure 2 shows the memory-related stall time 1 of equal-priority applications A and B running on different cores of a 2-core CMP. For simplicity of explanation, we assume an application stalls when there is an outstanding memory request, focus on requests going to the same cache set and memory bank, and assume all shown accesses to the shared cache occur before any replacement happens. Application A is very memory-intensive, while application B is much less memory-intensive. As prior work has observed [23], when a memory-intensive application with already high memoryrelated stall time interferes with a less memory-intensive application with much smaller memory-related stall time, delaying the former improves system fairness because the additional delay causes a smaller slowdown for the memory-intensive application than for the non-intensive one. Doing so can also improve throughput by allowing the less memory-intensive application to quickly return to its compute-intensive portion while the memory-intensive application continues waiting on memory. Figures 2(a) and (b) show the initial L2 cache state, access order and memory-related stall time when no fairness mechanism is employed in any of the shared resources. Application A s large number of memory requests arrive at the L2 cache earlier, and as a result, the small number of memory requests from application B are significantly delayed. This causes large unfairness because the computeintensive application B is slowed down significantly more than the already-slow memory-intensive application A. Figures 2(c) and (d) show that employing a fair cache increases the fairness in utilizationofthecache by allocatinganequalnumberofways from the accessed set to the two equal-priority applications. This increases application A s cache misses compared to the baseline with no fairness control. Even though application B gets more hits as a result of fair sharing of the cache, its memory-related stall time does not reduce due to increased interference in the main memory system from 1 Stall-time is the amount of execution time in which the application cannot retire instructions. Memory-related stall time caused by a memory request consists of: 1) time to access the L2 cache, and if the access is a miss 2) time to wait for the required DRAM bank to become available, and finally 3) time to access DRAM. 336

3 01 Cache hit 01 A Access order: A2 Wait for busy bank A A1, A2, A3, A4, A5, A6, A7, B1, B2, B3 Memory access A A A A2 A1 A7 B2 A B B2 Shared L2 cache B3 A s stall time B s stall time (a) Initial state for no fairness control (b) Memory related stall time of no fairness control A1 Access order: A2 A1, A2, A3, A4, A5, A6, A7, B1, B2, B3 A3 A4 A5 A6 A2 A1 B1 B2 A7 B1 B2 Shared L2 cache B3 (c) Initial state for fair cache Access order: Throttled requests A1, B1, B2, B3, A2, A3, A4, A5, A6, A7 A2 A1 A7 B2 Shared L2 cache (e) Initial state for fair source throttling () A1 B1 B2 B3 A2 A3 A4 A5 A6 A A s stall time B s stall time (d) Memory related time of fair cache A s stall time B s stall time (f) Memory related stall time of Figure 2: Access pattern and memory-related stall time of requests with (a, b) no fairness control, (c, d) fair cache, and (e, f) fair source throttling application A s increased misses. Application B s memory requests are still delayed behind the large number of memory requests from application A. Application A s memory-related stall time increases slightly due to its increased cache misses, however, since application A already had a large memory-related stall time, this slight increase does not incur a large slowdown for it. As a result, fairness improves slightly, but system throughput degrades because the system spends more time stalling rather than computing compared to no fair caching. In Figure 2, if the unfair slowdown of application B due to application A is detected at run-time, system fairness can be improved by limiting A s memory requests and reducing the frequency at which they are issued into the shared memory system. This is shown in the access order and memory-related stall times of Figures 2(e) and (f). If the frequency at which application A s memory requests are injected into the shared memory system is reduced, the memory access pattern can change as shown in Figure 2(e). We use the termthrottled requests to refer to those requests from application A that are delayed when accessing the shared L2 cache due to A s reduced injection rate. As a result of the late arrival of these throttled requests, application B s memory-related stall time significantly reduces (because A s requests no longer interfere with B s) while application A s stall time increases slightly. Overall, this ultimately improves both system fairness and throughput compared to both no fairness control and just a fair cache. Fairness improves because the memory-intensive application is delayed such that the less intensive application s memory related-stall time does not increase significantly compared to when running alone. Delaying the memory-intensive application does not slow it down too much compared to when running alone, because even when running alone it has high memory-related stall time. System throughput improves because the total amount of time spent computing rather than stalling in the entire system increases. The key insight is that both system fairness and throughput can improve by detecting high system unfairness at run-time and dynamically limiting the number of or delaying the issuing of memory requests from the aggressive applications. In essence, we propose a new approach that performs source-based fairness in the entire memory system rather than individual resource-based fairness that implements complex and possibly contradictory fairness mechanisms in each resource. Sources (i.e., cores) can collectively achieve fairness by throttling themselves based on dynamic unfairness feedback, eliminating the need for implementing possibly contradictory/conflicting fairness mechanisms and complicated coordination techniques between them. 3. Fairness via Source Throttling To enable fairness in the entire memory system, we propose Fairness via Source Throttling (). The proposed mechanism consists of two major components: 1) runtime unfairness evaluation and 2) dynamic request throttling. 3.1 Runtime Unfairness Evaluation Overview The goal of this component is to dynamically obtain an estimate of the unfairness in the CMP memory system. We use the following definitions in determining unfairness: 1) We define a memory system design asfair if the slowdowns of equal-priority applications running simultaneously on the cores of a CMP are the same, similarly to previous works [3, 7, 18, 22, 28]. 2) We define slowdown as T shared /T alone where T shared is the number of cycles it takes to run simultaneously with other applications and T alone is the number of cycles it would have taken the application to run alone on the same system. The main challenge in the design of the runtime unfairness evaluation component is obtaining information about the number of cycles it would have taken an application to run alone, while it is running simultaneously with other applications. To do so, we estimate the number ofextracycles it takes an application to execute due to inter-core interference in the shared memory system, called T excess. Using this estimate, T alone is calculated as T shared T excess. The following equations show how Individual Slowdown(IS) of each application andunfairness of the system are calculated. IS i = T shared i, Unfairness = MAX{IS 0, IS 1,..., IS N 1 } Ti alone MIN{IS 0, IS 1,..., IS N 1} Section 3.3 explains in detail how the runtime unfairness evaluation component is implemented and in picular how T excess is estimated. Assuming for now that this component is in place, we next explain how the information it provides is used to determine how each application is throttled to achieve fairness in the entire shared memory system. 3.2 Dynamic Request Throttling This component is responsible for dynamically adjusting the rate at which each core/application 2 makes requests to the shared resources. This is done on an interval basis as shown in Figure 3. An interval ends when each core has executed a certain number of instructions from the beginning of that interval. During each interval (for example Interval 1 in Figure 3) the runtime unfairness evaluation component gathers feedback used to estimate the slowdown of each application. At the beginning of the next interval (Interval 2), the feedback information obtained during the prior interval is used to make a decision about the request rates of each application for that 2 Since each core runs a separate application, we use the words core and application interchangeably in this paper. See also Section

4 Interval 1 Interval 2 Interval 3 { Slowdown Estimation Time Calculate Unfairness & Determine request rates for Interval 2 using feedback from Interval 1 Figure 3: s interval-based estimation and throttling interval. More precisely, slowdown values estimated during Interval 1 are used to estimate unfairness for the system. That unfairness value is used to determine the request rates for the different applications for the duration of Interval 2. During the next interval (Interval 2), those request rates are applied, and unfairness evaluation is performed again. The algorithm used to adjust the request rate of each application using the unfairness estimate calculated in the prior interval is shown in Algorithm 1. To ease explanations, Algorithm 1 is simplified for dual-core configurations. Section 3.5 presents the more general algorithm for more than two cores. We define multiple possible levels of aggressiveness for the request rate of each application. The dynamic request throttling component makes a decision to increase/decrease or keep constant the request rate of each application at interval boundaries. We refer to increasing/decreasing the request rate of an application as throttling the application up/down. Algorithm 1 Dynamic Request Throttling if Estimated Unfairness > Unfairness T hreshold then Throttle down application with the smallest slowdown Throttle up application with the largest slowdown Reset Successive F airness Achieved Intervals else if Successive F airness Achieved Intervals = threshold then Throttle all applications up Reset Successive F airness Achieved Intervals else Increment Successive F airness Achieved Intervals At the end of each interval, the algorithm compares the unfairness estimated in the previous interval to the unfairness threshold that is defined by system software. If the fairness goal has not been met in the previous interval, the algorithm reduces the request rate of the application with the smallest individual slowdown value and increases the request rate of the application with the largest individual slowdown value. This reduces the number and frequency of memory requests generated for and inserted into the memory resources by the application with the smallest estimated slowdown, thereby reducing its interference with other cores. The increase in the request rate of the application with the highest slowdown allows it to be more aggressive in exploiting Memory-Level Parallelism (MLP) [8] and as a result reduces its slowdown. If the fairness goal is met for a predetermined number of intervals (tracked by a Successive F airness Achieved Intervals counter in Algorithm 1), the dynamic request throttling component attempts to increase system throughput by increasing the request rates of all applications by one level. This is done because our proposed mechanism strives to increase throughput while maintaining the fairness goals set by the system software. Increasing the request rate of all applications might result in unfairness. However, the unfairness evaluation during the interval in which this happens detects this occurrence and dynamically adjusts the requests rates again. Throttling Mechanisms: Our mechanism increases/decreases the request rate of each application in multiple ways: 1) Adjusting the number of outstanding misses an application can have at any given time. To do so, anmshrquota, which determines the maximum number of MSHR entries an application can use at any given time, is enforced for each application. Reducing MSHR entries for an application reduces the pressure caused by that application s requests on all shared memory system resources by limiting the number of concurrent requests from that application contending for service from the shared resources. This reduces other simultaneously running applications memory-related stall times and gives them the opportunity to speed up. 2) Adjusting the frequency atwhichrequestsinthemshrsareissuedtoaccessl2. Reducing this frequency for an application reduces the number of memory requests per unit time from that application that contend for shared resources. This allows memory requests from other applications to be prioritized in accessing shared resources even if the application that is throttled down does not have high MLP to begin with and is not sensitive to reduction in the number of its MSHRs. We refer to this throttling technique as frequency throttling. We use both of these mechanisms to reduce the interference caused by the application that experiences the smallest slowdown on the application that experiences the largest slowdown. 3.3 Unfairness Evaluation Component Design T shared is simply the number of cycles it takes to execute an application in an interval. Estimating T alone is more difficult, and achieves this by estimating T excess for each core, which is the number of cycles the core s execution time is lengthened due to interference from other cores in the shared memory system. To estimate T excess, the unfairness evaluation component keeps track of intercore interference each core incurs. Tracking Inter-Core Interference: We consider three sources of inter-core interference: 1) cache, 2) DRAM bus and bank conflict, and 3) DRAM row-buffer. 3 Our mechanism uses an InterferenceP ercore bit-vector whose purpose is to indicate whether or not a core is delayed due to inter-core interference. In order to track interference from each source separately, a copy of InterferenceP ercore is maintained for each interference source. A main copy which is updated by taking the union of the different InterferenceP ercore vectors is eventually used to update T excess as described below. When detects intercore interference for coreiat any shared resource, it sets biti of the InterferenceP ercore bit-vector, indicating that the core was delayed due to interference. At the same time, it also sets an InterferingCoreId field in the corresponding interfered-with memory request s MSHR entry. This field indicates which core interfered with this request and is later used to reset the corresponding bit in the InterferenceP ercore vector when the interfered-with request is scheduled/serviced. We explain this process in more detail for each resource below in Sections If a memory request 3 On-chip interconnect can also experience inter-core interference [4]. Feedback information similar to that obtained for the three sources of inter-core interference we account for can be collected for the on-chip interconnect. That information can be incorporated into our technique seamlessly, which we leave as p of future work. 338

5 has not been interfered with, it s Interf eringcoreid will be the same as the core id of the core it was generated by. Updating T excess: stores the number ofextracycles it takes to execute a given interval s instructions due to inter-core interference (T excess) in an ExcessCycles counter per core. Every cycle, if the InterferencePerCore bit of a core is set, increments the corresponding core s ExcessCycles counter. Algorithm 2 shows how calculates ExcessCycles for a given core i. The following subsections explain in detail how each source of inter-core interference is taken into account to set InterferenceP ercore. Table 5 summarizes the required storage needed to implement the mechanisms explained here. Algorithm 2 Estimation of T excess for corei Every cycle if inter-core cache or DRAM bus or DRAM bank or DRAM row-buffer interference then set InterferenceP ercore bit i set InterferingCoreId in delayed memory request if InterferencePerCore bitiis set then Increment ExcessCycles for corei Every L2 cache fill for a miss due to interference OR Every time a memory request which is a row-buffer miss due to interference is serviced reset InterferencePerCore bit of corei InterferingCoreId of corei=i(no interference) Every time a memory request is scheduled to DRAM if Coreihas no requests waiting on any bank which is busy servicing another corej(j!=i) then reset InterferencePerCore bit of corei Cache Interference In order to estimate inter-core cache interference, for each coreiwe need to track the last-level cache misses that are caused by any other corej. To do so, uses a pollution filter for each core to approximate such misses. The pollution filter is a bit-vector that is indexed with the lower order bits of the accessed cache line s address. 4 In the bit-vector, a set entry indicates that a cache line belonging to the corresponding core was evicted by another core s request. When a request from corejreplaces one of corei s cache lines, corei s filter is accessed using the evicted line s address, and the corresponding bit is set. When a memory request from coreimisses the cache, its filter is accessed with the missing address. If the corresponding bit is set, the filter predicts that this line was previously evicted due to inter-core interference and the bit in the filter is reset. When such a prediction is made, the InterferenceP ercore bit corresponding to coreiis set to indicate that coreiis stalling due to cache interference. Once the interfered-with memory request is finally serviced from the memory system and the corresponding cache line is filled, corei s filter is accessed and the bit is reset DRAM Bus and Bank Conflict Interference Inter-core DRAM bank conflict interference occurs when core i s memory request cannot access the bank it maps to, because a request from some other corejis being serviced by that memory bank. DRAM bus conflict interference occurs when a core can- 4 We empirically determined the pollution filter for each core to have 2Kentries in our evaluations. not use the DRAM because another core is using the DRAM bus. These situations are easily detectable at the memory controller, as described in [22]. When such interference is detected, the InterferenceP ercore bit corresponding to core i is set to indicate that core i is stalling due to a DRAM bus or bank conflict. This bit is reset when no request from coreiis being prevented access to DRAM by the other cores requests DRAM Row-Buffer Interference This type of interference occurs when a potential row-buffer hit of core i when running alone is converted to a row-buffer miss/conflict due to a memory request of some corejwhen running together with others. This can happen if a request from corejcloses a DRAM row opened by a prior request from coreithat is also accessed by a subsequent request from core i. To track such interference, a Shadow Row-bufferAddressRegister(SRAR) is maintained for each core for each bank. Whenever core i s memory request accesses some row R, the SRAR of coreiis updated to rowr. Accesses to the same bank from some other corejdo not affect the SRAR of corei. As such, at any point in time, corei s SRAR will contain the last row accessed by the last memory request serviced from that core in that bank. When core i s memory request suffers a row-buffer miss because another corej s row is open in the row-buffer of the accessed bank, the SRAR of coreiis consulted. If the SRAR indicates a row-buffer hit would have happened, then inter-core row-buffer interference is detected. As a result, the Interf erencep ercore bit corresponding to core i is set. Once the memory request is serviced, the corresponding InterferencePerCore bit is reset Slowdown Due to Throttling When an application is throttled, it experiences some slowdown due to the throttling. This slowdown is different from the intercore interference induced slowdown estimated by the mechanisms of Sections to Throttling-induced slowdown is a function of an application s sensitivity to 1) the number of MSHRs that are available to it, 2) the frequency of injecting requests into the shared resources. Using profiling, we determine for each throttling level l, the corresponding slowdown (due to throttling) f of an application A. At runtime, any estimated slowdown for application A when running at throttling level l is multiplied by f. We find that accounting for this slowdown improves the performance of by 1.9% and 0.9% on the 2-core and 4-core systems respectively. As such, even though we use such information in our evaluations, it is not fundamental to s benefits Implementation Details Section 3.3 describes how separate copies of InterferenceP ercore are maintained per interference source. The main copy which is used by for updating T excess is physically located close by the L2 cache. Note that shared resources may be located far away from each other on the chip. Any possible timing constraints on the sending of updates to the InterferenceP ercore bit-vector from the shared resources can be eliminated by making these updates periodically. 5 To be more precise, the bit is reset row buffer hit latency cycles before the memory request is serviced. The memory request would have taken at least row buffer hit latency cycles had there been no interference. 339

6 3.4 System Software Support Different Fairness Objectives: System-level fairness objectives and policies are generally decided by the system software (the operating system or virtual machine monitor). is intended as architectural support for enforcing such policies in shared memory system resources. The fairness goal to be achieved by can be configured by system software. To achieve this, we enable the system software to determine the nature of the condition that triggers Algorithm 1. In the explanations of Section 3.2, thetriggeringcondition is Estimated Unfairness i > Unfairness Threshold System software might want to enforce different triggering conditions depending on the system s fairness/qos requirements. To enable this capability, implements different triggering conditions from which the system software can choose. For example, the fairness goal the system software wants to achieve could be to keep the maximum slowdown of any application below a threshold value. To enforce such a goal, the system software can configure such that the triggering condition in Algorithm 1 is changed to Estimated Slowdown i > Max. Slowdown Threshold Thread Weights: So far, we have assumed all threads are of equal importance. can be seamlessly adjusted to distinguish between and provide differentiated services to threads with different priorities. We add the notion ofthreadweights to, which are communicated to it by the system software using special instructions. Higher slowdown values are more tolerable for less important or lower weight threads. To incorporate thread weights, uses weightedslowdown values calculated as: WeightedSlowdown i = Measured Slowdown i Weight i By scaling the real slowdown of a thread with its weight, a thread with a higher weight appears as if it slowed down more than it really did, causing it to be favored by. Section 5.4 quantitatively evaluates with the above fairness goal and threads with different weights. Thread Migration and Context Switches: can be seamlessly extended to work in the presence of thread migration and context switches. When a context switch happens or a thread is migrated, the interference state related to that thread is cleared. When a thread rests executing after a context switch or migration, it sts at maximum throttle. The interference caused by the thread and the interference it suffers are dynamically re-estimated and adapts to the new set of co-executing applications. 3.5 General Dynamic Request Throttling Scalability to More Cores: When the number of cores is greater than two, a more general form of Algorithm 1 is used. The design of the unfairness evaluation component for the more general form of Algorithm 1 is also slightly different. For each core i, maintains a set ofn-1 counters, wherenis the number of simultaneously running applications. We refer to these N-1 counters that uses to keep track of the amount of the inter-core interference caused by any other corejin the system forias ExcessCycles ij. This information is used to identify which of the other applications in the system generates the most interference for corei. maintains the total inter-core interference an application on corei experiences in a TotalExcessCycles i counter per core. Algorithm 3 shows the generalized form of Algorithm 1 that uses this extra information to make more accurate throttling decisions in a system with more than two cores. The two most important changes are as follows. First, when the algorithm is triggered due to unfair slowdown of corei, compares the ExcessCycles ij counter values for all cores j i to determine which other core is interfering most with corei. The core found to be the most interfering is throttled down. We do this in order to reduce the slowdown of the core with the largest slowdown value, and improve system fairness. Second, cores that are neither the core with the largest slowdown (App slow ) nor the core generating the most interference (App interfering) for the core with the largest slowdown are throttled up every threshold1 intervals. This is a performance optimization that allows cores to be aggressive if they are not the main contributors to the unfairness in the system. Preventing Bank Service Denial due to FR-FCFS Memory Scheduling: First ready-first come first serve (FR-FCFS) [27] is a commonly used memory scheduling policy which we use in our baseline system. This memory scheduling policy has the potential to starve an application with no row-buffer locality in the presence of an application with high row-buffer locality (as discussed in prior work [21 24]). Even when the interfering application is throttled down, the potential for continued DRAM bank interference exists when FR-FCFS memory scheduling is used, due to the greedy rowhit-first nature of the scheduling algorithm: a throttled-down application with high row-buffer locality can deny service to another application continuously. To overcome this, we supplement with a heuristic that prevents this denial of service. Once an application has already been throttled down lower than Switch thr %, if detects that this throttled application is generating greater than Algorithm 3 Dynamic Request Throttling - General Form if Estimated Unfairness > Unfairness T hreshold then Throttle down application that causes most interference (App interfering ) for application with largest slowdown Throttle up application with the largest slowdown (App slow ) Reset Successive F airness Achieved Intervals Reset Intervals To Wait To Throttle Up for App interfering. // Preventing bank service denial if App interfering throttled lower than Switch thr AND causes greater than Interference thr amount of App slow s total interference then Temporarily stop prioritizing App interfering due to row hits in memory controller if App RowHitNotPrioritized has not been App interfering for SwitchBack thr intervals then Allow it to be prioritized in memory controller based on row-buffer hit status of its requests for all applications except App interfering and App slow do if Intervals To Wait To Throttle Up = threshold1 then throttle up Reset Intervals To Wait To Throttle Up for this app. else Increment Intervals To Wait To Throttle Up for this app. end for else if Successive F airness Achieved Intervals = threshold2 then Throttle up application with the smallest slowdown Reset Successive F airness Achieved Intervals else Increment Successive F airness Achieved Intervals 340

7 Interference thr % of App slow s total interference, it will temporarily stop prioritizing the interfering application based on rowbuffer hit status in the memory controller. We refer to this application as App RowHitNotPrioritized. If App RowHitNotPrioritized has not been the most interfering application for SwitchBack thr number of intervals, its prioritization over other applications based on row-buffer hit status will be re-allowed in the memory controller. This is done because if an application with high row-buffer locality is not allowed to take advantage of row buffer hits for a long time, its performance will suffer. 4. Methodology Processor Model: We use an in-house cycle-accurate x86 CMP simulator for our evaluation. We faithfully model all port contention, queuing effects, bank conflicts, and other major DDR3 DRAM system constraints in the memory subsystem. Table 1 shows the baseline configuration of each core and the shared resource configuration for the 2 and 4-core CMP systems we use. Execution Core Front End On-chip Caches DRAM Controller DRAM and Bus 6.6 GHz out of order processor, 15 stages, Decode/retire up to 4 instructions Issue/execute up to 8 micro instructions 256-entry reorder buffer Fetch up to 2 branches; 4K-entry BTB 64K-entry Hybrid branch predictor L1 I-cache: 32KB, 4-way, 2-cycle, 64B line L1 D-cache: 32KB, 4-way, 2-cycle, 64B line Shared unified L2: 1MB (2MB for 4-core), 8-way (16-way for 4-core), 16-bank, 15-cycle (20-cycle for 4-core), 1 port, 64B line size On-chip, FR-FCFS scheduling policy [27] 128-entry MSHR and memory request buffer 667MHz bus cycle, DDR3 1333MHz [20] 8B-wide data bus Latency: ns ( t RP - t RCD-CL) 8 DRAM banks, 16KB row buffer per bank Round-trip L2 miss latency: Row-buffer hit: 36ns, conflict: 66ns Table 1: Baseline system configuration Workloads: We use the SPEC CPU 2000/2006 benchmarks for our evaluation. Each benchmark was compiled using ICC (Intel C Compiler) or IFORT (Intel Fortran Compiler) with the -O3 option. We ran each benchmark with the reference input set for 200 million x86 instructions selected by Pinpoints [26] as a representative portion for the 2-core experiments. Due to long simulation times, 4-core experiments were done with 50 million instructions per benchmark. We classify benchmarks as highly memory-intensive/with medium memory intensity/non-intensive for our analyses and workload selection. We refer to a benchmark as highly memory-intensive if its L2 Cache Misses per 1K Instructions (MPKI) is greater than ten. If the MPKI value is greater than one but less than ten, we say the benchmark has medium memory-intensity. If the MPKI value is less than one, we refer to it as non-intensive. This classification is based on measurements made when each benchmark was run alone on the 2-core system. Table 2 shows the characteristics of some (due to space limitations) of the benchmarks that appear in the evaluated workloads when run on the 2-core system. Workload Selection We used 18 two-application and 10 fourapplication multi-programmed workloads for our 2-core and 4-core evaluations respectively. The 2-core workloads were chosen such that at least one of the benchmarks is highly memory intensive. For this purpose we used either from SPEC2000 orlbm from SPEC2006. For the second benchmark of each 2-core workload, applications of different memory intensity were used in order to cover Benchmark Type IPC MPKI Benchmark Type IPC MPKI FP milc FP soplex FP leslie3d FP lbm FP bwaves FP GemsFDTD FP astar INT omnetpp INT gcc INT zeusmp FP cactusadm FP bzip2 INT h264ref INT vortex INT gromacs FP namd FP calculix FP gamess FP povray FP Table 2: Characteristics of 20 SPEC 2000/2006 benchmarks: IPC and MPKI (L2 cache Misses Per 1K Instructions) a wide range of different combinations. Of the 18 benchmarks combined with either or lbm, seven benchmarks have high memory intensity, six have medium intensity, and five have low memory intensity. The ten 4-core workloads were randomly selected with the condition that the evaluated workloads each include at least one benchmark with high memory intensity and at least one benchmark with medium or high memory intensity. parameters used in evaluation: Table 3 shows the values we use in our evaluation unless stated otherwise. There are eight aggressiveness levels used for the request rate of each application: 2%, 3%, 4%, 5%, 10%, 25%, 50% and 100%. These levels denote the scaling of the MSHR quota and the request rate in terms of percentage. For example, when throttles an application to 5% of its total request rate on a system with 128 MSHRs, two parameters are adjusted. First, the application is given a 5% quota of the total number of available MSHRs (in this case, 6 MSHRs). Second, the application s memory requests in the MSHRs are issued to access the L2 cache at 5% of the maximum possible frequency (i.e., once every 20 cycles). F airness Successive F airness Intervals W ait Interval T hreshold Achieved Intervals T o T hrottle Up Length Threshold Kinsts Switch thr Interference thr SwitchBack thr 5% 70% 3 intervals Table 3: parameters Metrics: To measure CMP system performance, we use Harmonic mean of Speedups(Hspeedup) [18], and Weighted Speedup (Wspeedup) [28]. These metrics are commonly used in measuring multi-program performance in computer architecture. In order to demonstrate fairness improvements, we report Unfairness (see Section 3.1), as defined in [7, 22]. Since Hspeedup provides a balanced measure between fairness and system throughput as shown in previous work [18], we use it as our primary evaluation metric. In the metric definitions below: N is the number of cores in the CMP system, IPC alone is the IPC measured when an application runs alone on one core in the CMP system (other cores are idle), and IPC shared is the IPC measured when an application runs on one core while other applications are running on the other cores. N 1 N Hspeedup =, Wspeedup = N 1 IPC alone i i=0 IPCi shared i=0 5. Experimental Evaluation IPC shared i IPC alone i We evaluate our proposed techniques on both 2-core and 4-core systems. We compare to four other systems in our evaluations: 341

8 1) a baseline system with no fairness techniques employed in the shared memory system, using LRU cache replacement and FR-FCFS memory scheduling [27], both of which have been shown to be unfair [15, 21, 24]. We refer to this baseline asnofairness, 2) a system with only fair cache capacity management using the virtual private caches technique [25], called FairCache, 3) a system with a network fair queuing (NFQ) fair memory scheduler [24] combined with fair cache capacity management [25], called NFQ+FairCache, 4) a system with a parallelism-aware batch scheduling (PAR-BS) fair memory scheduler [23] combined with fair cache capacity management [25], calledpar-bs+faircache core System Results Figure 4 shows system performance and unfairness averaged (using geometric mean) across 18 workloads evaluated on the 2-core system. Figure 5 shows the Hspeedup performance of and other fairness techniques normalized to that of a system without any fairness technique for each of the 18 evaluated 2-core workloads. provides the highest system performance (in terms of Hspeedup) and the best unfairness among all evaluated techniques. We make several key observations: Value of metric (a) Unfairness (b) Hspeedup No Fairness Technique Fair Cache NFQ + Fair Cache PAR-BS + Fair Cache (c) Wspeedup Figure 4: Average performance of on the 2-core system 1. Fair caching s unfairness reduction comes at the cost of a large degradation in system performance. This is because fair caching changes the memory access patterns of applications. Since the memory access scheduler is unfair, the fairness benefits of the fair cache itself are reverted by the memory scheduler. 2. NFQ+FairCache together improves system fairness by 3% compared tonofairness. This degrades Wspeedup (by 12.3%). The combination of PAR-BS and fair caching improves both system performance and fairness compared to the combination of NFQ and a fair cache. The main reason is that PAR-BS preserves both DRAM bank parallelism and row-buffer locality of each thread better than NFQ, as shown in previous work [23]. Compared to the baseline with no fairness control, employing PAR-BS and a fair cache reduces unfairness by 41.3% and improves Hspeedup by 11.5%. However, this improvement comes at the expense of a large (7.8%) Wspeedup degradation. NFQ+FairCache and PAR-BS+FairCache both significantly degrade system throughput (Wspeedup) compared to employing no fairness mechanisms. This is due to two reasons both of which lead to the delaying of memory non-intensive applications (Recall that prioritizing memory non-intensive applications is better for system throughput [23, 24]). First, the fairness mechanisms that are employed separately in each resource interact negatively with each other, leading to one mechanism (e.g. fair caching) increasing the pressure on the other (fair memory scheduling). As a result, even though fair caching might benefit system throughput by giving more resources to a memory non-intensive application, increased misses of the memory-intensive application due to fair caching causes more congestion in the memory system, leading to both the memoryintensive and non-intensive applications to be delayed. Second, even though the combination of a fair cache and a fair memory controller can prioritize a memory non-intensive application s requests, this prioritization can be temporary. The deprioritized memory-intensive application can still fill the shared MSHRs with its requests, thereby denying the non-intensive application entry into the memory system. Hence, the non-intensive application stalls because it cannot inject enough requests into the memory system. As a result, the memory non-intensive application s performance does not improve while the memory-intensive application s performance degrades (due to fair caching), resulting in system throughput degradation. 3. reduces system unfairness by 45% while also improving Hspeedup by 16.3% and degrades Wspeedup by 4.5% compared to NoFairness. Unlike other fairness mechanisms, improves both system performance and fairness, without large degradation to Wspeedup. This is due to two major reasons. First, provides a coordinated approach in which both the cache and the memory controller receive less frequent requests from the applications causing unfairness. This reduces the starvation of the applications that are unfairly slowed down as well as interference of requests in the memory system, leading to better system performance for almost all applications. Second, because uses MSHR quotas to limit requests injected by memory-intensive applications that cause unfairness, these memory-intensive applications do not deny other applications entry into the memory system. As such, unlike other fairness techniques that do not consider fairness in memory system buffers (e.g., MSHRs), ensures that unfairly slowed-down applications are prioritized in the entire memory system, including all the buffers, caches, and schedulers. Table 4 summarizes our results for the 2-core evaluations. Compared to the previous technique that provides the highest system throughput (i.e. NoFairness), provides a significantly better balance between system fairness and performance. Compared to the previous technique that provides the best fairness (PAR- BS+FairCache), improves both system performance and fairness. We conclude that provides the best system fairness as well as the best balance between system fairness and performance. Unfairness Hspeedup Wspeedup over No Fairness Mechanism -45% 16.3% -4.5% over Fair Cache -30% 26.2% 12.8% over NFQ + Fair Cache -21.2% 16% 8.8% over PAR-BS + Fair Cache -6.3% 4.3% 3.4% Table 4: Summary of results on the 2-core system core System Results Overall Performance Figure 6 shows unfairness and system performance averaged across the ten evaluated 4-core workloads. provides the best fairness and Hspeedup among all evaluated techniques, while providing Wspeedup that is equivalent to that of the best previous technique. Overall, reduces unfairness by 44.4% 6 and increases system performance by 25.6% (Hspeedup) and 2.6% (Wspeedup) compared to NoFairness. Compared to PAR-BS+FairCache, the best performing previous technique, reduces unfairness by 36.2% and in- 6 Similarly, also reduces the coefficient of variation, an alternative unfairness metric, by 45%. 342

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Prefetch-Aware Shared-Resource Management for Multi-Core Systems Eiman Ebrahimi Chang Joo Lee Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The

More information

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC

Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches. Onur Mutlu March 23, 2010 GSRC Designing High-Performance and Fair Shared Multi-Core Memory Systems: Two Approaches Onur Mutlu onur@cmu.edu March 23, 2010 GSRC Modern Memory Systems (Multi-Core) 2 The Memory System The memory system

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin

More information

Computer Architecture Lecture 24: Memory Scheduling

Computer Architecture Lecture 24: Memory Scheduling 18-447 Computer Architecture Lecture 24: Memory Scheduling Prof. Onur Mutlu Presented by Justin Meza Carnegie Mellon University Spring 2014, 3/31/2014 Last Two Lectures Main Memory Organization and DRAM

More information

Prefetch-Aware DRAM Controllers

Prefetch-Aware DRAM Controllers Prefetch-Aware DRAM Controllers Chang Joo Lee Onur Mutlu Veynu Narasiman Yale N. Patt Department of Electrical and Computer Engineering The University of Texas at Austin {cjlee, narasima, patt}@ece.utexas.edu

More information

Parallel Application Memory Scheduling

Parallel Application Memory Scheduling Parallel Application Memory Scheduling Eiman Ebrahimi Rustam Miftakhutdinov Chris Fallin Chang Joo Lee José A. Joao Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University

More information

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter

Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior. Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Thread Cluster Memory Scheduling: Exploiting Differences in Memory Access Behavior Yoongu Kim Michael Papamichael Onur Mutlu Mor Harchol-Balter Motivation Memory is a shared resource Core Core Core Core

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems

Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Balancing DRAM Locality and Parallelism in Shared Memory CMP Systems Min Kyu Jeong, Doe Hyun Yoon^, Dam Sunwoo*, Michael Sullivan, Ikhwan Lee, and Mattan Erez The University of Texas at Austin Hewlett-Packard

More information

IN modern systems, the high latency of accessing largecapacity

IN modern systems, the high latency of accessing largecapacity IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 27, NO. 10, OCTOBER 2016 3071 BLISS: Balancing Performance, Fairness and Complexity in Memory Access Scheduling Lavanya Subramanian, Donghyuk

More information

Addressing End-to-End Memory Access Latency in NoC-Based Multicores

Addressing End-to-End Memory Access Latency in NoC-Based Multicores Addressing End-to-End Memory Access Latency in NoC-Based Multicores Akbar Sharifi, Emre Kultursay, Mahmut Kandemir and Chita R. Das The Pennsylvania State University University Park, PA, 682, USA {akbar,euk39,kandemir,das}@cse.psu.edu

More information

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers

Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Feedback Directed Prefetching: Improving the Performance and Bandwidth-Efficiency of Hardware Prefetchers Microsoft ssri@microsoft.com Santhosh Srinath Onur Mutlu Hyesoon Kim Yale N. Patt Microsoft Research

More information

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory

The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory The Application Slowdown Model: Quantifying and Controlling the Impact of Inter-Application Interference at Shared Caches and Main Memory Lavanya Subramanian* Vivek Seshadri* Arnab Ghosh* Samira Khan*

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors

Resource-Conscious Scheduling for Energy Efficiency on Multicore Processors Resource-Conscious Scheduling for Energy Efficiency on Andreas Merkel, Jan Stoess, Frank Bellosa System Architecture Group KIT The cooperation of Forschungszentrum Karlsruhe GmbH and Universität Karlsruhe

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Eiman Ebrahimi Onur Mutlu Yale N. Patt High Performance Systems Group Depment of Electrical and Computer

More information

Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors

Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors 40th IEEE/ACM International Symposium on Microarchitecture Stall-Time Fair Memory Access Scheduling for Chip Multiprocessors Onur Mutlu Thomas Moscibroda Microsoft Research {onur,moscitho}@microsoft.com

More information

A Hybrid Adaptive Feedback Based Prefetcher

A Hybrid Adaptive Feedback Based Prefetcher A Feedback Based Prefetcher Santhosh Verma, David M. Koppelman and Lu Peng Department of Electrical and Computer Engineering Louisiana State University, Baton Rouge, LA 78 sverma@lsu.edu, koppel@ece.lsu.edu,

More information

Improving Cache Performance using Victim Tag Stores

Improving Cache Performance using Victim Tag Stores Improving Cache Performance using Victim Tag Stores SAFARI Technical Report No. 2011-009 Vivek Seshadri, Onur Mutlu, Todd Mowry, Michael A Kozuch {vseshadr,tcm}@cs.cmu.edu, onur@cmu.edu, michael.a.kozuch@intel.com

More information

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era

Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Minimalist Open-page: A DRAM Page-mode Scheduling Policy for the Many-core Era Dimitris Kaseridis Electrical and Computer Engineering The University of Texas at Austin Austin, TX, USA kaseridis@mail.utexas.edu

More information

Thesis Defense Lavanya Subramanian

Thesis Defense Lavanya Subramanian Providing High and Predictable Performance in Multicore Systems Through Shared Resource Management Thesis Defense Lavanya Subramanian Committee: Advisor: Onur Mutlu Greg Ganger James Hoe Ravi Iyer (Intel)

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches

AB-Aware: Application Behavior Aware Management of Shared Last Level Caches AB-Aware: Application Behavior Aware Management of Shared Last Level Caches Suhit Pai, Newton Singh and Virendra Singh Computer Architecture and Dependable Systems Laboratory Department of Electrical Engineering

More information

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Memory Interference and QoS (Part I) Prof. Onur Mutlu Carnegie Mellon University Memory Interference and QoS Lectures These slides are from a lecture delivered at INRIA (July 8,

More information

Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems

Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Staged Memory Scheduling: Achieving High Performance and Scalability in Heterogeneous Systems Rachata Ausavarungnirun Kevin Kai-Wei Chang Lavanya Subramanian Gabriel H. Loh Onur Mutlu Carnegie Mellon University

More information

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance

A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance A Model for Application Slowdown Estimation in On-Chip Networks and Its Use for Improving System Fairness and Performance Xiyue Xiang Saugata Ghose Onur Mutlu Nian-Feng Tzeng University of Louisiana at

More information

A Comparison of Capacity Management Schemes for Shared CMP Caches

A Comparison of Capacity Management Schemes for Shared CMP Caches A Comparison of Capacity Management Schemes for Shared CMP Caches Carole-Jean Wu and Margaret Martonosi Princeton University 7 th Annual WDDD 6/22/28 Motivation P P1 P1 Pn L1 L1 L1 L1 Last Level On-Chip

More information

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun

EECS750: Advanced Operating Systems. 2/24/2014 Heechul Yun EECS750: Advanced Operating Systems 2/24/2014 Heechul Yun 1 Administrative Project Feedback of your proposal will be sent by Wednesday Midterm report due on Apr. 2 3 pages: include intro, related work,

More information

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative

Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Evaluating STT-RAM as an Energy-Efficient Main Memory Alternative Emre Kültürsay *, Mahmut Kandemir *, Anand Sivasubramaniam *, and Onur Mutlu * Pennsylvania State University Carnegie Mellon University

More information

Filtered Runahead Execution with a Runahead Buffer

Filtered Runahead Execution with a Runahead Buffer Filtered Runahead Execution with a Runahead Buffer ABSTRACT Milad Hashemi The University of Texas at Austin miladhashemi@utexas.edu Runahead execution dynamically expands the instruction window of an out

More information

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems

Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Application-to-Core Mapping Policies to Reduce Memory System Interference in Multi-Core Systems Reetuparna Das Rachata Ausavarungnirun Onur Mutlu Akhilesh Kumar Mani Azimi University of Michigan Carnegie

More information

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt

A Case for MLP-Aware Cache Replacement. Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt Moinuddin K. Qureshi Daniel Lynch Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas 78712-24 TR-HPS-26-3

More information

Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems

Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Techniques for Bandwidth-Efficient Prefetching of Linked Data Structures in Hybrid Prefetching Systems Depment of Electrical and Computer Engineering The University of Texas at Austin {ebrahimi, patt}@ece.utexas.edu

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories

DEMM: a Dynamic Energy-saving mechanism for Multicore Memories DEMM: a Dynamic Energy-saving mechanism for Multicore Memories Akbar Sharifi, Wei Ding 2, Diana Guttman 3, Hui Zhao 4, Xulong Tang 5, Mahmut Kandemir 5, Chita Das 5 Facebook 2 Qualcomm 3 Intel 4 University

More information

Bias Scheduling in Heterogeneous Multi-core Architectures

Bias Scheduling in Heterogeneous Multi-core Architectures Bias Scheduling in Heterogeneous Multi-core Architectures David Koufaty Dheeraj Reddy Scott Hahn Intel Labs {david.a.koufaty, dheeraj.reddy, scott.hahn}@intel.com Abstract Heterogeneous architectures that

More information

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt

Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction. Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Performance-Aware Speculation Control Using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt High Performance Systems Group Department of Electrical and Computer Engineering

More information

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems

Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems 1 Cache Friendliness-aware Management of Shared Last-level Caches for High Performance Multi-core Systems Dimitris Kaseridis, Member, IEEE, Muhammad Faisal Iqbal, Student Member, IEEE and Lizy Kurian John,

More information

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads

Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Continuous Runahead: Transparent Hardware Acceleration for Memory Intensive Workloads Milad Hashemi, Onur Mutlu, Yale N. Patt The University of Texas at Austin ETH Zürich ABSTRACT Runahead execution pre-executes

More information

High Performance Memory Requests Scheduling Technique for Multicore Processors

High Performance Memory Requests Scheduling Technique for Multicore Processors High Performance Memory Requests Scheduling Technique for Multicore Processors by Walid Ahmed Mohamed El-Reedy A Thesis Submitted to the Faculty of Engineering at Cairo University In Partial Fulfillment

More information

Sandbox Based Optimal Offset Estimation [DPC2]

Sandbox Based Optimal Offset Estimation [DPC2] Sandbox Based Optimal Offset Estimation [DPC2] Nathan T. Brown and Resit Sendag Department of Electrical, Computer, and Biomedical Engineering Outline Motivation Background/Related Work Sequential Offset

More information

Predicting Performance Impact of DVFS for Realistic Memory Systems

Predicting Performance Impact of DVFS for Realistic Memory Systems Predicting Performance Impact of DVFS for Realistic Memory Systems Rustam Miftakhutdinov Eiman Ebrahimi Yale N. Patt The University of Texas at Austin Nvidia Corporation {rustam,patt}@hps.utexas.edu ebrahimi@hps.utexas.edu

More information

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines

CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines CS377P Programming for Performance Single Thread Performance Out-of-order Superscalar Pipelines Sreepathi Pai UTCS September 14, 2015 Outline 1 Introduction 2 Out-of-order Scheduling 3 The Intel Haswell

More information

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems

A Bandwidth-aware Memory-subsystem Resource Management using. Non-invasive Resource Profilers for Large CMP Systems A Bandwidth-aware Memory-subsystem Resource Management using Non-invasive Resource Profilers for Large CMP Systems Dimitris Kaseridis, Jeffrey Stuecheli, Jian Chen and Lizy K. John Department of Electrical

More information

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b

A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 5th International Conference on Advanced Materials and Computer Science (ICAMCS 2016) A task migration algorithm for power management on heterogeneous multicore Manman Peng1, a, Wen Luo1, b 1 School of

More information

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems

Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Memory Performance Attacks: Denial of Memory Service in Multi-Core Systems Thomas Moscibroda Onur Mutlu Microsoft Research {moscitho,onur}@microsoft.com Abstract We are entering the multi-core era in computer

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Scheduling the Intel Core i7

Scheduling the Intel Core i7 Third Year Project Report University of Manchester SCHOOL OF COMPUTER SCIENCE Scheduling the Intel Core i7 Ibrahim Alsuheabani Degree Programme: BSc Software Engineering Supervisor: Prof. Alasdair Rawsthorne

More information

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1

Memory Performance Characterization of SPEC CPU2006 Benchmarks Using TSIM1 Available online at www.sciencedirect.com Physics Procedia 33 (2012 ) 1029 1035 2012 International Conference on Medical Physics and Biomedical Engineering Memory Performance Characterization of SPEC CPU2006

More information

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722

1993. (BP-2) (BP-5, BP-10) (BP-6, BP-10) (BP-7, BP-10) YAGS (BP-10) EECC722 Dynamic Branch Prediction Dynamic branch prediction schemes run-time behavior of branches to make predictions. Usually information about outcomes of previous occurrences of branches are used to predict

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions. Onur Mutlu June 5, 2011 ISMM/MSPC

Memory Systems in the Many-Core Era: Some Challenges and Solution Directions. Onur Mutlu   June 5, 2011 ISMM/MSPC Memory Systems in the Many-Core Era: Some Challenges and Solution Directions Onur Mutlu http://www.ece.cmu.edu/~omutlu June 5, 2011 ISMM/MSPC Modern Memory System: A Shared Resource 2 The Memory System

More information

Perceptron Learning for Reuse Prediction

Perceptron Learning for Reuse Prediction Perceptron Learning for Reuse Prediction Elvira Teran Zhe Wang Daniel A. Jiménez Texas A&M University Intel Labs {eteran,djimenez}@tamu.edu zhe2.wang@intel.com Abstract The disparity between last-level

More information

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU

Thomas Moscibroda Microsoft Research. Onur Mutlu CMU Thomas Moscibroda Microsoft Research Onur Mutlu CMU CPU+L1 CPU+L1 CPU+L1 CPU+L1 Multi-core Chip Cache -Bank Cache -Bank Cache -Bank Cache -Bank CPU+L1 CPU+L1 CPU+L1 CPU+L1 Accelerator, etc Cache -Bank

More information

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS

Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Advanced Topics UNIT 2 PERFORMANCE EVALUATIONS Structure Page Nos. 2.0 Introduction 4 2. Objectives 5 2.2 Metrics for Performance Evaluation 5 2.2. Running Time 2.2.2 Speed Up 2.2.3 Efficiency 2.3 Factors

More information

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service

Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service Optimizing Datacenter Power with Memory System Levers for Guaranteed Quality-of-Service * Kshitij Sudan* Sadagopan Srinivasan Rajeev Balasubramonian* Ravi Iyer Executive Summary Goal: Co-schedule N applications

More information

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories

DynRBLA: A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories SAFARI Technical Report No. 2-5 (December 6, 2) : A High-Performance and Energy-Efficient Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon hanbinyoon@cmu.edu Justin Meza meza@cmu.edu

More information

15-740/ Computer Architecture, Fall 2011 Midterm Exam II

15-740/ Computer Architecture, Fall 2011 Midterm Exam II 15-740/18-740 Computer Architecture, Fall 2011 Midterm Exam II Instructor: Onur Mutlu Teaching Assistants: Justin Meza, Yoongu Kim Date: December 2, 2011 Name: Instructions: Problem I (69 points) : Problem

More information

Improving DRAM Performance by Parallelizing Refreshes with Accesses

Improving DRAM Performance by Parallelizing Refreshes with Accesses Improving DRAM Performance by Parallelizing Refreshes with Accesses Kevin Chang Donghyuk Lee, Zeshan Chishti, Alaa Alameldeen, Chris Wilkerson, Yoongu Kim, Onur Mutlu Executive Summary DRAM refresh interferes

More information

A Fast Instruction Set Simulator for RISC-V

A Fast Instruction Set Simulator for RISC-V A Fast Instruction Set Simulator for RISC-V Maxim.Maslov@esperantotech.com Vadim.Gimpelson@esperantotech.com Nikita.Voronov@esperantotech.com Dave.Ditzel@esperantotech.com Esperanto Technologies, Inc.

More information

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A.

Improving Cache Performance by Exploi7ng Read- Write Disparity. Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Improving Cache Performance by Exploi7ng Read- Write Disparity Samira Khan, Alaa R. Alameldeen, Chris Wilkerson, Onur Mutlu, and Daniel A. Jiménez Summary Read misses are more cri?cal than write misses

More information

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems

Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Scalable Many-Core Memory Systems Topic 3: Memory Interference and QoS-Aware Memory Systems Prof. Onur Mutlu http://www.ece.cmu.edu/~omutlu onur@cmu.edu HiPEAC ACACES Summer School 2013 July 15-19, 2013

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

Exploiting Core Criticality for Enhanced GPU Performance

Exploiting Core Criticality for Enhanced GPU Performance Exploiting Core Criticality for Enhanced GPU Performance Adwait Jog, Onur Kayıran, Ashutosh Pattnaik, Mahmut T. Kandemir, Onur Mutlu, Ravishankar Iyer, Chita R. Das. SIGMETRICS 16 Era of Throughput Architectures

More information

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1)

Lecture 11: SMT and Caching Basics. Today: SMT, cache access basics (Sections 3.5, 5.1) Lecture 11: SMT and Caching Basics Today: SMT, cache access basics (Sections 3.5, 5.1) 1 Thread-Level Parallelism Motivation: a single thread leaves a processor under-utilized for most of the time by doubling

More information

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004

ABSTRACT STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS. Chungsoo Lim, Master of Science, 2004 ABSTRACT Title of thesis: STRATEGIES FOR ENHANCING THROUGHPUT AND FAIRNESS IN SMT PROCESSORS Chungsoo Lim, Master of Science, 2004 Thesis directed by: Professor Manoj Franklin Department of Electrical

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems

NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems NightWatch: Integrating Transparent Cache Pollution Control into Dynamic Memory Allocation Systems Rentong Guo 1, Xiaofei Liao 1, Hai Jin 1, Jianhui Yue 2, Guang Tan 3 1 Huazhong University of Science

More information

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

A Row Buffer Locality-Aware Caching Policy for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu A Row Buffer Locality-Aware Caching Policy for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Overview Emerging memories such as PCM offer higher density than

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Performance Oriented Prefetching Enhancements Using Commit Stalls

Performance Oriented Prefetching Enhancements Using Commit Stalls Journal of Instruction-Level Parallelism 13 (2011) 1-28 Submitted 10/10; published 3/11 Performance Oriented Prefetching Enhancements Using Commit Stalls R Manikantan R Govindarajan Indian Institute of

More information

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu

Row Buffer Locality Aware Caching Policies for Hybrid Memories. HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Row Buffer Locality Aware Caching Policies for Hybrid Memories HanBin Yoon Justin Meza Rachata Ausavarungnirun Rachael Harding Onur Mutlu Executive Summary Different memory technologies have different

More information

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect

MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect SAFARI Technical Report No. 211-8 (September 13, 211) MinBD: Minimally-Buffered Deflection Routing for Energy-Efficient Interconnect Chris Fallin Gregory Nazario Xiangyao Yu cfallin@cmu.edu gnazario@cmu.edu

More information

Computer Architecture Lecture 22: Memory Controllers. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015

Computer Architecture Lecture 22: Memory Controllers. Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015 18-447 Computer Architecture Lecture 22: Memory Controllers Prof. Onur Mutlu Carnegie Mellon University Spring 2015, 3/25/2015 Number of Students Lab 4 Grades 20 15 10 5 0 40 50 60 70 80 90 Mean: 89.7

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

Improving Virtual Machine Scheduling in NUMA Multicore Systems

Improving Virtual Machine Scheduling in NUMA Multicore Systems Improving Virtual Machine Scheduling in NUMA Multicore Systems Jia Rao, Xiaobo Zhou University of Colorado, Colorado Springs Kun Wang, Cheng-Zhong Xu Wayne State University http://cs.uccs.edu/~jrao/ Multicore

More information

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011

15-740/ Computer Architecture Lecture 5: Project Example. Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 15-740/18-740 Computer Architecture Lecture 5: Project Example Jus%n Meza Yoongu Kim Fall 2011, 9/21/2011 Reminder: Project Proposals Project proposals due NOON on Monday 9/26 Two to three pages consisang

More information

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors

Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Efficient Physical Register File Allocation with Thread Suspension for Simultaneous Multi-Threading Processors Wenun Wang and Wei-Ming Lin Department of Electrical and Computer Engineering, The University

More information

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling

Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Quest for High-Performance Bufferless NoCs with Single-Cycle Express Paths and Self-Learning Throttling Bhavya K. Daya, Li-Shiuan Peh, Anantha P. Chandrakasan Dept. of Electrical Engineering and Computer

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction

Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Performance-Aware Speculation Control using Wrong Path Usefulness Prediction Chang Joo Lee Hyesoon Kim Onur Mutlu Yale N. Patt Department of Electrical and Computer Engineering The University of Texas

More information

Computer Architecture

Computer Architecture Computer Architecture Lecture 1: Introduction and Basics Dr. Ahmed Sallam Suez Canal University Spring 2016 Based on original slides by Prof. Onur Mutlu I Hope You Are Here for This Programming How does

More information

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach

A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach A Heterogeneous Multiple Network-On-Chip Design: An Application-Aware Approach Asit K. Mishra Onur Mutlu Chita R. Das Executive summary Problem: Current day NoC designs are agnostic to application requirements

More information

Dynamic Branch Prediction

Dynamic Branch Prediction #1 lec # 6 Fall 2002 9-25-2002 Dynamic Branch Prediction Dynamic branch prediction schemes are different from static mechanisms because they use the run-time behavior of branches to make predictions. Usually

More information

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization

Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Rethinking On-chip DRAM Cache for Simultaneous Performance and Energy Optimization Fazal Hameed and Jeronimo Castrillon Center for Advancing Electronics Dresden (cfaed), Technische Universität Dresden,

More information

Improving Writeback Efficiency with Decoupled Last-Write Prediction

Improving Writeback Efficiency with Decoupled Last-Write Prediction Improving Writeback Efficiency with Decoupled Last-Write Prediction Zhe Wang Samira M. Khan Daniel A. Jiménez The University of Texas at San Antonio {zhew,skhan,dj}@cs.utsa.edu Abstract In modern DDRx

More information

Spatial Memory Streaming (with rotated patterns)

Spatial Memory Streaming (with rotated patterns) Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi The Memory Wall Memory latency 100 s clock cycles;

More information

High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap

High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap High Performance Memory Access Scheduling Using Compute-Phase Prediction and Writeback-Refresh Overlap Yasuo Ishii Kouhei Hosokawa Mary Inaba Kei Hiraki The University of Tokyo, 7-3-1, Hongo Bunkyo-ku,

More information

Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems

Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems Investigating the Viability of Bufferless NoCs in Modern Chip Multi-processor Systems Chris Craik craik@cmu.edu Onur Mutlu onur@cmu.edu Computer Architecture Lab (CALCM) Carnegie Mellon University SAFARI

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11

15-740/ Computer Architecture Lecture 28: Prefetching III and Control Flow. Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 15-740/18-740 Computer Architecture Lecture 28: Prefetching III and Control Flow Prof. Onur Mutlu Carnegie Mellon University Fall 2011, 11/28/11 Announcements for This Week December 2: Midterm II Comprehensive

More information

Footprint-based Locality Analysis

Footprint-based Locality Analysis Footprint-based Locality Analysis Xiaoya Xiang, Bin Bao, Chen Ding University of Rochester 2011-11-10 Memory Performance On modern computer system, memory performance depends on the active data usage.

More information

Simultaneous Multithreading: a Platform for Next Generation Processors

Simultaneous Multithreading: a Platform for Next Generation Processors Simultaneous Multithreading: a Platform for Next Generation Processors Paulo Alexandre Vilarinho Assis Departamento de Informática, Universidade do Minho 4710 057 Braga, Portugal paulo.assis@bragatel.pt

More information

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt

Bottleneck Identification and Scheduling in Multithreaded Applications. José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Bottleneck Identification and Scheduling in Multithreaded Applications José A. Joao M. Aater Suleman Onur Mutlu Yale N. Patt Executive Summary Problem: Performance and scalability of multithreaded applications

More information

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012

18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II. Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 18-447: Computer Architecture Lecture 23: Tolerating Memory Latency II Prof. Onur Mutlu Carnegie Mellon University Spring 2012, 4/18/2012 Reminder: Lab Assignments Lab Assignment 6 Implementing a more

More information

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas

High Performance Systems Group Department of Electrical and Computer Engineering The University of Texas at Austin Austin, Texas Diverge-Merge Processor (DMP): Dynamic Predicated Execution of Complex Control-Flow Graphs Based on Frequently Executed Paths yesoon Kim José. Joao Onur Mutlu Yale N. Patt igh Performance Systems Group

More information