PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 1. Performance and Energy Modeling of Heterogeneous Many-core Architectures

Size: px
Start display at page:

Download "PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 1. Performance and Energy Modeling of Heterogeneous Many-core Architectures"

Transcription

1 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES Performance and Energy Modeling of Heterogeneous Many-core Architectures Rui Pedro Gaspar Pinheiro Abstract Advances in processor design have recently pushed for the development of heterogeneous processors, in order to tackle the power and memory walls. However, fully exploiting the heterogeneity of manycore processing systems requires efficient scheduling mechanisms, in order to anticipate the performance and energy gains due to the migration of an application from one core to another. Although multiple approaches have been proposed to model such gains, they are generally limited in the supported levels of heterogeneity and can only be applied to a small subset of cores. Hence, a new methodology is herein proposed for performance and energy modeling of highly-heterogeneous many-core processors, which is specifically developed with the purpose of enabling system schedulers to perform near-optimal decisions. The proposed approach relies on a Linear Regression Model and on a set of specifically devised regressors that are highly correlated with several micro-architectural parameters of modern in-order and out-oforder processor architectures, including the cache size, issue-width, reorder buffer size and load/store queues size. To evaluate the proposed methodology, a set of 500 different core-architectures were simulated and their performance and energy consumption was modeled. Experimental results show that all models exhibit a high prediction accuracy, attaining coefficients of determination as high as 0,95. Moreover, when applied to the scheduling of applications on a simulated big.little system, the devised models allow the scheduler to correctly identify the optimal core-architecture in a large majority of the cases, leading to power-delay product differences of less than % compared with an oracle scheduling solution. Index Terms Single-ISA heterogeneous systems, many-core processors, performance and energy modeling, linear regression model, thread scheduling INTRODUCTION ADVANCES in processor design have recently pushed for the development of heterogeneous processors, in order to tackle the power and memory walls. In particular, by relying on appropriate and different core architectures, it is possible to efficiently leverage Memory-Level Parallelism and Instruction-Level Parallelism [], [2] such as to minimize power and energy consumption with a reduced performance loss. In particular, driven by the introduction of the ARM big.little heterogeneous processor [3] (although not exclusively), intensive research has recently been put forth in the development and exploitation of heterogeneous processor systems composed of multiple single-isa (Instruction Set Architecture) small in-order and big out-oforder cores. However, exploiting such heterogeneity often requires the development of efficient scheduling mechanisms, in order to anticipate the performance and power consumption gains or losses due to the migration of an application from one core to another, or to the morphing of a given core, which can be achieved by means of clock/power gating or by relying on reconfigurable technologies. As a result, a considerable amount of the research effort into heterogeneous systems concerns real-time performance and energy consumption models, specifically to predict the behavior of applications over different core architectures. Although multiple such approaches have been developed in the past which are effective for some use cases, they are sub-optimal in the general case. Taking into account the future developments in manycore heterogeneous processors or reconfigurable systems with a large number of different architectures, accurate performance and energy consumption estimation models are highly required. In addition, the exact core architectures available in a system may considerably vary from system to system, such that an architecture-agnostic method for creating such models would be extremely valuable. The objectives of this work are: ) To develop an architecture-agnostic method of deriving accurate on-line performance and energy estimation models, without the identified issues and limitations of existing approaches; 2) To derive a set of performance and energy estimation models that covers many common architectural parameters; 3) To show that the resulting models are highly accurate for the considered range of architectural parameters and can be used in the context of scheduling solutions. The rest of this work is organized as follows. Section 2 describes the state of the art in heterogeneous and reconfigurable architectures, as well as on performance and energy prediction models. A detailed description of the proposed modeling methodology is given in section 3, as well as the creation of general-case models covering all previouslymentioned architectural parameters. To thoroughly assess the developed work, section 4 presents the experimental results and validation of the devised model. Finally, section 6 concludes the developed work. 2 BACKGROUND The objective of this work is the development of performance and power models for heterogeneous processors, in

2 2 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES order to allow for a more efficient allocation of running tasks to the available cores. Hence, this section introduces the current state of research into performance and power modeling of heterogeneous or reconfigurable systems. 2. Heterogeneous and Reconfigurable Architectures Heterogeneous processors consist of an aggregate of multiple cores with different architectures and/or instruction sets, each tuned to handle specific workload types. Reconfigurable systems, on the other hand, have the capability to change their architecture (or parts of it) on the fly, adapting in real-time to their current workload. Advances in processor design have pushed for the development of these non-standard types of processors as a means to tackle the power and memory walls by relying on appropriate and different core architectures. In particular, heterogeneous processors with a combination of many small and few large core types have been shown [], [2] to be able to out-perform homogeneous processors in both execution speed (by having a larger number of smaller cores) and power savings (by having the least possible hardware sitting idle at any time). However, leveraging such architectures requires the adoption of intelligent task-to-core allocation strategies, in particular by analyzing the application s Memory-Level Parallelism (MLP) and Instruction-Level Parallelism (ILP). This conclusion could also be extrapolated into reconfigurable architectures, since they are able to dynamically add or remove ILP and MLP extraction hardware as necessary. These results have led to the introduction of the ARM big.little heterogeneous processor [3], which includes one high-performance big out-of-order core along with one low-power small in-order core, as well as NVidia s Tegra 4 CPU [4], which contains four big out-of-order cores along with one low-power small in-order core, among others. In addition, in order to efficiently leverage these types of processors, intensive research has recently been put forth in the exploitation of heterogeneous processor systems composed of multiple single-isa (Instruction Set Architecture) small in-order and big out-of-order cores. 2.2 Performance and Energy Prediction While the architectural parameters of the system are known, the application characteristics are harder to measure. These can be estimated at compile-time, but this requires recompilation, which might not be feasible in all situations, most commonly in regards to proprietary software. Alternatively, most current processors are equipped with multiple Hardware Counters that can be configured to measure various runtime statistics (e.g., cycle counts, retired instructions, cache misses), which correspond to interactions between the application and the processor subsystems, and therefore can be leveraged as an indirect way of measuring the variables necessary for the prediction models at runtime, without requiring any previous off-line analysis. To tackle the aforementioned issues, Kumar et al. [], [2], proposed a way to avoid off-line analysis, by relying on a two-stage on-line sampling approach. During the first stage (sampling), applications are permuted over all core architectures in order to obtain a set of multiple per-core statistics, retrieved from hardware counters. In a second stage, these gathered statistics are used to predict which core is better suited to each application. These last approach has one very important issue, the need for a slow on-line sampling/training process, during which the system is running sub-optimally, becoming worse as the number of different core variations increases. Many attempts have been made to remove this requirement. For example, Craeynest et al. [5] derived simplified runtime statistics-based models to estimate performance differences between small in-order and big out-of-order cores, which they call Performance Impact Estimation (PIE), by using runtime statistics, specifically the Cycles Per Instruction (CPI) metric, together with the average dependency distance between producer and consumer instructions. Based on this model, the authors developed a system scheduler able to regularly estimate the performance of all running applications on an alternate core architecture and then decide whether a core switch is worthwhile. While a very innovative approach, PIE still has a number of shortcomings which impede it from fulfilling all our objectives. Not only is PIE constrained to only two core types (small inorder and big out-of-order), but it also only takes into account the differences concerning the Re-Order Buffer size and Issue Width between the source and target processor architectures, which means changes in other parameters (e.g. the cache hierarchy) are not correctly predicted by the model. Additionally, measuring the base and memory CPI components separately, as required by PIE, can be a complex process. Another important contribution has been laid out by Pricopi et al. [6], who developed performance for specifically for the ARM big.little processor, by also taking into account the cache hierarchies and branch prediction subsystem. In addition, they used a Linear Regression Model (LRM) based on runtime statistics in order to model its energy consumption. Their approach does take into account possible micro-architectural differences other than ROB size and issue width, it still only considers two possible core variations at once, constraining its uses for many-core heterogeneous architectures, and also requires offline analysis. 3 PERFORMANCE AND ENERGY MODELING In order to better leverage heterogeneity, an architectureagnostic method of deriving accurate on-line performance and energy estimation models is desirable. Hence, it is necessary for the modeling approach to take into account a large number of core variations, specifically constructed by changing several architectural parameters, with as little overhead as possible such that the resulting models can be used in the context of real-time system schedulers. 3. Choice of Architectural Parameters One of the objectives of this work is to devise an architecture-agnostic method of creating performance and energy prediction models. However, there is a virtually limitless number of architectural parameters that could be changed between two different core variations. For this reason, it is important to choose a representative set of

3 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 3 TABLE Considered architectural parameters and their dominant effects. Parameter Re-Order Buffer size ROB size Core Width W Load Queue Size Store Queue Size SQ size Cache Size L{, 2, 3} size Cache Associativity L{, 2, 3} assoc Dominant Effects When full, generates structural hazards, leading to stalls at instruction issue. Affects the peak instruction throughput at issue, dispatch and commit stages. When full, generates structural hazards for new load instructions, causing pipeline stalls at issue. When full, generates structural hazards for new store instructions, causing pipeline stalls at issue. Affect the cache hit rate, significantly impacting the memory access latency. Affect the cache organization, possibly impacting the memory access latency. In order to simplify the analysis and reduce the number of architectures to be simulated, it is assumed that W issue = W commit = W. common architectural parameters with considerable effects on performance and energy. Accordingly, both in-order and out-of-order architectural classes will be considered, with varying parameters, depicted in Table with a description of their dominant effects. The cache sizes and associativities, as well as the Load Queue (LQ)/Store Queue (SQ) size parameters were chosen due to these hardware structures being very common ways of adding MLP extraction capability to a processor, while the Core Width is one of the most simple ways of ILP extraction and is used in most modern processors. As for the Re-Order Buffer (ROB) size parameter, it represents the main hardware structure required for processors to be able to execute instructions out-of-order, and as such was chosen for its important role in Out-of-Order processors. In particular, due to its importance to out-of-order execution, the ROB size is a parameter analyzed by much existing research [5] [8] mainly as a way to distinguish in-order from out-of-order cores. The cache sizes are also commonly analyzed [6], [8] [], by looking at the cache miss rates (mainly Last-Level Cache (LLC) misses) and their effects on performance. Some previous research also considers the direct effects of Core Width [5], [7] and LQ/SQ size [7], although to a lesser degree. 3.2 Architecture Analysis: Methodology and Setup In order to evaluate the impact of the different architecture parameters on the performance, power and energy, both inorder and out-of-order architectures were simulated using the state-of-the-art Sniper Multi-Core Simulator [2], able to provide accurate simulations of a broad range of x86 and x86 64 micro-architectures. To ensure the representativeness of the devised model when considering multiple types of workloads, the PARSEC [3] benchmark suite was chosen for analysis, training and validation. Each workload s initialization and shutdown phases (when the input data is read from disk or the results are written back) are uninteresting from the point-of-view of architectural analysis, since they depend almost solely on systems outside of the processor s control (e.g. hard drive speed). For such purpose, simulator-specific magic instructions were added to each of the eleven PARSEC benchmarks in order to define the appropriate simulation Region of Interest (ROI) for each benchmark and exclude initialization and shutdown. In order to be able to examine each architectural parameter s effects individually, a sweep over a large range was TABLE 2 Sweep ranges and sample count for all architectural parameters under consideration. Parameter Default Sweep range # Samples Issue/Commit Width ROB Size Load Queue Size Store Queue Size L Size (KB) L Associativity L2 Size (KB) L2 Associativity L3 Size (KB) L3 Associativity done for each of the parameters of interest, using a singlecore processor with Intel s Gainestown micro-architecture as a base. The default values, as well as the parameter sweep ranges and sample counts are depicted in Table 2. For the Issue/Commit Width, ROB size and Load/Store Queue sizes, the samples are uniformly distributed over the sweep range. However, due to architectural limits imposed by the simulator, both the cache size and associativity are required to be powers of two, meaning that only powers of two within the sweep range were sampled for the cache parameters (L{, 2, 3} size and L{, 2, 3} assoc ). For each of the core variations, the PARSEC benchmarks were executed to completion in single-threaded mode using the predefined small input set, with the pre- and post-roi sections simulated in fast-forward mode in order to reduce the total processing time. All ROI runtime statistics of interest were measured during simulation, and stored for later processing. After execution, the McPAT [4] power framework is used in order to estimate power consumption assuming a 45nm transistor technology with a.2v supply voltage. The results of this architectural analysis will be used during the rest of this section to justify the model development process. 3.3 Dependent Variables and Regressors In order to be able to use a LRM it is necessary to choose the dependent variable and set of regressors carefully. In particular, it is important to note that, although the in-sample fit of the model increases trivially with the introduction of more regression terms, this leads to an increase in model complexity, potentially constraining its real-time application

4 4 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES in scheduling systems. Too many regression terms also leads to over-fitting conditions, constraining the model s validity to unobserved applications. It is therefore important to pick the minimum number of terms that are able to correctly represent the dominant effects of all architectural parameters of interest Choice of Dependent Variables While the choice between Instructions Per Cycle (IPC) and CPI for performance metric might at first seem indifferent, since they are the reciprocal of each-other, there is in fact an important difference that affects model quality considerably. Due to the fact that most runtime statistics are normalized by the instruction count in order to equalize their scales and make them easier to work with, and because CPI is already normalized by the instruction count, it becomes an obvious choice for performance metric. Therefore, CPI is used as the LRM dependent variable for the performance model. As for energy consumption, the choice is slightly more difficult. Of course, it is possible to also use Energy Per Instruction (EPI), since it is already normalized by the instruction count. However, this is not a good choice, because the time for each instruction is both core- and applicationdependent, which means the same EPI value can correspond to a vastly different energy consumption if calculated using two different cores and/or workloads, depending on the performance. However, some simple mathematical manipulation leads to the equations EP I = E total I = P DP I = P avg t exec I = P avg T C I = EP C CP I, () where E total represents the total energy consumed during workload execution, P DP is the Power-Delay Product, P avg denotes the average power consumption, t exec the execution time of the workload, T = f the clock period, C the number of elapsed clock cycles, I the total number of instructions, and EP C (Energy Per Cycle) is the total energy consumption in a single cycle defined as EP C = E total C = P avg T. (2) This result quickly leads to the conclusion that EPC is a good choice of energy metric, because it is independent of performance, and unlike EPI denotes the energy consumption over a fixed time period corresponding to one clock cycle (assuming constant frequency). Additionally, when multiplied with the performance metric CPI (either measured during execution or predicted by the performance model) results in P DP I, a value which can be used directly by the system scheduler. It should however be taken into account that both the CPI and EPC metrics are always positive, i.e. CP I, EP C R +, something that the LRM is not aware of. In fact, a LRM assumes that the response variable is normally distributed over the whole real domain [5]. Hence, a LRM-based model will predict negative performance or energy consumption for certain input values, results which make no sense in practice and are therefore highly undesirable. TABLE 3 R 2 Coefficient of determination of the various transformation steps. Figure R The simplest way to correct this is by applying a transformation g( ) to the dependent variables, also known as the link function of the Generalized Linear Model (GLM), chosen such that g is continuous and provides the mapping g : R + R. The choice of link function is critical, since it will affect results considerably. In particular, experimental validation of the model without any link function showed the residual distribution was always log-normal. Accordingly, since normally-distributed residuals are preferable in order to ensure that the least-squares estimator matches the maximum-likelihood estimator (as the latter has better statistical properties [6]), and the natural logarithm function provides the desirable mapping (i.e., log : R + R), using it as the link function is an obvious choice Regressor Format With the dependent variables chosen, all that is left is to define the regressors. In order to minimize the necessary number of LRM terms considerably, we assume that each source core type is characterized independently, with independent LRM coefficients, such that its micro-architectural parameters become constant as far as the models are concerned, effectively being absorbed by the LRM coefficients. First, the performance model needs to be defined. Let us assume a simple situation where there s only one single parameter of interest, the Load Queue Size on out-oforder cores, and that the only objective is to predict changes in performance as varies, using the CPI metric. Figure illustrates the relationship between the parameter and log(cp I), where it becomes immediately obvious that there is no linear correlation. In particular, there are three noticeable problems: P. The shape of each application s performance is not linear. P.2 The y-axis is shifted differently for all applications, i.e. the minimum CPI is not the same. P.3 The speedup (ratio between maximum and minimum CPI) varies depending on the application. All these three issues have one feature in common, namely the differences in how the various applications interact with the architectural parameter in question. The solution to these issues is to apply a linearizing transformation, with possible help from runtime statistics., First, the change in parameter = initial, versus the change in the dependent variable log(cp I) = log(cp I) log(cp I initial ) are shown in Figure 2 for all simulated initial values. This can be implemented by estimating the change in performance or energy consumption between the source core and the target core, instead of absolute values. While this does not guarantee a common minimum, it is enough to resolve P.2, since all plots have the common data-point (0, 0) which can be used as a starting point for a linearizing transformation. Figure 3 goes further, by applying a linearizing transformation over the change in, specifically chosen because the non-linear shape of Figure is very

5 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 5 log(cp I) blackscholes bodytrack canneal dedup ferret fluidanimate raytrace streamcluster swaptions vips x (# Entries) Fig.. log(cp I) vs. Load Queue size of each PARSEC [3] benchmark. log(cp I) (# Entries) Fig. 2. Change in log(cp I) vs. Change in Load Queue Size for all possible initial values ( ) Fig. 3. Change in log(cp I) vs. Change in for all possible initial values. log(cp I) log(cp I) ( ) LD pi Fig. 4. Change in log(cp I) vs. Change in LD pi runtime statistic., multiplied by the similar to the inverse function x. As it can be concluded by analyzing the figure, this results in a linear behavior always centered around (0, 0), resolving P.. However, different applications and initial values still have different slopes, hence not completely resolving the modeling problem. By looking carefully at Figure 3 and comparing it with Figure, it can be seen that the slope of the various lines seems to depend on the corresponding application s speedup. For example, streamcluster is the application with the largest speedup, and also has the biggest slope, while canneal has the lowest speed-up, and smallest slope. This leads to the conclusion that all that is left is to classify the applications, which can be done by using runtime statistics. In particular, the rate of loads per instruction LD pi was found to highly correlate with the speed-up brought by an increase in Load Queue Size, by exhibiting a linear correlation coefficient ρ in all examined cases. Hence, Figure 4 shows the same plot, with the added step of multiplying by LD pi, which achieves a considerable improvement in linear correlation by resolving P.3. Table 3 shows the coefficient of determination R 2 values obtained by doing a single-term plus intercept linear regression ŷ = β 0 + β x on the various transformation steps. For reference, R 2 = denotes a perfect fit, and R 2 = 0 denotes no correlation at all. In particular, it can be seen that applying the inverse transformation in Figure 3 improves the correlation considerably, and the multiplication by the runtime statistic LD pi in Figure 4 achieves an almost-perfect quality of fit. By applying similar principles to the remaining variables, the proposed performance models are built based on the regression log( ˆ CPI tgt ) = β 0 + β log(cp I src ) + N β i+ x i, (3) x i = f i (p i ) Si (4) I where CPI ˆ tgt represents the estimated performance at the target core, β i are model coefficients, CP I src represents the performance measured in the source core, and x,, x N represent the set of N regression terms obtained by coupling the statistics gathered by using Hardware Counters (HCs) with the transformed micro-architectural parameter variations. Each regression term x i is herein considered to express the product of the variation f i (p i ) of a given micro-architectural parameter p i between the source (p isrc ) and target (p itgt ) cores using a transformation f i : R R, such that f i (p) = f i (p itgt ) f i (p isrc ) with a runtime statistic S i normalized by the retired instruction count I. This modeling process and linear regression format could also be employed for the energy consumption resulting in a model with similar accuracy. However, the energy model can be improved considerably, since if it is assumed that all cores are implemented using the same technology, EPC can be divided into two separate components, a coredependent EP C base component affected only by the architectural parameters of a processor, due to its static power consumption, as well as an application-dependent EP C app, corresponding to the core s dynamic power consumption, affected by the application. The core-dependent energy component EP C base consists of various interactions between parameters at the target core that define its static power consumption. As for the application-dependent component, architectural analysis shows that power consumption is typically inversely correlated with CPI for all considered parameter sweeps, which makes sense since higher performance means more transistor switching, resulting in an increased dynamic power consumption. In fact, if one calculates the linear correlation coefficient between log(cp I) and log(ep C) individually for each core variation of the architectural analysis, one obtains an average of ρ avg = which indicates a strong inverse correlation. Hence, it can be concluded that knowing the performance at a target processor should be enough to predict the application-dependent energy consumption component EP C app with relatively high accuracy. This leads to the conclusion that an energy model has to predict the performance in the target core, in order to predict i=

6 6 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES its energy consumption. However, the performance model is already doing that job, so its results can be directly used as a regressor. The proposed energy models are therefore built based on the regression log( ˆ EPC tgt ) = β 0 + β log(cp I tgt ) + N β i+ w i, (5) where EPC ˆ tgt represents the estimated energy consumption at the target core, β i are model coefficients, CP I tgt represents the performance at the target core used to explain EP C app, and w,, x N represent the set of N regression terms used to explain EP C base. The performance CP I tgt can be obtained through the performance model or, alternatively, this model can be used to predict energy consumption at the source core itself, useful in case the processor does not support per-core power instrumentation. The individual regressors w i should have the format i= w i = f i (p itgt ), (6) where f i is a transformation function f i : R R, and p itgt represents an architectural parameter of the target core. It should be noted that it is possible that certain interaction effects also need to be modeled, such that w i may be attained by multiplying two or more architectural parameters, w i = w j w k, i j k. (7) Moreover, while it was at first assumed that each source core would be trained individually, it can be seen that this is not required for the energy model, since only the CP I tgt term depends on the application itself, while all remaining terms depend solely on the target processor s architectural parameters. 3.4 Choice of Parameters and Statistics With the model format defined, all that is left is to choose the values of x i (4) for the performance models, and w i (6) for the energy models, in particular the architectural parameters p i, the transformation functions f i and, for the performance model, the runtime statistics S i. This can be done manually, by analyzing the data collected during section 3 and going through the parameters one-by-one, in order to choose the best combination of f i and S i for each parameter. However, as the number of architectural parameters and core variations increases, it may be necessary to automate this procedure. Hence, statistical methods for automatic regressor choice become very useful, for example Lasso [7], among others. Nevertheless, because the number of architectural parameters under variation is not too large such approaches are not strictly necessary, instead a manual choice of regressors was done based on the architectural analysis in subsection 3.2, which also helps to illustrate the model-making procedure. In order to validate these choices, regression analysis was used, by doing a single-term plus intercept linear regression for all possible combinations of measured runtime statistics S i (in the case of the performance model) and a hand-picked list of transformation functions f i over each architectural parameters p i discussed during subsection 3.2, and calculating the resulting coefficients of determination R 2 which can be used as a metric for selecting the best regressors used in the final models. 3.5 General-case Models With the general structure of the LRM decided, it is now possible to create general-case models for the parameters of interest for all possible core transitions, i.e. In-Order to In-Order, In-Order to Out-of-Order, Out-of-Order to In-Order, and Out-of-Order to Out-of-Order, for both performance and energy consumption. The performance models can then be individually trained for each source core, and used to predict the gains in performance if a thread were to switch to a different core type, while the energy models can be trained once for each of the four possible situations, and used to predict the difference in energy consumption Performance From the parameter-sweep analysis discussed in section 3, all parameters share a shape similar to the inverse function x, giving a very strong indication that this transformation will favor model development on all tested parameters. On the other hand, choosing the runtime statistic is more difficult, and it must be discussed case-by-case. For the Issue/Commit Width case, the results suggest that a very similar speedup is attained independently of the application, leading to the hypothesis that no runtime statistic is needed. To confirm such hypothesis a regression analysis was performed showing that it best R 2 value is obtained when no runtime statistic is used. Concerning the Load Queue size, they were already shown in Table 3 to have a strong correlation between speed-up and the number of load instructions LD. The same holds for the Store Queue, except with the number of store instructions ST. Concerning the ROB size, architectural analysis revealed a strong linear correlation with the Micro-operation (µ-op) rate. As for the cache size, the situation is more complex. While there is a clear correlation between miss rates at a certain level and that specific cache level s speed-up, the number of cache misses significantly varies as the cache size changes, a fact that compromises prediction accuracy. For example, a processor with 2048KB of L2 cache exhibits barely any cache misses when running the streamcluster application, even though that same application has a very large performance penalty with very low L2 cache sizes. It is expected that high traffic at a cache level increases the probability of a performance drop if the size of that level is reduced. In fact, cache misses at level x can effectively be used as a measure of traffic at level x, since a cache access at level x requires a cache miss at level x. As a result, the previous cache level s miss rate is a more accurate predictor of a specific cache level s performance. This thought process also implies that the effect of the L size can be predicted using the number of total memory accesses LD + ST. When considering the full set of micro-architecture parameters and statistics, the following simple, but still highly representative 9-term LRM to predict performance variations between different Out-of-Order cores was obtained:

7 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 7 log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 ( ) µop pi + ROB size β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size Out-of-Order Out-of-Order Performance Model Repeating the same analysis for the performance variations between different In-Order cores, the same conclusions are reached, except that the ROB size is not a valid architectural parameter of in-order cores and, as such, disappears from the model. Hence, the following 8-term LRM is obtained: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 ( ) LD pi + β 4 ( ) ST pi + SQ size β 5 ( ) (LD pi + ST pi ) + L size β 6 ( ) Lmiss pi + L2 size β 7 ( ) L2miss pi L3 size In-Order In-Order Performance Model When developing the cross-type performance models ( In-Order to Out-of-Order and Out-of-Order to In- Order ), most of the same conclusions hold. In particular, µop pi should be able to explain the change in performance due to the addition or removal of out-of-order execution capability, resulting in the following 9-term LRM to predict the performance variations from In-Order cores to Out-of- Order cores: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 µop pi + ROB size β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size In-Order Out-of-Order Performance Model (8) (9) (0) In the opposite direction, from Out-of-Order cores to In- Order, the model is very similar. However, since the model will be trained once per source core, the ROB size architectural parameter will be absorbed by the β i coefficients. As a result, the ROB size µop pi term should be replaced by just µop pi, leading to the following 9-term LRM: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 µop pi + β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size () β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size Out-of-Order In-Order Performance Model Energy The analysis for the energy models is slightly different than that for the performance models. In particular, because the models include the CP I tgt as a term and do not depend on the source core for any of the remaining terms, we are interested solely on the energy consumption component EP C base not directly influenced by performance. Hence, as the architectural analysis shows that the energy consumption due to a change in Load and Store Queue sizes is almost solely due to the increase in performance, these parameters do not need to be included in the energy models. The cache sizes are also relatively simple since their behavior is mostly linear, which means no transformation is required. However, things get more complex when looking at the core width and ROB size. These parameters have a large effect on energy consumption due to a large change in performance. However, both exhibit a small, linear increase in energy consumption as their size increases even when CPI does not change, which means they must be included as part of the model. However, experimental evaluations showed that this works only as long as only one of these parameters varies, and that the model accuracy falls sharply as soon as both parameters change simultaneously, with the residual distribution not being normally distributed any longer. These are indicators that there is a strong interaction between both parameters that also needs to be modeled. In fact, if a third term w i = W tgt ROB sizetgt is added in order to model this interaction, the model remains accurate even if both these parameters change simultaneously. As a result, the following 8-term LRM is obtained: (2) log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + β 3 ROB sizetgt + β 4 W tgt ROB sizetgt + β 5 L sizetgt + β 6 L2 sizetgt + β 7 L3 sizetgt Out-of-Order Out-of-Order Energy Model

8 8 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES TABLE 4 Considered architecture variations. Parameter Values Issue/Commit Width ; 2; 4; 6 ROB Size (Out-of-Order only) 32; 64; 28; 256 Load/Store Queue Size ; 5; 0; 5; 20 In-Order cores have very similar behavior in terms of energy, and as such the regressors are the same, except that the ROB size dependent terms disappear as this parameter no longer exists. Accordingly, the following 6-term LRM is obtained: log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + (3) β 3 L sizetgt + β 4 L2 sizetgt + β 5 L3 sizetgt In-Order In-Order Energy Model The cross-type models ( In-Order to Out-of-Order and Out-of-Order to In-Order ) are relatively simple as well. The architectural analysis in showed that the difference in energy consumption between in-order and out-of-order cores is mostly independent of the applications under test, a fact that is confirmed by automated regression analysis. As a result, we obtain the two following LRMs are obtained, with 8 terms for the In-Order to Out-of-Order case, and 6 terms in the opposite direction: (4) log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + β 3 ROB sizetgt + β 4 W tgt ROB sizetgt + β 5 L sizetgt + β 6 L2 sizetgt + β 7 L3 sizetgt In-Order Out-of-Order Energy Model log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + (5) β 3 L sizetgt + β 4 L2 sizetgt + β 5 L3 sizetgt Out-of-Order In-Order Energy Model 4 EXPERIMENTAL RESULTS In the previous section, a set of architectures was analyzed, and an architecture-agnostic method for creating performance and energy prediction models was devised. In what follows, these models will be tested and validated, in order to evaluate their predictive accuracy. 4. Experimental Methodology and Setup Based on the methodology used for the architectural analysis during the previous section (see subsection 3.2), several micro-architecture and cache organization parameters were varied for both in-order and out-of-order cores, as depicted in Tables 4 and 5. In accordance, a total of 500 different core variations were simulated, corresponding to 400 out-oforder and 00 in-order cores, allowing an effective coverage of the interaction between all the considered parameters. TABLE 5 Considered cache hierarchy variations (associativity and total size in KB). The block size was set fixed and equal to 64 Bytes as commonly used by Intel processors. Name L-I & L-D L2 L3 Assoc. Size Assoc. Size Assoc. Size Tiny Small Medium Large Huge MODELS ASSESSMENT To demonstrate the quality and accuracy of the models proposed in section 4, they were first assessed on a full range of architectures, by considering the multiple core parameterizations depicted on Tables 4 and 5. Since the model assumes the representativeness of the training set for all possible applications and cores, it makes sense to use as much information as possible during its testing and validation. Therefore, a leave-one-out cross-validation approach was adopted, such that one application was removed from the training set in each iteration, and subsequently used for model validation. Furthermore, in section 2 it was mentioned that commercial heterogeneous architectures such as the ARM big.lit- TLE [3] processor typically have only two types of cores, one small In-Order core, and one large Out-of-Order core. Accordingly, to properly perform model validation, several heterogeneous architectures are herein considered, namely a heterogeneous many-core system composed of the previously mentioned 500 cores and a more realistic heterogeneous system composed of only a big and a little core, consisting of the smallest or largest core variations within the set of 500 simulated, respectively. Additionally, current scheduling systems are more interested in the minimization of the Energy-Delay Product (EDP) or PDP instead of EPC, since minimizing PDP is the same as minimizing the total energy consumed by the processor for a certain workload, To perform such an evaluation, it is possible to combine the results of the proposed performance and energy models, and easily calculate an estimate of the PDP, by using the equation ˆ P DP tgt = EP ˆ C tgt CPˆ I tgt I, (6) where EPˆ C tgt was obtained using the CP I tgt value estimated by the performance model. Figure 5 presents the normalized predictions for the performance (CPI), energy (EPC), and combined PDP) models, represented as three Tukey box-plots for each metric. In particular, the energy model was obtained by using the CP I tgt value estimated by the performance model. It can be observed that the model provides accurate predictions over a wide range of architectures for all considered core parameters, with all three quartiles (25 th percentile, median, and 75 th percentile) ranging between 75% and 25% of the observed value in all cases. Furthermore, in the dual-core small in-order plus big outof-order case most observations show an estimation error lower than 0%, and all three quartiles range between 95% and 05% of the observed value for all cases.

9 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 9 Normalized value (relative to observed) OoO OoO CPI EPC PDP IO IO IO OoO OoO IO small IO + big OoO Fig. 5. Tukey boxplots of the proposed models between In-Order (IO) and Out-of-Order (OoO) cores (normalized by the observed values). TABLE 6 Scheduler validation of the performance and energy models with N randomly selected core variations to pick from (including the source core), when attempting to minimize CPI, EPC or PDP, compared to a random scheduler (expected value when picking a random core) and an oracle scheduler (always picks best core). CPI # Cores N Random Oracle Proposed Rel. Error 2.08% 4.29% 4.69% 3.33%.67% Random EPC Oracle (nj) Proposed Rel. Error 0.38% 0.79%.48% 2.4% 2.57% Random PDP Oracle (J) Proposed Rel. Error.45% 2.53% 4.45% 4.50% 5.53% 5. Application to System Schedulers To further evaluate whether the proposed models could effectively predict the most efficient core for each application, a scheduler-specific validation test was performed. For each iteration of the test, a permutation of one source core and N alternative cores was picked at random. A leave-oneout cross-validation approach is adopted, where for each iteration, one application is removed from the training set, and the models are re-trained for the specific combination of N cores. The resulting models are then used to predict the best core for that application by minimizing CPI, EPC or PDP, out of the possible (N) choices. The observed value (CPI, EPC or EDP) in the chosen core was then compared with the core selected by a scheduler using either a random policy, where cores are picked at random, or an oracle policy, where the best core is always chosen. The results, presented in Table 6, show that the model manages to estimate the correct core in a large majority of the cases, even when there are many different core variations to pick from. Furthermore, the results also show that, when the proposed models perform an incorrect guess, only a reduced performance or energy consumption loss is observed when compared to the oracle case. All in all, the proposed models are able to correctly predict the correct cores for various applications, leading to average CPI, EPC and PDP differences lower than 4.69%, 2.57% and 5.53% respectively when compared to an oracle scheduling solu- TABLE 7 Scheduler validation of the performance and energy models on various dual-core Small + Big systems with In-Order (IO) and Out-of-Order (OoO) cores, when attempting to minimize CPI, EPC or PDP, compared to a random scheduler (expected value when picking a random core) and an oracle scheduler (always picks best core). Architecture Small IO + Big IO + Small OoO Small IO Big OoO Big OoO + Big OoO + Big IO Random Oracle CPI Proposed Rel. Error 0.00% 0.00% 0.00% 0.00% Random EPC Oracle (nj) Proposed Rel. Error 0.00% 0.00% 0.00% 0.00% Random PDP Oracle (J) Proposed Rel. Error 0.6% 0.00% 0.55% 0.2% CP I Relative Error vs. Oracle 50% 40% 30% 20% 0% 0% 0.9% 5.32% 37.84% Proposed PIE Random Fig. 6. Relative Error vs. Oracle of the proposed performance model, compared with PIE [5] and a random scheduler. 48.5% EP C Relative Error vs. Oracle 50% 40% 30% 20% 0% 0% 0.24% 4.63% Proposed Pricopi et al. Random Fig. 7. Relative Error vs. Oracle of the proposed energy model, compared with Pricopi et al. [6] and a random scheduler. tion, even with over 00 different core variations to choose from. Reduced numbers of available core architectures provide even better results, for example PDP differences compared to the oracle case of.45% or 2.53% when considering 2 and 6 core variations, respectively. The same validation step was repeated for the four considered Small + Big dual-core systems, and the results are depicted in Table 7. From the results, it can be concluded that the correct core was chosen every single time while attempting to minimize CPI or EPC, and the loss in PDP relative to the oracle scheduler is, in the worst case, as low as 0.6%. 5.2 Comparison with State-of-the-Art Models In order to further demonstrate the quality of the results, the proposed models were compared with two state-of-the-art models, in particular the Performance Impact Estimation performance model [5] and the energy model proposed by Pricopi et al. [6]. For such purposes, the previously described scheduler-specific validation process was used.

10 0 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES To guarantee fairness, only the conditions considered by the corresponding authors of [5], [6] were used in each comparison. Hence, the existence of a dual-core processor is assumed (N = 2), where one core is In-Order and the other is Out-of-Order. In addition, since the PIE model [5] only considers the Core Width W and the ROB size ROB size architectural parameters, all other architectural parameters were held constant during comparison. On the other hand, the energy model proposed by Pricopi et al. [6] uses a mixture of off-line and on-line analysis, with the off-line portion being used to predict the cache access profile at the target core. Due to the complexity of implementing the off-line model, the comparison was performed by using the cache access statistics measured directly on the target core, as if the off-line portion were always 00% accurate in the estimation of such statistics. The results of both these comparisons are shown in Figures 6 and 7, where it can be seen that the PIE model [5] has a relative error versus the oracle scheduler approximately 7 times larger than the proposed models, and the Pricopi et al. [6] energy model approximately 9 times larger, serving as further proof of the quality and predictive power of the proposed models. 6 CONCLUSION The key contribution of this work is the proposed modeling methodology, able to successfully derive performance and energy prediction models that can accurately predict the more efficient core out of hundreds of different core variations at once, without requiring a high-overhead online sampling procedure, or offline static analysis. The presented results show that the obtained models are suitable for use by scheduling agents as a means of predicting the best task to core mapping that maximizes performance while minimizing power consumption. These include, for example, those in an operating system running on a heterogeneous processor, large cluster infrastructures with many different processors, or system hypervisors able to apply core morphing techniques to predict and adapt to the resource demands of each application. A representative set of performance and energy models was developed using the proposed methodology, and over the 500 considered core variations, all the obtained models have a coefficient of determination R 2 of at least or for the performance and energy models, respectively, meaning that there is a very high goodness of fit. Additionally, a scheduler validation step was executed, in which the proposed models were pitted against an oracle and a random scheduler in an attempt to minimize the CPI, the EPC or the Power-Delay Product (PDP). When compared to the oracle scheduler, the models showed a prediction error as low as 5.53%, even when over a hundred core variations were considered. Lower numbers of core types present even better results. For example, an error of only 2.53% is attained for a system with 6 different core variations. Additionally, since current commercial heterogeneous processors only contain two different core variations, this specific situation was analyzed in more detail. In particular, it was shown that the proposed models are able to predict CPI and EPC with a very high accuracy, with a large majority of the observations showing a CPI, EPC and PDP prediction error considerably smaller than 0%. In addition, the scheduler validation step was repeated specifically for the dual-architecture case, and the proposed models were shown to be able to correctly minimize CPI and EPC. As for PDP, a very small error of 0.6% was observed when compared to the oracle case. REFERENCES [] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen, Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction, in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 36. Washington, DC, USA: IEEE Computer Society, 2003, pp [2] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, Single-ISA heterogeneous multi-core architectures for multithreaded workload performance, SIGARCH Comput. Archit. News, vol. 32, no. 2, pp , Mar [3] big.little Technology: The Future of Mobile, ARM, Tech. Rep., 20, available at LITTLE Technology the Futue of Mobile.pdf. [4] NVidia Tegra 4 Family CPU Architecture 4-PLUS- Quad core, NVidia, Tech. Rep., 203, available at Quad a5 whitepaper FINALv2.pdf. [5] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, Scheduling heterogeneous multi-cores through performance impact estimation (PIE), in Proceedings of the 39th Annual International Symposium on Computer Architecture, ser. ISCA 2. Washington, DC, USA: IEEE Computer Society, 202, pp [6] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin, Power-performance modeling on asymmetric multicores, in 203 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept 203, pp. 0. [7] Y. Kora, K. Yamaguchi, and H. Ando, MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP, in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 203, pp [8] J. Cong and B. Yuan, Energy-efficient scheduling on heterogeneous multi-core architectures, in Proceedings of the 202 ACM/IEEE International Symposium on Low Power Electronics and Design, ser. ISLPED 2. New York, NY, USA: ACM, 202, pp [9] G. Patsilaras, N. K. Choudhary, and J. Tuck, Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era, ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, pp. 28: 28:2, Jan [0] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov, A comprehensive scheduler for asymmetric multicore systems, in Proceedings of the 5th European Conference on Computer Systems, ser. EuroSys 0. New York, NY, USA: ACM, 200, pp [] W. L. Bircher and L. K. John, Complete system power estimation: A trickle-down approach based on performance events, in IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2007., April 2007, pp [2] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, An evaluation of high-level mechanistic core models, ACM Transactions on Architecture and Code Optimization (TACO), 204. [3] C. Bienia, Benchmarking modern multiprocessors, Ph.D. dissertation, Princeton University, Princeton, NJ, USA, January 20. [4] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures, in MICRO nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009., Dec 2009, pp [5] A. J. Dobson and A. Barnett, An introduction to generalized linear models. CRC press, [6] G. A. F. Seber and A. J. Lee, Linear Regression Analysis, 2nd ed. John Wiley & Sons, Inc., [7] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), pp , 996.

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General

More information

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques

Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Energy-Efficiency Prediction of Multithreaded Workloads on Heterogeneous Composite Cores Architectures using Machine Learning Techniques Hossein Sayadi Department of Electrical and Computer Engineering

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

Control Hazards. Branch Prediction

Control Hazards. Branch Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

COMPUTER ORGANIZATION AND DESI

COMPUTER ORGANIZATION AND DESI COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count Determined by ISA and compiler

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

Analyzing Performance Asymmetric Multicore Processors for Latency Sensitive Datacenter Applications

Analyzing Performance Asymmetric Multicore Processors for Latency Sensitive Datacenter Applications Analyzing erformance Asymmetric Multicore rocessors for Latency Sensitive Datacenter Applications Vishal Gupta Georgia Institute of Technology vishal@cc.gatech.edu Ripal Nathuji Microsoft Research ripaln@microsoft.com

More information

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero

Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero Efficient Runahead Threads Tanausú Ramírez Alex Pajuelo Oliverio J. Santana Onur Mutlu Mateo Valero The Nineteenth International Conference on Parallel Architectures and Compilation Techniques (PACT) 11-15

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Adaptive Cache Partitioning on a Composite Core

Adaptive Cache Partitioning on a Composite Core Adaptive Cache Partitioning on a Composite Core Jiecao Yu, Andrew Lukefahr, Shruti Padmanabha, Reetuparna Das, Scott Mahlke Computer Engineering Lab University of Michigan, Ann Arbor, MI {jiecaoyu, lukefahr,

More information

Effective Prefetching for Multicore/Multiprocessor Systems

Effective Prefetching for Multicore/Multiprocessor Systems Effective Prefetching for Multicore/Multiprocessor Systems Suchita Pati and Pratyush Mahapatra Abstract Prefetching has been widely used by designers to hide memory access latency for applications with

More information

Control Hazards. Prediction

Control Hazards. Prediction Control Hazards The nub of the problem: In what pipeline stage does the processor fetch the next instruction? If that instruction is a conditional branch, when does the processor know whether the conditional

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Core Specific Block Tagging (CSBT) based Thread Migration Model for Single ISA Heterogeneous Multi-core Processors

Core Specific Block Tagging (CSBT) based Thread Migration Model for Single ISA Heterogeneous Multi-core Processors Core Specific Block Tagging (CSBT) based Thread Migration Model for Single ISA Heterogeneous Multi-core Processors Venkatesh Karthik S 1, Pragadeesh R 1 College of Engineering, Guindy, Anna University

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Position Paper: OpenMP scheduling on ARM big.little architecture

Position Paper: OpenMP scheduling on ARM big.little architecture Position Paper: OpenMP scheduling on ARM big.little architecture Anastasiia Butko, Louisa Bessad, David Novo, Florent Bruguier, Abdoulaye Gamatié, Gilles Sassatelli, Lionel Torres, and Michel Robert LIRMM

More information

Lec 25: Parallel Processors. Announcements

Lec 25: Parallel Processors. Announcements Lec 25: Parallel Processors Kavita Bala CS 340, Fall 2008 Computer Science Cornell University PA 3 out Hack n Seek Announcements The goal is to have fun with it Recitations today will talk about it Pizza

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms

The levels of a memory hierarchy. Main. Memory. 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms The levels of a memory hierarchy CPU registers C A C H E Memory bus Main Memory I/O bus External memory 500 By 1MB 4GB 500GB 0.25 ns 1ns 20ns 5ms 1 1 Some useful definitions When the CPU finds a requested

More information

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures

Homework 5. Start date: March 24 Due date: 11:59PM on April 10, Monday night. CSCI 402: Computer Architectures Homework 5 Start date: March 24 Due date: 11:59PM on April 10, Monday night 4.1.1, 4.1.2 4.3 4.8.1, 4.8.2 4.9.1-4.9.4 4.13.1 4.16.1, 4.16.2 1 CSCI 402: Computer Architectures The Processor (4) Fengguang

More information

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor*

Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Predictive Thread-to-Core Assignment on a Heterogeneous Multi-core Processor* Tyler Viswanath Krishnamurthy, and Hridesh Laboratory for Software Design Department of Computer Science Iowa State University

More information

LECTURE 3: THE PROCESSOR

LECTURE 3: THE PROCESSOR LECTURE 3: THE PROCESSOR Abridged version of Patterson & Hennessy (2013):Ch.4 Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU

More information

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI

OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI CMPE 655- MULTIPLE PROCESSOR SYSTEMS OVERHEADS ENHANCEMENT IN MUTIPLE PROCESSING SYSTEMS BY ANURAG REDDY GANKAT KARTHIK REDDY AKKATI What is MULTI PROCESSING?? Multiprocessing is the coordinated processing

More information

Multithreaded Processors. Department of Electrical Engineering Stanford University

Multithreaded Processors. Department of Electrical Engineering Stanford University Lecture 12: Multithreaded Processors Department of Electrical Engineering Stanford University http://eeclass.stanford.edu/ee382a Lecture 12-1 The Big Picture Previous lectures: Core design for single-thread

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines

CS450/650 Notes Winter 2013 A Morton. Superscalar Pipelines CS450/650 Notes Winter 2013 A Morton Superscalar Pipelines 1 Scalar Pipeline Limitations (Shen + Lipasti 4.1) 1. Bounded Performance P = 1 T = IC CPI 1 cycletime = IPC frequency IC IPC = instructions per

More information

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency

MediaTek CorePilot 2.0. Delivering extreme compute performance with maximum power efficiency MediaTek CorePilot 2.0 Heterogeneous Computing Technology Delivering extreme compute performance with maximum power efficiency In July 2013, MediaTek delivered the industry s first mobile system on a chip

More information

Full Datapath. Chapter 4 The Processor 2

Full Datapath. Chapter 4 The Processor 2 Pipelining Full Datapath Chapter 4 The Processor 2 Datapath With Control Chapter 4 The Processor 3 Performance Issues Longest delay determines clock period Critical path: load instruction Instruction memory

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , )

Lecture 8: Instruction Fetch, ILP Limits. Today: advanced branch prediction, limits of ILP (Sections , ) Lecture 8: Instruction Fetch, ILP Limits Today: advanced branch prediction, limits of ILP (Sections 3.4-3.5, 3.8-3.14) 1 1-Bit Prediction For each branch, keep track of what happened last time and use

More information

Meltdown or "Holy Crap: How did we do this to ourselves" Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory

Meltdown or Holy Crap: How did we do this to ourselves Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory Meltdown or "Holy Crap: How did we do this to ourselves" Abstract Meltdown exploits side effects of out-of-order execution to read arbitrary kernelmemory locations Breaks all security assumptions given

More information

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS

Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Energy Efficient Computing Systems (EECS) Magnus Jahre Coordinator, EECS Who am I? Education Master of Technology, NTNU, 2007 PhD, NTNU, 2010. Title: «Managing Shared Resources in Chip Multiprocessor Memory

More information

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases

A Cool Scheduler for Multi-Core Systems Exploiting Program Phases IEEE TRANSACTIONS ON COMPUTERS, VOL. 63, NO. 5, MAY 2014 1061 A Cool Scheduler for Multi-Core Systems Exploiting Program Phases Zhiming Zhang and J. Morris Chang, Senior Member, IEEE Abstract Rapid growth

More information

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University

Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Computer Architecture: Multi-Core Processors: Why? Prof. Onur Mutlu Carnegie Mellon University Moore s Law Moore, Cramming more components onto integrated circuits, Electronics, 1965. 2 3 Multi-Core Idea:

More information

CS 654 Computer Architecture Summary. Peter Kemper

CS 654 Computer Architecture Summary. Peter Kemper CS 654 Computer Architecture Summary Peter Kemper Chapters in Hennessy & Patterson Ch 1: Fundamentals Ch 2: Instruction Level Parallelism Ch 3: Limits on ILP Ch 4: Multiprocessors & TLP Ap A: Pipelining

More information

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13

Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Computer Architecture: Multi-Core Processors: Why? Onur Mutlu & Seth Copen Goldstein Carnegie Mellon University 9/11/13 Moore s Law Moore, Cramming more components onto integrated circuits, Electronics,

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Method-Level Phase Behavior in Java Workloads

Method-Level Phase Behavior in Java Workloads Method-Level Phase Behavior in Java Workloads Andy Georges, Dries Buytaert, Lieven Eeckhout and Koen De Bosschere Ghent University Presented by Bruno Dufour dufour@cs.rutgers.edu Rutgers University DCS

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

Hardware-based Speculation

Hardware-based Speculation Hardware-based Speculation Hardware-based Speculation To exploit instruction-level parallelism, maintaining control dependences becomes an increasing burden. For a processor executing multiple instructions

More information

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites

PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites PARSEC vs. SPLASH-2: A Quantitative Comparison of Two Multithreaded Benchmark Suites Christian Bienia (Princeton University), Sanjeev Kumar (Intel), Kai Li (Princeton University) Outline Overview What

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research

Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Applications Classification and Scheduling on Heterogeneous HPC Systems Using Experimental Research Yingjie Xia 1, 2, Mingzhe Zhu 2, 3, Li Kuang 2, Xiaoqiang Ma 3 1 Department of Automation School of Electronic

More information

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need??

Outline EEL 5764 Graduate Computer Architecture. Chapter 3 Limits to ILP and Simultaneous Multithreading. Overcoming Limits - What do we need?? Outline EEL 7 Graduate Computer Architecture Chapter 3 Limits to ILP and Simultaneous Multithreading! Limits to ILP! Thread Level Parallelism! Multithreading! Simultaneous Multithreading Ann Gordon-Ross

More information

COMPUTER ORGANIZATION AND DESIGN

COMPUTER ORGANIZATION AND DESIGN COMPUTER ORGANIZATION AND DESIGN 5 Edition th The Hardware/Software Interface Chapter 4 The Processor 4.1 Introduction Introduction CPU performance factors Instruction count CPI and Cycle time Determined

More information

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS

ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS ZSIM: FAST AND ACCURATE MICROARCHITECTURAL SIMULATION OF THOUSAND-CORE SYSTEMS DANIEL SANCHEZ MIT CHRISTOS KOZYRAKIS STANFORD ISCA-40 JUNE 27, 2013 Introduction 2 Current detailed simulators are slow (~200

More information

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases"

Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Meet the Walkers! Accelerating Index Traversals for In-Memory Databases Onur Kocberber Boris Grot, Javier Picorel, Babak Falsafi, Kevin Lim, Parthasarathy Ranganathan Our World is Data-Driven! Data resides

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

CS 2410 Mid term (fall 2018)

CS 2410 Mid term (fall 2018) CS 2410 Mid term (fall 2018) Name: Question 1 (6+6+3=15 points): Consider two machines, the first being a 5-stage operating at 1ns clock and the second is a 12-stage operating at 0.7ns clock. Due to data

More information

CMSC411 Fall 2013 Midterm 2 Solutions

CMSC411 Fall 2013 Midterm 2 Solutions CMSC411 Fall 2013 Midterm 2 Solutions 1. (12 pts) Memory hierarchy a. (6 pts) Suppose we have a virtual memory of size 64 GB, or 2 36 bytes, where pages are 16 KB (2 14 bytes) each, and the machine has

More information

CS / ECE 6810 Midterm Exam - Oct 21st 2008

CS / ECE 6810 Midterm Exam - Oct 21st 2008 Name and ID: CS / ECE 6810 Midterm Exam - Oct 21st 2008 Notes: This is an open notes and open book exam. If necessary, make reasonable assumptions and clearly state them. The only clarifications you may

More information

Lecture 1: Introduction

Lecture 1: Introduction Contemporary Computer Architecture Instruction set architecture Lecture 1: Introduction CprE 581 Computer Systems Architecture, Fall 2016 Reading: Textbook, Ch. 1.1-1.7 Microarchitecture; examples: Pipeline

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Neural Network based Energy-Efficient Fault Tolerant Architect

Neural Network based Energy-Efficient Fault Tolerant Architect Neural Network based Energy-Efficient Fault Tolerant Architectures and Accelerators University of Rochester February 7, 2013 References Flexible Error Protection for Energy Efficient Reliable Architectures

More information

Utilizing Concurrency: A New Theory for Memory Wall

Utilizing Concurrency: A New Theory for Memory Wall Utilizing Concurrency: A New Theory for Memory Wall Xian-He Sun (&) and Yu-Hang Liu Illinois Institute of Technology, Chicago, USA {sun,yuhang.liu}@iit.edu Abstract. In addition to locality, data access

More information

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs

To Use or Not to Use: CPUs Cache Optimization Techniques on GPGPUs To Use or Not to Use: CPUs Optimization Techniques on GPGPUs D.R.V.L.B. Thambawita Department of Computer Science and Technology Uva Wellassa University Badulla, Sri Lanka Email: vlbthambawita@gmail.com

More information

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures

A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures A Comparative Performance Evaluation of Different Application Domains on Server Processor Architectures W.M. Roshan Weerasuriya and D.N. Ranasinghe University of Colombo School of Computing A Comparative

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri

Department of Computer and IT Engineering University of Kurdistan. Computer Architecture Pipelining. By: Dr. Alireza Abdollahpouri Department of Computer and IT Engineering University of Kurdistan Computer Architecture Pipelining By: Dr. Alireza Abdollahpouri Pipelined MIPS processor Any instruction set can be implemented in many

More information

Exploitation of instruction level parallelism

Exploitation of instruction level parallelism Exploitation of instruction level parallelism Computer Architecture J. Daniel García Sánchez (coordinator) David Expósito Singh Francisco Javier García Blas ARCOS Group Computer Science and Engineering

More information

CS425 Computer Systems Architecture

CS425 Computer Systems Architecture CS425 Computer Systems Architecture Fall 2017 Thread Level Parallelism (TLP) CS425 - Vassilis Papaefstathiou 1 Multiple Issue CPI = CPI IDEAL + Stalls STRUC + Stalls RAW + Stalls WAR + Stalls WAW + Stalls

More information

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago

CMSC Computer Architecture Lecture 12: Multi-Core. Prof. Yanjing Li University of Chicago CMSC 22200 Computer Architecture Lecture 12: Multi-Core Prof. Yanjing Li University of Chicago Administrative Stuff! Lab 4 " Due: 11:49pm, Saturday " Two late days with penalty! Exam I " Grades out on

More information

Understanding Outstanding Memory Request Handling Resources in GPGPUs

Understanding Outstanding Memory Request Handling Resources in GPGPUs Understanding Outstanding Memory Request Handling Resources in GPGPUs Ahmad Lashgar ECE Department University of Victoria lashgar@uvic.ca Ebad Salehi ECE Department University of Victoria ebads67@uvic.ca

More information

ARCHITECTS use cycle-accurate simulators to accurately

ARCHITECTS use cycle-accurate simulators to accurately IEEE TRANSACTIONS ON COMPUTERS, VOL. 60, NO. 10, OCTOBER 2011 1445 An Empirical Architecture-Centric Approach to Microarchitectural Design Space Exploration Christophe Dubach, Timothy M. Jones, and Michael

More information

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures

Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures Power Struggles: Revisiting the RISC vs. CISC Debate on Contemporary ARM and x86 Architectures by Ernily Blern, laikrishnan Menon, and Karthikeyan Sankaralingarn Danilo Dominguez Perez danilo0@iastate.edu

More information

Performance of Multicore LUP Decomposition

Performance of Multicore LUP Decomposition Performance of Multicore LUP Decomposition Nathan Beckmann Silas Boyd-Wickizer May 3, 00 ABSTRACT This paper evaluates the performance of four parallel LUP decomposition implementations. The implementations

More information

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems

Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems Exploring the Throughput-Fairness Trade-off on Asymmetric Multicore Systems J.C. Sáez, A. Pousa, F. Castro, D. Chaver y M. Prieto Complutense University of Madrid, Universidad Nacional de la Plata-LIDI

More information

High Performance SMIPS Processor

High Performance SMIPS Processor High Performance SMIPS Processor Jonathan Eastep 6.884 Final Project Report May 11, 2005 1 Introduction 1.1 Description This project will focus on producing a high-performance, single-issue, in-order,

More information

Operating Systems Unit 6. Memory Management

Operating Systems Unit 6. Memory Management Unit 6 Memory Management Structure 6.1 Introduction Objectives 6.2 Logical versus Physical Address Space 6.3 Swapping 6.4 Contiguous Allocation Single partition Allocation Multiple Partition Allocation

More information

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE

CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 143 CHAPTER 6 STATISTICAL MODELING OF REAL WORLD CLOUD ENVIRONMENT FOR RELIABILITY AND ITS EFFECT ON ENERGY AND PERFORMANCE 6.1 INTRODUCTION This chapter mainly focuses on how to handle the inherent unreliability

More information

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables

Storage Efficient Hardware Prefetching using Delta Correlating Prediction Tables Storage Efficient Hardware Prefetching using Correlating Prediction Tables Marius Grannaes Magnus Jahre Lasse Natvig Norwegian University of Science and Technology HiPEAC European Network of Excellence

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved.

LRU. Pseudo LRU A B C D E F G H A B C D E F G H H H C. Copyright 2012, Elsevier Inc. All rights reserved. LRU A list to keep track of the order of access to every block in the set. The least recently used block is replaced (if needed). How many bits we need for that? 27 Pseudo LRU A B C D E F G H A B C D E

More information

Evaluation of Branch Prediction Strategies

Evaluation of Branch Prediction Strategies 1 Evaluation of Branch Prediction Strategies Anvita Patel, Parneet Kaur, Saie Saraf Department of Electrical and Computer Engineering Rutgers University 2 CONTENTS I Introduction 4 II Related Work 6 III

More information

Hardware-Based Speculation

Hardware-Based Speculation Hardware-Based Speculation Execute instructions along predicted execution paths but only commit the results if prediction was correct Instruction commit: allowing an instruction to update the register

More information

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores

Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Performance and Power Impact of Issuewidth in Chip-Multiprocessor Cores Magnus Ekman Per Stenstrom Department of Computer Engineering, Department of Computer Engineering, Outline Problem statement Assumptions

More information

TECHNOLOGY scaling has enabled greater integration because

TECHNOLOGY scaling has enabled greater integration because Thread Progress Equalization: Dynamically Adaptive Power and Performance Optimization of Multi-threaded Applications Yatish Turakhia, Guangshuo Liu 2, Siddharth Garg 3, and Diana Marculescu 2 arxiv:603.06346v

More information

A Study on Performance Benefits of Core Morphing in an Asymmetric Multicore Processor

A Study on Performance Benefits of Core Morphing in an Asymmetric Multicore Processor A Study on Performance Benefits of Core Morphing in an Asymmetric Multicore Processor Anup Das, Rance Rodrigues, Israel Koren and Sandip Kundu Department of Electrical and Computer Engineering University

More information

A Mechanism for Verifying Data Speculation

A Mechanism for Verifying Data Speculation A Mechanism for Verifying Data Speculation Enric Morancho, José María Llabería, and Àngel Olivé Computer Architecture Department, Universitat Politècnica de Catalunya (Spain), {enricm, llaberia, angel}@ac.upc.es

More information

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs

High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs High-Performance Microarchitecture Techniques John Paul Shen Director of Microarchitecture Research Intel Labs October 29, 2002 Microprocessor Research Forum Intel s Microarchitecture Research Labs! USA:

More information

Sampled Simulation of Multi-Threaded Applications

Sampled Simulation of Multi-Threaded Applications Sampled Simulation of Multi-Threaded Applications Trevor E. Carlson, Wim Heirman, Lieven Eeckhout Department of Electronics and Information Systems, Ghent University, Belgium Intel ExaScience Lab, Belgium

More information

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two

Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Cache Performance, System Performance, and Off-Chip Bandwidth... Pick any Two Bushra Ahsan and Mohamed Zahran Dept. of Electrical Engineering City University of New York ahsan bushra@yahoo.com mzahran@ccny.cuny.edu

More information

Reconfigurable Multicore Server Processors for Low Power Operation

Reconfigurable Multicore Server Processors for Low Power Operation Reconfigurable Multicore Server Processors for Low Power Operation Ronald G. Dreslinski, David Fick, David Blaauw, Dennis Sylvester, Trevor Mudge University of Michigan, Advanced Computer Architecture

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Walking Four Machines by the Shore

Walking Four Machines by the Shore Walking Four Machines by the Shore Anastassia Ailamaki www.cs.cmu.edu/~natassa with Mark Hill and David DeWitt University of Wisconsin - Madison Workloads on Modern Platforms Cycles per instruction 3.0

More information

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design

Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Performance of Multithreaded Chip Multiprocessors and Implications for Operating System Design Based on papers by: A.Fedorova, M.Seltzer, C.Small, and D.Nussbaum Pisa November 6, 2006 Multithreaded Chip

More information

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design

ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design ENGN1640: Design of Computing Systems Topic 06: Advanced Processor Design Professor Sherief Reda http://scale.engin.brown.edu Electrical Sciences and Computer Engineering School of Engineering Brown University

More information

ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS

ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS 17th European Signal Processing Conference (EUSIPCO 2009) Glasgow, Scotland, August 24-28, 2009 ANALYSIS OF A PARALLEL LEXICAL-TREE-BASED SPEECH DECODER FOR MULTI-CORE PROCESSORS Naveen Parihar Dept. of

More information

Symbiotic Job Scheduling on the IBM POWER8

Symbiotic Job Scheduling on the IBM POWER8 Symbiotic Job Scheduling on the IBM POWER8 Josué Feliu, Stijn Eyerman 2, Julio Sahuquillo, and Salvador Petit Dept. of Computer Engineering (DISCA), Universitat Politècnica de València, València, Spain

More information

Caching Basics. Memory Hierarchies

Caching Basics. Memory Hierarchies Caching Basics CS448 1 Memory Hierarchies Takes advantage of locality of reference principle Most programs do not access all code and data uniformly, but repeat for certain data choices spatial nearby

More information

Pipelining to Superscalar

Pipelining to Superscalar Pipelining to Superscalar ECE/CS 752 Fall 207 Prof. Mikko H. Lipasti University of Wisconsin-Madison Pipelining to Superscalar Forecast Limits of pipelining The case for superscalar Instruction-level parallel

More information

CS 136: Advanced Architecture. Review of Caches

CS 136: Advanced Architecture. Review of Caches 1 / 30 CS 136: Advanced Architecture Review of Caches 2 / 30 Why Caches? Introduction Basic goal: Size of cheapest memory... At speed of most expensive Locality makes it work Temporal locality: If you

More information

CS433 Homework 2 (Chapter 3)

CS433 Homework 2 (Chapter 3) CS433 Homework 2 (Chapter 3) Assigned on 9/19/2017 Due in class on 10/5/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference

Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference The 2017 IEEE International Symposium on Workload Characterization Performance Characterization, Prediction, and Optimization for Heterogeneous Systems with Multi-Level Memory Interference Shin-Ying Lee

More information

45-year CPU Evolution: 1 Law -2 Equations

45-year CPU Evolution: 1 Law -2 Equations 4004 8086 PowerPC 601 Pentium 4 Prescott 1971 1978 1992 45-year CPU Evolution: 1 Law -2 Equations Daniel Etiemble LRI Université Paris Sud 2004 Xeon X7560 Power9 Nvidia Pascal 2010 2017 2016 Are there

More information

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor

CPI < 1? How? What if dynamic branch prediction is wrong? Multiple issue processors: Speculative Tomasulo Processor 1 CPI < 1? How? From Single-Issue to: AKS Scalar Processors Multiple issue processors: VLIW (Very Long Instruction Word) Superscalar processors No ISA Support Needed ISA Support Needed 2 What if dynamic

More information