PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 1. Performance and Energy Modeling of Heterogeneous Many-core Architectures

Size: px

Start display at page:

Download "PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 1. Performance and Energy Modeling of Heterogeneous Many-core Architectures"

Shona Cook
5 years ago
Views:

1 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES Performance and Energy Modeling of Heterogeneous Many-core Architectures Rui Pedro Gaspar Pinheiro Abstract Advances in processor design have recently pushed for the development of heterogeneous processors, in order to tackle the power and memory walls. However, fully exploiting the heterogeneity of manycore processing systems requires efficient scheduling mechanisms, in order to anticipate the performance and energy gains due to the migration of an application from one core to another. Although multiple approaches have been proposed to model such gains, they are generally limited in the supported levels of heterogeneity and can only be applied to a small subset of cores. Hence, a new methodology is herein proposed for performance and energy modeling of highly-heterogeneous many-core processors, which is specifically developed with the purpose of enabling system schedulers to perform near-optimal decisions. The proposed approach relies on a Linear Regression Model and on a set of specifically devised regressors that are highly correlated with several micro-architectural parameters of modern in-order and out-oforder processor architectures, including the cache size, issue-width, reorder buffer size and load/store queues size. To evaluate the proposed methodology, a set of 500 different core-architectures were simulated and their performance and energy consumption was modeled. Experimental results show that all models exhibit a high prediction accuracy, attaining coefficients of determination as high as 0,95. Moreover, when applied to the scheduling of applications on a simulated big.little system, the devised models allow the scheduler to correctly identify the optimal core-architecture in a large majority of the cases, leading to power-delay product differences of less than % compared with an oracle scheduling solution. Index Terms Single-ISA heterogeneous systems, many-core processors, performance and energy modeling, linear regression model, thread scheduling INTRODUCTION ADVANCES in processor design have recently pushed for the development of heterogeneous processors, in order to tackle the power and memory walls. In particular, by relying on appropriate and different core architectures, it is possible to efficiently leverage Memory-Level Parallelism and Instruction-Level Parallelism [], [2] such as to minimize power and energy consumption with a reduced performance loss. In particular, driven by the introduction of the ARM big.little heterogeneous processor [3] (although not exclusively), intensive research has recently been put forth in the development and exploitation of heterogeneous processor systems composed of multiple single-isa (Instruction Set Architecture) small in-order and big out-oforder cores. However, exploiting such heterogeneity often requires the development of efficient scheduling mechanisms, in order to anticipate the performance and power consumption gains or losses due to the migration of an application from one core to another, or to the morphing of a given core, which can be achieved by means of clock/power gating or by relying on reconfigurable technologies. As a result, a considerable amount of the research effort into heterogeneous systems concerns real-time performance and energy consumption models, specifically to predict the behavior of applications over different core architectures. Although multiple such approaches have been developed in the past which are effective for some use cases, they are sub-optimal in the general case. Taking into account the future developments in manycore heterogeneous processors or reconfigurable systems with a large number of different architectures, accurate performance and energy consumption estimation models are highly required. In addition, the exact core architectures available in a system may considerably vary from system to system, such that an architecture-agnostic method for creating such models would be extremely valuable. The objectives of this work are: ) To develop an architecture-agnostic method of deriving accurate on-line performance and energy estimation models, without the identified issues and limitations of existing approaches; 2) To derive a set of performance and energy estimation models that covers many common architectural parameters; 3) To show that the resulting models are highly accurate for the considered range of architectural parameters and can be used in the context of scheduling solutions. The rest of this work is organized as follows. Section 2 describes the state of the art in heterogeneous and reconfigurable architectures, as well as on performance and energy prediction models. A detailed description of the proposed modeling methodology is given in section 3, as well as the creation of general-case models covering all previouslymentioned architectural parameters. To thoroughly assess the developed work, section 4 presents the experimental results and validation of the devised model. Finally, section 6 concludes the developed work. 2 BACKGROUND The objective of this work is the development of performance and power models for heterogeneous processors, in

2 2 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES order to allow for a more efficient allocation of running tasks to the available cores. Hence, this section introduces the current state of research into performance and power modeling of heterogeneous or reconfigurable systems. 2. Heterogeneous and Reconfigurable Architectures Heterogeneous processors consist of an aggregate of multiple cores with different architectures and/or instruction sets, each tuned to handle specific workload types. Reconfigurable systems, on the other hand, have the capability to change their architecture (or parts of it) on the fly, adapting in real-time to their current workload. Advances in processor design have pushed for the development of these non-standard types of processors as a means to tackle the power and memory walls by relying on appropriate and different core architectures. In particular, heterogeneous processors with a combination of many small and few large core types have been shown [], [2] to be able to out-perform homogeneous processors in both execution speed (by having a larger number of smaller cores) and power savings (by having the least possible hardware sitting idle at any time). However, leveraging such architectures requires the adoption of intelligent task-to-core allocation strategies, in particular by analyzing the application s Memory-Level Parallelism (MLP) and Instruction-Level Parallelism (ILP). This conclusion could also be extrapolated into reconfigurable architectures, since they are able to dynamically add or remove ILP and MLP extraction hardware as necessary. These results have led to the introduction of the ARM big.little heterogeneous processor [3], which includes one high-performance big out-of-order core along with one low-power small in-order core, as well as NVidia s Tegra 4 CPU [4], which contains four big out-of-order cores along with one low-power small in-order core, among others. In addition, in order to efficiently leverage these types of processors, intensive research has recently been put forth in the exploitation of heterogeneous processor systems composed of multiple single-isa (Instruction Set Architecture) small in-order and big out-of-order cores. 2.2 Performance and Energy Prediction While the architectural parameters of the system are known, the application characteristics are harder to measure. These can be estimated at compile-time, but this requires recompilation, which might not be feasible in all situations, most commonly in regards to proprietary software. Alternatively, most current processors are equipped with multiple Hardware Counters that can be configured to measure various runtime statistics (e.g., cycle counts, retired instructions, cache misses), which correspond to interactions between the application and the processor subsystems, and therefore can be leveraged as an indirect way of measuring the variables necessary for the prediction models at runtime, without requiring any previous off-line analysis. To tackle the aforementioned issues, Kumar et al. [], [2], proposed a way to avoid off-line analysis, by relying on a two-stage on-line sampling approach. During the first stage (sampling), applications are permuted over all core architectures in order to obtain a set of multiple per-core statistics, retrieved from hardware counters. In a second stage, these gathered statistics are used to predict which core is better suited to each application. These last approach has one very important issue, the need for a slow on-line sampling/training process, during which the system is running sub-optimally, becoming worse as the number of different core variations increases. Many attempts have been made to remove this requirement. For example, Craeynest et al. [5] derived simplified runtime statistics-based models to estimate performance differences between small in-order and big out-of-order cores, which they call Performance Impact Estimation (PIE), by using runtime statistics, specifically the Cycles Per Instruction (CPI) metric, together with the average dependency distance between producer and consumer instructions. Based on this model, the authors developed a system scheduler able to regularly estimate the performance of all running applications on an alternate core architecture and then decide whether a core switch is worthwhile. While a very innovative approach, PIE still has a number of shortcomings which impede it from fulfilling all our objectives. Not only is PIE constrained to only two core types (small inorder and big out-of-order), but it also only takes into account the differences concerning the Re-Order Buffer size and Issue Width between the source and target processor architectures, which means changes in other parameters (e.g. the cache hierarchy) are not correctly predicted by the model. Additionally, measuring the base and memory CPI components separately, as required by PIE, can be a complex process. Another important contribution has been laid out by Pricopi et al. [6], who developed performance for specifically for the ARM big.little processor, by also taking into account the cache hierarchies and branch prediction subsystem. In addition, they used a Linear Regression Model (LRM) based on runtime statistics in order to model its energy consumption. Their approach does take into account possible micro-architectural differences other than ROB size and issue width, it still only considers two possible core variations at once, constraining its uses for many-core heterogeneous architectures, and also requires offline analysis. 3 PERFORMANCE AND ENERGY MODELING In order to better leverage heterogeneity, an architectureagnostic method of deriving accurate on-line performance and energy estimation models is desirable. Hence, it is necessary for the modeling approach to take into account a large number of core variations, specifically constructed by changing several architectural parameters, with as little overhead as possible such that the resulting models can be used in the context of real-time system schedulers. 3. Choice of Architectural Parameters One of the objectives of this work is to devise an architecture-agnostic method of creating performance and energy prediction models. However, there is a virtually limitless number of architectural parameters that could be changed between two different core variations. For this reason, it is important to choose a representative set of

3 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 3 TABLE Considered architectural parameters and their dominant effects. Parameter Re-Order Buffer size ROB size Core Width W Load Queue Size Store Queue Size SQ size Cache Size L{, 2, 3} size Cache Associativity L{, 2, 3} assoc Dominant Effects When full, generates structural hazards, leading to stalls at instruction issue. Affects the peak instruction throughput at issue, dispatch and commit stages. When full, generates structural hazards for new load instructions, causing pipeline stalls at issue. When full, generates structural hazards for new store instructions, causing pipeline stalls at issue. Affect the cache hit rate, significantly impacting the memory access latency. Affect the cache organization, possibly impacting the memory access latency. In order to simplify the analysis and reduce the number of architectures to be simulated, it is assumed that W issue = W commit = W. common architectural parameters with considerable effects on performance and energy. Accordingly, both in-order and out-of-order architectural classes will be considered, with varying parameters, depicted in Table with a description of their dominant effects. The cache sizes and associativities, as well as the Load Queue (LQ)/Store Queue (SQ) size parameters were chosen due to these hardware structures being very common ways of adding MLP extraction capability to a processor, while the Core Width is one of the most simple ways of ILP extraction and is used in most modern processors. As for the Re-Order Buffer (ROB) size parameter, it represents the main hardware structure required for processors to be able to execute instructions out-of-order, and as such was chosen for its important role in Out-of-Order processors. In particular, due to its importance to out-of-order execution, the ROB size is a parameter analyzed by much existing research [5] [8] mainly as a way to distinguish in-order from out-of-order cores. The cache sizes are also commonly analyzed [6], [8] [], by looking at the cache miss rates (mainly Last-Level Cache (LLC) misses) and their effects on performance. Some previous research also considers the direct effects of Core Width [5], [7] and LQ/SQ size [7], although to a lesser degree. 3.2 Architecture Analysis: Methodology and Setup In order to evaluate the impact of the different architecture parameters on the performance, power and energy, both inorder and out-of-order architectures were simulated using the state-of-the-art Sniper Multi-Core Simulator [2], able to provide accurate simulations of a broad range of x86 and x86 64 micro-architectures. To ensure the representativeness of the devised model when considering multiple types of workloads, the PARSEC [3] benchmark suite was chosen for analysis, training and validation. Each workload s initialization and shutdown phases (when the input data is read from disk or the results are written back) are uninteresting from the point-of-view of architectural analysis, since they depend almost solely on systems outside of the processor s control (e.g. hard drive speed). For such purpose, simulator-specific magic instructions were added to each of the eleven PARSEC benchmarks in order to define the appropriate simulation Region of Interest (ROI) for each benchmark and exclude initialization and shutdown. In order to be able to examine each architectural parameter s effects individually, a sweep over a large range was TABLE 2 Sweep ranges and sample count for all architectural parameters under consideration. Parameter Default Sweep range # Samples Issue/Commit Width ROB Size Load Queue Size Store Queue Size L Size (KB) L Associativity L2 Size (KB) L2 Associativity L3 Size (KB) L3 Associativity done for each of the parameters of interest, using a singlecore processor with Intel s Gainestown micro-architecture as a base. The default values, as well as the parameter sweep ranges and sample counts are depicted in Table 2. For the Issue/Commit Width, ROB size and Load/Store Queue sizes, the samples are uniformly distributed over the sweep range. However, due to architectural limits imposed by the simulator, both the cache size and associativity are required to be powers of two, meaning that only powers of two within the sweep range were sampled for the cache parameters (L{, 2, 3} size and L{, 2, 3} assoc ). For each of the core variations, the PARSEC benchmarks were executed to completion in single-threaded mode using the predefined small input set, with the pre- and post-roi sections simulated in fast-forward mode in order to reduce the total processing time. All ROI runtime statistics of interest were measured during simulation, and stored for later processing. After execution, the McPAT [4] power framework is used in order to estimate power consumption assuming a 45nm transistor technology with a.2v supply voltage. The results of this architectural analysis will be used during the rest of this section to justify the model development process. 3.3 Dependent Variables and Regressors In order to be able to use a LRM it is necessary to choose the dependent variable and set of regressors carefully. In particular, it is important to note that, although the in-sample fit of the model increases trivially with the introduction of more regression terms, this leads to an increase in model complexity, potentially constraining its real-time application

4 4 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES in scheduling systems. Too many regression terms also leads to over-fitting conditions, constraining the model s validity to unobserved applications. It is therefore important to pick the minimum number of terms that are able to correctly represent the dominant effects of all architectural parameters of interest Choice of Dependent Variables While the choice between Instructions Per Cycle (IPC) and CPI for performance metric might at first seem indifferent, since they are the reciprocal of each-other, there is in fact an important difference that affects model quality considerably. Due to the fact that most runtime statistics are normalized by the instruction count in order to equalize their scales and make them easier to work with, and because CPI is already normalized by the instruction count, it becomes an obvious choice for performance metric. Therefore, CPI is used as the LRM dependent variable for the performance model. As for energy consumption, the choice is slightly more difficult. Of course, it is possible to also use Energy Per Instruction (EPI), since it is already normalized by the instruction count. However, this is not a good choice, because the time for each instruction is both core- and applicationdependent, which means the same EPI value can correspond to a vastly different energy consumption if calculated using two different cores and/or workloads, depending on the performance. However, some simple mathematical manipulation leads to the equations EP I = E total I = P DP I = P avg t exec I = P avg T C I = EP C CP I, () where E total represents the total energy consumed during workload execution, P DP is the Power-Delay Product, P avg denotes the average power consumption, t exec the execution time of the workload, T = f the clock period, C the number of elapsed clock cycles, I the total number of instructions, and EP C (Energy Per Cycle) is the total energy consumption in a single cycle defined as EP C = E total C = P avg T. (2) This result quickly leads to the conclusion that EPC is a good choice of energy metric, because it is independent of performance, and unlike EPI denotes the energy consumption over a fixed time period corresponding to one clock cycle (assuming constant frequency). Additionally, when multiplied with the performance metric CPI (either measured during execution or predicted by the performance model) results in P DP I, a value which can be used directly by the system scheduler. It should however be taken into account that both the CPI and EPC metrics are always positive, i.e. CP I, EP C R +, something that the LRM is not aware of. In fact, a LRM assumes that the response variable is normally distributed over the whole real domain [5]. Hence, a LRM-based model will predict negative performance or energy consumption for certain input values, results which make no sense in practice and are therefore highly undesirable. TABLE 3 R 2 Coefficient of determination of the various transformation steps. Figure R The simplest way to correct this is by applying a transformation g( ) to the dependent variables, also known as the link function of the Generalized Linear Model (GLM), chosen such that g is continuous and provides the mapping g : R + R. The choice of link function is critical, since it will affect results considerably. In particular, experimental validation of the model without any link function showed the residual distribution was always log-normal. Accordingly, since normally-distributed residuals are preferable in order to ensure that the least-squares estimator matches the maximum-likelihood estimator (as the latter has better statistical properties [6]), and the natural logarithm function provides the desirable mapping (i.e., log : R + R), using it as the link function is an obvious choice Regressor Format With the dependent variables chosen, all that is left is to define the regressors. In order to minimize the necessary number of LRM terms considerably, we assume that each source core type is characterized independently, with independent LRM coefficients, such that its micro-architectural parameters become constant as far as the models are concerned, effectively being absorbed by the LRM coefficients. First, the performance model needs to be defined. Let us assume a simple situation where there s only one single parameter of interest, the Load Queue Size on out-oforder cores, and that the only objective is to predict changes in performance as varies, using the CPI metric. Figure illustrates the relationship between the parameter and log(cp I), where it becomes immediately obvious that there is no linear correlation. In particular, there are three noticeable problems: P. The shape of each application s performance is not linear. P.2 The y-axis is shifted differently for all applications, i.e. the minimum CPI is not the same. P.3 The speedup (ratio between maximum and minimum CPI) varies depending on the application. All these three issues have one feature in common, namely the differences in how the various applications interact with the architectural parameter in question. The solution to these issues is to apply a linearizing transformation, with possible help from runtime statistics., First, the change in parameter = initial, versus the change in the dependent variable log(cp I) = log(cp I) log(cp I initial ) are shown in Figure 2 for all simulated initial values. This can be implemented by estimating the change in performance or energy consumption between the source core and the target core, instead of absolute values. While this does not guarantee a common minimum, it is enough to resolve P.2, since all plots have the common data-point (0, 0) which can be used as a starting point for a linearizing transformation. Figure 3 goes further, by applying a linearizing transformation over the change in, specifically chosen because the non-linear shape of Figure is very

5 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 5 log(cp I) blackscholes bodytrack canneal dedup ferret fluidanimate raytrace streamcluster swaptions vips x (# Entries) Fig.. log(cp I) vs. Load Queue size of each PARSEC [3] benchmark. log(cp I) (# Entries) Fig. 2. Change in log(cp I) vs. Change in Load Queue Size for all possible initial values ( ) Fig. 3. Change in log(cp I) vs. Change in for all possible initial values. log(cp I) log(cp I) ( ) LD pi Fig. 4. Change in log(cp I) vs. Change in LD pi runtime statistic., multiplied by the similar to the inverse function x. As it can be concluded by analyzing the figure, this results in a linear behavior always centered around (0, 0), resolving P.. However, different applications and initial values still have different slopes, hence not completely resolving the modeling problem. By looking carefully at Figure 3 and comparing it with Figure, it can be seen that the slope of the various lines seems to depend on the corresponding application s speedup. For example, streamcluster is the application with the largest speedup, and also has the biggest slope, while canneal has the lowest speed-up, and smallest slope. This leads to the conclusion that all that is left is to classify the applications, which can be done by using runtime statistics. In particular, the rate of loads per instruction LD pi was found to highly correlate with the speed-up brought by an increase in Load Queue Size, by exhibiting a linear correlation coefficient ρ in all examined cases. Hence, Figure 4 shows the same plot, with the added step of multiplying by LD pi, which achieves a considerable improvement in linear correlation by resolving P.3. Table 3 shows the coefficient of determination R 2 values obtained by doing a single-term plus intercept linear regression ŷ = β 0 + β x on the various transformation steps. For reference, R 2 = denotes a perfect fit, and R 2 = 0 denotes no correlation at all. In particular, it can be seen that applying the inverse transformation in Figure 3 improves the correlation considerably, and the multiplication by the runtime statistic LD pi in Figure 4 achieves an almost-perfect quality of fit. By applying similar principles to the remaining variables, the proposed performance models are built based on the regression log( ˆ CPI tgt ) = β 0 + β log(cp I src ) + N β i+ x i, (3) x i = f i (p i ) Si (4) I where CPI ˆ tgt represents the estimated performance at the target core, β i are model coefficients, CP I src represents the performance measured in the source core, and x,, x N represent the set of N regression terms obtained by coupling the statistics gathered by using Hardware Counters (HCs) with the transformed micro-architectural parameter variations. Each regression term x i is herein considered to express the product of the variation f i (p i ) of a given micro-architectural parameter p i between the source (p isrc ) and target (p itgt ) cores using a transformation f i : R R, such that f i (p) = f i (p itgt ) f i (p isrc ) with a runtime statistic S i normalized by the retired instruction count I. This modeling process and linear regression format could also be employed for the energy consumption resulting in a model with similar accuracy. However, the energy model can be improved considerably, since if it is assumed that all cores are implemented using the same technology, EPC can be divided into two separate components, a coredependent EP C base component affected only by the architectural parameters of a processor, due to its static power consumption, as well as an application-dependent EP C app, corresponding to the core s dynamic power consumption, affected by the application. The core-dependent energy component EP C base consists of various interactions between parameters at the target core that define its static power consumption. As for the application-dependent component, architectural analysis shows that power consumption is typically inversely correlated with CPI for all considered parameter sweeps, which makes sense since higher performance means more transistor switching, resulting in an increased dynamic power consumption. In fact, if one calculates the linear correlation coefficient between log(cp I) and log(ep C) individually for each core variation of the architectural analysis, one obtains an average of ρ avg = which indicates a strong inverse correlation. Hence, it can be concluded that knowing the performance at a target processor should be enough to predict the application-dependent energy consumption component EP C app with relatively high accuracy. This leads to the conclusion that an energy model has to predict the performance in the target core, in order to predict i=

6 6 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES its energy consumption. However, the performance model is already doing that job, so its results can be directly used as a regressor. The proposed energy models are therefore built based on the regression log( ˆ EPC tgt ) = β 0 + β log(cp I tgt ) + N β i+ w i, (5) where EPC ˆ tgt represents the estimated energy consumption at the target core, β i are model coefficients, CP I tgt represents the performance at the target core used to explain EP C app, and w,, x N represent the set of N regression terms used to explain EP C base. The performance CP I tgt can be obtained through the performance model or, alternatively, this model can be used to predict energy consumption at the source core itself, useful in case the processor does not support per-core power instrumentation. The individual regressors w i should have the format i= w i = f i (p itgt ), (6) where f i is a transformation function f i : R R, and p itgt represents an architectural parameter of the target core. It should be noted that it is possible that certain interaction effects also need to be modeled, such that w i may be attained by multiplying two or more architectural parameters, w i = w j w k, i j k. (7) Moreover, while it was at first assumed that each source core would be trained individually, it can be seen that this is not required for the energy model, since only the CP I tgt term depends on the application itself, while all remaining terms depend solely on the target processor s architectural parameters. 3.4 Choice of Parameters and Statistics With the model format defined, all that is left is to choose the values of x i (4) for the performance models, and w i (6) for the energy models, in particular the architectural parameters p i, the transformation functions f i and, for the performance model, the runtime statistics S i. This can be done manually, by analyzing the data collected during section 3 and going through the parameters one-by-one, in order to choose the best combination of f i and S i for each parameter. However, as the number of architectural parameters and core variations increases, it may be necessary to automate this procedure. Hence, statistical methods for automatic regressor choice become very useful, for example Lasso [7], among others. Nevertheless, because the number of architectural parameters under variation is not too large such approaches are not strictly necessary, instead a manual choice of regressors was done based on the architectural analysis in subsection 3.2, which also helps to illustrate the model-making procedure. In order to validate these choices, regression analysis was used, by doing a single-term plus intercept linear regression for all possible combinations of measured runtime statistics S i (in the case of the performance model) and a hand-picked list of transformation functions f i over each architectural parameters p i discussed during subsection 3.2, and calculating the resulting coefficients of determination R 2 which can be used as a metric for selecting the best regressors used in the final models. 3.5 General-case Models With the general structure of the LRM decided, it is now possible to create general-case models for the parameters of interest for all possible core transitions, i.e. In-Order to In-Order, In-Order to Out-of-Order, Out-of-Order to In-Order, and Out-of-Order to Out-of-Order, for both performance and energy consumption. The performance models can then be individually trained for each source core, and used to predict the gains in performance if a thread were to switch to a different core type, while the energy models can be trained once for each of the four possible situations, and used to predict the difference in energy consumption Performance From the parameter-sweep analysis discussed in section 3, all parameters share a shape similar to the inverse function x, giving a very strong indication that this transformation will favor model development on all tested parameters. On the other hand, choosing the runtime statistic is more difficult, and it must be discussed case-by-case. For the Issue/Commit Width case, the results suggest that a very similar speedup is attained independently of the application, leading to the hypothesis that no runtime statistic is needed. To confirm such hypothesis a regression analysis was performed showing that it best R 2 value is obtained when no runtime statistic is used. Concerning the Load Queue size, they were already shown in Table 3 to have a strong correlation between speed-up and the number of load instructions LD. The same holds for the Store Queue, except with the number of store instructions ST. Concerning the ROB size, architectural analysis revealed a strong linear correlation with the Micro-operation (µ-op) rate. As for the cache size, the situation is more complex. While there is a clear correlation between miss rates at a certain level and that specific cache level s speed-up, the number of cache misses significantly varies as the cache size changes, a fact that compromises prediction accuracy. For example, a processor with 2048KB of L2 cache exhibits barely any cache misses when running the streamcluster application, even though that same application has a very large performance penalty with very low L2 cache sizes. It is expected that high traffic at a cache level increases the probability of a performance drop if the size of that level is reduced. In fact, cache misses at level x can effectively be used as a measure of traffic at level x, since a cache access at level x requires a cache miss at level x. As a result, the previous cache level s miss rate is a more accurate predictor of a specific cache level s performance. This thought process also implies that the effect of the L size can be predicted using the number of total memory accesses LD + ST. When considering the full set of micro-architecture parameters and statistics, the following simple, but still highly representative 9-term LRM to predict performance variations between different Out-of-Order cores was obtained:

7 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 7 log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 ( ) µop pi + ROB size β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size Out-of-Order Out-of-Order Performance Model Repeating the same analysis for the performance variations between different In-Order cores, the same conclusions are reached, except that the ROB size is not a valid architectural parameter of in-order cores and, as such, disappears from the model. Hence, the following 8-term LRM is obtained: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 ( ) LD pi + β 4 ( ) ST pi + SQ size β 5 ( ) (LD pi + ST pi ) + L size β 6 ( ) Lmiss pi + L2 size β 7 ( ) L2miss pi L3 size In-Order In-Order Performance Model When developing the cross-type performance models ( In-Order to Out-of-Order and Out-of-Order to In- Order ), most of the same conclusions hold. In particular, µop pi should be able to explain the change in performance due to the addition or removal of out-of-order execution capability, resulting in the following 9-term LRM to predict the performance variations from In-Order cores to Out-of- Order cores: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 µop pi + ROB size β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size In-Order Out-of-Order Performance Model (8) (9) (0) In the opposite direction, from Out-of-Order cores to In- Order, the model is very similar. However, since the model will be trained once per source core, the ROB size architectural parameter will be absorbed by the β i coefficients. As a result, the ROB size µop pi term should be replaced by just µop pi, leading to the following 9-term LRM: log( CPˆ I tgt ) = β 0 + β log(cp I src ) + β 2 ( W ) + β 3 µop pi + β 4 ( ) LD pi + β 5 ( ) ST pi + SQ size () β 6 ( ) (LD pi + ST pi ) + L size β 7 ( ) Lmiss pi + L2 size β 8 ( ) L2miss pi L3 size Out-of-Order In-Order Performance Model Energy The analysis for the energy models is slightly different than that for the performance models. In particular, because the models include the CP I tgt as a term and do not depend on the source core for any of the remaining terms, we are interested solely on the energy consumption component EP C base not directly influenced by performance. Hence, as the architectural analysis shows that the energy consumption due to a change in Load and Store Queue sizes is almost solely due to the increase in performance, these parameters do not need to be included in the energy models. The cache sizes are also relatively simple since their behavior is mostly linear, which means no transformation is required. However, things get more complex when looking at the core width and ROB size. These parameters have a large effect on energy consumption due to a large change in performance. However, both exhibit a small, linear increase in energy consumption as their size increases even when CPI does not change, which means they must be included as part of the model. However, experimental evaluations showed that this works only as long as only one of these parameters varies, and that the model accuracy falls sharply as soon as both parameters change simultaneously, with the residual distribution not being normally distributed any longer. These are indicators that there is a strong interaction between both parameters that also needs to be modeled. In fact, if a third term w i = W tgt ROB sizetgt is added in order to model this interaction, the model remains accurate even if both these parameters change simultaneously. As a result, the following 8-term LRM is obtained: (2) log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + β 3 ROB sizetgt + β 4 W tgt ROB sizetgt + β 5 L sizetgt + β 6 L2 sizetgt + β 7 L3 sizetgt Out-of-Order Out-of-Order Energy Model

8 8 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES TABLE 4 Considered architecture variations. Parameter Values Issue/Commit Width ; 2; 4; 6 ROB Size (Out-of-Order only) 32; 64; 28; 256 Load/Store Queue Size ; 5; 0; 5; 20 In-Order cores have very similar behavior in terms of energy, and as such the regressors are the same, except that the ROB size dependent terms disappear as this parameter no longer exists. Accordingly, the following 6-term LRM is obtained: log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + (3) β 3 L sizetgt + β 4 L2 sizetgt + β 5 L3 sizetgt In-Order In-Order Energy Model The cross-type models ( In-Order to Out-of-Order and Out-of-Order to In-Order ) are relatively simple as well. The architectural analysis in showed that the difference in energy consumption between in-order and out-of-order cores is mostly independent of the applications under test, a fact that is confirmed by automated regression analysis. As a result, we obtain the two following LRMs are obtained, with 8 terms for the In-Order to Out-of-Order case, and 6 terms in the opposite direction: (4) log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + β 3 ROB sizetgt + β 4 W tgt ROB sizetgt + β 5 L sizetgt + β 6 L2 sizetgt + β 7 L3 sizetgt In-Order Out-of-Order Energy Model log( EPˆC tgt ) = β 0 + β log(cp I tgt ) + β 2 W tgt + (5) β 3 L sizetgt + β 4 L2 sizetgt + β 5 L3 sizetgt Out-of-Order In-Order Energy Model 4 EXPERIMENTAL RESULTS In the previous section, a set of architectures was analyzed, and an architecture-agnostic method for creating performance and energy prediction models was devised. In what follows, these models will be tested and validated, in order to evaluate their predictive accuracy. 4. Experimental Methodology and Setup Based on the methodology used for the architectural analysis during the previous section (see subsection 3.2), several micro-architecture and cache organization parameters were varied for both in-order and out-of-order cores, as depicted in Tables 4 and 5. In accordance, a total of 500 different core variations were simulated, corresponding to 400 out-oforder and 00 in-order cores, allowing an effective coverage of the interaction between all the considered parameters. TABLE 5 Considered cache hierarchy variations (associativity and total size in KB). The block size was set fixed and equal to 64 Bytes as commonly used by Intel processors. Name L-I & L-D L2 L3 Assoc. Size Assoc. Size Assoc. Size Tiny Small Medium Large Huge MODELS ASSESSMENT To demonstrate the quality and accuracy of the models proposed in section 4, they were first assessed on a full range of architectures, by considering the multiple core parameterizations depicted on Tables 4 and 5. Since the model assumes the representativeness of the training set for all possible applications and cores, it makes sense to use as much information as possible during its testing and validation. Therefore, a leave-one-out cross-validation approach was adopted, such that one application was removed from the training set in each iteration, and subsequently used for model validation. Furthermore, in section 2 it was mentioned that commercial heterogeneous architectures such as the ARM big.lit- TLE [3] processor typically have only two types of cores, one small In-Order core, and one large Out-of-Order core. Accordingly, to properly perform model validation, several heterogeneous architectures are herein considered, namely a heterogeneous many-core system composed of the previously mentioned 500 cores and a more realistic heterogeneous system composed of only a big and a little core, consisting of the smallest or largest core variations within the set of 500 simulated, respectively. Additionally, current scheduling systems are more interested in the minimization of the Energy-Delay Product (EDP) or PDP instead of EPC, since minimizing PDP is the same as minimizing the total energy consumed by the processor for a certain workload, To perform such an evaluation, it is possible to combine the results of the proposed performance and energy models, and easily calculate an estimate of the PDP, by using the equation ˆ P DP tgt = EP ˆ C tgt CPˆ I tgt I, (6) where EPˆ C tgt was obtained using the CP I tgt value estimated by the performance model. Figure 5 presents the normalized predictions for the performance (CPI), energy (EPC), and combined PDP) models, represented as three Tukey box-plots for each metric. In particular, the energy model was obtained by using the CP I tgt value estimated by the performance model. It can be observed that the model provides accurate predictions over a wide range of architectures for all considered core parameters, with all three quartiles (25 th percentile, median, and 75 th percentile) ranging between 75% and 25% of the observed value in all cases. Furthermore, in the dual-core small in-order plus big outof-order case most observations show an estimation error lower than 0%, and all three quartiles range between 95% and 05% of the observed value for all cases.

9 RUI PEDRO GASPAR PINHEIRO: PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES 9 Normalized value (relative to observed) OoO OoO CPI EPC PDP IO IO IO OoO OoO IO small IO + big OoO Fig. 5. Tukey boxplots of the proposed models between In-Order (IO) and Out-of-Order (OoO) cores (normalized by the observed values). TABLE 6 Scheduler validation of the performance and energy models with N randomly selected core variations to pick from (including the source core), when attempting to minimize CPI, EPC or PDP, compared to a random scheduler (expected value when picking a random core) and an oracle scheduler (always picks best core). CPI # Cores N Random Oracle Proposed Rel. Error 2.08% 4.29% 4.69% 3.33%.67% Random EPC Oracle (nj) Proposed Rel. Error 0.38% 0.79%.48% 2.4% 2.57% Random PDP Oracle (J) Proposed Rel. Error.45% 2.53% 4.45% 4.50% 5.53% 5. Application to System Schedulers To further evaluate whether the proposed models could effectively predict the most efficient core for each application, a scheduler-specific validation test was performed. For each iteration of the test, a permutation of one source core and N alternative cores was picked at random. A leave-oneout cross-validation approach is adopted, where for each iteration, one application is removed from the training set, and the models are re-trained for the specific combination of N cores. The resulting models are then used to predict the best core for that application by minimizing CPI, EPC or PDP, out of the possible (N) choices. The observed value (CPI, EPC or EDP) in the chosen core was then compared with the core selected by a scheduler using either a random policy, where cores are picked at random, or an oracle policy, where the best core is always chosen. The results, presented in Table 6, show that the model manages to estimate the correct core in a large majority of the cases, even when there are many different core variations to pick from. Furthermore, the results also show that, when the proposed models perform an incorrect guess, only a reduced performance or energy consumption loss is observed when compared to the oracle case. All in all, the proposed models are able to correctly predict the correct cores for various applications, leading to average CPI, EPC and PDP differences lower than 4.69%, 2.57% and 5.53% respectively when compared to an oracle scheduling solu- TABLE 7 Scheduler validation of the performance and energy models on various dual-core Small + Big systems with In-Order (IO) and Out-of-Order (OoO) cores, when attempting to minimize CPI, EPC or PDP, compared to a random scheduler (expected value when picking a random core) and an oracle scheduler (always picks best core). Architecture Small IO + Big IO + Small OoO Small IO Big OoO Big OoO + Big OoO + Big IO Random Oracle CPI Proposed Rel. Error 0.00% 0.00% 0.00% 0.00% Random EPC Oracle (nj) Proposed Rel. Error 0.00% 0.00% 0.00% 0.00% Random PDP Oracle (J) Proposed Rel. Error 0.6% 0.00% 0.55% 0.2% CP I Relative Error vs. Oracle 50% 40% 30% 20% 0% 0% 0.9% 5.32% 37.84% Proposed PIE Random Fig. 6. Relative Error vs. Oracle of the proposed performance model, compared with PIE [5] and a random scheduler. 48.5% EP C Relative Error vs. Oracle 50% 40% 30% 20% 0% 0% 0.24% 4.63% Proposed Pricopi et al. Random Fig. 7. Relative Error vs. Oracle of the proposed energy model, compared with Pricopi et al. [6] and a random scheduler. tion, even with over 00 different core variations to choose from. Reduced numbers of available core architectures provide even better results, for example PDP differences compared to the oracle case of.45% or 2.53% when considering 2 and 6 core variations, respectively. The same validation step was repeated for the four considered Small + Big dual-core systems, and the results are depicted in Table 7. From the results, it can be concluded that the correct core was chosen every single time while attempting to minimize CPI or EPC, and the loss in PDP relative to the oracle scheduler is, in the worst case, as low as 0.6%. 5.2 Comparison with State-of-the-Art Models In order to further demonstrate the quality of the results, the proposed models were compared with two state-of-the-art models, in particular the Performance Impact Estimation performance model [5] and the energy model proposed by Pricopi et al. [6]. For such purposes, the previously described scheduler-specific validation process was used.

10 0 PERFORMANCE AND ENERGY MODELING OF HETEROGENEOUS MANY-CORE ARCHITECTURES To guarantee fairness, only the conditions considered by the corresponding authors of [5], [6] were used in each comparison. Hence, the existence of a dual-core processor is assumed (N = 2), where one core is In-Order and the other is Out-of-Order. In addition, since the PIE model [5] only considers the Core Width W and the ROB size ROB size architectural parameters, all other architectural parameters were held constant during comparison. On the other hand, the energy model proposed by Pricopi et al. [6] uses a mixture of off-line and on-line analysis, with the off-line portion being used to predict the cache access profile at the target core. Due to the complexity of implementing the off-line model, the comparison was performed by using the cache access statistics measured directly on the target core, as if the off-line portion were always 00% accurate in the estimation of such statistics. The results of both these comparisons are shown in Figures 6 and 7, where it can be seen that the PIE model [5] has a relative error versus the oracle scheduler approximately 7 times larger than the proposed models, and the Pricopi et al. [6] energy model approximately 9 times larger, serving as further proof of the quality and predictive power of the proposed models. 6 CONCLUSION The key contribution of this work is the proposed modeling methodology, able to successfully derive performance and energy prediction models that can accurately predict the more efficient core out of hundreds of different core variations at once, without requiring a high-overhead online sampling procedure, or offline static analysis. The presented results show that the obtained models are suitable for use by scheduling agents as a means of predicting the best task to core mapping that maximizes performance while minimizing power consumption. These include, for example, those in an operating system running on a heterogeneous processor, large cluster infrastructures with many different processors, or system hypervisors able to apply core morphing techniques to predict and adapt to the resource demands of each application. A representative set of performance and energy models was developed using the proposed methodology, and over the 500 considered core variations, all the obtained models have a coefficient of determination R 2 of at least or for the performance and energy models, respectively, meaning that there is a very high goodness of fit. Additionally, a scheduler validation step was executed, in which the proposed models were pitted against an oracle and a random scheduler in an attempt to minimize the CPI, the EPC or the Power-Delay Product (PDP). When compared to the oracle scheduler, the models showed a prediction error as low as 5.53%, even when over a hundred core variations were considered. Lower numbers of core types present even better results. For example, an error of only 2.53% is attained for a system with 6 different core variations. Additionally, since current commercial heterogeneous processors only contain two different core variations, this specific situation was analyzed in more detail. In particular, it was shown that the proposed models are able to predict CPI and EPC with a very high accuracy, with a large majority of the observations showing a CPI, EPC and PDP prediction error considerably smaller than 0%. In addition, the scheduler validation step was repeated specifically for the dual-architecture case, and the proposed models were shown to be able to correctly minimize CPI and EPC. As for PDP, a very small error of 0.6% was observed when compared to the oracle case. REFERENCES [] R. Kumar, K. I. Farkas, N. P. Jouppi, P. Ranganathan, and D. M. Tullsen, Single-ISA heterogeneous multi-core architectures: The potential for processor power reduction, in Proceedings of the 36th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO 36. Washington, DC, USA: IEEE Computer Society, 2003, pp [2] R. Kumar, D. M. Tullsen, P. Ranganathan, N. P. Jouppi, and K. I. Farkas, Single-ISA heterogeneous multi-core architectures for multithreaded workload performance, SIGARCH Comput. Archit. News, vol. 32, no. 2, pp , Mar [3] big.little Technology: The Future of Mobile, ARM, Tech. Rep., 20, available at LITTLE Technology the Futue of Mobile.pdf. [4] NVidia Tegra 4 Family CPU Architecture 4-PLUS- Quad core, NVidia, Tech. Rep., 203, available at Quad a5 whitepaper FINALv2.pdf. [5] K. Van Craeynest, A. Jaleel, L. Eeckhout, P. Narvaez, and J. Emer, Scheduling heterogeneous multi-cores through performance impact estimation (PIE), in Proceedings of the 39th Annual International Symposium on Computer Architecture, ser. ISCA 2. Washington, DC, USA: IEEE Computer Society, 202, pp [6] M. Pricopi, T. S. Muthukaruppan, V. Venkataramani, T. Mitra, and S. Vishin, Power-performance modeling on asymmetric multicores, in 203 International Conference on Compilers, Architecture and Synthesis for Embedded Systems (CASES), Sept 203, pp. 0. [7] Y. Kora, K. Yamaguchi, and H. Ando, MLP-aware dynamic instruction window resizing for adaptively exploiting both ILP and MLP, in Proceedings of the 46th Annual IEEE/ACM International Symposium on Microarchitecture, ser. MICRO-46. New York, NY, USA: ACM, 203, pp [8] J. Cong and B. Yuan, Energy-efficient scheduling on heterogeneous multi-core architectures, in Proceedings of the 202 ACM/IEEE International Symposium on Low Power Electronics and Design, ser. ISLPED 2. New York, NY, USA: ACM, 202, pp [9] G. Patsilaras, N. K. Choudhary, and J. Tuck, Efficiently exploiting memory level parallelism on asymmetric coupled cores in the dark silicon era, ACM Transactions on Architecture and Code Optimization (TACO), vol. 8, no. 4, pp. 28: 28:2, Jan [0] J. C. Saez, M. Prieto, A. Fedorova, and S. Blagodurov, A comprehensive scheduler for asymmetric multicore systems, in Proceedings of the 5th European Conference on Computer Systems, ser. EuroSys 0. New York, NY, USA: ACM, 200, pp [] W. L. Bircher and L. K. John, Complete system power estimation: A trickle-down approach based on performance events, in IEEE International Symposium on Performance Analysis of Systems Software, ISPASS 2007., April 2007, pp [2] T. E. Carlson, W. Heirman, S. Eyerman, I. Hur, and L. Eeckhout, An evaluation of high-level mechanistic core models, ACM Transactions on Architecture and Code Optimization (TACO), 204. [3] C. Bienia, Benchmarking modern multiprocessors, Ph.D. dissertation, Princeton University, Princeton, NJ, USA, January 20. [4] S. Li, J. H. Ahn, R. Strong, J. Brockman, D. Tullsen, and N. Jouppi, McPAT: An integrated power, area, and timing modeling framework for multicore and manycore architectures, in MICRO nd Annual IEEE/ACM International Symposium on Microarchitecture, 2009., Dec 2009, pp [5] A. J. Dobson and A. Barnett, An introduction to generalized linear models. CRC press, [6] G. A. F. Seber and A. J. Lee, Linear Regression Analysis, 2nd ed. John Wiley & Sons, Inc., [7] R. Tibshirani, Regression shrinkage and selection via the lasso, Journal of the Royal Statistical Society. Series B (Methodological), pp , 996.

Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore

By Dan Stafford Background Heterogeneous Architectures Performance Modeling Single Core Performance Profiling Multicore Performance Estimation Test Cases Multicore Design Space Results & Observations General