Predicting x86 Program Runtime for Intel Processor

Size: px

Start display at page:

Download "Predicting x86 Program Runtime for Intel Processor"

Cecil Lenard Woods
5 years ago
Views:

1 Predicting x86 Progra Runtie for Intel Processor Behra Mistree, Haidreza Haki Javadi, Oid Mashayekhi Stanford University 1 Introduction Progras can be equivalent: given the sae set of inputs, they always produce the sae set of outputs. Superoptiizing copilers leverage this notion of equivalence, substituting one progra (or section of progra) with an equivalent progra that runs faster. Recent work by Schkufza attepts to build a superoptiizing copiler that intelligently searches a progra space coposed of all equivalent sets of instructions [3]. This work relies on being able to predict a progra s runtie given its progra text. This report explores several runtie prediction odels. Specifically, given a set of loop-free x86 input instructions, our project predicts how long a processor takes to execute those instructions. Section 2 discusses predicting runties for fixed length progras and shows that kernelized regression odels can predict progra runtie with 12% average error copared to a gold-standard tool [1]. Following these results, Section 3 proposes and evaluates a generative odel for predicting runties for progras with arbitrary nubers of instructions. Finally, Section 4 provides a perforance analysis of the proposed predicter. 2 Fixed length progras This section builds odels to predict runties for progras coposed of a fixed nubers of instructions. Section 2.1 shows the perforance of a siple linear regression and 2.2 expand this odel using Gaussian and polynoial kernels. 2.1 Linear regression Figure 1 shows the results of applying linear regression on randoly generated progras of length 1, 2, 3, and 4 1. Each progra feature set is a vector containing the aount of tie that each instruction would otherwise take to run by itself. These ties range fro 1 to 47 cycles. Feature weights are learned fro a training set of 1, progras, each, and results are plotted against a validation set of 2, progras each. Noralized error is calculated following: NoralizedError = pred iaca iaca 1 Section 4.1 extends this to real progras disassebled fro objdup. (1) Absolute value noralized error.6 Progra length (instructions) Figure 1: Noralized error for randoly generated progras using linear regression across progras of different lengths. where iaca is the nuber of cycles a gold standard, proprietary tool [1] predicts as progra runtie. By itself, linear regression perfors well: across all progra lengths tested, linear regression produces an average noralized error of only.158. The results in Figure 1 suggest an interesting trend: as progras increase in length, a basic linear regression works better and better. In soe ways, this is surprising. Althgouh the training set for each experient shown in Figure 1 is the sae size, the size of the space of potential progras that each odel tries to learn is very different: the space of progras of length 2 is roughly 15 1 larger than the space of progras of length 1 (there are approxiately 15 separate x86 instructions). This eans that for increasing progra lengths, each regression learns over a training set that is ore likely to be less and less representative of what it is trying to predict, i.e., for increasing progra lengths, we should expect ore error fro variance. Table 1 plots noralized average error for progras of length 1, 2, 3, and 4 as training set size increases. Note that after 1 saples, increases in training set size have alost no ipact on test set error. This suggests that the generalization error shown in Figure 1 is less a atter of under-representativeness of the training set and potentially ore a atter of bias in the underlying regression.

2 Prog S tr = 1 5 1k 2k 5k 1k 15k Len Table 1: Noralized error as training set size increases. 2.2 Kernelized regression The previous subsection argued that error fro linear regression was likely attributable to odel bias. Kernelizing the regression can reduce odel bias: whereas linear regression only captures linear dependencies between the features and the output of the odel, kernelizing can capture ore coplex relationships. We use the Kernel trick [6], replacing each original feature vector, x (i), with a (potentially) high-diensional feature vector, Φ(x (i) ) containing nonlinear functions of the input features, and then we solve linear regression using these high diensional vectors as our new feature vectors. Therefore, we solve: in. θ (θ Φ(x (i) ) y (i) ) 2 One issue in doing regression (especially nonlinear regressions) is that the solution to the regression proble can be a very coplex and non-soothe function, which can cause high variance. In order to overcoe this proble, we use Tikhonov regularization [5]. Specically, we add λ θ 2 2 to our cost function, which penalizes non-soothe predicted odels. (The intuition behind this is that if our predicted odel has saller coefficients then the odel will be ore sooth and we will have less variance). Therefore, we solve the following proble: in. θ (θ Φ(x (i) ) y (i) ) 2 + λ θ 2 2 Using the Kernel trick and the Representer Theore [4], we can show that the where: y predicted = θ Φ(x (i) ) = c = (K + λi ) 1 y c i K(x (i),x (new) ) Using these results, we expanded our linear regression to use: 1. a Gaussian kernel with paraeter σ (bandwidth) in which k(x (i),x ( j) ) = e x(i) x( j) 2 2 2σ 2 and 2. a polynoial kernel with paraeters d(degree) and c in which k(x (i),x ( j) ) = ((x (i) ) T x ( j) + c) d Absolute value noralized error Absolute value noralized error.5 Progra length (instructions) (a) Gaussian kernel regression Progra length (instructions) (b) Polynoial kernel regression Figure 2: Noralized error for Gaussian and polynoial kernels on fixed size progras. Figure 2 presents the results of the Gaussian (σ 2 =1) and polynoial kernelized regresssion (d=2, c=1). Both approaches showed only ild iproveents copared to linear regression: on average, 3% for the Gaussian kernel and 2% for the polynoial. There are several possible reasons that these kernels only show odest iproveents. Section 3.2 explores one of the ost basic of these: changing the apping fro x86 progra text to features. 3 Variable length progras The previous section discussed predicting runties for a set of progras of fixed length. However, progras of interest can have varying nubers of instructions: a superoptiizing copiler is priarily concerned with the question What is the fastest equivalent progra of any nuber of instructions? not What is the fastest equivalent progra of ten instructions? A patchwork approach that trains separate odels over the entire space of progra lengths following the ethodology outlined in Section 2.1 is feasible for our target application. However, fro a achine learning standpoint,

3 sub_progra_vec = []; runtie_estiate = instruction inst; for inst = progra.first_ins until inst = prgra.last_inst: runtie_estiate += increent_tie(sub_progra_vec, inst) sub_progra_vec.append(inst) end Figure 3: Pseudo code for generative runtie prediction algorith. it is relatively uninteresting. This section instead explores a different approach. It attepts to build a generative odel of progra runtie. Instead of predicting the runtie of a fixed length progra, the generative odel estiates the length of a subset of a progra and iteratively updates this estiate as additional instructions are added to the end of the progra. Figure 3 shows high-level pseudocode for this algorith. In essence, a generative odel predicts the increental change a single additional instruction akes to progra runtie. A good generative odel therefore addresses the challenge of progras of different lengths: one can predict the runtie of an arbitrarily-length progra by progressively updating a runtie estiate for each instruction in the progra. Section 3.1 presents a naive generative odel. Section 3.2 analyzes the deficiencies of this naive odel and uses the to otivate ore coplicated feature selection discussed. Section 3.3 uses these features to build ore coplicated generative odels and discusses their perforance. 3.1 Naive generative odel This section describes and shows results on progras fro ipleenting the siplest possible generative odel, which we call the naive generative odel. The naive generative odel decoposes a progra into its individual coponent instructions. It then assigns each instruction a weight that is the (precoputed) average of the tie it takes to run that instruction on all sets of valid inputs (e.g., registers and iediates). Finally, the naive generative odel calculates the overall progra runtie as the su of each of its instructions weights. Figure 4 shows the perforance of this odel for 1 progras coposed of a single instruction, four instructions, eight instructions, and sixteen instructions, each. The vertical axis of each subfigure shows each progra s runtie (in cycles) predicted fro a gold standard proprietary tool [1] and the horizontal axis shows the runtie predicted by the naive generative odel. Each subfigure shows the 45 o line corresponding to optial predictions: any point on this line is a correct prediction by the naive generative odel of progra runtie. Note that although each subfigure presents data over 1 3 progras, fewer than 1 points are distinguishable because any points overlap. This is particularly true for progras coposed of single instructions, which can achieve fewer predicted outcoes than larger progras coposed of any instructions. A least squares line (plotted in green) gives a sense of data spread for overlapping data. Gold standard prediction 2 15 Prog size = 1 Optial predictor slope = Naive linear prediction 35 8 Optial predictor slope = Optial predictor slope = Optial predictor slope =.48 7 Figure 4: Naive generative prediction error. The horizontal axis shows the nuber of cycles that the naive generative odel predicts; the vertical axis shows the nuber of cycles predicted using the gold standard IACA tool [1]. An optial predictor would (predicted cycles equals gold standard cycles) is labeled on each graph as a black line. The green line shows the results of fitting a generative regression through the data. As can be seen in Figure 4, as progra size increases, the naive generative odel perfors poorly: average error for single instruction progras is alost zero (as indicated by how closely the generative regression line atches the optial line); while the naive predictor for sixteen instruction progras, on average, has a isprediction error of alost 1%. This trend is to be expected. Iplicilty, the naive generative odel essentially trains across the space of single instruction progras. Therefore, when it is tested on a single instruction progra, it predicts well. However, when it is tested on progras with ore instructions (that therefore were not in its training set), it perfors poorly. The likely cause for this error is that the naive generative odel ignores processors potential parallelization and pipelining. Training a odel to independently predict runtie of individual instructions and suing the prediction for each does not capture this feature of processors. The following section explores appings that preserve register read-write conflicts. 3.2 Feature apping The degree that a progra can be parallelized partially depends on read-write dependencies on registers between instructions. For instance, consider the two, short progras shown in Figure 5. The seantics of each are siple. In the

4 Progra 1: ovl $x4, %rax addl $x4, %rbx, %rbx Progra 2: ovl $x4, %rax addl $x4, %rax, %rax Figure 5: Read-write dependencies first, the ovl instruction sets the value of register rax to a constant (x4); the addl instruction adds a constant (x4) to the contents of register rbx and puts this result in rbx. The second progra is siilar, except the addl instruction is over rax. In the first progra, the processor can run both the ovl and addl instructions in parallel. However, in the second, the addl instruction explicitly reads fro a register (rax) the ovl instruction writes to, and therefore the processor cannot run both in parallel. Therefore, although coposed of identical instructions, Progra 2 takes longer than Progra 1 to run. Both of the feature vectors described in the experients of Sections 3.1 and 2.1 did not track register dependencies: for both Progras 1 and 2, they produce identical feature vectors. To iprove results for the generative algorith, we updated our feature apping to include read-write dependencies. For each instruction in a progra, we parse the instruction s nae as well as each of the instruction s register arguents. Iportantly, registers with different naes can refer to parts of the sae physical register [2]. For instance, the nae al refers to the lower order bits of the sae physical register that the nae eax refers to. We track this aliasing as well as which portions of a register can be operated on concurrently. Given an instruction s nae, we index into a pre-copiled database that returns 1. A list of iplicit registers registers not specified by the arguents to the instruction that the instruction reads fro and writes to A list of which of its arguents it reads fro and writes to. For instance, the database ay specify that the instruction ovx a,b reads fro its first arguent (a) and writes to its second arguent (b), but does not operate on any iplicit registers. Following parsing and this database lookup, for every line in a progra, we can define its full register read and write set. Fro these read and write sets, we can easily infer register conflicts between a progra s instructions. 2 Instructions can also zero-extend registers, which writes zeros across the higher order bits of a physical register. 4 We have built a feature apping function, φ, paraeterized by a bandwidth paraeter τ, that odels these readwrite dependencies. More concretely, define a progra, P, as the ordered set I 1,I 2,I 3,...,I, where I i is the ith instruction in P. Then the feature vector associated with appending the instruction I +1 to this progra is C(I +1 τ ) 1{I +1 τ I +1 } C(I +2 τ ) 1{I +2 τ I +1 } φ(i +1,τ,P) =.. C(I ) 1{I i I +1 } C(I +1 ) where, C(.) is the nuber of cycles it takes to run a progra, and I j I k is true if the j th instruction of P writes to a register that the k th instruction of P reads or writes to. In this odel, the bandwidth paraeter, τ, specifies how any instructions to look back fro the current instruction for readwrite conflicts. Section 4.2 discusses the perforance of this feature creation approach, as any proposed solution should be fast enough to be useful in practice. 3.3 Generative odel In order to predict the increental runtie that each instruction adds to the total progra runtie, we have devised the following learning process. For progras of length 1, take each instruction, say I, and collect the features as explained in Section 3.2. Then run the progra with and without I, and use linear regression to learn the increental runtie that I adds to the progra fro the collected features. For a given progra with arbitrary length L, we add the predicted increental runtie of each instruction based on the collected features to predict the total runtie. Figure 6 shows the results. As the bandwidth increases the predictions becoes ore accurate. However, the difference for bandwidths greater than 5 is insignificant. Note that as we saw in Section 2.2, the Gaussian or polynoial kernel are alost the sae as linear regression so we only show the result for linear regression, here. 4 Conclusions and extensions This report exained using linear and kernelized regressions to predict progra runtie for x86 progras. For progras of fixed length we were able to predict progra runtie with 12% average error copared to a gold-standard tool. For variable lengthed progras, our generative odel perfors siilarly, and is flexible enough to predict progras of arbitrary size. The following subsections discuss additional experients we ran using our odel.

5 Absolute value noralized error bw=1 bw=2 bw=3 bw=4 bw=5 Absolute value noralized error bw=1 bw=2 bw=3 bw=4 bw=5. Progra length. Progra length Figure 6: Noralized error of generative odel with different bandwidth paraeter. Training set size is coposed of 8 progras with length 1. But, each test set has 2 progras of length 1, 2, 3, and 4. Figure 7: Noralized error of generative odel for disassebly of gcc using linear regression with a training set size 8 of and test set size of Real progras The dataset introduced in Section 2.1 and used throughout this report was coposed of rando sequences of instructions. This rando dataset atched the project sponsor s goals to intelligently search the space of all x86 progras. However, we were curious how well our odel predicted runties of copiler-generated progras, which have ore regular structure. To explore this question, we ran a disassebler (objdup) across progras copiled using gcc. Because the odel produced in the previous sections was for loopfree progras, we filtered jup instructions fro the disassebled code. This filtering liits our ability to draw strong conclusions about the odel s accuracy for real progras, but likely still produces ore realistic code segents than the previous randoly-generated dataset. Figure 7 shows the noralized error fro running the generative odel with linear regression on a disassebly of the gcc binary. Running the sae regressions on other progra types (e.g., progras that ake heavy use of IO, gui-based progras, etc.) produced siilar results. Average noralized error is approxiately 5% higher for this real data. 4.2 Feature collection runtie Our collaborator s target application requires not only predicting runtie accurately, but quickly as well. To ensure that the feature collection described in Section 3.2 does not becoe a prediction bottleneck, we bencharked the perforance of our unoptiized Python feature collector. Table2 shows these nubers collected on a 2.8GHz processor. With a bandwidth paraeter of 5, our unoptiized Python feature collector is able to produce features at a rate of approxiately 1k/second. Coupling these nubers with the tie required to perfor linear regression, the generative 5 bw= Instr/sec 154,52 133, ,97 97,38 Table 2: Feature vector construction tie versus bandwidth. odel described in Section 3.3 produces predictions orders of agnitude faster than issuing a syste call to execute the gold standard tool (65 predictions/s for generative linear regression odel on progras of 1 instructions vs 22 predictions/s for the gold-standard tool). References [1] Intel Architecture Code Analzyer. software.intel.co/en-us/articles/ intel-architecture-code-analyzer. [2] x86 Structure. X86#Structure. [3] E. Schkufza, R. Shara, and A. Aiken. Stochastic superoptiization. In Proceedings of the eighteenth international conference on Architectural support for prograing languages and operating systes, pages ACM, 213. [4] B. Schölkopf, R. Herbrich, and A. J. Sola. A generalized representer theore. In Coputational learning theory, pages Springer, 21. [5] A. N. Tikhonov and V. Y. Arsenin. Solutions of ill-posed probles [6] V. Vapnik. The nature of statistical learning theory. springer, 2.

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification

Solving the Damage Localization Problem in Structural Health Monitoring Using Techniques in Pattern Classification Solving the Daage Localization Proble in Structural Health Monitoring Using Techniques in Pattern Classification CS 9 Final Project Due Dec. 4, 007 Hae Young Noh, Allen Cheung, Daxia Ge Introduction Structural