Portable Compiler Optimisation Across Embedded Programs and Microarchitectures using Machine Learning

Size: px

Start display at page:

Download "Portable Compiler Optimisation Across Embedded Programs and Microarchitectures using Machine Learning"

Rodger Brooks
5 years ago
Views:

1 Portabe Compier Optimisation Across Embedded Programs and Microarchitectures using Machine Learning Christophe Dubach, Timothy M. Jones, Edwin V. Bonia Members of HiPEAC Schoo of Informatics University of Edinburgh, UK Grigori Fursin Member of HiPEAC INRIA Sacay, France Michae F.P. O Boye Member of HiPEAC Schoo of Informatics University of Edinburgh, UK ABSTRACT Buiding an optimising compier is a difficut and time consuming task which must be repeated for each generation of a microprocessor. As the underying microarchitecture changes from one generation to the next, the compier must be retuned to optimise specificay for that new system. It may take severa reeases of the compier to effectivey expoit a processor s performance potentia, by which time a new generation has appeared and the process starts again. We address this chaenge by deveoping a portabe optimising compier. Our approach empoys machine earning to automaticay earn the best optimisations to appy for any new program on a new microarchitectura configuration. It achieves this by earning a mode off-ine which maps a microarchitecture description pus the hardware counters from a singe run of the program to the best compier optimisation passes. Our compier gains 67% of the maximum speedup obtainabe by an iterative compier using 000 evauations. We obtain, on average, a.6x speedup over the highest defaut optimisation eve across an entire microarchitecture configuration space, achieving a 4.3x speedup in the best case. We demonstrate the robustness of this technique by appying it to an extended microarchitectura space where we achieve comparabe performance. Categories and Subject Descriptors D.3.4 [Programming anguages]: Processors Compiers; Optimization; Retargetabe compiers; C.0 [Computer Systems Organization]: Genera Hardware/software interfaces; C.4 [Computer Systems Organization]: Performance of systems Design studies; Modeing techniques; I.2.6 [Artificia inteigence]: Learning. Genera Terms Design, Experimentation, Performance. Permission to make digita or hard copies of a or part of this work for persona or cassroom use is granted without fee provided that copies are not made or distributed for profit or commercia advantage and that copies bear this notice and the fu citation on the first page. To copy otherwise, to repubish, to post on servers or to redistribute to ists, requires prior specific permission and/or a fee. MICRO 09, December 2 6, 2009, New York, NY, USA. Copyright 2009 ACM /09/2...$0.00. Keywords architecture/compier co-design, design-space exporation, machine earning.. INTRODUCTION Creating an optimising compier for a new microprocessor is a time consuming and aborious process. For each new microarchitecture generation the compier has to be retuned and speciaised to the particuar characteristics of the new machine. Severa reeases of a compier might be needed to effectivey expoit the processor s performance potentia, by which time the next microarchitecture generation has been deveoped and the process starts again. This never-ending game of catch-up means that we rarey expoit a shipped processor to the fu and this inevitaby deays the time to market. Athough this is a genera issue for a processor domains, it is particuary acute for embedded systems. Ideay, we woud ike a portabe compier technoogy that provides retargetabe optimisation whie fuy expoiting the characteristics of the new microarchitecture. In other words, given any new processor generation, deiver a compier that automaticay optimises for that new target and achieves high performance. Buiding such a compier is, however, extremey chaenging. This is primariy due to the compexity of the underying machine s behaviour and the varying structure of the programs being compied. Iterative compiation, which tunes each new program on a specific architecture [6, 6, 24, 30], has provided a methodoogy to find good optimisations.techniques such as genetic agorithms [24], hi cimbing [2] or optimisation orchestration [30] have been expored, a showing impressive performance improvements. Athough usefu, these approaches a suffer from the arge number of compiations and executions required to optimise each program. Every time the program or architecture changes, this time-consuming process must be repeated. In order to overcome these chaenges, reers have deveoped compiers that earn optimisation strategies using prior knowedge of other programs behaviour. Stephenson et a. [34] showed that genetic programming can earn good individua compier optimisations on a fixed architecture, eiminating the need for any iterative compiations of the new program. Cavazos et a. [3] showed that this coud be used to earning the best set of compier options on a fixed architecture. Athough these approaches dramaticay reduce or even eiminate the need for extra compiations and executions of the target program, they suffer from the need to entirey retrain the compier whenever the patform changes. In this paper we deveop, to the best of our knowedge, the first compier that can automaticay adapt to underying microarchitectura changes. This enabes portabe performance across different

2 generations of a microprocessor. This represents the first step towards the deveopment of a universa compier that can automaticay optimise appications for any patform without requiring extensive tuning. Given a new microarchitecture, our approach automaticay determines the right optimisation passes for any new program. Our scheme earns a machine earning mode off-ine which maps a microarchitecture description pus the hardware counters from a singe run of the program to the best compier optimisation passes. The earning process is a one-off activity whose cost is amortised across a future users of the compier on subsequent variations of the processor s microarchitecture. Using this approach we can, on average, achieve a.6x speedup over the highest defaut compier optimisation across 200 microarchitectura configurations. In addition, we show that this approach achieves 67% of the maximum performance improvement gained by standard iterative compiation using 000 evauations. Given our approach, a new compier does not need to be tuned whenever the processor microarchitecture changes or a new program needs compiing. This aows compiers to become fuy integrated into the design space exporation of a new processor generation, heping designers to fuy evauate the potentia of any new microarchitecture. Overtime, designers may wish to add new microarchitectura features not originay envisaged. We show that our approach adapts to new microarchitectura configurations and is abe to deiver the same eve of performance. In summary, this paper makes the foowing contributions: We deveop a machine earning mode that can predict the best optimisation passes to use for any new program when compiing for a new microarchitecture configuration; We show how our scheme accuratey deivers the performance improvements avaiabe across the MiBench benchmark suite and an embedded microarchitectura design space; We demonstrate the robustness of our scheme showing that it deivers comparabe performance on an extended microarchitecture space. The next section provides a short exampe demonstrating the difficuty of achieving portabe optimisation. Section 3 provides a description of how our compier is trained and depoyed using machine earning. Section 4 describes the experimenta setup. Section 5 then evauates the technique and anayses the resuts. Section 7 evauates this approach on a new extended space. This is foowed by a description of reated work in section 8. Finay, section 9 concudes the paper. 2. EXAMPLE In this paper we imit our study to seecting the right passes within an existing compier framework for varying programs and microarchitectures. Athough this may seem ike a restricted setting, the best optimisation passes to appy vary significanty between programs and microarchitectura configurations. Finding the best set of optimisation passes across programs and microarchitectures is highy non-trivia. To iustrate this point, consider figure which shows segment diagrams for three programs (rijndae_s, un and madpay) on three microarchitectures from our design space (described in section 4). For these three programs and microarchitectures, we found the best optimisation passes to appy (described in section 4.3). These optimisations ead to significant speedups, ranging from.6x to 2.62x speedup over the highest defaut optimisation eve. In this exampe we show ony five significant optimisation passes: bock reordering, oop unroing, function inining, instruction scheduing and goba common sub-expression eimination, abeed Microarchitecture A B C Program rijndae_e un madpay XScae Xscae with sma insn cache Xscae with sma insn cache sma data cache Bock reordering Loop unroing Function inining Goba common subexpression eimination Instruction scheduing Figure : Segment diagrams representing the optimisation passes to enabe in order to achieve the highest performance for three programs executed on three microarchitectures. A fied segment means that the optimisation shoud be enabed, an empty one means it shoud be disabed. on the right of figure. For each program/microarchitecture pair there is a circe of five segments representing the five passes. If the segment is fied, then the corresponding optimisation shoud be enabed for the given program and microarchitecture. If empty, it shoud be disabed. What is immediatey cear is that the best set of optimisation passes to appy changes across programs and microarchitectures. If we consider madpay for instance, three optimisations shoud be enabed for microarchitecture A, a different set of three for B and four enabed for configuration C. If we now consider microarchitecture B, two optimisations shoud be turned on for rijndae_e, four when compiing un and ony three for madpay. Given the arge number of optimisations avaiabe in a typica compier, providing a portabe optimising compier for programs and microarchitectures is non-trivia. However, there are simiarities between programs and microarchitectures that we can expoit. The best set of optimisations for rijndae_e on configuration C and madpay on microarchitecture A are exacty the same. This is aso true for un on microarchitectures B and C and madpay on configuration C. If we can somehow characterise the program madpay on configuration A and reate it to the characteristics of rijndae_e on microarchitecture C, then we can appy the same optimisation passes to madpay as we did to rijndae_e. This wi aow us to obtain the best speedups on this new program/microarchitecture pair without having ever seen madpay or configuration A before. In the next section we deveop a machine earning mode that automaticay identifies these simiarities. We then use and evauate it in section 5 to provide portabe optimisation across programs and microarchitectures. 3. ENABLING PORTABLE OPTIMISATION As seen in the previous section, the best performance is achieved by appying different optimisations depending on the program and the underying microarchitecture. This means that with current approaches to tuning an optimising compier [, 3, 29, 34] a new compier needs to be deveoped for each new generation or variation of the microarchitecture. To overcome this probem, we deveop a machine earning mode that automaticay adapts the compier s optimisation strategy for any program and any microarchitecture variation. 3. Overview Figure 2 gives an overview of our compier s structure. The too works ike any other compier, taking as an input the source code of a program and producing an optimised binary. However, in addition to the source code our compier has two other inputs which

3 Tabe : Performance counters used as a representation of program/microarchitecture pairs. Performance Counter Instructions per cyce Insn cache access rate ALU usage Decoder access rate Insn cache miss rate Mac usage Register fie access rate Data cache access rate Shifter usage Branch pred. access rate Data cache miss rate Figure 2: Overview of our portabe optimising compier. The compier takes in a program source, some performance counters and a microarchitecture description and outputs an optimised program binary for the microarchitecture. At the heart of the compier is a machine earning mode that predicts the best passes to run, controing the optimisations appied. it uses internay to optimise the program specificay for the microarchitecture it wi run on. Firsty, our compier takes in a description of the microarchitecture to target. This is simiar to standard compiers where this description is hard-coded in a machine description fie; here it is just an input. Secondy, it takes in performance counters derived from a previous run of the program. This is simiar to feedback-directed compiers that typicay use profiing information from a previous run to generate an optimised version of the program. However, unike any existing technique, our compier generates an optimised binary specificay for the target microarchitecture even when it has never seen the program or the microarchitecture before. Therefore, the compier does not have to be modified or regenerated whenever a new program or microarchitecture is encountered. At the heart of our compier is a mode that correates the behaviour of the new input program and microarchitecture with programs and microarchitectures that it has previousy seen. Such a mode is buit using machine earning and our approach can be considered as a three stage process: generating training data, buiding a mode and depoying it. The next three subsections describe each of these activities in detai, aowing us to create the overa compier shown in figure Generating Training Data In order to buid a mode that predicts good optimisation passes, we need exampes of various optimisation passes on different programs and microarchitectures as we as a description of each program and microarchitecture. We generate this training data by evauating N different sets of optimisation passes, y, on a set of training program/microarchitecture pairs, X,..., X M, and recording their execution times, t. We can characterise a program/microarchitecture pair using a vector of features x...,x M. Therefore, for each program/microarchitecture pair X j we have an associated dataset D j = {(y i, t i ) j } N i=, with j =,... M. Our goa is to predict the best set of optimisation passes y whenever a new program/microarchitecture X is encountered. Athough the generated dataset may be arge, it is ony a oneoff cost incurred by our mode. Furthermore, techniques such as custering [3] are abe to reduce this and is the subject of future work. Features We characterise program interaction with the processor using performance counters, c, and with the microarchitectures using 8 descriptors, d. The performance counters are shown in tabe and are simiar to those typicay found in processor anaytic modes [2, 22] To capture the features of the microarchitecture we simpy record its static description shown in tabe 2. The performance counters from a program running on a microarchitecture, c, are concatenated together with the microarchitecture description, d, to form a singe feature vector for the program/microarchitecture pair, x = (c,d). 3.3 Buiding a Mode Our aim is to buid a mode, M(x,y), that provides the mapping from any set of program/microarchitecture features to a set of good optimisation passes: M : x y. We approach this probem by earning the mapping from the features, x, to a probabiity distribution over good optimisation passes, q(y x). Once this distribution has been earnt (see next section), prediction on a new program and on a new microarchitecture is achieved by samping at the mode of the distribution. Thus we obtain the predicted set of optimisations by computing: y = argmax q(y x ). () y In other words, we find the vaue of y that gives the greatest probabiity of being a good optimisation Fitting Individua Distributions In order to earn the mode we need to fit a probabiity distribution over good optimisation passes to each training program/microarchitecture. Let g(y X) be a parametric distribution specific to a program/microarchitecture pair X. Note that whereas g(y X) is specific to the identity of a program/microarchitecture pair, q(y x) aows generaisation across programs and microarchitectures by being conditioned on a set of features x. Let Y e be a set of good optimisation passes and p(y X) be the empirica distribution over these passes for program and microarchitecture pair X. We wish to fit the parametric distribution g(y X) for each program/microarchitecture pair to be as cose as possibe to the empirica distribution p(y X). To do this we can minimise the Kuback-Leiber (KL) divergence: 2 fi KL( p(y), g(y)) = og p(y) f = constant + H( p(y), g(y)), g(y) p(y) (2) where H( p(y), g(y)) is the cross-entropy of p(y) and g(y). Thus, we can maximise the objective function: L = H( p(y), g(y)) = X y ey p(y) og g(y). (3) In our experiments we have chosen the set of good optimisations e Y to be those combinations of passes that are within the top 5% of a training optimizations for the respective program/microarchitecture pair. We have then weighted these optimizations uniformy. 2 For the foowing derivation, to simpify the notation, we wi omit the conditiona dependency of these distributions on a specific program/microarchitecture pair X.

4 In principe, our mode s probabiity distribution g(y) can beong to any parametric famiy. However, we have seected a very simpe IID (independent and identicay distributed) mode, where the impact of each pass (y ) is considered to be independent of a others, i.e. g(y) = Q L = g(y ), where L is the number of avaiabe passes. If g(y ) is a mutinomia distribution we have that: where S = {s () S LY LY Y g(y) = g(y ) = = = j= (θ j )I[y =s (j) ], (4),..., s ( S ) } is the set of possibe vaues that the pass y can take (i.e. on, off or a parameter vaue); I[y = s (j) ] is an indicator function that is ony when the particuar optimisation pass y takes on the vaue s (j) y = s (j) and zero otherwise); and θ j (i.e. I[y = s (j) ] = when is the probabiity of the optimisation pass y taking on the particuar vaue s j (i.e. θj = p(y = s (j) )); and, by definition, we have that P j θj =. By using equation (4) in equation (3) and maximising the atter with respect to each parameter θ j subject to the constraints = we obtain: P j θj θ j = X y ey p(y)i[y = s (j) ]. (5) This resut is known as the maximum ikeihood estimator. Since we have used the uniform distribution as p(y), this means that the estimation of the parameters of the IID distribution θ j is the number of (seected) optimisation passes in which y = s (j) divided by the tota number of (seected) passes. Our assumption of statistica independence between compier optimisations may seem simpistic because it is we known that optimisations interact with each other. However, as we see in section 5, this IID mode generaises we across programs and microarchitectures. Our mode works on the assumption that athough compier optimisations do interact, these interactions are ess important across good sets of optimisations. Additionay, more compicated distributions, e.g. a Markov mode, coud be considered without modifying our approach Learning a Predictive Distribution Across Programs and Microarchitectures Once the individua training distributions for each program/microarchitecture pair g(y X) have been obtained, we can earn a predictive distribution q(y x). This wi predict the probabiity of each optimisation pass vaue being good from the features, x, of a program/microarchitecture pair (performance counters, c, and microarchitecture descriptors, d). One possibe way of earning this distribution is to use memorybased methods such as K-nearest neighbours. In other words, we can set the predictive distribution q(y x) to be a convex combination of the K distributions corresponding to the training programs and microarchitectures that are cosest in the feature space to the new (test) program and microarchitecture. The coefficients of the combination are obtained by using: w k = exp{ βd(x (k),x ( ) )} P K k= exp{ βd(x(k),x ( ) )}, (6) where β is a constant and d(, ) is our evauation function, i.e. the eucidean distance of each corresponding nearest training point to the test point, so that the distributions of the cosest training points are assigned arger weights. We have set β = and K = 7 different neighbour programs, athough we have found experimentay that the technique is not sensitive to simiar vaues of K. Tabe 2: Microarchitectura parameters and their vaues. Each parameter varies as a power of 2, meaning 288,000 tota configurations. Aso shown are the vaues for the XScae processor. Parameter Vaues XScae Parameter Vaues XScae IL size 4K... 28K 32K DL size 4K... 28K 32K IL assoc DL assoc IL bock DL bock BTB entries BTB assoc Depoyment Once the mode is buit, it can be used to predict the best optimisation passes for any new program on any new microarchitecture, as shown in figure 2. It does this using just one run of the new program compied with the defaut optimisation eve, O3, on the new microarchitecture. Thus, given a new program/microarchitecture pair, X, we extract its features by using the microarchitecture description, d, and the performance counters from a run of this program (compied with O3) on this microarchitecture, c, so that we form x = (c,d ). We then use equation () above to give the predicted-best optimisation passes, y, compie and execute the program with this new optimisation. 4. DESIGN SPACE The previous section has deveoped a machine earning mode that automaticay predicts the correct optimisation passes to appy for any new program on any microarchitectura configuration. This section describes our experimenta setup ater used to evauate this mode. It aso provides a brief characterisation of the compier and microarchitectura design spaces. 4. Benchmarks We chose to use MiBench [5], a common embedded benchmark suite, we suited to the microarchitecture space of this paper. It contains a mix of programs, from signa processing agorithms to fu office appications. We used a 35 programs, running each to competion using an input set requiring at east 00 miion executed instructions, wherever possibe. Therefore,,, djpeg, and were run with the arge input set, a others were run with the sma inputs. 4.2 Microarchitecture Space In this paper we consider a typica embedded microarchitectura design space based on the XScae processor shown in tabe 2. We show the microarchitecture parameters of the XScae that we varied aong with the vaues each parameter can take. To generate our design space we varied the cache and branch predictor configurations because they are important components of an embedded processor. Other significant parameters such as pipeine depth or votage scaing were not considered, but coud be easiy added to our experimenta setup. We varied the parameters over a wide range of vaues, beyond those in current systems, to fuy expore the design space of this processor s microarchitecture. In tota there are 288,000 different configurations some of which give increased performance over the origina processor (up to 9%), others give significanty reduced power consumption (up to 2%) which is equay significant for embedded processors. In our experiments we used a sampe space of 200 configurations seected with uniform random samping. Each configuration was impemented in the Xtrem simuator [5] which has been vaidated for performance against the XScae processor. We aso used Cacti [35] to accuratey mode the cache access atencies, ensuring our experiments were as reaistic as possibe.

5 Figure 3: Compier optimisations and their parameters. Each is a pass within gcc and can be varied independenty. In tota there are 642 miion combinations. Speedup gs djpeg out fft susan_s ispe ame madpay un rijndae_d rijndae_e Figure 4: Distribution of the maximum speedup avaiabe across a microarchitectures on a per-program basis. The x- axis represents the program and the y-axis the speedup reative to gcc s defaut optimisation eve O3. The centra ine denotes the median speedup. The box represents the 25 and 75 percentie area whie the outer whiskers denote the extreme points of the distribution. Appication to other spaces We have considered an embedded microarchitectura space and benchmark suite because this represents a rea-word chaenge for portabe compier optimisation, based around an existing processor configuration. However, the machine earning schemes we deveop in this paper are independent of this and can equay be appied to other, more compex spaces. In section 7 we extend the microarchitectura space by varying frequency and issue width and show that our approach adapts to the new space. 4.3 Compier Optimisation Space Finding the best optimisation for a specific program on a specific microarchitecture is intractabe. To do this, a equivaent programs, of which there are infinitey many, must be evauated and the best seected. Therefore, in this paper we imit ourseves to finding the best optimisation within a finite space of optimisations. The space we have considered consists of a combinations of the compier passes and their parameters shown in figure 3. These passes are appied within gcc 4.2, an industry standard for the XScae processor. They were found to have a performance impact on the XScae microarchitecture configuration and other reers have expored a simiar space [38], aowing independent comparisons with existing work. Turning passes on or off eads to a design space of 642 miion different optimisations. Varying the parameters controing some of the optimisations, (e.g. gcse has five further options) eads to a tota of unique optimisation passes. AVERAGE Ceary, it is not feasibe to exhaustivey enumerate this entire space to find the best optimisations for each program on each microarchitecture. However, iterative compiation can be used to quicky find an approximation of the best and has been shown to out-perform other approaches [30]. Hence, to find the best optimisations, we used iterative compiation which evauated a 000 different optimisations. These optimisations were seected using uniform random samping. In our experimenta setup we saw amost no additiona improvement after 000 evauations, showing that it is a usefu indicator of the upper bound on reaistic performance achievabe by a compier. 4.4 Characterising the Compier Space Before trying to buid a compier that optimises across microarchitectures, it is important to examine whether there is any performance to gain. For this purpose, we evauated the impact of the compier optimisations on the 35 MiBench programs compied with the 000 random optimisation passes, each of which was executed on the 200 different architectura configurations, as described earier. This corresponds to a sampe space of 7 miion simuations and shoud provide some evidence of the potentia benefits of optimisation passes seection across microarchitectures and programs. We then record the best performance achieved on a per program per architecture basis. Figure 4 shows the sampe space s distribution of maximum speedups for each program across the microarchitectura configurations when compiing with the best set of optimisations per program per microarchitecture. What is immediatey cear is that there is significant variation across the programs. For some the performance improvement is modest; seecting the best optimisations does not hep the ibrary-bound benchmarks or for instance. For rijnadae_e there is significant performance improvement to be gained, ranging from a.2x speedup to 4.8x in the best case,.8x being the average. In the case of the extremes are much ess but on average seecting the best optimisation gives a 2.2x speedup across a configurations. In programs such as, madpay and un, there are modest speedups to be gained on average but significant improvements avaiabe on certain microarchitectures (up to 2.4x for madpay as the top whisker shows). The right-most entry shows that there is an average speedup of.23x avaiabe across the design space if we were abe to seect the best optimisations per program per microarchitecture. The chaenge is to deveop a compier that can automaticay obtain this speedup without having seen the program or target microarchitectura configuration before. Furthermore, it shoud be abe to capture the high performance avaiabe on certain microarchitectures and avoid the potentia sowdowns found by picking the wrong optimisations. We found that choosing the wrong set of passes can ead to an average speedup of 0.7 across programs and 0.2 in the worst

6 Speedup Speedup Microarchitecture fft susan_s out gs djpeg ame ispe rijndae_e rijndae_d un madpay Program Microarchitecture fft susan_s out gs djpeg ame ispe rijndae_e rijndae_d un madpay Program (a) Best passes (b) Our compier Figure 5: Speedup over O3 for each program/microarchitecture pair. Figure (a) shows the best improvement possibe over the programs and microarchitectures. Figure (b) shows the performance of the passes predicted by our compier scheme. Our portabe optimising compier can accuratey predict the best passes to enabe across this space to achieve the speedups avaiabe across a programs and microarchitectures. case (i.e. 5 times sower). The next section evauates experimentay the performance of the machine-earning compier deveoped in section EXPERIMENTAL METHODOLOGY AND RESULTS We first describe our evauation methodoogy and then evauate our approach across programs and microarchitectures. This is foowed by an anaysis of the resuts. 5. Evauation Methodoogy This section describes how we perform our experiment and determine the best performance achievabe in our space. 5.. Cross-Vaidation To evauate the accuracy of our approach we use eave-one-out cross-vaidation. This means that we remove a program and a microarchitecture from our training set, buid a mode based on the remaining data and then predict the best optimisations for the removed program and microarchitecture. Foowing this, the program is compied and executed with the predicted optimisations on that microarchitecture and its performance recorded. We then repeat this for each program/microarchitecture. This is a standard evauation methodoogy for machine earning techniques and means that we never train using the program or microarchitecture for which we wi optimise. Hence, this is a fair evauation methodoogy Best Performance Achievabe In addition to comparison with O3, we want to evauate our approach by assessing how cose its performance is to the maximum achievabe. Athough it is intractabe to determine the best performance that can be achieved by any set of optimisation passes, as stated in section 4.3, we consider the performance of an iterative compier with 000 evauations as being an appropriate upper bound for a compier using a singe profie run. 5.2 Program/Microarchitecture Optimisation Space We first consider the performance of our compier compared to the maximum speedups avaiabe. Figure 5(a) shows the maximum speedups achievabe, when seecting the best optimisations, reative to the defaut optimisation eve, O3, across the program and microarchitectura spaces. The microarchitectures are ordered so that those with arge speedups avaiabe over O3 are on the eft. The benchmarks are ordered so that those with arge performance increases (such as ) are on the right (as in figure 4). In the back corner, the maximum speedup achievabe with the best compier passes is obtained by rijndae_e. This benchmark achieves a 4.85x speedup on a microarchitecture with a sma instruction cache size. The optimisations eading to this resut do not incude any oop optimisations (apart from moving oop-invariant code out of the oops). In particuar, no oop unroing is performed because there is aready extensive, optimised software oop unroing programmed into the source code. The performance of our compier across the programs and microarchitectures is shown in figure 5(b). As is immediatey apparent, it is amost identica to the performance achieved when using the best optimisations, figure 5(a). Our mode is highy accurate at predicting very good compier passes across the programs and microarchitecture space. The coefficient of correation between the performance of the predicted optimisations and the best ones is 0.93 when evauated across the joint program/microarchitecture space. For a program/microarchitecture pairs with arge performance avaiabe, our approach is abe to achieve significant speedups, as is shown by the peaks for programs ispe, madpay, rijndae_d and rijndae_e. These graphs ceary demonstrate that our mode is abe to capture the variation in speedups avaiabe across the program and microarchitecture spaces. The next two sections conduct further evauations of our mode. 5.3 Evauation Across Programs This section focuses on the performance of our compier on each program rather than examining the microarchitectura space. Figure 6 shows the performance of each program when optimised with our portabe optimising compier, reative to compiing with O3,

7 Our Mode Best.5.4 Best Our Mode Speedup Speedup gs djpeg out fft susan_s ispe ame madpay un rijndae_d rijndae_e AVERAGE Microarchitecture Figure 6: The performance of our approach and the best optimisations achieved by iterative compiation for each program, normaised to O3 and averaged over a microarchitectures. Figure 7: The performance of our approach and the best optimisations achieved by iterative compiation for each microarchitecture, normaised to O3 and averaged across programs. averaged across a microarchitecture configurations. The second bar, abeed Best, is the maximum speedup achievabe for each program. On average, our technique obtains a.6x performance improvement across a programs and microarchitectures with just one profie run, achieving a.94x speedup for on average. For three benchmarks in particuar (, rijndae_e and rijndae_d), our scheme achieves significant speedups, approaching the best performance avaiabe. Figure 6 shows that our mode is abe to correcty identify good optimisations, aowing these programs to expoit the arge performance gains when avaiabe. However, figure 6 aso shows that some programs experience minor sowdowns compared with O3. Considering for exampe, our approach achieves ony a 0.97x speedup. This can be expained if we refer back to figure 4 where we can see that there is negigibe performance improvement to be gained over O3 for this program, even when picking the best optimisations per microarchitecture. Unfortunatey, for this benchmark, the majority of optimisations are detrimenta to performance and being ess than 00% accurate in picking optimizations means that compiing with our scheme causes a sma amount of performance oss. Considering our technique compared to the maximum speedup achievabe, we approach Best in most cases. For some programs, such as, we obtain over 95% of the maximum performance. However, for we achieve ony 30%. The reason for this shortfa is due to a subtety in the source code of. The main oop within this benchmark updates a pointer on every iteration, resuting in a arge number of oads and stores. By performing function inining and aowing a arge growth factor (parameter max-inine-insns-auto), this pointer increment is reduced to a simpe register addition which in turn reduces the number of data cache accesses. The performance counters are not sufficienty informative to enabe our machine earning mode to capture this behaviour. This prevents our mode from seecting the best passes. However, the addition of extra features, in particuar code features [9], woud enabe us to pick this up and wi be considered in future work. By way of comparison, standard iterative compiation woud require approximatey 50 iterations on average to achieve simiar performance. For some programs more than 00 iterations woud even be needed to match our mode. This ceary shows the benefit of using machine-earning modes to buid a portabe optimising compier. 5.4 Evauation Across Microarchitectures We now turn our attention to the performance of our compier across the microarchitecture space rather than across programs. Figure 7 shows the performance of our compier compared the best performance avaiabe for each microarchitecture, abeed Best. The microarchitectura configurations are ordered in terms of increasing speedup avaiabe over O3 (i.e. the Best ine). Those on the eft have itte speedup avaiabe whereas those on the right can gain significanty. For our portabe optimising compier we see that the amount of improvement over O3 varies from.08x to.35x. This gives an average speedup of.6x across a programs and microarchitectures. It is important to see that our scheme cosey foows the trend of the Best optimisations, showing how our approach captures the variation between configurations, expoiting architectura features when performance improvements can be achieved. Looking at figure 7 in more detai we can see that it is divided into roughy three regions. On the eft, up to configuration 32, the first region has itte performance improvement avaiabe. A microarchitectures in this area have a sma data cache. Unfortunatey gcc has very few data access optimisations, meaning the avaiabe speedups are reativey sma. Foowing this is the second region where the Best optimisations gain an average.2x speedup and our scheme manages to capture a respectabe.6x. Finay in the third section, after configuration 85, the avaiabe performance improvement increases dramaticay. These microarchitectures on the right have a sma instruction cache, meaning that it is important to prevent code dupication wherever possibe. This is typica of embedded systems where code size is frequenty an important optimisation goa. The performance counter specifying the instruction cache miss rate enabes our mode to earn this from the training programs. In particuar, our compier earns that instruction scheduing (schedue-insns) and function inining (inine-functions) must be disabed to prevent code size increases. In the case of instruction scheduing, this increase is due to a subsequent register aocation pass which emits more spi code for certain schedues. Here we can see an effect of the compex reationships between passes within the compier. Nonetheess, our mode is abe to cope with these interactions and achieve the majority of the speedups avaiabe in this area. 5.5 Summary We have shown that our portabe optimising compier achieves an average.6x speedup over O3 across the entire microarchitecture space for the MiBench benchmark suite. This is equivaent to 67% of the speedup achieved by the Best optimisation passes and is roughy consistent across the architecture configuration space. In addition, our approach is abe to achieve higher eves of performance whenever they are avaiabe, accuratey foowing the trends in the optimisation space across programs and microarchitectures.

8 fthread_jumps fcrossjumping foptimize_sibing_cas fcse_foow_jumps fcse_skip_bocks fexpensive_optimizations fstrength_reduce fre_run_cse_after_oop frerun_oop_opt fcaer_saves fpeephoe2 fregmove freorder_bocks faign_functions faign_jumps faign_oops faign_abes ftree_vrp ftree_pre funswitch_oops fgcse fno_gcse_m fgcse_sm fgcse_as fgcse_after_reoad param_max_gcse_passes fschedue_insns fno_sched_interbock fno_sched_spec finine_functions param_max_inine_insns_auto param_arge_function_insns param_arge_function_growth param_arge_unit_insns param_inine_unit_growth param_inine_ca_cost funro_oops param_max_unro_times param_max_unroed_insns fthread_jumps fcrossjumping foptimize_sibing_cas fcse_foow_jumps fcse_skip_bocks fexpensive_optimizations fstrength_reduce fre_run_cse_after_oop frerun_oop_opt fcaer_saves fpeephoe2 fregmove freorder_bocks faign_functions faign_jumps faign_oops faign_abes ftree_vrp ftree_pre funswitch_oops fgcse fno_gcse_m fgcse_sm fgcse_as fgcse_after_reoad param_max_gcse_passes fschedue_insns fno_sched_interbock fno_sched_spec finine_functions param_max_inine_insns_auto param_arge_function_insns param_arge_function_growth param_arge_unit_insns param_inine_unit_growth param_inine_ca_cost funro_oops param_max_unro_times param_max_unroed_insns gs djpeg out fft susan_s ispe ame madpay un rijndae_d rijndae_e btb_size btb_assoc i_size i_assoc i_bock d_size d_assoc d_bock IPC dec_acc_rate reg_acc_rate bpred_acc_rate icache_acc_rate icache_miss_rate dcache_acc_rate dcache_miss_rate ALU_usg MAC_usg Shft_usg Figure 8: A Hinton diagram showing the optimisations that our mode considers most ikey to affect performance for each benchmark. The arger the box, the more ikey an optimisation affects the performance of the respective program. Figure 9: A Hinton diagram showing the reationship between optimisations and features. The arger the box, the more informative a feature is in predicting whether to appy the optimisation. The next sections anayses our resuts, describing the passes that are important in our space and how our mode seects good optimisation passes for new programs and microarchitectures. 6. ANALYSIS OF RESULTS This section anayses the resuts of our experiments. It first describes how programs affect the choice of optimisation to appy. Then it shows how microarchitecture infuence the optimisations choice. 6. Program Impact on Optimisations Section 5.2 showed that our compier s performance cosey foows the speedups achieved by the best optimisations for each program/microarchitecture pair. We now consider how it achieves this by focusing on those optimisations that are most ikey to affect performance. Note that this is a post-hoc anaysis and, in genera, we cannot know in advance whether an optimisation wi be ikey to affect performance for a specific program and microarchitecture. Figure 8 shows a Hinton diagram of the normaised mutua information between each optimisation and the speedups obtained on each program. Intuitivey, mutua information gives an indication of the impact (good or bad) of a specific compier pass on each program. The arger the box, the greater the impact of the pass. However, as this is a summary across a architectures an optimisation may be important for just a few microarchitectures but not for the others, eading to a sma box being drawn. It is cear from figure 8 that some optimisations are important across a programs, whereas others are ony important to a few benchmarks. For exampe, instruction scheduing (schedue-insns) is important for amost a benchmarks. As discussed in section 5.4, in some cases this optimisation has a negative impact on microarchitectures with a sma instruction cache. Loop unroing (unrooops) is aso an important optimisation for many programs. For programs such as, which contains oops with a known number of iterations, it is important to consider this optimisation to achieve good performance. However, for others, such as rijndae_e, this optimisation does not pay a crucia roe in achieving good performance because extensive unroing is aready impemented in the source code. The optimisation passes affecting function inining (ininefunctions to param-inine-ca-cost) have itte impact on most programs. However, for four programs, ispe,, and, these are the most important passes. By using the mutua information shown in figure 8 our mode focuses on those optimisations that are most ikey to affect performance on a per program/architecture basis. 6.2 Microarchitecture Impact on Optimisations Having anaysed the optimisations that have most impact on different programs, we now turn our attention to the reationship between the microarchitecture and compier optimisations. Figure 9 presents another Hinton diagram showing the microarchitectura impact on the best optimisations to appy. These resuts are averaged over a programs. The features are separated into two groups: the first contains the eight architectura parameters d whist the second contains the performance counters c. Of a the micro architectura parameters, the size of the instruction cache (denoted i_size) has the biggest impact on compier optimisation. In particuar, it strongy infuences the optimisations that contro function inining (inine_functions) and oop unroing (unro_oops) It is therefore critica to predict these optimisations correcty (based on i_size) to avoid increasing the cache miss rate on sma cache configurations. Furthermore, on arger cache configurations, it is important to perform aggressive inining and unroing to expoit the fu potentia of the cache. Now considering the performance counters, d, we can see that IPC has significant impact. This is used by the mode in conjunc-

9 Speedup Our Mode Best gs djpeg out fft susan_s ispe ame madpay un rijndae_d rijndae_e AVERAGE Figure 0: The performance of our approach and the best optimisations achieved by iterative compiation for each program, normaised to O3 and averaged over a microarchitectures on an extended space. tion with the other features to predict the most important optimisations to appy, such as bock reordering (reorder_bocks), goba common subexpression eimination (gcse), instruction scheduing (schedue_insns), function inining (inine_functions) and oop unroing(unro_oops). The performance counters that record cache and branch predictor access/miss rates aso have significant impact on choosing the best optimisation fags. Surprisingy, knowedge of register and functiona unit usage has itte importance in determining the correct compier optimisations to appy. Whie some of these observations may seem rather intuitive, current production compiers, such as gcc, aways use the same strategy when appying optimisation passes, independenty of the architectura parameters. One immediate recommendation woud be to make gcc s unroing and inining optimisations sensitive to the instruction cache size and to make use of branch predictor and cache access performance counters. However, this is just a post hoc anaysis based on the resuts of this space. The technique deveoped in this paper is micro-architectura space neutra enabing a compier to automaticay adapt to any underying microarchitecture, as shown in the next section. 7. EXTENDING THE MICROARCHITEC- TURAL SPACE Whie our approach works we on a predefined architecture space, it is reasonabe to ask how woud it perform if the architecture space was changed at a ater date. We therefore extended our space by varying two microarchitectura parameters not considered in section 4.3, namey frequency and processor width. Frequency ranges from 200 to 600 MHz whie issue width is either or 2. As a reference, the corresponding XScae vaues are 400 MHz and issue width. Given this extended space, we then appied our approach, predicting the best optimisation passes for it. Figure 0 shows the resuting performance across programs, compared to the best performance avaiabe. In this new space, seecting the correct compier optimisation passes has a simiar impact as before. The Best optimisations give an average.24x improvement over O3 compared to.23x in the previous space. Our approach is abe to achieve an improved average of.4x speedup. This is comparabe to the performance achieved on the previous space without any modification to our approach. If we were to incude new features that capture the behaviour of the additiona architectura parameters, the performance of our mode woud be further improved. 8. RELATED WORK There is a significant voume of prior work reated to this paper which we discuss in the foowing seven sections. Domain-Specific Optimisations Yotov et a. [40] investigated a mode-driven approach for the ATLAS sef-tuning inear agebra ibrary that uses the machine description to compute the optima parameters of the optimisations. SPIRAL [32] is another sef-tuned ibrary. It automaticay generates high-performance code for digita signa processing by expoiting domain-specific knowedge to the parameter space at compie-time. These two systems both required domain-specific knowedge and the use of iterative compiation to optimise themseves on the target system. They have to be retuned for each new patform. This contrasts with our work where the compier is buit ony once and optimises across a range of microarchitectures using just one profie run for any new program. Iterative Compiation Iterative compiation optimises a singe program on a specific microarchitecture by ing the optimisation space. Cooper et a. [7] were amongst the first to use a genetic agorithm to sove the phase ordering probem, achieving impressive code size reductions. Later, an extensive study of this probem was conducted, advocating the use of mutipe hi-cimber runs [2]. Vuduc et a. [39] ooked at the probem of optimising a matrix mutipication ibrary using a statistica criterion to stop. Kukarni et a. [24] used their previousy deveoped VISTA compier infrastructure [42] to for effective optimisation phases at a function eve. They buid a tree of effective transformation sequences and use it to imit the of the optimisation space with a genetic agorithm. Orthogonay to this, other reers have focused on finding the best optimisations settings to appy. Triantafyis et a. [36] concentrated on a sma set of optimisations that perform we on a given set of code segments. These are paced in a tree which is traversed to for good optimisation combinations for a new appication. Finay, Pan and Eigenmann [30] compared these techniques with their own agorithm that iterativey eiminates settings with the most negative effect from the space. Compared to our approach, a these techniques specificay tune each program on a per-program, per-architecture basis by ing its optimisation space. Conversey, our technique avoids and recompiation by directy predicting the correct set of compier optimisations to appy on a new micro-architecture. Anaytic Modes for Compiation The use of anaytic modes has aso been investigated to speedup iterative compiation. Triantafyis et a. [36] used an anaytic mode to reduce the required time to evauate different compier optimisations for different code segments. Zhao et a. [4] deveoped an approach named FPO to estimate the impact of different oop transformations. To overcome the high cost of iterative compiation, Cooper et a. [6] deveoped ACME which uses the concept of virtua execution; a simpe anaytic mode that estimates the execution time of basic bocks. Anaytic modes have proved to be usefu for ing the optimisation space quicky. However, since our mode does not perform any but directy predicts the best optimisation passes to appy, they are not appicabe in this context. Machine-Learning Compiers Some of the first reers to incorporate machine earning into optimising compiers were McGovern and Moss [29] who used reinforcement earning for the scheduing of straight-ine code.

Portable Compiler Optimisation Across Embedded Programs and Microarchitectures using Machine Learning

Portable Compiler Optimisation Across Embedded Programs and Microarchitectures using Machine Learning Portabe Compier Optimisation Across Embedded Programs and Microarchitectures using Machine Learning Christophe Dubach, Timothy M. Jones, Edwin V. Bonia Members of HiPEAC Schoo of Informatics University