Exhaustive Phase Order Search Space Exploration and Evaluation

Size: px

Start display at page:

Download "Exhaustive Phase Order Search Space Exploration and Evaluation"

Kenneth Owen
5 years ago
Views:

1 Exhaustive Phase Order Search Space Exploration and Evaluation by Prasad Kulkarni (Florida State University) / 55

2 Compiler Optimizations To improve efficiency of compiler generated code Optimization phases require enabling conditions need specific patterns in the code many also need available registers Phases interact with each other Applying optimizations in different orders generates different code 2 / 55

3 Phase Ordering Problem To find an ordering of optimization phases that produces optimal code with respect to possible phase orderings Evaluating each sequence involves compiling, assembling, linking, execution and verifying results Best optimization phase ordering depends on source application target platform implementation of optimization phases Long standing problem in compiler optimization!! 3 / 55

4 Phase Ordering Space Current compilers incorporate numerous different optimization phases 15 distinct phases in our compiler backend 15! = 1,307,674,368,000 Phases can enable each other any phase can be active multiple times = 437,893,890,380,859,375 cannot restrict sequence length to = * / 55

5 Addressing Phase Ordering Exhaustive Search universally considered intractable We are now able to exhaustively evaluate the optimization phase order space. 5 / 55

6 Re-stating of Phase Ordering Earlier approach explicitly enumerate all possible optimization phase orderings Our approach explicitly enumerate all function instances that can be produced by any combination of phases 6 / 55

7 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 7 / 55

8 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 8 / 55

9 Experimental Framework We used the VPO compilation system established compiler framework, started development in 1988 comparable performance to gcc O2 VPO performs all transformations on a single representation (RTLs), so it is possible to perform most phases in an arbitrary order Experiments use all the 15 re-orderable optimization phases in VPO Target architecture was the StrongARM SA-100 processor 9 / 55

10 VPO Optimization Phases ID Optimization Phase ID Optimization Phase b branch chaining l loop transformations c common subexpr. elim. n code abstraction d remv. unreachable code o eval. order determin. g loop unrolling q strength reduction h dead assignment elim. r reverse branches i block reordering s instruction selection j minimize loop jumps u remv. useless jumps k register allocation 10 / 55

11 Disclaimers Did not include optimization phases normally associated with compiler front ends no memory hierarchy optimizations no inlining or other interprocedural optimizations Did not vary how phases are applied Did not include optimizations that require profile data 11 / 55

12 Benchmarks 12 MiBench benchmarks; 244 functions Category auto network telecomm consumer security office Program bitcount qsort dijkstra patricia fft adpcm jpeg tiff2bw sha blowfish stringsearch ispell Description test processor bit manipulation abilities sort strings using quicksort sorting algorithm Dijkstra s shortest path algorithm construct patricia trie for IP traffic fast fourier transform compress 16-bit linear PCM samples to 4-bit image compression and decompression convert color.tiff image to b&w image secure hash algorithm symmetric block cipher with variable length key searches for given words in phrases fast spelling checker 12 / 55

13 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 13 / 55

14 Terminology Active phase An optimization phase that modifies the function representation Dormant phase A phase that is unable to find any opportunity to change the function Function instance any semantically, syntactically, and functionally correct representation of the source function (that can be produced by our compiler) 14 / 55

15 Naïve Optimization Phase Order Space All combinations of optimization phase sequences are attempted L0 a b c d L1 a b c d a d a d a d b c b c b c L2 15 / 55

16 Eliminating Consecutively Applied Phases A phase just applied in our compiler cannot be immediately active again L0 a b c d L1 a b c d a b c d a b c d a b c d L2 16 / 55

17 Eliminating Dormant Phases Get feedback from the compiler indicating if any transformations were successfully applied in a phase. L0 a b c d L1 b c d a c d a b d a b c L2 17 / 55

18 Identical Function Instances Some optimization phases are independent example: branch chaining & register allocation Different phase sequences can produce the same code r[2] = 1; r[3] = r[4] + r[2]; instruction selection r[3] = r[4] + 1; r[2] = 1; r[3] = r[4] + r[2]; constant propagation r[2] = 1; r[3] = r[4] + 1; dead assignment elimination r[3] = r[4] + 1; 18 / 55

19 Equivalent Function Instances sum = 0; for (i = 0; i < 1000; i++ ) sum += a [ i ]; r[10]=0; r[12]=hi[a]; r[12]=r[12]+lo[a]; r[1]=r[12]; r[9]=4000+r[12]; L3 r[8]=m[r[1]]; r[10]=r[10]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L3; Register Allocation before Code Motion Source Code r[11]=0; r[10]=hi[a]; r[10]=r[10]+lo[a]; r[1]=r[10]; r[9]=4000+r[10]; L5 r[8]=m[r[1]]; r[11]=r[11]+r[8]; r[1]=r[1]+4; IC=r[1]?r[9]; PC=IC<0,L5; Code Motion before Register Allocation r[32]=0; r[33]=hi[a]; r[33]=r[33]+lo[a]; r[34]=r[33]; r[35]=4000+r[33]; L01 r[36]=m[r[34]]; r[32]=r[32]+r[36]; r[34]=r[34]+4; IC=r[34]?r[35]; PC=IC<0,L01; After Mapping Registers 19 / 55

20 Efficient Detection of Unique Function Instances After pruning dormant phases there may be tens or hundreds of thousands of unique instances Use a CRC (cyclic redundancy check) checksum on the bytes of the RTLs representing the instructions Used a hash table to check if an identical or equivalent function instance already exists in the DAG 20 / 55

21 Eliminating Identical/Equivalent Function Instances Resulting search space is a DAG of function instances L0 a b c c d a d a d L1 L2 21 / 55

22 Static Enumeration Results Function Inst. Fn_inst Len CF Batch Vs. optimal start_input_bmp (j) 1, , correct(i) 1,295 1,348, main (t) 1,276 2,882, parse_switches (j) 1, , start_input_gif (j) , start_input_tga (j) , askmode (i) , skiptoword (i) , start_input_ppm (j) 795 8, Average (234) , / 55

23 Exhaustively enumerated the optimization phase order space to find an optimal phase ordering with respect to code-size [Published in CGO 06] 23 / 55

24 Determining Program Performance Almost 175,000 distinct function instances, on average largest enumerated function has 2,882,021 instances Too time consuming to execute each distinct function instance assemble link execute more expensive than compilation Many embedded development environments use simulation simulation orders of magnitude more expensive than execution Use data obtained from a few executions to estimate the performance of all remaining function instances 24 / 55

25 Determining Program Performance (cont...) Function instances having identical control-flow graphs execute each block the same number of times Execute application once for each controlflow structure Statically estimate the number of cycles required to execute each basic block dynamic frequency measure = Σ (static cycles * block frequency) 25 / 55

26 Predicting Relative Performance I cycles 4 cycles cycles cycles 20 cycles 2 2 cycles 22 cycles 2 2 cycles 5 5 cycles cycles 5 10 cycles cycles Total cycles = 744 Total cycles = / 55

27 Dynamic Frequency Results Function Inst. Fn_inst Len CF Leaf % from optimal Batch Worst main (t) 1,276 2,882, , parse_switches (j) 1, , , askmode (i) , skiptoword (i) , , start_input_ppm (j) 795 8, pfx_list_chk (i) 640 1,269, , main (f) 624 2,789, , sha_transform (h) , , main (p) , Average (79) , / 55

28 Correlation Dynamic Frequency Counts Vs. Simulator Cycles Static performance estimation is inaccurate ignored cache/branch misprediction penalties Dynamic frequency counts may be sufficiently accurate simplification of the estimation problem most embedded systems have simpler architectures We show strong correlation between our measure of performance and simulator cycles 28 / 55

29 Complete Function Correlation Example: init_search in stringsearch 29 / 55

30 Leaf Function Correlation Leaf function instances are generated when no additional phases can be successfully applied Leaf instances provide a good sampling represents the only code that can be generated by an aggressive compiler, like VPO at least one leaf instance represents an optimal phase ordering for over 86% of functions significant percent of leaf instances among optimal 30 / 55

31 Leaf Function Correlation Statistics Pearson s correlation coefficient Pcorr = Σxy (ΣxΣy)/n sqrt( (Σx 2 (Σx) 2 /n) * (Σy 2 -(Σy) 2 /n) ) Accuracy of our estimate of optimal perf. cycle count for best leaf Lcorr = cy. cnt for leaf with best dynamic freq count 31 / 55

32 32 / 55 Leaf Function Correlation Statistics (cont ) Leaves Ratio Lcorr 0% Leaves Ratio average dijkstra(d) dequeue(d) ntbl_bit (b) ntbl_bitcnt(b) 23 main(b) bitcount(b) 2 bit_shifter(b) 2 bit_count.(b) 2 BW_btbl...(b) 1 AR_btbl...(b) Lcorr 1% Pcorr Function

33 Exhaustively evaluated the optimization phase order space to find a near-optimal phase ordering with respect to simulator cycles [Published in LCTES 06] 33 / 55

34 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 34 / 55

35 Phase Enabling Interaction benables a along the path a-b-a a b c b c c a b a d 35 / 55

36 36 / 55 Phase Enabling Probabilities g u s r q o n l k j i h d c b u s r q o n l k j i h g d c b St Ph

37 Phase Disabling Interaction b disables a along the path b-c-d a b c b c c a b a d 37 / 55

38 38 / 55 Disabling Probabilities u s r q o n l k j i h g d c b u s r q o n l k j i h g d c b Ph

39 Faster Conventional Compiler Modified VPO to use enabling and disabling phase probabilities to decrease compilation time # p[i] - current probability of phase i being active # e[i][j] - probability of phase j enabling phase i # d[i][j] - probability of phase j disabling phase i For each phase i do p[i] = e[i][st]; While (any p[i] > 0) do Select j as the current phase with highest probability of being active Apply phase j If phase j was active then For each phase i, where i!= j do p[i] += ((1-p[i]) * e[i][j]) - (p[i] * d[i][j]) p[j] = 0 39 / 55

40 40 / 55 Probabilistic Compilation Results average dijkstra(d) main(j) N/A LZWReadByte(j) N/A read_scan...(j) sha_trans...(h) main(f) fft_float(f) start_inp...(j) N/A start_inp...(j) N/A start_inp...(j) parse_swi...(j) N/A start_inp...(j) Speed Size Time Active Attempted Active Attempted Prob. / Old Prob. Compilation Old Compilation Function

41 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 41 / 55

42 Conclusions Phase ordering problem long standing problem in compiler optimization exhaustive evaluation always considered infeasible Exhaustively evaluated the phase order space re-interpretation of the problem novel application of search algorithms fast pruning techniques accurate prediction of relative performance Analyzed properties of the phase order space to speedup conventional compilation published in CGO 06, LCTES 06, submitted to TOPLAS 42 / 55

43 Challenges Exhaustive phase order search is a severe stress test for the compiler isolate analysis required and invalidated by each phase produce correct code for all phase orderings eliminate all memory leaks Search algorithm needs to be highly efficient used CRCs and hashes for function comparisons stored intermediate function instances to reduce disk access maintained logs to restart search after crash 43 / 55

44 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 44 / 55

45 VISTA Provides an interactive code improvement paradigm view low-level program representation apply existing phases and manual changes in any order browse and undo previous changes automatically obtain performance information automatically search for effective phase sequences Useful as a research as well as teaching tool employed in three universities published in LCTES 03, TECS / 55

46 VISTA Main Window 46 / 55

47 Faster Genetic Algorithm Searches Improving performance of genetic algorithms avoid redundant executions of the application over 87% of executions were avoided reduce search time by 62% modify search to obtain comparable results in fewer generations reduced GA generations by 59% reduce search time by 35% published in PLDI 04, TACO / 55

48 Heuristic Search Algorithms Analyzing the phase order space to improve heuristic algorithms detailed performance and cost comparison of different heuristic algorithms demonstrated the importance and difficulty of selecting the correct sequence length illustrated the importance of leaf function instances proposed modifications to existing algorithms, and new search algorithms published in CGO / 55

49 Dynamic Compilation Explored asynchronous dynamic compilation in a virtual machine demonstrated shortcomings of current popular compilation strategy describe importance of minimum compiler utilization discussed new compilation strategies explored the changes needed to current compilation strategies to exploit free cycles Will be published in VEE / 55

50 Outline Experimental framework Exhaustive phase order space evaluation Faster conventional compilation Conclusions Summary of my other work Future research directions 50 / 55

51 Compiler Technology Challenges High Level Language c o m p i l e r Machine Architecture Support for parallelism traditional languages express parallelism dynamic scheduling Virtual machines dynamic code generation and optimization Push compilation decisions further down multi-core heterogeneous cores No great solution performance monitoring software-controlled reconfiguration Can no longer do it alone 51 / 55

52 Iterative Compilation & Machine Learning Improved scope for iterative compilation & machine learning proliferation of new architectures automate tuning compiler heuristics tuning important libraries using performance monitors dynamic JIT compilers How to use machine learning to optimize and schedule more efficiently? 52 / 55

53 Heterogeneous Multi-core Architectures Can provide the best performance, cost, power balance Challenges schedule tasks, allocate resources dynamic core-specific optimization automatic data layout to prevent conflicts 53 / 55

54 Dynamic Compilation Virtual machines likely to grow in importance productivity, portability, interoperability, isolation... Challenges when, what, how to parallelize using hardware performance monitors using static analyses to aid dynamic compilation debugging tools for correctness and performance debugging 54 / 55

55 Questions? 55 / 55

56 Results Code Size Summary Exhaustively enumerated 234 out of 244 functions Sequence length Maximum: 44; Average: Distinct function instances Maximum: 2,882,021; Average: 89,947 Distinct control-flows Maximum: 920; Average: 36.2 Code size improvement over default sequence Maximum: 63%; Average: 6.46% 56 / 55

57 Results Performance Summary Exhaustively evaluated 79 out of 88 functions Sequence length Maximum: 44; Average: 16.1 Distinct function instances Maximum: 2,882,021; Average: 174,574.8 Distinct control-flows Maximum: 920; Average: 47.4 Performance improvement over default sequence Maximum: 15%; Average: 4.8% 57 / 55

58 Leaf Vs. Non-Leaf Performance 58 / 55

59 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 59 / 55

60 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 60 / 55

61 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 61 / 55

62 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 62 / 55

63 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 63 / 55

64 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 64 / 55

65 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 65 / 55

66 Phase Order Space Evaluation Summary last phase active? Y identical function instance? N equivalent function instance? N generate next optimization sequence N Y Y Y add node to DAG calculate function performance simulate application N seen control-flow structure? 66 / 55

67 Weighted Function Instance DAG Each node is weighted by the number of paths to a leaf node a b 5 [abc] c 2 [bc] 1 [c] 2 [ab] b c c a b 1 [a] 1 [d] 1 a d / 55

68 Predicting Relative Performance II 5 4 cycles 5 10 cycles 5 10 cycles 5 10 cycles? 15 cycles? 15 cycles?? 4 cycles? 10 cycles 26 cycles? 90 cycles? 44 cycles? 2 cycles Total cycles = 170 Total cycles =?? 68 / 55

69 Case when No Leaf is Optimal 69 / 55

In Search of Near-Optimal Optimization Phase Orderings

In Search of Near-Optimal Optimization Phase Orderings Prasad A. Kulkarni David B. Whalley Gary S. Tyson Jack W. Davidson Languages, Compilers, and Tools for Embedded Systems Optimization Phase Ordering