Bachelor Seminar Complexity Analysis of Register Allocation

Similar documents
Register Allocation. Register Allocation. Local Register Allocation. Live range. Register Allocation for Loops

SSA-Form Register Allocation

Register Allocation: What does Chaitin s NP-completeness Proof Really Prove?

CSC D70: Compiler Optimization Register Allocation

Compiler Design. Register Allocation. Hwansoo Han

Theorem 2.9: nearest addition algorithm

The complement of PATH is in NL

Lecture Notes on Register Allocation

register allocation saves energy register allocation reduces memory accesses.

Code generation for modern processors

Code generation for modern processors

Notes for Lecture 24

31.6 Powers of an element

SSA-based Register Allocation with PBQP

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Extremal Graph Theory: Turán s Theorem

Global Register Allocation

Approximation Algorithms

PCP and Hardness of Approximation

Small Survey on Perfect Graphs

Global Register Allocation - Part 2

Exact Algorithms Lecture 7: FPT Hardness and the ETH

Lecture 2. 1 Introduction. 2 The Set Cover Problem. COMPSCI 632: Approximation Algorithms August 30, 2017

Vertex Cover Approximations

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Online Graph Exploration

11.1 Facility Location

Solutions to In Class Problems Week 5, Wed.

Fall Compiler Principles Lecture 12: Register Allocation. Roman Manevich Ben-Gurion University

The problem of minimizing the elimination tree height for general graphs is N P-hard. However, there exist classes of graphs for which the problem can

Greedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d

Lecture 8: The Traveling Salesman Problem

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

CHAPTER 3. Register allocation

Recitation 4: Elimination algorithm, reconstituted graph, triangulation

SSA-Based Register Allocation with PBQP

Graphs and Network Flows IE411. Lecture 21. Dr. Ted Ralphs

LECTURES 3 and 4: Flows and Matchings

CHAPTER 3. Register allocation

Module 6 NP-Complete Problems and Heuristics

Stanford University CS261: Optimization Handout 1 Luca Trevisan January 4, 2011

On the Max Coloring Problem

Linear Scan Register Allocation. Kevin Millikin

Approximation Algorithms

Chordal deletion is fixed-parameter tractable

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

15-451/651: Design & Analysis of Algorithms November 4, 2015 Lecture #18 last changed: November 22, 2015

CS261: Problem Set #1

Discharging and reducible configurations

Introduction to Optimization, Instruction Selection and Scheduling, and Register Allocation

CS 4407 Algorithms. Lecture 8: Circumventing Intractability, using Approximation and other Techniques

Mathematical and Algorithmic Foundations Linear Programming and Matchings

Error-Correcting Codes

On 2-Subcolourings of Chordal Graphs

Register allocation. CS Compiler Design. Liveness analysis. Register allocation. Liveness analysis and Register allocation. V.

Register Allocation via Hierarchical Graph Coloring

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

Maximal Independent Set

Investigating Different Register Allocation Techniques for a GPU Compiler

CSE 417 Branch & Bound (pt 4) Branch & Bound

Graphs and Discrete Structures

Register Allocation & Liveness Analysis

AAL 217: DATA STRUCTURES

Register allocation. Overview

CS 580: Algorithm Design and Analysis. Jeremiah Blocki Purdue University Spring 2018

Module 6 P, NP, NP-Complete Problems and Approximation Algorithms

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

Lecture 24: More Reductions (1997) Steven Skiena. skiena

Maximal Independent Set

Faster parameterized algorithms for Minimum Fill-In

Lecture outline. Graph coloring Examples Applications Algorithms

Exercise set 2 Solutions

Faster parameterized algorithms for Minimum Fill-In

NP and computational intractability. Kleinberg and Tardos, chapter 8

PACKING DIGRAPHS WITH DIRECTED CLOSED TRAILS

CHAPTER 2. Graphs. 1. Introduction to Graphs and Graph Isomorphism

Chapter 3 Trees. Theorem A graph T is a tree if, and only if, every two distinct vertices of T are joined by a unique path.

April 15, 2015 More Register Allocation 1. Problem Register values may change across procedure calls The allocator must be sensitive to this

Approximation Algorithms

Scan Scheduling Specification and Analysis

Greedy algorithms is another useful way for solving optimization problems.

Empirical analysis of procedures that schedule unit length jobs subject to precedence constraints forming in- and out-stars

Complexity Results on Graphs with Few Cliques

K-SATURATED GRAPHS CLIFFORD BRIDGES, AMANDA DAY, SHELLY MANBER

Last week: Breadth-First Search

In this lecture we discuss the complexity of approximation problems, and show how to prove they are NP-hard.

Chapter 18 out of 37 from Discrete Mathematics for Neophytes: Number Theory, Probability, Algorithms, and Other Stuff by J. M. Cargal.

11/22/2016. Chapter 9 Graph Algorithms. Introduction. Definitions. Definitions. Definitions. Definitions

Dr. Amotz Bar-Noy s Compendium of Algorithms Problems. Problems, Hints, and Solutions

Chapter 9 Graph Algorithms

Scribe from 2014/2015: Jessica Su, Hieu Pham Date: October 6, 2016 Editor: Jimmy Wu

3 No-Wait Job Shops with Variable Processing Times

Combinatorics Prof. Dr. L. Sunil Chandran Department of Computer Science and Automation Indian Institute of Science, Bangalore

Packing Edge-Disjoint Triangles in Given Graphs

11. APPROXIMATION ALGORITHMS

1 Better Approximation of the Traveling Salesman

Register Allocation. Note by Baris Aktemur: Our slides are adapted from Cooper and Torczon s slides that they prepared for COMP 412 at Rice.

Algorithms Exam TIN093/DIT600

Solutions for the Exam 6 January 2014

Graph Algorithms. Tours in Graphs. Graph Algorithms

Transcription:

Bachelor Seminar Complexity Analysis of Register Allocation Markus Koenig Embedded Systems Group, University of Kaiserslautern m koenig13@cs.uni-kl.de Abstract Register allocation is the task of assigning temporary variables in a program to the available registers in the machine. Optimal register allocation (minimizing the number of registers used) is proved to be NP-complete by Chaitin et al, by a reduction from graph coloring problem. In particular, Chaitin et al showed for any given arbitrary graph, there exists a program whose interference graph is the same as the given arbitrary graph. In this paper, we study two existing analyses on complexity of register allocation [5, 1]. [5] proves that although optimal register allocation can be done in polynomial time for programs in static single assignment (SSA) form, the complexity after classical SSA elimination remains NP-complete. [1] shows that although register allocation is NP-complete due to the correspondence with coloring problem, the real complexity arises from the further optimizations of spilling and coalescing and from critical edges. Furthermore we also study a technique for solving combined register allocation and instruction scheduling [4]. 1 Introduction In most programs we have to store variables for a later use. This fact makes register allocation so important and it is a reason to do this as fast as possible. The physical memory split up into two or more sections, here is the simple approach with the main memory and a cache enough. To store a variable in main memory needs, compared to the registers, a lot of time so it is clear to avoid this when ever possible. The access to the memory is called spill (load/ store). Sometimes it is useful to transfer a variable to another register adding such an instruction is splitting and removing such an instruction is coalescing. So before the register allocation is done it is a good idea to check whether splitting or coalescing save a spill. Another problem is to find out which variable is the best for spill and so on to find the minimum number of variables that have to be spilled. It is also important to mention that the number of registers are part of the input but however the algorithm should find the smallest number of registers need to be allocated. The SSA, Single Static Assignment, Form is used in compilers as first step before a program is transferred to an executable form. In SSA every variable is defined and used for a definition only one time. The second step is the use of instruction scheduling during the register allocation. The problems arising there would be in case of a separate use of them two, which will be shown later. The combined method is called Crisp (Combined Register allocation and Instruction Scheduling Problem). To examine the outcome of the combined solution a cost function is used and a detailed analysis of the single steps to reorder the basic blocks is needed. With a look at the complexity another way, instead of the graph coloring, is

2 shown and at last point an experiment will show numbers for comparison with other algorithms. The first part of the solution, 3.1, is about the problem structure and a closer look on the proof of Chaitin. Coloring an arbitrary graph is Np-complete but it is not like that in all cases and now there are more optimization algorithms. Although it is Np-complete we see that the optimization is useful to save one register as shown in the two example figures. In section 3.2 the SSA model is introduced. There are some transformations done before it goes to the coloring and hopefully the chordal structure of the graph is after the SSA -process good enough for a simple coloring. Part 3.3 is another way to prepare the graph coloring. A combined method is used which includes instruction scheduling and register allocating. An example that shows the effect of the combined method is given in Figure 8. To see more information an experiment shows the improvement of this approach in relation to the separate use of instruction scheduling and register allocation. Then some limitations are shown of the different improvements. Unfortunately in some cases it is impossible to gain an easy graph for coloring so the time expensive spill and Np-complete coloring must be done. 2 Related Work The first Np-completeness proof that was made for register allocation was done by R. Sethi. He modeled the problem as graph coloring where the variables are the vertices and two vertices are connected by an edge if they are alive at the same time in program execution. The number of the available registers is the k that is the number of the possible colors. He used a DAG (directed acyclic graph) and find out the Np-completeness comes from the fact that the appearance of the instructions in the program code is not fixed. In this first approach we already see that the problem got two instances, on the one hand to decide whether a variable has to be spilled or not and on the other hand how to color the graph that K registers are enough. One is exactly defined in the following: Pereira, F. M. Q., & Palsberg, J. (2006, March). REGISTER ALLOCATION AFTER CLASSICAL SSA ELIMINATION IS NP-COMPLETE. In International Conference on Foundations of Software Science and Computation Structures (pp. 79-93). Springer, Berlin, Heidelberg. [5] Core register allocation problem: Instance: a program P and a number K of available registers. Problem: can each of the temporaries of P be mapped to one of the K registers such that temporary variables with interfering live ranges are assigned to different registers? By analyzing the problem carefully it is important to look at the single steps of the problem instance. First we get a program that consists of instructions with variables. Then before it comes to model a graph and color this there are some steps to take for optimization. With a special view to the basic blocks there are also interesting things to recognize. The similarity between a basic block witch is the smallest part of the program that is analyzed and the final coloring, that is the biggest step, is given. Motwani, R., Palem, K. V., Sarkar, V., & Reyen, S. (1995). COMBINING REGISTER ALLOCATION AND INSTRUCTION SCHEDULING. Courant Institute, New York University. [4] A formulation of the combined register allocation and instruction scheduling problem within a basic block as a single optimization problem, showing that a simple instance of the combined problem (single register, no latencies, single functional unit) is NP-hard, even though instruction scheduling for a basic block with 0/1 latencies on a single pipelined functional unit is not NP-hard.

3 Figure 1: (a) shows the matrix equation with φ function. (b) and (c) shows a matrix with its semantic. So there is a φ function for every n rows and each column represents another execution path in the program. [4] The improvements will shorten the time needed for the most steps, the graph structure can be formed better, maybe to chordal graph, the instructions can be sorted favorable, however this is also possible in some special cases and nevertheless with all the improvements the register allocation will be a Np-complete problem. 3 The Solution First, in section 3.1, the SSA- method is introduced which uses interval graphs SSA- circle graphs and phi- functions. Second, 3.2, we need a closer look at the problem instances and find out in which step it has got which complexity. The last part, 3.3, is another method in contrast to SSA for the simplification of the program before the graph coloring starts. 3.1 The SSA (Static Single Assignment) Approach 3.1.1 Phi functions For the SSA form the φ- functions are important. These functions are used like a naming system for the variables which choose the correct name and value for each variable. They are needed because in SSA there can not be two variables with the same name. So if we use a variable a second time or in another branch of the program the φ- function remove the old name, which is used before, and then set a new name. Here the syntax is described as a matrix modeled by Hack et al. In figure 1 the φ functions are evaluated simultaneously when each basic block begins. In fact that every column is a separate line in the control flow graph the variables in a row are independent of each

4 Figure 2: (a) shows a classical progrm in SSA-form. (b) shows the control flow graph and (c) the program with SSA-elimination. The three steps point out the transformation from a normal program into the post SSA form.[5] other and can allocated at the same register. Referring to figure 2, in (b) we can see in the second block that the variables v 11 and i 1 interfere but V 11 and i 2 does not just like the description of SSA. The Post SSA-form or also called SSA-elimination shown in (c) is the executable program. It is used because φ- functions are not supported in every programming language. This is the reason why the number of variables looking at the whole program is increased but maxlive is not. In fact that a variable in SSA-form is allowed only to use one time it is defined and another for defining a new variable there is always a new variable introduced when a value has to be used a second time. So the value is the same but the name has changed to solve the SSAcondition. The transformation into the SSA-form the compiler can do in cubic time. The advantage of the SSA-transformation is that we can color a chordal graph, which is the result of the SSA-elimination, in linear time and mapping a program into the SSA instance take polynomial time so this way in all steps is done in polynomial time and the allocation is no longer Np-complete. This sounds good but there is a problem when we look at all possibilities. The return direction can not be done so easy because we loose information in the single steps and so a solution for a program in SSA-form is not always the same as a solution for the program we started with. The whole problem is still Np-complete but we found a way that makes the allocation simple in some special cases. 3.1.2 SSA Circular Graphs First to define the interval graph in 3 in the way that two vertices have an edge V if and only if the intervals, described by the two nodes d and u, do not have an empty intersection. Next the outcome is a set of edges defined in V. So we can define three subsets of this amount.

5 Figure 3: [5] V i = (d,u) V d < u (1) V l = (d,u) V d > u (2) V z = (d,y) V (y,u) V l (3) The first set V i describe the edges that are between two nodes d and u when the number d is smaler than u. In the second amount it is the way around when the value of d is bigger than u. The last set point out the intersection, if there are nodes d and y when i am sure that the node y belongs to another u and they have an edge that is V l So on for the SSA- circular graph there are two additional points (y,u) W l : d N : (d,y) W z (4) (d,u) W i W z : (d,u ) W l : u < d (5) that makes the difference. A vertex (a, b) consists of two extreme points a and b. The first condition with the three subsets say that if a node is in the third amount there are two vertices that share the same extreme point. In the second condition we got a similar situation that says each interval in the amount of Wl share an extreme point with Wz. These shared extreme points will be used for the parameters of the phi- functions. For the following a mapping function F used that works on pairs (V,K) that splits intervals from V l. The results are presented in Lemma 1 to Lemma 5 by Pereira and Palsberg. [5] Lemma 1 If V is a circular graph and min(v ) > K, then F(V,K) is an SSA circular graph. Lemma 2 If W = F(V,K) is K-colorable, then two intervals in W z and W l that share an extreme point must be assigned the same color by any K- coloring of W. Lemma 3 Suppose V is a circular graph and min(v ) > K. We have V is K-colorable if and only if F(V,K) is K-colorable. Lemma 4 Graph coloring for SSA circular graphs is Np-complete.

6 Figure 4: (a) shows in the upper section the interval graph and below the corresponding graph with the edges. In (b) there is shown the program defined by the graphs in (a).[5] 3.1.3 Post SSA Programs Lemma 5 We can color an SSA- circular graph Y with K colors if and only if we can solve the core register allocation problem for H(Y,K) and K + 1 registers. In this part in contrast to the one before a new representation of circular graphs is used. The circular graph is converted into a list I of elements that is finite. I = de f ( j),use( j),copy( j, j), j, j N The j s are some names that were used temporarily and if we go trough the lists of elements in I the letters d and u describe the number of elements in the list of the defs or copies and the list of the uses or copies. In this example, Figure 4, there are two parts to look at. First there is a loop that can be colored independently from the rest of the program because the variables inside nearly all are used only local. So the other part is the rest of the program outside the loop. This will make the coloring easier if the programs are not so simple like this one. If the coloring of the loop is done this could be mapped to on the whole graph in linear time. The solution for the core register allocation will need K + 1 registers because of the loop control variable i and i 2. Now a valid solution to color the example graph is to use color one for a,a2,c, the second color for b,d,t,t2, color three for e,e2 and the last one for the loop control variable. Unfortunately this improvement can not reduce the complexity of the whole allocation problem because there is still a graph coloring to be done but it shows that the worst case with the bad execution time can be avoided in some cases (see [3]).

7 Figure 5: This figure shows an example how to form a program by using the interference graph on the left side (from [1]) 3.2 Analyzing the Problem Chaitin proved by reducing k-colors that the allocation is Np-complete (see [2]). By given undirected graph G and a natural number k is the question: Can we color G with k colors so that there are not two nodes linked by an edge that have the same color. For k > 2 and an arbitrary graph is this problem Np-complete. For the reduction Chaitin modeled a program with V variables and k as number of registers available. The variables u and v from V are linked by an edge if they have to be alive at the same time. So there is a maximum number of variables that have to be alive simultaneously while the program is running as maxlive. In the figure1 below the edges are presented as basic blocks. When there are V variables and another variable x (V + 1) need to be colored, in worst case it has got an edge to every vertex, we need a new color for x. Hence it is Np-complete to decide k + 1 registers are needed. But this model do not pay attention to improvements before it comes to the graph coloring. Sometimes the program structure allows improvements by splitting or coalescing and the graph is no longer complicated to color in all cases. In this example we can not use one of the optimization because the edges are critical, an edges is critical when it raises the number of maxlive. Or, like in the example, when the edge cause a circle. So it is clear that the interesting part which makes the problem Np-complete is located in the critical edges but only if maxlive > k or maxlive = k. The cases when maxlive < k are easy to solve because no spill is needed and graph coloring is not a problem if we know before start the coloring that for each variable there will be at least one register. Maxlive could be a minimum number of registers needed for fast allocation. If there were no critical edges and the optimization produce a chordal graph the coloring would take linear time. Hence the proof Chaitin made is only relevant for the worst case with the critical edges, otherwise we can find an easier way to allocate the registers. Without critical edges and the problem of maxlive > k there could help splitting or coalescing to reduce maxlive. These two steps are done before spilling because the also called shuffle is more time efficient than a spill. As follows it is the next problem, if the shuffle can not reduce maxlive, to keep the number of spilled

8 Figure 6: Here it is the example shown with the split of the critical edges (like in Figure 5)(from [1]) variables as small as possible. Figure2 6 above shows a similar structure like figure1 5 but this time the critical edges are split up. So there are three variables u, x u, y u, for each vertex u and a variable x uv for each edge. Now it seems like we will need more register than last time for the graph coloring but we can use a split after or before every basic block so the graph will be 3-colorable because the critical edges are gone. Every basic block needs three variables and after execution it has produced three variables. For example the first on left side set a, b, and x ab. Then the next block on left side set y a and y b, the origin registers used for a and b are now overwritten by the first two and the values are not lost. The last block needs all of the three variables but no more because we store them in the step which between the last two of the first program (Figure 5). The coloring is now easier because the variable which was in figure1 linked to all others is now independent and only linked to the node where it is needed. The coloring goes like this, the vertex u gets one color and the triangle which is build by x u and y u is used for the other two colors. u, v and x uv build another triangle but this is except u independent from the one before so vertex v and v uv can get same colors like x u and y u. The last one with u another vertex w and x uw goes the same. To do so for each node the graph is colored with three colors. Hence we need one register less than we need for the graph in figure1. This coloring (the problem 3-colorable) is also Np-complete which means we have still the same complexity.

9 Figure 7: [4] 3.3 The Combined Allocation 3.3.1 Comparison of Three Algorithms Including the Combined Method In this section we have a look at the following example to make a simple but very helpful observation. Figure 7 As known from the parts earlier we can split a program into basic blocks for a better overview and the coloring. Now the basic blocks are in focus for some improvements. This example is a basic block with six instructions and it is used a two stage pipeline with two registers available. The first picture in Figure 8 shows the instruction scheduling followed by register allocation. This method is used very often and tries to execute as many as possible steps parallel. Parallelism is typical for manufactures to work efficiently but in program code something really bad could happens. If more register have to be allocated than available a spill instruction is needed that take a lot of time, in comparison to the use of a register. Hence this method increases in some cases maxlive but saves idle slots to perform a better number of cycles. The second picture in Figure 8 shows register allocation followed by instruction scheduling. It is from a time registers were rarely available and a spill was not really an option. Instead of another register there more time is needed. In contrast to the first picture maxlive is one less but the method take two more cycles until the program execution finished. So we see here the other way around. The last picture in 8 shows the composition of the two approaches. As advantage maxlive is two like in the one before but the cycle time is one less, not same like the first but better than each. We noticed first method pushes v 5 between v 2 and v 3 so we need register for v 5 before v 2 is finished this increase maxlive. Second pulls an idle slot between v 2 and v 3 and v 5 and v 6 to avoid a collision but this increase the cycle time. The third placed v 4 in the idle slot between v 2 and v 3 and also set v 5 lately which gives us the best solution. After this a model for better understanding is introduced. A basic block consists of a set of instructions V = v1...vn and let DG be the data dependence graph formed by the set V. If two nodes vi and v j are connected with an edge (v i,v j ) it means that v i must start before v j is started. For every instruction

10 Figure 8: [4]

11 of V there is an execution time t needed and each edge has an inter instruction latency that says the next instruction must start at least a specific number of cycles after the one before completed. To simplify this example every instruction needs one cycle time and every latency is zero or one. The beginning of every instruction and the end of all instructions are specified by a schedule. In the schedule there is for every variable defined the use(v i ) and the de f (v i ) which are two sets where the read and write if defined. There are two more sets that are needed in this context the set of producers prod(v) and the set of consumers con(v) which means that the variable is initialized by the producer and used by the consumer. As follows the dependence graph DG need to know all data flows. The SSA- form creates the basic blocks in that every variable is defined and used only one time. So now for each schedule there is a value range of virtual registers defined by a triple (r,(v i ),(v j )) with the following properties: In this schedule we have a difference to the usual live range of a variable. A virtual register can have more than one value range because the definition says that this range is only for one definition and use but the variable can have more than one use and so every use is added as one item to the value range. The usual live range is from the definition until the last use. With a look at a single variable the value ranges can not overlap. The virtual registers models the fact that hardware use the same register for input and output without any interference between the use like in the SSA- form in which we can not do this. Another point is now to define the set of the spilled value ranges as SVR and the set of the active value ranges. To spill a value range over a certain time interval has the effect that this value range will not need a virtual register in this time but produces an overhead with the store and load operations when the value range is pushed into the main memory and the reload in the register again. In that way we buy in cost of time register for another value range. So it is clear that the following two conditions hold: The number of available registers is always bigger or same as the bandwidth for the active value range. The number of available functional units is always bigger or same as the number of executing instructions. The combined register allocation and instruction scheduling problem (CRISP). The CRISP can be formulated as a minimization problem in that we have to reduce the number of spills and the value of the completion time. 3.3.2 Analyzing the complexity of CRISP and improvements Now it need a closer look at the complexity of the CRISP. So we see that the register allocation is still Np- complete but under some circumstance we will see that the register allocation is not Np- complete. If we only look at the instruction scheduling for a basic block with latencies 0/1 on a single functional unit by a given register allocation is not Np-complete and if the schedule is given register allocation is also not Np-complete. In fact if we do not want to find the optimal solution but the near optimal there is a significant improvement by this algorithm. To show that CRISP is Np-complete a reduction from a very similar problem, Feedback Vertex Set (FVS), would be a distinct possibility. In details of the two problems can be described as in the following: CRISP In this case it a good idea to use a restricted version of the CRISP, RCRISP, which is defined like the normal CRISP except of one difference. Here the input of the edge latencies are all zero. The problem is to decide whether there is a permutation of the instructions such that the completion time is not bigger than a maximum t. Feedback Vertex Set (FVS) Problem Given a directed graph G (W,E) and a positive integer s, decide whether there is a feedback vertex set

12 S from W of size at most s, where a feedback vertex set of vertices whose removal (along with incident edges) from G will result in an acyclic graph. The Reduction Given an instance of FVS as an graph G(W,E) and the integer s that will be reduced to the RCRISP. The basic blocks were build with the vertices from the set W so there is a block V that consists the instructions w 1, w 2 and w 3 and a virtual register r w where de f (w 1 ) = use(w 2 ) = use(w 3 ) = r w and use(w 1 ) = de f (w 2 ) = de f (w 3 ) =. The only edges in the dependence graph are like this: For each vertex w we add edges to the following nodes so we have an edge from w 1 to w 2 and an edge from w 1 to w 3 and for an edge from v to w we connect v 3 with w 2. Finally we can define a value like a cost index: t = 3n + 2s, where n is the number of vertices and s is the integer from the FVS. In fact the number of total instructions is 3n in the RCRISP and the number of allowed spills is s. The following two statements are important to be mentioned. Lemma 1: The RCRISP instance has a solution σ with zero spills if and only if the FVS instance is a DAG (directed acyclic graph). The advantage of the Dag is that if there are two nodes v and w there is only an edge from v to w if f (v) < f (w). Hence we can use this for a schedule, for the RCRISP, that orders the instructions for the control flow graph. Here an isomorphism g : V > 1,,3n is useful that gives every vertex v from V the number of the set of instructions. Every instruction get like this a time value by ordering it into the schedule with the function g and we get a linear list of the instructions. So this new list of instructions is constructed to hold the orders of the dependency graph. Cause of the condition if v < w in the graph then f (v) < f (w) for the ordering in the schedule it can not happen that an instruction position is switched. There can not be to much instructions or new instructions be created because for every node there is a maximum number of possibilities that can happen (de f (v)anduse(v)). This goes to the next point that no spills are needed. Theorem 1: There is a polynomial time reduction from FVS to RCRISP, and hence RCRISP is Np-hard. The reduction can be done in polynomial time especially in quadratic time of the vertices. First it is important to recognize that the FVS and the schedule are connected by a time relation. The schedule has a cost C = 3n+2N where N is the number of spilled value ranges. So on the FVS instance has a feedback vertex set S of size s if and only if the RCRISP instance has a schedule of total cost K t = 3n+2s. Thus the feedback vertex set must have at most size s if and only if there are at most s spills in the schedule of the RCRISP instance. The sub graph that is produced by the FVS when we take W/S, where W is the amount of vertices and S the set of feedback vertex of max size s, is a DAG. So this sub graph is the perfect candidate to build a schedule from without any spills. To enlist the other variables we need to find a good place in the list and this makes the problem complex again. But nevertheless we have, before it comes to this, reduced the number of variables which is better than having all the variables enlisted like this. Maybe the rest of the variables have to be spilled some time but that are also less complex because we start with a spill free schedule. 3.3.3 Approximation Instead of the Optimal Solution A good way to handle Np-complete problems is not to search for the best solution but for the nearly optimal one. So we can define with instance I Opt(I) for the best solution of the problem I and S(I) for a sub optimal solution of the problem instance I that S(I)\Opt(I) < r as a minimization problem. In this context r is the ratio of the solution. In fact the use of a constant approximation algorithm is often a good average case algorithm. Hence we see that the CRISP is easier to approximate than solving the problem

13 with one of the earlier algorithms. But now we look at some negative points of the approximation algorithm. The standard method is to do something to model the problem but then it came to the graph coloring to check whether the graph is K-colorable. The vertices are the live ranges of the variables and the edges between two nodes are there if the live ranges overlap at any execution point. The coloring is Np-hard and if there is no way to color the graph we need to spill as many variables until it is colorable. If there is an approximation ratio like R/R 1 to find the maximum R-colorable sub graph there is a problem with the accuracy. When we build the sub graph from the original graph G we loose vertices and edges to form it until the approximation ratio is fulfilled. So in the sub graph it is still possible to have the Np-hard problem to find a solution because not every graph structure allows the problem to become easier if we leave out some nodes and edges. Here comes another problem with the minimization of the spilled variables. In the method we optimize them two we see that it is a similar problem when try to find the optimal solution. By the approximation there is not necessarily an equivalence between the graph coloring and the minimization of the spills, so if we approximate it could be bad solution for spilled variables. At last we see that it is also Np-complete when it comes to the graph coloring and there is no reliable way to avoid the coloring in all cases which makes the problem, optimal solution or approximation, Np-complete. This is the time to approximate the CRISP. The complexity of the approximation and so on the algorithm is Np-hard to find if there is a solution needed with O(log(n)). The best method to present has a complexity known by Ω(log(n)). Hence it seems to be impossible to find a satisfying algorithm that beats the complexity of the one at the moment. By this circumstance it is interesting to take a closer look at the relationship between the RCRISP and the FVS. In the reduction there is only a solution for the FVS instance with value s if and only if there is a solution of the RCRISP instance when t = 3n + 2s where n is the number of vertices of the FVS instance. In fact that the FVS is a sub set of vertices from the original graph we also have a sub set by the RCRISP and the range of the RCRISP instance will be 3n <= t <= 5n. So there must be any greedy solution for RCRISP like an a-approximation for a <= 5/3. This basic idea can be generalized for any RCRISP and so on the CRISP instances. A general heuristic for CRISP will supplement the previous effort. This heuristic is based on the following optimal algorithm for spill code generation with respect to the dependencies of the dependency graph of the specific instruction schedule. To do so we got a similar problem as we see before when we want to reduce a graph by the FVS, or in other words when removing one by one a vertex to gain a R-colorable sub graph without any spills first. There is a greedy algorithm that works in linear time and can find an optimal solution with help of a linear scan method. This method delete every node that would cause a spill and so on there is a non spill sub graph left. Therefore it is an optimal solution with respect to the spill minimization. To summarize the heuristic it follows a combined rank function that orders the instructions into a increasing list without considering the register bound. After that the variables to spill will be chosen by another walk through the schedule. The worst case complexity of these two operations is in polynomial time. 3.3.4 The Experiment To conclude the solution part an experiment is presented in the following. In the two tables below, Figure 9, the combined heuristic was compared with the first instruction scheduling register allocation and second register allocation instruction scheduling algorithm so there is a ratio presented. For the tests randomly generated DAGs were used with a two stage pipeline with four, eight, and 16 available registers. The cost ratio means that when value 1 the combined heuristic has lower cost than the phase ordered method. The results of the first table show that the combined solution needed in all cases lesser

14 Figure 9: [4] spills and the cost ratio is 16% 21% better than the phase ordered method. But the phase ordered solution shows the better makespan, which means the program schedule needs 13% 14% lesser time. In the second table the results are similar except of the makespan. Here the combined method is with the four and eight registers better but with 16 registers the phase ordered solution beats the other with 4%. The rest is like above the phase ordered method perform 4% 21% worse in the cost ratio and with the spills it is 19% 35% worse. A new point we see here is that the instruction scheduling seems to be more important because, while the register allocation first is in nearly all performances worse, the instruction scheduling makes a more time efficient schedule. As follows the instruction scheduling is more important and the combined solution is the best of the three. (y,u) W l : d N : (d,y) W z (6) (d,u) W i \W z : (d,u ) W i : u < d (7) Lemma 6 If V is a circular graph and min(v ) > K, then F(V,K) is an SSA circular graph. Lemma 7 If W = F(V,K) is K-colorable, then two intervals in W z and W z that share an extreme point must be assigned the same color by any K- coloring of W. Lemma 8 Suppose V is a circular graph and min(v ) > K. We have V is K-colorable if and only if F(V,K) is K-colorable. Lemma 9 Graph coloring for SSA circular graphs is NP-complete.

15 4 Results The core register allocation problem with SSA is Np-complete but the single steps can improve is some special cases the allocation. We analyze the SSA form and look at the single steps. The functions are an important tool for the SSA-transformation, they choose for every variable the correct name and so on the corresponding value and they are useful for the transformation. The use of them is by generating the SSA-form that the variables were defined and used only once. These copy instructions can not raise the complexity of the allocation problem. The way to the interval graph and then to post SSA bring more structure and also more overhead but in the lucky case the outcome is a chordal graph that can be colored in polynomial time that makes this transformation useful. Nevertheless there is a graph coloring problem at the end of this transformation which is Np-complete and the problem is still Np-complete. In the next part the step wise observation of the whole allocation problem shows that we can split it into different phases that bring different complexities to the problem in complete. In combination with the last part we see that there are two parts of the allocation that makes it Np-complete, when we do the graph coloring or schedule the basic blocks. The coalescing and splitting are two good instructions to prevent a spill, which is the next improvement. Here the graph structure is similar improved like the SSA-method. When looking at the problem we also noticed that the problem case is only given when the number of registers is smaller than the number of variables that have to be allocated. While Chaitin et. al. Concentrated on the graph coloring we now concentrate on the improvements that could be done before it comes to the complex coloring and with some little changes the coloring is not always Np-complete. In this model there were also shown the critical edges which cause the variable pressure. The critical edges are presented in all models but they do not always look like the same in SSA they were defined as the set of the spills and in the combined register allocation and instruction scheduling it was first the rest of the sub graph and then spilled. Finally the CRISP got an important position too, when the improvements have to be done. In this part the first result was that the combined allocation beats the two separated ones. But this method also got Np-completeness in the last step and it can not reduce the complexity of the problem. Another and better point is that this produces a schedule for the instructions which is not always Np-complete to handle. In fact it is similar to the SSA approach that try to reorder and build a better problem structure. 5 Conclusion Finally we look at a lot of different methods to improve the speed of the register allocation and we see that every method works on a special set of cases and makes this solution better. But at last it is not possible to reduce the complexity for a random input. With the help of these anlasysis we found out which part of the problem instance, makes the register allocation still an Np-complete problem. References [1] Florent Bouchez, Alain Darte, Christophe Guillon & Fabrice Rastello (2006): Register allocation: what does the NP-completeness proof of Chaitin et al. really prove? or revisiting register allocation: why and how. In: LCPC, 6, Springer, pp. 283 298. [2] Gregory J Chaitin, Marc A Auslander, Ashok K Chandra, John Cocke, Martin E Hopkins & Peter W Markstein (1981): Register allocation via coloring. Computer languages 6(1), pp. 47 57.

16 [3] Sebastian Hack, Daniel Grund & Gerhard Goos (2006): Register allocation for programs in SSA-form. CC 6, pp. 247 262. [4] Rajeev Motwani, Krishna V Palem, Vivek Sarkar & Salem Reyen (1995): Combining register allocation and instruction scheduling. Courant Institute, New York University. [5] Fernando Magno Quintao Pereira & Jens Palsberg (2006): Register allocation after classical SSA elimination is NP-complete. In: International Conference on Foundations of Software Science and Computation Structures, Springer, pp. 79 93.