Code Compaction Using Post-Increment/Decrement Addressing Modes Daniel Golovin and Michael De Rosa {dgolovin, mderosa}@cs.cmu.edu Abstract During computation, locality of reference is often observed, and can be exploited to achieve performance increases in several ways. This locality is an artifact of the computational abstractions and architectures that we use, such as iterating over an arrays or encapsulating tasks into distinct functions with local state. With this in mind, some architectures have been created with features designed specifically to exploit locality. In this paper, we extend the work of [1] to benefit as much as possible from one such particular feature: post-increment and post-decrement addressing modes. With careful placement of stack variables in memory, we can minimize the amount of necessary address arithmetic. This problem is NP-complete, and we present a 2-approximation algorithm for the case of a single address register, along with experimental results. 1 Introduction Some architectures have post-increment and post-decrement addressing modes, which allow the following two instructions to be executed as one: (v load(r); r r + 1), and similiarly for (v load(r); r r 1). To exploit these addessing modes, variables should be laid out in memory so that as often as possible, (temporally) consecutive accesses correspond to (spatially) consecutive address locations. Proper memory layout will result in code that is both smaller and faster. Following the formulation of Liao et. al. [1] we use basic blocks of code to define an access graph on the variables of the block, and then seek a maximum weight path cover of the graph. We first define the access graph, and then the Max Weight Path Cover problem. Definition 1. The Access Sequence of a basic block B is the sequence of variables accessed in B. It is defined as follows. The access sequence of a op b (e.g. a = +b) is ab, and that of a b op c is abc. If B is a sequence of commands c 1 ; c 2 ;... ; c k ; and c i has access sequence s i, then the access sequence of B is s 1 s 2... s k. Definition 2. The Access Graph G = (V, E) of a basic block B is an undirected graph with vertex set V equal to the variables of B, with edge (u, v) iff u and v are adjacent somewhere in the access sequence σ of B. Each edge (u, v) is weighted with the number of adjacent occurrences of u and v in σ Department of Computer Science, Carnegie Mellon University. This write-up was prepared for Optimizing Compilers, course 15-745, Spring 2005 1
Figure 1: An MWPC instance, with optimal solution. The Max Weight Path Cover problem is to cover the graph G with a set of node disjoint paths in G of maximum total edge weight. The formal definition is as follows Definition 3. The Max Weight Path Cover problem (MWPC) is, given a edge weighted, undirected graph G = (V, E), find a partition of V into ordered sets {P 1, P 2,..., P k } such that each P i is a path in G. That is, letting f(i) := P i, we can write each P i as {v i1, v i2,..., v if(i) } in a way that (v ij, v i(j+1) ) E[G] for all 1 j < f(i). The objective is then to maximize the sum weight of all edges in the paths, namely k i=1 f(i) j=1 c(v ij, v i(j+1) ). A solution is called a path cover. Once the path cover is found, the paths are extended into a linear ordering to be placed in memory in the obvious way (i.e. v 11, v 12,..., v 1f(1), v 21, v 22,..., v 2f(2), v 31,...). Notice how MWPC captures exactly the savings we obtain in code size. Unfortunately, MWPC is NP-complete. However, we were able to obtain an approximation algorithm detailed in section 4. We have implemented this approach on the C6X architecture. Experimental results appear in section 5. 2 Relevant Work Liao et. al. [1] introduce the reduction of the single offset assignment problem to MWPC. They present a heuristic based on Kruskal s maximum spanning tree algorithm. Their heuristic sorts the edges in non-increasing order of weight, and then in this order inserts each edge that does not increase the degree of any vertex above two. Liao et. al. give no approximation gaurantee for their heuristic. 3 Adapting SOA to Hyperblocks Liao et. al. [1] assume that the IR of the input procedure is logically divided into basic blocks. As the Pegasus/CASH IR uses hyperblocks to support predicated execution of multiple simultaneous 2
but mutually exclusive control paths, it was necessary to modify the basic SOA algorithm to account for this. To permit this, we define the access graph of a hyperblock differently than that of a basic block. Definition 4. The Access Graph of a hyperblock H is the weighted graph G = (V, E), with vertex set V being the set of variables accessed in H, and edge (u, v) with weight n occurring iff there is are n distinct control flow paths leading from u to v or from v to u, with no intervening variables accesses. The access graph of the procedure can the be found by merging the access graphs of all hyperblocks, using the same techniques as presented in Liao et. al. 4 Finding 2-Approximate Path Covers We find good path covers using the maximum weight cycle cover. We solve the following problem: Given an undirected, edge weighted graph G = (V, E) find a permutation σ on V that maximizes v V w(v, σ(v)), where w(u, v) is the weight of edge (u, v) if it exists in G, and zero otherwise. For each cycle of σ, delete all non-edges in the cycle. If any cycles remain, delete the minimum weight edge of each. Return the resulting edges as the path cover. Note that deleting the minimum cost edge from a length k cycle reduces its weight by at most 1/k, and all non-zero length cycles have length at least two, so the output has weight at least half the cycle cover weight. Yet the optimum cycle cover has weight at least that of the maximum weight path cover, and thus we obtain a two approximation. To find the optimum permutation σ, we reduce it to the max weight matching problem on the following complete bipartite graph B: Given G = (V, E) with weights w : E N, construct sets X, Y with X = Y = V. Let x, y be bijections from V to X and Y respectively. For each u, v V, add edges (x(u), y(v)) and (x(v), y(u)) to B of weight w(u, v), where as before, w(u, v) is the weight of edge (u, v) if it exists in G, and zero otherwise. If M is an max weight matching in B, then the optimal permutation is defined by σ(u) = v whenever (x(u), y(v)) M. 5 Experimental Results Due to a preexisting implementation issue with the provided register allocator s handling of spills, we were unable to benchmark our algorithm on sizable candidate functions. Of the functions we were able to test, of those which used frame variables, we found an average code reduction of 2.0%, corresponding to the conversion of 28.6% of all variable accesses to postincrement/postdecrement instructions. This compares well with the results of Liao et. al, who cite figures of 5% and 20% respectively for SOA. In none of our test cases were the final procedures longer than their unoptimized counterparts. The compile-time cost of the optimization was less than 0.01 seconds in all cases, meaning that there was no significant cost associated with the performance of the algorithm. Why does the Liao et. al. heuristic perform comparably to the 2-approximation algorithm? Though Liao et. al. give no approximation gaurantee for their heuristic, it in fact has an approximation gaurantee of exactly two, which we prove in the appendix. However, it does not lend itself to improved algorithms the way maximum cycle cover approaches do, and future work may yield practical improvements based on our algorithm. 3
6 Future Work There are several directions for future work. Minimizing the amount of address arithmetic in the case of several address registers does not appear to cleanly reduce to a graph theoretic problem such as MWPC It remains to find a fast approximation algorithm for it, if possible, to handle the general case. Various improvements to the single address register case are possible. Using the approximation algorithm to find an initial solution and then employing, e.g., local search may significantly improve performance. Further, ideas from sophisticated algorithms for Max Weight TSP can yield improvemented approximation guarantees, but are likely too slow in practice. This remains to be investigated. Lastly, when dealing with pieces of code above the level of basic blocks, profiling information could be used to weight the probability of consecutive accesses along an edge in the access graph. Giving more weight to hot edges in the access graph should result in faster code, although this may result in longer code than the original approach. 7 Conclusions We were able to successfully implement a novel extension to the work of Liao et. al, allowing their storage assignment scheme to function natively on a hyperblock-based representation. We also proved bounds on both their heuristic allocation scheme, and our more principled algorithm. The algorithm provides comparable results to those reported in the original paper, and requires a very small investment of compilation time. While in it s current state it provides only a modest improvement in code size, generalization of the algorithm to use multiple address registers or profiling information could easily provide more significant gains. References [1] Stan Liao, Srinivas Devadas, Kurt Keutzer, Steven Tjiang, and Albert Wang. Storage assignment to decrease code size. ACM Trans. Program. Lang. Syst., 18(3):235 253, 1996. A Additional Proofs Theorem 1. The Liao et. al. heuristic has an approximation gaurantee of exactly 2. Proof. First, we sketch the lower bound. Let G be the following tree: a line on k+1 vertices (the back-bone of G), say v 0, v 1,..., v k with edges (v i, v i+1 ) of unit weight, and two leaves hanging off of each v i for 0 < i < k via edges of length 1 ɛ. The optimal solution, consisting of all edges not in the back-bone, has weight 2(k 1)(1 ɛ), while the heuristic returns the back-bone, of weight (k + 1). As k and ɛ 0, the ratio approaches two. So the approximation factor is no better than two. Now we prove the upper bound. Fix G, the optimal path cover C, and the output of the heuristic, L. Consider edge e C, e / L, of weight w(e). Let e = (u, v). Since e / L, by the time we process e in the list of edges ordered in non-increasing weight when running the heuristic, one of u or v already has degree two. WLOG, let it be u. Then the two edges of L incident on u each have weight at least w(e). We pay for such edge e using a charging scheme. Initially all edges e of L have charge c(e ) = 0. To pay for e, place a charge of w(e)/2 on each edge of L incident to u. Next consider e C L. Pay for it by placing a charge of w(e) on e. Let c(l) := e L c(e) be the charge on L. Clearly, w(c ) c(l), since each edge of C has had its weight paid for. We claim that for each e L, c(e) 2w(e), and thus c(l) 2w(L), and so w(l) w(c )/2. 4
Consider e L, e / C. An edge (u, v) of C charges an edge e L only if e is incident to u or v, and charges it at most w(e)/2 if e (u, v). Since the degree of any node in C is at most two, e can be charged by at most four such edges of C, for a total charge of c(e) 4 1 2 w(e) = 2w(e). Next consider e L C. This edge is charged w(e) by its copy in C, but can have at most two edges of C sharing exactly one vertex with it. Each of these charges at most w(e)/2, for a total charge of c(e) w(e) + 2 1 2w(e) = 2w(e). So w(c ) = c(l) 2w(L) and we are done. 5