Code Compaction Using Post-Increment/Decrement Addressing Modes

Similar documents
Greedy Algorithms 1. For large values of d, brute force search is not feasible because there are 2 d

How to efficiently use the address register? Address register = contains the address of the operand to fetch from memory.

Theorem 2.9: nearest addition algorithm

Greedy Algorithms 1 {K(S) K(S) C} For large values of d, brute force search is not feasible because there are 2 d {1,..., d}.

Approximation Algorithms

CS 6783 (Applied Algorithms) Lecture 5

CMSC 451: Lecture 22 Approximation Algorithms: Vertex Cover and TSP Tuesday, Dec 5, 2017

Approximation Algorithms

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 02/26/15

NP-Hardness. We start by defining types of problem, and then move on to defining the polynomial-time reductions.

CMPSCI 311: Introduction to Algorithms Practice Final Exam

Problem Set 3. MATH 778C, Spring 2009, Austin Mohr (with John Boozer) April 15, 2009

Greedy algorithms is another useful way for solving optimization problems.

CS261: A Second Course in Algorithms Lecture #16: The Traveling Salesman Problem

DATA ACCESS PROFILING AND IMPROVED STRUCTURE FIELD REGROUPING IN PEGASUS. Matthew Moore and Vas Chellappa

Graphs. Pseudograph: multiple edges and loops allowed

Approximation Algorithms

Sankalchand Patel College of Engineering - Visnagar Department of Computer Engineering and Information Technology. Assignment

CME 305: Discrete Mathematics and Algorithms Instructor: Reza Zadeh HW#3 Due at the beginning of class Thursday 03/02/17

γ(ɛ) (a, b) (a, d) (d, a) (a, b) (c, d) (d, d) (e, e) (e, a) (e, e) (a) Draw a picture of G.

On Universal Cycles of Labeled Graphs

CPSC 536N: Randomized Algorithms Term 2. Lecture 10

Solutions for the Exam 6 January 2014

Lecture 5: Graphs. Rajat Mittal. IIT Kanpur

6. Lecture notes on matroid intersection

Traveling Salesman Problem (TSP) Input: undirected graph G=(V,E), c: E R + Goal: find a tour (Hamiltonian cycle) of minimum cost

SOLVING LARGE CARPOOLING PROBLEMS USING GRAPH THEORETIC TOOLS

Definition 1.1. A matching M in a graph G is called maximal if there is no matching M in G so that M M.

Improved approximation ratios for traveling salesperson tours and paths in directed graphs

Treewidth and graph minors

1 Undirected Vertex Geography UVG

AMS /672: Graph Theory Homework Problems - Week V. Problems to be handed in on Wednesday, March 2: 6, 8, 9, 11, 12.

Basic Combinatorics. Math 40210, Section 01 Fall Homework 4 Solutions

Chapter 5 Graph Algorithms Algorithm Theory WS 2012/13 Fabian Kuhn

Approximation Algorithms

1 The Traveling Salesperson Problem (TSP)

5. Lecture notes on matroid intersection

COMP 355 Advanced Algorithms Approximation Algorithms: VC and TSP Chapter 11 (KT) Section (CLRS)

NP Completeness. Andreas Klappenecker [partially based on slides by Jennifer Welch]

Introduction to Approximation Algorithms

8 Matroid Intersection

Answers to specimen paper questions. Most of the answers below go into rather more detail than is really needed. Please let me know of any mistakes.

Lecture 7. s.t. e = (u,v) E x u + x v 1 (2) v V x v 0 (3)

Vertex Cover Approximations

CPS 102: Discrete Mathematics. Quiz 3 Date: Wednesday November 30, Instructor: Bruce Maggs NAME: Prob # Score. Total 60

Greedy Algorithms and Matroids. Andreas Klappenecker

CS 473: Algorithms. Ruta Mehta. Spring University of Illinois, Urbana-Champaign. Ruta (UIUC) CS473 1 Spring / 36

CS 532: 3D Computer Vision 14 th Set of Notes

Steiner Trees and Forests

Greedy Algorithms and Matroids. Andreas Klappenecker

1 Definition of Reduction

Small Survey on Perfect Graphs

/ Approximation Algorithms Lecturer: Michael Dinitz Topic: Linear Programming Date: 2/24/15 Scribe: Runze Tang

Advanced Algorithms Class Notes for Monday, October 23, 2012 Min Ye, Mingfu Shao, and Bernard Moret

Approximation Algorithms for Wavelength Assignment

Problem Set 1. Solution. CS4234: Optimization Algorithms. Solution Sketches

5 MST and Greedy Algorithms

Decreasing a key FIB-HEAP-DECREASE-KEY(,, ) 3.. NIL. 2. error new key is greater than current key 6. CASCADING-CUT(, )

/633 Introduction to Algorithms Lecturer: Michael Dinitz Topic: Approximation algorithms Date: 11/27/18

DO NOT RE-DISTRIBUTE THIS SOLUTION FILE

CS473-Algorithms I. Lecture 13-A. Graphs. Cevdet Aykanat - Bilkent University Computer Engineering Department

11.4 Bipartite Multigraphs

Unit 8: Coping with NP-Completeness. Complexity classes Reducibility and NP-completeness proofs Coping with NP-complete problems. Y.-W.

Analysis of Algorithms Prof. Karen Daniels

Outline. Graphs. Divide and Conquer.

Solving NP-hard Problems on Special Instances

CSE 417 Branch & Bound (pt 4) Branch & Bound

Math 15 - Spring Homework 2.6 Solutions 1. (2.6 # 20) The following graph has 45 vertices. In Sagemath, we can define it like so:

arxiv: v2 [cs.dm] 3 Dec 2014

Storage Allocation Based on Client Preferences

PCP and Hardness of Approximation

Fast and Simple Algorithms for Weighted Perfect Matching

Introduction to Optimization

5 MST and Greedy Algorithms

Minimum Spanning Trees

Strongly Connected Spanning Subgraph for Almost Symmetric Networks

Matching and Planarity

Partha Sarathi Mandal

Lecture 24: More Reductions (1997) Steven Skiena. skiena

CS 341: Algorithms. Douglas R. Stinson. David R. Cheriton School of Computer Science University of Waterloo. February 26, 2019

Reductions. Linear Time Reductions. Desiderata. Reduction. Desiderata. Classify problems according to their computational requirements.

by conservation of flow, hence the cancelation. Similarly, we have

Voronoi Diagrams and Delaunay Triangulations. O Rourke, Chapter 5

Modules. 6 Hamilton Graphs (4-8 lectures) Introduction Necessary conditions and sufficient conditions Exercises...

Module 6 NP-Complete Problems and Heuristics

1 Better Approximation of the Traveling Salesman

val(y, I) α (9.0.2) α (9.0.3)

Approximation slides 1. An optimal polynomial algorithm for the Vertex Cover and matching in Bipartite graphs

1 The Traveling Salesman Problem

Introduction to Graph Theory

Traveling Salesperson Problem (TSP)

Minimum Spanning Trees My T. UF

Lecture 25 Notes Spanning Trees

Module 6 NP-Complete Problems and Heuristics

Implementation Techniques

Fully dynamic algorithm for recognition and modular decomposition of permutation graphs

arxiv: v2 [cs.ds] 18 May 2015

Chapter 23. Minimum Spanning Trees

Lecture Notes: Euclidean Traveling Salesman Problem

February 24, :52 World Scientific Book - 9in x 6in soltys alg. Chapter 3. Greedy Algorithms

Transcription:

Code Compaction Using Post-Increment/Decrement Addressing Modes Daniel Golovin and Michael De Rosa {dgolovin, mderosa}@cs.cmu.edu Abstract During computation, locality of reference is often observed, and can be exploited to achieve performance increases in several ways. This locality is an artifact of the computational abstractions and architectures that we use, such as iterating over an arrays or encapsulating tasks into distinct functions with local state. With this in mind, some architectures have been created with features designed specifically to exploit locality. In this paper, we extend the work of [1] to benefit as much as possible from one such particular feature: post-increment and post-decrement addressing modes. With careful placement of stack variables in memory, we can minimize the amount of necessary address arithmetic. This problem is NP-complete, and we present a 2-approximation algorithm for the case of a single address register, along with experimental results. 1 Introduction Some architectures have post-increment and post-decrement addressing modes, which allow the following two instructions to be executed as one: (v load(r); r r + 1), and similiarly for (v load(r); r r 1). To exploit these addessing modes, variables should be laid out in memory so that as often as possible, (temporally) consecutive accesses correspond to (spatially) consecutive address locations. Proper memory layout will result in code that is both smaller and faster. Following the formulation of Liao et. al. [1] we use basic blocks of code to define an access graph on the variables of the block, and then seek a maximum weight path cover of the graph. We first define the access graph, and then the Max Weight Path Cover problem. Definition 1. The Access Sequence of a basic block B is the sequence of variables accessed in B. It is defined as follows. The access sequence of a op b (e.g. a = +b) is ab, and that of a b op c is abc. If B is a sequence of commands c 1 ; c 2 ;... ; c k ; and c i has access sequence s i, then the access sequence of B is s 1 s 2... s k. Definition 2. The Access Graph G = (V, E) of a basic block B is an undirected graph with vertex set V equal to the variables of B, with edge (u, v) iff u and v are adjacent somewhere in the access sequence σ of B. Each edge (u, v) is weighted with the number of adjacent occurrences of u and v in σ Department of Computer Science, Carnegie Mellon University. This write-up was prepared for Optimizing Compilers, course 15-745, Spring 2005 1

Figure 1: An MWPC instance, with optimal solution. The Max Weight Path Cover problem is to cover the graph G with a set of node disjoint paths in G of maximum total edge weight. The formal definition is as follows Definition 3. The Max Weight Path Cover problem (MWPC) is, given a edge weighted, undirected graph G = (V, E), find a partition of V into ordered sets {P 1, P 2,..., P k } such that each P i is a path in G. That is, letting f(i) := P i, we can write each P i as {v i1, v i2,..., v if(i) } in a way that (v ij, v i(j+1) ) E[G] for all 1 j < f(i). The objective is then to maximize the sum weight of all edges in the paths, namely k i=1 f(i) j=1 c(v ij, v i(j+1) ). A solution is called a path cover. Once the path cover is found, the paths are extended into a linear ordering to be placed in memory in the obvious way (i.e. v 11, v 12,..., v 1f(1), v 21, v 22,..., v 2f(2), v 31,...). Notice how MWPC captures exactly the savings we obtain in code size. Unfortunately, MWPC is NP-complete. However, we were able to obtain an approximation algorithm detailed in section 4. We have implemented this approach on the C6X architecture. Experimental results appear in section 5. 2 Relevant Work Liao et. al. [1] introduce the reduction of the single offset assignment problem to MWPC. They present a heuristic based on Kruskal s maximum spanning tree algorithm. Their heuristic sorts the edges in non-increasing order of weight, and then in this order inserts each edge that does not increase the degree of any vertex above two. Liao et. al. give no approximation gaurantee for their heuristic. 3 Adapting SOA to Hyperblocks Liao et. al. [1] assume that the IR of the input procedure is logically divided into basic blocks. As the Pegasus/CASH IR uses hyperblocks to support predicated execution of multiple simultaneous 2

but mutually exclusive control paths, it was necessary to modify the basic SOA algorithm to account for this. To permit this, we define the access graph of a hyperblock differently than that of a basic block. Definition 4. The Access Graph of a hyperblock H is the weighted graph G = (V, E), with vertex set V being the set of variables accessed in H, and edge (u, v) with weight n occurring iff there is are n distinct control flow paths leading from u to v or from v to u, with no intervening variables accesses. The access graph of the procedure can the be found by merging the access graphs of all hyperblocks, using the same techniques as presented in Liao et. al. 4 Finding 2-Approximate Path Covers We find good path covers using the maximum weight cycle cover. We solve the following problem: Given an undirected, edge weighted graph G = (V, E) find a permutation σ on V that maximizes v V w(v, σ(v)), where w(u, v) is the weight of edge (u, v) if it exists in G, and zero otherwise. For each cycle of σ, delete all non-edges in the cycle. If any cycles remain, delete the minimum weight edge of each. Return the resulting edges as the path cover. Note that deleting the minimum cost edge from a length k cycle reduces its weight by at most 1/k, and all non-zero length cycles have length at least two, so the output has weight at least half the cycle cover weight. Yet the optimum cycle cover has weight at least that of the maximum weight path cover, and thus we obtain a two approximation. To find the optimum permutation σ, we reduce it to the max weight matching problem on the following complete bipartite graph B: Given G = (V, E) with weights w : E N, construct sets X, Y with X = Y = V. Let x, y be bijections from V to X and Y respectively. For each u, v V, add edges (x(u), y(v)) and (x(v), y(u)) to B of weight w(u, v), where as before, w(u, v) is the weight of edge (u, v) if it exists in G, and zero otherwise. If M is an max weight matching in B, then the optimal permutation is defined by σ(u) = v whenever (x(u), y(v)) M. 5 Experimental Results Due to a preexisting implementation issue with the provided register allocator s handling of spills, we were unable to benchmark our algorithm on sizable candidate functions. Of the functions we were able to test, of those which used frame variables, we found an average code reduction of 2.0%, corresponding to the conversion of 28.6% of all variable accesses to postincrement/postdecrement instructions. This compares well with the results of Liao et. al, who cite figures of 5% and 20% respectively for SOA. In none of our test cases were the final procedures longer than their unoptimized counterparts. The compile-time cost of the optimization was less than 0.01 seconds in all cases, meaning that there was no significant cost associated with the performance of the algorithm. Why does the Liao et. al. heuristic perform comparably to the 2-approximation algorithm? Though Liao et. al. give no approximation gaurantee for their heuristic, it in fact has an approximation gaurantee of exactly two, which we prove in the appendix. However, it does not lend itself to improved algorithms the way maximum cycle cover approaches do, and future work may yield practical improvements based on our algorithm. 3

6 Future Work There are several directions for future work. Minimizing the amount of address arithmetic in the case of several address registers does not appear to cleanly reduce to a graph theoretic problem such as MWPC It remains to find a fast approximation algorithm for it, if possible, to handle the general case. Various improvements to the single address register case are possible. Using the approximation algorithm to find an initial solution and then employing, e.g., local search may significantly improve performance. Further, ideas from sophisticated algorithms for Max Weight TSP can yield improvemented approximation guarantees, but are likely too slow in practice. This remains to be investigated. Lastly, when dealing with pieces of code above the level of basic blocks, profiling information could be used to weight the probability of consecutive accesses along an edge in the access graph. Giving more weight to hot edges in the access graph should result in faster code, although this may result in longer code than the original approach. 7 Conclusions We were able to successfully implement a novel extension to the work of Liao et. al, allowing their storage assignment scheme to function natively on a hyperblock-based representation. We also proved bounds on both their heuristic allocation scheme, and our more principled algorithm. The algorithm provides comparable results to those reported in the original paper, and requires a very small investment of compilation time. While in it s current state it provides only a modest improvement in code size, generalization of the algorithm to use multiple address registers or profiling information could easily provide more significant gains. References [1] Stan Liao, Srinivas Devadas, Kurt Keutzer, Steven Tjiang, and Albert Wang. Storage assignment to decrease code size. ACM Trans. Program. Lang. Syst., 18(3):235 253, 1996. A Additional Proofs Theorem 1. The Liao et. al. heuristic has an approximation gaurantee of exactly 2. Proof. First, we sketch the lower bound. Let G be the following tree: a line on k+1 vertices (the back-bone of G), say v 0, v 1,..., v k with edges (v i, v i+1 ) of unit weight, and two leaves hanging off of each v i for 0 < i < k via edges of length 1 ɛ. The optimal solution, consisting of all edges not in the back-bone, has weight 2(k 1)(1 ɛ), while the heuristic returns the back-bone, of weight (k + 1). As k and ɛ 0, the ratio approaches two. So the approximation factor is no better than two. Now we prove the upper bound. Fix G, the optimal path cover C, and the output of the heuristic, L. Consider edge e C, e / L, of weight w(e). Let e = (u, v). Since e / L, by the time we process e in the list of edges ordered in non-increasing weight when running the heuristic, one of u or v already has degree two. WLOG, let it be u. Then the two edges of L incident on u each have weight at least w(e). We pay for such edge e using a charging scheme. Initially all edges e of L have charge c(e ) = 0. To pay for e, place a charge of w(e)/2 on each edge of L incident to u. Next consider e C L. Pay for it by placing a charge of w(e) on e. Let c(l) := e L c(e) be the charge on L. Clearly, w(c ) c(l), since each edge of C has had its weight paid for. We claim that for each e L, c(e) 2w(e), and thus c(l) 2w(L), and so w(l) w(c )/2. 4

Consider e L, e / C. An edge (u, v) of C charges an edge e L only if e is incident to u or v, and charges it at most w(e)/2 if e (u, v). Since the degree of any node in C is at most two, e can be charged by at most four such edges of C, for a total charge of c(e) 4 1 2 w(e) = 2w(e). Next consider e L C. This edge is charged w(e) by its copy in C, but can have at most two edges of C sharing exactly one vertex with it. Each of these charges at most w(e)/2, for a total charge of c(e) w(e) + 2 1 2w(e) = 2w(e). So w(c ) = c(l) 2w(L) and we are done. 5