New Complexity Results on Array Contraction and Related Problems

Size: px

Start display at page:

Download "New Complexity Results on Array Contraction and Related Problems"

Annis Martin
5 years ago
Views:

Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668 New Complexity Results on Array Contraction and Related Problems

1 Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668 New Complexity Results on Array Contraction and Related Problems (Extension to Research Report 22-7) Alain Darte and Guillaume Huard October 22 Research Report N o 22-4 École Normale Supérieure de Lyon 46 Allée d Italie, Lyon Cedex 7, France Téléphone : +33() Télécopieur : +33() Adresse électronique : lip@ens-lyon.fr

2 New Complexity Results on Array Contraction and Related Problems (Extension to Research Report 22-7) Alain Darte and Guillaume Huard October 22 Abstract Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting. But so far, the complexity of the problem was unknown, and no exact approach was available (and even more, only a sufficient condition for array contraction was used). In this report, we focus on the theoretical aspects of the problem. We prove several NP-complete results that characterize precisely its complexity and we provide an integer linear programming formulation to solve the problem exactly. Our study also proves the NP-completeness of similar problems whose complexity was not established so far. Keywords: Code optimization, array contraction, memory reduction, complexity, NP-completeness, integer linear programming. Résumé La contraction de tableaux est une optimisation de code qui transforme des variables de type tableau en variables scalaires au sein d une boucle. Alors que la transformation inverse, l expansion de scalaire, est utilisée pour augmenter le parallélisme (avec une pénalité en taille mémoire), la contraction de tableau est utilisée pour économiser de la mémoire en supprimant des tableaux temporaires et pour augmenter la localité. Plusieurs heuristiques ont été proposées dans le passé pour rendre possible la contraction de tableaux par fusion de boucles et décalage d instructions. Néanmoins, la complexité du problème était jusqu à présent inconnue et aucune méthode de résolution exacte disponible (même plus, seule une condition suffisante de contraction était utilisée). Dans ce rapport, nous démontrons plusieurs résultats de NP-complétude qui caractérisent précisément le problème et nous proposons une méthode de résolution exacte par programmation linéaire en nombres entiers. Notre étude démontre également la NP-complétude de problèmes voisins dont la complexité n était pas établie jusqu à présent. Mots-clés: Optimisation de code, contraction de tableau, réduction mémoire, NP-complétude, programmation linéaire en nombres entiers.

3 Contents Introduction 2 2 Program Model and Objectives 3 2. Dependence Graph Contraction of Arcs, Contraction of Vertices Validity of Loop Transformations Loop Fusion Loop Shifting Complexity 7 3. Loop Fusion For Array Contraction With the Standard Condition With the Extended Condition Loop Shifting For Array Contraction With the Standard Condition With the Extended Condition An Integer Linear Programming Formulation 2 4. Loop Fusion For Array Contraction Loop Shifting For Array Contraction Related Work 24 6 Summary and Future Work 25

4 New Complexity Results on Array Contraction and Related Problems Alain Darte and Guillaume Huard LIP, ENS-Lyon, 46, Allée d Italie, 697 Lyon, France. {Alain.Darte,Guillaume.Huard}@ens-lyon.fr 3th October 22 Abstract Array contraction is an optimization that transforms array variables into scalar variables within a loop. While the opposite transformation, scalar expansion, is used for enabling parallelism (with a penalty in memory size), array contraction is used to save memory by removing temporary arrays and to increase locality. Several heuristics have already been proposed to perform array contraction through loop fusion and/or loop shifting. But so far, the complexity of the problem was unknown, and no exact approach was available (and even more, only a sufficient condition for array contraction was used). In this report, we focus on the theoretical aspects of the problem. We prove several NP-complete results that characterize precisely its complexity and we provide an integer linear programming formulation to solve the problem exactly. Our study also proves the NP-completeness of similar problems whose complexity was not established so far. Introduction Memory optimizations are becoming more and more important. First, as the gap between general-purpose single-chip processor and memory speeds grew, exploiting memory hierarchy became fundamental to achieve good performance. A large amount of compiler work has therefore focused on loop transformations and optimized data layouts (see [28, 23,, 22,, 8] to quote but a few) for better cache reuse and data prefetching. Now, memory optimizations become even more important in the context of compilation for embedded processor applications. Performance is not necessarily the only issue, but power consumption, memory design (size, type, etc.) are new objectives to be considered. Improving locality for reducing memory traffic, for deleting temporary arrays, for reducing memory sizes are important optimizations in the compilation process. Thus, architectural changes in the design and in the objectives have pushed compilers to develop sophisticated memory optimizations. On the other end of the spectrum, the evolution of languages also pushes compilers to be smarter when allocating memory. Indeed, array languages such as Fortran 9, HPF [3, 3], or ZPL [6] require the introduction of many temporary arrays by the compiler, which increases (compared to the apparent user s code) memory usage if no further optimizations are performed. Today, memory optimizations can be necessary for both reasons (architecture and languages) simultaneously. Indeed, if most circuits are still developed at register transfer level, several projects (such as PICO [25, ]) already target the compilation of circuits from C code or even from higher-level languages such as Matlab. In this latter case, the compiler will have to introduce many temporary arrays (due to the input language), and 2

5 perform both high-level and low-level optimizations to be very careful on the final memory allocation (because of the hardware objectives). Array contraction [8] is one of these memory transformations. When each element of an array is defined and used within the same iteration of surrounding loops, the array can be replaced by a scalar variable that holds the values of all elements, consecutively. Typically, such a situation occurs in codes where the contracted array is a temporary array i.e., either an array that the user introduced him/herself to store some intermediate computations, or an array that the compiler introduced for the same reason (again in array languages for example) and this temporary array is used, in the original code, in several successive loops. Therefore, in most practical cases and most benchmarks, the natural transformation that enables array contraction is loop fusion [3, page 35]. In an array language such as Fortran 9, because of its array constructs that support index shifts (when array sections differ by a constant) and negative strides, loop shifting (which takes the form of loop alignment in this case [3, page 322]) and loop reversal (the loop iterates in the other direction) are two other natural transformations for array contraction. In [24], Vivek Sarkar and Guang Gao were the first who tried to optimize explicitly for array contraction. They focused on finding the most suitable loop reversal to enable array contraction. Then, in [8], with R. Olsen and R. Thekkath, they mainly explored loop fusion for array contraction, developing a heuristic based on a maxflow-mincut algorithm. Then, several authors (see the Related Work section) contributed to loop fusion optimizations, but with slightly different objectives, focusing on loop fusion for locality [9], weighted loop fusion [2], maximal fusion (number of loops) [5], loop fusion for memory reduction [27, 7], etc. All these approaches keep in mind array contraction but they do not optimize directly for it. They target variants of data locality (for example, number of fused dependences) and, in favorable cases (but not always), they can achieve array contraction as a secondary effect. Nevertheless, since the work of Gao, Sarkar and al., several questions remained open. What is the complexity of optimizing for array contraction? How costly is an exact approach? Do we have to rely on heuristics? The goal of this report is to answer these theoretical questions. Also, we show that the way arrays are contracted traditionally (what we call the standard condition) is a bit restrictive. We give a more accurate formulation (we call it the extended condition) that allows us to contract more arrays. The rest of the report is organized as follows. Section 2 describes the program model we consider and defines the array contraction problems we address, mainly array contraction enabled by loop fusion, and by a combination of loop shifting and loop fusion. Section 3 gives several NP-complete proofs that characterize their complexity. Our results show at the same time the NP-completeness of three other optimization problems whose complexity was not established so far. In Section 4, we show that both problems (array contraction with loop fusion and array contraction with loop shifting) can be solved thanks to an integer linear programming formulation. More related works are discussed in Section 5. We conclude in Section 6. 2 Program Model and Objectives To simplify the discussion, we consider a sequence of simple (i.e., not nested) loops, with unitary loop steps, each loop containing one or several simple statements (assignments to an array or scalar variable). Dependences between statements exist that restrict the order in which statements can be executed. We first assume that each statement writes in a different variable and that all dependences are flow dependences (i.e., writes occur before reads). Following the terminology in [8, 2], we also assume that all loops are conformable (or of same type, with the terminology in [2]), i.e., regardless of dependences, all loops could be fused without code 3

6 generation or semantics problems (they have similar headers, same control dependences, etc.). Remark: In terms of NP-completeness, the simplest the input, the strongest the proof of NP-completeness. Restricting to simple cases is therefore not a restriction, but a strength. However, when solving the problem in practice, we will need to be able to extend the technique to more general cases. We will explain later when we can do it, and when problems remain to be solved. 2. Dependence Graph The program is represented by a dependence graph G = (V,E), a directed graph, where each vertex in V corresponds to a statement and an arc e = (u,v) E states that the statement v should always be placed in the same loop as the statement u, or in a loop after. We keep track of some information on dependences, their nature (flow, anti, or output dependences, even if so far we assume that all are flow dependences) and which dependence distances (differences between the loop index of the destination operation and the loop index of the source operation) each arc corresponds to. Dependence distances (or over-approximations) are used to decide whether a code transformation is valid or not. Figure shows a sample program fragment (this is a modified version of the examples from [2] and [8]), first written in Fortran 9 with array expressions, then written with loops where loop fusion has been greedily applied (from top to bottom). The graph on the right is the corresponding dependence graph (labels on arcs will be explained hereafter). A(:N) = E(:N-) B = A*2 + 3 C = B + 99 D(:N) = A(N::-) + A(:N) E = B + C*D F = E*4 + 2 G = E*8-3 H(:N) = F(:N) + G(:N)*E(2:N+) DO I=,N A(I) = E(I-) B(I) = A(I)*2 + 3 C(I) = B(I) + 99 DO I=,N D(I) = A(N-I+) + A(I) E(I) = B(I) + C(I)*D(I) F(I) = E(I)*4 + 2 G(I) = E(I)*8-3 DO I=,N H(I) = F(I) + G(I)*E(I+) A * B D C E F G H Figure : Sample program fragment (array version, loop version and dependences). 2.2 Contraction of Arcs, Contraction of Vertices A dependence arc is contractable when it corresponds to a dependence weight equal to. It can be contracted when it is contractable, and when the source and destination of the dependence are both in the same loop. Furthermore, in this case, the source should be placed textually before the destination to preserve the semantics of the program. In this case, there is an immediate reuse of the data involved in the dependence. An array is contractable when all flow dependence arcs relative to this array are contractable. It can be contracted into (replaced 4

7 by) a scalar variable when all flow dependence arcs relative to this array are contracted, i.e., when every element of the array is defined and used within the same iteration of a loop. The two problems of optimizing the contraction of arcs or the contraction of vertices are very similar, and we will use this similarity in our NP-completeness proofs. Remark: We point out that this condition for the contraction of an array (we call it the standard condition) is the one used in all previous works on array contraction, but it is only a sufficient condition. Indeed, an array such that each element is read either within the same iteration (i.e., at distance ) by statements textually after the writing statement, or in the next iteration (i.e., at distance ) but by statements textually before the writing statement could also be contracted. See Figure 2 for an example. This more general condition (we call it the extended condition) makes the problem harder to formulate since an additional condition on the textual reordering of statements has to be ensured. From a complexity point view, as we will prove later, the problem is NP-complete with the extended condition too. We also point out that, even if it is not mentioned in previous works, care should be taken for generating code, possibly using additional scalar variables, when several statements write into the same array, and the array is involved at the same time in flow dependences, and in output or anti dependences. B()= DO I=,N A(I) = B(I-) + B(I) = A(I) + 3 B= DO I=,N A = B + B = A + 3 Figure 2: An example of contraction with the extended condition. We distinguish between live-in arrays that are defined before the code fragment to be optimized, live-out arrays that should be kept in memory for later use, after the code fragment, and temporary arrays that are defined in the code fragment before being read and never used after the code fragment. In the code of Figure, E is a live-in array, H is a live-out array, and all other arrays are temporary arrays. If memory reduction is the main goal of array contraction, only temporary arrays are candidates for array contraction. However, if the goal of array contraction is to better use registers and to avoid memory traffic, we can also consider other arrays as candidates for array contraction. In this case, if a live-out array (the situation is similar for a live-in array) is written by several statements in the code fragment, intermediate writes can be considered for array contraction, but the last writes should be kept in the array. 2.3 Validity of Loop Transformations We label dependence arcs depending whether they prevent or not the code transformation to contract an array Loop Fusion When considering loop fusion, we need to distinguish between negative dependence distances, positive dependence distances, and dependence distances equal to. The conditions for loop fusion are well-known (see for example in [3]). An arc that corresponds to a negative dependence distance is called a fusion-preventing arc: the source and the destination of the arc should be placed in two different loops, and the source in a loop textually before. In this 5

8 case, contraction is certainly not possible since some data should be kept in memory for use in a subsequent loop (when the dependence is a flow dependence). An arc with a nonnegative distance (we call it a precedence arc) is not fusion-preventing. The source should be placed in a loop before the destination, or it can be placed in the same loop. If it is not contractable (i.e., if it corresponds to a positive distance), the corresponding array cannot be contracted (at least with the standard condition) since some data will be used in different iterations. And if it is contractable, as we reminded earlier, the source of the arc should be placed textually before its destination. Following classical definitions on loop fusion, a fusion partition P of G = (V,E) is a partition of V (the vertices) into disjoint subsets (called clusters). A fusion partition is legal for G if and only if: for each fusion-preventing arc, the source vertex and the destination vertex are in different clusters, the fused dependence graph defined by the fusion partition (there is an arc from a cluster A P to a cluster B P, B A, if there is an arc e = (u,v) E such that u A and v B) is a directed graph with no circuit (DAG). The dotted closed lines in Figure correspond to the greedy fusion partition (each statement is placed in the first possible loop). Given a legal fusion partition, the output code can be obtained as follows: all statements that belong to the same cluster are placed in a single loop, following the partial order defined by the -weight arcs (to preserve the semantics of the program), and loops are textually ordered according to some topological sort defined by the arcs in the fused dependence graph. For this to be possible (in particular when the graph corresponds to a valid program), the graph should have no circuit containing a fusion-preventing arc, and no -weight circuit. The problem Loop fusion for array contraction is to find a legal fusion partition for a dependence graph G so that as many arrays as possible can be contracted after fusion. After contraction, the partition depicted in Figure corresponds to the code in Figure 3(a), with 5 contracted arcs and contracted array. The solution obtained in [2] would be the code in Figure 3(b), with 6 contracted arcs and 3 contracted arrays. The optimal solution for array contraction is given in Figure 3(c), with 6 contracted arcs and 5 contracted arrays. DO I=,N A(I) = E(I-) B(I) = A(I)*2 + 3 C(I) = B(I) + 99 DO I=,N d = A(N-I+) + A(I) E(I) = B(I) + C(I)*d F(I) = E(I)*4 + 2 G(I) = E(I)*8-3 DO I=,N H(I) = F(I) + G(I)*E(I+) DO I=,N A(I) = E(I-) DO I=,N b = A(I)*2 + 3 c = b + 99 d = A(N-I+) + A(I) E(I) = b + c*d F(I) = E(I)*4 + 2 G(I) = E(I)*8-3 DO I=,N H(I) = F(I) + G(I)*E(I+) DO I=,N A(I) = E(I-) DO I=,N b = A(I)*2 + 3 c = b + 99 d = A(N-I+) + A(I) E(I) = b + c*d DO I=,N f = E(I)*4 + 2 g = E(I)*8-3 H(I) = f + g*e(i+) Figure 3: Codes after loop fusion and array contraction with (a) the greedy partition, (b) the partition selected in [2], (c) an optimal partition for array contraction. 6

9 2.3.2 Loop Shifting Loop shifting (also called loop alignment in [3]) consists in defining a map from V to Z (the relative integers) that assigns to each vertex (i.e., statement) of the dependence graph G a shift r(u) so that the operation corresponding to the statement u at iteration i in the original code is performed at iteration i + r(u) in the transformed code. A dependence distance from u to v originally equal to d(e) is equal, after the shift, to d(e) + r(v) r(u). Given a shift r for each statement, one can define the corresponding dependence graph G r where the dependence distance d r (e) of an arc is d(e) + r(v) r(u). When considering loop shifting, the weight of arcs change, so more arcs can be considered as contractable and fewer arcs are fusion-preventing, but we need more information on dependence distances to see it. An arc that corresponds to a numerically constant dependence distance is a uniform arc; with an adequate shift, the arc can be transformed into an arc with dependence distance equal to (i.e., becomes contractable). A precedence arc is an arc corresponding to dependence distances that are lower-bounded by a numerical constant; loop fusion may be possible with a sufficient shift, but if the arc is not uniform, the corresponding arc will not be contracted. Finally, any other arc is a fusion-preventing arc since, whatever the shift, the arc will still prevent fusion. In terms of direction vectors [29], a precedence arc corresponds to a label z+, where z is a relative integer, and a fusion-preventing arc corresponds to a label. In the graph of Figure, all arcs are uniform except the arc with label, which is a fusionpreventing arc. In the case of loop fusion alone, the arc with dependence distance (from E to H) is also considered as a fusion-preventing arc, while for loop shifting it is just a precedence arc (it is uniform). With the terminology of [5], a legal shift is a shift r such that in G r all arcs have a nonnegative weight. A fusion partition is legal with respect to a shift r (legal or not) if the partition is legal for G r, i.e., if the fused dependence graph has no circuit, if r is a legal shift when considering only arcs with both ends in the same cluster, and if G r has no -weight circuit. Since the weight of a circuit is unchanged by a shift, a graph G that has a shift r and a corresponding legal fusion partition (in particular, the graph of a valid program) has only circuits of positive weight. The problem Loop shifting for array contraction is to find a shift r and a legal fusion partition for G r so that as many arrays as possible can be contracted after shift by r and fusion. 3 Complexity In this section, we consider the following simplest case: all dependences are uniform (i.e., with constant dependence distances), all vertices in the dependence graph correspond to statements that write in different arrays, and all dependences are flow dependences. In this case, each vertex correspond to a different contractable array, and there is a gain of one contracted array each time a vertex is in the same cluster as all its successors and all dependence distances from this vertex to these successors are contractable, i.e., with weight for the standard condition, and with weight (if the successor is textually after) or (if the successor is textually before) for the extended condition. 3. Loop Fusion For Array Contraction The first problem, Loop Fusion for Array Contraction, is the easiest to understand. We first focus on arrays contracted with the standard condition. 7

10 3.. With the Standard Condition If no arcs are fusion-preventing, then the optimal partition is to place all statements within the same loop. All contractable arcs can be contracted and all arrays whose outgoing flow arcs are all contractable can be contracted. When the dependence graph has only one fusion-preventing arc, then the problem can also be solved in polynomial time, as a variant of a maxflow-mincut algorithm (as noticed in [8]), even if some modifications have to be done so that the maxflowmincut algorithm, which can naturally maximize the number of contracted arcs, is able to maximize the number of contracted vertices. However, when the dependence graph has more fusion-preventing arcs, the complexity changes. Theorem Loop Fusion for Array Contraction is strongly NP-complete for directed graphs with no circuit but a chain of k 2 fusion-preventing arcs. Proof: The associated decision problem is obviously in NP. The rest of the proof is, as in [9], by reduction from the problem Multiway Cut [4], whose following instance is NP-complete for k 3 (and even for the fixed value k = 3). Multiway Cut: Instance: An undirected graph G = (V,E), k specified vertices (s i ) i k, and an integer K. Question: Is there a set E E of edges, of size at most K, such that the removal of E from E disconnects all s i from any other s j? (Such a set E is called a cut.) Transformation To transform an instance of Multiway Cut into an instance of Loop Fusion for Array Contraction, there are two minor difficulties: () we are not interested in maximizing the number of contracted arcs, but in maximizing the number of contracted vertices, (2) our dependence graphs are directed, which is not the case in Multiway Cut. From an instance G = (V,E) and (s i ) i k of Multiway Cut, we build a graph G = (V,E ) as follows. We first let V = V and, for i < k, we add to E a fusion-preventing arc from s i to s i+. For each undirected edge e = (u,v) in E, we add to V a new vertex n e and we add to E two contractable arcs, one from n e to u and one from n e to v. The resulting graph G has V + E vertices, 2 E + k arcs, it has no circuit, and all arcs go from a vertex in V \ V to a vertex in V, except the fusion-preventing arcs that form a directed chain of (k ) arcs between the k specified vertices in V. Furthermore, to count the number of contracted vertices after fusion, we can only consider the vertices in V \ V (the vertices n e ) since, whatever the partition, vertices in V are either never contractable (vertices s i for i < k, since they have a fusion-preventing outgoing arc) or always contractable (other vertices since they have no outgoing arc at all). To complete the proof, we now show that there is a valid cut for G of size at most K if and only if there is a legal fusion partition with at least E K contracted vertices in V \ V (resp. 2 E K contracted arcs in G ). Reduction Given a cut E for G, we define the following binary relation on V : for all w V, wrw, and for all e = (u,v) E \ E, urv, vru, n e Ru, urn e, n e Rv, and vrn e. The transitive closure of R is an equivalence relation whose equivalence classes define a partition. By construction, two vertices u V and v V are in the same equivalence class if and only if there is an undirected path in G from u to v with all edges in E \ E. Furthermore, the fused dependence graph has the following properties. If a cluster contains no vertex in V, then it contains a unique vertex in V \ V (i.e., a vertex of type n e ), and the cluster has no incoming 8

11 arc and two outgoing arcs. If a cluster contains a vertex in V, the cluster has no outgoing arc except possibly the fusion-preventing arcs because if a vertex in V \ V is in the cluster, its two successors are also in the cluster. Therefore, only fusion-preventing arcs can be involved in a circuit in the fused dependence graph. From these properties, it is now clear that, if E is a valid cut, the partition is a legal fusion partition. First, all s i are in different clusters since there is no path between any pair (s i,s j ) with edges all in E \ E. Furthermore, if the reduced dependence graph defined by the partition has a circuit, then the circuit corresponds to a circuit of fusion-preventing arcs involving the vertices s i. But this is impossible since the fusion-preventing arcs define a chain, not a circuit. The fusion partition we obtain has already the following property: it has 2( E E ) contracted arcs and E E contracted vertices in V \ V. Now, since the fused dependence graph has no circuit, we can number the clusters according to some topological sort, and for each edge e = (u,v) E, we can place n e in the same cluster as the vertex, between u or v, with smaller cluster number, without creating any circuit. Finally, we end up with a legal fusion partition with 2 E E contracted arcs and E E contracted vertices in V \ V. Conversely, for any legal fusion partition for G, no two s i are in the same cluster and each vertex n e is contracted if and only if it is in the same cluster as its two successors u and v. If a vertex n e is not in the same cluster as both u and v, then with the same technique as above, we can always place it in the same cluster as either u or v. After this transformation, if the number of contracted vertices in V \ V is E K, we get 2( E K) + K = 2 E K contracted arcs. And if we remove in G all edges e = (u,v) such that u and v are not in the same cluster in G, we get a valid cut of size K. We point out that the proof of Theorem proves at the same time that maximizing the number of contracted arcs with loop fusion is strongly NP-complete (if there are at least 2 fusion-preventing arcs), in other words that the problem Weighted Loop Fusion, introduced in [9], is strongly NP-complete. We think this is interesting to mention because the proof in [9], which is so far the main (if not only) NP-completeness result most papers on loop fusion refer to, turns out to be incorrect (the construction proposed in [9] does not guarantee that all valid cuts correspond to partitions without circuit). But the idea to use Multiway Cut in the reduction was the right one as the proof above shows With the Extended Condition The previous proof is also valid if we consider the extended condition for contractability, since we can always restrict, in the proof, to programs such that noncontractable arcs (arcs with nonzero weight) corresponds to dependences with weight. In this case, there is no arc with weight in the graph, and arrays contracted with the extended condition are the same as arrays contracted with the standard condition, and the problem remains NP-complete. However, as we mentioned earlier, when there are no fusion-preventing arcs for example when all weights are equal to or the complete fusion is always possible and contraction with the standard condition is easy. But is that always true for the extended condition? Maybe not, since we need to find an adequate ordering of statements such that an arc is contracted if it has a weight and the source of the arc is placed textually before its destination (standard condition), but also if it has a weight and the source is placed textually after its destination. Consider a directed graph G = (V,E,d) where d(u) {,} for all u V. The variant of Loop Fusion for Extended Array Contraction is, for such a simple instance, to determine, given an integer K, a total order of vertices such that e = (u,v) and d(e) = implies u v, and at least K vertices are contracted with the extended condition, where a vertex u is contracted if u v for each arc e = (u,v) with weight and v u for each arc e = (u,v) 9

12 with weight. For this to be possible, the graph obtained by removing all noncontracted arcs and by replacing each contracted arc (u,v) of weight by an arc (v,u) should have no circuit. Theorem 2 Loop Fusion for Extended Array Contraction is strongly NP-complete for directed graphs with no circuit and arcs with weights or. Proof: We use a reduction from Vertex Cover (Problem GT in [9, p. 9]) recalled below. Vertex Cover: Instance: An undirected graph G = (V,E) and an integer K. Question: Is there a subset V of V, of size at most K, such that for each edge (u,v) E at least one of u and v belongs to V? (Such a set V is called a vertex cover.) Loop Fusion for Extended Array Contraction is obviously in NP. Now consider an instance G = (V,E) of Vertex Cover. We use a reduction similar to the reduction for Feedback Arc Set (Problem GT8 in [9, p. 92]). We build a directed graph G = (V,E,d) with no circuit as follows: for each vertex u V, we define two vertices u and u in V with an arc from u to u with weight, and for each edge e = (u,v), we define an arc from u to v and an arc from v to u, both with weight. There are 2 V vertices and V + 2 E arcs in G. We now show that there is a vertex cover of size at most K in G if and only if there are at least 2 V K contracted vertices (resp. 2 E + V K contracted arcs) in G. Consider a vertex cover V for G of size K. Remove all arcs (u,u ) in G when u V and replace each remaining arc of the form (v,v ) by an arc (v,v). Since V is a cover for G, there cannot be any path of the form u u v v in this new graph since either (u,u) or (v,v) is not in the graph. So, there is no circuit in this graph and we can find an ordering of vertices that follows the direction of arcs (in other words, the set of arcs (u,u) that we removed is a feedback arc set for this graph). This leads to a solution of Loop Fusion for Extended Array Contraction with at least 2 V K contracted vertices (all vertices except possibly those that belong to V) and 2 E + V K contracted arcs (all arcs with weight plus the arcs (u,u ) with weight such that u does not belong to the vertex cover). Conversely, consider a valid fusion and ordering of vertices, and define V as the set of vertices u in V such that u has at least one noncontracted outgoing arc in G. By definition, if we remove from G all noncontracted arcs, and if we replace each contracted arc (u,u ) with weight by an arc (u,u), we get a directed graph with no circuit. This implies in particular that there is no circuit u u v v u, therefore at least one of (u,u ) and (v,v ) is not contracted. In other words, V is a vertex cover for G. Now, if there are 2 V K contracted vertices, then only K vertices are in V, so the size of V is exactly K. And if there are 2 E + V K contracted arcs, each noncontracted arc can contribute to at most one different vertex in V, so the size of V is at most K. Theorem 2 shows that the problem is more difficult with the extended condition. Simply finding an adequate statement ordering is difficult while, for the standard condition, the difficulty arises only when some arcs prevent the total fusion (or even a partition with 2 clusters). 3.2 Loop Shifting For Array Contraction When all dependences are uniform, Loop Fusion for Array Contraction is NP-complete because negative dependence distances are considered as fusion-preventing arcs. But the situation may be different when introducing loop shifting. Indeed, with a sufficient shift, any uniform dependence can be transformed into a nonnegative dependence distance (i.e., a fusionpreventing arc can become a precedence arc). Even more, the complete fusion of a sequence of

13 loops with uniform dependences is always possible after suitable shifts. It is therefore legitimate to wonder if introducing loop shifting in the case of uniform dependences makes the problem easier. Also, this problem is of practical interest since many code fragments, for example coming from Fortran 9 array expressions, are codes with uniform dependences. Again, we first consider the case of contraction with the standard condition (no smart statement ordering to define), then with the extended condition With the Standard Condition Unlike Loop Fusion for Array Contraction, which is linked to a very close well-known problem (almost all difficulties are therefore pushed into the NP-completeness proof of Multiway Cut, which is quite long and difficult, see [4]), we have almost to start from scratch to establish the NP-completeness of Loop Shifting for Array Contraction. The proof has two parts. We first show that finding a shift that maximizes the number of contracted arcs (arcs with weight after the shift) is strongly NP-complete. Then, as in Theorem, we are able to reduce, from the maximization of contracted arcs, the maximization of contracted vertices we are interested in. We first need the following technical lemma. Lemma Let G = (V,E,d) be a directed graph where each arc e has a weight d(e) Z and such that all circuits have a positive weight. Let r be a shift for G and let P be a legal fusion partition with respect to r. Then there exists a legal shift r such that all arcs, contracted for r and P, are contracted for r and the partition P = {V } (total fusion). Furthermore, u V, r (u) e E d(e). Proof: To build r from r, we define a graph G = (V,E ) in which r will be computed. We first let G = G and for each arc e = (u,v) E such that e is contracted for r and P, we add a new arc e in E from v to u with weight d (e ) = d(e). Note first that all circuits in G r have a nonnegative weight (since each circuit should be part of a given cluster, and in each cluster all weights are nonnegative). The same is true in G r. Since weights of circuits are not modified by a shift, G has the same property and we can compute shortest paths in G. We define π(u) as the minimal weight of a path ending at u (a nonpositive quantity if, by convention, a path with no arc has weight ). For each arc e = (u,v) E, we have π(v) π(u) + d (e) (since the weight of the minimal path to v is less than or equal to the weight of any path that goes to v through u). With r (u) = π(u), we get d(e) + r (v) r (u) for all e = (u,v) E, thus r is legal for G and the partition P = {V } with only one cluster is legal with respect to r. Furthermore, because of the arcs e, we even have π(v) = π(u) + d (e) when e is contracted for r and P. Thus, arcs contracted for r and P are contracted for r and P. Finally, since r (u) is built as the opposite of the weight of an elementary path P(u) in G ending at u, we have r (u) = e P(u) d (e) e E d(e). Lemma shows that we can restrict to solutions that correspond to a total fusion with a legal shift. In practice however, when nonuniform and, in particular, fusion-preventing dependences exist, we will have to be able to take into account fusion partitions with more than one cluster (this will be done in the linear programming formulation presented in Section 4.2). For the NP-completeness proof itself, we now consider the following problem: Maximization of Local Accesses: Instance: A uniform dependence graph G = (V,E,d) and an integer K. Question: Is there a shift r of G and a legal fusion partition with respect to r such that at least K arcs are contracted?

14 The next theorem characterizes the complexity of Maximization of Local Accesses. Theorem 3 Maximization of Local Accesses is strongly NP-complete. Proof: We first show that the problem belongs to NP. Thanks to Lemma, given a shift r and a corresponding partition, there is always a legal shift r of polynomial size with at least as many contracted arcs. Therefore, given a positive instance of our problem, there is a polynomial certificate (the shift r of polynomial size) for which we can check in polynomial time that at least K arcs are contracted. The rest of the proof is by reduction from the problem Not-All- Equal 3SAT (Problem LO3 in [9, p. 259]) that we recall here. Not-All-Equal 3SAT: Instance: A set U of n boolean variables and a set C of m clauses over U such that each clause c C has c = 3. Question: Is there a truth assignment for U such that each clause in C has at least one true literal and at least one false literal? Transformation Let (U, C) be an instance of Not-All-Equal 3SAT. We define an instance G = (V,E,d) of Maximization of Local Accesses as follows. We start from G = (V,E) with V = and E =. For each variable u U, we add to G two vertices u and ū, and two arcs, with weight, from u to ū and from ū to u (see Figure 4). for each clause c = {x,y,z} C, we add to G six arcs of weight, from x to y, from y to x, from x to z, from z to x, from y to z, and from z to y (see Figure 5). we let K = 2m + n (remember that n = U and m = C ). The graph G has a polynomial size, with 2 U vertices and 2 U + 6 C arcs. u u u {x, y, z} x z y Figure 4: Transformation of a variable. Figure 5: Transformation of a clause. Reduction Let (U,C) be a positive instance of Not-All-Equal 3SAT, and let T be a truth assignment for U such that each clause has at least one true literal and at least one false literal. We define, for each variable u U, r(u) = { if T(u) = true if T(u) = false r(ū) = r(u) First, note that for each arc e = (u,v) E, r(v) r(u) (since r takes values in {,}), thus r(v) r(u)+d(e) since all arcs in G have a weight equal to (by construction). In other 2

15 words, r is a legal shift for G. For each u U, r(u) r(ū) = ±. Thus, either d r ((u,ū)) = and d r ((ū,u)) = 2, or d r ((u,ū)) = 2 and d r ((ū,u)) =. Therefore, each structure associated to a variable generates exactly one zero-weight arc, i.e., n zero-weight arcs for all variables. For each clause c = {x,y,z} C, at least one literal is true and at least one is false. Thus, there are two literals, for example x and y, such that r(x) = r(y) and r(x) r(z) = r(y) r(z) = ±. Therefore, there is no zero-weight arc between x and y, and there is exactly one such arc between x and z, and one between y and z. In other words, two arcs have a zero weight in each structure associated with a clause, i.e., 2m arcs of zero weight for all clauses. In addition to the n zero-weight arcs we obtained for the variables, we get 2m + n = K zero-weight arcs in G r, and (G,K) is a positive instance of Maximization of Local Accesses. Conversely, suppose that (G, K) is a positive instance of Maximization of Local Accesses. Let r be a shift of G such that G r has at least K zero-weight arcs. We define, for each literal u U: { true if r(u) mod 2 = T(u) = false if r(u) mod 2 = Note that a shift does not change the total weight along a circuit, thus at most one of the two arcs associated to a variable can have a zero weight after the shift (otherwise the weight of the circuit would have a weight equal to and not to 2). We now show that at most two arcs in the structure associated to a clause can have a zero weight after the shift. First, the same observation as before shows that only one of the two arcs between two different literals can have a zero weight, therefore at most 3 such arcs for each clause. Suppose that there is a clause c = {x,y,z} with at least two zero-weight arcs after the shift, for example between x and y (thus r(y) = r(x) ± ), and between x and z (thus r(z) = r(x) ± ). Then, either r(y) = r(z), or r(y) = r(z) ± 2, and in both cases, there is no zero-weight arc between y and z. Therefore, each structure associated to a clause has at most 2 zero-weight arcs after the shift (actually, all this is true even if the shift is not legal). To summarize, G r has at most 2m + n zero-weight arcs, and by hypothesis, it has at least K = 2m + n zero-weight arcs. Thus, it has exactly K = 2m + n arcs of zero weight, i.e., one for each structure associated to a variable, and two for each structure associated to a clause. It remains to show that T is a truth assignment with at least one true and one false literal in each clause. Since there is a zero-weight arc between u and ū, r(u) = r(ū) ±. Thus, r(u) mod 2 r(ū) mod 2 and T(u) T(ū); Each clause c = {x,y,z} contains exactly two arcs of zero weight; consider one of them, for example between x and y. We have r(x) = r(y) ± and thus T(x) T(y). Therefore, (U,C) is a positive instance of Not-All-Equal 3SAT. We just proved that (G, K) is a positive instance of Maximization of Local Accesses if and only if (U,T) is a positive instance of Not-All-Equal 3SAT. This proves that Maximization of Local Accesses is strongly NP-complete. Note that the instance built in the previous proof is a graph that has always circuits. Nevertheless, it is possible to show that Maximization of Local Accesses is strongly NPcomplete even for a graph with no circuit (and even if the shift is supposed to be legal or not). The proof is more technical, but we give it here for completeness. Theorem 4 Maximization of local accesses is strongly NP-complete even for a graph with no circuit and weights equal to or. 3

16 Proof: The same proof as in Theorem 3 shows that the problem belongs to NP. The rest of the proof is by reduction from the problem One-in-Three 3SAT (Problem LO4 in [9, p. 259]) that we recall here. One-In-Three 3SAT: Instance: A set U of n boolean variables and a set C of m clauses over U such that each clause c C has c = 3. Question: Is there a truth assignment for U such that each clause in C has exactly one true literal? Transformation Let (U,C) be an instance of One-in-Three 3SAT. We define an instance G = (V,E,d) of Maximization of local accesses as follows. We start from G = (V,E,d) with two vertices a and b (V = {a,b}) and E is a set of 48mn + 8m arcs with weight, from a to b. We call this initial structure the base structure (see Figure 6). a 48mn+8m b Figure 6: Base structure. For each variable u U, we add to G two vertices u and ū, and 24m arcs with weight, from ū to b, 6m arcs with weight from u to ū, 24m arcs with weight from a to u, 6m arcs of weight from a to ū, and 6m arcs with weight from u to b (see Figure 7). a u u b Figure 7: Transformation of a variable (each arc on the figure is repeated 8m times). For each clause c = {x,y,z} C, we consider what we call derivative clauses c = {x,ȳ, z}, c = { x,y, z}, and c 2 = { x,ȳ,z}. We add to G three vertices c, c, and c 2, and for each derivative clause c i = {x,y,z }, i {,,2}, we add an arc with weight from a to c i, 4

17 an arc with weight from c i to each of the vertices x, y, and z (built by the variables), and an arc with weight from each of the vertices x, y, and z to b (see Figure 8). x a c i y b z Figure 8: Transformation of a clause (for one derivative clause). We let K = 96mn + 22m. The graph G built this way has no circuit. It has 2n + 3m + 2 vertices and 44mn + 29m arcs. Reduction Let (U, C) be a positive instance of One-in-Three 3SAT, and let T be a truth assignment for U such that each clause has exactly one true literal. We define a shift r for G as follows. We let r(a) = r(b) =, for each u U: r(u) = { if T(u) = true if T(u) = false r(ū) = r(u) and for each clause c C, for each derivative clause c i = {x,y,z }, i {,,2}: { if T(x r(c i ) = ) = T(y ) = T(z ) = true otherwise Since r(a) = r(b), the shift is legal for the base structure of G and the 48mn+8m corresponding arcs have a weight equal to in G r. For each variable u U, either r(u) = and r(ū) =, or r(u) = and r(ū) =. In both cases, this gives rise to exactly 6 groups of 8m arcs with weight in each structure associated to a variable (i.e., 6 arcs with weight on Figure 7). Therefore, there is a total of 48mn arcs of weight for all variables (note also that the shift is legal for all arcs associated to variables). For each clause c C, exactly one of the literals of c is true, therefore exactly one derivative clause c i = {x,y,z }, i {,,2} has all its literals true. For this derivative clause, r(x ) = r(y ) = r(z ) = and r(c i ) =, thus the shift is legal for the associated structure and it gives rise to 6 arcs with weight. For any other derivative clause c j, r(c j ) = since at least one literal is false, and whatever the shift of the other vertices ( or ), the shift is legal for the structure associated to c j and gives rise to 4 arcs with weight. Therefore, there is a total of = 4 arcs with weight for each clause and 4m such arcs for all the clauses. For the full graph G r, we get a total of 96mn + 22m arcs with weight : (G,K) is a positive instance of Maximization of local accesses. Conversely, suppose that (G, K) is a positive instance of Maximization of local accesses. Let r be a shift of G such that G r has at least K zero-weight arcs. Without loss of generality, we can assume that r(a) =, otherwise we subtract r(a) to all values of the shift. Before going further, we prove some properties of the shift r. We first prove that r(a) = r(b) =, then that for each variable u U, r(u) and r(ū) are equal to or, and that they are not equal. 5

18 Note that each structure built from a variable contains 2 8m = 96m arcs and that each structure built from a clause contains 3 7 = 2 arcs, thus a total of 96mn + 2m arcs for all structures. Therefore, if G r contains K = 96mn + 22m arcs with weight or more, then some of them belong to the base structure. Since all arcs for the base structure have the same initial weight, the same source a, and the same destination b, they all have a weight in G r. Thus r(a) = r(b) = and there are exactly 48mn + 8m arcs with weight in the base structure. Now consider the structure built from a variable u U. Assume that r(u) < or r(u) > : in both cases, the arcs from a to u and the arcs from u to b have a nonzero weight after the shift (since r(a) = r(b) = ). Furthermore, the arcs from a to ū and from u to ū cannot have a weight simultaneously (otherwise, r(ū) = and r(u) = ). Therefore, at most 4m arcs have a weight (the arcs from a to ū or the arcs from u to ū, plus possibly the arcs from ū to b). Now suppose that r(ū) < or r(ū) > : in both cases, the arcs from a to ū and the arcs from ū to b have a nonzero weight (since r(a) = r(b) = ). Furthermore, the arcs from u to ū and the arcs from u to b cannot have a weight simultaneously since they have the same source u, the same initial weight, but r(ū) r(b). Therefore, again, the structure contains at most 4m arcs of weight (the arcs from u to ū or the arcs from u to b, plus possibly the arcs from a to u). Finally, we can easily check that if r(u) = r(ū) = or if r(u) = r(ū) =, then the structure corresponding to u contains exactly 4m arcs of weight, and if r(u) = and r(ū) =, or if r(u) = and r(ū) =, then it contains 48m arcs of weight. Now, if at least one of the structures built from a variable contains at most 4m arcs with weight, then the total number of arcs with weight in the structures associated to variables, plus the base structure, is at most 48mn + 8m + 48m(n ) + 4m = 96mn. By definition of K, we still need 22m arcs of weight in the structures associated to clauses, but the total number of arcs in these structures is only 2m. Therefore, each structure associated to a variable must contain exactly 48m arcs with weight after the shift, and for each variable u U, either r(u) = and r(ū) =, or r(u) = and r(ū) = (note also that the shift is legal for all arcs in the structure). We define, for each literal u U: { false if r(u) = T(u) = true otherwise { false if r(ū) = and T(ū) = true otherwise The previous study shows that for each variable u U, T(u) T(ū), thus T is a truth assignment. It remains to show that each clause contains exactly one true literal. Consider a derivative clause c i = {x,y,z } obtained from a clause c C: it corresponds to a structure in which r(x ), r(y ), and r(z ) are equal to or (since each such vertex is the u or ū contained in the structure associated to a variable u). Assume that r(c i ) < or r(c i ) >. Then the arc from a to c i and the arcs from c i to each of the vertices x, y, and z have a nonzero weight after the shift. In this case, at most 3 arcs have a weight. Furthermore, if r(c i ) =, then the structure has exactly 4 arcs with weight, and if r(c i ) =, then either r(x ) = r(y ) = r(z ) = and the structure contains exactly 6 arcs of weight, or it is easy to check that it contains at most 4 arcs of weight (and some have a negative weight). Finally, by construction of derivative clauses c, c, and c 2, for each pair of such clauses (c i,c j ), there is a variable u U such that u c i and ū c j. Therefore, at most one of the three derivative structures is such that r(x ) = r(y ) = r(z ) = and contains 6 arcs of weight (each other derivative clause contains at most 4 such arcs). If at least one clause has strictly less than arcs with weight in its 3 derivative clauses, then with the arcs of other clauses, we get at most ( ) (m ) + 3 = 4m arcs of weight. With the other structures, we get at most 48mn + 8m + 48mn + 4m = 96mn + 22m arcs of weight, which is not enough. Therefore, for each clause c C, exactly one derivative clause c i contains 6 arcs of zero weight and the two other derivative clauses contains 4 such arcs. This means that 6

New Results on Array Contraction

Laboratoire de l Informatique du Parallélisme École Normale Supérieure de Lyon Unité Mixte de Recherche CNRS-INRIA-ENS LYON n o 5668 New Results on Array Contraction Alain Darte and Guillaume Huard April