Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensional Integrals

Size: px

Start display at page:

Download "Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensional Integrals"

Randall Short
5 years ago
Views:

1 Optimization of Memory Usage and Communication Requirements for a Class of Loops Implementing Multi-Dimensional Integrals Chi-Chung LamÝ, Daniel CociorvaÞ, Gerald BaumgartnerÝ, P. SadayappanÝ ÝDepartment of Computer and Information Science The Ohio State University, Columbus, OH clam,gb,saday@cis.ohio-state.edu ÞDepartment of Physics The Ohio State University, Columbus, OH cociorva@pacific.mps.ohio-state.edu Abstract Multi-dimensional integrals of products of several arrays arise in certain scientific computations. To optimize the performance of such computations on parallel computers, the total amount of communication needs to be minimized, while staying within the available memory on each processor. In the context of these integral calculations, this paper addresses a communication minimization problem with memory constraint. We first investigate the memory usage minimization problem for sequential computers. Based on a framework that models the relationship between loop fusion and memory usage, we propose algorithms for finding loop fusion configurations that minimize memory usage under static and dynamic memory allocation models. We suggest ways to further reduce memory usage, when necessary, at the cost of increased arithmetic operations. A practical example shows the performance improvement obtained by our algorithms on an electronic structure computation. The algorithms are then extended to the context of parallel computers, to determine the appropriate loop fusions and array distributions that minimizes communication cost without exceeding the memory limit. Supported in part by the National Science Foundation under grant DMR

2 Keywords: memory usage optimization, communication optimization, loop fusion, multidimensional integral 2

3 1 Introduction This paper addresses the optimization of a class of loop computations that implement multidimensional integrals of the product of several arrays. Such integral calculations arise, for example, in the computation of electronic properties of semiconductors and metals [1, 7, 15]. The objective is to minimize the execution time of such computations on a parallel computer while staying within the available memory. In addition to the performance optimization issues pertaining to inter-processor communication and data locality enhancement, there is an opportunity to apply algebraic transformations using the properties of commutativity, associativity and distributivity to reduce the total number of arithmetic operations. Given a specification of the required computation as a multi-dimensional sum of the product of input arrays, we first determine an equivalent sequence of multiplication and summation formulae that computes the result using a minimum number of arithmetic operations. Each formula computes and stores some intermediate results in an intermediate array. By computing the intermediate results once and reusing them multiple times, the number of arithmetic operations can be reduced. In previous work, this operation minimization problem was proved to be NP-complete and an efficient pruning search strategy was proposed [10]. The simplest way to implement an optimal sequence of multiplication and summation formulae is to compute the formulae one by one, each coded as a set of perfectly nested loops, and to store the intermediate results produced by each formula in an intermediate array. However, in practice, the input and intermediate arrays could be so large that they cannot fit into the available memory. Hence, there is a need to fuse the loops as a means of reducing memory usage. By fusing loops between the producer loop and the consumer loop of an intermediate array, intermediate results are formed and used in a pipelined fashion and they reuse the same reduced array space. The problem of finding a loop fusion configuration that minimizes memory usage without increasing the operation count is not trivial. 3

4 In this paper, we develop an optimization framework that appropriately models the relation between loop fusion and memory usage. We present algorithms that find an optimal loop fusion configuration that minimizes memory usage, under both static and dynamic memory allocation models. We describe ways to tradeoff operation count for memory usage. The algorithms are applied on a practical physics computation to demonstrate their effectiveness. We then extend the framework to the parallel context, and propose two approaches for determining the distribution of data arrays and computations among processors and the loop fusions that minimize inter-processor communication for load-balanced parallel execution, while not exceeding the memory limit. Reduction of arithmetic operations has been traditionally done by compilers using the technique of common subexpression elimination [4]. Chatterjee et al. consider the optimal alignment of arrays in evaluating array expression on massively parallel machines [2, 3]. Much work has been done on improving locality and parallelism by loop fusion [8, 14, 16]. However, this paper considers a different use of loop fusion, which is to reduce array sizes and memory usage of automatically synthesized code containing nested loop structures. Traditional compiler research does not address this use of loop fusion because this problem does not arise with manually-produced programs. The contraction of arrays into scalars through loop fusion is studied in [6] but is motivated by data locality enhancement and not memory reduction. Loop fusion in the context of delayed evaluation of array expressions in APL programs is discussed in [5] but is also not aimed at minimizing array sizes. We are unaware of any work on fusion of multi-dimensional loop nests into imperfectly-nested loops as a means to reduce memory usage. The rest of this paper is organized as follows. Section 2 describes the operation minimization problem. Section 3 studies the use of loop fusion to reduce array sizes and presents algorithms for finding an optimal loop fusion configuration that minimizes memory usage. Section 4 considers the optimal partitioning of data and loops to minimize communication and computational costs for execution on parallel machines under memory constraint and describes two approaches to the 4

5 problem. Section 5 provides conclusions. 2 Operation Minimization In the class of computations considered, the final result to be computed can be expressed as multidimensional integrals of the product of many input arrays. Due to commutativity, associativity and distributivity, there are many different ways to obtain the same final result and they could differ widely in the number of floating point operations required. The problem of finding an equivalent form that computes the result with the least number of operations is not trivial and so a software tool for doing this is desirable. Consider, for example, the multi-dimensional integral shown in Figure 1(a). If implemented directly as expressed (i.e., as a single set of perfectly-nested loops), the computation would require Æ Æ Æ Æ Ð arithmetic operations to compute. However, assuming associative reordering of the operations and use of the distributive law of multiplication over addition is satisfactory for the floating-point computations, the above computation can be rewritten in various ways. One equivalent form that only requires Æ Æ Æ Ð Æ Æ Æ Æ operations is given in Figure 1(b). It expresses the sequence of steps in computing the multi-dimensional integral as a sequence of formulae. Each formula computes some intermediate result and the last formula gives the final result. A formula is either a product of two input/intermediate arrays or a integral/summation over one index, of an input/intermediate array. A sequence of formulae can also be represented as an expression tree. For instance, Figure 1(c) shows the expression tree corresponding to the example formula sequence. The problem of finding a formula sequence that minimizes the number of operations has been proved to be NP-complete [10]. A pruning search algorithm for finding such a formula sequence is given below. 5

6 Ï È Ðµ Ð Ð (a) A multi-dimensional integral È ½ Ð Ð Ð È Ð Ð ½ Ï È (b) A formula sequence for computing (a) È È ½ È Ð Ð (c) An expression tree representation of (b) Figure 1: An example multi-dimensional integral and two representations of a computation. 1. Form a list of the product terms of the multi-dimensional integral. Let denote the -th product term and dimens the set of index variables in. Set Ö and to zero. Set to the number of product terms. 2. While there exists an summation index (say ) that appears in exactly one term (say ) in the list and, increment Ö and and create a formula Ö È where Ö dimens dimens. Remove from the list. Append to the list Ö. Set to. 3. Increment Ö and and form a formula Ö where and 6

7 are two terms in the list such that and, and give priority to the terms that have exactly the same set of indices. The indices for Ö are Ö dimens dimens dimens. Remove and from the list. Append to the list Ö. Set to. Go to step When steps 2 and 3 cannot be performed any more, a valid formula sequence is obtained. To obtain all valid sequences, exhaust all alternatives in step 3 using depth-first search. 3 Memory Usage Minimization In implementing the computation represented by an operation-count-optimal formula sequence (or expression tree), there is a need to perform loop fusion to reduce the sizes of the arrays. Without fusing the loops, the arrays would be too large to fit into the available memory. There are many different ways to fuse the loops and they could result in different memory usage. This section addresses the problem of finding a loop fusion configuration for a given formula sequence that uses the least amount of memory. Section 3.1 introduces the memory usage minimization problem. Section 3.2 describes some preliminary concepts that we use to formulate our solutions to the problem. Sections 3.3 and 3.4 present algorithms for finding memory-optimal loop fusion configurations, under static memory allocation and dynamic memory allocation, respectively. Section 3.5 explores ways to further reduce memory usage at the cost of increasing the number of arithmetic operations. Section 3.6 illustrates how the application of the algorithms on an example physics computation improves its performance. 3.1 Problem Description Consider again the expression tree shown in Figure 1(c). A naive way to implement the computation is to have a set of perfectly-nested loops for each node in the tree, as shown in Figure 2(a). 7

8 The brackets indicate the scopes of the loops. Figure 2(b) shows how the sizes of the arrays may be reduced by the use of loop fusions. It shows the resulting loop structure after fusing all the loops between and ½, all the loops among,,, and, and all the loops between and. Here,,,,, and are reduced to scalars. After fusing all the loops between a node and its parent, all dimensions of the child array are no longer needed and can be eliminated. The elements in the reduced arrays are now reused to hold different values at different iterations of the fused loops. Each of those values was held by a different array element before the loops were fused (as in Figure 2(a)). Note that some loop nests (such as those for and ) are reordered and some loops within loop nests (such as the -, -, and Ð-loops for,, and ) are permuted in order to facilitate loop fusions. For now, we assume the leaf node arrays (i.e., input arrays) can be generated one element at a time (by the genú function for array Ú) so that loop fusions with their parents are allowed. This assumption holds for arrays in which the value of each element is a function of the array subscripts, as in many arrays in the physics computations that we work on. As will be clear later on, the case where an input array has to be read in or produced in slices or in its entirety can be handled by disabling the fusion of some or all the loops between the leaf node and its parent. Figure 2(c) shows another possible loop fusion configuration obtained by fusing all the -loops and then all the -loops and Ð-loops inside them. The sizes of all arrays except and are smaller. By fusing the -, -, and Ð-loops between those nodes, the -, -, and Ð-dimensions of the corresponding arrays can be eliminated. Hence,, ½,,, and are reduced to scalars while becomes a one-dimensional array. In general, fusing a Ø-loop between a node Ú and its parent eliminates the Ø-dimension of the array Ú and reduces the array size by a factor of Æ Ø. In other words, the size of an array after loop fusions equals the product of the ranges of the loops that are not fused with its parent. We only consider fusions of loops among nodes that are all transitively related by (i.e., form a transitive 8

9 Figure 2: Three loop fusion configurations for the expression tree in Figure 1. for i for j A[i,j]=genA(i,j) for j for k for l B[j,k,l]=genB(j,k,l) for k for l C[k,l]=genC(k,l) initialize f1 for i for j f1[j]+=a[i,j] for j for k for l f2[j,k,l]=b[j,k,l] C[k,l] initialize f3 for j for k for l f3[j,k]+=f2[j,k,l] for j for k f4[j,k]=f1[j] f3[j,k] initialize f5 for j for k f5[k]+=f4[j,k] (a) ½ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð (d) Ð Õ Õ Õ Õ Õ Õ Õ Ð initialize f1 for i for j A=genA(i,j) f1[j]+=a initialize f3 for k for l C=genC(k,l) for j B=genB(j,k,l) f2=b C f3[j,k]+=f2 initialize f5 for j for k f4=f1[j] f3[j,k] f5[k]+=f4 (b) ½ Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð Ð (e) Õ for k for l C[k,l]=genC(k,l) initialize f5 for j for i A[i]=genA(i,j) initialize f1 for i f1+=a[i] for k initialize f3 for l B=genB(j,k,l) f2=b C[k,l] f3+=f2 f4=f1 f3 f5[k]+=f4 (c) ½ Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð (f) 9 Õ Ð

10 closure over) parent-child relations. Fusing loops between unrelated nodes (such as fusing siblings without fusing their parent) has no effect on array sizes. We also restrict our attention for now to loop fusion configurations that do not increase the operation count. The tradeoff between memory usage and arithmetic operations is considered in Section 3.5. In the class of loops considered in this paper, the only dependence relations are those between children and parents, and array subscripts are simply loop index variables 1. So, loop permutations, loop nests reordering, and loop fusions are always legal as long as child nodes are evaluated before their parents. This freedom allows the loops to be permuted, reordered, and fused in a large number of ways that differ in memory usage. Finding a loop fusion configuration that uses the least memory is not trivial. We believe this problem is NP-complete but have not found a proof yet. Fusion graphs. Let Ì be an expression tree. For any given node Ú Ì, let subtree Úµ be the set of nodes in the subtree rooted at Ú, Úparent be the parent of Ú, and Úindices be the set of loop indices for Ú (including the summation index Úsumindex if Ú is a summation node). A loop fusion configuration can be represented by a graph called a fusion graph, which is constructed from Ì as follows. 1. Replace each node Ú in Ì by a set of vertices, one for each index Úindices. 2. Remove all tree edges in Ì for clarity. 3. For each loop fused (say, of index ) between a node and its parent, connect the -vertices in the two nodes with a fusion edge. 4. For each pair of vertices with matching index between a node and its parent, if they are not already connected with a fusion edge, connect them with a potential fusion edge. 1 When array subscripts are not simple loop variables, as many researchers have studied, more dependence relations exist, which prevent some loop rearrangement and/or loop fusions. In that case, a restricted set of loop fusion configurations would need to be searched in minimizing memory usage. 10

11 Figure 2 shows the fusion graphs alongside the loop fusion configurations. In the figure, solid lines are fusion edges and dotted lines are potential fusion edges, which are fusion opportunities not exploited. As an example, consider the loop fusion configuration in Figure 2(b) and the corresponding fusion graph in Figure 2(e). Since the -, -, and Ð-loops are fused between and, there are three fusion edges, one for each of the three loops, between and in the fusion graph. Also, since no loops are fused between and, the edges between and in the fusion graph remain potential fusion edges. In a fusion graph, we call each connected component of fusion edges (i.e., a maximal set of connected fusion edges) a fusion chain, which corresponds to a fused loop in the loop structure. The scope of a fusion chain, denoted scope µ, is defined as the set of nodes it spans. In Figure 2(f), there are three fusion chains, one for each of the -, -, and Ð-loops; the scope of the shortest fusion chain is. The scope of any two fusion chains in a fusion graph must either be disjoint or a subset/superset of each other. Scopes of fusion chains do not partially overlap because loops do not (i.e., loops must be either separate or nested). Therefore, any fusion graph with fusion chains whose scopes are partially-overlapping is illegal and does not correspond to any loop fusion configuration. Fusion graphs help us visualize the structure of the fused loops and find further fusion opportunities. If we can find a set of potential fusion edges that, when converted to fusion edges, does not lead to partially overlapping scopes of fusion chains, then we can perform the corresponding loop fusions and reduce the size of some arrays. For example, the -loops between and ½ in Figure 2(f) can be further fused and array would be reduced to a scalar. If converting all potential fusion edges in a fusion graph to fusion edges does not make the fusion graph illegal, then we can completely fuse all the loops and achieve optimal memory usage. But for many fusion graphs in real-life loop configurations (including the ones in Figure 2), this does not hold. Instead, potential fusion edges may be mutually prohibitive; fusing one loop could prevent the fusion of another. In 11

12 Figure 2(e), fusing the -loops between ½ and would disallow the fusion of the -loops between and. Although a fusion graph specifies what loops are fused, it does not fully determine the permutations of the loops and the ordering of the loop nests. As we will see in Section 3.4, under dynamic memory allocation, reordering loop nests could alter memory usage without changing array sizes. 3.2 Preliminaries So far, we have been describing the fusion between a node and its parent by the set of fused loops (or the loop indices such as ). But in order to compare loop fusion configurations for a subtree, it is desirable to include information about the relative scopes of the fused loops in the subtree. Scope and fusion scope of a loop. The scope of a loop of index in a subtree rooted at Ú, denoted scope Úµ, is defined in the usual sense as the set of nodes in the subtree that the fused loop spans. That is, if the -loop is fused, scope Úµ scope µ subtree Úµ, where is a fusion chain for the -loop with Ú scope µ. If the -loop of Ú is not fused, then scope Úµ. We also define the fusion scope of an -loop in a subtree rooted at Ú as fscope Úµ scope Úµ if the -loop is fused between Ú and its parent; otherwise fscope Úµ. As an example, for the fusion graph in Figure 2(e), scope µ, but fscope µ. Indexset sequence. To describe the relative scopes of a set of fused loops, we introduce the notion of an indexset sequence, which is defined as an ordered list of disjoint, non-empty sets of loop indices. For example, is an indexset sequence. For simplicity, we write each indexset in an indexset sequence as a string. Thus, is written as. Let and ¼ be indexset sequences. We denote by the number of indexsets in, Ö the Ö-th indexset in, and Set µ the union of all indexsets in, i.e., Set µ Ë ½Ö Ö. For instance,, ½, 12

13 and Set µ Set µ. We say that ¼ is a prefix of if ¼, ¼ ¼ ¼, and for all ½ Ö ¼, ¼ Ö Ö. We write this relation as prefix ¼ µ. So,,,,, are prefixes of, but is not. The concatenation of and an indexset Ü, denoted Ü, is defined as the indexset sequence ¼¼ such that if Ü, then ¼¼ ½, ¼¼ ¼¼ Ü, and for all ½ Ö ¼¼, ¼¼ Ö Ö ; otherwise, ¼¼. Fusion. We use the notion of an indexset sequence to define a fusion. Intuitively, the loops fused between a node and its parent are ranked by their fusion scopes in the subtree from largest to smallest; two loops with the same fusion scope have the same rank (i.e., are in the same indexset). For example, in Figure 2(f), the fusion between and is Ð and the fusion between and is (because the fused -loop covers two more nodes, and ½ ). Formally, a fusion between a node Ú and Úparent is an indexset sequence such that 1. Set µ Úindices Úparent.indices, 2. for all Set µ, the -loop is fused between Ú and Úparent, and 3. for all Ö and ¼ Ö ¼, (a) Ö Ö ¼ iff fscope Úµ fscope ¼ Úµ, and (b) Ö Ö ¼ iff fscope Úµ fscope ¼ Úµ. Nesting. Similarly, a nesting of the loops at a node Ú can be defined as an indexset sequence. Intuitively, the loops at a node are ranked by their scopes in the subtree; two loops have the same rank (i.e., are in the same indexset) if they have the same scope. For example, in Figure 2(e), the loop nesting at is Ð, at is, and at is Ð. Formally, a nesting of the loops at a node Ú is an indexset sequence such that 1. Set µ Úindices and 13

14 2. for all Ö and ¼ Ö ¼, (a) Ö Ö ¼ iff scope Úµ scope ¼ Úµ, and (b) Ö Ö ¼ iff scope Úµ scope ¼ Úµ. By definition, the loop nesting at a leaf node Ú must be Úindices because all loops at Ú have empty scope. Legal fusion. A legal fusion graph (corresponding to a loop fusion configuration) for an expression tree Ì can be built up in a bottom-up manner by extending and merging legal fusion graphs for the subtrees of Ì. For a given node Ú, the nesting at Ú summarizes the fusion graph for the subtree rooted at Ú and determines what fusions are allowed between Ú and its parent. A fusion is legal for a nesting at Ú if prefix µ and set µ Úparent.indices. This is because, to keep the fusion graph legal, loops with larger scopes must be fused before fusing those with smaller scopes, and only loops common to both Ú and its parent may be fused. For example, consider the fusion graph for the subtree rooted at in Figure 2(e). Since the nesting at is Ð and indices Ð, the legal fusions between and are,, Ð, Ð, and Ð. Notice that all legal fusions for a node Ú are prefixes of a maximal legal fusion, which can be expressed as MaxFusion Úµ ÑÜ prefix µ and set µ Úparent.indices, where is the nesting at Ú. In Figure 2(e), the maximal legal fusion for is Ð, and for is Ð. Resulting nesting. Let Ù be the parent of a node Ú. If Ú is the only child of Ù, then the loop nesting at Ù as a result of a fusion between Ù and Ú can be obtained by the function ExtNesting Ùµ Ùindices Set µµ. For example, in Figure 2(e), if the fusion between and is Ð, then the nesting at would be Ð. Compatible nestings. Suppose Ú has a sibling Ú ¼, is the fusion between Ù and Ú, and ¼ is the fusion between Ù and Ú ¼. For the fusion graph for the subtree rooted at Ù (which is merged from those of Ú and Ú ¼ ) to be legal, ExtNesting Ùµ and ¼ ExtNesting ¼ Ùµ must be compatible 14

15 according to the condition: for all Ö and, if Ö and ¼ Ö ¼ and ¼ ¼, then Ö ¼ ¼. This requirement ensures an -loop that has a larger scope than a -loop in one subtree will not have a smaller scope than the -loop in the other subtree. If and ¼ are compatible, the resulting loop nesting at Ù (as merged from and ¼ ) is ¼¼ such that for all ¼¼ Ö ¼¼ and ¼¼ ¼¼, if Ö, ¼ Ö ¼,, and ¼ ¼, then Ö ¼¼ ¼¼ µ Ö and Ö ¼ ¼ and Ö ¼¼ ¼¼ µ Ö and Ö ¼ ¼. Effectively, the loops at Ù are re-ranked by their combined scopes in the two subtrees to form ¼¼. As an example, in Figure 2(e), if the fusion between ½ and is and the fusion between and is ¼, then ExtNesting µ and ¼ ExtNesting ¼ µ would be incompatible. But if is changed to, then ExtNesting µ would be compatible with ¼, and the resulting nesting at would be. A procedure for checking if and ¼ are compatible and forming ¼¼ from and ¼ is provided in Section 3.3. The more-constraining relation on nestings. A nesting at a node Ú is said to be more or equally constraining than another nesting ¼ at the same node, denoted Ú ¼, if for all legal fusion graph for Ì in which the nesting at Ú is, there exists a legal fusion graph ¼ for Ì in which the nesting at Ú is ¼ such that the subgraphs of and ¼ induced by Ì subtree Úµ are identical. In other words, Ú ¼ means that any loop fusion configuration for the rest of the expression tree that works with also works with ¼. This relation allows us to do effective pruning among the large number of loop fusion configurations for a subtree in Section 3.3. It can be proved that the necessary and sufficient condition for Ú ¼ is that for all ÑÖ and Ñ, there exist Ö ¼ ¼ such that Ñ ¼ Ö ¼ and Ñ ¼ ¼ and Ö µ Ö ¼ ¼ and Ö µ Ö ¼ ¼, where Ñ MaxFusion Úµ and Ñ ¼ MaxFusion ¼ Úµ. Comparing the nesting at between Figure 2(e) and (f), the nesting Ð in (e) is more constraining than the nesting Ð in (f). A procedure for determining if Ú ¼ is given in Section

16 3.3 Static Memory Allocation Under the static memory allocation model, since all the arrays in a program exist during the entire computation, the memory usage of a loop fusion configuration is simply the sum of the sizes of all the arrays (including those reduced to scalars). Figures 3 and 4 shows a bottom-up, dynamic programming algorithm for finding a memory-optimal loop fusion configuration for a given expression tree Ì. For each node Ú in Ø, it computes a set of solutions Úsolns for the subtree rooted at Ú. Each solution in Úsolns represents a loop fusion configuration for the subtree rooted at Ú and contains the following information for : the loop nesting nesting at Ú, the fusion fusion between Ú and its parent, the memory usage mem so far, and the pointers src1 and src2 to the corresponding solutions for the children of Ú. The set of solutions Úsolns is obtained by the following steps. First, if Ú is a leaf node, initialize the solution set to contain a single solution using InitSolns. Otherwise, take the solution set from a child Úchild1 of Ú, and, if Ú has two children, merge it (using MergeSolns) with the compatible solutions from the other child Úchild2. Then, prune the solution set to remove the inferior solutions using PruneSolns. A solution is inferior to another unpruned solution ¼ if nesting is more or equally constraining than ¼ nesting and does not use less memory than ¼. Next, extend the solution set by considering all possible legal fusions between Ú and its parent (see ExtSolns). The size of array Ú is added to memory usage by AddMemUsage. Inferior solutions are also removed. Although the complexity of the algorithm is exponential in the number of index variables and the number of solutions could in theory grow exponentially with the size of the expression tree, the number of index variables in practical applications is usually small and there is indication that the pruning is effective in keeping the size of the solution set in each node small. The algorithm assumes the leaf nodes may be freely fused with their parents and the root node array must be available in its entirety at the end of the computation. If these assumptions do not 16

17 MinMemFusion Ì µ: InitFusible Ì µ foreach node Ú in some bottom-up traversal of Ì if Únchildren ¼ then Ë½ InitSolns Úµ else Ë½ Úchild1.solns if Únchildren then Ë½ MergeSolns Ë½ Úchild2.solnsµ Ë½ PruneSolns Ë½ Úµ Úsolns ExtendSolns Ë½ Úµ Ìrootoptsoln the single element in Ìrootsolns foreach node Ú in some top-down traversal of Ì Úoptsoln Úoptfusion fusion ½ src1 if Únchildren ½ then Úchild1.optsoln ½ else Úchild1.optsoln ½src1 Úchild2.optsoln ½src2 InitFusible Ì µ: foreach Ú Ì if Ú Ìroot then Úfusible else Úfusible Úindices Úparent.indices InitSolns Úµ: nesting Úfusible InitMemUsage µ return MergeSolns Ë½ Ëµ: Ë foreach ½ Ë½ foreach Ë nesting MergeNesting ½nesting nestingµ if nesting then // if ½ and are compatible src1 ½ src2 MergeMemUsage ½ µ AddSoln Ëµ return Ë PruneSolns Ë½ Úµ: Ë foreach ½ Ë½ nesting MaxFusion ½nesting Úµ AddSoln Ëµ return Ë ExtendSolns Ë½ Úµ: Ë foreach ½ Ë½ foreach prefix of ½nesting fusion nesting ExtNesting Úparentµ 17

18 AddSoln Ëµ: foreach ¼ Ë if Inferior ¼ µ then return else if Inferior ¼ µ then Ë Ë ¼ Ë Ë MergeNesting ¼ µ: Ö Ö ¼ ½ Ü Ü ¼ while Ö or Ö ¼ ¼ if Ü then Ü Ö if Ü ¼ then Ü ¼ ¼ Ö ¼ Ý Ü Ü if Ý then return // and ¼ are incompatible Ý Ý Ü Ü ¼ Ü ¼ Ü ¼ Ü Ü Ý end while return // and ¼ are compatible Ú ¼ : // test if is more/equally constraining than ¼ Ö ¼ ½ Ü ¼ for Ö ½ to if Ü ¼ then if Ö ¼ ¼ then return false Ü ¼ ¼ Ö ¼ if Ö Ü ¼ then return false Ü ¼ Ü ¼ Ö return true InitMemUsage µ: mem ¼ AddMemUsage Ú size ½ µ: mem ½mem size MergeMemUsage ½ µ: mem ½mem mem Inferior ¼ µ nesting Ú ¼ nesting and mem ¼ mem FusedSize Ú µ É ( Úindices Úsumindex ) Æ ExtNesting Ùµ Ùindices Set µµ MaxFusion Úµ ÑÜ prefix µ and Set µ Úparent.indices 18 Set µ Ë ½Ö Ö

19 Ú line src nesting fusion ext-nest memory usage opt Ð Ð Ð 1 3 Ð Ð Ð 1 4 Ð Ð 15 5 Ð Ð Ð 40 6 Ð Ð 600 ½ = = ,3 2,4 Ð Ð Ð Ð Ð Ð (1+1)+1=3 (1+15)+1= ,5 Ð Ð Ð (1+40)+1= ,6 Ð Ð Ð (1+600)+1= Ð 17+1= Ð 602+1= ,14 8,13 (2+603)+1=606 (101+18)+1= ,14 ( )+1= =160 Figure 5: Solution sets for the subtrees in the example. hold, the InitFusible procedure can be easily modified to restrict or expand the allowable fusions for those nodes. To illustrate how the algorithm works, consider again the empty fusion graph in Figure 2(d) for the expression tree in Figure 1(c). Let Æ ¼¼, Æ ½¼¼, Æ ¼, and Æ Ð ½. There are different fusions between and. Among them, only the full fusion Ð is in solns because all other fusions result in more constraining nestings and use more memory than the full fusion and are pruned. However, this does not happen to the fusions between and since the resulting nesting Ð of the full fusion Ð is not less constraining than those of the other 3 possible fusions. Then, solutions from and are merged together to form solutions for. For example, when the two full-fusion solutions from and are merged, the merged nesting for is Ð, which can then be extended by full fusion (between and ) to form a full-fusion solution for that has a memory usage of only 3 scalars. Again, since this solution is not the least constraining one, other solutions cannot be pruned. Although this solution is optimal for the subtree rooted at, it turns out to be non-optimal for the entire expression tree. Figure 5 shows the 19

20 Õ ½ Õ Õ Õ Õ Õ Õ Õ Õ Ð Õ Õ Õ Õ Õ Õ Õ Õ Õ Ð Ð Õ initialize f1 for i for j A=genA(i,j) f1[j]+=a initialize f5 for k for l C[l]=genC(k,l) for j for l B=genB(j,k,l) f2=b C[l] initialize f3 f3+=f2 f4=f1[j] f3 f5[k]+=f4 (a) (b) Figure 6: An optimal solution for the example. solution sets for all of the nodes. The src column contains the line numbers of the corresponding solutions for the children. The ext-nest column shows the resulting nesting for the parent. A mark indicates the solution forms a part of an optimal solution for the entire expression tree. The fusion graph for the optimal solution is shown in Figure 6(a). Once an optimal solution is obtained, we can generate the corresponding fused loop structure from it. The following procedure determines an evaluation order of the nodes: 1. Initialize set È to contain a single node Ìroot and list Ä to an empty list. 2. While È is not empty, remove from È a node Ú where Úoptfusion is maximal among all nodes in È, insert Ú at the beginning of Ä, and add the children of Ú (if any) to È. 3. When È is empty, Ä contains the evaluation order. Putting the loops around the array evaluation statements is trivial. The initialization of an array can be placed inside the innermost loop that contains both the evaluation and use of the array. The optimal loop fusion configuration for the example expression tree is shown in Figure 6(b). 20

21 3.4 Dynamic Memory Allocation Under dynamic memory allocation, space for arrays can be allocated and deallocated as needed. We allow an array to be allocated/deallocated multiple times by putting the allocate/deallocate statements inside some loops. But each array must be allocated/deallocated in its entirety. We do not consider sharing of space between arrays. Given a loop fusion configuration, the positions of the allocate/deallocate statements are determined as follows. Let Ú be an array, be the statement that evaluates Ú, Ù be the statement that uses Ú, and the Ø-loop be the innermost loop that contains both and Ù. The latest point an allocate Ú statement can be placed is inside the Ø-loop and before and any loops that surround and are inside the Ø-loop. The earliest point a deallocate Ú statement can be placed is inside the Ø-loop and after Ù and any loops that surround Ù and are inside the Ø-loop. The memory usage of a loop fusion configuration is no longer the sum of the array sizes, but maximum total size of the array that are allocated at any time. Another difference between dynamic allocation and static allocation is that, with dynamic allocation, the evaluation order of the nodes having equal fusion with their parents can affect memory usage although the sizes of individual arrays remain unchanged. Consider the fusion graph shown in Figure 7(a). Since ½ and are equally fused with, either of the two subtrees rooted at ½ and can be evaluated first. Figure 7(b) and (c) show the loop fusion configurations for the two evaluation orders. Assume the same index ranges Æ ¼¼, Æ ½¼¼, Æ ¼, and Æ Ð ½ as before. In (b), the maximum memory usage is 4141 elements when ½,,, and are allocated during the evaluation of and. But in (c), when ½ is being evaluated,, ½, and coexist in memory and their total size is 4600 elements. Therefore, in addition to finding the fusions between children and parents, we need to determine the evaluation order of the nodes that minimizes memory usage. A simpler problem of finding the memory-optimal evaluation order of the nodes in an expression tree without any fusion has been addressed in [9]. Here, we apply its result to each set of nodes having equal fusion to obtain the 21

22 optimal evaluation order among them. We use the same bottom-up, dynamic programming algorithm as in Figures 3 and 4 to traverse the expression tree and enumerate the legal fusions for the nodes. But the procedures related to calculations of memory usage (namely, InitMemUsage, AddMemUsage, MergeMemUsage, and Inferior) are replaced by those in Figures 8 and 9. To correctly calculate memory usage and to determine the optimal evaluation order of nodes with equal fusion, instead of a mem field, each solution for a node Ú now has a seqset field, which is a set of fusion indexsets Set µ for the fusions in the subtree rooted at Ú. Each fusion indexset Ü in seqset is associated with an Üseq, which is a sequence of indivisible units. Each indivisible unit is a (nodelist, hi, lo) triple, where nodelist is an ordered list of nodes, hi is the maximum memory usage during the evaluation of the nodes in nodelist, and lo is the memory usage after those nodes are evaluated. The nodes in a nodelist has a special property that there necessarily exists some globally optimal traversal of the entire tree wherein this sequence appears undivided. Therefore, inserting any node in between the nodes of an indivisible unit does not lower the total memory usage. The seqsets and seqs are manipulated as follows. When two solutions from two children nodes of the same parent are merged together, their seqsets are merged by the MergeMemUsage procedure. A seq for a fusion indexset that appears in only one seqset is simply copied to the result seqset. If a fusion indexset Ü appears in both seqsets, the indivisible units in the two seqs for Ü are merge-sorted together in the order of decreasing hi-lo difference (by the MergeSeq procedure) to form a combined seq and their hi-lo values are adjusted. In [9], we have proved that arranging the indivisible units in this order minimizes memory usage. Moreover, the indivisible units in a seq must appear in the same order in some optimal traversal of the entire expression tree. When a solution is extended to include a new node Ú having fusion with its parent, the seqs in Úseqset for fusion indexsets larger than Set µ are collapsed into the seq for Set µ (by the 22

23 CollapseSeq procedure). In collapsing a seq É into another seq É ¼, all the indivisible units in É are appended to É ¼ after their hi-lo values are adjusted. When all collapsings are done, all fusion indexsets in Úseqset are subsets of Set µ. Then, a new indivisible unit for Ú is appended to the seq for Set µ. Whenever an indivisible unit is appended to a seq É by the AppendSeq procedure, it is combined with indivisible units at the end of É whose hi is not higher than hi or whose lo is not lower than lo. The combined indivisible unit has the concatenated nodelist and the highest hi but the original lo. An indivisible unit initially contains a single node. But as the algorithm goes up the tree, indivisible units may be combined to form ones that contain more nodes. As the procedures for enumerating legal fusions are the same for both static and dynamic memory allocation, we will use a particular fusion graph, the one in Figure 7(a), to illustrate how memory usage is maintained by the algorithm under dynamic allocation. For node, the fusion is and the fused array size is Æ ¼¼. So, seqset has a single fusion indexset whose associated seq is ¼¼ ¼¼µ. The seqsets for and are similarly obtained. Since ½ is not fused with, the AddMemUsage procedure collapses the only seq in seqset from fusion indexset to. When a new indivisible unit ½ ¼¼ ½¼¼µ for ½ is appended to the seq for, it is combined with the existing indivisible unit into ½ ¼¼ ½¼¼µ. Now, and ½ becomes indivisible. For node, the seqsets for and (which share no common fusion indexset) are first merged together by MergeMemUsage and then a new seq for Ð Set Ðµ is created for. Thus, seqset has 3 seqs. To form the seqset for, which is not fused with, the AddMemUsage procedure first collapses the 3 seqs in seqset into ½ ½µ, and then appends to it a new indivisible unit ¼ ½ ¼¼¼µ to form the single seq ¼ ½ ¼¼¼µ in seqset. The seqsets for all the nodes are shown in Figure 10. The single seq in the root node s seqset contains the optimal evaluation order of the nodes and the hi value of the first indivisible unit in that seq is the maximum memory usage (see Figure 7(b)). 23

24 3.5 Further Reduction in Memory Usage So far, we have been restricting our attention to reducing memory usage without any increase in the number of arithmetic operations. However, with this restriction, the optimal loop fusion configuration for some formula sequences may still require more than the available memory. As a result, it may be necessary to relax the restriction on operation count, for the implementation of some formula sequences to be feasible in terms of memory usage. Creating additional loops around some assignment statements may enable more loop fusions and thus reduce array sizes. To do this, we add additional vertices to the fusion graph and connect each new vertex of a node Ú and the corresponding vertex at the parent of Ú with an additional potential fusion edge. If the additional potential fusion edge is converted into a fusion edge in a fusion graph, the operation count for node Ú is multiplied by Æ, where is the loop index for the fusion edge. For example, in Figure 2(b), if we create an Ð-loop around and and fuse it with the Ð-loop around,,, and, we can eliminate both dimensions of and make it a scalar. However, this increases the operation counts for and by a factor of Æ Ð because they are re-compute Æ Ð times. Note that if we place no limit on the operation count, the memory usage minimization problem would have a trivial solution, which is to put all the assignment statements inside a perfect loop nest. Doing so would reduce all intermediate arrays to scalars but the operation count would return to its original unoptimized value. The existence of sparse arrays and common sub-expressions and the use of fast Fourier transforms are characteristics of many practical scientific computations. Their existence may prohibit the fusion of some loops, thereby imposing high lower bounds on the sizes of some intermediate arrays. With common-subexpressions, a directed acyclic graph (DAG) instead of an expression tree is needed to represent the formula sequence. Readers are referred to [12, 13] on how the optimization algorithms can be extended to handle these three characteristics and on the transformations to the fusion graph for further memory reduction at a cost of increased arithmetic operations. 24

25 The resulting formula sequence and the fusion graph are then fed into one of the memory usage minimization algorithms described in Sections 3.3 and 3.4 to determine the optimal loop fusion configuration. The memory usage minimization algorithms can be modified to maintain the additional arithmetic operations for each solution. A solution is inferior to another solution ¼ if, in addition to existing criteria, the additional operations for is greater than or equal to that for ¼. Any solution with a cumulative memory usage higher than a user-specified limit can be pruned. At the end, the algorithms return a unpruned solution for the root node that has the fewest additional arithmetic operations. 3.6 An Example We illustrate the practical application of the memory usage minimization algorithm on the following example formula sequence which can be used to determine self-energy in electronic structure of solids. It is optimal in operation count and has a cost of ½ ½¼ ½ operations. The ranges of the indices are Æ ½¼, Æ Ø ½¼¼, Æ ÊÄ Æ ÊÄ½ Æ ÊÄ Æ Æ ½ ½¼¼¼, and Æ Ö Æ Ö½ ½¼¼¼¼¼. f1[r,rl,rl1,t] = Y[r,RL] * G[RL1,RL,t] f2[r,rl1,t] = sum RL f1[r,rl,rl1,t] f5[r,rl2,r1,t] = Y[r,RL2] * f2[r1,rl2,t] f6[r,r1,t] = sum RL2 f5[r,rl2,r1,t] f7[k,r,r1] = exp[k,r] * exp[k,r1] f10[r,r1,t] = f6[r,r1,t] * f6[r1,r,t] f11[k,r,r1,t] = f7[k,r,r1] * f10[r,r1,t] f13[k,r1,t,g] = fft r f11[k,r,r1,t] * exp[g,r] f15[k,t,g,g1] = fft r1 f13[k,r1,t,g] * exp[g1,r1] 25

26 In this example, array is sparse and has only 10% non-zero elements. Notice that the common sub-expressions, exp, and appear at the right hand side of more than one formula. Also, ½ and ½ are fast Fourier transform formulae. Discussions on how to handle sparse arrays, common sub-expressions, and fast Fourier transforms can be found in [12, 13]. Without any loop fusion, the total size of the arrays is ½½ ½¼ ½ elements. If each array is to be computed only once, the presence of the common sub-expressions and FFTs would prevent the fusion of some loops, such as the Ö and Ö½ loops between and ½¼. Under the operation-count restriction, the optimal loop fusion configuration obtained by the memory usage minimization algorithm for static memory allocation requires memory storage for ½½¼ ½¼ ½½ array elements, which is 1000 times better than without any loop fusion. But this translates to about 1,000 gigabytes and probably still exceeds the amount of memory available in any computer today. Thus, relaxation of the operation-count restriction is necessary to further reduce to memory usage to reasonable values. We perform the following simple transformations to the DAG and the corresponding fusion graph. Two additional vertices are added: one for a -loop around ½¼ and the other for a Ø-loop around. These additional vertices are then connected to the corresponding vertices in ½½ with additional potential fusion edges to allow more loop fusion opportunities between ½½ and its two children. The common sub-expressions, exp, and are split into multiple nodes. Two copies of sub-dag rooted at are also made. This will overcome some constraints on legal fusion graphs for DAGs. The memory usage minimization algorithm for static memory allocation is then applied on the transformed fusion graph. The loop fusion configuration, the fusion graph, and the memory usage 26

27 and operation count statistics of the optimal solution found are shown in Figure 11. For clarity, the input arrays are not included in the fusion graph. The memory usage of the optimal solution after relaxing the operation-count restriction is significantly reduced by a factor of about 100 to ½½ ½¼ array elements. The operation count is increased by only around 10% to ½¼ ½¼ ½. Compared with the best hand-optimized loop fusion configuration, which also has some manuallyapplied transformations to reduce memory usage to ½½ ½¼ array elements and has ¼ ½¼ ½ operations, the optimal loop fusion configuration obtained by the algorithm shows a factor of 2.5 improvement in operation count while using the same amount of memory. 4 Communication Minimization on Parallel Computers with Memory Constraint Given a sequence of formulae, we now address the problem of finding the optimal partitioning of arrays and operations among the processors and the loop fusions on each processor in order to minimize inter-processor communication and computational costs while staying within the available memory in implementing the computation on a message-passing parallel computer. Section 4.1 briefly describes a multi-dimensional processor view and discusses the effects of loop fusions and array/operation partitioning on communication cost, computational cost, and memory usage. Two approaches for solving this problem are presented in Section Preliminaries A logical view of the processors as a multi-dimensional grid is used, where each array can be distributed or replicated along one or more of the processor dimensions. Let be the number of processors on the -th dimension of an Ò-dimensional processor array, so that the number of processors is ½ Ò. We use an Ò-tuple to denote the partitioning or distribution of 27

28 the elements of a data array on an Ò-dimensional processor array. The -th position in an Ò-tuple «, denoted «, corresponds to the -th processor dimension. Each position may be one of the following: an index variable distributed along that processor dimension, a * denoting replication of data along that processor dimension, or a 1 denoting that only the first processor along that processor dimension is assigned any data. If an index variable appears as an array subscript but not in the Ò-tuple, then the corresponding dimension of the array is not distributed. Conversely, if an index variable appears in the Ò-tuple but not in the array, then the data are replicated along the corresponding processor dimension, which is the same as replacing that index variable with a *. As an example, suppose 128 processors form a array. For the array Ð in Figure 1, the 4-tuple ½ specifies that the first and the second dimensions of are distributed along the first and second processor dimensions respectively (the third dimension of is not distributed), and that data are replicated along the third processor dimension and are assigned only to processors whose fourth processor dimension equals 1. Thus, a processor whose id is È Þ½ Þ Þ Þ will be assigned a portion of specified by myrange Þ ½ Æ ½ µ myrange Þ Æ µ ½ Æ Ð if Þ ½ and no part of otherwise, where myrange Þ Æ µ is the range Þ ½µ Æ ½ to Þ Æ. We assume the SPMD programming model and do not consider distributing the computation of different formulae on different processors. A child array (or a part of it) is redistributed before the evaluation of its parent if their distributions do not match. For instance, suppose the arrays Ð and Ð have distributions ½ and Ð ½ respectively. If we want to have distribution Ð ½ when evaluating Ð Ð Ð, would have to be redistributed from ½ to Ð ½ because the two distributions do not match. But since for Ð, the distribution Ð ½ is the same as Ð ½, is not redistributed. The partitioning of data arrays among the processors and the allowable fusions of loops on 28

29 each processor are inter-related. For the fusion of a Ø-loop between nodes Ù and Ú to be possible, that loop must either be undistributed at both Ù and Ú, or be distributed onto the same number of processors at Ù and at Ú (but not necessarily along the same processor dimension). Otherwise, the range of the Ø-loop at node Ù would be different from that at node Ú. For example, if the array Ð in Figure 1 has distribution ½ and fusion Ð with Ð, then can have distribution ½ but not Ð ½ because the fusion Ð forces the -dimension of to be distributed onto 2 processors and the Ð-dimension to be undistributed. Array partitioning and loop fusion also have effects on memory usage, communication cost, and computational cost. Fusing a Ø-loop between a node Ú and its parent eliminates the Ø-dimension of array Ú. If the Ø-loop is not fused but the Ø-dimension of array Ú is distributed along the -th processor dimension, then the range of the Ø-dimension of array Ú on each processor is reduced to Æ Ø. Let DistSize Ú «µ be the size on each processor of array Ú, which has fusion with its parent and distribution «. We have DistSize Ú «µ É Údimens DistRange Ú «Set µµ where Údimens Úindices Úsumindex is the array dimension indices of Ú before loop fusions, and ½ if Ü DistRange Ú «Üµ Æ if Ü and «Æ if Ü and «In our example, assume again that Æ ¼¼, Æ ½¼¼, Æ ¼, and Æ Ð ½. If the array Ð has distribution ½ and fusion Ð with, then the size of on each processor would be Æ ¼ since the -dimension is the only unfused dimension and is distributed onto 2 processors. Note that if array Ú undergoes redistribution from «to, the array size on each processor after redistribution is DistSize Ú µ, which could be different from DistSize Ú «µ, 29

30 the size before redistribution. The initial and final distributions of an array Ú determines the communication pattern and whether Ú needs redistribution, while loop fusions change the number of times array Ú is redistributed and the size of each message. Let Ú be an array that needs to be redistributed. If node Ú is not fused with its parent, array Ú is redistributed only once. Fusing a Ø-loop between node Ú and its parent puts the collective communication code for redistribution inside the Ø-loop. Thus, the number of redistributions is increased by a factor of Æ Ø if the Ø-dimension of Ú is distributed along the -th processor dimension and by a factor of Æ Ø if the Ø-dimension of Ú is not distributed. In other words, loop fusions cannot reduce communication cost. Continuing with our example, if the array Ð has fusion Ð with and needs to be redistributed from ½ to ½, then it would be redistributed Æ Æ Ð ¼ times. Let Mcost localsize «µ be the communication cost in moving the elements of an array, with localsize elements distributed on each processor, from an initial distribution «to a final distribution. We empirically measure Mcost for each possible non-matching pair of «and and for several different localsizes on the target parallel computer. Let MoveCost Ú «µ denote the communication cost in redistributing the elements of array Ú, which has fusion with its parent, from an initial distribution «to a final distribution. It can be expressed as: MoveCost Ú «µ MsgFactor Ú «Set µµ Mcost DistSize Ú «Set µµ «µ where MsgFactor Ú «É Üµ Údimens LoopRange Ú «Üµ and 30

Scan Scheduling Specification and Analysis

Scan Scheduling Specification and Analysis Bruno Dutertre System Design Laboratory SRI International Menlo Park, CA 94025 May 24, 2000 This work was partially funded by DARPA/AFRL under BAE System subcontract