MANY signal processing systems, particularly in the multimedia

Size: px

Start display at page:

Download "MANY signal processing systems, particularly in the multimedia"

Bernice Chandler
5 years ago
Views:

1 1304 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 Signal Assignment to Hierarchical Memory Organizations for Embedded Multidimensional Signal Processing Systems Florin Balasa, Hongwei Zhu, and Ilie I. Luican Abstract The storage requirements of the array-dominated and loop-organized algorithmic specifications running on embedded systems can be significant. Employing a data memory space much larger than needed has negative consequences on the energy consumption, latency, and chip area. Finding an optimized storage of the usually large arrays from these algorithmic specifications is an essential task of memory allocation. This paper proposes an efficient algorithm for mapping multidimensional arrays to the data memory. Similarly to [1], it computes bounding windows for live elements in the index space of arrays, but this algorithm is several times faster. More important, since this algorithm works not only for entire arrays, but also parts of arrays like, for instance, array references or, more general, sets of array elements represented by lattices [2], this signal-to-memory mapping technique can be also applied in hierarchical memory architectures. Index Terms Lattice, memory allocation, multidimensional signal processing, polytope, signal-to-memory mapping. I. INTRODUCTION MANY signal processing systems, particularly in the multimedia and telecom domains, are synthesized to execute mainly data-dominated applications. Their behavior is described in a high-level programming language, where the code is typically organized in sequences of loop nests having as boundaries (usually affine) functions of loop iterators, conditional instructions where the arguments may be data-dependent and/or data-independent (relational and/or logic expressions of affine functions of loop iterators). The data structures are multidimensional arrays; the indices of the array references are affine functions of loop iterators. The specifications with these characteristics are often called affine specifications [3]. In these targeted VLSI systems, data transfer and storage have a significant impact on both the system performance and the major cost parameters power consumption and chip area. During the system development process, the designer must often focus first on the exploration of the memory subsystem in order to achieve a cost optimized end-product [3]. The assignment and mapping of the multidimensional signals (arrays) Manuscript received June 09, 2007; revised January 19, First published March 16, 2009; current version published August 19, F. Balasa is with the Department of Computer Science and Information Systems, Southern Utah University, Cedar City, UT USA ( fbalasa@cs.uic.edu). H. Zhu is with the Physical IP Division, ARM, Inc., San Jose, CA USA ( davzhu01@arm.com). I. I. Luican is with the Department of Computer Science, University of Illinois at Chicago, Chicago, IL USA ( iluica2@uic.edu). Digital Object Identifier /TVLSI from these behavioral specifications to the data memory is an essential memory management task. The goals of this latter operation are the following: 1) to ascertain that any distinct scalar signals (array elements) simultaneously alive are mapped to distinct storage locations; 1 2) to use mapping functions simple enough in order to ensure an address generation hardware of a reasonable complexity; 3) to map the arrays from the behavioral specification into an amount of data storage as small as possible. Employing a data memory space much larger than needed has several negative consequences: the energy consumption per access increases with the memory size [3], as well as the data access latency [4]; in addition, larger memories occupy more chip area. A. Previous Works Several mapping models have been proposed in the past, trading off between the last two goals (that is, accepting a certain excess of storage to ensure a less complex address generation hardware) while ascertaining that the first goal is strictly satisfied. De Greef et al. choose one of the canonical linearizations of the array (a permutation of its dimensions), followed by a modulo operation that wraps the set of virtual memory locations into a smaller set of actual physical locations [5]. Since for an -dimensional array there are permutations of the indices, and since, in addition, the elements in a given dimension can be mapped in the increasing or decreasing order of the respective index, the number of analyzed linearizations increases fast with the signal dimension. Lefebvre and Feautrier, addressing parallelization of static control programs, developed in [6] an intraarray storage approach based on modular mapping, as well. They first compute the lexicographically maximal time delay between the write and the last read operations, which is a super-approximation of the distance between conflicting index vectors (i.e., whose corresponding array elements are simultaneously alive). Then, the modulo operands are computed successively as follows: the modulo operand, applied on the first array index, is set to 1 1 The lifetime of a scalar signal is the time interval between the clock cycles when the scalar is produced or written, and when it is read for the last time, i.e., consumed, during the code execution. Two scalars are simultaneously alive if their lifetimes do overlap. Obviously, in such a case, they must occupy different memory locations; otherwise, they can share the same location /$ IEEE

2 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1305 plus the maximal difference between the first indices over the conflicting index vectors; the modulo operand of the second index is set to 1 plus the maximal difference between the second indices over the conflicting index vector, when the first indices are equal, and so on. Quilleré and Rajopadhye studied the problem of memory reuse for systems of recurrence equations, a computation model used to represent algorithms to be compiled into circuits [7]. In their model, the loop iterators first undergo an affine mapping (into a linear space of smallest dimension what they call a projection ) before modulo operations are applied to the array indices. In order to avoid the inconvenience of analyzing different linearization schemes like in [5], Tronçon et al. proposed [1] to reduce the size of an -dimensional array mapped to the data memory by computing an -dimensional bounding window, whose elements can be used as operands in modulo operations that redirect all accesses to the array. Darte et al. proposed a lattice-based mathematical framework for intraarray mapping, establishing a correspondence between valid linear storage allocations and integer lattices called strictly admissible relative to the set of differences of the conflicting indices [8]. They proposed heuristic techniques for building strictly admissible integer lattices, thus building valid storage allocations. B. Overview of the Proposed Framework and the Contributions of This Work The assignment algorithm proposed in this paper has a similar general goal as Tronçon s mapping model [1]: for each -dimensional array in the specification code, it computes a maximal bounding window whose dimensions are used to redirect the accesses to the array. Any array element is mapped into the window ; in its turn, the window is mapped into the physical memory (relative to a base address) by a typical canonical linearization, like row or column concatenation for 2-D arrays. Each window dimension is computed as the maximum difference in absolute value between the th indices of any two array elements simultaneously alive, plus 1. This ensures that any two array elements simultaneously alive are mapped to distinct memory locations. The amount of data memory required for storing (after mapping) the array is the volume of the window, that is,. In the illustrative example shown in Fig. 1, the window corresponding to the 2-D array has the dimensions. Indeed, it can be easily verified that, independent of the iteration, the maximum distance between the first (respectively, second) indices of two alive elements cannot exceed 3 (respectively, 5). For instance, the simultaneously alive elements and (see Fig. 1) have both indices the farthest apart from each other. Therefore, a memory access to the element can be safely (that is, without any mapping conflict) redirected to. The computation algorithm employed in [1] operates in two stages: a liveness analysis establishes a set of program points Fig. 1. Array space of signal A. The points represent the A-elements A[index ][index ] which are produced (and consumed as well) in the loop nest. The points to the left of the dashed line represent the elements produced till the end of the iteration (i =7;j =9), the black points being the elements still alive (i.e., produced and still needed in the next iterations) while the circles representing elements already dead (i.e., not needed as operands any more). The light points to the right of the dashed line are A-elements still unborn (to be produced in the next iterations). in the code where the live sets of array elements are computed; this step is followed by the proper windows computation, based on incremental increases of the windows dimensions and checks for the existence of mapping conflicts. 2 In contrast, the approach described in this paper is based on a different computation methodology, which is significantly faster. This efficient methodology based on projections of lattices [2] is the first contribution of the paper and it is explained in Sections II and III. More important than the increase of the algorithmic efficiency, our computation framework allows significant extensions of the mapping model which are not shared, to the best of our knowledge, by any other previous work in the field. These extensions are explained below. As already mentioned, in the illustrative example from Fig. 1, the window of signal has the dimensions. It follows that the storage requirement after mapping for the 2-D signal is memory locations. It can be easily noticed that, actually, only 19 locations would be really needed since not more than 19 -elements can be simultaneously alive: see, again, Fig. 1, where the 19 black dots represent the live elements at the end of the iteration. While the previous works compute only the sizes of the array window, they are unable to provide any consistent information on the quality of their models, that is, how large is the oversize of the computed mapping windows in comparison to the minimal array windows and, also, what is the oversize of the storage requirement after mapping relative to the minimum amount of data memory necessary for the code execution. While a minimum array window is difficult to achieve in practice (in many cases, requiring a significantly more complex hardware), a signal-to-memory mapping model actually 2 More implementation details on this technique will be given in Section VI.

3 1306 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 trades-off an excess of storage against a less complex address generation hardware. Our methodology allows to compute this amount of extra storage after mapping, providing useful information for the system-level exploration design phase. This is the second major contribution of the paper, and algorithmic details are provided in Section IV. None of the works addressing the problem of signal-tomemory assignment during the memory allocation design phase [9] discusses the applicability of their proposed mapping model in the context of a multilayer memory organization. On the other hand, the works in the area of memory management that consider memory hierarchy (see Section V for a brief overview) do not address the mapping problem. Since our algorithm works not only for entire arrays, but also parts of arrays like, for instance, array references or, more general, sets of array elements represented by lattices [2], this signal-to-memory mapping technique can be also applied in a multilayer memory hierarchy. This is the third major contribution of the paper. The extension of our mapping model in a two-level memory hierarchy (off-chip memory and on-chip scratch-pad memory) is discussed in Section V. Finally, Section VI presents basic implementation aspects and discusses the experimental results, while Section VII summarizes the main conclusions of this work. II. WINDOW OF AN ARRAY REFERENCE In this section, a simpler problem will be addressed: given an array reference of an -dimensional signal, in the scope of a nest of loops having the iterators, and where the indices are affine functions of the iterators, compute the dimensions of the bounding window for live elements (as explained in Section I-B), assuming that all the -elements of the array reference are simultaneously alive. The technique for solving this problem will be illustrated on the following. Example 1: The iterators satisfy a set of linear inequalities derived from the loop boundaries and conditional expressions:. These inequalities define what is usually called a -polytope. 3 The indices of the array reference are given by an affine vector mapping defined on Such a set is usually called a (linearly bounded) lattice [2] and it constitutes the array space or index space of the array reference. 3 In general, this iterator space can be a finite set of Z-polytopes; these can be considered disjoint (since, otherwise, a polytope decomposition can be applied). More generally, the problem to be solved is the following: Given the linearly bounded lattice where is an index vector and is an iterator vector, compute the projection spans of the lattice on all the coordinate axes. The general idea for solving this problem is to find a transformation such that the extreme values of first iterator correspond to the extreme values of, say, the th index, for every value of. In this way, the problem reduces to computing the integer projection of a -polytope, this latter problem being well-studied [11], [12]. Algorithm 1: Computation of the bounding window of a lattice Step 1 The th index has the expression:, where is the th row of the matrix in (1). Let be a unimodular matrix 4 bringing to the Hermite Normal Form [2]: (if the row is null, then the window reduces to one point: ). Step 2 After applying the unimodular transformation, the new iterator -polytope becomes:. Step 3 Compute the extreme values of (denoted and ) by projecting the -polytope on the first axis [11]. Then, and. The th window dimension. Performing this 3-step algorithm for, the bounding window of the lattice is obtained. Example 1 (Continued): The algorithm will be illustrated projecting the array reference from Example 1 on the first axis and finding the extreme values of the first index, where. The unimodular matrix (see, e.g., [2] for building ) brings to the Hermite Normal Form:. Since, the initial iterator space (see Example 1) becomes. Eliminating from these inequalities with a Fourier Motzkin technique [13], the extreme values of the exact shadow [11] of on the first axis are and, and those extreme points are valid projections (i.e., there exists s such that and satisfy the inequalities defining ). Since and, it follows immediately that and. Projecting the lattice on the second axis, the second row of the affine mapping is and the unimodular matrix is. With a similar computation, it follows that and. Therefore, the bounding window of the array reference is. This algorithm will be used in the signal-to-memory mapping approach, explained in the next section. 4 A square matrix with integer elements whose determinant is 61. (1)

4 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1307 Fig. 2. (a) Example 2: An affine specification derived from a motion detection kernel. (b) Decomposition of the index space of the signal A[x][y] into disjoint lattices; the arcs in the graph show inclusion relations. III. GLOBAL FLOW OF THE SIGNAL-TO-MEMORY MAPPING ALGORITHM In the previous section, we computed the bounding window for the live array elements covered by a given array reference. However, the assumption we made that all the elements in the array reference are alive does not always hold during the execution of the algorithmic specification. In the illustrative example from Fig. 2(a), the array reference from the assignment marked with, covering 10,465 elements, is entirely alive at the beginning of the loop nest, but it is not any more entirely alive when the execution of the loop nest is over: since 161 of its elements (, ) were consumed for the last time during the loop nest execution, only 10,304 -elements are still alive (hence, needing storage) at the end of the loop nest. Even more revealing, the array reference from the same starred assignment covers 409,825 -elements. If all the elements were simultaneously alive, the necessary bounding window would be, since there is only one possible value (32) for the first index, 97 possible values (between 32 and 128) for the second index, and 4225 possible values (between 1 and 4225) for the third index. However, a closer look shows that, actually, only one single element is alive at a time: each -element produced in a certain iteration is immediately consumed in the next iteration, being also covered by the array reference. So, a window would suffice: each access to the two array references could be safely redirected to the same memory location. The global flow of the mapping algorithm, computing the bounding windows of the arrays while taking into account the life span of the elements, is described below. The overall size of these windows yields the (one-layer) data memory requirement. Moreover, the window dimensions are used to map the array elements to the physical memory locations using modulo operations, as explained in Section I-B. Algorithm 2: Computation of the bounding windows of all the arrays in the algorithmic specification Step 1 Extract the array references from the given algorithmic specification and decompose the array references for every indexed signal into disjoint lattices. The decomposition of the array references will allow to find out which parts of an array reference are staying alive during the execution of a loop nest and which parts are consumed. This decomposition into disjoint lattices used also in [10] can be performed analytically, by recursively intersecting the array references of every indexed (multidimensional) signal in the code. Let be an indexed signal in the algorithmic specification and let be the initial set of s lattices (these are the lattices of the array references of ). An inclusion graph is gradually built; this is a directed acyclic graph whose nodes are lattices of and whose arcs denote inclusion relations between the respective sets. (For instance, an arc from node to node shows that the lattice in included in.) A high-level pseudo-code of the

5 1308 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 decomposition of given below: initialize the inclusion graph lattice; s array references into disjoint lattices is creating one node for each repeat { for (each pair of lattices in the set ) if ( && ) // if there is a path in from node to if { (1) if set ; // add an arc in from node to elsif set ; // add an arc in from node to (2) elsif (there exists already equivalent 5 to ) { if set ; if set ; } (3) else { add the new lattice to and add a node in ; set ; set ; } // add arcs, compute ; // yields a set of lattices each new lattice in the difference is added to, and corresp. new nodes with arcs towards are added in ; compute ; // yields a set of lattices each new lattice in the difference is added to, and corresp. new nodes with arcs towards are added in ; } } until (no new lattice can be added to the set ). While the intersection of two bounded lattices is itself a lattice, the difference may consist of several bounded lattices. These operations with lattices are described in detail and illustrated in [10]. At the beginning, the inclusion graph of the indexed signal contains only the lattices of the corresponding array references as nodes. Every pair of lattices between which no inclusion relation is known so far (i.e., between which there is currently no path in the graph) is intersected. If the intersection produces a nonempty lattice, there are basically three possibilities: 1) the resulting lattice is one of the intersection operands: in this case, inclusion arcs between the corresponding nodes must be added to the graph; 2) an equivalent linearly bounded lattice exists already in the set : in this case, arcs must be added only if necessary between the nodes corresponding to the equivalent lattice and to the operands; 3) the resulting lattice is a new element of the set : a new node is appended to the graph, along with arcs towards the two operands of the intersection. The inclusion graph is used on one hand to speed up the decomposition (for instance, if the intersection results to be empty, there is no sense of trying to intersect with the lattices included in since those intersections will be empty 5 Two linearly bounded lattices of the same indexed signal are equivalent if they represent the same set of indices, e.g., fx = i + jj0 i 2; 0 j 2g and fx = ij0 i 4g are equivalent. as well) and, on the other hand, to determine the structure of each array reference in terms of disjoint lattices: at the end, all the nodes in the graph without incident arcs represent disjoint lattices. Fig. 2(b) shows such an inclusion graph and the result of the decomposition of the six ( bold ) array references of the 2-D signal in the illustrative example from Fig. 2(a). The graph displays the inclusion relations between the lattices of. The six bold nodes represent the six array references of signal in the code. For instance, the node represents the lattice of from the first loop nest. The nodes are also labeled with the size of the corresponding lattice, that is, the number of array elements covered by it. In this example, and is also a bounded lattice denoted in Fig. 2(b). However, the difference is not a lattice due to the nonconvexity of this set, so the resulting set had to be decomposed further. At the end of the decomposition, contains the 17 lattices ; each of the 11 nodes without incident arcs represents a disjoint lattice in the partition of the array space of. Every array reference in the code is now either a disjoint lattice itself (like,, and ), or a union of disjoint lattices (for instance, ). Step 2 Compute preliminary windows (underestimations of ) for each indexed signal taking into account the live elements at the boundaries between the loop nests. Let be an indexed signal in the algorithmic specification. A high-level pseudo-code of the computation of s preliminary window is given as follows: for (each disjoint lattice ) // The set of disjoint lattices // was determined at Step 1 for (each dimension of signal ) compute and of using Algorithm 1; for (each boundary between the loop nests and ){ let be the set of disjoint lattices of, alive at boundary (i.e., produced before the boundary and consumed after it); for (each dimension of signal ){ ; ; ;// is the th side // of s bounding window at boundary } } for (each dimension of signal ) ; // is now the th side of s window over all boundaries First, the extreme values of the projections of every disjoint lattice are computed using Algorithm 1. Exemplifying for the lattice (Fig. 2(b)), ; and. After the decomposition of the array references, one can easily find out the loop nests where each disjoint lattice was produced and consumed (i.e., used as an operand for the last time). 6 For instance, the lattice 6 The algorithmic specifications are in the single-assignment form: each array element is written at most once, but it can be read an arbitrary number of times.

6 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1309 belongs to the input signal, so it is alive when the execution of the code is initiated. is consumed in the first loop nest since it is included only in the array reference (the lattice ) in the assignment line marked with in Fig. 2(a). Let us consider, for instance, the boundary between the first and the second loop nest ( since the start of the code is boundary zero and single instructions outside nested loops can be considered nests of depth zero). Then, since these lattices are alive across the boundary, being consumed in the third loop nest. It follows that and ; consequently, s bounding window relative to the chosen boundary is. But since all the disjoint lattices are alive before the start of the first loop nest, then the bounding window for that boundary is larger:. Obviously, the largest window must be selected. Similarly, the signal will have a window since only one -element is alive between the blocks of code (,,,or ). The window of the 3-D signal is for the time being since no -element is alive at any boundary between blocks. Step 3 Update the bounding window (compute the exact values of ) for each indexed signal. The window sides computed for each index at Step 2 are maximal when every loop nest in the code either produces or consumes (but not both!) the array s elements. This works, e.g., for signal in Example 2, where every loop nest only consumes -elements[see Fig. 2(a)]. On the other hand, this does not work for and whose elements are both produced and consumed in the same loop nest. In this situation, the values of may be underestimates and, hence, may need to be increased. Let be a disjoint lattice of an indexed signal which is consumed (read for the last time) in a certain (normalized) loop nest where -elements are produced as well. Then: for (each -element covered by the lattice ){ compute the iteration vector when the -element is consumed; 7 determine the disjoint lattices of partially produced until the -element is consumed in iteration ; determine the disjoint lattices of partially consumed until the -element is consumed in iteration ; for (each dimension of signal ) for (each lattice in ){ compute and using Algorithm 1; ;// may increase } } The guiding idea is that local or global maxima of are reached immediately before the consumption of an -element, 7 This requires the computation of the maximum iterator vector relative to the lexicographic order [10]. If the lattice L is included in several array references, the overall maximum iterator vector is considered. which may entail a shrinkage of some side of the bounding window encompassing the live elements. There are quite frequent situations when array elements are produced and consumed at every iteration. Such cases occur when array references have one-to-one correspondences between the iterators and the indices. For instance, in Example 2, each iterator vector corresponds to a unique (produced) array element and a unique (consumed) element (see the assignment marked with in the first loop nest). Since each iteration will produce and consume one element, the bounding window of is. Similarly, each iterator corresponds to unique -elements and which yields a window. The storage requirement for Example 2 is obtained by summing up the individual window sizes: memory locations. It can be shown that this storage requirement is really the minimum one, therefore, the mapping model we adopted based on the bounding windows of simultaneously alive array elements was very effective for this code. However, this is not always the case: as already explained in Section I-B, signal A s window in the illustrative example from Fig. 1 is, therefore locations, whereas the minimum storage window of is only 19 locations. IV. MINIMUM STORAGE REQUIREMENT OF AN INDEXED SIGNAL As emphasized in Section I-B, the other previous works addressing signal-to-memory mapping made only relative evaluations, comparing the performances of different mapping models based on some set of benchmark applications. They were unable to provide information on how large is the oversize of their resulting array windows in comparison to the minimum array windows, or how large is the oversize of their overall memory allocation solution versus the minimum data storage requirement for the entire application. In contrast, our model provides such information, allowing thus a better performance evaluation of the mapping model. Our polyhedral framework can compute the optimal window for each multidimensional array in the specification, therefore, the optimal memory sharing between the elements of the same array (intraarray memory sharing). For instance, the signal from the illustrative example in Fig. 3 needs a minimum window of 1,752 memory locations since there are at most 1,752 -elements simultaneously alive as computed by the tool. Similarly, the minimum window (or the optimal memory sharing) of signal is 3,104 storage locations. However, since the elements of the arrays and can share the same locations if their lifetimes are disjoint, the minimum storage requirement is, actually, 4,536 locations, 8 less than the sum of the minimum windows of and. The minimum data storage requirement of the entire (procedural) algorithmic specification (therefore, the optimal memory sharing between all the array elements in the code (interarray memory sharing) is determined, as well [10]. While the previous works make only rel- 8 A detail of the memory trace generated by our tool is displayed in Fig. 3.

7 1310 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 Fig. 3. Code example and its memory trace illustrating the optimal intra- and interarray memory sharing. ative evaluations, comparing the performance of one mapping model versus other models or versus the total number of array elements from the specification code, our framework can provide an accurate evaluation of the absolute efficiency (in terms of data storage) of the mapping model employed and, more general, of any other mapping model. When computing the minimum array window (optimal memory sharing between the elements of a same array), an approach similar to Algorithm 2 is used with one key difference: instead of lattice windows, the exact lattice sizes are used instead. While the lattice size is a tight (reachable) lower-bound of the data memory needed to store the lattice, the window of the lattice although typically larger offers, in addition, a mapping solution (based on modulo computations) for its elements, which is acceptable from address generation point of view. Algorithm 3: Computation of the exact size of a lattice Fig D polytope (quadrilateral) representing the iterator space in Example 3. approach, which makes it appropriate for algorithmic specifications typical to multidimensional signal processing applications. Example 3: There are two distinct situations to be considered: Case 1 The affine vector function of the lattice is a one-to-one mapping. A sufficient condition is that the rank of matrix in (1) be equal to its number of columns. Then, the computation of the number of distinct signal indices (i.e., the amount of memory necessary to store the array elements covered by the lattice) is, hence, reduced to counting the number of iterator vectors in the iterator polytope in (1). In such a situation, a computation technique based on Barvinok s decomposition of a simplicial cone into unimodular cones [14] is used. Note that counting the points having integer coordinates in -polyhedra can be done in several ways: there are methods based on Ehrhart polynomials like, for instance, [15], or even much simpler adapting the Fourier Motzkin technique [13]. We decided to employ the current approach because of its scalability, which will become apparent by the end of this section. An example illustrating both the concepts and the technique is given below. Although the theoretical background is not explained in detail (see [14], [16] for more theoretical insight), this example is intended to illustrate the main steps of the computation flow. Moreover, it will clearly show the scalability of the Since the rank of matrix is 2, the vector function is a one-to-one mapping of the iterator space into the index space. It follows that the computation of the number of array elements covered by is equivalent to counting the number of points of integer coordinates in the iterator polytope shown in Fig. 4. This latter operation is done as explained below. Step 1 Find the vertices of the iterator polytope Fig. 4) and their supporting cones. (see Let be linearly independent integer vectors. The (rational polyhedral) cone generated by the rays is the set. For instance, the set of points inside the angle (see Fig. 4) is a 2-D cone generated by the rays and with the point taken as origin. There is a supporting cone corresponding to

8 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1311 each vertex of a polyhedron. For instance, the supporting cone of the vertex (Fig. 4), denoted, is the one generated by the rays and. A cone is called unimodular if the matrix of the rays is unimodular. Given the inequalities defining the iterator space, the vertices and the rays are computed using the reverse search algorithm [17]. The supporting cones corresponding to the vertices of the iterator polytope, as well as their generating rays shown below as column vectors, are if the vertex of the cone is then ; if the ray then. The generating function of any cone is obtained by making the summation of the functions of all the unimodular component cones. From (2) and (4), the generating functions of the cones,, and are, respectively: Step 2 Decompose the supporting cones into unimodular cones [14]. The last two cones in our example and are not unimodular. Their decomposition is given as follows: 9 Step 3 Find out the generating function of the iterator polytope. Given a -polytope in the -dimensional space, for each point in, a term can be introduced. The function (or, simply, ) is defined as the generating function of the polytope [14]. For example, the generating function of the triangle with the vertices (0, 0), (1, 2), (2, 0) is, each monomial term corresponding to one of the lattice points inside or on the border of the triangle: e.g., corresponds to the point (1,2), and so on. By evaluating at, we get the number of points in [14]. For instance, if then, which is the number of points of integer coordinates in the triangle. Writing as a sum of monomial terms would be impractical for large polytopes. Fortunately, can be compactly written as an algebraic sum of rational functions, each term corresponding to one of the unimodular cones in the decomposition of the supporting cones of the polytope vertices (Steps 1 and 2). Every unimodular cone whose vertex has integer coordinates has associated a generating function [14] of the form where (, being the coordinates of ), and the product is over all the generating rays. For instance, 9 The computation is complex: the cones are polarized, then decomposed, then polarized back. This double dualization process eliminates the lower dimension component cones [16]. (2) (3) (4) (5) When not all the coordinates of vertex are integer (as in the case of vertex ), but still rational numbers (which is the case in practically all the multimedia algorithms), the generating function has the same form (5), but is the lattice point in the fundamental parallelepiped of the unimodular cone [14]. In this case, can be determined as follows [16]. Let be the matrix whose columns are the rays of the cone. Solve the linear system of equations, where is the coordinate vector of the vertex ( for ). Let be the solution of the system. Then the coordinates of vertex in the generating function are the elements of the vector, where. This technique applied to the two unimodular cones (3) components of yields and, respectively. The generating function of the cone is The sum of the rational functions is the generating function of the whole quadrilateral in Fig. 4. Step 4 Compute the number of points having integer coordinates from the generating function the whole polytope. In order to obtain a one-variable generating function,we make the substitutions and, where and are integers chosen such that no factor in the denominators of the terms of becomes zero. In this example, we choose. With the substitution, the generating function of the iterator polytope in Fig. 4 becomes: After eliminating the negative exponents in the denominators, and after factorizing, we substitute, obtaining rational terms of the form, where and are polynomials, and is the dimension of the iterator space of

9 1312 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 If and, the coefficients of the quotient can be obtained recursively as follows: and It can be proven [16] that the algebraic sum of the coefficients (since here the space dimension is 2) after the polynomial divisions in all the terms of is the number of points in the -polytope. In this example, the 6 coefficients (one for each term of ) are. Their sum is 21, which is indeed the number of lattice points inside or on the border of the quadrilateral in Fig. 4, and it is also the number of memory locations to store the array reference since the vector function is a one-to-one mapping from the iterator space to the index space, as already explained. Assume now that the line is shifted to the right, keeping its slope unchanged, until the coordinates of become (107,1), instead of (7,1). The equation of that line becomes (instead of ), and the new vertex will have the coordinates (214/3, 217/3). The computation effort necessary to count the number of points in the new, much larger, -polytope is not affected by the significant increase in its size. Indeed, the 4 supporting cones are generated by the same rays, the decompositions are the same, the generating functions are almost the same. The only difference appears at the numerators of and due to the modifications of the coordinates of these vertices. For instance, the numerator of becomes since the new coordinates of are (107,1), while the numerators of the two terms of become and, respectively. The number of points in the new -polytope is 3,921. Besides its excellent scalability, the technique sketched above, illustrated for a 2-D polytope (a quadrilateral), works for an arbitrary number of dimensions. Therefore, it is well-suited to address the large size of array references typical to telecom and multimedia applications. Case 2 The affine vector function of the lattice is not a one-to-one mapping. When the rank of matrix is smaller than (the number of columns of ), then is brought to the Hermite Normal Form [2] postmultiplying it by a unimodular matrix. The size of the lattice is, afterwards, obtained counting the number of points having integer coordinates from the projection of the -polytope on along the first coordinates [11], [12]. Example 4: The rank of matrix is, hence less than the number of columns of ; in this case, the vector function for may not be a one-to-one mapping. Indeed, the iterator vectors and are mapped to the same index. The unimodular transformation bringing at the Hermite Normal Form, modifies the initial iterator space into (where is the new iterator vector after the transformation ) whose exact 1-D projection (since ) is, obtained eliminating in the inequalities above [13]. The points in the dark shadow [11] correspond to lattice points in the modified iterator space. Checking the values of outside the dark shadow, and 2 result to be invalid projections: replacing these values in the modified iterator space, no integer solution for can be found. Conversely, and 3 are valid projection, since and (3,1) belong to the modified iterator space. Therefore, storing requires 13 locations: the index can take all the values between 0 and 14, except 1 and 2. The global flow of the computation of the minimum storage requirement of an indexed signal is similar to the flow of Algorithm 2 computing the bounding windows (see Section III). Actually, their implementations are merged together in the same framework for the sake of efficiency. In this way, the designer can simultaneously get a mapping solution together with information on the solution quality in terms of data storage. Algorithm 4: Computation of the minimum storage requirement for a given array in the algorithmic specification Step 1 (similar to Step 1 from Algorithm 2) Extract the array references of from the given algorithmic specification and decompose them into disjoint lattices. Step 2 Compute (an underestimation of) the minimum storage requirement for signal (denoted ) taking into account only the live -elements at the boundaries between the loop nests. for (each disjoint lattice ) // The set of disjoint lattices // was determined at Step 1 compute, the size of the lattice, using Algorithm 3; for (each boundary between the loop nests and ){ let be the set of disjoint lattices of, alive at boundary (i.e., produced before the boundary and consumed after it); ; // The minimum storage requirement of at boundary } ; // (underestimate of) // min. storage requirement of over all loop boundaries Step 3 Compute the exact value of. after Step 2 is the exact value of the minimum storage when all the -elements are either produced or consumed (but not both!) in the loop nests. If this is not the case, a local or global maximum of is reached immediately before the consumption of some -element. Step 3 actually investigates all the local maxima within that loop nest, adjusting upwards. Let be a disjoint lattice which is

10 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1313 consumed in a (normalized) loop nest where produced as well. Then -elements are for (each -element covered by the lattice ){ compute the iteration vector when the -element is read for the last time; determine the disjoint lattices of partially produced until the -element is consumed in iteration ; determine the disjoint lattices of partially consumed until the -element is consumed in iteration ; if } V. MAPPING SIGNALS INTO A HIERARCHICAL MEMORY SUBSYSTEM In embedded communication and multimedia processing system, the on-chip memory is often complemented by an external (off-chip) memory for storing the large amounts of data during the execution of the application. The memory subsystem is typically a major contributor to the overall energy budget of the entire system. The energy cost can be reduced and the system performance enhanced by introducing an optimized custom memory hierarchy that exploits the temporal locality of the data accesses [3]. Savings of dynamic energy (which expands only when memory accesses occur) at the level of the whole memory subsystem can be mainly obtained by accessing frequently used data from smaller on-chip memories rather than from large background (off-chip) memories. As on-chip storage, the scratch-pad memories (SPMs) software-controlled static or dynamic random-access memories, more energy-efficient than caches are widely used in embedded systems, in which the flexibility of caches in terms of workload adaptability is often unnecessary, whereas power consumption and cost play a much more critical role. Different from caches, the SPM occupies one distinct part of the virtual address space with the rest of the address space occupied by the main memory. The consequence is that there is no need to check for the availability of the data in the SPM. Hence, the SPM does not possess a comparator and the miss/hit acknowledging circuitry. This contributes to a significant energy (as well as area) reduction. Another consequence is that in cache memory systems, the mapping of data to the cache is done during run-time, whereas in SPM-based systems it can be done at compilation time. The problem on energy-efficient assignment of signals to the on- and off-chip memories has been studied since the late nineties. These previous works focused on partitioning the arrays into copy candidates and on the optimal selection and mapping of these into the memory hierarchy [18] [21]. The general idea was to identify the data (arrays or parts of arrays) that are most frequently accessed in each loop nest. Copying these heavily accessed data from the large off-chip memory to a smaller on-chip memory can potentially save energy (since most accesses will take place on the smaller copy and not on the large, more energy consuming, original array) and also ; improve performance. Many different possibilities exist for deciding on which parts of the arrays should be copy candidates and, also, for selecting among the candidates those which will be instantiated as copies and their assignment to the different memory layers. For instance, Kandemir et al. analyze and exploit the temporal locality by inserting local copies [19]. Their layer assignment builds a separate hierarchy per loop nest and then combines them into a single hierarchy. However, the approach lacks a global view on the (part of) arrays lifetimes in applications having imperfect nested loops. Brockmeyer et al. use the steering heuristic of assigning the arrays having the highest access number over size ratio to the highest memory layer first, followed by incremental reassignments [20]. They take into account the relative lifetime differences between arrays and between the scalars covered by each array. However, it is not clear whether the copy candidates can be also parts of arrays instead of entire arrays (and if so, how they identify these parts) since the access patterns are, in general, not uniform. Hu et al. can use parts of arrays as copies, but they typically use cuts along the array dimensions [21] (like rows and columns of matrices), using the index space of the array references and also their intersections. Different from these previous works, our framework identifies with precision those parts of arrays more intensely accessed. These parts represented as lattices are assigned to the on-chip scratch-pad memory, achieving a reduction of the dynamic energy consumption due to memory accesses. Since the bounding window of any lattice can be computed using Algorithm 1, those lattices covering the most accessed elements of the arrays are mapped to the SPM and Algorithm 2 can find their bounding windows. The same Algorithm 2 will compute the bounding windows for the off-chip memory, as well. Algorithm 5: Hierarchical (two-layer) energy-aware memory allocation for an index signal A for (each disjoint lattice ) compute (Algorithm 3) the number of memory accesses to ; assign to the SPM a subset of disjoint lattices having the highest average access values, such that ; // the total size of the lattices assigned to SPM should not exceed its maximum size compute using Algorithm 2 the bounding window of the lattices in the subset ; // Mapping solution for SPM compute using Algorithm 2 the bounding window of the lattices in ; // Mapping for the off-chip mem. For instance, consider the lattice derived from the illustrative example in Fig. 2(a). As is included in the array references from all the three loop nests and it also coincides with the operand from the first loop nest (see Fig. 5), the contributions of memory accesses for all these four array references must be computed. Since the lattice of the first array reference is,

11 1314 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 Fig. 5. Computation of memory accesses for the disjoint lattices. A15 is included in four array references: A1, A2, A3, and itself. The number of accesses from each array reference is 5249, 5249, 5249, and 409,825. The total numbers of accesses are listed together with the average numbers of accesses per A-element. the expressions of the iterators mapping the array elements into are. The size of this lattice is 5,249 (computed with Algorithm 3 as explained in Section IV), representing the number of memory accesses to due to the first array reference. Performing similar computations, the distribution of accesses in the whole array space is computed. The lattices,, and covering roughly the center of the array space of signal are the most accessed array parts (see Fig. 5): 425,572 memory accesses (the total number of accesses being 2,458,950) for each of the three lattices,, and ; therefore, an average of 4, accesses for each -element covered by them. 10 Assuming,, and are copied in the scratch-pad memory (the subset ), their bounding windows computed with Algorithm 1 (since all their elements are simultaneously alive) are for each lattice. Moreover, since all these three lattices are simultaneously alive, the overall bounding window for them is. Obviously, this window is much smaller than the window of the whole array. The -elements stored in the on-chip SPM should use the mapping function, whereas the -elements stored off-chip do not need any mapping function, since the window is as large as the entire array space of. This allocation solution with two memory layers implies, besides the background memory of 10,787 locations, another memory of locations in the SPM, closer to the processor. The increase of data memory space is compensated here by savings of dynamic energy of about 51.5%, according to the CACTI power model [22]. Indeed, assuming typical SPM and off-chip memory energy values for read accesses of nj and 3.57 nj (for memory sizes as in this illustrative example), the dynamic energy consumption caused by accesses to the array would decrease from 8.75 mj to 4.25 mj according to the power model [22]. 10 Although the number of memory accesses is higher for A11 and A12, these lattices are significantly larger (4830 A-elements each) and, hence, the average number of accesses per element is much lower. VI. EXPERIMENTAL RESULTS A software tool performing the mapping of the multidimensional arrays from a given algorithmic specification (expressed in a subset of the C language) into the data memory has been implemented in C++, incorporating the algorithms described in this paper. Table I summarizes part of the results of our experiments. Columns 2 and 3 display the numbers of array references and array elements in the specification code. Column 4 shows the data memory sizes (i.e., the total window sizes of the arrays 11 ) after the signal-to-memory mapping, computed as explained in Section III. Column 5 displays the sum of the minimum array windows (optimized intraarray memory sharing), computed as explained in Section IV. Column 6 shows the minimum storage requirements of the application codes, when the memory sharing between elements of different arrays is optimized as well [10]. For a better understanding of the meaning of the data shown in this table, the first tests are the illustrative examples from this paper. The rest of the benchmark tests are signal processing applications and typical algebraic kernels used in this domain. The benchmarks used are the following: 1) a motion detection algorithm used in the transmission of real-time video signals on data networks [3]; 2) a real-time regularity detection algorithm used in robot vision; 3) a 2-D Gaussian blur filter from a medical image processing application which extracts contours from tomography images in order to detect brain tumors; 4) a singular value decomposition (SVD) updating algorithm [23] used in spatial division multiplex access (SDMA) modulation in mobile communication receivers, in beamforming, and Kalman filtering; 5) an autocorrelation algorithm used in GSM; 6) the kernel of a voice coding application essential component of a mobile radio terminal. 11 If two entire arrays have disjoint lifetimes, their bounding windows can be mapped starting at the same base address. Then, the maximum of the two window sizes is added to the other window sizes. Such situations, or even more general ones, do not occur in the current tests; so, the sum of the window sizes is a reasonable measure for the total storage requirement after mapping.

12 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1315 TABLE I EXPERIMENTAL RESULTS. COLUMN 4 DISPLAYS THE SIZE OF THE DATA MEMORY AFTER MAPPING (THE TOTAL SIZE OF THE BOUNDING WINDOWS OF THE INDEXED SIGNALS). COLUMNS 5 AND 6 DISPLAY THE TOTAL SIZE OF THE MINIMUM ARRAY WINDOWS (AS EXPLAINED IN Section IV) AND, RESPECTIVELY, THE MINIMUM SIZE OF THE DATA MEMORY [10] Different from all the other signal-to-memory mapping techniques, this computation methodology allows to determine the additional storage implied by the mapping model by computing the minimum window (storage requirement) of each individual array and the minimum data storage for the execution of the whole code (the last two columns in the result table), providing thus useful information for the initial design phase of systemlevel exploration. The importance of the data in the last two columns is mainly theoretical, serving for the evaluation of the allocation results (column 4). A minimum data storage can indeed be obtained, but there is a price to be paid since the typical consequence is a very complex control and address generation hardware. As already explained in Section I-B, by means of the signal-to-memory mapping model, an excess of storage in memory allocation is traded-off against a less complex address generation hardware (practically all the mapping models using modulo computations). This tool has the unique feature of yielding a measure of this compromise the amount of extra storage. 12 Table I shows that there are applications, like the motion detection, where this mapping model gives very good solutions, the bounding windows being equal to the optimal array windows, and the amount of memory after mapping being almost equal to the minimum storage. On the other hand, there are also applications where the bounding window sizes result significantly larger than the optimal size of the data memory (e.g., about two times larger for the SVD updating). In such a situation, the result only indicates a more difficult case which needs further study and improvement. For instance, performing code transformations (like loop merging) on the regularity detection code, we succeeded to decrease the memory size after mapping from 4,353 to only 657 locations (see Table I). Func- 12 De Greef et al. display the minimum storage for those benchmarks having a more reduced amount of array elements; they obtained those data by symbolic simulation, but this method proved infeasible for larger examples [5]. tionally equivalent codes can yield very different data storage lower-bounds [10] and, also, mapping solutions. Our tool can help the designer to explore (in storage point of view) different equivalent implementations of the application. Another possibility of improving the mapping results is to try also other models when several tools are available (since there is no perfect mapping anyway!). Applying to the SVD updating application De Greef s mapping model [5], a reduction of the memory size after mapping of about 28% was obtained, in spite of a significant increase of the computation time. There are situations when the linearization-based model [5] yields identical bounding windows as this model (e.g., motion and regularity detection). This is caused by the linear or rectangular lattices in the decomposition of the array space (like, for instance, in Example 2 from Fig. 2), whose windows are identical in both models. This approach yields better results than [5] in situations where the size of the arrays can be reduced to windows in any dimension (because any canonical linearization contains a number of unused elements) and, also, in situations where there are signals with holes in their array space. On the other hand, there are cases like the SVD updating where the linearization model can yield better results (see columns 2 and 3 in Table II). Again, the positive aspect is that this tool is able to measure the quality of the mapping, helping the designer to take a decision, whereas the other works in the field do not provide similar information. Although our approach employs a mapping strategy similar to [1], the computation methodology is entirely different. Tronçon et al. perform first a liveness analysis. During this phase, a set of program points is firstly decided (for instance, at the beginning and at the end of each loop body). Every program point is annotated with a set of -polyhedra, whose integer solutions specify sets of live array elements. In this way, all the live array elements in the chosen program points can be determined. Afterwards, an evaluation of the lower-bounds of window sides is performed. Starting with, all the integer solu-

13 1316 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 9, SEPTEMBER 2009 TABLE II EXPERIMENTAL RESULTS. COLUMNS 2 AND 3 SHOW THE MEMORY SIZES APPLYING DIFFERENT MAPPING ALGORITHMS. COLUMNS 4 AND 5 DISPLAY THE CPU TIMES OBTAINED RUNNING IMPLEMENTATIONS OF DE GREEF S [5] AND TRONÇON S [1] ALGORITHMS.COLUMN 6 SHOWS THE CPU TIMES OBTAINED RUNNING OUR TOOL. THE TESTS HAVE BEEN CARRIED OUT ON A PC WITH A 1.85-GHz ATHLON XP PROCESSOR tions differing in the th coordinate of the computed -polyhedra are tested whether or not. The equality implies that a potential mapping conflict was found and, consequently, the window side should be incremented. Subsequent upward adjustments of are performed for all the chosen program points till no mapping conflict occurs. In contrast, our lattice-oriented methodology requires significantly fewer program points (only between loop nests). In addition, our computation of the window sides is mostly based on lattice projections rather than on incremental adjustments and enumerations of the integer points in -polyhedra. These are the reasons why our computation methodology is significantly faster and has a better scalability in terms of number of array elements. The algorithm [5] is even slower than [1] since the number of canonical linearizations to be analyzed increases quickly with the dimension of the signal. The experimental results in Table II (the last three columns) confirm our expectations based on theoretical insight that our mapping algorithm is regularly several times faster. VII. CONCLUSION sthis paper has addressed the problem of memory allocation for (embedded) multidimensional signal processing systems. The paper proposes a signal-to-memory mapping algorithm based on the computation of bounding windows for the simultaneously alive array elements. Based on a methodology operating with lattices, the technique can be also applied in multilayer memory hierarchies. The software tool implementing this algorithm can additionally compute the minimum window for each multidimensional array in the specification, as well as the minimum storage requirements, providing an accurate evaluation of the efficiency of the mapping model. This is a unique feature which is not encountered in any other works addressing similar topics. ACKNOWLEDGMENT The authors would like to thank their colleagues from IMEC and K.U. Leuven Prof. F. Catthoor, Prof. M. Bruynooghe, Dr. E. De Greef, Dr. Q. Fu, and Dr. K. C. Shashidhar for helpful discussions and for sharing with them some of the benchmark tests used in this paper. REFERENCES [1] R. Tronçon, M. Bruynooghe, G. Janssens, and F. Catthoor, Storage size reduction by in-place mapping of arrays, in Verification, Model Checking and Abstract Interpretation, ser. Lecture Notes Comput. Sci., A. Coresi, Ed. New York: Springer, 2002, pp [2] A. Schrijver, Theory of Linear and Integer Programming. New York: Wiley, [3] F. Catthoor, S. Wuytack, E. De Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle, Custom Memory Management Methodology: Exploration of Memory Organization for Embedded Multimedia System Design. Boston, MA: Kluwer, [4] J. Ramanujam, J. Hong, M. Kandemir, and A. Narayan, Reducing memory requirements of nested loops for embedded systems, in Proc. 38th ACM/IEEE Design Aut. Conf., Jun. 2001, pp [5] E. De Greef, F. Catthoor, and H. De Man, A. Krikelis, Ed., Memory size reduction through storage order optimization for embedded parallel multimedia applications, Special Issue on Parallel Processing and Multimedia, Parallel Comput., vol. 23, no. 12, pp , Dec [6] V. Lefebvre and P. Feautrier, Automatic storage management for parallel programs, Parallel Comput., vol. 24, pp , [7] F. Quilleré and S. Rajopadhye, Optimizing memory usage in the polyhedral model, ACM Trans. Programm. Lang. Syst., vol. 22, no. 5, pp , [8] A. Darte, R. Schreiber, and G. Villard, Lattice-based memory allocation, IEEE Trans. Comput., vol. 54, no. 10, pp , Oct [9] S. Meftali, F. Garsalli, F. Rousseau, and A. A. Jerraya, An optimal memory allocation for application-specific multiprocessor system-onchip, in Proc. Int. Symp. System Synthesis, Montreal, QC, Canada, Oct. 2001, pp [10] F. Balasa, H. Zhu, and I. I. Luican, Computation of storage requirements for multidimensional signal processing applications, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 15, no. 4, pp , Apr [11] W. Pugh, A practical algorithm for exact array dependence analysis, Commun. ACM, vol. 35, no. 8, pp , Aug [12] S. Verdoolaege, K. Beyls, M. Bruynooghe, and F. Catthoor, Experiences with enumeration of integer projections of parametric polytopes, in Proc. Compiler Construction: 14th Int. Conf., R. Bodik, Ed., 2005, vol. 3443, pp , Springer. [13] G. B. Dantzig and B. C. Eaves, Fourier Motzkin elimination and its dual, J. Combin. Theory A, vol. 14, pp , 1973.

BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1317 [14] A. I. Barvinok and J.

Clauss, Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs, in Proc. ACM Int. Conf. Supercomputing, 1996, pp.

Avis, LRS: A revised implementation of the reverse search vertex enumeration algorithm, in Polytopes Combinatorics and Computation, G. Kalai and G. Ziegler, Eds.

14 BALASA et al.: SIGNAL ASSIGNMENT TO HIERARCHICAL MEMORY ORGANIZATIONS 1317 [14] A. I. Barvinok and J. Pommersheim, An algorithmic theory of lattice points in polyhedra, in New Perspectives in Algebraic Combinatorics. Cambridge, U.K.: Cambridge Univ. Press, 1999, pp [15] P. Clauss, Counting solutions to linear and nonlinear constraints through Ehrhart polynomials: Applications to analyze and transform scientific programs, in Proc. ACM Int. Conf. Supercomputing, 1996, pp [16] J. A. De Loera, R. Hemmecke, J. Tauzer, and R. Yoshida, Effective lattice point counting in rational convex polytopes, J. Symbol. Comput., vol. 38, no. 4, pp , [17] D. Avis, LRS: A revised implementation of the reverse search vertex enumeration algorithm, in Polytopes Combinatorics and Computation, G. Kalai and G. Ziegler, Eds. Berlin, Germany: Birkhäuser-Verlag, 2000, pp [18] S. Wuytack, J.-P. Diguet, F. Catthoor, and H. De Man, Formalized methodology for data reuse exploration for low-power hierarchical memory mappings, IEEE Trans. Very Large Scale Integr. (VLSI) Syst., vol. 6, no. 4, pp , Dec [19] M. Kandemir and A. Choudhary, Compiler-directed scratch-pad memory hierarchy design and management, in Proc. 39th ACM/IEEE Design Automation Conf., Las Vegas, NV, Jun. 2002, pp [20] E. Brockmeyer, M. Miranda, H. Corporaal, and F. Catthoor, Layer assignment techniques for low energy in multilayered memory organisations, in Proc. 6th ACM/IEEE Design and Test in Europe Conf., Munich, Germany, Mar. 2003, pp [21] Q. Hu, A. Vandecapelle, M. Palkovic, P. G. Kjeldsberg, E. Brockmeyer, and F. Catthoor, Hierarchical memory size estimation for loop fusion and loop shifting in data-dominated applications, in Proc. Asia-South Pacific Design Automation Conf., Yokohama, Japan, Jan. 2006, pp [22] G. Reinman and N. P. Jouppi, CACTI: An Integrated Cache Timing and Power Model COMPAQ Western Research Lab, [23] M. Moonen, P. Van Dooren, and J. Vandewalle, An SVD updating algorithm for subspace tracking, SIAM J. Matrix Anal. Appl., vol. 13, no. 4, pp , Florin Balasa received the Ph.D. degree in computer science from the Polytechnical University of Bucharest, Bucharest, Romania, in 1994, and the Ph.D. degree in electrical engineering from the Katholieke Universiteit Leuven, Leuven, Belgium, in He is currently an Associate Professor of Computer Science at Southern Utah University, Cedar City. His research interests are mainly focused on memory management algorithms for digital signal processing, high-level synthesis and physical design automation, optimization techniques, and applications in CAD VLSI. Dr. Balasa was a recipient of the National Science Foundation CAREER Award. Hongwei Zhu received the B.S. degree in electrical engineering from Xi an Jiaotong University, Xi an, China, in 1996, the M.S. degree in electrical and electronic engineering from Nanyang Technological University, Singapore, in 2001, and the Ph.D. degree in computer science from the University of Illinois at Chicago, in He is currently a software engineer in the Design Automation Group of the Physical IP Department at ARM, Inc., San Jose, CA. His research interests are mainly focused on memory management for real-time multidimensional signal processing and combinatorial optimization in CAD VLSI. Ilie I. Luican received the B.S. and M.S. degrees in computer science from the Polytechnical University of Bucharest (PUB), Bucharest, Romania, in 2002 and 2003, respectively. He is currently pursuing the Ph.D. degree in the Department of Computer Science, University of Illinois at Chicago. His research interests are mainly focused on high-level synthesis and memory management for real-time multidimensional signal processing.

Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations

Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations Hongwei Zhu Ilie I. Luican Florin Balasa Dept. of Computer Science, University of Illinois at Chicago, {hzhu7,iluica2,fbalasa}@uic.edu