Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA

Size: px

Start display at page:

Download "Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA"

Chastity James
5 years ago
Views:

1 Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA Abstract - In this paper, we show how proper data placement in off chip memory and loop tiling can be used to enhance cache performance number of cycles and energy consumption. While the reduction in the number of cycles is spectacular (~5X), the reduction in the energy consumption is modest (~2X). The data placement procedure consists of assigning specific cache lines to each array in the application program and then placing arrays in the off chip memory such that the amount of unused space is as small as possible. The procedure for assignment of specific cache lines is based on an analysis of access patterns and ensures significant reduction in the number of conflict misses. 1. INTRODUCTION In the 80s, the microelectronics industry was primarily concerned with performance, area, cost and reliability, with power consumption being of secondary importance. Starting from the early 90s, lowering the power consumption became as important as increasing the throughput and reducing area. This was because of the remarkable success of portable consumer electronics (camcorders, compact disk players), personal computing devices (notebooks, laptops, and palmtops) and wireless communication systems (cordless phones, cellular phones) that required high-speed computations with low power consumption. In the design of low power systems, the main focus has been to reduce the power consumption in the data path components. However, in systems that involve multidimensional streams of signals such as images or video sequences, the majority of the area and power cost is not due to the data-path or controllers but due to the global communication and memory interactions [Catthoor et al. 1998]. This is due to the fact power consumed in memory transfers is significantly larger than the power consumed in data path operations. This implies that with proper design, reduction in the memory related power budget can far exceed the reduction due to voltage scaling and other power saving transformations. In this paper, we focus on procedures to help reduce the memoryrelated energy consumption. Specifically, we study the effect of off-chip data placement and tiling on the cache performance and show that while data placement and tiling results in a significant reduction in the number of cycles, the reduction in energy is modest. The main contributions of this paper are Showed how to do data placement that involves assigning specific cache lines to each of the arrays, and placing arrays in the off-chip memory so that the number of conflict misses is reduced and the amount of unused space is as small as possible. Showed how to find the minimum cache size for large programs. Cache sizes smaller than the minimum cache size have very degraded performance both in terms of number of cycles and energy and should not be considered. The work presented in this paper is an extension of our earlier work in [Shiue and Chakrabarti 1999a; 1999b; 1999c]. Pioneering work in the area of memory management for low power applications has been done at IMEC [Catthoor et al. 1994;1998]. The procedure is comprehensive and consists of global transformations to increase the locality and regularity of data accesses, systematic method for data reuse, and memory allocation and assignment that meets the timing constraints with as cheap as possible memory architecture (both in terms of power and area). Data memory exploration for embedded system has been extensively studied by Panda, Dutt and Nicolau [Panda et al.1997a;1997b;1999]. The performance metrics of their system are data cache size and number of processor cycles. A novel approach for designing memory systems based on binding of array groups to memory components with different dimensions, access times and number of ports has been presented in [Schmit and Thomas 1997]. A memory system design for video processors and a methodology for the analysis of on-chip memory architectures has been developed in [Dutta et al. 1995]. Designs are evaluated based on tradeoffs between area, cycle time and utilization. For low power applications, it is not sufficient to consider only area and cycle time. Energy has to be included in the performance metrics since variation in the energy for different memory configurations is quite different from the variation in the number of cycles. In [Shiue and Chakrabarti 1999a; Shiue and Chakrabarti 1999b], we show how three performance metrics namely, area, number of cycles and energy are required to efficiently explore the memory design space for low power applications. The rest of the paper is organized as follows. Section 2 briefly describes the time and energy models used in our procedure and demonstrates the differences in the variation in the number of cycles and energy for different types of programs. Section 3 describes the techniques to improve cache performance, namely, data placement in off-chip memory. Section 4 concludes the paper. 2. BACKGROUND In this section, we present a brief review of the time and energy models used in our procedure as well as the differences between the timing and energy characteristics. For more details, please refer to [Shiue and Chakrabarti 1999c]. Our system has three-performance metrics: cache size, number of processor cycles and energy consumption. Cache size: Given the area constraint, we find the largest possible cache size C max that satisfies this constraint. The memory exploration procedure searches for the best cache configuration among cache sizes < C max. Number of processor cycles: The number of processor cycles is a function of the miss rate. We adopt the model used in [Hennessy and Patterson 1996] and assume that the number of cycles per hit is 1, 1.1, 1.12, and 1.14 for 1, 2, 4, and 8-way set associative cache respectively. We also assume that the number of cycles per miss is 40, 40, 42, 44, 48, 56, and 72 for 1

2 line sizes of 4, 8, 16, 32, 64, 128, and 256 respectively. The number of processor cycles is calculated as follows. Number of processor cycles = hit_rate*trip_count*(number of cycles per hit) + miss_rate*trip_count*( number of cycles per miss). Energy: Our energy model is derived from combining the energy models in [Kamble and Ghose 1997] and [Su and Despain 1995] to develop a model that results in fast estimation as in [Su and Despain 1995] and yet matches the energy values in [Kamble and Ghose 1997] quite closely. Furthermore, we not only consider the energy consumed in the host processor but also consider the energy consumed in the off-chip memory. This is very important in order to accurately calculate the energy consumed during a miss. Energy = Edec+Ecell+Eio+Emain - Edec = α*(add_bus_bs)*(c/l) - Ecell = β*(word_line_size) * (Bit_line_size+4.8) *(Nhit+Nmiss) - Eio = γ*(data_pad_bs* 8L+ Add_pad_bs) - Emain = γ*data_pad_bs*8l+em*8l*nmiss Add_bus_bs = Pr*(Nhit+Nmiss)*Wadd Add_pad_bs = Pr*(Nmiss)*Wadd Data_pad_bs = Pr*(Nmiss) Word_line_size = m*(8l+t+st) Bit_line_size = C/(m*L) α = 7.89e-17; β = 1.44e-14; γ = 5.45e-11; Em=4.95e-9 Joule C = cache size. L = cache line size. m = m-way set associativity cache. T = tag size of T bits St = number of status bits per block frame. Pr = the probabilities of a 0 to 1 transition Nhit = number of hits. Nmiss = number of misses. Wadd = the width of address bus. Add_bus_bs = number of bit switches on address bus. Add_pad_bs = number of bit switches on address pads. Data_pad_bs = number of bit switches on data pad. Word_line_size = Number of memory cells in a word line. Bit_line_size = Number of memory cells in a bit line. Edec = Energy consumed in the address bus. Ecell = Energy consumed by the pre-charged cache word/bit lines. Eio= Energy consumed in the I/O pads of the host processors. Emain = Energy consumed in the off-chip memory and in accessing the off-chip memory. 2.1 Energy Time Tradeoffs In this section, we briefly review the differences in the timing and energy characteristics for different cache configurations with the help of an example. It is these differences that justify the need to include energy in the performance metrics. Figure 1(a) and Figure 1(b) plot the variation in the number of cycles and the variation in the energy consumption for the example Compress 1. From these plots we see that The minimum time cache configuration corresponds to the largest possible cache while the minimum energy consumption corresponds to the smallest possible cache. 1 Compress is a 2-dimensional filter used in Image Processing Applications. Cycles The number of cycles and the energy consumption increases significantly if the cache size is smaller than a particular value. We refer to the point on the curve at which this transition occurs as the inflection point. The inflection point corresponds to the minimum cache size. In the next section, we will describe a procedure to calculate the minimum cache size. The energy increases significantly with increase in the line size for the same cache size. This is because retrieving a large block of data from off-chip main memory to on-chip cache consumes more energy than retrieving a small block of data. For a specific line size, the energy increases with increase in cache size. This is because a larger cache consumes more energy and the increase in energy is more than the reduction in the miss rate (due to larger cache size). For a specific line size, the number of cycles decreases mildly with increase in cache size. This is in tune with the decrease in the miss rate Cycles variation (Compress) C8 C16 C32 C64 C128 C256 C512 C_1K C_2K C_4K C_8K Cache size (a) (b) Figure 1. Example Compress (a) Cycles variation and (b) Energy variation for different cache sizes and line sizes. The number of lines is >=2. 3. ENHANCING CACHE PERFORMANCE There are several techniques to enhance cache performance. Notable among them are placement of data in main memory so that the number of conflict misses is reduced, tiling to improve data locality, set associativity to improve the hit rates, block buffering to save power by optimizing capacitance of each cache access and subbanking to save power by eliminating unnecessary accesses [Kamble and Ghose 1997;Su and Despain 1995]. In this paper, we focus on data placement and tiling. 3.1 Data Placement Cycles Cycle variation (Optimized vs. Unoptimized) L4 L8 L16 L32 Opt. L64 Un-opt. C32L4 C64L8 C128L16 Cache size and line size Energy variation (Compress) Figure 2. Example Compress. Cycles and energy reduction due to data placement. Energy (nj) Energy (nj) C8 C16 C32 C64 C128 C256 C512 C_1K C_2K C_4K C_8K Cache size Energy variation (Optimized vs. Unoptimized) L4 L8 L16 L32 Opt. L64 Un-opt. C32L4 C64L8 C128L16 Cache size and line size 2

3 Data placement can be used to significantly reduce the miss rate. This translates to reduction in both the number of cycles and the energy consumption as illustrated in Figure 2 for the example Compress. Note that the reduction in the number of cycles is much larger than the reduction in the energy consumption. This is because Energy of a miss >> Energy of a hit, and so a drastic reduction (5X) in the miss rate results in only a modest reduction (2X) in the energy. In the rest of the section, we describe a procedure for data placement in the off-chip memory that results in very few conflict misses. Our technique involves (i) finding the minimum cache size that is required to minimize the number of conflict misses, (ii) assigning specific cache lines to each of the arrays and (iii) determining placement of arrays in the off-chip memory. Finding the minimum cache size: Computing the minimum cache size is very important since if the cache size is smaller than the minimum cache size, the miss rate increases significantly, thereby degrading the memory performance (cycles and energy). Our procedure consists of finding the minimum number of cache lines from the array access patterns. Let n be the depth of a loop nest, and d be the dimensions of an array A. Two references, A[f(i)] and A[g(i)], where f and g are indexing functions Z n Z d, are called uniformly generated if f(i)=hi+c f and g(i)=hi+c g, where H is a linear transformation and c f and c g are constant vectors [Wolf and Lam 1991]. We partition references in a loop nest into equivalent classes of reference if they have the same H and operate on the same arrays as described in [Wolf and Lam 1991]. For each class, we find the minimum number of cache lines. We repeat the procedure for each array in the kernel program and sum the number of cache lines. The minimum cache size is the line size times the sum of the minimum number of cache lines. The procedure for finding the minimum cache size for a single kernel program is given below. Algorithm_min_cache_size_kernel_program 1. Find the distance for each class Distance = floor(abs(difference of constant vector)/stride of loop ) +1; 2. for each class if cache line size < Tripcount_inner_loop N= (distance) mod (cache line size) if N==0 or 1 # cache lines = floor(distance/cache line size)+1; else # cache lines = floor(distance/cache line size)+2; end else # cache lines = 1; end 3. Repeat Step 1, 2 for each array. 4. Minimum cache lines = Σ # cache lines 5. Minimum cache size = Minimum cache lines * cache line size We illustrate this procedure with the help of Compress example. Example 1. Compress int a[32,32] for i=1,31 for j=1,31 a[i.j]=a[i,j]-a[i-1,j]-a[i,j-1]-2*a[i-1,j-1]; Equivalent class References Distance # of cache lines if line size =2 Class 1 a[i-1,j-1], a[i-1,j] floor(abs(1/1))+1=2 Floor(2/2)+1=2 Class 2 a[i,j-1], a[i,j] floor(abs(1/1))+1=2 Floor(2/2)+1=2 In Example 1, there are two equivalent classes. Class 1: a[i- 1,j-1], a[i-1,j] and class 2: a[i,j-1], a[i,j]. The total number of cache lines for L=2 is 4 (two cache lines for references in class 1 and two cache lines for references in class 2). The minimum cache size is 4*L, where L is the line size. Thus if the cache size is smaller than 4L, the miss rate increases significantly. If the cache size is larger than 4L, then the cache line size can be increased to exploit the spatial locality or the number of cache lines can be increased in proportion to the number of classes. Finding the minimum number of cache lines per array: A large program consists of several kernel programs, each of which consists of arrays with different access patterns. This makes finding the minimum cache size (MCS) for large programs a lot more involved. The procedure for finding the minimum number of cache lines for each array consists of (i) Finding the minimum number of zones by looking at the access pattern of the outer loop index. (ii) Finding the minimum number of cache lines per zone by looking at the access pattern of the inner loop index. In order to calculate the minimum number of zones, we calculate a difference set for each kernel program, where the elements of the difference set are obtained by computing the difference between the outer loop index values. Next, we compute the union of the difference sets generated by each kernel program. For instance, if in kernel program 1, rows i, i+2, i+4 of array a get accessed, then the difference set for program 1 is {2,4}. Now if in kernel program 2, rows i, i+4, i+5 of array a get accessed, then the difference set for program 2 is {1,4,5}. For the whole program that includes kernel 1 and kernel 2, the difference set for array a is {1,2,4,5}. The number of zones is the smallest integer that cannot be divided by any number in the difference set. In the above example, the minimum number of zones is 3. Thus, if the rows of array a get mapped to three zones as shown in Figure 3, there will not be any conflict. There would also not be any conflict if the number of zones is larger than and equal to 6 because 6 is larger than any value in the difference set. If, however, the number of zones is chosen to be 4 or 5, there would be a conflict as shown in Figure 3. Kernel 1: { Row i, Row i+2, Row i+4} i i+1 i+2 i+3 i+4 i+5 Zone 1 Zone 2 Zone 3 Kernel 2: { Row i, Row i+4, Row i+5} i i+1 i+2 i+3 i+4 i+5 Zone 1 Zone 2 Zone 3 Kernel 1 i i+1 i+2 i+3 i+4 i+5 i+6 i+7 Zone 1 Zone 2 Zone 3 Zone 4 Kernel 2 i i+1 i+2 i+3 i+4 i+5 i+6 i+7 3

4 Zone 1 Zone 2 Zone 3 Zone 4 Figure 3.Example illustrating how there would be a conflict if the number of zones is 4 instead of 3. Next, the number of cache lines per zone has to be calculated. We illustrate the procedure with the help of Example 2. Here i is the outer loop index and j is the inner loop index. The difference in the outer loop index for array a in K1 is {1} and in K2 is {3}. Thus we require a minimum of 2 zones for array a. Next, we calculate the number of lines per zone by looking at the access pattern of the inner loop. For instance, for array a, we calculate the minimum number for cache lines required for Zone 1 by taking the maximum of the number of cache lines required in class 1 and class 5. Thus Zone 1 requires two cache lines if L=2. Similarly, Zone 2 requires two cache lines if L=2. Note that if zones 1 and 2 required different number of cache lines, then we assign the larger number of cache lines to each of the zones. Example 2: Kernel programs K1 K2 # lines for each class (assume Line size =2) Class 1: 2 lines Class 2: 2 lines Class 3: 2 lines Class 4: 2 lines Class 5: 2 lines Class 6: 2 lines Class 7: 2 lines Class 8: 2 lines References Class 1: a[i,j], a[i,j+1] Class 2: a[i+1,j], a[i+1,j+1] Class 3: b[i,j], b[i,j+1] Class 4: b[i+1,j], b[i+1,j+2] Class 5: a[i,j], a[i,j+2] Class 6: a[i+3,j], a[i+3,j+1] Class 7: b[i,j], b[i,j+1] Class 8: b[i+2,j], b[i+2,j+2] # lines for each zone 2 lines for row i of a 2 lines for row i+1 of a 2 lines for row i of b 2 lines for row i+1 of b 2 lines for row i+3 of a 2 lines for row i+2 of b We use a similar analysis to find the minimum number of cache lines for array b. Since the difference in the outer loop index for array b in K1 is {1} and in K2 is {2}, we require a minimum of 3 zones for b. Since the minimum number of lines per zone for array b is 2, the minimum number of lines for array b is 6. Since arrays a and b have overlapping lifetime, the minimum cache size is (4+6)*L = 10*2 =20 bytes for L=2. The exact assignment of different rows of arrays a and b is as follows. Note that the 9 rows of array a are distributed over zones 1 and 2 while the 8 rows of b are distributed over zones 3, 4 and 5. (Here A0 stands for row 0 of array a, A1 stands for row 1 of array a, etc). Array a : 9x13 Array b : 8x13 Zone 1 Zone 2 Zone 3 Zone 4 Zone A0 A2 A4 A6 A8 A1 A3 A5 A7 B0 B3 B6 B1 B4 B7 B2 B5 Note that the number of lines per zone is a function of the line size. Thus MCS(L) = Σi Lines(i,L), where i is the zone number and Lines(i,L) is the number of lines for zone i if L is the line size. For most programs, as the line size increases, the number of lines per zone would decrease. Finding the minimum number of cache lines for the whole program: Given the minimum number of cache lines per array, the next step is to find the minimum number of cache lines for the whole program. First, we create an array conflict graph where the nodes correspond to arrays and an edge exists between two nodes if the two corresponding arrays have overlapping life times. Each node has a weight associated with it, where the weight is the MCL of the corresponding array. The minimum number of cache lines is thus larger than or equal to the cost of the maximal cost clique of this graph. For instance, consider an application program with three kernel programs. K1: [A(3 lines),c (1 line),x(3 lines)]; K2: [A(4 lines),b (1 line),c(1 line),d(2 lines)]; K3: [B(1 line),y(6 lines)]. The array conflict graph of this example is given in Figure 4. ABCD or ACX form the maximal cost clique with cost 8. Thus the minimum number of cache lines for this application program is 8. Since there are some arrays that are not part of the maximal cost clique, in the next step we identify whether these arrays can share cache lines with the ones in the maximal cost clique. Let nodes in the maximal cost clique be assigned to major set (MS) and the other nodes be assigned to the remaining set (RS). Our aim here is to try to match the nodes in RS with those in MS so that they can share cache lines. X MS RS MS RS A=4 B=1 C=1 D=2 A C X=3 Y=6 B D Figure 4. Example illustrating calculation of minimum cache size. We first create the dual of the conflict graph. Thus there are no edges between the nodes in MS, and edges between nodes in RS imply that the corresponding arrays can share cache lines. Our algorithm is greedy and works on one node of the B=1 D=2 Y A C D B X=3 Y X 4

5 RS at a time. We choose the node with the largest cost, ò, where the cost is the number of cache lines required by the corresponding array. If node ò has edges with several nodes in MS, then we choose a subset of those incident nodes such that the cost of ò is equal to the cost of the nodes in the subset. While choosing the subset of incident nodes, we give higher priority to nodes with lower degree. If the cost of ò is larger than the cost of the nodes in the subset, the cache size has to be increased by an amount which is equal to the difference of the two costs. At the end of this step, node ò and its edges with nodes in MS are deleted and the costs of the incident nodes updated. Nodes with cost=0 in MS are also deleted. The cost of an incident nodes is not updated if there exists another node in RS which can share cache lines with both ò and the incident node. The algorithm is shown below. Algorithm_min_cache_size_actual 1. The nodes in the maximal cost clique are assigned to MS and the other nodes are assigned to RS. 2. Create the dual of G, conflict graph. 3. Remove and update scheme - Choose the node with largest cost, ò in RS. - For node ò { a. Find the incident nodes of ò in MS. b. Choose a subset of these incident nodes such that COST(ò) is equal to the cost of these nodes (i) Higher priority is given to the incident nodes that have small edges. (ii) If COST(incident nodes) < COST(ò) Add extra x lines (where x = COST(ò)-COST(incident nodes) c. Node ò is deleted and the cost of the incident nodes updated. We illustrate our procedure with the help of an example (see Figure 4). The maximal cost clique consists of nodes A, B, C, D and has a cost of 8. Thus nodes A, B, C, D are assigned to MS and nodes X, Y are assigned to RS. In the dual graph, note that there is an edge between X and Y implying that X and Y can share cache lines. In the greedy procedure, we first choose node Y in RS since it has the largest cost. Node Y has three incident nodes A, C, D. Of these nodes, we choose A and C first since they have degree 1. Thus Y share 4 lines with A and 1 line with C. Since Y has a cost of 6, Y has to share 1 line with D. This implies that cache lines for A, C, D have to be contiguous. Next, we update the nodes in MS and RS, and their costs. The resulting configuration consists of B and D in MS and X in RS. Note that since X and Y can share cache lines, the cost of D is not updated. In this configuration, X can share 1 cache line with B and 2 cache lines with D. If X and Y cannot share cache lines, then an additional cache line has to be added. Given that we know the exact assignment of cache lines to different arrays for MCS(L), the minimum cache size corresponding to a specific L, the next step is to determine the assignment when the cache size is larger than MCS(L). A larger cache size can be due to larger line size and/or due to larger number of lines. Case 1: Cache size increases but the line size L remains the same. In this case, the number of cache lines increases and the number of lines assigned to each zone increases by the ratio cache_size/mcs(l). Case 2: Line size increases but the number of lines remains the same. Since the line size increases to (say) Å, the number of lines per zone may reduce. Thus MCS(Å) may contain fewer lines than MCS(L). In this case, the number of lines assigned to each zone increases by the ratio cache_size/mcs(å). Placing arrays in the off-chip memory for large programs: Once the placement of the arrays in the cache is known, placement of the arrays in the main memory is relatively straightforward. Array padding is done between the rows of an array (for a 2D array) as well as between arrays. Array padding between rows is done such that references belonging to different classes do not get mapped to the same cache line. Our placement procedure is a greedy algorithm that places the arrays in the main memory such that the amount of unused space is minimum. The greedy algorithm chooses the array that (i) results in minimum number of unused locations and (ii) is larger than the other arrays assigned to the same cache line. It maps all the rows of the chosen array before operating on the next array. Since array padding is done between the rows of an array, the number of locations needed to store the array in the main memory will most likely be larger than the size of the array. For ease of implementation, each cache line has a candidate array list associated with it. The candidate array list is prioritized with larger arrays having higher priority. After the assignment of one array (start location B, end location B+padded_arraysize-1), we find the cache line, α, corresponding to the next available location, E. We pick the array with the highest priority from the candidate array list corresponding to line α. If the candidate array list of α is NULL, we look at the list in line α+1, and so on. Once an array in the candidate list is identified, the next step is to compute B, the start location of this array in the main memory. Since α=floor((b mod C)/L), B should satisfy the equation B=C*n+α*L. B should also be the smallest value that is larger than or equal to E. Once B is determined, all the rows of the array are assigned in increasing order (row 0 first, row 1 second etc.) using a similar procedure. Recall that since every row of the array has been assigned a specific cache line, determining the padding between consecutive rows is very similar to determining the padding between arrays. At the end of this step, locations B through B+padded_arraysize-1 are blocked, and the candidate list of line α is updated. The procedure is then repeated for the next array. The algorithm is listed below. Algorithm_memory_assignment 1. The candidate array list associated with cache line i is list(i). Assign priorities to the arrays larger the array higher is the priority. 2. Repeat step 3 till all arrays have been assigned. 3. Procedure - Determine the cache line corresponding to location E. Compute α=floor((e mod C)/L). repeat until list(α) NULL {α= (α+1) mod (C/L)} - Choose the array with largest size from list(α). - Find the smallest value of B=C*n + α*l that satisfies B>=E. Assign all the rows of the array (in increasing order) using a similar procedure. Block locations B through B+ padded_arraysize update list(α). - E=B+padded_arraysize The memory assignment procedure is explained with the help of Figure 5. Array a : 8x16 Array b : 8x4 Array c: 4x8 Zone 1 Zone 2 Zone 3 Zone 4 Zone

6 A0 A2 A4 A6 C0 C2 A1 A3 A5 A7 C1 C3 B0 B3 B B1 B4 B7 B2 B A0 A1 A2 A3 A Array a Array c Array b Figure 5. Example illustrating placement of arrays in the main memory. Here C=20, L=2; array a is of size of 8x16, array b is of size of 8x4 and array c is of size 4x8. If the first free memory location is 100, then array a (which is larger than array c) is chosen first. The eight rows of array a are mapped before b is assigned. Note that the padding between rows 0 and 1 of a is different from that between rows 1 and 2. As a result of padding, the number of unused locations is 32 for the assignment of array a, 16 for the assignment of array b, and 39 for the assignment of array c. The total number of unused locations is 103 out of 296. If array b occupies zones 1, 2 and 3, and arrays a and c occupy zones 4 and 5, the total number of unused locations would be 92 out of 284. Note that while by confining the rows of array to a few zones, the number of conflict misses has reduced significantly, the number of unused locations in the main memory has increased. Furthermore, the number of unused locations is a function of (i) placement of arrays in the cache and (ii) the address of the first free memory location. Since our procedure for finding the minimum cache size does not pin down the exact placement of arrays in the cache, one can calculate the memory size for different configurations and choose the one with the minimum size. The main drawback of this approach is that the number of possible configurations can be very large; n! for n arrays. Clearly heuristics are needed to choose the configuration that would result in the minimum memory size. 4. CONCLUSION In this paper, we show how data placement techniques can be used to enhance cache performance in low energy memory design. This technique results in a spectacular reduction in the number of cycles and a modest reduction in the energy. This is because the energy associated with a miss is very large and so a large reduction in the miss rate helps reduce the energy by only a modest amount. The data placement procedure described here assumes that reduction in miss rate is more important than off-chip memory size. memory management methodology exploration of memory organisation for embedded multimedia system design. Kluwer Academic Publishers. (June). DUTTA, S., WOLF, W. AND WOLFE, A Memory system architectures for programmable video signal processors. Proceedings of ICCD, IEEE Computer Society Press. HENNESSY, J. L. AND PATTERSON, D. A Computer architecture a quantitative approach. 2nd edition Morgan Kaufman Publishers. KAMBLE, M. B. AND GHOSE, K Analytical energy dissipation models for low power caches. International Symposium on Low Power Electronics and Design. PANDA, P. R., DUTT, N. D. AND NICOLAU, A. 1997a. Architectural exploration and optimization of local memory in embedded systems. International Symposium on System Synthesis. (Antwerp, Sept). PANDA, P. R., DUTT, N. D. AND NICOLAU, A. 1997b. Memory data organization for improved cache performance in embedded processor applications. ACM Transactions on Design Automation of Electronic Systems, 2, 4 (Oct). PANDA, P. R., DUTT, N. D. AND NICOLAU, A Local memory exploration and optimization in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18, 1 (January). SCHMIT, H AND THOMAS, D. H Synthesis of Applicationspecific Memory Designs. IEEE Transactions on VLSI Systems, 5, 1, (March) SHIUE, W. T. AND CHAKRABARTI, C. 1999a. Memory exploration for low power, embedded systems. 36th Design Automation Conference, (New Orleans, LA, June), SHIUE, W. T. AND CHAKRABARTI, C. 1999b. Memory design and exploration for low power, embedded systems. IEEE Workshop on Signal Processing Systems: Design and Implementation. (Taiwan R.O.C., Oct). SHIUE, W. T. AND CHAKRABARTI, C. 1999c. Memory design and exploration for low power embedded systems. Center for Low Power Electronics Technical Report CLPE-TR (Oct). SU, C. AND DESPAIN, A Cache design trade-offs for power and performance optimization: a case study. International Symposium on Low Power Electronics and Design, THORDARSON, A Comparison of manual and automatic behavioral synthesis of MPEG algorithm. Master s thesis, University of California Irvine. WOLF, M. E. AND LAM, M A data locality optimizing algorithm. In Proceedings of the SIGPLAN 91 Conference on Programming Language Design and Implementation. (June) REFERENCES CATTHOOR, F., FRANSSEN, F., WUYTACK, S., NACHTERGAELE, L., MAN, H. DE Global communication and memory optimizing transformations for low power signal processing systems. Workshop on VLSI Signal Processing (La Jolla, CA, Oct). CATTHOOR, F., WUYTACK, S., GREEF, E. D., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A Custom 6

Memory Exploration for Low Power, Embedded Systems

Memory Exploration for Low Power, Embedded Systems Wen-Tsong Shiue Arizona State University Department of Electrical Engineering Tempe, AZ 85287-5706 Ph: 1(602) 965-1319, Fax: 1(602) 965-8325 shiue@imap3.asu.edu