Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA

Size: px
Start display at page:

Download "Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA"

Transcription

1 Memory Synthesis for Low Power Wen-Tsong Shiue Silicon Metrics Corporation Research Blvd. Suite 300, Austin, TX 78759, USA Abstract - In this paper, we show how proper data placement in off chip memory and loop tiling can be used to enhance cache performance number of cycles and energy consumption. While the reduction in the number of cycles is spectacular (~5X), the reduction in the energy consumption is modest (~2X). The data placement procedure consists of assigning specific cache lines to each array in the application program and then placing arrays in the off chip memory such that the amount of unused space is as small as possible. The procedure for assignment of specific cache lines is based on an analysis of access patterns and ensures significant reduction in the number of conflict misses. 1. INTRODUCTION In the 80s, the microelectronics industry was primarily concerned with performance, area, cost and reliability, with power consumption being of secondary importance. Starting from the early 90s, lowering the power consumption became as important as increasing the throughput and reducing area. This was because of the remarkable success of portable consumer electronics (camcorders, compact disk players), personal computing devices (notebooks, laptops, and palmtops) and wireless communication systems (cordless phones, cellular phones) that required high-speed computations with low power consumption. In the design of low power systems, the main focus has been to reduce the power consumption in the data path components. However, in systems that involve multidimensional streams of signals such as images or video sequences, the majority of the area and power cost is not due to the data-path or controllers but due to the global communication and memory interactions [Catthoor et al. 1998]. This is due to the fact power consumed in memory transfers is significantly larger than the power consumed in data path operations. This implies that with proper design, reduction in the memory related power budget can far exceed the reduction due to voltage scaling and other power saving transformations. In this paper, we focus on procedures to help reduce the memoryrelated energy consumption. Specifically, we study the effect of off-chip data placement and tiling on the cache performance and show that while data placement and tiling results in a significant reduction in the number of cycles, the reduction in energy is modest. The main contributions of this paper are Showed how to do data placement that involves assigning specific cache lines to each of the arrays, and placing arrays in the off-chip memory so that the number of conflict misses is reduced and the amount of unused space is as small as possible. Showed how to find the minimum cache size for large programs. Cache sizes smaller than the minimum cache size have very degraded performance both in terms of number of cycles and energy and should not be considered. The work presented in this paper is an extension of our earlier work in [Shiue and Chakrabarti 1999a; 1999b; 1999c]. Pioneering work in the area of memory management for low power applications has been done at IMEC [Catthoor et al. 1994;1998]. The procedure is comprehensive and consists of global transformations to increase the locality and regularity of data accesses, systematic method for data reuse, and memory allocation and assignment that meets the timing constraints with as cheap as possible memory architecture (both in terms of power and area). Data memory exploration for embedded system has been extensively studied by Panda, Dutt and Nicolau [Panda et al.1997a;1997b;1999]. The performance metrics of their system are data cache size and number of processor cycles. A novel approach for designing memory systems based on binding of array groups to memory components with different dimensions, access times and number of ports has been presented in [Schmit and Thomas 1997]. A memory system design for video processors and a methodology for the analysis of on-chip memory architectures has been developed in [Dutta et al. 1995]. Designs are evaluated based on tradeoffs between area, cycle time and utilization. For low power applications, it is not sufficient to consider only area and cycle time. Energy has to be included in the performance metrics since variation in the energy for different memory configurations is quite different from the variation in the number of cycles. In [Shiue and Chakrabarti 1999a; Shiue and Chakrabarti 1999b], we show how three performance metrics namely, area, number of cycles and energy are required to efficiently explore the memory design space for low power applications. The rest of the paper is organized as follows. Section 2 briefly describes the time and energy models used in our procedure and demonstrates the differences in the variation in the number of cycles and energy for different types of programs. Section 3 describes the techniques to improve cache performance, namely, data placement in off-chip memory. Section 4 concludes the paper. 2. BACKGROUND In this section, we present a brief review of the time and energy models used in our procedure as well as the differences between the timing and energy characteristics. For more details, please refer to [Shiue and Chakrabarti 1999c]. Our system has three-performance metrics: cache size, number of processor cycles and energy consumption. Cache size: Given the area constraint, we find the largest possible cache size C max that satisfies this constraint. The memory exploration procedure searches for the best cache configuration among cache sizes < C max. Number of processor cycles: The number of processor cycles is a function of the miss rate. We adopt the model used in [Hennessy and Patterson 1996] and assume that the number of cycles per hit is 1, 1.1, 1.12, and 1.14 for 1, 2, 4, and 8-way set associative cache respectively. We also assume that the number of cycles per miss is 40, 40, 42, 44, 48, 56, and 72 for 1

2 line sizes of 4, 8, 16, 32, 64, 128, and 256 respectively. The number of processor cycles is calculated as follows. Number of processor cycles = hit_rate*trip_count*(number of cycles per hit) + miss_rate*trip_count*( number of cycles per miss). Energy: Our energy model is derived from combining the energy models in [Kamble and Ghose 1997] and [Su and Despain 1995] to develop a model that results in fast estimation as in [Su and Despain 1995] and yet matches the energy values in [Kamble and Ghose 1997] quite closely. Furthermore, we not only consider the energy consumed in the host processor but also consider the energy consumed in the off-chip memory. This is very important in order to accurately calculate the energy consumed during a miss. Energy = Edec+Ecell+Eio+Emain - Edec = α*(add_bus_bs)*(c/l) - Ecell = β*(word_line_size) * (Bit_line_size+4.8) *(Nhit+Nmiss) - Eio = γ*(data_pad_bs* 8L+ Add_pad_bs) - Emain = γ*data_pad_bs*8l+em*8l*nmiss Add_bus_bs = Pr*(Nhit+Nmiss)*Wadd Add_pad_bs = Pr*(Nmiss)*Wadd Data_pad_bs = Pr*(Nmiss) Word_line_size = m*(8l+t+st) Bit_line_size = C/(m*L) α = 7.89e-17; β = 1.44e-14; γ = 5.45e-11; Em=4.95e-9 Joule C = cache size. L = cache line size. m = m-way set associativity cache. T = tag size of T bits St = number of status bits per block frame. Pr = the probabilities of a 0 to 1 transition Nhit = number of hits. Nmiss = number of misses. Wadd = the width of address bus. Add_bus_bs = number of bit switches on address bus. Add_pad_bs = number of bit switches on address pads. Data_pad_bs = number of bit switches on data pad. Word_line_size = Number of memory cells in a word line. Bit_line_size = Number of memory cells in a bit line. Edec = Energy consumed in the address bus. Ecell = Energy consumed by the pre-charged cache word/bit lines. Eio= Energy consumed in the I/O pads of the host processors. Emain = Energy consumed in the off-chip memory and in accessing the off-chip memory. 2.1 Energy Time Tradeoffs In this section, we briefly review the differences in the timing and energy characteristics for different cache configurations with the help of an example. It is these differences that justify the need to include energy in the performance metrics. Figure 1(a) and Figure 1(b) plot the variation in the number of cycles and the variation in the energy consumption for the example Compress 1. From these plots we see that The minimum time cache configuration corresponds to the largest possible cache while the minimum energy consumption corresponds to the smallest possible cache. 1 Compress is a 2-dimensional filter used in Image Processing Applications. Cycles The number of cycles and the energy consumption increases significantly if the cache size is smaller than a particular value. We refer to the point on the curve at which this transition occurs as the inflection point. The inflection point corresponds to the minimum cache size. In the next section, we will describe a procedure to calculate the minimum cache size. The energy increases significantly with increase in the line size for the same cache size. This is because retrieving a large block of data from off-chip main memory to on-chip cache consumes more energy than retrieving a small block of data. For a specific line size, the energy increases with increase in cache size. This is because a larger cache consumes more energy and the increase in energy is more than the reduction in the miss rate (due to larger cache size). For a specific line size, the number of cycles decreases mildly with increase in cache size. This is in tune with the decrease in the miss rate Cycles variation (Compress) C8 C16 C32 C64 C128 C256 C512 C_1K C_2K C_4K C_8K Cache size (a) (b) Figure 1. Example Compress (a) Cycles variation and (b) Energy variation for different cache sizes and line sizes. The number of lines is >=2. 3. ENHANCING CACHE PERFORMANCE There are several techniques to enhance cache performance. Notable among them are placement of data in main memory so that the number of conflict misses is reduced, tiling to improve data locality, set associativity to improve the hit rates, block buffering to save power by optimizing capacitance of each cache access and subbanking to save power by eliminating unnecessary accesses [Kamble and Ghose 1997;Su and Despain 1995]. In this paper, we focus on data placement and tiling. 3.1 Data Placement Cycles Cycle variation (Optimized vs. Unoptimized) L4 L8 L16 L32 Opt. L64 Un-opt. C32L4 C64L8 C128L16 Cache size and line size Energy variation (Compress) Figure 2. Example Compress. Cycles and energy reduction due to data placement. Energy (nj) Energy (nj) C8 C16 C32 C64 C128 C256 C512 C_1K C_2K C_4K C_8K Cache size Energy variation (Optimized vs. Unoptimized) L4 L8 L16 L32 Opt. L64 Un-opt. C32L4 C64L8 C128L16 Cache size and line size 2

3 Data placement can be used to significantly reduce the miss rate. This translates to reduction in both the number of cycles and the energy consumption as illustrated in Figure 2 for the example Compress. Note that the reduction in the number of cycles is much larger than the reduction in the energy consumption. This is because Energy of a miss >> Energy of a hit, and so a drastic reduction (5X) in the miss rate results in only a modest reduction (2X) in the energy. In the rest of the section, we describe a procedure for data placement in the off-chip memory that results in very few conflict misses. Our technique involves (i) finding the minimum cache size that is required to minimize the number of conflict misses, (ii) assigning specific cache lines to each of the arrays and (iii) determining placement of arrays in the off-chip memory. Finding the minimum cache size: Computing the minimum cache size is very important since if the cache size is smaller than the minimum cache size, the miss rate increases significantly, thereby degrading the memory performance (cycles and energy). Our procedure consists of finding the minimum number of cache lines from the array access patterns. Let n be the depth of a loop nest, and d be the dimensions of an array A. Two references, A[f(i)] and A[g(i)], where f and g are indexing functions Z n Z d, are called uniformly generated if f(i)=hi+c f and g(i)=hi+c g, where H is a linear transformation and c f and c g are constant vectors [Wolf and Lam 1991]. We partition references in a loop nest into equivalent classes of reference if they have the same H and operate on the same arrays as described in [Wolf and Lam 1991]. For each class, we find the minimum number of cache lines. We repeat the procedure for each array in the kernel program and sum the number of cache lines. The minimum cache size is the line size times the sum of the minimum number of cache lines. The procedure for finding the minimum cache size for a single kernel program is given below. Algorithm_min_cache_size_kernel_program 1. Find the distance for each class Distance = floor(abs(difference of constant vector)/stride of loop ) +1; 2. for each class if cache line size < Tripcount_inner_loop N= (distance) mod (cache line size) if N==0 or 1 # cache lines = floor(distance/cache line size)+1; else # cache lines = floor(distance/cache line size)+2; end else # cache lines = 1; end 3. Repeat Step 1, 2 for each array. 4. Minimum cache lines = Σ # cache lines 5. Minimum cache size = Minimum cache lines * cache line size We illustrate this procedure with the help of Compress example. Example 1. Compress int a[32,32] for i=1,31 for j=1,31 a[i.j]=a[i,j]-a[i-1,j]-a[i,j-1]-2*a[i-1,j-1]; Equivalent class References Distance # of cache lines if line size =2 Class 1 a[i-1,j-1], a[i-1,j] floor(abs(1/1))+1=2 Floor(2/2)+1=2 Class 2 a[i,j-1], a[i,j] floor(abs(1/1))+1=2 Floor(2/2)+1=2 In Example 1, there are two equivalent classes. Class 1: a[i- 1,j-1], a[i-1,j] and class 2: a[i,j-1], a[i,j]. The total number of cache lines for L=2 is 4 (two cache lines for references in class 1 and two cache lines for references in class 2). The minimum cache size is 4*L, where L is the line size. Thus if the cache size is smaller than 4L, the miss rate increases significantly. If the cache size is larger than 4L, then the cache line size can be increased to exploit the spatial locality or the number of cache lines can be increased in proportion to the number of classes. Finding the minimum number of cache lines per array: A large program consists of several kernel programs, each of which consists of arrays with different access patterns. This makes finding the minimum cache size (MCS) for large programs a lot more involved. The procedure for finding the minimum number of cache lines for each array consists of (i) Finding the minimum number of zones by looking at the access pattern of the outer loop index. (ii) Finding the minimum number of cache lines per zone by looking at the access pattern of the inner loop index. In order to calculate the minimum number of zones, we calculate a difference set for each kernel program, where the elements of the difference set are obtained by computing the difference between the outer loop index values. Next, we compute the union of the difference sets generated by each kernel program. For instance, if in kernel program 1, rows i, i+2, i+4 of array a get accessed, then the difference set for program 1 is {2,4}. Now if in kernel program 2, rows i, i+4, i+5 of array a get accessed, then the difference set for program 2 is {1,4,5}. For the whole program that includes kernel 1 and kernel 2, the difference set for array a is {1,2,4,5}. The number of zones is the smallest integer that cannot be divided by any number in the difference set. In the above example, the minimum number of zones is 3. Thus, if the rows of array a get mapped to three zones as shown in Figure 3, there will not be any conflict. There would also not be any conflict if the number of zones is larger than and equal to 6 because 6 is larger than any value in the difference set. If, however, the number of zones is chosen to be 4 or 5, there would be a conflict as shown in Figure 3. Kernel 1: { Row i, Row i+2, Row i+4} i i+1 i+2 i+3 i+4 i+5 Zone 1 Zone 2 Zone 3 Kernel 2: { Row i, Row i+4, Row i+5} i i+1 i+2 i+3 i+4 i+5 Zone 1 Zone 2 Zone 3 Kernel 1 i i+1 i+2 i+3 i+4 i+5 i+6 i+7 Zone 1 Zone 2 Zone 3 Zone 4 Kernel 2 i i+1 i+2 i+3 i+4 i+5 i+6 i+7 3

4 Zone 1 Zone 2 Zone 3 Zone 4 Figure 3.Example illustrating how there would be a conflict if the number of zones is 4 instead of 3. Next, the number of cache lines per zone has to be calculated. We illustrate the procedure with the help of Example 2. Here i is the outer loop index and j is the inner loop index. The difference in the outer loop index for array a in K1 is {1} and in K2 is {3}. Thus we require a minimum of 2 zones for array a. Next, we calculate the number of lines per zone by looking at the access pattern of the inner loop. For instance, for array a, we calculate the minimum number for cache lines required for Zone 1 by taking the maximum of the number of cache lines required in class 1 and class 5. Thus Zone 1 requires two cache lines if L=2. Similarly, Zone 2 requires two cache lines if L=2. Note that if zones 1 and 2 required different number of cache lines, then we assign the larger number of cache lines to each of the zones. Example 2: Kernel programs K1 K2 # lines for each class (assume Line size =2) Class 1: 2 lines Class 2: 2 lines Class 3: 2 lines Class 4: 2 lines Class 5: 2 lines Class 6: 2 lines Class 7: 2 lines Class 8: 2 lines References Class 1: a[i,j], a[i,j+1] Class 2: a[i+1,j], a[i+1,j+1] Class 3: b[i,j], b[i,j+1] Class 4: b[i+1,j], b[i+1,j+2] Class 5: a[i,j], a[i,j+2] Class 6: a[i+3,j], a[i+3,j+1] Class 7: b[i,j], b[i,j+1] Class 8: b[i+2,j], b[i+2,j+2] # lines for each zone 2 lines for row i of a 2 lines for row i+1 of a 2 lines for row i of b 2 lines for row i+1 of b 2 lines for row i+3 of a 2 lines for row i+2 of b We use a similar analysis to find the minimum number of cache lines for array b. Since the difference in the outer loop index for array b in K1 is {1} and in K2 is {2}, we require a minimum of 3 zones for b. Since the minimum number of lines per zone for array b is 2, the minimum number of lines for array b is 6. Since arrays a and b have overlapping lifetime, the minimum cache size is (4+6)*L = 10*2 =20 bytes for L=2. The exact assignment of different rows of arrays a and b is as follows. Note that the 9 rows of array a are distributed over zones 1 and 2 while the 8 rows of b are distributed over zones 3, 4 and 5. (Here A0 stands for row 0 of array a, A1 stands for row 1 of array a, etc). Array a : 9x13 Array b : 8x13 Zone 1 Zone 2 Zone 3 Zone 4 Zone A0 A2 A4 A6 A8 A1 A3 A5 A7 B0 B3 B6 B1 B4 B7 B2 B5 Note that the number of lines per zone is a function of the line size. Thus MCS(L) = Σi Lines(i,L), where i is the zone number and Lines(i,L) is the number of lines for zone i if L is the line size. For most programs, as the line size increases, the number of lines per zone would decrease. Finding the minimum number of cache lines for the whole program: Given the minimum number of cache lines per array, the next step is to find the minimum number of cache lines for the whole program. First, we create an array conflict graph where the nodes correspond to arrays and an edge exists between two nodes if the two corresponding arrays have overlapping life times. Each node has a weight associated with it, where the weight is the MCL of the corresponding array. The minimum number of cache lines is thus larger than or equal to the cost of the maximal cost clique of this graph. For instance, consider an application program with three kernel programs. K1: [A(3 lines),c (1 line),x(3 lines)]; K2: [A(4 lines),b (1 line),c(1 line),d(2 lines)]; K3: [B(1 line),y(6 lines)]. The array conflict graph of this example is given in Figure 4. ABCD or ACX form the maximal cost clique with cost 8. Thus the minimum number of cache lines for this application program is 8. Since there are some arrays that are not part of the maximal cost clique, in the next step we identify whether these arrays can share cache lines with the ones in the maximal cost clique. Let nodes in the maximal cost clique be assigned to major set (MS) and the other nodes be assigned to the remaining set (RS). Our aim here is to try to match the nodes in RS with those in MS so that they can share cache lines. X MS RS MS RS A=4 B=1 C=1 D=2 A C X=3 Y=6 B D Figure 4. Example illustrating calculation of minimum cache size. We first create the dual of the conflict graph. Thus there are no edges between the nodes in MS, and edges between nodes in RS imply that the corresponding arrays can share cache lines. Our algorithm is greedy and works on one node of the B=1 D=2 Y A C D B X=3 Y X 4

5 RS at a time. We choose the node with the largest cost, ò, where the cost is the number of cache lines required by the corresponding array. If node ò has edges with several nodes in MS, then we choose a subset of those incident nodes such that the cost of ò is equal to the cost of the nodes in the subset. While choosing the subset of incident nodes, we give higher priority to nodes with lower degree. If the cost of ò is larger than the cost of the nodes in the subset, the cache size has to be increased by an amount which is equal to the difference of the two costs. At the end of this step, node ò and its edges with nodes in MS are deleted and the costs of the incident nodes updated. Nodes with cost=0 in MS are also deleted. The cost of an incident nodes is not updated if there exists another node in RS which can share cache lines with both ò and the incident node. The algorithm is shown below. Algorithm_min_cache_size_actual 1. The nodes in the maximal cost clique are assigned to MS and the other nodes are assigned to RS. 2. Create the dual of G, conflict graph. 3. Remove and update scheme - Choose the node with largest cost, ò in RS. - For node ò { a. Find the incident nodes of ò in MS. b. Choose a subset of these incident nodes such that COST(ò) is equal to the cost of these nodes (i) Higher priority is given to the incident nodes that have small edges. (ii) If COST(incident nodes) < COST(ò) Add extra x lines (where x = COST(ò)-COST(incident nodes) c. Node ò is deleted and the cost of the incident nodes updated. We illustrate our procedure with the help of an example (see Figure 4). The maximal cost clique consists of nodes A, B, C, D and has a cost of 8. Thus nodes A, B, C, D are assigned to MS and nodes X, Y are assigned to RS. In the dual graph, note that there is an edge between X and Y implying that X and Y can share cache lines. In the greedy procedure, we first choose node Y in RS since it has the largest cost. Node Y has three incident nodes A, C, D. Of these nodes, we choose A and C first since they have degree 1. Thus Y share 4 lines with A and 1 line with C. Since Y has a cost of 6, Y has to share 1 line with D. This implies that cache lines for A, C, D have to be contiguous. Next, we update the nodes in MS and RS, and their costs. The resulting configuration consists of B and D in MS and X in RS. Note that since X and Y can share cache lines, the cost of D is not updated. In this configuration, X can share 1 cache line with B and 2 cache lines with D. If X and Y cannot share cache lines, then an additional cache line has to be added. Given that we know the exact assignment of cache lines to different arrays for MCS(L), the minimum cache size corresponding to a specific L, the next step is to determine the assignment when the cache size is larger than MCS(L). A larger cache size can be due to larger line size and/or due to larger number of lines. Case 1: Cache size increases but the line size L remains the same. In this case, the number of cache lines increases and the number of lines assigned to each zone increases by the ratio cache_size/mcs(l). Case 2: Line size increases but the number of lines remains the same. Since the line size increases to (say) Å, the number of lines per zone may reduce. Thus MCS(Å) may contain fewer lines than MCS(L). In this case, the number of lines assigned to each zone increases by the ratio cache_size/mcs(å). Placing arrays in the off-chip memory for large programs: Once the placement of the arrays in the cache is known, placement of the arrays in the main memory is relatively straightforward. Array padding is done between the rows of an array (for a 2D array) as well as between arrays. Array padding between rows is done such that references belonging to different classes do not get mapped to the same cache line. Our placement procedure is a greedy algorithm that places the arrays in the main memory such that the amount of unused space is minimum. The greedy algorithm chooses the array that (i) results in minimum number of unused locations and (ii) is larger than the other arrays assigned to the same cache line. It maps all the rows of the chosen array before operating on the next array. Since array padding is done between the rows of an array, the number of locations needed to store the array in the main memory will most likely be larger than the size of the array. For ease of implementation, each cache line has a candidate array list associated with it. The candidate array list is prioritized with larger arrays having higher priority. After the assignment of one array (start location B, end location B+padded_arraysize-1), we find the cache line, α, corresponding to the next available location, E. We pick the array with the highest priority from the candidate array list corresponding to line α. If the candidate array list of α is NULL, we look at the list in line α+1, and so on. Once an array in the candidate list is identified, the next step is to compute B, the start location of this array in the main memory. Since α=floor((b mod C)/L), B should satisfy the equation B=C*n+α*L. B should also be the smallest value that is larger than or equal to E. Once B is determined, all the rows of the array are assigned in increasing order (row 0 first, row 1 second etc.) using a similar procedure. Recall that since every row of the array has been assigned a specific cache line, determining the padding between consecutive rows is very similar to determining the padding between arrays. At the end of this step, locations B through B+padded_arraysize-1 are blocked, and the candidate list of line α is updated. The procedure is then repeated for the next array. The algorithm is listed below. Algorithm_memory_assignment 1. The candidate array list associated with cache line i is list(i). Assign priorities to the arrays larger the array higher is the priority. 2. Repeat step 3 till all arrays have been assigned. 3. Procedure - Determine the cache line corresponding to location E. Compute α=floor((e mod C)/L). repeat until list(α) NULL {α= (α+1) mod (C/L)} - Choose the array with largest size from list(α). - Find the smallest value of B=C*n + α*l that satisfies B>=E. Assign all the rows of the array (in increasing order) using a similar procedure. Block locations B through B+ padded_arraysize update list(α). - E=B+padded_arraysize The memory assignment procedure is explained with the help of Figure 5. Array a : 8x16 Array b : 8x4 Array c: 4x8 Zone 1 Zone 2 Zone 3 Zone 4 Zone

6 A0 A2 A4 A6 C0 C2 A1 A3 A5 A7 C1 C3 B0 B3 B B1 B4 B7 B2 B A0 A1 A2 A3 A Array a Array c Array b Figure 5. Example illustrating placement of arrays in the main memory. Here C=20, L=2; array a is of size of 8x16, array b is of size of 8x4 and array c is of size 4x8. If the first free memory location is 100, then array a (which is larger than array c) is chosen first. The eight rows of array a are mapped before b is assigned. Note that the padding between rows 0 and 1 of a is different from that between rows 1 and 2. As a result of padding, the number of unused locations is 32 for the assignment of array a, 16 for the assignment of array b, and 39 for the assignment of array c. The total number of unused locations is 103 out of 296. If array b occupies zones 1, 2 and 3, and arrays a and c occupy zones 4 and 5, the total number of unused locations would be 92 out of 284. Note that while by confining the rows of array to a few zones, the number of conflict misses has reduced significantly, the number of unused locations in the main memory has increased. Furthermore, the number of unused locations is a function of (i) placement of arrays in the cache and (ii) the address of the first free memory location. Since our procedure for finding the minimum cache size does not pin down the exact placement of arrays in the cache, one can calculate the memory size for different configurations and choose the one with the minimum size. The main drawback of this approach is that the number of possible configurations can be very large; n! for n arrays. Clearly heuristics are needed to choose the configuration that would result in the minimum memory size. 4. CONCLUSION In this paper, we show how data placement techniques can be used to enhance cache performance in low energy memory design. This technique results in a spectacular reduction in the number of cycles and a modest reduction in the energy. This is because the energy associated with a miss is very large and so a large reduction in the miss rate helps reduce the energy by only a modest amount. The data placement procedure described here assumes that reduction in miss rate is more important than off-chip memory size. memory management methodology exploration of memory organisation for embedded multimedia system design. Kluwer Academic Publishers. (June). DUTTA, S., WOLF, W. AND WOLFE, A Memory system architectures for programmable video signal processors. Proceedings of ICCD, IEEE Computer Society Press. HENNESSY, J. L. AND PATTERSON, D. A Computer architecture a quantitative approach. 2nd edition Morgan Kaufman Publishers. KAMBLE, M. B. AND GHOSE, K Analytical energy dissipation models for low power caches. International Symposium on Low Power Electronics and Design. PANDA, P. R., DUTT, N. D. AND NICOLAU, A. 1997a. Architectural exploration and optimization of local memory in embedded systems. International Symposium on System Synthesis. (Antwerp, Sept). PANDA, P. R., DUTT, N. D. AND NICOLAU, A. 1997b. Memory data organization for improved cache performance in embedded processor applications. ACM Transactions on Design Automation of Electronic Systems, 2, 4 (Oct). PANDA, P. R., DUTT, N. D. AND NICOLAU, A Local memory exploration and optimization in embedded systems. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 18, 1 (January). SCHMIT, H AND THOMAS, D. H Synthesis of Applicationspecific Memory Designs. IEEE Transactions on VLSI Systems, 5, 1, (March) SHIUE, W. T. AND CHAKRABARTI, C. 1999a. Memory exploration for low power, embedded systems. 36th Design Automation Conference, (New Orleans, LA, June), SHIUE, W. T. AND CHAKRABARTI, C. 1999b. Memory design and exploration for low power, embedded systems. IEEE Workshop on Signal Processing Systems: Design and Implementation. (Taiwan R.O.C., Oct). SHIUE, W. T. AND CHAKRABARTI, C. 1999c. Memory design and exploration for low power embedded systems. Center for Low Power Electronics Technical Report CLPE-TR (Oct). SU, C. AND DESPAIN, A Cache design trade-offs for power and performance optimization: a case study. International Symposium on Low Power Electronics and Design, THORDARSON, A Comparison of manual and automatic behavioral synthesis of MPEG algorithm. Master s thesis, University of California Irvine. WOLF, M. E. AND LAM, M A data locality optimizing algorithm. In Proceedings of the SIGPLAN 91 Conference on Programming Language Design and Implementation. (June) REFERENCES CATTHOOR, F., FRANSSEN, F., WUYTACK, S., NACHTERGAELE, L., MAN, H. DE Global communication and memory optimizing transformations for low power signal processing systems. Workshop on VLSI Signal Processing (La Jolla, CA, Oct). CATTHOOR, F., WUYTACK, S., GREEF, E. D., BALASA, F., NACHTERGAELE, L., AND VANDECAPPELLE, A Custom 6

Memory Exploration for Low Power, Embedded Systems

Memory Exploration for Low Power, Embedded Systems Memory Exploration for Low Power, Embedded Systems Wen-Tsong Shiue Arizona State University Department of Electrical Engineering Tempe, AZ 85287-5706 Ph: 1(602) 965-1319, Fax: 1(602) 965-8325 shiue@imap3.asu.edu

More information

Memory Design and Exploration for Low Power, Embedded Systems

Memory Design and Exploration for Low Power, Embedded Systems Memory Design and Exploration for Low Power, Embedded Systems WEN-TSONG SHIUE AND CHAITALI CHAKRABARTI Arizona State University, AZ, USA Received January 3, 2000; Revised October 3, 2000 Abstract. In this

More information

Data Memory Design and Exploration for Low Power Embedded Systems

Data Memory Design and Exploration for Low Power Embedded Systems Data Memory Design and Exploration for Low Power Embedded Systems Wen-Tsong Shiue, Sathishkumar Udayanarayanan, Chaitali Chakrabarti Arizona State University,Tempe, Arizona Categories and Subject Descriptors:

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Multi-Module Multi-Port Memory Design for Low Power Embedded Systems

Multi-Module Multi-Port Memory Design for Low Power Embedded Systems Design Automation for Embedded Systems, 9, 235 261, 2004. c 2005 Springer Science + Business Media, Inc. Manufactured in The Netherlands. Multi-Module Multi-Port Memory Design for Low Power Embedded Systems

More information

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,

More information

A Reconfigurable Cache Design for Embedded Dynamic Data Cache

A Reconfigurable Cache Design for Embedded Dynamic Data Cache I J C T A, 9(17) 2016, pp. 8509-8517 International Science Press A Reconfigurable Cache Design for Embedded Dynamic Data Cache Shameedha Begum, T. Vidya, Amit D. Joshi and N. Ramasubramanian ABSTRACT Applications

More information

The Impact of Write Back on Cache Performance

The Impact of Write Back on Cache Performance The Impact of Write Back on Cache Performance Daniel Kroening and Silvia M. Mueller Computer Science Department Universitaet des Saarlandes, 66123 Saarbruecken, Germany email: kroening@handshake.de, smueller@cs.uni-sb.de,

More information

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015

CS 31: Intro to Systems Caching. Kevin Webb Swarthmore College March 24, 2015 CS 3: Intro to Systems Caching Kevin Webb Swarthmore College March 24, 205 Reading Quiz Abstraction Goal Reality: There is no one type of memory to rule them all! Abstraction: hide the complex/undesirable

More information

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1

CSE 431 Computer Architecture Fall Chapter 5A: Exploiting the Memory Hierarchy, Part 1 CSE 431 Computer Architecture Fall 2008 Chapter 5A: Exploiting the Memory Hierarchy, Part 1 Mary Jane Irwin ( www.cse.psu.edu/~mji ) [Adapted from Computer Organization and Design, 4 th Edition, Patterson

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica

1 Introduction Data format converters (DFCs) are used to permute the data from one format to another in signal processing and image processing applica A New Register Allocation Scheme for Low Power Data Format Converters Kala Srivatsan, Chaitali Chakrabarti Lori E. Lucke Department of Electrical Engineering Minnetronix, Inc. Arizona State University

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors K. Masselos 1,2, F. Catthoor 2, C. E. Goutis 1, H. DeMan 2 1 VLSI Design Laboratory, Department of Electrical and Computer

More information

Scan-Based BIST Diagnosis Using an Embedded Processor

Scan-Based BIST Diagnosis Using an Embedded Processor Scan-Based BIST Diagnosis Using an Embedded Processor Kedarnath J. Balakrishnan and Nur A. Touba Computer Engineering Research Center Department of Electrical and Computer Engineering University of Texas

More information

Addressing the Memory Wall

Addressing the Memory Wall Lecture 26: Addressing the Memory Wall Parallel Computer Architecture and Programming CMU 15-418/15-618, Spring 2015 Tunes Cage the Elephant Back Against the Wall (Cage the Elephant) This song is for the

More information

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico

More information

1. Designing a 64-word Content Addressable Memory Background

1. Designing a 64-word Content Addressable Memory Background UNIVERSITY OF CALIFORNIA College of Engineering Department of Electrical Engineering and Computer Sciences Project Phase I Specification NTU IC541CA (Spring 2004) 1. Designing a 64-word Content Addressable

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories

Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories V. Delaluz, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, A. Sivasubramaniam, and I. Kolcu y Microsystems Design Lab Pennsylvania

More information

Case Studies on Cache Performance and Optimization of Programs with Unit Strides

Case Studies on Cache Performance and Optimization of Programs with Unit Strides SOFTWARE PRACTICE AND EXPERIENCE, VOL. 27(2), 167 172 (FEBRUARY 1997) Case Studies on Cache Performance and Optimization of Programs with Unit Strides pei-chi wu and kuo-chan huang Department of Computer

More information

Constructive floorplanning with a yield objective

Constructive floorplanning with a yield objective Constructive floorplanning with a yield objective Rajnish Prasad and Israel Koren Department of Electrical and Computer Engineering University of Massachusetts, Amherst, MA 13 E-mail: rprasad,koren@ecs.umass.edu

More information

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier

Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science CPUtime = IC CPI Execution + Memory accesses Instruction

More information

Operating Systems (2INC0) 2017/18

Operating Systems (2INC0) 2017/18 Operating Systems (2INC0) 2017/18 Virtual Memory (10) Dr Courtesy of Dr I Radovanovic, Dr R Mak System rchitecture and Networking Group genda Recap memory management in early systems Principles of virtual

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy

SE-292 High Performance Computing. Memory Hierarchy. R. Govindarajan Memory Hierarchy SE-292 High Performance Computing Memory Hierarchy R. Govindarajan govind@serc Memory Hierarchy 2 1 Memory Organization Memory hierarchy CPU registers few in number (typically 16/32/128) subcycle access

More information

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College

Computer Systems C S Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College Computer Systems C S 0 7 Cynthia Lee Today s materials adapted from Kevin Webb at Swarthmore College 2 Today s Topics TODAY S LECTURE: Caching ANNOUNCEMENTS: Assign6 & Assign7 due Friday! 6 & 7 NO late

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

CMSC 611: Advanced Computer Architecture. Cache and Memory

CMSC 611: Advanced Computer Architecture. Cache and Memory CMSC 611: Advanced Computer Architecture Cache and Memory Classification of Cache Misses Compulsory The first access to a block is never in the cache. Also called cold start misses or first reference misses.

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 14 Caches III Lecturer SOE Dan Garcia Google Glass may be one vision of the future of post-pc interfaces augmented reality with video

More information

Caches Concepts Review

Caches Concepts Review Caches Concepts Review What is a block address? Why not bring just what is needed by the processor? What is a set associative cache? Write-through? Write-back? Then we ll see: Block allocation policy on

More information

Low-Power Technology for Image-Processing LSIs

Low-Power Technology for Image-Processing LSIs Low- Technology for Image-Processing LSIs Yoshimi Asada The conventional LSI design assumed power would be supplied uniformly to all parts of an LSI. For a design with multiple supply voltages and a power

More information

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station

Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Power and Locality Aware Request Distribution Technical Report Heungki Lee, Gopinath Vageesan and Eun Jung Kim Texas A&M University College Station Abstract With the growing use of cluster systems in file

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

Methodology for Memory Analysis and Optimization in Embedded Systems

Methodology for Memory Analysis and Optimization in Embedded Systems Methodology for Memory Analysis and Optimization in Embedded Systems Shenglin Yang UCLA Dept of EE Los Angeles, CA 995 +1-31-267-494 shengliny@ee.ucla.edu Ingrid M. Verbauwhede UCLA Dept of EE Los Angeles,

More information

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy

Memory Hierarchy. ENG3380 Computer Organization and Architecture Cache Memory Part II. Topics. References. Memory Hierarchy ENG338 Computer Organization and Architecture Part II Winter 217 S. Areibi School of Engineering University of Guelph Hierarchy Topics Hierarchy Locality Motivation Principles Elements of Design: Addresses

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

Energy Issues in Software Design of Embedded Systems

Energy Issues in Software Design of Embedded Systems Energy Issues in Software Design of Embedded Systems A. CHATZIGEORGIOU, G. STEPHANIDES Department of Applied Informatics University of Macedonia 156 Egnatia Str., 54006 Thessaloniki GREECE alec@ieee.org,

More information

Efficiently Utilizing ATE Vector Repeat for Compression by Scan Vector Decomposition

Efficiently Utilizing ATE Vector Repeat for Compression by Scan Vector Decomposition Efficiently Utilizing ATE Vector Repeat for Compression by Scan Vector Decomposition Jinkyu Lee and Nur A. Touba Computer Engineering Research Center University of Teas, Austin, TX 7872 {jlee2, touba}@ece.uteas.edu

More information

Reducing Memory Requirements of Nested Loops for Embedded Systems

Reducing Memory Requirements of Nested Loops for Embedded Systems Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,

More information

LECTURE 11. Memory Hierarchy

LECTURE 11. Memory Hierarchy LECTURE 11 Memory Hierarchy MEMORY HIERARCHY When it comes to memory, there are two universally desirable properties: Large Size: ideally, we want to never have to worry about running out of memory. Speed

More information

A Novel Memory Size Model for Variable-Mapping In System Level Design

A Novel Memory Size Model for Variable-Mapping In System Level Design A Novel Memory Size Model for Variable-Mapping In System Level Design Lukai Cai, Haobo Yu, Daniel Gajski Center for Embedded Computing Systems University of California, Irvine, USA {lcai,haoboy,gajski}@cecs.uci.edu

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

A Low Power Design of Gray and T0 Codecs for the Address Bus Encoding for System Level Power Optimization

A Low Power Design of Gray and T0 Codecs for the Address Bus Encoding for System Level Power Optimization A Low Power Design of Gray and T0 Codecs for the Address Bus Encoding for System Level Power Optimization Prabhat K. Saraswat, Ghazal Haghani and Appiah Kubi Bernard Advanced Learning and Research Institute,

More information

Advanced optimizations of cache performance ( 2.2)

Advanced optimizations of cache performance ( 2.2) Advanced optimizations of cache performance ( 2.2) 30 1. Small and Simple Caches to reduce hit time Critical timing path: address tag memory, then compare tags, then select set Lower associativity Direct-mapped

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Performance Evolution of DDR3 SDRAM Controller for Communication Networks

Performance Evolution of DDR3 SDRAM Controller for Communication Networks Performance Evolution of DDR3 SDRAM Controller for Communication Networks U.Venkata Rao 1, G.Siva Suresh Kumar 2, G.Phani Kumar 3 1,2,3 Department of ECE, Sai Ganapathi Engineering College, Visakhaapatnam,

More information

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry High Performance Memory Read Using Cross-Coupled Pull-up Circuitry Katie Blomster and José G. Delgado-Frias School of Electrical Engineering and Computer Science Washington State University Pullman, WA

More information

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Qiang Liu, George A. Constantinides, Konstantinos Masselos and Peter Y.K. Cheung Department of Electrical and Electronics

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

COMP 3221: Microprocessors and Embedded Systems

COMP 3221: Microprocessors and Embedded Systems COMP 3: Microprocessors and Embedded Systems Lectures 7: Cache Memory - III http://www.cse.unsw.edu.au/~cs3 Lecturer: Hui Wu Session, 5 Outline Fully Associative Cache N-Way Associative Cache Block Replacement

More information

state encoding with fewer bits has fewer equations to implement state encoding with more bits (e.g., one-hot) has simpler equations

state encoding with fewer bits has fewer equations to implement state encoding with more bits (e.g., one-hot) has simpler equations State minimization fewer states require fewer state bits fewer bits require fewer logic equations Encodings: state, inputs, outputs state encoding with fewer bits has fewer equations to implement however,

More information

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy

Chapter 5B. Large and Fast: Exploiting Memory Hierarchy Chapter 5B Large and Fast: Exploiting Memory Hierarchy One Transistor Dynamic RAM 1-T DRAM Cell word access transistor V REF TiN top electrode (V REF ) Ta 2 O 5 dielectric bit Storage capacitor (FET gate,

More information

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0;

Plot SIZE. How will execution time grow with SIZE? Actual Data. int array[size]; int A = 0; How will execution time grow with SIZE? int array[size]; int A = ; for (int i = ; i < ; i++) { for (int j = ; j < SIZE ; j++) { A += array[j]; } TIME } Plot SIZE Actual Data 45 4 5 5 Series 5 5 4 6 8 Memory

More information

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction

Chapter 6 Memory 11/3/2015. Chapter 6 Objectives. 6.2 Types of Memory. 6.1 Introduction Chapter 6 Objectives Chapter 6 Memory Master the concepts of hierarchical memory organization. Understand how each level of memory contributes to system performance, and how the performance is measured.

More information

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors

A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors A Software LDPC Decoder Implemented on a Many-Core Array of Programmable Processors Brent Bohnenstiehl and Bevan Baas Department of Electrical and Computer Engineering University of California, Davis {bvbohnen,

More information

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang

CHAPTER 6 Memory. CMPS375 Class Notes Page 1/ 16 by Kuo-pao Yang CHAPTER 6 Memory 6.1 Memory 233 6.2 Types of Memory 233 6.3 The Memory Hierarchy 235 6.3.1 Locality of Reference 237 6.4 Cache Memory 237 6.4.1 Cache Mapping Schemes 239 6.4.2 Replacement Policies 247

More information

The Development of Network Address Translation and Protocol Translation on Embedded Linux

The Development of Network Address Translation and Protocol Translation on Embedded Linux The Development of Network Address Translation and Protocol Translation on Embedded Linux r8906015@mail.dyu.edu.tw swang@mail.dyu.edu.tw Key Words Embedded system, Linux,IPv6, Gateway, NAT-PT IPv4 IPv6

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Energy Efficient Caching-on-Cache Architectures for Embedded Systems

Energy Efficient Caching-on-Cache Architectures for Embedded Systems JOURNAL OF INFORMATION SCIENCE AND ENGINEERING 19, 809-825 (2003) Energy Efficient Caching-on-Cache Architectures for Embedded Systems HUNG-CHENG WU, TIEN-FU CHEN, HUNG-YU LI + AND JINN-SHYAN WANG + Department

More information

Chapter 8. Virtual Memory

Chapter 8. Virtual Memory Operating System Chapter 8. Virtual Memory Lynn Choi School of Electrical Engineering Motivated by Memory Hierarchy Principles of Locality Speed vs. size vs. cost tradeoff Locality principle Spatial Locality:

More information

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public

Reduce Your System Power Consumption with Altera FPGAs Altera Corporation Public Reduce Your System Power Consumption with Altera FPGAs Agenda Benefits of lower power in systems Stratix III power technology Cyclone III power Quartus II power optimization and estimation tools Summary

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

CS370 Operating Systems

CS370 Operating Systems CS370 Operating Systems Colorado State University Yashwant K Malaiya Fall 2017 Lecture 23 Virtual memory Slides based on Text by Silberschatz, Galvin, Gagne Various sources 1 1 FAQ Is a page replaces when

More information

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1

6T- SRAM for Low Power Consumption. Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 6T- SRAM for Low Power Consumption Mrs. J.N.Ingole 1, Ms.P.A.Mirge 2 Professor, Dept. of ExTC, PRMIT &R, Badnera, Amravati, Maharashtra, India 1 PG Student [Digital Electronics], Dept. of ExTC, PRMIT&R,

More information

Fast Fuzzy Clustering of Infrared Images. 2. brfcm

Fast Fuzzy Clustering of Infrared Images. 2. brfcm Fast Fuzzy Clustering of Infrared Images Steven Eschrich, Jingwei Ke, Lawrence O. Hall and Dmitry B. Goldgof Department of Computer Science and Engineering, ENB 118 University of South Florida 4202 E.

More information

The University of Adelaide, School of Computer Science 13 September 2018

The University of Adelaide, School of Computer Science 13 September 2018 Computer Architecture A Quantitative Approach, Sixth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

DESIGN AND IMPLEMENTATION OF BIT TRANSITION COUNTER

DESIGN AND IMPLEMENTATION OF BIT TRANSITION COUNTER DESIGN AND IMPLEMENTATION OF BIT TRANSITION COUNTER Amandeep Singh 1, Balwinder Singh 2 1-2 Acadmic and Consultancy Services Division, Centre for Development of Advanced Computing(C-DAC), Mohali, India

More information

Computer Architecture Memory hierarchies and caches

Computer Architecture Memory hierarchies and caches Computer Architecture Memory hierarchies and caches S Coudert and R Pacalet January 23, 2019 Outline Introduction Localities principles Direct-mapped caches Increasing block size Set-associative caches

More information

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Cache Introduction. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Cache Introduction [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user with as much

More information

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time

Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Lecture 9: Improving Cache Performance: Reduce miss rate Reduce miss penalty Reduce hit time Review ABC of Cache: Associativity Block size Capacity Cache organization Direct-mapped cache : A =, S = C/B

More information

Performance and Power Solutions for Caches Using 8T SRAM Cells

Performance and Power Solutions for Caches Using 8T SRAM Cells Performance and Power Solutions for Caches Using 8T SRAM Cells Mostafa Farahani Amirali Baniasadi Department of Electrical and Computer Engineering University of Victoria, BC, Canada {mostafa, amirali}@ece.uvic.ca

More information

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

ECE 485/585 Microprocessor System Design

ECE 485/585 Microprocessor System Design Microprocessor System Design Lecture 8: Principle of Locality Cache Architecture Cache Replacement Policies Zeshan Chishti Electrical and Computer Engineering Dept Maseeh College of Engineering and Computer

More information

Memory and multiprogramming

Memory and multiprogramming Memory and multiprogramming COMP342 27 Week 5 Dr Len Hamey Reading TW: Tanenbaum and Woodhull, Operating Systems, Third Edition, chapter 4. References (computer architecture): HP: Hennessy and Patterson

More information

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION

MODULAR PARTITIONING FOR INCREMENTAL COMPILATION MODULAR PARTITIONING FOR INCREMENTAL COMPILATION Mehrdad Eslami Dehkordi, Stephen D. Brown Dept. of Electrical and Computer Engineering University of Toronto, Toronto, Canada email: {eslami,brown}@eecg.utoronto.ca

More information

Types of Cache Misses: The Three C s

Types of Cache Misses: The Three C s Types of Cache Misses: The Three C s 1 Compulsory: On the first access to a block; the block must be brought into the cache; also called cold start misses, or first reference misses. 2 Capacity: Occur

More information

Memory. Lecture 22 CS301

Memory. Lecture 22 CS301 Memory Lecture 22 CS301 Administrative Daily Review of today s lecture w Due tomorrow (11/13) at 8am HW #8 due today at 5pm Program #2 due Friday, 11/16 at 11:59pm Test #2 Wednesday Pipelined Machine Fetch

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

Analytical Design Space Exploration of Caches for Embedded Systems

Analytical Design Space Exploration of Caches for Embedded Systems Analytical Design Space Exploration of Caches for Embedded Systems Arijit Ghosh and Tony Givargis Department of Information and Computer Science Center for Embedded Computer Systems University of California,

More information

A LITERATURE SURVEY ON CPU CACHE RECONFIGURATION

A LITERATURE SURVEY ON CPU CACHE RECONFIGURATION A LITERATURE SURVEY ON CPU CACHE RECONFIGURATION S. Subha SITE, Vellore Institute of Technology, Vellore, India E-Mail: ssubha@rocketmail.com ABSTRACT CPU caches are designed with fixed number of sets,

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS152 Computer Architecture and Engineering Lecture 17: Cache System CS152 Computer Architecture and Engineering Lecture 17 System March 17, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http//http.cs.berkeley.edu/~patterson

More information

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018

Virtual Memory. Kevin Webb Swarthmore College March 8, 2018 irtual Memory Kevin Webb Swarthmore College March 8, 2018 Today s Goals Describe the mechanisms behind address translation. Analyze the performance of address translation alternatives. Explore page replacement

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institute of Technology, Delhi Lecture - 13 Virtual memory and memory management unit In the last class, we had discussed

More information

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES

CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES CHAPTER 8 - MEMORY MANAGEMENT STRATEGIES OBJECTIVES Detailed description of various ways of organizing memory hardware Various memory-management techniques, including paging and segmentation To provide

More information

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance:

Classification Steady-State Cache Misses: Techniques To Improve Cache Performance: #1 Lec # 9 Winter 2003 1-21-2004 Classification Steady-State Cache Misses: The Three C s of cache Misses: Compulsory Misses Capacity Misses Conflict Misses Techniques To Improve Cache Performance: Reduce

More information

Chapter 5. Large and Fast: Exploiting Memory Hierarchy

Chapter 5. Large and Fast: Exploiting Memory Hierarchy Chapter 5 Large and Fast: Exploiting Memory Hierarchy Processor-Memory Performance Gap 10000 µproc 55%/year (2X/1.5yr) Performance 1000 100 10 1 1980 1983 1986 1989 Moore s Law Processor-Memory Performance

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

Optimal Cache Organization using an Allocation Tree

Optimal Cache Organization using an Allocation Tree Optimal Cache Organization using an Allocation Tree Tony Givargis Technical Report CECS-2-22 September 11, 2002 Department of Information and Computer Science Center for Embedded Computer Systems University

More information

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts

Memory management. Last modified: Adaptation of Silberschatz, Galvin, Gagne slides for the textbook Applied Operating Systems Concepts Memory management Last modified: 26.04.2016 1 Contents Background Logical and physical address spaces; address binding Overlaying, swapping Contiguous Memory Allocation Segmentation Paging Structure of

More information

UCB CS61C : Machine Structures

UCB CS61C : Machine Structures inst.eecs.berkeley.edu/~cs61c UCB CS61C : Machine Structures Lecture 32 Caches III 2008-04-16 Lecturer SOE Dan Garcia Hi to Chin Han from U Penn! Prem Kumar of Northwestern has created a quantum inverter

More information