Cache Miss Clustering for Banked Memory Systems

Size: px
Start display at page:

Download "Cache Miss Clustering for Banked Memory Systems"

Transcription

1 Cache Miss Clustering for Banked Memory Systems O. Ozturk, G. Chen, M. Kandemir Computer Science and Engineering Department Pennsylvania State University University Park, PA 16802, USA {ozturk, gchen, M. Karakoy Department of Computing Imperial College London SW7 2AZ, UK ABSTRACT One of the previously-proposed techniques for reducing memory energy consumption is memory banking. The idea is to divide the memory space into multiple banks and place currently unused (idle) banks into a low-power operating mode. The prior studies both hardware and software domain in memory energy optimization via low-power modes do not take the data cache behavior explicitly into account. As a consequence, the energy savings achieved by these techniques can be unpredictable due to dynamic cache behavior at runtime. The main contribution of this paper is a compiler optimization, called the bank-aware cache miss clustering, that increases idle durations of memory banks, and as a result, enables better exploitation of available low-power capabilities supported by the memory system. This is because clustering cache misses helps to cluster cache hits as well, and this in turn increases bank idleness. We implemented our cache miss clustering approach within a compilation framework and tested it using seven array-intensive application codes. Our experiments show that cache miss clustering saves significant memory energy as a result of increased idle periods of memory banks. 1. INTRODUCTION AND MOTIVATION One of the energy reduction techniques currently employed by memory chips is low-power operating modes. A low-power operating mode in DRAMs is typically implemented by shutting off certain components within the memory chip. It is typically used in conjunction with a banked memory system, where the memory banks that are not used by the current computation can be put into a lowpower mode. Most of the memory energy management schemes via low-power modes studied in the literature, however, do not take data cache behavior explicitly into account. In other words, they operate, in a sense, a cache oblivious fashion. As a consequence, the results achieved by these techniques can be unpredictable due to dynamic cache behavior at runtime. In principle, taking data cache behavior into account can allow a more effective management of available low-power operation modes, and this can in turn boost energy savings. The main goal of the work in this paper is to study a compilerdirected code restructuring scheme for increasing energy savings This work is supported in part by NSF CAREER # and a fund from GSRC. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICCAD 06, November 5-9, 2006, San Jose, CA Copyright 2006 ACM /06/ $ coming from low-power modes used in banked memories. The proposed scheme employs cache miss clustering, and tries to increase bank idle times, which allows better use of the low-power operating modes supported by the underlying memory hardware. More specifically, cache miss clustering clusters data cache misses and this helps cluster accesses to main memory (which is banked). This clustering of accesses also means clustering of memory idle cycles, which in turn means longer idle periods for banks and thus higher energy savings. While a variant of cache miss clustering has been used by the previous research in enhancing memory level parallelism [11], to the best of our knowledge, this paper presents the first study that uses cache miss clustering for increasing energy benefits in a banked memory system. We automated our approach within an optimizing compiler built upon the SUIF infrastructure from Stanford University [6] and performed experiments with a set of embedded applications. Our extensive experimental analysis indicates that cache miss clustering is very effective in increasing average length of idle periods for memory banks. As a result of this increase in the length of idle periods, memory energy savings achieved through low-power modes are increased significantly. Our experimental results also show that cache miss clustering reduces execution cycles. The rest of this paper is organized as follows. Section 2 revises the banked memory concept and low-power operation modes. Section 3 discusses the prior work on energy optimization for memory banks. Section 4 presents our approach in detail and gives our compiler algorithm. Section 5 presents an experimental evaluation of our approach. Section 6 concludes the paper. 2. POWER MODES OF MEMORY BANKS In this work, we focus on an RDRAM-like off-chip memory architecture [13] where off-chip memory is partitioned into several banks, each of which can be activated or deactivated independently from others. In this architecture, when a memory bank is not actively used, it can be placed into a low-power operating mode. While in a low-power mode, a bank typically consumes much less energy than in the active mode. However, a bank in a low power need to be reactivated before its contents can be accessed, which incurs reactivation overheads. Typically, a more energy-saving low-power operating mode has also a higher reactivation overhead. Thus, it is important to select the most appropriate low-power mode to switch to when the bank becomes idle. In this study, we assume four different operating modes: an active mode (the mode during which the memory read/write activity can occur) and three low-power modes, namely, standby, napping, and power-down. Current DRAMs [13] support up to six power modes with a few of them supporting only two modes. We collapse the read, write, and active without read or write modes into a single mode, called the active mode, in our experimentation. However, one may choose to vary the number of modes based on the target DRAM architecture. The energy consumptions and reactivation overheads for these operating modes are given in Table 1. The energy values shown in this table have been obtained from the mea-

2 Energy Consumption Reactivation Overhead Active 3.570nJ 0cycles Standby 0.830nJ 2cycles Napping 0.320nJ 30 cycles Power-Down 0.005nJ 9,000 cycles Table 1: Energy consumptions (per cycle) and reactivation times for different operating modes. These numbers include both dynamic (switching) and static (leakage) components. sured current values associated with memory modules documented in memory data sheets (for a 3.3 V, 2.5 nsec cycle time, 8 MB memory) [14]. The reactivation times (overheads) are also obtained from data sheets. Based on trends gleaned from data sheets, the energy values are increased by 30% when module size is doubled. An important parameter that helps us choose the most suitable low-power mode is bank inter-access time (BIT), i.e., the time between successive accesses (requests) to a given memory bank. Obviously, the larger the BIT, a more aggressive low-power mode can be exploited. Then, the problem of effective power mode utilization can be defined as one of accurately estimating the BIT and using this information to select the most suitable low-power mode. This estimation can be done by software using the compiler [3, 2] or OS support [7], by hardware using a prediction mechanism attached to the memory controller [3], or by a combination of both. While the compiler-based techniques have the advantage of predicting BIT accurately for a specific class of applications, runtime and hardware based techniques are able to capture runtime variations in access patterns (e.g., those due to cache hits/misses) better. In this paper, we employ a hardware-based BIT prediction mechanism. This prediction mechanism is similar to the mechanisms used in current memory controllers. Specifically, after 10 cycles of idleness, the corresponding bank is put in the standby mode. Subsequently, if the bank is not referenced for another 100 cycles, it is transitioned into the napping mode. Finally, if the bank is not referenced for a further 1,000,000 cycles, it is put into the power-down mode. Whenever the bank is referenced, it is brought back into the active mode, incurring the corresponding reactivation overhead (based on what mode it was in). It needs to be mentioned however that our bank-aware cache miss clustering scheme works with software based BIT prediction schemes as well. We focus on a single program environment, and do not consider the existence of a virtual memory system. Exploring the energy impact of our approach in the presence of a virtual address translation is part of our planned further research. 3. RELATED WORK While there exists a large body of work on memory performance evaluation [18, 15], banked memory based studies for energy reduction are relatively new. It has been observed [5, 19, 1] that memory system is a dominant consumer of the overall system energy, making this a ripe candidate for software and hardware optimizations, thus serving as a strong motivation for the research presented in this paper. Lebeck et al [7] have presented OS based strategies for optimizing energy consumption of a banked memory architecture. The idea is to perform page allocation in an energy-efficient manner by taking into account the banked structure of the main memory. Delaluz et al [3] have discussed architectural and compiler techniques for low-power operating mode management. They have also discussed the bank pre-activation strategy used in this study. Fan et al [4] have shown how to set memory controller policies for the best memory energy behavior. Delaluz et al [2] have presented an experimental evaluation of some well-known classical high-level compiler optimizations on memory energy. In contrast to these previous efforts, the work described in this paper takes into account data cache explicitly, and restructures the application code to cluster data cache misses. In this respect, it is different from the previous studies on banked memory optimization. However, the proposed approach can co-exist with many of the previously-proposed memory energy optimization techniques. In fact, our point is that, since our approach increases bank idleness, it will increase the effectiveness of any hard- 245 Figure 1: Overview of our approach. ware or software based scheme that makes use of low-power operating modes. The only cache miss clustering related studies that we are aware of are [11] and [12]. The first of these studies uses a variant of cache miss clustering as a means of improving memory parallelism in high-performance computing. and the second one compares it against software prefetching, a well-known latency hiding mechanism. Our cache miss clustering scheme is very different from these prior papers. First, our approach is bank sensitive and consequently it is preceded by a code transformation that isolates the loop iterations that access a given set of memory banks. Clustering misses without considering how the arrays are mapped to the banks in the memory system would not be very useful in our context. Second, our approach is oriented towards reducing energy consumption rather than improving memory level parallelism in highend machines. As a result, our code transformation approach is entirely different from the one in [11, 12], which is based on loop strip-mining, a loop-level code restructuring technique. 4. COMPILER APPROACH Figure 1 shows our two step approach to restructuring code to cluster data cache misses. The first part, which is explained in Section 4.3, clusters array references (load/store operations) based on the memory banks they access. The idea can be summarized as decomposing a loop nest into multiple smaller loop nests such that each of the resulting smaller nests accesses the same set of banks throughout its execution. The second part, which is explained in Section 4.4, handles these small loop nests generated by the first part one by one, and it transforms the loop nest such that the data cache misses are clustered. Our approach operates under the assumption of cache line alignment, which means each array starts at the beginning of a new cache line. Many compilers have directives that enforce this type of cache line alignments. Also, we assume that before our approach is invoked, array-to-bank assignment has already been performed. Our approach takes the size of a cache bank as parameter. Changing the size of cache bank requires recompilation of the application in order to maximize the benefit that can be achieved by cache miss clustering. This may be a problem for an application the needs to run without modification on multiple systems with different configurations. However, for embedded systems where the software are usually co-designed with the hardware, this is not a serious drawback. 4.1 Motivational Example Figure 2 illustrates an example to motivate cache miss clustering. For ease of illustration, we consider a single bank in this example. In (a), we give the original loop nest we want to optimize. Part (b) shows the order of memory accesses. Each circle indicates an array element and the arrows capture the traversal order of array elements. We also highlight the cache line boundaries to show how the execution transitions from one cache line to another, assuming a cache line holds three elements. Part (c) of the figure illustrates the idle and active periods caused by the access pattern in part (b). We note that we incur a single cache miss in every three accesses. Assuming a cache hit latency of T h cycles, the length of each idle period is 2T h cycles. Our cache miss clustering strategy (to be explained shortly) transforms the original loop nest in (a) to the one shown in part (d). Part (e) illustrates the resulting data access pattern and part (f) shows the active and idle periods for the transformed loop nest. The important point here is that the new access pattern clusters the two cache misses together and this in turn increases the length of each idle period from 2T h cycles to 4T h cycles. This new pattern is clearly better than the original one from the perspective of energy

3 for i =0to 3 for j =0to 5...A[i, j]... Original loop nest. (a) (b) Order of memory accesses for the original loop nest. Each shaded block represents a cache line. U =(u 1,u 2,..., u k ) T are the lower and upper bound vectors, d = (d 1,d 2,..., d k ) T is the loop step vector (step d can be omitted if all its elements are 1), and s j( I) (j =1, 2,..., m) isthej th array reference in the body of loop nest N i (for ease of discussion, we omit non-array references). When this loop nest is executed, I takes values from L to U in the lexicographic order. The array element accessed by s j( I) can be represented as X[f j( I)], wherex (X X i) is the name of the array. Also, function f j maps iteration vector I to a vector of subscripts for array X. In this paper, we assume that f j is an affine function of I. (c) Bank state variation during the execution of the original loop nest. The length of each idle period is 2T h cycles, where T h is cache hit latency. for i =0to 1 for j =0to 5...A[i, j]......a[i +2,j]... (d) Transformed loop nest. (e) Order of memory accesses for the transformed loop nest. Each shaded block represents a cache line. 4.3 Clustering Memory Bank Accesses In this section, we discuss a method for isolating memory references (loads and stores) to a specific set of banks at a given time. We start by making an important definition, namely, the bank-invariant loop nest. Given a loop nest N of the form: N : for I [ L, U] s 1( I),s 2( I),..., s m( I), we say that it is a bank-invariant loop nest if the following constraint is satisfied for all 1 j m: (f) Bank state variation during the execution of the transformed loop nest.the length of each idle period is 4T h cycles, where T h is cache hit latency. Figure 2: An example motivating cache miss clustering. saving. This is because we now have longer idle periods and can potentially use a more aggressive low-power operating mode for this bank. 4.2 Notation For ease of discussion, let us first define some notations we use in the rest of this paper. Note that this paper focuses on array-based, loop-intensive programs, e.g., embedded multi-media applications. Such a program is typically composed of a set of loop nests; each loop nest accesses a set of arrays. We use the abstract form below to represent an array based loop-intensive program P that consists of l loop nests: P = N 1, N 2,..., N l, where N i (i =1, 2,..., l)isthei th loop nest of program P. Further, we use X i to represent the set of arrays accessed by loop nest N i, and X to represent the set of arrays accessed by program P. Based on these definitions, we have X = l[ i=1 X i. A loop nest N i of the following form: N i:forj 1 = l 1 to u 1 step d 1 for j 2 = l 2 to u 2 step d 2... for j k = l k to u k step d k Body can be represented in an abstract form as: N i : for I [ L i, U i] step d s 1( I),s 2( I),..., s m( I), where I is the iteration vector (i.e., a vector that contains the loop iterators in the nest from top to bottom), L =(l 1,l 2,..., l k ) T and 246 I 1, I 2 [ L, U]:b(X j[f j( I 1)]) = b(x j[f j( I 2)]), (1) where X j is the array accessed by array reference s j, f j is the mapping function that maps an iteration vector of the loop nest to a subscript vector of array X j, and function b(x) gives the ID of the memorybankthatstoresarrayelementx. In other words, the memory bank accessed by a given array reference (load/store operation) s j does not change throughout the execution of N. We cluster memory bank accesses by splitting a given loop nest N, which may not be bank-invariant, into a set of bank-invariant loop nests N 1, N 2,..., N m such that [ L i, U i] [ L, U] ([ L i, U i] is the iteration space of nest N i) and that the result of executing these bank-invariant loop nests is equivalent to that of executing the original loop nest N. Let us assume that the bank size is B and loop nest N uses a arrays, the capacity of data cache is C, the data cache uses a W -way associative mapping and LRU replacement policy, and the size of each cache line is equal to that of each array element. One can show that if B C/W, a W, clustering memory bank accesses does not increase the number of cache misses. Our compiler algorithm for clustering memory bank accesses is given in Figure 3. Figure 4 illustrates the idea behind forming bankinvariant loop nests using an example. Part (a) of this figure gives the code of a loop nest N that accesses arrays X and Y. The bank layout of arrays X and Y are given in part (b). We can see that, due to the limited capacity of a bank, each array has to be stored in two banks: banks B 1 and B 2 for array X, and banks B 3 and B 4 for array Y. Consequently, during the execution of loop nest N, depending on the current value of iteration vector (i, j) T, the array reference s 1 may access either bank B 1 or bank B 2,ands 2 may access either bank B 3 or bank B 4. Applying our bank memory access clustering algorithm given in Figure 3 to loop nest N, we obtain four bank invariant loop nests N 1, N 2, N 3,andN 4, as shown in Figure 4(c). We note that the banks accessed by array references s 1 and s 2 do not change in a bank-invariant loop nest. One of the side benefits of clustering bank memory accesses is that it can be expected to improve cache locality as well since, at any given time frame, the memory accesses are localized in a small set of memory banks. We will return this issue in our discussion of experimental results. 4.4 Clustering Data Cache Misses Our compiler algorithm for clustering data cache misses is given in Figure 5. This algorithm takes a bank invariant loop nest and cache miss clustering factor c as input, generates a new loop nest in which data cache misses are clustered. Clustering factor determines the number of cache misses that we want to cluster together.

4 Input: loop nest N ; Output: bank-invariant loop nest set S = {N 1, N 2,..., N m; S = {N ; for i =1to m { // assume s i( I) accesses array element X[f( I)] if(array X is stored in k banks and k > 1) { for each loop nest cn i S { split N i into k loop nests N i,1, N i,2,..., N i,k such that s i accesses only one bank within N i,j ; S = S {N i,1, N i,2,..., N i,k ; S = S ; Figure 3: Algorithm for splitting loop nest N into m bankinvariant loop nests (N 1, N 2,..., N m). N :fori =0to 99 for j =0to 99 s 1:...X[i, j]... // access B 1, B 2 s 2:...Y [j, i]... // access B 3, B 4 (a) Original loop nest N. (b) Bank layout of arrays X and Y, assuming that each bank can hold up to 5000 array elements. N 1:fori =0to 49 N 2:fori =0to 49 for j =0to 49 for j =50to 99 s 1:...X[i, j]... // access B 1 s 1:...X[i, j]... // access B 1 s 2:...Y [j, i]... // access B 3 s 2:...Y [j, i]... // access B 4 N 3:fori =50to 99 for j =0to 49 s 1:...X[i, j]... // access B 2 s 2:...Y [j, i]... // access B 3 N 4:fori =50to 99 for j =50to 99 s 1:...X[i, j]... // access B 2 s 2:...Y [j, i]... // access B 4 (c) Bank-invariant loop nests obtained by applying bank-based memory access clustering to loop nest N. Input: bank-invariant loop nest N : N :fori [ L, U] s 1( I),..., s m( I) where L =(l 1,l 2,..., l n) T and U =(u 1,u 2,..., u n) T c: cache miss clustering factor. Output: the code for transformed loop nest. r =(u 1 l 1) mod c; if(r 0) { // peeling the first r iterations from the outer most loop. δ =(r 1, 0, 0,..., 0) T ; // generate code for the first r iterations. emit for I [ L, L + δ] s 1( I),..., s m( I) ; δ =((u1 l 1 r)/c, 0, 0,..., 0) T ; // compute the new loop bounds L =(l 1 + r, l 2,l 3,..., l n) T ; U =(l 1 + r + (u 1 l 1)/c 1,u 2,u 3,..., u n) T ; // generate the loop control structure for the transformed loop nest emit for I [ L, U ] ; // generate the body for the transformed loop nest for i =1to m emit s i( I),s i( I + δ),s i( I +2 δ),..., s i( I +(c 1) δ) ; emit Figure 5: Cache miss clustering algorithm. N :fori =0to 59 s 1 : X[f 1(i)]; s 2 : Y [f 2(i)]; (a) Original loop nest. N ]: fori =0to 19 s 1 : X[f1(i)]; s 1 : X[f1(i +20)]; s 1 : X[f1(i + 40)]; s 2 : Y [f2(i)]; s 2 : Y [f2(i + 20)]; s 2 : Y [f2(i +40)]; (b) Transformed loop nest. Figure 4: Splitting loop nest N into a set of bank-invariant loop nests: N 1, N 2, N 3,andN 4. We can cluster more cache misses by using a larger clustering factor. However, increasing the clustering factor also increases the size of program code. Further, a very large clustering factor can also increase the number of cache misses. Figure 6 illustrates how our cache miss clustering algorithm works using an example where we cluster cache misses in the loop: N : for i =0to 59 s 1 : X[f 1(i)],s 2 : Y [f 2(i)] witha clustering factor of c =3. In this figure, we can observe that, we split the iteration space of loop N into three sections: [0, 19], [20, 39], and[40, 59]. Since the values of the iteration vectors from the different sections are far from each other, and the array element accessed by each array reference is determined by the value of the iteration vector, the addresses of the data elements accessed by the iterations that belong to the different sections tend to be far from each other. Therefore, data reuses are not likely to happen between two loop iterations from the different sections. As an example, array reference s 1(i) executed in the iteration space section [20, 39] is not likely to reuse any data brought to the cache by s 1(i) executed in the iteration space section [0, 19]. Our cache miss clustering algorithm is based on this observation. As shown in Figure 6, we achieve our goal by clustering the array references that are originally executed in the different sections of the iteration space since these array references are likely to incur cache misses together. And, this clustering generates long active periods and, consequently, longer idle periods for memory banks. An important issue regarding our implementation is about data dependences. Note that both bank-based clustering and cache miss clustering modify the original execution order of loop iterations. Therefore, their legality depends on the data dependences in the loop nest. In our current implementation, we do not apply them to a loop nest if they violate any data dependence relationship in the nest. 247 (c) Mapping between the array references in the original and transformed loop nests. Figure 6: An example showing how our approach clusters cache misses (clustering factor c =3). 4.5 A Complete Example In this section, we present a complete example showing how the two steps of our approach explained in Sections 4.3 and 4.4 work together. Figure 7(a) shows the code for a loop nest N that accesses array A[0.99][0.99]. For the sake of explanation, let us assume that each memory bank can store up to 500 array elements. Therefore, array A can be stored in two banks B i and B i+1 (see Figure 7(b)). We further assume that the size of a cache line is equal to that of an array element, the cache miss latency is T m cycles, and per iteration execution time, excluding the delay due to cache misses, for loop nest N is T cycles. Figure 7(c) shows the state variations for bank B i during the execution of loop nest N. We can observe that loop nest N incurs a data cache miss every T + T m cycles, and the length of bank idle period between two successive cache misses is T cycles. By applying memory bank access clustering to loop nest N, we obtain two bank-invariant loop nests N 1 (accessing only bank B i)andn 2 (accessing only bank B i+1), as shown in Figure 7(d). By applying cache miss clustering with a clustering factor of c =2 to the loop nest N 1 and N 2, we obtain loop nests N 1 and N 2 in Figure 7(e). Figure 7(f) shows the state variations for bank B i during the execution of N 1. In this figure, we observe that loop nests N 1 and N 2 incur two clustered cache misses every 2(T + T m) cycles, and the bank idle period between two successive cache misses is 2T cycles. To sum up, clustering cache misses increases the length of idle period so that we have more opportunities to put memory

5 N :fori =0to 99 x = A[i, j];... use x... (a) A loop nest accessing array A. (b) Memory layout for array A. Benchmark Number Input Baseline Baseline Name of Lines Size Energy Cycles adi 56 78MB 19.3mJ 3.92M cholesky 34 61MB 61.1mJ 7.10M hydro2d 52 44MB 76.3mJ 6.59M flt 85 51MB 328.1mJ 11.57M fourier MB 411.7mJ 8.90M tis MB 511.0mJ 12.04M tsf MB 620.4mJ 16.71M (c) State variation for bank B i during the execution of the loop nest in (a). N 1:fori =0to 49 x = A[i][j];... use x... N 2:fori =50to 99 x = A[i][j];... use x... (d) After clustering memory bank accesses. N 1 :fori =0to 24 x 1 = A[i][j]; x 2 = A[i + 25][j];... use x 1,x 2... N 2 :fori =50to 74 x 1 = A[i][j]; x 2 = A[i + 25][j];... use x 1,x 2... (e) After clustering cache misses. (f) State variation for bank B i during the execution of the optimized loop nest in (e). Figure 7: An example demonstrating the two steps of our approach. banks into the low-power mode; it also reduces the number of times that a bank needs to be re-activated, and thus reduces the overall re-activation overheads. 5. EXPERIMENTAL EVALUATION 5.1 Setup To evaluate our bank-aware cache miss clustering scheme, we automated our approach within an open-source compiler infrastructure called SUIF [6], and made a series of simulation-based experiments using SimpleScalar [17]. SUIF is built a series of compiler phases and our approach has been implemented as a separate phase. The additional increase in compilation times due to the addition of our phase was about 27% on average, as compared to the case without our phase. In our experiments, we assume the embedded system consists of a 400MHz processor, 8KB 4-way data cache, and eight main memory bank. The access latencies of the data cache and the main memory are 2 and 110 cycles, respectively. However, we also performed experiments with different values for off-chip memory access latency. The important characteristics of the benchmark codes that we used to measure the energy benefits of our cache miss clustering approach are given in Table 2. fourier and flt are Fourier transform and digital filtering routines, respectively. adi and cholesky are ADI and cholesky decomposition codes; hydro2d is an arraydominated code from the Spec Benchmark Suite; and tis and tsf are from the Perfect Club Benchmarks. The third column in Table 2 gives the total dataset size manipulated by the corresponding code. Finally, the last two columns give the energy consumption and execution cycles with the baseline scheme (explained below). While our code restructuring increased the executable sizes by about 14% on the average, the increase in instruction cache misses was very small (less than 0.4%). In any case, in these benchmark codes, the data memory behavior is much more dominant than the instruction memory behavior. The data cache misses of these benchmark codes 248 Table 2: Our benchmarks and their important characteristics. varied between 8.8% and 21.4%. The default clustering factor (see Section 4.4) used in our experiments is 8. The energy consumption results presented below capture all the energy consumption due to data accesses. This includes the energy consumed in the data cache, the off-chip memory and the interconnects. We conducted experiments with five different schemes: Baseline: This scheme does not perform any energy and performance optimizations, and used as a base case against which the other schemes can be normalized and compared. Its memory energy or performance numbers are given in the last two columns of Table 2. LOOP: This represents a conventional loop restructuring scheme for optimizing cache locality. It basically employs two types of optimizations: loop permutation (i.e., interchanging the order of two loops to ensure units stride access with the innermost loop after the transformation) and loop tiling (i.e., generating the blocked version of a loop for improving cache locality). While this scheme does not specifically target energy, the performance improvements it brings can translate to energy savings since the number of accesses to the main memory is reduced. The specific locality optimization techniques used in this scheme are adapted from [9, 8, 10]. CMC: This is the scheme proposed and discussed in this paper. As mentioned earlier, it energy benefits come from both increased bank idleness (as a result of clustering) and improved data locality. PRI: This is a previously-proposed scheme [16] to energy optimization for banked memories. It tries to cluster memory accesses to a small set of banks at a given time and this improves what is called bank locality. It needs to be noted that this approach works in a cache independent fashion. That is, it does not take cache behavior explicitly into account. The important point to emphasize here is that all the schemes listed above, including the baseline scheme (which does not employ any code/ data optimization) make use of the same underlying hardware-based low-power management scheme explained in Section 2. Therefore, the differences between the different schemes can be attributed to the duration of the bank idle periods (BIT values) they generate. It is also important to note that we compare our scheme (CMC) to two previously proposed schemes: a data locality oriented one (LOOP, adapted from [9, 8, 10]) and a memory energy oriented one (PRI, adapted from [16]). 5.2 Results We first present in Figure 8 the CDF (cumulative distribution function) for the bank inter-access times (BIT) for original applications when all eight banks are considered together. Specifically, an (x,y) point on any curve in this plot indicates that x% of the total bank idle periods are between y cycles or less. We see from this plot that there are many short idle periods, which are potential candidates for optimization. The graph in Figure 9 gives the normalized energy consumptions for our benchmarks with four different schemes described above. Our first observation is that the LOOP scheme generates some energy savings (16.1% on an average). This is a direct result of (1) optimizing data cache behavior and (2) increasing memory idleness. When we look at the result of CMC (our approach) and PRI, we see that they are competitive. While in some benchmarks PRI outperforms CMC, in others the latter generates better energy savings than the former. The average energy savings brought by PRI and CMC are 25.4% and 27.3%, respectively. These results clearly indicate

6 Figure 8: CDF for memory bank inter-access times (accumulated over all banks). Figure 9: Normalized energy consumptions with respect to the baseline scheme. Figure 10: CDF for bank inter-access idle times (accumulated over all banks) when our approach is used. Figure 11: Normalized execution cycles with respected to the baseline scheme. Figure 12: Distribution of energy savings across banks. fourier. Left: cholesky Right: that CMC is very effective in reducing energy consumption. Finally, the last bar, for each benchmark code, in Figure 9 gives the energy consumption with the CMC+PRI scheme (when we combine CMC and PRI). One can see that this combined approach achieves about 37.7% when averaged over all the seven codes evaluated. These results also imply that by miss clustering one can achieve much better results than the conventional code optimizations. To explain where the energy benefits are coming from, we give in Figure 10 the CDF curves for the codes restructured using the CMC scheme. When we compare these curves to the corresponding ones given in Figure 8, we see that our approach increases the length of the bank idle times significantly. These in turn translate to memory energy savings. It needs to be mentioned at this point that, while both PRI and CMC are very effective in saving energy consumption, they save energy in different ways, as has been discussed earlier. The PRI scheme tries to increase the idleness of each bank in the system independently. As a result, while it may get significant benefits with some benchmarks, it may be unsuccessful with the others. In contrast, our approach works at the data cache level. Therefore, its savings are expected to be distributed across the different banks uniformly, under the assumption that the banks are accessed uniformly. The graphs in Figure 12 give the distribution of energy savings across the eight memory banks for two of our benchmarks, namely, cholesky and fourier. One can see from these results that the savings with our scheme are uniform across the banks, whereas the variance with the PRI scheme can be very large in some cases. The remaining benchmarks also exhibit similar patterns. We next look at the performance results given in Figure 11. As expected the LOOP scheme achieves best execution time improvements across all versions, since it is performance oriented. However, the results with the CMC scheme are also very good. Specifically, the execution cycle improvements brought by LOOP and CMC are 25.6% and 21.9%, respectively. This is because our miss clustering approach captures cache locality as well, in addition to bank idleness. That is, clustering data accesses first based on the set of banks they refer enhances data locality at the cache level. Consequently, the results it generates are not very far from those obtained through the LOOP scheme. Our last observation from these results is that the PRI version does not improve execution time significantly (3.6% improvement on an average, and increasing original execution times in some cases). The main reason for this poor performance behavior is the fact that this scheme does not consider data locality at all. Therefore, while its results are good from the energy saving perspective, they are not impressive as far as performance is considered. Overall, when we consider the results given in Figures 9 and 11 together, we can say that considering bank idleness and locality together is a must, if we are to improve both energy and performance. We next present the results with varying data cache capacities (all are 4-way associative as in the default configuration). The default cache capacity used in our experiments so far was 16KB. The results with the different cache capacities are given in Figure 13 for the benchmark cholesky. The trends observed with the remaining benchmarks were similar; so, we do not present them. One can see from this graph that the PRI scheme is not very sensitive to the data cache capacity used. This is because it operates in a cache oblivious way. However, since a different cache capacity changes the hit/miss behavior, it changes the absolute savings obtained by PRI. In comparison, the effectiveness of CMC and LOOP is reduced with the increased cache capacity. This is expected because the results are given as normalized with respect to the baseline approach without any code optimization, and the behavior of this optimization improves significantly as we increase data cache capacity. In fact, in the extreme case, a very large cache capacity can make any data locality optimization useless. However, we also note that CMC is affected less than LOOP when we increase cache capacity. This is mainly because its energy savings do not come entirely from locality improvement but also from increasing bank idle times through miss clustering. We observed similar trends when we fix the cache capacity at its default value and change the cache associativity. That is, with increased associativity, we saw that the PRI scheme does not exhibit a significant variance. Also, the effectiveness of CMC and LOOP is reduced with the increased associativity. Our next set of experiments study the impact of the off-chip access latency on our results (obtained using the CMC scheme) and are presented in Fig- 249

7 Figure 13: Normalized energy consumptions with varying data cache capacities (cholesky). Figure 14: Normalized energy consumptions with varying off-chip latency values (cholesky). Figure 15: Influence of the clustering factor on energy consumption and execution cycles. ure 14 for cholesky. The default value for off-chip memory access latency was 110 cycles. We see from Figure 14 that the energy results are not very sensitive to the variations in the off-chip access latency, though we abserve a declining trend in normalized energy consumption with increased latency values. However, as expected, the execution cycle savings achieved by our approach reduces more as compared to the reduction in energy savings when we reduce the off-chip access latency. We still observe however that, even with an off-chip access latency of 60 cycles, our approach improves execution cycles by 5.56% for this benchmark (on an average 7.27%, when considering all the benchmarks. Recall that an important parameter in our approach is the clustering factor as defined in Section 4.4. As mentioned earlier, the default value for this parameter used in our experiments so far was 8. We also performed experiments with different values of this parameter. The results are given in Figure 15. In this plot, we give the normalized energy consumption and the normalized execution cycles when averaged over all the benchmarks in our experimental suite. We only present the average results since the results with the different benchmarks are very similar to each other. We observe an interesting trend from this results: both energy consumption and execution cycles are first decreasing as we increase the value of the clustering factor, but beyond a point, they are showing an increasing trend. This can be explained as follows. As mentioned earlier, increasing the value of the clustering factor is good from the viewpoint of clustering cache misses. In a sense, the larger its value, the better miss clustering we have. However, this increase also increases the gap (in terms of the intervening accesses) between the two successive accesses to a given cache line. As a result, when the clustering factor becomes large enough, this can cause extra cache misses, which increase both execution cycles and memory energy consumption. 6. CONCLUDING REMARKS This paper proposes a cache conscious memory energy reduction scheme based on the concept of cache miss clustering. The idea is to cluster data cache misses and thus cluster banked memory accesses and idle periods. This increase in idle periods then allows better energy management via low-power operation modes and eventually leads to better energy savings. Our experimental analysis shows that our approach is effective in reducing both energy consumption and execution cycles, and, more effective than the previously-proposed schemes. Our results also indicate that the energy savings achieved by our approach are consistent across different cache sizes, cache associativities, and off-chip access latencies. 7. REFERENCES [1] F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology Exploration of Memory Organization for Embedded Multimedia System Design, Kluwer Academic Publishers, June [2] V. Delaluz, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Energy-oriented compiler optimizations for partitioned memory architectures. In Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, [3] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin. DRAM energy management using software and hardware directed power mode control. In Proc. the 7th International Conference on High Performance Computer Architecture, Monterrey, Mexico, January [4] X. Fan, C. S. Ellis, and A. R. Lebeck. Memory controller policies for DRAM power management. In Proc. the International Symposium on Low Power Electronics and Design, [5] K. I. Farkas, J. Flinn, G. Back, D. Grunwald, and J.-A. M. Anderson. Quantifying the energy consumption of a pocket computer and a Java virtual machine. In Proc. SIGMETRICS, pages , [6] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, December [7] A. R. Lebeck, X. Fan, H. Zeng, and C. S. Ellis. Power aware page allocation. In Proc. Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, [8] W. Li. Compiling for NUMA Parallel Machines. Ph.D.Thesis, Computer Science Department, Cornell University, Ithaca, NY, [9] K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, [10] M. O Boyle and P. Knijnenburg. Integrating loop and data transformations for global optimisation. In Proc. International Conference on Parallel Architectures and Compilation Techniques, [11] V. S. Pai and S. V. Adve. Code transformations to improve memory parallelism. In Proc. the 32nd International Symposium on Microarchitecture, [12] V. S. Pai and S. V. Adve. Comparing and combining read miss clustering and software prefetching. In Proc. the International Symposium on Parallel Architectures and Compilation Techniques, [13] 128/144-MBit Direct RDRAM Data Sheet, Rambus Inc., [14] Rambus Inc. [15] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks in digital signal processors. In Proc. International Conference on Architectural Support for Programming languages and Operating Systems, [16] U. Sezer, G. Chen, M. Kandemir, H. Saputra, and M. J. Irwin. Exploiting bank locality in multi-bank memories. In Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, [17] SimpleScalar LLC. [18] A. Sudarsanam and S. Malik. Simultaneous reference allocation in code generation for dual data memory banks ASIPs. ACM Transactions on Design Automation of Electronic Systems 5, 2000, pp [19] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. Y. Kim, and W. Ye. Energy-driven integrated hardware-software optimizations using SimplePower. In Proc. the International Symposium on Computer Architecture, 2000.

Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories

Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories V. Delaluz, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, A. Sivasubramaniam, and I. Kolcu y Microsystems Design Lab Pennsylvania

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

SPM Management Using Markov Chain Based Data Access Prediction*

SPM Management Using Markov Chain Based Data Access Prediction* SPM Management Using Markov Chain Based Data Access Prediction* Taylan Yemliha Syracuse University, Syracuse, NY Shekhar Srikantaiah, Mahmut Kandemir Pennsylvania State University, University Park, PA

More information

Optimizing Inter-Nest Data Locality

Optimizing Inter-Nest Data Locality Optimizing Inter-Nest Data Locality M. Kandemir, I. Kadayif Microsystems Design Lab Pennsylvania State University University Park, PA 682, USA fkandemir,kadayifg@cse.psu.edu A. Choudhary, J. A. Zambreno

More information

Locality-Aware Process Scheduling for Embedded MPSoCs*

Locality-Aware Process Scheduling for Embedded MPSoCs* Locality-Aware Process Scheduling for Embedded MPSoCs* Mahmut Kandemir and Guilin Chen Computer Science and Engineering Department The Pennsylvania State University, University Park, PA 1682, USA {kandemir,

More information

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy ing-aware Cache Line Turnoff for Saving Leakage Energy Ismail Kadayif Mahmut Kandemir Feihui Li Dept. of Computer Engineering Dept. of Computer Sci. & Eng. Dept. of Computer Sci. & Eng. Canakkale Onsekiz

More information

Hardware and Software Techniques for Controlling DRAM Power

Hardware and Software Techniques for Controlling DRAM Power Hardware and Software Techniques for Controlling DRAM Power Modes Λ V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin 220 Pond Lab Pennsylvania State University University

More information

Profiler and Compiler Assisted Adaptive I/O Prefetching for Shared Storage Caches

Profiler and Compiler Assisted Adaptive I/O Prefetching for Shared Storage Caches Profiler and Compiler Assisted Adaptive I/O Prefetching for Shared Storage Caches Seung Woo Son Pennsylvania State University sson@cse.psu.edu Mahmut Kandemir Pennsylvania State University kandemir@cse.psu.edu

More information

Reducing Memory Requirements of Nested Loops for Embedded Systems

Reducing Memory Requirements of Nested Loops for Embedded Systems Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,

More information

A Power and Temperature Aware DRAM Architecture

A Power and Temperature Aware DRAM Architecture A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems

Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems 5. Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems Mahmut Kandemir Microsystems Design Lab Pennsylvania State University University Park, PA 680 kandemir@cse.psu.edu J. Ramanujam

More information

Strategies for Improving Data Locality in Embedded Applications

Strategies for Improving Data Locality in Embedded Applications Strategies for Improving Data Locality in Embedded Applications N. E. Crosbie y M. Kandemir z I. Kolcu x J. Ramanujam { A. Choudhary k Abstract This paper introduces a dynamic layout optimization strategy

More information

Compiler-Directed Cache Polymorphism

Compiler-Directed Cache Polymorphism Compiler-Directed Cache Polymorphism J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, W. Zhang Microsystems Design Lab The Pennsylvania State University Outline Motivation Compiler-Directed

More information

Quantifying and Optimizing Data Access Parallelism on Manycores

Quantifying and Optimizing Data Access Parallelism on Manycores Quantifying and Optimizing Data Access Parallelism on Manycores Jihyun Ryoo Penn State University jihyun@cse.psu.edu Orhan Kislal Penn State University kislal.orhan@gmail.com Xulong Tang Penn State University

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors

A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors Seung Woo Son Pennsylvania State University sson @cse.p su.ed u Mahmut Kandemir Pennsylvania State University kan d emir@cse.p su.ed

More information

Software-Directed Disk Power Management for Scientific Applications

Software-Directed Disk Power Management for Scientific Applications Software-Directed Disk Power Management for Scientific Applications S. W. Son M. Kandemir CSE Department Pennsylvania State University University Park, PA 16802, USA {sson,kandemir}@cse.psu.edu A. Choudhary

More information

Transparent Data-Memory Organizations for Digital Signal Processors

Transparent Data-Memory Organizations for Digital Signal Processors Transparent Data-Memory Organizations for Digital Signal Processors Sadagopan Srinivasan, Vinodh Cuppu, and Bruce Jacob Dept. of Electrical & Computer Engineering University of Maryland at College Park

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Optimizing Inter-Nest Data Locality Using Loop Splitting and Reordering

Optimizing Inter-Nest Data Locality Using Loop Splitting and Reordering Optimizing Inter-Nest Data Locality Using Loop Splitting and Reordering Sofiane Naci The Computer Laboratory, University of Cambridge JJ Thompson Avenue Cambridge CB3 FD United Kingdom Sofiane.Naci@cl.cam.ac.uk

More information

Improving the Energy Behavior of Block Buffering Using Compiler Optimizations

Improving the Energy Behavior of Block Buffering Using Compiler Optimizations Improving the Energy Behavior of Block Buffering Using Compiler Optimizations M. KANDEMIR Pennsylvania State University J. RAMANUJAM Louisiana State University and U. SEZER University of Wisconsin On-chip

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

ECE 5730 Memory Systems

ECE 5730 Memory Systems ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line

More information

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors

Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri Hari Krishna Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Department of CSE, The Pennsylvania State University {snarayan, guilchen,

More information

Locality-Conscious Process Scheduling in Embedded Systems

Locality-Conscious Process Scheduling in Embedded Systems Locality-Conscious Process Scheduling in Embedded Systems I. Kadayif, M. Kemir Microsystems Design Lab Pennsylvania State University University Park, PA 16802, USA kadayif,kemir @cse.psu.edu I. Kolcu Computation

More information

Dynamic Compilation for Reducing Energy Consumption of I/O-Intensive Applications

Dynamic Compilation for Reducing Energy Consumption of I/O-Intensive Applications Dynamic Compilation for Reducing Energy Consumption of I/O-Intensive Applications Seung Woo Son 1, Guangyu Chen 1, Mahmut Kandemir 1, and Alok Choudhary 2 1 Pennsylvania State University, University Park

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly

More information

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni

More information

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9

Memory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9 Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional

More information

Reducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy

Reducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy Reducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy W. ZHANG Southern Illinois University and J. S. HU, V. DEGALAHAL, M. KANDEMIR, N. VIJAYKRISHNAN, and M. J. IRWIN The Pennsylvania

More information

Requirements Engineering for Enterprise Systems

Requirements Engineering for Enterprise Systems Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Requirements Engineering for Enterprise Systems

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

AS the applications ported into System-on-a-Chip (SoC)

AS the applications ported into System-on-a-Chip (SoC) 396 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 5, MAY 2005 Optimizing Array-Intensive Applications for On-Chip Multiprocessors Ismail Kadayif, Student Member, IEEE, Mahmut Kandemir,

More information

VL-CDRAM: Variable Line Sized Cached DRAMs

VL-CDRAM: Variable Line Sized Cached DRAMs VL-CDRAM: Variable Line Sized Cached DRAMs Ananth Hegde, N. Vijaykrishnan, Mahmut Kandemir and Mary Jane Irwin Micro Systems Design Lab The Pennsylvania State University, University Park {hegdeank,vijay,kandemir,mji}@cse.psu.edu

More information

Improving Bank-Level Parallelism for Irregular Applications

Improving Bank-Level Parallelism for Irregular Applications Improving Bank-Level Parallelism for Irregular Applications Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli 2, Jagadish Kotra Pennsylvania State University VMware, Inc. 2 University Park, PA, USA Palo

More information

Leveraging Transitive Relations for Crowdsourced Joins*

Leveraging Transitive Relations for Crowdsourced Joins* Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations

Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations Hongwei Zhu Ilie I. Luican Florin Balasa Dept. of Computer Science, University of Illinois at Chicago, {hzhu7,iluica2,fbalasa}@uic.edu

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness

Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma

More information

Computer Sciences Department

Computer Sciences Department Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for

More information

Compiling for Advanced Architectures

Compiling for Advanced Architectures Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have

More information

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM

Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,

More information

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System

A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses

More information

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis

Outline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20

More information

CS 426 Parallel Computing. Parallel Computing Platforms

CS 426 Parallel Computing. Parallel Computing Platforms CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:

More information

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation

ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating

More information

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh

SAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu

More information

A Layout-Conscious Iteration Space Transformation Technique

A Layout-Conscious Iteration Space Transformation Technique IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO 12, DECEMBER 2001 1321 A Layout-Conscious Iteration Space Transformation Technique Mahmut Kandemir, Member, IEEE, J Ramanujam, Member, IEEE, Alok Choudhary, Senior

More information

Tiling: A Data Locality Optimizing Algorithm

Tiling: A Data Locality Optimizing Algorithm Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique

More information

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency

Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu

More information

Memory Management! Goals of this Lecture!

Memory Management! Goals of this Lecture! Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and

More information

CSC D70: Compiler Optimization Prefetching

CSC D70: Compiler Optimization Prefetching CSC D70: Compiler Optimization Prefetching Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons DRAM Improvement

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

Instruction Subsetting: Trading Power for Programmability

Instruction Subsetting: Trading Power for Programmability Instruction Subsetting: Trading Power for Programmability William E. Dougherty, David J. Pursley, Donald E. Thomas [wed,pursley,thomas]@ece.cmu.edu Department of Electrical and Computer Engineering Carnegie

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Guiding the optimization of parallel codes on multicores using an analytical cache model

Guiding the optimization of parallel codes on multicores using an analytical cache model Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

Garbage Collection (2) Advanced Operating Systems Lecture 9

Garbage Collection (2) Advanced Operating Systems Lecture 9 Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Program Transformations for the Memory Hierarchy

Program Transformations for the Memory Hierarchy Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California

More information

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1)

CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1) 71 CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1) 4.1 INTRODUCTION One of the prime research objectives of this thesis is to optimize

More information

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines

Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines Journal of Parallel and Distributed Computing 6, 924965 (2) doi:.6jpdc.2.639, available online at http:www.idealibrary.com on Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory

More information

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy

Memory Management. Goals of this Lecture. Motivation for Memory Hierarchy Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and

More information

PARALLEL MULTI-DELAY SIMULATION

PARALLEL MULTI-DELAY SIMULATION PARALLEL MULTI-DELAY SIMULATION Yun Sik Lee Peter M. Maurer Department of Computer Science and Engineering University of South Florida Tampa, FL 33620 CATEGORY: 7 - Discrete Simulation PARALLEL MULTI-DELAY

More information

ABC basics (compilation from different articles)

ABC basics (compilation from different articles) 1. AIG construction 2. AIG optimization 3. Technology mapping ABC basics (compilation from different articles) 1. BACKGROUND An And-Inverter Graph (AIG) is a directed acyclic graph (DAG), in which a node

More information

Loop Transformations! Part II!

Loop Transformations! Part II! Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow

More information

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems

Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Qiang Liu, George A. Constantinides, Konstantinos Masselos and Peter Y.K. Cheung Department of Electrical and Electronics

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem

Introduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth

More information

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme

AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer

More information

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers

A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers Soohyun Yang and Yeonseung Ryu Department of Computer Engineering, Myongji University Yongin, Gyeonggi-do, Korea

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

Affine and Unimodular Transformations for Non-Uniform Nested Loops

Affine and Unimodular Transformations for Non-Uniform Nested Loops th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and

More information

Cost Models for Query Processing Strategies in the Active Data Repository

Cost Models for Query Processing Strategies in the Active Data Repository Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272

More information

1 Motivation for Improving Matrix Multiplication

1 Motivation for Improving Matrix Multiplication CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n

More information

Reduction in Cache Memory Power Consumption based on Replacement Quantity

Reduction in Cache Memory Power Consumption based on Replacement Quantity Journal of Computer Engineering 1 (9) 3-11 Reduction in Cache Memory Power Consumption based on Replacement Quantity M.Jafarsalehi Islamic Azad University of Iran Tabriz-branch jafarsalehi_m@yahoo.com

More information

Best Practices for SSD Performance Measurement

Best Practices for SSD Performance Measurement Best Practices for SSD Performance Measurement Overview Fast Facts - SSDs require unique performance measurement techniques - SSD performance can change as the drive is written - Accurate, consistent and

More information

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques

University oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques Locality Enhancement for Large-Scale Shared-Memory Multiprocessors Tarek Abdelrahman 1, Naraig Manjikian 2,GaryLiu 3 and S. Tandri 3 1 Department of Electrical and Computer Engineering University oftoronto

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

OpenMP for next generation heterogeneous clusters

OpenMP for next generation heterogeneous clusters OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Dynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures

Dynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures Dynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures Liping Xue, Ozcan ozturk, Feihui Li, Mahmut Kandemir Computer Science and Engineering Department Pennsylvania State

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018

CS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018 CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler

More information

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.

CS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page. CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.

More information

Improving Real-Time Performance on Multicore Platforms Using MemGuard

Improving Real-Time Performance on Multicore Platforms Using MemGuard Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study

More information