Cache Miss Clustering for Banked Memory Systems
|
|
- Clyde Hutchinson
- 6 years ago
- Views:
Transcription
1 Cache Miss Clustering for Banked Memory Systems O. Ozturk, G. Chen, M. Kandemir Computer Science and Engineering Department Pennsylvania State University University Park, PA 16802, USA {ozturk, gchen, M. Karakoy Department of Computing Imperial College London SW7 2AZ, UK ABSTRACT One of the previously-proposed techniques for reducing memory energy consumption is memory banking. The idea is to divide the memory space into multiple banks and place currently unused (idle) banks into a low-power operating mode. The prior studies both hardware and software domain in memory energy optimization via low-power modes do not take the data cache behavior explicitly into account. As a consequence, the energy savings achieved by these techniques can be unpredictable due to dynamic cache behavior at runtime. The main contribution of this paper is a compiler optimization, called the bank-aware cache miss clustering, that increases idle durations of memory banks, and as a result, enables better exploitation of available low-power capabilities supported by the memory system. This is because clustering cache misses helps to cluster cache hits as well, and this in turn increases bank idleness. We implemented our cache miss clustering approach within a compilation framework and tested it using seven array-intensive application codes. Our experiments show that cache miss clustering saves significant memory energy as a result of increased idle periods of memory banks. 1. INTRODUCTION AND MOTIVATION One of the energy reduction techniques currently employed by memory chips is low-power operating modes. A low-power operating mode in DRAMs is typically implemented by shutting off certain components within the memory chip. It is typically used in conjunction with a banked memory system, where the memory banks that are not used by the current computation can be put into a lowpower mode. Most of the memory energy management schemes via low-power modes studied in the literature, however, do not take data cache behavior explicitly into account. In other words, they operate, in a sense, a cache oblivious fashion. As a consequence, the results achieved by these techniques can be unpredictable due to dynamic cache behavior at runtime. In principle, taking data cache behavior into account can allow a more effective management of available low-power operation modes, and this can in turn boost energy savings. The main goal of the work in this paper is to study a compilerdirected code restructuring scheme for increasing energy savings This work is supported in part by NSF CAREER # and a fund from GSRC. Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. ICCAD 06, November 5-9, 2006, San Jose, CA Copyright 2006 ACM /06/ $ coming from low-power modes used in banked memories. The proposed scheme employs cache miss clustering, and tries to increase bank idle times, which allows better use of the low-power operating modes supported by the underlying memory hardware. More specifically, cache miss clustering clusters data cache misses and this helps cluster accesses to main memory (which is banked). This clustering of accesses also means clustering of memory idle cycles, which in turn means longer idle periods for banks and thus higher energy savings. While a variant of cache miss clustering has been used by the previous research in enhancing memory level parallelism [11], to the best of our knowledge, this paper presents the first study that uses cache miss clustering for increasing energy benefits in a banked memory system. We automated our approach within an optimizing compiler built upon the SUIF infrastructure from Stanford University [6] and performed experiments with a set of embedded applications. Our extensive experimental analysis indicates that cache miss clustering is very effective in increasing average length of idle periods for memory banks. As a result of this increase in the length of idle periods, memory energy savings achieved through low-power modes are increased significantly. Our experimental results also show that cache miss clustering reduces execution cycles. The rest of this paper is organized as follows. Section 2 revises the banked memory concept and low-power operation modes. Section 3 discusses the prior work on energy optimization for memory banks. Section 4 presents our approach in detail and gives our compiler algorithm. Section 5 presents an experimental evaluation of our approach. Section 6 concludes the paper. 2. POWER MODES OF MEMORY BANKS In this work, we focus on an RDRAM-like off-chip memory architecture [13] where off-chip memory is partitioned into several banks, each of which can be activated or deactivated independently from others. In this architecture, when a memory bank is not actively used, it can be placed into a low-power operating mode. While in a low-power mode, a bank typically consumes much less energy than in the active mode. However, a bank in a low power need to be reactivated before its contents can be accessed, which incurs reactivation overheads. Typically, a more energy-saving low-power operating mode has also a higher reactivation overhead. Thus, it is important to select the most appropriate low-power mode to switch to when the bank becomes idle. In this study, we assume four different operating modes: an active mode (the mode during which the memory read/write activity can occur) and three low-power modes, namely, standby, napping, and power-down. Current DRAMs [13] support up to six power modes with a few of them supporting only two modes. We collapse the read, write, and active without read or write modes into a single mode, called the active mode, in our experimentation. However, one may choose to vary the number of modes based on the target DRAM architecture. The energy consumptions and reactivation overheads for these operating modes are given in Table 1. The energy values shown in this table have been obtained from the mea-
2 Energy Consumption Reactivation Overhead Active 3.570nJ 0cycles Standby 0.830nJ 2cycles Napping 0.320nJ 30 cycles Power-Down 0.005nJ 9,000 cycles Table 1: Energy consumptions (per cycle) and reactivation times for different operating modes. These numbers include both dynamic (switching) and static (leakage) components. sured current values associated with memory modules documented in memory data sheets (for a 3.3 V, 2.5 nsec cycle time, 8 MB memory) [14]. The reactivation times (overheads) are also obtained from data sheets. Based on trends gleaned from data sheets, the energy values are increased by 30% when module size is doubled. An important parameter that helps us choose the most suitable low-power mode is bank inter-access time (BIT), i.e., the time between successive accesses (requests) to a given memory bank. Obviously, the larger the BIT, a more aggressive low-power mode can be exploited. Then, the problem of effective power mode utilization can be defined as one of accurately estimating the BIT and using this information to select the most suitable low-power mode. This estimation can be done by software using the compiler [3, 2] or OS support [7], by hardware using a prediction mechanism attached to the memory controller [3], or by a combination of both. While the compiler-based techniques have the advantage of predicting BIT accurately for a specific class of applications, runtime and hardware based techniques are able to capture runtime variations in access patterns (e.g., those due to cache hits/misses) better. In this paper, we employ a hardware-based BIT prediction mechanism. This prediction mechanism is similar to the mechanisms used in current memory controllers. Specifically, after 10 cycles of idleness, the corresponding bank is put in the standby mode. Subsequently, if the bank is not referenced for another 100 cycles, it is transitioned into the napping mode. Finally, if the bank is not referenced for a further 1,000,000 cycles, it is put into the power-down mode. Whenever the bank is referenced, it is brought back into the active mode, incurring the corresponding reactivation overhead (based on what mode it was in). It needs to be mentioned however that our bank-aware cache miss clustering scheme works with software based BIT prediction schemes as well. We focus on a single program environment, and do not consider the existence of a virtual memory system. Exploring the energy impact of our approach in the presence of a virtual address translation is part of our planned further research. 3. RELATED WORK While there exists a large body of work on memory performance evaluation [18, 15], banked memory based studies for energy reduction are relatively new. It has been observed [5, 19, 1] that memory system is a dominant consumer of the overall system energy, making this a ripe candidate for software and hardware optimizations, thus serving as a strong motivation for the research presented in this paper. Lebeck et al [7] have presented OS based strategies for optimizing energy consumption of a banked memory architecture. The idea is to perform page allocation in an energy-efficient manner by taking into account the banked structure of the main memory. Delaluz et al [3] have discussed architectural and compiler techniques for low-power operating mode management. They have also discussed the bank pre-activation strategy used in this study. Fan et al [4] have shown how to set memory controller policies for the best memory energy behavior. Delaluz et al [2] have presented an experimental evaluation of some well-known classical high-level compiler optimizations on memory energy. In contrast to these previous efforts, the work described in this paper takes into account data cache explicitly, and restructures the application code to cluster data cache misses. In this respect, it is different from the previous studies on banked memory optimization. However, the proposed approach can co-exist with many of the previously-proposed memory energy optimization techniques. In fact, our point is that, since our approach increases bank idleness, it will increase the effectiveness of any hard- 245 Figure 1: Overview of our approach. ware or software based scheme that makes use of low-power operating modes. The only cache miss clustering related studies that we are aware of are [11] and [12]. The first of these studies uses a variant of cache miss clustering as a means of improving memory parallelism in high-performance computing. and the second one compares it against software prefetching, a well-known latency hiding mechanism. Our cache miss clustering scheme is very different from these prior papers. First, our approach is bank sensitive and consequently it is preceded by a code transformation that isolates the loop iterations that access a given set of memory banks. Clustering misses without considering how the arrays are mapped to the banks in the memory system would not be very useful in our context. Second, our approach is oriented towards reducing energy consumption rather than improving memory level parallelism in highend machines. As a result, our code transformation approach is entirely different from the one in [11, 12], which is based on loop strip-mining, a loop-level code restructuring technique. 4. COMPILER APPROACH Figure 1 shows our two step approach to restructuring code to cluster data cache misses. The first part, which is explained in Section 4.3, clusters array references (load/store operations) based on the memory banks they access. The idea can be summarized as decomposing a loop nest into multiple smaller loop nests such that each of the resulting smaller nests accesses the same set of banks throughout its execution. The second part, which is explained in Section 4.4, handles these small loop nests generated by the first part one by one, and it transforms the loop nest such that the data cache misses are clustered. Our approach operates under the assumption of cache line alignment, which means each array starts at the beginning of a new cache line. Many compilers have directives that enforce this type of cache line alignments. Also, we assume that before our approach is invoked, array-to-bank assignment has already been performed. Our approach takes the size of a cache bank as parameter. Changing the size of cache bank requires recompilation of the application in order to maximize the benefit that can be achieved by cache miss clustering. This may be a problem for an application the needs to run without modification on multiple systems with different configurations. However, for embedded systems where the software are usually co-designed with the hardware, this is not a serious drawback. 4.1 Motivational Example Figure 2 illustrates an example to motivate cache miss clustering. For ease of illustration, we consider a single bank in this example. In (a), we give the original loop nest we want to optimize. Part (b) shows the order of memory accesses. Each circle indicates an array element and the arrows capture the traversal order of array elements. We also highlight the cache line boundaries to show how the execution transitions from one cache line to another, assuming a cache line holds three elements. Part (c) of the figure illustrates the idle and active periods caused by the access pattern in part (b). We note that we incur a single cache miss in every three accesses. Assuming a cache hit latency of T h cycles, the length of each idle period is 2T h cycles. Our cache miss clustering strategy (to be explained shortly) transforms the original loop nest in (a) to the one shown in part (d). Part (e) illustrates the resulting data access pattern and part (f) shows the active and idle periods for the transformed loop nest. The important point here is that the new access pattern clusters the two cache misses together and this in turn increases the length of each idle period from 2T h cycles to 4T h cycles. This new pattern is clearly better than the original one from the perspective of energy
3 for i =0to 3 for j =0to 5...A[i, j]... Original loop nest. (a) (b) Order of memory accesses for the original loop nest. Each shaded block represents a cache line. U =(u 1,u 2,..., u k ) T are the lower and upper bound vectors, d = (d 1,d 2,..., d k ) T is the loop step vector (step d can be omitted if all its elements are 1), and s j( I) (j =1, 2,..., m) isthej th array reference in the body of loop nest N i (for ease of discussion, we omit non-array references). When this loop nest is executed, I takes values from L to U in the lexicographic order. The array element accessed by s j( I) can be represented as X[f j( I)], wherex (X X i) is the name of the array. Also, function f j maps iteration vector I to a vector of subscripts for array X. In this paper, we assume that f j is an affine function of I. (c) Bank state variation during the execution of the original loop nest. The length of each idle period is 2T h cycles, where T h is cache hit latency. for i =0to 1 for j =0to 5...A[i, j]......a[i +2,j]... (d) Transformed loop nest. (e) Order of memory accesses for the transformed loop nest. Each shaded block represents a cache line. 4.3 Clustering Memory Bank Accesses In this section, we discuss a method for isolating memory references (loads and stores) to a specific set of banks at a given time. We start by making an important definition, namely, the bank-invariant loop nest. Given a loop nest N of the form: N : for I [ L, U] s 1( I),s 2( I),..., s m( I), we say that it is a bank-invariant loop nest if the following constraint is satisfied for all 1 j m: (f) Bank state variation during the execution of the transformed loop nest.the length of each idle period is 4T h cycles, where T h is cache hit latency. Figure 2: An example motivating cache miss clustering. saving. This is because we now have longer idle periods and can potentially use a more aggressive low-power operating mode for this bank. 4.2 Notation For ease of discussion, let us first define some notations we use in the rest of this paper. Note that this paper focuses on array-based, loop-intensive programs, e.g., embedded multi-media applications. Such a program is typically composed of a set of loop nests; each loop nest accesses a set of arrays. We use the abstract form below to represent an array based loop-intensive program P that consists of l loop nests: P = N 1, N 2,..., N l, where N i (i =1, 2,..., l)isthei th loop nest of program P. Further, we use X i to represent the set of arrays accessed by loop nest N i, and X to represent the set of arrays accessed by program P. Based on these definitions, we have X = l[ i=1 X i. A loop nest N i of the following form: N i:forj 1 = l 1 to u 1 step d 1 for j 2 = l 2 to u 2 step d 2... for j k = l k to u k step d k Body can be represented in an abstract form as: N i : for I [ L i, U i] step d s 1( I),s 2( I),..., s m( I), where I is the iteration vector (i.e., a vector that contains the loop iterators in the nest from top to bottom), L =(l 1,l 2,..., l k ) T and 246 I 1, I 2 [ L, U]:b(X j[f j( I 1)]) = b(x j[f j( I 2)]), (1) where X j is the array accessed by array reference s j, f j is the mapping function that maps an iteration vector of the loop nest to a subscript vector of array X j, and function b(x) gives the ID of the memorybankthatstoresarrayelementx. In other words, the memory bank accessed by a given array reference (load/store operation) s j does not change throughout the execution of N. We cluster memory bank accesses by splitting a given loop nest N, which may not be bank-invariant, into a set of bank-invariant loop nests N 1, N 2,..., N m such that [ L i, U i] [ L, U] ([ L i, U i] is the iteration space of nest N i) and that the result of executing these bank-invariant loop nests is equivalent to that of executing the original loop nest N. Let us assume that the bank size is B and loop nest N uses a arrays, the capacity of data cache is C, the data cache uses a W -way associative mapping and LRU replacement policy, and the size of each cache line is equal to that of each array element. One can show that if B C/W, a W, clustering memory bank accesses does not increase the number of cache misses. Our compiler algorithm for clustering memory bank accesses is given in Figure 3. Figure 4 illustrates the idea behind forming bankinvariant loop nests using an example. Part (a) of this figure gives the code of a loop nest N that accesses arrays X and Y. The bank layout of arrays X and Y are given in part (b). We can see that, due to the limited capacity of a bank, each array has to be stored in two banks: banks B 1 and B 2 for array X, and banks B 3 and B 4 for array Y. Consequently, during the execution of loop nest N, depending on the current value of iteration vector (i, j) T, the array reference s 1 may access either bank B 1 or bank B 2,ands 2 may access either bank B 3 or bank B 4. Applying our bank memory access clustering algorithm given in Figure 3 to loop nest N, we obtain four bank invariant loop nests N 1, N 2, N 3,andN 4, as shown in Figure 4(c). We note that the banks accessed by array references s 1 and s 2 do not change in a bank-invariant loop nest. One of the side benefits of clustering bank memory accesses is that it can be expected to improve cache locality as well since, at any given time frame, the memory accesses are localized in a small set of memory banks. We will return this issue in our discussion of experimental results. 4.4 Clustering Data Cache Misses Our compiler algorithm for clustering data cache misses is given in Figure 5. This algorithm takes a bank invariant loop nest and cache miss clustering factor c as input, generates a new loop nest in which data cache misses are clustered. Clustering factor determines the number of cache misses that we want to cluster together.
4 Input: loop nest N ; Output: bank-invariant loop nest set S = {N 1, N 2,..., N m; S = {N ; for i =1to m { // assume s i( I) accesses array element X[f( I)] if(array X is stored in k banks and k > 1) { for each loop nest cn i S { split N i into k loop nests N i,1, N i,2,..., N i,k such that s i accesses only one bank within N i,j ; S = S {N i,1, N i,2,..., N i,k ; S = S ; Figure 3: Algorithm for splitting loop nest N into m bankinvariant loop nests (N 1, N 2,..., N m). N :fori =0to 99 for j =0to 99 s 1:...X[i, j]... // access B 1, B 2 s 2:...Y [j, i]... // access B 3, B 4 (a) Original loop nest N. (b) Bank layout of arrays X and Y, assuming that each bank can hold up to 5000 array elements. N 1:fori =0to 49 N 2:fori =0to 49 for j =0to 49 for j =50to 99 s 1:...X[i, j]... // access B 1 s 1:...X[i, j]... // access B 1 s 2:...Y [j, i]... // access B 3 s 2:...Y [j, i]... // access B 4 N 3:fori =50to 99 for j =0to 49 s 1:...X[i, j]... // access B 2 s 2:...Y [j, i]... // access B 3 N 4:fori =50to 99 for j =50to 99 s 1:...X[i, j]... // access B 2 s 2:...Y [j, i]... // access B 4 (c) Bank-invariant loop nests obtained by applying bank-based memory access clustering to loop nest N. Input: bank-invariant loop nest N : N :fori [ L, U] s 1( I),..., s m( I) where L =(l 1,l 2,..., l n) T and U =(u 1,u 2,..., u n) T c: cache miss clustering factor. Output: the code for transformed loop nest. r =(u 1 l 1) mod c; if(r 0) { // peeling the first r iterations from the outer most loop. δ =(r 1, 0, 0,..., 0) T ; // generate code for the first r iterations. emit for I [ L, L + δ] s 1( I),..., s m( I) ; δ =((u1 l 1 r)/c, 0, 0,..., 0) T ; // compute the new loop bounds L =(l 1 + r, l 2,l 3,..., l n) T ; U =(l 1 + r + (u 1 l 1)/c 1,u 2,u 3,..., u n) T ; // generate the loop control structure for the transformed loop nest emit for I [ L, U ] ; // generate the body for the transformed loop nest for i =1to m emit s i( I),s i( I + δ),s i( I +2 δ),..., s i( I +(c 1) δ) ; emit Figure 5: Cache miss clustering algorithm. N :fori =0to 59 s 1 : X[f 1(i)]; s 2 : Y [f 2(i)]; (a) Original loop nest. N ]: fori =0to 19 s 1 : X[f1(i)]; s 1 : X[f1(i +20)]; s 1 : X[f1(i + 40)]; s 2 : Y [f2(i)]; s 2 : Y [f2(i + 20)]; s 2 : Y [f2(i +40)]; (b) Transformed loop nest. Figure 4: Splitting loop nest N into a set of bank-invariant loop nests: N 1, N 2, N 3,andN 4. We can cluster more cache misses by using a larger clustering factor. However, increasing the clustering factor also increases the size of program code. Further, a very large clustering factor can also increase the number of cache misses. Figure 6 illustrates how our cache miss clustering algorithm works using an example where we cluster cache misses in the loop: N : for i =0to 59 s 1 : X[f 1(i)],s 2 : Y [f 2(i)] witha clustering factor of c =3. In this figure, we can observe that, we split the iteration space of loop N into three sections: [0, 19], [20, 39], and[40, 59]. Since the values of the iteration vectors from the different sections are far from each other, and the array element accessed by each array reference is determined by the value of the iteration vector, the addresses of the data elements accessed by the iterations that belong to the different sections tend to be far from each other. Therefore, data reuses are not likely to happen between two loop iterations from the different sections. As an example, array reference s 1(i) executed in the iteration space section [20, 39] is not likely to reuse any data brought to the cache by s 1(i) executed in the iteration space section [0, 19]. Our cache miss clustering algorithm is based on this observation. As shown in Figure 6, we achieve our goal by clustering the array references that are originally executed in the different sections of the iteration space since these array references are likely to incur cache misses together. And, this clustering generates long active periods and, consequently, longer idle periods for memory banks. An important issue regarding our implementation is about data dependences. Note that both bank-based clustering and cache miss clustering modify the original execution order of loop iterations. Therefore, their legality depends on the data dependences in the loop nest. In our current implementation, we do not apply them to a loop nest if they violate any data dependence relationship in the nest. 247 (c) Mapping between the array references in the original and transformed loop nests. Figure 6: An example showing how our approach clusters cache misses (clustering factor c =3). 4.5 A Complete Example In this section, we present a complete example showing how the two steps of our approach explained in Sections 4.3 and 4.4 work together. Figure 7(a) shows the code for a loop nest N that accesses array A[0.99][0.99]. For the sake of explanation, let us assume that each memory bank can store up to 500 array elements. Therefore, array A can be stored in two banks B i and B i+1 (see Figure 7(b)). We further assume that the size of a cache line is equal to that of an array element, the cache miss latency is T m cycles, and per iteration execution time, excluding the delay due to cache misses, for loop nest N is T cycles. Figure 7(c) shows the state variations for bank B i during the execution of loop nest N. We can observe that loop nest N incurs a data cache miss every T + T m cycles, and the length of bank idle period between two successive cache misses is T cycles. By applying memory bank access clustering to loop nest N, we obtain two bank-invariant loop nests N 1 (accessing only bank B i)andn 2 (accessing only bank B i+1), as shown in Figure 7(d). By applying cache miss clustering with a clustering factor of c =2 to the loop nest N 1 and N 2, we obtain loop nests N 1 and N 2 in Figure 7(e). Figure 7(f) shows the state variations for bank B i during the execution of N 1. In this figure, we observe that loop nests N 1 and N 2 incur two clustered cache misses every 2(T + T m) cycles, and the bank idle period between two successive cache misses is 2T cycles. To sum up, clustering cache misses increases the length of idle period so that we have more opportunities to put memory
5 N :fori =0to 99 x = A[i, j];... use x... (a) A loop nest accessing array A. (b) Memory layout for array A. Benchmark Number Input Baseline Baseline Name of Lines Size Energy Cycles adi 56 78MB 19.3mJ 3.92M cholesky 34 61MB 61.1mJ 7.10M hydro2d 52 44MB 76.3mJ 6.59M flt 85 51MB 328.1mJ 11.57M fourier MB 411.7mJ 8.90M tis MB 511.0mJ 12.04M tsf MB 620.4mJ 16.71M (c) State variation for bank B i during the execution of the loop nest in (a). N 1:fori =0to 49 x = A[i][j];... use x... N 2:fori =50to 99 x = A[i][j];... use x... (d) After clustering memory bank accesses. N 1 :fori =0to 24 x 1 = A[i][j]; x 2 = A[i + 25][j];... use x 1,x 2... N 2 :fori =50to 74 x 1 = A[i][j]; x 2 = A[i + 25][j];... use x 1,x 2... (e) After clustering cache misses. (f) State variation for bank B i during the execution of the optimized loop nest in (e). Figure 7: An example demonstrating the two steps of our approach. banks into the low-power mode; it also reduces the number of times that a bank needs to be re-activated, and thus reduces the overall re-activation overheads. 5. EXPERIMENTAL EVALUATION 5.1 Setup To evaluate our bank-aware cache miss clustering scheme, we automated our approach within an open-source compiler infrastructure called SUIF [6], and made a series of simulation-based experiments using SimpleScalar [17]. SUIF is built a series of compiler phases and our approach has been implemented as a separate phase. The additional increase in compilation times due to the addition of our phase was about 27% on average, as compared to the case without our phase. In our experiments, we assume the embedded system consists of a 400MHz processor, 8KB 4-way data cache, and eight main memory bank. The access latencies of the data cache and the main memory are 2 and 110 cycles, respectively. However, we also performed experiments with different values for off-chip memory access latency. The important characteristics of the benchmark codes that we used to measure the energy benefits of our cache miss clustering approach are given in Table 2. fourier and flt are Fourier transform and digital filtering routines, respectively. adi and cholesky are ADI and cholesky decomposition codes; hydro2d is an arraydominated code from the Spec Benchmark Suite; and tis and tsf are from the Perfect Club Benchmarks. The third column in Table 2 gives the total dataset size manipulated by the corresponding code. Finally, the last two columns give the energy consumption and execution cycles with the baseline scheme (explained below). While our code restructuring increased the executable sizes by about 14% on the average, the increase in instruction cache misses was very small (less than 0.4%). In any case, in these benchmark codes, the data memory behavior is much more dominant than the instruction memory behavior. The data cache misses of these benchmark codes 248 Table 2: Our benchmarks and their important characteristics. varied between 8.8% and 21.4%. The default clustering factor (see Section 4.4) used in our experiments is 8. The energy consumption results presented below capture all the energy consumption due to data accesses. This includes the energy consumed in the data cache, the off-chip memory and the interconnects. We conducted experiments with five different schemes: Baseline: This scheme does not perform any energy and performance optimizations, and used as a base case against which the other schemes can be normalized and compared. Its memory energy or performance numbers are given in the last two columns of Table 2. LOOP: This represents a conventional loop restructuring scheme for optimizing cache locality. It basically employs two types of optimizations: loop permutation (i.e., interchanging the order of two loops to ensure units stride access with the innermost loop after the transformation) and loop tiling (i.e., generating the blocked version of a loop for improving cache locality). While this scheme does not specifically target energy, the performance improvements it brings can translate to energy savings since the number of accesses to the main memory is reduced. The specific locality optimization techniques used in this scheme are adapted from [9, 8, 10]. CMC: This is the scheme proposed and discussed in this paper. As mentioned earlier, it energy benefits come from both increased bank idleness (as a result of clustering) and improved data locality. PRI: This is a previously-proposed scheme [16] to energy optimization for banked memories. It tries to cluster memory accesses to a small set of banks at a given time and this improves what is called bank locality. It needs to be noted that this approach works in a cache independent fashion. That is, it does not take cache behavior explicitly into account. The important point to emphasize here is that all the schemes listed above, including the baseline scheme (which does not employ any code/ data optimization) make use of the same underlying hardware-based low-power management scheme explained in Section 2. Therefore, the differences between the different schemes can be attributed to the duration of the bank idle periods (BIT values) they generate. It is also important to note that we compare our scheme (CMC) to two previously proposed schemes: a data locality oriented one (LOOP, adapted from [9, 8, 10]) and a memory energy oriented one (PRI, adapted from [16]). 5.2 Results We first present in Figure 8 the CDF (cumulative distribution function) for the bank inter-access times (BIT) for original applications when all eight banks are considered together. Specifically, an (x,y) point on any curve in this plot indicates that x% of the total bank idle periods are between y cycles or less. We see from this plot that there are many short idle periods, which are potential candidates for optimization. The graph in Figure 9 gives the normalized energy consumptions for our benchmarks with four different schemes described above. Our first observation is that the LOOP scheme generates some energy savings (16.1% on an average). This is a direct result of (1) optimizing data cache behavior and (2) increasing memory idleness. When we look at the result of CMC (our approach) and PRI, we see that they are competitive. While in some benchmarks PRI outperforms CMC, in others the latter generates better energy savings than the former. The average energy savings brought by PRI and CMC are 25.4% and 27.3%, respectively. These results clearly indicate
6 Figure 8: CDF for memory bank inter-access times (accumulated over all banks). Figure 9: Normalized energy consumptions with respect to the baseline scheme. Figure 10: CDF for bank inter-access idle times (accumulated over all banks) when our approach is used. Figure 11: Normalized execution cycles with respected to the baseline scheme. Figure 12: Distribution of energy savings across banks. fourier. Left: cholesky Right: that CMC is very effective in reducing energy consumption. Finally, the last bar, for each benchmark code, in Figure 9 gives the energy consumption with the CMC+PRI scheme (when we combine CMC and PRI). One can see that this combined approach achieves about 37.7% when averaged over all the seven codes evaluated. These results also imply that by miss clustering one can achieve much better results than the conventional code optimizations. To explain where the energy benefits are coming from, we give in Figure 10 the CDF curves for the codes restructured using the CMC scheme. When we compare these curves to the corresponding ones given in Figure 8, we see that our approach increases the length of the bank idle times significantly. These in turn translate to memory energy savings. It needs to be mentioned at this point that, while both PRI and CMC are very effective in saving energy consumption, they save energy in different ways, as has been discussed earlier. The PRI scheme tries to increase the idleness of each bank in the system independently. As a result, while it may get significant benefits with some benchmarks, it may be unsuccessful with the others. In contrast, our approach works at the data cache level. Therefore, its savings are expected to be distributed across the different banks uniformly, under the assumption that the banks are accessed uniformly. The graphs in Figure 12 give the distribution of energy savings across the eight memory banks for two of our benchmarks, namely, cholesky and fourier. One can see from these results that the savings with our scheme are uniform across the banks, whereas the variance with the PRI scheme can be very large in some cases. The remaining benchmarks also exhibit similar patterns. We next look at the performance results given in Figure 11. As expected the LOOP scheme achieves best execution time improvements across all versions, since it is performance oriented. However, the results with the CMC scheme are also very good. Specifically, the execution cycle improvements brought by LOOP and CMC are 25.6% and 21.9%, respectively. This is because our miss clustering approach captures cache locality as well, in addition to bank idleness. That is, clustering data accesses first based on the set of banks they refer enhances data locality at the cache level. Consequently, the results it generates are not very far from those obtained through the LOOP scheme. Our last observation from these results is that the PRI version does not improve execution time significantly (3.6% improvement on an average, and increasing original execution times in some cases). The main reason for this poor performance behavior is the fact that this scheme does not consider data locality at all. Therefore, while its results are good from the energy saving perspective, they are not impressive as far as performance is considered. Overall, when we consider the results given in Figures 9 and 11 together, we can say that considering bank idleness and locality together is a must, if we are to improve both energy and performance. We next present the results with varying data cache capacities (all are 4-way associative as in the default configuration). The default cache capacity used in our experiments so far was 16KB. The results with the different cache capacities are given in Figure 13 for the benchmark cholesky. The trends observed with the remaining benchmarks were similar; so, we do not present them. One can see from this graph that the PRI scheme is not very sensitive to the data cache capacity used. This is because it operates in a cache oblivious way. However, since a different cache capacity changes the hit/miss behavior, it changes the absolute savings obtained by PRI. In comparison, the effectiveness of CMC and LOOP is reduced with the increased cache capacity. This is expected because the results are given as normalized with respect to the baseline approach without any code optimization, and the behavior of this optimization improves significantly as we increase data cache capacity. In fact, in the extreme case, a very large cache capacity can make any data locality optimization useless. However, we also note that CMC is affected less than LOOP when we increase cache capacity. This is mainly because its energy savings do not come entirely from locality improvement but also from increasing bank idle times through miss clustering. We observed similar trends when we fix the cache capacity at its default value and change the cache associativity. That is, with increased associativity, we saw that the PRI scheme does not exhibit a significant variance. Also, the effectiveness of CMC and LOOP is reduced with the increased associativity. Our next set of experiments study the impact of the off-chip access latency on our results (obtained using the CMC scheme) and are presented in Fig- 249
7 Figure 13: Normalized energy consumptions with varying data cache capacities (cholesky). Figure 14: Normalized energy consumptions with varying off-chip latency values (cholesky). Figure 15: Influence of the clustering factor on energy consumption and execution cycles. ure 14 for cholesky. The default value for off-chip memory access latency was 110 cycles. We see from Figure 14 that the energy results are not very sensitive to the variations in the off-chip access latency, though we abserve a declining trend in normalized energy consumption with increased latency values. However, as expected, the execution cycle savings achieved by our approach reduces more as compared to the reduction in energy savings when we reduce the off-chip access latency. We still observe however that, even with an off-chip access latency of 60 cycles, our approach improves execution cycles by 5.56% for this benchmark (on an average 7.27%, when considering all the benchmarks. Recall that an important parameter in our approach is the clustering factor as defined in Section 4.4. As mentioned earlier, the default value for this parameter used in our experiments so far was 8. We also performed experiments with different values of this parameter. The results are given in Figure 15. In this plot, we give the normalized energy consumption and the normalized execution cycles when averaged over all the benchmarks in our experimental suite. We only present the average results since the results with the different benchmarks are very similar to each other. We observe an interesting trend from this results: both energy consumption and execution cycles are first decreasing as we increase the value of the clustering factor, but beyond a point, they are showing an increasing trend. This can be explained as follows. As mentioned earlier, increasing the value of the clustering factor is good from the viewpoint of clustering cache misses. In a sense, the larger its value, the better miss clustering we have. However, this increase also increases the gap (in terms of the intervening accesses) between the two successive accesses to a given cache line. As a result, when the clustering factor becomes large enough, this can cause extra cache misses, which increase both execution cycles and memory energy consumption. 6. CONCLUDING REMARKS This paper proposes a cache conscious memory energy reduction scheme based on the concept of cache miss clustering. The idea is to cluster data cache misses and thus cluster banked memory accesses and idle periods. This increase in idle periods then allows better energy management via low-power operation modes and eventually leads to better energy savings. Our experimental analysis shows that our approach is effective in reducing both energy consumption and execution cycles, and, more effective than the previously-proposed schemes. Our results also indicate that the energy savings achieved by our approach are consistent across different cache sizes, cache associativities, and off-chip access latencies. 7. REFERENCES [1] F. Catthoor, S. Wuytack, E. D. Greef, F. Balasa, L. Nachtergaele, and A. Vandecappelle. Custom Memory Management Methodology Exploration of Memory Organization for Embedded Multimedia System Design, Kluwer Academic Publishers, June [2] V. Delaluz, M. Kandemir, N. Vijaykrishnan, and M. J. Irwin. Energy-oriented compiler optimizations for partitioned memory architectures. In Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, [3] V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin. DRAM energy management using software and hardware directed power mode control. In Proc. the 7th International Conference on High Performance Computer Architecture, Monterrey, Mexico, January [4] X. Fan, C. S. Ellis, and A. R. Lebeck. Memory controller policies for DRAM power management. In Proc. the International Symposium on Low Power Electronics and Design, [5] K. I. Farkas, J. Flinn, G. Back, D. Grunwald, and J.-A. M. Anderson. Quantifying the energy consumption of a pocket computer and a Java virtual machine. In Proc. SIGMETRICS, pages , [6] M. W. Hall, J. M. Anderson, S. P. Amarasinghe, B. R. Murphy, S.-W. Liao, E. Bugnion, and M. S. Lam. Maximizing multiprocessor performance with the SUIF compiler. IEEE Computer, December [7] A. R. Lebeck, X. Fan, H. Zeng, and C. S. Ellis. Power aware page allocation. In Proc. Ninth International Conference on Architectural Support for Programming Languages and Operating Systems, [8] W. Li. Compiling for NUMA Parallel Machines. Ph.D.Thesis, Computer Science Department, Cornell University, Ithaca, NY, [9] K. McKinley, S. Carr, and C.W. Tseng. Improving data locality with loop transformations. ACM Transactions on Programming Languages and Systems, [10] M. O Boyle and P. Knijnenburg. Integrating loop and data transformations for global optimisation. In Proc. International Conference on Parallel Architectures and Compilation Techniques, [11] V. S. Pai and S. V. Adve. Code transformations to improve memory parallelism. In Proc. the 32nd International Symposium on Microarchitecture, [12] V. S. Pai and S. V. Adve. Comparing and combining read miss clustering and software prefetching. In Proc. the International Symposium on Parallel Architectures and Compilation Techniques, [13] 128/144-MBit Direct RDRAM Data Sheet, Rambus Inc., [14] Rambus Inc. [15] M. A. R. Saghir, P. Chow, and C. G. Lee. Exploiting dual data-memory banks in digital signal processors. In Proc. International Conference on Architectural Support for Programming languages and Operating Systems, [16] U. Sezer, G. Chen, M. Kandemir, H. Saputra, and M. J. Irwin. Exploiting bank locality in multi-bank memories. In Proc. International Conference on Compilers, Architecture, and Synthesis for Embedded Systems, [17] SimpleScalar LLC. [18] A. Sudarsanam and S. Malik. Simultaneous reference allocation in code generation for dual data memory banks ASIPs. ACM Transactions on Design Automation of Electronic Systems 5, 2000, pp [19] N. Vijaykrishnan, M. Kandemir, M. J. Irwin, H. Y. Kim, and W. Ye. Energy-driven integrated hardware-software optimizations using SimplePower. In Proc. the International Symposium on Computer Architecture, 2000.
Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories
Compiler-Directed Array Interleaving for Reducing Energy in Multi-Bank Memories V. Delaluz, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, A. Sivasubramaniam, and I. Kolcu y Microsystems Design Lab Pennsylvania
More informationExploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors
Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,
More informationSPM Management Using Markov Chain Based Data Access Prediction*
SPM Management Using Markov Chain Based Data Access Prediction* Taylan Yemliha Syracuse University, Syracuse, NY Shekhar Srikantaiah, Mahmut Kandemir Pennsylvania State University, University Park, PA
More informationOptimizing Inter-Nest Data Locality
Optimizing Inter-Nest Data Locality M. Kandemir, I. Kadayif Microsystems Design Lab Pennsylvania State University University Park, PA 682, USA fkandemir,kadayifg@cse.psu.edu A. Choudhary, J. A. Zambreno
More informationLocality-Aware Process Scheduling for Embedded MPSoCs*
Locality-Aware Process Scheduling for Embedded MPSoCs* Mahmut Kandemir and Guilin Chen Computer Science and Engineering Department The Pennsylvania State University, University Park, PA 1682, USA {kandemir,
More informationPrefetching-Aware Cache Line Turnoff for Saving Leakage Energy
ing-aware Cache Line Turnoff for Saving Leakage Energy Ismail Kadayif Mahmut Kandemir Feihui Li Dept. of Computer Engineering Dept. of Computer Sci. & Eng. Dept. of Computer Sci. & Eng. Canakkale Onsekiz
More informationHardware and Software Techniques for Controlling DRAM Power
Hardware and Software Techniques for Controlling DRAM Power Modes Λ V. Delaluz, M. Kandemir, N. Vijaykrishnan, A. Sivasubramaniam, and M. J. Irwin 220 Pond Lab Pennsylvania State University University
More informationProfiler and Compiler Assisted Adaptive I/O Prefetching for Shared Storage Caches
Profiler and Compiler Assisted Adaptive I/O Prefetching for Shared Storage Caches Seung Woo Son Pennsylvania State University sson@cse.psu.edu Mahmut Kandemir Pennsylvania State University kandemir@cse.psu.edu
More informationReducing Memory Requirements of Nested Loops for Embedded Systems
Reducing Memory Requirements of Nested Loops for Embedded Systems 23.3 J. Ramanujam Λ Jinpyo Hong Λ Mahmut Kandemir y A. Narayan Λ Abstract Most embedded systems have limited amount of memory. In contrast,
More informationA Power and Temperature Aware DRAM Architecture
A Power and Temperature Aware DRAM Architecture Song Liu, Seda Ogrenci Memik, Yu Zhang, and Gokhan Memik Department of Electrical Engineering and Computer Science Northwestern University, Evanston, IL
More information2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]
EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian
More informationExploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems
5. Exploiting Shared Scratch Pad Memory Space in Embedded Multiprocessor Systems Mahmut Kandemir Microsystems Design Lab Pennsylvania State University University Park, PA 680 kandemir@cse.psu.edu J. Ramanujam
More informationStrategies for Improving Data Locality in Embedded Applications
Strategies for Improving Data Locality in Embedded Applications N. E. Crosbie y M. Kandemir z I. Kolcu x J. Ramanujam { A. Choudhary k Abstract This paper introduces a dynamic layout optimization strategy
More informationCompiler-Directed Cache Polymorphism
Compiler-Directed Cache Polymorphism J. S. Hu, M. Kandemir, N. Vijaykrishnan, M. J. Irwin, H. Saputra, W. Zhang Microsystems Design Lab The Pennsylvania State University Outline Motivation Compiler-Directed
More informationQuantifying and Optimizing Data Access Parallelism on Manycores
Quantifying and Optimizing Data Access Parallelism on Manycores Jihyun Ryoo Penn State University jihyun@cse.psu.edu Orhan Kislal Penn State University kislal.orhan@gmail.com Xulong Tang Penn State University
More informationEffective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management
International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,
More informationA Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors
A Compiler-Directed Data Prefetching Scheme for Chip Multiprocessors Seung Woo Son Pennsylvania State University sson @cse.p su.ed u Mahmut Kandemir Pennsylvania State University kan d emir@cse.p su.ed
More informationSoftware-Directed Disk Power Management for Scientific Applications
Software-Directed Disk Power Management for Scientific Applications S. W. Son M. Kandemir CSE Department Pennsylvania State University University Park, PA 16802, USA {sson,kandemir}@cse.psu.edu A. Choudhary
More informationTransparent Data-Memory Organizations for Digital Signal Processors
Transparent Data-Memory Organizations for Digital Signal Processors Sadagopan Srinivasan, Vinodh Cuppu, and Bruce Jacob Dept. of Electrical & Computer Engineering University of Maryland at College Park
More informationBehavioral Array Mapping into Multiport Memories Targeting Low Power 3
Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,
More informationOptimizing Inter-Nest Data Locality Using Loop Splitting and Reordering
Optimizing Inter-Nest Data Locality Using Loop Splitting and Reordering Sofiane Naci The Computer Laboratory, University of Cambridge JJ Thompson Avenue Cambridge CB3 FD United Kingdom Sofiane.Naci@cl.cam.ac.uk
More informationImproving the Energy Behavior of Block Buffering Using Compiler Optimizations
Improving the Energy Behavior of Block Buffering Using Compiler Optimizations M. KANDEMIR Pennsylvania State University J. RAMANUJAM Louisiana State University and U. SEZER University of Wisconsin On-chip
More informationA Lost Cycles Analysis for Performance Prediction using High-Level Synthesis
A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,
More informationArea-Efficient Error Protection for Caches
Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various
More informationECE 5730 Memory Systems
ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2 Where We re Headed Off-line
More informationTemperature-Sensitive Loop Parallelization for Chip Multiprocessors
Temperature-Sensitive Loop Parallelization for Chip Multiprocessors Sri Hari Krishna Narayanan, Guilin Chen, Mahmut Kandemir, Yuan Xie Department of CSE, The Pennsylvania State University {snarayan, guilchen,
More informationLocality-Conscious Process Scheduling in Embedded Systems
Locality-Conscious Process Scheduling in Embedded Systems I. Kadayif, M. Kemir Microsystems Design Lab Pennsylvania State University University Park, PA 16802, USA kadayif,kemir @cse.psu.edu I. Kolcu Computation
More informationDynamic Compilation for Reducing Energy Consumption of I/O-Intensive Applications
Dynamic Compilation for Reducing Energy Consumption of I/O-Intensive Applications Seung Woo Son 1, Guangyu Chen 1, Mahmut Kandemir 1, and Alok Choudhary 2 1 Pennsylvania State University, University Park
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Announcements Monday November 28th, Dr. Sanjay Rajopadhye is talking at BMAC Friday December 2nd, Dr. Sanjay Rajopadhye will be leading CS553 Last Monday Kelly
More informationDemand fetching is commonly employed to bring the data
Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni
More informationMemory Systems and Compiler Support for MPSoC Architectures. Mahmut Kandemir and Nikil Dutt. Cap. 9
Memory Systems and Compiler Support for MPSoC Architectures Mahmut Kandemir and Nikil Dutt Cap. 9 Fernando Moraes 28/maio/2013 1 MPSoC - Vantagens MPSoC architecture has several advantages over a conventional
More informationReducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy
Reducing Instruction Cache Energy Consumption Using a Compiler-Based Strategy W. ZHANG Southern Illinois University and J. S. HU, V. DEGALAHAL, M. KANDEMIR, N. VIJAYKRISHNAN, and M. J. IRWIN The Pennsylvania
More informationRequirements Engineering for Enterprise Systems
Association for Information Systems AIS Electronic Library (AISeL) AMCIS 2001 Proceedings Americas Conference on Information Systems (AMCIS) December 2001 Requirements Engineering for Enterprise Systems
More informationA Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors
A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal
More information2 TEST: A Tracer for Extracting Speculative Threads
EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath
More informationAS the applications ported into System-on-a-Chip (SoC)
396 IEEE TRANSACTIONS ON PARALLEL AND DISTRIBUTED SYSTEMS, VOL. 16, NO. 5, MAY 2005 Optimizing Array-Intensive Applications for On-Chip Multiprocessors Ismail Kadayif, Student Member, IEEE, Mahmut Kandemir,
More informationVL-CDRAM: Variable Line Sized Cached DRAMs
VL-CDRAM: Variable Line Sized Cached DRAMs Ananth Hegde, N. Vijaykrishnan, Mahmut Kandemir and Mary Jane Irwin Micro Systems Design Lab The Pennsylvania State University, University Park {hegdeank,vijay,kandemir,mji}@cse.psu.edu
More informationImproving Bank-Level Parallelism for Irregular Applications
Improving Bank-Level Parallelism for Irregular Applications Xulong Tang, Mahmut Kandemir, Praveen Yedlapalli 2, Jagadish Kotra Pennsylvania State University VMware, Inc. 2 University Park, PA, USA Palo
More informationLeveraging Transitive Relations for Crowdsourced Joins*
Leveraging Transitive Relations for Crowdsourced Joins* Jiannan Wang #, Guoliang Li #, Tim Kraska, Michael J. Franklin, Jianhua Feng # # Department of Computer Science, Tsinghua University, Brown University,
More informationARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES
ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu
More informationLast Level Cache Size Flexible Heterogeneity in Embedded Systems
Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw
More informationMapping Multi-Dimensional Signals into Hierarchical Memory Organizations
Mapping Multi-Dimensional Signals into Hierarchical Memory Organizations Hongwei Zhu Ilie I. Luican Florin Balasa Dept. of Computer Science, University of Illinois at Chicago, {hzhu7,iluica2,fbalasa}@uic.edu
More informationA Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System
A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI CHEN TIANZHOU SHI QINGSONG JIANG NING College of Computer Science Zhejiang University College of Computer Science
More informationAn Approach for Adaptive DRAM Temperature and Power Management
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance
More informationEfficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness
Journal of Instruction-Level Parallelism 13 (11) 1-14 Submitted 3/1; published 1/11 Efficient Prefetching with Hybrid Schemes and Use of Program Feedback to Adjust Prefetcher Aggressiveness Santhosh Verma
More informationComputer Sciences Department
Computer Sciences Department SIP: Speculative Insertion Policy for High Performance Caching Hongil Yoon Tan Zhang Mikko H. Lipasti Technical Report #1676 June 2010 SIP: Speculative Insertion Policy for
More informationCompiling for Advanced Architectures
Compiling for Advanced Architectures In this lecture, we will concentrate on compilation issues for compiling scientific codes Typically, scientific codes Use arrays as their main data structures Have
More informationMigration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM
Migration Based Page Caching Algorithm for a Hybrid Main Memory of DRAM and PRAM Hyunchul Seok Daejeon, Korea hcseok@core.kaist.ac.kr Youngwoo Park Daejeon, Korea ywpark@core.kaist.ac.kr Kyu Ho Park Deajeon,
More informationA Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System
A Data Centered Approach for Cache Partitioning in Embedded Real- Time Database System HU WEI, CHEN TIANZHOU, SHI QINGSONG, JIANG NING College of Computer Science Zhejiang University College of Computer
More informationSF-LRU Cache Replacement Algorithm
SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,
More information15-740/ Computer Architecture Lecture 20: Main Memory II. Prof. Onur Mutlu Carnegie Mellon University
15-740/18-740 Computer Architecture Lecture 20: Main Memory II Prof. Onur Mutlu Carnegie Mellon University Today SRAM vs. DRAM Interleaving/Banking DRAM Microarchitecture Memory controller Memory buses
More informationOutline. Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis
Memory Optimization Outline Issues with the Memory System Loop Transformations Data Transformations Prefetching Alias Analysis Memory Hierarchy 1-2 ns Registers 32 512 B 3-10 ns 8-30 ns 60-250 ns 5-20
More informationCS 426 Parallel Computing. Parallel Computing Platforms
CS 426 Parallel Computing Parallel Computing Platforms Ozcan Ozturk http://www.cs.bilkent.edu.tr/~ozturk/cs426/ Slides are adapted from ``Introduction to Parallel Computing'' Topic Overview Implicit Parallelism:
More informationECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation
ECE902 Virtual Machine Final Project: MIPS to CRAY-2 Binary Translation Weiping Liao, Saengrawee (Anne) Pratoomtong, and Chuan Zhang Abstract Binary translation is an important component for translating
More informationSAMBA-BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ. Ruibing Lu and Cheng-Kok Koh
BUS: A HIGH PERFORMANCE BUS ARCHITECTURE FOR SYSTEM-ON-CHIPS Λ Ruibing Lu and Cheng-Kok Koh School of Electrical and Computer Engineering Purdue University, West Lafayette, IN 797- flur,chengkokg@ecn.purdue.edu
More informationA Layout-Conscious Iteration Space Transformation Technique
IEEE TRANSACTIONS ON COMPUTERS, VOL 50, NO 12, DECEMBER 2001 1321 A Layout-Conscious Iteration Space Transformation Technique Mahmut Kandemir, Member, IEEE, J Ramanujam, Member, IEEE, Alok Choudhary, Senior
More informationTiling: A Data Locality Optimizing Algorithm
Tiling: A Data Locality Optimizing Algorithm Previously Unroll and Jam Homework PA3 is due Monday November 2nd Today Unroll and Jam is tiling Code generation for fixed-sized tiles Paper writing and critique
More informationProfiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency
Profiling-Based L1 Data Cache Bypassing to Improve GPU Performance and Energy Efficiency Yijie Huangfu and Wei Zhang Department of Electrical and Computer Engineering Virginia Commonwealth University {huangfuy2,wzhang4}@vcu.edu
More informationMemory Management! Goals of this Lecture!
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Why it works: locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware and
More informationCSC D70: Compiler Optimization Prefetching
CSC D70: Compiler Optimization Prefetching Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip Gibbons DRAM Improvement
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationMemory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware
More informationData Parallel Architectures
EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003
More informationInstruction Subsetting: Trading Power for Programmability
Instruction Subsetting: Trading Power for Programmability William E. Dougherty, David J. Pursley, Donald E. Thomas [wed,pursley,thomas]@ece.cmu.edu Department of Electrical and Computer Engineering Carnegie
More informationMore on Conjunctive Selection Condition and Branch Prediction
More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationLecture notes for CS Chapter 2, part 1 10/23/18
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationGuiding the optimization of parallel codes on multicores using an analytical cache model
Guiding the optimization of parallel codes on multicores using an analytical cache model Diego Andrade, Basilio B. Fraguela, and Ramón Doallo Universidade da Coruña, Spain {diego.andrade,basilio.fraguela,ramon.doalllo}@udc.es
More informationThreshold-Based Markov Prefetchers
Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this
More informationGarbage Collection (2) Advanced Operating Systems Lecture 9
Garbage Collection (2) Advanced Operating Systems Lecture 9 Lecture Outline Garbage collection Generational algorithms Incremental algorithms Real-time garbage collection Practical factors 2 Object Lifetimes
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationProgram Transformations for the Memory Hierarchy
Program Transformations for the Memory Hierarchy Locality Analysis and Reuse Copyright 214, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class at the University of Southern California
More informationCHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1)
71 CHAPTER 4 OPTIMIZATION OF WEB CACHING PERFORMANCE BY CLUSTERING-BASED PRE-FETCHING TECHNIQUE USING MODIFIED ART1 (MART1) 4.1 INTRODUCTION One of the prime research objectives of this thesis is to optimize
More informationCompiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory Machines
Journal of Parallel and Distributed Computing 6, 924965 (2) doi:.6jpdc.2.639, available online at http:www.idealibrary.com on Compiler Algorithms for Optimizing Locality and Parallelism on Shared and Distributed-Memory
More informationMemory Management. Goals of this Lecture. Motivation for Memory Hierarchy
Memory Management Goals of this Lecture Help you learn about: The memory hierarchy Spatial and temporal locality of reference Caching, at multiple levels Virtual memory and thereby How the hardware and
More informationPARALLEL MULTI-DELAY SIMULATION
PARALLEL MULTI-DELAY SIMULATION Yun Sik Lee Peter M. Maurer Department of Computer Science and Engineering University of South Florida Tampa, FL 33620 CATEGORY: 7 - Discrete Simulation PARALLEL MULTI-DELAY
More informationABC basics (compilation from different articles)
1. AIG construction 2. AIG optimization 3. Technology mapping ABC basics (compilation from different articles) 1. BACKGROUND An And-Inverter Graph (AIG) is a directed acyclic graph (DAG), in which a node
More informationLoop Transformations! Part II!
Lecture 9! Loop Transformations! Part II! John Cavazos! Dept of Computer & Information Sciences! University of Delaware! www.cis.udel.edu/~cavazos/cisc879! Loop Unswitching Hoist invariant control-flow
More informationData Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems
Data Reuse Exploration under Area Constraints for Low Power Reconfigurable Systems Qiang Liu, George A. Constantinides, Konstantinos Masselos and Peter Y.K. Cheung Department of Electrical and Electronics
More informationArchitecture Tuning Study: the SimpleScalar Experience
Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationIntroduction. Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: We study the bandwidth problem
Introduction Stream processor: high computation to bandwidth ratio To make legacy hardware more like stream processor: Increase computation power Make the best use of available bandwidth We study the bandwidth
More informationAN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING. 1. Introduction. 2. Associative Cache Scheme
AN ASSOCIATIVE TERNARY CACHE FOR IP ROUTING James J. Rooney 1 José G. Delgado-Frias 2 Douglas H. Summerville 1 1 Dept. of Electrical and Computer Engineering. 2 School of Electrical Engr. and Computer
More informationA Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers
A Memory Management Scheme for Hybrid Memory Architecture in Mission Critical Computers Soohyun Yang and Yeonseung Ryu Department of Computer Engineering, Myongji University Yongin, Gyeonggi-do, Korea
More informationLow Power Set-Associative Cache with Single-Cycle Partial Tag Comparison
Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,
More informationAffine and Unimodular Transformations for Non-Uniform Nested Loops
th WSEAS International Conference on COMPUTERS, Heraklion, Greece, July 3-, 008 Affine and Unimodular Transformations for Non-Uniform Nested Loops FAWZY A. TORKEY, AFAF A. SALAH, NAHED M. EL DESOUKY and
More informationCost Models for Query Processing Strategies in the Active Data Repository
Cost Models for Query rocessing Strategies in the Active Data Repository Chialin Chang Institute for Advanced Computer Studies and Department of Computer Science University of Maryland, College ark 272
More information1 Motivation for Improving Matrix Multiplication
CS170 Spring 2007 Lecture 7 Feb 6 1 Motivation for Improving Matrix Multiplication Now we will just consider the best way to implement the usual algorithm for matrix multiplication, the one that take 2n
More informationReduction in Cache Memory Power Consumption based on Replacement Quantity
Journal of Computer Engineering 1 (9) 3-11 Reduction in Cache Memory Power Consumption based on Replacement Quantity M.Jafarsalehi Islamic Azad University of Iran Tabriz-branch jafarsalehi_m@yahoo.com
More informationBest Practices for SSD Performance Measurement
Best Practices for SSD Performance Measurement Overview Fast Facts - SSDs require unique performance measurement techniques - SSD performance can change as the drive is written - Accurate, consistent and
More informationUniversity oftoronto. Queens University. Abstract. This paper gives an overview of locality enhancement techniques
Locality Enhancement for Large-Scale Shared-Memory Multiprocessors Tarek Abdelrahman 1, Naraig Manjikian 2,GaryLiu 3 and S. Tandri 3 1 Department of Electrical and Computer Engineering University oftoronto
More informationChapter 2: Memory Hierarchy Design Part 2
Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental
More informationOpenMP for next generation heterogeneous clusters
OpenMP for next generation heterogeneous clusters Jens Breitbart Research Group Programming Languages / Methodologies, Universität Kassel, jbreitbart@uni-kassel.de Abstract The last years have seen great
More informationUsing Intel Streaming SIMD Extensions for 3D Geometry Processing
Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,
More informationDynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures
Dynamic Partitioning of Processing and Memory Resources in Embedded MPSoC Architectures Liping Xue, Ozcan ozturk, Feihui Li, Mahmut Kandemir Computer Science and Engineering Department Pennsylvania State
More information4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.
Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that
More informationTradeoff between coverage of a Markov prefetcher and memory bandwidth usage
Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end
More informationCS 31: Intro to Systems Virtual Memory. Kevin Webb Swarthmore College November 15, 2018
CS 31: Intro to Systems Virtual Memory Kevin Webb Swarthmore College November 15, 2018 Reading Quiz Memory Abstraction goal: make every process think it has the same memory layout. MUCH simpler for compiler
More informationCS 433 Homework 4. Assigned on 10/17/2017 Due in class on 11/7/ Please write your name and NetID clearly on the first page.
CS 433 Homework 4 Assigned on 10/17/2017 Due in class on 11/7/2017 Instructions: 1. Please write your name and NetID clearly on the first page. 2. Refer to the course fact sheet for policies on collaboration.
More informationImproving Real-Time Performance on Multicore Platforms Using MemGuard
Improving Real-Time Performance on Multicore Platforms Using MemGuard Heechul Yun University of Kansas 2335 Irving hill Rd, Lawrence, KS heechul@ittc.ku.edu Abstract In this paper, we present a case-study
More information