Memory Redundancy Elimination to Improve Application Energy Efficiency

Memory Redundancy Elimination to Improve Application Energy Efficiency Keith D. Cooper and Li Xu Department of Computer Science Rice University Houston, Texas, USA Abstract. Application energy consumption has become an increasingly important issue for both high-end microprocessors and mobile and embedded devices. A multitude of circuit and architecture-level techniques have been developed to improve application energy efficiency. However, relatively less work studies the effects of compiler transformations in terms of application energy efficiency. In this paper, we use energyestimation tools to profile the execution of benchmark applications. The results show that energy consumption due to memory instructions accounts for a large share of total energy. An effective compiler technique that can improve energy efficiency is memory redundancy elimination. It reduces both application execution cycles and the number of cache accesses. We evaluate the energy improvement over 12 benchmark applications from SPEC2000 and MediaBench. The results show that memory redundancy elimination can significantly reduce energy in the processor clocking network and the instruction and data caches. The overall application energy consumption can be reduced by up to 15%, and the reduction in terms of energy-delay product is up to 24%. 1 Introduction Application energy consumption has become an increasingly important issue for the whole array of microprocessors spanning from high-end processors used in data centers to those inside mobile and embedded devices. Energy conservation is currently the target of intense research efforts. A multitude of circuit and architecture-level techniques have been proposed and developed to reduce processor energy consumption [25, 13, 17]. However, many of these research efforts focus on hardware techniques, such as dynamic voltage scaling (DVS) [25, 28, 22] and low-energy cache design [13, 1, 17]. Equally important, application-level techniques are necessary to make program execution more energy efficient, as ultimately, it is the applications executed by the processors that determine the total energy consumption. In this paper, we look at how compiler techniques can be used to improve application energy efficiency. In Section 2, we use energy profiling to identify top energy consuming micro-architecture components and motivate the study of memory redundancy elimination as a potential technique to reduce energy. Section 3 overviews a new algorithm for memory redundancy detection and presents two frameworks to remove redundant memory instructions. Section 4 presents the experimental results. Section 5 summarizes related work and Section 6 concludes the paper.

2 Energy Profiling Optimizing compilers have been successful to improve program performance [2]. One key reason is that accurate performance models are used to evaluate various code transformations. Similarly, an automatic program energy efficiency optimization requires accurate energy dissipation modeling. Unfortunately, most energy estimation tools work on the circuit and transistor level and require detailed information from circuit design. Recently, researchers have started to build higher level energy modeling tools, such as Wattch [6] and SimplePower [30], which can estimate power and energy dissipation of various micro-architecture components. When combined with instruction-level performance simulators, these tools provide an ideal infrastructure to evaluate compiler optimizations targeting energy efficiency. In our work, we use the Wattch [6] tool along with the SimpleScalar [7] simulators to study compiler techniques in terms of application energy consumption. Our approach is as follows: we first profile application execution and measure energy consumption breakdown by major processor components. This step will reveal how energy is dissipated. Based on the application energy profile, we can then identify promising code optimizations to improve energy efficiency. Parameter Value Parameter Value Processor Core Memory Hierarchy RUU Size 64 instructions L1 Data Cache 64KB, 2-way (LRU), LSQ Size 32 instructions 32B block, 1 cycle latency Fetch Queue Size 8 instructions L1 Instruction Cache 64KB, 2-way (LRU), Fetch Width 4 instructions/cycle 32B block, 1 cycle latency Decode Width 4 instructions/cycle L2 Cache Unified, 2MB, 4-way (LRU), Issue Width 4 instructions/cycle 32B block, 12-cycle latency Commit Width 4 instructions/cycle Memory 100-cycle latency Function Units 4 integer ALUs TLB 128 entry, fully associative, 1 integer multiply, 1 integer divide 30-cycle miss latency 1 FP add, 1 FP multiply, 1 FP divide/sqrt Branch Prediction Branch predictor Combined: 4K chooser Bimodal: 4K table 2-Level: 1K table, 10-bit history BTB 1024-entry, 2-way Returned Address Stack 32-entry Misprediction Penalty 7 cycles Table 1. SimpleScalar Simulator Configuration In CMOS circuits, dynamic power consumption accounts for the major share of power dissipation. We use Wattch to get the dynamic power breakdown of superscalar processors. The processor configuration is shown in Table 1, which is similar to those of the Alpha 21264. The Wattch tool is configured to use parameters of a.35um process at 600 MHz with supply voltage of 2.5V. The pie-chart in Figure 1 shows the percentage of dynamic power dissipation for the micro-architecture components, assuming each component is fully active. Figure 1 shows that the top dynamic power-dissipating components are the global clocking network and on-chip caches. Combined together, they account 2

Dynamic Power of Components DTLB 1% L1 d-cache 16% ITLB 0% L2 cache 18% bpred 5% rename 1% instruction window 6% load/store queue 3% register file 5% result bus 4% L1 i-cache 8% clock 28% Fig. 1. Active dynamic power consumption by micro-architecture components alu 5% bpred rename instruction window load/store queue register file result bus alu clock L1 i-cache ITLB L1 d-cache DTLB L2 cache Benchmark Description Input adpcm 16-to-4 bit voice encoding clinton.pcm g721 CCITT G.721 voice encoding clinton.pcm gsm GSM speech encoding clinton.pcm epic Pyramid image encoding test img.pgm pegwit Elliptic curve public key encryption news.txt mpeg2dec MPEG-2 video decoding child.mpg 181.mcf Combinational optimization test input 164.gzip Compression test input 256.bzip2 Compression test input 175.vpr FPGA placement and routing test input 197.parser Link grammar parser of English test input 300.twolf Circuit placement and routing test input Table 2. Test Benchmarks Benchmark Total (mj) Clock (mj) Ratio I-Cache (mj) Ratio D-Cache (mj) Ratio adpcm 257.33 76.90 29.9% 46.45 18.1% 14.09 5.5% g721 9,079.94 2,739.31 30.2% 1,402.15 15.4% 1,034.61 11.4% gsm 6,394.06 1,917.07 30.0% 975.21 15.3% 1,072.65 16.8% epic 1,860.85 590.95 31.8% 244.62 13.2% 158.69 8.5% pegwit 1,236.13 350.43 28.4% 188.82 15.3% 197.86 16.0% mpeg 5,227.71 1,570.38 30.0% 683.34 13.1% 614.02 11.8% 181.mcf 8,743.05 2,823.21 32.3% 1,175.15 13.4% 1,393.48 15.9% 164.gzip 27,590.61 8,199.88 29.7% 3,806.98 13.8% 4,491.96 16.3% 256.bzip2 60,783.20 19,197.48 31.6% 8,451.11 13.9% 9,419.49 15.5% 175.vpr 32,053.54 9,625.57 30.0% 4,261.32 13.3% 5,267.46 16.4% 197.parser 10,090.21 3,101.89 30.7% 1,466.54 14.5% 1,636.30 16.2% 300.twolf 10,620.75 3,208.99 30.2% 1,555.77 14.7% 1,694.24 16.0% Geometric Mean 30.4% 14.4% 13.2% Table 3. Application energy consumption and energy consumption by clocking network, top level I-Cache and D-Cache. Energy unit is mj (10 3 Joule). 3

for more than 70% of the total dynamic power dissipation. This suggests that the clocking network and caches should be the primary targets for compiler techniques to improve energy efficiency. The processor energy of dynamic switching (E d ) can be defined as: E d = αcv 2 dd In the above equation of E d, C is the load capacitance, V dd is the supply voltage and α is the switching activity factor indicating how often logic transitions from low to high take place [14]. C and V dd are dependent on the particular process technology and circuit design, while the activity factor α is related to the codes being executed [14]. The main leverage for the compiler to minimize E d is to reduce α. We ran 12 benchmark applications and profiled the energy consumption. The benchmarks are chosen from the widely used SPEC2000 and MediaBench [20]. Table 2 shows the descriptions of the benchmark applications. We compiled the benchmark applications using GNU GCC compiler with -O4 level optimization. The compiled executables are then run on the out-of-order superscalar simulator with Wattch to collect run-time and energy statistics. Table 3 shows the total energy consumption and energy in the clocking network, top level I-Cache and D-Cache. The results show that the energy distribution of the micro-architecture components is very similar to the power distribution graph in Figure 1. The major difference is that all applications exhibit good cache locality, and L2 cache is rarely accessed due to the low number of top level cache misses. Therefore, compared to energy in other components, energy in L2 cache is negligible due to infrequent access activities. As shown in Table 3, energy consumption in the clocking network and top level cache accounts for a large share of total application energy. For the 12 applications, the clocking network and L1 cache account for more than 58% (geometric mean) of total energy. Table 4 shows the dynamic instruction count and dynamic load and store count. The results show that memory instructions account for about 24% (geometric mean) of dynamic instructions, and for the more sophisticated SPEC2000 applications, the percentage is even higher, with a geometric mean of 36%. The large number of dynamic memory instructions have the following consequences: first, these instructions must be fetched from the I-Cache before execution, thus costing energy in the I-Cache; second, instruction execution also costs energy, including that in the clocking network; and thirdly, the execution of memory instructions also requires D-Cache access, and this is the major cause of D-Cache energy consumption. As both the clocking network and caches are top powerdissipating components, memory instructions thus have significant impact on total energy consumption. The above energy profiling data indicate that memory instructions are good target to improve application energy efficiency. Redundant memory instructions represent wasted energy; removing them [14,40] should reduce energy costs. The dominant modern processor architecture is the load-store architecture, in which 4

Benchmark Total Load Ratio Store Ratio Load+Store adpcm 9,136,002 460,201 5.0% 117188 1.3% 6.3% g721 332,838,710 43,242,018 13.0% 11531406 3.5% 16.5% gsm 243,270,088 41,358,696 17.0% 11716196 4.8% 21.8% epic 59,933,404 7,505,703 12.5% 1007723 1.7% 14.2% pegwit 44,934,340 9,241,265 20.6% 2767074 6.2% 26.7% mpeg 174,648,416 26,999,461 15.5% 6320129 3.6% 19.1% 181.mcf 263,801,691 67,462,222 25.6% 38954777 14.8% 40.3% 164.gzip 872,033,479 204,006,573 23.4% 87834271 10.1% 33.5% 256.bzip2 1,953,921,052 425,238,535 21.8% 147407675 7.5% 29.3% 175.vpr 1,044,347,692 314,776,025 30.1% 105206772 10.1% 40.2% 197.parser 322,027,162 88,673,262 27.5% 26627863 8.3% 35.8% 300.twolf 337,032,127 96,499,758 28.6% 32200073 9.6% 38.2% Geometric Mean 18.3% 5.5% 24.0% Table 4. Memory instruction count and ratio. Benchmarks compiled with GCC -O4. most instructions operate on data in the register file, and only loads and stores can access memory. Between the processor core and main memory, the I-Cache stores the instructions to be fetched and executed; the D-Cache serves as local copy of memory data, so loads and stores can access data faster. When redundant memory instructions are removed, the traffic from memory through the I-Cache to the CPU core is reduced because fewer instructions are fetched. This saves energy in the I-Cache. Data accesses in the D-Cache are also reduced, saving energy in the D-Cache. Finally, removing memory instructions speeds up the application and saves energy in the clocking network. In our prior analysis, the clocking network and cache structures are among the top energy consuming components in the processor. Thus energy savings in these components can significantly reduce total energy consumption. The rest of this paper will present the compile-time memory redundancy elimination and evaluate its effectiveness to improve energy efficiency. 3 Memory Redundancy Elimination Memory redundancy elimination is a compile-time technique to remove unnecessary memory instructions. Consider the sample C code in Figure 2; in the functions full red, par cond and par loop, the struct field accesses by p->x and p->y are generally compiled into loads. However, the loads in line 11 and 12 are fully redundant with those in line 10, as they always load the same values at run time; similarly, the loads in line 19 are partially redundant with those in line 17 when the conditional statement is executed; the loads in line 25 are partially redundant, as the load values need to be loaded only for the first loop iteration and all the remaining iterations load the same values. These redundant loads can be detected and removed at compile time. As we discussed in Section 2, memory instructions incur significant dynamic energy consumption, so memory redundancy elimination can be an effective energy-saving transformation. In our prior work [10], we presented a new static analysis algorithm to detect memory redundancy. This algorithm uses value numbering on memory operations, and is the basis for memory redundancy removal techniques described in this paper. In comparison, this paper extends our work in [10] by providing a more powerful removal framework which is capable to eliminate a larger set 5

1 struct parm { 2 int x; 3 int y; 4 }; 5 struct parm pa = {3, 7}; 6 struct parm pb = {2001, 2002}; 7 8 void full red(struct parm *p, int *result) 9 { 10 result[0] = p->x + p->y; 11 result[1] = p->x - p->y; 12 result[2] = p->x + p->y; 13 } 14 void par cond(struct parm *p, int *result) 15 { 16 if (p->x > 10) { 17 result[0] = p->x + p->y; 18 } 19 result[1] = p->x - p->y; 20 } 21 void par loop(struct parm *p, int *result) 22 { 23 int i; 24 for (i=0; i<100; i++) { 25 result[i] = p->x + p->y; 26 } 27 } 28 void client() 29 { 30 int r[6][100]; 31 full red(&pa, r[0]); 32 full red(&pb, r[1]); 33 par cond(&pa, r[2]); 34 par cond(&pb, r[3]); 35 par loop(&pa, r[4]); 36 par cond(&pb, r[5]); 37 } Fig. 2. Memory Redundancy Example. The loads in line 11 and 12 for p->x and p->y are fully redundant; the loads in line 19 for p->x and p->y are partially redundant due to conditional statement in line 16; the loads for p->x and p->y in line 25 are partially redundant as they are loop invariant. of memory redundancies; furthermore, this paper focuses on energy efficiency benefits, while the previous work concerns about performance improvements. In Section 3.1, we first give an overview of this memory redundancy detection algorithm; and in Section 3.2, we present code transformations which use the analysis results of the detection algorithm to remove those fully and partially redundant memory instructions. 3.1 Finding Memory Redundancy In [10], we presented a new static analysis algorithm to detect memory redundancy. We extended Simpson s optimistic global value-numbering algorithm [27, 12] to value number memory instructions. Sccvn is a powerful procedure-scope scalar redundancy (i.e. non-memory instructions) detection algorithm. It discovers value-based identical scalar instructions (as opposed to lexical identities), performs optimistic constant propagation, and handles a broad set of algebraic identities. To extend Sccvn so that it can also detect identities for loads and stores, we annotated the memory instructions in the compiler s intermediate representation (IR) with M-lists lists of the names of memory objects that are potentially defined by the instruction (an M-def list) and the names of those that are 6

potentially used by the instruction (an M-use list). The M-lists are computed by a flow-insensitive, context-insensitive, Andersen-style pointer analysis [3]. Our compiler uses a low-level, Risc-style, three-address IR, called Iloc. All memory accesses in Iloc occur on load and store instructions. The other instructions work from an unlimited set of virtual registers. The Iloc load and store code with M-lists for line 10 in Figure 2 is shown in the following: ild r1 => r4 M-use[@pa x @pb x] ild r2 => r5 M-use[@pa y @pb y] ist r3 r7 M-use[@r] M-def[@r] As an example, the M-use in ild r1 => r4 M-use[@pa x @pb x] means the integer load will load from address r1, put result in r4, and the load may access memory object pa.x or pb.x. This corresponds to p->x in the source code. In the annotated IR with M-lists, loads only have M-use list, as loads don t change the states of those referenced memory objects, and during value numbering, the value numbers of names in M-use indicate both before and after-states of memory objects affected by the loads; stores are annotated with both M-use and M-def, as stores may write new values to memory objects, and during value numbering of stores, the value numbers of names in M-use indicate the states before the execution of stores, and the value numbers in M-def indicates the states after the execution of stores. Using M-list, we can value number memory instructions along with scalar instructions and detect instruction identities. To value number memory instructions, both normal instruction operands (base address, offset, and result) and M-list names are used as a combined hash-key to look up values in the hash table. If there is a match, the memory instructions will access the same address with the same value and change the affected memory objects into identical states, therefore the matching instructions are redundant. For example, after value numbering, the three loads which correspond to p->x in function full red in Figure 2, all have the same form as ild r1 vn => r4 vn M-use [@pa x vn @pb x vn]; therefore the three loads are identities, and the last two are redundant and can be removed to reuse the value in register r4 vn. Also in Figure 2, for those loads of p->x and p->y in the functions par cond and par loop, memory value numbering can detect they are redundant. 3.2 Removing Memory Redundancy After memory redundancies are detected, code transformations are used to eliminate the redundant instructions. We have used two different techniques to perform the elimination phase: traditional common subexpression elimination (CSE) [9] and partial redundancy elimination (PRE) [24, 18, 11]. Using memory value numbering results, we can easily extend scalar CSE and PRE and build unified frameworks that remove both scalar and memory-based redundancies. Memory CSE was first described in [10] and is briefly recapitulated in this paper for completeness; memory PRE is a more powerful removal framework and can eliminate a larger set of memory redundancies. This section shows the two frameworks, extended to include memory redundancy removal. 7

AVLOC i = computed locally as in Section 3.2 { if i is the entry block; AVIN i = AVOUT h otherwise. h pred(i) AVOUT i =AVIN i AVLOC i Fig. 3. CSE Data Flow Equation System Available Expressions Traditional common subexpression elimination (CSE) finds and removes redundant scalar expressions (sometimes called fully redundant expressions). It computes the set of expressions that are available on entry to each block as a data-flow problem. An expression e is available on entry to block b if every control-flow path that reaches b contains a computation of e. Any expression in the block that is also available on entry to the block (in AVIN) is redundant and can be removed. Figure 3 shows the equations used for value-based CSE. To identify fully redundant memory instructions, for equivalent memory instructions, we assign them a unique id number. The AVLOC i set for block i is computed by adding scalar values and memory ids defined in i. When the equations in Figure 3 are solved, the AVIN i set contains the available scalar values and memory ids at the entry of block i. Fully redundant instructions (including redundant memory instructions) can be detected and removed by scanning the instructions in i in execution order as follows: if scalar instruction s computes v AVIN i, s is redundant and removed; if memory instruction with id m AVIN i, m is redundant and removed. For the example in Figure 2, the new memory CSE removes the 4 redundant loads on line 11 and 12 as they are assigned same ids as those in line 10. Partial Redundancy Elimination The key idea behind partial redundancy elimination (pre) and lazy code motion is to find computations that are redundant on some, but not all paths [24, 18, 11]. Given an expression e at point p that is redundant on some subset of the paths that reach p, the transformation inserts evaluations of e on paths where it had not been, to make the evaluation at p redundant on all paths. Our transformation is based on the formulation due to Drechsler and Stadel [11]. Drechsler and Stadel s formulation computes the sets INSERT and DELETE for scalar expressions in each block. The set INSERT i j contains those partially redundant expressions that must be duplicated along the edge i j. The DELETE i set contains expressions in block i that are redundant and can be removed. The data-flow equations are shown in Figure 4. PRE is, essentially, a code motion transformation. Thus, it must preserve data dependences during the transformation. (The flow, anti, and output dependences of the original program must be preserved [2].) The results from our new memory redundancy detection algorithm let us model dependence relations involving memory instructions and remove redundant loads. 1 1 We exclude stores from pre for two reasons. First, loads do not create anti and output dependences. Fixing the positions of stores greatly simplifies dependence graph 8

{ if i is the entry block b0; AVIN i = AVOUT h otherwise. h pred(i) AVOUT i =AVIN i AVLOC i Availability { if i is the exit block; ANTOUT i = ANTIN j otherwise. j succ(i) ANTIN i = (ANTOUT i - altered i) ANTLOC i Anticipatablity ANTIN j AVOUT i if i is b 0; EARLIEST i j = ANTIN j AVOUT i (altered i ANTOUT i) otherwise. Earliest { if i is b0; LATERIN i = LATER h i otherwise. h pred(i) LATER h i =LATERIN h ANTLOC i EARLIEST h i Later INSERT i j =LATER i j LATERIN j DELETE i = { if i is b0; ANTLOC i LATERIN i otherwise. Placement Fig. 4. PRE Data Flow Equation System To encode the constraints of load motion into the equations for PRE, we must consider both the load address and the states of the memory objects in the M- use list for the load. Specifically, a load cannot be moved past the instructions of its addressing computation; in addition, other memory instructions might change the states of the memory objects that the load may read from, so a load cannot be moved past any memory instruction which assigns new value number (i.e. defines a new state) to memory object in the M-use list of the load. Using memory value numbering results, We can build a value dependence graph that encodes the dependence relationship among the value numbers of the results of scalar instructions and the value numbers of the M-def and M-use lists for memory instructions. In particular, 1) for each scalar instruction, the instruction construction. Second, and equally important, our experiments show that opportunities to remove redundant stores are quite limited [29]. 9

becomes the def node that defines the value number of its result; furthermore, we also add a dependence edge from each def node of the source operands to the scalar instruction node; 2) for a store, the store becomes the def node that defines the value numbers of any objects on its M-def list that are assigned new value numbers; 3) for a load, the instruction becomes the def node for the load result, and we also add edges from the def nodes for the load address and the value numbers of any memory objects on the load M-use list. Intuitively, the value numbers of scalar operands and M-list objects capture the def-use relations among scalar and memory instructions. Stores can be thought of as def-points for values stored in the memory objects on the M-def; the value dependence edges between stores and loads which share common memory objects represent the flow dependences between store and load instructions. Thus, using the value numbers assigned by the memory redundancy detection algorithm, we can build the value dependence graph so that it represents the dependence relations for both scalar and memory instructions. Once the value dependence graph has been built, the compiler can build the local set altered i for each block. The altered i set contains the instructions whose source operands would change values due to the execution of block i. If e altered i, then the code motion should not move e backward beyond i, as otherwise it would violate the dependence rule. We set altered i to include all instructions in block i other than scalar and load instructions. This prevents the algorithm from moving those instructions. Furthermore, any instructions that depend transitively on these instructions are also included in altered i.thiscan be computed by taking the transitive closure in the value dependence graph with respect to the def nodes for the instructions in i. Another local set ANT LOC i contains the candidate instructions for PRE to remove. In traditional applications of PRE, ANT LOC i only contains scalar instructions. Using the value numbers for M-lists, we can model memory dependences and put loads into ANT LOC i.wesetant LOC i to contain both scalar and load instructions in block i which are not in altered i ; in other words, it contains the scalars and loads whose movement is not restricted by i. The last local set in the PRE framework is AV LOC i. It contains the all scalars and loads in i. By treating memory instructions in this way, we force the data-flow system to consider them. When the data flow system is solved, the INSERT and DELETE sets contain the scalar instructions and loads that are partially redundant and can be removed. In the example in Figure 2, the partially redundant loads for p->x and p->y in line 19 are in the DELETE set, and copies of these loads are in the INSERT set of the block where the test conditional is false. Similarly, the loads in the loop body in line 25 are also removed and copies of these loads are inserted in the loop header. In summary, the memory PRE can successfully remove those partial memory redundancies in Figure 2. 10

Front End: c2i Analysis/Transformation Passes on ILOC Back End: i2ss C Source ILOC ILOC SimpleScalar Executable Fig. 5. ILOC Execution Model Benchmark S-CSE M-CSE Ratio S-PRE M-PRE Ratio adpcm 445,117 445,117 100.0% 464,893 464,893 100.0% g721 44,066,954 43,425,899 98.6% 44,297,744 43,618,024 98.5% gsm 34,723,686 33,797,428 97.3% 34,847,422 21,949,559 63.0% epic 8,242,764 7,708,474 93.5% 8,232,931 7,240,634 88.0% pegwit 11,683,127 8,437,712 72.2% 11,680,913 8,554,340 73.2% mpeg 27,046,064 25,134,366 92.9% 26,925,145 23,944,703 88.9% Geometric Mean 91.9% 84.2% 181.mcf 70,190,795 65,194,656 92.9% 70,711,865 61,807,206 87.4% 164.gzip 765,126,630 534,164,375 69.8% 768,743,772 526,464,600 68.5% 256.bzip2 556,298,644 373,466,249 67.1% 554,923,128 344,316,419 62.1% 175.vpr 396,021,979 275,488,845 69.6% 397,356,599 263,224,908 66.2% 197.parser 97,672,948 82,377,643 84.3% 98,506,932 82,497,510 83.8% 300.twolf 99,816,284 72,935,505 73.1% 98,696,484 72,542,711 73.5% Geometric Mean 75.6% 73.0% Overall G-Mean 83.4% 78.4% 4 Experimental Results Table 5. Dynamic load count. Figure 5 shows the execution model for our compiler. The C front end (c2i) converts the program into Iloc. The compiler applies multiple analysis and optimization passes to the Iloc code. Finally, the back end (i2ss) generates SimpleScalar executables To evaluate the energy efficiency improvement of memory redundancy elimination, we implemented memory CSE and memory PRE as Iloc passes, referred to as M-CSE and M-PRE. As the memory versions of CSE and PRE subsume scalar CSE and PRE, to evaluate the effects of memory redundancy removal, we also implemented the scalar versions of CSE and PRE, referred to as S-CSE and S-PRE. We use the same benchmarks in Table 2. The benchmarks are first translated into Iloc, then multiple passes of traditional compiler optimizations are run on the Iloc codes, including constant propagation, dead code elimination, copy coalescing and control-flow simplification. We then run the whole-program pointer analysis to annotate the Iloc codes with M-lists. We run separately the S-CSE, M-CSE, S-PRE and M-PRE passes on the Iloc codes, followed by the SimpleScalar backend i2ss to create the SimpleScalar executables We then run the generated executables on the out-of-order superscalar simulator with the Wattch tool and collect the run-time performance and energy statistics. Dynamic Load Count and Cycle Count Table 5 shows the dynamic load count for the benchmarks. The ratio columns show the load count ratio between 11

110% Execution Cycles M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 145% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 Fig. 6. Normalized execution cycles. 175.vpr 197.parser 300.twolf the memory and scalar versions of CSE and PRE. For the majority of the benchmark applications, both M-CSE and M-PRE significantly reduce the dynamic load count, with a geometric mean of 16.6% for M-CSE and 21.6% for M-PRE. As M-PRE removes memory redundancies from conditionals and loops, it removes a larger number of memory redundancies than M-CSE. Furthermore, the data show that M-CSE and M-PRE have more opportunities in SPEC2000 programs than MediaBench: the dynamic load ratios between M-PRE and S- PRE are 73% for SPEC2000 and 84.2% for MediaBench; and the ratios between M-CSE and S-CSE are 75.6% for SPEC2000 and 91.9% for MediaBench. The cause of this difference is that SPEC2000 applications are generally larger and more complex than those in MediaBench, and more data references are compiled as memory instructions, which provides more opportunities for memory redundancy elimination. Figure 6 shows the impact of memory redundancy elimination on application execution cycles. As expected, M-PRE achieves the best results as it is the most powerful redundancy elimination 2. The reduction in execution cycle count leads to energy savings in the clocking network. Figure 7 shows the normalized clocking network energy consumption of M-CSE, S-PRE and M-PRE with S-CSE as base. The curves are mostly identical to those in Figure 6. Like the execution count results, the benchmarks of 300.twolf, 175.vpr, 256.bzip2 and gsm have the largest energy savings with M-PRE. Cache Energy As we discussed in Section 2, memory redundancy elimination reduces cache accesses in both L1 I-Cache and D-Cache, thus it saves energy in the cache structures. Figure 8 shows the normalized L1 I-Cache energy consumption for M-CSE, S-PRE and M-PRE with S-CSE as the base. Figure 9 shows the normalized L1 D-Cache energy for the four versions. In Figure 8 and Figure 9, the curves for M-PRE are the lowest, as M-PRE generally incurs the fewest I-Cache and D-Cache accesses, thus achieving the 2 The large execution cycle count for S-PRE in 300.twolf is due to abnormally high L1 I-Cache misses. For other cache configurations, the S-PRE cycle count is generally comparable to that of S-CSE. 12

Clock Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 110% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 7. Normalized clocking network energy consumption. 300.twolf 105% I-Cache Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 8. Normalized L1 I-Cache energy consumption. 300.twolf D-Cache Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 130% 125% 120% 115% 110% 105% 100% 95% 90% 85% 80% 75% 70% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 9. Normalized L1 D-Cache energy consumption. 300.twolf 13

Total Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 110% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr Fig. 10. Normalized total energy consumption. 197.parser 300.twolf largest energy savings. The energy consumption diagrams of the I-Cache and D- Cache also show that memory redundancy elimination is more effective to reduce the D-Cache energy, as both M-CSE and M-PREachieve more than 10% energy savings in the D-Cache for pegwit, 164.gzip, 256.bzip2, 175.vpr and 300.twolf, while the amount of energy savings in the I-Cache are relatively smaller. Total Energy and Energy-Delay Product Figure 10 shows the normalized total application energy consumption. Among the redundancy elimination techniques, M-PRE produces the best energy efficiency. A useful metric to measure both application performance and energy efficiency is the energy-delay product [14]. The smaller the energy-delay product, the better the application energy efficiency and performance. Figure 11 shows the normalized energy-delay product with S-CSE as the base. As memory redundancy elimination reduces both application execution cycles and the total energy consumption, the energy-delay product for M-CSE and M-PRE is smaller. In contrast to other techniques, such as dynamic voltage scaling, which trade application execution speed to reduce energy consumption, memory redundancy elimination boosts both application performance and energy efficiency, making it a desirable compiler transformation to save energy without loss in performance. Application Energy Breakdown We also studied the micro-architecture component energy contribution for total application energy consumption. Figures 12 and 13 show the component energy breakdown for 256.bzip2 and 175.vpr the two applications which have the largest energy efficiency improvement. The major energy savings for these two applications come from the clocking network and top level instruction and data cache. In 256.bzip2, the clocking network energy savings for M-CSE and M-PRE are 12% and 15% respectively, the L1 I-Cache savings are 8% and 10%, and the L1 D-Cache savings are 23% and 24%. The final energy savings are 12% for M-CSE and 15% for M-PRE. Similarly, in 175.vpr, the clocking network energy savings for M-CSE and M-PRE are 13% and 15% each, the L1 I-Cache savings are 10% and 12% each, and the L1 D-Cache savings are 25% and 26%. The final energy savings on 175.vpr are 14% for M-CSE and 15% for M-PRE. 14

120% Energy-Delay Product M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 153% 110% 100% 90% 80% 70% 60% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr Fig. 11. Normalized energy-delay product. 197.parser 256.bzip2 175.vpr 300.twolf mj 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 regfile bus alu lsq window rename bpred L2 cache L1 dcache L1 icache clock mj 50000 45000 40000 35000 30000 25000 20000 15000 10000 5000 regfile bus alu lsq window rename bpred L2 cache L1 dcache L1 icache clock 0 S-CSE M-CSE S-PRE M-PRE Fig. 12. Energy breakdown of 256.bzip2. 0 S-CSE M-CSE S-PRE M-PRE Fig. 13. Energy breakdown of 175.vpr. 5 Related Work Recently, power and energy issues have become critical design constraints for both high-end processors and embedded digital devices powered by battery. Researchers have developed many hardware-based techniques to reduce power and energy consumption in these systems. Dynamic voltage scaling (DVS) dynamically varies processor clock frequency and voltage to save energy and is described in [25, 28, 22]. The work in [13, 1, 17] discussed ways to reduce cache energy consumption. However, all of these are circuit and architecture-level techniques. Relatively less focus is put on application-level energy saving techniques. In [16], Kandemir et. al. studied the energy effects of loop-level compiler optimizations using array-based scientific codes. In contrast to their work, we first profiled the total application energy consumption to identify top energyconsuming components and then evaluated one compiler technique memory redundancy elimination, which can significantly reduce energy consumption in these components. Furthermore, our technique targets more complicated general purpose and multimedia applications. Recently, researchers have been studying compile-time management of hardware-based energy saving mechanisms, such as DVS. Hsu et. al. described a compiler algorithm to identify program regions where CPU can be slowed down with negligible performance loss [15]. Kremer 15

summarized compiler-based energy management methods in [19]. These methods are orthogonal to the techniques in this paper. Both scalar [9, 24, 18, 5] and memory [23, 21, 26, 4, 8] redundancy detection and removal have been studied in the literature. The redundancy detection algorithm used in our work is described in [10]. Compared to other methods, this algorithm unifies the process of scalar and memory redundancy detection and is able to find more redundancies. Most of the previous work concerns application run-time speed, while our work targeted toward the benefits of energy savings, though the results show that performance is also improved. 6 Conclusion Most of the recent work on low power and energy systems focuses on circuit and architecture-level techniques. However, more energy savings are possible by optimizing the behavior of the applications. We profiled the energy consumption of a suite of benchmarks. The energy statistics identify that the clocking network and first level cache as the top energy consuming components. With this insight, we investigated the energy savings of a particular compiler technique memory redundancy elimination. We present two redundancy elimination frameworks and evaluate the energy improvements. The results indicate that memory redundancy elimination can reduce both execution cycles and the number of top level cache accesses, thus saving energy from the clocking network and the instruction and data caches. For our benchmarks, memory redundancy elimination can achieve up to a 15% reduction in total energy consumption, and up to a 24% reduction in the energy-delay product. 7 Acknowledgments We would like to thank Tim Harvey and the anonymous reviewers, whose comments greatly helped improve the presentation of the paper. References 1. D. H. Albonesi. Selective cache ways: On-demand cache resource allocation. 1999. International Symposium on Microarchitecture. 2. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based App roach. Morgan Kaufmann, 2002. 3. L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, University of Copenhagen, 1994. 4. R. Bodik, R. Gupta, and M. L. Soffa. Load-reuse analysis: Design and evaluation. pages 64 76. 1999 Conference on Programming Language Design and Implementation (PLDI). 5. P. Briggs, K. D. Cooper, and L. T. Simpson. Value numbering. Software Practice and Experience, pages 710 724, June 1977. 6. D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architecturallevel power analysis and optimizations. In International Symposium on Computer Architecture, pages 83 94, 2000. 7. D. Burger and T. Austin. The simplescalar toolset, version 2.0. pages 13 25, June 1997. Computer Architecture News. 16

8. D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. pages 53 65. 1990 ACM SIGPLAN PLDIConference on Programming Language Design and Implementation. 9. J. Cocke. Global common subexpression elimination. pages 20 24, July 1970. Symposium on Compiler Construction. 10. K. D. Cooper and L. Xu. An efficient static analysis algorithm to detect redundant memory operations. In ACM Workshop on Memory Systems Performance, 2002. 11. K.-H. Drechsler and M. P. Stadel. A variation of knoop, ruthing, and steffen s lazy code motion. SIGPLAN Notices, pages 29 38, May 1993. 12. K. Gargi. A sparse algorithm for predicated global value numbering. In 2002 ACM SIGPLAN PLDI, pages 45 56. 13. K. Ghose and M. Kamble. Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation. pages 70 75, 1999. International Symposium on Low Power Electronics and Design. 14. R. Gonzalez and M. Horowitz. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, pages 1277 1283, Sept. 1996. 15. C.-H. Hsu and U. Kremer. The design, implementation, and evaluation of a compiler algorithm for cpu energy reduction. pages 38 48. 2003 PLDI. 16. M. T. Kandemir, N. Vijaykrishnan, and et. al. Influence of compiler optimizations on system power. In Design Automation Conference, pages 304 307, 2000. 17. J. Kin and et. al. The filter cache: An energy efficient memory structure. pages 184 193, 1997. International Symposium on Microarchitecture. 18. J. Knoop, O. Rüthing, and B. Steffen. Lazy code motion. pages 224 234. ACM SIGPLAN 1992 PLDI. 19. U. Kremer. Compilers for power and energy management. 2003 PLDI Tutorial. 20. C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In International Symposium on Microarchitecture, pages 330 335, 1997. 21. R. Lo, F. Chow, R. Kennedy, S.-M. Liu, and P. Tu. Register promotion by sparse partial redundancy elimination of loads and stores. pages 26 37. 1998 ACM SIGPLAN Conference on Programming Language Design and Implementation. 22. J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In SIGMETRICS/Performance, pages 50 61, 2001. 23. J. Lu and K. Cooper. Register promotion in c programs. pages 308 319. 1997 ACM SIGPLAN Conference on Programming Language Design and Implementation. 24. E. Morel and C. Renvoise. Global optimization by suppression of partial redundancies. Commun. ACM, (2):96 103, 1979. 25. T. Pering, T. Burd, and R. Brodersen. The simulation and evaluation of dynamic voltage scaling algorithms. pages 76 81, 1998. International Symposium on Low Power Electronics and Design. 26. A. Sastry and R. D. Ju. A new algorithm for scalar register promotion based on ssa form. pages 15 25. 1998 ACM SIGPLAN PLDI. 27. T. Simpson. Value-Driven Redundancy Elimination. PhD thesis, Rice University, 1996. 28. T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. D. Micheli. Dynamic voltage scaling for portable systems. 2001. Design Automation Conference. 29. L. Xu. Program Redundancy Analysis and Optimization to Improve Memory Performance. PhD thesis, Rice University, 2003. 30. W. Ye, N. Vijaykrishnan, M. T. Kandemir, and M. J. Irwin. The design and use of simplepower: a cycle-accurate energy estimation tool. In Design Automation Conference, pages 340 345, 2000. 17