Memory Redundancy Elimination to Improve Application Energy Efficiency

Size: px
Start display at page:

Download "Memory Redundancy Elimination to Improve Application Energy Efficiency"

Transcription

1 Memory Redundancy Elimination to Improve Application Energy Efficiency Keith D. Cooper and Li Xu Department of Computer Science Rice University Houston, Texas, USA Abstract. Application energy consumption has become an increasingly important issue for both high-end microprocessors and mobile and embedded devices. A multitude of circuit and architecture-level techniques have been developed to improve application energy efficiency. However, relatively less work studies the effects of compiler transformations in terms of application energy efficiency. In this paper, we use energyestimation tools to profile the execution of benchmark applications. The results show that energy consumption due to memory instructions accounts for a large share of total energy. An effective compiler technique that can improve energy efficiency is memory redundancy elimination. It reduces both application execution cycles and the number of cache accesses. We evaluate the energy improvement over 12 benchmark applications from SPEC2000 and MediaBench. The results show that memory redundancy elimination can significantly reduce energy in the processor clocking network and the instruction and data caches. The overall application energy consumption can be reduced by up to 15%, and the reduction in terms of energy-delay product is up to 24%. 1 Introduction Application energy consumption has become an increasingly important issue for the whole array of microprocessors spanning from high-end processors used in data centers to those inside mobile and embedded devices. Energy conservation is currently the target of intense research efforts. A multitude of circuit and architecture-level techniques have been proposed and developed to reduce processor energy consumption [25, 13, 17]. However, many of these research efforts focus on hardware techniques, such as dynamic voltage scaling (DVS) [25, 28, 22] and low-energy cache design [13, 1, 17]. Equally important, application-level techniques are necessary to make program execution more energy efficient, as ultimately, it is the applications executed by the processors that determine the total energy consumption. In this paper, we look at how compiler techniques can be used to improve application energy efficiency. In Section 2, we use energy profiling to identify top energy consuming micro-architecture components and motivate the study of memory redundancy elimination as a potential technique to reduce energy. Section 3 overviews a new algorithm for memory redundancy detection and presents two frameworks to remove redundant memory instructions. Section 4 presents the experimental results. Section 5 summarizes related work and Section 6 concludes the paper.

2 2 Energy Profiling Optimizing compilers have been successful to improve program performance [2]. One key reason is that accurate performance models are used to evaluate various code transformations. Similarly, an automatic program energy efficiency optimization requires accurate energy dissipation modeling. Unfortunately, most energy estimation tools work on the circuit and transistor level and require detailed information from circuit design. Recently, researchers have started to build higher level energy modeling tools, such as Wattch [6] and SimplePower [30], which can estimate power and energy dissipation of various micro-architecture components. When combined with instruction-level performance simulators, these tools provide an ideal infrastructure to evaluate compiler optimizations targeting energy efficiency. In our work, we use the Wattch [6] tool along with the SimpleScalar [7] simulators to study compiler techniques in terms of application energy consumption. Our approach is as follows: we first profile application execution and measure energy consumption breakdown by major processor components. This step will reveal how energy is dissipated. Based on the application energy profile, we can then identify promising code optimizations to improve energy efficiency. Parameter Value Parameter Value Processor Core Memory Hierarchy RUU Size 64 instructions L1 Data Cache 64KB, 2-way (LRU), LSQ Size 32 instructions 32B block, 1 cycle latency Fetch Queue Size 8 instructions L1 Instruction Cache 64KB, 2-way (LRU), Fetch Width 4 instructions/cycle 32B block, 1 cycle latency Decode Width 4 instructions/cycle L2 Cache Unified, 2MB, 4-way (LRU), Issue Width 4 instructions/cycle 32B block, 12-cycle latency Commit Width 4 instructions/cycle Memory 100-cycle latency Function Units 4 integer ALUs TLB 128 entry, fully associative, 1 integer multiply, 1 integer divide 30-cycle miss latency 1 FP add, 1 FP multiply, 1 FP divide/sqrt Branch Prediction Branch predictor Combined: 4K chooser Bimodal: 4K table 2-Level: 1K table, 10-bit history BTB 1024-entry, 2-way Returned Address Stack 32-entry Misprediction Penalty 7 cycles Table 1. SimpleScalar Simulator Configuration In CMOS circuits, dynamic power consumption accounts for the major share of power dissipation. We use Wattch to get the dynamic power breakdown of superscalar processors. The processor configuration is shown in Table 1, which is similar to those of the Alpha The Wattch tool is configured to use parameters of a.35um process at 600 MHz with supply voltage of 2.5V. The pie-chart in Figure 1 shows the percentage of dynamic power dissipation for the micro-architecture components, assuming each component is fully active. Figure 1 shows that the top dynamic power-dissipating components are the global clocking network and on-chip caches. Combined together, they account 2

3 Dynamic Power of Components DTLB 1% L1 d-cache 16% ITLB 0% L2 cache 18% bpred 5% rename 1% instruction window 6% load/store queue 3% register file 5% result bus 4% L1 i-cache 8% clock 28% Fig. 1. Active dynamic power consumption by micro-architecture components alu 5% bpred rename instruction window load/store queue register file result bus alu clock L1 i-cache ITLB L1 d-cache DTLB L2 cache Benchmark Description Input adpcm 16-to-4 bit voice encoding clinton.pcm g721 CCITT G.721 voice encoding clinton.pcm gsm GSM speech encoding clinton.pcm epic Pyramid image encoding test img.pgm pegwit Elliptic curve public key encryption news.txt mpeg2dec MPEG-2 video decoding child.mpg 181.mcf Combinational optimization test input 164.gzip Compression test input 256.bzip2 Compression test input 175.vpr FPGA placement and routing test input 197.parser Link grammar parser of English test input 300.twolf Circuit placement and routing test input Table 2. Test Benchmarks Benchmark Total (mj) Clock (mj) Ratio I-Cache (mj) Ratio D-Cache (mj) Ratio adpcm % % % g721 9, , % 1, % 1, % gsm 6, , % % 1, % epic 1, % % % pegwit 1, % % % mpeg 5, , % % % 181.mcf 8, , % 1, % 1, % 164.gzip 27, , % 3, % 4, % 256.bzip2 60, , % 8, % 9, % 175.vpr 32, , % 4, % 5, % 197.parser 10, , % 1, % 1, % 300.twolf 10, , % 1, % 1, % Geometric Mean 30.4% 14.4% 13.2% Table 3. Application energy consumption and energy consumption by clocking network, top level I-Cache and D-Cache. Energy unit is mj (10 3 Joule). 3

4 for more than 70% of the total dynamic power dissipation. This suggests that the clocking network and caches should be the primary targets for compiler techniques to improve energy efficiency. The processor energy of dynamic switching (E d ) can be defined as: E d = αcv 2 dd In the above equation of E d, C is the load capacitance, V dd is the supply voltage and α is the switching activity factor indicating how often logic transitions from low to high take place [14]. C and V dd are dependent on the particular process technology and circuit design, while the activity factor α is related to the codes being executed [14]. The main leverage for the compiler to minimize E d is to reduce α. We ran 12 benchmark applications and profiled the energy consumption. The benchmarks are chosen from the widely used SPEC2000 and MediaBench [20]. Table 2 shows the descriptions of the benchmark applications. We compiled the benchmark applications using GNU GCC compiler with -O4 level optimization. The compiled executables are then run on the out-of-order superscalar simulator with Wattch to collect run-time and energy statistics. Table 3 shows the total energy consumption and energy in the clocking network, top level I-Cache and D-Cache. The results show that the energy distribution of the micro-architecture components is very similar to the power distribution graph in Figure 1. The major difference is that all applications exhibit good cache locality, and L2 cache is rarely accessed due to the low number of top level cache misses. Therefore, compared to energy in other components, energy in L2 cache is negligible due to infrequent access activities. As shown in Table 3, energy consumption in the clocking network and top level cache accounts for a large share of total application energy. For the 12 applications, the clocking network and L1 cache account for more than 58% (geometric mean) of total energy. Table 4 shows the dynamic instruction count and dynamic load and store count. The results show that memory instructions account for about 24% (geometric mean) of dynamic instructions, and for the more sophisticated SPEC2000 applications, the percentage is even higher, with a geometric mean of 36%. The large number of dynamic memory instructions have the following consequences: first, these instructions must be fetched from the I-Cache before execution, thus costing energy in the I-Cache; second, instruction execution also costs energy, including that in the clocking network; and thirdly, the execution of memory instructions also requires D-Cache access, and this is the major cause of D-Cache energy consumption. As both the clocking network and caches are top powerdissipating components, memory instructions thus have significant impact on total energy consumption. The above energy profiling data indicate that memory instructions are good target to improve application energy efficiency. Redundant memory instructions represent wasted energy; removing them [14,40] should reduce energy costs. The dominant modern processor architecture is the load-store architecture, in which 4

5 Benchmark Total Load Ratio Store Ratio Load+Store adpcm 9,136, , % % 6.3% g ,838,710 43,242, % % 16.5% gsm 243,270,088 41,358, % % 21.8% epic 59,933,404 7,505, % % 14.2% pegwit 44,934,340 9,241, % % 26.7% mpeg 174,648,416 26,999, % % 19.1% 181.mcf 263,801,691 67,462, % % 40.3% 164.gzip 872,033, ,006, % % 33.5% 256.bzip2 1,953,921, ,238, % % 29.3% 175.vpr 1,044,347, ,776, % % 40.2% 197.parser 322,027,162 88,673, % % 35.8% 300.twolf 337,032,127 96,499, % % 38.2% Geometric Mean 18.3% 5.5% 24.0% Table 4. Memory instruction count and ratio. Benchmarks compiled with GCC -O4. most instructions operate on data in the register file, and only loads and stores can access memory. Between the processor core and main memory, the I-Cache stores the instructions to be fetched and executed; the D-Cache serves as local copy of memory data, so loads and stores can access data faster. When redundant memory instructions are removed, the traffic from memory through the I-Cache to the CPU core is reduced because fewer instructions are fetched. This saves energy in the I-Cache. Data accesses in the D-Cache are also reduced, saving energy in the D-Cache. Finally, removing memory instructions speeds up the application and saves energy in the clocking network. In our prior analysis, the clocking network and cache structures are among the top energy consuming components in the processor. Thus energy savings in these components can significantly reduce total energy consumption. The rest of this paper will present the compile-time memory redundancy elimination and evaluate its effectiveness to improve energy efficiency. 3 Memory Redundancy Elimination Memory redundancy elimination is a compile-time technique to remove unnecessary memory instructions. Consider the sample C code in Figure 2; in the functions full red, par cond and par loop, the struct field accesses by p->x and p->y are generally compiled into loads. However, the loads in line 11 and 12 are fully redundant with those in line 10, as they always load the same values at run time; similarly, the loads in line 19 are partially redundant with those in line 17 when the conditional statement is executed; the loads in line 25 are partially redundant, as the load values need to be loaded only for the first loop iteration and all the remaining iterations load the same values. These redundant loads can be detected and removed at compile time. As we discussed in Section 2, memory instructions incur significant dynamic energy consumption, so memory redundancy elimination can be an effective energy-saving transformation. In our prior work [10], we presented a new static analysis algorithm to detect memory redundancy. This algorithm uses value numbering on memory operations, and is the basis for memory redundancy removal techniques described in this paper. In comparison, this paper extends our work in [10] by providing a more powerful removal framework which is capable to eliminate a larger set 5

6 1 struct parm { 2 int x; 3 int y; 4 }; 5 struct parm pa = {3, 7}; 6 struct parm pb = {2001, 2002}; 7 8 void full red(struct parm *p, int *result) 9 { 10 result[0] = p->x + p->y; 11 result[1] = p->x - p->y; 12 result[2] = p->x + p->y; 13 } 14 void par cond(struct parm *p, int *result) 15 { 16 if (p->x > 10) { 17 result[0] = p->x + p->y; 18 } 19 result[1] = p->x - p->y; 20 } 21 void par loop(struct parm *p, int *result) 22 { 23 int i; 24 for (i=0; i<100; i++) { 25 result[i] = p->x + p->y; 26 } 27 } 28 void client() 29 { 30 int r[6][100]; 31 full red(&pa, r[0]); 32 full red(&pb, r[1]); 33 par cond(&pa, r[2]); 34 par cond(&pb, r[3]); 35 par loop(&pa, r[4]); 36 par cond(&pb, r[5]); 37 } Fig. 2. Memory Redundancy Example. The loads in line 11 and 12 for p->x and p->y are fully redundant; the loads in line 19 for p->x and p->y are partially redundant due to conditional statement in line 16; the loads for p->x and p->y in line 25 are partially redundant as they are loop invariant. of memory redundancies; furthermore, this paper focuses on energy efficiency benefits, while the previous work concerns about performance improvements. In Section 3.1, we first give an overview of this memory redundancy detection algorithm; and in Section 3.2, we present code transformations which use the analysis results of the detection algorithm to remove those fully and partially redundant memory instructions. 3.1 Finding Memory Redundancy In [10], we presented a new static analysis algorithm to detect memory redundancy. We extended Simpson s optimistic global value-numbering algorithm [27, 12] to value number memory instructions. Sccvn is a powerful procedure-scope scalar redundancy (i.e. non-memory instructions) detection algorithm. It discovers value-based identical scalar instructions (as opposed to lexical identities), performs optimistic constant propagation, and handles a broad set of algebraic identities. To extend Sccvn so that it can also detect identities for loads and stores, we annotated the memory instructions in the compiler s intermediate representation (IR) with M-lists lists of the names of memory objects that are potentially defined by the instruction (an M-def list) and the names of those that are 6

7 potentially used by the instruction (an M-use list). The M-lists are computed by a flow-insensitive, context-insensitive, Andersen-style pointer analysis [3]. Our compiler uses a low-level, Risc-style, three-address IR, called Iloc. All memory accesses in Iloc occur on load and store instructions. The other instructions work from an unlimited set of virtual registers. The Iloc load and store code with M-lists for line 10 in Figure 2 is shown in the following: ild r1 => r4 M-use[@pa x] ild r2 => r5 M-use[@pa y] ist r3 r7 M-use[@r] M-def[@r] As an example, the M-use in ild r1 => r4 M-use[@pa x] means the integer load will load from address r1, put result in r4, and the load may access memory object pa.x or pb.x. This corresponds to p->x in the source code. In the annotated IR with M-lists, loads only have M-use list, as loads don t change the states of those referenced memory objects, and during value numbering, the value numbers of names in M-use indicate both before and after-states of memory objects affected by the loads; stores are annotated with both M-use and M-def, as stores may write new values to memory objects, and during value numbering of stores, the value numbers of names in M-use indicate the states before the execution of stores, and the value numbers in M-def indicates the states after the execution of stores. Using M-list, we can value number memory instructions along with scalar instructions and detect instruction identities. To value number memory instructions, both normal instruction operands (base address, offset, and result) and M-list names are used as a combined hash-key to look up values in the hash table. If there is a match, the memory instructions will access the same address with the same value and change the affected memory objects into identical states, therefore the matching instructions are redundant. For example, after value numbering, the three loads which correspond to p->x in function full red in Figure 2, all have the same form as ild r1 vn => r4 vn M-use [@pa x x vn]; therefore the three loads are identities, and the last two are redundant and can be removed to reuse the value in register r4 vn. Also in Figure 2, for those loads of p->x and p->y in the functions par cond and par loop, memory value numbering can detect they are redundant. 3.2 Removing Memory Redundancy After memory redundancies are detected, code transformations are used to eliminate the redundant instructions. We have used two different techniques to perform the elimination phase: traditional common subexpression elimination (CSE) [9] and partial redundancy elimination (PRE) [24, 18, 11]. Using memory value numbering results, we can easily extend scalar CSE and PRE and build unified frameworks that remove both scalar and memory-based redundancies. Memory CSE was first described in [10] and is briefly recapitulated in this paper for completeness; memory PRE is a more powerful removal framework and can eliminate a larger set of memory redundancies. This section shows the two frameworks, extended to include memory redundancy removal. 7

8 AVLOC i = computed locally as in Section 3.2 { if i is the entry block; AVIN i = AVOUT h otherwise. h pred(i) AVOUT i =AVIN i AVLOC i Fig. 3. CSE Data Flow Equation System Available Expressions Traditional common subexpression elimination (CSE) finds and removes redundant scalar expressions (sometimes called fully redundant expressions). It computes the set of expressions that are available on entry to each block as a data-flow problem. An expression e is available on entry to block b if every control-flow path that reaches b contains a computation of e. Any expression in the block that is also available on entry to the block (in AVIN) is redundant and can be removed. Figure 3 shows the equations used for value-based CSE. To identify fully redundant memory instructions, for equivalent memory instructions, we assign them a unique id number. The AVLOC i set for block i is computed by adding scalar values and memory ids defined in i. When the equations in Figure 3 are solved, the AVIN i set contains the available scalar values and memory ids at the entry of block i. Fully redundant instructions (including redundant memory instructions) can be detected and removed by scanning the instructions in i in execution order as follows: if scalar instruction s computes v AVIN i, s is redundant and removed; if memory instruction with id m AVIN i, m is redundant and removed. For the example in Figure 2, the new memory CSE removes the 4 redundant loads on line 11 and 12 as they are assigned same ids as those in line 10. Partial Redundancy Elimination The key idea behind partial redundancy elimination (pre) and lazy code motion is to find computations that are redundant on some, but not all paths [24, 18, 11]. Given an expression e at point p that is redundant on some subset of the paths that reach p, the transformation inserts evaluations of e on paths where it had not been, to make the evaluation at p redundant on all paths. Our transformation is based on the formulation due to Drechsler and Stadel [11]. Drechsler and Stadel s formulation computes the sets INSERT and DELETE for scalar expressions in each block. The set INSERT i j contains those partially redundant expressions that must be duplicated along the edge i j. The DELETE i set contains expressions in block i that are redundant and can be removed. The data-flow equations are shown in Figure 4. PRE is, essentially, a code motion transformation. Thus, it must preserve data dependences during the transformation. (The flow, anti, and output dependences of the original program must be preserved [2].) The results from our new memory redundancy detection algorithm let us model dependence relations involving memory instructions and remove redundant loads. 1 1 We exclude stores from pre for two reasons. First, loads do not create anti and output dependences. Fixing the positions of stores greatly simplifies dependence graph 8

9 { if i is the entry block b0; AVIN i = AVOUT h otherwise. h pred(i) AVOUT i =AVIN i AVLOC i Availability { if i is the exit block; ANTOUT i = ANTIN j otherwise. j succ(i) ANTIN i = (ANTOUT i - altered i) ANTLOC i Anticipatablity ANTIN j AVOUT i if i is b 0; EARLIEST i j = ANTIN j AVOUT i (altered i ANTOUT i) otherwise. Earliest { if i is b0; LATERIN i = LATER h i otherwise. h pred(i) LATER h i =LATERIN h ANTLOC i EARLIEST h i Later INSERT i j =LATER i j LATERIN j DELETE i = { if i is b0; ANTLOC i LATERIN i otherwise. Placement Fig. 4. PRE Data Flow Equation System To encode the constraints of load motion into the equations for PRE, we must consider both the load address and the states of the memory objects in the M- use list for the load. Specifically, a load cannot be moved past the instructions of its addressing computation; in addition, other memory instructions might change the states of the memory objects that the load may read from, so a load cannot be moved past any memory instruction which assigns new value number (i.e. defines a new state) to memory object in the M-use list of the load. Using memory value numbering results, We can build a value dependence graph that encodes the dependence relationship among the value numbers of the results of scalar instructions and the value numbers of the M-def and M-use lists for memory instructions. In particular, 1) for each scalar instruction, the instruction construction. Second, and equally important, our experiments show that opportunities to remove redundant stores are quite limited [29]. 9

10 becomes the def node that defines the value number of its result; furthermore, we also add a dependence edge from each def node of the source operands to the scalar instruction node; 2) for a store, the store becomes the def node that defines the value numbers of any objects on its M-def list that are assigned new value numbers; 3) for a load, the instruction becomes the def node for the load result, and we also add edges from the def nodes for the load address and the value numbers of any memory objects on the load M-use list. Intuitively, the value numbers of scalar operands and M-list objects capture the def-use relations among scalar and memory instructions. Stores can be thought of as def-points for values stored in the memory objects on the M-def; the value dependence edges between stores and loads which share common memory objects represent the flow dependences between store and load instructions. Thus, using the value numbers assigned by the memory redundancy detection algorithm, we can build the value dependence graph so that it represents the dependence relations for both scalar and memory instructions. Once the value dependence graph has been built, the compiler can build the local set altered i for each block. The altered i set contains the instructions whose source operands would change values due to the execution of block i. If e altered i, then the code motion should not move e backward beyond i, as otherwise it would violate the dependence rule. We set altered i to include all instructions in block i other than scalar and load instructions. This prevents the algorithm from moving those instructions. Furthermore, any instructions that depend transitively on these instructions are also included in altered i.thiscan be computed by taking the transitive closure in the value dependence graph with respect to the def nodes for the instructions in i. Another local set ANT LOC i contains the candidate instructions for PRE to remove. In traditional applications of PRE, ANT LOC i only contains scalar instructions. Using the value numbers for M-lists, we can model memory dependences and put loads into ANT LOC i.wesetant LOC i to contain both scalar and load instructions in block i which are not in altered i ; in other words, it contains the scalars and loads whose movement is not restricted by i. The last local set in the PRE framework is AV LOC i. It contains the all scalars and loads in i. By treating memory instructions in this way, we force the data-flow system to consider them. When the data flow system is solved, the INSERT and DELETE sets contain the scalar instructions and loads that are partially redundant and can be removed. In the example in Figure 2, the partially redundant loads for p->x and p->y in line 19 are in the DELETE set, and copies of these loads are in the INSERT set of the block where the test conditional is false. Similarly, the loads in the loop body in line 25 are also removed and copies of these loads are inserted in the loop header. In summary, the memory PRE can successfully remove those partial memory redundancies in Figure 2. 10

11 Front End: c2i Analysis/Transformation Passes on ILOC Back End: i2ss C Source ILOC ILOC SimpleScalar Executable Fig. 5. ILOC Execution Model Benchmark S-CSE M-CSE Ratio S-PRE M-PRE Ratio adpcm 445, , % 464, , % g721 44,066,954 43,425, % 44,297,744 43,618, % gsm 34,723,686 33,797, % 34,847,422 21,949, % epic 8,242,764 7,708, % 8,232,931 7,240, % pegwit 11,683,127 8,437, % 11,680,913 8,554, % mpeg 27,046,064 25,134, % 26,925,145 23,944, % Geometric Mean 91.9% 84.2% 181.mcf 70,190,795 65,194, % 70,711,865 61,807, % 164.gzip 765,126, ,164, % 768,743, ,464, % 256.bzip2 556,298, ,466, % 554,923, ,316, % 175.vpr 396,021, ,488, % 397,356, ,224, % 197.parser 97,672,948 82,377, % 98,506,932 82,497, % 300.twolf 99,816,284 72,935, % 98,696,484 72,542, % Geometric Mean 75.6% 73.0% Overall G-Mean 83.4% 78.4% 4 Experimental Results Table 5. Dynamic load count. Figure 5 shows the execution model for our compiler. The C front end (c2i) converts the program into Iloc. The compiler applies multiple analysis and optimization passes to the Iloc code. Finally, the back end (i2ss) generates SimpleScalar executables To evaluate the energy efficiency improvement of memory redundancy elimination, we implemented memory CSE and memory PRE as Iloc passes, referred to as M-CSE and M-PRE. As the memory versions of CSE and PRE subsume scalar CSE and PRE, to evaluate the effects of memory redundancy removal, we also implemented the scalar versions of CSE and PRE, referred to as S-CSE and S-PRE. We use the same benchmarks in Table 2. The benchmarks are first translated into Iloc, then multiple passes of traditional compiler optimizations are run on the Iloc codes, including constant propagation, dead code elimination, copy coalescing and control-flow simplification. We then run the whole-program pointer analysis to annotate the Iloc codes with M-lists. We run separately the S-CSE, M-CSE, S-PRE and M-PRE passes on the Iloc codes, followed by the SimpleScalar backend i2ss to create the SimpleScalar executables We then run the generated executables on the out-of-order superscalar simulator with the Wattch tool and collect the run-time performance and energy statistics. Dynamic Load Count and Cycle Count Table 5 shows the dynamic load count for the benchmarks. The ratio columns show the load count ratio between 11

12 110% Execution Cycles M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 145% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 Fig. 6. Normalized execution cycles. 175.vpr 197.parser 300.twolf the memory and scalar versions of CSE and PRE. For the majority of the benchmark applications, both M-CSE and M-PRE significantly reduce the dynamic load count, with a geometric mean of 16.6% for M-CSE and 21.6% for M-PRE. As M-PRE removes memory redundancies from conditionals and loops, it removes a larger number of memory redundancies than M-CSE. Furthermore, the data show that M-CSE and M-PRE have more opportunities in SPEC2000 programs than MediaBench: the dynamic load ratios between M-PRE and S- PRE are 73% for SPEC2000 and 84.2% for MediaBench; and the ratios between M-CSE and S-CSE are 75.6% for SPEC2000 and 91.9% for MediaBench. The cause of this difference is that SPEC2000 applications are generally larger and more complex than those in MediaBench, and more data references are compiled as memory instructions, which provides more opportunities for memory redundancy elimination. Figure 6 shows the impact of memory redundancy elimination on application execution cycles. As expected, M-PRE achieves the best results as it is the most powerful redundancy elimination 2. The reduction in execution cycle count leads to energy savings in the clocking network. Figure 7 shows the normalized clocking network energy consumption of M-CSE, S-PRE and M-PRE with S-CSE as base. The curves are mostly identical to those in Figure 6. Like the execution count results, the benchmarks of 300.twolf, 175.vpr, 256.bzip2 and gsm have the largest energy savings with M-PRE. Cache Energy As we discussed in Section 2, memory redundancy elimination reduces cache accesses in both L1 I-Cache and D-Cache, thus it saves energy in the cache structures. Figure 8 shows the normalized L1 I-Cache energy consumption for M-CSE, S-PRE and M-PRE with S-CSE as the base. Figure 9 shows the normalized L1 D-Cache energy for the four versions. In Figure 8 and Figure 9, the curves for M-PRE are the lowest, as M-PRE generally incurs the fewest I-Cache and D-Cache accesses, thus achieving the 2 The large execution cycle count for S-PRE in 300.twolf is due to abnormally high L1 I-Cache misses. For other cache configurations, the S-PRE cycle count is generally comparable to that of S-CSE. 12

13 Clock Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 110% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 7. Normalized clocking network energy consumption. 300.twolf 105% I-Cache Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 8. Normalized L1 I-Cache energy consumption. 300.twolf D-Cache Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 130% 125% 120% 115% 110% 105% 100% 95% 90% 85% 80% 75% 70% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr 197.parser Fig. 9. Normalized L1 D-Cache energy consumption. 300.twolf 13

14 Total Energy M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 110% 105% 100% 95% 90% 85% 80% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr Fig. 10. Normalized total energy consumption. 197.parser 300.twolf largest energy savings. The energy consumption diagrams of the I-Cache and D- Cache also show that memory redundancy elimination is more effective to reduce the D-Cache energy, as both M-CSE and M-PREachieve more than 10% energy savings in the D-Cache for pegwit, 164.gzip, 256.bzip2, 175.vpr and 300.twolf, while the amount of energy savings in the I-Cache are relatively smaller. Total Energy and Energy-Delay Product Figure 10 shows the normalized total application energy consumption. Among the redundancy elimination techniques, M-PRE produces the best energy efficiency. A useful metric to measure both application performance and energy efficiency is the energy-delay product [14]. The smaller the energy-delay product, the better the application energy efficiency and performance. Figure 11 shows the normalized energy-delay product with S-CSE as the base. As memory redundancy elimination reduces both application execution cycles and the total energy consumption, the energy-delay product for M-CSE and M-PRE is smaller. In contrast to other techniques, such as dynamic voltage scaling, which trade application execution speed to reduce energy consumption, memory redundancy elimination boosts both application performance and energy efficiency, making it a desirable compiler transformation to save energy without loss in performance. Application Energy Breakdown We also studied the micro-architecture component energy contribution for total application energy consumption. Figures 12 and 13 show the component energy breakdown for 256.bzip2 and 175.vpr the two applications which have the largest energy efficiency improvement. The major energy savings for these two applications come from the clocking network and top level instruction and data cache. In 256.bzip2, the clocking network energy savings for M-CSE and M-PRE are 12% and 15% respectively, the L1 I-Cache savings are 8% and 10%, and the L1 D-Cache savings are 23% and 24%. The final energy savings are 12% for M-CSE and 15% for M-PRE. Similarly, in 175.vpr, the clocking network energy savings for M-CSE and M-PRE are 13% and 15% each, the L1 I-Cache savings are 10% and 12% each, and the L1 D-Cache savings are 25% and 26%. The final energy savings on 175.vpr are 14% for M-CSE and 15% for M-PRE. 14

15 120% Energy-Delay Product M-CSE/S-CSE S-PRE/S-CSE M-PRE/S-CSE 153% 110% 100% 90% 80% 70% 60% adpcm g721 gsm epic pegwit mpeg 181.mcf 164.gzip 256.bzip2 175.vpr Fig. 11. Normalized energy-delay product. 197.parser 256.bzip2 175.vpr 300.twolf mj 90,000 80,000 70,000 60,000 50,000 40,000 30,000 20,000 10,000 regfile bus alu lsq window rename bpred L2 cache L1 dcache L1 icache clock mj regfile bus alu lsq window rename bpred L2 cache L1 dcache L1 icache clock 0 S-CSE M-CSE S-PRE M-PRE Fig. 12. Energy breakdown of 256.bzip2. 0 S-CSE M-CSE S-PRE M-PRE Fig. 13. Energy breakdown of 175.vpr. 5 Related Work Recently, power and energy issues have become critical design constraints for both high-end processors and embedded digital devices powered by battery. Researchers have developed many hardware-based techniques to reduce power and energy consumption in these systems. Dynamic voltage scaling (DVS) dynamically varies processor clock frequency and voltage to save energy and is described in [25, 28, 22]. The work in [13, 1, 17] discussed ways to reduce cache energy consumption. However, all of these are circuit and architecture-level techniques. Relatively less focus is put on application-level energy saving techniques. In [16], Kandemir et. al. studied the energy effects of loop-level compiler optimizations using array-based scientific codes. In contrast to their work, we first profiled the total application energy consumption to identify top energyconsuming components and then evaluated one compiler technique memory redundancy elimination, which can significantly reduce energy consumption in these components. Furthermore, our technique targets more complicated general purpose and multimedia applications. Recently, researchers have been studying compile-time management of hardware-based energy saving mechanisms, such as DVS. Hsu et. al. described a compiler algorithm to identify program regions where CPU can be slowed down with negligible performance loss [15]. Kremer 15

16 summarized compiler-based energy management methods in [19]. These methods are orthogonal to the techniques in this paper. Both scalar [9, 24, 18, 5] and memory [23, 21, 26, 4, 8] redundancy detection and removal have been studied in the literature. The redundancy detection algorithm used in our work is described in [10]. Compared to other methods, this algorithm unifies the process of scalar and memory redundancy detection and is able to find more redundancies. Most of the previous work concerns application run-time speed, while our work targeted toward the benefits of energy savings, though the results show that performance is also improved. 6 Conclusion Most of the recent work on low power and energy systems focuses on circuit and architecture-level techniques. However, more energy savings are possible by optimizing the behavior of the applications. We profiled the energy consumption of a suite of benchmarks. The energy statistics identify that the clocking network and first level cache as the top energy consuming components. With this insight, we investigated the energy savings of a particular compiler technique memory redundancy elimination. We present two redundancy elimination frameworks and evaluate the energy improvements. The results indicate that memory redundancy elimination can reduce both execution cycles and the number of top level cache accesses, thus saving energy from the clocking network and the instruction and data caches. For our benchmarks, memory redundancy elimination can achieve up to a 15% reduction in total energy consumption, and up to a 24% reduction in the energy-delay product. 7 Acknowledgments We would like to thank Tim Harvey and the anonymous reviewers, whose comments greatly helped improve the presentation of the paper. References 1. D. H. Albonesi. Selective cache ways: On-demand cache resource allocation International Symposium on Microarchitecture. 2. R. Allen and K. Kennedy. Optimizing Compilers for Modern Architectures: A Dependence-based App roach. Morgan Kaufmann, L. O. Andersen. Program Analysis and Specialization for the C Programming Language. PhD thesis, University of Copenhagen, R. Bodik, R. Gupta, and M. L. Soffa. Load-reuse analysis: Design and evaluation. pages Conference on Programming Language Design and Implementation (PLDI). 5. P. Briggs, K. D. Cooper, and L. T. Simpson. Value numbering. Software Practice and Experience, pages , June D. Brooks, V. Tiwari, and M. Martonosi. Wattch: a framework for architecturallevel power analysis and optimizations. In International Symposium on Computer Architecture, pages 83 94, D. Burger and T. Austin. The simplescalar toolset, version 2.0. pages 13 25, June Computer Architecture News. 16

17 8. D. Callahan, S. Carr, and K. Kennedy. Improving register allocation for subscripted variables. pages ACM SIGPLAN PLDIConference on Programming Language Design and Implementation. 9. J. Cocke. Global common subexpression elimination. pages 20 24, July Symposium on Compiler Construction. 10. K. D. Cooper and L. Xu. An efficient static analysis algorithm to detect redundant memory operations. In ACM Workshop on Memory Systems Performance, K.-H. Drechsler and M. P. Stadel. A variation of knoop, ruthing, and steffen s lazy code motion. SIGPLAN Notices, pages 29 38, May K. Gargi. A sparse algorithm for predicated global value numbering. In 2002 ACM SIGPLAN PLDI, pages K. Ghose and M. Kamble. Reducing power in superscalar processor caches using subbanking, multiple line buffers and bit-line segmentation. pages 70 75, International Symposium on Low Power Electronics and Design. 14. R. Gonzalez and M. Horowitz. Energy dissipation in general purpose microprocessors. IEEE Journal of Solid-State Circuits, pages , Sept C.-H. Hsu and U. Kremer. The design, implementation, and evaluation of a compiler algorithm for cpu energy reduction. pages PLDI. 16. M. T. Kandemir, N. Vijaykrishnan, and et. al. Influence of compiler optimizations on system power. In Design Automation Conference, pages , J. Kin and et. al. The filter cache: An energy efficient memory structure. pages , International Symposium on Microarchitecture. 18. J. Knoop, O. Rüthing, and B. Steffen. Lazy code motion. pages ACM SIGPLAN 1992 PLDI. 19. U. Kremer. Compilers for power and energy management PLDI Tutorial. 20. C. Lee, M. Potkonjak, and W. H. Mangione-Smith. Mediabench: A tool for evaluating and synthesizing multimedia and communications systems. In International Symposium on Microarchitecture, pages , R. Lo, F. Chow, R. Kennedy, S.-M. Liu, and P. Tu. Register promotion by sparse partial redundancy elimination of loads and stores. pages ACM SIGPLAN Conference on Programming Language Design and Implementation. 22. J. R. Lorch and A. J. Smith. Improving dynamic voltage scaling algorithms with PACE. In SIGMETRICS/Performance, pages 50 61, J. Lu and K. Cooper. Register promotion in c programs. pages ACM SIGPLAN Conference on Programming Language Design and Implementation. 24. E. Morel and C. Renvoise. Global optimization by suppression of partial redundancies. Commun. ACM, (2):96 103, T. Pering, T. Burd, and R. Brodersen. The simulation and evaluation of dynamic voltage scaling algorithms. pages 76 81, International Symposium on Low Power Electronics and Design. 26. A. Sastry and R. D. Ju. A new algorithm for scalar register promotion based on ssa form. pages ACM SIGPLAN PLDI. 27. T. Simpson. Value-Driven Redundancy Elimination. PhD thesis, Rice University, T. Simunic, L. Benini, A. Acquaviva, P. Glynn, and G. D. Micheli. Dynamic voltage scaling for portable systems Design Automation Conference. 29. L. Xu. Program Redundancy Analysis and Optimization to Improve Memory Performance. PhD thesis, Rice University, W. Ye, N. Vijaykrishnan, M. T. Kandemir, and M. J. Irwin. The design and use of simplepower: a cycle-accurate energy estimation tool. In Design Automation Conference, pages ,

Low-Complexity Reorder Buffer Architecture*

Low-Complexity Reorder Buffer Architecture* Low-Complexity Reorder Buffer Architecture* Gurhan Kucuk, Dmitry Ponomarev, Kanad Ghose Department of Computer Science State University of New York Binghamton, NY 13902-6000 http://www.cs.binghamton.edu/~lowpower

More information

Area-Efficient Error Protection for Caches

Area-Efficient Error Protection for Caches Area-Efficient Error Protection for Caches Soontae Kim Department of Computer Science and Engineering University of South Florida, FL 33620 sookim@cse.usf.edu Abstract Due to increasing concern about various

More information

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw

More information

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation

Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Using Lazy Instruction Prediction to Reduce Processor Wakeup Power Dissipation Houman Homayoun + houman@houman-homayoun.com ABSTRACT We study lazy instructions. We define lazy instructions as those spending

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh Department of Computer Science and Information Engineering Chang Gung University Tao-Yuan, Taiwan Hsin-Dar Chen

More information

An Implementation of Lazy Code Motion for Machine SUIF

An Implementation of Lazy Code Motion for Machine SUIF An Implementation of Lazy Code Motion for Machine SUIF Laurent Rolaz Swiss Federal Institute of Technology Processor Architecture Laboratory Lausanne 28th March 2003 1 Introduction Optimizing compiler

More information

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB

Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Saving Register-File Leakage Power by Monitoring Instruction Sequence in ROB Wann-Yun Shieh * and Hsin-Dar Chen Department of Computer Science and Information Engineering Chang Gung University, Taiwan

More information

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage

Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Tradeoff between coverage of a Markov prefetcher and memory bandwidth usage Elec525 Spring 2005 Raj Bandyopadhyay, Mandy Liu, Nico Peña Hypothesis Some modern processors use a prefetching unit at the front-end

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 18, 2005 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

Microarchitecture Overview. Performance

Microarchitecture Overview. Performance Microarchitecture Overview Prof. Scott Rixner Duncan Hall 3028 rixner@rice.edu January 15, 2007 Performance 4 Make operations faster Process improvements Circuit improvements Use more transistors to make

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Energy Efficient Asymmetrically Ported Register Files

Energy Efficient Asymmetrically Ported Register Files Energy Efficient Asymmetrically Ported Register Files Aneesh Aggarwal ECE Department University of Maryland College Park, MD 20742 aneesh@eng.umd.edu Manoj Franklin ECE Department and UMIACS University

More information

Design of Experiments - Terminology

Design of Experiments - Terminology Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific

More information

Improving Data Cache Performance via Address Correlation: An Upper Bound Study

Improving Data Cache Performance via Address Correlation: An Upper Bound Study Improving Data Cache Performance via Address Correlation: An Upper Bound Study Peng-fei Chuang 1, Resit Sendag 2, and David J. Lilja 1 1 Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

Understanding The Effects of Wrong-path Memory References on Processor Performance

Understanding The Effects of Wrong-path Memory References on Processor Performance Understanding The Effects of Wrong-path Memory References on Processor Performance Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt The University of Texas at Austin 2 Motivation Processors spend

More information

Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R A case study in modern microarchitecture.

Module 5: MIPS R10000: A Case Study Lecture 9: MIPS R10000: A Case Study MIPS R A case study in modern microarchitecture. Module 5: "MIPS R10000: A Case Study" Lecture 9: "MIPS R10000: A Case Study" MIPS R10000 A case study in modern microarchitecture Overview Stage 1: Fetch Stage 2: Decode/Rename Branch prediction Branch

More information

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution

Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Wish Branch: A New Control Flow Instruction Combining Conditional Branching and Predicated Execution Hyesoon Kim Onur Mutlu Jared Stark David N. Armstrong Yale N. Patt High Performance Systems Group Department

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Limiting the Number of Dirty Cache Lines

Limiting the Number of Dirty Cache Lines Limiting the Number of Dirty Cache Lines Pepijn de Langen and Ben Juurlink Computer Engineering Laboratory Faculty of Electrical Engineering, Mathematics and Computer Science Delft University of Technology

More information

ECE404 Term Project Sentinel Thread

ECE404 Term Project Sentinel Thread ECE404 Term Project Sentinel Thread Alok Garg Department of Electrical and Computer Engineering, University of Rochester 1 Introduction Performance degrading events like branch mispredictions and cache

More information

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University zmily@stanford.edu, christos@ee.stanford.edu

More information

Power and Performance Tradeoffs using Various Cache Configurations

Power and Performance Tradeoffs using Various Cache Configurations Power and Performance Tradeoffs using Various Cache Configurations Gianluca Albera xy and R. Iris Bahar y x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 y Brown University

More information

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum

for Energy Savings in Set-associative Instruction Caches Alexander V. Veidenbaum Simultaneous Way-footprint Prediction and Branch Prediction for Energy Savings in Set-associative Instruction Caches Weiyu Tang Rajesh Gupta Alexandru Nicolau Alexander V. Veidenbaum Department of Information

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

Defining Wakeup Width for Efficient Dynamic Scheduling

Defining Wakeup Width for Efficient Dynamic Scheduling Defining Wakeup Width for Efficient Dynamic Scheduling Aneesh Aggarwal ECE Depment Binghamton University Binghamton, NY 9 aneesh@binghamton.edu Manoj Franklin ECE Depment and UMIACS University of Maryland

More information

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency

Program Phase Directed Dynamic Cache Way Reconfiguration for Power Efficiency Program Phase Directed Dynamic Cache Reconfiguration for Power Efficiency Subhasis Banerjee Diagnostics Engineering Group Sun Microsystems Bangalore, INDIA E-mail: subhasis.banerjee@sun.com Surendra G

More information

Selective Fill Data Cache

Selective Fill Data Cache Selective Fill Data Cache Rice University ELEC525 Final Report Anuj Dharia, Paul Rodriguez, Ryan Verret Abstract Here we present an architecture for improving data cache miss rate. Our enhancement seeks

More information

Reducing Data Cache Energy Consumption via Cached Load/Store Queue

Reducing Data Cache Energy Consumption via Cached Load/Store Queue Reducing Data Cache Energy Consumption via Cached Load/Store Queue Dan Nicolaescu, Alex Veidenbaum, Alex Nicolau Center for Embedded Computer Systems University of Cafornia, Irvine {dann,alexv,nicolau}@cecs.uci.edu

More information

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy

Prefetching-Aware Cache Line Turnoff for Saving Leakage Energy ing-aware Cache Line Turnoff for Saving Leakage Energy Ismail Kadayif Mahmut Kandemir Feihui Li Dept. of Computer Engineering Dept. of Computer Sci. & Eng. Dept. of Computer Sci. & Eng. Canakkale Onsekiz

More information

Power Efficient Processors Using Multiple Supply Voltages*

Power Efficient Processors Using Multiple Supply Voltages* Submitted to the Workshop on Compilers and Operating Systems for Low Power, in conjunction with PACT Power Efficient Processors Using Multiple Supply Voltages* Abstract -This paper presents a study of

More information

Power and Performance Tradeoffs using Various Caching Strategies

Power and Performance Tradeoffs using Various Caching Strategies Power and Performance Tradeoffs using Various Caching Strategies y Brown University Division of Engineering Providence, RI 02912 R. Iris Bahar y Gianluca Albera xy Srilatha Manne z x Politecnico di Torino

More information

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors

Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Banked Multiported Register Files for High-Frequency Superscalar Microprocessors Jessica H. T seng and Krste Asanoviü MIT Laboratory for Computer Science, Cambridge, MA 02139, USA ISCA2003 1 Motivation

More information

Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering.

Power/Performance Advantages of Victim Buer in. High-Performance Processors. R. Iris Bahar y. y Brown University. Division of Engineering. Power/Performance Advantages of Victim Buer in High-Performance Processors Gianluca Albera xy x Politecnico di Torino Dip. di Automatica e Informatica Torino, ITALY 10129 R. Iris Bahar y y Brown University

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Kashif Ali MoKhtar Aboelaze SupraKash Datta Department of Computer Science and Engineering York University Toronto ON CANADA Abstract

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

Minimizing Energy Consumption for High-Performance Processing

Minimizing Energy Consumption for High-Performance Processing Minimizing Energy Consumption for High-Performance Processing Eric F. Weglarz, Kewal K. Saluja, and Mikko H. Lipasti University ofwisconsin-madison Madison, WI 5376 USA fweglarz,saluja,mikkog@ece.wisc.edu

More information

One-Level Cache Memory Design for Scalable SMT Architectures

One-Level Cache Memory Design for Scalable SMT Architectures One-Level Cache Design for Scalable SMT Architectures Muhamed F. Mudawar and John R. Wani Computer Science Department The American University in Cairo mudawwar@aucegypt.edu rubena@aucegypt.edu Abstract

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

E-path PRE Partial Redundancy Elimination Made Easy

E-path PRE Partial Redundancy Elimination Made Easy E-path PRE Partial Redundancy Elimination Made Easy Dhananjay M. Dhamdhere dmd@cse.iitb.ac.in Department of Computer Science and Engineering Indian Institute of Technology, Mumbai 400 076 (India). Abstract

More information

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1

Chapter 03. Authors: John Hennessy & David Patterson. Copyright 2011, Elsevier Inc. All rights Reserved. 1 Chapter 03 Authors: John Hennessy & David Patterson Copyright 2011, Elsevier Inc. All rights Reserved. 1 Figure 3.3 Comparison of 2-bit predictors. A noncorrelating predictor for 4096 bits is first, followed

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors

An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors An Analysis of the Performance Impact of Wrong-Path Memory References on Out-of-Order and Runahead Execution Processors Onur Mutlu Hyesoon Kim David N. Armstrong Yale N. Patt High Performance Systems Group

More information

Towards a More Efficient Trace Cache

Towards a More Efficient Trace Cache Towards a More Efficient Trace Cache Rajnish Kumar, Amit Kumar Saha, Jerry T. Yen Department of Computer Science and Electrical Engineering George R. Brown School of Engineering, Rice University {rajnish,

More information

Reducing Reorder Buffer Complexity Through Selective Operand Caching

Reducing Reorder Buffer Complexity Through Selective Operand Caching Appears in the Proceedings of the International Symposium on Low Power Electronics and Design (ISLPED), 2003 Reducing Reorder Buffer Complexity Through Selective Operand Caching Gurhan Kucuk Dmitry Ponomarev

More information

Partial Redundancy Analysis

Partial Redundancy Analysis Partial Redundancy Analysis Partial Redundancy Analysis is a boolean-valued data flow analysis that generalizes available expression analysis. Ordinary available expression analysis tells us if an expression

More information

Code Placement, Code Motion

Code Placement, Code Motion Code Placement, Code Motion Compiler Construction Course Winter Term 2009/2010 saarland university computer science 2 Why? Loop-invariant code motion Global value numbering destroys block membership Remove

More information

Execution-based Prediction Using Speculative Slices

Execution-based Prediction Using Speculative Slices Execution-based Prediction Using Speculative Slices Craig Zilles and Guri Sohi University of Wisconsin - Madison International Symposium on Computer Architecture July, 2001 The Problem Two major barriers

More information

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science

CS A Large, Fast Instruction Window for Tolerating. Cache Misses 1. Tong Li Jinson Koppanalil Alvin R. Lebeck. Department of Computer Science CS 2002 03 A Large, Fast Instruction Window for Tolerating Cache Misses 1 Tong Li Jinson Koppanalil Alvin R. Lebeck Jaidev Patwardhan Eric Rotenberg Department of Computer Science Duke University Durham,

More information

Lazy Code Motion. Comp 512 Spring 2011

Lazy Code Motion. Comp 512 Spring 2011 Comp 512 Spring 2011 Lazy Code Motion Lazy Code Motion, J. Knoop, O. Ruthing, & B. Steffen, in Proceedings of the ACM SIGPLAN 92 Conference on Programming Language Design and Implementation, June 1992.

More information

MCD: A Multiple Clock Domain Microarchitecture

MCD: A Multiple Clock Domain Microarchitecture MCD: A Multiple Clock Domain Microarchitecture Dave Albonesi in collaboration with Greg Semeraro Grigoris Magklis Rajeev Balasubramonian Steve Dropsho Sandhya Dwarkadas Michael Scott Caveats We started

More information

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections )

Lecture 9: More ILP. Today: limits of ILP, case studies, boosting ILP (Sections ) Lecture 9: More ILP Today: limits of ILP, case studies, boosting ILP (Sections 3.8-3.14) 1 ILP Limits The perfect processor: Infinite registers (no WAW or WAR hazards) Perfect branch direction and target

More information

Performance Implications of Single Thread Migration on a Chip Multi-Core

Performance Implications of Single Thread Migration on a Chip Multi-Core Performance Implications of Single Thread Migration on a Chip Multi-Core Theofanis Constantinou, Yiannakis Sazeides, Pierre Michaud +, Damien Fetis +, and Andre Seznec + Department of Computer Science

More information

Microarchitecture-Level Power Management

Microarchitecture-Level Power Management 230 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 10, NO. 3, JUNE 2002 Microarchitecture-Level Power Management Anoop Iyer, Member, IEEE, and Diana Marculescu, Member, IEEE Abstract

More information

PowerPC 740 and 750

PowerPC 740 and 750 368 floating-point registers. A reorder buffer with 16 elements is used as well to support speculative execution. The register file has 12 ports. Although instructions can be executed out-of-order, in-order

More information

A Cache Scheme Based on LRU-Like Algorithm

A Cache Scheme Based on LRU-Like Algorithm Proceedings of the 2010 IEEE International Conference on Information and Automation June 20-23, Harbin, China A Cache Scheme Based on LRU-Like Algorithm Dongxing Bao College of Electronic Engineering Heilongjiang

More information

CS 406/534 Compiler Construction Putting It All Together

CS 406/534 Compiler Construction Putting It All Together CS 406/534 Compiler Construction Putting It All Together Prof. Li Xu Dept. of Computer Science UMass Lowell Fall 2004 Part of the course lecture notes are based on Prof. Keith Cooper, Prof. Ken Kennedy

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Low-power Architecture. By: Jonathan Herbst Scott Duntley

Low-power Architecture. By: Jonathan Herbst Scott Duntley Low-power Architecture By: Jonathan Herbst Scott Duntley Why low power? Has become necessary with new-age demands: o Increasing design complexity o Demands of and for portable equipment Communication Media

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

CSC D70: Compiler Optimization LICM: Loop Invariant Code Motion

CSC D70: Compiler Optimization LICM: Loop Invariant Code Motion CSC D70: Compiler Optimization LICM: Loop Invariant Code Motion Prof. Gennady Pekhimenko University of Toronto Winter 2018 The content of this lecture is adapted from the lectures of Todd Mowry and Phillip

More information

Dependability, Power, and Performance Trade-off on a Multicore Processor

Dependability, Power, and Performance Trade-off on a Multicore Processor Dependability, Power, and Performance Trade-off on a Multi Processor Toshinori Sato System LSI Research Center Kyushu University toshinori.sato@computer.org Abstract - As deep submicron technologies are

More information

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST

A Cost Effective Spatial Redundancy with Data-Path Partitioning. Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST A Cost Effective Spatial Redundancy with Data-Path Partitioning Shigeharu Matsusaka and Koji Inoue Fukuoka University Kyushu University/PREST 1 Outline Introduction Data-path Partitioning for a dependable

More information

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads

Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Preliminary Evaluation of the Load Data Re-Computation Method for Delinquent Loads Hideki Miwa, Yasuhiro Dougo, Victor M. Goulart Ferreira, Koji Inoue, and Kazuaki Murakami Dept. of Informatics, Kyushu

More information

Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses

Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses K. Basu, A. Choudhary, J. Pisharath ECE Department Northwestern University Evanston, IL 60208, USA fkohinoor,choudhar,jayg@ece.nwu.edu

More information

Improving energy-efficiency in high-performance. processors by bypassing trivial instructions.

Improving energy-efficiency in high-performance. processors by bypassing trivial instructions. Improving energy-efficiency in high-performance processors by bypassing trivial instructions E. Atoofian and A. Baniasadi Abstract: Energy-efficiency benefits of bypassing trivial computations in high-performance

More information

Threshold-Based Markov Prefetchers

Threshold-Based Markov Prefetchers Threshold-Based Markov Prefetchers Carlos Marchani Tamer Mohamed Lerzan Celikkanat George AbiNader Rice University, Department of Electrical and Computer Engineering ELEC 525, Spring 26 Abstract In this

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

Finding Optimal L1 Cache Configuration for Embedded Systems

Finding Optimal L1 Cache Configuration for Embedded Systems Finding Optimal L Cache Configuration for Embedded Systems Andhi Janapsatya, Aleksandar Ignjatović, Sri Parameswaran School of Computer Science and Engineering, The University of New South Wales Sydney,

More information

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1

CSE 820 Graduate Computer Architecture. week 6 Instruction Level Parallelism. Review from Last Time #1 CSE 820 Graduate Computer Architecture week 6 Instruction Level Parallelism Based on slides by David Patterson Review from Last Time #1 Leverage Implicit Parallelism for Performance: Instruction Level

More information

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window

Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Dual-Core Execution: Building A Highly Scalable Single-Thread Instruction Window Huiyang Zhou School of Computer Science University of Central Florida New Challenges in Billion-Transistor Processor Era

More information

Drowsy Instruction Caches

Drowsy Instruction Caches Drowsy Instruction Caches Leakage Power Reduction using Dynamic Voltage Scaling and Cache Sub-bank Prediction Nam Sung Kim, Krisztián Flautner, David Blaauw, Trevor Mudge {kimns, blaauw, tnm}@eecs.umich.edu

More information

Simultaneous Multithreading (SMT)

Simultaneous Multithreading (SMT) Simultaneous Multithreading (SMT) An evolutionary processor architecture originally introduced in 1995 by Dean Tullsen at the University of Washington that aims at reducing resource waste in wide issue

More information

A Variation of Knoop, ROthing, and Steffen's Lazy Code Motion

A Variation of Knoop, ROthing, and Steffen's Lazy Code Motion A Variation of Knoop, ROthing, and Steffen's Lazy Code Motion by Karl-Heinz Drechsler and Manfred P. Stadel Siemens Nixdorf Informationssysteme AG, STM SD 2 Otto Hahn Ring 6, 8000 Mtinchen 83, Germany

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Hardware Speculation Support

Hardware Speculation Support Hardware Speculation Support Conditional instructions Most common form is conditional move BNEZ R1, L ;if MOV R2, R3 ;then CMOVZ R2,R3, R1 L: ;else Other variants conditional loads and stores nullification

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions

Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions 1 Using Statistical Simulation for Studying Compiler-Microarchitecture Interactions Lieven Eeckhout John Cavazos ELIS Department, Ghent University, Belgium School of Informatics, University of Edinburgh,

More information

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine

More information

EECS 322 Computer Architecture Superpipline and the Cache

EECS 322 Computer Architecture Superpipline and the Cache EECS 322 Computer Architecture Superpipline and the Cache Instructor: Francis G. Wolff wolff@eecs.cwru.edu Case Western Reserve University This presentation uses powerpoint animation: please viewshow Summary:

More information

Partial Redundancy Elimination and SSA Form

Partial Redundancy Elimination and SSA Form Topic 5a Partial Redundancy Elimination and SSA Form 2008-03-26 \course\cpeg421-08s\topic-5a.ppt\course\cpeg421-07s\topic- 7b.ppt 1 References Robert Kennedy, Sun Chan, Shin-ming Liu, Raymond Lo, Pend

More information

Grouped Prefetching: Maximizing Resource Utilization

Grouped Prefetching: Maximizing Resource Utilization Grouped Prefetching: Maximizing Resource Utilization Weston Harper, Justin Mintzer, Adrian Valenzuela Rice University ELEC525 Final Report Abstract Prefetching is a common method to prevent memory stalls,

More information

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research

Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Reducing the SPEC2006 Benchmark Suite for Simulation Based Computer Architecture Research Joel Hestness jthestness@uwalumni.com Lenni Kuff lskuff@uwalumni.com Computer Science Department University of

More information

Reducing Instruction Fetch Energy in Multi-Issue Processors

Reducing Instruction Fetch Energy in Multi-Issue Processors Reducing Instruction Fetch Energy in Multi-Issue Processors PETER GAVIN, DAVID WHALLEY, and MAGNUS SJÄLANDER, Florida State University The need to minimize power while maximizing performance has led to

More information

Augmenting Modern Superscalar Architectures with Configurable Extended Instructions

Augmenting Modern Superscalar Architectures with Configurable Extended Instructions Augmenting Modern Superscalar Architectures with Configurable Extended Instructions Xianfeng Zhou and Margaret Martonosi Dept. of Electrical Engineering Princeton University {xzhou, martonosi}@ee.princeton.edu

More information

Energy-Effective Instruction Fetch Unit for Wide Issue Processors

Energy-Effective Instruction Fetch Unit for Wide Issue Processors Energy-Effective Instruction Fetch Unit for Wide Issue Processors Juan L. Aragón 1 and Alexander V. Veidenbaum 2 1 Dept. Ingen. y Tecnología de Computadores, Universidad de Murcia, 30071 Murcia, Spain

More information

Scheduling Reusable Instructions for Power Reduction

Scheduling Reusable Instructions for Power Reduction Scheduling Reusable Instructions for Power Reduction J. S. Hu, N. Vijaykrishnan, S. Kim, M. Kandemir, and M. J. Irwin Microsystems Design Lab The Pennsylvania State Univeity Univeity Park, PA 1682, USA

More information

Itanium 2 Processor Microarchitecture Overview

Itanium 2 Processor Microarchitecture Overview Itanium 2 Processor Microarchitecture Overview Don Soltis, Mark Gibson Cameron McNairy, August 2002 Block Diagram F 16KB L1 I-cache Instr 2 Instr 1 Instr 0 M/A M/A M/A M/A I/A Template I/A B B 2 FMACs

More information

Lazy Code Motion. Jens Knoop FernUniversität Hagen. Oliver Rüthing University of Dortmund. Bernhard Steffen University of Dortmund

Lazy Code Motion. Jens Knoop FernUniversität Hagen. Oliver Rüthing University of Dortmund. Bernhard Steffen University of Dortmund RETROSPECTIVE: Lazy Code Motion Jens Knoop FernUniversität Hagen Jens.Knoop@fernuni-hagen.de Oliver Rüthing University of Dortmund Oliver.Ruething@udo.edu Bernhard Steffen University of Dortmund Bernhard.Steffen@udo.edu

More information

AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors*

AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors* Appears in the Proceedings of Design, Automation and Test in Europe Conference, March 2002 AccuPower: An Accurate Power Estimation Tool for Superscalar Microprocessors* Dmitry Ponomarev, Gurhan Kucuk and

More information

Techniques for Efficient Processing in Runahead Execution Engines

Techniques for Efficient Processing in Runahead Execution Engines Techniques for Efficient Processing in Runahead Execution Engines Onur Mutlu Hyesoon Kim Yale N. Patt Depment of Electrical and Computer Engineering University of Texas at Austin {onur,hyesoon,patt}@ece.utexas.edu

More information

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading)

CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) CMSC 411 Computer Systems Architecture Lecture 13 Instruction Level Parallelism 6 (Limits to ILP & Threading) Limits to ILP Conflicting studies of amount of ILP Benchmarks» vectorized Fortran FP vs. integer

More information

Getting CPI under 1: Outline

Getting CPI under 1: Outline CMSC 411 Computer Systems Architecture Lecture 12 Instruction Level Parallelism 5 (Improving CPI) Getting CPI under 1: Outline More ILP VLIW branch target buffer return address predictor superscalar more

More information

High Performance and Energy Efficient Serial Prefetch Architecture

High Performance and Energy Efficient Serial Prefetch Architecture In Proceedings of the 4th International Symposium on High Performance Computing (ISHPC), May 2002, (c) Springer-Verlag. High Performance and Energy Efficient Serial Prefetch Architecture Glenn Reinman

More information

Processing Unit CS206T

Processing Unit CS206T Processing Unit CS206T Microprocessors The density of elements on processor chips continued to rise More and more elements were placed on each chip so that fewer and fewer chips were needed to construct

More information

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor

Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Accuracy Enhancement by Selective Use of Branch History in Embedded Processor Jong Wook Kwak 1, Seong Tae Jhang 2, and Chu Shik Jhon 1 1 Department of Electrical Engineering and Computer Science, Seoul

More information

COMPUTER ARCHITECTURE SIMULATOR

COMPUTER ARCHITECTURE SIMULATOR International Journal of Electrical and Electronics Engineering Research (IJEEER) ISSN 2250-155X Vol. 3, Issue 1, Mar 2013, 297-302 TJPRC Pvt. Ltd. COMPUTER ARCHITECTURE SIMULATOR P. ANURADHA 1, HEMALATHA

More information