Impact of ILP-improving Code Transformations on Loop Buffer Energy

Size: px

Start display at page:

Download "Impact of ILP-improving Code Transformations on Loop Buffer Energy"

Robert Reeves
5 years ago
Views:

1 Impact of ILP-improving Code Transformations on Loop Buffer Tom Vander Aa Murali Jayapala Henk Corporaal Francky Catthoor Geert Deconinck IMEC, Kapeldreef 75, B-300 Leuven, Belgium ESAT, KULeuven, Kasteelpark Arenberg 0, B-300 Leuven, Belgium TU Eindhoven, Den Dolech 2, 562 AZ Eindhoven, Netherlands Abstract For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. In particular, software controlled clustered loop buffers are very energy efficient. However code transformations needed in VLIW compilers to reach a higher ILP potentially may have a large negative influence on the energy consumed in the instruction memories (including the loop buffer). This paper will show that such code transformations can also have a positive impact on the instruction memory energy of processors, if the transformations are steered taking into account the presence of the software controlled clustered loop buffer. We will propose guidelines to steer the code transformations and show that these guidlines should be differently applied in a system with clustered loop buffer than in a system with only a normal instruction cache. Results are presented for a mix of three important ILP code transformations: software-pipelining, if-conversion and loop unrolling. Results show an energy reduction between 5% and 25% and a delay reduction between 6 and 75% for an MPEG-2 Video Encoder benchmark. I. INTRODUCTION AND MOTIVATION Low energy is one of the key design goals of the current embedded systems for multimedia applications. Typically the cores of such systems are programmable processors. VLIW ASIPs in particular are known to be very effective in achieving high performance with reasonable low power for our domain of interest [6]. However, power analysis of such processors indicates that a significant amount of power is consumed in the on-chip (instruction) memory hierarchy: 3 of the total power according to [2]. Our experimental analysis shows that, if the appropriate data memory hierarchy mapping techniques are applied first [4], and if all methods to reduce power in the datapath are applied [3], this number may go up to 5. Hence, reducing this part of the budget is crucial in reducing the overall power consumption of the system. Loop buffering is an effective scheme to reduce energy consumption in the instruction memory hierarchy [0]. In any typical multimedia application, a significant amount of execution time is spent in small program segments. can be reduced by storing them in a small loop buffer instead of the big instruction cache (IL). However, as more instruction level parallelism (ILP) is extracted from the application, wider datapaths and wider loop buffers are needed. In the datapath, the register files and the interconnection network to them is the main bottleneck of wide VLIWs. This issue is resolved in recent papers using datapath clustering (see e.g. [5]). Data path clustering will split up these register file into several partitions, and restricts access to this register file to certain s. This reduces the ports to the register file and the number of wires in the interconnection network. To obtain low power, clustering also needs to be applied to the instruction memory hierarchy. For this a clustered loop buffer architecture [7] has been proposed. The criteria for this type of clustering are however, very different than those for data path clustering. In a clustered data path, the main compiler problem to be solved is how to minimize the inter-cluster communication overhead. Since the instruction streams of the different functional units do not interact, this is not a problem for instruction clusters. To obtain high ILP, ILP improving code transformations are needed. Code transformations such as loop unrolling, software pipelining and if-conversions form part of every modern compiler. The contributions of this paper in that area are two-fold. Firstly we will show that such code transformations can have a large positive impact on the instruction memory energy of processors that have a software controlled clustered loop buffer. We also show that for some transformations, such as loop unrolling, a trade-off exists between performance and energy. Secondly to exploit this trade-off we propose a set of guidlines to steer the code transformation to optimally exploit the presence of the loop buffer. The rest of this paper is organized as follows: In Section II the software controlled clustered loop buffer organization and the compiler are described. In Section III the the different ILP code transformations are discussed. Their effect on performance and energy is studied for serveral benchmarks. Based on the observations of Section III guidelines on how to steer these transformations are proposed (Section IV). An account of the related work is presented in Section V. Conclusions are drawn in Section VI. II. CLUSTERED LOOP BUFFER ORGANIZATION AND COMPILATION Figure a illustrates the essentials of the clustered loop buffer under consideration. A more detailed analysis of the

2 2 L0 Buffer Enable LC INSTRUCTION LEVEL CACHE Loop Buffer partition mux... INSTRUCTION CLUSTER LC... a) Loop Buffer partition mux NEW_PC Local Controller (LC) index enable Loop Buffer Partition Fig.. Clustered loop buffer architecture; (a) overview: functional units are groups into instruction clusters, a cluster fetches instructions from a loop buffer partition, (b) local controller detail architecture can be found in [7]. Instructions are fed to the processing unit from either the level instruction cache (IL), or from the loop buffer. Initially the loop buffer is disabled and the program executes via IL. When a loop buffer control instruction is encountered marking the beginning of the loop that has to be mapped to the loop buffer, the loop buffer will be turned on. The form of this special instruction is lbon<startaddress>, <endaddress>. (Lbon means loop buffer on). Startaddress is the address of the first instruction of the loop and endaddress that of the last one. These values are stored in the local controller (LC) of each cluster and will be used during the execution of the loop. When the loop buffer is used, a local controller (see Figure b) will translate the program counter (PC) to an index in the loop buffer (called NEW PC in the figure) by subtracting the stored startaddress from the PC value. This NEW PC is used as an index in the local controller table. Only if the first column of this table indicates that an operation is stored in the loop buffer partition for that PC, the loop buffer partition is accessed. This is an important feature since it saves energy (by not accessing the loop buffer partition) and area (by not storing any operations if that cluster is not used in that cycle). No NOP compression is used inside a cluster, so if only one is active in a certain cluster, the other s will execute an explicitly coded NOP instruction. This is OK, since the clustering is chosen such that only s that are used together are clustered together. What entry in the loop buffer partition is accessed depends on the second column in the local controller table. When the LC detects that the program counter is larger than endaddress, the loop buffer will be turned off. A. Experimental Setup The experiments in this paper have all been performed using a VLIW processor modeled to the TMS320C62x processor of Texas Instruments [8] This VLIW has a data-path with two clusters of four functional units (s). Benchmarks of the MediaBench [9] suite were compiled with the Trimaran Compiler [9]. Since steering of the loop unrolling transformation was not adequately implemented in the compiler, and therefore, manual interaction was needed, we have limited ourself for this transformation to two R b) OP OP3 OP4 OP5 benchmarks: an MPEG-2 Video Encoder and a 3D image reconstruction algorithm. After compilation to assembly code, loops are automatically selected to be mapped onto the loop buffer using the method presented in [2]. Once loops are mapped, the loop buffer clustering (number of clusters and size of each cluster) is optimized for this mapping. The result of the assignment and clustering is used to estimate the energy in the instruction memory hierarchy using the energy models described in [7]. This includes the energy of the loop buffer, the instruction cache and the external memory. The energy of the other processor components is not considered because in this paper we focus on the instruction memory part, which, as already shown before, is an important part of the total processor energy consumption. III. ILP CODE TRANSFORMATIONS ILP transformations combine and enlarge basic blocks to expose more instruction level parallelism (ILP). They are needed if the available parallelism of the VLIW processor is not exploited in the original program and the program does not meet its performance constraints because of this. ILP code transformations tend to increase the number of operations and increase code size, and thus may also increase the instruction memory energy. Loop unrolling, for example, replicates the body of a loop. Superblock creation, a technique to increase the freedom and scope of the scheduler, uses tail duplication, which duplicates certain basic blocks of the program. By default the compiler steers the transformations to achieve maximum performance and does not consider energy. This causes a clear energy penalty. On average the energy increase in the instruction memories is 5. However, we will show in the next subsections that if we modify the heuristics that steer the transformations in the compiler, some transformations can clearly have a positive effect on instruction memory energy. This steering is one of main contributions of this paper. We will see that, using these heuristics, it sometimes is possible to improve both on energy and on performance, and sometimes a pareto trade-off can be explored between energy and performance. A. How More ILP can Reduce the Clustered Loop Buffer Since many ILP-enhancing transformations increase either the loop size or the number of operations, it is not clear if energy in the instruction memory can be reduced this way. The example below shows that due to the fact that a clustered loop buffer architecture is used, increasing ILP can indeed reduce energy. Figure 2 shows a simple nested loop (a 2D FIR filter). For this loop, two possible schedules of the inner loop are shown in Figure 3: one generated with normal list scheduling, and one generated with modulo scheduling. The energy consumption (estimated using the method described in Section II-A) for the high ILP version is 9.4µJ, while for the low ILP version it is 2.2µJ (3 more). The largest difference comes

3 3 f o r ( i =0; i <=28; i ++) 2 tmp = 0. 0 ; 3 f o r ( i 2 =0; i2 <32; i 2 ++) 4 tmp += w[ i 2 ] x [ i + i 2 ] ; 5 y [ i ] = tmp Fig. 2. 2D FIR filter Cluster 2 3 MACC ADD SHIFT ADD 2 ADD STORE SHIFT 3 ADD STORE BRANCH Cluster ADD ADD ADD 2 CMPP STORE ADD 3 STORE 4 5 MACC 6 7 BRANCH Fig. 3. High and low ILP schedule for the inner loop with optimal loop buffer clustering. from the energy spent in the address translation (indirection) of the local controller (LC). The address translation table is accessed every cycle, and indicates with a zero or one if a cluster needs to be activated. The LC of the high ILP version has three main energy advantages, namely: i) fewer local controllers: The optimal clustering of the high ILP version has 3 local controllers (LCs), while the other has 4. ii) smaller local controllers: The depth of the LC, is only 3, instead of 7. iii) fewer accesses to the local controllers: Each LC is only accessed 3 times per iteration of the high ILP loop (3 cycles), while for the low ILP loop, this is 7 cycles. So if some transformations can increase the ILP of a loop that is assigned to the loop buffer, without causing too much overhead (additional operations), this will reduce energy in the instruction memory hierarchy. In the next three sections, more details of three ILP code transformations are provided, namely: software-pipelining, if-conversion and loop unrolling. The effect of the code transformations on MediaBench are studied and conclusions are drawn on if and when each transformation should be applied, incorporating their mutual influence. B. Software Pipelining Software pipelining [8] is a loop scheduling technique that extracts loop parallelism by overlapping the execution of several consecutive iterations. Software-pipelining is now considered the most effective technique to achieve high instruction level parallelism and is implemented in all advanced compilers targeting ILP-based processors [8]. Figure 4 shows that the effect of software-pipelining in the compiler on MediaBench is very limited: for both energy and delay the change is less than 5%; for a processor with 8 s, the average parallelism for MediaBench is only adpcmdec adpcmenc aes blowfishenc energy cycles cjpeg djpeg epic g72dec g72enc ghostscript gsmdec gsmenc mesa mpeg2dec mpeg2enc rasta sha average Fig. 4. Effect of software-pipelining on energy and cycles. The Y-axis shows the percentage energy and cycle count gain of enabling softwarepipelining in the compiler. instructions per cycle. The reason for this limited effect is two-fold. Firstly, modulo scheduling [4], the softwarepipelining technique used, can only be applied for inner loops that have no internal control flow. And secondly, very often the hardware has enough resources to execute more operations in parallel, but the loop does not contain enough operations. To solve these two problems two transformations that enable efficient software-pipelining have been added: hyperblock creation for the former problem and loop unrolling for the latter. The effect of these two transformations will be discussed in the next two subsections and only after that the true benefit of software-pipelining can be evaluated. C. If-Conversion or HyperBlock Creation A common problem all global optimization and scheduling strategies must resolve, is conditional branches in the target application. Predicated execution is an efficient method to handle conditional branches. Predicated or guarded execution refers to the conditional execution of instructions based on the value of a boolean source operand, referred to as the predicate. When the predicate has value true, the instruction is executed normally and when the predicate has value false, the instruction is treated as a NOP. With predicated execution support provided in the architecture, the compiler can eliminate many of the conditional branches in an application. The process of eliminating conditional branches from a program using predication is referred to as if-conversion. The newly created block, consisting of the operations of multiple basic blocks is called a hyperblock. To form hyperblocks, two features of each basic block in a region are examined: execution frequency and size. Execution frequency is used to exclude paths of control that are not executed often. Removing infrequently executed paths reduces scheduling constraints for the frequent paths. The second feature is basic block size. Larger basic blocks should be given less priority for inclusion than smaller blocks. Larger blocks utilize many machine resources and thus may reduce the performance of the control paths through smaller blocks. A heuristic function that considers the two issues is shown below. BSV bb = K weight bb size bb size mainpath weight mainpath

4 4 Delay 82% 8 78% 76% 74% selected inner all branches 72% all inner % 9 95% 0 05% 5% Fig. 5. Effect of different predication strategies on energy and delay (cycles). The values shown are an average of all applications of MediaBench. The Block Selection Value (BSV [2]) is calculated for each basic block considered for inclusion in the hyperblock. The weight and size of each basic block is normalized against that of the main path. The main path is the most likely executed control path through the region of blocks considered for inclusion in the hyperblock. The hyperblock initially contains only blocks along the main path. The variable K is a machine dependent constant to represent the issue rate of the processor. In this case this is 8. Processors with more resources can execute more instructions concurrently, and therefore are likely to take advantage of larger hyperblocks. From an energy perspective, if-conversion should be avoided at all since it adds useless (i.e., nullified) operations. In the next subsection we will see that if-conversion can be useful to enable modulo-scheduling with no, or only limited increase in instruction memory energy. In high performance compilation, normally all forward branches are considered for predication, even those that are not in software-pipelined loops. But since predicating these kinds of branches has not such a big influence on the ILP, and since predication always increases the number of executed operations, the effect on energy is generally negative. When predication as proposed in [2] is applied on the MediaBench suite, on average, the number of cycles is reduced with 3%. However, the instruction memory energy increases with %, with peaks of 25 for some applications. We will refer to this predication strategy as all branches. By restricting the predication to inner loops, where it enables software pipelining, both the delay and the energy can be reduced. Figure 5 shows three different strategies: i) the predication strategy using the BSV, not limited to inner loops (all branches), aimed only at improving performance, ii) predicating selected inner loops based on the same BSV as all branches (selected inner) and iii) predicating all inner loops (all inner). Since predicating all inner loops gives best performance and lowest energy consumption, we propose to always perform this type of predication in combination with a compiler that uses this type of software-pipelining and a software controlled clustered loop buffer. Compare Figure 6 with Figure 4 to see that enabling ifconversion for all inner loops helps significantly to increase performance and reduces energy consumption: on average a 55% reduction in energy consumption and a % reduction in cycles adpcmdec adpcmenc energy cycles aes blowfishenc cjpeg djpeg epic g72dec g72enc ghostscript gsmdec gsmenc mesa mpeg2dec mpeg2enc rasta sha average Fig. 6. Effect of software-pipelining on energy and cycles. (Softwarepipelining disabled versus software-pipelining enabled) If-conversion is applied to all inner loops. No loop unrolling is applied yet. D. Loop Unrolling Another problem with modulo scheduling, is the high overhead of the prologue and epilogue of the softwarepipelined loop, if the number of iterations is low. Since modulo scheduling, as currently implemented in the compiler that was used, can only be applied to the innermost loop, the number of iterations of this loop will determine the efficiency of the modulo scheduling. For MediaBench, for more than 6 of the time spent in loops, it is in loops with an iteration count between 8 and 6. We will see in the next subsection, how loop unrolling can help to increase ILP and reduce energy for such loops. Loop unrolling is a technique that reduces the number of iterations by replicating the body of the loop. A loop that has been unrolled can have more ILP, if the extra operations in the body are partly independent of the already existing ones and there are enough resources to place these operations. In the context of modulo scheduling, the minimal initiation interval (MII), determines the performance of the loop. This MII is bounded by resources (ResII), and by cyclic dependencies (RecII). We have examined the distribution of software-pipelined loops in MediaBench with respect to resource limitations and recurrent dependencies limitations. The conclusion was that in 87% of the cases, the loop s II is not only limited by the resource constraints, by also by the recurrent dependencies. However, in many cases of this 87%, this RecII limitation can be broken when the loop is unrolled using variable renaming and tree height reduction. The FIR-filter of Figure 2, is such a case. By unrolling the loop and renaming the accumulator tmp, lines 4, and 5 of the source code become: tmp += w[i2]*x[i+i2]; tmp2 = w[i2+]*x[i+i2+] and y[i] = tmp + tmp2 respectively. ) Effect of Loop Unrolling, a Simple Example: Unrolling of the inner loop of the FIR filter of Figure 2 is examined here. The inner loop has 32 iterations, so maximally 32 copies of the loop are considered (unrolled 3 times, or unroll amount equal to 32). It may sound counter-intuitive to apply a technique that reduces the number of iterations, if this is already too low, but it will be shown at the end of the section how this will help.

5 5 Fig Unroll Amount Cycles Efficiency Effect of unrolling the inner loop of a simple FIR-filter. Figure 7 shows the effect of unrolling on cycle count and energy in the instruction memory hierarchy. Efficiency is defined as: Eff n = S n /n with n the unroll amount and S n the speedup of the unrolled version with unroll amount n versus the original version. So, if a loop gets twice as fast by unrolling it once the efficiency is 0. The efficiency will be lower if the speedup is lower. In this case, cycle count and efficiency are complementary graphs. Several observations can be made from the graphs: The energy of the instruction memory follows the curve of the cycle count. This is in agreement with results from previous studies: loop unrolling is only beneficial for energy consumption if it increases performance [5], [22], [20]. When unrolled once (unroll amount 2) the cycle count goes up. This is because in this case for each iteration of the loop twice the number of results are generated (from the two copies of the loop). These two sets of results somehow need to be combined causing extra overhead. For unroll amounts larger than 2, that overhead of combining the results, is often negligible, given that there are still enough resources for the unrolled loop. If loops are unrolled more than once, the cycle count will decrease, although not linearly with the unroll amount, as can be seen in the efficiency graph. Because the loop has exactly 32 iterations (before unrolling), some peculiar behavior appears when unrolling it completely: suddenly efficiency increases dramatically. (An efficiency of 4 means a 5 speed-up). This is because of the interaction with modulo scheduling: if the loop is unrolled completely, the second innermost loop can be software-pipelined, adding additional speedup to the overall program. In the next sections it will be shown that, because in multimedia applications many inner loops have small iteration counts, unrolling them completely is usually the best option, both for energy and for performance. 2) Loop Unrolling for Multimedia Applications: The results of the previous section, combined with the low iteration counts in innermost loops, indicate that it would be very beneficial to apply full loop unrolling to a selected number of inner loops of MediaBench. Two applications have been chosen, namely a 3D image reconstruction algorithm (snake [3]) and an MPEG-2 Video encoder. The most important inner loops of each application have been unrolled. The MPEG-2 benchmark contains two important loops, motion estimation (motion) and the forward discrete cosine transform (FDCT), that, together, account for 5 of the cycle count, before unrolling. The snake benchmark has one inner loop accounting for 72% of the total number of cycles. Both for energy and delay, full unrolling is indeed the best for two of the three loops: the snake loop and the FDCT loop. This is because for these two loops, the processor has enough free resources to successfully schedule the unrolled loop. For motion estimation, however, things are different, since this loop is more resource limited, full unrolling will have a smaller performance benefit, and thus the negative impact on energy will be relatively larger. After full unrolling, the cycle count of the inner loops of the two benchmarks is reduced to 35% and 65% respectively. E. Combination of Unrolling, If-Conversion and Software Pipelining For the MPEG-2 video encoder and for the snake benchmark, we have examined the mutual effect of full loop unrolling, if-conversion and software pipelining. All of the three transformations can be enabled separately, resulting in 2 3 or 8 combinations. For all 8 combinations, the effect on energy and delay is shown in Figure 8 for the snake benchmark and for the MPEG-2 Video Encoder benchmark. Several observations can be made: In all cases the impact on cycles is larger than the impact on energy. In particular, the impact on cycles is always positive. If the impact on energy is positive, it is smaller than the impact on cycles. Sometimes a quite large negative impact on energy can occur. All of these observations can be explained by the fact that the transformations were designed with increasing performance in mind and without any steering to reduce or keep under the energy consumption. In future research, heuristics need to be developed that take into account both energy and delay. Sometimes one global optimum is present with respect to energy and delay, for example for the snake benchmark. We expect however, that (in contrast with classical code transformations) ILP code transformations will often present an energy-delay trade-off. In the case of the MPEG-2 Video Encoder, a Pareto trade-off can be observed in the XY-chart. One of the challenges to efficiently explore this tradeoff is to correctly estimate the impact of the enabling transformations. Loop unrolling can be an enabling transformation: performing loop unrolling alone has a significant negative impact on energy, but combining loop unrolling with software pipelining has a larger energy gain than just software-pipelining. F. Loop Buffer versus Normal Instruction Cache This section investigates if the existing steering for normal instruction caches is good enought for loop buffers.

6 unroll swp hb SNAKE energy cycles n y n y n y n y n n y y n n y y n n n n y y y y TABLE I ENERGY REDUCTION DUE TO DIFFERENT IF-CONVERSION STRATEGIES FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER. THE OPTIMAL STRATEGY IS MARKED IN BOLD. AVERAGE OVER ALL APPLICATIONS IN MEDIABENCH. cache loop buffer no if-conv. 0 0 all branches 0 % selected inner 97% 93% all inner 83% TABLE II ENERGY REDUCTION DUE TO SOFTWARE PIPELINING FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER. AVERAGE OVER ALL APPLICATIONS IN MEDIABENCH. Delay Delay unroll swp hb only one pareto point MPEG 2 n y n y n y n y n n y y n n y y n n n n y y y y energy cycles pareto curve Fig. 8. Effect of loop unrolling (unroll), software-pipelining (swp) and ifconversion (hb) on energy and delay (cycles) on the snake (top) benchmark and on the MPEG-2 Video Encoder benchmark. Each transformation can be enabled (y) or disabled (n). The upper graph shows the effect of the enabled transformations on energy and delay in a bar-chart. The lower graph shows the same information in a XY-chart. cache loop buffer no swp 0 0 swp 85% 45% Table I shows the effect of the different if-conversion strategies on the energy consumption of an architecture with and without a loop buffer. The values are relative to no if-conversion. Two important messages can be observed. Firstly, if-conversion is more important for an architecture with loop buffer, since it induces a larger energy reduction. Secondly, the best strategy (in terms of energy) for a loop buffer based strategy is to convert all inner loops, while for a cache-based architecture this is to only convert selected inner loops. For software pipelining, the conclusion is similar, yet simpler (Table II): software pipelining reduces energy consumption in an architecture without loop buffer, but even more so in an architecture with loop buffer. Steering loop unrolling seems to be more tricky. For the two examples we have compiled (snake and the MPEG- 2 Video Encoder) the results already differ. For the snake application, the optimal unroll amount w.r.t. energy is more for a loop buffered architecture than for an architecture with cache, but for the MPEG-2 application it is less. It is already TABLE III ENERGY REDUCTION DUE TO LOOP UNROLLING FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER FOR TWO BENCHMARKS. THE OPTIMAL UNROLL AMOUNT IS MARKED IN BOLD. SNAKE unroll amount cache loop buffer % 02% 4 79% 04% % MPEG-2 Video Encoder unroll amount cache loop buffer % 4 93% 8 77% 92% 6 7 2%

7 7 clear that the optimal unrolling for the two architectures is different, but more research is needed to determine what this unroll amount actually should be. IV. GUIDELINES From the above observations we propose guidelines on how to apply the three transformations when a loop buffer is present. ) Software pipelining should always be applied since it is a relatively safe transformation and since from Figure 8 it is clear the other two transformations do not make sense to applied if software pipelining is not applied. 2) We propose to do if-conversion on all inner loops, but only on inner loops. 3) Loop unrolling on relatevely small inner loops can definetly be beneficial for energy and or performance of the loop. However, we cannot state that is it good in general to do full loop unrolling, since for some loops an energy-delay pareto trade-off occurs with the different unroll amounts. As stated before, more research is needed to tackle this problem V. RELATED WORK The transformations that are proposed are in itself not new: they have been applied in the past in high performance compilers [], but not in the context of optimization for low power. With respect to low energy, other researchers before us come to the conclusion that transformations that improve performance, will also improve energy [5], [22], [20]. The main difference is that all those researchers have looked at normal instruction caches, while we ve clearly shown that the steering for loop buffers is different from the steering for instruction caches. Many researchers have explored the trade-off performance versus code-size and instruction memory energy for processors with normal instruction caches. Liveris, et al. [] explore function inlining and code placement. Zambreno et al. [23] provide heuristics to explore this trade-off when performing loop unrolling and function inlining and Su et al. [7] do the same for software pipelining. To the best of our knowledge, code transformations for VLIW processors to reduce energy in the loop buffer, are only proposed in [6]. In that paper however, the authors assume a loop buffer without support for branches, and apply transformations to alleviate this handicap. Our loop buffer implementation does support branches, and thus does not need those transformations. Furthermore, the trade-off between energy and performance is not looked at in that paper. VI. CONCLUSIONS We have shown that ILP code transformations, if applied carefully, reduce the instruction memory energy. The high energy gain can be achieved if the system contains a software controlled clustered loop buffer as proposed in this paper. Three transformations (loop unrolling, if-conversion and software pipelining) have been examined and guidelines on how to apply a combination of these three transformations have been proposed. Using these guidelines to steer the transformations, the energy and delay could be reduced with 2 and 78% respectively for an advanced 3D image reconstruction benchmark application. For an MPEG-2 Video Encoder benchmark, a Pareto trade-off between energy and delay has been presented. We have shown that for some transformations (unrolling and if-conversion) different heuristics are needed in the presense of a software controlled loop buffer than with a normal instruction cache. For if-conversion, we have proposed and implemented some heuristics that seem to work well with the loop buffer. For unrolling, more research is needed to come up with a cost function to steer this transformation and to implement it in the compiler. This will be done in future work. REFERENCES [] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. Computing Surveys, 26(4): , 994. [2] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, and R. Zafalon. A power modeling and estimation framework for vliwbased embedded systems. In Proc. Int. Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS, September 200. [3] L. Benini and G. de Micheli. System-level power optimization: Techniques and tools. ACM Transactions on Design Automation of Electronic Systems (TODAES), 5(2):5 92, April [4] F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. G. Kjeldsberg, T. Van Achteren, and T. Omnes. Data access and storage management for embedded programmable processors. Kluwer Academic Publishers, March [5] S. V. Gheorghita, H. Corporaal, and T. Basten. Using iterative compilation to reduce energy consumption. In Proc. of ASCI 2004, pages , June [6] M. F. Jacome and G. de Veciana. Design challenges for new application-specific processors. Special issue on Design of Embedded Systems in IEEE Design & Test of Computers, April-June [7] M. Jayapala, F. Barat, T. Vander Aa, F. Catthoor, H. Corporaal, and G. Deconinck. Clustered loop buffer organization for low energy VLIW embedded processors. IEEE Trans. on Computers (Special issue on -Efficient Computing),Vol.54, No.6, pages , June [8] M. S. Lam. A retrospective: Software pipelining: An effective scheduling technique for VLIW machines. In 20 Years of PLDI ( ): A Selection, [9] C. Lee and et al. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In International Symposium on Microarchitecture, pages , 997. [0] L. H. Lee, W. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proc. of ACM/IEEE International Symposium on Low Power Electronics (ISLPED), August 999. [] N. Liveris, N. D. Zervas, D. Soudris, and C. E. Goutis. A code transformation-based methodology for improving I-cache performance of DSP applications. In Proc. of Design Automation and Test in Europe (DATE), March [2] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In 25th Annual International Symposium on Microarchitecture, 992. [3] M. Proesmans, L. J. V. Gool, and A. J. Oosterlinck. Active acquisition of 3D shape for moving objects. In Proc. of the ICIP International Conference on Image Processing, 996. [4] B. R. Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In MICRO 27: Proc. of the 27th annual international symposium on Microarchitecture, pages ACM Press, 994. [5] S. Rixner, W. Dally, B. Khialany, p. Mattson, U. Kapnasi, and J. Owens. Register organization for media processing. In Proc. of HiPC 2000, January 2000.

8 [6] J. Sias, H. Hunter, and W. Hwu. Enhancing loop buffering of media and telecommunications applications using low-overhead predications. In Proc. of 34th Annual International Symposium on Microarchitecture, December 200. [7] B. Su, J. Wang, R. Rabipour, E.-W. Hu, and J. Manzano. Loop optimization with tradeoff between cycle count and code size for DSP applications. In Proc. of EUSIPCO 2004, September [8] Texas Instruments. C6000 Platform: DSP Selection Guide, Selection Guide No. SSDV004L. [9] Trimaran group, Trimaran: An Infrastructure for Research in Instruction-Level Parallelism, 999. [20] M. Valluri and L. John. Is compiling for performance == compiling for power? In Proc. of the 5th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT-5), 200. [2] T. Vander Aa, M. Jayapala, F. Barat, G. Deconinck, R. Lauwereins, F. Catthoor, and H. Corporaal. Instruction buffering exploration for low energy vliws with instruction clusters. In Proc. of ASPDAC 2004, January [22] H. Yang, G. Gao, A. Marquez, G. Cai, and Z. Hu. Power and energy impact by loop transformations. In Proc. of COLP 200, September 200. [23] J. Zambreno, M. Kandemir, and A. Choudhary. Enhancing compiler techniques for memory energy optimizations. In Proc. of EMSOFT 2002, October

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal