Impact of ILP-improving Code Transformations on Loop Buffer Energy

Size: px
Start display at page:

Download "Impact of ILP-improving Code Transformations on Loop Buffer Energy"

Transcription

1 Impact of ILP-improving Code Transformations on Loop Buffer Tom Vander Aa Murali Jayapala Henk Corporaal Francky Catthoor Geert Deconinck IMEC, Kapeldreef 75, B-300 Leuven, Belgium ESAT, KULeuven, Kasteelpark Arenberg 0, B-300 Leuven, Belgium TU Eindhoven, Den Dolech 2, 562 AZ Eindhoven, Netherlands Abstract For multimedia applications, loop buffering is an efficient mechanism to reduce the power in the instruction memory of embedded processors. In particular, software controlled clustered loop buffers are very energy efficient. However code transformations needed in VLIW compilers to reach a higher ILP potentially may have a large negative influence on the energy consumed in the instruction memories (including the loop buffer). This paper will show that such code transformations can also have a positive impact on the instruction memory energy of processors, if the transformations are steered taking into account the presence of the software controlled clustered loop buffer. We will propose guidelines to steer the code transformations and show that these guidlines should be differently applied in a system with clustered loop buffer than in a system with only a normal instruction cache. Results are presented for a mix of three important ILP code transformations: software-pipelining, if-conversion and loop unrolling. Results show an energy reduction between 5% and 25% and a delay reduction between 6 and 75% for an MPEG-2 Video Encoder benchmark. I. INTRODUCTION AND MOTIVATION Low energy is one of the key design goals of the current embedded systems for multimedia applications. Typically the cores of such systems are programmable processors. VLIW ASIPs in particular are known to be very effective in achieving high performance with reasonable low power for our domain of interest [6]. However, power analysis of such processors indicates that a significant amount of power is consumed in the on-chip (instruction) memory hierarchy: 3 of the total power according to [2]. Our experimental analysis shows that, if the appropriate data memory hierarchy mapping techniques are applied first [4], and if all methods to reduce power in the datapath are applied [3], this number may go up to 5. Hence, reducing this part of the budget is crucial in reducing the overall power consumption of the system. Loop buffering is an effective scheme to reduce energy consumption in the instruction memory hierarchy [0]. In any typical multimedia application, a significant amount of execution time is spent in small program segments. can be reduced by storing them in a small loop buffer instead of the big instruction cache (IL). However, as more instruction level parallelism (ILP) is extracted from the application, wider datapaths and wider loop buffers are needed. In the datapath, the register files and the interconnection network to them is the main bottleneck of wide VLIWs. This issue is resolved in recent papers using datapath clustering (see e.g. [5]). Data path clustering will split up these register file into several partitions, and restricts access to this register file to certain s. This reduces the ports to the register file and the number of wires in the interconnection network. To obtain low power, clustering also needs to be applied to the instruction memory hierarchy. For this a clustered loop buffer architecture [7] has been proposed. The criteria for this type of clustering are however, very different than those for data path clustering. In a clustered data path, the main compiler problem to be solved is how to minimize the inter-cluster communication overhead. Since the instruction streams of the different functional units do not interact, this is not a problem for instruction clusters. To obtain high ILP, ILP improving code transformations are needed. Code transformations such as loop unrolling, software pipelining and if-conversions form part of every modern compiler. The contributions of this paper in that area are two-fold. Firstly we will show that such code transformations can have a large positive impact on the instruction memory energy of processors that have a software controlled clustered loop buffer. We also show that for some transformations, such as loop unrolling, a trade-off exists between performance and energy. Secondly to exploit this trade-off we propose a set of guidlines to steer the code transformation to optimally exploit the presence of the loop buffer. The rest of this paper is organized as follows: In Section II the software controlled clustered loop buffer organization and the compiler are described. In Section III the the different ILP code transformations are discussed. Their effect on performance and energy is studied for serveral benchmarks. Based on the observations of Section III guidelines on how to steer these transformations are proposed (Section IV). An account of the related work is presented in Section V. Conclusions are drawn in Section VI. II. CLUSTERED LOOP BUFFER ORGANIZATION AND COMPILATION Figure a illustrates the essentials of the clustered loop buffer under consideration. A more detailed analysis of the

2 2 L0 Buffer Enable LC INSTRUCTION LEVEL CACHE Loop Buffer partition mux... INSTRUCTION CLUSTER LC... a) Loop Buffer partition mux NEW_PC Local Controller (LC) index enable Loop Buffer Partition Fig.. Clustered loop buffer architecture; (a) overview: functional units are groups into instruction clusters, a cluster fetches instructions from a loop buffer partition, (b) local controller detail architecture can be found in [7]. Instructions are fed to the processing unit from either the level instruction cache (IL), or from the loop buffer. Initially the loop buffer is disabled and the program executes via IL. When a loop buffer control instruction is encountered marking the beginning of the loop that has to be mapped to the loop buffer, the loop buffer will be turned on. The form of this special instruction is lbon<startaddress>, <endaddress>. (Lbon means loop buffer on). Startaddress is the address of the first instruction of the loop and endaddress that of the last one. These values are stored in the local controller (LC) of each cluster and will be used during the execution of the loop. When the loop buffer is used, a local controller (see Figure b) will translate the program counter (PC) to an index in the loop buffer (called NEW PC in the figure) by subtracting the stored startaddress from the PC value. This NEW PC is used as an index in the local controller table. Only if the first column of this table indicates that an operation is stored in the loop buffer partition for that PC, the loop buffer partition is accessed. This is an important feature since it saves energy (by not accessing the loop buffer partition) and area (by not storing any operations if that cluster is not used in that cycle). No NOP compression is used inside a cluster, so if only one is active in a certain cluster, the other s will execute an explicitly coded NOP instruction. This is OK, since the clustering is chosen such that only s that are used together are clustered together. What entry in the loop buffer partition is accessed depends on the second column in the local controller table. When the LC detects that the program counter is larger than endaddress, the loop buffer will be turned off. A. Experimental Setup The experiments in this paper have all been performed using a VLIW processor modeled to the TMS320C62x processor of Texas Instruments [8] This VLIW has a data-path with two clusters of four functional units (s). Benchmarks of the MediaBench [9] suite were compiled with the Trimaran Compiler [9]. Since steering of the loop unrolling transformation was not adequately implemented in the compiler, and therefore, manual interaction was needed, we have limited ourself for this transformation to two R b) OP OP3 OP4 OP5 benchmarks: an MPEG-2 Video Encoder and a 3D image reconstruction algorithm. After compilation to assembly code, loops are automatically selected to be mapped onto the loop buffer using the method presented in [2]. Once loops are mapped, the loop buffer clustering (number of clusters and size of each cluster) is optimized for this mapping. The result of the assignment and clustering is used to estimate the energy in the instruction memory hierarchy using the energy models described in [7]. This includes the energy of the loop buffer, the instruction cache and the external memory. The energy of the other processor components is not considered because in this paper we focus on the instruction memory part, which, as already shown before, is an important part of the total processor energy consumption. III. ILP CODE TRANSFORMATIONS ILP transformations combine and enlarge basic blocks to expose more instruction level parallelism (ILP). They are needed if the available parallelism of the VLIW processor is not exploited in the original program and the program does not meet its performance constraints because of this. ILP code transformations tend to increase the number of operations and increase code size, and thus may also increase the instruction memory energy. Loop unrolling, for example, replicates the body of a loop. Superblock creation, a technique to increase the freedom and scope of the scheduler, uses tail duplication, which duplicates certain basic blocks of the program. By default the compiler steers the transformations to achieve maximum performance and does not consider energy. This causes a clear energy penalty. On average the energy increase in the instruction memories is 5. However, we will show in the next subsections that if we modify the heuristics that steer the transformations in the compiler, some transformations can clearly have a positive effect on instruction memory energy. This steering is one of main contributions of this paper. We will see that, using these heuristics, it sometimes is possible to improve both on energy and on performance, and sometimes a pareto trade-off can be explored between energy and performance. A. How More ILP can Reduce the Clustered Loop Buffer Since many ILP-enhancing transformations increase either the loop size or the number of operations, it is not clear if energy in the instruction memory can be reduced this way. The example below shows that due to the fact that a clustered loop buffer architecture is used, increasing ILP can indeed reduce energy. Figure 2 shows a simple nested loop (a 2D FIR filter). For this loop, two possible schedules of the inner loop are shown in Figure 3: one generated with normal list scheduling, and one generated with modulo scheduling. The energy consumption (estimated using the method described in Section II-A) for the high ILP version is 9.4µJ, while for the low ILP version it is 2.2µJ (3 more). The largest difference comes

3 3 f o r ( i =0; i <=28; i ++) 2 tmp = 0. 0 ; 3 f o r ( i 2 =0; i2 <32; i 2 ++) 4 tmp += w[ i 2 ] x [ i + i 2 ] ; 5 y [ i ] = tmp Fig. 2. 2D FIR filter Cluster 2 3 MACC ADD SHIFT ADD 2 ADD STORE SHIFT 3 ADD STORE BRANCH Cluster ADD ADD ADD 2 CMPP STORE ADD 3 STORE 4 5 MACC 6 7 BRANCH Fig. 3. High and low ILP schedule for the inner loop with optimal loop buffer clustering. from the energy spent in the address translation (indirection) of the local controller (LC). The address translation table is accessed every cycle, and indicates with a zero or one if a cluster needs to be activated. The LC of the high ILP version has three main energy advantages, namely: i) fewer local controllers: The optimal clustering of the high ILP version has 3 local controllers (LCs), while the other has 4. ii) smaller local controllers: The depth of the LC, is only 3, instead of 7. iii) fewer accesses to the local controllers: Each LC is only accessed 3 times per iteration of the high ILP loop (3 cycles), while for the low ILP loop, this is 7 cycles. So if some transformations can increase the ILP of a loop that is assigned to the loop buffer, without causing too much overhead (additional operations), this will reduce energy in the instruction memory hierarchy. In the next three sections, more details of three ILP code transformations are provided, namely: software-pipelining, if-conversion and loop unrolling. The effect of the code transformations on MediaBench are studied and conclusions are drawn on if and when each transformation should be applied, incorporating their mutual influence. B. Software Pipelining Software pipelining [8] is a loop scheduling technique that extracts loop parallelism by overlapping the execution of several consecutive iterations. Software-pipelining is now considered the most effective technique to achieve high instruction level parallelism and is implemented in all advanced compilers targeting ILP-based processors [8]. Figure 4 shows that the effect of software-pipelining in the compiler on MediaBench is very limited: for both energy and delay the change is less than 5%; for a processor with 8 s, the average parallelism for MediaBench is only adpcmdec adpcmenc aes blowfishenc energy cycles cjpeg djpeg epic g72dec g72enc ghostscript gsmdec gsmenc mesa mpeg2dec mpeg2enc rasta sha average Fig. 4. Effect of software-pipelining on energy and cycles. The Y-axis shows the percentage energy and cycle count gain of enabling softwarepipelining in the compiler. instructions per cycle. The reason for this limited effect is two-fold. Firstly, modulo scheduling [4], the softwarepipelining technique used, can only be applied for inner loops that have no internal control flow. And secondly, very often the hardware has enough resources to execute more operations in parallel, but the loop does not contain enough operations. To solve these two problems two transformations that enable efficient software-pipelining have been added: hyperblock creation for the former problem and loop unrolling for the latter. The effect of these two transformations will be discussed in the next two subsections and only after that the true benefit of software-pipelining can be evaluated. C. If-Conversion or HyperBlock Creation A common problem all global optimization and scheduling strategies must resolve, is conditional branches in the target application. Predicated execution is an efficient method to handle conditional branches. Predicated or guarded execution refers to the conditional execution of instructions based on the value of a boolean source operand, referred to as the predicate. When the predicate has value true, the instruction is executed normally and when the predicate has value false, the instruction is treated as a NOP. With predicated execution support provided in the architecture, the compiler can eliminate many of the conditional branches in an application. The process of eliminating conditional branches from a program using predication is referred to as if-conversion. The newly created block, consisting of the operations of multiple basic blocks is called a hyperblock. To form hyperblocks, two features of each basic block in a region are examined: execution frequency and size. Execution frequency is used to exclude paths of control that are not executed often. Removing infrequently executed paths reduces scheduling constraints for the frequent paths. The second feature is basic block size. Larger basic blocks should be given less priority for inclusion than smaller blocks. Larger blocks utilize many machine resources and thus may reduce the performance of the control paths through smaller blocks. A heuristic function that considers the two issues is shown below. BSV bb = K weight bb size bb size mainpath weight mainpath

4 4 Delay 82% 8 78% 76% 74% selected inner all branches 72% all inner % 9 95% 0 05% 5% Fig. 5. Effect of different predication strategies on energy and delay (cycles). The values shown are an average of all applications of MediaBench. The Block Selection Value (BSV [2]) is calculated for each basic block considered for inclusion in the hyperblock. The weight and size of each basic block is normalized against that of the main path. The main path is the most likely executed control path through the region of blocks considered for inclusion in the hyperblock. The hyperblock initially contains only blocks along the main path. The variable K is a machine dependent constant to represent the issue rate of the processor. In this case this is 8. Processors with more resources can execute more instructions concurrently, and therefore are likely to take advantage of larger hyperblocks. From an energy perspective, if-conversion should be avoided at all since it adds useless (i.e., nullified) operations. In the next subsection we will see that if-conversion can be useful to enable modulo-scheduling with no, or only limited increase in instruction memory energy. In high performance compilation, normally all forward branches are considered for predication, even those that are not in software-pipelined loops. But since predicating these kinds of branches has not such a big influence on the ILP, and since predication always increases the number of executed operations, the effect on energy is generally negative. When predication as proposed in [2] is applied on the MediaBench suite, on average, the number of cycles is reduced with 3%. However, the instruction memory energy increases with %, with peaks of 25 for some applications. We will refer to this predication strategy as all branches. By restricting the predication to inner loops, where it enables software pipelining, both the delay and the energy can be reduced. Figure 5 shows three different strategies: i) the predication strategy using the BSV, not limited to inner loops (all branches), aimed only at improving performance, ii) predicating selected inner loops based on the same BSV as all branches (selected inner) and iii) predicating all inner loops (all inner). Since predicating all inner loops gives best performance and lowest energy consumption, we propose to always perform this type of predication in combination with a compiler that uses this type of software-pipelining and a software controlled clustered loop buffer. Compare Figure 6 with Figure 4 to see that enabling ifconversion for all inner loops helps significantly to increase performance and reduces energy consumption: on average a 55% reduction in energy consumption and a % reduction in cycles adpcmdec adpcmenc energy cycles aes blowfishenc cjpeg djpeg epic g72dec g72enc ghostscript gsmdec gsmenc mesa mpeg2dec mpeg2enc rasta sha average Fig. 6. Effect of software-pipelining on energy and cycles. (Softwarepipelining disabled versus software-pipelining enabled) If-conversion is applied to all inner loops. No loop unrolling is applied yet. D. Loop Unrolling Another problem with modulo scheduling, is the high overhead of the prologue and epilogue of the softwarepipelined loop, if the number of iterations is low. Since modulo scheduling, as currently implemented in the compiler that was used, can only be applied to the innermost loop, the number of iterations of this loop will determine the efficiency of the modulo scheduling. For MediaBench, for more than 6 of the time spent in loops, it is in loops with an iteration count between 8 and 6. We will see in the next subsection, how loop unrolling can help to increase ILP and reduce energy for such loops. Loop unrolling is a technique that reduces the number of iterations by replicating the body of the loop. A loop that has been unrolled can have more ILP, if the extra operations in the body are partly independent of the already existing ones and there are enough resources to place these operations. In the context of modulo scheduling, the minimal initiation interval (MII), determines the performance of the loop. This MII is bounded by resources (ResII), and by cyclic dependencies (RecII). We have examined the distribution of software-pipelined loops in MediaBench with respect to resource limitations and recurrent dependencies limitations. The conclusion was that in 87% of the cases, the loop s II is not only limited by the resource constraints, by also by the recurrent dependencies. However, in many cases of this 87%, this RecII limitation can be broken when the loop is unrolled using variable renaming and tree height reduction. The FIR-filter of Figure 2, is such a case. By unrolling the loop and renaming the accumulator tmp, lines 4, and 5 of the source code become: tmp += w[i2]*x[i+i2]; tmp2 = w[i2+]*x[i+i2+] and y[i] = tmp + tmp2 respectively. ) Effect of Loop Unrolling, a Simple Example: Unrolling of the inner loop of the FIR filter of Figure 2 is examined here. The inner loop has 32 iterations, so maximally 32 copies of the loop are considered (unrolled 3 times, or unroll amount equal to 32). It may sound counter-intuitive to apply a technique that reduces the number of iterations, if this is already too low, but it will be shown at the end of the section how this will help.

5 5 Fig Unroll Amount Cycles Efficiency Effect of unrolling the inner loop of a simple FIR-filter. Figure 7 shows the effect of unrolling on cycle count and energy in the instruction memory hierarchy. Efficiency is defined as: Eff n = S n /n with n the unroll amount and S n the speedup of the unrolled version with unroll amount n versus the original version. So, if a loop gets twice as fast by unrolling it once the efficiency is 0. The efficiency will be lower if the speedup is lower. In this case, cycle count and efficiency are complementary graphs. Several observations can be made from the graphs: The energy of the instruction memory follows the curve of the cycle count. This is in agreement with results from previous studies: loop unrolling is only beneficial for energy consumption if it increases performance [5], [22], [20]. When unrolled once (unroll amount 2) the cycle count goes up. This is because in this case for each iteration of the loop twice the number of results are generated (from the two copies of the loop). These two sets of results somehow need to be combined causing extra overhead. For unroll amounts larger than 2, that overhead of combining the results, is often negligible, given that there are still enough resources for the unrolled loop. If loops are unrolled more than once, the cycle count will decrease, although not linearly with the unroll amount, as can be seen in the efficiency graph. Because the loop has exactly 32 iterations (before unrolling), some peculiar behavior appears when unrolling it completely: suddenly efficiency increases dramatically. (An efficiency of 4 means a 5 speed-up). This is because of the interaction with modulo scheduling: if the loop is unrolled completely, the second innermost loop can be software-pipelined, adding additional speedup to the overall program. In the next sections it will be shown that, because in multimedia applications many inner loops have small iteration counts, unrolling them completely is usually the best option, both for energy and for performance. 2) Loop Unrolling for Multimedia Applications: The results of the previous section, combined with the low iteration counts in innermost loops, indicate that it would be very beneficial to apply full loop unrolling to a selected number of inner loops of MediaBench. Two applications have been chosen, namely a 3D image reconstruction algorithm (snake [3]) and an MPEG-2 Video encoder. The most important inner loops of each application have been unrolled. The MPEG-2 benchmark contains two important loops, motion estimation (motion) and the forward discrete cosine transform (FDCT), that, together, account for 5 of the cycle count, before unrolling. The snake benchmark has one inner loop accounting for 72% of the total number of cycles. Both for energy and delay, full unrolling is indeed the best for two of the three loops: the snake loop and the FDCT loop. This is because for these two loops, the processor has enough free resources to successfully schedule the unrolled loop. For motion estimation, however, things are different, since this loop is more resource limited, full unrolling will have a smaller performance benefit, and thus the negative impact on energy will be relatively larger. After full unrolling, the cycle count of the inner loops of the two benchmarks is reduced to 35% and 65% respectively. E. Combination of Unrolling, If-Conversion and Software Pipelining For the MPEG-2 video encoder and for the snake benchmark, we have examined the mutual effect of full loop unrolling, if-conversion and software pipelining. All of the three transformations can be enabled separately, resulting in 2 3 or 8 combinations. For all 8 combinations, the effect on energy and delay is shown in Figure 8 for the snake benchmark and for the MPEG-2 Video Encoder benchmark. Several observations can be made: In all cases the impact on cycles is larger than the impact on energy. In particular, the impact on cycles is always positive. If the impact on energy is positive, it is smaller than the impact on cycles. Sometimes a quite large negative impact on energy can occur. All of these observations can be explained by the fact that the transformations were designed with increasing performance in mind and without any steering to reduce or keep under the energy consumption. In future research, heuristics need to be developed that take into account both energy and delay. Sometimes one global optimum is present with respect to energy and delay, for example for the snake benchmark. We expect however, that (in contrast with classical code transformations) ILP code transformations will often present an energy-delay trade-off. In the case of the MPEG-2 Video Encoder, a Pareto trade-off can be observed in the XY-chart. One of the challenges to efficiently explore this tradeoff is to correctly estimate the impact of the enabling transformations. Loop unrolling can be an enabling transformation: performing loop unrolling alone has a significant negative impact on energy, but combining loop unrolling with software pipelining has a larger energy gain than just software-pipelining. F. Loop Buffer versus Normal Instruction Cache This section investigates if the existing steering for normal instruction caches is good enought for loop buffers.

6 unroll swp hb SNAKE energy cycles n y n y n y n y n n y y n n y y n n n n y y y y TABLE I ENERGY REDUCTION DUE TO DIFFERENT IF-CONVERSION STRATEGIES FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER. THE OPTIMAL STRATEGY IS MARKED IN BOLD. AVERAGE OVER ALL APPLICATIONS IN MEDIABENCH. cache loop buffer no if-conv. 0 0 all branches 0 % selected inner 97% 93% all inner 83% TABLE II ENERGY REDUCTION DUE TO SOFTWARE PIPELINING FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER. AVERAGE OVER ALL APPLICATIONS IN MEDIABENCH. Delay Delay unroll swp hb only one pareto point MPEG 2 n y n y n y n y n n y y n n y y n n n n y y y y energy cycles pareto curve Fig. 8. Effect of loop unrolling (unroll), software-pipelining (swp) and ifconversion (hb) on energy and delay (cycles) on the snake (top) benchmark and on the MPEG-2 Video Encoder benchmark. Each transformation can be enabled (y) or disabled (n). The upper graph shows the effect of the enabled transformations on energy and delay in a bar-chart. The lower graph shows the same information in a XY-chart. cache loop buffer no swp 0 0 swp 85% 45% Table I shows the effect of the different if-conversion strategies on the energy consumption of an architecture with and without a loop buffer. The values are relative to no if-conversion. Two important messages can be observed. Firstly, if-conversion is more important for an architecture with loop buffer, since it induces a larger energy reduction. Secondly, the best strategy (in terms of energy) for a loop buffer based strategy is to convert all inner loops, while for a cache-based architecture this is to only convert selected inner loops. For software pipelining, the conclusion is similar, yet simpler (Table II): software pipelining reduces energy consumption in an architecture without loop buffer, but even more so in an architecture with loop buffer. Steering loop unrolling seems to be more tricky. For the two examples we have compiled (snake and the MPEG- 2 Video Encoder) the results already differ. For the snake application, the optimal unroll amount w.r.t. energy is more for a loop buffered architecture than for an architecture with cache, but for the MPEG-2 application it is less. It is already TABLE III ENERGY REDUCTION DUE TO LOOP UNROLLING FOR AN ARCHITECTURE WITH A CACHE AND WITH LOOP BUFFER FOR TWO BENCHMARKS. THE OPTIMAL UNROLL AMOUNT IS MARKED IN BOLD. SNAKE unroll amount cache loop buffer % 02% 4 79% 04% % MPEG-2 Video Encoder unroll amount cache loop buffer % 4 93% 8 77% 92% 6 7 2%

7 7 clear that the optimal unrolling for the two architectures is different, but more research is needed to determine what this unroll amount actually should be. IV. GUIDELINES From the above observations we propose guidelines on how to apply the three transformations when a loop buffer is present. ) Software pipelining should always be applied since it is a relatively safe transformation and since from Figure 8 it is clear the other two transformations do not make sense to applied if software pipelining is not applied. 2) We propose to do if-conversion on all inner loops, but only on inner loops. 3) Loop unrolling on relatevely small inner loops can definetly be beneficial for energy and or performance of the loop. However, we cannot state that is it good in general to do full loop unrolling, since for some loops an energy-delay pareto trade-off occurs with the different unroll amounts. As stated before, more research is needed to tackle this problem V. RELATED WORK The transformations that are proposed are in itself not new: they have been applied in the past in high performance compilers [], but not in the context of optimization for low power. With respect to low energy, other researchers before us come to the conclusion that transformations that improve performance, will also improve energy [5], [22], [20]. The main difference is that all those researchers have looked at normal instruction caches, while we ve clearly shown that the steering for loop buffers is different from the steering for instruction caches. Many researchers have explored the trade-off performance versus code-size and instruction memory energy for processors with normal instruction caches. Liveris, et al. [] explore function inlining and code placement. Zambreno et al. [23] provide heuristics to explore this trade-off when performing loop unrolling and function inlining and Su et al. [7] do the same for software pipelining. To the best of our knowledge, code transformations for VLIW processors to reduce energy in the loop buffer, are only proposed in [6]. In that paper however, the authors assume a loop buffer without support for branches, and apply transformations to alleviate this handicap. Our loop buffer implementation does support branches, and thus does not need those transformations. Furthermore, the trade-off between energy and performance is not looked at in that paper. VI. CONCLUSIONS We have shown that ILP code transformations, if applied carefully, reduce the instruction memory energy. The high energy gain can be achieved if the system contains a software controlled clustered loop buffer as proposed in this paper. Three transformations (loop unrolling, if-conversion and software pipelining) have been examined and guidelines on how to apply a combination of these three transformations have been proposed. Using these guidelines to steer the transformations, the energy and delay could be reduced with 2 and 78% respectively for an advanced 3D image reconstruction benchmark application. For an MPEG-2 Video Encoder benchmark, a Pareto trade-off between energy and delay has been presented. We have shown that for some transformations (unrolling and if-conversion) different heuristics are needed in the presense of a software controlled loop buffer than with a normal instruction cache. For if-conversion, we have proposed and implemented some heuristics that seem to work well with the loop buffer. For unrolling, more research is needed to come up with a cost function to steer this transformation and to implement it in the compiler. This will be done in future work. REFERENCES [] D. F. Bacon, S. L. Graham, and O. J. Sharp. Compiler transformations for high-performance computing. Computing Surveys, 26(4): , 994. [2] L. Benini, D. Bruni, M. Chinosi, C. Silvano, V. Zaccaria, and R. Zafalon. A power modeling and estimation framework for vliwbased embedded systems. In Proc. Int. Workshop on Power And Timing Modeling, Optimization and Simulation PATMOS, September 200. [3] L. Benini and G. de Micheli. System-level power optimization: Techniques and tools. ACM Transactions on Design Automation of Electronic Systems (TODAES), 5(2):5 92, April [4] F. Catthoor, K. Danckaert, C. Kulkarni, E. Brockmeyer, P. G. Kjeldsberg, T. Van Achteren, and T. Omnes. Data access and storage management for embedded programmable processors. Kluwer Academic Publishers, March [5] S. V. Gheorghita, H. Corporaal, and T. Basten. Using iterative compilation to reduce energy consumption. In Proc. of ASCI 2004, pages , June [6] M. F. Jacome and G. de Veciana. Design challenges for new application-specific processors. Special issue on Design of Embedded Systems in IEEE Design & Test of Computers, April-June [7] M. Jayapala, F. Barat, T. Vander Aa, F. Catthoor, H. Corporaal, and G. Deconinck. Clustered loop buffer organization for low energy VLIW embedded processors. IEEE Trans. on Computers (Special issue on -Efficient Computing),Vol.54, No.6, pages , June [8] M. S. Lam. A retrospective: Software pipelining: An effective scheduling technique for VLIW machines. In 20 Years of PLDI ( ): A Selection, [9] C. Lee and et al. Mediabench: A tool for evaluating and synthesizing multimedia and communicatons systems. In International Symposium on Microarchitecture, pages , 997. [0] L. H. Lee, W. Moyer, and J. Arends. Instruction fetch energy reduction using loop caches for embedded applications with small tight loops. In Proc. of ACM/IEEE International Symposium on Low Power Electronics (ISLPED), August 999. [] N. Liveris, N. D. Zervas, D. Soudris, and C. E. Goutis. A code transformation-based methodology for improving I-cache performance of DSP applications. In Proc. of Design Automation and Test in Europe (DATE), March [2] S. A. Mahlke, D. C. Lin, W. Y. Chen, R. E. Hank, and R. A. Bringmann. Effective compiler support for predicated execution using the hyperblock. In 25th Annual International Symposium on Microarchitecture, 992. [3] M. Proesmans, L. J. V. Gool, and A. J. Oosterlinck. Active acquisition of 3D shape for moving objects. In Proc. of the ICIP International Conference on Image Processing, 996. [4] B. R. Rau. Iterative modulo scheduling: an algorithm for software pipelining loops. In MICRO 27: Proc. of the 27th annual international symposium on Microarchitecture, pages ACM Press, 994. [5] S. Rixner, W. Dally, B. Khialany, p. Mattson, U. Kapnasi, and J. Owens. Register organization for media processing. In Proc. of HiPC 2000, January 2000.

8 [6] J. Sias, H. Hunter, and W. Hwu. Enhancing loop buffering of media and telecommunications applications using low-overhead predications. In Proc. of 34th Annual International Symposium on Microarchitecture, December 200. [7] B. Su, J. Wang, R. Rabipour, E.-W. Hu, and J. Manzano. Loop optimization with tradeoff between cycle count and code size for DSP applications. In Proc. of EUSIPCO 2004, September [8] Texas Instruments. C6000 Platform: DSP Selection Guide, Selection Guide No. SSDV004L. [9] Trimaran group, Trimaran: An Infrastructure for Research in Instruction-Level Parallelism, 999. [20] M. Valluri and L. John. Is compiling for performance == compiling for power? In Proc. of the 5th Annual Workshop on Interaction between Compilers and Computer Architectures (INTERACT-5), 200. [2] T. Vander Aa, M. Jayapala, F. Barat, G. Deconinck, R. Lauwereins, F. Catthoor, and H. Corporaal. Instruction buffering exploration for low energy vliws with instruction clusters. In Proc. of ASPDAC 2004, January [22] H. Yang, G. Gao, A. Marquez, G. Cai, and Z. Hu. Power and energy impact by loop transformations. In Proc. of COLP 200, September 200. [23] J. Zambreno, M. Kandemir, and A. Choudhary. Enhancing compiler techniques for memory energy optimizations. In Proc. of EMSOFT 2002, October

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

CURRENT embedded systems for multimedia applications,

CURRENT embedded systems for multimedia applications, 672 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005 Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors Murali Jayapala, Student Member, IEEE, Francisco Barat, Student

More information

Evaluation of Static and Dynamic Scheduling for Media Processors.

Evaluation of Static and Dynamic Scheduling for Media Processors. Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts 1 and Wayne Wolf 2 1 Dept. of Computer Science, Washington University, St. Louis, MO 2 Dept. of Electrical Engineering, Princeton

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

Generic Software pipelining at the Assembly Level

Generic Software pipelining at the Assembly Level Generic Software pipelining at the Assembly Level Markus Pister pister@cs.uni-sb.de Daniel Kästner kaestner@absint.com Embedded Systems (ES) 2/23 Embedded Systems (ES) are widely used Many systems of daily

More information

Code Compression for DSP

Code Compression for DSP Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

EECS 583 Class 3 More on loops, Region Formation

EECS 583 Class 3 More on loops, Region Formation EECS 583 Class 3 More on loops, Region Formation University of Michigan September 19, 2016 Announcements & Reading Material HW1 is out Get busy on it!» Course servers are ready to go Today s class» Trace

More information

EECS 583 Class 3 Region Formation, Predicated Execution

EECS 583 Class 3 Region Formation, Predicated Execution EECS 583 Class 3 Region Formation, Predicated Execution University of Michigan September 14, 2011 Reading Material Today s class» Trace Selection for Compiling Large C Applications to Microcode, Chang

More information

Complementing Software Pipelining with Software Thread Integration

Complementing Software Pipelining with Software Thread Integration Complementing Software Pipelining with Software Thread Integration LCTES 05 - June 16, 2005 Won So and Alexander G. Dean Center for Embedded System Research Dept. of ECE, North Carolina State University

More information

Speculative Trace Scheduling in VLIW Processors

Speculative Trace Scheduling in VLIW Processors Speculative Trace Scheduling in VLIW Processors Manvi Agarwal and S.K. Nandy CADL, SERC, Indian Institute of Science, Bangalore, INDIA {manvi@rishi.,nandy@}serc.iisc.ernet.in J.v.Eijndhoven and S. Balakrishnan

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Node Prefetch Prediction in Dataflow Graphs

Node Prefetch Prediction in Dataflow Graphs Node Prefetch Prediction in Dataflow Graphs Newton G. Petersen Martin R. Wojcik The Department of Electrical and Computer Engineering The University of Texas at Austin newton.petersen@ni.com mrw325@yahoo.com

More information

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Hamed Fatemi 1,2, Henk Corporaal 2, Twan Basten 2, Richard Kleihorst 3,and Pieter Jonker 4 1 h.fatemi@tue.nl 2 Eindhoven

More information

Understanding multimedia application characteristics for designing programmable media processors

Understanding multimedia application characteristics for designing programmable media processors Header for SPIE use Understanding multimedia application characteristics for designing programmable media processors Jason Fritts *, Wayne Wolf, and Bede Liu Dept. of Electrical Engineering, Princeton

More information

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties

Instruction-Level Parallelism Dynamic Branch Prediction. Reducing Branch Penalties Instruction-Level Parallelism Dynamic Branch Prediction CS448 1 Reducing Branch Penalties Last chapter static schemes Move branch calculation earlier in pipeline Static branch prediction Always taken,

More information

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas

Software Pipelining by Modulo Scheduling. Philip Sweany University of North Texas Software Pipelining by Modulo Scheduling Philip Sweany University of North Texas Overview Instruction-Level Parallelism Instruction Scheduling Opportunities for Loop Optimization Software Pipelining Modulo

More information

Impact of Source-Level Loop Optimization on DSP Architecture Design

Impact of Source-Level Loop Optimization on DSP Architecture Design Impact of Source-Level Loop Optimization on DSP Architecture Design Bogong Su Jian Wang Erh-Wen Hu Andrew Esguerra Wayne, NJ 77, USA bsuwpc@frontier.wilpaterson.edu Wireless Speech and Data Nortel Networks,

More information

Incorporating Compiler Feedback Into the Design of ASIPs

Incorporating Compiler Feedback Into the Design of ASIPs Incorporating Compiler Feedback Into the Design of ASIPs Frederick Onion Alexandru Nicolau Nikil Dutt Department of Information and Computer Science University of California, Irvine, CA 92717-3425 Abstract

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 21, NO. 12, DECEMBER 2002 1395 RS-FDRA: A Register-Sensitive Software Pipelining Algorithm for Embedded VLIW Processors

More information

Removing Communications in Clustered Microarchitectures Through Instruction Replication

Removing Communications in Clustered Microarchitectures Through Instruction Replication Removing Communications in Clustered Microarchitectures Through Instruction Replication ALEX ALETÀ, JOSEP M. CODINA, and ANTONIO GONZÁLEZ UPC and DAVID KAELI Northeastern University The need to communicate

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture

Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture Control Flow Prediction through Multiblock Formation in Parallel Register Sharing Architecture Rajendra Kumar Dept. of Computer Science & engineering, Vidya College of Engineering Meerut (UP), India rajendra04@gmail.com

More information

Instruction Scheduling

Instruction Scheduling Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling

CE431 Parallel Computer Architecture Spring Compile-time ILP extraction Modulo Scheduling CE431 Parallel Computer Architecture Spring 2018 Compile-time ILP extraction Modulo Scheduling Nikos Bellas Electrical and Computer Engineering University of Thessaly Parallel Computer Architecture 1 Readings

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Evaluating MMX Technology Using DSP and Multimedia Applications

Evaluating MMX Technology Using DSP and Multimedia Applications Evaluating MMX Technology Using DSP and Multimedia Applications Ravi Bhargava * Lizy K. John * Brian L. Evans Ramesh Radhakrishnan * November 22, 1999 The University of Texas at Austin Department of Electrical

More information

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks

An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 Benchmarks An Analysis of the Amount of Global Level Redundant Computation in the SPEC 95 and SPEC 2000 s Joshua J. Yi and David J. Lilja Department of Electrical and Computer Engineering Minnesota Supercomputing

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Exploration of Low Power Adders for a SIMD Data Path

Exploration of Low Power Adders for a SIMD Data Path Exploration of Low Power Adders for a SIMD Data Path G. Paci IMEC and DEIS, U. Bologna gpaci@deis.unibo.it P. Marchal IMEC marchal@imec.be L. Benini DEIS, U. Bologna lbenini@deis.unibo.it Abstract Hardware

More information

Architecture Tuning Study: the SimpleScalar Experience

Architecture Tuning Study: the SimpleScalar Experience Architecture Tuning Study: the SimpleScalar Experience Jianfeng Yang Yiqun Cao December 5, 2005 Abstract SimpleScalar is software toolset designed for modeling and simulation of processor performance.

More information

Hiroaki Kobayashi 12/21/2004

Hiroaki Kobayashi 12/21/2004 Hiroaki Kobayashi 12/21/2004 1 Loop Unrolling Static Branch Prediction Static Multiple Issue: The VLIW Approach Software Pipelining Global Code Scheduling Trace Scheduling Superblock Scheduling Conditional

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Instruction-Level Parallelism. Instruction Level Parallelism (ILP)

Instruction-Level Parallelism. Instruction Level Parallelism (ILP) Instruction-Level Parallelism CS448 1 Pipelining Instruction Level Parallelism (ILP) Limited form of ILP Overlapping instructions, these instructions can be evaluated in parallel (to some degree) Pipeline

More information

Still Image Processing on Coarse-Grained Reconfigurable Array Architectures

Still Image Processing on Coarse-Grained Reconfigurable Array Architectures ESTIMEDIA 2007 1 Still Image Processing on Coarse-Grained Reconfigurable Array Architectures Matthias Hartmann 1 Vassilis Pantazis 2 Tom Vander Aa 2 Mladen Berekovic 2 Christian Hochberger 3 Bjorn de Sutter

More information

04 - DSP Architecture and Microarchitecture

04 - DSP Architecture and Microarchitecture September 11, 2015 Memory indirect addressing (continued from last lecture) ; Reality check: Data hazards! ; Assembler code v3: repeat 256,endloop load r0,dm1[dm0[ptr0++]] store DM0[ptr1++],r0 endloop:

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

Limits of Data-Level Parallelism

Limits of Data-Level Parallelism Limits of Data-Level Parallelism Sreepathi Pai, R. Govindarajan, M. J. Thazhuthaveetil Supercomputer Education and Research Centre, Indian Institute of Science, Bangalore, India. Email: {sree@hpc.serc,govind@serc,mjt@serc}.iisc.ernet.in

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

DynaPack: A Dynamic Scheduling Hardware Mechanism for a VLIW Processor

DynaPack: A Dynamic Scheduling Hardware Mechanism for a VLIW Processor Appl. Math. Inf. Sci. 6-3S, No. 3, 983-991 (2012) 983 Applied Mathematics & Information Sciences An International Journal DynaPack: A Dynamic Scheduling Hardware Mechanism for a VLIW Processor Slo-Li Chu,

More information

IMAGINE: Signal and Image Processing Using Streams

IMAGINE: Signal and Image Processing Using Streams IMAGINE: Signal and Image Processing Using Streams Brucek Khailany William J. Dally, Scott Rixner, Ujval J. Kapasi, Peter Mattson, Jinyung Namkoong, John D. Owens, Brian Towles Concurrent VLSI Architecture

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals

Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals Mapping MPEG Video Decoders on the ADRES Reconfigurable Array Processor for Next Generation Multi-Mode Mobile Terminals Mladen Berekovic IMEC Kapeldreef 75 B-301 Leuven, Belgium 0032-16-28-8162 Mladen.Berekovic@imec.be

More information

Power Consumption Estimation of a C Program for Data-Intensive Applications

Power Consumption Estimation of a C Program for Data-Intensive Applications Power Consumption Estimation of a C Program for Data-Intensive Applications Eric Senn, Nathalie Julien, Johann Laurent, and Eric Martin L.E.S.T.E.R., University of South-Brittany, BP92116 56321 Lorient

More information

Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures

Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Tailoring Pipeline Bypassing and Functional Unit Mapping to Application in Clustered VLIW Architectures Marcio Buss, Rodolfo Azevedo, Paulo Centoducatte and Guido Araujo IC - UNICAMP Cx Postal 676 Campinas,

More information

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines

A Key Theme of CIS 371: Parallelism. CIS 371 Computer Organization and Design. Readings. This Unit: (In-Order) Superscalar Pipelines A Key Theme of CIS 371: arallelism CIS 371 Computer Organization and Design Unit 10: Superscalar ipelines reviously: pipeline-level parallelism Work on execute of one instruction in parallel with decode

More information

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

Stanford University Computer Systems Laboratory. Stream Scheduling. Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles Stanford University Concurrent VLSI Architecture Memo 122 Stanford University Computer Systems Laboratory Stream Scheduling Ujval J. Kapasi, Peter Mattson, William J. Dally, John D. Owens, Brian Towles

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer rchitecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 11: Software Pipelining and Global Scheduling Lecture Outline Review of Loop Unrolling Software Pipelining

More information

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial

HPL-PD A Parameterized Research Architecture. Trimaran Tutorial 60 HPL-PD A Parameterized Research Architecture 61 HPL-PD HPL-PD is a parameterized ILP architecture It serves as a vehicle for processor architecture and compiler optimization research. It admits both

More information

Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File

Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File Increasing the Number of Effective Registers in a Low-Power Processor Using a Windowed Register File Rajiv A. Ravindran, Robert M. Senger, Eric D. Marsman, Ganesh S. Dasika, Matthew R. Guthaus, Scott A.

More information

Instruction Scheduling. Software Pipelining - 3

Instruction Scheduling. Software Pipelining - 3 Instruction Scheduling and Software Pipelining - 3 Department of Computer Science and Automation Indian Institute of Science Bangalore 560 012 NPTEL Course on Principles of Compiler Design Instruction

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

2 TEST: A Tracer for Extracting Speculative Threads

2 TEST: A Tracer for Extracting Speculative Threads EE392C: Advanced Topics in Computer Architecture Lecture #11 Polymorphic Processors Stanford University Handout Date??? On-line Profiling Techniques Lecture #11: Tuesday, 6 May 2003 Lecturer: Shivnath

More information

CS 351 Final Exam Solutions

CS 351 Final Exam Solutions CS 351 Final Exam Solutions Notes: You must explain your answers to receive partial credit. You will lose points for incorrect extraneous information, even if the answer is otherwise correct. Question

More information

VLIW/EPIC: Statically Scheduled ILP

VLIW/EPIC: Statically Scheduled ILP 6.823, L21-1 VLIW/EPIC: Statically Scheduled ILP Computer Science & Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind

More information

A 100 B F

A 100 B F Appears in Adv. in Lang. and Comp. for Par. Proc., Banerjee, Gelernter, Nicolau, and Padua (ed) 1 Using Prole Information to Assist Advanced Compiler Optimization and Scheduling William Y. Chen, Scott

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

Improving Software Pipelining with Hardware Support for Self-Spatial Loads

Improving Software Pipelining with Hardware Support for Self-Spatial Loads Improving Software Pipelining with Hardware Support for Self-Spatial Loads Steve Carr Philip Sweany Department of Computer Science Michigan Technological University Houghton MI 49931-1295 fcarr,sweanyg@mtu.edu

More information

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines

ECE 252 / CPS 220 Advanced Computer Architecture I. Lecture 14 Very Long Instruction Word Machines ECE 252 / CPS 220 Advanced Computer Architecture I Lecture 14 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Efficient Hardware Acceleration on SoC- FPGA using OpenCL

Efficient Hardware Acceleration on SoC- FPGA using OpenCL Efficient Hardware Acceleration on SoC- FPGA using OpenCL Advisor : Dr. Benjamin Carrion Schafer Susmitha Gogineni 30 th August 17 Presentation Overview 1.Objective & Motivation 2.Configurable SoC -FPGA

More information

Joint Entity Resolution

Joint Entity Resolution Joint Entity Resolution Steven Euijong Whang, Hector Garcia-Molina Computer Science Department, Stanford University 353 Serra Mall, Stanford, CA 94305, USA {swhang, hector}@cs.stanford.edu No Institute

More information

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study

Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study Power Estimation of System-Level Buses for Microprocessor-Based Architectures: A Case Study William Fornaciari Politecnico di Milano, DEI Milano (Italy) fornacia@elet.polimi.it Donatella Sciuto Politecnico

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors

ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors ECE 4750 Computer Architecture, Fall 2015 T16 Advanced Processors: VLIW Processors School of Electrical and Computer Engineering Cornell University revision: 2015-11-30-13-42 1 Motivating VLIW Processors

More information

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors K. Masselos 1,2, F. Catthoor 2, C. E. Goutis 1, H. DeMan 2 1 VLSI Design Laboratory, Department of Electrical and Computer

More information

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems *

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Hadi Hajimiri, Kamran Rahmani, Prabhat Mishra Department of Computer & Information Science & Engineering

More information

Pipelining, Branch Prediction, Trends

Pipelining, Branch Prediction, Trends Pipelining, Branch Prediction, Trends 10.1-10.4 Topics 10.1 Quantitative Analyses of Program Execution 10.2 From CISC to RISC 10.3 Pipelining the Datapath Branch Prediction, Delay Slots 10.4 Overlapping

More information

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University

15-740/ Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University 15-740/18-740 Computer Architecture Lecture 23: Superscalar Processing (III) Prof. Onur Mutlu Carnegie Mellon University Announcements Homework 4 Out today Due November 15 Midterm II November 22 Project

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

A Perfect Branch Prediction Technique for Conditional Loops

A Perfect Branch Prediction Technique for Conditional Loops A Perfect Branch Prediction Technique for Conditional Loops Virgil Andronache Department of Computer Science, Midwestern State University Wichita Falls, TX, 76308, USA and Richard P. Simpson Department

More information

Sustainable Computing: Informatics and Systems

Sustainable Computing: Informatics and Systems Sustainable Computing: Informatics and Systems 2 (212) 71 8 Contents lists available at SciVerse ScienceDirect Sustainable Computing: Informatics and Systems j ourna l ho me page: www.elsevier.com/locate/suscom

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

DPM at OS level: low-power scheduling policies

DPM at OS level: low-power scheduling policies Proceedings of the 5th WSEAS Int. Conf. on CIRCUITS, SYSTEMS, ELECTRONICS, CONTROL & SIGNAL PROCESSING, Dallas, USA, November 1-3, 6 1 DPM at OS level: low-power scheduling policies STMicroeletronics AST

More information

Using Intel Streaming SIMD Extensions for 3D Geometry Processing

Using Intel Streaming SIMD Extensions for 3D Geometry Processing Using Intel Streaming SIMD Extensions for 3D Geometry Processing Wan-Chun Ma, Chia-Lin Yang Dept. of Computer Science and Information Engineering National Taiwan University firebird@cmlab.csie.ntu.edu.tw,

More information

Using Cache Line Coloring to Perform Aggressive Procedure Inlining

Using Cache Line Coloring to Perform Aggressive Procedure Inlining Using Cache Line Coloring to Perform Aggressive Procedure Inlining Hakan Aydın David Kaeli Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115 {haydin,kaeli}@ece.neu.edu

More information

Evaluating Compiler Support for Complexity Effective Network Processing

Evaluating Compiler Support for Complexity Effective Network Processing Evaluating Compiler Support for Complexity Effective Network Processing Pradeep Rao and S.K. Nandy Computer Aided Design Laboratory. SERC, Indian Institute of Science. pradeep,nandy@cadl.iisc.ernet.in

More information

Performance Evaluation of VLIW and Superscalar Processors on DSP and Multimedia Workloads

Performance Evaluation of VLIW and Superscalar Processors on DSP and Multimedia Workloads Middle-East Journal of Scientific Research 22 (11): 1612-1617, 2014 ISSN 1990-9233 IDOSI Publications, 2014 DOI: 10.5829/idosi.mejsr.2014.22.11.21523 Performance Evaluation of VLIW and Superscalar Processors

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP

SAD implementation and optimization for H.264/AVC encoder on TMS320C64 DSP SETIT 2007 4 th International Conference: Sciences of Electronic, Technologies of Information and Telecommunications March 25-29, 2007 TUNISIA SAD implementation and optimization for H.264/AVC encoder

More information

SPECULATIVE MULTITHREADED ARCHITECTURES

SPECULATIVE MULTITHREADED ARCHITECTURES 2 SPECULATIVE MULTITHREADED ARCHITECTURES In this Chapter, the execution model of the speculative multithreading paradigm is presented. This execution model is based on the identification of pairs of instructions

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints

Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Improving Reconfiguration Speed for Dynamic Circuit Specialization using Placement Constraints Amit Kulkarni, Tom Davidson, Karel Heyse, and Dirk Stroobandt ELIS department, Computer Systems Lab, Ghent

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

CS553 Lecture Profile-Guided Optimizations 3

CS553 Lecture Profile-Guided Optimizations 3 Profile-Guided Optimizations Last time Instruction scheduling Register renaming alanced Load Scheduling Loop unrolling Software pipelining Today More instruction scheduling Profiling Trace scheduling CS553

More information

Applications written for

Applications written for BY ARVIND KRISHNASWAMY AND RAJIV GUPTA MIXED-WIDTH INSTRUCTION SETS Encoding a program s computations to reduce memory and power consumption without sacrificing performance. Applications written for the

More information

If-Conversion SSA Framework and Transformations SSA 09

If-Conversion SSA Framework and Transformations SSA 09 If-Conversion SSA Framework and Transformations SSA 09 Christian Bruel 29 April 2009 Motivations Embedded VLIW processors have architectural constraints - No out of order support, no full predication,

More information

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1

Design of Embedded DSP Processors Unit 2: Design basics. 9/11/2017 Unit 2 of TSEA H1 1 Design of Embedded DSP Processors Unit 2: Design basics 9/11/2017 Unit 2 of TSEA26-2017 H1 1 ASIP/ASIC design flow We need to have the flow in mind, so that we will know what we are talking about in later

More information

More on Conjunctive Selection Condition and Branch Prediction

More on Conjunctive Selection Condition and Branch Prediction More on Conjunctive Selection Condition and Branch Prediction CS764 Class Project - Fall Jichuan Chang and Nikhil Gupta {chang,nikhil}@cs.wisc.edu Abstract Traditionally, database applications have focused

More information

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 1, JANUARY /$ IEEE

IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 1, JANUARY /$ IEEE IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 17, NO. 1, JANUARY 2009 151 Transactions Briefs Interconnect Exploration for Energy Versus Performance Tradeoffs for Coarse Grained

More information

An Optimizing Compiler for the TMS320C25 DSP Chip

An Optimizing Compiler for the TMS320C25 DSP Chip An Optimizing Compiler for the TMS320C25 DSP Chip Wen-Yen Lin, Corinna G Lee, and Paul Chow Published in Proceedings of the 5th International Conference on Signal Processing Applications and Technology,

More information