A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal 2,3 1 ESAT/ACCA, Kasteelpark Arenberg 10, K.U.Leuven, Heverlee, Belgium-3001 {mjayapal, fbaratqu, pieter, gdec}@esat.kuleuven.ac.be 2 IMEC vzw, Kapeldreef 75, Heverlee, Belgium-3001 {catthoor, heco}@imec.be 3 Department of Electrical Engineering, Eindhoven University of Technology (TUE), P.O. box 513, 5600 MB Eindhoven, The Netherlands Abstract. In the current embedded processors for media applications, up to 30% of the total processor power is consumed in the instruction memory hierarchy. In this context, we present an inherently low energy clustered instruction memory hierarchy template. Small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we present a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. Using a few representative multimedia benchmarks we show that up to 45% of the L0 buffer energy can be reduced using our clustering approach. 1 Introduction Many of the current embedded systems for multimedia applications, like mobile and hand-held devices, are typically battery operated. Therefore, low energy is one of the key design goals of such systems. Typically the core of such systems are programmable processors, and in some cases application specific instruction set processors (ASIPs). VLIW ASIPs in particular are known to be very effective in achieving high performance for our domain of interest [4]. However, power analysis of such processors indicates that a significant amount of power is consumed in the on-chip memories. For example in the TMS320C6000, a VLIW processor from Texas Instruments, up to 30% of the total processor power is consumed in instruction caches alone [6]. Hence, reducing power consumption in the instruction memory hierarchy is important in reducing the overall power consumption of the system. To this end we present a low energy clustered instruction memory hierarchy as shown in Figure 1. Small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we present a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. This work is supported in part by MESA under the MEDEA+ program

2 VLIW ASIP (Clustered instruction memory hierarchy) L1 Cluster L1 Cluster L1 Cache L1 Cache L0 Buffer L0 Cluster L0 Buffer L0 Cluster L0 Buffer L0 Cluster FU FU FU FU FU FU FU Fig. 1. Clustered instruction memory hierarchy The rest of the paper is organized as follows. Section 2 describes the operation of the clustered architecture and how energy consumption can be reduced by clustering. Section 3 describes the profile based algorithm to optimally synthesize the L0 clusters. Section 4 positions our work with respect to some of the related work available in the literature, and finally in section 5 we present some of the experimental results and analysis of the clustering approach. 2 The Architecture Template The fully clustered instruction memory hierarchy is as shown in Figure 1. At the level 1, the conventional instruction cache is partitioned to form the L1 clusters. At the level 0, a special instruction buffer or a cache is partitioned to form L0 clusters. The level 0 buffers are typically small and used during loop execution. Different loop buffer schemes like the decoded instruction buffer scheme [1] or the loop cache scheme [3] or the special filter cache scheme [2], can be adopted in the L0 clusters. Of these loop buffer schemes, for our simulations we specifically consider the decoded instruction buffer scheme [1, 8]. In essence the loop buffer operation is as follows. During the first iteration of the loop the instructions are distributed over the loop buffers. For rest of the loop execution the instructions are derived from the L0 buffers instead of the instruction cache. During the execution of non-loop parts of the code instructions are derived from level 1 instruction caches. Typically, the levels are distinguished based on the access latency to the memory at the corresponding level. However, we distinguish them based on the physical proximity of the buffers. In the sense that, level 0 buffers are placed closer to the functional units than the level 1 caches, while their access latency could still be the same as the access latency of the level 1 instruction cache.

3 2.1 L1 Clusters At the top level, the level 1 instruction cache is partitioned and each partition is called an L1 cluster. Unlike a sub-banked cache, each partition is a cache in itself. Each cache in a cluster can be a direct-mapped, set-associative or fully associative. The block size of the cache is proportional to the number of functional units in an L1 cluster, specifically it is assumed to be a multiple of #issue slots in L1 cluster multiplied by the operation width i,e block size = n (#issue slots operation width). A VLIW instruction is composed of operations for the functional units to be executed in an instruction cycle. Here, this instruction is further sub-divided into instruction bundles. Each instruction bundle corresponds to the group of operations for the functional units in an L1 cluster. Furthermore, each instruction bundle is sub-divided into operation bundles. Each operation bundle corresponds to the group of operations for the functional units in an L0 cluster. This categorization is shown in Figure 2. Furthermore, the length of instruction bundles are assumed to be of variable (NOP compressed). Each L1 cluster has a separate fetch, decode and issue mechanisms. Since, the instruction bundles are variable in length, we assume a scheme similar to the fetch, decode and issue mechanisms in Texas Instruments TMS320C6000 [7]. However, this scheme is applicable to each and every L1 cluster. A fetch from the L1 cache receives an instruction packet, which is the size of the #issue slots in an L1 cluster multiplied by the width of an operation 1 of the L1 cluster. The fetch mechanisms across the L1 clusters operate asynchronously, while the issue mechanisms are synchronized every instruction cycle. A fetch from the L1 cache of one L1 cluster and a fetch from the L1 cache of another L1 cluster might contain operations to be executed in different instruction cycles. However, we assume that the encoding scheme provides enough information to determine this difference and to issue operations in the correct instruction cycle. VLIW Instruction Operation Bundle Instruction Bundle VLIW Instruction : Group of operations for the whole processor Instruction Bundle : Group of operations for an L1 cluster Operation Bundle : Group of operations for an L0 cluster Fig. 2. Instruction format describing the categorization 1 Here, issue width of an L1 cluster is assumed to be equal to number of functional units in an L1 cluster

4 2.2 L0 Clusters As shown in the Figure 1, each L0 cluster has an L0 buffer. These buffers are used only during the execution of the loops. Each L0 cluster has a local controller to generate addresses and to regulate the accesses to corresponding loop buffers. Also, at the end of every iteration of a loop the local controllers are synchronized through a synchronization logic. In our earlier work we have presented the details of the local controller and the synchronization logic. For further details we refer the reader to [8]. Instruction Clusters and Datapath Clusters An L0 cluster is basically an instruction cluster, and in principle an instruction cluster and a datapath cluster can be different. In a datapath cluster, as seen in some of the current VLIW processors like the TMS320C6000 [7], the functional units derive data from a single register file. On the other hand, in an instruction cluster the functional units derive instructions from a single L0 buffer. Even though in both cases the main aim of partitioning is to reduce energy (power) consumption, the principle of partitioning is different [5, 9], and the decisions can be taken independently. From the energy consumption perspective, an instruction cluster could include one or more datapath clusters. Usually, in most datapath organizations, one access to an instruction buffer is followed by three or more accesses to the register files. Hence, the access rate to a register file is at least three times more than the access rate to an instruction buffer. Also, the energy per access of an register file is higher than energy per access of and instruction buffer of the same size, because the register files are multi-ported. Hence, in order to minimize the energy consumption, the datapath clusters should be smaller than the instruction clusters. However, it is still possible that an instruction cluster and a datapath cluster are equivalent (in terms of functional unit grouping). 2.3 Energy Reduction by Clustering The architectural template presented in the previous sections is inherently low energy in two aspects. Firstly, the energy consumption in the storage can be reduced by employing smaller and distributed memories, which are low power consuming. Secondly, the energy consumption in the interconnect (communication) can be reduced by localizing the possible data (instruction) transfers. In a conventional approach, long and power consuming interconnects are needed to deliver the instructions from a centralized storage to the functional units. However, in a distributed organization like in Figure 1, such long interconnects can be avoided. Energy Reduction in Storage by Clustering Clustering the storage at an architectural level aids in reducing the energy consumption in two ways. Firstly, smaller and distributed memory can be employed. Secondly, at the architectural

5 level the access patterns to these memories can be analyzed and the information gathered can be utilized in restricting the accesses to certain clusters. This principle holds for all the levels of memory in the instruction memory hierarchy. Analytically, the energy consumption of a centralized (non-clustered) organization can be written as E centralized = Naccess E per acc where, E centralized represents the energy consumption of a centralized buffer, and E clustered = N CLUST ERS i=1 Naccess i E per acci where E clustered represents the energy consumption of a clustered organization. By partitioning a centralized memory, E per acci < E per acc, i.e. each partition will have smaller energy per access than a centralized memory. On the otherhand if the accesses to each partition can be restricted such that Naccess i < Naccess, then it is clear that E clustered < E centralized. However, in certain cases even if Naccessi > Naccess, accessing a smaller memory (E per acci < E per acc ) sometimes pays off in reducing energy. Typically, the variation of E clustered with number of clusters is as shown if Figure 3. For a certain combination of access i, E per acci and N CLUST ERS, the energy consumption is maximally reduced over a centralized organization. The following section describes a clustering scheme where in, given an instruction profile of an application it gives a clustering scheme, which achieves a certain maximal reduction in energy. Energy Consumption (Storage) Energy Consumption (Centralized) Optimal Clustering # Clusters Fig. 3. Typical variation of energy consumption in storage with number of clusters 3 Profile Based Clustering of L0 Buffers The architecture template has many parameters to be explored both in L0 and L1 clusters. For instance for the L1 clusters, the type of the L1 caches (direct/setassociative), the number of sets and the block size of each L1 cache, etc. Similarly for the L0 buffers the parameters are the size of each L0 buffer, etc. Since, the L0 buffers are accessed exclusively from the L1 caches (except during the initiation

6 phase), the synthesis of the L0 and L1 clusters can be decoupled. The generic approach that we follow to form the L0 clusters is shown in Figure 4. Given an instruction profile during the execution of the loops, the functional units and the L0 buffers are grouped to form the L0 clusters. PROFILE (LOOPS) 1 1 1 1 0 0 1 1 0 1... 1 1 1 0 1 1 1 1 0 1... 1 1 1 1 0 0 1 1 0 1... 1 1 1 1 1 0 1 1 0 1... 1 1 0 0 0 0 1 1 0 1... 1 1 1 1 0 0 1 1 0 1... Parameterized Energy Model L0 Clustering L0 Clusters Fig. 4. Profile based L0 clustering approach 3.1 Instruction Profile The basic observation which aids in clustering is that typically in an instruction cycle not all the functional units are active. Furthermore, the functional units active in different instruction cycles may or may not be the same. Now, the active functional units need to be delivered with instructions. Hence, there is a direct correlation between the access pattern to the instruction memory to functional unit activation. For a given application an instruction profile can be generated which contains the information of functional unit activation every instruction cycle. Because of the direct correlation between the functional unit activation and the access patterns, this profile can be used to form clusters. To remove some effects of data dependency, an average profile can be generated over multiple runs of the application with different input data. Since the L0 buffers are accessed only during the execution of loops, the profile information corresponding to loop execution is used for generating L0 clusters. 3.2 L0 Clustering Here, for a given an instruction profile (# of functional units and their activation trace) and the centralized L0 buffer s size (# words and width) the L0 buffer is partitioned and the functional units are grouped into clusters so as to minimize energy consumption. This problem is formulated as a 0-1 assignment optimization problem. The formulation is as follows. min{ L0Cost(L0clust, P rofile loops ) } Where, Subject T o N max CLUST i=1 L0clust ij = 1; j

7 { 1 if j L0Clust ij = th functional unit is assigned to cluster i 0 otherwise N F U is the total number of functional units N maxclust is the maximum number of feasible clusters = N F U (At most each functional unit can be an L0cluster) The L0Cost(L0clust, P rofile loops ) represents the energy consumption in the L0 buffers for any valid clustering. As a result of this optimization, the L0 buffer is partitioned and the functional units are clustered, and is represented by the matrix L0Clust ij. 4 Related Work In the literature the problem of reducing power in instruction memory hierarchy has been addressed at various abstraction levels of the processor, an overview of the different optimizations can be found in [14]. However, we believe that power can be reduced to a large extent by optimizing at higher levels of abstraction, and further optimizing at lower levels. The notion of partitioning or sub-banking a memory block at the logic level is quite well known [13]. However, in our scheme the partitioning is done at the architectural level, and the sub-banking can still be used within each of the clusters. Other architectural level clustering like the ones in [10, 11] are essentially different from our scheme. Firstly, these schemes concentrate specifically on the level 1 caches, while we consider clustering the level 0 buffers, and the interconnect in addition to the level 1 caches. Secondly, they do not consider grouping the functional units into clusters as we do. Furthermore, our approach to clustering is at a higher level than these schemes and in principle they can still be applied to the level 1 caches in the L1 clusters of our architecture (Figure 1). 5 Experimental Results We performed some simulations using some of the benchmarks from Mediabench. Some of the relevant characteristics are shown in Table 1. These benchmarks were compiled on to a non-clustered VLIW processor with ten functional units, using the compiler in the Trimaran tool suite [12]. Commonly used transformations like predication and software pipelining were applied to increase the instruction level parallelism. The profile as described in the section 3.1 was generated using the instruction set simulator of the Trimaran tool suite. Currently, of all the possibilities in the instruction memory hierarchy template we assumed a single L1 cluster, since our main goal is to form L0 clusters. Furthermore, the L0 buffers were assumed to be SRAM based buffers, and the number of words in the L0 buffers are assumed to be given. We have explored only the possible grouping of functional units and the partitioning of L0 buffers into L0 clusters.

8 Benchmark avg ILP %exec time ILP %exec time ILP loops loops non-loops non-loops Adpcm 5.6 95 5.6 5 2.7 Jpeg Decode 3.0 30 2.9 70 3.0 IDCT 3.9 50 5.6 50 2.4 Mpeg2 Decode 3.3 30 5.6 70 2.3 Table 1. Benchmark characteristics The profile corresponding to the loop execution, for each of the benchmark, was passed through the L0 clustering stage. Here, we assumed a centralized L0 buffer with 64 words 2 depth and 32*10 bits wide (operation width * # issue slots). The energy consumption in the L0 clusters was calculated using the formula described in section 2.3. The energy per access for these buffers was obtained from the parameterized energy models of Wattch [15] (by modeling them as simplearray elements), and the number of accesses for each cluster was calculated using the instruction profile. Benchmark N L0Clust (F Us grouping) Adpcm 5 Jpeg Decode 7 IDCT 4 Mpeg2 Decode 4 Centralized L0 buffer size (bits) E redn (1,2,3,4,5) (6,7) (8) (9) 10 64 (5 32) 64 (2 32) 64 32 64 32 64 32 30% (1,2) (3,4) (5) 64 (2 32) 64 (2 32) 64 32 44% (6,7) 8 9 10 64 (2 32) 64 32 64 32 64 32 (1,2,3,4,5) (6,7) (8,9) 10 64 (5 32) 64 (2 32) 64 (2 32) 64 32 25% (1,2,3,4,5,8) (6,7) 9 10 64 (6 32) 64 (2 32) 64 32 64 32 22% (1,2,3,4,5,6,7,8,9,10) 64 (10 32) 20Kbit Table 2. The L0 clustering results The variation of the L0 buffer energy with the number clusters is shown in Figure 5. With an increase in the number of clusters the energy consumption in the L0 buffers drops to a certain optimal energy consumption and increases with further increase in number of clusters. The resulting L0 clusters for the optimal energy consumption is shown in Table 2. Here, the resulting clusters are specific to each benchmark. Furthermore, the achieved energy reduction in the L0 buffer energy over a single centralized L0 buffer organization, for each benchmark, is also shown in Table 2. Clearly, the results indicate that by clustering the L0 buffers, instead of arbitrarily partitioning, up to 45% of L0 buffer can be reduced, and the profile based clustering can be used to automatically synthesize the clusters. Furthermore, we have not considered the interconnect energy in our energy costs. We believe that 2 it was observed that number of instructions within the loop body was less than 64 instructions

9 if that were to be considered in the energy equations, the reduction would be more significant. However, we leave such an effort to our future work. 5.1 Discussion: L1 Clustering As an interesting exercise, we formulated the L1 clustering similar to the L0 clustering as shown in Figure 4. Here, the problem was to group L0 clusters and partition the level 1 cache into L1 clusters, instead of grouping the functional units and partitioning the L0 buffers into L0 clusters. Since the accesses to the level 1 caches occurs during the execution of the non-loop parts of the code, the corresponding instruction profile was used. Furthermore, the centralized cache was assumed to be a 20KB, direct mapped, with 256 words and a block size of 80bytes (due to the restriction in the architecture section 2). The energy cost for any valid clustering was obtained using the formula described in section 2.3. The energy per access of the caches were obtained from Cacti [16], and the number of accesses to each cache (assuming variable length encoding) was calculated using the instruction profile. Estimated energy consumption (normalized) 1.2 1.1 1 0.9 0.8 0.7 0.6 Energy vs #L0 Clusters ADPCM JPEG DEC IDCT MPEG2DEC Centralized L0 Buffer Energy Estimated energy consumption (normalized) 1.3 1.25 1.2 1.15 1.1 1.05 1 0.95 0.9 0.85 ADPCM JPEG DEC IDCT MPEG2DEC Energy vs #L1 Clusters Centralized L1 Cache Energy 0.5 1 2 3 4 5 6 7 8 9 10 # L0 Clusters 0.8 1 2 3 4 5 6 7 # L1 Clusters Fig. 5. Variation of energy consumption with number of L0 clusters The variation of L1 cache energy with the number of clusters is shown in Figure 5. Similar to the L0 clustering stage, the energy consumption drops to a certain optimal energy and then increases with the number clusters. However, unlike the L0 clustering the maximum reduction in energy with L1 clustering was marginal, about 2%. The reduction in marginal mainly because of the overhead of tags in each clusters. In our experiments the centralized organization was assumed to be a direct mapped cache. By partitioning this direct mapped cache, we were essentially replacing it with smaller direct mapped caches, which naturally increases the number of tag bits. Even though the effective storage size is the same the number of tag bits would have increased. If this tag-overhead can be avoided we believe the energy reduction could be more significant. Some authors have proposed ways to design caches with less tag overhead [17]. We intend to evaluate this in our future work.

10 6 Summary In summary, we have presented a low energy clustered instruction memory hierarchy for long instruction word processors. Where, small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we also presented a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. As shown in our experimental results section, by taking into account the access patterns to the instruction memory, visible only at the architectural level, we can achieve significant reduction in energy. References 1. R. S. Bajwa, et al., Instruction Buffering to Reduce Power in Processors for Signal Processing, IEEE Transactions on VLSI systems, vol 5, no 4, Dec 1997. 2. N. Bellas, et al., Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors, ISLPED 1998. 3. L. H. Lee, et al., Instruction Fetch Energy Reduction Using Loop Caches For Applications with Small Tight Loops, ISLPED 1999. 4. M.Jacome, et al., Design Challenges for New Application Specific Processors, Special Issue System Design of Embedded Systems, IEEE Design & Test of Computers, April-June 2000. 5. M. Jacome, et al., Exploring Performance Tradeoffs for Clustered VLIW ASIPs, ICCAD, November 2000. 6. Texas Instruments Inc, Technical Report, TMS3206000 Power Consumption Summary, http://www.ti.com 7. Texas Instruments Inc, TMS320C6000 CPU and Instruction Set Reference Guide, http://www.ti.com 8. M. Jayapala, et al., Loop Cache (Buffer) Organization: Energy Analysis and Partitioning, Technical Report K.U.Leuven/ESAT, 22 Jan 2002. 9. A. Wolfe, et al., Datapath Design for a VLIW Video Signal Processor, IEEE Symposium on High-Performance Computer Architecture (HPCA 97). 10. S. Kim, et al., Power-aware Partitioned Cache Architectures, ISLPED 2001. 11. M. Huang, et al., L1 Data Cache Decomposition for Energy Efficiency, ISLPED 2001. 12. Trimaran, An Infrastructure for Research in Instruction-Level Parallelism, 1999. http://www.trimaran.org 13. Ching-Long Su, et al., Cache Design Trade-offs for Power and Performance Optimization: A Case Study, ISLPED 1995. 14. L. Nachtergaele, V. Tiwari and N. Dutt, System and Architectural-Level Power Reduction of Microprocessor-based Communication and Multimedia Applications, ICCAD 2000. 15. D. Brooks, et al., Wattch: A Framework for Architectural-Level Power Analysis and Optimizations, ISCA 2000. 16. S.J.E. Wilton and N.P. Jouppi, CACTI: an enhanced cache access and cycle time model, IEEE Journal of Solid-State Circuits, 31(5):677 688, 1996. 17. P. Petrov and A. Orailoglu, Power Efficient Embedded Processor IPs through Application-Specific Tag Compression in Data Caches, Design and Test in Europe Conf (DATE), April 2002.