A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

Size: px
Start display at page:

Download "A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors"

Transcription

1 A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal 2,3 1 ESAT/ACCA, Kasteelpark Arenberg 10, K.U.Leuven, Heverlee, Belgium-3001 {mjayapal, fbaratqu, pieter, gdec}@esat.kuleuven.ac.be 2 IMEC vzw, Kapeldreef 75, Heverlee, Belgium-3001 {catthoor, heco}@imec.be 3 Department of Electrical Engineering, Eindhoven University of Technology (TUE), P.O. box 513, 5600 MB Eindhoven, The Netherlands Abstract. In the current embedded processors for media applications, up to 30% of the total processor power is consumed in the instruction memory hierarchy. In this context, we present an inherently low energy clustered instruction memory hierarchy template. Small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we present a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. Using a few representative multimedia benchmarks we show that up to 45% of the L0 buffer energy can be reduced using our clustering approach. 1 Introduction Many of the current embedded systems for multimedia applications, like mobile and hand-held devices, are typically battery operated. Therefore, low energy is one of the key design goals of such systems. Typically the core of such systems are programmable processors, and in some cases application specific instruction set processors (ASIPs). VLIW ASIPs in particular are known to be very effective in achieving high performance for our domain of interest [4]. However, power analysis of such processors indicates that a significant amount of power is consumed in the on-chip memories. For example in the TMS320C6000, a VLIW processor from Texas Instruments, up to 30% of the total processor power is consumed in instruction caches alone [6]. Hence, reducing power consumption in the instruction memory hierarchy is important in reducing the overall power consumption of the system. To this end we present a low energy clustered instruction memory hierarchy as shown in Figure 1. Small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we present a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. This work is supported in part by MESA under the MEDEA+ program

2 2 VLIW ASIP (Clustered instruction memory hierarchy) L1 Cluster L1 Cluster L1 Cache L1 Cache L0 Buffer L0 Cluster L0 Buffer L0 Cluster L0 Buffer L0 Cluster FU FU FU FU FU FU FU Fig. 1. Clustered instruction memory hierarchy The rest of the paper is organized as follows. Section 2 describes the operation of the clustered architecture and how energy consumption can be reduced by clustering. Section 3 describes the profile based algorithm to optimally synthesize the L0 clusters. Section 4 positions our work with respect to some of the related work available in the literature, and finally in section 5 we present some of the experimental results and analysis of the clustering approach. 2 The Architecture Template The fully clustered instruction memory hierarchy is as shown in Figure 1. At the level 1, the conventional instruction cache is partitioned to form the L1 clusters. At the level 0, a special instruction buffer or a cache is partitioned to form L0 clusters. The level 0 buffers are typically small and used during loop execution. Different loop buffer schemes like the decoded instruction buffer scheme [1] or the loop cache scheme [3] or the special filter cache scheme [2], can be adopted in the L0 clusters. Of these loop buffer schemes, for our simulations we specifically consider the decoded instruction buffer scheme [1, 8]. In essence the loop buffer operation is as follows. During the first iteration of the loop the instructions are distributed over the loop buffers. For rest of the loop execution the instructions are derived from the L0 buffers instead of the instruction cache. During the execution of non-loop parts of the code instructions are derived from level 1 instruction caches. Typically, the levels are distinguished based on the access latency to the memory at the corresponding level. However, we distinguish them based on the physical proximity of the buffers. In the sense that, level 0 buffers are placed closer to the functional units than the level 1 caches, while their access latency could still be the same as the access latency of the level 1 instruction cache.

3 3 2.1 L1 Clusters At the top level, the level 1 instruction cache is partitioned and each partition is called an L1 cluster. Unlike a sub-banked cache, each partition is a cache in itself. Each cache in a cluster can be a direct-mapped, set-associative or fully associative. The block size of the cache is proportional to the number of functional units in an L1 cluster, specifically it is assumed to be a multiple of #issue slots in L1 cluster multiplied by the operation width i,e block size = n (#issue slots operation width). A VLIW instruction is composed of operations for the functional units to be executed in an instruction cycle. Here, this instruction is further sub-divided into instruction bundles. Each instruction bundle corresponds to the group of operations for the functional units in an L1 cluster. Furthermore, each instruction bundle is sub-divided into operation bundles. Each operation bundle corresponds to the group of operations for the functional units in an L0 cluster. This categorization is shown in Figure 2. Furthermore, the length of instruction bundles are assumed to be of variable (NOP compressed). Each L1 cluster has a separate fetch, decode and issue mechanisms. Since, the instruction bundles are variable in length, we assume a scheme similar to the fetch, decode and issue mechanisms in Texas Instruments TMS320C6000 [7]. However, this scheme is applicable to each and every L1 cluster. A fetch from the L1 cache receives an instruction packet, which is the size of the #issue slots in an L1 cluster multiplied by the width of an operation 1 of the L1 cluster. The fetch mechanisms across the L1 clusters operate asynchronously, while the issue mechanisms are synchronized every instruction cycle. A fetch from the L1 cache of one L1 cluster and a fetch from the L1 cache of another L1 cluster might contain operations to be executed in different instruction cycles. However, we assume that the encoding scheme provides enough information to determine this difference and to issue operations in the correct instruction cycle. VLIW Instruction Operation Bundle Instruction Bundle VLIW Instruction : Group of operations for the whole processor Instruction Bundle : Group of operations for an L1 cluster Operation Bundle : Group of operations for an L0 cluster Fig. 2. Instruction format describing the categorization 1 Here, issue width of an L1 cluster is assumed to be equal to number of functional units in an L1 cluster

4 4 2.2 L0 Clusters As shown in the Figure 1, each L0 cluster has an L0 buffer. These buffers are used only during the execution of the loops. Each L0 cluster has a local controller to generate addresses and to regulate the accesses to corresponding loop buffers. Also, at the end of every iteration of a loop the local controllers are synchronized through a synchronization logic. In our earlier work we have presented the details of the local controller and the synchronization logic. For further details we refer the reader to [8]. Instruction Clusters and Datapath Clusters An L0 cluster is basically an instruction cluster, and in principle an instruction cluster and a datapath cluster can be different. In a datapath cluster, as seen in some of the current VLIW processors like the TMS320C6000 [7], the functional units derive data from a single register file. On the other hand, in an instruction cluster the functional units derive instructions from a single L0 buffer. Even though in both cases the main aim of partitioning is to reduce energy (power) consumption, the principle of partitioning is different [5, 9], and the decisions can be taken independently. From the energy consumption perspective, an instruction cluster could include one or more datapath clusters. Usually, in most datapath organizations, one access to an instruction buffer is followed by three or more accesses to the register files. Hence, the access rate to a register file is at least three times more than the access rate to an instruction buffer. Also, the energy per access of an register file is higher than energy per access of and instruction buffer of the same size, because the register files are multi-ported. Hence, in order to minimize the energy consumption, the datapath clusters should be smaller than the instruction clusters. However, it is still possible that an instruction cluster and a datapath cluster are equivalent (in terms of functional unit grouping). 2.3 Energy Reduction by Clustering The architectural template presented in the previous sections is inherently low energy in two aspects. Firstly, the energy consumption in the storage can be reduced by employing smaller and distributed memories, which are low power consuming. Secondly, the energy consumption in the interconnect (communication) can be reduced by localizing the possible data (instruction) transfers. In a conventional approach, long and power consuming interconnects are needed to deliver the instructions from a centralized storage to the functional units. However, in a distributed organization like in Figure 1, such long interconnects can be avoided. Energy Reduction in Storage by Clustering Clustering the storage at an architectural level aids in reducing the energy consumption in two ways. Firstly, smaller and distributed memory can be employed. Secondly, at the architectural

5 5 level the access patterns to these memories can be analyzed and the information gathered can be utilized in restricting the accesses to certain clusters. This principle holds for all the levels of memory in the instruction memory hierarchy. Analytically, the energy consumption of a centralized (non-clustered) organization can be written as E centralized = Naccess E per acc where, E centralized represents the energy consumption of a centralized buffer, and E clustered = N CLUST ERS i=1 Naccess i E per acci where E clustered represents the energy consumption of a clustered organization. By partitioning a centralized memory, E per acci < E per acc, i.e. each partition will have smaller energy per access than a centralized memory. On the otherhand if the accesses to each partition can be restricted such that Naccess i < Naccess, then it is clear that E clustered < E centralized. However, in certain cases even if Naccessi > Naccess, accessing a smaller memory (E per acci < E per acc ) sometimes pays off in reducing energy. Typically, the variation of E clustered with number of clusters is as shown if Figure 3. For a certain combination of access i, E per acci and N CLUST ERS, the energy consumption is maximally reduced over a centralized organization. The following section describes a clustering scheme where in, given an instruction profile of an application it gives a clustering scheme, which achieves a certain maximal reduction in energy. Energy Consumption (Storage) Energy Consumption (Centralized) Optimal Clustering # Clusters Fig. 3. Typical variation of energy consumption in storage with number of clusters 3 Profile Based Clustering of L0 Buffers The architecture template has many parameters to be explored both in L0 and L1 clusters. For instance for the L1 clusters, the type of the L1 caches (direct/setassociative), the number of sets and the block size of each L1 cache, etc. Similarly for the L0 buffers the parameters are the size of each L0 buffer, etc. Since, the L0 buffers are accessed exclusively from the L1 caches (except during the initiation

6 6 phase), the synthesis of the L0 and L1 clusters can be decoupled. The generic approach that we follow to form the L0 clusters is shown in Figure 4. Given an instruction profile during the execution of the loops, the functional units and the L0 buffers are grouped to form the L0 clusters. PROFILE (LOOPS) Parameterized Energy Model L0 Clustering L0 Clusters Fig. 4. Profile based L0 clustering approach 3.1 Instruction Profile The basic observation which aids in clustering is that typically in an instruction cycle not all the functional units are active. Furthermore, the functional units active in different instruction cycles may or may not be the same. Now, the active functional units need to be delivered with instructions. Hence, there is a direct correlation between the access pattern to the instruction memory to functional unit activation. For a given application an instruction profile can be generated which contains the information of functional unit activation every instruction cycle. Because of the direct correlation between the functional unit activation and the access patterns, this profile can be used to form clusters. To remove some effects of data dependency, an average profile can be generated over multiple runs of the application with different input data. Since the L0 buffers are accessed only during the execution of loops, the profile information corresponding to loop execution is used for generating L0 clusters. 3.2 L0 Clustering Here, for a given an instruction profile (# of functional units and their activation trace) and the centralized L0 buffer s size (# words and width) the L0 buffer is partitioned and the functional units are grouped into clusters so as to minimize energy consumption. This problem is formulated as a 0-1 assignment optimization problem. The formulation is as follows. min{ L0Cost(L0clust, P rofile loops ) } Where, Subject T o N max CLUST i=1 L0clust ij = 1; j

7 7 { 1 if j L0Clust ij = th functional unit is assigned to cluster i 0 otherwise N F U is the total number of functional units N maxclust is the maximum number of feasible clusters = N F U (At most each functional unit can be an L0cluster) The L0Cost(L0clust, P rofile loops ) represents the energy consumption in the L0 buffers for any valid clustering. As a result of this optimization, the L0 buffer is partitioned and the functional units are clustered, and is represented by the matrix L0Clust ij. 4 Related Work In the literature the problem of reducing power in instruction memory hierarchy has been addressed at various abstraction levels of the processor, an overview of the different optimizations can be found in [14]. However, we believe that power can be reduced to a large extent by optimizing at higher levels of abstraction, and further optimizing at lower levels. The notion of partitioning or sub-banking a memory block at the logic level is quite well known [13]. However, in our scheme the partitioning is done at the architectural level, and the sub-banking can still be used within each of the clusters. Other architectural level clustering like the ones in [10, 11] are essentially different from our scheme. Firstly, these schemes concentrate specifically on the level 1 caches, while we consider clustering the level 0 buffers, and the interconnect in addition to the level 1 caches. Secondly, they do not consider grouping the functional units into clusters as we do. Furthermore, our approach to clustering is at a higher level than these schemes and in principle they can still be applied to the level 1 caches in the L1 clusters of our architecture (Figure 1). 5 Experimental Results We performed some simulations using some of the benchmarks from Mediabench. Some of the relevant characteristics are shown in Table 1. These benchmarks were compiled on to a non-clustered VLIW processor with ten functional units, using the compiler in the Trimaran tool suite [12]. Commonly used transformations like predication and software pipelining were applied to increase the instruction level parallelism. The profile as described in the section 3.1 was generated using the instruction set simulator of the Trimaran tool suite. Currently, of all the possibilities in the instruction memory hierarchy template we assumed a single L1 cluster, since our main goal is to form L0 clusters. Furthermore, the L0 buffers were assumed to be SRAM based buffers, and the number of words in the L0 buffers are assumed to be given. We have explored only the possible grouping of functional units and the partitioning of L0 buffers into L0 clusters.

8 8 Benchmark avg ILP %exec time ILP %exec time ILP loops loops non-loops non-loops Adpcm Jpeg Decode IDCT Mpeg2 Decode Table 1. Benchmark characteristics The profile corresponding to the loop execution, for each of the benchmark, was passed through the L0 clustering stage. Here, we assumed a centralized L0 buffer with 64 words 2 depth and 32*10 bits wide (operation width * # issue slots). The energy consumption in the L0 clusters was calculated using the formula described in section 2.3. The energy per access for these buffers was obtained from the parameterized energy models of Wattch [15] (by modeling them as simplearray elements), and the number of accesses for each cluster was calculated using the instruction profile. Benchmark N L0Clust (F Us grouping) Adpcm 5 Jpeg Decode 7 IDCT 4 Mpeg2 Decode 4 Centralized L0 buffer size (bits) E redn (1,2,3,4,5) (6,7) (8) (9) (5 32) 64 (2 32) % (1,2) (3,4) (5) 64 (2 32) 64 (2 32) % (6,7) (2 32) (1,2,3,4,5) (6,7) (8,9) (5 32) 64 (2 32) 64 (2 32) % (1,2,3,4,5,8) (6,7) (6 32) 64 (2 32) % (1,2,3,4,5,6,7,8,9,10) 64 (10 32) 20Kbit Table 2. The L0 clustering results The variation of the L0 buffer energy with the number clusters is shown in Figure 5. With an increase in the number of clusters the energy consumption in the L0 buffers drops to a certain optimal energy consumption and increases with further increase in number of clusters. The resulting L0 clusters for the optimal energy consumption is shown in Table 2. Here, the resulting clusters are specific to each benchmark. Furthermore, the achieved energy reduction in the L0 buffer energy over a single centralized L0 buffer organization, for each benchmark, is also shown in Table 2. Clearly, the results indicate that by clustering the L0 buffers, instead of arbitrarily partitioning, up to 45% of L0 buffer can be reduced, and the profile based clustering can be used to automatically synthesize the clusters. Furthermore, we have not considered the interconnect energy in our energy costs. We believe that 2 it was observed that number of instructions within the loop body was less than 64 instructions

9 9 if that were to be considered in the energy equations, the reduction would be more significant. However, we leave such an effort to our future work. 5.1 Discussion: L1 Clustering As an interesting exercise, we formulated the L1 clustering similar to the L0 clustering as shown in Figure 4. Here, the problem was to group L0 clusters and partition the level 1 cache into L1 clusters, instead of grouping the functional units and partitioning the L0 buffers into L0 clusters. Since the accesses to the level 1 caches occurs during the execution of the non-loop parts of the code, the corresponding instruction profile was used. Furthermore, the centralized cache was assumed to be a 20KB, direct mapped, with 256 words and a block size of 80bytes (due to the restriction in the architecture section 2). The energy cost for any valid clustering was obtained using the formula described in section 2.3. The energy per access of the caches were obtained from Cacti [16], and the number of accesses to each cache (assuming variable length encoding) was calculated using the instruction profile. Estimated energy consumption (normalized) Energy vs #L0 Clusters ADPCM JPEG DEC IDCT MPEG2DEC Centralized L0 Buffer Energy Estimated energy consumption (normalized) ADPCM JPEG DEC IDCT MPEG2DEC Energy vs #L1 Clusters Centralized L1 Cache Energy # L0 Clusters # L1 Clusters Fig. 5. Variation of energy consumption with number of L0 clusters The variation of L1 cache energy with the number of clusters is shown in Figure 5. Similar to the L0 clustering stage, the energy consumption drops to a certain optimal energy and then increases with the number clusters. However, unlike the L0 clustering the maximum reduction in energy with L1 clustering was marginal, about 2%. The reduction in marginal mainly because of the overhead of tags in each clusters. In our experiments the centralized organization was assumed to be a direct mapped cache. By partitioning this direct mapped cache, we were essentially replacing it with smaller direct mapped caches, which naturally increases the number of tag bits. Even though the effective storage size is the same the number of tag bits would have increased. If this tag-overhead can be avoided we believe the energy reduction could be more significant. Some authors have proposed ways to design caches with less tag overhead [17]. We intend to evaluate this in our future work.

10 10 6 Summary In summary, we have presented a low energy clustered instruction memory hierarchy for long instruction word processors. Where, small instruction memories are distributed over groups of functional units and the interconnects are localized in order to minimize energy consumption. Furthermore, we also presented a simple profile based algorithm to optimally synthesize the L0 clusters, for a given application. As shown in our experimental results section, by taking into account the access patterns to the instruction memory, visible only at the architectural level, we can achieve significant reduction in energy. References 1. R. S. Bajwa, et al., Instruction Buffering to Reduce Power in Processors for Signal Processing, IEEE Transactions on VLSI systems, vol 5, no 4, Dec N. Bellas, et al., Architectural and Compiler Support for Energy Reduction in Memory Hierarchy of High Performance Microprocessors, ISLPED L. H. Lee, et al., Instruction Fetch Energy Reduction Using Loop Caches For Applications with Small Tight Loops, ISLPED M.Jacome, et al., Design Challenges for New Application Specific Processors, Special Issue System Design of Embedded Systems, IEEE Design & Test of Computers, April-June M. Jacome, et al., Exploring Performance Tradeoffs for Clustered VLIW ASIPs, ICCAD, November Texas Instruments Inc, Technical Report, TMS Power Consumption Summary, 7. Texas Instruments Inc, TMS320C6000 CPU and Instruction Set Reference Guide, 8. M. Jayapala, et al., Loop Cache (Buffer) Organization: Energy Analysis and Partitioning, Technical Report K.U.Leuven/ESAT, 22 Jan A. Wolfe, et al., Datapath Design for a VLIW Video Signal Processor, IEEE Symposium on High-Performance Computer Architecture (HPCA 97). 10. S. Kim, et al., Power-aware Partitioned Cache Architectures, ISLPED M. Huang, et al., L1 Data Cache Decomposition for Energy Efficiency, ISLPED Trimaran, An Infrastructure for Research in Instruction-Level Parallelism, Ching-Long Su, et al., Cache Design Trade-offs for Power and Performance Optimization: A Case Study, ISLPED L. Nachtergaele, V. Tiwari and N. Dutt, System and Architectural-Level Power Reduction of Microprocessor-based Communication and Multimedia Applications, ICCAD D. Brooks, et al., Wattch: A Framework for Architectural-Level Power Analysis and Optimizations, ISCA S.J.E. Wilton and N.P. Jouppi, CACTI: an enhanced cache access and cycle time model, IEEE Journal of Solid-State Circuits, 31(5): , P. Petrov and A. Orailoglu, Power Efficient Embedded Processor IPs through Application-Specific Tag Compression in Data Caches, Design and Test in Europe Conf (DATE), April 2002.

CURRENT embedded systems for multimedia applications,

CURRENT embedded systems for multimedia applications, 672 IEEE TRANSACTIONS ON COMPUTERS, VOL. 54, NO. 6, JUNE 2005 Clustered Loop Buffer Organization for Low Energy VLIW Embedded Processors Murali Jayapala, Student Member, IEEE, Francisco Barat, Student

More information

Impact of ILP-improving Code Transformations on Loop Buffer Energy

Impact of ILP-improving Code Transformations on Loop Buffer Energy Impact of ILP-improving Code Transformations on Loop Buffer Tom Vander Aa Murali Jayapala Henk Corporaal Francky Catthoor Geert Deconinck IMEC, Kapeldreef 75, B-300 Leuven, Belgium ESAT, KULeuven, Kasteelpark

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview

Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors. Overview Multi-Level Cache Hierarchy Evaluation for Programmable Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Prof. Wayne Wolf Overview Why Programmable Media Processors?

More information

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management

Effective Memory Access Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management International Journal of Computer Theory and Engineering, Vol., No., December 01 Effective Memory Optimization by Memory Delay Modeling, Memory Allocation, and Slack Time Management Sultan Daud Khan, Member,

More information

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview

Evaluation of Static and Dynamic Scheduling for Media Processors. Overview Evaluation of Static and Dynamic Scheduling for Media Processors Jason Fritts Assistant Professor Department of Computer Science Co-Author: Wayne Wolf Overview Media Processing Present and Future Evaluation

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Understanding multimedia application chacteristics for designing programmable media processors

Understanding multimedia application chacteristics for designing programmable media processors Understanding multimedia application chacteristics for designing programmable media processors Jason Fritts Jason Fritts, Wayne Wolf, and Bede Liu SPIE Media Processors '99 January 28, 1999 Why programmable

More information

Optimal Cache Organization using an Allocation Tree

Optimal Cache Organization using an Allocation Tree Optimal Cache Organization using an Allocation Tree Tony Givargis Technical Report CECS-2-22 September 11, 2002 Department of Information and Computer Science Center for Embedded Computer Systems University

More information

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining

Several Common Compiler Strategies. Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Several Common Compiler Strategies Instruction scheduling Loop unrolling Static Branch Prediction Software Pipelining Basic Instruction Scheduling Reschedule the order of the instructions to reduce the

More information

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors

Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors Low Power Mapping of Video Processing Applications on VLIW Multimedia Processors K. Masselos 1,2, F. Catthoor 2, C. E. Goutis 1, H. DeMan 2 1 VLSI Design Laboratory, Department of Electrical and Computer

More information

Cache Justification for Digital Signal Processors

Cache Justification for Digital Signal Processors Cache Justification for Digital Signal Processors by Michael J. Lee December 3, 1999 Cache Justification for Digital Signal Processors By Michael J. Lee Abstract Caches are commonly used on general-purpose

More information

Memory Exploration for Low Power, Embedded Systems

Memory Exploration for Low Power, Embedded Systems Memory Exploration for Low Power, Embedded Systems Wen-Tsong Shiue Arizona State University Department of Electrical Engineering Tempe, AZ 85287-5706 Ph: 1(602) 965-1319, Fax: 1(602) 965-8325 shiue@imap3.asu.edu

More information

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3

Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Behavioral Array Mapping into Multiport Memories Targeting Low Power 3 Preeti Ranjan Panda and Nikil D. Dutt Department of Information and Computer Science University of California, Irvine, CA 92697-3425,

More information

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip

Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip Optimization of Task Scheduling and Memory Partitioning for Multiprocessor System on Chip 1 Mythili.R, 2 Mugilan.D 1 PG Student, Department of Electronics and Communication K S Rangasamy College Of Technology,

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy

Power Reduction Techniques in the Memory System. Typical Memory Hierarchy Power Reduction Techniques in the Memory System Low Power Design for SoCs ASIC Tutorial Memories.1 Typical Memory Hierarchy On-Chip Components Control edram Datapath RegFile ITLB DTLB Instr Data Cache

More information

Low-Power Data Address Bus Encoding Method

Low-Power Data Address Bus Encoding Method Low-Power Data Address Bus Encoding Method Tsung-Hsi Weng, Wei-Hao Chiao, Jean Jyh-Jiun Shann, Chung-Ping Chung, and Jimmy Lu Dept. of Computer Science and Information Engineering, National Chao Tung University,

More information

Evaluating Inter-cluster Communication in Clustered VLIW Architectures

Evaluating Inter-cluster Communication in Clustered VLIW Architectures Evaluating Inter-cluster Communication in Clustered VLIW Architectures Anup Gangwar Embedded Systems Group, Department of Computer Science and Engineering, Indian Institute of Technology Delhi September

More information

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations?

c. What are the machine cycle times (in nanoseconds) of the non-pipelined and the pipelined implementations? Brown University School of Engineering ENGN 164 Design of Computing Systems Professor Sherief Reda Homework 07. 140 points. Due Date: Monday May 12th in B&H 349 1. [30 points] Consider the non-pipelined

More information

Answers to comments from Reviewer 1

Answers to comments from Reviewer 1 Answers to comments from Reviewer 1 Question A-1: Though I have one correction to authors response to my Question A-1,... parity protection needs a correction mechanism (e.g., checkpoints and roll-backward

More information

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison

Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Low Power Set-Associative Cache with Single-Cycle Partial Tag Comparison Jian Chen, Ruihua Peng, Yuzhuo Fu School of Micro-electronics, Shanghai Jiao Tong University, Shanghai 200030, China {chenjian,

More information

MAP1000A: A 5W, 230MHz VLIW Mediaprocessor

MAP1000A: A 5W, 230MHz VLIW Mediaprocessor MAP1000A: A 5W, 230MHz VLIW Mediaprocessor Hot Chips 99 John Setel O Donnell jod@equator.com MAP1000A VLIW CPU + system-on-a-chip peripherals Based on MAP Architecture Developed Jointly by Equator and

More information

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture

Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Predictive Line Buffer: A fast, Energy Efficient Cache Architecture Kashif Ali MoKhtar Aboelaze SupraKash Datta Department of Computer Science and Engineering York University Toronto ON CANADA Abstract

More information

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors

MPEG-2 Video Decompression on Simultaneous Multithreaded Multimedia Processors MPEG- Video Decompression on Simultaneous Multithreaded Multimedia Processors Heiko Oehring Ulrich Sigmund Theo Ungerer VIONA Development GmbH Karlstr. 7 D-733 Karlsruhe, Germany uli@viona.de VIONA Development

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

Worst Case Execution Time Analysis for Synthesized Hardware

Worst Case Execution Time Analysis for Synthesized Hardware Worst Case Execution Time Analysis for Synthesized Hardware Jun-hee Yoo ihavnoid@poppy.snu.ac.kr Seoul National University, Seoul, Republic of Korea Xingguang Feng fengxg@poppy.snu.ac.kr Seoul National

More information

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language.

Architectures & instruction sets R_B_T_C_. von Neumann architecture. Computer architecture taxonomy. Assembly language. Architectures & instruction sets Computer architecture taxonomy. Assembly language. R_B_T_C_ 1. E E C E 2. I E U W 3. I S O O 4. E P O I von Neumann architecture Memory holds data and instructions. Central

More information

Analytical Design Space Exploration of Caches for Embedded Systems

Analytical Design Space Exploration of Caches for Embedded Systems Analytical Design Space Exploration of Caches for Embedded Systems Arijit Ghosh and Tony Givargis Department of Information and Computer Science Center for Embedded Computer Systems University of California,

More information

Cache-Aware Scratchpad Allocation Algorithm

Cache-Aware Scratchpad Allocation Algorithm 1530-1591/04 $20.00 (c) 2004 IEEE -Aware Scratchpad Allocation Manish Verma, Lars Wehmeyer, Peter Marwedel Department of Computer Science XII University of Dortmund 44225 Dortmund, Germany {Manish.Verma,

More information

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution

Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Dynamic Instruction Scheduling For Microprocessors Having Out Of Order Execution Suresh Kumar, Vishal Gupta *, Vivek Kumar Tamta Department of Computer Science, G. B. Pant Engineering College, Pauri, Uttarakhand,

More information

Comparing Multiported Cache Schemes

Comparing Multiported Cache Schemes Comparing Multiported Cache Schemes Smaїl Niar University of Valenciennes, France Smail.Niar@univ-valenciennes.fr Lieven Eeckhout Koen De Bosschere Ghent University, Belgium {leeckhou,kdb}@elis.rug.ac.be

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim

Cache Designs and Tricks. Kyle Eli, Chun-Lung Lim Cache Designs and Tricks Kyle Eli, Chun-Lung Lim Why is cache important? CPUs already perform computations on data faster than the data can be retrieved from main memory and microprocessor execution speeds

More information

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures

Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Designing Area and Performance Constrained SIMD/VLIW Image Processing Architectures Hamed Fatemi 1,2, Henk Corporaal 2, Twan Basten 2, Richard Kleihorst 3,and Pieter Jonker 4 1 h.fatemi@tue.nl 2 Eindhoven

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications

An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications An In-order SMT Architecture with Static Resource Partitioning for Consumer Applications Byung In Moon, Hongil Yoon, Ilgu Yun, and Sungho Kang Yonsei University, 134 Shinchon-dong, Seodaemoon-gu, Seoul

More information

SF-LRU Cache Replacement Algorithm

SF-LRU Cache Replacement Algorithm SF-LRU Cache Replacement Algorithm Jaafar Alghazo, Adil Akaaboune, Nazeih Botros Southern Illinois University at Carbondale Department of Electrical and Computer Engineering Carbondale, IL 6291 alghazo@siu.edu,

More information

An Energy Improvement in Cache System by Using Write Through Policy

An Energy Improvement in Cache System by Using Write Through Policy An Energy Improvement in Cache System by Using Write Through Policy Vigneshwari.S 1 PG Scholar, Department of ECE VLSI Design, SNS College of Technology, CBE-641035, India 1 ABSTRACT: This project presents

More information

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES

ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES ARCHITECTURAL APPROACHES TO REDUCE LEAKAGE ENERGY IN CACHES Shashikiran H. Tadas & Chaitali Chakrabarti Department of Electrical Engineering Arizona State University Tempe, AZ, 85287. tadas@asu.edu, chaitali@asu.edu

More information

On Efficiency of Transport Triggered Architectures in DSP Applications

On Efficiency of Transport Triggered Architectures in DSP Applications On Efficiency of Transport Triggered Architectures in DSP Applications JARI HEIKKINEN 1, JARMO TAKALA 1, ANDREA CILIO 2, and HENK CORPORAAL 3 1 Tampere University of Technology, P.O.B. 553, 33101 Tampere,

More information

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) K.Prasad Babu 2 M.tech (Ph.d) hanumanthurao19@gmail.com 1 kprasadbabuece433@gmail.com 2 1 PG scholar, VLSI, St.JOHNS

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

Analytical Design Space Exploration of Caches for Embedded Systems

Analytical Design Space Exploration of Caches for Embedded Systems Technical Report CECS-02-27 Analytical Design Space Exploration of s for Embedded Systems Arijit Ghosh and Tony Givargis Technical Report CECS-2-27 September 11, 2002 Department of Information and Computer

More information

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding

Introduction Architecture overview. Multi-cluster architecture Addressing modes. Single-cluster Pipeline. architecture Instruction folding ST20 icore and architectures D Albis Tiziano 707766 Architectures for multimedia systems Politecnico di Milano A.A. 2006/2007 Outline ST20-iCore Introduction Introduction Architecture overview Multi-cluster

More information

Parameterized System Design

Parameterized System Design Parameterized System Design Tony D. Givargis, Frank Vahid Department of Computer Science and Engineering University of California, Riverside, CA 92521 {givargis,vahid}@cs.ucr.edu Abstract Continued growth

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Automatic Counterflow Pipeline Synthesis

Automatic Counterflow Pipeline Synthesis Automatic Counterflow Pipeline Synthesis Bruce R. Childers, Jack W. Davidson Computer Science Department University of Virginia Charlottesville, Virginia 22901 {brc2m, jwd}@cs.virginia.edu Abstract The

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013

ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 ENGN 2910A Homework 03 (140 points) Due Date: Oct 3rd 2013 Professor: Sherief Reda School of Engineering, Brown University 1. [from Debois et al. 30 points] Consider the non-pipelined implementation of

More information

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures.

Introduction. Introduction. Motivation. Main Contributions. Issue Logic - Motivation. Power- and Performance -Aware Architectures. Introduction Power- and Performance -Aware Architectures PhD. candidate: Ramon Canal Corretger Advisors: Antonio onzález Colás (UPC) James E. Smith (U. Wisconsin-Madison) Departament d Arquitectura de

More information

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling

Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling Lazy BTB: Reduce BTB Energy Consumption Using Dynamic Profiling en-jen Chang Department of Computer Science ational Chung-Hsing University, Taichung, 402 Taiwan Tel : 886-4-22840497 ext.918 e-mail : ychang@cs.nchu.edu.tw

More information

Novel Multimedia Instruction Capabilities in VLIW Media Processors. Contents

Novel Multimedia Instruction Capabilities in VLIW Media Processors. Contents Novel Multimedia Instruction Capabilities in VLIW Media Processors J. T. J. van Eijndhoven 1,2 F. W. Sijstermans 1 (1) Philips Research Eindhoven (2) Eindhoven University of Technology The Netherlands

More information

Pipelining and Vector Processing

Pipelining and Vector Processing Chapter 8 Pipelining and Vector Processing 8 1 If the pipeline stages are heterogeneous, the slowest stage determines the flow rate of the entire pipeline. This leads to other stages idling. 8 2 Pipeline

More information

C152 Laboratory Exercise 5

C152 Laboratory Exercise 5 C152 Laboratory Exercise 5 Professor: Krste Asanovic GSI: Henry Cook Department of Electrical Engineering & Computer Science University of California, Berkeley April 9, 2008 1 Introduction and goals The

More information

Code Compression for DSP

Code Compression for DSP Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

More information

Lecture notes for CS Chapter 2, part 1 10/23/18

Lecture notes for CS Chapter 2, part 1 10/23/18 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Power Consumption Estimation of a C Program for Data-Intensive Applications

Power Consumption Estimation of a C Program for Data-Intensive Applications Power Consumption Estimation of a C Program for Data-Intensive Applications Eric Senn, Nathalie Julien, Johann Laurent, and Eric Martin L.E.S.T.E.R., University of South-Brittany, BP92116 56321 Lorient

More information

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

Improving Memory Repair by Selective Row Partitioning

Improving Memory Repair by Selective Row Partitioning 200 24th IEEE International Symposium on Defect and Fault Tolerance in VLSI Systems Improving Memory Repair by Selective Row Partitioning Muhammad Tauseef Rab, Asad Amin Bawa, and Nur A. Touba Computer

More information

A Low Energy Set-Associative I-Cache with Extended BTB

A Low Energy Set-Associative I-Cache with Extended BTB A Low Energy Set-Associative I-Cache with Extended BTB Koji Inoue, Vasily G. Moshnyaga Dept. of Elec. Eng. and Computer Science Fukuoka University 8-19-1 Nanakuma, Jonan-ku, Fukuoka 814-0180 JAPAN {inoue,

More information

Power Efficient Instruction Caches for Embedded Systems

Power Efficient Instruction Caches for Embedded Systems Power Efficient Instruction Caches for Embedded Systems Dinesh C. Suresh, Walid A. Najjar, and Jun Yang Department of Computer Science and Engineering, University of California, Riverside, CA 92521, USA

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.

4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4. Chapter 4: CPU 4.1 Introduction 4.3 Datapath 4.4 Control 4.5 Pipeline overview 4.6 Pipeline control * 4.7 Data hazard & forwarding * 4.8 Control hazard 4.14 Concluding Rem marks Hazards Situations that

More information

A First-step Towards an Architecture Tuning Methodology for Low Power

A First-step Towards an Architecture Tuning Methodology for Low Power A First-step Towards an Architecture Tuning Methodology for Low Power Greg Stitt, Frank Vahid*, Tony Givargis Department of Computer Science and Engineering University of California, Riverside {gstitt,

More information

Novel Multimedia Instruction Capabilities in VLIW Media Processors

Novel Multimedia Instruction Capabilities in VLIW Media Processors Novel Multimedia Instruction Capabilities in VLIW Media Processors J. T. J. van Eijndhoven 1,2 F. W. Sijstermans 1 (1) Philips Research Eindhoven (2) Eindhoven University of Technology The Netherlands

More information

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1]

2 Improved Direct-Mapped Cache Performance by the Addition of a Small Fully-Associative Cache and Prefetch Buffers [1] EE482: Advanced Computer Organization Lecture #7 Processor Architecture Stanford University Tuesday, June 6, 2000 Memory Systems and Memory Latency Lecture #7: Wednesday, April 19, 2000 Lecturer: Brian

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK

DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK DEPARTMENT OF ELECTRONICS & COMMUNICATION ENGINEERING QUESTION BANK SUBJECT : CS6303 / COMPUTER ARCHITECTURE SEM / YEAR : VI / III year B.E. Unit I OVERVIEW AND INSTRUCTIONS Part A Q.No Questions BT Level

More information

Automated Data Cache Placement for Embedded VLIW ASIPs

Automated Data Cache Placement for Embedded VLIW ASIPs Automated Data Cache Placement for Embedded VLIW ASIPs Paul Morgan 1, Richard Taylor, Japheth Hossell, George Bruce, Barry O Rourke CriticalBlue Ltd 17 Waterloo Place, Edinburgh, UK +44 131 524 0080 {paulm,

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

One instruction specifies multiple operations All scheduling of execution units is static

One instruction specifies multiple operations All scheduling of execution units is static VLIW Architectures Very Long Instruction Word Architecture One instruction specifies multiple operations All scheduling of execution units is static Done by compiler Static scheduling should mean less

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Reducing Instruction Fetch Cost by Packing Instructions into Register Windows

Reducing Instruction Fetch Cost by Packing Instructions into Register Windows Reducing Instruction Fetch Cost by Packing Instructions into Register Windows Stephen Hines, Gary Tyson, David Whalley Computer Science Dept. Florida State University November 14, 2005 ➊ Introduction Reducing

More information

Using a Victim Buffer in an Application-Specific Memory Hierarchy

Using a Victim Buffer in an Application-Specific Memory Hierarchy Using a Victim Buffer in an Application-Specific Memory Hierarchy Chuanjun Zhang Depment of lectrical ngineering University of California, Riverside czhang@ee.ucr.edu Frank Vahid Depment of Computer Science

More information

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts

Advance CPU Design. MMX technology. Computer Architectures. Tien-Fu Chen. National Chung Cheng Univ. ! Basic concepts Computer Architectures Advance CPU Design Tien-Fu Chen National Chung Cheng Univ. Adv CPU-0 MMX technology! Basic concepts " small native data types " compute-intensive operations " a lot of inherent parallelism

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses

Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses Power Protocol: Reducing Power Dissipation on Off-Chip Data Buses K. Basu, A. Choudhary, J. Pisharath ECE Department Northwestern University Evanston, IL 60208, USA fkohinoor,choudhar,jayg@ece.nwu.edu

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

Area/Delay Estimation for Digital Signal Processor Cores

Area/Delay Estimation for Digital Signal Processor Cores Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,

More information

COSC 6385 Computer Architecture. - Memory Hierarchies (II)

COSC 6385 Computer Architecture. - Memory Hierarchies (II) COSC 6385 Computer Architecture - Memory Hierarchies (II) Fall 2008 Cache Performance Avg. memory access time = Hit time + Miss rate x Miss penalty with Hit time: time to access a data item which is available

More information

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING

A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING 2014 IEEE International Conference on Acoustic, Speech and Signal Processing (ICASSP) A LOW-COMPLEXITY AND LOSSLESS REFERENCE FRAME ENCODER ALGORITHM FOR VIDEO CODING Dieison Silveira, Guilherme Povala,

More information

Parallel-computing approach for FFT implementation on digital signal processor (DSP)

Parallel-computing approach for FFT implementation on digital signal processor (DSP) Parallel-computing approach for FFT implementation on digital signal processor (DSP) Yi-Pin Hsu and Shin-Yu Lin Abstract An efficient parallel form in digital signal processor can improve the algorithm

More information

An Approach for Adaptive DRAM Temperature and Power Management

An Approach for Adaptive DRAM Temperature and Power Management IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS 1 An Approach for Adaptive DRAM Temperature and Power Management Song Liu, Yu Zhang, Seda Ogrenci Memik, and Gokhan Memik Abstract High-performance

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

C152 Laboratory Exercise 5

C152 Laboratory Exercise 5 C152 Laboratory Exercise 5 Professor: Krste Asanovic TA: Scott Beamer Department of Electrical Engineering & Computer Science University of California, Berkeley April 7, 2009 1 Introduction and goals The

More information

Basic Computer Architecture

Basic Computer Architecture Basic Computer Architecture CSCE 496/896: Embedded Systems Witawas Srisa-an Review of Computer Architecture Credit: Most of the slides are made by Prof. Wayne Wolf who is the author of the textbook. I

More information

Last Level Cache Size Flexible Heterogeneity in Embedded Systems

Last Level Cache Size Flexible Heterogeneity in Embedded Systems Last Level Cache Size Flexible Heterogeneity in Embedded Systems Mario D. Marino, Kuan-Ching Li Leeds Beckett University, m.d.marino@leedsbeckett.ac.uk Corresponding Author, Providence University, kuancli@gm.pu.edu.tw

More information

Chapter 2: Memory Hierarchy Design Part 2

Chapter 2: Memory Hierarchy Design Part 2 Chapter 2: Memory Hierarchy Design Part 2 Introduction (Section 2.1, Appendix B) Caches Review of basics (Section 2.1, Appendix B) Advanced methods (Section 2.3) Main Memory Virtual Memory Fundamental

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Performance Evaluation of XHTML encoding and compression

Performance Evaluation of XHTML encoding and compression Performance Evaluation of XHTML encoding and compression Sathiamoorthy Manoharan Department of Computer Science, University of Auckland, Auckland, New Zealand Abstract. The wireless markup language (WML),

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

Lecture 13 - VLIW Machines and Statically Scheduled ILP

Lecture 13 - VLIW Machines and Statically Scheduled ILP CS 152 Computer Architecture and Engineering Lecture 13 - VLIW Machines and Statically Scheduled ILP John Wawrzynek Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~johnw

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

A framework for verification of Program Control Unit of VLIW processors

A framework for verification of Program Control Unit of VLIW processors A framework for verification of Program Control Unit of VLIW processors Santhosh Billava, Saankhya Labs, Bangalore, India (santoshb@saankhyalabs.com) Sharangdhar M Honwadkar, Saankhya Labs, Bangalore,

More information

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University zmily@stanford.edu, christos@ee.stanford.edu

More information