A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs

Size: px
Start display at page:

Download "A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs"

Transcription

1 A Novel Profile-Driven Technique for Simultaneous Power and Code-size Optimization of Microcoded IPs Bita Gorjiara, Daniel Gajski Center for Embedded Computer Systems, University of California, Irvine {bgorjiar, Abstract Microcoded customized IPs have significantly better performance, yet larger code size, compared to similarly-sized instruction-based processors. Storing wide microcodes on-chip requires wide memory-blocks that occupy a large area and consume high leakage power. Therefore, addressing the code size of microcoded IPs is very important. In this paper, we introduce compression techniques that along with careful resolution of don t care values (denoted by X ) in microcode can address the code size issue. We observed that X values can be used for improving either dynamic power of IPs or their compression. However, achieving the efficiency of both is challenging. In this paper, we propose a profile-guided X -resolution technique that can achieve both power and compression efficiency. Using our technique, the code size of microcoded IPs is reduced by 2.7 times, while saving 20% dynamic power, on average. 1. Introduction Shrinking time-to-market and high demand for productivity has driven traditional hardware designers to use design methodologies that start from high-level languages. However, meeting timing and physical constraints of automatically generated IPs is often challenging and time-consuming. Moreover, slight changes in the high-level specification require redoing the RTL and physical synthesis phases because the behavioral synthesis tool may generate a new datapath. To avoid repeating these phases, high-level synthesis tools should be extended to recompile the application without modifying the previously generated datapath. A new breed of design methodologies based on traditional statically-scheduled Horizontal Microcoded Architectures (HMA) [1] have emerged that not only address the recompilation issue but also support more complex architectures than traditional synthesis techniques. The PICO [2], TIPI [3] and NISC [4] [5] [6] are examples of such methodologies. In staticallyscheduled HMAs, the compiler compiles the program directly to microcodes or nanocodes without using instruction abstraction. A nanocode is a low-level microcode that directly controls the units of datapath for one cycle. HMAs can potentially have better performance, lower power, and lower area than their conventional processor counterparts. This is because of replacing the costly decoder and scheduler hardware with off-line compiler algorithms. As a result, highly parallel customized architectures can be designed as HMA without any concern about the complexity of instructions, controller and scheduler. Microcodes may contain don t care values (denoted by X ) that can be mapped to 0 or 1 in the final executable binary. In each microcode, the X values correspond to the control signals of idle units at that cycle. To save power, one approach is to use combination of signal gating and power gating to disable the idle units. This approach requires placing gated registers at the inputs of all units. Additional registers can increase clock power, especially in platforms that do not support clock gating (such as FPGAs). Also, they can increase number of cycles and hence overall energy consumption of the IP. Another alternative or complementary approach for reducing dynamic power of datapath is to carefully resolve X values so that the overall switching activity of the units is minimized. This requires replacing the X values with the corresponding non- X values in the preceding microcodes. We refer to this technique as power-aware X -resolution or PX. Note that, in this approach, seemingly equivalent microcodes will be resolved to two different binaries depending on preceding microcodes in the program. Such power optimization is possible because: (1) there is a one-to-one relationship between microcode bits and control signals of the units; and (2) microcodes can be customized for a given application at no hardware cost. In contrast, conventional processors rely on fixed hardware decoders to convert instructions to microcodes. Such decoders are more complex to customize. Unfortunately, microcoded IPs have an important limitation: their code size is very large. In this paper, we show that the code size of microcoded IPs can be several times larger than traditional instruction-set-based processors. Storing wide microcodes on-chip requires wide memory-blocks that occupy a large area and consume more leakage power. Also, loading the microcodes from an off-chip memory increases the I/O power. Therefore, addressing the code size of microcoded IPs is very important. In this paper, we introduce compression techniques that along with careful /07/$ IEEE 609

2 resolution of X values can address the code size issue of microcoded IPs. We first show that power-aware X resolution (PX) reduces the dynamic power of the IPs by 26%, on average. Also, we show that using a dictionary-based compression technique along with compression-aware X resolution (CX) can reduce the code size of microcoded IPs by 3 times, on average. Next, we discuss that combining the two power and code-size optimizations is challenging; because, if compressionaware X -resolution is replaced with the power-aware one (PX), then, the compression efficiency drops significantly. According to our experiments, after combining the two approaches, the code size may even exceed the original uncompressed size! To address this issue, we propose a new profile-guided X -resolution technique that can achieve both power and compression efficiency. Using our technique, the dynamic power is reduced by 20% while the code size is reduced by 2.7 times. The rest of the paper is organized as follows: Section 2 presents an overview of our design approach. Section 3 explains the power-aware X resolution. Section 4 compares the code size of a microcoded IP with that of a conventional processor. Then, it introduces a compression technique that along with compressionaware X resolution can address the code size issue. In Section 5, we show that replacing the compression-aware X resolution with the power-aware one makes the compression ineffective. Section 6 introduces our approach for resolving X values that maintains both power and compression efficiency. 2. Overview of Our Design Flow (NISC) Our design flow is based on No-Instruction-Set Computer (NISC) [4] [5] [6]. In our flow, a custom datapath is generated or selected for a given application and then, the program is compiled on the datapath. Figure 1 shows an example of our custom IPs. The IP consists of a datapath and a controller. Since it is general enough to run many applications, we refer to it as GNISC. The datapath contains functional units, register file, registers, multiplexers, and memory. Our approach relies on a sophisticated compiler [5] to compile a program described in a high-level language to a binary that directly drives the control signals of components in the datapath. The values of control signals generated for each cycle are called a nanocode or a Control Word (CW). The CWs are stored in a control memory (CMem). Our toolset is available online at [4]. At each cycle, some of the control signals in the datapath may have don t care value. This means that both 0 and 1 can be assigned to those control signals without affecting the correctness of the program. The compiler generates don t care values for the components that are not used. Note that not all control signals can be assigned X. For example, register-file write enable cannot be X, because if X gets resolved to 1, an incorrect data is written to the RF. In other words, the control signals that can affect the registers and register files in the IP cannot be assigned X. Control signals such as register-file read and write addresses, Mux selection signal, and ALU operation signal can be X when the units are not used. Figure 1- Block diagram of GNISC architecture. Figure 2- A subset of GNISC datapath 3. Power-aware X resolution (PX) Dynamic power of CMOS circuits is proportional to average switching activity of the gates. The X values in a control word can be resolved in such a way that it decreases switching activity of the datapath. To explain this concept, we use the following example. Figure 2 is a subset of GNISC architecture shown in Figure 1. In GNISC, the ALU is connected to other components in the datapath through two multiplexers (i.e. MUX1 and MUX2). The multiplexers have two-bit control signals to address all their inputs. The value of these control signals depends on the preceding operations that produce ALU s inputs. Suppose that in a given cycle, ALU must add the output of multiplier MUL with a constant value (i.e. CONST) that is also stored in the control word. In that case, control signal of MUX1 and MUX2 should be changed to 00 and 11, respectively, to propagate the correct data to ALU. Now, suppose that in the next clock cycle the ALU is idle, and therefore compiler generates X values for the control signals of ALU and multiplexers as well as the CONST. To reduce dynamic power, the inputs of the ALU must remain unchanged as much as possible during the idle cycle. This means that, the constant value (i.e. CONST), the MUL output, and the control signals of multiplexers should remain unchanged. To do so, the X values generated for multiplexers and CONST are replaced by their previous values in the preceding cycle. 610

3 Figure 3 and Figure 4 show how power-aware X resolution is implemented: Figure 3 shows a set of control words corresponding to a basic-block of a program. To reduce dynamic power, the X values are replaced with the non- X values from the preceding control word (the final result is shown in Figure 4). In [5] and [8], a similar activity-reduction approach is proposed for reducing dynamic power of datapath generated by high-level synthesis. They first extract the don t care information from controller s FSM by constructing dependency graphs, and then add extra logic to the controller to decrease their activity. In our approach, the compiler directly generates the don t cares and there is no need for constructing additional data structure during controller generation. 1 0 X X 1 1 X X X X X X X 0 X X 1 1 X X X X X 1 X 0 X 1 X 0 X X 0 X X X Figure 3- Example CWs generated by compiler 1 0 X X 1 1 X X X X X X Figure 4- X values are resolved for power optimization 3.1 Experiments: Power efficiency of PX In this section, we present the power savings achieved by applying PX to GNISC architecture. The experiments are performed on a set of benchmarks including adpcm_decoder, crc32, dijkstra, and sha, from MiBench (the free version of EEMBC embedded benchmarks). All the designs are synthesized on Xilinx Virtex4 (90nm) FPGA. For power simulation, the signal activities are collected by post-placement-and-routing simulation using ModelSim simulator. The activities are then fed into Xilinx gate-level power simulator called XPower. Table 1 shows the power consumption of the benchmarks with and without PX optimization. On average, the power consumption is reduced from 30.9 mw to 23mW. This shows that PX reduces the dynamic power of GNISC by 26% on average. In the next section, we show that the X values may be resolved in a different way to help with the code size reduction. Table 1- Power consumption (mw) with and without PX Power (mw) Power Savings without PX with PX (%) adpcm_decoder CRC dijkstra sha average Reducing code-size of microcoded IPs Microcoded and nanocoded IPs have larger code size compared to instruction-based processors. Table 2 compares the code size of GNISC (shown in Figure 1) with that of Xilinx soft-core RISC processor called MicroBlaze. The code size of MicroBlaze is the size of instruction section (.text) of the ELF file generated by the compiler. On average, the code size of GNISC is 3.6 times larger than MicroBlaze while its performance is several times better [14]. The goal of our optimization is to simultaneously improve the code size and power consumption of GNISC while maintaining the performance benefits. Table 2- Comparing code size (KB) of GNISC with MicroBlaze Benchmarks MicroBlaze GNISC code size (KB) code size (KB) code size ratio adpcm_decoder CRC dijkstra sha Average Overview of code-size reduction techniques In general-purpose processors, the instruction-set abstraction is used to reduce the code size of processors. In RISC processors, designers define 32-bit or 16-bit [13] instructions to encode wide control words. At runtime, the instructions are decoded back to the control words (nanocode) using a hardware decoder. In most processors, one or more pipeline stages are added to the datapath for instruction decoding. As a result, it affects the performance of the processor. On the other hand, designing instruction-set is a very complex and timeconsuming task for a typical IP designer; because compiler, assembler, linker and instruction decoder must be re-designed to handle custom instructions. Therefore, in our approach, we eliminate the need for instruction-set and directly compress the control words. This not only simplifies the design, but also enables us to use the don t care values of the control words for improving the compression ratio or power consumption. In general-purpose domain, dictionary-based compression is used for reducing the code size [9], [10], [11], [12]. In CCRP [9], unique instructions in the program are stored in a dictionary, where the location of the instructions is determined by Huffman coding; most frequent instructions in the program are placed in low addresses of the dictionary and are coded with less number of bits. Due to Huffman coding, the compressed instructions have variable sizes. Although decompressing variable-size code is more complex, they manage to hide the decompression latency using cache. IBM CodePack [10], [11], [12] is another compression technique that has the same memory structure as of CCRP. In CodePack, each instruction is partitioned to two halves and two dictionaries are used to store the unique patterns of each half. Nevertheless, none of these approaches consider binary optimization for dynamic power reduction. 611

4 4.2 Our compression approach In this paper, we present our approach assuming no cache is used in the design and the code size is fixed (i.e. no Huffman coding). These assumptions help in simplifying our discussions. The cache and Huffman coding are orthogonal optimizations and can be implemented later. In our compression approach, we resolve X values in the nanocode so that both power consumption and code size are reduced. Our tool constructs a dictionary of unique control words and, in the executable binary, replaces each control word by its corresponding dictionary line addresses. Figure 5 shows a one-dictionary (D1) code compression approach. The memory structure consists of a Code lookup table (CodeLUT) and a dictionary. The Program Counter (PC) contains the address of CodeLUT and is used to read the next codeword. The codeword is then used to read the corresponding control word from dictionary. The following example shows how dictionary-based compression can reduce the total size of CMem in controller. Suppose that Figure 6 shows the CWs of a sample program. Each CW has 16 bits and the program has nine CWs. Therefore, the code size of the program is 144 bits (16 9). Figure 7 shows the compressed implementation of Figure 6, where the dictionary contains five unique CWs and the CodeLUT contains the corresponding address of the CWs. To address the dictionary, three bits is needed; thus, the codewords are three-bit wide. After compression the total binary size is reduced to 107 (i.e ). Figure 5- One-dictionary code compression (D1) Figure 6- CWs of a sample program Figure 7- Single-dictionary compression on CWs of Figure 6 Since CWs can be very wide with many unique patterns, the dictionary may have many entries and the compression efficiency may be low. To increase the chances of finding matching patterns, we can partition the CWs to smaller slices and construct multiple dictionaries. Usually the total size of the partitioned dictionaries is much smaller than that of a single big dictionary. However, corresponding to each dictionary, a code field must be added to code words. As number of dictionaries increases, the number of these fields in the codeword increases and eventually cancels out the gain of partitioning. Figure 8 shows two-dictionary (D2) and three-dictionary (D3) code compression approaches. In addition to number of dictionaries, the way X values are resolved in the binary may affect the efficiency of compression. This concept is discussed in the next section. lookup Code LUT lookup lookup Dictionaries (a) (b) Figure 8- (a) Two-dic (D2) and (b) Three-dic compression (D3) 4.3 Compression-aware X resolution (CX) The following example shows that careful resolution of X values can lead to a better compression. For the control words of Figure 3, if don t cares are simply replaced by 0, then the dictionary (shown in Figure 9) will have four entries, because only the second and the last vectors match. However, if the X values are smartly resolved, then the seemingly different control words match with each other as well. In Figure 10 the X values are resolved so that the first, third, and fourth vectors in Figure 3 are mapped to the first entry of Figure 10, and the other two vectors are mapped to the second entry of Figure Figure 9- Dictionary content for CWs of Figure 3 ( X are replace by 0 ) Figure 10- Dictionary content for CWs of Figure 3 (using compression-aware X resolution) To solve this problem in a general case, the X values in CWs must be resolved so that the total number of unique patterns is minimized. This problem can be converted to graph coloring problem: For a given list of bit-vectors, a graph G(V, E) is constructed. The vertices in V are the bit-vectors, and the edges in E show the conflict between the vectors. Two bit-vectors do not have conflict if they can be merged to a single bit vector. The graph coloring algorithm partitions the vertices (or 612

5 vectors) to subsets so that there is no edge (i.e. conflict) between any two vertices in the same set while minimizing the total number of sets (i.e. colors). Solving the graph coloring problem optimally is NP-hard. But there are many well-known heuristics that generate efficient results in polynomial time. After coloring the graph, corresponding to each color a new vector is generated and all the same-color vectors are merged into that vector. The new vectors are used to fill the dictionary. The details of this algorithm are available in [14]. 4.4 Experiments: Compression efficiency Table 3 shows the binary size of the MiBench benchmarks running on GNISC with different code compressions. The second column (D0) shows the baseline code size without any compression. The remaining columns (D1CX-D4CX) show the code size with different number of dictionaries (Section 4.3). These compression techniques use CX approach for X resolution. As number of dictionaries increases (columns 3, 4, 5, and 6), the code size (i.e. the total size of dictionaries and CodeLUT) of all the benchmarks decrease up to certain points (the highlighted values) and then increases again. These are the points where the increase in CodeLUT size cancels out the benefit of having more dictionaries. The optimum number of dictionaries may vary for different applications. Table 3- Code size (KB) of benchmarks with CX D0 D1CX D2CX D3CX D4CX adpcm_decoder CRC dijkstra sha average CR The last row in the table shows the Compression Ratio (CR), a metric commonly used to evaluate a compression algorithm. CR is the ratio between the compressed size and the original size, and smaller CR numbers show a better compression. On average, for all these benchmarks, the three-dictionary compression (i.e. D3CX) outperforms the others with CR of In D3CX, the total code size is one-third of the code size of D0. In other words, it compresses the code by 3 times. These experiments show that dictionary-based compression combined with CX can result in a very impressive compression ratio. In the next section, we investigate whether the efficiency of dictionary-based compression is maintained if X values are resolved for power rather than for compression. 5. Combining PX with compression To reduce dynamic power and code size at the same time, we combine the dictionary-based technique with the PX approach introduced in Section 3, and rerun the experiments. Table 4 shows code size of benchmarks after both optimizations. Although PX does not affect the code size of D0PX, it significantly increases that of compressed controllers (i.e. D1PX-D3PX). That is because PX significantly increases number of unique binary patterns. This issue can also be observed in the example of Figure 3. After PX optimization (see Figure 4), there are four unique patterns in the code. However, after CX (Figure 10), there are only two unique patterns in the code. In general, compared to CX, PX increases number of dictionary entries, and hence width of codewords. If number of dictionary entries increases significantly, the size of compressed code may even exceed the size of original uncompressed code. This case happens for D1PX in our experiments. As shown in the third column, the compression ratio (CR) of D1PX is 1.06, indicating a 6% increase in the code size. While using more dictionaries (D2PX and D3PX) improve the compression ratio (CR) to 0.79 and 0.62, the CR is still two times worse than that of CX implementation shown in Table 3. This motivates us to develop an X - resolution technique that achieves both power and compression efficiency. Table 4- Code size (KB) of benchmarks with PX D0PX D1PX D2PX D3PX adpcm_decoder CRC dijkstra sha average CR Hybrid X resolution (HX) In most programs, 90% of execution time is spent in 5%-10% of the basic-blocks. We use this property to improve both code size and dynamic power consumption: we limit the PX optimization to the X values in the frequently-executed basic blocks. Using the application profiling information, these frequently executed basic blocks that are the main contributors to the power consumption are identified and PX is applied to them. For the rest of basic blocks, CX is used to improve the compression efficiency. Using this hybrid technique, power consumption can be reduced with little loss of compression efficiency. Figure 11 shows the flow of our controller synthesizer tool. The inputs to our tool are the CW binaries with X information generated by the compiler, and the application profile information (i.e. the execution frequency of basic blocks). In the first step of the synthesis, the binary is partitioned according to the requested dictionary style. Then, the content of each dictionary is compressed using HX approach: i.e. first the X values of frequently-executing basic block are resolved using PX, and then the graph coloring algorithm in CX is applied to resolve the rest of X values and compact the content. Next, the codewords are generated according to the dictionary contents. Finally, the HDL code of the controller is produced. 613

6 profile-guided X -resolution technique can achieve both power and compression efficiency. CX PX HX Avg. Binary Size (KB) D0 D1 D2 D3 Figure 12- Average binary size for different dictionary counts and X resolutions CX PX HX Figure 11- Profile-guided controller generator (HX) 6.1 Experiments: efficiency of HX Table 5 shows the code size and power consumption of different controller implementations with HX power optimization. In Table 5, second, third, and fourth columns show the code size of one-, two-, and threedictionary implementation with HX optimization. Compared to PX implementation (see Table 4 last row), the compression ratio (CR) of HX is improved significantly (note that the smaller CR values show a better compression). Comparing the power consumption of the HX with PX (see Table 1) shows that only a few mw extra power is consumed to achieve 1.8 times better compression. Table 5- Power consumption and binary size of benchmarks with profile-guided controller generation Binary size (KByte) Power (mw) D1HX D2HX D3HX D1HX D2HX D3HX adpcm_decoder CRC dijkstra sha average average CR average power Figure 12 and Figure 13 compare CX, PX, and HX implementations in terms of binary size and power consumption, respectively. Note that the binary size of HX is within 17% of CX, while its power consumption is within 6% of PX. Therefore, HX has the benefits of both CX and PX without their limitations. It is worth noting that all power values in Table 5, Figure 12 and Figure 13 include the dynamic power consumption of the datapath, controller, memories and decompression logic. 7. Conclusion In this paper, we show that using dictionary-based compression techniques the code size of microcoded IPs can be reduced by 3 times, on average. We also show that power-aware X resolution (PX) reduces the dynamic power of the IPs by 26%, on average. However combining the two power and code-size optimizations is challenging. To address this issue, we propose a new Avg. Power (mw) D0 D1 D2 D3 Figure 13- Average power for different dictionary counts and X resolutions References [1] A. Agrawala, T. Rauscher, Foundations of Microprogramming: Architecture, Software, and Applications, Academic Press, ISBN: , [2] R. Schreiber, S. Aditya, S. Mahlke, V. Kathail, B. R. Rau, D. Cronquist and M. Sivaraman, "PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators". Journal of VLSI Signal Processing, , [3] S. J Weber and K. Keutzer, Using minimal minterms to represent programmability, Proc. CODES+ISSS 2005, [4] Nisc Website: [5] M. Reshadi, D. Gajski, "A Cycle-Accurate Compilation Algorithm for Custom Pipelined Datapaths", Proc. CODES+ISSS, p , [6] M. Reshadi, B. Gorjiara, D. Gajski, "Utilizing Horizontal and Vertical Parallelism Using a No-Instruction-Set Compiler and Custom Datapaths", International Conference on Computer Design (ICCD), p , [7] A. Raghunathan et al., "Controller re-specification to minimize switching activity in controller/data path circuits", ISLPED [8] A. Raghunathan, S. Dey, N. Jha, and K. Wakabayashi, "Power management techniques for control-flow intensive designs", DAC [9] A. Wolfe and A. Chanin, Executing compressed programs on an embedded RISC architecture, Intl. Symposium on Microarchitecture, [10] IBM, CodePack PowerPC code Compression Utility User s Manual Version 3.0, IBM, [11] T.M. Kemp, R.K. Montoye, D.J. Auerback, J.D. Harper, J.D. Palmer, "A decompression core for PowerPC", IBM Syst. J. 42,6(November), [12] C. Lefurgy, E. Piccininni, T. Mudge, Evaluation of a high performance code compression method, Intl. Symposium on Microarchitecture [13] S. Segars, K. Clarke, and L. Goudge, Embedded control problems, Thumb, and the ARM7TDMI, IEEE Micro, vol. 15, no. 5, Oct [14] B. Gorjiara, D. Gajski, "FPGA-friendly Code Compression for Horizontal Microcoded Custom IPs", Intl. Symposium on FPGA,

Lossless Compression using Efficient Encoding of Bitmasks

Lossless Compression using Efficient Encoding of Bitmasks Lossless Compression using Efficient Encoding of Bitmasks Chetan Murthy and Prabhat Mishra Department of Computer and Information Science and Engineering University of Florida, Gainesville, FL 326, USA

More information

INTEGRATION, the VLSI journal

INTEGRATION, the VLSI journal INTEGRATION, the VLSI journal ] (]]]]) ]]] ]]] Contents lists available at SciVerse ScienceDirect INTEGRATION, the VLSI journal journal homepage: www.elsevier.com/locate/vlsi Bitmask aware compression

More information

NISC Application and Advantages

NISC Application and Advantages NISC Application and Advantages Daniel D. Gajski Mehrdad Reshadi Center for Embedded Computer Systems University of California, Irvine Irvine, CA 92697-3425, USA {gajski, reshadi}@cecs.uci.edu CECS Technical

More information

NISC Technology Online Toolset

NISC Technology Online Toolset NISC Technology Online Toolset Mehrdad Reshadi, Bita Gorjiara, Daniel Gajski Technical Report CECS-05-19 December 2005 Center for Embedded Computer Systems University of California Irvine Irvine, CA 92697-3425,

More information

Code Compression for RISC Processors with Variable Length Instruction Encoding

Code Compression for RISC Processors with Variable Length Instruction Encoding Code Compression for RISC Processors with Variable Length Instruction Encoding S. S. Gupta, D. Das, S.K. Panda, R. Kumar and P. P. Chakrabarty Department of Computer Science & Engineering Indian Institute

More information

Code Compression for DSP

Code Compression for DSP Code for DSP Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress Abstract

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Evaluation of a High Performance Code Compression Method

Evaluation of a High Performance Code Compression Method Evaluation of a High Performance Code Compression Method Charles Lefurgy, Eva Piccininni, and Trevor Mudge Advanced Computer Architecture Laboratory Electrical Engineering and Computer Science Dept. The

More information

Code Compression for the Embedded ARM/THUMB Processor

Code Compression for the Embedded ARM/THUMB Processor IEEE International Workshop on Intelligent Data Acquisition and Advanced Computing Systems: Technology and Applications 8-10 September 2003, Lviv, Ukraine Code Compression for the Embedded ARM/THUMB Processor

More information

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining

Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Early Performance-Cost Estimation of Application-Specific Data Path Pipelining Jelena Trajkovic Computer Science Department École Polytechnique de Montréal, Canada Email: jelena.trajkovic@polymtl.ca Daniel

More information

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties

High-Level Synthesis of Programmable Hardware Accelerators Considering Potential Varieties THE INSTITUTE OF ELECTRONICS, INFORMATION AND COMMUNICATION ENGINEERS TECHNICAL REPORT OF IEICE.,, (VDEC) 113 8656 7 3 1 CREST E-mail: hiroaki@cad.t.u-tokyo.ac.jp, fujita@ee.t.u-tokyo.ac.jp SoC Abstract

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors

Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Simultaneously Improving Code Size, Performance, and Energy in Embedded Processors Ahmad Zmily and Christos Kozyrakis Electrical Engineering Department, Stanford University zmily@stanford.edu, christos@ee.stanford.edu

More information

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation. ISSN 2319-8885 Vol.03,Issue.32 October-2014, Pages:6436-6440 www.ijsetr.com Design and Modeling of Arithmetic and Logical Unit with the Platform of VLSI N. AMRUTHA BINDU 1, M. SAILAJA 2 1 Dept of ECE,

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3

Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number. Chapter 3 Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number Chapter 3 Area Efficient, Low Power Array Multiplier for Signed and Unsigned Number Chapter 3 3.1 Introduction The various sections

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

A Method To Derive Application-Specific Embedded Processing Cores Olivier Hébert 1, Ivan C. Kraljic 2, Yvon Savaria 1,2 1

A Method To Derive Application-Specific Embedded Processing Cores Olivier Hébert 1, Ivan C. Kraljic 2, Yvon Savaria 1,2 1 A Method To Derive Application-Specific Embedded Processing Cores Olivier Hébert 1, Ivan C. Kraljic 2, Yvon Savaria 1,2 1 Electrical and Computer Engineering Dept. École Polytechnique de Montréal, Montréal,

More information

Logic Optimization Techniques for Multiplexers

Logic Optimization Techniques for Multiplexers Logic Optimiation Techniques for Multiplexers Jennifer Stephenson, Applications Engineering Paul Metgen, Software Engineering Altera Corporation 1 Abstract To drive down the cost of today s highly complex

More information

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal

More information

Design Space Exploration Using Parameterized Cores

Design Space Exploration Using Parameterized Cores RESEARCH CENTRE FOR INTEGRATED MICROSYSTEMS UNIVERSITY OF WINDSOR Design Space Exploration Using Parameterized Cores Ian D. L. Anderson M.A.Sc. Candidate March 31, 2006 Supervisor: Dr. M. Khalid 1 OUTLINE

More information

On the Interplay of Loop Caching, Code Compression, and Cache Configuration

On the Interplay of Loop Caching, Code Compression, and Cache Configuration On the Interplay of Loop Caching, Code Compression, and Cache Configuration Marisha Rawlins and Ann Gordon-Ross* Department of Electrical and Computer Engineering, University of Florida, Gainesville, FL

More information

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for

More information

Study on LZW algorithm for Embedded Instruction Memory.

Study on LZW algorithm for Embedded Instruction Memory. Study on LZW algorithm for Embedded Instruction Memory. ADBULLAH A. HUSSAIN MAO ZHIGANG MICROELECTRONICS MICROELECTRONICS Harbin Institute of Technology Harbin Institute of Technology Flat No. 202, Building

More information

NISC Double-Handshake Communication Interface

NISC Double-Handshake Communication Interface NISC Double-Handshake Communication Interface Bita Gorjiara, Mehrdad Reshadi, Daniel Gajski Technical Report CECS-0-8 Novermber 00 Center for Embedded Computer Systems University of California Irvine Irvine,

More information

Value Compression for Efficient Computation

Value Compression for Efficient Computation Value Compression for Efficient Computation Ramon Canal 1, Antonio González 12 and James E. Smith 3 1 Dept of Computer Architecture, Universitat Politècnica de Catalunya Cr. Jordi Girona, 1-3, 08034 Barcelona,

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Write only as much as necessary. Be brief!

Write only as much as necessary. Be brief! 1 CIS371 Computer Organization and Design Midterm Exam Prof. Martin Thursday, March 15th, 2012 This exam is an individual-work exam. Write your answers on these pages. Additional pages may be attached

More information

Stratix vs. Virtex-II Pro FPGA Performance Analysis

Stratix vs. Virtex-II Pro FPGA Performance Analysis White Paper Stratix vs. Virtex-II Pro FPGA Performance Analysis The Stratix TM and Stratix II architecture provides outstanding performance for the high performance design segment, providing clear performance

More information

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS International Journal of Computing Academic Research (IJCAR) ISSN 2305-9184 Volume 2, Number 4 (August 2013), pp. 140-146 MEACSE Publications http://www.meacse.org/ijcar DESIGN AND IMPLEMENTATION OF VLSI

More information

Improving Code Density Using Compression Techniques

Improving Code Density Using Compression Techniques Improving Code Density Using Compression Techniques Charles Lefurgy, Peter Bird, I-Cheng Chen, and Trevor Mudge EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 {lefurgy,pbird,icheng,tnm}@eecs.umich.edu

More information

A Novel Test-Data Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods Kanad Basu 1 and Prabhat Mishra 2

A Novel Test-Data Compression Technique using Application-Aware Bitmask and Dictionary Selection Methods Kanad Basu 1 and Prabhat Mishra 2 A Novel Test-Data Compression Technique using Application-Aware Bitmask and Selection Methods Kanad Basu 1 and Prabhat Mishra 2 Computer and Information Science and Engineering Department University of

More information

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011

FPGA for Complex System Implementation. National Chiao Tung University Chun-Jen Tsai 04/14/2011 FPGA for Complex System Implementation National Chiao Tung University Chun-Jen Tsai 04/14/2011 About FPGA FPGA was invented by Ross Freeman in 1989 SRAM-based FPGA properties Standard parts Allowing multi-level

More information

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier

CHAPTER 3 METHODOLOGY. 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier CHAPTER 3 METHODOLOGY 3.1 Analysis of the Conventional High Speed 8-bits x 8-bits Wallace Tree Multiplier The design analysis starts with the analysis of the elementary algorithm for multiplication by

More information

Code Compression for DSP

Code Compression for DSP Code for DSP CSE-TR-380-98 Charles Lefurgy and Trevor Mudge {lefurgy,tnm}@eecs.umich.edu EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor, MI 48109-2122 http://www.eecs.umich.edu/~tnm/compress

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume 9 /Issue 3 / OCT 2017 Design of Low Power Adder in ALU Using Flexible Charge Recycling Dynamic Circuit Pallavi Mamidala 1 K. Anil kumar 2 mamidalapallavi@gmail.com 1 anilkumar10436@gmail.com 2 1 Assistant Professor, Dept of

More information

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY SPAA AWARE ERROR TOLERANT 32 BIT ARITHMETIC AND LOGICAL UNIT FOR GRAPHICS PROCESSOR UNIT Kaushal Kumar Sahu*, Nitin Jain Department

More information

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips

Overview. CSE372 Digital Systems Organization and Design Lab. Hardware CAD. Two Types of Chips Overview CSE372 Digital Systems Organization and Design Lab Prof. Milo Martin Unit 5: Hardware Synthesis CAD (Computer Aided Design) Use computers to design computers Virtuous cycle Architectural-level,

More information

International Journal of Informative & Futuristic Research ISSN:

International Journal of Informative & Futuristic Research ISSN: Reviewed Paper Volume 3 Issue 10 June 2016 International Journal of Informative & Futuristic Research Design Of ARM 7 Processor Core With Constraint Of Power And Area Consumption Using FSM Modelling And

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

Design and Low Power Implementation of a Reorder Buffer

Design and Low Power Implementation of a Reorder Buffer Design and Low Power Implementation of a Reorder Buffer J.D. Fisher, C. Romo, E. John, W. Lin Department of Electrical and Computer Engineering, University of Texas at San Antonio One UTSA Circle, San

More information

HW/SW Partitioning of an Embedded Instruction Memory Decompressor

HW/SW Partitioning of an Embedded Instruction Memory Decompressor HW/SW Partitioning of an Embedded Instruction Memory Decompressor Shlomo Weiss and Shay Beren EE-Systems, Tel Aviv University Tel Aviv 69978, ISRAEL ABSTRACT We introduce a ne PLA-based decoder architecture

More information

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS

PERFORMANCE ANALYSIS OF HIGH EFFICIENCY LOW DENSITY PARITY-CHECK CODE DECODER FOR LOW POWER APPLICATIONS American Journal of Applied Sciences 11 (4): 558-563, 2014 ISSN: 1546-9239 2014 Science Publication doi:10.3844/ajassp.2014.558.563 Published Online 11 (4) 2014 (http://www.thescipub.com/ajas.toc) PERFORMANCE

More information

An Effective Reconstruction of Replica Memory Design Optimization for Embedded System

An Effective Reconstruction of Replica Memory Design Optimization for Embedded System International Journal of Electronics Engineering Research. ISSN 0975-6450 Volume 9, Number 4 (2017) pp. 575-586 Research India Publications http://www.ripublication.com An Effective Reconstruction of Replica

More information

Multi-Profile Based Code Compression

Multi-Profile Based Code Compression 15.2 Multi-Profile Based Code Compression E. Wanderley Netto CEFET/RN IC/UNICAMP R. Azevedo P. Centoducatte IC/UNICAMP IC/UNICAMP Caixa Postal 6176 13084-971 Campinas/SP Brazil +55 19 3788 5838 {braulio,

More information

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors

Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Reducing Power Consumption for High-Associativity Data Caches in Embedded Processors Dan Nicolaescu Alex Veidenbaum Alex Nicolau Dept. of Information and Computer Science University of California at Irvine

More information

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression

Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Design and Implementation of FPGA- based Systolic Array for LZ Data Compression Mohamed A. Abd El ghany Electronics Dept. German University in Cairo Cairo, Egypt E-mail: mohamed.abdel-ghany@guc.edu.eg

More information

TSEA44 - Design for FPGAs

TSEA44 - Design for FPGAs 2015-11-24 Now for something else... Adapting designs to FPGAs Why? Clock frequency Area Power Target FPGA architecture: Xilinx FPGAs with 4 input LUTs (such as Virtex-II) Determining the maximum frequency

More information

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment

A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment LETTER IEICE Electronics Express, Vol.11, No.2, 1 9 A scalable, fixed-shuffling, parallel FFT butterfly processing architecture for SDR environment Ting Chen a), Hengzhu Liu, and Botao Zhang College of

More information

Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai

Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai Embedded Soc using High Performance Arm Core Processor D.sridhar raja Assistant professor, Dept. of E&I, Bharath university, Chennai Abstract: ARM is one of the most licensed and thus widespread processor

More information

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA

Hardware Implementation of Cryptosystem by AES Algorithm Using FPGA Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology ISSN 2320 088X IMPACT FACTOR: 6.017 IJCSMC,

More information

FPGA Implementation of ALU Based Address Generation for Memory

FPGA Implementation of ALU Based Address Generation for Memory International Journal of Emerging Engineering Research and Technology Volume 2, Issue 8, November 2014, PP 76-83 ISSN 2349-4395 (Print) & ISSN 2349-4409 (Online) FPGA Implementation of ALU Based Address

More information

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA

Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Scalable and Dynamically Updatable Lookup Engine for Decision-trees on FPGA Yun R. Qu, Viktor K. Prasanna Ming Hsieh Dept. of Electrical Engineering University of Southern California Los Angeles, CA 90089

More information

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes

FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes FPGA Implementation of Double Error Correction Orthogonal Latin Squares Codes E. Jebamalar Leavline Assistant Professor, Department of ECE, Anna University, BIT Campus, Tiruchirappalli, India Email: jebilee@gmail.com

More information

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech)

DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) DYNAMIC CIRCUIT TECHNIQUE FOR LOW- POWER MICROPROCESSORS Kuruva Hanumantha Rao 1 (M.tech) K.Prasad Babu 2 M.tech (Ph.d) hanumanthurao19@gmail.com 1 kprasadbabuece433@gmail.com 2 1 PG scholar, VLSI, St.JOHNS

More information

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions

Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Exploiting Forwarding to Improve Data Bandwidth of Instruction-Set Extensions Ramkumar Jayaseelan, Haibin Liu, Tulika Mitra School of Computing, National University of Singapore {ramkumar,liuhb,tulika}@comp.nus.edu.sg

More information

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department

More information

High-Level Synthesis (HLS)

High-Level Synthesis (HLS) Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems

A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems A Hardware Task-Graph Scheduler for Reconfigurable Multi-tasking Systems Abstract Reconfigurable hardware can be used to build a multitasking system where tasks are assigned to HW resources at run-time

More information

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF

More information

FPGA for Software Engineers

FPGA for Software Engineers FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course

More information

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased

Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased Optimized architectures of CABAC codec for IA-32-, DSP- and FPGAbased platforms Damian Karwowski, Marek Domański Poznan University of Technology, Chair of Multimedia Telecommunications and Microelectronics

More information

Improving Code Density Using Compression Techniques

Improving Code Density Using Compression Techniques Abstract Improving Code Density Using Compression Techniques CSE-TR-342-97 Charles Lefurgy, Peter Bird, I-Cheng Chen, and Trevor Mudge EECS Department, University of Michigan 1301 Beal Ave., Ann Arbor,

More information

Characterization of Native Signal Processing Extensions

Characterization of Native Signal Processing Extensions Characterization of Native Signal Processing Extensions Jason Law Department of Electrical and Computer Engineering University of Texas at Austin Austin, TX 78712 jlaw@mail.utexas.edu Abstract Soon if

More information

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier

Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Design and Implementation of VLSI 8 Bit Systolic Array Multiplier Khumanthem Devjit Singh, K. Jyothi MTech student (VLSI & ES), GIET, Rajahmundry, AP, India Associate Professor, Dept. of ECE, GIET, Rajahmundry,

More information

Introduction to Field Programmable Gate Arrays

Introduction to Field Programmable Gate Arrays Introduction to Field Programmable Gate Arrays Lecture 1/3 CERN Accelerator School on Digital Signal Processing Sigtuna, Sweden, 31 May 9 June 2007 Javier Serrano, CERN AB-CO-HT Outline Historical introduction.

More information

Available online at ScienceDirect. Procedia Technology 25 (2016 )

Available online at  ScienceDirect. Procedia Technology 25 (2016 ) Available online at www.sciencedirect.com ScienceDirect Procedia Technology 25 (2016 ) 544 551 Global Colloquium in Recent Advancement and Effectual Researches in Engineering, Science and Technology (RAEREST

More information

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems *

Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Synergistic Integration of Dynamic Cache Reconfiguration and Code Compression in Embedded Systems * Hadi Hajimiri, Kamran Rahmani, Prabhat Mishra Department of Computer & Information Science & Engineering

More information

An Overview of Static Pipelining

An Overview of Static Pipelining Boise State University ScholarWorks Computer Science Faculty Publications and Presentations Department of Computer Science 1-1-2012 An Overview of Static Pipelining Ian Finlayson Gang-Ryung Uh Boise State

More information

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs

Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs Saving Power by Mapping Finite-State Machines into Embedded Memory Blocks in FPGAs Anurag Tiwari and Karen A. Tomko Department of ECECS, University of Cincinnati Cincinnati, OH 45221-0030, USA {atiwari,

More information

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical & Computer Engineering

More information

New Approach for Affine Combination of A New Architecture of RISC cum CISC Processor

New Approach for Affine Combination of A New Architecture of RISC cum CISC Processor Volume 2 Issue 1 March 2014 ISSN: 2320-9984 (Online) International Journal of Modern Engineering & Management Research Website: www.ijmemr.org New Approach for Affine Combination of A New Architecture

More information

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India

More information

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2

DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL. Shruti Hathwalia* 1, Meenakshi Yadav 2 ISSN 2277-2685 IJESR/November 2014/ Vol-4/Issue-11/799-807 Shruti Hathwalia et al./ International Journal of Engineering & Science Research DESIGN AND IMPLEMENTATION OF SDR SDRAM CONTROLLER IN VHDL ABSTRACT

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP

FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP FPGA BASED ADAPTIVE RESOURCE EFFICIENT ERROR CONTROL METHODOLOGY FOR NETWORK ON CHIP 1 M.DEIVAKANI, 2 D.SHANTHI 1 Associate Professor, Department of Electronics and Communication Engineering PSNA College

More information

LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE

LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE LOSSLESS DATA COMPRESSION AND DECOMPRESSION ALGORITHM AND ITS HARDWARE ARCHITECTURE V V V SAGAR 1 1JTO MPLS NOC BSNL BANGALORE ---------------------------------------------------------------------***----------------------------------------------------------------------

More information

AUTOMATIC DATA PATH GENERATION FROM C CODE FOR CUSTOM PROCESSORS

AUTOMATIC DATA PATH GENERATION FROM C CODE FOR CUSTOM PROCESSORS AUTOMATIC DATA PATH GENERATION FROM C CODE FOR CUSTOM PROCESSORS Center for Embedded Computer Systems University of California, Irvine jelenat@cecs.uci.edu, gajski@cecs.uci.edu Abstract: The stringent

More information

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer)

ESE Back End 2.0. D. Gajski, S. Abdi. (with contributions from H. Cho, D. Shin, A. Gerstlauer) ESE Back End 2.0 D. Gajski, S. Abdi (with contributions from H. Cho, D. Shin, A. Gerstlauer) Center for Embedded Computer Systems University of California, Irvine http://www.cecs.uci.edu 1 Technology advantages

More information

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression

FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression FPGA Implementation of Multiplierless 2D DWT Architecture for Image Compression Divakara.S.S, Research Scholar, J.S.S. Research Foundation, Mysore Cyril Prasanna Raj P Dean(R&D), MSEC, Bangalore Thejas

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

Minimum Area Cost for a 30 to 70 Gbits/s AES Processor

Minimum Area Cost for a 30 to 70 Gbits/s AES Processor Minimum Area Cost for a 30 to 70 Gbits/s AE Processor Alireza Hodjat and Ingrid Verbauwhede Electrical Engineering Department University of California, Los Angeles {ahodjat, ingrid} @ ee.ucla.edu Abstract

More information

Designing Heterogeneous FPGAs with Multiple SBs *

Designing Heterogeneous FPGAs with Multiple SBs * Designing Heterogeneous FPGAs with Multiple SBs * K. Siozios, S. Mamagkakis, D. Soudris, and A. Thanailakis VLSI Design and Testing Center, Department of Electrical and Computer Engineering, Democritus

More information

FPGA: What? Why? Marco D. Santambrogio

FPGA: What? Why? Marco D. Santambrogio FPGA: What? Why? Marco D. Santambrogio marco.santambrogio@polimi.it 2 Reconfigurable Hardware Reconfigurable computing is intended to fill the gap between hardware and software, achieving potentially much

More information

CHAPTER 4 BLOOM FILTER

CHAPTER 4 BLOOM FILTER 54 CHAPTER 4 BLOOM FILTER 4.1 INTRODUCTION Bloom filter was formulated by Bloom (1970) and is used widely today for different purposes including web caching, intrusion detection, content based routing,

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Compiler-in-the-Loop Design Space Exploration Framework for Energy Reduction in Horizontally Partitioned Cache Architectures

Compiler-in-the-Loop Design Space Exploration Framework for Energy Reduction in Horizontally Partitioned Cache Architectures IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 28, NO. 3, MARCH 2009 461 Compiler-in-the-Loop Design Space Exploration Framework for Energy Reduction in Horizontally

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

A Comparative Power Analysis of an Asynchronous Processor

A Comparative Power Analysis of an Asynchronous Processor A Comparative Power Analysis of an Asynchronous Processor Aristides Efthymiou, Jim D. Garside, and Steve Temple Department of Computer Science,University of Manchester Oxford Road, Manchester, M13 9PL

More information

Stratix II vs. Virtex-4 Performance Comparison

Stratix II vs. Virtex-4 Performance Comparison White Paper Stratix II vs. Virtex-4 Performance Comparison Altera Stratix II devices use a new and innovative logic structure called the adaptive logic module () to make Stratix II devices the industry

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

: : (91-44) (Office) (91-44) (Residence)

:  : (91-44) (Office) (91-44) (Residence) Course: VLSI Circuits (Video Course) Faculty Coordinator(s) : Prof. S. Srinivasan Department of Electrical Engineering Indian Institute of Technology Madras Chennai 600036 Email Telephone : srinis@iitm.ac.in,

More information

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY

POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY POWER REDUCTION IN CONTENT ADDRESSABLE MEMORY Latha A 1, Saranya G 2, Marutharaj T 3 1, 2 PG Scholar, Department of VLSI Design, 3 Assistant Professor Theni Kammavar Sangam College Of Technology, Theni,

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs

Linköping University Post Print. Analysis of Twiddle Factor Memory Complexity of Radix-2^i Pipelined FFTs Linköping University Post Print Analysis of Twiddle Factor Complexity of Radix-2^i Pipelined FFTs Fahad Qureshi and Oscar Gustafsson N.B.: When citing this work, cite the original article. 200 IEEE. Personal

More information