A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA
|
|
- Austen O’Neal’
- 5 years ago
- Views:
Transcription
1 A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA Akihito Tsusaka Mai Izawa Rie Uno Nobuyuki Ozaki Hideharu Amano Keio University, Yokohama, , Japan ABSTRACT Cool Mega Array (CMA) is an energy efficient Coarse Grained Reconfigurable processor Array (CGRA) consisting of a large PE (Processing Element) array. In order to reduce the power for storing intermediate results and clock tree, the PE array is consisting of combinatorial circuits. The completion time in the PE array has been calculated from the delay table and mapping results manually, and specified in the micro-code. A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, a single micro-code can be used for various supply voltages to PE array. Also, dynamic change of the delay by changing of the temperature and the variation for each chip can be treated. 1. INTRODUCTION Recent battery driven mobile devices require high performance for a certain area of application as well as energy efficiency. As a solution, Coarse-Grained Reconfigurable processor Arrays (CGRA) [1, 2, 3] have received attention as energy efficient accelerators, and some of them have been utilized in commercial products[4, 5]. CMA (Cool Mega Array)[6] has been developed as a highly energy efficient CGRA. It provides a large PE (Processing Element) array consisting of combinatorial logic. Data-flow graphs for target application programs are mapped directly on the array, and computation is done without storing intermediate results. The energy for storing intermediate results into registers in each PE and clock distribution through the clock tree are not required. A small microcontroller manages data distribution and collection between data memory and registers only provided at input/output of the PE array. The supply voltage of the PE array can be scaled so that the computation delay in the PE array is well balanced to the time for data management by the microcontroller. The first prototype CMA-1 using 65nm CMOS technology achieved 2.7GOPS/11.2 mw sustained performance, and a multicore system which has a number of CMA chips Cube-1[7] is now available. One of the most difficult problems of CMA architecture is how to evaluate the computational delay on the PE array. Ozaki proposed a method to compute the largest delay time in the PE array from the result of the application mapping[8]. It uses a table in which the delay of each PE at various supply voltage, and with a certain amount of margin, the programmer decides the timing to store results from the PE array. However, the method does not care about the temperature of environment and variance of the delay in each chip. For safe computation, a large amount of margin is required, and it will degrade both performance and energy efficiency. In order to address the problem, a hardware mechanism that detects the completion of the execution in the PE array is proposed. It uses a selectable delay line consisting of buffers connected in tandem. The delay is decided according to the operation executed in the PE array. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The rest of paper is organized as follows: in Section 2, the architecture of CMA is introduced focusing on the delay estimation method. A hardware mechanism for detecting the completion of execution is proposed in Section 3. The overhead and efficiency of the mechanism are shown in Section 4. Section 5 concludes the paper with discussion of future work. 2. CMA ARCHITECTURE AND COMPLETION DETECTION 2.1. The CMA architecture Like other CGRAs, the target application of CMA is multimedia streaming application which has a large degree of parallelism. By parallel execution of a lot of PEs, it achieves a required performance with low supply voltage. The impor-
2 tant difference between other CGRAs is that it adopts an extreme architecture for saving energy as possible. A large PE array of CMA consists of combinatorial circuits without registers and context memory unlike other CGRAs. The energy for storing intermediate results and the power for clock distribution inside the PE array are not required. Dynamic reconfiguration which requires a large amount of energy is not adopted. The configuration data for the PE array is given from configuration registers provided outside the PE array and fixed during execution. The data flow graph corresponding to the application is mapped statically on the PE array. For keeping the flexibility, a small microcontroller is provided between PE array and data memory. It reads data from the data memory and distributes it to the register attached to the input of the PE array. It also collects the results from the register attached to the output of the PE array, and writes them back to the data memory. It flexibly manages the data transfer between the memory and registers by using mapping registers and vector operations. With the above structure, it enables to implement various application programs without power hungry dynamic reconfiguration in the PE array. Since the computation in the PE array and data management by the microcontroller are done in a pipelined manner, their execution speeds must be balanced. If the computation delay is longer than the data management delay, the voltage supplied to the PE array can be reduced. The total power required for computation can thus be reduced without degrading computing performance. On the other hand, if the data management delay is longer than the computation delay, wave pipelining in the PE array can be used. The delay time for achieving wave pipelining can be also controlled by changing the voltage supplied to the PE array Prototype chip CMA-1 The first prototype, CMA-1[6] with 8 8 PE array was fabricated in mm 2 65-nm CMOS technology, and achieved 2.4-GOPS/11.2-mW sustained performance. Figure 1 shows the block diagram of CMA-1. It consists of PE array, microcontroller, data memory (DMEM) and registers. Here, the computation in the PE array by the control of microcontroller is described in detail. As shown in Figure 2, microcontroller is consisting of a controller, Fetch register, Launch register and Gather register. First, it reads from DMEM and distributes them to entries of Fetch register. The data distribution and collection by the micro controller was designed to be flexible to enable arbitrary mapping between the address of the data memory and the input of the PE array using address mapping registers. Stride vector access operations are also supported. When the input data in Fetch register is ready, it is transferred to Launch register and the computation starts in the PE array. After a certain time interval, the result of PE array PE_ARRAY CMA CONF_REG 25bit Data Channel 17bit Constant Value Data PE array COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 CONST_REG (1) data distribution micro controller 25bit X 1K µ - Controller DMEM Connect to/from Host CPU 25bit X 1K Fig. 1. Block diagram of CMA-1 (2) computation in the PE array 24bit X 1K DMEM 24bit X 1K Passing Links CONF_REG Feedback Lines launch register(lr) fetch register(fr) gather register(gr) (3) data collection (1) (2) (3) (1) (2) (3) (1) (2) (3) (a) All stages are balanced (b) computation in the PE array is short -> Voltage Scaling Fig. 2. Pipleined operation of CMA-1 (c) data manupilation by the micro controller is short -> Wave Pipelining is stored into Gather register, and the results in the entries of Gather register are written back into DMEM. (1) Distribution from DMEM to Fetch register, (2)computation in the PE array and (3) written back of the results to DMEM from Gather register are done in the pipelined manner. Supply voltage scaling to PE array is used to balance the time for stage (2) with other two stages Microcodes of the CMA-1 The programmable controller of CMA-1 has 16 general purpose registers, and uses 14-bit micro operations stored in a small (128 depth) micro-code memory. Table 1 shows an example of micro-code. Only body of a loop is extracted.
3 Table 1. An example of micro-code... LD ADD r0,r8 //1: Load data LD ADD r1,r8 //2: to Fetch LD ADD r2,r8 //3: register LD ADD r3,r8 //4: LD ADD r4,r8 //5: LD ADD r5,r8 //6: SCATTER r9 //7: Scatter loop: LD ADD r0,r8 //8: LD ADD r1,r8 //9: LD ADD r2,r8 //10: LD ADD r3,r8 //11: LD ADD r4,r8 //12: LD ADD r5,r8 //13: NOP 3 //14: GATHER r11,r12,0 //15: Gather SCATTER r9 //16: ADDI r13,#-1 //17: BNEZ r13,loop //18:... LD ADD reads the data from data memory and transfers it to each entry of Fetch register. In this code, r8 is used as a base register and a predefined value is added when LD ADD is executed. r0-r5 are used as mapping registers. When all data in Fetch register are ready (Line 6), SCAT- TER is executed to transfer the content of Fetch register to Launch register. At that time, the computation in the PE array starts. During computation, microcontroller fetches the second data set (Line 8-13), and wait 3 clock cycles for the end of the computation in the PE array (Line 14). Then the results of the PE array are stored in Gather register. GATHER instruction transfers the results into data memory according to the base register (r11) and the mask register (r12). The base register is incremented by the number of transferred data. Although GATHER instruction takes multiple clock cycles, it is executed automatically by a dedicated controller, and the micro controller can execute next code SCATTER immediately for starting the computation of the next data set. In this example, the loop is iterated until r13 reaches zero. Note that the PE array computation and GATHER instruction are done independently from the execution of the microcode, three steps are performed in the pipeline manner as shown in Figure Completion detection The problem for microcode designers is that they must estimate the completion of the PE array and specify it into the micro-code. In this case, from the first SCATTER to GATHER, 6 micro-codes (Line8-13) are executed. When the microcontroller works at 250MHz, 24nsec is spent with them. In this case, if the delay of the PE array is estimated about 36nsec, 3 clock cycles must be added by NOP 3 micro-code. However, the execution time in the PE array is depending on the applications. For simple applications which use a small number of PEs, the results are ready with a small delay, while it becomes large in complicated applications. The data flow on the PE array is designed with Black Diamond retargetable compiler[9]. It compiles the program described in C-like language, maps, routes and generates the configuration data for the PE array. Ozaki et.al. proposed a method to evaluate the total delay time by using the result of mapping and a delay table by changing the supply voltage based on the measurement of real chips[8]. Since the delay is different depending on the operations, the table is provided for each instruction of PE. The longest path in the PE array can be computed from the mapping results and the sum of the appropriate delays in the table. Since the largest wire delay is assumed for each operation, the computed total delay includes a certain margin. 3. COMPLETION DETECTION MECHANISM 3.1. Related Work Since CMA uses a large PE array with combinatorial circuits, the completion detection mechanism is somehow like that of asynchronous systems. Although a large number of researches have been done on asynchronous FPGA architectures, most of them use microsynchronization mechanisms to recognize the end of computation[10]. However, since it takes a large amount of additional hardware, it is difficult to be applied to the PE array in CMA. PCA-1/2[11] is a reconfigurable architecture which uses delay lines to send the results to the next cells, and Xia et.al. proposes a hybrid architecture using delay and synchronization mechanism. Although they are based on fine-grained reconfigurable architectures, using delay line is cost efficient way for recognize the completion of computation. Techniques for controlling delay mechanisms have been well studied and available[12]. For coarse grained architecture, a dataflow-driven execution control mechanism is proposed[13], but it is for a general PE array with clock The concept of the proposed method The current delay estimation method has the following problems: (1) When the supply voltage is scaled, the micro-code must be changed. Different codes must be provided when voltage is scaled dynamically. (2) The temperature and the delay variance of each chip are not cared. The delay variance will become large when the low power supply voltage is used in the future process. Considering the safe operation, a large margin is required. In order to address these problems, we propose a hardware completion detection mechanism with tandem connected buffers. Figure 3 shows the concept of the mechanism. When
4 SCATTER instruction is executed, the completion signals attached to all input data are asserted at the input of PE array. They are propagated exactly on the same way as the input data. In the PE, the completion signal is delayed with the serially connected buffers whose delay is arranged according to the operation executed in the PE. When two completion signals are joined into a PE, the earlier asserted input must wait for the later asserted signal. When all completion signals attached to outputs data are available, the results are stored into Gather register. PE ARRAY PE PE PE PE PE PE PE DATA signal PE PE PE Fig. 4. The layout of the PE array Completion signal 3.4. Implementation of the mechanism Design environment Fig. 3. Hardware Completion Mechanism The key implementation issue is that the completion signal must flow in the same way of the data. For this purpose, we implemented it as an extra data bit of data bus. Thus, it has the same fan-out and almost the same routing path as the other data. As shown in Figure 4, in CMA, all PEs are aligned naturally in the two dimensional structure, the completion signals can be routed with the same manner as the corresponding data wires so that the delay time becomes almost the same Delay line in the PE As shown in Figure 5, the completion signals from both input are forwarded through the AND gate to the delay line which is implemented with a buffers connected in tandem. When constant data is used or the PE executes instructions with single operand, the corresponding input of the completion signal is set to be H beforehand. The delay line has several taps, and the signal with appropriate delay is selected by the output multiplexer. Since the operation of PE is defined by the configuration data, the multiplexer is also selected according to the configuration data for each PE. The proposed completion detection mechanism is implemented in the CMA architecture shown in Table 2. Design tools are shown in the same table. The target CMA is almost same as the CMA-1 except providing the dedicated links for transferring the constant data. This improvement was proved to reduce the loading time of the configuration data[14] Delay in the PE The position of taps are decided by the post layout simulation results of PE. According to the analysis results, four taps are provided for corresponding operations: ADD/SUB, MUL, LOGICs and SHIFTs. The buffer SC23BUFXA1 whose maximum delay time is 69ps is used for building delay line. Table 3 shows the number of buffers for each taps. In our PE, the delay of ADD/SUB operation is slightly larger than that of MUL operation. It comes from that a high speed multiplier is adopted for multiplication while a carry ripple adder is selected for ADD/SUB operation Modification of micro-code SCATTER/GATHER instructions are modified as follows: SCATTER: When SCATTER instruction executes, all completion signals attached to available inputs are asserted. When all propagated completion signals are asserted at the output and GATHER instruction is executed, the input completion signals are negated. Until it, the next SCATTER instruction is suspended. GATHER: When all completion signals are ready at the output of PE array, and GATHER instruction is
5 DL buffer DELAY OUT ALU PE DATA_A DATA_B DELAY_B DELAY_A IN_A IN_B Fig. 5. Delay Line in a PE ALU_CONF Table 2. Specifications of the target CMA Technology Fujitsu e-shuttle 65-nm 12-metal CMOS Cell Library CS202SZ low-power standard cell library Supply Voltage V for PE array (1.2V for evaluation) PE 24-bit ALU,64-bit Network 2-lane island-style 2 direct links Micro controller 14-bit micro-codes, 16 instructions, 128 entries, 8 GPRs, 8-address register, 4-base register Clock frequency 210 MHz Synthesis Design Compiler SP5 Layout IC Compiler 2009 Analysis Primetime SP3 executed, the results are stored into Gather register. If GATHER instruction is issued before completion signals being ready, the execution of the microcontroller is stalled. With this modification, the designers don t have to worry about the timing of issuing GATHER instruction. The results are stored in the Gather register and then automatically written back to the data memory. Although the hardware completion detection mechanism is substantially suitable for wave-pipelining, the implementation makes the wave-pipeline impossible. The restriction was given just for safety, relaxing the condition about multiple SCATTER instructions to enable wave-pipeline is our future work The delay time 4. EVALUATION We implemented four simple image filter programs: alpha blender, sepia filter, gray scale filter, and edge detection filter on CMA with completion detection mechanism. All programs correctly worked with the post layout simulation including wiring and parastic capacitance delay. Figure 6 shows the maximum delay actually measured in the PE during the simulation. The graph shows that the delay by the tandem connected buffer is appropriately given with about 10% margin. The execution time is completely the same as the case without the completion detection mechanism, since the delay in the microcode is well tuned. However, the same code can be used with any supply voltage or temperature. Table 3. The setting of the delayline Operation Num of buffers Delay time(ps) ADD/SUB MULT LOGICs SHIFT The Overhead The proposed mechanism requires additional hardware which will introduce overhead on area and power consumption. The area is increased 5.1% that of the design without the mechanism. Since the additional hardware is simple hardware consisting of buffers, multiplexers and AND gates, the increasing area is small. Figure 7 shows the average power consumption in a PE when applications are executed. The power consumed in the completion detection mechanism in only about 6%. In this implementation, only 1 bit for each data is changed in a computation. This is the reason why the power consumption is not so large in this mechanism. 5. CONCLUSION A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and
6 Fig. 6. The delay in the PE Fig. 7. The power consumed in a PE ARRAY wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, the single code can be used for various supply voltages to PE array. Also, dynamic delay variation by changing the temperature or chip variation are also treated. The proposed mechanism is substantially suitable to wavepipeline. Now, the wave-pipeline is implemented on CMA- 1 by careful manual tuning. Execution of the wave-pipeline with the proposed hardware completion mechanism is our future work. Acknowledgments A part of this research was performed by Japan Science and Technology Agency [JST] of Core Research for Evolutional Science and Technology [CREST]. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Cadence Design Systems, Inc. [1] F.J.Veradas, M.Scheppler, W.Moffat, B.Mei, Custom Implementation of the Coarse-Grained Reconfigurable ADRES architecture for multimedia Purposes, in Proc. of International Conference on Field Programmable Logic and Applications (FPL05), 2005, pp [2] C.Ebeling, D.C.Cronquist and P.Franklin, Rapid - Reconfigurable Pipelined Datapath, in Proc. of the FPL 2004, [3] H. Amano, Y. Hasegawa, S. Tsutsumi, T. Nakamura, T. Nisimura, V. Tunbunheng, A. Parimala, T. Sano and M. Kato, MuCCRA Chips: Configurable Dynamically- Reconfigurable Processors, in Proc. of ASSCC, Nov. 2007, pp [4] M. Motomura, STP Engine, a C-based Programmable HW Core featuring Massively P aralleland Reconfigurable PE Array: its Architecture, Tool, and SystemImplicatio ns, in Prof. of CoolChips XII, [5] H-S.Kim, M.Ann, J.A.Sratton, W.Mei, W.Hwu, ULP-SRP: Ultra Low Power Samsung Reconfigurable Processor for Biomedical Applications, in Prof. of ICFPT 2012, 2012, pp [6] N.Ozaki, Y.Yasuda, Y.Saito, D.Ikebuchi, M.Kimura, H.Amano, H.Nakamura, K.Usami, M.Namiki, M.Kondo, Cool Mega-Arrays: Ultralow-Power Reconfigurable Accelerator Chips, IEEE Micro, Vol.31, pp. 6 18, [7] Y. Koizumi, et al, CMA-Cube: a scalable reconfigurable accelerator with 3-D wireless inductive coupling interconnect, in Proc. of the FPL 2012, Aug [8] N.Ozaki, et.al., Cool Mega-Arrays: A highly energy efficient accelarator, Proc. on ICFPT 2011, [9] V. Tunbunheng and H. Amano, Black-Diamond: a Retargetable Compiler Using Graph with Configuration Bits for Dynamically Reconfigurable Architectures, in Proc. of The 14th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI), 2007, pp [10] J. Teifel, R. Manohar, An Asynchronous Dataflow FPGA Architecture, IEEE Trans. on Computers, vol. 53, no. 11, pp , November [11] R.Konishi, H.Ito, H.Nakada, A.Nagoya, K.Oguri, N.Imlig, T.Shiozawa, M.Inamori, K.Nagami, PCA-1: A Fully Asynchronous Self-Reconfigurable LSI, Proc. of Int l Symp. Asynchrnous Circuits and Systems, [12] M.Onouchi, A low-power wide-range clock synchronizer with predictive-delay-adjustment scheme for continuous voltage scaling in dvfs, IEEE Journal of Solid-State Circuits, vol. 45, no. 380, pp , November [13] R.Panda, C.Ebeling, S.Hauck, Adding dataflow-driven Exection Control to a Coarse-Grained Reconfigurable Array, Proc. of FPL, [14] R.Uno, N.Ozaki, H.Amano, A Research of PE Array Connection Network for Cool Mega-Array, in Proc. of Int. Workshop on Renewable Computing Systems, March REFERENCES
A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2
A 297MOPS/.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2 Koichiro Masuyama, Yu Fujita, Hayate Okuhara, Hideharu Amano Dept. of ICS, Keio University, Yokohama Japan Email: {wasmii,
More informationInnovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)
Innovative Power Control for Ultra Low-Power and High- Performance System LSIs Hiroshi Nakamura Hideharu Amano Masaaki Kondo Mitaro Namiki Kimiyoshi Usami (Univ. of Tokyo) (Keio Univ.) (Univ. of Electro-Communications)
More informationA Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding
A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely
More informationMUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H.
MUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H.Matsutani, T. Sano, K. Kasuga, Y. Yoshida, K. Niitsu, N. Miura, T. Kuroda
More informationPerformance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor
Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Vu Manh Tuan, Yohei Hasegawa, Naohiro Katsura and Hideharu Amano Graduate School of Science
More informationA Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan
A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan 1 Outline: 3D Wireless NoC Designs This part also explores 3D NoC architecture with inductive-coupling
More informationPart IV: 3D WiNoC Architectures
Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures
More informationAn Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture
An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture Shuji SANNOMIYA 1, Yukikuni NISHIDA 2, Makoto IWATA 3, and Hiroaki NISHIKAWA 1 1 Faculty of Engineering, Information and Systems,
More informationTHE latest generation of microprocessors uses a combination
1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz
More information3D WiNoC Architectures
Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",
More informationEmbedded Systems: Hardware Components (part I) Todor Stefanov
Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System
More informationAbstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE
A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany
More informationIntegrating MRPSOC with multigrain parallelism for improvement of performance
Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,
More informationDelay Time Analysis of Reconfigurable. Firewall Unit
Delay Time Analysis of Reconfigurable Unit Tomoaki SATO C&C Systems Center, Hirosaki University Hirosaki 036-8561 Japan Phichet MOUNGNOUL Faculty of Engineering, King Mongkut's Institute of Technology
More informationCAD Technology of the SX-9
KONNO Yoshihiro, IKAWA Yasuhiro, SAWANO Tomoki KANAMARU Keisuke, ONO Koki, KUMAZAKI Masahito Abstract This paper outlines the design techniques and CAD technology used with the SX-9. The LSI and package
More informationCoarse Grained Reconfigurable Architecture
Coarse Grained Reconfigurable Architecture Akeem Edwards July 29 2012 Abstract: This paper examines the challenges of mapping applications on to a Coarsegrained reconfigurable architecture (CGRA). Through
More informationDESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER
DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal
More informationDesigning and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders
Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani
More informationPOWER consumption has become one of the most important
704 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 4, APRIL 2004 Brief Papers High-Throughput Asynchronous Datapath With Software-Controlled Voltage Scaling Yee William Li, Student Member, IEEE, George
More informationMassively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain
Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,
More informationMulti processor systems with configurable hardware acceleration
Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations
More informationAn Overview of Standard Cell Based Digital VLSI Design
An Overview of Standard Cell Based Digital VLSI Design With examples taken from the implementation of the 36-core AsAP1 chip and the 1000-core KiloCore chip Zhiyi Yu, Tinoosh Mohsenin, Aaron Stillmaker,
More informationA Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM
IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 09, 2016 ISSN (online): 2321-0613 A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM Yogit
More informationDesign and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology
Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,
More informationINTRODUCTION TO FPGA ARCHITECTURE
3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)
More informationAnalysis of Different Multiplication Algorithms & FPGA Implementation
IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 29-35 e-issn: 2319 4200, p-issn No. : 2319 4197 Analysis of Different Multiplication Algorithms & FPGA
More informationVLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL
International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF
More informationCS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS
CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight
More informationReconfigurable Cell Array for DSP Applications
Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell
More informationHigh performance, power-efficient DSPs based on the TI C64x
High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research
More informationthe main limitations of the work is that wiring increases with 1. INTRODUCTION
Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,
More informationHardware-Software Codesign. 1. Introduction
Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2
More informationVertex Shader Design I
The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only
More informationRuntime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays
Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann
More informationHardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University
Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis
More informationGated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver
Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver E.Kanniga 1, N. Imocha Singh 2,K.Selva Rama Rathnam 3 Professor Department of Electronics and Telecommunication, Bharath
More informationImplimentation of A 16-bit RISC Processor for Convolution Application
Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 4, Number 5 (2014), pp. 441-446 Research India Publications http://www.ripublication.com/aeee.htm Implimentation of A 16-bit RISC
More informationHIGH-LEVEL SYNTHESIS
HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms
More informationEvaluation of Space Allocation Circuits
Evaluation of Space Allocation Circuits Shinya Kyusaka 1, Hayato Higuchi 1, Taichi Nagamoto 1, Yuichiro Shibata 2, and Kiyoshi Oguri 2 1 Department of Electrical Engineering and Computer Science, Graduate
More informationA Reconfigurable Multifunction Computing Cache Architecture
IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,
More informationComputer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can
More informationThe Nios II Family of Configurable Soft-core Processors
The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture
More informationTestability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology
ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 14, Number 4, 2011, 392 398 Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology Traian TULBURE
More informationImplementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay
Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay A.Sakthivel 1, A.Lalithakumar 2, T.Kowsalya 3 PG Scholar [VLSI], Muthayammal Engineering College,
More informationInternational Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN
International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 1502 Design and Characterization of Koggestone, Sparse Koggestone, Spanning tree and Brentkung Adders V. Krishna
More informationStorage I/O Summary. Lecture 16: Multimedia and DSP Architectures
Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal
More informationRECENTLY, researches on gigabit wireless personal area
146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,
More informationA 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications
A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System
More informationStructure of Computer Systems
288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram
More informationCompact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits
International Journal of Communication Engineering and Technology. ISSN 2277-3150 Volume 3, Number 1 (2013), pp. 13-22 Research India Publications http://www.ripublication.com Compact Clock Skew Scheme
More informationO PT I C Alan N. Willson, Jr. AD-A ppiov' 9!lj" 2' 2 1,3 9. Quarterly Progress Report. (October 1, 1992 through December 31, 1992)
AD-A260 754 Quarterly Progress Report (October 1, 1992 through December 31, 1992) O PT I C on " 041 o 993 VLSI for High-Speed Digital Signal Processing prepared for Accesion For NTIS CRA&I Office of Naval
More informationHigh-performance and Low-power Consumption Vector Processor for LTE Baseband LSI
High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by
More informationChapter 4. The Processor
Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified
More informationIssue Logic for a 600-MHz Out-of-Order Execution Microprocessor
IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 5, MAY 1998 707 Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor James A. Farrell and Timothy C. Fischer Abstract The logic and circuits
More informationController Synthesis for Hardware Accelerator Design
ler Synthesis for Hardware Accelerator Design Jiang, Hongtu; Öwall, Viktor 2002 Link to publication Citation for published version (APA): Jiang, H., & Öwall, V. (2002). ler Synthesis for Hardware Accelerator
More informationEECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007
EECS 5 - Components and Design Techniques for Digital Systems Lec 2 RTL Design Optimization /6/27 Shauki Elassaad Electrical Engineering and Computer Sciences University of California, Berkeley Slides
More informationGeneral Purpose Signal Processors
General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:
More informationEmbedded Systems. 7. System Components
Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic
More informationOPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION
OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION 1 S.Ateeb Ahmed, 2 Mr.S.Yuvaraj 1 Student, Department of Electronics and Communication/ VLSI Design SRM University, Chennai, India 2 Assistant
More informationA Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static Random Access Memories
Japanese Journal of Applied Physics Vol., No. B, 200, pp. 329 3300 #200 The Japan Society of Applied Physics A Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static
More informationDESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES
Volume 120 No. 6 2018, 4453-4466 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR
More informationLow-Power Technology for Image-Processing LSIs
Low- Technology for Image-Processing LSIs Yoshimi Asada The conventional LSI design assumed power would be supplied uniformly to all parts of an LSI. For a design with multiple supply voltages and a power
More informationExploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)
Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Sponsored by SRC and NSF as a Part of Multicore Chip Design
More informationDesign of 8 bit Pipelined Adder using Xilinx ISE
Design of 8 bit Pipelined Adder using Xilinx ISE 1 Jayesh Diwan, 2 Rutul Patel Assistant Professor EEE Department, Indus University, Ahmedabad, India Abstract An asynchronous circuit, or self-timed circuit,
More informationAbbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University
Abbas El Gamal Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program Stanford University Chip stacking Vertical interconnect density < 20/mm Wafer Stacking
More informationOUTLINE Introduction Power Components Dynamic Power Optimization Conclusions
OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism
More informationA 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology
http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee
More informationVLSI Design Automation. Maurizio Palesi
VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips
More informationToday. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses
Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single
More informationManaging Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks
Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department
More informationArea/Delay Estimation for Digital Signal Processor Cores
Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,
More informationDigital Design with FPGAs. By Neeraj Kulkarni
Digital Design with FPGAs By Neeraj Kulkarni Some Basic Electronics Basic Elements: Gates: And, Or, Nor, Nand, Xor.. Memory elements: Flip Flops, Registers.. Techniques to design a circuit using basic
More informationHRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing
HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard
More informationINTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016
NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering
More informationEnergy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture
ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference
More informationReal-Time Dynamic Voltage Hopping on MPSoCs
Real-Time Dynamic Voltage Hopping on MPSoCs Tohru Ishihara System LSI Research Center, Kyushu University 2009/08/05 The 9 th International Forum on MPSoC and Multicore 1 Background Low Power / Low Energy
More informationSystem-on-Chip Architecture for Mobile Applications. Sabyasachi Dey
System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution
More informationImplementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient
ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi
More informationCHAPTER 1 INTRODUCTION
CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits
More informationA 1-GHz Configurable Processor Core MeP-h1
A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface
More informationDesign and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor
Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design
More informationKiloCore: A 32 nm 1000-Processor Array
KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation
More informationSerial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing
CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.
More informationA VARIETY OF ICS ARE POSSIBLE DESIGNING FPGAS & ASICS. APPLICATIONS MAY USE STANDARD ICs or FPGAs/ASICs FAB FOUNDRIES COST BILLIONS
architecture behavior of control is if left_paddle then n_state
More informationCo-synthesis and Accelerator based Embedded System Design
Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer
More information2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don
RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,
More informationThe extreme Adaptive DSP Solution to Sensor Data Processing
The extreme Adaptive DSP Solution to Sensor Data Processing Abstract Martin Vorbach PACT XPP Technologies Leo Mirkin Sky Computers, Inc. The new ISR mobile autonomous sensor platforms present a difficult
More informationFPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST
FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is
More informationHigh Performance Memory Read Using Cross-Coupled Pull-up Circuitry
High Performance Memory Read Using Cross-Coupled Pull-up Circuitry Katie Blomster and José G. Delgado-Frias School of Electrical Engineering and Computer Science Washington State University Pullman, WA
More informationComputer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics
Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing
More informationA 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing
A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine
More informationFour Steps of Speculative Tomasulo cycle 0
HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly
More informationSynthesis of Language Constructs. 5/10/04 & 5/13/04 Hardware Description Languages and Synthesis
Synthesis of Language Constructs 1 Nets Nets declared to be input or output ports are retained Internal nets may be eliminated due to logic optimization User may force a net to exist trireg, tri0, tri1
More informationLogic Verification 13-1
Logic Verification 13-1 Verification The goal of verification To ensure 100% correct in functionality and timing Spend 50 ~ 70% of time to verify a design Functional verification Simulation Formal proof
More informationAn Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki
An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &
More informationCALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL
CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL Shyam Akashe 1, Ankit Srivastava 2, Sanjay Sharma 3 1 Research Scholar, Deptt. of Electronics & Comm. Engg., Thapar Univ.,
More informationParallel graph traversal for FPGA
LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,
More informationDesign of a Pipelined 32 Bit MIPS Processor with Floating Point Unit
Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,
More informationCOMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More informationA Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique
A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department
More information