A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA

Size: px
Start display at page:

Download "A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA"

Transcription

1 A HARDWARE COMPLETE DETECTION MECHANISM FOR AN ENERGY EFFICIENT RECONFIGURABLE ACCELERATOR CMA Akihito Tsusaka Mai Izawa Rie Uno Nobuyuki Ozaki Hideharu Amano Keio University, Yokohama, , Japan ABSTRACT Cool Mega Array (CMA) is an energy efficient Coarse Grained Reconfigurable processor Array (CGRA) consisting of a large PE (Processing Element) array. In order to reduce the power for storing intermediate results and clock tree, the PE array is consisting of combinatorial circuits. The completion time in the PE array has been calculated from the delay table and mapping results manually, and specified in the micro-code. A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, a single micro-code can be used for various supply voltages to PE array. Also, dynamic change of the delay by changing of the temperature and the variation for each chip can be treated. 1. INTRODUCTION Recent battery driven mobile devices require high performance for a certain area of application as well as energy efficiency. As a solution, Coarse-Grained Reconfigurable processor Arrays (CGRA) [1, 2, 3] have received attention as energy efficient accelerators, and some of them have been utilized in commercial products[4, 5]. CMA (Cool Mega Array)[6] has been developed as a highly energy efficient CGRA. It provides a large PE (Processing Element) array consisting of combinatorial logic. Data-flow graphs for target application programs are mapped directly on the array, and computation is done without storing intermediate results. The energy for storing intermediate results into registers in each PE and clock distribution through the clock tree are not required. A small microcontroller manages data distribution and collection between data memory and registers only provided at input/output of the PE array. The supply voltage of the PE array can be scaled so that the computation delay in the PE array is well balanced to the time for data management by the microcontroller. The first prototype CMA-1 using 65nm CMOS technology achieved 2.7GOPS/11.2 mw sustained performance, and a multicore system which has a number of CMA chips Cube-1[7] is now available. One of the most difficult problems of CMA architecture is how to evaluate the computational delay on the PE array. Ozaki proposed a method to compute the largest delay time in the PE array from the result of the application mapping[8]. It uses a table in which the delay of each PE at various supply voltage, and with a certain amount of margin, the programmer decides the timing to store results from the PE array. However, the method does not care about the temperature of environment and variance of the delay in each chip. For safe computation, a large amount of margin is required, and it will degrade both performance and energy efficiency. In order to address the problem, a hardware mechanism that detects the completion of the execution in the PE array is proposed. It uses a selectable delay line consisting of buffers connected in tandem. The delay is decided according to the operation executed in the PE array. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and wires are accounted. The rest of paper is organized as follows: in Section 2, the architecture of CMA is introduced focusing on the delay estimation method. A hardware mechanism for detecting the completion of execution is proposed in Section 3. The overhead and efficiency of the mechanism are shown in Section 4. Section 5 concludes the paper with discussion of future work. 2. CMA ARCHITECTURE AND COMPLETION DETECTION 2.1. The CMA architecture Like other CGRAs, the target application of CMA is multimedia streaming application which has a large degree of parallelism. By parallel execution of a lot of PEs, it achieves a required performance with low supply voltage. The impor-

2 tant difference between other CGRAs is that it adopts an extreme architecture for saving energy as possible. A large PE array of CMA consists of combinatorial circuits without registers and context memory unlike other CGRAs. The energy for storing intermediate results and the power for clock distribution inside the PE array are not required. Dynamic reconfiguration which requires a large amount of energy is not adopted. The configuration data for the PE array is given from configuration registers provided outside the PE array and fixed during execution. The data flow graph corresponding to the application is mapped statically on the PE array. For keeping the flexibility, a small microcontroller is provided between PE array and data memory. It reads data from the data memory and distributes it to the register attached to the input of the PE array. It also collects the results from the register attached to the output of the PE array, and writes them back to the data memory. It flexibly manages the data transfer between the memory and registers by using mapping registers and vector operations. With the above structure, it enables to implement various application programs without power hungry dynamic reconfiguration in the PE array. Since the computation in the PE array and data management by the microcontroller are done in a pipelined manner, their execution speeds must be balanced. If the computation delay is longer than the data management delay, the voltage supplied to the PE array can be reduced. The total power required for computation can thus be reduced without degrading computing performance. On the other hand, if the data management delay is longer than the computation delay, wave pipelining in the PE array can be used. The delay time for achieving wave pipelining can be also controlled by changing the voltage supplied to the PE array Prototype chip CMA-1 The first prototype, CMA-1[6] with 8 8 PE array was fabricated in mm 2 65-nm CMOS technology, and achieved 2.4-GOPS/11.2-mW sustained performance. Figure 1 shows the block diagram of CMA-1. It consists of PE array, microcontroller, data memory (DMEM) and registers. Here, the computation in the PE array by the control of microcontroller is described in detail. As shown in Figure 2, microcontroller is consisting of a controller, Fetch register, Launch register and Gather register. First, it reads from DMEM and distributes them to entries of Fetch register. The data distribution and collection by the micro controller was designed to be flexible to enable arbitrary mapping between the address of the data memory and the input of the PE array using address mapping registers. Stride vector access operations are also supported. When the input data in Fetch register is ready, it is transferred to Launch register and the computation starts in the PE array. After a certain time interval, the result of PE array PE_ARRAY CMA CONF_REG 25bit Data Channel 17bit Constant Value Data PE array COL_0 COL_1 COL_2 COL_3 COL_4 COL_5 COL_6 COL_7 CONST_REG (1) data distribution micro controller 25bit X 1K µ - Controller DMEM Connect to/from Host CPU 25bit X 1K Fig. 1. Block diagram of CMA-1 (2) computation in the PE array 24bit X 1K DMEM 24bit X 1K Passing Links CONF_REG Feedback Lines launch register(lr) fetch register(fr) gather register(gr) (3) data collection (1) (2) (3) (1) (2) (3) (1) (2) (3) (a) All stages are balanced (b) computation in the PE array is short -> Voltage Scaling Fig. 2. Pipleined operation of CMA-1 (c) data manupilation by the micro controller is short -> Wave Pipelining is stored into Gather register, and the results in the entries of Gather register are written back into DMEM. (1) Distribution from DMEM to Fetch register, (2)computation in the PE array and (3) written back of the results to DMEM from Gather register are done in the pipelined manner. Supply voltage scaling to PE array is used to balance the time for stage (2) with other two stages Microcodes of the CMA-1 The programmable controller of CMA-1 has 16 general purpose registers, and uses 14-bit micro operations stored in a small (128 depth) micro-code memory. Table 1 shows an example of micro-code. Only body of a loop is extracted.

3 Table 1. An example of micro-code... LD ADD r0,r8 //1: Load data LD ADD r1,r8 //2: to Fetch LD ADD r2,r8 //3: register LD ADD r3,r8 //4: LD ADD r4,r8 //5: LD ADD r5,r8 //6: SCATTER r9 //7: Scatter loop: LD ADD r0,r8 //8: LD ADD r1,r8 //9: LD ADD r2,r8 //10: LD ADD r3,r8 //11: LD ADD r4,r8 //12: LD ADD r5,r8 //13: NOP 3 //14: GATHER r11,r12,0 //15: Gather SCATTER r9 //16: ADDI r13,#-1 //17: BNEZ r13,loop //18:... LD ADD reads the data from data memory and transfers it to each entry of Fetch register. In this code, r8 is used as a base register and a predefined value is added when LD ADD is executed. r0-r5 are used as mapping registers. When all data in Fetch register are ready (Line 6), SCAT- TER is executed to transfer the content of Fetch register to Launch register. At that time, the computation in the PE array starts. During computation, microcontroller fetches the second data set (Line 8-13), and wait 3 clock cycles for the end of the computation in the PE array (Line 14). Then the results of the PE array are stored in Gather register. GATHER instruction transfers the results into data memory according to the base register (r11) and the mask register (r12). The base register is incremented by the number of transferred data. Although GATHER instruction takes multiple clock cycles, it is executed automatically by a dedicated controller, and the micro controller can execute next code SCATTER immediately for starting the computation of the next data set. In this example, the loop is iterated until r13 reaches zero. Note that the PE array computation and GATHER instruction are done independently from the execution of the microcode, three steps are performed in the pipeline manner as shown in Figure Completion detection The problem for microcode designers is that they must estimate the completion of the PE array and specify it into the micro-code. In this case, from the first SCATTER to GATHER, 6 micro-codes (Line8-13) are executed. When the microcontroller works at 250MHz, 24nsec is spent with them. In this case, if the delay of the PE array is estimated about 36nsec, 3 clock cycles must be added by NOP 3 micro-code. However, the execution time in the PE array is depending on the applications. For simple applications which use a small number of PEs, the results are ready with a small delay, while it becomes large in complicated applications. The data flow on the PE array is designed with Black Diamond retargetable compiler[9]. It compiles the program described in C-like language, maps, routes and generates the configuration data for the PE array. Ozaki et.al. proposed a method to evaluate the total delay time by using the result of mapping and a delay table by changing the supply voltage based on the measurement of real chips[8]. Since the delay is different depending on the operations, the table is provided for each instruction of PE. The longest path in the PE array can be computed from the mapping results and the sum of the appropriate delays in the table. Since the largest wire delay is assumed for each operation, the computed total delay includes a certain margin. 3. COMPLETION DETECTION MECHANISM 3.1. Related Work Since CMA uses a large PE array with combinatorial circuits, the completion detection mechanism is somehow like that of asynchronous systems. Although a large number of researches have been done on asynchronous FPGA architectures, most of them use microsynchronization mechanisms to recognize the end of computation[10]. However, since it takes a large amount of additional hardware, it is difficult to be applied to the PE array in CMA. PCA-1/2[11] is a reconfigurable architecture which uses delay lines to send the results to the next cells, and Xia et.al. proposes a hybrid architecture using delay and synchronization mechanism. Although they are based on fine-grained reconfigurable architectures, using delay line is cost efficient way for recognize the completion of computation. Techniques for controlling delay mechanisms have been well studied and available[12]. For coarse grained architecture, a dataflow-driven execution control mechanism is proposed[13], but it is for a general PE array with clock The concept of the proposed method The current delay estimation method has the following problems: (1) When the supply voltage is scaled, the micro-code must be changed. Different codes must be provided when voltage is scaled dynamically. (2) The temperature and the delay variance of each chip are not cared. The delay variance will become large when the low power supply voltage is used in the future process. Considering the safe operation, a large margin is required. In order to address these problems, we propose a hardware completion detection mechanism with tandem connected buffers. Figure 3 shows the concept of the mechanism. When

4 SCATTER instruction is executed, the completion signals attached to all input data are asserted at the input of PE array. They are propagated exactly on the same way as the input data. In the PE, the completion signal is delayed with the serially connected buffers whose delay is arranged according to the operation executed in the PE. When two completion signals are joined into a PE, the earlier asserted input must wait for the later asserted signal. When all completion signals attached to outputs data are available, the results are stored into Gather register. PE ARRAY PE PE PE PE PE PE PE DATA signal PE PE PE Fig. 4. The layout of the PE array Completion signal 3.4. Implementation of the mechanism Design environment Fig. 3. Hardware Completion Mechanism The key implementation issue is that the completion signal must flow in the same way of the data. For this purpose, we implemented it as an extra data bit of data bus. Thus, it has the same fan-out and almost the same routing path as the other data. As shown in Figure 4, in CMA, all PEs are aligned naturally in the two dimensional structure, the completion signals can be routed with the same manner as the corresponding data wires so that the delay time becomes almost the same Delay line in the PE As shown in Figure 5, the completion signals from both input are forwarded through the AND gate to the delay line which is implemented with a buffers connected in tandem. When constant data is used or the PE executes instructions with single operand, the corresponding input of the completion signal is set to be H beforehand. The delay line has several taps, and the signal with appropriate delay is selected by the output multiplexer. Since the operation of PE is defined by the configuration data, the multiplexer is also selected according to the configuration data for each PE. The proposed completion detection mechanism is implemented in the CMA architecture shown in Table 2. Design tools are shown in the same table. The target CMA is almost same as the CMA-1 except providing the dedicated links for transferring the constant data. This improvement was proved to reduce the loading time of the configuration data[14] Delay in the PE The position of taps are decided by the post layout simulation results of PE. According to the analysis results, four taps are provided for corresponding operations: ADD/SUB, MUL, LOGICs and SHIFTs. The buffer SC23BUFXA1 whose maximum delay time is 69ps is used for building delay line. Table 3 shows the number of buffers for each taps. In our PE, the delay of ADD/SUB operation is slightly larger than that of MUL operation. It comes from that a high speed multiplier is adopted for multiplication while a carry ripple adder is selected for ADD/SUB operation Modification of micro-code SCATTER/GATHER instructions are modified as follows: SCATTER: When SCATTER instruction executes, all completion signals attached to available inputs are asserted. When all propagated completion signals are asserted at the output and GATHER instruction is executed, the input completion signals are negated. Until it, the next SCATTER instruction is suspended. GATHER: When all completion signals are ready at the output of PE array, and GATHER instruction is

5 DL buffer DELAY OUT ALU PE DATA_A DATA_B DELAY_B DELAY_A IN_A IN_B Fig. 5. Delay Line in a PE ALU_CONF Table 2. Specifications of the target CMA Technology Fujitsu e-shuttle 65-nm 12-metal CMOS Cell Library CS202SZ low-power standard cell library Supply Voltage V for PE array (1.2V for evaluation) PE 24-bit ALU,64-bit Network 2-lane island-style 2 direct links Micro controller 14-bit micro-codes, 16 instructions, 128 entries, 8 GPRs, 8-address register, 4-base register Clock frequency 210 MHz Synthesis Design Compiler SP5 Layout IC Compiler 2009 Analysis Primetime SP3 executed, the results are stored into Gather register. If GATHER instruction is issued before completion signals being ready, the execution of the microcontroller is stalled. With this modification, the designers don t have to worry about the timing of issuing GATHER instruction. The results are stored in the Gather register and then automatically written back to the data memory. Although the hardware completion detection mechanism is substantially suitable for wave-pipelining, the implementation makes the wave-pipeline impossible. The restriction was given just for safety, relaxing the condition about multiple SCATTER instructions to enable wave-pipeline is our future work The delay time 4. EVALUATION We implemented four simple image filter programs: alpha blender, sepia filter, gray scale filter, and edge detection filter on CMA with completion detection mechanism. All programs correctly worked with the post layout simulation including wiring and parastic capacitance delay. Figure 6 shows the maximum delay actually measured in the PE during the simulation. The graph shows that the delay by the tandem connected buffer is appropriately given with about 10% margin. The execution time is completely the same as the case without the completion detection mechanism, since the delay in the microcode is well tuned. However, the same code can be used with any supply voltage or temperature. Table 3. The setting of the delayline Operation Num of buffers Delay time(ps) ADD/SUB MULT LOGICs SHIFT The Overhead The proposed mechanism requires additional hardware which will introduce overhead on area and power consumption. The area is increased 5.1% that of the design without the mechanism. Since the additional hardware is simple hardware consisting of buffers, multiplexers and AND gates, the increasing area is small. Figure 7 shows the average power consumption in a PE when applications are executed. The power consumed in the completion detection mechanism in only about 6%. In this implementation, only 1 bit for each data is changed in a computation. This is the reason why the power consumption is not so large in this mechanism. 5. CONCLUSION A hardware completion detection mechanism for CMA is proposed, implemented and evaluated. Each PE uses serially connected buffers with selectable taps, and the delay is decided according to the operation executed in the PE. Since the completion signal is transferred exactly on the same paths that for computation, the delay in the switch and

6 Fig. 6. The delay in the PE Fig. 7. The power consumed in a PE ARRAY wires are accounted. The mechanism was implemented in CMA with 65nm CMOS process, and post layout simulation revealed that the same performance without the mechanism can be obtained only with 5.1% area overhead and less than 6% extra power consumption. With the mechanism, the single code can be used for various supply voltages to PE array. Also, dynamic delay variation by changing the temperature or chip variation are also treated. The proposed mechanism is substantially suitable to wavepipeline. Now, the wave-pipeline is implemented on CMA- 1 by careful manual tuning. Execution of the wave-pipeline with the proposed hardware completion mechanism is our future work. Acknowledgments A part of this research was performed by Japan Science and Technology Agency [JST] of Core Research for Evolutional Science and Technology [CREST]. This work is also supported by VLSI Design and Education Center (VDEC), the University of Tokyo in collaboration with Cadence Design Systems, Inc. [1] F.J.Veradas, M.Scheppler, W.Moffat, B.Mei, Custom Implementation of the Coarse-Grained Reconfigurable ADRES architecture for multimedia Purposes, in Proc. of International Conference on Field Programmable Logic and Applications (FPL05), 2005, pp [2] C.Ebeling, D.C.Cronquist and P.Franklin, Rapid - Reconfigurable Pipelined Datapath, in Proc. of the FPL 2004, [3] H. Amano, Y. Hasegawa, S. Tsutsumi, T. Nakamura, T. Nisimura, V. Tunbunheng, A. Parimala, T. Sano and M. Kato, MuCCRA Chips: Configurable Dynamically- Reconfigurable Processors, in Proc. of ASSCC, Nov. 2007, pp [4] M. Motomura, STP Engine, a C-based Programmable HW Core featuring Massively P aralleland Reconfigurable PE Array: its Architecture, Tool, and SystemImplicatio ns, in Prof. of CoolChips XII, [5] H-S.Kim, M.Ann, J.A.Sratton, W.Mei, W.Hwu, ULP-SRP: Ultra Low Power Samsung Reconfigurable Processor for Biomedical Applications, in Prof. of ICFPT 2012, 2012, pp [6] N.Ozaki, Y.Yasuda, Y.Saito, D.Ikebuchi, M.Kimura, H.Amano, H.Nakamura, K.Usami, M.Namiki, M.Kondo, Cool Mega-Arrays: Ultralow-Power Reconfigurable Accelerator Chips, IEEE Micro, Vol.31, pp. 6 18, [7] Y. Koizumi, et al, CMA-Cube: a scalable reconfigurable accelerator with 3-D wireless inductive coupling interconnect, in Proc. of the FPL 2012, Aug [8] N.Ozaki, et.al., Cool Mega-Arrays: A highly energy efficient accelarator, Proc. on ICFPT 2011, [9] V. Tunbunheng and H. Amano, Black-Diamond: a Retargetable Compiler Using Graph with Configuration Bits for Dynamically Reconfigurable Architectures, in Proc. of The 14th Workshop on Synthesis And System Integration of Mixed Information technologies (SASIMI), 2007, pp [10] J. Teifel, R. Manohar, An Asynchronous Dataflow FPGA Architecture, IEEE Trans. on Computers, vol. 53, no. 11, pp , November [11] R.Konishi, H.Ito, H.Nakada, A.Nagoya, K.Oguri, N.Imlig, T.Shiozawa, M.Inamori, K.Nagami, PCA-1: A Fully Asynchronous Self-Reconfigurable LSI, Proc. of Int l Symp. Asynchrnous Circuits and Systems, [12] M.Onouchi, A low-power wide-range clock synchronizer with predictive-delay-adjustment scheme for continuous voltage scaling in dvfs, IEEE Journal of Solid-State Circuits, vol. 45, no. 380, pp , November [13] R.Panda, C.Ebeling, S.Hauck, Adding dataflow-driven Exection Control to a Coarse-Grained Reconfigurable Array, Proc. of FPL, [14] R.Uno, N.Ozaki, H.Amano, A Research of PE Array Connection Network for Cool Mega-Array, in Proc. of Int. Workshop on Renewable Computing Systems, March REFERENCES

A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2

A 297MOPS/0.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2 A 297MOPS/.4mW Ultra Low Power Coarse-grained Reconfigurable Accelerator CMA-SOTB-2 Koichiro Masuyama, Yu Fujita, Hayate Okuhara, Hideharu Amano Dept. of ICS, Keio University, Yokohama Japan Email: {wasmii,

More information

Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.)

Innovative Power Control for. Performance System LSIs. (Univ. of Electro-Communications) (Tokyo Univ. of Agriculture and Tech.) Innovative Power Control for Ultra Low-Power and High- Performance System LSIs Hiroshi Nakamura Hideharu Amano Masaaki Kondo Mitaro Namiki Kimiyoshi Usami (Univ. of Tokyo) (Keio Univ.) (Univ. of Electro-Communications)

More information

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding

A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding A Low Power Asynchronous FPGA with Autonomous Fine Grain Power Gating and LEDR Encoding N.Rajagopala krishnan, k.sivasuparamanyan, G.Ramadoss Abstract Field Programmable Gate Arrays (FPGAs) are widely

More information

MUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H.

MUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H. MUCCRA-CUBE: A 3D DYNAMICALLY RECONFIGURABLE PROCESSOR WITH INDUCTIVE-COUPLING LINK S. Saito, Y. Kohama, Y. Sugimori, Y. Hasegawa, H.Matsutani, T. Sano, K. Kasuga, Y. Yoshida, K. Niitsu, N. Miura, T. Kuroda

More information

Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor

Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Performance/Cost trade-off evaluation for the DCT implementation on the Dynamically Reconfigurable Processor Vu Manh Tuan, Yohei Hasegawa, Naohiro Katsura and Hideharu Amano Graduate School of Science

More information

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan

A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan A Building Block 3D System with Inductive-Coupling Through Chip Interfaces Hiroki Matsutani Keio University, Japan 1 Outline: 3D Wireless NoC Designs This part also explores 3D NoC architecture with inductive-coupling

More information

Part IV: 3D WiNoC Architectures

Part IV: 3D WiNoC Architectures Wireless NoC as Interconnection Backbone for Multicore Chips: Promises, Challenges, and Recent Developments Part IV: 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan 1 Outline: 3D WiNoC Architectures

More information

An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture

An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture An Overload-Free Data-Driven Ultra-Low-Power Networking Platform Architecture Shuji SANNOMIYA 1, Yukikuni NISHIDA 2, Makoto IWATA 3, and Hiroaki NISHIKAWA 1 1 Faculty of Engineering, Information and Systems,

More information

THE latest generation of microprocessors uses a combination

THE latest generation of microprocessors uses a combination 1254 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 30, NO. 11, NOVEMBER 1995 A 14-Port 3.8-ns 116-Word 64-b Read-Renaming Register File Creigton Asato Abstract A 116-word by 64-b register file for a 154 MHz

More information

3D WiNoC Architectures

3D WiNoC Architectures Interconnect Enhances Architecture: Evolution of Wireless NoC from Planar to 3D 3D WiNoC Architectures Hiroki Matsutani Keio University, Japan Sep 18th, 2014 Hiroki Matsutani, "3D WiNoC Architectures",

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Delay Time Analysis of Reconfigurable. Firewall Unit

Delay Time Analysis of Reconfigurable. Firewall Unit Delay Time Analysis of Reconfigurable Unit Tomoaki SATO C&C Systems Center, Hirosaki University Hirosaki 036-8561 Japan Phichet MOUNGNOUL Faculty of Engineering, King Mongkut's Institute of Technology

More information

CAD Technology of the SX-9

CAD Technology of the SX-9 KONNO Yoshihiro, IKAWA Yasuhiro, SAWANO Tomoki KANAMARU Keisuke, ONO Koki, KUMAZAKI Masahito Abstract This paper outlines the design techniques and CAD technology used with the SX-9. The LSI and package

More information

Coarse Grained Reconfigurable Architecture

Coarse Grained Reconfigurable Architecture Coarse Grained Reconfigurable Architecture Akeem Edwards July 29 2012 Abstract: This paper examines the challenges of mapping applications on to a Coarsegrained reconfigurable architecture (CGRA). Through

More information

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER

DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER DESIGN AND PERFORMANCE ANALYSIS OF CARRY SELECT ADDER Bhuvaneswaran.M 1, Elamathi.K 2 Assistant Professor, Muthayammal Engineering college, Rasipuram, Tamil Nadu, India 1 Assistant Professor, Muthayammal

More information

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders

Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders Vol. 3, Issue. 4, July-august. 2013 pp-2266-2270 ISSN: 2249-6645 Designing and Characterization of koggestone, Sparse Kogge stone, Spanning tree and Brentkung Adders V.Krishna Kumari (1), Y.Sri Chakrapani

More information

POWER consumption has become one of the most important

POWER consumption has become one of the most important 704 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 39, NO. 4, APRIL 2004 Brief Papers High-Throughput Asynchronous Datapath With Software-Controlled Voltage Scaling Yee William Li, Student Member, IEEE, George

More information

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain

Massively Parallel Computing on Silicon: SIMD Implementations. V.M.. Brea Univ. of Santiago de Compostela Spain Massively Parallel Computing on Silicon: SIMD Implementations V.M.. Brea Univ. of Santiago de Compostela Spain GOAL Give an overview on the state-of of-the- art of Digital on-chip CMOS SIMD Solutions,

More information

Multi processor systems with configurable hardware acceleration

Multi processor systems with configurable hardware acceleration Multi processor systems with configurable hardware acceleration Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline Motivations

More information

An Overview of Standard Cell Based Digital VLSI Design

An Overview of Standard Cell Based Digital VLSI Design An Overview of Standard Cell Based Digital VLSI Design With examples taken from the implementation of the 36-core AsAP1 chip and the 1000-core KiloCore chip Zhiyi Yu, Tinoosh Mohsenin, Aaron Stillmaker,

More information

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM

A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM IJSRD - International Journal for Scientific Research & Development Vol. 4, Issue 09, 2016 ISSN (online): 2321-0613 A Review Paper on Reconfigurable Techniques to Improve Critical Parameters of SRAM Yogit

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

INTRODUCTION TO FPGA ARCHITECTURE

INTRODUCTION TO FPGA ARCHITECTURE 3/3/25 INTRODUCTION TO FPGA ARCHITECTURE DIGITAL LOGIC DESIGN (BASIC TECHNIQUES) a b a y 2input Black Box y b Functional Schematic a b y a b y a b y 2 Truth Table (AND) Truth Table (OR) Truth Table (XOR)

More information

Analysis of Different Multiplication Algorithms & FPGA Implementation

Analysis of Different Multiplication Algorithms & FPGA Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 4, Issue 2, Ver. I (Mar-Apr. 2014), PP 29-35 e-issn: 2319 4200, p-issn No. : 2319 4197 Analysis of Different Multiplication Algorithms & FPGA

More information

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF

More information

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS

CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS CS6303 Computer Architecture Regulation 2013 BE-Computer Science and Engineering III semester 2 MARKS UNIT-I OVERVIEW & INSTRUCTIONS 1. What are the eight great ideas in computer architecture? The eight

More information

Reconfigurable Cell Array for DSP Applications

Reconfigurable Cell Array for DSP Applications Outline econfigurable Cell Array for DSP Applications Chenxin Zhang Department of Electrical and Information Technology Lund University, Sweden econfigurable computing Coarse-grained reconfigurable cell

More information

High performance, power-efficient DSPs based on the TI C64x

High performance, power-efficient DSPs based on the TI C64x High performance, power-efficient DSPs based on the TI C64x Sridhar Rajagopal, Joseph R. Cavallaro, Scott Rixner Rice University {sridhar,cavallar,rixner}@rice.edu RICE UNIVERSITY Recent (2003) Research

More information

the main limitations of the work is that wiring increases with 1. INTRODUCTION

the main limitations of the work is that wiring increases with 1. INTRODUCTION Design of Low Power Speculative Han-Carlson Adder S.Sangeetha II ME - VLSI Design, Akshaya College of Engineering and Technology, Coimbatore sangeethasoctober@gmail.com S.Kamatchi Assistant Professor,

More information

Hardware-Software Codesign. 1. Introduction

Hardware-Software Codesign. 1. Introduction Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2

More information

Vertex Shader Design I

Vertex Shader Design I The following content is extracted from the paper shown in next page. If any wrong citation or reference missing, please contact ldvan@cs.nctu.edu.tw. I will correct the error asap. This course used only

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver

Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver Gated-Demultiplexer Tree Buffer for Low Power Using Clock Tree Based Gated Driver E.Kanniga 1, N. Imocha Singh 2,K.Selva Rama Rathnam 3 Professor Department of Electronics and Telecommunication, Bharath

More information

Implimentation of A 16-bit RISC Processor for Convolution Application

Implimentation of A 16-bit RISC Processor for Convolution Application Advance in Electronic and Electric Engineering. ISSN 2231-1297, Volume 4, Number 5 (2014), pp. 441-446 Research India Publications http://www.ripublication.com/aeee.htm Implimentation of A 16-bit RISC

More information

HIGH-LEVEL SYNTHESIS

HIGH-LEVEL SYNTHESIS HIGH-LEVEL SYNTHESIS Page 1 HIGH-LEVEL SYNTHESIS High-level synthesis: the automatic addition of structural information to a design described by an algorithm. BEHAVIORAL D. STRUCTURAL D. Systems Algorithms

More information

Evaluation of Space Allocation Circuits

Evaluation of Space Allocation Circuits Evaluation of Space Allocation Circuits Shinya Kyusaka 1, Hayato Higuchi 1, Taichi Nagamoto 1, Yuichiro Shibata 2, and Kiyoshi Oguri 2 1 Department of Electrical Engineering and Computer Science, Graduate

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture I. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture I Benny Thörnberg Associate Professor in Electronics Hardware architecture Computer architecture The functionality of a modern computer is so complex that no human can

More information

The Nios II Family of Configurable Soft-core Processors

The Nios II Family of Configurable Soft-core Processors The Nios II Family of Configurable Soft-core Processors James Ball August 16, 2005 2005 Altera Corporation Agenda Nios II Introduction Configuring your CPU FPGA vs. ASIC CPU Design Instruction Set Architecture

More information

Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology

Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology ROMANIAN JOURNAL OF INFORMATION SCIENCE AND TECHNOLOGY Volume 14, Number 4, 2011, 392 398 Testability Optimizations for A Time Multiplexed CPLD Implemented on Structured ASIC Technology Traian TULBURE

More information

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay

Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay Implementation of Reduce the Area- Power Efficient Fixed-Point LMS Adaptive Filter with Low Adaptation-Delay A.Sakthivel 1, A.Lalithakumar 2, T.Kowsalya 3 PG Scholar [VLSI], Muthayammal Engineering College,

More information

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN

International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October ISSN International Journal of Scientific & Engineering Research, Volume 4, Issue 10, October-2013 1502 Design and Characterization of Koggestone, Sparse Koggestone, Spanning tree and Brentkung Adders V. Krishna

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

RECENTLY, researches on gigabit wireless personal area

RECENTLY, researches on gigabit wireless personal area 146 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: EXPRESS BRIEFS, VOL. 55, NO. 2, FEBRUARY 2008 An Indexed-Scaling Pipelined FFT Processor for OFDM-Based WPAN Applications Yuan Chen, Student Member, IEEE,

More information

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications

A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications A 50Mvertices/s Graphics Processor with Fixed-Point Programmable Vertex Shader for Mobile Applications Ju-Ho Sohn, Jeong-Ho Woo, Min-Wuk Lee, Hye-Jung Kim, Ramchan Woo, Hoi-Jun Yoo Semiconductor System

More information

Structure of Computer Systems

Structure of Computer Systems 288 between this new matrix and the initial collision matrix M A, because the original forbidden latencies for functional unit A still have to be considered in later initiations. Figure 5.37. State diagram

More information

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits

Compact Clock Skew Scheme for FPGA based Wave- Pipelined Circuits International Journal of Communication Engineering and Technology. ISSN 2277-3150 Volume 3, Number 1 (2013), pp. 13-22 Research India Publications http://www.ripublication.com Compact Clock Skew Scheme

More information

O PT I C Alan N. Willson, Jr. AD-A ppiov' 9!lj" 2' 2 1,3 9. Quarterly Progress Report. (October 1, 1992 through December 31, 1992)

O PT I C Alan N. Willson, Jr. AD-A ppiov' 9!lj 2' 2 1,3 9. Quarterly Progress Report. (October 1, 1992 through December 31, 1992) AD-A260 754 Quarterly Progress Report (October 1, 1992 through December 31, 1992) O PT I C on " 041 o 993 VLSI for High-Speed Digital Signal Processing prepared for Accesion For NTIS CRA&I Office of Naval

More information

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI

High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI High-performance and Low-power Consumption Vector Processor for LTE Baseband LSI Yi Ge Mitsuru Tomono Makiko Ito Yoshio Hirose Recently, the transmission rate for handheld devices has been increasing by

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor

Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 33, NO. 5, MAY 1998 707 Issue Logic for a 600-MHz Out-of-Order Execution Microprocessor James A. Farrell and Timothy C. Fischer Abstract The logic and circuits

More information

Controller Synthesis for Hardware Accelerator Design

Controller Synthesis for Hardware Accelerator Design ler Synthesis for Hardware Accelerator Design Jiang, Hongtu; Öwall, Viktor 2002 Link to publication Citation for published version (APA): Jiang, H., & Öwall, V. (2002). ler Synthesis for Hardware Accelerator

More information

EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007

EECS Components and Design Techniques for Digital Systems. Lec 20 RTL Design Optimization 11/6/2007 EECS 5 - Components and Design Techniques for Digital Systems Lec 2 RTL Design Optimization /6/27 Shauki Elassaad Electrical Engineering and Computer Sciences University of California, Berkeley Slides

More information

General Purpose Signal Processors

General Purpose Signal Processors General Purpose Signal Processors First announced in 1978 (AMD) for peripheral computation such as in printers, matured in early 80 s (TMS320 series). General purpose vs. dedicated architectures: Pros:

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION 1 S.Ateeb Ahmed, 2 Mr.S.Yuvaraj 1 Student, Department of Electronics and Communication/ VLSI Design SRM University, Chennai, India 2 Assistant

More information

A Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static Random Access Memories

A Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static Random Access Memories Japanese Journal of Applied Physics Vol., No. B, 200, pp. 329 3300 #200 The Japan Society of Applied Physics A Memory-Based Programmable Logic Device Using Look-Up Table Cascade with Synchronous Static

More information

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES

DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR LOGIC FAMILIES Volume 120 No. 6 2018, 4453-4466 ISSN: 1314-3395 (on-line version) url: http://www.acadpubl.eu/hub/ http://www.acadpubl.eu/hub/ DESIGN AND SIMULATION OF 1 BIT ARITHMETIC LOGIC UNIT DESIGN USING PASS-TRANSISTOR

More information

Low-Power Technology for Image-Processing LSIs

Low-Power Technology for Image-Processing LSIs Low- Technology for Image-Processing LSIs Yoshimi Asada The conventional LSI design assumed power would be supplied uniformly to all parts of an LSI. For a design with multiple supply voltages and a power

More information

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA)

Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Exploiting Dynamically Changing Parallelism with a Reconfigurable Array of Homogeneous Sub-cores (a.k.a. Field Programmable Core Array or FPCA) Sponsored by SRC and NSF as a Part of Multicore Chip Design

More information

Design of 8 bit Pipelined Adder using Xilinx ISE

Design of 8 bit Pipelined Adder using Xilinx ISE Design of 8 bit Pipelined Adder using Xilinx ISE 1 Jayesh Diwan, 2 Rutul Patel Assistant Professor EEE Department, Indus University, Ahmedabad, India Abstract An asynchronous circuit, or self-timed circuit,

More information

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University

Abbas El Gamal. Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program. Stanford University Abbas El Gamal Joint work with: Mingjie Lin, Yi-Chang Lu, Simon Wong Work partially supported by DARPA 3D-IC program Stanford University Chip stacking Vertical interconnect density < 20/mm Wafer Stacking

More information

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions

OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions OUTLINE Introduction Power Components Dynamic Power Optimization Conclusions 04/15/14 1 Introduction: Low Power Technology Process Hardware Architecture Software Multi VTH Low-power circuits Parallelism

More information

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology

A 256-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology http://dx.doi.org/10.5573/jsts.014.14.6.760 JOURNAL OF SEMICONDUCTOR TECHNOLOGY AND SCIENCE, VOL.14, NO.6, DECEMBER, 014 A 56-Radix Crossbar Switch Using Mux-Matrix-Mux Folded-Clos Topology Sung-Joon Lee

More information

VLSI Design Automation. Maurizio Palesi

VLSI Design Automation. Maurizio Palesi VLSI Design Automation 1 Outline Technology trends VLSI Design flow (an overview) 2 Outline Technology trends VLSI Design flow (an overview) 3 IC Products Processors CPU, DSP, Controllers Memory chips

More information

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses

Today. Comments about assignment Max 1/T (skew = 0) Max clock skew? Comments about assignment 3 ASICs and Programmable logic Others courses Today Comments about assignment 3-43 Comments about assignment 3 ASICs and Programmable logic Others courses octor Per should show up in the end of the lecture Mealy machines can not be coded in a single

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Area/Delay Estimation for Digital Signal Processor Cores

Area/Delay Estimation for Digital Signal Processor Cores Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,

More information

Digital Design with FPGAs. By Neeraj Kulkarni

Digital Design with FPGAs. By Neeraj Kulkarni Digital Design with FPGAs By Neeraj Kulkarni Some Basic Electronics Basic Elements: Gates: And, Or, Nor, Nand, Xor.. Memory elements: Flip Flops, Registers.. Techniques to design a circuit using basic

More information

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing

HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing HRL: Efficient and Flexible Reconfigurable Logic for Near-Data Processing Mingyu Gao and Christos Kozyrakis Stanford University http://mast.stanford.edu HPCA March 14, 2016 PIM is Coming Back End of Dennard

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Energy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture

Energy Aware Optimized Resource Allocation Using Buffer Based Data Flow In MPSOC Architecture ISSN (Online) : 2319-8753 ISSN (Print) : 2347-6710 International Journal of Innovative Research in Science, Engineering and Technology Volume 3, Special Issue 3, March 2014 2014 International Conference

More information

Real-Time Dynamic Voltage Hopping on MPSoCs

Real-Time Dynamic Voltage Hopping on MPSoCs Real-Time Dynamic Voltage Hopping on MPSoCs Tohru Ishihara System LSI Research Center, Kyushu University 2009/08/05 The 9 th International Forum on MPSoC and Multicore 1 Background Low Power / Low Energy

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi

More information

CHAPTER 1 INTRODUCTION

CHAPTER 1 INTRODUCTION CHAPTER 1 INTRODUCTION Rapid advances in integrated circuit technology have made it possible to fabricate digital circuits with large number of devices on a single chip. The advantages of integrated circuits

More information

A 1-GHz Configurable Processor Core MeP-h1

A 1-GHz Configurable Processor Core MeP-h1 A 1-GHz Configurable Processor Core MeP-h1 Takashi Miyamori, Takanori Tamai, and Masato Uchiyama SoC Research & Development Center, TOSHIBA Corporation Outline Background Pipeline Structure Bus Interface

More information

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor

Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Design and Implementation of 5 Stages Pipelined Architecture in 32 Bit RISC Processor Abstract The proposed work is the design of a 32 bit RISC (Reduced Instruction Set Computer) processor. The design

More information

KiloCore: A 32 nm 1000-Processor Array

KiloCore: A 32 nm 1000-Processor Array KiloCore: A 32 nm 1000-Processor Array Brent Bohnenstiehl, Aaron Stillmaker, Jon Pimentel, Timothy Andreas, Bin Liu, Anh Tran, Emmanuel Adeagbo, Bevan Baas University of California, Davis VLSI Computation

More information

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing

Serial. Parallel. CIT 668: System Architecture 2/14/2011. Topics. Serial and Parallel Computation. Parallel Computing CIT 668: System Architecture Parallel Computing Topics 1. What is Parallel Computing? 2. Why use Parallel Computing? 3. Types of Parallelism 4. Amdahl s Law 5. Flynn s Taxonomy of Parallel Computers 6.

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don

2D/3D Graphics Accelerator for Mobile Multimedia Applications. Ramchan Woo, Sohn, Seong-Jun Song, Young-Don RAMP-IV: A Low-Power and High-Performance 2D/3D Graphics Accelerator for Mobile Multimedia Applications Woo, Sungdae Choi, Ju-Ho Sohn, Seong-Jun Song, Young-Don Bae,, and Hoi-Jun Yoo oratory Dept. of EECS,

More information

The extreme Adaptive DSP Solution to Sensor Data Processing

The extreme Adaptive DSP Solution to Sensor Data Processing The extreme Adaptive DSP Solution to Sensor Data Processing Abstract Martin Vorbach PACT XPP Technologies Leo Mirkin Sky Computers, Inc. The new ISR mobile autonomous sensor platforms present a difficult

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry

High Performance Memory Read Using Cross-Coupled Pull-up Circuitry High Performance Memory Read Using Cross-Coupled Pull-up Circuitry Katie Blomster and José G. Delgado-Frias School of Electrical Engineering and Computer Science Washington State University Pullman, WA

More information

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics

Computer and Hardware Architecture II. Benny Thörnberg Associate Professor in Electronics Computer and Hardware Architecture II Benny Thörnberg Associate Professor in Electronics Parallelism Microscopic vs Macroscopic Microscopic parallelism hardware solutions inside system components providing

More information

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing

A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing A 167-processor Computational Array for Highly-Efficient DSP and Embedded Application Processing Dean Truong, Wayne Cheng, Tinoosh Mohsenin, Zhiyi Yu, Toney Jacobson, Gouri Landge, Michael Meeuwsen, Christine

More information

Four Steps of Speculative Tomasulo cycle 0

Four Steps of Speculative Tomasulo cycle 0 HW support for More ILP Hardware Speculative Execution Speculation: allow an instruction to issue that is dependent on branch, without any consequences (including exceptions) if branch is predicted incorrectly

More information

Synthesis of Language Constructs. 5/10/04 & 5/13/04 Hardware Description Languages and Synthesis

Synthesis of Language Constructs. 5/10/04 & 5/13/04 Hardware Description Languages and Synthesis Synthesis of Language Constructs 1 Nets Nets declared to be input or output ports are retained Internal nets may be eliminated due to logic optimization User may force a net to exist trireg, tri0, tri1

More information

Logic Verification 13-1

Logic Verification 13-1 Logic Verification 13-1 Verification The goal of verification To ensure 100% correct in functionality and timing Spend 50 ~ 70% of time to verify a design Functional verification Simulation Formal proof

More information

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki

An Ultra High Performance Scalable DSP Family for Multimedia. Hot Chips 17 August 2005 Stanford, CA Erik Machnicki An Ultra High Performance Scalable DSP Family for Multimedia Hot Chips 17 August 2005 Stanford, CA Erik Machnicki Media Processing Challenges Increasing performance requirements Need for flexibility &

More information

CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL

CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL CALCULATION OF POWER CONSUMPTION IN 7 TRANSISTOR SRAM CELL USING CADENCE TOOL Shyam Akashe 1, Ankit Srivastava 2, Sanjay Sharma 3 1 Research Scholar, Deptt. of Electronics & Comm. Engg., Thapar Univ.,

More information

Parallel graph traversal for FPGA

Parallel graph traversal for FPGA LETTER IEICE Electronics Express, Vol.11, No.7, 1 6 Parallel graph traversal for FPGA Shice Ni a), Yong Dou, Dan Zou, Rongchun Li, and Qiang Wang National Laboratory for Parallel and Distributed Processing,

More information

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit

Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit Design of a Pipelined 32 Bit MIPS Processor with Floating Point Unit P Ajith Kumar 1, M Vijaya Lakshmi 2 P.G. Student, Department of Electronics and Communication Engineering, St.Martin s Engineering College,

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique

A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique A Low-Power Field Programmable VLSI Based on Autonomous Fine-Grain Power Gating Technique P. Durga Prasad, M. Tech Scholar, C. Ravi Shankar Reddy, Lecturer, V. Sumalatha, Associate Professor Department

More information