Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Size: px
Start display at page:

Download "Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path"

Transcription

1 Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering Department, University of Patras, Patras 26500, Greece mgalanis@ee.upatras.gr Abstract The execution time improvements achieved in a generic microprocessor system by employing a highperformance data-path are presented. The data-path acts as a coprocessor that accelerates computational intensive kernel regions thereby increasing the overall performance. The data-path has been previously introduced and it is composed by Flexible Computational Components (s) that can realize any two-level template of primitive operations. For evaluating the effectiveness of our coprocessor approach, several real-world DSP applications are mapped to the system. Study of the performance improvements relative to the microprocessor architecture and to the computational resources of the data-path is performed. Significant overall application speedups are reported that range from 1.75 to 3.95, having an average value of 2.72, while the overhead in circuit area is small. I. INTRODUCTION Embedded systems have to meet the increasing requirements for high-performance and reduced energy consumption of contemporary applications, like baseband processing of communication protocols and digital imaging. The majority of Digital Signal Processing (DSP) and multimedia applications usually spend most of their time executing a small number of time-critical regular code segments, called kernels. The kernels are commonly located in loop structures and exhibit high amounts of operation parallelism. Custom coprocessor hardware is typically utilized to realize critical kernels that would execute considerably more slowly on a microprocessor. Furthermore, for reducing the time-to-market of embedded systems, automated synthesis flows are required for constructing application-specific loop coprocessors from high-level specifications. Research activities in High-Level Synthesis (HLS) [1] and in Application Specific Instruction Processors (ASIPs) [2], [3] have proven that the use of complex computational structures, called templates or clusters, instead of using only primitive ones (like a single ALU) in custom data-paths improves performance. A template may be a specialized hardware unit or a group of chained units. Chaining is the removal of the intermediate registers between the primitive units improving the total delay of the combined units. In previous work [4], we have introduced a high-performance data-path that is composed by Flexible Computational Components (s). The is a combinational circuit consisting of a 2x2 array of Processing Elements (s). Each contains one ALU and one multiplication unit, whereas one of them is activated at each control step of the schedule. Due to the flexible connections inside the, any two-level complex template operation can be easily derived. A This work was partially funded by the Alexander S. Onassis Public Benefit Foundation smaller number of s cover a Data Flow Graph (DFG) compared with existing template-based data-paths. This allows the introduction of inter-component connectivity which enables inter- chaining resulting in performance improvements relative to primitive resource and template-based data-paths. A flow for synthesizing high-level descriptions to the -based data-path was also introduced. In this paper, we present the integration of the highperformance data-path to a system for improving application s performance. The available Instruction Level Parallelism (ILP) of computation intensive kernels is efficiently exploited by the flexible arithmetic units of the data-path leading to significant kernel acceleration. An instruction-set processor executes sequential irregular code segments and provides software programmability. For realizing a complete application to the generic singlechip system, a design flow is introduced. The system s performance is estimated by simulation. Analytical experiments are performed for assessing the effectiveness of the -based coprocessor and of the design flow. Eight real-life DSP applications are mapped on the six instances of the generic system. Important application speedups are reported as the design flow accelerates each application close to the ideal speedups. The rest of the paper is organized as follows. Section II presents existing research activities in synthesizing kernel coprocessors. Section III overviews the system s architecture and the data-path. The design flow and the synthesis method are given in section IV. The experiments are given in section V, while the conclusions are drawn in section VI. II. RELATED WORK Coprocessors are used for accelerating the computation of time-critical procedures relieving the system s microprocessor from these application parts. The PICO system [5] synthesizes nonprogrammable accelerators, under a given performance constraint, to be used as coprocessors for functions expressed as loop nests in C. The generated coprocessors consist of a synchronous array of one or more customized processor data-paths. A VLIW processor executes non-critical application code. Results are given for synthesizing over thirty loop nests into hardware. Nevertheless, results from executing complete applications were not provided. Also, the PICO targets a specific systolic array architecture that limits the applicability of the synthesis flow. Two different coprocessor architectures are presented in [6] that were evaluated on a JG encoder. An academic CPU is used as the host processor. Two new instructions have to be added in the instruction-set for coupling the customized

2 data-paths on the microprocessor. This limits the applicability of their approach since the modification of the instruction-set of a microprocessor, like an ARM, requires the modification of the compiler infrastructure which is a rather complicated task. Coprocessor circuits build from an automated synthesis flow can be also implemented in FPGA logic [7], [8]. However, in this case the performance is limited due to the higher delay of the FPGA logic relative to the custom ASIC implementation of the generated coprocessor. Furthermore, FPGAs consume more power and occupy considerably larger area than ASIC circuits. Area and power consumption are important design parameters in embedded systems. Our work generates coprocessor data-paths for improving kernels execution under a given constraint in the arithmetic units. The proposed data-path is implemented in ASIC logic and it is flexible enough to realize any two-level complex computational structure. Furthermore, no modification in the instruction-set of the microprocessor is required to execute an application. Thorough study is performed with eight realistic DSP applications and results are provided from simulating the execution of the overall applications on the system. III. SYSTEM ARCHITECTURE The proposed -based data-path is coupled with a host microprocessor for executing complete applications. An outline of a system-on-chip (SoC) architecture employing the coprocessor is shown in Fig. 1. The microprocessor, typically a RISC one, executes sequential control-dominant software parts, while the -based data-path realizes time-critical kernel code. The shared data RAM stores global data for the execution of application on the system. The processor and the data-path are connected to the data RAM via a global bus. Microprocessor Shared Data RAM Data-path Figure 1. Generic diagram of the system architecture. Data communication model between the -based data-path and the processor uses shared-memory mechanism. The shared memory is composed by the shared data RAM and a subset of the registers in the register bank of the coprocessor (section III-A). Scalar variables are exchanged via the shared registers, while global variables and data arrays are allocated in the shared data RAM. Both the microprocessor and the data-path have access to the shared memory. The communication process used by the processor and the preserves data coherency by requiring the execution of the processor and the to be mutually exclusive. A kernel is replaced with code that enables the datapath using a start signal and performs data communication by transferring live-in scalar variables, produced by the microprocessor, to the shared registers. Then, the data-path executes the kernel. Upon completion, it informs the microprocessor using a done signal, writes the live-out scalar variables for the code segments following the kernel in the shared registers, and writes global variables and array data located in the shared RAM. Then, the execution of the application is continued on the processor. The mutual exclusive execution makes the programming of the system architecture easier by eliminating complicated analysis and synchronization procedures. A. Coprocessor data-path An overview of the data-path that has been previously introduced in [4] is presented in Fig. 2. The high-degree of operation parallelism in DSP kernels is exploited by the Flexible Computational Components (). In [4] it was shown that the coprocessor efficiently accelerates kernels with high-performance. The coprocessor s data-path consists of: (a) the s, (b) a register bank, (c) interconnect which enables the inter- connections and connectivity to the register bank, (d) multiplexers for providing the proper inputs to the s, and (e) a controlunit. The register bank stores intermediate values among computations and input/output data located in RAM. The control unit manages the execution of the data-path every cycle. The data-path begins the execution of a kernel when the start signal is asserted in the control-unit by the host microprocessor. When the kernel execution is competed the control-unit informs the host processor using the done signal. start done Control unit Register bank Interconnect data I/O Figure 2. Overview of the proposed coprocessor. The s internal architecture is shown in Fig. 3a. The data-width of the is 16-bits, although higher bitwidths are supported. It consists of four Processing Elements (s), four primary inputs (in1, in2, in3, in4) connected to the register bank and two primary outputs (out3, out4) connected to the register bank. Four additional inputs (A, B, C, D) and two outputs (out1, out2) are connected either to the register bank or to another. As each performs a two-operand operation, multiplexers are used to select the inputs for the secondlevel s. These multiplexers also create the flexible intra- connections. In each there is an ALU and a multiplier unit where both of them are implemented as combinational circuits. At each control-step (c-step) of the schedule, either the multiplier or the ALU are activated. The ALU performs shifting, arithmetic (add/subtract), and logical operations. The flexible connections among the s inside a allow in easily realizing any desired operation combination, as the ones proposed in [1]-[3], by properly configuring the multiplexers of the. Examples of complex operations realized by an are shown in Fig. 3b. Thus, since a can implement templates by properly setting the connections inside the, highperformance can be achieved. In [4], it was shown that an average execution time reduction of 17% was

3 accomplished with -based data-paths relative to existing high-performance data-path. This improvement is due to the exploitation of chaining of operations inside the s (intra-component chaining) and inter- chaining owing to the direct connections among the s. To register bank or to another Out 1 A B In 1 In 2 In 3 In B C C D 4 Out 3 Out 4 (a) To register bank or to another A,B,C,D come from register bank or from other Out >> >> Figure 3. (a) Architecture of the, (b) Examples of complex operations realized by the. (b) IV. DESIGN FLOW For implementing a complete application on the generic system of Fig. 1, a design flow is required that integrates the synthesis method of the -based coprocessor. The design flow used to realize applications in this work is shown in Fig. 4. Initially, a kernel identification procedure, based on profiling, outputs the kernels and the non-critical parts of the source code. For performing profiling, standard debugger/simulator tools of the development environment of a specific processor can be utilized. For ARM processors, the instruction-set simulator (ISS) of the ARM RealView Developer Suite (RVDS) can be used. Kernels are considered those code segments that contribute more than a certain amount to the total application s execution time on the processor. For example, parts of the code that account 10% or more to the application s time can be characterized as kernels. The non-critical code is compiled using a compiler for the specific processor and the software binary is produced. The kernels are synthesized using the procedure described in section IV-A. From the data-path architectures of all the kernels we derive a data-path that allows the hardware sharing of the kernels without degrading the execution time of each kernel. This sharing is feasible since the kernels are not concurrently executed. The control-unit of the multi-kernel coprocessor data-path activates the execution of a specific kernel each time. Area estimation Kernel identification Kernel code Non critical code synthesis Coprocessor arch. RTL coding Synthesis Area C description Simulation Time Figure 4. System design flow. Compilation Software binary Performance estimation The performance of the system is estimated by cyclelevel simulation that has as inputs the execution times of kernels on the coprocessor hardware and the execution times of the rest of the code on the microprocessor. The execution cycles of the kernels are reported by the synthesis method and the execution cycles of software on the microprocessor are extracted using an instruction-set simulator. For estimating the area of the generated coprocessor, the data-path architecture is described in synthesizable register-transfer level (RTL) VHDL. The produced VHDL is synthesized with a commercial tool, like Synplify ASIC or Synopsys Design Compiler, to estimate the area. The dark grey boxes in Fig. 4 represent the procedures modified or created by the `s for the specific flow, while the light grey boxes external tools used. A. Synthesis method The flow for synthesizing a kernel described in C to the proposed coprocessor data-path, for minimizing its execution time under given resource constraints, is illustrated in Fig. 5. First, the CDFG of the input kernel is created utilizing the SUIF2/MachineSUIF compiler infrastructures [9]. In this work, we utilize a hierarchical CDFG [10] for modeling data and control-flow dependencies. The control-flow structures, like branches and loops, are modeled through the hierarchy, while the data dependencies are modeled by Data Flow Graphs (DFGs). Existing and custom-made compiler passes are used for the CDFG creation. Afterwards, optimizations are applied to the kernel s CDFG for more efficient synthesis. Optimizations implemented in the synthesis methodology are tree-height reduction, dead code elimination, common sub-expression elimination and constant propagation. MachineSUIF compiler passes have been developed for the automatic application of the described optimizations on a kernel s CDFG. Kernel (.c) Front-end CDFG Optimizations Scheduling Binding to s Data-path specification Data-path arch. Figure 5. Coprocessor synthesis method. The optimized CDFG is input to the developed scheduler for the data-path. If the kernel s CDFG is composed by more than one DFG, the scheduler hierarchically traverses the CDFG and schedules one DFG at a time. The scheduling is a resource-constrained problem with the goal of execution cycles minimization, since the number and type of s (e.g. three s) in the data-path is input to the synthesis script. A proper list (priority)-based scheduler has been developed. The priority function [10] of the scheduler is derived by properly labeling the DFG nodes (operations). Particularly, the nodes are labeled with weights of their longest path to the sink node of the DFG, and they are

4 ranked in decreasing order. The most urgent operations are scheduled first. The resource constraints for the scheduler are determined by the total number of s at the first rows of all the s in the data-path. If there are p s in the data-path, there are 2p s in the first rows, since each row consists of 2 s. Thus, 2p primitive operations (ALU and/or multiplications) can be executed in parallel at each clock cycle of the schedule. For example, if there are three s in the data-path, six operations can be executed in parallel at every cycle of the schedule. The input to the binding step is the scheduled CDFG. The binding algorithm maps row-wise the CDFG operations to the s. Idle units inside s are removed by a procedure called instantiation. In particular, when a unit (ALU or multiplier) in a and/or a whole is not used at any control-step of the scheduled CDFG, then it is not included in the final datapath. A detailed description of the binding algorithm is given in [4]. After the binding, the execution cycles of the kernel are outputted. The cycles of the synthesized kernel have clock period set to the delay of the instantiated with the longest combinational delay, for having unit execution delay for all the instantiated s. The delay of an -based data-path is largely determined by the critical path of an resource, since the proposed datapath can be considered as a resource-dominated circuit, as it targets DSP kernels [10]. After the binding to s the data-path specification procedure takes place. The size of the register bank is defined by the longest lifetime of all the values produced by the s. The number and type of multiplexers, the interconnections among the s, the interconnection of the s to the registers and the states of the control-unit are also determined. A prototype tool was developed in C for the automation of the scheduling, binding and the specification procedure. V. EXRIMENTS A. Experimental set-up Two 16-bit -based data-paths are considered in the experiments. The first coprocessor data-path (1) is composed by two s, while the second one (2) by three s. Each one of these data-paths is a coprocessor to a 32-bit ARM processor. Three ARM processors are used each time in the platform. These processors are: (a) an ARM7 clocked at 133 MHz, (b) an ARM9 clocked at 250 MHz, and (c) an ARM10 having clock frequency of 325 MHz. These clock frequencies were taken from reference designs from the ARM website and they are considered as typical for these processors at 130nm process. In [4], we synthesized and laid-out an RTL VHDL description of the unit with Synplify ASIC tool using 130nm CMOS process and it was found that the delay for the equals 4.03ns. For accommodating the extra delays caused by the register bank, the multiplexers, the interconnection and the control-unit, we set the clock period for both data-paths to 5ns. Thus, a clock frequency of 200MHz is assumed for having unit execution delay for the s. The extra delay overheads were estimated by synthesizing and laying out representative benchmarks, such as DFGs from [4] and kernels extracted from the applications of this work. The eight real-world DSP applications, described in C language, used in the experiments are: a JG encoder, an IEEE a OFDM transmitter, a wavelet-based image compressor, a medical imaging technique called cavity detector, an image edge detector, a JG decoder, a GSM speech encoder and a GSM speech decoder. The ARM RVDS (version 2.2) was used for estimating the execution cycles of software parts. The profiling results showed that each application is composed by at most 4 kernels. The number of kernel for each application is illustrated in Fig. 7. Parts of the code that account 10% or more of the application s time were characterized as critical. It was observed that a threshold smaller than 10% leads in marginal additional improvements. These kernels are innermost loops and they consist of word-level operations (ALU, multiplications, shifts) that match the granularity of the ALU and the multiplier units inside an. B. Results The execution times and the overall application speedups for the eight applications are presented in Table I. The performance of the applications executed on the six systems is estimated via simulation, using the proposed design flow. Time sw represents the software execution time of the whole application on a specific microprocessor (Proc.). The ideal speedup (Ideal Sp.) is the application speedup that would ideally be achieved, according to Amdahl s Law, if application s kernels were executed on the in zero time. Time system corresponds to the execution time of the application when executing the critical code on the data-path. All execution times are normalized to the software execution times on the ARM7. The Sp. is the estimated application speedup, after utilizing the developed design flow, over the execution of the application on the microprocessor. The estimated speedup is calculated as Sp= Time sw / Time system. The average values, as well as, the geometrical means of the execution times and of the speedups are also illustrated. From the results given in Table I, it is evident that significant overall performance improvements are achieved when critical software parts are synthesized on the s. It is noticed from Table I that the largest overall application performance gains are achieved for the ARM7 extended architectures since the ARM7 exhibits the highest Cycles Per Instruction (CPI) and it has the slowest clock relative to the rest two ARM processors. The average application speedup of the eight DSP benchmarks for the ARM7 extended systems (for both 1 and 2) is 2.90, for the ARM9 is 2.68, while for the ARM10 systems is Thus, even when the -based data-paths are coupled with a modern embedded processor, as the ARM10, which is clocked at a higher clock frequency, the application speedup over the execution on the processor core is significant. For the case of synthesizing the kernels on data-paths including three s (2 case), the speedups are somewhat larger than the 1-based data-paths due to the larger number of s available in each control-step of the schedule. However, even though the kernels are executed faster on the 2, the application speedup slightly increases due to the fact that the non-critical code segments are executed on the microprocessor. The average estimated application speedup is 2.67 for the microprocessor architectures coupled with the 1 data-

5 paths. When the processor cores are extended with the 2-based coprocessors the average speedup, for the eight applications and the three ARM processors, is From Table I, it is inferred that the reported speedups for each application and for each processor type are close to theoretical speedup bounds, especially for the case of the ARM7 systems. Thus, the proposed design flow quite effectively utilized the processing capabilities of the based data-paths for improving the overall performance of the applications near to the ideal speedups. We note that it was found by experimenting with the benchmarks of this paper that few parts of each application can be executed in parallel on the processor and on the. A trivial performance increase relative to the mutual exclusive execution was also reported. Such minor improvements cannot offset the benefits of the simpler programming of the system architecture due to the exclusive execution. TABLE I. COMPARISON OF EXECUTION TIMES FOR SOFTWARE AND SOFTWARE WITH -BASED COPROCESSOR Application Proc. Time sw Proc./1 Proc./2 Ideal sp. Time system Sp. Time system Sp. ARM JG enc. ARM ARM ARM OFDM trans. ARM ARM ARM Compressor ARM ARM ARM Cavity det. ARM ARM ARM Edge det. ARM ARM ARM JG dec. ARM ARM ARM Gsm enc. ARM ARM ARM Gsm dec. ARM ARM Average Geo. mean In order to provide an insight into the cost of coupling an data-path with a microprocessor, we note that the area at 130nm for ARM7 is 2.4mm 2, for ARM9 is 3.2mm 2, while for ARM10 is 6.9mm 2. The maximum area for the data-paths is reported for the OFDM transmitter for the 2 case and equals 0.471mm 2. Hence, important speedups have been achieved, by using the data-path as a coprocessor, with a relatively small area overhead. VI. CONCLUSIONS The speedups from executing eight DSP applications on a SoC that integrates a high-performance coprocessor were presented. The coprocessor uses flexible arithmetic units that can realize complex operations. The application speedups have an average value of 2.72 for six instances of a generic system. These improvements come with a small increase in the system s area. ACKNOWLEDGMENTS This work was partially funded by the Alexander S. Onassis Public Benefit Foundation. REFERENCES [1] M. R. Corazao et al., Performance Optimization Using Template Mapping for Datapath-Intensive High-Level Synthesis, in IEEE Trans. on CAD, vol. 15, no. 2, pp , August [2] J. Cong et al., Application-Specific Instruction Generation for Configurable Processor Architectures, in Proc. of the ACM FPGA 04, pp , [3] R. Kastner et al., Instruction Generation for Hybrid Reconfigurable Systems, in ACM TODAES, vol. 7, no. 4, pp , October [4] M. D. Galanis, G. Theodoridis, S. Tragoudas, C. E. Goutis, A High Performance Data-Path for Synthesizing DSP Kernels, to be appear in IEEE Trans. on CAD. [5] R. Schreiber et al., PICO-NPA: High-Level Synthesis of Nonprogrammable Hardware Accelerators, in the Journal of VLSI Processing, Springer, vol. 31, no. 2, pp , [6] S. L. Shee et al., Novel Architecture for Loop Acceleration: A Case Study, in Proc. of CODESISSS 05, pp , [7] T.J. Callahan et al., The Garp Architecture and C Compiler, in IEEE Computer, vol. 33, no. 4, pp 62-69, April [8] G. Stitt et al., Hardware/Software Partitioning of Software Binaries: A Case Study of H.264 Decode, in Proc. of CODESISSS 05, pp , [9] SUIF2, [10] G. De Micheli, Synthesis and Optimization of Digital Circuits, McGraw-Hill, 1994.

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical

More information

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems

A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems A Partitioning Flow for Accelerating Applications in Processor-FPGA Systems MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical & Computer Engineering

More information

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

Towards Optimal Custom Instruction Processors

Towards Optimal Custom Instruction Processors Towards Optimal Custom Instruction Processors Wayne Luk Kubilay Atasu, Rob Dimond and Oskar Mencer Department of Computing Imperial College London HOT CHIPS 18 Overview 1. background: extensible processors

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 4: HLS Intro* Prof. Mingjie Lin *Notes are drawn from the textbook and the George Constantinides notes 1 Course Material Sources 1) Low-Power High-Level Synthesis

More information

MARKET demands urge embedded systems to incorporate

MARKET demands urge embedded systems to incorporate IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011 429 High Performance and Area Efficient Flexible DSP Datapath Synthesis Sotirios Xydis, Student Member, IEEE,

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

RTL Power Estimation and Optimization

RTL Power Estimation and Optimization Power Modeling Issues RTL Power Estimation and Optimization Model granularity Model parameters Model semantics Model storage Model construction Politecnico di Torino Dip. di Automatica e Informatica RTL

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

Performance Improvements of DSP Applications on a Generic Reconfigurable Platform

Performance Improvements of DSP Applications on a Generic Reconfigurable Platform Performance Improvements of DSP Applications on a Generic Reconfigurable Platform Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis Abstract Speedups from mapping four real-life DSP applications

More information

Mapping DSP Applications on Processor Systems with Coarse-Grain Reconfigurable Hardware

Mapping DSP Applications on Processor Systems with Coarse-Grain Reconfigurable Hardware Mapping DSP Applications on Processor Systems with Coarse-Grain Reconfigurable Hardware Michalis D. Galanis 1, Gregory Dimitroulakos 2, and Costas E. Goutis 3 VLSI Design Laboratory, Electrical and Computer

More information

SPARK: A Parallelizing High-Level Synthesis Framework

SPARK: A Parallelizing High-Level Synthesis Framework SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark

More information

Hardware/Software Co-design

Hardware/Software Co-design Hardware/Software Co-design Zebo Peng, Department of Computer and Information Science (IDA) Linköping University Course page: http://www.ida.liu.se/~petel/codesign/ 1 of 52 Lecture 1/2: Outline : an Introduction

More information

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

ICS 252 Introduction to Computer Design

ICS 252 Introduction to Computer Design ICS 252 Introduction to Computer Design Lecture 3 Fall 2006 Eli Bozorgzadeh Computer Science Department-UCI System Model According to Abstraction level Architectural, logic and geometrical View Behavioral,

More information

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University

The Processor: Datapath and Control. Jin-Soo Kim Computer Systems Laboratory Sungkyunkwan University The Processor: Datapath and Control Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Introduction CPU performance factors Instruction count Determined

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware 4.1 Introduction We will examine two MIPS implementations

More information

VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko

VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko 136 VHDL-MODELING OF A GAS LASER S GAS DISCHARGE CIRCUIT Nataliya Golian, Vera Golian, Olga Kalynychenko Abstract: Usage of modeling for construction of laser installations today is actual in connection

More information

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT

A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT A RISC ARCHITECTURE EXTENDED BY AN EFFICIENT TIGHTLY COUPLED RECONFIGURABLE UNIT N. Vassiliadis, N. Kavvadias, G. Theodoridis, S. Nikolaidis Section of Electronics and Computers, Department of Physics,

More information

Chapter 4. The Processor

Chapter 4. The Processor Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations A simplified

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA

RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA RECONFIGURABLE SPI DRIVER FOR MIPS SOFT-CORE PROCESSOR USING FPGA 1 HESHAM ALOBAISI, 2 SAIM MOHAMMED, 3 MOHAMMAD AWEDH 1,2,3 Department of Electrical and Computer Engineering, King Abdulaziz University

More information

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning

A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning A Study of the Speedups and Competitiveness of FPGA Soft Processor Cores using Dynamic Hardware/Software Partitioning By: Roman Lysecky and Frank Vahid Presented By: Anton Kiriwas Disclaimer This specific

More information

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores

A Configurable Multi-Ported Register File Architecture for Soft Processor Cores A Configurable Multi-Ported Register File Architecture for Soft Processor Cores Mazen A. R. Saghir and Rawan Naous Department of Electrical and Computer Engineering American University of Beirut P.O. Box

More information

ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III

ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III ECE 587 Hardware/Software Co-Design Spring 2018 1/28 ECE 587 Hardware/Software Co-Design Lecture 23 Hardware Synthesis III Professor Jia Wang Department of Electrical and Computer Engineering Illinois

More information

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design

COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design COPROCESSOR APPROACH TO ACCELERATING MULTIMEDIA APPLICATION [CLAUDIO BRUNELLI, JARI NURMI ] Processor Design Lecture Objectives Background Need for Accelerator Accelerators and different type of parallelizm

More information

High Level Synthesis

High Level Synthesis High Level Synthesis Design Representation Intermediate representation essential for efficient processing. Input HDL behavioral descriptions translated into some canonical intermediate representation.

More information

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999

HW/SW Co-design. Design of Embedded Systems Jaap Hofstede Version 3, September 1999 HW/SW Co-design Design of Embedded Systems Jaap Hofstede Version 3, September 1999 Embedded system Embedded Systems is a computer system (combination of hardware and software) is part of a larger system

More information

Exploring Automatically Generated Platforms in High Performance FPGAs

Exploring Automatically Generated Platforms in High Performance FPGAs Exploring Automatically Generated Platforms in High Performance FPGAs Panagiotis Skrimponis b, Georgios Zindros a, Ioannis Parnassos a, Muhsen Owaida b, Nikolaos Bellas a, and Paolo Ienne b a Electrical

More information

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL

VLSI DESIGN OF REDUCED INSTRUCTION SET COMPUTER PROCESSOR CORE USING VHDL International Journal of Electronics, Communication & Instrumentation Engineering Research and Development (IJECIERD) ISSN 2249-684X Vol.2, Issue 3 (Spl.) Sep 2012 42-47 TJPRC Pvt. Ltd., VLSI DESIGN OF

More information

Embedded Systems. 7. System Components

Embedded Systems. 7. System Components Embedded Systems 7. System Components Lothar Thiele 7-1 Contents of Course 1. Embedded Systems Introduction 2. Software Introduction 7. System Components 10. Models 3. Real-Time Models 4. Periodic/Aperiodic

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays

Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Runtime Adaptation of Application Execution under Thermal and Power Constraints in Massively Parallel Processor Arrays Éricles Sousa 1, Frank Hannig 1, Jürgen Teich 1, Qingqing Chen 2, and Ulf Schlichtmann

More information

Co-synthesis and Accelerator based Embedded System Design

Co-synthesis and Accelerator based Embedded System Design Co-synthesis and Accelerator based Embedded System Design COE838: Embedded Computer System http://www.ee.ryerson.ca/~courses/coe838/ Dr. Gul N. Khan http://www.ee.ryerson.ca/~gnkhan Electrical and Computer

More information

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS

DESIGN AND IMPLEMENTATION OF VLSI SYSTOLIC ARRAY MULTIPLIER FOR DSP APPLICATIONS International Journal of Computing Academic Research (IJCAR) ISSN 2305-9184 Volume 2, Number 4 (August 2013), pp. 140-146 MEACSE Publications http://www.meacse.org/ijcar DESIGN AND IMPLEMENTATION OF VLSI

More information

High Data Rate Fully Flexible SDR Modem

High Data Rate Fully Flexible SDR Modem High Data Rate Fully Flexible SDR Modem Advanced configurable architecture & development methodology KASPERSKI F., PIERRELEE O., DOTTO F., SARLOTTE M. THALES Communication 160 bd de Valmy, 92704 Colombes,

More information

Pilot: A Platform-based HW/SW Synthesis System

Pilot: A Platform-based HW/SW Synthesis System Pilot: A Platform-based HW/SW Synthesis System SOC Group, VLSI CAD Lab, UCLA Led by Jason Cong Zhong Chen, Yiping Fan, Xun Yang, Zhiru Zhang ICSOC Workshop, Beijing August 20, 2002 Outline Overview The

More information

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE

Abstract A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE A SCALABLE, PARALLEL, AND RECONFIGURABLE DATAPATH ARCHITECTURE Reiner W. Hartenstein, Rainer Kress, Helmut Reinig University of Kaiserslautern Erwin-Schrödinger-Straße, D-67663 Kaiserslautern, Germany

More information

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.

More information

An Efficient Flexible Architecture for Error Tolerant Applications

An Efficient Flexible Architecture for Error Tolerant Applications An Efficient Flexible Architecture for Error Tolerant Applications Sheema Mol K.N 1, Rahul M Nair 2 M.Tech Student (VLSI DESIGN), Department of Electronics and Communication Engineering, Nehru College

More information

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers

Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Addressing Verification Bottlenecks of Fully Synthesized Processor Cores using Equivalence Checkers Subash Chandar G (g-chandar1@ti.com), Vaideeswaran S (vaidee@ti.com) DSP Design, Texas Instruments India

More information

A hardware operating system kernel for multi-processor systems

A hardware operating system kernel for multi-processor systems A hardware operating system kernel for multi-processor systems Sanggyu Park a), Do-sun Hong, and Soo-Ik Chae School of EECS, Seoul National University, Building 104 1, Seoul National University, Gwanakgu,

More information

EE382V: System-on-a-Chip (SoC) Design

EE382V: System-on-a-Chip (SoC) Design EE382V: System-on-a-Chip (SoC) Design Lecture 8 HW/SW Co-Design Sources: Prof. Margarida Jacome, UT Austin Andreas Gerstlauer Electrical and Computer Engineering University of Texas at Austin gerstl@ece.utexas.edu

More information

PACE: Power-Aware Computing Engines

PACE: Power-Aware Computing Engines PACE: Power-Aware Computing Engines Krste Asanovic Saman Amarasinghe Martin Rinard Computer Architecture Group MIT Laboratory for Computer Science http://www.cag.lcs.mit.edu/ PACE Approach Energy- Conscious

More information

A Reconfigurable Multifunction Computing Cache Architecture

A Reconfigurable Multifunction Computing Cache Architecture IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 9, NO. 4, AUGUST 2001 509 A Reconfigurable Multifunction Computing Cache Architecture Huesung Kim, Student Member, IEEE, Arun K. Somani,

More information

EE382V: System-on-a-Chip (SoC) Design

EE382V: System-on-a-Chip (SoC) Design EE382V: System-on-a-Chip (SoC) Design Lecture 10 Task Partitioning Sources: Prof. Margarida Jacome, UT Austin Prof. Lothar Thiele, ETH Zürich Andreas Gerstlauer Electrical and Computer Engineering University

More information

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic 368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 1, JANUARY 2016 Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Kostas Tsoumanis, Sotirios Xydis,

More information

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs

A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Politecnico di Milano & EPFL A Novel Design Framework for the Design of Reconfigurable Systems based on NoCs Vincenzo Rana, Ivan Beretta, Donatella Sciuto Donatella Sciuto sciuto@elet.polimi.it Introduction

More information

Processor (I) - datapath & control. Hwansoo Han

Processor (I) - datapath & control. Hwansoo Han Processor (I) - datapath & control Hwansoo Han Introduction CPU performance factors Instruction count - Determined by ISA and compiler CPI and Cycle time - Determined by CPU hardware We will examine two

More information

ARM Processors for Embedded Applications

ARM Processors for Embedded Applications ARM Processors for Embedded Applications Roadmap for ARM Processors ARM Architecture Basics ARM Families AMBA Architecture 1 Current ARM Core Families ARM7: Hard cores and Soft cores Cache with MPU or

More information

Previous Exam Questions System-on-a-Chip (SoC) Design

Previous Exam Questions System-on-a-Chip (SoC) Design This image cannot currently be displayed. EE382V Problem: System Analysis (20 Points) This is a simple single microprocessor core platform with a video coprocessor, which is configured to process 32 bytes

More information

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling

More information

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University

The Processor (1) Jinkyu Jeong Computer Systems Laboratory Sungkyunkwan University The Processor (1) Jinkyu Jeong (jinkyu@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu EEE3050: Theory on Computer Architectures, Spring 2017, Jinkyu Jeong (jinkyu@skku.edu)

More information

A High Performance Reconfigurable Data Path Architecture For Flexible Accelerator

A High Performance Reconfigurable Data Path Architecture For Flexible Accelerator IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 7, Issue 4, Ver. II (Jul. - Aug. 2017), PP 07-18 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org A High Performance Reconfigurable

More information

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient

Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient ISSN (Online) : 2278-1021 Implementation of Ripple Carry and Carry Skip Adders with Speed and Area Efficient PUSHPALATHA CHOPPA 1, B.N. SRINIVASA RAO 2 PG Scholar (VLSI Design), Department of ECE, Avanthi

More information

Integrating MRPSOC with multigrain parallelism for improvement of performance

Integrating MRPSOC with multigrain parallelism for improvement of performance Integrating MRPSOC with multigrain parallelism for improvement of performance 1 Swathi S T, 2 Kavitha V 1 PG Student [VLSI], Dept. of ECE, CMRIT, Bangalore, Karnataka, India 2 Ph.D Scholar, Jain University,

More information

Chapter 4. The Processor Designing the datapath

Chapter 4. The Processor Designing the datapath Chapter 4 The Processor Designing the datapath Introduction CPU performance determined by Instruction Count Clock Cycles per Instruction (CPI) and Cycle time Determined by Instruction Set Architecure (ISA)

More information

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors

Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Modeling Arbitrator Delay-Area Dependencies in Customizable Instruction Set Processors Siew-Kei Lam Centre for High Performance Embedded Systems, Nanyang Technological University, Singapore (assklam@ntu.edu.sg)

More information

Designing for Performance. Patrick Happ Raul Feitosa

Designing for Performance. Patrick Happ Raul Feitosa Designing for Performance Patrick Happ Raul Feitosa Objective In this section we examine the most common approach to assessing processor and computer system performance W. Stallings Designing for Performance

More information

Area/Delay Estimation for Digital Signal Processor Cores

Area/Delay Estimation for Digital Signal Processor Cores Area/Delay Estimation for Digital Signal Processor Cores Yuichiro Miyaoka Yoshiharu Kataoka, Nozomu Togawa Masao Yanagisawa Tatsuo Ohtsuki Dept. of Electronics, Information and Communication Engineering,

More information

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

High Performance and Area Efficient DSP Architecture using Dadda Multiplier 2017 IJSRST Volume 3 Issue 6 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology High Performance and Area Efficient DSP Architecture using Dadda Multiplier V.Kiran Kumar

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

The MorphoSys Parallel Reconfigurable System

The MorphoSys Parallel Reconfigurable System The MorphoSys Parallel Reconfigurable System Guangming Lu 1, Hartej Singh 1,Ming-hauLee 1, Nader Bagherzadeh 1, Fadi Kurdahi 1, and Eliseu M.C. Filho 2 1 Department of Electrical and Computer Engineering

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510)

EECS Dept., University of California at Berkeley. Berkeley Wireless Research Center Tel: (510) A V Heterogeneous Reconfigurable Processor IC for Baseband Wireless Applications Hui Zhang, Vandana Prabhu, Varghese George, Marlene Wan, Martin Benes, Arthur Abnous, and Jan M. Rabaey EECS Dept., University

More information

Embedded Systems: Hardware Components (part I) Todor Stefanov

Embedded Systems: Hardware Components (part I) Todor Stefanov Embedded Systems: Hardware Components (part I) Todor Stefanov Leiden Embedded Research Center Leiden Institute of Advanced Computer Science Leiden University, The Netherlands Outline Generic Embedded System

More information

Software-Level Power Estimation

Software-Level Power Estimation Chapter 3 Software-Level Power Estimation The main goal of this chapter is to define a methodology for the static and dynamic estimation of the power consumption at the software-level to be integrated

More information

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey

System-on-Chip Architecture for Mobile Applications. Sabyasachi Dey System-on-Chip Architecture for Mobile Applications Sabyasachi Dey Email: sabyasachi.dey@gmail.com Agenda What is Mobile Application Platform Challenges Key Architecture Focus Areas Conclusion Mobile Revolution

More information

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013

Introduction to FPGA Design with Vivado High-Level Synthesis. UG998 (v1.0) July 2, 2013 Introduction to FPGA Design with Vivado High-Level Synthesis Notice of Disclaimer The information disclosed to you hereunder (the Materials ) is provided solely for the selection and use of Xilinx products.

More information

Controller Synthesis for Hardware Accelerator Design

Controller Synthesis for Hardware Accelerator Design ler Synthesis for Hardware Accelerator Design Jiang, Hongtu; Öwall, Viktor 2002 Link to publication Citation for published version (APA): Jiang, H., & Öwall, V. (2002). ler Synthesis for Hardware Accelerator

More information

Instruction Encoding Synthesis For Architecture Exploration

Instruction Encoding Synthesis For Architecture Exploration Instruction Encoding Synthesis For Architecture Exploration "Compiler Optimizations for Code Density of Variable Length Instructions", "Heuristics for Greedy Transport Triggered Architecture Interconnect

More information

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP

Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Processor Architectures At A Glance: M.I.T. Raw vs. UC Davis AsAP Presenter: Course: EEC 289Q: Reconfigurable Computing Course Instructor: Professor Soheil Ghiasi Outline Overview of M.I.T. Raw processor

More information

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation.

Keywords: Soft Core Processor, Arithmetic and Logical Unit, Back End Implementation and Front End Implementation. ISSN 2319-8885 Vol.03,Issue.32 October-2014, Pages:6436-6440 www.ijsetr.com Design and Modeling of Arithmetic and Logical Unit with the Platform of VLSI N. AMRUTHA BINDU 1, M. SAILAJA 2 1 Dept of ECE,

More information

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing

Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Towards a Dynamically Reconfigurable System-on-Chip Platform for Video Signal Processing Walter Stechele, Stephan Herrmann, Andreas Herkersdorf Technische Universität München 80290 München Germany Walter.Stechele@ei.tum.de

More information

The S6000 Family of Processors

The S6000 Family of Processors The S6000 Family of Processors Today s Design Challenges The advent of software configurable processors In recent years, the widespread adoption of digital technologies has revolutionized the way in which

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit

Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Custom Instruction Generation Using Temporal Partitioning Techniques for a Reconfigurable Functional Unit Farhad Mehdipour, Hamid Noori, Morteza Saheb Zamani, Kazuaki Murakami, Koji Inoue, Mehdi Sedighi

More information

Multi Cycle Implementation Scheme for 8 bit Microprocessor by VHDL

Multi Cycle Implementation Scheme for 8 bit Microprocessor by VHDL Multi Cycle Implementation Scheme for 8 bit Microprocessor by VHDL Sharmin Abdullah, Nusrat Sharmin, Nafisha Alam Department of Electrical & Electronic Engineering Ahsanullah University of Science & Technology

More information

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA

FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA 1 FCUDA-SoC: Platform Integration for Field-Programmable SoC with the CUDAto-FPGA Compiler Tan Nguyen 1, Swathi Gurumani 1, Kyle Rupnow 1, Deming Chen 2 1 Advanced Digital Sciences Center, Singapore {tan.nguyen,

More information

MLR Institute of Technology

MLR Institute of Technology MLR Institute of Technology Laxma Reddy Avenue, Dundigal, Quthbullapur (M), Hyderabad 500 043 Course Name Course Code Class Branch ELECTRONICS AND COMMUNICATIONS ENGINEERING QUESTION BANK : DIGITAL DESIGN

More information

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi

Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Embedded Systems Dr. Santanu Chaudhury Department of Electrical Engineering Indian Institution of Technology, Delhi Lecture - 34 Compilers for Embedded Systems Today, we shall look at the compilers, which

More information

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts

Design methodology for programmable video signal processors. Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Design methodology for programmable video signal processors Andrew Wolfe, Wayne Wolf, Santanu Dutta, Jason Fritts Princeton University, Department of Electrical Engineering Engineering Quadrangle, Princeton,

More information

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter

Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter Tradeoff Analysis and Architecture Design of a Hybrid Hardware/Software Sorter M. Bednara, O. Beyer, J. Teich, R. Wanka Paderborn University D-33095 Paderborn, Germany bednara,beyer,teich @date.upb.de,

More information

Hardware-Software Codesign. 1. Introduction

Hardware-Software Codesign. 1. Introduction Hardware-Software Codesign 1. Introduction Lothar Thiele 1-1 Contents What is an Embedded System? Levels of Abstraction in Electronic System Design Typical Design Flow of Hardware-Software Systems 1-2

More information

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations

Chapter 4. The Processor. Instruction count Determined by ISA and compiler. We will examine two MIPS implementations Chapter 4 The Processor Part I Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle time Determined by CPU hardware We will examine two MIPS implementations

More information

Ten Reasons to Optimize a Processor

Ten Reasons to Optimize a Processor By Neil Robinson SoC designs today require application-specific logic that meets exacting design requirements, yet is flexible enough to adjust to evolving industry standards. Optimizing your processor

More information

Data Parallel Architectures

Data Parallel Architectures EE392C: Advanced Topics in Computer Architecture Lecture #2 Chip Multiprocessors and Polymorphic Processors Thursday, April 3 rd, 2003 Data Parallel Architectures Lecture #2: Thursday, April 3 rd, 2003

More information

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems

FPGA-Based Rapid Prototyping of Digital Signal Processing Systems FPGA-Based Rapid Prototyping of Digital Signal Processing Systems Kevin Banovic, Mohammed A. S. Khalid, and Esam Abdel-Raheem Presented By Kevin Banovic July 29, 2005 To be presented at the 48 th Midwest

More information

FPGA for Software Engineers

FPGA for Software Engineers FPGA for Software Engineers Course Description This course closes the gap between hardware and software engineers by providing the software engineer all the necessary FPGA concepts and terms. The course

More information

Optimized Design Platform for High Speed Digital Filter using Folding Technique

Optimized Design Platform for High Speed Digital Filter using Folding Technique Volume-2, Issue-1, January-February, 2014, pp. 19-30, IASTER 2013 www.iaster.com, Online: 2347-6109, Print: 2348-0017 ABSTRACT Optimized Design Platform for High Speed Digital Filter using Folding Technique

More information

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not

More information

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool

Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Synthesizable FPGA Fabrics Targetable by the VTR CAD Tool Jin Hee Kim and Jason Anderson FPL 2015 London, UK September 3, 2015 2 Motivation for Synthesizable FPGA Trend towards ASIC design flow Design

More information

How Much Logic Should Go in an FPGA Logic Block?

How Much Logic Should Go in an FPGA Logic Block? How Much Logic Should Go in an FPGA Logic Block? Vaughn Betz and Jonathan Rose Department of Electrical and Computer Engineering, University of Toronto Toronto, Ontario, Canada M5S 3G4 {vaughn, jayar}@eecgutorontoca

More information

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support

Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Enhancing a Reconfigurable Instruction Set Processor with Partial Predication and Virtual Opcode Support Nikolaos Vassiliadis, George Theodoridis and Spiridon Nikolaidis Section of Electronics and Computers,

More information

ISSN Vol.05,Issue.09, September-2017, Pages:

ISSN Vol.05,Issue.09, September-2017, Pages: WWW.IJITECH.ORG ISSN 2321-8665 Vol.05,Issue.09, September-2017, Pages:1693-1697 AJJAM PUSHPA 1, C. H. RAMA MOHAN 2 1 PG Scholar, Dept of ECE(DECS), Shirdi Sai Institute of Science and Technology, Anantapuramu,

More information

INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad ELECTRONICS AND COMMUNICATIONS ENGINEERING

INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad ELECTRONICS AND COMMUNICATIONS ENGINEERING INSTITUTE OF AERONAUTICAL ENGINEERING Dundigal, Hyderabad - 00 0 ELECTRONICS AND COMMUNICATIONS ENGINEERING QUESTION BANK Course Name : DIGITAL DESIGN USING VERILOG HDL Course Code : A00 Class : II - B.

More information

Design methodology for multi processor systems design on regular platforms

Design methodology for multi processor systems design on regular platforms Design methodology for multi processor systems design on regular platforms Ph.D in Electronics, Computer Science and Telecommunications Ph.D Student: Davide Rossi Ph.D Tutor: Prof. Roberto Guerrieri Outline

More information