Flexible Datapath Synthesis Through Arithmetically Optimized Operation Chaining

Size: px
Start display at page:

Download "Flexible Datapath Synthesis Through Arithmetically Optimized Operation Chaining"

Transcription

1 29 NS/ES Conference on daptive Hardware and Systems Flexible Datapath Synthesis Through rithmetically Optimized Operation Chaining Sotirios Xydis, Ioannis Triantafyllou, George Economakos, Kiamal Pekmestzi National Technical University of thens - School of Electrical and Computer Engineering Microprocessors and Digital Systems Laboratory {sxydis, ioannis, geconom, pekmes}@microlab.ntua.gr bstract Datapath synthesis incorporating complex operation templates has been proven extremely efficient especially for the Digital Signal Processing (DSP) application domain. However, only architectural level optimizations have been reported for the specification and implementation of the operation templates. This paper introduces the consideration of arithmetic level optimizations for template based datapath synthesis. high performance architecture for the implementation of DSP kernels is presented. It is based on flexible and arithmetically optimized components able to perform a large set of operation templates. synthesis methodology for optimized mapping of DSP kernels onto the proposed architecture is also presented. Experimental results are reported showing significant gains in execution time, active chip area and power dissipation in comparison to previously published flexible template-based datapaths. 1 Introduction Modern embedded systems are dominated by Digital Signal Processing (DSP) applications, i.e telecom, wireless communications etc. The main characteristic of these applications is that they are extremely computation intensive with light-weight control structures. Their execution time is dominated by certain code segments, called kernels. In order to deliver high performance implementations, previous contributions [2], [6], [8], [11], [12], [14], [15], [16] have proposed the hardware mapping of these kernels. Research activities from the field of High Level Synthesis (HLS) [11], [14], [16], [18] and pplication Specific Instruction Processors (SIP) [2], [6], [12], [15] have presented promising results, concerning the performance metric, when complex hardware resources (custom instructions in the SIP domain) are utilized instead of primitive ones i.e single LUs, Conventional Flow rchitecture Level rchitecture Optimized Template Specification + Datapath Generator RTL Level rithmetic Level Logic Level Physical Level Proposed Flow rchitecture/rithmetic Optimized Templates + Datapath Generator S4 C4 3 B3 C3 3 B3 C3 D3 S3 F1,3 C3 S2 2 B2 C2 D2 F1,2 C2 1 B1 C1 D1 Figure 1. a) Conventional and Proposed Design Flow, b) a 3:2 C-S adder, c) a 4:2 C-S adder C,4 S,3 C,3 S,2 C,2 S1 S,1 F1,1 C1 C,1 S B C D MULs etc. These complex hardware resources take advantage of chained operation templates found into the initial dataflow graph (DFG) of the DSP kernel. Operation chaining removes the intermediate registers between primitive hardware units, improving their overall delay. Given that the template based HLS and SIP methodologies work at the architectural level, most of the aforementioned techniques generate datapaths taking into account only observations on the structure of the DFGs. Thus, little or none awareness of techniques from lower design levels (i.e arithmetic level optimization) is encountered (Conventional Flow in Fig. 1a). The complex operation templates are either specified in a predefined behavioral template library [11], [14], [16] or extracted directly from the kernel s DFG [2], [6], [12], [15]. Since flexibility has not been considered, a large number of different operation templates is generated reducing the physical homogeneity and the opportunities for efficient template sharing at register transfer level (RTL). Recently, in [8], the usage of flexible operation-templates has been proposed in order to address this type of inefficiencies. Their basic computational node, C4 S3 F,3 C3 F,3 S2 2 B2 C2 F,2 C2 F,2 1 B1 S1 F,1 C1 C1 F,1 S B F, C C= F, S, F1, C= /9 $ IEEE DOI 1.119/HS

2 called Flexible Computational Component (FCC), comprises of four LU/MUL units. Their architecture is based on fully interconnecting such FCCs with an intermediate crossbar switch in order to exploit inter-template chaining as well. lthough, significant performance improvements in execution time are reported by their approach, large hardware area overheads are imposed due to hardware redundancy. dditionally, the crossbar interconnection switch limits the scalability of their architecture. The previously referenced contributions present the inherent limitation that the critical delay of the generated datapaths is determined of large carry-propagation chains which are found in the conventional binary arithmetic designs. In order to eliminate the large carry-propagation chains found into datapath components, the Carry-Save (C- S) arithmetic representation is used [4]. In C-S datapaths, chains of 3:2 (Fig. 1b) or 4:2 C-S (Fig. 1c) adders are formed in such a way that multi-input additions (present in multiplication, multiplication/accumulation (MC) and accumulation/multiplication operations) reduce significantly faster than by doing serial addition of inputs. Many research activities [1], [3], [1], [17], [21] concerned datapath optimization using C-S arithmetic. In [1], [3], DFG transformation techniques are presented in order to either maximize the use of C-S arithmetic [3] or perform common subexpression elimination in C-S computations [1]. However, their studies took into account only SIC like datapaths without concerning about flexibility of the final architecture. Timing driven synthesis utilizing C-S components is presented in [1], [17]. These techniques operate at the post RTL abstraction when the whole behavioral specification has been laid out as a single circuit. Thus, no flexibility and no hardware-sharing opportunities are supported. In [21] the problem of combined retiming, module selection and C-S arithmetic representation selection is formulated. They introduced Mixed representation Flow Graph (MFG) model which resolves the signal representation mismatch (C-S vs binary). The final datapath comprises a mixture of C-S or binary arithmetic operators. gain, no flexibility issues have been considered. The main objective of this paper is to enable flexible and high performance datapath synthesis. In order to achieve this, we improve the state of the art solution on flexible datapath synthesis [8] by eliminating its two disadvantageous features, namely be 1) the unawareness of arithmetic level optimizations during architecture specification and 2) the unscalable and area-power hungry crossbar interconnection. Specifically, we introduce a flexible datapath architecture and a comprehensive methodology for synthesizing DSP kernels. The proposed datapath is formed combining concepts and optimization techniques both from the architectural and the arithmetic level of abstraction (Proposed Flow in Fig. 1a). It comprises of uniform and flexible computational units which enable the execution of a large set of operation templates found into DSP kernels. The overall architecture operates in C-S arithmetic delivering high speed implementations. systematic synthesis methodology for optimized mapping of the DSP kernels onto the proposed architecture, is also presented. Experimental comparisons at the circuit level show that the proposed flexible computational units are able to operate on a significantly larger range of operating frequencies and with lower area overheads than the FCC units in [8]. lso, experiments on a representative set of DSP kernels, show that the proposed approach delivers an average improvement of 29,5% in execution latency together with 42,1% area reduction and 1,1% power gains, compared to the datapath solution in [8]. The rest of the paper is organized as follows. Section 2 presents the proposed architecture and the detailed design of its flexible computational units. In Section 3, the synthesis methodology for mapping DSP kernels onto instances of the proposed architecture is presented. Experimental results are reported in Section 4, while Section 5 concludes the paper. 2 Flexible Datapath rchitecture n abstract architectural model of the proposed datapath is illustrated in Fig. 2. It is composed by 1) the Flexible Computational Units (FCUs), 2) the register bank, 3) the data interconnection network, 4) a CS to binary arithmetic conversion module (CS2Bin) and 5) the control unit which drives the overall architecture (configuration words and multiplexor s selection signals) in each control step. Each individual component has been designed to operate on C-S word operands of 16-bit, since such a word length is considered adequate for the majority of DSP datapaths [8]. The register bank is introduced for the storage of intermediate results and the sharing and the communication of variables among the FCUs. It is implemented by scratch registers according to the register allocation of each kernel. The data interconnection network handles the communication between the register bank and the FCUs. Different DSP kernels (different register allocation and data communication patterns per kernel) can be mapped onto the proposed architecture utilizing post RTL datapath sharing techniques [13]. Control Unit Control Control Scratch Registers Data Interconnection Network Configuration.... Words FCU FCU FCU FCU Figure 2. bstract View of the Flexible Datapath. CS2Bin 48

3 X* Y* Z* C-S Multiplier Recoding... X C... Dj... X* CL1 4 to 2 Z* X S MBj Z C Z S N C N S N* Y C Y S Y* Figure 3. The Flexible Computational Unit. The arithmetic conversion module (CS2Bin) performs the conversion from the C-S format to the conventional binary. It is implemented as a simple carry-propagate adder [4], since its critical path delay is overlapped with the critical path delay of the FCUs. The arithmetic conversion usually takes place at the end of each kernel execution in order to output the computed result in binary format. The number of FCUs is determined at design time according to the available instruction-level parallelism (ILP) of each kernel. Fig. 3 depicts 1) the abstract model of the FCU, 2) the internal structure of the FCU node (detailed model) and 3) the way the abstract model maps onto the detailed one. The abstract model shows that the proposed FCU enables intra-template operation chaining by merging together the pre- and the post- multiplication addition. The alternative execution paths in each FCU are determined by properly setting the control signals of the two multiplexors. These execution paths define the template library in which each FCU can be configured (Fig. 4). Based on the template library of Fig. 4, the DSP kernels are mapped onto the proposed architecture (Section 3). Two level operation templates (of type dd-mul, Mul-dd) dominates most of the DSP kernels [11]. However, three level operation templates are also occurred in many DSP kernels i.e in the Symmetrical FIR filters. The proposed datapath supports also this type of templates (i.e T1 of Fig. 4). Inter-template (inter-fcu) chaining has not been considered in order to avoid the unscalable interconnection crossbar among FCUs, found in [8]. The internal structure of the FCU (Fig. 3) consists of 1) a 4:2 C-S adder of 2 s complement numbers, for the addition of the input data (X, Y ), 2) a C-S to Modified Booth [4] recoding scheme, 3) a tree based adder for the addition of the partial products, 4) the final CS accumulation unit implemented also by a 4:2 C-S adder and 5) a configuration CL CL1 CL2 K C K S CL3 register. The superscript,, denotes a C-S redundant representation composed of two numbers both in 2 s complement form. Each of the X, Y, K, N, Z is in C-S format. The quantities X C, X S, Y C, Y S, K C, K S, N C, N S, Z C, Z S and are all 2 s complement conventional binary numbers (Fig. 3). In general, the following relation stands for all C-S formatted data: X = {X C,X S } = X C + X S. Thus, input data in conventional binary format, can be directly processed by a FCU without any conversion overhead. The entire internal structure of each FCU operates directly on, and produces data in C-S arithmetic representation. This feature 1) enables the direct reusability of the unit s output and 2) eliminates the latency intensive carrychains found in the conversion from the C-S representation to the conventional binary form. The upper 4:2 C-S adder computes the N = X + Y which can be used either in the pre- or the post- multiplication addition. Input has to be always in 2 s complement binary numeric representation. t next, the 2 s complement C-S formed data, N or K, are driven either for multiplication or addition. We analyze in more detail the multiplication path since it forms the critical path of the FCU. When the FCU is configured to perform multiplication, the recoding unit is activated. The recoding unit enables the multiplication operation to be conducted with one operand (N or K )in C-S format. It performs the conversion from C-S format to an intermediate format of Sign Digits (SDs) [4] and subsequently the conversion of intermediate SDs to the Modified Booth (MB) digit representation [4]. Further details on the implementation of the recoding unit can be found in [19]. It is worth to be noticed that the C-S to SD and the SD to MB conversion modules are carry-propagation-less circuits, contributing only slightly on the critical path delay. The MB digits are passed into the partial product generator and the partial products are reduced within a C-S Wallace tree addition scheme [4]. The multiplier s output results in 2 s complement C-S format. Since the paper focuses on the overall architecture description and due to lack of space, we neglect further details about the core multiplier s circuit. The final accumulation unit is a 4:2 C-S adder with two fixed inputs (the multipliers C-S output) and two configurable inputs. ll the inputs of the accumulation unit are T1 X* Y* T2 X* Y* T3 X* Y* T4 X* Y* Figure 4. The FCU Template Library. T5 49

4 in 2 s complement C-S format. The configured inputs are either the N derived from the initial 4:2 C-S adder or an independent C-S input number (K ). Finally, the operation mode of the FCU and the signs of the input numbers (determining addition or subtraction operations) are controlled through the configuration register by driving with proper control logic bits (CL i ) the multiplexors and the sign selection inside each 4:2 C-S adder module. The configuration register is loaded in a cycle by cycle basis with configuration words which are generated in the control unit (Fig. 2). Loop Unrolling No C Code of DSP Kenrel DFG Extraction Static ddress Calculation #Mul_Ops < #FCU Resource Constraints #FCU, #CS2Bin Yes 3 Synthesis Methodology C-S ware DFG Reduction n HLS synthesis methodology has been developed in order to enable the efficient mapping of DSP kernels onto the FCU-based architecture. The overall flow of the proposed methodology is depicted in Fig. 5. It consists of 3 major phases. Each phase is reported with different color in Fig. 5, in order to be clearly distinguished. In phase 1, the DFG of the kernel is extracted from its C code specification,. rrays address calculations are statically pre-computed and substituted in the original DFG by register to register transfers. By this way, the FCUs are concentrated on effective computations and not on simple index calculations. dditionally, the proposed datapath produces data in C-S arithmetic representation, which is inadequate format for arrays indexes. The DFG graph is unrolled according to the number of multiplication operations in the DFG and the allocated FCUs. By this way, the available ILP is balanced to the utilization of the allocated hardware, since the multiplier is the area-dominant component of the FCU. The next phase of the proposed methodology is the C- S aware reduction of the original DFG. It is similar to applying aggressively the C-S module selection technique in [21]. The DFG reduction exploits the inherent feature of C- S addition/subtraction circuits to merge/compress multipleadditions into a single one [17]. Thus, the size of the DFG shrinks, offering opportunities for faster schedules (number of cycles) than considering primitive resource operations. Conventional C-S aware DFG reduction techniques [1], [3], [1], [17] assumes only 3:2 C-S compressors. The proposed flexible architecture is able to handle 3:2 together with 4:2 C-S compression, since 4:2 C-S compressors are a superset of 3:2 compressors. The pseudocode of the proposed C-S aware DFG reduction is shown in Fig. 6a. We used the notion of Boundary Operations (B Ops) similar to [17]. Boundary Operations are the DFG s nodes which set the boundaries of C-S aware reductions. Thus, C-S reductions are conducted in DFG nodes that lay between B Ops nodes. In our case, assuming DFG graphs which comprise only addition/subtraction and multiplication operations, three types of B Ops are en- FCU Template Library Generation and Selection of FCU Patterns List-Scheduling of FCU Patterns Bind Scheduled Patterns to FCU Components FCU FSM Datapath FSMD Verilog-HDL Figure 5. The proposed Synthesis Flow. countered, 1) the DFG s primary inputs, 2) the DFG s primary outputs and 3) the multiplication nodes of the DFG. t first, the original DFG graph is SP scheduled, in order to topologically order the DFG s nodes according to their timing dependencies. Next, the C-S addition trees are formed iteratively, by merging primitive DFG nodes. The B Ops are excluded of the merging process, while the high fanout DFG nodes are merged only as roots of the C-S trees. t the end of each iteration, the formed C-S trees are substituted in the original DFG in order to be candidate nodes for extra merging in the next iteration. In each iteration, the 3:2 C-S reductions are considered first and a second pass forms the 4:2 C-S reductions wherever it is possible. The 3:2 compression nodes remain until the end of the C-S aware reduction process in order to include the case of a larger merging at the next steps. fter, the completion of the C-S reduction process, the remained uncompressed addition nodes or 3:2 compression nodes are substituted by 4:2 compressor nodes, by adding one or two zero inputs, respectively, at the unbound ports. Fig. 6b illustrates the C-S aware reduction on a sample DFG. The C-S aware reduction process transforms the original DFG in an intermediate representation which is compliant with the FCU s resource model. pattern generation procedure is applied onto the reduced DFG with respect to the FCU template library (Fig. 4). The pattern generation is a covering of the reduced DFG according to the FCU s operation templates. It actually clusters the DFG s C-S reduced nodes with the multiplication nodes, in order to perform maximal operation chaining. Currently, the covering 41

5 C-S ware DFG Reduction Input_1: DFG; /*Set Boundary Operations (B_Op) */ Input_2: B_Op {I/O Ops, Mul Ops}; Output: Reduced DFG; Schedule SP the DFG; for (step=2; step<= #sched_steps; step++){ for each Opi in step{ if (Opi!= B_Op){ Bottom-Up Generation of 3:2 C-S Trees; Bottom-Up Generation of 4:2 C-S Trees; Reconstruct the Scheduled DFG; } else continue; } } MR MR1 (a) MR2 MR3 MR4 MR5 MR6 MR FIX_ + x (b) 3:2 C-S Tree 4:2 C-S Tree Boundary Ops Figure 6. a) C-S ware DFG Reduction Psheudocode, b) Example of C-S Reduction on a sample DFG. patterns are generated in an exhaustive manner. The designer is responsible to select those patterns that optimally cover the reduced DFG, as far as it concerns the minimization of DFG s execution time. utomated library based pattern generation and selection [11] can also be incorporated, however these techniques are out of the scope of this paper. In phase 3, the clustered DFG is scheduled in order to assign each FCU operation to a specific control step. Since the datapath is realized by a fixed number of FCUs, a resourceconstrained scheduling problem with the goal of latency minimization is considered. The number of available FCUs and CS2Bin modules determines the resource constraint set. list-based scheduler [7] has been developed that takes into account the mobility of FCU operations. The SP and LP time-stamps of the FCU operations are calculated and the mobility for each FCU operation is extracted as the difference of the corresponding LP-SP value. The FCU operations are prioritized based on the lower mobility value, since the lower the operation s mobility the most critical the operation. Next, the scheduled FCU operations are bound onto specific FCU instances and the proper configuration bits are generated. fter the completion of register allocation, a FSM description is extracted in order to implement the control unit of the overall architecture, and a FSMD [7] model of the FCU based architecture is generated in synthesizable Verilog. 4 Experimental Results In this section, we provide experimental results showing the effectiveness of our approach. We have compared the proposed datapath with the one presented in [8], which forms the most recent and relevant work to ours. comparative area-timing exploration between the two basic computational components, the proposed FCU and the FCC in [8]) has been conducted. We have also included explorative results for the case of a conventional 16-bit multiplication unit (DW Mul), in order to provide straightforward comparison with the area dominant component (multiplication unit) of non-flexible implementations. ll the components were mapped onto the standard cells of TSMC.13um technology library, using the Synopsys Design Compiler version 26 [2]. The arithmetic optimized pparch implementations from the Synopsys DesignWare library [2] were considered for both the conventional multiplication unit and the multiplication units found into FCC [8]. The lower limits of the area-timing values for the three implementations were exposed based on an iterative synthesis procedure which generated different delay-constrained versions of the two datapaths. The delay constraint was altered in each iteration considering a time interval of,1 ns, with an initial value of 1, ns and final value of 5,3 ns. Fig. 7 reports the comparative results. The proposed FCU is able to operate, without timing violations, in a time frame between [2,2ns, 5,3ns]. Respectively, the FCC unit [8] op- um Synopsys DW_Mul pparch Proposed FCU FCC [8] FCUvsMUL = 19147,1 um 2 FCUvsMUL =,8 ns FCCvsFCU = 1,7 ns FCCvsFCU = 66875,4 um 2 Proposed FCU FCC Figure 7. rea-time Explorative Diagram. DW_Mul ns 411

6 erates without timing violations in a time frame of [3,9ns, 5,3ns]. Thus, the proposed FCU has a larger operative range of about ΔT FCCvsFCU =1, 7ns than the FCC [8]. s expected, the DW Mul unit has the largest violation free timing range of [1,6ns, 5,3ns], among the other two flexible components. The rather small ΔT FCUvsMUL =, 8ns between the proposed flexible computational unit and the optimized non-flexible DW Mul unit shows the efficiency of our approach. dditionally, comparing the area of the two flexible components at 3,9 ns (the upper operative point of the FCC unit), the proposed FCU delivers approximately 6 smaller area than the FCC. The comparison of the proposed FCU and the DW Mul at the upper operative point of FCU (T = 2,2 ns) shows that FCU occupies about 3 larger area than DW Mul, at that specific point. However, Fig. 7 shows that for operative points larger than 4 ns the area of the FCU converge towards the optimized non-flexible DW Mul. representative set of computationally intensive DSP kernels was formed in order to demonstrate the efficiency of the proposed solution. The benchmark suite consists of: 1) an 8-taps Symmetrical FIR filter (SymFir8), 2) a 16- taps FIR filter (Fir16), 3) a 6th order Elliptic filter (Elliptic), 4) a Volterra IIR filter, 5) the MES Matrix Multiplication (Mesa Mat Mul) kernel [5], 6) a straightforward 1-D DCT kernel with unrolled the column s loop (U-R 1D- DCT), 7) the 2-D DCT (Jpeg DCT) used in JPEG [5] and 8) the 2-D Inverse DCT (Mpeg DCT) used in MPEG [5]. These kernels were mapped 1) onto a FCU-based architecture comprising 4 FCUs and 2) onto a FCC-based datapath with 2 FCCs (= 8 LUs and 8 Muls) [8]. The SymFir8 and Fir16 have been unrolled 4 times, while each loop of the Volterra filter has been fully unrolled according to the synthesis methodology of Section 3. The Mesa Mat Mul, U-R 1D-DCT and Jpeg DCT kernels were not unrolled since in each iteration more than 4 multiplication operations were available. For the mapping onto the FCC-based architecture, the above benchmarks were scheduled using the SPRK HLS tool [9] with the aggressive operation chaining option enabled and we manually optimized the resultant FCC-based datapath in order to take into consideration the inter-template chaining, which is not supported by SPRK. Due to the limited space we omitted comparative results between the proposed approach and datapaths composed only by primitive operators. However, such a comparison has been conducted in [8] for the case of FCC-based datapath. Since we are compared with the approach in [8], some straightforward qualitatively conclusions about the performance efficiency of our approach in comparison with primitive resource datapaths can be safely inferred. The FCU-based and FCC-based FSMD models were synthesized with Synopsys Design Compiler and TSMC.13um technology library. For the proposed FCU-based datapath a timing constraint of T clkfcu = 3, 8ns was imposed while for the FCC-based datapath the timingconstraint was set to T clkfcc =4, 8ns (the middle values between the upper and lower operative points of Fig. 7, for each flexible component). Power analysis of the synthesized netlists performed with Synopsys PrimePower [2]. Worst case power analysis was considered by imposing ToggleRate =1to all the inputs and the internal nets of the synthesized netlists. Table 1 reports 1) the actual latency (#cycles T clk )inns, 2) the active area in um 2 and 3) the power dissipation in mw att for each DSP kernel. The proposed datapath delivers faster implementations with smaller area complexity than the FCC-based datapaths in all cases. Specifically, the proposed datapath delivers average latency and area reductions of 29,5% and 42,1%, respectively. The average power consumption is 1,1% lower for FCU-based datapaths. In some kernels the power consumption of the proposed datapath is larger than the FCC-based datapath. This occurs due to the higher operating frequency of FCU datapath in comparison to the FCC datapath. However, the small area complexity (small load capaciatnce) of FCU-based datapaths amortize the power effect of high operating frequency in most DSP kernels. We considered three design metrics (Fig. 8), namely 1) the rea-delay (D) product 2) the Power-Delay (PD) Table 1. Latency, rea and Power Consumption Results: FCU vs FCC [8] Datapaths. Proposed FCU FCC [8] Gains (%) DSP Latency (ns) rea Power Latency(ns) rea Power Latency rea Power Kernel (Cycles T clkfcu ) (um 2 ) (mwatt) (Cycles T clkfcc ) (um 2 ) (mwatt) (%) (%) (%) SymFir8 15, ,5 7,3 28, ,6 13,9 47,2 6,5 47,5 Fir16 22,8 9418,4 1, 38, ,3 8,5 4,6 41,8-17,6 Elliptic 22,8 1279,4 11,2 28, ,8 17,8 2,8 37,2 37,1 Voltrerra ,8 8,8 33, ,9 13,3 43,4 47,4 33,8 Mesa Mat Mul 79, ,1 13,3 15, ,7 16,4 24,4 45,2 18,9 U-R 1D-DCT 55,4 9875,3 1,6 614, ,7 13,7 17,7 47,9 22,6 Jpeg DCT 497, ,1 47,1 657, ,8 37,4 24,3 27,8-25,7 Mpeg IDCT ,3 5, ,5 37,1 17,8 28,9-36,11 verage ,5 42,1 1,1 412

7 SymFir Normalized rea-delay Product [D(FCU-based)/D(FCC-based)] 1 % D(FCC-based) Fir16 Elliptic Voltrerra SymFir8 Fir16 Elliptic Voltrerra Mesa_Mat_Mult Mesa_Mat_Mult U-R 1D-DCT Jpeg DCT U-R 1D-DCT Jpeg DCT Mpeg IDCT Normalized Power-Delay Product [PD(FCU-based)/PD(FCC-based)] 1 % PD(FCC-based) SymFir8 Fir16 Elliptic Voltrerra Mesa_Mat_Mult U-R 1D-DCT Jpeg DCT Mpeg IDCT Normalized Energy-Delay Product [ED(FCU-based)/ED(FCC-based)] 1 % ED(FCC-based) Mpeg IDCT Figure 8. Designs Metrics a) rea-delay b) Power-Delay, c) Energy Delay Products. product and 3) the Energy-Delay (ED) product, in order to evaluate the efficiency of the synthesized datapaths. The D, PD and ED values (the lower the better) have been normalized according to the respective values of each benchmark for the FCC case [8]. Thus, the top dashed line (value 1%) represents the corresponding D, PD or ED product of the FCC datapath. FCU datapath solution outperforms the FCC one in all cases and for all design metrics, except the PD value for the MPEG IDCT kernel. The D values of FCU-based datapaths lay between 2,8%-58,4%. The PD and ED values range between 27,7%-111% and 14,6%- 92%, respectively. 5 Conclusion This paper presented a methodology for highperformance datapath synthesis based on flexible and arithmetically optimized architectural templates. Experimental results on several DSP kernels have shown average performance, area and power improvements of about 3%, 42% and 1% respectively, over previously published high performance and flexible datapath solution. References [1]. Hosangadi, F. Fallah, R. Kastner. Optimizing High Speed rithmetic Circuits Using Three-Term Extraction. In Proc. of IEEE/CM DTE, pages , 26. [2].Peymandoust, L. Pozzi, P. Ienne, G. De Micheli. utomatic Instruction Set Extension and Utilization for Embedded Processors. In Proc. of IEEE SP Conference, pages , 23. [3]. Verma, P. Ienne. Improved Use of the Carry-Save Representation for the Synthesis of Complex rithmetic Circuits. In Proc. of IEEE/CM ICCD, pages , 24. [4] B. Parhami. Computer rithmetic: lgorithms and Hardware Designs. Oxford University Press, 2. [5] C. Lee, M. Potkonjak, W. Mangione-Smith. MediaBench: Tool for Evaluating and Synthesizing Multimedia and Communicatons Systems. In Proc. of the MICRO-3, pages , [6] F. Sun, S. Ravi,. Raghunathan, N. Jha. Synthesis of Custom Processors Based on Extensible Platforms. In Proc. IEEE/CM ICCD, pages , 22. [7] G. De Micheli. Synthesis and of Digital Circuits. McGraw-Hill Higher Education, [8] M. Galanis, G. Theodoridis, S. Tragoudas, and C. Goutis. High Performance Data-Path for Synthesizing DSP Kernels. IEEE Trans. on Computer-ided Design of Integrated Circuits and Systems, 25(6): , June 26. [9] S. Gupta, R. Gupta, N. Dutt, and. Nicolau. SPRK: Parallelizing pproach to the High-Level Synthesis of Digital Circuits. Springer. [1] J. Um, T. Kim. n Optimal llocation of Carry-Save- dders in rithmetic Circuits. IEEE Trans. on Compututers, 5(3): , 21. [11] M. Corazao, M. Khalaf, L. Guerra, M. Potkonjak, J. Rabaey. Perfomance Using Template Mapping for Datapath-Intensive High-Level Synthesis. IEEE Trans. on Computer-ided Design of Integrated Circuits and Systems, 15(2): , aug [12] N. Clark, H. Zhong, W. Tang, S. Mahlke. utomatic Design of pplication Specific Instruction Set Extensions Through Dataflow Graph Exploration. Int. J. Parallel Programming, 31(6): , 23. [13] N. Moreano, E. Borin, C. de Souza, G. raujo. Efficient Datapath Merging for Partially Reconfigurable rchitectures. IEEE Trans. on CD of Integrated Circuits and Systems, 24(7):969 98,

8 [14] P. Marwedel, B. Landwehr, R. Domer. Built-in Chaining: Introducing Complex Components into rchitectural Synthesis. In Proc. of the SP-DC, pages , [15] R. Kastner, S. Ogrenci-Memik, E. Bozorgzadeh, M. Sarrafzadeh. Instruction Generation for Hybrid Reconfigurable Systems. CM Trans. on Design utomation of Electronic Systems, 7(4):65 627, 22. [16] S. Note, W. Geurts, F. Catthoor, H. De Man. Cathedral- III: rchitecture-driven High-Level Synthesis for High Throughput DSP pplications. In Proc. CM/IEEE DC, pages , [17] T. Kim, W. Jao, S. Tjiang. Circuit Using Carry-Save-dder Cells. IEEE Trans. on CD of Integrated Circuits and Systems, 17(1): , [18] T. Ly, D. Knapp, R. Miller, D. MacMillen. Scheduling Using Behavioral Templates. In Proc. CM/IEEE DC, pages 11 16, [19] W.C. Yeh, C.W. Jen. High Performance Carry-Save to Signed-Digit Recoder for Fused ddition-multiplication. In Proc. of IEEE ICSSP, pages , 2. [2] [21] Z. Yu, K.Y. Khoo,. Willson. The Use of Carry-Save Representation in Joint Module Selection and Retiming. In Proc. of IEEE/CM DC, pages ,

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016

INTERNATIONAL JOURNAL OF PROFESSIONAL ENGINEERING STUDIES Volume VII /Issue 2 / OCT 2016 NEW VLSI ARCHITECTURE FOR EXPLOITING CARRY- SAVE ARITHMETIC USING VERILOG HDL B.Anusha 1 Ch.Ramesh 2 shivajeehul@gmail.com 1 chintala12271@rediffmail.com 2 1 PG Scholar, Dept of ECE, Ganapathy Engineering

More information

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic 368 IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 24, NO. 1, JANUARY 2016 Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Kostas Tsoumanis, Sotirios Xydis,

More information

MARKET demands urge embedded systems to incorporate

MARKET demands urge embedded systems to incorporate IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 19, NO. 3, MARCH 2011 429 High Performance and Area Efficient Flexible DSP Datapath Synthesis Sotirios Xydis, Student Member, IEEE,

More information

High Performance and Area Efficient DSP Architecture using Dadda Multiplier

High Performance and Area Efficient DSP Architecture using Dadda Multiplier 2017 IJSRST Volume 3 Issue 6 Print ISSN: 2395-6011 Online ISSN: 2395-602X Themed Section: Science and Technology High Performance and Area Efficient DSP Architecture using Dadda Multiplier V.Kiran Kumar

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

An Efficient Flexible Architecture for Error Tolerant Applications

An Efficient Flexible Architecture for Error Tolerant Applications An Efficient Flexible Architecture for Error Tolerant Applications Sheema Mol K.N 1, Rahul M Nair 2 M.Tech Student (VLSI DESIGN), Department of Electronics and Communication Engineering, Nehru College

More information

Unit 2: High-Level Synthesis

Unit 2: High-Level Synthesis Course contents Unit 2: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 2 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path

Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Accelerating DSP Applications in Embedded Systems with a Coprocessor Data-Path Michalis D. Galanis, Gregory Dimitroulakos, and Costas E. Goutis VLSI Design Laboratory, Electrical and Computer Engineering

More information

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic

Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Flexible DSP Accelerator Architecture Exploiting Carry-Save Arithmetic Abstract: Hardware acceleration has been proved an extremelypromising implementation strategy for the digital signal processing (DSP)domain.

More information

Efficient High Level Synthesis Exploration Methodology Combining Exhaustive and Gradient-Based Pruned Searching

Efficient High Level Synthesis Exploration Methodology Combining Exhaustive and Gradient-Based Pruned Searching 2010 IEEE Annual Symposium on VLSI Efficient High Level Synthesis Exploration Methodology Combining Exhaustive and Gradient-Based Pruned Searching Sotirios Xydis, Christos Skouroumounis, Kiamal Pekmestzi,

More information

Paper ID # IC In the last decade many research have been carried

Paper ID # IC In the last decade many research have been carried A New VLSI Architecture of Efficient Radix based Modified Booth Multiplier with Reduced Complexity In the last decade many research have been carried KARTHICK.Kout 1, MR. to reduce S. BHARATH the computation

More information

A High Performance Reconfigurable Data Path Architecture For Flexible Accelerator

A High Performance Reconfigurable Data Path Architecture For Flexible Accelerator IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 7, Issue 4, Ver. II (Jul. - Aug. 2017), PP 07-18 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org A High Performance Reconfigurable

More information

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation

Introduction to Electronic Design Automation. Model of Computation. Model of Computation. Model of Computation Introduction to Electronic Design Automation Model of Computation Jie-Hong Roland Jiang 江介宏 Department of Electrical Engineering National Taiwan University Spring 03 Model of Computation In system design,

More information

Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis*

Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis* Behavioural Transformation to Improve Circuit Performance in High-Level Synthesis* R. Ruiz-Sautua, M. C. Molina, J.M. Mendías, R. Hermida Dpto. Arquitectura de Computadores y Automática Universidad Complutense

More information

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS Waqas Akram, Cirrus Logic Inc., Austin, Texas Abstract: This project is concerned with finding ways to synthesize hardware-efficient digital filters given

More information

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator M.Chitra Evangelin Christina Associate Professor Department of Electronics and Communication Engineering Francis Xavier

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator

Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator Sum to Modified Booth Recoding Techniques For Efficient Design of the Fused Add-Multiply Operator D.S. Vanaja 1, S. Sandeep 2 1 M. Tech scholar in VLSI System Design, Department of ECE, Sri VenkatesaPerumal

More information

High-Level Synthesis

High-Level Synthesis High-Level Synthesis 1 High-Level Synthesis 1. Basic definition 2. A typical HLS process 3. Scheduling techniques 4. Allocation and binding techniques 5. Advanced issues High-Level Synthesis 2 Introduction

More information

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017 VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier 1 Katakam Hemalatha,(M.Tech),Email Id: hema.spark2011@gmail.com 2 Kundurthi Ravi Kumar, M.Tech,Email Id: kundurthi.ravikumar@gmail.com

More information

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors

A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors A Low Energy Clustered Instruction Memory Hierarchy for Long Instruction Word Processors Murali Jayapala 1, Francisco Barat 1, Pieter Op de Beeck 1, Francky Catthoor 2, Geert Deconinck 1 and Henk Corporaal

More information

RTL Power Estimation and Optimization

RTL Power Estimation and Optimization Power Modeling Issues RTL Power Estimation and Optimization Model granularity Model parameters Model semantics Model storage Model construction Politecnico di Torino Dip. di Automatica e Informatica RTL

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks

Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Managing Dynamic Reconfiguration Overhead in Systems-on-a-Chip Design Using Reconfigurable Datapaths and Optimized Interconnection Networks Zhining Huang, Sharad Malik Electrical Engineering Department

More information

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. A.Anusha 1 R.Basavaraju 2 anusha201093@gmail.com 1 basava430@gmail.com 2 1 PG Scholar, VLSI, Bharath Institute of Engineering

More information

SPARK: A Parallelizing High-Level Synthesis Framework

SPARK: A Parallelizing High-Level Synthesis Framework SPARK: A Parallelizing High-Level Synthesis Framework Sumit Gupta Rajesh Gupta, Nikil Dutt, Alex Nicolau Center for Embedded Computer Systems University of California, Irvine and San Diego http://www.cecs.uci.edu/~spark

More information

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder 1.M.Megha,M.Tech (VLSI&ES),2. Nataraj, M.Tech (VLSI&ES), Assistant Professor, 1,2. ECE Department,ST.MARY S College of Engineering

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

Rapid: A Configurable Architecture for Compute-Intensive Applications

Rapid: A Configurable Architecture for Compute-Intensive Applications Rapid: Configurable rchitecture for Compute-Intensive pplications Carl Ebeling Dept. of Computer Science and Engineering niversity of Washington lternatives for High-Performance Systems SIC se application-specific

More information

MODERN consumer electronics make extensive use of

MODERN consumer electronics make extensive use of IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS I: REGULAR PAPERS, VOL. 61, NO. 4, APRIL 2014 1133 An Optimized Modified Booth Recoder for Efficient Design of the Add-Multiply Operator Kostas Tsoumanis, Student

More information

MOST computations used in applications, such as multimedia

MOST computations used in applications, such as multimedia IEEE TRANSACTIONS ON VERY LARGE SCALE INTEGRATION (VLSI) SYSTEMS, VOL. 13, NO. 9, SEPTEMBER 2005 1023 Pipelining With Common Operands for Power-Efficient Linear Systems Daehong Kim, Member, IEEE, Dongwan

More information

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes International OPEN ACCESS Journal ISSN: 2249-6645 Of Modern Engineering Research (IJMER) Improved Design of High Performance Radix-10 Multiplication Using BCD Codes 1 A. Anusha, 2 C.Ashok Kumar 1 M.Tech

More information

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors

Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Software Pipelining for Coarse-Grained Reconfigurable Instruction Set Processors Francisco Barat, Murali Jayapala, Pieter Op de Beeck and Geert Deconinck K.U.Leuven, Belgium. {f-barat, j4murali}@ieee.org,

More information

Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath

Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath 972 PAPER Special Section on Formal Approach Word-Level Equivalence Checking in Bit-Level Accuracy by Synthesizing Designs onto Identical Datapath Tasuku NISHIHARA a), Member, Takeshi MATSUMOTO, and Masahiro

More information

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN

A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN A SIMULINK-TO-FPGA MULTI-RATE HIERARCHICAL FIR FILTER DESIGN Xiaoying Li 1 Fuming Sun 2 Enhua Wu 1, 3 1 University of Macau, Macao, China 2 University of Science and Technology Beijing, Beijing, China

More information

Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA

Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA Vagner S. Rosa Inst. Informatics - Univ. Fed. Rio Grande do Sul Porto Alegre, RS Brazil vsrosa@inf.ufrgs.br Eduardo

More information

Efficient Radix-10 Multiplication Using BCD Codes

Efficient Radix-10 Multiplication Using BCD Codes Efficient Radix-10 Multiplication Using BCD Codes P.Ranjith Kumar Reddy M.Tech VLSI, Department of ECE, CMR Institute of Technology. P.Navitha Assistant Professor, Department of ECE, CMR Institute of Technology.

More information

High-Level Synthesis (HLS)

High-Level Synthesis (HLS) Course contents Unit 11: High-Level Synthesis Hardware modeling Data flow Scheduling/allocation/assignment Reading Chapter 11 Unit 11 1 High-Level Synthesis (HLS) Hardware-description language (HDL) synthesis

More information

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures

A Complete Data Scheduler for Multi-Context Reconfigurable Architectures A Complete Data Scheduler for Multi-Context Reconfigurable Architectures M. Sanchez-Elez, M. Fernandez, R. Maestre, R. Hermida, N. Bagherzadeh, F. J. Kurdahi Departamento de Arquitectura de Computadores

More information

Methodology and Example-Driven Interconnect Synthesis for Designing Heterogeneous Coarse-Grain Reconfigurable Architectures

Methodology and Example-Driven Interconnect Synthesis for Designing Heterogeneous Coarse-Grain Reconfigurable Architectures Chapter 1 Methodology and Example-Driven Interconnect Synthesis for Designing Heterogeneous Coarse-Grain Reconfigurable rchitectures Johann Glaser, Clifford Wolf bstract Low power consumption or high execution

More information

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR R. Alwin [1] S. Anbu Vallal [2] I. Angel [3] B. Benhar Silvan [4] V. Jai Ganesh [5] 1 Assistant Professor, 2,3,4,5 Student Members Department of Electronics

More information

FPGA IMPLEMENTATION OF SUM OF ABSOLUTE DIFFERENCE (SAD) FOR VIDEO APPLICATIONS

FPGA IMPLEMENTATION OF SUM OF ABSOLUTE DIFFERENCE (SAD) FOR VIDEO APPLICATIONS FPG IMPLEMENTTION OF UM OF OLUTE DIFFERENCE (D) FOR VIDEO PPLICTION D. V. Manjunatha 1, Pradeep Kumar 1 and R. Karthik 2 1 Department of Electrical and Computer Engineering, lvas Institute of Engineering

More information

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures

Storage I/O Summary. Lecture 16: Multimedia and DSP Architectures Storage I/O Summary Storage devices Storage I/O Performance Measures» Throughput» Response time I/O Benchmarks» Scaling to track technological change» Throughput with restricted response time is normal

More information

DUE to the high computational complexity and real-time

DUE to the high computational complexity and real-time IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 15, NO. 3, MARCH 2005 445 A Memory-Efficient Realization of Cyclic Convolution and Its Application to Discrete Cosine Transform Hun-Chen

More information

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL

HDL. Operations and dependencies. FSMs Logic functions HDL. Interconnected logic blocks HDL BEHAVIORAL VIEW LOGIC LEVEL ARCHITECTURAL LEVEL ARCHITECTURAL-LEVEL SYNTHESIS Motivation. Outline cgiovanni De Micheli Stanford University Compiling language models into abstract models. Behavioral-level optimization and program-level transformations.

More information

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116

[Sahu* et al., 5(7): July, 2016] ISSN: IC Value: 3.00 Impact Factor: 4.116 IJESRT INTERNATIONAL JOURNAL OF ENGINEERING SCIENCES & RESEARCH TECHNOLOGY SPAA AWARE ERROR TOLERANT 32 BIT ARITHMETIC AND LOGICAL UNIT FOR GRAPHICS PROCESSOR UNIT Kaushal Kumar Sahu*, Nitin Jain Department

More information

Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park. Design Technology Infrastructure Design Center System-LSI Business Division

Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park. Design Technology Infrastructure Design Center System-LSI Business Division Sungmin Bae, Hyung-Ock Kim, Jungyun Choi, and Jaehong Park Design Technology Infrastructure Design Center System-LSI Business Division 1. Motivation 2. Design flow 3. Parallel multiplier 4. Coarse-grained

More information

Study, Implementation and Survey of Different VLSI Architectures for Multipliers

Study, Implementation and Survey of Different VLSI Architectures for Multipliers Study, Implementation and Survey of Different VLSI Architectures for Multipliers Sonam Kandalgaonkar, Prof.K.R.Rasane Department of Electronics and Communication Engineering, VTU University KLE s College

More information

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne

Architectural-Level Synthesis. Giovanni De Micheli Integrated Systems Centre EPF Lausanne Architectural-Level Synthesis Giovanni De Micheli Integrated Systems Centre EPF Lausanne This presentation can be used for non-commercial purposes as long as this note and the copyright footers are not

More information

Verilog for High Performance

Verilog for High Performance Verilog for High Performance Course Description This course provides all necessary theoretical and practical know-how to write synthesizable HDL code through Verilog standard language. The course goes

More information

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA

Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Scalable and Modularized RTL Compilation of Convolutional Neural Networks onto FPGA Yufei Ma, Naveen Suda, Yu Cao, Jae-sun Seo, Sarma Vrudhula School of Electrical, Computer and Energy Engineering School

More information

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms

A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms A Methodology for Energy Efficient FPGA Designs Using Malleable Algorithms Jingzhao Ou and Viktor K. Prasanna Department of Electrical Engineering, University of Southern California Los Angeles, California,

More information

Built-in Chaining: Introducing Complex Components into Architectural Synthesis

Built-in Chaining: Introducing Complex Components into Architectural Synthesis Built-in Chaining: Introducing Complex Components into Architectural Synthesis Peter Marwedel, Birger Landwehr Rainer Dömer Dept. of Computer Science II Dept. of Information and Computer Science University

More information

High Throughput Radix-D Multiplication Using BCD

High Throughput Radix-D Multiplication Using BCD High Throughput Radix-D Multiplication Using BCD Y.Raj Kumar PG Scholar, VLSI&ES, Dept of ECE, Vidya Bharathi Institute of Technology, Janagaon, Warangal, Telangana. Dharavath Jagan, M.Tech Associate Professor,

More information

Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems

Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems Abstract The problem of implementing linear systems in hardware by efficiently using shifts

More information

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs

A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs A Methodology and Tool Framework for Supporting Rapid Exploration of Memory Hierarchies in FPGAs Harrys Sidiropoulos, Kostas Siozios and Dimitrios Soudris School of Electrical & Computer Engineering National

More information

COE 561 Digital System Design & Synthesis Introduction

COE 561 Digital System Design & Synthesis Introduction 1 COE 561 Digital System Design & Synthesis Introduction Dr. Aiman H. El-Maleh Computer Engineering Department King Fahd University of Petroleum & Minerals Outline Course Topics Microelectronics Design

More information

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis

A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis A Lost Cycles Analysis for Performance Prediction using High-Level Synthesis Bruno da Silva, Jan Lemeire, An Braeken, and Abdellah Touhafi Vrije Universiteit Brussel (VUB), INDI and ETRO department, Brussels,

More information

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms M.N.Mahesh, Satrajit Gupta Electrical and Communication Engg. Indian Institute of Science Bangalore - 560012, INDIA

More information

Design and Implementation of CVNS Based Low Power 64-Bit Adder

Design and Implementation of CVNS Based Low Power 64-Bit Adder Design and Implementation of CVNS Based Low Power 64-Bit Adder Ch.Vijay Kumar Department of ECE Embedded Systems & VLSI Design Vishakhapatnam, India Sri.Sagara Pandu Department of ECE Embedded Systems

More information

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden

MOJTABA MAHDAVI Mojtaba Mahdavi DSP Design Course, EIT Department, Lund University, Sweden High Level Synthesis with Catapult MOJTABA MAHDAVI 1 Outline High Level Synthesis HLS Design Flow in Catapult Data Types Project Creation Design Setup Data Flow Analysis Resource Allocation Scheduling

More information

Intel HLS Compiler: Fast Design, Coding, and Hardware

Intel HLS Compiler: Fast Design, Coding, and Hardware white paper Intel HLS Compiler Intel HLS Compiler: Fast Design, Coding, and Hardware The Modern FPGA Workflow Authors Melissa Sussmann HLS Product Manager Intel Corporation Tom Hill OpenCL Product Manager

More information

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding

Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding LETTER IEICE Electronics Express, Vol.14, No.21, 1 11 Efficient VLSI Huffman encoder implementation and its application in high rate serial data encoding Rongshan Wei a) and Xingang Zhang College of Physics

More information

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION

OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION OPTIMIZATION OF FIR FILTER USING MULTIPLE CONSTANT MULTIPLICATION 1 S.Ateeb Ahmed, 2 Mr.S.Yuvaraj 1 Student, Department of Electronics and Communication/ VLSI Design SRM University, Chennai, India 2 Assistant

More information

EEL 4783: HDL in Digital System Design

EEL 4783: HDL in Digital System Design EEL 4783: HDL in Digital System Design Lecture 4: HLS Intro* Prof. Mingjie Lin *Notes are drawn from the textbook and the George Constantinides notes 1 Course Material Sources 1) Low-Power High-Level Synthesis

More information

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm

An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary Common Sub-Expression Elimination Algorithm Volume-6, Issue-6, November-December 2016 International Journal of Engineering and Management Research Page Number: 229-234 An Efficient Constant Multiplier Architecture Based On Vertical- Horizontal Binary

More information

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE

DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE RESEARCH ARTICLE OPEN ACCESS DESIGN AND IMPLEMENTATION OF FAST DECIMAL MULTIPLIER USING SMSD ENCODING TECHNIQUE S.Sirisha PG Scholar Department of Electronics and Communication Engineering AITS, Kadapa,

More information

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path

Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path Performance Improvements of Microprocessor Platforms with a Coarse-Grained Reconfigurable Data-Path MICHALIS D. GALANIS 1, GREGORY DIMITROULAKOS 2, COSTAS E. GOUTIS 3 VLSI Design Laboratory, Electrical

More information

Low-Power FIR Digital Filters Using Residue Arithmetic

Low-Power FIR Digital Filters Using Residue Arithmetic Low-Power FIR Digital Filters Using Residue Arithmetic William L. Freking and Keshab K. Parhi Department of Electrical and Computer Engineering University of Minnesota 200 Union St. S.E. Minneapolis, MN

More information

Implementation of digit serial fir filter using wireless priority service(wps)

Implementation of digit serial fir filter using wireless priority service(wps) Implementation of digit serial fir filter using wireless priority service(wps) S.Aruna Assistant professor,ece department MVSR Engineering College Nadergul,Hyderabad-501510 V.Sravanthi PG Scholar, ECE

More information

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs

Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mapping-Aware Constrained Scheduling for LUT-Based FPGAs Mingxing Tan, Steve Dai, Udit Gupta, Zhiru Zhang School of Electrical and Computer Engineering Cornell University High-Level Synthesis (HLS) for

More information

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology EE382C: Embedded Software Systems Final Report David Brunke Young Cho Applied Research Laboratories:

More information

Low Power Floating-Point Multiplier Based On Vedic Mathematics

Low Power Floating-Point Multiplier Based On Vedic Mathematics Low Power Floating-Point Multiplier Based On Vedic Mathematics K.Prashant Gokul, M.E(VLSI Design), Sri Ramanujar Engineering College, Chennai Prof.S.Murugeswari., Supervisor,Prof.&Head,ECE.,SREC.,Chennai-600

More information

Mapping Vector Codes to a Stream Processor (Imagine)

Mapping Vector Codes to a Stream Processor (Imagine) Mapping Vector Codes to a Stream Processor (Imagine) Mehdi Baradaran Tahoori and Paul Wang Lee {mtahoori,paulwlee}@stanford.edu Abstract: We examined some basic problems in mapping vector codes to stream

More information

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University

Hardware Design Environments. Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Hardware Design Environments Dr. Mahdi Abbasi Computer Engineering Department Bu-Ali Sina University Outline Welcome to COE 405 Digital System Design Design Domains and Levels of Abstractions Synthesis

More information

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier

VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier VLSI Design and Implementation of High Speed and High Throughput DADDA Multiplier U.V.N.S.Suhitha Student Department of ECE, BVC College of Engineering, AP, India. Abstract: The ever growing need for improved

More information

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors

Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors Exploiting On-Chip Data Transfers for Improving Performance of Chip-Scale Multiprocessors G. Chen 1, M. Kandemir 1, I. Kolcu 2, and A. Choudhary 3 1 Pennsylvania State University, PA 16802, USA 2 UMIST,

More information

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE

HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE HIGH-PERFORMANCE RECONFIGURABLE FIR FILTER USING PIPELINE TECHNIQUE Anni Benitta.M #1 and Felcy Jeba Malar.M *2 1# Centre for excellence in VLSI Design, ECE, KCG College of Technology, Chennai, Tamilnadu

More information

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India Mapping Signal Processing Algorithms to Architecture Sumam David S Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India sumam@ieee.org Objectives At the

More information

Stacked FSMD: A Power Efficient Micro-Architecture for High Level Synthesis

Stacked FSMD: A Power Efficient Micro-Architecture for High Level Synthesis Stacked FSMD: A Power Efficient Micro-Architecture for High Level Synthesis Khushwinder Jasrotia, Jianwen Zhu Department of Electrical and Computer Engineering University of Toronto, Ontario MS 3G4, Canada

More information

Fast Timing Closure by Interconnect Criticality Driven Delay Relaxation

Fast Timing Closure by Interconnect Criticality Driven Delay Relaxation Fast Timing Closure by Interconnect Criticality Driven Delay Relaxation Love Singhal and Elaheh Bozorgzadeh Donald Bren School of Information and Computer Sciences University of California, Irvine, California

More information

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider Tamkang Journal of Science and Engineering, Vol. 3, No., pp. 29-255 (2000) 29 Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider Jen-Shiun Chiang, Hung-Da Chung and Min-Show

More information

Advanced Design System DSP Synthesis

Advanced Design System DSP Synthesis Advanced Design System 2002 DSP Synthesis February 2002 Notice The information contained in this document is subject to change without notice. Agilent Technologies makes no warranty of any kind with regard

More information

VHDL for Synthesis. Course Description. Course Duration. Goals

VHDL for Synthesis. Course Description. Course Duration. Goals VHDL for Synthesis Course Description This course provides all necessary theoretical and practical know how to write an efficient synthesizable HDL code through VHDL standard language. The course goes

More information

CS 31: Intro to Systems Digital Logic

CS 31: Intro to Systems Digital Logic CS 3: Intro to Systems Digital Logic Martin Gagné Swarthmore College January 3, 27 You re going to want scratch papr today borrow some if needed. Quick nnouncements Late Policy Reminder 3 late days total

More information

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies VLSI IMPLEMENTATION OF HIGH PERFORMANCE DISTRIBUTED ARITHMETIC (DA) BASED ADAPTIVE FILTER WITH FAST CONVERGENCE FACTOR G. PARTHIBAN 1, P.SATHIYA 2 PG Student, VLSI Design, Department of ECE, Surya Group

More information

Performance and Overhead in a Hybrid Reconfigurable Computer

Performance and Overhead in a Hybrid Reconfigurable Computer Performance and Overhead in a Hybrid Reconfigurable Computer Osman Devrim Fidanci 1, Dan Poznanovic 2, Kris Gaj 3, Tarek El-Ghazawi 1, Nikitas Alexandridis 1 1 George Washington University, 2 SRC Computers

More information

High Speed Multiplication Using BCD Codes For DSP Applications

High Speed Multiplication Using BCD Codes For DSP Applications High Speed Multiplication Using BCD Codes For DSP Applications Balasundaram 1, Dr. R. Vijayabhasker 2 PG Scholar, Dept. Electronics & Communication Engineering, Anna University Regional Centre, Coimbatore,

More information

Design Methodologies

Design Methodologies Design Methodologies 1981 1983 1985 1987 1989 1991 1993 1995 1997 1999 2001 2003 2005 2007 2009 Complexity Productivity (K) Trans./Staff - Mo. Productivity Trends Logic Transistor per Chip (M) 10,000 0.1

More information

Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems

Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems Simultaneous Optimization of Delay and Number of Operations in Multiplierless Implementation of Linear Systems Anup Hosangadi Farzan Fallah Ryan Kastner University of California, Fujitsu Labs of America,

More information

Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations

Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations 30 IEEE TRANSACTIONS ON COMPUTER-AIDED DESIGN OF INTEGRATED CIRCUITS AND SYSTEMS, VOL. 19, NO. 1, JANUARY 2000 Maximally and Arbitrarily Fast Implementation of Linear and Feedback Linear Computations Miodrag

More information

HECTOR: Formal System-Level to RTL Equivalence Checking

HECTOR: Formal System-Level to RTL Equivalence Checking ATG SoC HECTOR: Formal System-Level to RTL Equivalence Checking Alfred Koelbl, Sergey Berezin, Reily Jacoby, Jerry Burch, William Nicholls, Carl Pixley Advanced Technology Group Synopsys, Inc. June 2008

More information

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018 RESEARCH ARTICLE DESIGN AND ANALYSIS OF RADIX-16 BOOTH PARTIAL PRODUCT GENERATOR FOR 64-BIT BINARY MULTIPLIERS K.Deepthi 1, Dr.T.Lalith Kumar 2 OPEN ACCESS 1 PG Scholar,Dept. Of ECE,Annamacharya Institute

More information

Design of Add-Multiply operator usingmodified Booth Recoder K. Venkata Prasad, Dr.M.N. Giri Prasad

Design of Add-Multiply operator usingmodified Booth Recoder K. Venkata Prasad, Dr.M.N. Giri Prasad Design of Add-Multiply operator usingmodified Booth Recoder K. Venkata Prasad, Dr.M.N. Giri Prasad Abstract: Digital Signal Processing (DSP) applications carry out a large number of complex arithmetic

More information

Design of Delay Efficient Carry Save Adder

Design of Delay Efficient Carry Save Adder Design of Delay Efficient Carry Save Adder K. Deepthi Assistant Professor,M.Tech., Department of ECE MIC College of technology Vijayawada, India M.Jayasree (PG scholar) Department of ECE MIC College of

More information

Advanced Design System 1.5. DSP Synthesis

Advanced Design System 1.5. DSP Synthesis Advanced Design System 1.5 DSP Synthesis December 2000 Notice The information contained in this document is subject to change without notice. Agilent Technologies makes no warranty of any kind with regard

More information

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC

High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Journal of Computational Information Systems 7: 8 (2011) 2843-2850 Available at http://www.jofcis.com High Performance VLSI Architecture of Fractional Motion Estimation for H.264/AVC Meihua GU 1,2, Ningmei

More information

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology

Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Design and Analysis of Kogge-Stone and Han-Carlson Adders in 130nm CMOS Technology Senthil Ganesh R & R. Kalaimathi 1 Assistant Professor, Electronics and Communication Engineering, Info Institute of Engineering,

More information

Digital Computer Arithmetic

Digital Computer Arithmetic Digital Computer Arithmetic Part 6 High-Speed Multiplication Soo-Ik Chae Spring 2010 Koren Chap.6.1 Speeding Up Multiplication Multiplication involves 2 basic operations generation of partial products

More information

VLSI Signal Processing

VLSI Signal Processing VLSI Signal Processing Programmable DSP Architectures Chih-Wei Liu VLSI Signal Processing Lab Department of Electronics Engineering National Chiao Tung University Outline DSP Arithmetic Stream Interface

More information