Split-path Fused Floating Point Multiply Accumulate (FPMAC)

Size: px
Start display at page:

Download "Split-path Fused Floating Point Multiply Accumulate (FPMAC)"

Transcription

1 Split-path Fused Floating Point Multiply Accumulate (FPMAC) Suresh Srinivasan, Ketan Bhudiya, Rajaraman Ramanarayanan, P Sahit Babu, Tiju Jacob, Sanu. K. Mathew*, Ram Krishnamurthy*, Vasantha Erraguntla Intel India Pvt. Ltd, Bangalore, India *Circuits Research Lab, Intel Corporation, Hillsboro, Oregon suresh.srinivasan@intel.com, ketan.m.budhiya@intel.com, rajaraman.ramanarayanan@intel.com, sahit30@gmail.com, tiju.jacob@intel.com, sanu.k.mathew@intel.com, ram.krishnamurthy@intel.com, vasantha.erraguntla@intel.com Abstract Floating point multiply-accumulate (FPMAC) unit is the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=e xy -E z = {-2,-1,0,1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2 compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it. Keywords: Double precision floating point multiply-accumulate, Normalization, Wallace tree, IEEE Rounding I. INTRODUCTION The floating point multiply-accumulate (FPMAC) unit is the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. The FPMAC operation is commonly employed in algorithms like FFT, convolution and consequently a cardinal piece of most signal processing and physics applications. The FPMAC unit is used extensively in servers employed for engineering and scientific applications, and therefore its design is of vital consideration in such systems. Moreover, with increased focus of most microprocessor companies in floating point vector operations like AVX, SSE etc., any optimizations in area, timing and power of critical components like FPMAC are of tremendous value to the microprocessors marked for frequency and power. Also, use of FPMAC units in AVX engines not only increases logic complexity but also interconnect complexity. With the interconnect delay and power increasing with every process generation, its dominance on the critical timing path and total power makes the fused-multiply-add architecture not only a design with difficult timing goals, but also one of the heaviest on power consumption. Various FPMAC have evolved over many years [4, 5, 6, 7], however the timing and power of the FPMAC design still remains a key issue to be tackled optimally. In this work we provide an FPMAC design which is focused on optimizing the timing of the FPMAC critical path to the fastest known silicon implementation and is also gating friendly by design, potentially targeting (15-20%) of the gates to be turned off during each operation. The concept is very much in line with tackling the problem as a Divide and conquer approach where we split the problem into the timing critical vs. noncritical portions and tackle the critical portion optimally. Such a split comes at no additional cost due to the nature of FPMAC pipeline and therefore only results in a gain with respect to both timing and power. FPMAC operation comprises of the floating point multiply followed by add operation (Mx*My+Mz). During the addition of the multiplicand with accumulate, while dealing with signed numbers, there are cases in which the difference between the two numbers is too close leading to a big normalization shift. We call these the near path operations while cases with bigger difference between the multiplicand and accumulate are called the far path. The near path operation is the critical path of the design and therefore isolation of this case, based on the exponent difference in early stages could help in eliminating any unwanted logic operations for this case. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages:

2 a) Splitting near and far paths based on the exponent difference(d=exy-ez = {-2,-1,0,1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2 compressor from near path critical logic, possible due to small alignment shifts in near path, c) Combined round-add for eliminating the completion adder from multiplier giving both timing and power benefits. Splitting the paths also provides tremendous opportunities for gating the unused portion (15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The ideas demonstrated in visuals are interchangeably used based on demonstration simplicity for single or double precision blocks, however the comparison of timing, power and area is only demonstrated for the more complex double precision FPMAC. Section 2 talks about the prior research and the way we differ from any existing work. Section 3 describes the FPMAC proposed in this work. Section 4 provides the area and timing estimates of this design and compares with the best known silicon implementation and the best known theoretically fastest timing FPMAC. II. RELATED WORK Floating point arithmetic in microprocessor industries are defined to adhere to the IEEE floating point standards defined in [3]. Floating point multiplication has been explored in the past for optimal implementation in [8, 9]. The design in [8] has proposed Wallace tree based implementation followed by optimal rounding schemes to remove the completion adder at the end of multiplication and perform it together with the rounding scheme. The dual adder in completion can help removing not only the completion adder but also the post normalization shifter using the ideas presented in [9, 16, 17]. Addition of floating point numbers is slightly more complicated since it requires alignment of the numbers before performing the FP addition. Apart from this alignment shift signed addition may also lead to non-normalized results leading to big normalizations shifts at the end of the add operation. Various optimizations have been used to get to timing optimal adders which include splitting addition operation into near and far path based on the alignment shifts required [10, 11, 15]. The FPMAC design in this work has ideas inspired from [9, 11], where we combine two of the ideas from the splitting approach of addition and combined round-add of multipliers to employ it optimally in the FPMAC proposed for timing optimal design. Various FMAs have been proposed in [4, 5, 6, 7], some of the recent ones among those are FPMAC optimized designs implemented in Power6 [6, 7] and timing optimized FPMAC in Lang [5]. The Power6 FPMAC is the best known fastest silicon implementation while the Lang s work presents the fastest theoretical FPMAC implementation. In this work we use the Power6 FMA and the Lang FPMAC designs for timing, area and power comparison. III. PROPOSED FPMAC DESIGN Figure 1 shows the FPMAC design proposed in this work. The FPMAC operation involves mantissa multiplication of two input operands (Mx, My), followed by the accumulation or addition of the third operand Mz. All the operands are represented as standard IEEE floating point normalized numbers (S, F, E), where S depicts the sign (1-bit), F is the fraction (1.F is the m bit normalized mantissa M) and E the biased exponent (actual exponent(e)+bias, to make the representation of E positive). Multiplication of the two operands involves mantissa multiplication (Mx X My), performed using carry-save reduction compressor tree based design, and the output exponent of the product is Exy=Ex+Ey-bias. The accumulation involves alignment of the accumulate mantissa, Mz, and the multiply result, Mxy, by shifting one of them by the exponent difference, d=exy- Ez. To improve performance, the exponent difference computation and alignment shift of Mz is performed in parallel to mantissa multiplication. The operations involved in various stages of the FPMAC pipeline differ quite significantly based on the exponent difference d. - Cases with d>1, and d<-2 (Far path), involve a big right or left alignment shift (done in parallel to mantissa multiplication), followed by a 3:2 compression to reduce the aligned accumulate, Mz, with carry (C) and sum (S) terms of the final result as output. In the completion addition that adds the (C, S) result from the previous stage, only the MSB m -bit sum is required while the remaining bits are used for computing the carry (C), guard (G) round (R) and the sticky bits (T). This is followed by normalization right shift of worst case m+3 (when d=-(m+3)). The rounding unit uses the C, G, R and T bits to compute the rounded result. - Cases with d={0,1,-1,-2} (Near path) involve smaller alignment shift. This is followed by a 3:2 compression similar to the earlier case. However, these cases may generate a large number of leading 0s or 1s based on positive or negative value of the result respectively, which requires a worst case of 2m bit normalization left shift. This also necessitates the computation of the whole 2m -bit sum and a parallel Leading Zero Anticipator (LZA) to improve performance. The normalized result is then used for rounding. The near path clearly forms the critical path and dominates hardware requirements, due to the presence of 2m bit sum and 2m bit normalize unit along with the LZA. The proposed FPMAC implementation is based on split handling of near and far paths, which uses optimal hardware and

3 logic stages for each of the cases, performing the bare minimal operations required, particularly in the near path. The delay and area optimizations in the near path include: 1) Early injection of the near path accumulate operand Mz into the multiplication CSA tree removing the 3:2 compression stage from critical path, giving 3% reduction in logic levels (LL) compared to [5]. 2) The accumulation is performed after the The far path is non-critical and thereby designed based on the minimum required operations. Apart from performing minimal number of operations required, the split path handling also provides significant power benefits due to the ease of clock/power gating the near or the far paths as required. Each of the above innovations are described in detail in the below sections walking through each stage of the operation. The implementation details are discussed in terms of a single-precision FPMAC unit for the ease of Figure 1: Proposed FPMAC design normalization shift for both the near and far paths, combined with the rounding unit, eliminating the accumulate adder from the critical path. ( 2m bit adder area savings) 3) To further reduce the near path delay, the normalization shifting for the near path is performed in parallel with the LZA on the (C, S) outputs of the CSA tree, which masks the shifting delay with the LZA computation. (2 and 3 effectively give 13% reduction in logic levels) 4) Sign detection of the result for conditional 2 s complimenting is performed using the existing LZA components. This completely eliminates the sum computation or the sign detection unit from the critical path and the hardware associated with them. This is a key differentiation and optimization that enables both high performance and low power compared to other existing works [4, 5, 6]. representation; however without loss of generalization all the techniques proposed hold good for double precision FPMAC as well. The results and comparison will also be demonstrated only for double precision FPMAC. The completion adder is a combined round and add block which is very similar to [5]. The dual adder operates on the MSB m-1 bits computing sum and sum+1 which is selected based on the carry, sticky and round bits. The carry sticky and the round bits are computed on the LSB m+1 bits which are finally used to detect whether to select sum or sum+1 from the completion adder. The only additional change with respect to [5] in this design is the addition of 2 s complement 1 in the LSB in this dual adder stage which is in the non-critical section of addition of LSB. The key premise of such a multiplier performing combined round and add is the fact the resulting sum of C,S should be at worst of the form 11.XXX and not go to 111.XXX in which case the round guard and the sticky bits looked at become

4 incorrect. We ensure that by pre-shifting the multiplicand result to the right in case an overflow is detected from either the near or the far paths along with the addition of 1 to the final exponent to compensate for the shift Exponent Compute Data-path The exponent difference d=ex+ey-ez is computed as demonstrated in Figure 2 using a 3:2 compressor. The addition of bias (-127/-1023 for single/double precision) is accomplished by appending the required two 1s to the LSB of carry and MSB of sum. Separate right and left shifters are used for the far path large shift values (d<-2 and d > 1), while small 1 or 2 bit shifts for the near path are performed using different shifters. This enables dealing with the near path and the far path separately and thereby enabling early availability of the near path shifted accumulate value which is inserted into the multiplication CSA tree. The first m-1 significant bits of the exponent path being 0s or 1s is detected for determining the near path flag using zero detect module as shown in Figure 2 for a single precision 8-bit exponent example. injection also helps splitting the data-paths for the near and the far paths immediately at the end of the multiplication CSA tree. Such a splitting comes at no additional hardware penalty and helps using the optimal hardware for the rest of the pipeline. This also helps in significantly reducing the critical path of the near path, which is also the critical path of the FPMAC design, by removing the undue penalties imposed by unified handling of the different cases. Figure 3: Early Injection of Near Path Accumulate 3.3. Split Handling of Normalization Shift Figure 2: Exponent compute block supporting early detection for near path 3.2. Early Injection of Near Path Accumulate The near path accumulate mantissa, Mz, requires a small shift to be aligned with the multiplication result, Mxy. The early availability of the aligned mantissa provides an opportunity to compress the near path mantissa along with the multiplication CSA tree. Figure 3 demonstrates the insertion of the near path accumulate into the 2nd stage of the CSA tree. Note that in cases of a negative number we inject the complement of Mz along with the 2 s compliment 1 in the same stage of the sparse tree. The sparse nature of both the double and single precision floating point multiplication trees enables the near path (C, S) results from the CSA tree to be computed without any additional delay penalty in the critical path. Early computation of the (C, S) for the near-path including the compression of accumulate mantissa, prevents an additional 3:2 compression stage and enables their immediate processing, saving one 3:2 compression stage in the critical near path. This near path The completion addition is performed post-normalization shifting of the (C, S) terms, combined with the rounding. Normalization before the completion addition enables computation of only the required m bit sum and makes the design performance and hardware optimized. The normalization of the near and far paths as demonstrated in Figure 1 is performed separately. In the near path an effective subtraction may at worst lead to 2m leading zeroes (or ones) when the m bit accumulate value is equal to 2m bit multiplication result. To determine the left shift amount in such cases with leading zeroes we employ a Leading Zero Anticipator (LZA). As demonstrated in Figure 4, the LZA generates the string representing the number of leading zeroes and leading ones (Figure 4(a)). This string is binary encoded using the Leading Digit Counter (LDC) (Figure 4(b)) and also used to give the sign of the output result (Figure 4(c)). The shift amount generated by the LZA is used by the normalization shifter to perform the left shift on the C, S terms for obtaining a normalized result. There are cases where the multiplication result may have an overflow when added and needs a first bit detection in the carry and sum. Cases which have 11.XXX in sum needs a pre-shift of 1 right before performing the normalization shift. This is primarily to ensure that the combined round and add stage doesn t deal with a result of 111.XXX number in which case the result would not be looking at the correct L,G,R,T bits. The skewed arrival times of binary encoded shift amount from LSB to MSB is utilized for masking the normalization shift delay by performing the shifts upon the immediate arrival of the bits in that order. Unlike prior implementations, performing the completion addition along

5 (a) Leading zero/one anticipation string (b) Leading Digit Counter (Binary encoding) A B P 0 Z 0 G 0 P P G Z PG 7 PZ 7 PG PZ F F (c) Sign Detection using LZA Result Figure 4(a)(b)(c): Leading Zero/One Anticipation, Detection and Sign Detection Logic with rounding and sign detection using the LZA completely eliminates the 2m bit summation or sign detection unit in our proposed design which gives significant savings in the hardware requirements. The other parallel path in the normalization unit deals with the far path cases where d>1 & d<-2. The 3:2 compression of the aligned far path mantissa is performed on the Mz and the (C, S) terms from the CSA tree. Similar to the near path, the summation is postponed till the last stage of combined rounding and addition, where the summation is handled for both the near and far paths using a single unit. The normalization shifter for the far path requires a worst case right shifting of the C, S terms by m+3 bits which corresponds to the case when accumulate was shifted left for alignment i.e. d=-(m+3). The bits shifted out of the m -bit range are used to compute the carry, sticky, and the guard bits to be used while rounding in the combined sum and rounding unit. The normalized (C, S) or (~C, ~S) from either the near path or the far path, are passed on to the combined sum/rounding unit based on the sign of the result determined earlier (Figure 1). The 1s required for completion of the 2s complimenting of (C, S) are handled in the combined add/round unit as in existing implementations. These 1s are added in the final stage of combined round add to the MSB. Split handling of near and far path ensures performing the bare minimal operations required on the critical path and therefore is a performance optimal solution. Another, big advantage of such split handling is the ease of clock/power gating half the units based on the near path or far path flag. This enables turning off all the power consuming normalization shifters and logic blocks and keeping only the required blocks of computation switching and thereby giving a power optimal design. The total logic levels in terms of basic gates are estimated to be 15% lesser than the

6 best silicon implementation of FPMAC presented in [6, 7] with significantly reduced hardware complexity. IV. LOGIC/AREA ANALYSIS AND COMPARISON The FPMAC presented in this paper is the fastest existing implementation when compared to any existing design for performing single multiply accumulates. The area of the proposed design is very comparable to any existing implementation. To compare and highlight the goodness of the FPMAC presented in this work we compare it to the best known silicon implementation from IBM Power6 architecture [6, 7]. Figure 5 demonstrated two FPMAC architectures proposed in recent literature. The first one is the best known silicon implementation of IBM Power6 while the second one is the fastest logic level FPMAC from Lang et al [5]. 4.1 Logic Level Comparison A detailed comparison of the logic levels is presented in Table 1 based on our best understanding of stages in [5] and [6]. The interconnect penalty is not precisely captured in the logic level computation of all the blocks except the LZA and shifters where an additional gate stage delays are incurred based on gate level simulations. The precise timing computation is currently underway in gate level implementation and this work is primarily targeted towards theoretical analysis and comparison closely matching circuit simulations. As clearly demonstrated, the proposed design is faster than the best known timing design [5] by 5.2% while with respect to the best known silicon implementation the design is faster by 14.2%. This significant speedup is primarily attributed to the complete isolation of the critical path and removing the unnecessary burden that a combined implementation has to pay for. The normalization logic has gate levels estimated based on no overlap between the LZA and the shift logic in [5] and the proposed scheme for simplicity. However as proposed in [5] LZA and the normalization shifter may be overlapped due to the inherent nature of LZA which provides the leading digit count from MSB to LSB with sufficient delays in between. Similar optimization is applied in our implementation which can completely eliminate the normalization shift logic delays from both [5] and our proposed design. A complete silicon implementation of the design is currently ongoing and based on preliminary gate level simulations we observe a savings of ~11% in timing of our design over Lang[5]. The detailed implementation and delay estimation from gate level simulations includes all the finer delays down to detailed interconnect and routing delays due to the addition of large interconnect dominated shifters in our proposed design. Further optimizations in the gate level simulations are also currently underway. (a) IBM Power6 FPMAC implementation [6] (b) Lang FPMAC implementation [5] Figure 5: Existing FPMAC implementations

7 4.2 Area Comparison One of the important requirements of deigning timing optimal FPMAC is to make sure that its area does not increase significantly. Given FPMACs are employed in wide bit SIMD pipelines the area increase not only impacts the gate count/power etc. but also critically dictates the floorplan of the processors. Table 2 demonstrates a detailed comparison of the area of various blocks in our design with the [5] and [6]. The area estimates are purely based on the logic blocks comparison as the interconnect area is mostly similar between the designs due to overall similar bit width operations. The interconnect area of the shifters however is included in our comparison since it is interconnect dominated logic; therefore we incur the additional penalty of increasing the shifters not only in logic but also interconnect. Note that these area estimates do not include the additional individual gate area savings due to relaxed timing provided by our timing optimal design. We have normalized all the area numbers with respect to the area of a 106-bit shifter on 106 bit which is assumed to be unit 100. Using this normalized estimates the total area of the three designs, IBM, Lang and proposed are demonstrated in Table 2. These normalized estimates help us in comparing the area with other designs. As we can see from the Table 2, our design shows 8% increase in area as compared to the power6 implementation, while it is 12% better than the Lang design. The numbers from our implementation however as highlighted earlier do not include the reduction in area of the non-critical components which could potentially bring our implementation area very close to the Power6 implementation without losing any of the timing benefits. The reduced area and relaxed timing contribute directly to the reduction in the dynamic power consumption of the design. Furthermore, the split handling of near and far paths provide a natural opportunity to save idle power in the portions of the logic unused during near or far path operations. Just by design, at any instance, up-to ~20% of the design may be clock/power gated just using the near vs. far path signal. V. CONCLUSION AND FUTURE WORK This paper demonstrates a split path FPMAC design which is 14% faster than the fastest known silicon implementation. The goodness of the design is accentuated by the timing gains at no additional area costs. The split path design provides a natural way for gating opportunities and even under normal case may lead to 15-20% of lesser switching gates based on the near or far path operation. We intend to implement this on silicon to highlight the merits of this design and feasibility with respect to silicon implementation and this work is underway. Gate-level timing simulations on individual blocks have been adhering to the estimates computed in this work; however the final stitching of the blocks and synthesis is ongoing work. This innovation can significantly help the microprocessor designs with fast timing area and power convergence. FMA Design Power6 FPMAC [6, 7] Lang [5] Proposed Comments Multiplier Accumulate adder Normalization shifter Rounding unit Post rounding norm 53-bit multiplier (33) 3:2 compressor (4) 106-bit accumulate add (19) 53 bit multiplier with injected near path (33) (4) 53 bit multiplier with injected near path (33) 3:2 compressor 106-bit accumulate-add (0) (0) (0) LZA (0) (9) LZA (9) 106-bit left shift (8) 54 bit round add+grt on lower 54 bits (18) 1/2-bit right shift (2) 161 bit shifter (8) (20) 1/2-bit right shift (2) 106-bit left shift (8) 52-bit dual add+grtc on lower 56 bits (20) 1/2-bit right shift (2) Total Logic Levels Near path injection into sparse compressor tree comes free of cost. Removed in proposed design. Removed in proposed design (XOR assumed 2 LL) (3+log(128)*2+2) Critical path LZA (XOR is assumed 2 LL) + log(128)*1 and tree 3 stages per log shift log(128)*3 Proposed: Added final mux delay in dual add, included the carry from lower 53-bit stage Table1. Logic level comparison of the proposed implementation compared to IBM Power6[6] and Lang Arith[5]

8 FMA Design Power6 FPMAC [6, 7] Lang [5] Proposed Multiplier (area) Alignment shift (area) Accumulate adder (area) Normalization shifter (area) Rounding Unit (area) 53bit mulltiplier (400) 56-bit LS on 53 bits, 106-bit RS on 53 bits (100) 106-bit add and 3:2 compress (87) LZA on 106-bit (113) 106 bit left shifter on 106 bits (100) 53-bit add (63) 53bit multiplier 53-bit mul with near path (400) Mz inserted (400) 161 bit RS on 53 bits (100) 56-bit LS on 53 bits, 106-bit RS on 53 bits (100) 106 bit 3:2 compress 106 bit 3:2 compress 106-bit 106-bit (LZA) bit sign (LZA + sign detect) (169) detect + HA (251) 106-bit left shift S 106-bit left shift C 53-bit right shift S 53-bit right shift C (200) 53-bit dual add (70) 106-bit left shift S 106-bit left shift C 53-bit right shift S 53-bit right shift C (200) 53-bit dual add (70) Total Norm Area Table 2: Area comparison of the proposed approach with best known silicon implementation (numbers in braces are indicative of relative comparison) ACKNOWLEDGMENT We would like to acknowledge the following for their encouragement and support for this work: Nitin Borkar, Sandip Pandey and Srinivas Lingam. REFERENCES [1] Intel s AVX page [2] AMD SSE5 [3] IEEE 754: Standard for Binary Floating-Point Arithmetic [4] C. Chen, L.-A. Chen, and J.-R. Cheng, Architectural Design of a Fast Floating-Point Multiplication-Add Fused Unit Using Signed- Digit Addition, Proceedings Euromicro Symposium Digital System Design (DSD2001), pp , [5] Tomas Lang, J. D. Bruguera, Floating-Point Multiply-Add-Fused with Reduced Latency, 17th IEEE Symposium on Computer Arithmetic (ARITH), [6] IBM Power6, MCCridie. B et al Desing of the Power6 Microprocessor IEEE International Solid-State Circuits Conference, ISSCC Digest of Technical Papers. [7] Son Dao Trong, P6 Binary Floating-Point Unit, 18 th IEEE Symposium on Computer Arithmetic (ARITH), [8] Yu. R.K. 167 MHz radix-4 floating point multiplier, Proceedings of the 12th Symposium on Computer Arithmetic, [9] Belluomini, W., Ngo, H., McDowell, C., Sawada, J., Nguyen, T., Veraa, B., Wagoner, J., Lee, M., A double precision floating point multiply, IEEE International Solid-State Circuits Conference, Digest of Technical Papers. ISSCC [10] Peter-Michael Seidel, Guy Even, Delay-Optimized Implementation of IEEE Floating-Point Addition, IEEE Transactions on Computers, v.53 n.2, p , February 2004 [11] A. Beaumont-Smith, N. Burgess, S. Lefrere, C. C. Lim, Reduced Latency IEEE Floating-Point Standard Adder Architectures, Proceedings of the 14th IEEE Symposium on Computer Arithmetic, p.35, April 14-16, 1999 [12] Kyung T. Lee, Kevin J. Nowka, 1 GHz Leading Zero Anticipator Using Independent Sign-Bit Determination Logic, Symp. VLSI Circuits Dig. Techn. Papers, 2000, pp [13] E. Antelo, M. Boo, J.D. Bruguera, and E.L. Zapata, A Novel Design for a Two Operand Normalization Circuit, IEEE Trans. Very Large Scale of Integration (VLSI) Systems, vol. 6, no. 1, pp , [14] N. Burgess, The Flagged Prefix Adder for Dual Additions, Proc. SPIE ASPAII7, [15] R.V.K. Pillai, A Low Power Approach to Floating Point Adder Design for DSP Applications,VLSI Signal Processing, 2001 [16] Seidel, Peter-M. On-Line IEEE Floating-Point Multiplication and Division for Reduced Power Dissipation, Proceedings of 38th Asilomar Conference on Signals, Systems and Computers, 2004 [17] Eric Charles Quinnell, Floating-Point Fused Multiply-Add Architectures, 11th Asilomar Conference on Circuits, Systems and Computers, Conference Record. 1977

FLOATING POINT ADDERS AND MULTIPLIERS

FLOATING POINT ADDERS AND MULTIPLIERS Concordia University FLOATING POINT ADDERS AND MULTIPLIERS 1 Concordia University Lecture #4 In this lecture we will go over the following concepts: 1) Floating Point Number representation 2) Accuracy

More information

ADDERS AND MULTIPLIERS

ADDERS AND MULTIPLIERS Concordia University FLOATING POINT ADDERS AND MULTIPLIERS 1 Concordia University Lecture #4 In this lecture we will go over the following concepts: 1) Floating Point Number representation 2) Accuracy

More information

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering

International Journal of Advanced Research in Electrical, Electronics and Instrumentation Engineering An Efficient Implementation of Double Precision Floating Point Multiplier Using Booth Algorithm Pallavi Ramteke 1, Dr. N. N. Mhala 2, Prof. P. R. Lakhe M.Tech [IV Sem], Dept. of Comm. Engg., S.D.C.E, [Selukate],

More information

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST SAKTHIVEL Assistant Professor, Department of ECE, Coimbatore Institute of Engineering and Technology Abstract- FPGA is

More information

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier

A Quadruple Precision and Dual Double Precision Floating-Point Multiplier A Quadruple Precision and Dual Double Precision Floating-Point Multiplier Ahmet Akkaş Computer Engineering Department Koç University 3445 Sarıyer, İstanbul, Turkey ahakkas@kuedutr Michael J Schulte Department

More information

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator A.Sindhu 1, K.PriyaMeenakshi 2 PG Student [VLSI], Dept. of ECE, Muthayammal Engineering College, Rasipuram, Tamil Nadu,

More information

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers

Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Reducing Computational Time using Radix-4 in 2 s Complement Rectangular Multipliers Y. Latha Post Graduate Scholar, Indur institute of Engineering & Technology, Siddipet K.Padmavathi Associate. Professor,

More information

A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design

A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design A New Architecture For Multiple-Precision Floating-Point Multiply-Add Fused Unit Design Libo Huang, Li Shen, Kui Dai, Zhiying Wang School of Computer National University of Defense Technology Changsha,

More information

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers

IEEE-754 compliant Algorithms for Fast Multiplication of Double Precision Floating Point Numbers International Journal of Research in Computer Science ISSN 2249-8257 Volume 1 Issue 1 (2011) pp. 1-7 White Globe Publications www.ijorcs.org IEEE-754 compliant Algorithms for Fast Multiplication of Double

More information

Redundant Data Formats for Faster Floating Point Addition. Abstract

Redundant Data Formats for Faster Floating Point Addition. Abstract Redundant Data Formats for Faster Floating Point Addition Abstract This paper presents an approach that minimizes the latency of a floating point adder by leaving outputs in a redundant form. As long as

More information

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation

Fused Floating Point Arithmetic Unit for Radix 2 FFT Implementation IOSR Journal of VLSI and Signal Processing (IOSR-JVSP) Volume 6, Issue 2, Ver. I (Mar. -Apr. 2016), PP 58-65 e-issn: 2319 4200, p-issn No. : 2319 4197 www.iosrjournals.org Fused Floating Point Arithmetic

More information

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications

A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications A Single/Double Precision Floating-Point Reciprocal Unit Design for Multimedia Applications Metin Mete Özbilen 1 and Mustafa Gök 2 1 Mersin University, Engineering Faculty, Department of Computer Science,

More information

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P ABSTRACT A fused floating-point three term adder performs two additions

More information

Implementation of Floating Point Multiplier Using Dadda Algorithm

Implementation of Floating Point Multiplier Using Dadda Algorithm Implementation of Floating Point Multiplier Using Dadda Algorithm Abstract: Floating point multiplication is the most usefull in all the computation application like in Arithematic operation, DSP application.

More information

2 General issues in multi-operand addition

2 General issues in multi-operand addition 2009 19th IEEE International Symposium on Computer Arithmetic Multi-operand Floating-point Addition Alexandre F. Tenca Synopsys, Inc. tenca@synopsys.com Abstract The design of a component to perform parallel

More information

By, Ajinkya Karande Adarsh Yoga

By, Ajinkya Karande Adarsh Yoga By, Ajinkya Karande Adarsh Yoga Introduction Early computer designers believed saving computer time and memory were more important than programmer time. Bug in the divide algorithm used in Intel chips.

More information

Integer Multiplication. Back to Arithmetic. Integer Multiplication. Example (Fig 4.25)

Integer Multiplication. Back to Arithmetic. Integer Multiplication. Example (Fig 4.25) Back to Arithmetic Before, we did Representation of integers Addition/Subtraction Logical ops Forecast Integer Multiplication Integer Division Floating-point Numbers Floating-point Addition/Multiplication

More information

Design and Implementation of a Quadruple Floating-point Fused Multiply-Add Unit

Design and Implementation of a Quadruple Floating-point Fused Multiply-Add Unit Design and Implementation of a Quadruple Floating-point Fused Multiply-Add Unit He Jun Shanghai Hi-Performance IC Design Centre Shanghai, China e-mail: joyhejun@126.com Zhu Ying Shanghai Hi-Performance

More information

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE

THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE THE INTERNATIONAL JOURNAL OF SCIENCE & TECHNOLEDGE Design and Implementation of Optimized Floating Point Matrix Multiplier Based on FPGA Maruti L. Doddamani IV Semester, M.Tech (Digital Electronics), Department

More information

Efficient Radix-10 Multiplication Using BCD Codes

Efficient Radix-10 Multiplication Using BCD Codes Efficient Radix-10 Multiplication Using BCD Codes P.Ranjith Kumar Reddy M.Tech VLSI, Department of ECE, CMR Institute of Technology. P.Navitha Assistant Professor, Department of ECE, CMR Institute of Technology.

More information

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER. A.Anusha 1 R.Basavaraju 2 anusha201093@gmail.com 1 basava430@gmail.com 2 1 PG Scholar, VLSI, Bharath Institute of Engineering

More information

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P.

I. Introduction. India; 2 Assistant Professor, Department of Electronics & Communication Engineering, SRIT, Jabalpur (M.P. A Decimal / Binary Multi-operand Adder using a Fast Binary to Decimal Converter-A Review Ruchi Bhatt, Divyanshu Rao, Ravi Mohan 1 M. Tech Scholar, Department of Electronics & Communication Engineering,

More information

Design and Optimized Implementation of Six-Operand Single- Precision Floating-Point Addition

Design and Optimized Implementation of Six-Operand Single- Precision Floating-Point Addition 2011 International Conference on Advancements in Information Technology With workshop of ICBMG 2011 IPCSIT vol.20 (2011) (2011) IACSIT Press, Singapore Design and Optimized Implementation of Six-Operand

More information

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring

EE260: Logic Design, Spring n Integer multiplication. n Booth s algorithm. n Integer division. n Restoring, non-restoring EE 260: Introduction to Digital Design Arithmetic II Yao Zheng Department of Electrical Engineering University of Hawaiʻi at Mānoa Overview n Integer multiplication n Booth s algorithm n Integer division

More information

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier

Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier Delay Optimised 16 Bit Twin Precision Baugh Wooley Multiplier Vivek. V. Babu 1, S. Mary Vijaya Lense 2 1 II ME-VLSI DESIGN & The Rajaas Engineering College Vadakkangulam, Tirunelveli 2 Assistant Professor

More information

An Implementation of Double precision Floating point Adder & Subtractor Using Verilog

An Implementation of Double precision Floating point Adder & Subtractor Using Verilog IOSR Journal of Electrical and Electronics Engineering (IOSR-JEEE) e-issn: 2278-1676,p-ISSN: 2320-3331, Volume 9, Issue 4 Ver. III (Jul Aug. 2014), PP 01-05 An Implementation of Double precision Floating

More information

A Library of Parameterized Floating-point Modules and Their Use

A Library of Parameterized Floating-point Modules and Their Use A Library of Parameterized Floating-point Modules and Their Use Pavle Belanović and Miriam Leeser Department of Electrical and Computer Engineering Northeastern University Boston, MA, 02115, USA {pbelanov,mel}@ece.neu.edu

More information

High Speed Multiplication Using BCD Codes For DSP Applications

High Speed Multiplication Using BCD Codes For DSP Applications High Speed Multiplication Using BCD Codes For DSP Applications Balasundaram 1, Dr. R. Vijayabhasker 2 PG Scholar, Dept. Electronics & Communication Engineering, Anna University Regional Centre, Coimbatore,

More information

II. MOTIVATION AND IMPLEMENTATION

II. MOTIVATION AND IMPLEMENTATION An Efficient Design of Modified Booth Recoder for Fused Add-Multiply operator Dhanalakshmi.G Applied Electronics PSN College of Engineering and Technology Tirunelveli dhanamgovind20@gmail.com Prof.V.Gopi

More information

High Throughput Radix-D Multiplication Using BCD

High Throughput Radix-D Multiplication Using BCD High Throughput Radix-D Multiplication Using BCD Y.Raj Kumar PG Scholar, VLSI&ES, Dept of ECE, Vidya Bharathi Institute of Technology, Janagaon, Warangal, Telangana. Dharavath Jagan, M.Tech Associate Professor,

More information

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes

Improved Design of High Performance Radix-10 Multiplication Using BCD Codes International OPEN ACCESS Journal ISSN: 2249-6645 Of Modern Engineering Research (IJMER) Improved Design of High Performance Radix-10 Multiplication Using BCD Codes 1 A. Anusha, 2 C.Ashok Kumar 1 M.Tech

More information

Module 2: Computer Arithmetic

Module 2: Computer Arithmetic Module 2: Computer Arithmetic 1 B O O K : C O M P U T E R O R G A N I Z A T I O N A N D D E S I G N, 3 E D, D A V I D L. P A T T E R S O N A N D J O H N L. H A N N E S S Y, M O R G A N K A U F M A N N

More information

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator M.Chitra Evangelin Christina Associate Professor Department of Electronics and Communication Engineering Francis Xavier

More information

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018

International Journal of Engineering and Techniques - Volume 4 Issue 2, April-2018 RESEARCH ARTICLE DESIGN AND ANALYSIS OF RADIX-16 BOOTH PARTIAL PRODUCT GENERATOR FOR 64-BIT BINARY MULTIPLIERS K.Deepthi 1, Dr.T.Lalith Kumar 2 OPEN ACCESS 1 PG Scholar,Dept. Of ECE,Annamacharya Institute

More information

Run-Time Reconfigurable multi-precision floating point multiplier design based on pipelining technique using Karatsuba-Urdhva algorithms

Run-Time Reconfigurable multi-precision floating point multiplier design based on pipelining technique using Karatsuba-Urdhva algorithms Run-Time Reconfigurable multi-precision floating point multiplier design based on pipelining technique using Karatsuba-Urdhva algorithms 1 Shruthi K.H., 2 Rekha M.G. 1M.Tech, VLSI design and embedded system,

More information

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems. Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems. K. Ram Prakash 1, A.V.Sanju 2 1 Professor, 2 PG scholar, Department of Electronics

More information

Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier

Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier International Journal of Engineering and Advanced Technology (IJEAT) ISSN: 2249 8958, Volume-4 Issue 1, October 2014 Design and Implementation of IEEE-754 Decimal Floating Point Adder, Subtractor and Multiplier

More information

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017 VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier 1 Katakam Hemalatha,(M.Tech),Email Id: hema.spark2011@gmail.com 2 Kundurthi Ravi Kumar, M.Tech,Email Id: kundurthi.ravikumar@gmail.com

More information

A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Modified CSA

A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Modified CSA RESEARCH ARTICLE OPEN ACCESS A Novel Efficient VLSI Architecture for IEEE 754 Floating point multiplier using Nishi Pandey, Virendra Singh Sagar Institute of Research & Technology Bhopal Abstract Due to

More information

CHW 261: Logic Design

CHW 261: Logic Design CHW 261: Logic Design Instructors: Prof. Hala Zayed Dr. Ahmed Shalaby http://www.bu.edu.eg/staff/halazayed14 http://bu.edu.eg/staff/ahmedshalaby14# Slide 1 Slide 2 Slide 3 Digital Fundamentals CHAPTER

More information

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation

COMPUTER ORGANIZATION AND DESIGN. 5 th Edition. The Hardware/Software Interface. Chapter 3. Arithmetic for Computers Implementation COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 3 Arithmetic for Computers Implementation Today Review representations (252/352 recap) Floating point Addition: Ripple

More information

CPE300: Digital System Architecture and Design

CPE300: Digital System Architecture and Design CPE300: Digital System Architecture and Design Fall 2011 MW 17:30-18:45 CBC C316 Arithmetic Unit 10122011 http://www.egr.unlv.edu/~b1morris/cpe300/ 2 Outline Recap Fixed Point Arithmetic Addition/Subtraction

More information

PROJECT REPORT IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE

PROJECT REPORT IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE PROJECT REPORT ON IMPLEMENTATION OF LOGARITHM COMPUTATION DEVICE AS PART OF VLSI TOOLS COURSE Project Guide Prof Ravindra Jayanti By Mukund UG3 (ECE) 200630022 Introduction The project was implemented

More information

Fused Floating Point Three Term Adder Using Brent-Kung Adder

Fused Floating Point Three Term Adder Using Brent-Kung Adder P P IJISET - International Journal of Innovative Science, Engineering & Technology, Vol. 2 Issue 9, September 205. Fused Floating Point Three Term Adder Using Brent-Kung Adder 2 Ms. Neena Aniee JohnP P

More information

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction

Floating Point. The World is Not Just Integers. Programming languages support numbers with fraction 1 Floating Point The World is Not Just Integers Programming languages support numbers with fraction Called floating-point numbers Examples: 3.14159265 (π) 2.71828 (e) 0.000000001 or 1.0 10 9 (seconds in

More information

An FPGA based Implementation of Floating-point Multiplier

An FPGA based Implementation of Floating-point Multiplier An FPGA based Implementation of Floating-point Multiplier L. Rajesh, Prashant.V. Joshi and Dr.S.S. Manvi Abstract In this paper we describe the parameterization, implementation and evaluation of floating-point

More information

A High Speed Binary Floating Point Multiplier Using Dadda Algorithm

A High Speed Binary Floating Point Multiplier Using Dadda Algorithm 455 A High Speed Binary Floating Point Multiplier Using Dadda Algorithm B. Jeevan, Asst. Professor, Dept. of E&IE, KITS, Warangal. jeevanbs776@gmail.com S. Narender, M.Tech (VLSI&ES), KITS, Warangal. narender.s446@gmail.com

More information

Number Systems and Computer Arithmetic

Number Systems and Computer Arithmetic Number Systems and Computer Arithmetic Counting to four billion two fingers at a time What do all those bits mean now? bits (011011011100010...01) instruction R-format I-format... integer data number text

More information

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog

Design of a Floating-Point Fused Add-Subtract Unit Using Verilog International Journal of Electronics and Computer Science Engineering 1007 Available Online at www.ijecse.org ISSN- 2277-1956 Design of a Floating-Point Fused Add-Subtract Unit Using Verilog Mayank Sharma,

More information

Binary Floating Point Fused Multiply Add Unit

Binary Floating Point Fused Multiply Add Unit Binary Floating Point Fused Multiply Add Unit by Eng. Walaa Abd El Aziz Ibrahim A Thesis Submitted to the Faculty of engineering at Cairo University in partial Fulfillment of the Requirement for the Degree

More information

International Journal of Research in Computer and Communication Technology, Vol 4, Issue 11, November- 2015

International Journal of Research in Computer and Communication Technology, Vol 4, Issue 11, November- 2015 Design of Dadda Algorithm based Floating Point Multiplier A. Bhanu Swetha. PG.Scholar: M.Tech(VLSISD), Department of ECE, BVCITS, Batlapalem. E.mail:swetha.appari@gmail.com V.Ramoji, Asst.Professor, Department

More information

Systolic Super Summation with Reduced Hardware

Systolic Super Summation with Reduced Hardware Systolic Super Summation with Reduced Hardware Willard L. Miranker Mathematical Sciences Department IBM T.J. Watson Research Center Route 134 & Kitichwan Road Yorktown Heights, NY 10598 Abstract A principal

More information

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm

32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding Algorithm 2016 IJSRSET Volume 2 Issue 3 Print ISSN : 2395-1990 Online ISSN : 2394-4099 Themed Section: Engineering and Technology 32-bit Signed and Unsigned Advanced Modified Booth Multiplication using Radix-4 Encoding

More information

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing

EE878 Special Topics in VLSI. Computer Arithmetic for Digital Signal Processing EE878 Special Topics in VLSI Computer Arithmetic for Digital Signal Processing Part 6b High-Speed Multiplication - II Spring 2017 Koren Part.6b.1 Accumulating the Partial Products After generating partial

More information

Digital Fundamentals

Digital Fundamentals Digital Fundamentals Tenth Edition Floyd Chapter 2 2009 Pearson Education, Upper 2008 Pearson Saddle River, Education NJ 07458. All Rights Reserved Decimal Numbers The position of each digit in a weighted

More information

CS6303 COMPUTER ARCHITECTURE LESSION NOTES UNIT II ARITHMETIC OPERATIONS ALU In computing an arithmetic logic unit (ALU) is a digital circuit that performs arithmetic and logical operations. The ALU is

More information

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG

HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG HIGH SPEED SINGLE PRECISION FLOATING POINT UNIT IMPLEMENTATION USING VERILOG 1 C.RAMI REDDY, 2 O.HOMA KESAV, 3 A.MAHESWARA REDDY 1 PG Scholar, Dept of ECE, AITS, Kadapa, AP-INDIA. 2 Asst Prof, Dept of

More information

An FPGA Based Floating Point Arithmetic Unit Using Verilog

An FPGA Based Floating Point Arithmetic Unit Using Verilog An FPGA Based Floating Point Arithmetic Unit Using Verilog T. Ramesh 1 G. Koteshwar Rao 2 1PG Scholar, Vaagdevi College of Engineering, Telangana. 2Assistant Professor, Vaagdevi College of Engineering,

More information

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm

A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm A Review on Optimizing Efficiency of Fixed Point Multiplication using Modified Booth s Algorithm Mahendra R. Bhongade, Manas M. Ramteke, Vijay G. Roy Author Details Mahendra R. Bhongade, Department of

More information

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY A PATH FOR HORIZING YOUR INNOVATIVE WORK DESIGN AND VERIFICATION OF FAST 32 BIT BINARY FLOATING POINT MULTIPLIER BY INCREASING

More information

Figurel. TEEE-754 double precision floating point format. Keywords- Double precision, Floating point, Multiplier,FPGA,IEEE-754.

Figurel. TEEE-754 double precision floating point format. Keywords- Double precision, Floating point, Multiplier,FPGA,IEEE-754. AN FPGA BASED HIGH SPEED DOUBLE PRECISION FLOATING POINT MULTIPLIER USING VERILOG N.GIRIPRASAD (1), K.MADHAVA RAO (2) VLSI System Design,Tudi Ramireddy Institute of Technology & Sciences (1) Asst.Prof.,

More information

ARCHITECTURAL DESIGN OF 8 BIT FLOATING POINT MULTIPLICATION UNIT

ARCHITECTURAL DESIGN OF 8 BIT FLOATING POINT MULTIPLICATION UNIT ARCHITECTURAL DESIGN OF 8 BIT FLOATING POINT MULTIPLICATION UNIT Usha S. 1 and Vijaya Kumar V. 2 1 VLSI Design, Sathyabama University, Chennai, India 2 Department of Electronics and Communication Engineering,

More information

Implementation of Double Precision Floating Point Multiplier in VHDL

Implementation of Double Precision Floating Point Multiplier in VHDL ISSN (O): 2349-7084 International Journal of Computer Engineering In Research Trends Available online at: www.ijcert.org Implementation of Double Precision Floating Point Multiplier in VHDL 1 SUNKARA YAMUNA

More information

Double Precision IEEE-754 Floating-Point Adder Design Based on FPGA

Double Precision IEEE-754 Floating-Point Adder Design Based on FPGA Double Precision IEEE-754 Floating-Point Adder Design Based on FPGA Adarsha KM 1, Ashwini SS 2, Dr. MZ Kurian 3 PG Student [VLS& ES], Dept. of ECE, Sri Siddhartha Institute of Technology, Tumkur, Karnataka,

More information

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION

EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION EC2303-COMPUTER ARCHITECTURE AND ORGANIZATION QUESTION BANK UNIT-II 1. What are the disadvantages in using a ripple carry adder? (NOV/DEC 2006) The main disadvantage using ripple carry adder is time delay.

More information

AN EFFICIENT FLOATING-POINT MULTIPLIER DESIGN USING COMBINED BOOTH AND DADDA ALGORITHMS

AN EFFICIENT FLOATING-POINT MULTIPLIER DESIGN USING COMBINED BOOTH AND DADDA ALGORITHMS AN EFFICIENT FLOATING-POINT MULTIPLIER DESIGN USING COMBINED BOOTH AND DADDA ALGORITHMS 1 DHANABAL R, BHARATHI V, 3 NAAMATHEERTHAM R SAMHITHA, 4 PAVITHRA S, 5 PRATHIBA S, 6 JISHIA EUGINE 1 Asst Prof. (Senior

More information

Floating Point Square Root under HUB Format

Floating Point Square Root under HUB Format Floating Point Square Root under HUB Format Julio Villalba-Moreno Dept. of Computer Architecture University of Malaga Malaga, SPAIN jvillalba@uma.es Javier Hormigo Dept. of Computer Architecture University

More information

Assistant Professor, PICT, Pune, Maharashtra, India

Assistant Professor, PICT, Pune, Maharashtra, India Volume 6, Issue 1, January 2016 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com Low Power High

More information

Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier

Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier Architecture and Design of Generic IEEE-754 Based Floating Point Adder, Subtractor and Multiplier Sahdev D. Kanjariya VLSI & Embedded Systems Design Gujarat Technological University PG School Ahmedabad,

More information

An Efficient Implementation of Floating Point Multiplier

An Efficient Implementation of Floating Point Multiplier An Efficient Implementation of Floating Point Multiplier Mohamed Al-Ashrafy Mentor Graphics Mohamed_Samy@Mentor.com Ashraf Salem Mentor Graphics Ashraf_Salem@Mentor.com Wagdy Anis Communications and Electronics

More information

Double Precision Floating-Point Arithmetic on FPGAs

Double Precision Floating-Point Arithmetic on FPGAs MITSUBISHI ELECTRIC ITE VI-Lab Title: Double Precision Floating-Point Arithmetic on FPGAs Internal Reference: Publication Date: VIL04-D098 Author: S. Paschalakis, P. Lee Rev. A Dec. 2003 Reference: Paschalakis,

More information

An Effective Implementation of Dual Path Fused Floating-Point Add-Subtract Unit for Reconfigurable Architectures

An Effective Implementation of Dual Path Fused Floating-Point Add-Subtract Unit for Reconfigurable Architectures Received: December 20, 2016 40 An Effective Implementation of Path Fused Floating-Point Add-Subtract Unit for Reconfigurable Architectures Anitha Arumalla 1 * Madhavi Latha Makkena 2 1 Velagapudi Ramakrishna

More information

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT

THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT Clemson University TigerPrints All Theses Theses 12-2009 THE DESIGN OF AN IC HALF PRECISION FLOATING POINT ARITHMETIC LOGIC UNIT Balaji Kannan Clemson University, balaji.n.kannan@gmail.com Follow this

More information

Low Power Floating-Point Multiplier Based On Vedic Mathematics

Low Power Floating-Point Multiplier Based On Vedic Mathematics Low Power Floating-Point Multiplier Based On Vedic Mathematics K.Prashant Gokul, M.E(VLSI Design), Sri Ramanujar Engineering College, Chennai Prof.S.Murugeswari., Supervisor,Prof.&Head,ECE.,SREC.,Chennai-600

More information

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard

FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE Standard FPGA Implementation of Multiplier for Floating- Point Numbers Based on IEEE 754-2008 Standard M. Shyamsi, M. I. Ibrahimy, S. M. A. Motakabber and M. R. Ahsan Dept. of Electrical and Computer Engineering

More information

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER Available Online at www.ijcsmc.com International Journal of Computer Science and Mobile Computing A Monthly Journal of Computer Science and Information Technology IJCSMC, Vol. 3, Issue. 11, November 2014,

More information

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder

An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder An Efficient Fused Add Multiplier With MWT Multiplier And Spanning Tree Adder 1.M.Megha,M.Tech (VLSI&ES),2. Nataraj, M.Tech (VLSI&ES), Assistant Professor, 1,2. ECE Department,ST.MARY S College of Engineering

More information

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR

HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR HIGH PERFORMANCE FUSED ADD MULTIPLY OPERATOR R. Alwin [1] S. Anbu Vallal [2] I. Angel [3] B. Benhar Silvan [4] V. Jai Ganesh [5] 1 Assistant Professor, 2,3,4,5 Student Members Department of Electronics

More information

VHDL implementation of 32-bit floating point unit (FPU)

VHDL implementation of 32-bit floating point unit (FPU) VHDL implementation of 32-bit floating point unit (FPU) Nikhil Arora Govindam Sharma Sachin Kumar M.Tech student M.Tech student M.Tech student YMCA, Faridabad YMCA, Faridabad YMCA, Faridabad Abstract The

More information

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666 UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering Digital Computer Arithmetic ECE 666 Part 6b High-Speed Multiplication - II Israel Koren ECE666/Koren Part.6b.1 Accumulating the Partial

More information

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Bits and Bytes and Numbers

Computer Science 324 Computer Architecture Mount Holyoke College Fall Topic Notes: Bits and Bytes and Numbers Computer Science 324 Computer Architecture Mount Holyoke College Fall 2007 Topic Notes: Bits and Bytes and Numbers Number Systems Much of this is review, given the 221 prerequisite Question: how high can

More information

VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS

VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS VHDL IMPLEMENTATION OF FLOATING POINT MULTIPLIER USING VEDIC MATHEMATICS I.V.VAIBHAV 1, K.V.SAICHARAN 1, B.SRAVANTHI 1, D.SRINIVASULU 2 1 Students of Department of ECE,SACET, Chirala, AP, India 2 Associate

More information

Performance Evaluation of Guarded Static CMOS Logic based Arithmetic and Logic Unit Design

Performance Evaluation of Guarded Static CMOS Logic based Arithmetic and Logic Unit Design International Journal of Engineering Research and General Science Volume 2, Issue 3, April-May 2014 Performance Evaluation of Guarded Static CMOS Logic based Arithmetic and Logic Unit Design FelcyJeba

More information

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL

Comparison of Adders for optimized Exponent Addition circuit in IEEE754 Floating point multiplier using VHDL International Journal of Engineering Research and Development e-issn: 2278-067X, p-issn: 2278-800X, www.ijerd.com Volume 11, Issue 07 (July 2015), PP.60-65 Comparison of Adders for optimized Exponent Addition

More information

ISSN Vol.02, Issue.11, December-2014, Pages:

ISSN Vol.02, Issue.11, December-2014, Pages: ISSN 2322-0929 Vol.02, Issue.11, December-2014, Pages:1208-1212 www.ijvdcs.org Implementation of Area Optimized Floating Point Unit using Verilog G.RAJA SEKHAR 1, M.SRIHARI 2 1 PG Scholar, Dept of ECE,

More information

D I G I T A L C I R C U I T S E E

D I G I T A L C I R C U I T S E E D I G I T A L C I R C U I T S E E Digital Circuits Basic Scope and Introduction This book covers theory solved examples and previous year gate question for following topics: Number system, Boolean algebra,

More information

VARUN AGGARWAL

VARUN AGGARWAL ECE 645 PROJECT SPECIFICATION -------------- Design A Microprocessor Functional Unit Able To Perform Multiplication & Division Professor: Students: KRIS GAJ LUU PHAM VARUN AGGARWAL GMU Mar. 2002 CONTENTS

More information

A New Family of High Performance Parallel Decimal Multipliers

A New Family of High Performance Parallel Decimal Multipliers A New Family of High Performance Parallel Decimal Multipliers Alvaro Vázquez, Elisardo Antelo University of Santiago de Compostela Dept. of Electronic and Computer Science 15782 Santiago de Compostela,

More information

UNIT - I: COMPUTER ARITHMETIC, REGISTER TRANSFER LANGUAGE & MICROOPERATIONS

UNIT - I: COMPUTER ARITHMETIC, REGISTER TRANSFER LANGUAGE & MICROOPERATIONS UNIT - I: COMPUTER ARITHMETIC, REGISTER TRANSFER LANGUAGE & MICROOPERATIONS (09 periods) Computer Arithmetic: Data Representation, Fixed Point Representation, Floating Point Representation, Addition and

More information

An efficient multiple precision floating-point Multiply-Add Fused unit

An efficient multiple precision floating-point Multiply-Add Fused unit Loughborough University Institutional Repository An efficient multiple precision floating-point Multiply-Add Fused unit This item was submitted to Loughborough University's Institutional Repository by

More information

Digital Logic & Computer Design CS Professor Dan Moldovan Spring 2010

Digital Logic & Computer Design CS Professor Dan Moldovan Spring 2010 Digital Logic & Computer Design CS 434 Professor Dan Moldovan Spring 2 Copyright 27 Elsevier 5- Chapter 5 :: Digital Building Blocks Digital Design and Computer Architecture David Money Harris and Sarah

More information

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3

Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Lecture 3 Floating-Point Data Representation and Manipulation 198:231 Introduction to Computer Organization Instructor: Nicole Hynes nicole.hynes@rutgers.edu 1 Fixed Point Numbers Fixed point number: integer part

More information

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive.

The Sign consists of a single bit. If this bit is '1', then the number is negative. If this bit is '0', then the number is positive. IEEE 754 Standard - Overview Frozen Content Modified by on 13-Sep-2017 Before discussing the actual WB_FPU - Wishbone Floating Point Unit peripheral in detail, it is worth spending some time to look at

More information

Divide: Paper & Pencil

Divide: Paper & Pencil Divide: Paper & Pencil 1001 Quotient Divisor 1000 1001010 Dividend -1000 10 101 1010 1000 10 Remainder See how big a number can be subtracted, creating quotient bit on each step Binary => 1 * divisor or

More information

16-BIT DECIMAL CONVERTER FOR DECIMAL / BINARY MULTI-OPERAND ADDER

16-BIT DECIMAL CONVERTER FOR DECIMAL / BINARY MULTI-OPERAND ADDER 16-BIT DECIMAL CONVERTER FOR DECIMAL / BINARY MULTI-OPERAND ADDER #1 SATYA KUSUMA NAIDU, M.Tech Student, #2 D.JHANSI LAKSHMI, Assistant Professor, Dept of EEE, KAKINADA INSTITUTE OF TECHNOLOGICAL SCIENCES,

More information

Number Representations

Number Representations Number Representations times XVII LIX CLXX -XVII D(CCL)LL DCCC LLLL X-X X-VII = DCCC CC III = MIII X-VII = VIIIII-VII = III 1/25/02 Memory Organization Viewed as a large, single-dimension array, with an

More information

Effective Improvement of Carry save Adder

Effective Improvement of Carry save Adder Effective Improvement of Carry save Adder K.Nandini 1, A.Padmavathi 1, K.Pavithra 1, M.Selva Priya 1, Dr. P. Nithiyanantham 2 1 UG scholars, Department of Electronics, Jay Shriram Group of Institutions,

More information

High Speed Special Function Unit for Graphics Processing Unit

High Speed Special Function Unit for Graphics Processing Unit High Speed Special Function Unit for Graphics Processing Unit Abd-Elrahman G. Qoutb 1, Abdullah M. El-Gunidy 1, Mohammed F. Tolba 1, and Magdy A. El-Moursy 2 1 Electrical Engineering Department, Fayoum

More information

AT Arithmetic. Integer addition

AT Arithmetic. Integer addition AT Arithmetic Most concern has gone into creating fast implementation of (especially) FP Arith. Under the AT (area-time) rule, area is (almost) as important. So it s important to know the latency, bandwidth

More information

Computer Architecture Set Four. Arithmetic

Computer Architecture Set Four. Arithmetic Computer Architecture Set Four Arithmetic Arithmetic Where we ve been: Performance (seconds, cycles, instructions) Abstractions: Instruction Set Architecture Assembly Language and Machine Language What

More information