Split-path Fused Floating Point Multiply Accumulate (FPMAC)

Size: px

Start display at page:

Download "Split-path Fused Floating Point Multiply Accumulate (FPMAC)"

Nora Gallagher
5 years ago
Views:

1 Split-path Fused Floating Point Multiply Accumulate (FPMAC) Suresh Srinivasan, Ketan Bhudiya, Rajaraman Ramanarayanan, P Sahit Babu, Tiju Jacob, Sanu. K. Mathew*, Ram Krishnamurthy*, Vasantha Erraguntla Intel India Pvt. Ltd, Bangalore, India *Circuits Research Lab, Intel Corporation, Hillsboro, Oregon suresh.srinivasan@intel.com, ketan.m.budhiya@intel.com, rajaraman.ramanarayanan@intel.com, sahit30@gmail.com, tiju.jacob@intel.com, sanu.k.mathew@intel.com, ram.krishnamurthy@intel.com, vasantha.erraguntla@intel.com Abstract Floating point multiply-accumulate (FPMAC) unit is the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. FPMAC unit is used extensively in contemporary client microprocessors, further proliferated with ISA support for instructions like AVX and SSE and also extensively used in server processors employed for engineering and scientific applications. Consequently design of FPMAC is of vital consideration since it dominates the power and performance tradeoff decisions in such systems. In this work we demonstrate a novel FPMAC design which focuses on optimal computations in the critical path and therefore making it the fastest FPMAC design as of today in literature. The design is based on the premise of isolating and optimizing the critical path computation in FPMAC operation. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages in the timing critical path: a) Splitting near and far paths based on the exponent difference (d=e xy -E z = {-2,-1,0,1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2 compressor from near path critical logic, exploiting the small alignment shifts in near path and sparse Wallace tree for 53 bit mantissa multiplication, c) Combined round and accumulate add for eliminating the completion adder from multiplier giving both timing and power benefits. Our design by premise of splitting consumes lesser power for each operation where only the required logic for each case is switching. Splitting the paths also provides tremendous opportunities for clock or power gating the unused portion (nearly 15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The demonstrated design outperforms the best known silicon implementation of IBM Power6 [6] by 14% in timing while having similar area and giving additional power benefits due to split handling. The design is also compared to best known timing design from Lang et al. [5] and outperforms it by 7% while being 30% smaller in area than it. Keywords: Double precision floating point multiply-accumulate, Normalization, Wallace tree, IEEE Rounding I. INTRODUCTION The floating point multiply-accumulate (FPMAC) unit is the backbone of modern processors and is a key circuit determining the frequency, power and area of microprocessors. The FPMAC operation is commonly employed in algorithms like FFT, convolution and consequently a cardinal piece of most signal processing and physics applications. The FPMAC unit is used extensively in servers employed for engineering and scientific applications, and therefore its design is of vital consideration in such systems. Moreover, with increased focus of most microprocessor companies in floating point vector operations like AVX, SSE etc., any optimizations in area, timing and power of critical components like FPMAC are of tremendous value to the microprocessors marked for frequency and power. Also, use of FPMAC units in AVX engines not only increases logic complexity but also interconnect complexity. With the interconnect delay and power increasing with every process generation, its dominance on the critical timing path and total power makes the fused-multiply-add architecture not only a design with difficult timing goals, but also one of the heaviest on power consumption. Various FPMAC have evolved over many years [4, 5, 6, 7], however the timing and power of the FPMAC design still remains a key issue to be tackled optimally. In this work we provide an FPMAC design which is focused on optimizing the timing of the FPMAC critical path to the fastest known silicon implementation and is also gating friendly by design, potentially targeting (15-20%) of the gates to be turned off during each operation. The concept is very much in line with tackling the problem as a Divide and conquer approach where we split the problem into the timing critical vs. noncritical portions and tackle the critical portion optimally. Such a split comes at no additional cost due to the nature of FPMAC pipeline and therefore only results in a gain with respect to both timing and power. FPMAC operation comprises of the floating point multiply followed by add operation (Mx*My+Mz). During the addition of the multiplicand with accumulate, while dealing with signed numbers, there are cases in which the difference between the two numbers is too close leading to a big normalization shift. We call these the near path operations while cases with bigger difference between the multiplicand and accumulate are called the far path. The near path operation is the critical path of the design and therefore isolation of this case, based on the exponent difference in early stages could help in eliminating any unwanted logic operations for this case. In this work we have three key innovations to create a novel double precision FPMAC with least ever gate stages:

2 a) Splitting near and far paths based on the exponent difference(d=exy-ez = {-2,-1,0,1} is near path and the rest is far path), b) Early injection of the accumulate add for near path into the Wallace tree for eliminating a 3:2 compressor from near path critical logic, possible due to small alignment shifts in near path, c) Combined round-add for eliminating the completion adder from multiplier giving both timing and power benefits. Splitting the paths also provides tremendous opportunities for gating the unused portion (15-20%) of the logic gates purely based on the exponent difference signals. We also demonstrate the support for all rounding modes to adhere to IEEE standard for double precision FPMAC which is critical for employment of this design in contemporary processor families. The ideas demonstrated in visuals are interchangeably used based on demonstration simplicity for single or double precision blocks, however the comparison of timing, power and area is only demonstrated for the more complex double precision FPMAC. Section 2 talks about the prior research and the way we differ from any existing work. Section 3 describes the FPMAC proposed in this work. Section 4 provides the area and timing estimates of this design and compares with the best known silicon implementation and the best known theoretically fastest timing FPMAC. II. RELATED WORK Floating point arithmetic in microprocessor industries are defined to adhere to the IEEE floating point standards defined in [3]. Floating point multiplication has been explored in the past for optimal implementation in [8, 9]. The design in [8] has proposed Wallace tree based implementation followed by optimal rounding schemes to remove the completion adder at the end of multiplication and perform it together with the rounding scheme. The dual adder in completion can help removing not only the completion adder but also the post normalization shifter using the ideas presented in [9, 16, 17]. Addition of floating point numbers is slightly more complicated since it requires alignment of the numbers before performing the FP addition. Apart from this alignment shift signed addition may also lead to non-normalized results leading to big normalizations shifts at the end of the add operation. Various optimizations have been used to get to timing optimal adders which include splitting addition operation into near and far path based on the alignment shifts required [10, 11, 15]. The FPMAC design in this work has ideas inspired from [9, 11], where we combine two of the ideas from the splitting approach of addition and combined round-add of multipliers to employ it optimally in the FPMAC proposed for timing optimal design. Various FMAs have been proposed in [4, 5, 6, 7], some of the recent ones among those are FPMAC optimized designs implemented in Power6 [6, 7] and timing optimized FPMAC in Lang [5]. The Power6 FPMAC is the best known fastest silicon implementation while the Lang s work presents the fastest theoretical FPMAC implementation. In this work we use the Power6 FMA and the Lang FPMAC designs for timing, area and power comparison. III. PROPOSED FPMAC DESIGN Figure 1 shows the FPMAC design proposed in this work. The FPMAC operation involves mantissa multiplication of two input operands (Mx, My), followed by the accumulation or addition of the third operand Mz. All the operands are represented as standard IEEE floating point normalized numbers (S, F, E), where S depicts the sign (1-bit), F is the fraction (1.F is the m bit normalized mantissa M) and E the biased exponent (actual exponent(e)+bias, to make the representation of E positive). Multiplication of the two operands involves mantissa multiplication (Mx X My), performed using carry-save reduction compressor tree based design, and the output exponent of the product is Exy=Ex+Ey-bias. The accumulation involves alignment of the accumulate mantissa, Mz, and the multiply result, Mxy, by shifting one of them by the exponent difference, d=exy- Ez. To improve performance, the exponent difference computation and alignment shift of Mz is performed in parallel to mantissa multiplication. The operations involved in various stages of the FPMAC pipeline differ quite significantly based on the exponent difference d. - Cases with d>1, and d<-2 (Far path), involve a big right or left alignment shift (done in parallel to mantissa multiplication), followed by a 3:2 compression to reduce the aligned accumulate, Mz, with carry (C) and sum (S) terms of the final result as output. In the completion addition that adds the (C, S) result from the previous stage, only the MSB m -bit sum is required while the remaining bits are used for computing the carry (C), guard (G) round (R) and the sticky bits (T). This is followed by normalization right shift of worst case m+3 (when d=-(m+3)). The rounding unit uses the C, G, R and T bits to compute the rounded result. - Cases with d={0,1,-1,-2} (Near path) involve smaller alignment shift. This is followed by a 3:2 compression similar to the earlier case. However, these cases may generate a large number of leading 0s or 1s based on positive or negative value of the result respectively, which requires a worst case of 2m bit normalization left shift. This also necessitates the computation of the whole 2m -bit sum and a parallel Leading Zero Anticipator (LZA) to improve performance. The normalized result is then used for rounding. The near path clearly forms the critical path and dominates hardware requirements, due to the presence of 2m bit sum and 2m bit normalize unit along with the LZA. The proposed FPMAC implementation is based on split handling of near and far paths, which uses optimal hardware and

3 logic stages for each of the cases, performing the bare minimal operations required, particularly in the near path. The delay and area optimizations in the near path include: 1) Early injection of the near path accumulate operand Mz into the multiplication CSA tree removing the 3:2 compression stage from critical path, giving 3% reduction in logic levels (LL) compared to [5]. 2) The accumulation is performed after the The far path is non-critical and thereby designed based on the minimum required operations. Apart from performing minimal number of operations required, the split path handling also provides significant power benefits due to the ease of clock/power gating the near or the far paths as required. Each of the above innovations are described in detail in the below sections walking through each stage of the operation. The implementation details are discussed in terms of a single-precision FPMAC unit for the ease of Figure 1: Proposed FPMAC design normalization shift for both the near and far paths, combined with the rounding unit, eliminating the accumulate adder from the critical path. ( 2m bit adder area savings) 3) To further reduce the near path delay, the normalization shifting for the near path is performed in parallel with the LZA on the (C, S) outputs of the CSA tree, which masks the shifting delay with the LZA computation. (2 and 3 effectively give 13% reduction in logic levels) 4) Sign detection of the result for conditional 2 s complimenting is performed using the existing LZA components. This completely eliminates the sum computation or the sign detection unit from the critical path and the hardware associated with them. This is a key differentiation and optimization that enables both high performance and low power compared to other existing works [4, 5, 6]. representation; however without loss of generalization all the techniques proposed hold good for double precision FPMAC as well. The results and comparison will also be demonstrated only for double precision FPMAC. The completion adder is a combined round and add block which is very similar to [5]. The dual adder operates on the MSB m-1 bits computing sum and sum+1 which is selected based on the carry, sticky and round bits. The carry sticky and the round bits are computed on the LSB m+1 bits which are finally used to detect whether to select sum or sum+1 from the completion adder. The only additional change with respect to [5] in this design is the addition of 2 s complement 1 in the LSB in this dual adder stage which is in the non-critical section of addition of LSB. The key premise of such a multiplier performing combined round and add is the fact the resulting sum of C,S should be at worst of the form 11.XXX and not go to 111.XXX in which case the round guard and the sticky bits looked at become

4 incorrect. We ensure that by pre-shifting the multiplicand result to the right in case an overflow is detected from either the near or the far paths along with the addition of 1 to the final exponent to compensate for the shift Exponent Compute Data-path The exponent difference d=ex+ey-ez is computed as demonstrated in Figure 2 using a 3:2 compressor. The addition of bias (-127/-1023 for single/double precision) is accomplished by appending the required two 1s to the LSB of carry and MSB of sum. Separate right and left shifters are used for the far path large shift values (d<-2 and d > 1), while small 1 or 2 bit shifts for the near path are performed using different shifters. This enables dealing with the near path and the far path separately and thereby enabling early availability of the near path shifted accumulate value which is inserted into the multiplication CSA tree. The first m-1 significant bits of the exponent path being 0s or 1s is detected for determining the near path flag using zero detect module as shown in Figure 2 for a single precision 8-bit exponent example. injection also helps splitting the data-paths for the near and the far paths immediately at the end of the multiplication CSA tree. Such a splitting comes at no additional hardware penalty and helps using the optimal hardware for the rest of the pipeline. This also helps in significantly reducing the critical path of the near path, which is also the critical path of the FPMAC design, by removing the undue penalties imposed by unified handling of the different cases. Figure 3: Early Injection of Near Path Accumulate 3.3. Split Handling of Normalization Shift Figure 2: Exponent compute block supporting early detection for near path 3.2. Early Injection of Near Path Accumulate The near path accumulate mantissa, Mz, requires a small shift to be aligned with the multiplication result, Mxy. The early availability of the aligned mantissa provides an opportunity to compress the near path mantissa along with the multiplication CSA tree. Figure 3 demonstrates the insertion of the near path accumulate into the 2nd stage of the CSA tree. Note that in cases of a negative number we inject the complement of Mz along with the 2 s compliment 1 in the same stage of the sparse tree. The sparse nature of both the double and single precision floating point multiplication trees enables the near path (C, S) results from the CSA tree to be computed without any additional delay penalty in the critical path. Early computation of the (C, S) for the near-path including the compression of accumulate mantissa, prevents an additional 3:2 compression stage and enables their immediate processing, saving one 3:2 compression stage in the critical near path. This near path The completion addition is performed post-normalization shifting of the (C, S) terms, combined with the rounding. Normalization before the completion addition enables computation of only the required m bit sum and makes the design performance and hardware optimized. The normalization of the near and far paths as demonstrated in Figure 1 is performed separately. In the near path an effective subtraction may at worst lead to 2m leading zeroes (or ones) when the m bit accumulate value is equal to 2m bit multiplication result. To determine the left shift amount in such cases with leading zeroes we employ a Leading Zero Anticipator (LZA). As demonstrated in Figure 4, the LZA generates the string representing the number of leading zeroes and leading ones (Figure 4(a)). This string is binary encoded using the Leading Digit Counter (LDC) (Figure 4(b)) and also used to give the sign of the output result (Figure 4(c)). The shift amount generated by the LZA is used by the normalization shifter to perform the left shift on the C, S terms for obtaining a normalized result. There are cases where the multiplication result may have an overflow when added and needs a first bit detection in the carry and sum. Cases which have 11.XXX in sum needs a pre-shift of 1 right before performing the normalization shift. This is primarily to ensure that the combined round and add stage doesn t deal with a result of 111.XXX number in which case the result would not be looking at the correct L,G,R,T bits. The skewed arrival times of binary encoded shift amount from LSB to MSB is utilized for masking the normalization shift delay by performing the shifts upon the immediate arrival of the bits in that order. Unlike prior implementations, performing the completion addition along

5 (a) Leading zero/one anticipation string (b) Leading Digit Counter (Binary encoding) A B P 0 Z 0 G 0 P P G Z PG 7 PZ 7 PG PZ F F (c) Sign Detection using LZA Result Figure 4(a)(b)(c): Leading Zero/One Anticipation, Detection and Sign Detection Logic with rounding and sign detection using the LZA completely eliminates the 2m bit summation or sign detection unit in our proposed design which gives significant savings in the hardware requirements. The other parallel path in the normalization unit deals with the far path cases where d>1 & d<-2. The 3:2 compression of the aligned far path mantissa is performed on the Mz and the (C, S) terms from the CSA tree. Similar to the near path, the summation is postponed till the last stage of combined rounding and addition, where the summation is handled for both the near and far paths using a single unit. The normalization shifter for the far path requires a worst case right shifting of the C, S terms by m+3 bits which corresponds to the case when accumulate was shifted left for alignment i.e. d=-(m+3). The bits shifted out of the m -bit range are used to compute the carry, sticky, and the guard bits to be used while rounding in the combined sum and rounding unit. The normalized (C, S) or (~C, ~S) from either the near path or the far path, are passed on to the combined sum/rounding unit based on the sign of the result determined earlier (Figure 1). The 1s required for completion of the 2s complimenting of (C, S) are handled in the combined add/round unit as in existing implementations. These 1s are added in the final stage of combined round add to the MSB. Split handling of near and far path ensures performing the bare minimal operations required on the critical path and therefore is a performance optimal solution. Another, big advantage of such split handling is the ease of clock/power gating half the units based on the near path or far path flag. This enables turning off all the power consuming normalization shifters and logic blocks and keeping only the required blocks of computation switching and thereby giving a power optimal design. The total logic levels in terms of basic gates are estimated to be 15% lesser than the

best silicon implementation of FPMAC presented in [6, 7] with significantly reduced hardware complexity. IV.

The area of the proposed design is very comparable to any existing implementation.

Figure 5 demonstrated two FPMAC architectures proposed in recent literature.

1 Logic Level Comparison A detailed comparison of the logic levels is presented in Table 1 based on our best understanding of stages in [5] and [6].

demonstrated, the proposed design is faster than the best known timing design [5] by 5.2%

This significant speedup is primarily attributed to the complete isolation of the critical path and removing the unnecessary burden that a combined implementation has to pay for.

6 best silicon implementation of FPMAC presented in [6, 7] with significantly reduced hardware complexity. IV. LOGIC/AREA ANALYSIS AND COMPARISON The FPMAC presented in this paper is the fastest existing implementation when compared to any existing design for performing single multiply accumulates. The area of the proposed design is very comparable to any existing implementation. To compare and highlight the goodness of the FPMAC presented in this work we compare it to the best known silicon implementation from IBM Power6 architecture [6, 7]. Figure 5 demonstrated two FPMAC architectures proposed in recent literature. The first one is the best known silicon implementation of IBM Power6 while the second one is the fastest logic level FPMAC from Lang et al [5]. 4.1 Logic Level Comparison A detailed comparison of the logic levels is presented in Table 1 based on our best understanding of stages in [5] and [6]. The interconnect penalty is not precisely captured in the logic level computation of all the blocks except the LZA and shifters where an additional gate stage delays are incurred based on gate level simulations. The precise timing computation is currently underway in gate level implementation and this work is primarily targeted towards theoretical analysis and comparison closely matching circuit simulations. As clearly demonstrated, the proposed design is faster than the best known timing design [5] by 5.2% while with respect to the best known silicon implementation the design is faster by 14.2%. This significant speedup is primarily attributed to the complete isolation of the critical path and removing the unnecessary burden that a combined implementation has to pay for. The normalization logic has gate levels estimated based on no overlap between the LZA and the shift logic in [5] and the proposed scheme for simplicity. However as proposed in [5] LZA and the normalization shifter may be overlapped due to the inherent nature of LZA which provides the leading digit count from MSB to LSB with sufficient delays in between. Similar optimization is applied in our implementation which can completely eliminate the normalization shift logic delays from both [5] and our proposed design. A complete silicon implementation of the design is currently ongoing and based on preliminary gate level simulations we observe a savings of ~11% in timing of our design over Lang[5]. The detailed implementation and delay estimation from gate level simulations includes all the finer delays down to detailed interconnect and routing delays due to the addition of large interconnect dominated shifters in our proposed design. Further optimizations in the gate level simulations are also currently underway. (a) IBM Power6 FPMAC implementation [6] (b) Lang FPMAC implementation [5] Figure 5: Existing FPMAC implementations

7 4.2 Area Comparison One of the important requirements of deigning timing optimal FPMAC is to make sure that its area does not increase significantly. Given FPMACs are employed in wide bit SIMD pipelines the area increase not only impacts the gate count/power etc. but also critically dictates the floorplan of the processors. Table 2 demonstrates a detailed comparison of the area of various blocks in our design with the [5] and [6]. The area estimates are purely based on the logic blocks comparison as the interconnect area is mostly similar between the designs due to overall similar bit width operations. The interconnect area of the shifters however is included in our comparison since it is interconnect dominated logic; therefore we incur the additional penalty of increasing the shifters not only in logic but also interconnect. Note that these area estimates do not include the additional individual gate area savings due to relaxed timing provided by our timing optimal design. We have normalized all the area numbers with respect to the area of a 106-bit shifter on 106 bit which is assumed to be unit 100. Using this normalized estimates the total area of the three designs, IBM, Lang and proposed are demonstrated in Table 2. These normalized estimates help us in comparing the area with other designs. As we can see from the Table 2, our design shows 8% increase in area as compared to the power6 implementation, while it is 12% better than the Lang design. The numbers from our implementation however as highlighted earlier do not include the reduction in area of the non-critical components which could potentially bring our implementation area very close to the Power6 implementation without losing any of the timing benefits. The reduced area and relaxed timing contribute directly to the reduction in the dynamic power consumption of the design. Furthermore, the split handling of near and far paths provide a natural opportunity to save idle power in the portions of the logic unused during near or far path operations. Just by design, at any instance, up-to ~20% of the design may be clock/power gated just using the near vs. far path signal. V. CONCLUSION AND FUTURE WORK This paper demonstrates a split path FPMAC design which is 14% faster than the fastest known silicon implementation. The goodness of the design is accentuated by the timing gains at no additional area costs. The split path design provides a natural way for gating opportunities and even under normal case may lead to 15-20% of lesser switching gates based on the near or far path operation. We intend to implement this on silicon to highlight the merits of this design and feasibility with respect to silicon implementation and this work is underway. Gate-level timing simulations on individual blocks have been adhering to the estimates computed in this work; however the final stitching of the blocks and synthesis is ongoing work. This innovation can significantly help the microprocessor designs with fast timing area and power convergence. FMA Design Power6 FPMAC [6, 7] Lang [5] Proposed Comments Multiplier Accumulate adder Normalization shifter Rounding unit Post rounding norm 53-bit multiplier (33) 3:2 compressor (4) 106-bit accumulate add (19) 53 bit multiplier with injected near path (33) (4) 53 bit multiplier with injected near path (33) 3:2 compressor 106-bit accumulate-add (0) (0) (0) LZA (0) (9) LZA (9) 106-bit left shift (8) 54 bit round add+grt on lower 54 bits (18) 1/2-bit right shift (2) 161 bit shifter (8) (20) 1/2-bit right shift (2) 106-bit left shift (8) 52-bit dual add+grtc on lower 56 bits (20) 1/2-bit right shift (2) Total Logic Levels Near path injection into sparse compressor tree comes free of cost. Removed in proposed design. Removed in proposed design (XOR assumed 2 LL) (3+log(128)*2+2) Critical path LZA (XOR is assumed 2 LL) + log(128)*1 and tree 3 stages per log shift log(128)*3 Proposed: Added final mux delay in dual add, included the carry from lower 53-bit stage Table1. Logic level comparison of the proposed implementation compared to IBM Power6[6] and Lang Arith[5]

8 FMA Design Power6 FPMAC [6, 7] Lang [5] Proposed Multiplier (area) Alignment shift (area) Accumulate adder (area) Normalization shifter (area) Rounding Unit (area) 53bit mulltiplier (400) 56-bit LS on 53 bits, 106-bit RS on 53 bits (100) 106-bit add and 3:2 compress (87) LZA on 106-bit (113) 106 bit left shifter on 106 bits (100) 53-bit add (63) 53bit multiplier 53-bit mul with near path (400) Mz inserted (400) 161 bit RS on 53 bits (100) 56-bit LS on 53 bits, 106-bit RS on 53 bits (100) 106 bit 3:2 compress 106 bit 3:2 compress 106-bit 106-bit (LZA) bit sign (LZA + sign detect) (169) detect + HA (251) 106-bit left shift S 106-bit left shift C 53-bit right shift S 53-bit right shift C (200) 53-bit dual add (70) 106-bit left shift S 106-bit left shift C 53-bit right shift S 53-bit right shift C (200) 53-bit dual add (70) Total Norm Area Table 2: Area comparison of the proposed approach with best known silicon implementation (numbers in braces are indicative of relative comparison) ACKNOWLEDGMENT We would like to acknowledge the following for their encouragement and support for this work: Nitin Borkar, Sandip Pandey and Srinivas Lingam. REFERENCES [1] Intel s AVX page [2] AMD SSE5 [3] IEEE 754: Standard for Binary Floating-Point Arithmetic [4] C. Chen, L.-A. Chen, and J.-R. Cheng, Architectural Design of a Fast Floating-Point Multiplication-Add Fused Unit Using Signed- Digit Addition, Proceedings Euromicro Symposium Digital System Design (DSD2001), pp , [5] Tomas Lang, J. D. Bruguera, Floating-Point Multiply-Add-Fused with Reduced Latency, 17th IEEE Symposium on Computer Arithmetic (ARITH), [6] IBM Power6, MCCridie. B et al Desing of the Power6 Microprocessor IEEE International Solid-State Circuits Conference, ISSCC Digest of Technical Papers. [7] Son Dao Trong, P6 Binary Floating-Point Unit, 18 th IEEE Symposium on Computer Arithmetic (ARITH), [8] Yu. R.K. 167 MHz radix-4 floating point multiplier, Proceedings of the 12th Symposium on Computer Arithmetic, [9] Belluomini, W., Ngo, H., McDowell, C., Sawada, J., Nguyen, T., Veraa, B., Wagoner, J., Lee, M., A double precision floating point multiply, IEEE International Solid-State Circuits Conference, Digest of Technical Papers. ISSCC [10] Peter-Michael Seidel, Guy Even, Delay-Optimized Implementation of IEEE Floating-Point Addition, IEEE Transactions on Computers, v.53 n.2, p , February 2004 [11] A. Beaumont-Smith, N. Burgess, S. Lefrere, C. C. Lim, Reduced Latency IEEE Floating-Point Standard Adder Architectures, Proceedings of the 14th IEEE Symposium on Computer Arithmetic, p.35, April 14-16, 1999 [12] Kyung T. Lee, Kevin J. Nowka, 1 GHz Leading Zero Anticipator Using Independent Sign-Bit Determination Logic, Symp. VLSI Circuits Dig. Techn. Papers, 2000, pp [13] E. Antelo, M. Boo, J.D. Bruguera, and E.L. Zapata, A Novel Design for a Two Operand Normalization Circuit, IEEE Trans. Very Large Scale of Integration (VLSI) Systems, vol. 6, no. 1, pp , [14] N. Burgess, The Flagged Prefix Adder for Dual Additions, Proc. SPIE ASPAII7, [15] R.V.K. Pillai, A Low Power Approach to Floating Point Adder Design for DSP Applications,VLSI Signal Processing, 2001 [16] Seidel, Peter-M. On-Line IEEE Floating-Point Multiplication and Division for Reduced Power Dissipation, Proceedings of 38th Asilomar Conference on Signals, Systems and Computers, 2004 [17] Eric Charles Quinnell, Floating-Point Fused Multiply-Add Architectures, 11th Asilomar Conference on Circuits, Systems and Computers, Conference Record. 1977

FLOATING POINT ADDERS AND MULTIPLIERS

Concordia University FLOATING POINT ADDERS AND MULTIPLIERS 1 Concordia University Lecture #4 In this lecture we will go over the following concepts: 1) Floating Point Number representation 2) Accuracy