H5H4, H5E7 lecture 5 Fixed point arithmetic. Overview

Size: px

Start display at page:

Download "H5H4, H5E7 lecture 5 Fixed point arithmetic. Overview"

Catherine Carr
5 years ago
Views:

1 H5H4, H5E7 lecture 5 Fixed point arithmetic I. Verbauwhede Acknowledgements: H. DeMan, V. Öwall, D. Hwang, K.U.Leuven 1 Overview Lecture 1: what is a system-on-chip Lecture : terminology for the different steps Lecture : models of computations, SDFG Lecture 4: control flow Lecture 5 today : fixed point refinement Page 1

2 H5H4 goal: Skiing down a mountain SPW, Matlab, C++ pipelining, unrolling Specification Algorithm Transformations loop merging, compaction Memory Transformations and Optimizations 40 bit accumulator Floating-point to Fixed-point ASIC Special Purpose Retargetable coprocessor DSP processor DSP- RISC RISC References P. Lapsley, et al., DSP Processor fundamentals: Architectures and features, IEEE Press, 1997, Chapter. W. Sung, K. Kum, Simulation-based Word-Length Optimization Method for Fixed-point Digital Signal processing systems, IEEE Trans. On Signal Proc. Vol. 4, No. 1, Dec Viktor Öwall, Dept. of Electroscience, Lund Sweden - M. Ercegovac, T. Lang, Digital Arithmetic, Kaufmann Publishers, 004. Fridge project: 4 Page

3 Finite word lengths: a must for DSP Floating-point powerful expensive (storage & ops) bytes (mantissa) + 1 byte (exponent) DSP applications high speed minimum area low power * 8 Fixed-point refinement 6 * 14 5 Consequences of Bad Use of Approximations Example: Failure of Patriot Missile (1991 Feb. 5) Source American Patriot Missile battery in Dharan, Saudi Arabia, failed to intercept incoming Iraqi Scud missile The Scud struck an American Army barracks, killing 8 Cause, per GAO/IMTEC-9-6 report: software problem (inaccurate calculation of the time since boot) Specifics of the problem: time in tenths of second as measured by the system s internal clock was multiplied by 1/10 to get the time in seconds. Internal registers were 4 bits wide 1/10 = (chopped to 4 b) Error Error in 100-hr operation period = 0.4 s Distance traveled by Scud = (0.4 s) (1676 m/s) 570 m This put the Scud outside the Patriot s range gate Ironically, the fact that the bad time calculation had been improved in some (but not all) code parts contributed to the problem, since it meant that inaccuracies did not cancel out 6 Page

4 Consequences of Bad Approximations Example: Explosion of Ariane Rocket (1996 June 4) Source Unmanned Ariane 5 rocket launched by the European Space Agency veered off its flight path, broke up, and exploded only 0 seconds after lift-off (altitude of 700 m) The $500 million rocket (with cargo) was on its 1st voyage after a decade of development costing $7 billion Cause: software error in the inertial reference system Specifics of the problem: a 64 bit floating point number relating to the horizontal velocity of the rocket was being converted to a 16 bit signed integer An SRI* software exception arose during conversion because the 64-bit floating point number had a value greater than what could be represented by a 16-bit signed integer (max 767) 7 Outline Number representation Location of decimal point Precision Dynamic range Truncation, rounding Overflow 8 Page 4

5 Binary numbers, unsigned integers MSB = Most Significant Bit LSB = Least Significant Bit N bits N (0) (1) () () (4) (5) (6) (7) 9 [V. Öwall] Dynamic range and Resolution Nr. of Nr. of Resolution Dynamic Range bits levels V fs =0.5V V LSB = V 0.5V 8 56 mv 8V mV 18V μV 04V 10 How do we use the bits? Depends on the application! [V. Öwall] Page 5

6 Number Representation Unsigned numbers Signed digit numbers Sign magnitude One s complement Two s complement Notation: <W,L> with W = K + L W = wordlength L = number of bits behind decimal (or binary) point 11 Signed-Digit Representations Representations 1) Signed-Magnitude: redundant ) Biased: non-redundant ) Complement» A) Radix Complement (r= two's complement ) non-redundant» B) Digit Complement or Diminished-Radix Complement (r= one's complement ) redundant Redundant two representations for same number Non-redundant each representation is different number 1 Page 6

7 Sign Magnitude Unsigned numbers with a sign-bit Signed Magnitude Two Zeros + Low Power? [V. Öwall] One s Complement Signed numbers by inverting (Complement) One's Complement Two Zeros + Easy to convert to Negative [V. Öwall] Page 7

8 Two s Complement Most widely used fixed point numbering system 000 Complement + LSB Two's Complement One Zero + Easy Addition - Not so easy to convert to Neg [V. Öwall] Signedmagnitude Biased Two s complement One s complement Page 8

9 Position of decimal point MSB=W-L LSB=L i W L <W,L> Total number of bits W Fractional bits L Value representation s complement (i=-1) unsigned (i=1) 17 How do you store this decimal point? Fixed point for DSP processors Simple binary integer (two s complement) MSB=W LSB=0 Signbit W <W,0> Simple binary fractional representation MSB=W LSB=L=W-1 Signbit <W,W-1> W Values between [-1,1[ 18 Page 9

10 Mantissa representation Mantissa: e.g. 4 bit One sign bit Mantissa bit = 1 (always!) [-1, -] and [+1, +] Exponent: e.g. 8 bit Value = Mantisse x exponent MSB=W LSB=L Signbit W <W,L> Precision Quantization error = error when a longer numeric format is converted to a shorter one E.g.: round 1.5 to 1., error = Maximum precision (in bits) = log ( maximum value / max quantization error ) E.g.: 16 bit fractional representation max value = -1, max error = -16 (with rounding) maximum precision = 16 bits Importance of scaling!! 0 Page 10

11 Dynamic range Dynamic range = largest number / smallest number in a given data format E.g. bit fractional value ratio = (1- -1 ) / -1 = +1 = = 187 db Telecom: 50 db, High End Audio: 90dB + DSP processors: provide a few more bits than the dynamic range requires Scaling!! 1 Rounding Page 11

12 How do we quantize? Cheap Nasty fxp flp fxp Sign-Magnitude Unusual flp floor x -compl truncate Magnitude truncate fxp fxp flp flp ceil x round Best Expensive Rounding Rounding occurs when we want to approximate a more precise number (i.e. more fractional bits L) with a less precise number (i.e. fewer fractional bits L') Example 1: down old: (K=6, L=8) new: (K'=6, L'=) Example : up old: (K=6, L=8) new: (K'=6, L'=0) The following show rounding from L>0 fractional bits to L'=0 bits, but the mathematics hold true for any L' < L Usually, keep the number of integral bits the same K'=K 4 Page 1

13 Rounding Equation Whole part Fractional part x k 1 x k... x 1 x 0. x 1 x... x Round l y k 1 y k... y 1 y 0 y = round(x) 5 Rounding Techniques Different rounding techniques: 1) truncation» results in round towards zero in signed magnitude» results in round towards - in two's complement ) round to nearest number ) round to nearest even number (or odd number) 4) round towards + Other rounding techniques 5) jamming or von Neumann 6) ROM rounding Each will differ in their error depending on representation of numbers i.e. signed magnitude versus two's complement 6 Error = round(x) x Page 1

14 1) Truncation The simplest possible rounding scheme: chopping or truncation x k 1 x k... x 1 x 0. x 1 x... x l trunc x k 1 x k... x 1 x 0 ulp Truncation in signed-magnitude results in a number chop(x) that is always of smaller magnitude than x. This is called round towards zero or inward rounding (.5) () 10» Error = (-.5) (-) 10» Error = +0.5 Truncation in two's complement results in a number chop(x) that is always smaller than x. This is called round towards - or downward-directed rounding (.5) () 10» Error = (-.5) (-4) 10» Error = Truncation Function Graph: chop(x) chop(x) Fig Truncation or chopping of a signed-magnitude number (same as round toward 0). x chop(x) Fig Truncation or chopping of a s-complement number (same as round to - ). x 8 Page 14

15 Bias in two's complement truncation X (binary) X (decimal) chop(x) (binary) chop(x) (decimal) Error (decimal) Assuming all combinations of positive and negative values of x equally possible, average error is In general, average error = ( -L' - -L )/, where L' = new number of fractional bits 9 Implementation truncation in hardware Easy, just ignore (i.e. truncate) the fractional digits from L to L'+1 x k-1 x k-.. x 1 x 0. x -1 x -.. x -L = y k-1 y k-.. y 1 y 0. ignore (i.e. truncate the rest) 0 Page 15

16 ) Round to nearest number Rounding to nearest number what we normally think of when say round rtn in two's complement (.5) () 10» Error = (-.5) (-) 10» Error = Round to Nearest Function Graph: rtn(x) rtn(x) x Page 16

17 Bias in two's complement round to nearest X (binary) X (decimal) rtn(x) (binary) rtn(x) (decimal) All combinations of positive and negative values of x equally possible, average error is Smaller average error than truncation, but still not symmetric error We have a problem with the midway value, i.e. exactly at.5 or -.5 leads to positive error bias always Overflow problem: if only allocate K' = K integral bits Example: rtn(011.10) overflow This overflow only occurs on positive numbers near the maximum positive value, not on negative numbers Error (decimal) Truncation and rounding Truncation: cheapest but introduces bias E.g.: use <4,0> 0011 = =.5 truncates to 1100 = = -.5 truncates to -4 Always a smaller number Rounding: round to the nearest Simple hardware trick: add 1/ of the smallest number and truncate E.g.: use <4,0> 0011 = =.5 rounds to = -.5 rounds to - How in hardware? 4 Page 17

18 Rounding Rounding to the nearest: still bias for numbers exactly half way More expensive: convergent rounding Signbit 7 a a a a a a a a Signbit b b 1 0 b b If a:a0 > 1000 b:b0 = a7:a4 + a If a:a0 < 1000 b:b0 = a7:a4 + a If a:a0 = 1000 b:b0 = a7:a4 + a4 5 Overflow 6 Page 18

19 What happens on an overflow? wrap-around saturation fxp flp fxp flp max. value 7 Adding Two's Complement Numbers: Ignoring Overflow Ignoring overflow, adding a K.L two's complement number to a K.L binary unsigned number results in a K.L number Example: = = Ignore c K Ignore c K Adding results in -.5: must add ^K = 16 to get correct result (1.75) Adding results in +1: must add -^K = -16 to get correct result 8 Page 19

20 Two's Complement Wraparound Property Temporary wraparounds are fine as long as final value is in the correct dynamic range: Example: add ( ) + 7 = = 0010» Should be (-14) 10 not (+) 10 wraparound/overflow = 1001» Final result is correct: (-7) 10»Iffinal result guaranteed to be in the correct dynamic range [-8,+7] then intermediate wraparounds are fine 9 Adding Two's Complement Numbers: Avoiding or Detecting Overflow To avoid overflow, adding a <K+L,L> binary two's complement number to a <K+L,L> two's complement number results in a <K+L+1,L> number. To compute, sign extend MSB, ignore c K+1 Example: Ignore c K = K=4, L= If result is confined to a <K+L,L> number, need overflow detection, which is the c K xor c K-1 Example: = c K XOR c K-1 indicates overflow 40 Page 0

21 Subtracting Two's Complement Numbers: Ignoring Overflow Ignoring overflow, subtracting a <K+L,L> two's complement number from a <K+L,L> two's complement number results in a <K+L,L> number Example: = Ignore c K 7 (-8) resulted in -1 A wraparound/overflow occured Must add ^K=^4=16 to get correct value of +15 Again we see the modulo effect As with addition, temporary wraparounds are okay as long as final result is in correct dynamic range 41 Subtracting Two's Complement Numbers: Avoiding or Detecting Overflow To avoid overflow, subtracting a <K+L,L> two's complement number from a <K+L,L> two's complement number results in a <K+L+1,L> number Example: = Ignore c K+1 If result is confined to a K.L number, need overflow detection, which is the c K xor c K-1 Example: = c K XOR c K-1 indicates overflow Page 1

22 Negating a Two's Complement Number Negating a K.L two's complement number usually only requires a K.L digit result. The only exception is when you negate the largest negative number, and you need a K.(L+1) digit result.» = 1001» = need extra bit to negate largest negative number Again overflow detection needed 4 Outline Number representation Location of decimal point Precision Dynamic range Truncation, rounding Overflow Now: what to do? 44 Page

23 The Wordlength, i.e. nr of bits x(n) h0 D D D h1 h h UMTS-filter y(n) Every extra bit costs energy/power delay area the word length has to be reduced 7bits float 45 [V. Öwall] The Wordlength, i.e. nr of bits x(n) D D D h0 h1 h h y(n) The output of adder output needs an extra bit to be sure of no overflow, e.g. + = =100 multiplier MxN bits M+N bits for full precision Precision has to be limited 46 [V. Öwall] Page

24 During design: specify fixed-point formats for signals AD Floating-point algorithm AD 8 7 W,L,Q System context data? +??? * *?? System context coefficients 47 Fixed-point refinement: optimization problem Minimize overall cost: minimal word lengths truncate and wrap-around MSB determination: goal: avoid unwanted overflows method: find min, max signal values result: MSB position, value representation, overflow behaviour LSB determination: goal: keep required precision method: evaluate difference between flp and fxp behavior result: LSB position, quantization safe range quantization cost t t 48 Page 4

25 1.MSB determination: range calculations x d range calc. range info c m * + y Analytical method Put range (min, max) on inputs, states Propagate range over the operators This gives a save(pessimistic) estimate 49 Word length propagation Range propagation translates to word length growth E.g. Two s complement integer addition A + B A and B represented by <W,0> A + B needs <W+1, 0> A B needs <W+1, 0> In general: A is represented by <W A,L A >, B by <W B,L B >, A + B needs <max(w A, W B ) +1, max(l A, L B )> Get s more complicated for multiplication 50 Page 5

26 Range calculations grows unbounded? X(n) + Y(n) x(1) x() z -1 z -1 * + z -1 * + z -1 * + F(n) * a<0 a<0 a max F a n min F 51 Alternative: Collect signal statistics during simulations?min, max stimuli q1 x d stimuli q c m * + y Perform simulation with realistic stimuli. Collect minimum and maximum value on each signal during the simulation This gives an optimistic, stimuli dependent estimate 5 Page 6

27 Combine both methods for accurate MSB determination signal statistic range propagation name min max MSB1 min max MSB signal signal signal If MSB1 == MSB: wrap-around(msb1) If MSB1 < MSB: choose saturate(msb1) or wrap-around(msb) If MSB1 << MSB: range propagation problem (MSB1 + saturation to be used) 5 Transform DFG for cheaper solution Scaling by moving multiplications or shifters over operators, use commutativity, associativity, distributivity (check accuracy!) Need to verify also LSB behavior 54 Page 7

28 . Quantization effects can be modeled as additive noise (LSB) B bits input output input output Q + noise Quantization noise is approximated by a statistical model with the following assumptions: the noise is uncorrelated to the input. the noise is white. the probability distribution is uniform. 55 Each quantization effect is modeled by a mean and variance Rounding: Truncation: mn mn Magnitude truncation: = 0 and σ n Δ and σ = n Δ = 1 Δ = 1 mn Δ is quantization step = 0 and σ n = Δ 56 Page 8

29 This results in an equivalent linear network X(n) + Y(n) X(n) + Y(n) z -1 z -1 Q a * + * a e(n) But quantization is a non-linear operation! 57 Limit cycles are an example of non- linear behavior without rounding: X(n) + Y(n) z Q B bits * with rounding: X(0) = 14, x(n) = 0 for n > 0 round to nearest integer Page 9

30 Limitcycle example 59 a) LSB determination must be based on simulations stimuli x z -1 All fixed-point simulate 0.6 m * Q + y Q no output ok yes x 0.6 m * + y com pare 60 Page 0

31 b) Gradual refinement is necessary to keep the problem manageable For each signal S quantize S only simulate stimuli x 0.6 z -1 m y * Q + reference simulation com pare no Perf. ok yes return 61 Conclusion Number representation Location of decimal point Precision Dynamic range Truncation, rounding Overflow Now: what to do? Why are we doing this? Area/time/power optimization Important design optimization for JPEG project 6 Page 1

Lecture 3: Basic Adders and Counters

Lecture 3: Basic Adders and Counters ECE 645 Computer Arithmetic /5/8 ECE 645 Computer Arithmetic Lecture Roadmap Revisiting Addition and Overflow Rounding Techniques Basic Adders and Counters Required