EE216B: VLSI Signal Processing Wordlength Optimization Prof. Dejan Marković ee216b@gmail.com Number Systems: Algebraic Algebraic Number e.g. a = + b [1] High level abstraction Infinite precision Often easier to understand Good for theory/algorithm development Hard to implement [1] C. Shi, Floating-point to Fixed-point Conversion, Ph.D. Thesis, University of California, Berkeley, 2004. 12.2 1
Number Systems: Floating Point Widely used in CPUs Floating precision Good for algorithm study and validation Value = ( 1) Sign Fraction 2 (Exponent Bias) IEEE 754 standard Sign Exponent Fraction Bias Single precision [31:0] 1 [31] 8 [30:23] 23 [22:0] 127 Double precision [63:0] 1 [63] 11 [62:52] 52 [51:00] 1023 A short floating-point number π = Sign 0 1 1 0 0 1 0 1 0 1 Frac Exp π = ( 1) 0 (1 2 1 + 1 2 2 + 0 2 3 + 0 2 4 + 1 2 5 + 0 2 6 ) (1 2 2 2 + 0 2 1 + 1 2 0 3) = 3.125 Bias=3 12.3 π = Number Systems: Fixed Point 2 s complement 0 0 1 1 0 0 1 0 0 1 Unsigned magnitude Overflow-mode Quant.-mode Overflow-mode Quant.-mode π = 0 0 1 1 0 0 1 0 0 1 Sign W Int W Fr π = 0 2 3 + 0 2 2 + 1 2 1 + 0 2 1 + 0 2 2 + 1 2 3 + 0 2 4 + 0 2 5 + 1 2 6 = 3.140625 Economical implementation W Int and W Fr suitable for predictable range o-mode (saturation, wrap-around) q-mode (rounding, truncation) Economic for implementation Useful built-in MATLAB functions: e.g. fix, round, ceil, floor, dec2bin,bin2dec,etc. W Int W Fr In MATLAB: dec2bin(round(pi*2^6),10) bin2dec(above)*2^-6 Simulink SynDSP and SysGen 12.4 2
> 1 month Motivation for Floating-to-Fixed Point Conversion Algorithms designed in algebraic arithmetic, verified in floating-point or very large fixed-point arithmetic a = π + b Idea Floating-pt algorithm OK? Yes No Quantization VLSI Implementation in fixed-point arithmetic π = Overflow-mode Quant.-mode S 0 1 1 0 0 1 0 0 1 Fixed-pt algorithm OK? No Time consuming Error prone Sign W Int W Fr Yes Hardware mapping 12.5 Optimization Techniques: FRIDGE [2] Set of test vectors for inputs Pre-assigned W Fr at all inputs Range-detection through simulation W Int Deterministic propagation W Fr W Int in all internal nodes + Conservative but good for avoiding overflow W Fr in all internal nodes Unjustified input W Fr Overly conservative [2] H. Keding et al., "FRIDGE: A Fixed-point Design and Simulation Environment," in Proc. Design, Automation and Test in Europe, Feb. 1998, pp. 429 435. 12.6 3
Optimization Techniques: Robust Ad Hoc Fix-point system as black-box bit-true sim. System specifications Logic block WLs Hardware cost Ad hoc search [3] or procedural [4] Long bit-true simulation, large number of iterations [5] Impractical for large systems [3] W. Sung and K.-I. Kum, "Simulation-based Word-length Optimization Method for Fixed-point Digital Signal Processing Systems," IEEE Trans. Sig. Proc., vol. 43, no. 12, pp. 3087-3090, Dec. 1995. [4] S. Kim, K.-I. Kum, and W. Sung, "Fixed-Point Optimization Utility for C and C++ Based on Digital Signal Processing Programs," IEEE Trans. Circuits and Systems-II, vol. 45, no. 11, pp. 1455-1464, Nov. 1998. [5] M. Cantin, Y. Savaria, and P. Lavoie, "A Comparison of Automatic Word Length Optimization Procedures," in Proc. Int. Symp. Circuits and Systems, vol. 2, May 2002, pp. 612-615. 12.7 Problem Formulation: Optimization Minimize hardware cost: f(w Int,1, W Fr,1 ; W Int,2, W Fr,2 ; ; o-q-modes) Subject to quantization-error specifications: S j (W Int,1, W Fr,1 ; W Int,2, W Fr,2 ; ; o-q-modes) < spec, j Feasibility: N Z +, s.t. S j (N, N; ; any mode) < spec, j Stopping criteria: f < (1 + a) f opt where a > 0. From now on, concentrate on W Fr [1] [1] C. Shi, Floating-point to Fixed-point Conversion, Ph.D. Thesis, University of California, Berkeley, 2004. 12.8 4
Output MSE Specs: Perturbation Theory On MSE [6] 2 MS E = Ε [(Infinite-precision- output Fixed-point - output) ] p 2W T Fr, i μ μ 2 B c i 1 i for a datapath of p, WL B p, C p μ i 1 WFr, i qw i, datapath 2 W c,2 c onst c Fr, i fix-pt( ), c i i i q i 0, round-off 1, truncation [6] C. Shi and R.W. Brodersen, "A Perturbation Theory on Statistical Quantization Effects in Fixedpoint DSP with Non-stationary Input," in Proc. IEEE Int. Symp. Circuits and Systems, vol. 3, May 2004, pp. 373-376. 12.9 Actual vs. Computed MSE 11-tap LMS Adaptive Filter SVD U-Sigma Further improvement can be made considering correlation T T MSE E[ b b ] E[ ] μ Bμ σ Cσ i, T, m, T j n i, T m, T i, T m, T j n j n W p, with BC,, and σ 2 Fr i i More simulations required Usually not necessary 12.10 5
FPGA Hardware Resource Estimation Designs In SysGen/SynDSP Simulink Compiler Netlister VHDL/Core Generation Synthesis Tool Mapper Design Mapping Accurate X Sometimes unnecessarily accurate X Slow (minutes to hours) X Excessive exposure to low-end tools X No direct way to estimate subsystem X Hard to realize for incomplete design Map Report with Area Info Fast and flexible resource estimation is important for FFC! Tool needs to be orders of magnitude faster 12.11 Model-based Resource Estimation [*] Individual MATLAB function created for each type of logic MATLAB function estimates each logic-block area based on design parameters (input/output WL, o, q, # of inputs, etc ) Area accumulates for each logic block Total area accumulated from individual area functions (register_area, accum_area, etc ) Xilinx area functions are proprietary, but ASIC area functions can be constructed through synthesis characterizations [*] by C. Shi and Xilinx Inc. ( Xilinx) 12.12 6
ASIC Area Estimation ASIC logic block area is a multi-dimensional function of its input/output WL and speed, constructed based on synthesis Each WL setting characterized for LP, MP, and HP Perform curve-fitting to fit data unto a quadratic function Adder Area Multiplier Area x 10 4 800 2.5 Adder Area 600 400 200 Mult Area 2 1.5 1 0.5 0 40 30 40 30 20 30 20 30 20 20 10 10 10 10 Adder 0 Output Wordlength WL 0 max(input Adder Input Wordlength WL) Input Mult 2 0 0 2 WL Input Mult Input 1 WL WL 1 0 40 12.13 40 Analytical Hardware-Cost Function: FPGA Quadratic-fit hardware-cost If all design parameters (latency, o, q, etc.) and all W Int s are fixed, then the FPGA area is roughly quadratic to W Fr f( W) W H W H W h, where W ( W Fr, W Fr,...) 850 800 750 700 650 600 550 500 450 400 T T 1 2 3,1,2 Check Hardware-cost Fitting Behavior Check Hardware-cost Fitting Behavior Quadratic-fit Linear-fit Ideal-fit Quadratic-fit Linear-fit Ideal-fit FPGA Quadratic-fit Quadratic-fit hardware-cost hardware-cost 2.5 2 1.5 x 10 4 3 1 Quadratic-fit Linear-fit Ideal-fit Quadratic-fit Linear-fit Ideal-fit ASIC 350 350 400 450 500 550 600 650 700 750 800 850 Actual hardware-cost 0.5 0.5 1 1.5 2 2.5 3 Actual Actual hardware-cost ASIC area modeled by the same f (W) x 10 4 12.14 7
Wordlength Optimization Flow Simulink Design in XSG or SynDSP [7] See the book website for tool download. Initial Setup (10.16) WL Analysis & Range Detection (10.18) HW Models for ASIC Estimation (10.13) WL Connectivity & WL Grouping (10.19-20) Optimal W Int Create Cost-function for ASIC (10.12) Create cost-function for FPGA (10.12) MSE-specification Analysis (10.22) HW-acceleration / Parallel Sim. Under Development Data-fit to Create HW Cost Function (10.21) Data-fit to Create MSE Cost Function (10.22) Wordlength Optimization Optimization Refinement (10.23) Optimal W Fr 12.15 Initial Setup Insert a FFC setup block from the library see notes Insert a Spec Marker for every output requiring MSE analysis Generally every output needs one 12.16 8
Wordlength Reader Captures the WL information of each block If user specifies WL, store the value If no specified WL, back-trace the source block until a specified WL is found If source is the input-port of a block, find source of its parent 12.17 Wordlength Analyzer Determines the integer WL of every block Inserts a Range Detector at every active/non-constant node Each detector stores signal range and other statistical info Runs 1 simulation, unless specified multiple test vectors Xilinx Range Detectors SynDSP 12.18 9
Wordlength Connectivity Connect wordlength information through WL-passive blocks Back-trace until a WL-active block is reached Essentially flattens the design hierarchy First step toward reducing # of independent WLs Connected Connected 12.19 Wordlength Grouping Deterministic Fixed WL (mux select, enable, reset, address, constant, etc) Same WL as driver (register, shift reg, up/down-sampler, etc) Heuristic (WL rules) Multi-input blocks have the same input WL (adder, mux, etc) Tradeoff between design optimality and simulation complexity Fixed Heuristic Deterministic 12.20 10
Resource-Estimation Function, Analyze HW Cost Creates a function call for each block Slide 12.12, 12.14 HW cost is analyzed as a function of WL One or two WL group is toggled with other groups fixed Quadratic iterations for small # of WLs Linear iterations for large # of WLs 12.21 Analyze Specifications, Analyze Optimization Computes MSE s sensitivity to each WL group First simulate with all WL at maximum precision WL of each group is reduced individually Slide 12.9, 12.10 Once MSE function and HW cost function are computed, user may enter the MSE requirement Specify 1 MSE for each Spec Marker Optimization algorithm summary 1) Find the minimum W Fr for a given group (others high) 2) Based on all the minimum W Fr s, increase all WL to meet spec 3) Temporarily decrease each W Fr separately by one bit, only keep the one with greatest HW reduction and still meet spec 4) Repeat 3) until W Fr cannot be reduced anymore 12.22 11
Optimization Refinement and Result The result is then examined by user for suitability Re-optimize if necessary, only takes seconds Example: 1/sqrt() on an FPGA (16,12) (13,11) (14,9) (8,4) (24,16) (13,8) (24,16) (11, 6) (24,16) (10,6) (16,11) (11,7) (12,9) (10,7) About 50% area reduction Legend: red = WL optimal 409 slices black = fixed WL 877 slices (16,11) (16,11) (8,7) (8,7) 12.23 ASIC Example: FIR Filter [8] Original Design Area = 48916 μm 2 Optimized for MSE = 10 6 Area = 18356 μm 2 [8] C.C. Wang, Design and Optimization of Low-power Logic, M.S. Thesis, UCLA, 2009. (Appendix A) 12.24 12
Example: Jitter Compensation Filter [9] Derivative HPF LPF + Mult SNR (db) 40 35 30 25 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (us) SNR (db) 40 35 30 25 29.4 db 30.8 db 20 15 10 5 0 0 0.5 1 1.5 2 2.5 3 3.5 4 4.5 Time (us) [9] Z. Towfic, S.-K. Ting, A. Sayed, "Sampling Clock Jitter Estimation and Compensation in ADC Circuits," in Proc. IEEE Int. Symp. Circuits and Systems, June 2010, pp. 829-832. 12.25 Tradeoff: MSE vs. Hardware-Cost 7 Acceptable MSE ACPR (MSE=6 10-3 ) 46 db HW Cost (kluts) 6 5 4 3 ACPR (MSE=7 10-3 ) 2 10-6 MSE cos 10-4 10-2 10-2 WL-Optimal Design 10-4 MSE sin 10-6 12.26 13
Summary Wordlength minimization is important in the implementation of fixed-point systems in order to reduce area and power Integer wordlength can be simply found by using range detection, based on input data Fractional wordlengths require more elaborate perturbation theory to minimize hardware cost subject to MSE error due to quantization Design-specific information can be used Wordlength grouping (e.g. in multiplexers) Hierarchical optimization (with fixed input/output WLs) WL optimizer for recursive systems takes longer due to the time require for algorithm convergence FPGA/ASIC hardware resource estimation results are used to minimize WLs for FPGA/ASIC implementations 12.27 References (1/3) C. Shi, Floating-point to Fixed-point Conversion, Ph.D. Thesis, University of California, Berkeley, 2004. H. Keding et al., "FRIDGE: A Fixed-point Design and Simulation Environment," in Proc. Design, Automation and Test in Europe, Feb. 1998, pp. 429 435. W. Sung and K.-I. Kum, "Simulation-based Word-length Optimization Method for Fixed-point Digital Signal Processing Systems," IEEE Trans. Sig. Proc., vol. 43, no. 12, pp. 3087-3090, Dec. 1995. S. Kim, K.-I. Kum, and W. Sung, "Fixed-Point Optimization Utility for C and C++ Based on Digital Signal Processing Programs," IEEE Trans. Circuits and Systems-II, vol. 45, no. 11, pp. 1455-1464, Nov. 1998. 12.28 14
References (2/3) M. Cantin, Y. Savaria, and P. Lavoie, "A Comparison of Automatic Word Length Optimization Procedures," in Proc. Int. Symp. Circuits and Systems, vol. 2, May 2002, pp. 612-615. C. Shi and R.W. Brodersen, "A Perturbation Theory on Statistical Quantization Effects in Fixed-point DSP with Non-stationary Input," in Proc. IEEE Int. Symp. Circuits and Systems, vol. 3, May 2004, pp. 373-376. See book supplement website for tool download. Also see: http://bwrc.eecs.berkeley.edu/people/grad_students/ccshi/res earch/ffc/documentation.htm 12.29 References (3/3) C.C. Wang, Design and Optimization of Low-power Logic, M.S. Thesis, UCLA, 2009. (Appendix A) Z. Towfic, S.-K. Ting, A.H. Sayed, "Sampling Clock Jitter Estimation and Compensation in ADC Circuits," in Proc. IEEE Int. Symp. Circuits and Systems, June 2010, pp. 829-832. 12.30 15
Course Wiki CAD Tutorials WL Optimization Tool Source code Tested with Matlab 2006b and SynDSP 3.6 11.31 16