WHILE most digital signal processing algorithms are

Similar documents
Floating-point to Fixed-point Conversion. Digital Signal Processing Programs (Short Version for FPGA DSP)

DUE to the high computational complexity and real-time

MANY image and video compression standards such as

Analytical Approach for Numerical Accuracy Estimation of Fixed-Point Systems Based on Smooth Operations

A Ripple Carry Adder based Low Power Architecture of LMS Adaptive Filter

Design and Implementation of Signed, Rounded and Truncated Multipliers using Modified Booth Algorithm for Dsp Systems.

Fixed Point Representation And Fractional Math. By Erick L. Oberstar Oberstar Consulting

COMPUTER ARCHITECTURE AND ORGANIZATION. Operation Add Magnitudes Subtract Magnitudes (+A) + ( B) + (A B) (B A) + (A B)

FIR Filter Synthesis Algorithms for Minimizing the Delay and the Number of Adders

Area And Power Efficient LMS Adaptive Filter With Low Adaptation Delay

Wordlength Optimization

Rapid Prototyping System for Teaching Real-Time Digital Signal Processing

RECENTLY, researches on gigabit wireless personal area

Optimization Method for Broadband Modem FIR Filter Design using Common Subexpression Elimination

Optimization of Vertical and Horizontal Beamforming Kernels on the PowerPC G4 Processor with AltiVec Technology

Express Letters. A Simple and Efficient Search Algorithm for Block-Matching Motion Estimation. Jianhua Lu and Ming L. Liou

Digital Filter Synthesis Considering Multiple Adder Graphs for a Coefficient

Optimization Method for Broadband Modem FIR Filter Design using Common Subexpression Elimination

Implementation of Efficient Modified Booth Recoder for Fused Sum-Product Operator

ISSN (Online), Volume 1, Special Issue 2(ICITET 15), March 2015 International Journal of Innovative Trends and Emerging Technologies

THE orthogonal frequency-division multiplex (OFDM)

Low Power Floating-Point Multiplier Based On Vedic Mathematics

REAL-TIME DIGITAL SIGNAL PROCESSING

/$ IEEE

Multiframe Blocking-Artifact Reduction for Transform-Coded Video

FPGA IMPLEMENTATION OF FLOATING POINT ADDER AND MULTIPLIER UNDER ROUND TO NEAREST

VHDL Implementation of Multiplierless, High Performance DWT Filter Bank

Variable Temporal-Length 3-D Discrete Cosine Transform Coding

PERFORMANCE ANALYSIS OF INTEGER DCT OF DIFFERENT BLOCK SIZES USED IN H.264, AVS CHINA AND WMV9.

SCALABLE IMPLEMENTATION SCHEME FOR MULTIRATE FIR FILTERS AND ITS APPLICATION IN EFFICIENT DESIGN OF SUBBAND FILTER BANKS

Heuristic Algorithms for Multiconstrained Quality-of-Service Routing

Fixed Point LMS Adaptive Filter with Low Adaptation Delay

Carry-Free Radix-2 Subtractive Division Algorithm and Implementation of the Divider

Twiddle Factor Transformation for Pipelined FFT Processing

A deblocking filter with two separate modes in block-based video coding

AMONG various transform techniques for image compression,

Performance analysis of Integer DCT of different block sizes.

QUANTIZER DESIGN FOR EXPLOITING COMMON INFORMATION IN LAYERED CODING. Mehdi Salehifar, Tejaswi Nanjundaswamy, and Kenneth Rose

Yui-Lam CHAN and Wan-Chi SIU

Two High Performance Adaptive Filter Implementation Schemes Using Distributed Arithmetic

Pipelined Fast 2-D DCT Architecture for JPEG Image Compression

MULTICHANNEL image processing is studied in this

Using Streaming SIMD Extensions in a Fast DCT Algorithm for MPEG Encoding

ARITHMETIC operations based on residue number systems

II. MOTIVATION AND IMPLEMENTATION

Evaluating MMX Technology Using DSP and Multimedia Applications

Improving Area Efficiency of Residue Number System based Implementation of DSP Algorithms

Design and Implementation of Effective Architecture for DCT with Reduced Multipliers

Power Optimized Programmable Truncated Multiplier and Accumulator Using Reversible Adder

IMPLEMENTATION OF DOUBLE PRECISION FLOATING POINT RADIX-2 FFT USING VHDL

An efficient multiplierless approximation of the fast Fourier transform using sum-of-powers-of-two (SOPOT) coefficients

Efficient Method for Half-Pixel Block Motion Estimation Using Block Differentials

Measuring Improvement When Using HUB Formats to Implement Floating-Point Systems under Round-to- Nearest

COPY RIGHT. To Secure Your Paper As Per UGC Guidelines We Are Providing A Electronic Bar Code

fractional quantities are typically represented in computers using floating point format this approach is very much similar to scientific notation

Design and Implementation of 3-D DWT for Video Processing Applications

UNIVERSITY OF MASSACHUSETTS Dept. of Electrical & Computer Engineering. Digital Computer Arithmetic ECE 666

Data Hiding in Video

A Multiple-Precision Division Algorithm

Using Shift Number Coding with Wavelet Transform for Image Compression

In this article, we present and analyze

Low-Complexity Block-Based Motion Estimation via One-Bit Transforms

FPGA Implementation of Low Complexity Video Encoder using Optimized 3D-DCT

FILTER SYNTHESIS USING FINE-GRAIN DATA-FLOW GRAPHS. Waqas Akram, Cirrus Logic Inc., Austin, Texas

Code Generation for TMS320C6x in Ptolemy

Digital Signal Processing Introduction to Finite-Precision Numerical Effects

Image denoising in the wavelet domain using Improved Neigh-shrink

Data Wordlength Optimization for FPGA Synthesis

Design Optimization Techniques Evaluation for High Performance Parallel FIR Filters in FPGA

Video Compression System for Online Usage Using DCT 1 S.B. Midhun Kumar, 2 Mr.A.Jayakumar M.E 1 UG Student, 2 Associate Professor

Image Segmentation Techniques for Object-Based Coding

ERROR MODELLING OF DUAL FIXED-POINT ARITHMETIC AND ITS APPLICATION IN FIELD PROGRAMMABLE LOGIC

AN IMPROVED FUSED FLOATING-POINT THREE-TERM ADDER. Mohyiddin K, Nithin Jose, Mitha Raj, Muhamed Jasim TK, Bijith PS, Mohamed Waseem P

02 - Numerical Representations


Chapter 2. Data Representation in Computer Systems

Data Representation Type of Data Representation Integers Bits Unsigned 2 s Comp Excess 7 Excess 8

IMPLEMENTATION OF AN ADAPTIVE FIR FILTER USING HIGH SPEED DISTRIBUTED ARITHMETIC

OPTIMIZATION OF AREA COMPLEXITY AND DELAY USING PRE-ENCODED NR4SD MULTIPLIER.

Implementation of a Low Power Decimation Filter Using 1/3-Band IIR Filter

OPTIMIZING THE POWER USING FUSED ADD MULTIPLIER

Floating Point Considerations

VLSI Design Of a Novel Pre Encoding Multiplier Using DADDA Multiplier. Guntur(Dt),Pin:522017

A faster way to downscale during JPEG decoding to a fourth

Evaluation of High Speed Hardware Multipliers - Fixed Point and Floating point

Optimizing the Deblocking Algorithm for. H.264 Decoder Implementation

A New Configuration of Adaptive Arithmetic Model for Video Coding with 3D SPIHT

International Journal for Research in Applied Science & Engineering Technology (IJRASET) IIR filter design using CSA for DSP applications

A Novel Carry-look ahead approach to an Unified BCD and Binary Adder/Subtractor

INTEGER SEQUENCE WINDOW BASED RECONFIGURABLE FIR FILTERS.

Lossless Image Compression having Compression Ratio Higher than JPEG

HIGH SPEED REALISATION OF DIGITAL FILTERS

Pipelined Quadratic Equation based Novel Multiplication Method for Cryptographic Applications

Design of Hierarchical Crossconnect WDM Networks Employing a Two-Stage Multiplexing Scheme of Waveband and Wavelength

An Efficient Design of Sum-Modified Booth Recoder for Fused Add-Multiply Operator

The Serial Commutator FFT

A VLSI Architecture for H.264/AVC Variable Block Size Motion Estimation

A Library of Parameterized Floating-point Modules and Their Use

Head, Dept of Electronics & Communication National Institute of Technology Karnataka, Surathkal, India

Divide: Paper & Pencil

Transcription:

IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 1455 Fixed-Point Optimization Utility for C and C Based Digital Signal Processing Programs Seehyun Kim, Member, IEEE, Ki-Il Kum, and Wonyong Sung, Member, IEEE Abstract Fixed-point optimization utility software is developed that can aid scaling and wordlength determination of digital signal processing algorithms written in C or C++ ++. This utility consists of two programs: the range estimator and the fixed-point simulator. The former estimates the ranges of floatingpoint variables for purposes of automatic scaling, and the latter translates floating-point programs into fixed-point equivalents to evaluate the fixed-point performance by simulation. By exploiting the operator overloading characteristics of C++ ++, the range estimation and the fixed-point simulation can be conducted by simply modifying the variable declaration of the original program. This utility is easily applicable to nearly all types of digital signal processing programs including nonlinear, time-varying, multirate, and multidimensional signal processing algorithms. In addition, this software can be used to compare the fixed-point characteristics of different implementation architectures. An optimization example for an 8 2 8 inverse discrete cosine transform (IDCT) architecture that conforms to the IEEE standard specifications is presented. The optimized results require 8% fewer gates when compared with the previous best implementation. Index Terms Finite wordlength effects, fixed-point optimization, fixed-point simulation, range estimation. I. INTRODUCTION WHILE most digital signal processing algorithms are developed using floating-point arithmetic, their implementation using very large scale implementation (VLSI) or fixed-point digital signal processors usually requires fixedpoint arithmetic for the sake of hardware cost and speed. However, fixed-point implementation can suffer from excessive finite wordlength effects, due to overflows and quantization noise, unless all signals are scaled properly and enough wordlengths are assigned [1]. Previously known analytical methods can be used for scaling and wordlength optimization of linear digital filters and some specific algorithms [2] [4]. However, it is very difficult to apply these analytical methods to general digital signal processing algorithms, and it is usually necessary to simulate digital signal processing algorithms extensively using fixed-point arithmetic before implementation. It has been considered a tedious process to determine scaling information and prepare fixed-point simulation models Manuscript received August 5, 1996; revised April 8, 1998. This paper was recommended by Associate Editor T. Q. Nguyen. S. Kim was with the School of Electrical Engineering, Seoul National University, Seoul 151-742, Korea. He is now with LG Corporate Institue of Technology, Seoul 137-140, Korea. K.-I. Kum and W. Sung are with the School of Electrical Engineering, Seoul National University, Seoul 151-742, Korea (e-mail: wysung@dsp.snu.ac.kr). Publisher Item Identifier S 1057-7130(98)07526-0. Fig. 1. Proposed fixed-point algorithm development procedure. of complex signal processing algorithms, which may contain nonlinear and time-varying blocks. C is still most popular for describing digital signal processing algorithms although there are several program languages and block diagram based CAD tools that support fixed-point data types, such as Silage [5], DSP/C [6], DSP Station [7], and SPW [8]. In particular, C is more flexible for the development of digital signal processing programs containing control intensive algorithms. Although there are some previous works that introduce different formats in C using operator overloading, such as the variable precision floating-point simulator [9], C does not support fixed-point formats. As a result, the conversion of a floating-point C program into a fixed-point version requires much effort. In order to solve this problem, we have developed an automatic scaling and fixed-point simulation utility for digital signal processing programs written in C or C. This utility consists of the range estimator and the fixedpoint simulator. The proposed procedure for developing fixedpoint algorithms is shown in Fig. 1. Users develop C or C models with floating-point arithmetic and mark the variables whose fixed-point behavior is to be examined with the range estimation directives. The range estimator then finds the statistics of internal signals throughout the floating-point simulation using real inputs and determines scaling parameters. AC data class for range estimation is developed for this purpose. The fixed-point simulator converts a floating-point digital signal processing program with fixed-point simulation directives to a fixed-point equivalent by introducing two fixedpoint data classes, one for bit-accurate simulation and the other for fast execution. In order to overcome the representational 1057 7130/98$10.00 1998 IEEE

1456 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 III. RANGE ESTIMATION USING STATISTICS The minimum integer wordlength for a variable can be determined from the range as follows: (2) Fig. 2. A hardware model for adding two fixed-point variables of different integer wordlengths. limit of a fractional or integer format, a fixed-point data format [10], [11] is employed that can support an arbitrary representation range by scaling, as shown in (1). The operations associated with the fixed-point data class, such as,,,, and are also defined at the class declaration. Then, fixed-point arithmetic operations, instead of floating-point arithmetic, are conducted automatically due to the operator overloading capability of C [12]. In Section II, the fixed-point data representation method is described. An algorithm for estimating the range is discussed in Section III. In Section IV, the range estimation utility for determining the integer wordlength is explained. Details of the fixed-point simulator are presented in Section V. In Section VI, a fast fixed-point simulator employing a hardware floating-point data-path is explained. As an example of implementation, the internal wordlengths of an 8 8 inverse discrete cosine transform (IDCT) architecture are optimized in Section VII. Concluding remarks follow in Section VIII. II. FIXED-POINT DATA REPRESENTATION For the representation of the fixed-point data, a generalized format [10] is employed using the attributes specified in the following: wordlength integer-wordlength sign-overflow-quantization-mode (1) In the fixed-point data format, two numbers can be added or subtracted only when their hypothetical binary-points are aligned. Let us consider an addition of two signals and to produce a signal where,, and have the fixedpoint formats of 10, 2, tsr, 9, 3, tsr, and 10, 3, tsr, respectively. In order to calculate, must have one more integer bit by sign-extension, while needs to have two more fractional bits. Then, -bit addition, where is greater than or equal to 11, is conducted, and the saturation and rounding operation is applied. The above procedure can be implemented in hardware, as shown in Fig. 2. Previously known analytical methods try to determine the range by calculating the L1 norm of a transfer function [1]. The range estimated using the L1 norm guarantees no overflow for any signal, but it is a very conservative estimate for most applications. According to our comparison for implementing a fourth-order infinite impulse response (IIR) digital filter for the application to speech signal processing, the L1 norm needs four extra bits when compared with the optimum scaling result using a real input speech signal [10]. It is also very difficult to calculate the L1 norm of adaptive or nonlinear systems. In our approach, a simulation-based method is adopted to estimate ranges, where the range of each signal is measured during the floating-point simulation using realistic input signal files. This method is applicable to both nonlinear and linear systems, but requires an adequate estimation of the range from a finite length of simulation results. It is easy to parameterize simple distributions, such as uniform, Gaussian, or Laplacian, by applying a few statistics. It is well known that speech signals have a Laplacian distribution. However, it is not possible to model all signals in practical systems using a simple distribution [13]. For example, they may be nonsymmetric or multimodal; note that the estimated range of a multimodal signal could be too small if we employ the rule for unimodal distributions. The scheme for estimating the range should therefore vary according to the distribution. In this section, we describe a method for range estimation based on identifying the distribution of a signal. A. Statistical Characteristics of a Signal Skewness for a given set of samples is defined as [14] where is the th-order central moment and.it vanishes if the distribution of the samples is symmetrical about the mean,. Nonzero skewness implies that the distribution spreads more widely to the left or right from the mean. On the other hand, the kurtosis is defined as follows [14]: This indicates how many samples are close to the mean, and becomes zero if the s have a Gaussian distribution. Modes represent local maxima of the distribution. While unimodal distributions have a single peak, multimodal distributions have several peaks. The standard deviation of a unimodal distribution can be taken as the width of the peak, as shown in Fig. 3(a). For instance, 99.99% of a Gaussian distribution is included in a range of four times. Thus, we can estimate the range of a unimodal and symmetric signal by means of and times, where is highly dependent on the distribution. For the multimodal situation, however, (3) (4)

KIM et al.: FIXED-POINT OPTIMIZATION UTILITY FOR C AND C BASED DSP PROGRAMS 1457 (a) to discriminate unimodal distributions from multimodal ones. That is, if, the distribution is approximately unimodal. As an example, LMS error and LMS coeff signals in Table I all have nonzero skewness and a very large kurtosis. Thus, they are estimated to have nonsymmetric and multimodal distributions. In fact, they have two modes, which represent the initial and final steady states, respectively. For unimodal and symmetric distributions, the range can be estimated effectively by Fig. 3. (b) (a) Unimodal and (b) bimodal distributions. TABLE I STATISTICAL INFORMATION OF A FEW SIGNALS Note that for two symmetric distributions that have an identical variance, the one with the larger kurtosis spreads more widely than the other. Thus, a greater value of is needed in order to estimate the range of the signal having the larger kurtosis. Specifically, we use as. For example, the distribution of IIR speech in Table I can be covered effectively by five times according to its kurtosis. The above rule is not satisfactory for multimodal or nonsymmetric distributions. For such cases, we can consider an alternative rule (5) (6) no longer indicates how widely the distribution spreads. As an example, let us consider a bimodal distribution, shown in Fig. 3(b). Most internal signals of adaptive systems have this distribution, where two modes indicate the initial and the final states, respectively. In this case, it is not possible to simply estimate the range using and. Not only are the mean and the standard deviation important for estimating the range, but also the characteristics of the distribution. B. Estimation of the Range In order to estimate the range effectively, distributions can be characterized as follows: unimodal/multimodal; symmetric/nonsymmetric; zero mean/nonzero mean. Although symmetry and zero mean can be verified easily by the skewness and the mean, respectively, it is harder to estimate the number of modes. From the experimental results shown in Table I, we can derive a heuristic method where is a guard for the range and is defined as. Note that indicates a submaximum value, which covers % of the entire sample. Various submaxima are collected during the simulation. The greater the difference between and is, the larger the guard value must be. The scale factor also controls the guard value and currently is two. Note that the statistical information obtained is dependent on the input data set. Thus, it is necessary to use several input signal sample files for a more reliable estimation of the range. In order to measure the variation of each distribution parameter according to the input signal sample files, four sensitivities are calculated for mean, variance, skewness, and kurtosis, respectively. They are defined as (7) (8) (9) (10) where and are the maximum and the minimum value, respectively, among the multiple simulation results. Then, the statistics are modified as (11) (12) (13) (14) The scale factor is currently chosen as 0.1. When and turn out to be symmetric and unimodal, (5) can be used for estimating ranges. Otherwise, (6) is used with and.

1458 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 (a) (b) C++ programs for a first-order IIR filter. (a) The original C++ Fig. 5. program. (b) An automatically translated version for the range estimation. Fig. 4. Declaration of the class. Fig. 6. The result of the range estimator for the IIR filter. IV. RANGE ESTIMATION UTILITY Since we employ the simulation-based approach for estimating the range, a program needs to be generated for collecting the statistical information during the simulation. To avoid changing the original floating-point C or C program for range estimation, the operator overloading characteristics of C are utilized. The new data class for tracing the possible maximum value of a signal, i.e., the range, is named as. In order to prepare a range estimation model of a C or C digital signal processing program, it is only necessary to change the type of variables from to, since the class in C is also a type of variable defined by users. The class not only computes the current value, but also keeps records of the variable using private members. Thus, when the simulation is completed, the range of a variable declared as class is readily available from the records stored in the class. The class has several private members, as shown in Fig. 4. The variable keeps the current value, while and record the summation and the squared summation of past values, respectively. and record the third and fourth moments, respectively. This information is needed to calculate the statistics of a variable including its mean, standard deviation, skewness, and kurtosis. stores the absolute maximum value of a variable during the simulation. The class also keeps the number of modifications during the simulation in the field. The class overloads arithmetic and relational operators. Hence, basic arithmetic operations such as addition, subtraction, multiplication, and division are conducted automatically for variables. This property is also applicable for relational operators, such as,,,, and. Therefore, any instance can be compared with floating-point variables and constants. The contents, or private members, of a variable declared by the class are updated when the variable is assigned by one of the assignment operators, such as,,,, and. For example, is updated when the absolute of the present value is larger than the previously determined. After the simulation model for estimating the ranges is prepared by modifying the variable declaration, the range estimator is executed. It compiles the simulation model and links it with the simulation driver and the overloaded operators of the class. In our work, the GNU C compiler, version 2.6.3, is used throughout the development [15]. Next, the simulation driver executes the simulation model, and the range information is gathered during the simulation. After the simulation is completed, the mean ( ), standard deviation ( ), skewness ( ), and kurtosis ( ) can be calculated using the information,,,, and. Then, the statistical range of a variable can be estimated by the procedure shown in the previous section. Finally, the integer wordlengths of all signals are obtained from their ranges, as shown in (2). As an example, let us consider a first-order digital IIR filter. The overall procedure to estimate the ranges of internal variables can be summarized as follows. 1) Develop a C program for the first-order IIR filter. 2) Insert the range estimation directives, as shown in Fig. 5(a). Since the ranges of the input (Xin) and the coefficient (Coeff) are known already, only the output (Yout) and the state variable (Ydly) are to be examined. 3) Invoke the range estimator. The estimator generates the simulation model [Fig. 5(b)] and runs it. After the simulation, we obtain the ranges and the integer wordlengths of the variables, as shown in Fig. 6.

KIM et al.: FIXED-POINT OPTIMIZATION UTILITY FOR C AND C BASED DSP PROGRAMS 1459 TABLE II EXECUTION SPEED OF THE RANGE ESTIMATOR The simulation time required for the range estimation is approximately two to four times that for the original floatingpoint C program. The execution speed of the original floatingpoint, the developed range estimator, and the Autoscaler for a fourth-order IIR digital filtering program with 24 000 input samples in the Pentium 90-MHz based PC are compared in Table II. The Autoscaler is a range estimation and automatic scaling program for the fixed-point assembly program development of TMS320C25 and C50 [10], [16]. This clearly shows that the range estimator developed here is quite fast, because it conducts the simulation using a high-level program. Thus, it is practical to obtain a reliable range estimation by simulating with several input signal files. V. FIXED-POINT SIMULATION UTILITY Previously developed analytical methods for evaluating the fixed-point performance of a digital signal processing algorithm are not easily applicable to practical systems containing nonlinear and time-varying blocks [17], [18]. The analysis is more complicated when a specific kind of input signal, such as speech, is required for the evaluation. In order to relieve these problems, we employed a simulation-based method for evaluating the fixed-point characteristics of a digital signal processing algorithm. However, most high-level language compilers do not support fixed-point arithmetic which needs variable wordlength for each arithmetic operation. Therefore, a new fixed-point data class and its operators are developed to prepare a bit-accurate fixed-point version of a floating-point program and to know its finite wordlength and scaling effects by simulation. The declaration part of the class is shown Fig. 7. As shown in this figure, the fixed-point data class has several private members, which are the mantissa ( ), the wordlength ( ), the integer wordlength ( ), and attributes ( ). In order to represent mantissa values requiring a larger precision than that of built-in integer type, e.g., 32 bits, the fixed-point class employs the data class of the GNU C library, which provides multiple precision integer arithmetic facilities [19]. The class supports all of the assignment and arithmetic operations supported in C or C. The list of the operations supported can be found at the operator list in Fig. 7. Brief explanations of them are as follows. 1) The assignment operator converts the input data according to the fixed-point format of the left side variable and assigns the format converted data to this variable. The input data, which is the evaluated result at the right side, can be either a floating-point or a fixed-point data. If the given format of the left side variable does not provide enough precision for representing the input data, the data is modified according to the fixed-point Fig. 7. Declaration of the class. attributes of the left side variable, such as saturation, overflow, rounding, or truncation. 2) The operation of the fixed-point add operator is shown in Fig. 8. In order to prevent any loss of accuracy during the operation, it first computes the maximum integer and the fractional wordlengths of two input data. For example, assume that the integer and fractional wordlengths for the first operand are two and eight, and those for the second operand are four and six, respectively. Then, the internal data has the integer wordlength of five in order to prevent overflows and the fractional wordlength of eight not to lose any accuracy. After then, the input data are aligned by using shift operations, and added in fixed-point. 3) The fixed-point multiply operator is also described in Fig. 8. For two s complement data, the wordlength of the product is the sum of the wordlengths of the two input data minus one in order to eliminate the superfluous sign. And, the integer wordlength becomes the sum of the two input integer wordlengths. 4) Arithmetic right shift operator shifts right signals in bit-level. Since the wordlength and the integer wordlength are not affected by this operator, one-bit right shift halves the real value of variables. 5) Arithmetic left shift operator shifts left signals in bit-level. In a similar way, one-bit left shift doubles the real value of variables.

1460 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 Fig. 10. A fixed-point C++ program for a first-order IIR filter. TABLE III COMPARISON OF SIMULATION TIME FOR A FOURTH-ORDER IIR FILTER (a) (b) Fig. 8. Operators of the class. (a) Addition. (b) Multiplication. (a) (b) Fig. 9. Three operand addition employing different architectures. including,,,,, and are also supported. Let us reconsider the simple IIR filter presented in Section IV. The overall procedure to investigate the fixedpoint behavior can be summarized as follows. 1) Develop a C program for the first-order IIR filter. Users can use the same filter developed in Section IV. 2) Insert the fixed-point simulation directives in the same manner in Fig. 5(a). 3) Invoke the fixed-point simulator. The simulator generates the simulation model shown in Fig. 10 and runs it. Note that only the type of variables is converted to, but the other parts of the program are not changed. After the simulation, we obtain the fixed-point output of the filter and compute the performance measure such as signal-to-quantization noise ratio (SQNR). The fixed-point version shows the SQNR of 52.15 db when compared to the floating-point implementation results. As described above, there is no loss of accuracy during the fixed-point add or multiply operations. However, arithmetic right shift or arithmetic left shift may cause loss of accuracy or overflows. The fixed-point format conversion or precision degradation occurs only at the assignment operator. Thus, two programs shown in Fig. 9(a) and (b) can have different fixed-point results. In Fig. 9(b), the result of is format converted to 10 bit data, is added to the operand, and then format converted again to. Fixed-point performance of different implementation architectures that may be based on the same algorithm can be compared by utilizing the above characteristics. The relational- and assignment-based operators are also supported. The relational operators include,,,,, and operations. Since objects are compared after being interpreted as real values, the relational operators can also be used with other variables as well as or variables. Assignment based operators, VI. FAST FIXED-POINT SIMULATION LIBRARY The execution speed of the fixed-point library is very important because the optimization of wordlength may require iterative simulations, with different fixed-point formats, using a long length of input signal [20]. However, as shown in Table III, the execution speed of the operation is quite slow compared to the floating-point case. Most of today s computers, such as Pentium-based PC s and SPARC-based workstations, are equipped with fast floating-point hardware units. Actually, the operations for,, or that are shown in Fig. 8 can be conducted using floatingpoint hardware units when the input and internal wordlengths are not very large. For example, the addition of two fixedpoint data in the class consists of the integer wordlength comparison, fixed-point data shift, and fixed-point addition as shown in Fig. 8. These operations, respectively, correspond to the exponent comparison, mantissa alignment, and mantissa addition in a floating-point arithmetic unit [21]. We developed

KIM et al.: FIXED-POINT OPTIMIZATION UTILITY FOR C AND C BASED DSP PROGRAMS 1461 a pfix (pseudo fixed-point) library for implementing fast fixedpoint operations using a hardware floating-point data-path. In this pfix library, the assignment operator plays the most important role, which limits the accuracy of a floatingpoint data according to the fixed-point format of the result operand. The assignment operator converts the input data according to the fixed-point format of the left side variable and assigns the format converted data to this variable. If the given format of the left side variable does not provide enough precision for representing the input data, the data is modified, such as saturation, overflow, rounding, or truncation, according to the fixed-point attributes of the left side variable. Wordlength of the fixed-point data is limited in the pfix library when a bit accurate simulation result is required. The wordlength of fixed-point data, including the temporary results, should not exceed the wordlength of the mantissa in a floating-point unit. Note that the IEEE standard double precision floating-point format assigns 53 bits for the mantissa. When two fixed-point data having the wordlength of and are multiplied, the product has the wordlength of for the two s complementary signed case. The wordlength of 53 is sufficient for modeling 16- or 24-bit programmable digital signal processors, but is not very sufficient in general. For example, the pfix-library-based simulation of a digital filter having the signal wordlength of 32 bits and the coefficient wordlength of 24 bits does not produce a bit-exact result. The accuracy of the pfix library when compared with the bit-exact library can be categorized into three cases. When the wordlengths of all the variables including temporary ones are less than 53 bits, the simulation results are bitaccurate. As the second case, when the input and output signal wordlengths are quite smaller than 53 bits, but the wordlength of temporary variables, such as products or accumulated results, are larger than 53 bits, the results are not bit-accurate. But, we can obtain quite accurate results that can be used for the computation of SQNR or other quantization effects. For example, the pfix-based simulation of a digital filter having the signal wordlength of 32 bits and the coefficient wordlength of 24 bits does not produce a bit-accurate result when compared to the simulation. But, the SQNR comparison with the double precision simulation result is quite close: 141.750 59 db for both the pfix and the. Finally, in the case that the wordlengths of data or coefficients are larger than 53 bits, the pfix library is not suitable for the fixed-point simulation. The execution speed of various simulation methods are compared quantitatively in Table III. The simulation of the fourth-order IIR filter using a pfix library takes only 7.4 times the execution time of the original floating-point program. Although the or VHDL-based simulation can guarantee the bit-accurate results, it usually takes 50 to 200 times longer than the floating-point simulation, as shown in Table III. VII. EXAMPLE WORDLENGTH OPTIMIZATION OF AN 8 8 IDCT ARCHITECTURE The developed utility is very useful for the fixed-point performance evaluation of large C-based digital signal processing programs, such as the FS-CELP vocoder and the MPEG- 2 audio decoder [16], [22]. In addition, this program can be used for the wordlength optimization of a digital signal processing algorithm based on a specific architecture. Note that the finite wordlength performances are affected not only by the algorithm but by the architecture as well. In this section, the wordlength optimization of the multiplier-adder based 8 8 IDCT architecture conforming to the IEEE standard specification will be illustrated. The two-dimensional discrete cosine transform has been used widely for various image and video processing standards, such as JPEG, H.261 for video telephone, MPEG, and HDTV. Fixed-point implementation of the algorithm may result in a noticeable mismatch between the encoder and the decoder. Especially, this problem can be magnified when the IDCT algorithm is used in a feedback loop for motion compensation because the quantization error is accumulated. To solve this problem, IEEE specifies the numerical characteristics of the 8 8 IDCT for use in visual telephone and similar applications using the IEEE Standard 1180-1990 [18]. The test procedure is also described in [23], and the output errors shall meet the following specifications when 10 000 samples of random input data sequence are applied. 1) For any pixel location, the peak error (ppe) shall not exceed one in magnitude. 2) For any pixel location, the mean square error (pmse) shall not exceed 0.06. 3) Overall, the mean square error (omse) shall not exceed 0.02. 4) For any pixel location, the mean error (pme) shall not exceed 0.015 in magnitude. 5) Overall, the mean error (ome) shall not exceed 0.0015 in magnitude. 6) For all-zero input, the proposed IDCT shall generate all-zero output. There have been several studies to analyze the finite wordlength effects of several fast DCT/IDCT algorithms [18], [24], [25]. But these studies that compared different algorithms are not readily applicable to the hardware optimization mainly because the implementation architecture was not considered. We optimized the wordlengths of a multiplier adder-based implementation for the 8 8 row column-based IDCT algorithm. The simulation-based wordlength optimization method is employed which uses the input sequences specified in the IEEE standard and the accurate hardware model of the architecture to be evaluated. The bit-accurate hardware model is derived from the floating-point model by using the developed fixed-point simulation utility. A. Multiplier Adder-Based 8 8 IDCT Architecture The row-column decomposition method is most popular for implementing the 8 8 IDCT algorithm because of the structural and computational regularities. The block diagram of the row column decomposition-based 2-D IDCT architecture is shown in Fig. 11. In order to reduce the number of arithmetic operations and keep the regularity of the 1-D IDCT unit, Chen s algorithm [26] is employed. Then, the eight-point

1462 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 Fig. 11. Block diagram of 2-D IDCT using the row column decomposition. IDCT can be calculated as follows: Fig. 12. Block diagram of a multiplier adder-based 2-D IDCT. (15) Fig. 13. Overall procedure for wordlength optimization. where (16) As shown above, we can obtain the eight-point IDCT results by conducting two 4 4 matrix vector multiplications and butterfly operations. In order to calculate matrix vector products, the multiplier adder-based architecture, shown in Fig. 12, is employed. In this figure, there are five quantization error sources: quantization of coefficients for the row-wise and the column-wise transform (Coeff1, Coeff2), wordlength reduction for the outputs of the first and the second multipliers (Adder1, Adder2), and the output of the limiter for the rowwise transform (1D_Out). The conventional rounding scheme is used for quantization except for the output of multiplier. Since the multiplier adder chain usually comprises the critical path, we assume that the output of the multiplier is simply truncated in order not to employ an additional adder for rounding. B. Wordlength Optimization The overall wordlength determination procedure using the fixed-point optimization utility is shown in Fig. 13 [27]. In order to determine the integer wordlength, the range estimation utility has been used. First, C or C based programs are developed for modeling the various architectures using the floating-point arithmetic. Variables and coefficients are declared as the double precision floating-point. Then, the range estimation utility estimates the range of internal signals during the floating-point simulation by using the new floating-point data class and the operator overloading characteristic of the C. The minimum integer wordlength that prevents overflows can be determined from the estimated range information. A set of cost-optimum wordlengths should require the minimum hardware cost while satisfying the IEEE Standard 1180-1990. The fixed-point performance is measured using the developed fixed-point simulator [28]. The wordlength optimization method shown in [20] is employed to find out the optimum wordlength using a small number of simulations. First, minimum wordlengths of all signals are determined. The minimum wordlength of a signal is the smallest wordlength guaranteeing the specified system performance when all the other signals have enough precision, such as double floatingpoint format [20]. From this lower bound of wordlengths, the set of optimal wordlengths that minimizes the hardware cost while satisfying the given specifications is determined. As for modeling the hardware cost, the cell libraries of VLSI Technologies, Inc. are used [29]. A C program for computing the 1-D IDCT using the multiplier adder chain is shown in Fig. 14. From the simulation results, it was found that the most crucial condition for determining the minimum wordlength

KIM et al.: FIXED-POINT OPTIMIZATION UTILITY FOR C AND C BASED DSP PROGRAMS 1463 Fig. 14. A multiplier adder-based 1-D IDCT program using floating-point arithmetic. TABLE IV OPTIMIZED WORDLENGTHS FOR THE MULTIPLIER ADDER-BASED ARCHITECTURE of Coeff1, Coeff2, and 1D_Out is the overall mean square error, omse. However, the peak mean error pme and the overall mean error ome play the key role for determining the minimum wordlengths of Adder1 and Adder2, because the means of the quantization errors are not zero due to truncation. Minimum wordlengths and optimum wordlengths are shown in Table IV. The numbers inside parentheses show the wordlengths of the previous implementation [30]. As shown in the table, the internal wordlengths can be substantially reduced when compared with the previous work. This fixed-point utility can also be used for the optimization of the bit-serial arithmeticbased implementation of 8 8 IDCT architecture. VIII. CONCLUDING REMARKS Fixed-point utility software that aids scaling and wordlength optimization of algorithms written in C or C programs is developed. The integer wordlength for each fixed-point signal is automatically determined using the developed range estimator, and the finite wordlength effects can also be evaluated using the fixed-point simulator. In order to obtain reliable scaling information from the finite length of simulation results, a statistical model of the range, which covers both unimodal and multimodal signals, is developed. The range estimator is very fast when compared with our previously developed Autoscaler because it collects the range information from the simulation of C programs, instead of the assembly programs. The fixed-point simulator can meet the requirements for bit-accuracy and fast simulation by employing two specific fixed-point classes. The library models the fixedpoint arithmetic using software routines, but can be used to obtain bit-exact results without any practical limitation in the wordlength. The pfix library utilizes floating-point hardware for fast fixed-point simulation. This library is useful for simulation-based wordlength optimization of digital signal processing algorithms, which requires iterative fixed-point simulation with different wordlengths assigned [20]. This work can be extended to efficient VLSI or fixed-point digital signal processor-based development tools because the optimized fixed-point digital signal processing programs can be converted easily to VHDL codes or integer programs. This software has been used for the fixed-point performance evaluation of complex digital signal processing algorithms such as CELP vocoder [16] and MPEG audio [22]. This utility is freely available to academics through our web site, http://www.vspl.snu.ac.kr. REFERENCES [1] L. B. Jackson, On the interaction of roundoff noise and dynamic range in digital filters, Bell Syst. Tech. J., vol. 49, pp. 159 184, Feb. 1970. [2] B. Liu, Effect of finite word-length on the accuracy of digital filters: A review, IEEE Trans. Circuit Theory, vol. CT-18, pp. 670 677, Nov. 1971. [3] H. K. Kwan, Amplitude scaling of arbitrary linear digital networks, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP-32, pp. 1240 1242, Dec. 1984. [4] C. Caraiscos and B. Liu, A roundoff error analysis of the LMS adaptive algorithm, IEEE Trans. Acoust., Speech, Signal Processing, vol. ASSP- 32, pp. 34 41, Feb. 1984. [5] P. Hilfinger, A high-level language and silicon compiler for digital signal processing, in Proc. Custom Integrated Circuits Conf., Los Alamitos, CA, 1985, pp. 213 216. [6] K. W. Leary and W. Waddington, DSP/C: A standard high level language for DSP and numeric processing, in Proc. Int. Conf. Acoustics, Speech and Signal Processing, 1990, pp. 1065 1068. [7] Mentor Graphics, DSP Station Design Architect User s Manual, 1993. [8] Comdisco Systems, Inc., SPW The DSP Framework Hardware Design System User s Guide, Aug. 1990. [9] D. M. Samani, J. Ellinger, E. J. Pwers, and E. E. Swartzlander, Jr., Simulation of variable precision IEEE floating point using C++ and its application in digital signal processor design, in Proc. 27th Annual Asilomar Conf. Signals, Systems, and Computer, 1993, pp. 1574 1578. [10] S. Kim and W. Sung, A floating-point to fixed-point assembly program translator for the TMS320C25, IEEE Trans. Circuits Syst. II, vol. 41, pp. 730 739, Nov. 1994. [11], Fixed-point simulation utility for C and C++ based digital signal processing programs, in Proc. 28th Annual Asilomar Conf. Signals, Systems, and Computer, 1994, pp. 162 166. [12] B. Stroustrup, The C++ Programming Language, 2nd ed. Reading, MA: Addison Wesley, 1993. [13] K. Baudendistel, Compiler development for fixed-point processors, Ph.D. dissertation, Georgia Inst. Technol., Sept. 1992. [14] S. M. Kendall and A. Stuart, The Advanced Theory of Statistics. London, U.K.: Griffin, 1987, vol. 1. [15] R. M. Stallman, Using and Porting GNU CC, Free Software Foundation, Inc., 1990. [16] W. Sung, J. Sohn, J. Kang, and S. Kim, Fixed-point implementation of the FS-CELP vocoder using the autoscaler for the TMS320C50, in Proc. Int. Conf. Signal Processing Applications and Technology, 1995, pp. 1883 1891. [17] P. W. Wong, Quantization and roundoff noises in fixed-point FIR digital filters, IEEE Trans. Signal Processing, vol. 39, pp. 1552 1563, July 1991. [18] I. D. Yun and S. U. Lee, On the fixed-point-error analysis of several fast DCT algorithms, IEEE Trans. Circuits Syst. Video Technol., vol. 3, pp. 27 41, Feb. 1993.

1464 IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS II: ANALOG AND DIGITAL SIGNAL PROCESSING, VOL. 45, NO. 11, NOVEMBER 1998 [19] D. Lea, User s Guide to the GNU C++ Library, version 2.0, Apr. 1992. [20] W. Sung and K.-I. Kum, Simulation-based word-length optimization method for fixed-point digital signal processing systems, IEEE Trans. Signal Processing, pp. 3087 3090, Dec. 1995. [21] I. Koren, Computer Arithmetic Algorithms. New York: Wiley, 1993. [22] M. S. Jeong, S. Kim, J. S. Sohn, J. Y. Kang, and W. Sung, Finite wordlength effects evaluation of the MPEG audio decoder, in Proc. Int. Conf. Signal Processing Applications and Technology, Oct. 1996, to be published. [23] IEEE Standard Specifications for the Implementations of 8 2 8 Inverse Discrete Cosine Transform, 1991. [24] Y. C. Yao and C. Y. Hsu, Comparative performance of fast cosine transform with fixed-point roundoff error analysis, IEEE Trans. Signal Processing, vol. 42, pp. 1256 1259, May 1994. [25] I. D. Yun and S. U. Lee, On the fixed-point-error analysis of several fast IDCT algorithms, IEEE Trans. Circuits Syst., vol. 42, pp. 685 693, Nov. 1995. [26] W. H. Chen, C. H. Smith, and S. C. Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Trans. Commun., vol. COM-25, pp. 1004 1009, Sept. 1977. [27] W. Sung and K.-I. Kum, Wordlength determination and scaling software for a signal flow block diagram, in Proc. Int. Conf. Acoustics, Speech and Signal Processing, Apr. 1994, vol. 2, pp. 457 460. [28] S. Kim, K.-I. Kum, and W. Sung, Fixed-point optimization utility for C and C++ based digital signal processing programs, in Proc. 1995 IEEE Workshop on VLSI Signal Processing, Oct. 1995, pp. 197 206. [29] VLSI Technologies, Inc., 1-Micron Cell Compiler Library, Nov. 1991. [30] T. Miyazaki, T. Nishitani, M. Edahiro, and I. Ono, DCT/IDCT processor for HDTV developed with DSP silicon compiler, J. VLSI Signal Processing, vol. 5, pp. 39 47, 1993. Seehyun Kim (S 91 M 97) received the B.S., M.S., and Ph.D. degrees in electrical engineering from Seoul National University, Korea, in 1990, 1992, and 1996, respectively. From 1996 to 1997, he was with the University of California at Berkeley as a Postdoctoral Researcher, where he was involved in the Ptolemy project and studied an infrastructure for finite precision implementation of digital signal processing algorithms. In 1997, he joined LG Corporate Institute of Technology and has been involved in developing a high-definition TV (HDTV) decoder LSI. His research interests include VLSI algorithms and architectures for signal processing and concurrent hardware and software design of embedded real-time systems. Ki-Il Kum received the B.S. and M.S. degrees in control and instrumentation engineering from Seoul National University, Korea, in 1991 and 1994, respectively. He is working for the Ph.D. degree from the School of Electrical Engineering, Seoul National University. His research interests include VLSI design and multiprocessor implementation of DSP algorithms and computer-aided design for digital signal processing systems, especially wordlength optimization for fixed-point DSP systems. Wonyong Sung (S 84 M 87) received the B.S. degree in electronic engineering from the Seoul National University in 1978, the M.S. degree in electrical engineering from the Korea Advanced Institute of Science and Technology (KAIST) in 1980, and the Ph.D. degree in electrical and computer engineering from the University of California, Santa Barbara, in 1987. He has been a member of the faculty of the Seoul National University since 1989. From 1980 to 1983, he worked in the Central Research Laboratory of the Gold Star (currently LG Electronics) in Korea. During his Ph.D. study, he developed parallel processing algorithms, vector and multiprocessor implementation, and low-complexity FIR filter design. From May 1993 to June 1994, he consulted with the Alta Group for the development of the fixed point optimizer, automatic wordlength determination and scaling software. His major research interests are the development of fixed-point optimization tools, implementation of VLSI for digital signal processing, and multiprocessor based implementations. Since January of 1998, he has worked as a Chief of the SEED (system engineering and design center) at Seoul National University.