Implementation of DSP and Communication Systems

Size: px

Start display at page:

Download "Implementation of DSP and Communication Systems"

Joel Small
5 years ago
Views:

1 EURASIP Journal on Applied Signal Processing Implementation of DSP and Communication Systems Guest Editors: Yuke Wang and Yu Hen Hu

2 EURASIP Journal on Applied Signal Processing Implementation of DSP and Communication Systems

3 EURASIP Journal on Applied Signal Processing Implementation of DSP and Communication Systems Guest Editors: Yuke Wang and Yu Hen Hu

4 Copyright 2002 Hindawi Publishing Corporation All rights reserved This is a special issue published in volume 2002 of EURASIP Journal on Applied Signal Processing All articles are open access articles distributed under the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited

5 Editor-in-Chief K J Ray Liu, University of Maryland, College Park, USA Associate Editors Kiyoharu Aizawa, Japan Jiri Jan, Czech Antonio Ortega, USA Gonzalo Arce, USA Shigeru Katagiri, Japan Mukund Padmanabhan, USA Jaakko Astola, Finland Mos Kaveh, USA Ioannis Pitas, Greece Mauro Barni, Italy Bastiaan Kleijn, Sweden Raja Rajasekaran, USA Sankar Basu, USA Ut Va Koc, USA Phillip Regalia, France Shih-Fu Chang, USA Aggelos Katsaggelos, USA Hideaki Sakai, Japan Jie Chen, USA C C Jay Kuo, USA William Sandham, UK Tsuhan Chen, USA S Y Kung, USA Wan-Chi Siu, Hong Kong M Reha Civanlar, USA Chin-Hui Lee, USA Piet Sommen, The Netherlands Tony Constantinides, UK Kyoung Mu Lee, Korea John Sorensen, Denmark Luciano Costa, Brazil Y Geoffrey Li, USA Michael G Strintzis, Greece Irek Defee, Finland Heinrich Meyr, Germany Ming-Ting Sun, USA Ed Deprettere, The Netherlands Ferran Marques, Spain Tomohiko Taniguchi, Japan Zhi Ding, USA Jerry M Mendel, USA Sergios Theodoridis, Greece Jean-Luc Dugelay, France Marc Moonen, Belgium Yuke Wang, USA Pierre Duhamel, France José M FMoura, USA Andy Wu, Taiwan Tariq Durrani, UK Ryohei Nakatsu, Japan Xiang-Gen Xia, USA Sadaoki Furui, Japan King N Ngan, Singapore Zixiang Xiong, USA Ulrich Heute, Germany Takao Nishitani, Japan Kung Yao, USA Yu Hen Hu, USA Naohisa Ohta, Japan

6 Contents Editorial, Yuke Wang and Yu Hen Hu Volume 2002 (2002), Issue 9, Pages Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform, Fang Fang, Tsuhan Chen, and Rob A Rutenbar Volume 2002 (2002), Issue 9, Pages High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection, Nitin Chandrachoodan, Shuvra S Bhattacharyya, and K J Ray Liu Volume 2002 (2002), Issue 9, Pages Design and DSP Implementation of Fixed-Point Systems, Martin Coors, Holger Keding, Olaf Lüthje, and Heinrich Meyr Volume 2002 (2002), Issue 9, Pages Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding, Zhong Wang, Edwin Hsing-Mean Sha, and Yuke Wang Volume 2002 (2002), Issue 9, Pages P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm, Martin Kuhlmann and Keshab K Parhi Volume 2002 (2002), Issue 9, Pages Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design, Jin-Gyun Chung and Keshab K Parhi Volume 2002 (2002), Issue 9, Pages Low-Complexity Versatile Finite Field Multiplier in Normal Basis, Hua Li and Chang Nian Zhang Volume 2002 (2002), Issue 9, Pages A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems, Tsun-Shan Chan, Jen-Chih Kuo, and An-Yeu (Andy) Wu Volume 2002 (2002), Issue 9, Pages A DSP Based POD Implementation for High Speed Multimedia Communications, Chang Nian Zhang, Hua Li, Nuannuan Zhang, and Jiesheng Xie Volume 2002 (2002), Issue 9, Pages Wavelet Kernels on a DSP: A Comparison between Lifting and Filter Banks for Image Coding, Stefano Gnavi, Barbara Penna, Marco Grangetto, Enrico Magli, and Gabriella Olmo Volume 2002 (2002), Issue 9, Pages AVSynDEx: A Rapid Prototyping Process Dedicated to the Implementation of Digital Image Processing Applications on Multi-DSP and FPGA Architectures, Virginie Fresse, Olivier Déforges, and Jean-François Nezan Volume 2002 (2002), Issue 9, Pages

7 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Editorial Yuke Wang Department of Computer Science, Box , MS EC 31, University of Texas at Dallas, Richardson, TX , USA yuke@utdallasedu Yu Hen Hu Department of Electrical and Computer Engineering, University of Wisconsin-Madison, Madison, WI , USA hu@engrwiscedu The telecommunications, wireless communications, multimedia, and consumer electronics industries are witnessing a rapid evolution toward integrating complete systems on a single chip Single-chip systems will increasingly have both a hardware component as well as a software component, where the hardware component is of heterogeneous nature and may include a combination of ASIC, ASIP, digital signal processors, reconfigurable processors, FPGAs, and general processors The architecture of digital signal processors has taken many new directions including VLIW, superscalar, SIMD, and more The choice of the architecture styles and the hardware/software combination are determined by tradeoffs among costs, performance, power, time to market, and flexibility Furthermore, the boundary between hardware and software has been blurred, while system design is characterized by ever-increasing complexity that has to be implemented within reduced time and resulting in minimum costs Therefore computer-aided design tools that facilitate easy design process are of essential importance The first paper proposes a lightweight floating point arithmetic, a family of customizable floating-point data formats, which bridges the design gap between software and hardware The effectiveness of the proposed scheme is demonstrated using the inverse discrete cosine transform in the context of video coding Such flexible data format will find applications beyond multimedia in areas such as wireless communication where a wide range of precision/power/speed/area tradeoffscanbemade The second paper considers negative cycle detection in a weighted directed graph in the context of high-level synthesis for DSP systems The paper introduces the concept of adaptive negative cycle detection and demonstrates the application of the technique for problems such as performance analysis and design space exploration in DSP applications The third paper introduces a design environment FRIDGE, which supports transformation of signal processing algorithms coded in floating-point to a fixed-point representation FRIDGE also provides a direct link to DSP implementation by processor specific C-code generation The fourth paper presents a technique useful for efficient DSP processor compiler design, which reduces the CPU idle time due to the long memory access latency The technique explores the instruction level parallelism among instructions of typical DSP applications The next three papers deal with ASIC design of various important components such as CORDIC algorithm, FIR filter, and multiplication in GF(2 n ) CORDIC algorithm has important applications in Hartley transform, FFT, and DCT ThefifthpaperintroducesanovelCORDICalgorithmand a novel architecture resulting in the least delay The sixth paper introduces an efficient parallel FIR filter with a new lookahead quantization algorithm Finite field GF(2 n )isofgreat interests for cryptosystems and the seventh paper introduces a low complexity pipeline multiplier for GF(2 n ) The next three papers discuss efficient implementation on DSP processors for applications in discrete multitone (DMT) communication system, high-speed multimedia communication systems, and image coding The 512-point IFFT/FFT is a modulation/demodulation kernel in the ADSL systems, and an efficient fast algorithm together with its DSP processor based implementation for IFFT/FFT is derived in the eighth paper The nineth paper introduces an implementation of point-of-deployment security module on DSP processor (TMS320C6211) The tenth paper develops wavelet engines implemented in DSP platform Finally, our last paper presents a full rapid prototyping process by means of existing academic, commercial CAD tools and platforms targeting an architecture that combines multi-dsp with an FPGA

878 EURASIP Journal on Applied Signal Processing Overall, we have covered several areas in this special issue: computer-aided design environment, framework, and tools to facilitate the design of

thank the authors, reviewers, the publisher, the editorial committee, and the EIC, for the tremendous amount of effort they put into this special issue to make it a success We believe the readers

8 878 EURASIP Journal on Applied Signal Processing Overall, we have covered several areas in this special issue: computer-aided design environment, framework, and tools to facilitate the design of complex communication and DSP systems, ASIC based implementation of important components in communication and DSP systems, DSP processor based implementation, and integration of current tools We thank the authors, reviewers, the publisher, the editorial committee, and the EIC, for the tremendous amount of effort they put into this special issue to make it a success We believe the readers will find the results presented in this special issue useful for their own design and implementation problems Yuke Wang Yu Hen Hu Yuke Wang received his BS degree from the University of Science and Technology of China, Hefei, China, in 1989, the MS and the PhD degrees from the University of Saskatchewan, Canada, in 1992 and 1996, respectively He has held faculty positions at Concordia University, Canada, and Florida Atlantic University, Florida, USA Currently he is an Assistant Professor at the Computer Science Department, University of Texas at Dallas He has also held visiting assistant professor positions at the University of Minnesota, the University of Maryland, and the University of California at Berkeley Dr Wang is currently an Editor of IEEE Transactions on Circuits and Systems, Part II, an Editor of IEEE Transactions on VLSI Systems, an Editor of EURASIP Journal on Applied Signal Processing, and a few other journals Dr Wang s research interests include VLSI design of circuits and systems for DSP and communication, computer aided design, and computer architectures During , he has published about 60 papers among which about 20 papers are in IEEE/ACM Transactions Yu Hen Hu is a faculty member at the Department of Electrical and Computer Engineering, University of Wisconsin, Madison He received BSEE from National Taiwan University, and MSEE and PhD degrees from University of Southern California Prior to joining University of Wisconsin, he was a faculty member in the Electrical Engineering Department of Southern Methodist University, Dallas, Texas His research interests include multimedia signal processing, artificial neural networks, fast algorithms and design methodology for application specific micro-architectures, as well as computer-aided design tools He has published more than 180 technical papers in these areas Dr Hu is a fellow of IEEE He is a former Associate Editor ( ) for the IEEE Transaction of Acoustic, Speech, and Signal Processing in the areas of system identification and fast algorithms He served as the secretary of the IEEE signal processing society ( ), a board member at IEEE neural network council, and is currently a steering committee member of the International conference of Multimedia and Expo on behalf of IEEE Signal Processing Society

9 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform Fang Fang Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA ffang@ececmuedu Tsuhan Chen Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA tsuhan@ececmuedu Rob A Rutenbar Department of Electrical and Computer Engineering, Carnegie Mellon University, 5000 Forbes Avenue, Pittsburgh, PA 15213, USA rutenbar@ececmuedu Received 15 May 2001 and in revised form 9 May 2002 To enable floating-point (FP) signal processing applications in low-power mobile devices, we propose lightweight floating-point arithmetic It offers a wider range of precision/power/speed/area trade-offs, but is wrapped in forms that hide the complexity of the underlying implementations from both multimedia software designers and hardware designers Libraries implemented in C and Verilog provide flexible and robust floating-point units with variable bit-width formats, multiple rounding modes and other features This solution bridges the design gap between software and hardware, and accelerates the design cycle from algorithm to chip by avoiding the translation to fixed-point arithmetic We demonstrate the effectiveness of the proposed scheme using the inverse discrete cosine transform (IDCT), in the context of video coding, as an example Further, we implement lightweight floating-point IDCT into hardware and demonstrate the power and area reduction Keywords and phrases: floating-point arithmetic, customizable bit-width, rounding modes, low-power, inverse discrete cosine transform, video coding 1 INTRODUCTION Multimedia processing has been finding more and more applications in mobile devices A lot of effort must be spent to manage the complexity, power consumption, and timeto-market of the modern multimedia system-on-chip (SoC) designs However, multimedia algorithms are computationally intensive, rich in costly FP arithmetic operations rather than simple logic FP arithmetic hardware offers a wide dynamic range and high computation precision, yet occupies large fractions of total chip area and energy budget Therefore, its application on mobile computing chip is highly limited Many embedded microprocessors such as the StrongARM [1] do not include an FP unit due to its unacceptable hardware cost So there is an obvious gap in multimedia system development: software designers prototype these algorithms using high-precision FP operations, to understand how the algorithm behaves, while the silicon designers ultimately implement these algorithms into integer-like hardware, that is, fixed-point units This seemingly minor technical choice actually creates severe consequences: the need to use fixedpoint operations often distorts the natural form of the algorithm, forces awkward design trade-offs, and even introduces perceptible artifacts Error analysis and word length optimization of fixed-point 2D IDCT, inverse discrete cosine transform, algorithm has been studied in [2], and a tool for translating FP algorithms to fixed-point algorithms was presented in [3] However, such optimization and translation are based on human knowledge of the dynamic range, precision requirements, and the relationship between algorithm s architecture and precision This time-consuming and errorprone procedure often becomes the bottleneck of the entire system design flow In this paper, we propose an effective solution: lightweight FP arithmetic This is essentially a family of customizable FP data formats that offer a wider range of precision/power/speed/area trade-offs, but wrapped in forms that

10 880 EURASIP Journal on Applied Signal Processing hide the complexity of the underlying implementations from both multimedia algorithm designers and silicon designers Libraries implemented in C and Verilog provide flexible and robust FP units with variable bit-width formats, multiple rounding modes and other features This solution bridges the design gap between software and hardware and accelerate the design cycle from algorithm to chip Algorithm designers can translate FP arithmetic computations transparently to lightweight FP arithmetic and adjust the precision easily to what is needed Silicon designers can use the standard ASIC or FPGA design flow to implement these algorithms using the arithmetic cores we provide which consume less power than standard FP units Manual translation from FP algorithms to algorithms can be eliminated from the design cycle We test the effectiveness of our lightweight arithmetic library using an H263 video decoder Typical multimedia applications working with modest-resolution human sensory data such as audio and video do not need the whole dynamic range and precision that IEEE-standard FP offers By reducing the complexity of FP arithmetic in many dimensions, such as narrowing the bit-width, simplifying the rounding methods and the exception handling, and even increasing the radix, we explore the impact of such lightweight arithmetic on both the algorithm performance and the hardware cost Our experiments show that for the H263 video decoder, FP representation with less than half of the IEEE standard FP bit-width can produce almost the same perceptual video quality Specifically, only 5 exponent bits and 8 mantissa bits for a radix-2 FP representation, or 3-exponent bits and 11 mantissa bits for a radix-16 FP representation are all we need to maintain the video quality We also demonstrate that a simple rounding mode is sufficient for video decoding and offers enormous reduction in hardware cost In addition, we implement a core algorithm in the video codec, IDCT, into hardware using the lightweight arithmetic unit Compared to a conventional 32-bit FP IDCT, our approach reduces the power consumption by 895% The paper is organized as follows Section 2 introduces briefly the relevant background on FP and fixed-point representations Section 3 describes our C and Verilog libraries of lightweight FP arithmetic and the usage of the libraries Section 4 explores the complexity reduction we can achieve for IDCT built with our customizable library Based on the results in this section, we present the implementation of lightweight FP arithmetic units and analyze the hardware cost reduction in Section 5 InSection 6, wecompare the area/speed/power of a standard FP IDCT, a lightweight FP IDCT, and a fixed-point IDCT Concluding remarks follow in Section 7 2 BACKGROUND 21 Floating-point representation versus fixed-point representation There are two common ways to specify real numbers: FP and fixed-point representations FP can represent numbers on an s exp frac FP value: (1) s 2 exp bias 1 frac (The leading 1 is implicit) 0 exp 255, bias = 127 Figure 1: FP number representation int frac Figure 2: Fixed-point number representation exponential scale and is reputed for a wide dynamic range The date format consists of three fields: sign, exponent, and fraction (alsocalled mantissa), as shown in Figure 1 Dynamic range is determined by the exponent bit-width, and resolution is determined by the fraction bit-width The widely adopted IEEE single FP standard [4] uses an 8-bit exponent that can reach a dynamic range roughly from to 2 127, and a 23-bit fraction that can provide a resolution of 2 exp , where exp stands for value represented by the exponent field In contrast, the fixed-point representation is on a uniform scale, that is, essentially the same as the integer representation, except for the fixed radix point For instance (see Figure 2), a 32-bit fixed-point number with a 16-bit integer part and a 16-bit fraction part can provide a dynamic range of 2 16 to 2 16 and a resolution of 2 16 When prototyping algorithms with FP, programmers do not have to concern about dynamic range and precision, because IEEE standard FP provides more than necessary for most general applications Hence, float and double are standard parts of programming languages like C, and are supported by most compilers However, in terms of hardware, the arithmetic operations of FP need to deal with three parts (sign, exponent, fraction) individually, which adds substantially to the complexity of the hardware, especially in the aspect of power consumption, while fixed-point operations are almost as simple as integer operations If the system has a stringent power budget, then the application of FP units has to be limited, and on the other hand, a lot of manual work is spent in implementing and optimizing the fixed-point algorithms to provide the necessary dynamic range and precision 22 IEEE-754 floating-point standard IEEE-754 is a standard for binary FP arithmetic [4] Since our later discussion about the lightweight FP is based on this standard, we give a brief review of its main features in this section Data format The standard defines two primary formats, single precision (32 bits) and double precision (64 bits) The bit-widths of three fields and the dynamic range of single and doubleprecision FP are listed in Table 1

11 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 881 Table 1: IEEE FP number format and dynamic range Format Sign Exp Frac Bias Max Min Single Double Infinite s frac == NaN s frac 0 NaN is assigned when some invalid operations occur, like ( )( ), 0, 0/0,etc s frac 0 Figure 4: Infinite and NaN representations Figure 3: Denormalized number format Rounding The default rounding mode is round-to-nearest If there is a tie for the two nearest neighbors, then it is rounded to the one with the least significant bit as zero Three user selectable rounding modes are: round-toward, round-toward, and round-toward-zero (or called truncation) Denormalization Denormalization is a way to allow gradual underflow For normalized numbers, because there is an implicit leading 1, the smallest positive value is for single precision (it is not because the exponent with all zeros is reserved for denormalized numbers) Values below this can be represented by a so-called denormalized format (Figure 3) that does not have the implicit leading 1 The value of the denormalized number is frac and the smallest representable value is hence scaled down to ( ) Denormalization provides graceful degradation of precision for computations on very small numbers However, it complicates hardware significantly and slows down the more common normalized cases Exception handling There are five types of exceptions defined in the standard: invalid operation, division by zero, overflow, underflow, and inexact As shown in Figure 4, some bit patterns are reserved for these exceptions When numbers simply cannot be represented, the format returns a pattern called NaN, not-anumber, with information about the problem NaNs provide an escape mechanism to prevent system crashes in case of invalid operations In addition to assigning specific NaN bit patterns, some status flags and trapping signals are used to indicate exceptions From the above review, we can see that IEEE standard FP arithmetic has a strong capability to represent real numbers as accurately as possible and is very robust when exceptions occur However, if the FP arithmetic unit is dedicated to a particular application, the IEEE-mandated 32 or 64-bit long bit-width may provide more precision and dynamic range than needed, and many other features may be unnecessary as well 3 CUSTOMIZABLE LIGHTWEIGHT FLOATING-POINT LIBRARY The goal of our customizable lightweight FP library is to provide more flexibility than IEEE standard FP in bit-width, rounding and exception handling We created matched C and Verilog FP arithmetic libraries that can be used during algorithm/circuit simulation and circuit synthesis With the C library, software designers can simulate the algorithms with lightweight arithmetic and decide the minimal bit-width, rounding mode, and so forth, according to the numerical performance Then with the Verilog library, hardware designers can plug in the parameterized FP arithmetic cores into the system and synthesize it to the gate-level circuit Our libraries provide a way to move the FP design choices (bit-width, rounding, ) upwards to the algorithm design stage and better predict the performance during early algorithm simulation 31 Easy-to-use C class Cmufloat for algorithm designers Our lightweight FP class is called Cmufloat and implemented by overloading existing C arithmetic operators (,,,/,) It allows direct operations, including assignment between Cmufloat and any C data types except char The bit-width of Cmufloat varies from 1 to 32 including sign, fraction, and exponent bits and is specified during the variable declaration Three rounding modes are supported: round-to-nearest, Jamming, and truncation, one of which is chosen by defining a symbol in an appropriate configuration file Explanation of our rounding modes is presented in detail later In Figure 5, we summarize the operators of Cmufloat andgivesomeexamplesofusingit Our implementation of lightweight FP offers two advantages First, it provides a transparent mechanism to embed Cmufloat numbers in programs As shown in the example, designers can use Cmufloat as a standard C data type Therefore, the overall structure of the source code can be preserved and a minimal amount of work is spent in translating a standard FP program to a lightweight FP program Second, the arithmetic operators are implemented by bit-level manipulation, which carefully emulates the hardware implementation We believe the correspondence between software and hardware is more exact than previous work [5, 6] These other approaches appear to have implemented the operators by simply quantizing the result of standard FP operations

12 882 EURASIP Journal on Applied Signal Processing Cmufloat double float = int short Cmufloat < 14, 5 > a = 05; Cmufloat <> b = 15; Cmufloat < 18, 6 >c[2]; float fa; c[1] = a b; fa = a b; c[2] = fa c[1]; cout << c[2]; func(a); Cmufloat == Cmufloat double >=,> double float <=,< float int /! = int short short (a) Operators with Cmufloat // 14-bit fraction and 5-bit exponent // Default is IEEE-standard float // Define an array // Assign the result to float // Operation between float and Cmufloat // I/O stream // Function call (b) Examples of Cmufloat Figure 5: Operators and examples of Cmufloat into limited bits This approach actually has more bit-width for the intermediate operations, while our approach guarantees that results of all operations, including the intermediate results, are consistent with the hardware implementation Hence, the numerical performance of the system during the early algorithm simulation is more trustworthy 32 Parameterized Verilog library for silicon designers We provide a rich set of lightweight FP arithmetic units (adders, multipliers) in the form of parameterized Verilog First, designers can choose implementations according to the rounding mode and the exception handling Then they can specify the bit-width for the fraction and exponent by parameters With this library, silicon designers are able to simulate the circuit at the behavioral level and synthesize it into a gate-level netlist The availability of such cores makes possible a wider set of design trade-offs (power, speed, area, accuracy) for multimedia tasks 4 REDUCING THE COMPLEXITY OF FLOATING-POINT ARITHMETIC FOR A VIDEO CODEC IN MULTIPLE DIMENSIONS Most multimedia applications process modest-resolution human sensory data, which allows hardware implementation to use low-precision arithmetic computations Our work aims to find out how much the precision and the dynamic range can be reduced from the IEEE standard FP without perceptual quality degradation In addition, the impacts of other features in IEEE standard, such as the rounding mode, denormalization, and the radix choice are also studied Specifically, we target the IDCT algorithm in an H263 video codec, since it is the only module that really uses FP computations in the codec, and also is common in many other media applications, such as image processing, audio compression, and so forth In Figure 6, we give a simplified diagram of a video codec In the decoder side, the input compressed video is put into IDCT after inverse quantization After some FP computations in IDCT, those DCT coefficients are converted to pixel values that are the differences between the previous frame and the current frame The last step is to get the current frame by adding up the outputs of IDCT and the previous frame after the motion compensation Considering the IEEE representation of FP numbers, there are five dimensions that we can explore in order to reduce the hardware complexity (Table 2) Accuracy versus hardware cost trade-off is made in each dimension In order to measure the accuracy quantitatively, we integrate the Cmufloat IDCT into a complete video codec and measure the PSNR (peak-signal-to-noise-ratio) of the decoded video, which reflects the decoded video quality, PSNR = 10 log Ni=0 ( pi f i ) /N, (1) where N is the total number of pixels, p i stands for the pixel value decoded by the lightweight FP algorithm, and f i stands for the reference pixel value of the original video 41 Reducing the exponent and fraction bit-width Reducing the exponent bit-width The exponent bit-width determines the dynamic range Using a 5-bit exponent as an example, we derive the dynamic range in Table 3 Complying with the IEEE standard, the exponent with all 1s is reserved for infinity and NaN With a bias of 15, the dynamic range for a 5-bit exponent is from 2 14 or 2 15 to 2 16, depending on the support for denormalization In order to decide the necessary exponent bit-width for our IDCT, we collected the histogram information of exponents for all the variables in the IDCT algorithm during a video sequence decoding (see Figure 7) The range of these exponents lies in [ 22, 10], which is consistent with the theoretical result in [7] From the dynamic range analysis above, we know that a 5-bit exponent can almost cover such a range except when numbers are extremely small, while a 6-bit exponent can cover the entire range However, our experiment shows that 5-bit exponent is able to produce the same PSNR as an 8-bit exponent Reducing the fraction bit-width Reducing the fraction bit-width is the most practical way to lower the hardware cost because the complexity of an integer multiplier is reduced quadratically with decreasing bit-width [8] On the other hand, the accuracy is degraded when narrowing the bit-width The influence of decreasing bit-width on video quality is shown in Figure 8 As we can see in the curve, PSNR remains almost constant across a rather wide range of fraction bit widths, which means that the fraction width does not affect the decoded

13 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 883 DCT Q Transmit IQ IDCT IQ IDCT Cmufloat algorithm Motion compensation Encoder D D Motion compensation Decoder Figure 6: Video codec diagram Q: quantizer, IQ: inverse quantizer, D: delay Table 2: Working dimensions for lightweight FP Dimension Smaller exponent bit-width Smaller fraction bit-width Simpler rounding mode No support for denormalization Higher radix Description Reduce the number of exponent bits at the expense of dynamic range Reduce the number of fraction bits at the expense of precision Choose simpler rounding mode at the expense of precision Do not support denormalization at the expense of precision Increase the implied radix from 2 to 16 for the FP exponent (higher radix FP needs less exponent bits and more fraction bits than radix-2 FP, in order to achieve the comparable dynamic range and precision) Table 3: Dynamic range of a 5-bit exponent Exponent Value of exp-bias Dynamic range Biggest exponent Smallest exponent (with support for denormalization) Smallest exponent (no support for denormalization) Exponent Figure 7: Histogram of exponent value video quality in this range The cutoff point where PSNR starts dropping is as small as 8 bits this is about 1/3 of the fraction-width of an IEEE standard FP The difference of PSNR between an 8-bit fraction FP and 23-bit fraction is only 022 db, which is almost not perceptible to human eyes One frame of the video sequence is compared in Figure 9 The top one is decoded by a full precision FP IDCT, and the bottom one is decoded by a 14-bit FP IDCT From the perceptual quality point of view, it is hard to tell the major difference between these two From the above analysis, we can reduce the total bitwidth from 32 bits to 14 bits (1-bit sign 5-bit exponent 8-bit fraction), while preserving good perceptual quality In order to generalize our result, the same experiment is carried on three other video sequences: Akiyo, Stefan, mobile, all of which have around 02 db degradation in PSNR when 14-bit FP is applied The relationship between video compression ratio and the minimal bit-width For streaming video, the lower the bit rate, the worse is the video quality The difference in bit rate is mainly caused by the quantization step size during encoding A larger step size means coarser quantization and therefore worse video quality Considering the limited wireless network bandwidth and the low display quality of mobile devices, a relatively low bit rate is preferred to transfer video in this situation Hence, we

$versus fraction width Quantization step size = 4 Quantization step size = 8 Quantization step size = 16 Figure 10: Comparison of different quantization step (a) One frame decoded by 32-bit FP 1-bit$

14 884 EURASIP Journal on Applied Signal Processing PSNR (db) Cutoff point PSNR Fraction width Fraction width Figure 8: PSNR versus fraction width Quantization step size = 4 Quantization step size = 8 Quantization step size = 16 Figure 10: Comparison of different quantization step (a) One frame decoded by 32-bit FP 1-bit sign 8-bit exp 23-bit fraction PSNR (db) Fraction width Intracoding Intercoding Figure 11: Comparison of intracoding and intercoding (b) One frame decoded by 14-bit FP 1-bit sign 5-bit exp 8-bit fraction Figure 9: Video quality comparison want to study the relationship between compression ratio, or quantization step size and the minimal bit-width We compare the PSNR curves obtained by experiments with different quantization step sizes in Figure 10 From the figure, we can see that for a larger quantization step size, the PSNR is lower, but at the same time the curve drops more slowly and the minimal bit-width can be reduced further This is because coarse quantization can hide more computation error under the quantization noise Therefore, less computational precision is needed for the video codec using a larger quantization step size The relationship between inter/intra-coding and the minimal bit-width Intercoding refers to the coding of each video frame with reference to the previous video frame That is, only the difference between the current frame and the previous frame is coded, after motion compensation Since in most videos, adjacent frames are highly correlated, intercoding provides very high efficiency Based on the assumption that video frames have some correlation, for intercoding, the differences coded as the inputs to DCT are typically much smaller than regular pixel values Accordingly, the DCT coefficients, or inputs to IDCT are also smaller than those in intracoding One property of FP numbers is that representation error is smaller when the number to be represented is smaller From Figure 11, the PSNR of intercoding drops more slowly than intracoding, or the minimum bit-width of intercoding can be 1 bit less than intracoding However, if the encoder uses the full precision FP IDCT, while the decoder uses the lightweight FP IDCT, then the error propagation effect of intercoding cannot be eliminated In that case, intercoding does not have the above advantage The results in this section demonstrates that some pro-

15 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 885 grams do not need the extreme precision and dynamic range provided by IEEE standard FP Applications dealing with the modest-resolution human sensory data can tolerate some computation error in the intermediate or even final results, while giving similar human perceptual results Some experiments on other applications also agree with this assertion We also applied Cmufloat to an MP3 decoder The bit-width can be cut down to 14 bits (1-bit sign 6-bit exponent 7-bit fraction) and the noise behind the music is still not perceptible In the CMU Sphinx application, a speech recognizer, 11-bit FP (1-bit sign 6-bit exponent 4-bit fraction) can maintain the same recognition accuracy as 32-bit FP Such dramatic bit-width reduction offers enormous advantage that may broaden the application of lightweight FP units in mobile devices Finally, we note that the numerical analysis and precision optimization in this section can be implemented in a semiautomated way by appropriate compiler support We can extend an existing C compiler to handle lightweight arithmetic operations and assist the process of exploring the precision trade-offs with less programmer intervention This will unburden designers from translating codes manually into proper limited-precision formats 42 Rounding modes When an FP number cannot be represented exactly, or the intermediate result is beyond the allowed bit-width during computation, then the number is rounded, introducing an error less than the value of the least significant bit Among the four rounding modes specified by the IEEE FP standard, round-to-nearest, round-to-( ), and round-to-( ) need an extra adder in the critical path, while round-toward-zero is the simplest in hardware, but the least accurate in precision Since round-to-nearest has the most accuracy, we implement it as our baseline of comparison There is another classical alternative mode that may have potential in both accuracy and hardware cost: von Neumann [9] We will discuss these three rounding modes (round-to-nearest, Jamming, round-toward-zero) in detail Round-to-nearest In the standard FP arithmetic implementation, there are three bits beyond the significant bits that are for intermediate results [10] (seefigure 12) The sticky bit is the logical ORof all bits thereafter These three bits participate in rounding in the following way: b000 b Truncate these tail bits b011 If b is 1, add 1 to b b100 If b is 0, truncate the tail bits b101 b Add 1 to b b111 significant bit guard bit round bit sticky bit Figure 12: Guard/round/sticky bit b XXX b OR bxxx Do nothing Add 1 to b Do nothing Truncate tail bits Figure 13: Jamming rounding Round-to-nearest is the most accurate rounding mode, but needs some comparison logic and a carry-propagate adder in hardware Further, since the rounding can actually increase the fraction magnitude, it may require extra normalization steps which cause additional fraction and exponent calculations Jamming The rule for Jamming rounding is as follows: if b is 1, then truncate those 3 bits, if b is 0, and there is a 1 among those 3 bits, then add 1 to b, else if b and those 3 bits are all 0, then truncate those 3 bits Essentially, it is the function of an OR gate (seefigure 13) Jamming is extremely simple as hardware, almost as simple as truncation, but numerically more attractive for one subtle but important reason The rounding created by truncation is biased; the rounded result is always smaller than the correct value Jamming, by sometimes forcing a 1 into the least significant bit position, is unbiased The magnitude of Jamming errors is not different from truncation, but the mean of errors is zero This important distinction was recognized by von Neumann almost 50 years ago [9] Round-toward-zero The operation of round-toward-zero is just truncation This mode has no overhead in hardware, and it does not have to keep 3 more bits for the intermediate results So it is much simpler in hardware than the first two modes The PNSR curves for the same video sequence obtained using these three rounding modes are shown in Figure 14 Three rounding modes produce almost the same PSNR when the fraction bit-width is more than 8 bits At the point of 8 bits, the PSNR of truncation is about 02 db worse than the other two On the other hand, from the hardware point

16 886 EURASIP Journal on Applied Signal Processing PSNR (db) Fraction width Round-to-nearest Truncation Jamming Figure 14: Comparison of rounding modes Without denormalization With denormalization Figure 15: Denormalization of view, Jamming is much simpler than round-to-nearest and truncation is the simplest among these three modes So trade-off will be made between quality and complexity among the modes of Jamming and truncation We will finalize the choice of rounding mode during the hardware implementation section 43 Denormalization The IEEE standard allows for a special set of non-normalized numbers that represent magnitudes very close to zero We illustrate this in Figure 15 by an example of a 3-bit fraction Without denormalization, there is an implicit 1 before the fraction, so the actual smallest fraction is 1000, while with denormalization, the leading 1 is not enforced so that the smallest fraction is scaled down to 0001 This mechanism provides more precision for scientific computation with small numbers, but for multimedia applications, especially for video codec, do those small numbers during the computation affectthevideo quality? We experimented on the IDCT with a 5-bit exponent Cmufloat representation 5 bits was chosen to ensure that no overflow would happen during the computation But from the histogram offigure 7, there are still some numbers below the threshold of normalized numbers That means if denormalization is not supported, these numbers will be rounded to zero However, the experiment shows that the PSNRs with and without denormalization are the same, which means that denormalization does not affect the decoded video quality at all 44 Higher radix for FP exponent The exponent of the IEEE standard FP is based on radix 2 Historically, there are also systems based on radix 16, for example, the IBM 390 [11] The advantage of radix 16 lies mainly in fewer types of shifting during prealignment and normalization, which can reduce the shifter complexity in the FP adder and multiplier We will discuss this issue in Section 54 when we discuss hardware implementation in more detail The potential advantage of a higher radix such as 16 is that the smaller exponent bit-width is needed for the same dynamic range as the radix-2 FP, while the disadvantage is that the larger fraction bit-width has to be chosen to maintain the comparable precision We analyze such features in the following Exponent bit-width The dynamic range represented by i-bit exponent is approximately from β 2 (i 1) to β 2(i 1) (β is the radix) Assume we use i-bit exponent for radix-2 FP and j-bit exponent for radix- 16 FP If they have the same dynamic range, then 2 2 (i 1) = 16 2(j 1),orj = i 2 Specifically, if the exponent bit-width for radix-2 is 5, then only 3 bits are needed for the radix-16 FP to reach approximately the same range Fraction bit-width The precision of an FP number is mainly determined by the fraction bit-width But the radix for the exponent also plays a role, due to the way that normalization works Normalization ensures that no number can be represented with two or more bit patterns in the FP format, thus maximizing the use of the finite number of bit patterns Radix-2 numbers are normalized by shifting to ensure a leading bit 1 in the most significant fraction bit IEEE format actually makes this implicit, that is, it is not physically stored in the number For radix 16, however, normalization means that the first digit of the fraction, that is, the most significant 4 bits after the radix point, is never 0000 Hence there are four bit patterns that can appear in the radix-16 fraction (see Table 4) In other words, the radix-16 fraction uses its available bits in a less efficient way, because the leading zeros reduce the number of significant bits of precision We analyze the loss of significant precisions bits in Table 4 The significant bit-width of a radix-2 i-bit fraction is i1, while for a radix-16 j-bit fraction, the significant bit-width is j, j 1, j 2, or j 3, with possibility of 1/4, respectively The minimum fraction bit-width of a radix-16 FP that can guarantee the precision not less than radix-2 must satisfy the following inequality: min{j, j 1, j 2, j 3} i 1, (2) so the minimum fraction bit-width is i 4Actually,it can provide more precision than radix-2 FP since j, j 1, j 2 are larger than i 1 From previous discussions, we know that 14-bit radix-2 FP (1-bit sign 5-bit exponent 8-bit fraction) can produce good video quality Moving to radix-16, two less exponent bits (3 bits) and four more fraction bits (12 bits), or 16 bits for total FP can guarantee the comparable video quality

17 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 887 Table 4: Comparison of radix-2 to radix-16 FP format Normal form of fraction Range of fraction Significant bits Radix-2 (i-bit fraction) 1xx xx 1 f<2 i 1 1xx x 1/2 f<1 j Radix-16 (j-bit fraction) 01xx x 1/4 f<1/2 j 1 001x x 1/8 f<1/4 j x x 1/16 f<1/8 j 3 PSNR (db) FP format Radix 16 Radix 2 Cutoff point Figure 16: Comparison of radix-2 and radix-16 Table 5: 11-bit fraction for radix-16 is sufficient PSNR Radix-2 (8-bit fraction) Radix-16 (12-bit fraction) Radix-16 (11-bit fraction) Radix-16 (10-bit fraction) Since the minimum fraction bit-width is derived from the worst case analysis, it could be reduced further from the perspective of average precision Applying radix-16 Cmufloat to the video decoder, we can see that the cutoff point actually is 11 bits, not 12 bits for fraction width (Figure 16 and Table 5) After discussion in each working dimension, we summarize the lightweight FP design choices for the H263 video codec in the following: Data format: 14-bit radix-2 FP (5-bit exponent 8-bit fraction) or 15-bit radix-16 FP (3-bit exponent 11-bit fraction) Rounding: Jamming or truncation Denormalization:notsupported The final choice of data format and rounding mode are made in Section 5 according to the hardware cost In all discussions in this section, we use PSNR as a measurement of the algorithm However, we need to mention that there is an IEEE standard specifying the precision requirement for 8 8 DCT implementation [12] In the standard, it has the following specification: omse = 7i=0 7j= k=0 e 2 k (i, j) 002 (3) (where omse is the overall-mean-square-error, e k is the pixel difference between reference and proposed IDCT, and i, j are the position of the pixel in the 8 8block) PSNR specification can be derived from omse: PSNR = 10 log 10 (255 2 / omse) 651dB, which is too tight for videos/images displayed by mobile devices The other reason we did not choose this standard is that it uses the uniform distributed random numbers as input pixels that eliminate the correlation between pixels and enlarge the FP computation error We also did experiments based on the IEEE standarditturnsoutthataround17-bitfractionisrequiredto meet all the constraints From PSNR curves in this section, we know that PSNR almost keeps constant for fraction width from 17 bits down to 9 bits The experimental results support our claim very well that IEEE standard specifications for IDCT is too strict for encoding/decoding real video sequences 5 HARDWARE IMPLEMENTATION OF LIGHTWEIGHT FP ARITHMETIC UNITS FP addition and multiplication are the most frequent FP operations A lot of work has been published about IEEE compliant FP adders and multipliers, focusing on reducing the latency of the computation In IBM RISC System/6000, leading zero anticipator is introduced in the FP adder [13] The SNAP project proposed a two-path approach in the FP adder [14] However, the benefit of these algorithms is not significant and the penalty in area is not ignorable when the bitwidth is very small In this section, we present the structure of the FP adder and multiplier appropriate for narrow bitwidth and study the impact of different rounding/exception handling/radix schemes Our design is based on Synopsys Designware library and STMicroelectronics 018 µm technology library The area and latency are measured on gate-level circuit by Synopsys DesignCompiler, and the power consumption is measured by Cadence VerilogXL simulator and Synopsys DesignPower

18 888 EURASIP Journal on Applied Signal Processing 7000 Area Prealignment Fraction addition Normalization Rounding Exception detection Area (µm 2 ) Exception handling (a) Adder Bit width N-to-1 Mux Two stages Logarithmic Fraction multiplication Exponent addition 6 Delay Normalization Rounding Exception handling Delay (ns) (b) Multiplier Figure 17: Diagram of FP, the adder and multiplier 51 Structure As shown in Figure 17, we take the most straightforward toplevel structure for the FP adder and multiplier The tricks reducing the latency are not adopted because firstly the adder and multiplier can be accelerated easily by pipelining, and secondly those tricks increase the area by a large percentage in the case of narrow bit width Shifter The core component in the prealignment and normalization is a shifter There are three common architectures for shifter: N-to-1 Mux shifter is appropriate when N is small Logarithmic shifter [15] uses log(n) stages and each stage handles a single, power-of-2 shifts This architecture has compact area, but the timing path is long when N is big Fast two-stage shifter is used in IBM RISC System/6000 [16] The first stage shifts (0, 4, 8, 12,) bit positions, and the second stage shifts (0, 1, 2, 3) bit position The comparison of these three architectures over different bit-widths is shown in Figure 18 It indicates that for narrow bit-width (8 16), logarithmic shifter is the best choice considering both of area and delay, but when the Bit width N-to-1 Mux Two stages Logarithmic Figure 18: Area and delay comparisons of three shifter structures bit-width is increased to a certain level, two-stage shifter become best 52 Rounding In our lightweight FP arithmetic operations, round-tonearest is not considered because of the heavy hardware overhead Jamming rounding demonstrates similar performance as round-to-nearest in the example of video codec But it still has to keep three more bits in each stage in the FP adder, which becomes significant, especially for narrow bitwidth cases The other candidate is round-toward-zero because its performance is close to Jamming rounding at the cutoff point Table 6 shows the reduction in both of the area and delay when changing rounding mode from Jamming to round-toward-zero Since 15% reduction in the area of FP adder can be obtained, we finally choose truncation as the rounding mode

19 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 889 Table 6: Comparison of rounding modes Rounding mode Area (um 2 ) Delay(ns) Jamming Truncation 7741 ( 19%) 571 ( 16%) (a) 14-bit FP adder Rounding mode Area (um 2 ) Delay(ns) Jamming Truncation 7123 ( 15%) 943 ( 76%) (b) 14-bit FP multiplier Table 8: Comparison of radix 2 and radix 16 Radix Area (um 2 ) Delay(ns) ( 12%) 848 ( 17%) (a) FP adder Radix Area (um 2 ) Delay(ns) (43%) 662 (14%) (b) FP multiplier Table 7: Comparison of exception handling Exception handling Area (um 2 ) Delay(ns) Full Partial 7545 ( 10%) 926 ( 93%) (a) 14-bit FP adder Exception handling Area (um 2 ) Delay(ns) Full Partial 7508 ( 49%) 462 ( 20%) (b) 14-bit multiplier Table 9 Data format Area (um 2 ) Delay(ns) Power(mW) Lightweight IEEE standard (a) Comparison of lightweight FP and IEEE FP-adder Data format Area (um 2 ) Delay(ns) Power(mW) Lightweight IEEE standard (b) Comparison of lightweight FP and IEEE FP-multiplier (a) Radix-2 FP (8-bit fraction 1 leading bit) (b) Radix-16 FP (11-bit fraction) Figure 19: Shifting positions for radix-2 and radix-16 FP 53 Exception handling For an IEEE compliant FP arithmetic unit, a large portion of hardware in critical timing path is dedicated for rare exceptional cases, for example, overflow, underflow, infinite, NaN and so forth If the exponent bit-width is enough to avoid overflow, then infinite and NaN will not occur during computation Then in the FP adder and multiplier diagram (Figure 17), exception detection is not needed and only underflow is detected in exception handling As a result, the delay of the FP adder and multiplier is reduced by 93% and 20%, respectively, using partial exception handling 54 Radix The advantage of higher radix FP is less complex in the shifter From Section 4, we know that in video codec, the precision of radix-16 FP with 11-bit fraction is close to radix-2 FP with 8-bit fraction We illustrate the difference in shifter in (Figure 19) The step size of shifting for radix-16 is four and only three shifting positions are needed Such a simple shifter can be implemented by the structure of 3-to-1 Mux Although there are 3 more bits in the fraction, FP adder still benefited from higher radix in both area and delay (Table 8) On the other hand, more fraction width increases the complexity of the FP multiplier The size of multiplier is increased about quadratically with the bit-width, so only 3 more bits can increase the multiplier s area by 43% (Table 8) From the table, it is clear that radix-16 is not always better than radix-2 In a certain application, if there are more adders than multipliers, then radix-16 is a better choice than radix-2 In our IDCT structure, there are 29 adders and 11 multipliers Therefore, radix-16 is chosen for the implementation of IDCT Combining all the optimization strategies in this section, the final 15-bit radix-16 FP adder and multiplier are compared with the IEEE standard compliant single-precision FP adder and multiplier in Table 9 Reducing the bit-width and simplifying the rounding/shifting/exception handling, the power consumption of FP arithmetic unit is cut down to around 1/5 of the IEEE standard FP unit Further optimization can be conducted in two directions One is low-power design approach As proposed in [17], triple data path in FP adder structure can reduce the power delay product by 16X The other is transistor level optimization Shifter and multiplexor designed in transistors

20 890 EURASIP Journal on Applied Signal Processing x7 x5 x3 x1 x6 x4 x2 x0 Ci = 1 2cos iπ 16 C4 C6/C4 C5 C2/C4 C7 1/C4 C4 C6/C4 C3 x y x y y x Butterfly C2/C4 x C C x C1 x y y5 y2 y4 y3 y6 y1 y7 y0 y x y fa sign fb Alignment Negation Fraction addition Mux Figure 20: Structure of IDCT Leading-one detection & normalization are much smaller and faster than those implemented as logic gates [15, 18] 6 IMPLEMENTATION OF IDCT IDCT performs linear transform on an 8-input data set The algorithm developed in [19] uses minimum resources, 11 multiplications and 29 additions, to compute a onedimensional IDCT (Figure 20) 61 Optimization in butterfly In the architecture of IDCT, there are 12 butterflies and each one is composed of two adders Because these two adders have the same inputs, some operations such as prealignment and negation can be shared Further, in the butterfly structure, normalization and leading-one detection can be separated into two paths and executed in parallel, which reduce the timing critical path Figure 21a is a diagram of the FP adder, and Figure 21b is a diagram of the butterfly with the function of two FP adders From the figure, we can see that a butterfly is similar to one FP adder in structure except one extra integer adder and some selection logic Table 10 shows that such butterfly structure saves 37% area compared with two simple adders This reduction is due to the property of FP addition For a fixed-point butterfly, no operations can be shared 62 Results comparison We implement the IDCT in 32-bit IEEE FP, 15-bit radix-16 lightweight FP, and fixed-point algorithms (see the comparison in Table 11) In the fixed-point implementation, we preserve 12-bit accuracy for constants, and the widest bit-width is 24 in the whole algorithm (not fine tuned) From the perspective of power, the lightweight FP IDCT consumes only around 1/10 of the power compared to the IEEE FP IDCT, and is comparable with the fixed-point implementation 7 CONCLUSION In this paper, we introduce C and Verilog libraries of lightweight FP arithmetic, focusing on the most critical arithmetic operators (addition, multiplication), and the fa Fraction addition Normalization fa fb (a) FP adder Negation fb Alignment Selection logic fa fb (b) FP butterfly Fraction addition Leading-one detection fa fb Figure 21: Diagrams of a FP adder and a FP butterfly most common parameterizations useful for multimedia tasks (bit-width, rounding modes, exception handling, radix) With these libraries, we can easily translate a standard FP program to a lightweight FP program, and explore the system numerical performance versus hardware complexity tradeoff An H263 video codec is chosen to be our benchmark Such media applications do not need a wide dynamic range and high precision in computations, so the lightweight FP can be applied efficiently By examining the histogram information of FP numbers and relationship between PSNR and bit-width, we demonstrate that our video codec have almost no quality degradation when more than half of the bit-width in standard FP is reduced Other features specified in the standard FP, such as rounding modes, exception handling and the radix choice, are also discussed for this partic-

Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 891 Table 10: Comparison of two FP adders and a butterfly Structure Area (um 2 ) Delay(ns) Two FP adders 10412

21 Lightweight Floating-Point Arithmetic: Case Study of Inverse Discrete Cosine Transform 891 Table 10: Comparison of two FP adders and a butterfly Structure Area (um 2 ) Delay(ns) Two FP adders Butterfly Table 11: Comparison of three implementations of IDCT Implementation Area (um 2 ) Delay(ns) Power(mW) IEEE FP Lightweight FP Fixed-point ular application Such optimization offers huge reduction in hardware cost In the hardware implementation of IDCT, we combined two FP adders in a butterfly (basic component of IDCT), which further reduced the hardware cost At last, we show that power consumption of the lightweight FP IDCT is only 105% of the standard FP IDCT, and comparable to the fixed-point IDCT ACKNOWLEDGMENT This work was funded in part by the Pittsburgh Digital Green house, and the Semiconductor Research Corporation REFERENCES [1] D Dobberpuhl, The design of a high performance low power microprocessor, in International Symposium on Low Power Electronics and Design, pp 11 16, Montery, Calif, USA, August 1996 [2] S Kim and W Sung, Fixed-point error analysis and word length optimization of 8 8 IDCT architectures, IEEE Trans Circuits and Systems for Video Technology, vol 8, no 8, pp , 1998 [3]KKum,JKang,andWSung, AUTOSCALERforC:An optimizing floating-point to integer C program converter for fixed-point digital signal processors, IEEE Trans on Circuits and Systems II: Analog and Digital Signal Processing, vol 47, no 9, pp , 2000 [4] The Institute of Electrical and Electronics Engineers, Inc, IEEE-standard for Binary Floating-Point Arithmetic, ANSI/IEEE Std , 1985 [5] DMSamanj,JEllinger,EJPowers,andEESwartzlander, Simulation of variable precision IEEE floating point using C and its application in digital signal processor design, in Circuits and Systems, Proceedings of the 36th Midwest Symposium on, pp , 1993 [6] R Ignatowski and E E Swartzlander, Creating new algorithm and modifying old algorithms to use the variable precision floating point simulator, in Signals, Systems and Computers, 1994 Conference Record of the 28th Asilomar Conference on, vol 1, pp , 1994 [7] X Wan, Y Wang, and W H Chen, Dynamic range analysis for the implementation of fast transform, IEEE Trans Circuits and Systems for Video Technology, vol 5, no 2, pp , 1995 [8] PCHMeier,RARutenbar,andLRCarley, Exploring multiplier architecture and layout for low power, in Proc IEEE 1996 Custom Integrated Circuits Conference, pp , San Diego, Calif, USA, May 1996 [9] A W Burks, H H Goldstine, and J von Neumann, Preliminary Discussion of the Logical Design of an Electronics Computing Instrument, Computer Structures: Reading and Examples McGraw-Hill, 1971 [10] The Institute of Electrical and Electronics Engineers, Inc, A Proposed Standard for Binary Floating-Point Arithmetic, Draft 80 of IEEE Task P754, 1981 [11] S F Anderson, J G Earle, R E Goldschmidt, and D M Powers, The IBM system 390 model 91: floating point execution unit, IBM Journal of Research and Development, vol 11, no 1, pp 34 53, 1967 [12] Institute of Electrical and Electronics Engineers, Inc, IEEEstandard specifications for the implementations of 8 8 inverse discrete cosine transform, IEEE Std , 1990 [13] E Hokenek and R K Montoye, Leading-zero anticipator (LZA) in the IBM RISC system/6000 floating-point execution unit, IBM Journal of Research and Development, vol 34, no 1, pp 71 77, 1990 [14] S F Oberman, H AL-Twaijry, and M J Flynn, The SNAP project: design of floating point arithmetic units, in Proc 13th Symposium on Computer Arithmetic, pp , Asilomar, Calif, USA, July 1997 [15] K P Acken, M J Irwin, and R M Owens, Power comparisons for barrel shifters, in International Symposium on Low Power Electronics and Design, pp , Monterey, Calif, USA, August 1996 [16] R K Montoye, E Hokenek, and S L Runyon, Design of the IBM RISC system/6000 floating-point execution unit, IBM Journal of Research and Development, vol 34, no 1, pp 59 70, 1990 [17] R V K Pillai, D Al-Khalili, and A J Al-Khalili, A low power approach to floating point adder design, in Proc IEEE International Conference on Computer Design: VLSI in Computers and Processors, pp , Austin, Tex, USA, October 1997 [18] J M Rabaey, Digital Integrated Circuits: A Design Perspective, Prentice-Hall, New Jersey, USA, 1996 [19] A Artieri and O Colavin, A chip set core for image compression, IEEE Trans on Consumer Electronics, vol 36, pp , August 1990 Fang Fang received the BS degree in electrical engineering from Southeast University, China in 1999 and the MS degree from Carnegie Mellon University, Pittsburgh, Pa in 2001 Currently, she is a PhD candidate in the Center for Silicon System Implementation and the Advanced Multimedia Processing Lab in Electrical and Computer Engineering Department at Carnegie Mellon Univesity Her research interest includes numerical performance of signal processing algorithms and high level hardware synthesis of DSP algorithms Her current work focus on automatic design flow of the lightweight floating-point system, and hardware synthesis of WHT and FFT transforms

892 EURASIP Journal on Applied Signal Processing Tsuhan Chen received the BS degree in electrical engineering from the National Taiwan University in 1987, and the MS and PhD degrees in electrical

Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, where he is now a Professor He directs the Advanced Multimedia Processing Laboratory His research interests include multimedia

22 892 EURASIP Journal on Applied Signal Processing Tsuhan Chen received the BS degree in electrical engineering from the National Taiwan University in 1987, and the MS and PhD degrees in electrical engineering from the California Institute of Technology, Pasadena, California, in 1990 and 1993, respectively Since October 1997, Tsuhan Chen has been with the Department of Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, Pennsylvania, where he is now a Professor He directs the Advanced Multimedia Processing Laboratory His research interests include multimedia signal processing and communication, audiovisual interaction, biometrics, processing of 2D/3D graphics, bioinformatics, and building collaborative virtual environments From August 1993 to October 1997, he worked in the Visual Communications Research Department, AT&T Bell Laboratories, Holmdel, New Jersey, and later at AT&T Labs-Research, Red Bank, New Jersey Tsuhan helped create the Technical Committee on Multimedia Signal Processing, as the founding chair, and the Multimedia Signal Processing Workshop, both in the IEEE Signal Processing Society He has recently been appointed as the Editor-in-Chief for IEEE Transactions on Multimedia for the period He has coedited a book titled Advances in Multimedia: Systems, Standards, and Networks He is a recipient of the National Science Foundation Career Award Rob A Rutenbar received the PhD degree from the University of Michigan in 1984, and subsequently joined the faculty of Carnegie Mellon University He is currently the Stephen J Jatras Professor of Electrical and Computer Engineering, and (by courtesy) of Computer Science He is the founding Director of the MARCO/DARPA Center for Circuits, Systems, Software (C2S2), a consortium of US Universities chartered in 2001 to explore long-term solutions for next-generation circuit challenges His research interests focus on circuit and layout synthesis algorithms for mixed-signal ASICs, and for highspeed digital systems In 1987, Dr Rutenbar received a Presidential Young Investigator Award from the National Science Foundation He was General Chair of the 1996 International Conference on CAD From 1992 through 1996, he chaired the Analog Technical Advisory Board for Cadence Design Systems In 2001 he was cowinner of the Semiconductor Research Corporation s Aristotle Award for contributions to graduate education He is a Fellow of the IEEE and a member of the ACM and Eta Kappa Nu

23 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection Nitin Chandrachoodan Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA nitin@engumdedu Shuvra S Bhattacharyya Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA ssb@engumdedu K J Ray Liu Department of Electrical and Computer Engineering, University of Maryland, College Park, MD 20742, USA kjrliu@engumdedu Received 31 August 2001 and in revised form 15 May 2002 The problem of detecting negative weight cycles in a graph is examined in the context of the dynamic graph structures that arise in the process of high level synthesis (HLS) The concept of adaptive negative cycle detection is introduced, in which a graph changes over time and negative cycle detection needs to be done periodically, but not necessarily after every individual change We present an algorithm for this problem, based on a novel extension of the well-known Bellman-Ford algorithm that allows us to adapt existing cycle information to the modified graph, and show by experiments that our algorithm significantly outperforms previous incremental approaches for dynamic graphs In terms of applications, the adaptive technique leads to a very fast implementation of Lawlers algorithm for the computation of the maximum cycle mean (MCM) of a graph, especially for a certain form of sparse graph Such sparseness often occurs in practical circuits and systems, as demonstrated, for example, by the ISCAS 89/93 benchmarks The application of the adaptive technique to design-space exploration (synthesis) is also demonstrated by developing automated search techniques for scheduling iterative data-flow graphs Keywords and phrases: negative cycle detection, dynamic graphs, maximum cycle mean, adaptive performance estimation 1 INTRODUCTION High-level synthesis of circuits for digital signal processing (DSP) applications is an area of considerable interest due to the rapid increase in the number of devices requiring multimedia and DSP algorithms High-level synthesis (HLS) plays an important role in the overall system synthesis process because it speeds up the process of converting an algorithm into an implementation in hardware, software, or a mixture of both In HLS, the algorithm is represented in an abstract form (usually a dataflow graph), and this representation is transformed and mapped onto architectural elements from a library of resources These resources could be pure hardware elements like adders or logic gates, or they could be general purpose processors with the appropriate software to execute the required operations The architecture could also involve a combination of both the above, in which case the problem becomes one of hardware-software cosynthesis HLS involves several stages and requires computation of several parameters In particular, performance estimation is one very important part of the HLS process, and the actual design space exploration (synthesis of architecture) is another Performance estimation involves using timing information about the library elements to obtain an estimate of the throughput that can be obtained from the synthesized implementation A particularly important estimate for iterative dataflow graphs is known as the maximum cycle mean (MCM) [1, 2] This quantity provides a bound on the maximum throughput attainable by the system A fast method for computing the MCM would therefore enable this metric to be computed for a large number of system configurations easily The other important problem in HLS is the problem of design space exploration This requires the selection of an appropriate set of elements from the resource library and mapping of the dataflow graph functions onto these resources This problem is known to be NP-complete, and there exist several heuristic approaches that attempt to provide suffi-

24 894 EURASIP Journal on Applied Signal Processing ciently good solutions An important feature of HLS of DSP applications is that the mapping to hardware needs to be done only once for a given design, which is then produced in large quantities for a consumer market As a result, for these applications (such as modems, wireless phones, multimedia terminals, etc), it makes sense to consider the possibility of investing large amounts of computational power at compile time, so that a more optimal result can be used at run time The problems in HLS described above both have the common feature of requiring a fast solution to the problem of detecting negative cycles in a graph This is because the execution times of the various resources combine with the graph structure to impose a set of constraints on the system, and checking the feasibility of this set of constraints is equivalent to checking for the presence of negative cycles in the corresponding constraint graph DSP applications, more than other embedded applications considered in HLS, have the property that they are cyclic in nature As explained in Section 2, this means that the problem of negative cycle detection in constraint analysisismorerelevanttosuchsystemsinordertomakeuseof the increased computational power that is available, one possibility is to conduct more extensive searches of the design space than is performed by a single heuristic One possible approach to this problem involves an iterative improvement system based on generating modified versions of an existing implementation and verifying their correctness The incremental improvements can then be used to guide a search of the design space that can be tailored to fit in the maximum time allotted to the exploration problem In this process, the most computationally intensive part is the process of verifying correctness of the modified systems, and therefore speeding up this process would have a direct impact on the size of the explored region of the design space In addition to these problems from HLS, several other problems in circuits and systems theory require the solving of constraint equations [2, 3, 4, 5, 6] Examples include very large scale integrated circuit (VLSI) layout compaction, interactive (reactive) systems, graphic layout heuristics, and timing analysis and retiming of circuits for performance or area considerations Though a general system of constraints would require a linear programming (LP) approach to solve it, several problems of interest actually consist of the special case of difference constraints (each constraint expresses the minimum or maximum value that the difference of two variables in the system can take) These problems can be attacked by faster techniques than the general LP, mostly involving the solution of a shortest path problem on a weighted directed graph Detection of negative cycles in the graph is therefore a closely related problem, as it would indicate the infeasibility of the constraint system Because of the above reasons, detecting the presence of negative cycles in a weighted directed graph is a very important problem in systems theory This problem is also important in the computation of network flows Considerable effort has been spent on finding efficient algorithms for this purpose Cherkassky and Goldberg [3] have performed a comprehensive survey of existing techniques Their study shows some interesting features of the available algorithms, such as the fact that for a large class of random graphs, the worst case performance bound is far more pessimistic than the observed performance There are also situations in which it is useful or necessary to maintain a feasible solution to a set of difference constraints as a system evolves Typical examples of this would be real-time or interactive systems, where constraints are added or removed one (or several) at a time, and after each such modification it is required to determine whether the resulting system has a feasible solution and if so, to find it In these situations, it is often more efficient to adapt existing information to aid the solution of the constraint system In the example from HLS that was mentioned previously, it is possible to cast the problem of design space exploration in a way that benefits from this approach Several researchers [5, 7, 8] have worked on the area of incremental computation They have presented analyses of algorithms for the shortest path problem and negative cycle detection in dynamic graphs Most of the approaches try to apply the modifications of Dijkstra s algorithm to the problem The obvious reason for this is that this is the fastest known algorithm for the problem when only positive weights are allowed on edges However, the use of Dijkstra s algorithm as the basis for incremental computation requires the changestobehandledoneatatimewhilethismayoften be efficient enough, there are many cases where the ability to handle multiple changes simultaneously would be more advantageous For example, it is possible that in a sequence of changes, one reverses the effect of another: in this case, a normal incremental approach would perform the same computation twice, while a delayed adaptive computation would not waste any effort In this paper, we present an approach that generalizes the adaptive approach beyond single increments: we address multiple changes to the graph simultaneously Our approach can be applied to cases where it is possible to collect several changes to the graph structure before updating the solution to the constraint set As mentioned previously, this can result in increased efficiency in several important problems We present simulation results comparing our method against the single-increment algorithm proposed in [4]Forlargernumbers of changes, our algorithm performs considerably better than this incremental algorithm To illustrate the advantages of our adaptive approach, we present two applications from the area of HLS, requiring the solution of difference constraint problems, which therefore benefit from the application of our technique For the problem of performance estimation, we show how the new technique can be used to derive a fast implementation of Lawler s algorithm [9] for the problem of computing the MCM of a weighted directed graph We present experimental results comparing this against Howard s algorithm [2, 10], which appears to be the fastest algorithm available in practice We find that for graph sizes and node-degrees similar to those of real circuits, our algorithm often outperforms even Howard s algorithm For the problem of design space exploration, we present a

25 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 895 search technique for finding schedules for iterative dataflow graphs that uses the adaptive negative cycle detection algorithm as a subroutine We illustrate the use of this local search technique by applying it to the problem of resourceconstrained scheduling for minimum power in the presence of functional units that can operate at multiple voltages The method we develop is quite general and can therefore easily be extended to the case of optimization where we are interested in optimizing other criteria rather than the power consumption This paper is organized as follows Section 2 surveys previous work on shortest path algorithms and incremental algorithms In Section 3, we describe the adaptive algorithm that works on multiple changes to a graph efficiently Section 4 compares our algorithm against the existing approach, as well as against another possible candidate for adaptive operation Section 5 then gives details of the applications mentioned above, and presents some experimental results Finally, we present our conclusions and examine areas that would be suitable for further investigation A preliminary version of the results presented in this paper were published in [11] 2 BACKGROUND AND PROBLEM FORMULATION Cherkassky and Goldberg [3] haveconductedanextensive survey of algorithms for detecting negative cycles in graphs They have also performed a similar study on the problem of shortest path computations They present several problem families that can be used to test the effectiveness of a cycle-detection algorithm One surprising fact is that the best known theoretical bound (O( V E ), where V is the number of vertices and E is the number of edges in the graph) for solving the shortest path problem (with arbitrary weights) is also the best known time bound for the negative cycle problem But examining the experimental results from their work reveals the interesting fact that in almost all of the studied samples, the performance is considerably less costly than would be suggested by the product V E Itappears that the worst case is rarely encountered in random examples, and an average case analysis of the algorithms might be more useful Recently, there has been increased interest in the subject of dynamic or incremental algorithms for solving problems [5, 7, 8] This uses the fact that in several problems where a graph algorithm such as shortest paths or transitive closure needs to be solved, it is often the case that we need to repeatedly solve the problem on variants of the original graph The algorithms therefore store information about the problem that was obtained during a previous iteration and use this as an efficient starting point for the new problem instance corresponding to the slightly altered graph The concept of bounded incremental computation introduced in [7] provides a framework within which the improvement afforded by this approach can be quantified and analyzed In this paper, the problem we are most interested in is that of maintaining a solution to a set of difference constraints This is equivalent to maintaining a shortest path tree in a dynamic graph [4] Frigioni et al [5] present an algorithm for maintaining shortest paths in arbitrary graphs that performs better than starting from scratch, while Ramalingam and Reps [12] present a generalization of the shortest path problem, and show how it can be used to handle the case where there are few negative weight edges In both of these cases, they have considered one change at a time (not multiple changes), and the emphasis has been on the theoretical time bound, rather than experimental analysis In [13], the authors present an experimental study, but only for the case of positive weight edges, which restricts the study to computation of shortest paths and does not consider negative weight cycles The most significant work along the lines we propose is described in [4] In this, the authors use the observation that in order to detect negative cycles, it is not necessary to maintain a tree of the shortest paths to each vertex They suggest an improved algorithm based on Dijkstra s algorithm, which is able to recompute a feasible solution (or detect a negative cycle) in time O(EV log V), or in terms of output complexity (defined and motivated in [4]) O( log ), where is the number of variables whose values are changed and is the number of constraints involving the variables whosevalueshavechanged The above problem can be generalized to allow multiple changes to the graph between calls to the negative cycle detection algorithm In this case, the above algorithms would require the changes to be handled one at a time, and therefore would take time proportional to the total number of changes On the other hand, it would be preferable if we could obtain a solution whose complexity depends on the number of updates requested, rather than the total number of changes applied to the graph Multiple changes between updates to the negative cycle computation arise naturally in many interactive environments (eg, if we prefer to accumulate changes between refreshes of the state, using the idea of lazy evaluation) or in design space-exploration, as can be seen, for example, in Section 52 By accumulating changes and processing them in large batches, we remove a large overhead from the computation, which may result in considerably faster algorithms Note that the work in [4] also considers the addition/ deletion of constraints only one at a time It needs to be emphasized that this limitation is basic to the design of the algorithm: Dijkstra s algorithm can be applied only when thechangesareconsideredoneatatimethisisacceptable in many contexts since Dijkstra s algorithm is the fastest algorithm for the case where edge weights are positive If we try using another shortest paths algorithm we would incur a performance penalty However, as we show, this loss in performance in the case of unit changes may be offset by improved performance when we consider multiple changes The approach we present for the solution is to extend the classical Bellman-Ford algorithm for shortest paths in such a way that the solution obtained in one problem instance can be used to reduce the complexity of the solution in modified versions of the graph In the incremental case (single change to the graph) this problem is related to the

26 896 EURASIP Journal on Applied Signal Processing problem of analyzing the sensitivity of the algorithm [14] The sensitivity analysis tries to study the performance of an algorithm when its inputs are slightly perturbed Note that there do not appear to be any average case sensitivity analyses of the Bellman-Ford algorithm, and the approach presented in [14] has a quadratic running time in the size of the graph This analysis is performed for a general graph without regard to any special properties it may have But as explained in Section 511, graphs corresponding to circuits and systems in HLS for DSP are typically very sparse most benchmarkgraphstendtohavearatioofabout2edgesper vertex, and the number of delay elements is also small relative to the total number of vertices Our experiments have shown that in these cases, the adaptive approach is able to do much better than a quadratic approach We also provide application examples to show other potential uses of the approach In the following sections, we show that our approach performs almost as well as the approach in [4] (experimentally) for changes made one at a time, and significantly outperforms their approach under the general case of multiple changes (this is true even for relatively small batches of changes, as will be seen from the results) Also, when the number of changes between updates is very large, our algorithm reduces to the normal Bellman-Ford algorithm (starting from scratch), so we do not lose in performance This is important since when a large number of changes are made, the problem can be viewed as one of solving the shortestpath problem for a new graph instance, and we should not perform worse than the standard available technique for that Our interest in adaptive negative cycle detection stems primarily from its application in the problems of HLS that we outlined in the introduction To demonstrate its usefulness in these areas, we have used this technique to obtain improved implementations of the performance estimation problem (computation of the MCM) and to implement an iterative improvement technique for design space exploration Dasdan et al [2] present an extensive study of existing algorithms for computing the MCM They conclude that the most efficient algorithm in practice is Howard s algorithm [10] We show that the well-known Lawler s algorithm [9], when implemented using an efficient negative cycle detection technique and with the added benefit of our adaptive negative cycle detection approach, actually outperforms this algorithm for several test cases, including several of the IS- CAS benchmarks, which represent reasonable sized circuits As mentioned previously, the relevance of negative cycle detection to design space exploration is because of the cyclic nature of the graphs for DSP applications That is, there is often a dependence between the computation in one iteration and the values computed in previous iterations Such graphs are referred to as iterative dataflow graphs [15] Traditional scheduling techniques tend to consider only the latency of the system, converting it to an acyclic graph if necessary This can result in loss of the ability to exploit inter-iteration parallelism effectively Methods such as optimum unfolding [16] and range-chart guided scheduling [15] are techniques that try to avoid this loss in potential parallelism by working directly on the cyclic graph However, they suffer from some disadvantages of their own Optimum unfolding can potentially lead to a large increase in the size of the resulting graph to be scheduled Range chart guided scheduling is a deterministic heuristic that could miss potential solutions In addition, the process of scanning through all possible time intervals for scheduling an operation can work only when the run times of operations are small integers This is more suited to a software implementation than a general hardware design These techniques also work only after a function to resource binding is known, as they require timing information for the functions in order to schedule them For the general architecture synthesis problem, this binding itself needs to be found through a search procedure, so it is reasonable to consider alternate search schemes that combine the search for architecture with the search for a schedule If the cyclic dataflow graph is used to construct a constraint graph, then feasibility of the resulting system is determined by the absence of negative cycles in the graph This can be used to obtain exact schedules capable of attaining the performance bound for a given function to resource binding For the problem of design space exploration, we treat the problem of scheduling an iterative dataflow graph (IDFG) as a problem of searching for an efficient ordering of function vertices on processors, which can be treated as addition of several timing constraints to an existing set of constraints We implement a simple search technique that uses this approach to solve a number of scheduling problems, including scheduling for low-power on multiple-voltage resources, and scheduling on homogeneous processors, within a single framework Since the feasibility analysis forms the core of the search, speeding this up should result in a proportionate increase in the number of designs evaluated (until such a point that this is no longer the bottleneck in the overall computation) The adaptive negative cycle detection technique ensures that we can do such searches efficiently, by restricting the computations required 3 THE ADAPTIVE BELLMAN-FORD ALGORITHM We present the basis of the adaptive approach that enables efficient detection of negative cycles in dynamic graphs We first note that the problem of detecting negative cycles in a weighted directed graph (digraph) is equivalent to finding whether or not a set of difference inequality constraints has a feasible solution To see this, observe that if we have a set of difference constraints of the form x i x j b ij, (1) we can construct a digraph with vertices corresponding to the x i,andanedge(e ij ) directed from the vertex corresponding to x i to the vertex for x j such that weight(e ij ) = b ij This procedure is performed for each constraint in the system and a weighted directed graph is obtained Solving for shortest paths in this graph would yield a set of distances dist that satisfy the constraints on x i This graph is henceforth referred to as the constraint graph

27 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 897 The usual technique used to solve for dist is to introduce an imaginary vertex s 0 to act as a source, and introduce edges of zero-weight from this vertex to each of the other vertices Theresultinggraphisreferredtoastheaugmented graph [4] In this way, we can use a single-source shortest paths algorithm to find dist from s 0, and any negative cycles (infeasible solution) found in the augmented graph must also be present in the original graph, since the new vertex and edges cannot create cycles The basic Bellman-Ford algorithm does not provide a standard way of detecting negative cycles in the graph However, it is obvious from the way the algorithm operates that if changes in the distance labels continue to occur for more than a certain number of iterations, there must be a negative cycle in the graph This observation has been used to detect negative cycles, and with this straightforward implementation, we obtain an algorithm to detect negative cycles that takes O( V 3 )time,where V is the number of vertices in the graph The study by Cherkassky and Goldberg [3] presents several variants of the negative cycle detection technique The technique they found to be most efficient in practice is based on the subtree disassembly technique proposed by Tarjan [17] This algorithm works by constructing a shortest path tree as it proceeds from the source of the problem, and any negative cycle in the graph will first manifest itself as a violation of the tree order in the construction The experimental evaluation presented in their study found this algorithm to be a robust variant for the negative cycle detection problem As a result of their findings, we have chosen this algorithm as the basis for the adaptive algorithm Our modified algorithm is henceforth referred to as the adaptive Bellman- Ford (ABF) algorithm The adaptive version of the Bellman-Ford algorithm works on the basis of storing the distance labels that were computed from the source vertex from one iteration to the next Since the negative cycle detection problem requires that the source vertex is always the same (the augmenting vertex), it is intuitive that as long as most edge weights do not change, the distance labels for most of the vertices will also remain the same Therefore, by storing this information and using it as a starting point for the negative cycle detection routines, we can save a considerable amount of computation One possible objection to this system is that we would need to scan all the edges each time in order to detect vertices that have been affected But in most applications involving multiple changes to a graph, it is possible to pass information to the algorithm about which vertices have been affected This information can be generated by the higher level application-specific process making the modifications For example, if we consider multiprocessor scheduling, the highlevel process would generate a new vertex ordering, and add edges to the graph to represent the new constraints Since any changes to the graph can only occur at these edges, the application can pass on to the ABF algorithm precise information about what changes have been made to the graph, thus saving the trouble of scanning the graph for changes Note that in the event where the high-level application cannot pass on this information without adding significant bookkeeping overhead, the additional work required for a scan of the edges is proportional to the number of edges, and hence does not affect the overall complexity, which is at least as large as this For example, in the case of the maximum cycle mean computation examined in Section 51,for most circuit graphs the number of edges with delays is about 1/10 as many as the total number of edges With each change in the target iteration period, most of these edges will cause constraint violations In such a situation, an edge scan provides a way of detecting violations that is very fast and easy to implement, while not increasing the overall complexity of the method 31 Correctness of the method The use of a shortest path routine to find a solution to a system of difference constraint equations is based on the following two theorems, which are not hard to prove (see [18]) Theorem 1 Asystemofdifference constraints is consistent if andonlyifitsaugmentedconstraintgraphhasnonegativecycles, and the latter condition holds if and only if the original constraint graph has no negative cycles Theorem 2 Let G be the augmented constraint graph of a consistent system of constraints V, C Then D is a feasible solution for V, C,where D(u) = dist G ( s0,u ) (2) The augmented constraint graph consists of this graph, together with an additional source vertex (s 0 ) that has zeroweight edges leading to all the other existing vertices, and consistency means that a set of x i exists that satisfy all the constraints in the system In the adaptive version of the algorithm, we are effectively setting the weights of the augmenting edges to be equal to the labels that were computed in the previous iteration In this way, the initial scan from the augmenting vertex sets the distance label at each vertex equal to the previously computed weight instead of setting it to 0 So we now need to show that using nonzero weights on the augmenting edges does not change the solution space in any way: that is, all possible solutions for the zero-weight problem are also solutions for the nonzero-weight problem, except possibly for translation by a constant The new algorithm with the adaptation enhancements can be seen to be correct if we relax the definition of the augmented graph so that the augmenting edges (from s 0 ) need not have zero-weight We summarize the arguments for this in the following theorems Theorem 3 Consider a constraint graph augmented with a source vertex s 0, and edges from this vertex to every other vertex v, such that these augmenting edges have arbitrary weight weight(s 0 v) The associated system of constraints is consistent if and only if the augmenting graph defined above has no negative cycles, which in turn holds if and only if the original constraint graph has no negative cycles

28 898 EURASIP Journal on Applied Signal Processing Proof Clearly, since s 0 does not have any in-edges, no cycles can pass through it So any cycles, negative or otherwise, which are detected in the augmented graph, must have come from the original constraint graph, which in turn would happen only if the constraint system was inconsistent (by Theorem 1) Also, any inconsistency in the original system would manifest itself as a negative cycle in the constraint graph, and the above augmentation cannot remove any such cycle The following theorem establishes the validity of solutions computed by the ABF algorithm Theorem 4 If G is the augmented graph with arbitrary weightsasdefinedabove,andd(u) = dist G (s 0,u) (shortest paths from s 0 ), then (1) D is a solution to V, C ;and (2) any solution to V, C can be converted into a solution to the constraint system represented by G by adding a constant to D(u) for each u V Proof The first part is obvious, by the definition of shortest paths Now we need to show that by augmenting the graph with arbitrary weight edges, we do not prevent certain solutions from being found To see this, first note that any solution to a difference constraint system remains a solution when translated by a constant That is, we can add or subtract a constant to all the D(u) without changing the validity of the solution In our case, if we have a solution to the constraint system that does not satisfy the constraints posed by our augmented graph, it is clear that the constraint violation can only be on one of the augmenting edges (since the underlying constraint graph is the same as in the case where the augmenting edges had zero weight) Therefore, if we define l max = max { weight(e) e S a }, (3) where S a is the set of augmenting edges and D (u) = D(u) l max, (4) we ensure that D satisfies all the constraints of the original graph, as well as all the constraints on the augmenting edges Theorem 4 tells us that an augmented constraint graph with arbitrary weights on the augmenting edges can also be used to find a feasible solution to a constraint system This means that once we have found a solution dist : V R (where R is the set of real numbers) to the constraint system, we can change the augmented graph so that the weight on each edge e : u v is dist(v) Now even if we change the underlying constraint graph in any way, we can use the same augmented graph to test the consistency of the new system Figure 1 helps to illustrate the concepts that are explained in the previous paragraphs In Figure 1a, there is a change in the weight of one edge But as we can see from the augmented graph, this will result in only the single update to the affected 0 0 Augmenting vertex 0 A E 1 B D 3 (a) Augmenting graph with zeroweight augmenting edges Augmenting vertex 0 A E 1 B D 1 3 (b) Augmenting graph with nonzeroweight augmenting edges Figure 1: Constraint graph vertex itself, and all the other vertices will get their constraint satisfying values directly from the previous iteration Note that in general, several vertices could be affected by the change in weight of a single edge For example, in Figure 1 if edge AC had not existed, then changing the weight of AB would have resulted in a new distance label for vertices C and D as well These would be cascading effects from the change in the distance label for vertex B Therefore, when we speak of affected vertices, it is not just those vertices incident on an edge whose weight has changed, but could also consist of vertices not directly on an edge that has undergone a change in constraint weight The actual number of vertices affected by a single edge-weight change cannot be determined just by examining the graph, we would actually need to run through the Bellman-Ford algorithm to find the complete set of vertices that are affected 1 C C 0 3

29 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 899 In the example from Figure 1, the change in weight of edge AB means that after an initial scan to determine changes in distance labels, we find that vertex B is affected However, on examining the outgoing edges from vertex B, we find that all other constraints are satisfied, so the Bellman-Ford algorithm can terminate here without proceeding to examine all other edges Therefore, in this case, there is only 1 vertex whose label is affected out of the 5 vertices in the graph Furthermore, the experiments show that even in large sparse graphs, the effectof any single change is usuallylocalized to a small region of the graph, and this is the main reason that the adaptive approach is useful, as opposed to other techniques that are developed for more general graphs Note that, as explained in Section 2, the initial overhead for detecting constraint violations still holds, but the complexity of this operation is significantly less than that of the Bellman-Ford algorithm 4 COMPARISON AGAINST OTHER INCREMENTAL ALGORITHMS We compare the ABF algorithm against (a) the incremental algorithm developed in [4] for maintaining a solution to a set of difference constraints (referred to here as the RSJM algorithm), and (b) a modification of Howard s algorithm [10], since it appears to be the fastest known algorithm to compute the cycle mean, and hence can also be used to check for feasibility of a system Our modification allows us to use some of the properties of adaptation to reduce the computation in this algorithm The main idea of the adaptive algorithm is that it is used as a routine inside a loop corresponding to a larger program As a result, in several applications where this negative cycle detection forms a computation bottleneck, there will be a proportional speedup in the overall application, which would be much larger than the speedup in a single run It is worth making a couple of observations at this point regarding the algorithms we compare against (1) The RSJM algorithm [4] uses Dijkstra s algorithm as the core routine for quickly recomputing the shortest paths Using the Bellman-Ford algorithm here (even with Tarjan s implementation) would result in a loss in performance since it cannot match the performance of Dijkstra s algorithm when edge weights are positive Consequently, no benefit would be derived from the reduced-cost concept used in [4] (2) The code for Howard s algorithm was obtained from the Internet website of the authors of [10] The modifications suggested by Dasdan et al [19] have been taken into account This method of constraints checking uses Howard s algorithm to see if the MCM of the system yields a feasible value, otherwise the system is deemed inconsistent Another important point is the type of graphs on which we have tested the algorithms We have restricted our attention tosparse graphs, or bounded degree graphs In particular, we have tried to keep the vertex-to-edge ratio similar to what we may find in practice, as in, for example, the ISCAS benchmarks To understand why such graphs are relevant, note the following two points about the structural elements usually found in circuits and signal processing blocks: (a) they typically have a small, finite number of inputs and outputs (eg, AND gates, adders, etc are binary elements) and (b) the fanout that is allowed in these systems is usually limited for reasons of signal strength preservation (buffers are used if necessary) For these reasons, the graphs representing practical circuits can be well approximated by bounded degree graphs In more general DSP application graphs, constraints such as fanout may be ignored, but the modular nature of these systems (they are built up of simpler, small modules) implies that they normally have small vertex degrees We have implemented all the algorithms under the LEDA [20] framework for uniformity The tests were run on random graphs, with several random variations performed on them thereafter We kept the number of vertices constant and changed only the edges This was done for the following reason: a change to a node (addition/deletion) may result in several edges being affected In general, due to the random nature of the graph, we cannot know in advance the exact number of altered edges Therefore, in order to keep track of the exact number of changes, we applied changes only to the edges Note that when node changes are allowed, the argument for an adaptive algorithm capable of handling multiple changes naturally becomes stronger In the discussion that follows, we use the term batch-size to refer to the number of changes in a multiple change update That is, when we make multiple changes to a graph between updates, the changes are treated as a single batch, and the actual number of changes that was made is referred to as the batch-size This is a useful parameter to understand the performance of the algorithms The changes that were applied to the graph were of 3 types (i) Edge insertion: an edge is inserted into the graph, ensuring that multiple edges between vertices do not occur (ii) Edge deletions: an edge is chosen at random and deleted from the graph Note that, in general, this cannot cause any violations of constraints (iii) Edge weight change: an edge is chosen at random and its weight is changed to another random number Figure 2 shows a comparison of the running time of the 3 algorithms on random graphs The graphs in question were randomly generated, had 1000 vertices and 2000 edges each, and a sequence of edge change operations (as defined above) were applied to them The points in the plot correspond to an average over 10 runs using randomly generated graphs TheX-axis shows the granularity of the changes That is, at one extreme, we apply the changes one at a time, and at the other, we apply all the changes at once and then compute the correctness of the result Note that the delayed update feature is not used by the RSJM algorithm, which uses the fact that only one change occurs per test to look for negative cycles As can be seen, the algorithms that use the adaptive modifications benefit greatly as the batch size is increased, and even among these, the ABF algorithm far outperforms the Howard algorithm, because the latter actually

30 900 EURASIP Journal on Applied Signal Processing Run time (s) Constant total changes Table 1: Relative speed of adaptive versus incremental approach for graph of 1000 nodes, 2000 edges Batch size Speedup (RSJM time/abf time) Batch size Large batch effect (1000nodes, 2000edges) RSJM ABF Original BF Howard s 10 2 Figure 2: Comparison of algorithms as batch size varies 35 Varying batch size Run time (s) Batch size Run time (s) RSJM ABF Original BF Figure 4: Asymptotic behavior of the algorithms RSJM ABF Batch size 7 Original BF Howard s Figure 3: Constant number of iterations at different batch sizes performs most of the computation required to compute the maximum cycle mean of the graph, which is far more than necessary Figure 3 shows a plot of what happens when we apply 1000 batches of changes to the graph, but alter the number of changes per batch, so that the total number of changes actually varies from 1000 to As expected, RSJM takes total time proportional to the number of changes But the other algorithms take nearly constant time as the batch size varies, which provides the benefit The reason for the almost constant time seen here is that other bookkeeping operations dominate over the actual computation at this stage As the batch size increases (asymptotically), we would expect that the adaptive algorithm takes more and more time to operate, finally converging to the same performance as the standard Bellman-Ford algorithm As mentioned previously, the adaptive algorithm is better than the incremental algorithm at handling changes in batches Table 1 shows the relative speedup for different batch sizes on a graph of 1000 nodes and 2000 edges Although the exact speedup may vary, it is clear that as the number of changes in a batch increases, the benefit of using the adaptive approach is considerable Figure 4 illustrates this for a graph with 1000 vertices and 2000 edges We have plotted this on a log-scale to capture the effect of a large variation in batch size Because of this, note that the difference in performance between the incremental algorithm and starting from scratch is actually a factor of 3 or so at the beginning, which is considerable Also, this figure does not show the performance of Howard s algorithm, because as can be seen from Figures 2 and 3, the ABF algorithm considerably outperforms Howard s algorithm in this context

31 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 901 An important feature that can be noted from Figure 4 is the behavior of the algorithms as the number of changes between updates becomes very large The RSJM algorithm is completely unaffected by this increase, since it has to continue processing changes one at a time For very large changes, even when we start from scratch, we find that the total time for update starts to increase, because now the time taken to implement the changes itself becomes a factor that dominates overall performance In between these two extremes, we see that our incremental algorithm provides considerable improvements for small batch sizes, but for large batches of changes, it tends towards the performance of the original Bellman-Ford algorithm for negative cycle detection From Figures 3 and 4, we see, as expected, that the RSJM algorithm takes time proportional to the total number of changes Howard s algorithm also appears to take more time when the number of changes increases Figure 2 allows us to estimate at what batch size each of the other algorithms becomes more efficient than the RSJM algorithm Note that the scale on this figure is also logarithmic Another point to note with regard to these experiments is that they represent the relative behavior for graphs with 1000 vertices and 2000 edges These numbers were chosen to obtain reasonable run times on the experiments Similar results are obtained for other graph sizes, with a slight trend indicating that the break-even point, where our adaptive algorithm starts outperforming the incremental approach, shifts to lower batch-sizes for larger graphs 5 APPLICATIONS We present two applications that make extensive use of algorithms for negative cycle detection In addition, these applications also present situations where we encounter the same graph with slight modifications either in the edge-weights (MCM computation) or in the actual addition and deletion of a small number of edges (scheduling search techniques) As a result, these provide good examples of the type of applications that would benefit from the adaptive solution to the negative cycle detection problem As mentioned in Section 1, these problems are central to the high-level synthesis of DSP systems 51 Maximum cycle mean computation The first application we consider is the computation of the MCM of a weighted digraph This is defined as the maximum over all directed cycles of the sum of the arc weights divided by the number of delay elements on the arcs This metric plays an important role in discrete systems and embedded systems [2, 21], since it represents the greatest throughput that can be extracted from the system Also, as mentioned in [21], there are situations where it may be desirable to recompute this measure several times on closely related graphs, for example, for the purpose of design space exploration As specific examples, [6] proposes an algorithm for dataflow graph partitioning where the repeated computation of the MCM plays a key role, and [22] discusses the utility of frequent MCM computation to synchronization optimization in embedded multiprocessors Therefore, efficient algorithms for this problem can make it reasonable to consider using such solutions instead of the simpler heuristics that are otherwise necessary Although several results such as [23, 24] provide polynomial time algorithms for the problem of MCM computation, the first extensive study of algorithmic alternatives for it has been undertaken by Dasdan et al [2] They concluded that the best existing algorithm in practice for this problem appears to be Howard s algorithm, which, unfortunately, does not have a known polynomial bound on its running time To model this application, the edge weights on our graph are obtained from the equation weight(u v) = delay(e) P exec time(u), (5) where weight(e) refers to the weight of the edge e : u v, delay(e) refers to the number of delay elements (flip-flops) on the edge, exec time(u) is the propagation delay of the circuit element that is the source of the vertex, and P is the desired clock period that we are testing the system for In other words, if the graph with weights as mentioned above does not have negative cycles, then P is a feasible clock for the system We can then perform a binary search in order to compute P to any precision we require This algorithm is attributed to Lawler [9] Our contribution here is to apply the adaptive negative cycle detection techniques to this algorithm and analyze the improved algorithm that is obtained as a result 511 Experimental setup For an experimental study, we build on the work by Dasdan et al [2], where the authors have conducted an extensive study of algorithms for this problem They conclude that Howard s algorithm [10] appears to be the fastest experimentally, even though no theoretical time bounds indicate this As will be seen, our algorithm performs almost as well as Howard s algorithm on several useful sized graphs, and especially on the circuits of the ISCAS 89/93 benchmarks, where our algorithm typically performs better For comparison purposes, we implemented our algorithm in the C programming language, and compared it against the implementation provided by the authors of [10] Although the authors do not claim their implementation is the fastest possible, it appears to be a very efficient implementation, and we could not find any obvious ways of improving it As we mentioned in the previous section, the implementation we used incorporates the improvements proposed by Dasdan et al [2] The experiments were run on a Sun Ultra SPARC-10 (333 MHz processor, 128 MB memory) This machine would classify as a medium-range workstation under present conditions It is clear that the best performance bound that can be placed on the algorithm as it stands is O( V E log T)where T is the maximum value of P that we examine in the search procedure, and V and E are, respectively, the size of the input graph in number of vertices and edges However, our experiments show that it performs significantly faster than wouldbeexpectedbythisbound

32 902 EURASIP Journal on Applied Signal Processing One point to note is that since we are doing a binary search on T, we are forced to set a limit on the precision to which we compute our answer This precision in turn depends on the maximum value of the edge-weights, as well as the actual precision desired in the application itself Since these depend on the application, we have had to choose values for these We have used a random graph generator that generates integer weights for the edges in the range [ ] For this range of weights, it could be argued that integer precision would be sufficient However, since the maximum cycle mean is a ratio, it is not restricted to integer values We have therefore conservatively chosen a precision of 0001 for the binary search (ie, 10 7 times the maximum edge-weight) Increasing the precision by a factor of 2 requires one more run of the negative cycle detection algorithm, which would imply a proportionate increase in the total time taken for computation of the MCM With regard to the ISCAS benchmarks, note that there is a slight ambiguity in translating the net-lists into graphs This arises because a D-type flip-flop can either be treated as a single edge with a delay, with the fanout proceeding from the sink of this edge, or as k separate edges with unit delay emanating from the source vertex In the former treatment, it makes more sense to talk about the D / V ratio ( D being the number of D flip-flops), as opposed to the D / E ratio that we use in the experiments with random graphs However, the difference between the two treatments is not significant and can be safely ignored We also conducted experiments where we vary the number of edges with delays on them For this, we need to exercise care, since we may introduce cycles without delays on them, which are fundamentally infeasible and do not have a maximum cycle mean To avoid this, we follow the policy of treating edges with delays as back-edges in an otherwise acyclic graph [15] This view is inspired by the structure of circuits, where a delay element usually figures in the feedback portion of the system Unfortunately, one effect of this is that when we have a low number of delay edges, the resulting graph tends to have an asymmetric structure: it is almost acyclic with only a few edges in the reverse direction It is not clear how to get around this problem in a fashion that does not destroy the symmetry of the graph, since this requires solving the feedback arc set problem, which is NP-hard [25] One effect of this is in the way it impacts the performance of the Bellman-Ford algorithm When the number of edgeswithdelaysissmall,thereareseveralnegativeweight edges, which means that the standard Bellman-Ford algorithm spends large amounts of time trying to compute shortest paths initially The incremental approach, however, is able to avoid this excess computation for large values of T,which results in its performance being considerably faster when the number of delays is small Intuitively, therefore, for the above situation, we would expect our algorithm to perform better This is because, for the MCM problem, a change in the value of P for which we are testing the system will cause changes in the weights of those edges which have delays on them If these are Run time (s) Feedback edge ratio 07 ABF-based MCM computation MCM computation using normal BF routine Howard s algorithm Figure 5: Comparison of algorithms for vertices, edges: the number of feedback edges (with delays) is varied as a proportion of the total number of edges fewer, then we would expect that fewer operations would be required overall when we retain information across iterations This is borne out by the experiments as discussed in Section 512 Our experiments focus more on the kinds of graphs that appear to represent real graphs By this we mean graphs for which the average out-degree of a vertex (number of edges divided by number of vertices), and the relative number of edges with delays on them are similar to those found in real circuits We have used the ISCAS benchmarks as a good representative sample of real circuits, and we can see that they show remarkable similarity in the parameters we have described: the average out-degree of a vertex is close to and a little less than 2, while an average of about (1/10)th or fewer edges have delays on them An intuitive explanation for the former observation is that most real circuits are usually built up of a collection of simpler systems, which predominantly have small numbers of inputs and outputs For example, logic gates have typically 2 inputs and 1 output, as do elements such as adders and multipliers More complex elements like multiplexers and encoders are relatively rare, and even their effect is somewhat offset by single-input singleoutput units like NOT gates and filters 512 Experimental results We now present the results of the experiments on random graphs with different parameters of the graph being varied We first consider the behavior of the algorithms for random graphs consisting of vertices and edges, when the feedback-edge ratio (ratio of edges with nonzero delay to total number of edges) is varied from 0 to 1 in increments of 01 The resulting plot is shown in Figure 5 As discussed in Section 511, for small values of this ratio,

33 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 903 Run time (s) Run time (s) Number of vertices Number of nodes MCM using ABF MCM using original BF Howard s algorithm Figure 6: Performance of the algorithms as graph size varies: all edges have delays (feedback edges) and number of edges = twice the number of vertices MCM using ABF MCM using original BF Howard s algorithm Figure 7: Performance of the algorithms as graph size varies: proportion of edges with delays = 01 and number of edges = twice the number of vertices (Y-axislimitedtoshow detail) the graph is nearly acyclic, and almost all edges have negative weights As a result, the normal Bellman-Ford algorithm performs a large number of computations that increase its running time The ABF-based algorithm is able to avoid this overhead due to its property of retaining information across runs, and so it performs significantly better for small values of the feedback edge ratio The ABF-based algorithm and Howard s algorithm perform almost identically in this experiment The points on the plot represent an average over 10 random graphs each Figure 6 shows the effect of varying the number of vertices The average degree of the graph is kept constant, so that there is an average of 2 edges per vertex, and the feedback edge ratio is kept constant at 1 (all edges have delays) The reason for the choice of average degree was explained in Section 511 Figure 7 shows the same experiment, but this time with a feedback edge ratio of 01 We have limited the displayed portion of the Y-axis since the values for the MCM computation using the original Bellman-Ford routine rise as high as 10 times that of the others and drowns them out otherwise These plots reveal an interesting point: as the size of the graph increases, Howard s algorithm performs less well than the MCM computation using the ABF algorithm This indicates that for real circuits, the ABF-based algorithm may actually be a better choice than Howard s algorithm This is borne out by the results of the ISCAS benchmarks Figures 8 and 9 show a study of what happens as the edgedensity of the graph is varied: for this, we have kept the number of edges constant at , and the number of vertices varies from 1000 to This means a variation from an edge-density (ratio of the number of edges to the number of vertices) of 115 to 20 In both these figures, we see that Run time (s) MCM using ABF MCM using original BF Howard s algorithm Number of vertices Figure 8: Performance of the algorithms as graph edge density varies: all edges have delays (feedback edges) and the number of edges = the MCM computation using ABF performs especially well at low densities (sparse graphs), where it does considerably better than Howard s algorithm and the normal MCM computation using ordinary negative cycle detection In addition, the point where the ABF-based algorithm starts performing better appears to be at around an edge-density of 2, which is also seen in Figure 5

34 904 EURASIP Journal on Applied Signal Processing 25 Table 2: Run time for MCM computation for the 6 largest ISCAS 89/93 benchmarks Run time (s) Number of vertices Bench- E / V D / V Orig BF ABF Howard s mark MCM MCM algo s s s s s s s s s s MCM using ABF MCM using original BF Howard s algorithm Figure 9: Performance of the algorithms as graph size varies: proportion of edges with delays = 01 and the number of edges = We note the following features from the experiments: (i) If all edges have unit delay, the MCM algorithm that uses our adaptive negative cycle detection provides some benefit, but less than in the case where few edges have delays (ii) When we vary the number of feedback edges (edges with delays), the benefit of the modifications becomes very considerable at low feedback ratios, doing better than Howard s algorithm for low-edge densities (iii) In the ISCAS benchmarks, we can see that all of the circuits have E / V < 2, and D / V < 01, ( D is the number of flip-flops, V is the total number of circuit elements, and E is the number of edges) In this range of parameters, our algorithm performs very well, even better than Howard s algorithm in several cases (also see Table 2 for our results on the ISCAS benchmarks) Table 2 shows the results obtained when we used the different algorithms to compute MCMs for the circuits from the ISCAS 89/93 benchmark set One point to note here is that the ISCAS circuits are not true HLS benchmarks; they were originally designed with logic circuits in mind, and as such, the normal assumption would be that all registers (flip-flops) in the system are triggered by the same clock In order to use them for our testing, however, we have relaxed this assumption and allowed each flip-flop to be triggered on any phase; in particular, the phases that are computed by the MCM computation algorithm are such that the overall system speed is maximized These benchmark circuits are still very important in the area of HLS, because real DSP circuits also show similar structure (sparseness and density of delay elements), and an important observation we can make from the experiments is that the structure of the graph is very rel- evant to the performance of the various algorithms in the MCM computation As can be seen in Table 2, Lawler s algorithm does reasonably well at computing the MCM However, when we use the adaptive negative cycle detection in place of the normal negative cycle detection technique, there is an increase in speed by a factor of 5 to 10 in most cases This increase in speed is in fact sufficient to make Lawler s algorithm with this implementation up to twice as fast as Howard s algorithm, which was otherwise considered the fastest algorithm in practice for this problem 52 Search techniques for scheduling We demonstrate another application of our technique, efficient searching of schedules for IDFGs The basic idea is that for scheduling an IDFG, we need to (a) assign vertices to processors and (b) assign relative positions to the vertices within each processor (for resource sharing) Once these two aspects are done, the schedule for a given throughput constraint is determined by finding a feasible solution to the constraint equations, which we do using the ABF algorithm This idea that the ordering can be directly used to compute schedule times has been used previously, for example see [26] Since the search process involves repeatedly checking the feasibility of many similar constraint systems, the advantages of the adaptive negative cycle detection come into play The approach we have taken for the schedule search is (i) start with each vertex on its own processor, find a feasible solution on the fastest possible processor; (ii) examine each vertex in turn, and try to find a place for it on another processor (resource sharing) In doing so, we are making a small number of changes to the constraint system, and need to recompute a feasible solution; (iii) in choosing the new position, choose one that has minimum power (or area, or whatever cost we want to optimize); (iv) additional moves that can be made include inserting a new processor type and moving as many vertices onto

35 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 905 it as possible, moving vertices in groups from one processor to another, and so forth; (v) the technique also lends itself very well to the application in schemes using evolutionary improvement [27]; (vi) in the present implementation, to choose among various equivalent implementations at a given stage, we use a weight based on giving greater importance to implementations that result in lower overall slack on the cycles in the system (Slack of a cycle here refers to the difference between the total delay afforded by the registers (number of delay elements in the cycle times the clock period) and the sum of the execution times of the vertices on the cycle, and is useful since a large slack could be taken as an indication of under-utilization of resources) Each such move or modification that we make to the graph can be treated as a set of edge-changes in the precedence/processor constraint graph, and a feasible schedule would be found if the system does not have negative cycles In addition, the dist(v) values that are obtained from applying the algorithm directly give us the starting times that will meet the schedule requirements We have applied this technique to attack the multiplevoltage scheduling problem addressed in [28] The problem here is to find a schedule for the given DFG that minimizes the overall power consumption, subject to fixed constraints on the iteration period bound, and on the total number of resources available For this example, we consider three resources: adders that operate at 5 V, adders that operate at 33 V, and multipliers that operate at 5 V For the elliptic filter, multipliers operate in 2 time units, while for the FIR filter, they operate in 1 time unit The 5 V adders operate in 1 time unit, while the 33 V adders operate in 2 time units always It is clear that the power savings are obtained through scheduling as many adders as possible on 33 V adders instead of 5 V adders We have used only the basic resource types mentioned in Table 3 to compare our results with those in [28] However, there is no inherent limit imposed by the algorithm itself on the number of different kinds of resources that we can consider In tackling this problem, we have used only the most basic method, namely, moving vertices onto another existing processor Already, the results match and even outperform thatobtainedin[28] In addition, the method has the benefit that it can handle any number of voltages/processors, and can also easily be extended to other problems, such as homogeneous-processor scheduling [15] Table 3 shows the power savings that were obtained using this technique S and R power saving indicates the power savings (assuming 25 units for 5 V devices and 1089 units for 33 V devices) obtained by [28], while ABF power savings refers to the results obtained using our algorithm (where the ABF algorithm is used to test the feasibility of the system after each move as per the definition above) The overall timing constraint T is the iteration period we are aiming for Table 3 shows some interesting features, the iterative improvement based on the ABF algorithm (column marked Table 3: Comparison between the ABF-based search and algorithm of Sarrafzadeh and Raje [26] ( : failed to schedule, : not available) Example Resource T Power saved (5 V, 33 V, 5 V ) SandR ABF 5th order {2, 2, 2} % 3486% ellip filt {2, 1, 2} % 1660% {2, 2, 2} % 2656% {2, 1, 2} % 1494% FIR filt {1, 2, 1} % {1, 2, 2} % {1, 2, 1} % {1, 2, 2} % 2454% ABF) produced results with significantly higher power savings than the results presented in [28] One important reason contributing to this could be that the iterative improvement algorithm makes full use of the iterative nature of the graphs, and produces schedules that make good use of the available interiteration parallelism On the other hand, we find that for one of the configurations, the ABF-based algorithm is not able to find any valid schedule This is because the simple nature of the algorithm occasionally results in it getting stuck in local minima, with the result that it is unable to find a valid schedule even when one exists Several variations on this theme are possible; the search scheme could be used for other criteria such as the case where the architecture needs to be chosen (not fixed in advance), and modifications such as small amounts of randomization could be used to prevent the algorithm from getting stuck in local minima This flexibility combined with the speed improvements afforded by the improved adaptive negative cycle detection can allow this method to form the core of a large class of scheduling techniques 6 CONCLUSIONS The problem of negative cycle detection is considered in the context of HLS for DSP systems It was shown that important problems such as performance analysis and design space exploration often result in the construction of dynamic graphs, where it is necessary to repeatedly perform negative cycle detection on variants of the original graph We have introduced an adaptive approach (the ABF algorithm) to negative cycle detection in dynamically changing graphs Specifically, we have developed an enhancement to Tarjan s algorithm for detecting negative cycles in static graphs This enhancement yields a powerful algorithm for dynamic graphs that outperforms previously available methods for addressing the scenario where multiple changes are made to the graph between updates Our technique explicitly addresses the common, practical scenario in which negative cycle detection must be periodically performed after intervals in which a small number of changes are made to the graph We have shown by experiments that for reasonable sized graphs ( vertices and edges) our al-

36 906 EURASIP Journal on Applied Signal Processing gorithm outperforms the incremental algorithm (one change processed at a time) described in [4] evenforchangesmade in groups of as little as 4 5 at a time As our original interest in the negative cycle detection problem arose from its application to the problems described above in HLS, we have implemented some schemes that make use of the adaptive approach to solve those problems We have shown how our adaptive approach to negative cycle detection can be exploited to compute the maximum cycle mean of a weighted digraph, which is a relevant metric for determining the throughput of DSP system implementations We have compared our ABF technique, and ABF-based MCM computation technique against the best known related work in the literature, and have observed favorable performance Specifically, the new technique provides better performance than Howard s algorithm for sparse graphs with relatively few edges that have delays Since computing power is cheaply available now, it is increasingly worthwhile to employ extensive search techniques for solving NP-hard analysis and design problems such as scheduling The availability of an efficient adaptive negative cycle detection algorithm can make this process much more efficient in many application contexts We have demonstrated this concretely by employing our ABF algorithm within the framework of a search strategy for multiple voltage scheduling ACKNOWLEDGMENTS This research was supported in part by the US National Science Foundation (NSF) Grant # , NSF NYI Award MIP , and the Advanced Sensors Collaborative Technology Alliance REFERENCES [1] R Reiter, Scheduling parallel computations, Journal of the ACM, vol 15, no 4, pp , 1968 [2] A Dasdan, S S Irani, and R K Gupta, Efficient algorithms for optimum cycle mean and optimum cost to time ratio problems, in 36th Design Automation Conference, pp 37 42, New Orleans, La, USA, ACM/IEEE, June 1999 [3] B Cherkassky and A V Goldberg, Negative cycle detection algorithms, Tech Rep tr , NEC Research Institute, March 1996 [4] G Ramalingam, J Song, L Joskowicz, and R E Miller, Solving systems of difference constraints incrementally, Algorithmica, vol 23, no 3, pp , 1999 [5] D Frigioni, A Marchetti-Spaccamela, and U Nanni, Fully dynamic shortest paths and negative cycle detection on digraphs with arbitrary arc weights, in ESA 98, vol 1461 of Lecture Notes in Computer Science, pp , Springer, Venice, Italy, August 1998 [6] L-T Liu, M Shih, J Lillis, and C-K Cheng, Data-flow partitioning with clock period and latency constraints, IEEE Trans on Circuits and Systems I: Fundamental Theory and Applications, vol 44, no 3, 1997 [7] G Ramalingam, Bounded incremental computation, PhD thesis, University of Wisconsin, Madison, Wis, USA, August 1993, revised version published by Springer-Verlag (1996) as vol 1089 of Lecture Notes in Computer Science [8] B Alpern, R Hoover, B K Rosen, P F Sweeney, and F K Zadeck, Incremental evaluation of computational circuits, in Proc 1st ACM-SIAM Symposium on Discrete Algorithms, pp 32 42, San Francisco, Calif, USA, January 1990 [9] E Lawler, Combinatorial Optimization: Networks and Matroids, Holt, Rhinehart and Winston, New York, NY, USA, 1976 [10] J Cochet-Terrasson, G Cohen, S Gaubert, M McGettrick, and J-P Quadrat, Numerical computation of spectral elementsinmax-plusalgebra, inproc IFAC Conf on Syst Structure and Control, Nantes, France, July 1998 [11] N Chandrachoodan, S S Bhattacharyya, and K J R Liu, Adaptive negative cycle detection in dynamic graphs, in Proc International Symposium on Circuits and Systems, vol V, pp , Sydney, Australia, May 2001 [12] G Ramalingam and T Reps, An incremental algorithm for a generalization of the shortest-paths problem, Journal of Algorithms, vol 21, no 2, pp , 1996 [13] D Frigioni, M Ioffreda, U Nanni, and G Pasqualone, Experimental analysis of dynamic algorithms for single source shortest paths problem, in Proc Workshop on Algorithm Engineering, pp 54 63, Ca Dolfin, Venice, Italy, September 1997 [14] R K Ahuja, T L Magnanti, and J B Orlin, Network Flows, Prentice-Hall, Upper Saddle River, NJ, USA, 1993 [15] S M H de Groot, S H Gerez, and O E Herrmann, Rangechart-guided iterative data-flow graph scheduling, IEEE Trans on Circuits and Systems I: Fundamental Theory and Applications, vol 39, no 5, pp , 1992 [16] K K Parhi and D G Messerschmitt, Static rate-optimal scheduling of iterative data-flow programs via optimum unfolding, IEEE Trans on Computers, vol 40, no 2, pp , 1991 [17] R E Tarjan, Shortest paths, Tech Rep, AT&T Bell laboratories, Murray Hill, New Jersey, USA, 1981 [18] THCormen,CELeiserson,andRLRivest, Introduction to Algorithms, MIT Press, Cambridge, Mass, USA, 1990 [19] A Dasdan, S S Irani, and R K Gupta, An experimental study of minimum mean cycle algorithms, Tech Rep UCI- ICS #98-32, University of California, Irvine, 1998 [20] K Mehlhorn and S Näher, LEDA: A platform for combinatorial and geometric computing, Communications of the ACM, vol 38, no 1, pp , 1995 [21] K Ito and K K Parhi, Determining the minimum iteration period of an algorithm, Journal of VLSI Signal Processing, vol 11, no 3, pp , 1995 [22] S S Bhattacharyya, S Sriram, and E A Lee, Resynchronization for multiprocessor DSP systems, IEEE Trans on Circuits and Systems I: Fundamental Theory and Applications, vol 47, no 11, pp , 2000 [23] D Y Chao and D T Wang, Iteration bounds of single rate dataflow graphs for concurrent processing, IEEE Trans on Circuits and Systems I: Fundamental Theory and Applications, vol 40, no 9, pp , 1993 [24] S H Gerez, S M H de Groot, and O E Hermann, A polynomial time algorithm for computation of the iteration period bound in recursive dataflow graphs, IEEE Trans on Circuits and Systems I: Fundamental Theory and Applications, vol 39, no 1, pp 49 52, 1992 [25] M R Garey and D S Johnson, Computers and Intractability A Guide to the Theory of NP-Completeness, W H Freeman and Company, New York, NY, USA, 1979 [26] D J Wang and Y H Hu, Fully static multiprocessor array realizability criteria for real-time recurrent DSP applications, IEEE Trans Signal Processing, vol 42, no 5, pp , 1994

High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 907 [27] T Bäck, U Hammel, and H-P Schwefel, Evolutionary computation: Comments on the history and current state, IEEE

Circuits and Systems, pp 350 353, Miami, Fla, USA, 30 May 2 June 1999 Proceedings of the IEEE, a Guest Editor of special issue on Signal Processing for Wireless Communications of IEEE Journal of

of IEEE Trans on Multimedia, and an editor of Journal of VLSI Signal Processing Systems Nitin Chandrachoodan wasbornonaugust 11, 1975 in Madras, India He received the BTech degree in electronics and

37 High-Level Synthesis of DSP Applications Using Adaptive Negative Cycle Detection 907 [27] T Bäck, U Hammel, and H-P Schwefel, Evolutionary computation: Comments on the history and current state, IEEE Trans Evolutionary Computation, vol 1, no 1, pp 3 17, 1997 [28] M Sarrafzadeh and S Raje, Scheduling with multiple voltages under resource constraints, in Proc 1999 International Symposium on Circuits and Systems, pp , Miami, Fla, USA, 30 May 2 June 1999 Proceedings of the IEEE, a Guest Editor of special issue on Signal Processing for Wireless Communications of IEEE Journal of Selected Areas in Communications, a Guest Editor of special issue on Multimedia Communications over Networks of IEEE Signal Processing Magazine, a Guest Editor of special issue on Multimedia over IP of IEEE Trans on Multimedia, and an editor of Journal of VLSI Signal Processing Systems Nitin Chandrachoodan wasbornonaugust 11, 1975 in Madras, India He received the BTech degree in electronics and communications engineering from the Indian Institute of Technology, Madras, in 1996, and the MS degree in electrical engineering from the University of Maryland at College Park in 1998 He is currently a PhD candidate at the University of Maryland His research concerns analysis and representation techniques for system level synthesis of DSP dataflow graphs Shuvra S Bhattacharyya received the BS degree from the University of Wisconsin at Madison, and the PhD degree from the University of California at Berkeley He is an Associate Professor in the Department of Electrical and Computer Engineering, and the Institute for Advanced Computer Studies (UMIACS) at the University of Maryland, College Park He is also an Affiliate Associate Professor in the Department of Computer Science The coauthor of two books and the author or coauthor of more than 50 refereed technical articles, Dr Bhattacharyya is a recipient of the NSF Career Award His research interests center around architectures and computer-aided design for embedded systems, with emphasis on hardware/software codesign for signal, image, and video processing Dr Bhattacharyya has held industrial positions as a Researcher at Hitachi, and as a Compiler Developer at Kuck & Associates K J Ray Liu received the BS degree from the National Taiwan University, and the PhD degree from UCLA, both in electrical engineering He is Professor at the Electrical and Computer Engineering Department of University of Maryland, College Park His research interests span broad aspects of signal processing architectures; multimedia communications and signal processing; wireless communications and networking; information security; and bioinformatics in which he has published over 230 refereed papers, of which over 70 are in archival journals Dr Liu is the recipient of numerous awards including the 1994 National Science Foundation Young Investigator, the IEEE Signal Processing Society s 1993 Senior Award, IEEE 50th Vehicular Technology Conference Best Paper Award, Amsterdam, 1999 He also received the George Corcoran Award in 1994 for outstanding contributions to electrical engineering education and the Outstanding Systems Engineering Faculty Award in 1996 in the recognition of outstanding contributions in interdisciplinary research, both from the University of Maryland Dr Liu is Editor-in-Chief of EURASIP Journal on Applied Signal Processing, and has been an Associate Editor of IEEE Transactions on Signal Processing, a Guest Editor of special issues on Multimedia Signal Processing of

38 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Design and DSP Implementation of Fixed-Point Systems Martin Coors Institute for Integrated Signal Processing Systems, Aachen University of Technology, Aachen, Germany coors@issrwth-aachende Holger Keding Institute for Integrated Signal Processing Systems, Aachen University of Technology, Aachen, Germany keding@issrwth-aachende Olaf Lüthje Institute for Integrated Signal Processing Systems, Aachen University of Technology, Aachen, Germany luethje@issrwth-aachende Heinrich Meyr Institute for Integrated Signal Processing Systems, Aachen University of Technology, Aachen, Germany meyr@issrwth-aachende Received 31 August 2001 This article is an introduction to the FRIDGE design environment which supports the design and DSP implementation of fixedpoint digital signal processing systems We present the tool-supported transformation of signal processing algorithms coded in floating-point ANSI C to a fixed-point representation in SystemC We introduce the novel approach to control and data flow analysis, which is necessary for the transformation The design environment enables fast bit-true simulation by mapping the fixed-point algorithm to integral data types of the host machine A speedup by a factor of 20 to 400 can be achieved compared to C-library-based bit-true simulation FRIDGE also provides a direct link to DSP implementation by processor specific C code generation and advanced code optimization Keywords and phrases: fixed-point design, design methodology, data flow analysis, compiled simulation, code optimization 1 INTRODUCTION Digital system design is characterized by ever-increasing complexity that has to be implemented within reduced time, resulting in minimum costs and short time-to-market This requires a seamless design flow that allows the execution of the design steps at the highest suitable level of abstraction For most digital systems, the design has to result in a fixed-point implementation, either in HW or SW This is due to the fact that these systems are sensitive to power consumption, chip size, throughput, and price-per-device Fixed-point realizations outperform floating-point realizations by far with regard to these criteria A typical fixed-point design flow is depicted in Figure 1 Algorithm design starts from a floating-point description that is analyzed by means of simulation without taking the quantization effects into account This abstraction from all implementation effects allows an exploration of the algorithm space, for example, the evaluation of different digital receiver structures This exploration is well supported by a variety of commercial block-diagram oriented system level design tools [1, 2, 3]The modeling efficiency on the floatingpoint level is high and the floating-point models offer a maximum degree of reusability In a next step towards system implementation, a transformation to a bit-true representation of the system is necessary, that is, assigning a fixed word length and a fixed exponent to every operand This process is quite tedious and error-prone if done manually: often more than 50% of the implementation time is spent on the algorithmic transformation [4] to the fixed-point level for complex designs once the floatingpoint model has been specified The major reasons for this bottleneck are as follows: (1) There is no unique transformation from floatingpoint to fixed-point (a) Different HW and SW targets put different constraints on the fixed-point specification (b) Optimization for different design criteria, like throughput, chip size, memory size, or accuracy are in general mutually exclusive goals and result in a complex design

39 Design and DSP Implementation of Fixed-Point Systems 909 Algorithmic transformation Description transformation Fixed-point Floating-point floating Floating floating point point ok? Ok? ok? quantization Quantization quantization fixed Fixed fixed point point ok? Ok? ok? coding Coding coding SW/HW SW / / HW HW Figure 1: Fixed-point design process Throughput Design space exploration Evaluation of the bit-true behavior Implementation Quantization noise implementation needs to be optimized with respect to chip area, memory consumption, throughput, and power consumption Here the bit-true system-level model serves as a golden reference for the target implementation which yields bit-by-bit the same results To increase the designer s efficiency, software tool support for fixed-point design is necessary Ideally the design environment would have the following features: (1) A modeling language supporting generic fixed-point data types to model the fixed-point behavior of the system It will also provide a means of data monitoring of variables and operands during simulation, for example, range, mean, and variance (2) A semiautomatic transformation from floating-point to a bit-true representation The designer can bring in his knowledge about the system and he has full control over the transformation The tool will accept a set of constraints specified by the designer to model the characteristics of the target hardware (3) The ability to perform bit-true simulation with a simulation speed close to floating-point simulation (4) A seamless design flow down to system implementation, generating optimized input for DSP compilers These requirements have been the motivation for the Fixed-point programming and Design Environment (FRIDGE) [6, 7, 8], an interactive design environment for the specification, simulation, and implementation of fixed-point systems In this article we describe the principles and elements of FRIDGE and outline the seamless design flow as it becomes possible with this design environment FRIDGE relies on five main concepts which are briefly introduced in the following Program memory/ chip size Figure 2: Fixed-point design space space as sketched in Figure 2 Furthermore, targets with a given datapath, for example, DSPs put different constraints on the quantization than ASICs where the datapaths are flexible (c) The quantization is generally highly dependent on the application, that is, on the applied stimuli (2) Quantization is a nonlinear process Analytical models based on signal theory are only applicable for systems with a low complexity [5] An exploration of the fixed-point design space with respect to quantization noise, performance, and operand word lengths cannot be done without extensive system simulation (3) Some algorithms are difficult to implement in fixedpoint due to high signal dynamics or sensitivity to quantization noise Thus algorithmic alternatives need to be employed Finally, the quantized system is implemented, either in hardware or in software on a programmable DSP The 11 Fixed-point modeling language DSP system design is frequently done on a PC or a workstation utilizing a C/C-based system-level design environment For efficient modeling of finite word length effects, language extensions implementing generic fixed-point data types are necessary ANSI C does not offer such data types and hence fixed-point modeling using pure ANSI C becomes a very tedious and error-prone task Fixed-point language extensions implemented as libraries in C [9, 10, 11] offer a high modeling efficiency They supply generic fixed-point data types and various casting modes for overflow and quantization handling The simulation speed of these libraries on the other hand is rather poor Some of these libraries also offer data monitoring capabilities during simulation time In the FRIDGE design environment, the SystemC fixedpoint data types are used for fixed-point modeling and simulation A more detailed description of the SystemC fixedpoint data types is given in Section 3 12 Interpolative transformation A central component of the FRIDGE design environment is the interpolative transformation from a hybrid description into a fully bit-true representation The interpolative

40 910 EURASIP Journal on Applied Signal Processing transformation, which is presented in detail in Section 4 uses analytical range propagation to determine operand word lengths 13 Data flow analysis During the development of the FRIDGE design environment, we have identified a need for accurate data flow analysis The published approaches for static and dynamic program analysis did not match the requirements of the design environment, thus we have developed a novel approach for control and data flow analysis, which is presented in Section 5 Floating-point ANSI-C code Global annotations Local annotations Fixed-point code Hybrid code Interpolation Hybrid simulation Simulation engine Bit-true simulation 14 Fast bit-true simulation Existing C-based simulation libraries model the fixedpoint operands as objects and make extensive use of operator overloading and container data types Also, for ease of use, many decisions are made during run time These mechanisms increase the execution time of fixed-point simulations by one to two orders of magnitude compared to floatingpoint arithmetic This makes the simulation run time a major bottleneck during the fixed-point design process In Section 7 various approaches for fixed-point simulation are presented and a methodology for fast bit-true simulation by mapping fixed-point algorithms in SystemC to an integer based ANSI C algorithm is introduced 15 DSP target mapping The final step in a float-to-fixed design flow is the implementation of the DSP system, either in hardware or in software As a case study for targeting a high performance DSP, we have developed a FRIDGE back end which addresses the Texas Instruments TMS 320C62x fixed-point DSP processor and its C compiler The back end generates target specific integer C code which exploits the features of the processor and the compiler to achieve a high efficiency of the compiled code In Section 9 the FRIDGE C62x back end and the optimization strategies are presented 2 THE FRIDGE DESIGN FLOW The FRIDGE design flow starts from a floating-point algorithm in ANSI C As illustrated in Figure 3, the designer then annotates single operands with fixed-point attributes Inserting these local annotations results in a hybrid description of the algorithm, that is, some of the operands are specified bit-true, while the rest remain floating-point A comparative simulation of the floating-point and the hybrid code within the same simulation environment shows whether the local annotations are appropriate, or if some annotations have to be modified The integer word length of the local annotations can be derived from operand range monitoring during simulation runs Typically, the designer manually annotates function parameters and key variables, for example, accumulator variables, which account for approximately 5% of all operands Figure 3: Quantization methodology with FRIDGE Once the hybrid program matches the design criteria, the remaining floating-point operands are automatically transferred to fixed-point operands by interpolation Interpolation denotes the process of computing the fixed-point parameters of the nonannotated operands from the information that is inherent to the annotated operands and the operations performed on them Additionally, the interpolator has to observe a set of global annotations, that is, default restrictions for the calculation of fixed-point parameters This can be, for example, a default maximum word length that corresponds to the register length of the target processor The interpolation results in a fully annotated program, where each operand and operation is specified bit-true way Cosimulating this algorithm with the original floating-point code will give an accuracy evaluation and for changes now only the set of local and/or global annotations have/has to be modified, while the rest is determined and kept consistent by the interpolator Described above are the algorithmic level transformations as illustrated in Figure 1, that change the behavior or accuracy of an algorithm The resulting completely bit-true algorithm in SystemC is not directly suited for implementation, thus it needs to be mapped to a target, such as, a processor s architecture or to an ASIC This is an implementation level transformation, where the bit-true behavior normally remains unchanged Within the FRIDGE environment, different back ends map the internal bit-true specification to different formats/targets, according to the purpose or goal of the quantization process 3 FIXED-POINT DATA TYPES AND LOCAL ANNOTATIONS Since ANSI C offers no efficient support for fixed-point data types [12, 13], we initially developed the fixed-point language fixed-c [14] that is a superset of the ANSI C language It comprises different generic fixed-point data types, cast operators, and interpolator directives The fixed-c language was licensed to Synopsys, Inc, and Synopsys contributed it as a set of additional fixed-point data types to the Open SystemC

41 Design and DSP Implementation of Fixed-Point Systems 911 s iwl wl fwl wl : word length iwl : integer word length fwl : fractional word length s : sign encoding/sign bit Figure 4: Fixed-point attributes of a bit-true description Initiative (OSCI) [11] Together with additional fixed-point language elements from the A RT Library by Frontier Design Inc, [10] fixed-c has been the base for the development of the SystemC fixed-point data types that are now used in the FRIDGE project as well The SystemC fixed-point data types are utilized for different purposes in the FRIDGE design flow: Since ANSI C is a subset of SystemC, the additional fixed-point constructs can be used as bit-true annotations to dedicated operands of the original floating-point ANSI C file, resulting in a hybrid specification This partially fixed-point code can be used for simulation or as input to the interpolator The bit-true output of the interpolator is represented in SystemC as well This allows a maximum transparency of the results to the designer, since the changes to the code are reduced to a minimum and the effects of the designer s directives, such as local annotations in the hybrid code, become directly visible The additional fixed-point types and functions are part of a C class library that can be used in any design and simulation environment that are based on or can integrate C orccode(see,eg,[1, 2, 3]) For a bit-true and implementation independent specification of a fixed-point operand, a three-tuple is necessary: the word length wl, the integer word length iwl, and the sign s, as illustrated in Figure 4 For every fixed-point format, two of the three parameters wl, iwl,and fwl (fractional word length) are independent; the third parameter can always be calculated from the other two, wl = iwl fwl With a given sign encoding s, we can also compute the minimum and maximum value that the fixed-point format <wl,iwl> can hold For example, for a two s complement (tc) signed representation the minimum and maximum compute to max = 2 iwl 1 2 fwl, wl,iwl,tc min = 2 iwl 1 wl,iwl,tc (1) For an unsigned representation (us), on the other hand, the minimum and maximum are max = 2 iwl 2 fwl, wl,iwl,us min = 0 wl,iwl,us (2) Note that an integral data type is merely a special case of a fixed-point data type with an iwl that always equals wl hence an integral data type can be described by two parameters only, the word length wl and the sign encoding s In the following sections, we provide a short overview of the most frequently used fixed-point data types and functions in SystemC A more detailed description can be found in the SystemC users manual [11] 31 The data types sc fixed and sc ufixed The two s complement data type sc fixed and the unsigned data type sc ufixed receive their format when they are declared, that is, the fixed-point attributes must be known at compile time (static arguments), sc_fixed<wl,iwl> d,*e,g[8]; sc_ufixed<wl,iwl> c; Thus they behave according to these fixed-point parameters throughout their lifetime This concept is called declaration time instantiation (DTI) Similar concepts exist in other fixed-point languagesas well [9, 10, 15] Pointers and arrays, as frequently used in ANSI C, are supported as well ForeveryassignmenttoaDTIvariable,adatatypecheck is performed If the left-hand data type does not match the right-hand data type as illustrated in the code example below, an implicit cast to the left-hand data type becomes necessary, sc fixed<6,3> a,b; sc ufixed<12,12> c; a = b; /* correct, both types match */ c = b; /* type mismatch -> implicit cast necessary */ The data types sc fixed and sc ufixed are the data types of choice, for example, for interfaces to other functionalities or for lookup tables, since they behave like a memory location of a specific length and a known embedding/scaling 32 The data type sc fxval Additionally to the DTI data type concept, SystemC provides the assignment time instantiation (ATI) data type sc fxval This type may hold fixed-point numbers of arbitrary format and is especially tailored for the float-to-fixed transformation process A declaration of a variable of type sc fxval does not specify any fixed-point attributes and if subsequently in the code a fixed-point value is assigned to a sc fxval variable, the variable is (re-)instantiated with all fixed-point attributes of the assigned value 33 The data types sc fix and sc ufix Along with the static attribute types sc fixed and sc ufixed, SystemC also provides the fixed-point types sc fix and sc ufix that may also take nonstatic fixed-point attributes such as variables The function in the code example below has the word length wl and the integer word length iwl as formal parameters, that is, wl and iwl are not known at compile time

42 912 EURASIP Journal on Applied Signal Processing sc fxval cast func(int wl, int iwl, sc fxval in) { return sc fix(in,wl,iwl); } As shown in this example, the constructor for the types sc fix and sc ufix are often used to cast a value to a different fixed-point format 34 Cast modes For a cast operation to a fixed-point format <wl,iwl, sign>, it is also important to specify the overflow and precision reduction in case the target data type cannot hold the original value: a = sc_fix(input,wl,iwl,q_mode,o_mode); The variable a holds a two s complement fixed-point format <wl,iwl> and the value of input is cast to this fixed-point data type according to the quantization mode q mode 1 and the overflow mode o mode 2 The most important casting modes are listed below SystemC also specifies many additional cast modes to model target specific behavior Quantization modes Truncation (SC TRN) The bits below the specified LSB are cut off This quantization mode is the default for SystemC fixedpoint types and will be used if no other value is specified Rounding (SC RND) Adds LSB/2 first, before cutting off the bits below the LSB Overflow modes Wrap-around (SC WRAP) In case of an overflow the MSB carry bit is ignored This overflow mode is the default for SystemC fixed-point types and will be used if no other value is specified Saturation (SC SAT) In case the minimum or maximum value is exceeded the result is set to the minimum or maximum value, respectively With the sc fxval type, every assignment to a variable overwrites all prior instantiations, that is, one sc fxval variable may have different context-specific bit-true attributes in the same scope This concept of ATI is motivated by the specific design flow: transformation starts from a floating-point program, where the designer abstracts from the fixed-point problems and does not think of a variable as finite length register The concept of local annotations and ATI is also an effective way to assign context specific information without changing structures or variables when exploring the fixedpoint design space 1 The quantization handling specifies the behavior in case of a word length reduction at the LSB side 2 The overflow handling specifies the behavior in case of a word length reduction at the MSB side 4 INTERPOLATION The interpolator with its control and data flow analyzer is the core of the FRIDGE design environment As depicted in Figure 3 it determines the fixed-point formats for all operands of an algorithm, taking as input a user annotated hybrid description of the algorithm and a set of global default rules, the global annotation file Hence interpolation describes the computation of the fixed-point parameters of the nonannotated operands from the information that is inherent to the annotated operands The interpolative concept is based on three key ideas: (1) Attribute propagation The method of using the attributes of the bit-true specified operands in the code to calculate bit-true attributes for the remaining operands and operations in the code (2) Global annotations The description of default rules and restrictions for attribute propagation (3) Designer support The interpolator supplies feedback and reports to assist the designer to debug or improve the interpolation result For a better understanding the first two points are explained more detailed in the following (1) Attribute propagation Given the information of the fixed-point attributes of some operands, the type and the fixed-point format of other operands can be extracted from this information For example, if for the inputs to an operation both the range and the relevant fractional word length are specified, the same attributes can be determined for the result 3 Consider the following line of code: c = a b; d = 15; e = c * d; The corresponding data flow graph is depicted in Figure 5 We assume that the ranges and the precision of the variables a and b are known, for example, by user annotations: a [ 025, 075] = R a = [ 025, 075]; fwl(a) = 2, b [ 125, 05] = R b = [ 125, 05]; fwl(b) = 2 To receive the range R c for the variable c that contains the sum of the variables a and b we add the ranges R a and R b (a detailed description of the range arithmetic used here can be found in [14]), R c = R a R b = [ min a min b, max a (3) ] max = [ 15, 125] b (4) The precision P c (fwl) for the sum c computes to the maximum of the precisions P a and P b, P c = max ( P a, P b ) = 2 (5) The information on the range and on the precision of the variable c is sufficient to calculate the required word length 3 An exception is the division, where the accuracy of the operation must be specified as well

43 Design and DSP Implementation of Fixed-Point Systems 913 [ 025, 075] a [ 125, 05] b [ 15, 125] c d = 15 [ 225, 1875] e Figure 5: Example for interpolation of ranges/word lengths or integer word length for c The correlation between fwl, range, and iwl yields the iwl of c: iwl c = max ( ( log 2 min, log c 2 max 2 fwlc)) 1 c = max(058, 058) 1 = 2 (6) Thus the resulting format for c is <4,2,tc>, where tc indicates the two s complement representation of c The next step for the interpolator is to compute the fixedpoint format of the constant d Since the range of d is R d = [15, 15] and the precision is P d = fwl d = 1 the iwl of d can be calculated as iwl d = ( log 2 max 2 fwl) = log 2 (1505) = 1 (7) d After all fixed-point parameters of the input operands to the multiplication e = d * care known to the interpolator, it continues with the calculation of the bit-true format and parameters for the variable e: R e = R c R d = [ 15, 125] 15 = [ 225, 1875], P e = P c P d = 21= 3 = iwl e = max ( ( log 2 min, log e 2 max 2 fwle)) 1 e = max(117, 1) 1 = 3 Hence we receive a fixed-point format of <6,3,tc> for the variable e Note that this is a rather conservative way of interpolation, bits that may contain any information are never discarded For the MSB side this is called a worst case interpolation, since with the iwl calculated by the interpolator an overflow is impossible, while on the other hand it may lead to iwls much larger than actually needed In this case the designer may add additional local annotations to cut back the iwl to a more suited value For the LSB side this is called maximum precision interpolation (MPI) interpolation, that is, by default every LSB of the operands is kept, maintaining the highest possible accuracy LSBs are only discarded if the word length exceeds the maximum word length specified in the global annotation file This can lead to a large increase in the (fwl), but with additional local annotations the designer can also keep the fwl shorter In [6] we also describe a method to have the interpolator calculate a less conservative value for the fwl (8) (2) Global annotations While local annotations express fixed-point information for single operands, the global annotations describe default restrictions to the complete design For different targets, different global restrictions apply For SW, the functional units to perform specific operations are already defined by the architecture of the processor Consider a bit multiplier writing to a 32-bit register A global annotation can supply the information to the interpolator that the word length of a multiplication operand must not exceed 16 bits, while the result may have a word length of up to 32 bits 41 Implementational issues In a first step the FRIDGE front end parses in the hybrid description into a C-based intermediate representation (IR) Then range propagation is performed to determine the bittrue format for all the operands During this process, control and data flow analysis is also carried out The information gained is stored in the IR The advanced algorithms used for the analysis will be described in Section 5 After this process the IR holds a bit-true description of the algorithm with additional control and data flow information These data structures form the basis for additional transformation steps performed in the FRIDGE back ends that target different languages and platforms 5 ADVANCED DATA FLOW ANALYSIS During the development of the FRIDGE design environment, we have identified a need for accurate data flow analysis to cater the needs of the interpolation, the fast simulation code generation and the target specific code optimization The published methods were not capable of matching the requirements, thus we have developed a novel approach for data flow analysis that can provide the necessary data for the FRIDGE back ends Researchers have worked on program analysis techniques since the 1960s and there is, by now, an extensive literature [16] There are two major approaches to program analysis: (a) There are static analysis techniques that analyze the program code at compile time Usually, sets of equations are set up according to the program semantics and solved by finding their fixpoint One of the best known static approaches is Data Flow Analysis Itistreatedindepthin standard compiler books [17, 18] Other techniques such as constraint-based analysis and abstract interpretation are also described in [19] PAG [20] is a tool for generating interprocedural data flow analyzers that implement these techniques (b) On the other hand, there are techniques for dynamic analysis that are used for examining the behavior of program code during execution Typically, these techniques are employed by profiling tools Profiling information can for example be used by programmers to find critical pieces of code or as input to profile-driven optimizers Dynamic program analysis techniques have been implemented in tools like Pixie [21] orqpt [22] By principle, dynamic program analysis relies on input vectors to be

44 914 EURASIP Journal on Applied Signal Processing processed during execution Thus the results are of no general nature Analysis techniques of neither category are suited for the needs of the FRIDGE design environment Static analysis puts tight constraints onto the code to be analyzed The use of pointers is usually not supported or yields too conservative results Implementations of digital signal processing systems usually make extensive use of pointers, even, for example, for iterating over data arrays Furthermore, static analysis is blind for program properties that result from run time effects However, especially these properties have to be taken into account by FRIDGE in order to obtain precise results Dynamic analysis is to some extend capable of detecting these properties Nevertheless, it is not applicable for the FRIDGE design environment for two reasons First, the results are of statistical, numerical nature There is no way to gain information about data flow or control flow properties Second, the results are not generally valid, that is, they only reflect the behavior of the program running on the given input vectors FRIDGE requires analysis results that are valid for all possible executions of the program though The requirements for the analysis employed by FRIDGE are different from those of standard tools like, for example, a general purpose compiler FRIDGE is focused on digital processing systems These systems are typically data flow dominated, that is, their execution is to a great extent independent from the data to be processed Besides, the accuracy and quality of the results are more important than speed (of analysis) This allows for a more comprehensive code analysis than, for example, a general purpose compiler can apply In order to gain precise results including also run time properties and being able to handle pointer operations, the code is interpreted Since there is no concrete data to be processed, we process abstract data instead In the following this methodology is referred to as abstract execution The data flow analysis unit in the FRIDGE design environment is based on three main components: (1) The concept of data abstraction (2) The state controlled memory model (3) The concept of coupled iterators 51 Data abstraction While in concrete execution numeric values are written to and read from memory, we use operations for abstract execution An operation is a collection of information about possible values The two most important elements are (1) the range, that is, the minimum value and the maximum value, and (2) a reference to the expression in the code that corresponds to the operation 4 4 This is for gaining data flow information Furthermore, operations may be ambiguous Consider the code example below 01 int func(int x, int y, int z){ 02 int a, b, c, d; switch(y){ 05 case 1: 06 a = 8; break; 07 case 2: 08 a = 16; break; 09 case 3: 10 a = 32;} if(z>0) 13 b = 0; 14 else 15 b = 1; if(x>0){ 18 c = 5; 19 d = a;} 20 else { 21 c = b; 22 d = 7;} return c d; 25 } The only information available about parameters x, y, and z is that they are integers Hence it cannot be decided which branches of the switch- and if-statements in lines 04, 12, and 17 are executed This results in an ambiguous content, for example, of variable b, namely,values0 5 and 1, referring to the expressions in lines 13 and 15, respectively We combine both operations to an ambiguous operation In addition, ambiguous operations are associated with conditions, under which the alternatives are chosen In the example, alternative 0 is chosen if (z > 0) is true, alternative 1 if it is false In general, there may be more than two alternatives and conditions may be combined by a logical AND Operations are arranged in graphs similar to binary decision diagrams introduced by Akers [23], where the nodes embody the ambiguous operations and the leafs the unambiguous operations In general, operations are described by the following rules: (i) an operation is either an unambiguous operation or an ambiguous operation; (ii) an unambiguous operation represents a possible content in memory during concrete execution of a program; 5 When talking about a value, we mean an operation with a range degenerated to a value

45 Design and DSP Implementation of Fixed-Point Systems 915 Messages Interpreter Current state read/write Figure 6: Abstract execution Control State controlled memory model if (x >0) true 5 A 1 false A 2 true 0 false 1 switch (y) A 4 A 3 truefalse 7 case 1: 8 case 2: 16 case 3: 32 (iii) an ambiguous operation is associated with a control flow ambiguity in the code (dashed line in Figure 7) and matches each possible branch to an operation Step Nr if (z >0) Current state Thus these trees do not only contain the alternatives, but also the conditions under which the alternatives are taken The conditions are determined by all the ambiguities along the path from the root to the alternative Each ambiguity contributes to the condition in this way, that the condition for the execution of the control flow branch must be fulfilled, that is associated with the link to the next operation on the path A logical AND is applied to the contributions of each ambiguity For example, the tree in Figure 7 with A 3 as its root shows the ambiguity tree corresponding to variable d in line 24 The path to value 32 (bold line) goes through ambiguities A 3 and A 4 A 3 is associated with the if-statement and the path follows the link that is associated with the true-branch That yields the condition (x > 0) == true Further on, the path passes through A 4 and follows the link to 32 A 4 is associated with theswitch-statement and the link to 32 with case 3 That yields the condition y == 3 Thus the resulting condition for A 3 taking on the value 32 is 6 (1) (2) (3) (4) (5) (6) (x >0) == t&&y == 1 (x >0) == t&&y == 2 (x >0) == t&&y == 3 (x >0) == f &&(z >0) == t (x >0) == f &&(z >0) == f Figure 7: Iterating over ambiguities (x > 0) == true && y == 3 52 The state controlled memory model As illustrated in Figure 6, the state controlled memory Model serves as a regular memory that can be read and written to Besides, it is responsible for building the ambiguity trees described in Section 51 As long as the current state is in initial state, the behavior of the state controlled memory model does not differ from a regular memory Once the current state contains a condition, all changes done to memory contents only occur under that condition and result in appropriate ambiguity trees The state is defined by a set of assumptions about the result of particular expressions in the code A logical AND is performed on these assumptions The initial state makes no assumptions at all Other valid states could for example be (x > 0) == true or (x > 0) == true && y == 3 During abstract execution, the state can be changed by the interpreter 53 Iterating over ambiguities When abstractly executing statements (Section 54) orcom- puting the set of all possible evaluations of an expression, 7 We have to iterate over the alternatives of ambiguities This is basically done by traversing the corresponding tree However, the current state is taken into account, that is, only those alternatives arevisible, whose conditions are not contradictory to the current state Furthermore, when selecting an alternative from an ambiguity, the corresponding conditions are if not yet included added to the current state This way, the following is achieved: All data couplings are taken into account, that is, no impossible cases are considered Alternative executions of statements can be done without further thought about the current state (see Section 54) Selecting an alternative from an ambiguity is done by building a path through the corresponding tree The end of the path is an unambiguous operation In principle, iterating 6 This notation is according to C syntax 7 For example, this is done when computing fixed-point parameters of an expression

46 916 EURASIP Journal on Applied Signal Processing is performed on all successors of an ambiguity first, until it will be iterated over the alternatives of the ambiguity itself (depth first) When establishing a path through an ambiguity, two basic cases have to be considered: (1) The current state contains a condition respective to the control flow fork that is associated with the ambiguity In this case, the path must follow the link that corresponds to the condition and may not be altered The node would be considered a slave node (2) The current state does not yet contain a condition respective to the control flow branch that is associated with the ambiguity In this case, a possible branch is selected and the path is extended by the corresponding link The corresponding condition is added to the current state The node would be considered a master node During further iteration, the path will switch to all other links successively When this is done, the respective condition has to be updated accordingly After that, the condition is removed from the current state The trees in Figure 7 show the contents of variables c (left-hand side) and d (right-hand side) connected to line 24 in the code Figure 7 also illustrates how to iterate over all possible combinations of contents of both variables Note how building a path through an ambiguity affects the current state and how the current state masks the visible alternatives of ambiguities First of all value 5 is selected from ambiguity A 1 The corresponding condition ((x > 0) == true) is added to the current statethusa 1 becomes a master node When building the path through A 3,A 3 becomes a slave node, because the current state already makes an assumption about the control flow ambiguity that is associated with A 3 ((x > 0)) Therefore, the path must follow the link from A 3 to A 4 NodesA 2 and A 4 are associated with different control flow forks, respectively They always become master nodes and never affectanyotherambiguitiessteps2and3iterate over the remaining visible alternatives of the right-hand tree Step 4 switches to the second alternative of master node A 1 (false) This affects the slave A 3 in this way as long as the path in the left-hand tree goes from A 1 to A 2 (steps 4 and 5), the only visible alternative of the right-hand tree is 7In step 6 the iteration has been completed 54 Execution of a program Figure 8 shows how statements are abstractly executed The solid lines represent the control flow of a concrete execution Abstract execution also follows that control flow However, statements that depend on ambiguous data are executed multiple times (dashed lines), once for every possible vector of the involved ambiguities The vectors are iterated over as described in Section 53 Thuseveryexecutionisperformed in a different current state, such that changes in memory together with their corresponding states are stored in ambiguity trees This algorithm is applied recursively for nested statements Any code constructs can be executed this way Although a possibly large number of execution states exists, we found that the run time and the memory consumption of the analysis were remarkably low for typical signal processing algorithms In most cases the control and data Control flow Statement Statement Alternative executions Figure 8: Abstract executions of sequential statements flow analysis was performed in less than one second on a 800 MHz PC The information gained during abstract execution is stored in the intermediate representation of the algorithm The FRIDGE back ends, which will be introduced in the next sections, access this information to perform several code transformation steps 6 FAST BIT-TRUE SIMULATION As pointed out in Section 1, transforming a signal processing algorithm from a floating-point to a fixed-point requires extensive simulations due to the nonlinear nature of the quantization process The available C-based fixed-point libraries [10, 11] offer a high modeling efficiency but the simulation speed of these libraries on the other hand is rather poor This makes simulation speed a major bottleneck in the fixed-point design process Utilizing C-based fixed-point libraries like the ETSI basic arithmetic operations [24] does not overcome this problem as the simulation speed still has a considerable overhead compared to an equivalent floating-point implementation Existing C-based simulation libraries model the fixedpoint operands as objects In order to offer generic fixedpoint data types without word length restrictions, data container types are used as an internal representation Bit-true operations are performed by operator overloading Range checking, the choice of cast modes and many other decisions necessary for correct bit-true behavior are done at simulation time The price for this flexibility and ease of modeling is slow execution speed as the generic fixed-point data types modeled by extensive C constructs cannot be efficiently mapped to the architecture of the host machine by today s C compilers A simulation speedup can be achieved by mapping the fixed-point operands to the mantissa of the floating-point hardware of the host machine and bit level manipulations to maintain bit-true behavior This restricts the maximum word length of the fixed-point operands to the word length of the mantissa This approach has been described by Kim et al [25] and it is also implemented in the SystemC library [11]

47 Design and DSP Implementation of Fixed-Point Systems 917 Another mean of speeding up fixed-point simulations is the use of a hardware accelerator, for example, an FPGA to perform computationally expensive operations The acceleration can be achieved either by utilizing configurable logic or by combining configurable logic with a processor This approach has been described by De Coster [26] The mapping of the algorithm to the different hardware units and the data transfer between the units make additional transformation steps necessary The work described in this article proposes a mapping of fixed-point algorithm in SystemC to an integer-based ANSI C algorithm that directly addresses the built-in integer ALU of the host machine An efficient mapping includes an embedding of all fixed-point operands into the host machine registers, a cast mode optimization and many other aspects, and requires a detailed control and data flow analysis of the algorithm Independently from the authors work, De Coster [26] proposed a similar method, using DFL [27] as input language and targeting directly a Motorola DSP65000 Our work presented here represents a continuation of the research results published by Keding et al [6] andwillems [14] and introduces improved concepts for the mapping process that result in a considerable simulation acceleration For the fast simulation back end we assume that fixedpoint attributes are assigned to every operation The back end also requires the information collected during the control and data flow analysis stored in the IR After a number of IR refinements, an ANSI C representation of the algorithm using only integral data types can be derived from the IR It is important to note that the transformation in the back end, in contrast to the float-to-fixed transformation in the IR, does not change the behavior of the algorithm The fully quantized algorithm coded in SystemC and the integer-only ANSI C algorithm yield bit-by-bit identical results, making the fast simulation back end output ideally suited for fast bittrue simulation on a workstation or PC 7 TRANSFORMATION TO ANSI C 71 The lbp alignment For the embedding of a fixed-point operand specified by a triple (wl, iwl, sign) into a register of the host machine with the machine word length (mwl) the minimum requirement is mwl wl = iwl fwl (9) Figure 9 illustrates different options for embedding an operand with a word length of 5 bit into a given mwl of 8 Obviously, for mwl > wl, a degree of freedom for choosing the location of binary point (lbp) exists: mwl iwl lbp wl iwl = fwl (10) s s s Beside this degree of freedom, there are also a number of constraints for the selection of the lbp: (i) Interface constraints For interface elements, such as, function parameters or global variables, the lbp must be demwl wl iwl fwl Ibp mwl : machine word length wl : word length iwl : integer word length fwl : fractional word length Ibp : location of binary point s : sign encoding s s s s s s s s s s Figure 9: Embedding a 5-bit word into an 8-bit register fined identically for a function and all calls to this function Otherwise, the data written to or read from these data elements will be misinterpreted (ii) Operation constraintseachoperationhasanlbpsyntaxthislbpsyntax may include constraints on the lbp of the operand(s) of the operation and/or rules for the calculation of the lbp of the result For example, the operands and the result of and addition must have the same lbp (iii) Control and data flow constraints Generally, a read access to a storage element must use the same lbp as the preceding write access to the storage element This implies that if a write operation to a memory location occurs in alternative control-flow branches, the lbp must be at the same position in both write operations, as no run time information about the lbp is available in a following read operation The same applies to ambiguous write operations to arrays and write operations via pointers 711 The lbp alignment algorithm The lbp alignment algorithm implemented in the fast simulation back end is designed to take advantage of the degree of freedom described by (10), while meeting the constraints specified above Meeting these constraints and maintaining the consistency of the lbps require precise information about the control and data flow of the algorithm To obtain this information we used the data flow analysis method described in Section 5 The data flow information is represented basically as define-use (du) chains and use-define (ud) chains [17, 18], with additional and more accurate information about ambiguous control flow Initially, for all operands lbp = fwl is chosen Thus all operands are right aligned In a first step we set the lbps of all interface elements according to the interface constraints Then, in an iterative process, the data flow information is used to adjust the lbps by insertion of shift operations to meet the operation constraints and the control and data flow constraints The algorithm terminates when all conditions are fulfilled and the lbps did not change during the last iteration The operation constraint lbp alignment algorithm basically consists of an iteration over all operations and an adjustment of the operand and result lbps according to the operation s lbp syntax The control and data flow constraint lbp alignment algorithm searches for all read accesses from a data element the associated previous write accesses to the same data element, that is, finding all defines for a use of a data element (ud-

48 918 EURASIP Journal on Applied Signal Processing chains) According to the control and data flow constraints the lbp of operands linked by such ud-chains are set to the same value Finally, the embedding of constants canbedoneinaway that the required shift operations when using the constant are minimized Unlike described by Kum et al [28], we do not use a shift operation minimizing approach here, but using the degree of freedom in choosing a suited lbp (10) and the accurate data flow information, we found that there is not sufficient potential for this optimization to justify the effort 72 Data type selection The next step in the transformation process is the selection of suitable integral data types for fixed-point variables The FRIDGE internal bit-true specification of the algorithm features arbitrary word lengths With the SystemC back end this does not represent a problem, since the SystemC data types are generic and may be of any bit length required With the fast-simulation back end, on the other hand, we only have the limited pool of the built-in data types of the host machine, that is, integral data types like char, short, int, long 721 Basic constraints for any data element A matching data type for every fixed-point variable has to be chosen The minimum requirement for the data type chosen is that it can be embedded into the host machine data type with word length mwl at the correct location, (see Figure 9 for illustration) iwl lbp mwl 722 Structural constraints Additionally, the requirements introduced by data structures that force each of their elements to be of the same data type have to be met An example for this behavior are arrays The target data type for the N elements of an array must fulfill the following condition: maxi=0 N 1 (iwl array [i] lbp array [i]) mwl 723 Semantical constraints Another constraint becomes important if aliasing of data elements, for example, by pointers occurs: a pointer may point to different data elements For syntax and semantics reasons all aliased data elements and the base type of the pointer must be identical [13] This only causes a problem if data types are changed like it is done in fixed-point optimizations or the floating-point to fixed-point transformation process described in Section 2: initially, most numerical data types are floating-point types but after the transformation there are various different fixed-point data formats Hence special care must be taken during the code generation process to ensure that the types are consistent A detailed description of the data type selection algorithm used can be found in [29] 73 Cast mode transformation Cast operations can reduce or limit the word length on the MSB side of a word (overflow handling) or at the LSB side of aword(quantization handling) They are used either to prevent indeterministic behavior of fixed-point systems 8 or to model a data path that is different from the host machine This is often the case when algorithms for DSP systems are developed Fixed-point libraries like in SystemC offer various generic overflow and quantization handling modes, which makes SystemC an efficient means of modeling fixed-point systems For fast fixed-point simulation, on the other hand, the use of these generic casting modes are simply ruled out for performance reasons 731 Overflow handling Overflow handling is required if it is necessary to reduce the wl at the MSB side of the word or if the carry bit is set for the MSB Examples for frequently used overflow handling modes in digital signal processing algorithms are wrap-around and saturation [30] Saturation In SystemC, a cast of an expression expr to a wl-bit tc data type with integer wordlengthiwl applying saturation as overflow mode can be modeled as follows: result = sc_fix(expr,wl,iwl,,sc_sat); The fast simulation code generation on the other hand translates this into plain C code that first tests if the range of data type is exceeded, and if so it sets the resulting value to the minimum or maximum of this type, which is MAX = wl,iwl,lbp,tc 2iwllbp 1 2 lbl fwl, MIN wl,iwl,lbp,tc = 2iwllbp 1 2 lbl fwl 1 (11) Thus the fast simulation code construct generated is the following: 9 int tmp; result=((tmp=expr)>max)?max:(tmp<min)?min:tmp; Introducing an additional temporary variable avoids multiple evaluations of expr Wrap-Around The SystemC way of casting an expression expr to a wl-bit tc data typewithintegerwordlengthiwl applying wrap-around as overflow mode is shown here, result = sc_fix(expr,wl,iwl,,sc_wrap); For the bit-true ANSI C equivalent of this operation several options exist An example for a code construct for wrap around assuming two s complement arithmetic and a machine word length of mwl is 8 In many cases, the ANSI C standard [13] doesnotspecifythebit-true behavior of integral data types in case of overflow, quantization, and so forth 9 Note that for the code generation we also take the bit-true properties of the processor and compiler into account

49 Design and DSP Implementation of Fixed-Point Systems 919 result = (expr << SHIFT) >> SHIFT; The amount of shifts computes to SHIFT = mwl iwl lpb The shift left eliminates the MSBs whereas the arithmetic shift right provides a sign extension for the new MSB 732 Quantization handling If the word length of an operand is reduced at the LSB side, we can apply different quantization handling modes The most frequently encountered are rounding and truncation Rounding In SystemC the method for casting an expression expr to a wl-bit two s complement data type with integer word length iwl applying rounding as quantization mode is result = sc_fix(expr,wl,iwl,sc_rnd,); Rounding is defined by adding DELTA = LSB/2 to the operand and eliminating the LSBs, for example, by shifting it right SHIFT = lbp fwl bits Thus the rounding operation can be realized in the fast simulation code by result = ((expr DELTA)>>SHIFT)<<SHIFT; Truncation The truncation operation, given in SystemC by result = sc_fix(expr,wl,iwl,sc_trn,); can be implemented efficiently by a bit mask operation, result = expr & (~MASK); Where MASK is given by 2 lpb fwl 1 For several combinations of cast modes, for example, wrap-around combined with rounding or truncation, more efficient joint quantization and overflow handling C code constructs are generated The shift operations introduced by the cast code constructs are also utilized to adjust the lbp of the expression, eliminating the need for additional scaling shifts 8 EXPERIMENTAL RESULTS The code generated by the FRIDGE fast simulation back end has been benchmarked against the fixed-point simulation classes, which are part of the C-based SystemC language The simulation classes offer two simulation modes: a mode supporting unlimited fixed-point word lengths based on concatenated data containers and a mode supporting limited precision up to 53 bits based on float-arithmetic and bit manipulations The benchmarks have been performed on a SUN Ultra 10 workstation running SOLARIS using the GCC compiler version 2952 with the -O3 option The SystemC library version 10 was utilized for the bit-true simulations The benchmark is based on typical signal processing kernels, FIR 17-tap FIR filter, DCT 8 8 JPEG DCT algorithm, Autocorr 25 elements 5th order autocorrelation, IIR 3rd order IIR filter, FFT complex FFT of length 8, Matrix 4 4 matrix multiplication Four different versions of the kernel functions have been benchmarked: (i) Floating-Point The execution speed of the floatingpoint implementation of the algorithms serve as reference for the benchmarks (ii) SystemC The quantized bit-true version of the algorithms utilizing the SystemC fixed-point data types The algorithms have been quantized using the FRIDGE design environment (iii) SystemC limited precision The quantized bit-true code has been compiled with the limited precision option to speed up SystemC fixed-point operations (iv) Fast simulation code The fast fixed-point simulation code based on integral data types has been generated by the FRIDGE back end applying the transformation techniques described in the previous sections The code yields bit-by-bit the same results as the code utilizing the SystemC data types The experimental results are presented in Table 1 As the floating-point code has been used as a reference, the experimental data has been scaled relative to the execution speed of the floating-point code The bit-true SystemC code consumes by a factor of 325 to 1103 more run time than the original floating-point code, making bit-true simulation a major bottleneck in the fixed-point design flow Utilizing the limited precision mode of the SystemC library, a speedup by a factor of can be achieved, but the fixed-point code is still by a factor of slower than the floating-point reference The fast simulation code runs by a factor of faster compared to the SystemC fixed-point code utilizing the limited precision option For the unlimited precision the speedup is , respectively Compared to the floating-point reference code, the fast simulation code is by a factor of slower This is due to the host system s architecture and additional shift and bit mask operations necessary to perform lbp-alignment and cast operations to maintain bit-by-bit consistency with the quantized code The quantized DCT algorithm contains many cast operations to reduce fixed-point word lengths introduced by the quantization process As these operations can be modeled efficiently by bit mask operations in the fast simulation code, the highest speedup was achieved for this kernel function 9 DSP CODE GENERATION During the recent years, new architectural approaches for DSP processors have been made The current generation of high performance DSP processors features a pipelined VLIW architecture (very long instruction word), which offers a very high computing performance if a high degree of software pipelining in combination with instruction level parallelism is used But programming these processors manually utilizing assembly language is a very tedious task In awareness of this problem, the modern DSP architectures have been de-

50 920 EURASIP Journal on Applied Signal Processing Table 1: Relative execution speed Floating-point ANSI C SystemC SystemC limited precision Fast simulation code FIR DCT Autocorr IIR FFT Matrix veloped using a processor/compiler codesign methodology which led to compiler-efficient processor designs On the other hand, a significant gap in the system design flow is still evident; there is no direct path from a floatingpoint system level simulation to an optimized fixed-point implementation Today a manual implementation on the DSP and target specific code optimization is necessary, increasing time-to-market and making design changes very tedious, error prone, and costly Thus we have developed an optimizing FRIDGE back end to generate target optimized DSP C code The target specific code generation is necessary for two reasons: (i) The generic fixed-point data types used for fixedpoint simulations are not suited for DSP implementation, as the currently available DSP compilers do not support C fixed-point data types The upcoming generation of DSP compilers will support C language constructs, but compiling the fixed-point libraries for the DSP is no viable alternative as the implementation of the generic data types makes extensive use of operator overloading, templates, and dynamic memory management This will render fixed-point operations rather inefficient compared to integer arithmetic performed on a DSP (ii) Compiling the FRIDGE-generated integer ANSI C code on a DSP is also not sufficiently efficient as the generic C code does not exploit the capabilities of the DSP hardware such as built-in saturation and rounding logic or SIMD processing As a case study, we have chosen the TMS320C62x processor and its C compiler as a target for the FRIDGE design environmentthis enables a seamless design-flow from floating-point to optimized C62x C code utilizing integral data types Generating a C62x optimized version of a signal processing algorithm using a different set of fixed-point parameters becomes a matter of hours instead of days or weeks using the conventional manual techniques The C62x integer code generated by the design environment yields bit-by-bit the same results as the fixed-point code utilizing C simulation classes on the host machine Thus a comparative simulation to the golden reference model gives the designer a high degree of confidence in the generated code Thefirstobjectiveofourcasestudywastofindout which C code constructs compile into efficient C62x assembly code Thus we applied the DSPstone benchmarking methodology to the C62x optimizing C compiler The DSPstone project [31], conducted in 1994 by ISS, Aachen University of Technology established a benchmarking methodology for DSP compilers by comparing the performance of compiled C code to hand optimized assembly code in terms of program/data memory consumption and execution time As a consequence, it allows to identify a possible mismatch between architecture and compiler The benchmarking has been done using eleven typical signal processing algorithms (FIR, FFT, DCT, minimum error search, etc) The benchmarking gives quantitative results for cycle count and program memory consumption In a second step, we used C62x specific C language extensions (intrinsics) and compiler directives to restructure the off-the-shelf C code while maintaining functional equivalence to the original code These optimizations led to a considerable improvement in performance in many cases as the compiler was able to utilize software pipelining and instruction level parallelism to speed up the code It has turned out that software pipelining is the key to achieving a high performance but, on the other hand, requires careful analysis and code restructuring The evaluation [32] gave quantitative performance data for the C62x compiler and a set of code optimization techniques to generate efficient C62x C code In a third step, we benchmarked various implementations of the fixed-point quantization and overflow handling modes on the C62x This led to a set of optimized implementations for the quantization and overflow handling functionality 91 DSP code transformation The FRIDGE C62x back end performs similar transformation steps as the fast bit-true simulation code generation presented in Section 6: lbp alignment, cast mode transformation, and data type selection Additionally, target specific code optimization is performed The designer has to keep the special requirements of the DSP target in mind to reach a high level of efficiency Through our experiments we found that, for example, the number of cast statements and shift operations has a strong influence on the efficiency of the generated code Thus if the designer chooses settings for the global annotations and the default cast mode during the early stages of the transformation which do not represent the properties of the target architecture properly, the code optimization and the DSP compiler are not able to generate efficient assembly code The optimizations performed in the FRIDGE C62x back end are source level transformations to supply the C62x com-

51 Design and DSP Implementation of Fixed-Point Systems 921 piler with the best C code possible The amount of analysis done in an optimizing compiler is usually limited due to constraints of the time used for compilation In the FRIDGE design environment, control and data flow analysis is performed with the maximum possible accuracy utilizing the techniques presented in Section 5 The information gained during this analysis is available for the back end code transformation as well Thus we are able to perform code restructuring techniques, which are usually beyond the scope of an optimizing compiler 911 The lbp alignment As the TI C6000 processor family has an integer multiplication mode, the right alignment strategy of the lbp alignment algorithm can also be applied in the C62x back end This algorithm implicitly minimizes the number of scaling shifts In contrast to the fast bit-true simulation, the number of scaling shifts generated is important for the C62x code generation For the fast simulation code generation we found the potential of shift minimization limited to a performance improvement of 3% 13% [29] This is different for the C62x code generation As the C62x can perform two scaling shift operations per cycle, a shortage of functional units limits the performance in highly software pipelined loops Thus shift poisoning of loops must be avoided, for example, by choosing suitable fixed-point data types for function parameters and central data structures 912 Data type selection As the properties of the data paths of the C62x processor and the width of the integral data types supported by the C62x C compiler are known, the design environment can utilize this information during the transformation process A set of global annotations for the C62x guides the interpolation process and a set of integral data types with a given bit length is supplied to the C62x back end 913 Cast mode transformation The generic overflow- and quantization handling modes offered by SystemC have to be mapped to the target hardware in an efficient manner The C62x offers built-in saturation hardware which can be used by the back end This is illustrated by the following example Cast mode: saturation A cast of an expression to a wl-bit two s complement data type with integer wordlengthiwl applying saturation as overflow mode is modeled in SystemC as follows: result=sc_fix(expr,wl,iwl,,sc_sat); An implementation of this code construct in generic ANSI C is int tmp; result=((tmp=expr)>max)?max:(tmp<min)?min:tmp; On the C62x the sshl intrinsic (saturating shift left) can be used to perform the saturation operation: result=(signed)_sshl(expr,shift)>>shift; whereshiftis givenby mwl (iwllbp) Utilizing the builtin saturation hardware of the C62x via the sshl intrinsic allows the generation of code with linear control flow in contrast to the forked control flow in the ANSI C implementation This significantly speeds up the code 914 Loop optimizations The key to high execution speed on the C62x is software pipelining and instruction level parallelism This is especially important for loops, where most of the execution time is spent for most digital signal processing algorithms The latest version of the C62x C compiler is able to perform quite sophisticated loop optimizations to achieve high performance This can be further improved by restructuring the loops at source level, applying techniques like loop unrolling, scalar expansion and splitting data paths By introducing SIMD (single instruction multiple data) intrinsics it is possible to reduce the required number of load/store operations significantly The C62x back end utilizes the data- and control flow information and the code transformation infrastructure to identify possible loop optimizations and to perform the necessary loop restructuring The design environment maintains the consistency of generated code 10 EXPERIMENTAL RESULTS We have benchmarked the cycle count performance of the generated C62x integer C code using two sets of typical signal processing kernel functions: The first set consists of six off-the-shelf kernels which have been initially coded without DSP specific code optimization The second set of kernels has been extracted from TI s C6000 compiler benchmarking suite 101 Off-the-shelf kernels This set of kernels consists of six signal processing functions, which also have been used for the benchmarks in Section 8: FIR, DCT, Autocorr, IIR, Matrix, Dotprod The code has been translated using TI s C6x compiler version 40 [33] and the performance has been compared with three reference codes: (i) C67x floating-point C code The C67x floating-point DSP is code-compatible to the C62x and its C compiler is mostly identical to the C62x C compiler, thus the performance of the generated fixed-point C code can be compared to the original floating-point C code (ii) C62x floating-point emulation The floating-point emulation library which is part of the C62x compiler s run time library allows the user to perform floating-point arithmetic on the C62x processor The floating-point operations are executed as function calls (iii) C62x integer ANSI C code The FRIDGE back end allows the designer to generate ANSI C fixed-point code without C62x specific optimization This code can also be compiled and executed on the C62x processor The efficiency of the target specific code optimization can be benchmarked using this code

52 922 EURASIP Journal on Applied Signal Processing Table 2: Cycle count Floating-point Float emulation Generic ANSI C Target specific C Device C67x C62x C62x C62x FIR DCT Autocorr IIR Matrix Dotprod Dotprod Matrix IIR Autocorr DCT FIR 0% 100% 216% 200% 400% 427% 100% 111% 112% 100% 185% 542% 100% 188% 456% 100% 177% 396% 100% 600% Floating-point C67 ANSI-C C62 Optimized C62 800% 1000% 1200% 1400% 1368% 1481% Figure 10: Cycle count relative to floating-point code 1600% Table 2 presents the benchmarking results for the six kernel functions Figure 10 illustratestherelative cyclecount As the C67x floating-point code has been used as a reference, it was scaled to 100% For readability the results of the floatingpoint emulation have been omitted in the bar graph As depicted in Table 2 the C62x floating-point software emulation has a cycle count which is by a factor of 97 to 103 higher than the cycle count of the same code compiled for the floating-point processor The generic ANSI C integer code without C62x specific language extensions is by a factor of 11 to 148 slower than the floating-point code The integer code performs additional shift- and bit-masking operations to ensure the bittrue behavior Some of the cast-operations cannot easily be modeled in generic ANSI C Thus a significant overhead is introduced for kernel functions where many cast operations are inserted by the interpolation (eg, the DCT) The performance can be improved by matching the generated code to the target architecture For example, utilizing the sshl intrinsic is a convenient way to access the C62x saturation hardware directly This reduces the overhead introduced by the additional shift and cast operations to a factor of 11 to 43 compared to the floating-point code For the floating-point code of the Dotprod kernel function, the compiler was able to generate efficient code using 95 cycles for 64 vector elements For the fixed-point code, the additional operations needed for cast operations in the inner loop prevent the compiler from achieving similar efficiency Removing all scaling shifts and overflow protection from the inner loop of the fixed-point code for this kernel yields a cycle count of 83 Introducing a single scaling shift in the inner loop brings the cycle count up to 147, adding overflow protection yields 406 cycles Similar effects appear in the Matrix kernel benchmark 102 TI compiler benchmarking kernels This set of kernels consists of six signal processing functions: IIR 16-coefficient IIR filter, IIR cas biquads 10 cascaded biquads, FIR 10-tap 40 sample FIR filter, MAC VSELP two 40 samples vectors, VQ MSE MSE between two 256 element vectors, VEC SUM vector sum of two 44 sample vectors For these kernels hand-optimized C62x assembly code and C62x integer C code is available on TI s website It is noteworthy that neither the C code nor the assembly code was coded with overflow protection For the embedding of input and output operands, implicit assumptions were made which reduced the number of scaling shifts in the kernel functions Thus the hand-optimized C62x assembly code can serve as an upper bound for the efficiency of the FRIDGE C62x design flow We derived the floating-point code from the integer C code The function interfaces in the floating-point code were manually annotated with fixed-point specifications to get hybrid code The hybrid code was used as input to generate optimized C62x integer code from the FRIDGE C62x environment The FRIDGE generated C62x code features full overflow protection and maintains consistency for the location of binary point for input and output operands The code has been translated using TI s C6x compiler version 40 [33]and the performance has been compared to the reference codes: (i) C67x floating-point C code This is the floating-point code compiled for the C67x processor (ii) C62x hand-optimized integer C code This is the original hand-optimized code from the benchmarking suite

53 Design and DSP Implementation of Fixed-Point Systems 923 Table 3: Cycle count Floating-point Assembly Hand optimized ANSI C FRIDGE Device C67x C62x C62x C62x IIR IIR BIQUAD FIR MAC VSELP VQ MSE VEC SUM VEC SUM 76% 81%100% 202% VQ MSE 49% 49% 50% 100% MAC VSELP FIR 34% 35% 100% 103% 88% 75% 100% 153% IIR BIQUAD IIR 72% 55% 47% 100% 85% 45% 49% 100% 0% 50% 100% 150% 200% 250% Floating-point C67 Hand-optimized assembly Hand-optimized C FRIDGE Figure 11: Cycle count relative to floating point code (iii) C62x hand-optimized assembly code The handoptimized assembly code served as a reference for the benchmarks Table 3 presents the benchmarking results for the six kernel functions Figure 11 illustrates the relative cycle count For consistency, the floating-point code has been used as a reference, it was scaled to 100% For these kernels, the C6x compiler was obviously able to generate very efficient code For consistency we have measured the cycle count including the function call This causes the hand-optimized C code to be faster than the handoptimized assembly code for some kernels The floatingpoint code is slower than the hand-optimized assembly and C code in all cases as the floating-point instructions need more execution stages than their integer counterparts For this set of kernel functions the FRIDGE generated code consumes more cycles than the hand-optimized code as additional shift and cast operations for overflow protection are performed For some kernels, such as, the MAC VSELP and the VEC SUM, this leads to a significant overhead as the hand-optimized code uses the processor s functional units in a very efficient manner Introducing additional shift and bit mask operations in the innermost loop slows down the code, as no unused functional units are available in the very tight loop pipelining schedule Especially the s-unit which performs shift operations is heavily used and becomes the performance bottleneck Nevertheless, the FRIDGE generated code comes very close in performance to the hand-optimized code while offering full overflow protection and maintaining consistency of input and output data formats

54 924 EURASIP Journal on Applied Signal Processing 11 SUMMARY The FRIDGE design environment presented in this article allows the designer to concentrate on the critical issues of floating-point to fixed-point design flow Thus he is able to explore the design space more efficiently The interpolative transformation which is based on analytical range propagation enables an accelerated development cycle and in consequence a shorter time-to-market ThefastsimulationcodegenerationaswellastheDSP back end benefits directly from the advanced control and data flow analysis techniques we developed The concept of abstract execution, in combination with a state-driven memory model and coupled iterators, yields results with the precision necessary for the back end transformation steps The verification of the fixed-point algorithm has to be performed by means of simulation Existing C-based fixed-point libraries increase simulation-time by up to two orders of magnitude compared to the corresponding floating-point simulation The FRIDGE fast simulation back end applies advanced compile-time analysis concepts, analyzes necessary casting operations, and selects the appropriate built-in data type on the host machine, thus a speedup by a factor of 20 to 400 compared to the SystemC code while maintaining bit-by-bit equivalence was achieved The target specific C code generation provides a direct link from a floating-point code to C62x C code using integral data types The generated code yields bit-by-bit the same results as the bit-true SystemC code for host simulation, enabling comparative simulation to the reference model As proven by the experimental data, the generated C62x C code comes very close to hand-optimized C- and assembly code These features make FRIDGE a powerful design environment for the specification, evaluation, and implementation of fixed-point algorithms REFERENCES [1] Synopsys Inc, CoCentric System Studio User s Manual, Mountain View, Calif, USA [2] Mathworks Inc, Simulink Reference Manual, March 1996 [3] Cadence Design Systems, 919 E Hillsdale Blvd, SPW User s Manual, Foster City, Calif, USA [4] T Grötker, E Multhaup, and O Mauss, Evaluation of HW/SW tradeoffs using behavioral synthesis, in Proc Int Conf on Signal Processing Application and Technology,Boston, Mass, USA, October 1996 [5] B Liu, Effect of finite word length on the accuracy of digital filters a review, IEEE Trans on Circuit Theory, vol 18, no 6, pp , 1971 [6] H Keding, M Willems, M Coors, and H Meyr, FRIDGE: A fixed-point design and simulation environment, in Proc European Conference on Design, Automation and Test, pp , Paris, France, February 1998 [7] M Willems, V Bürsgens, and H Meyr, FRIDGE: Floatingpoint programming of fixed-point digital signal processors, in Proc Int Conf on Signal Processing Application and Technology, pp , San Diego, Calif, USA, September 1997 [8] M Willems, V Bürsgens, H Keding, T Grötker, and H Meyr, System level fixed-point design based on an interpolative approach, in Proc Design Automation Conference, pp , Anaheim, Calif, USA, June 1997 [9] S Kim, K Kum, and W Sung, Fixed-point optimization utility for C and C based digital signal processing programs, in Workshop on VLSI and Signal Processing 95, pp , Osaka, Japan, November 1995 [10] Frontier Design Inc, A RT Library User s and Reference Documentation, Danville, Calif, USA, 1998 [11] Synopsys Inc, CoWare Inc, Frontier Design Inc, SystemC User s Guide, Version 20, 2001 [12] W Sung and K Kum, Word-length determination and scaling software for a signal flow block diagram, in Proc IEEE Int Conf Acoustics, Speech, Signal Processing, pp , Adelaide, Australia, April 1994 [13] B W Kernighan and D M Ritchie, The C Programming Language, Prentice-Hall, Englewood Cliffs, NJ, USA, 2nd edition, 1988 [14] M Willems, A methodology for the efficient design of fixedpoint systems, PhD thesis, Aachen University of Technology, 1998 [15] Mentor Graphics, DSP Station User s Manual, San Jose, Calif, USA [16] C Hankin, Program analysis tools, International Journal on Software Tools for Technology Transfer, vol 2, no 1, pp 6 12, 1998 [17] A Aho, R Sethi, and J Ullman, Compilers, Principles, Techniques and Tools, Addison-Wesley, Reading, Mass, USA, 1986 [18] M J Wolfe, High Performance Compilers for Parallel Computing, Addison-Wesley, Redwood City, Calif, USA, 1996 [19] C Hankin, F Nielson, and H R Nielson, Principles of Program Analysis, Springer, Heidelberg, Germany, 1999 [20] F Martin, PAG an efficient program analyzer generator, International Journal on Software Tools for Technology Transfer, vol 2, no 1, pp 46 67, 1998 [21] MIPS Computer Systems, UMIPS-V Reference Manual (Pixie and Pixstats), Sunnyvale, Calif, USA, 1990 [22] T Ball and J R Larus, Optimally profiling and tracing programs, ACM Transactions on Programming Languages and Systems (TOPLAS), vol 16, no 4, pp , 1994 [23] S B Akers, Binary decision diagrams, IEEE Trans on Computers, vol 27, no 6, pp , 1978 [24] European Telecommunication Standard Institute, GSM full rate speech transcoding, GSM recommendation 0610, February 1992 [25] S Kim, K Kum, and W Sung, Fixed-point optimization utility for C and C based digital signal processing programs, IEEE Trans on Circuits and Systems II: Analog and Digital Signal Processing, vol 45, no 11, pp , 1998 [26] L De Coster, Bit-true simulation of digital signal processing applications, PhD thesis, KU Leuven, 1999 [27] Mentor Graphics, DSP Architect, DFL User s and Reference Manual, 1994 [28] K Kum, J Kang, and W Sung, A floating-point to integer C converter with shift reduction for fixed-point digital signal processors, in Proc IEEE Int Conf Acoustics, Speech, Signal Processing, vol 4, pp , Phoenix, Ariz, USA, March 1999 [29] H Keding, M Coors, O Lüthje, and H Meyr, Fast bit-true simulation, in Proc the Design Automation Conference, pp , Las Vegas, Nev, USA, June 2001 [30] S K Mitra, Digital Signal Processing: A Computer-Based Approach, McGraw-Hill, New York, NY, USA, 1998 [31] V Živojnović, J Martínez, C Schläger, and H Meyr, DSPstone: A DSP-oriented benchmarking methodology, in Proc International Conference on Signal Processing Applications and Technology, Dallas, Tex, USA, October 1994

Design and DSP Implementation of Fixed-Point Systems 925 [32] M Coors, O Wahlen, H Keding, O Lüthje, and H Meyr, C62x compiler benchmarking and performance coding techniques, in Proc International

the diploma in electrical engineering from Aachen University of Technology (RWTH), Aachen, Germany In 1997, he joined the Institute for Integrated Signal Processing Systems (ISS) at RWTH Aachen as a

electrical engineering from Aachen University of Technology (RWTH), Aachen, Germany, and is currently working towards the PhD degree in electrical engineering at the same institute His research

55 Design and DSP Implementation of Fixed-Point Systems 925 [32] M Coors, O Wahlen, H Keding, O Lüthje, and H Meyr, C62x compiler benchmarking and performance coding techniques, in Proc International Conference on Signal Processing Applications and Technology, Orlando,Fla,USA,November 1999 [33] Texas Instruments, USA, TMS320C6000 Optimizing Compiler User s Guide, March 2000 Martin Coors received the diploma in electrical engineering from Aachen University of Technology (RWTH), Aachen, Germany In 1997, he joined the Institute for Integrated Signal Processing Systems (ISS) at RWTH Aachen as a research assistant His research interests include DSP code optimization techniques, fixed-point design methodologies and code generation for embedded processors Olaf Lüthje received the diploma in electrical engineering from Aachen University of Technology (RWTH), Aachen, Germany, and is currently working towards the PhD degree in electrical engineering at the same institute His research interests focus on fixed-point design methodology and data flow analysis Holger Keding received the diploma in electrical engineering from Aachen University of Technology (RWTH), Aachen, Germany From 1996 to 2001 he was with ISS to work towards his PhD thesis Having finished his PhD, he joined the system level design group of Synopsys as a senior corporate application engineer His research interests include fast bit-true simulation and fixed-point and system-level design methodology Heinrich Meyr received his MS and PhD from ETH Zurich, Switzerland He spent over 12 years in various research and management positions in industry before accepting a professorship in electrical engineering at Aachen University of Technology (RWTH Aachen) in 1977 He has worked extensively in the areas of communication theory, synchronization, and digital signal processing for the last thirty years His research has been applied to the design of many industrial products At RWTH Aachen he heads an institute involved in the analysis and design of complex signal processing systems for communication applications He was a cofounder of CADIS GmbH (acquired 1993 by Synopsys, Mountain View, California), a company which commercialized the tool suite COSSAP extensively worldwide used in industry He is a member of the Board of Directors of two companies in the communications industry Dr Meyr has published numerous IEEE papers He is author together with Dr G Ascheid of the book Synchronization in Digital Communications, Wiley 1990, and of the book Digital Communication Receivers He is also the author of Synchronization, Channel Estimation, and Signal Processing (together with Dr M Moeneclaey and Dr S Fechtel), Wiley, October 1997 He holds many patents He served as a Vice President for International Affairs of the IEEE Communications Society and is a Fellow of the IEEE

56 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding Zhong Wang Department of Computer Science and Engineering, University of Notre Dame, Notre Dame, IN 46556, USA zwang1@csendedu Edwin Hsing-Mean Sha Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA edsha@utdallasedu Yuke Wang Department of Computer Science, University of Texas at Dallas, Richardson, TX 75083, USA yuke@utdallasedu Received 2 September 2001 and in revised form 14 May 2002 This paper presents an iteration space partitioning scheme to reduce the CPU idle time due to the long memory access latency We take into consideration both the data accesses of intermediate and initial data An algorithm is proposed to find the largest overlap for initial data to reduce the entire memory traffic In order to efficiently hide the memory latency, another algorithm is developed to balance the ALU and memory schedules The experiments on DSP benchmarks show that the algorithms significantly outperform the known existing methods Keywords and phrases: loop pipelining, initial data, maximal overlap, balanced partition scheduling 1 INTRODUCTION The contemporary DSP and embedded systems always contain the memory hierarchy, which can be categorized as onchip and off-chip memories In general, the on-chip memory have a fast speed and restrictive size, while the off-chip memory have the much slower speed and larger size To do the CPU s computation, the data need to be loaded from the off-chip to on-chip memories Thus, the system performance will be degraded due to this long off-chip access latency How to tolerate the memory latency with memory hierarchy is becoming a more and more important problem [1] The onchip and off-chip memories are abstracted as the first and second level memories, respectively, in this paper Prefetching [1, 2, 3, 4, 5] is a technique to fetch the data from the memory in advance of the corresponding computations It can be used to hide the memory latency On the other hand, software pipelining [6] and modulo scheduling [7, 8] are the scheduling techniques used to explore the parallelism in the loop Both the prefetching and scheduling techniques can be used to accelerate the execution speed However, these traditional techniques have some weaknesses [9] such that they cannot efficiently solve the problem mentioned in the first paragraph This paper combines the software pipelining technique with the data prefetching approach Multiple memory units, attached to the first level memory, will perform operations to prefetch data from the second to the first level memories These memory units are in charge of preparing all data required by the computation in the first level memory in advance of computation Multiple ALU units exist in the processor for doing the computation The ALU schedule is optimized by using the software pipelining technique under the resource constraints The operations in the ALU units and memory units execute simultaneously Therefore, the long memory access latency is tolerated by overlapping the data fetching operations with the ALU operations Although using computation to hide the memory latency has been studied extensively before, trying to balance the computation and memory loading has never been researched thoroughly according to the authors knowledge This paper presents an approach to balance the ALU and memory schedules to achieve an optimal overall schedule length The data to be prefetched can be classified into two groups, the intermediate and initial data The intermediate data can serve as both left and right operands in the equa-

57 Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 927 tions Their value will vary during the computation On the contrary, the initial data can only serve as right operands in the equations They will maintain their value during the computation Take the following equations as an example, the arrays B, C can be regarded as the intermediate data and A as the initial data B[i 1]= B[i] B[i 1] A[i], C[i 1]= B[i 1] A[i 1]A[i] The influence of both these two kinds of data should be deliberated in order to obtain an optimal overall schedule To take full use of the data locality, the entire iteration space can be divided into small blocks named partitions A lot of works have been done on the partitioning technique Loop tiling [10, 11] is a technique used to group basic computations so as to increase computation granularity and thereby reduce communication time Generally, they have no detailed schedule of ALU and memory operations as our method Moreover, only intermediate data are taken into consideration Agarwal and Kranz [12] make an extensive study of data partition They use an approximation method to find a good partition to minimize the data transfer among the different processors Affine reference index is considered in their work However, they mainly concentrate on the initial data and have few consideration on the intermediate data The approaches in [9, 13] are the few approaches to consider the detailed schedule under memory hierarchy Nevertheless, their memory references consider only the intermediate data, and ignore the initial data, which are an important influence factor of performance From the experimental results in Section 5, we can see that such deficiency will lead to an unbalanced schedule, which means a worse schedule In our approach, both the intermediate and initial data are considered For the intermediate data, we will restrict our study to nested loops with uniform data dependencies The study of uniform loop nests is justified by the fact that most general linear recurrence equations can be transformed into a uniform form This transformation (uniformization [14]) greatly reduces the complexity of the problem On the other hand, it is difficult to implement uniformization for the initial data Therefore, affine reference index is considered The concept footprint [12] is used to denote the initial data needed for the computation of ALU units in one partition Given a partition shape, this paper presents an algorithm to find a partition size which can give rise to the maximum overlap between the adjacent overall footprints such that the number of memory operations is reduced to the largest extent When considering the schedule of the loop, we propose the detailed ALU and memory schedules Each of the memory and ALU operations are assigned to an available hardware unit and time slot Therefore, it is very convenient to apply our technique to a compiler The memory schedule is balanced to the ALU schedule such that the overall schedule is close to the lower bound, which is determined by the ALU schedule Our method gives the algorithm to determine the (1) partition shape and size in order to achieve balanced ALU and memory schedules At last, the memory requirement of our technique for applications is also presented The new algorithm in this paper significantly exceeds the performance of existing algorithms [9, 13] due to the fact that it optimizes both ALU and memory schedules and considers the influence of initial data Taking the wave digital filter as an example, in a standard system with 4 ALU units and 4 memory units, assuming 3 initial data references exist in each iteration, our algorithm can obtain an average schedule length of 4018 CPU clock cycles, which is very close to the theoretic lower bound of 4 clock cycles The traditional list scheduling needs 22 clock cycles The hardware prefetching costs 10 clock cycles While the PSP algorithm in [13] can achieve some improvement, it still needs 8 clock cycles Without the memory constraint, the algorithm in [9]hasthe same performance, 8 clock cycles Our algorithm improves all the previous approaches It is worthwhile to mention that some works have been done on data layout technique [15, 16], which is used to maintain the cache coherency and reduce the conflict traffic Our work should be regarded as another different layer which can be built upon the layer of data layout to get a better performance The remainder of this paper is organized as follows Section 2 introduces the terms and basic concepts used in the paper Section 3 presents the theory on initial data Section 4 describes the algorithm to find the detailed schedule Section 5 contains the experimental result of comparison of this technique with a number of existing approaches We conclude in Section 6 2 BACKGROUND We can represent the operations in a loop by a multidimensional data flow graph (MDFG) [6] Each node in the MDFG represents a computation Each edge denotes the data dependence between two computations, with its weight as the distance vector The benefit of using MDFG instead of the general data dependence graph (DDG) or statement dependence graph (SDG) is that MDFG is the finer-grained description of data dependences Each node of MDFG corresponds to one ALU computation On the contrary, a node always corresponds to a statement in DDG or SDG, which will consume uncertain ALU computation time depending on the complexity of the statement It is more convenient to schedule the ALU operations with MDFG Moreover, lots of DSP applications, such as DSP filters, and so forth, can be directly mapped into MDFG [17] The execution of all nodes in an MDFG onetimeisan iteration It corresponds to executing the loop body for one time under a certain loop index Iterations are identified by a vector i, equivalent to a multidimensional index In this paper, we will always illustrate our ideas under two-dimensional loops It is not difficult to extend to loops with more than two dimensions by using the same idea presented in this paper

58 928 EURASIP Journal on Applied Signal Processing ALU1 ALUs ALU2 ALU3 (small, fast) internal memory Memory units mem1 mem2 mem3 (large, slow) external memory CS 1: CS 2: CS 3: CS 4: CS 5: CS 6: CS 7: CS 8: CS 9: CS 10: CS 11: CS 12: CS 13: CS 14: CS 15: CS 16: CS 17: ALU Memory Prefetch for initial data Keep for initial data Prefetch for inter data Keep for inter data Figure 1: Architecture model with multiple function units and a memory hierarchy 21 Architecture model The technique in our paper is designed for use in a system which has one or more processors These processors share a common memory hierarchy, as shown in Figure 1 There are multiple ALU and memory units in the system The access time for the first level memory is significantly less than for the second level memory, as in current systems During a program s execution, if one instruction requires data which is not in the first level memory, the processor will have to fetch data from the second level memory, which will cost much more time Thus, prefetching data into the first level memory before its explicit use can minimize the overall execution time Two types of memory operations, prefetch and keep are supported by the memory units The prefetch operation prefetches the data from the second level to the first level memories; the keep operation keeps the data in the first level memory for the execution of one partition Both of them are issued to guarantee that those data being referenced in the near future appear in the first level memory before their references It is important to note that the first level memory in this model cannot be regarded as a pure cache, because we do not consider the cache associativity In other words, it can be thought of as a full-associative cache 22 Partitioning the iteration space Regular execution of nested loops proceeds in either a rowwise or column-wise manner until the boundary of iteration space is reached However, this mode of execution does not take full advantage of either the locality of reference or the available parallelism The execution of such structures can be made to be more efficient by dividing the entire iteration space into regions called partitions that better exploit spatial locality Provided that the total iteration space is divided into partitions of iterations, the execution sequence will be determined by each partition Assume that the partition in which the loop is executing is the current partition Then the next partition is the partition adjacent on the right side of the Figure 2: The overall schedule current partition along the x-axis The other partitions are all partitions except the above two partitions Based on this classification, different memory operations will be assigned to different data in a partition For a delay dependency that goes into the next partition, a keep memory operation is used to keep this data in the first level memory for one partition, since this data will be reused immediately in the next partition Delay dependencies that go into other partitions result in the use of prefetch memory operations to fetch data in advance A partition is determined by its partition shape and partition size We use two basic vectors (in a basic vector, each element is an integer and all elements have no common factor except 1), P x and P y, to identify a parallelogram as the partition shape These two basic vectors will be called partition vectors Assume, without loss of generality, that the angle between P x and P y is less than 180,andP x is clockwise of P y The partition size is determined by the vector S = ( f x,f y ), where f x and f y are the multiples of the partition size over partition vectors P x and P y,respectivelythus, the partition can be delimited by two vectors f x P x and f y P y How to find the optimal partition size will be discussed in Section 4 Due to the dependencies between the iterations, the P x and P y cannot be chosen arbitrarily The following property gives the condition of a legal partition shape [9] Property 1 A pair of partition vectors that satisfy the following constraints is legal For each delay vector d e, the following cross products 1 relations hold: d e P x 0andd e P y 0 Because nested loops should follow the lexicographical order, we can choose (1, 0) as our P x vector and use the normalized leftmost vector of all delay dependencies as our P y The partition shape is decided by these two vectors An overall schedule consists of two parts: an ALU part and a memory part, as seen in Figure 2 The ALU part sched- 1 The cross product p 1 p 2 is defined as the signed area of the parallelogram formed by the points (0,0), p 1, p 2,andp 1 p 2 = (x 1 x 2,y 1 y 2 ) It is p 1 p 2 = p 1 xp 2 y p 1 yp 2 x

59 Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 929 Partition 3 4 j B(2i j, i 2j) B(i j, i j) uniformly generated if g 1 ( i) = ig a 1, g 2 ( i) = ig a 2 (2) Figure 3: The footprint ules the ALU computation We know that the computation in a loop can be represented by an MDFG The ALU part is a schedule of these MDFG nodes The memory part schedules the memory operations prefetch and keep, so that the data for the computation can always be found in the first level memory 3 THE THEORY ABOUT INITIAL DATA The overall footprint of one partition consists of all the initial data needed by one partition computation Provided the execution is along the partition sequence, the initial data needed by the current partition computation have been prefetched to the first level memory at the time of previous partition Also, the initial data needed by the next partition execution will be prefetched by the memory units during the current partition execution For the overlap between the overall footprints of the current and next partitions, they have already been in the first level memory The prefetch operations can be spared Thus, the major concern for the initial data is how to maximize the overlap between the overall footprints of two consecutively executed partitions to reduce the memory traffic As mentioned in Section 1, we consider affine reference for the initial data Given a loop index vector i, anaffine reference index can be expressed as g( i) = ig a, where G = [ G 1 G 2 ]isa2 2matrixand a is the offset vectorthefootprint with respect to a reference A[ g 1 ( i)] is the set of all data elements A[ g 1 ( i)] of A,for i an element of the partition The overall footprint is the union of the footprints with respect to all different references For example, in Figure 3, the partition is a rectangle with size 3 4 The initial data references are B(i j, i j) andb(2i j, i 2j) Their corresponding footprints are denoted by those integer points marked by and, respectively The overall footprint is the union of these two footprints In [12], Anant presents the concept uniformly generated references Two references A[ g 1 ( i)] and A[ g 2 ( i)] are said to be i If two references B 1 and B 2 are not uniformly generated, the overlap between footprint with respect to B 1 of the current partition and that with respect to B 2 of the next partition can be ignored because the overlap, if exists, diminishes rapidly Therefore, we need only consider the overlap between footprints with respect to uniformly generated references of two consecutive partitions Moreover, the offset vector a should satisfy that a = m G 1 n G 2,wherem and n are integer constants Otherwise, no overlap between the footprints of consecutive partitions will exist even for the uniformly generated references The memory requirement should be taken into account when trying to maximize the overlap The partition size cannot be enlarged arbitrarily only for the sake of increasing overlap In such case, the larger partition means the larger overall footprint; that is, the much more memory space will be consumed Therefore, given a partition shape and a set of uniformly generated references, we try to derive some conditions of the partition size which should be met to achieve a reasonable maximal overlap For the convenience of description, we introduce the following notations Definition 1 (1) Assuming the partition size is S, f ( a, S) is the footprint with respect to reference with offset vector a of the current partition, and f ( a, S) is the footprint with respect to reference with offset a of the next partition (2) Given a set of uniformly generated references, the set R = { a 1, a 2,, a n } is set of offset vectors 2 Assuming the partition size is S, F(R, S) is the overall footprint of the current partition and F(R, S) is the overall footprint of the next partition The one-dimensional case can be regarded as a simplification to the two-dimensional problem, in which the f y is always set to zero It provides the theoretic foundation for the two-dimensional problem In the case of one dimension, a partition is reduced to a line segment and all vectors reduce to integer numbers The partition size can be thought of as the length of the line segment We use an example to demonstrate the problem we are tackling In Figure 4, there are three different offset vectors: 1, 2, 7 The solid lines represent the overall footprint of the current partition, and dotted lines denote that of the next partition Then, we need to find the condition of the partition size, that is, the length of the line segment, to achieve a maximal overlap The figure shows the case when the length equal 5, which is the minimum length to obtain the maximum overlap between overall footprints In order to derive the theorem on the minimum value S which can generate the maximum overlap, we first have 2 Note that the elements in the set R are in lexicographically increasing order

60 930 EURASIP Journal on Applied Signal Processing The following lemma gives the expression of their intersection Figure 4: One-dimensional line segments a 1 a 1 S a 2 { S { { a 2 S S { (a) Case 1 (b) Case 2 Figure 5: Two different relations between a 1 and a 2 Lemma 3 Let C m be the intersection f (a m,s) f (a m 1,S) Then the intersection of F(R, S) and F(R,S) is n 2 C m,where the number of integers in R is n Proof Let A m denote f (r m,s), and B m denote f (r m,s) Basis step Letn = 2 Then F(R, S) = A 1 A 2 and F(R,S) = B 1 B 2 The ending point of A 1 is less than the starting point of B 1 and B 2, the starting point of B 2 is greater than the ending point of A 1 and A 2 Thus, the only possible intersection is A 2 B 1 Induction hypothesis Assume that, for some n 2, F(R, S) F(R,S) = n 2 C n Induction stepforn 1, the added intersection is A n1 (B 1 B 2 B n ) There are two different cases (1) a n1 (a n S) Then A n1 can only intersect with B n (2) a n1 < (a n S) Then A n1 can be divided into two parts, A = (a n1,a n S)andA = (a n S, a n1 S 1) A n1 ( ) B 1 B 2 B n = A ( ) B 1 B 2 B n A ( ) B 1 B 2 B n n C n ( ) A n1 B n 2 = C n1 (4) the following lemmas They are used to consider the overlap of two footprints of the consecutive partitions, as show in Figure 5 The solid line is the footprint of the current partition and the dotted line is the footprint of the next partition Lemma 1 The minimum S is a 2 a 1 which makes the maximum intersection between f (a 1,S) and f (a 2,S), wherea 2 a 1 Proof According to the relation between (a 1 S) anda 2, there are two different cases Case 1 As shown in Figure 5a, a 1 S a 2, that is, S a 2 a 1 The intersection is (a 2,a 1 2S 1)Itcanreachthe maximum value a 2 a 1 when S = a 2 a 1 Case 2 As shown in Figure 5b, a 1 S>a 2, that is, S> a 2 a 1 The intersection of two segments is (a 1 S, a 2 S 1) It hasno relation to S This means the size of intersection will not increase in spite of the increment of S Lemma 2 For the intersection between f (a 1,S) and f (a 2,S), where a 2 a 1, it will keep constant, irrelevant to the value of S, as long as S a 2 a 1 According to Definition 1, F(R, S)andF(R,S)canbeexpressed as F(R, S) = f ( a 1,S ) f ( a 2,S ) f ( a n,s ), F ( R,S ) = f ( a 1,S ) f ( r 2,S ) f ( r n,s ) (3) Therefore, F(R, S) F(R,S) = n1 2 C n Theorem 1 Given the set R = (a 1,a 2,a 3,,a n ), the maximum intersection between F(R, S) and F(R,S) can be achieved when S = max n m=2(a m a m 1 ) Proof When considering two adjacent C m and C m 1,wehave C m = A m B m 1 and C m 1 = A m 1 B m 2 There is no common element between B m 1 and A m 1, neither is C m and C m 1 According to Lemmas 1 and 2, the value x r m r m 1 can make segment C m largest Moreover, each C m will not intersect each other Therefore, the theorem is correct From Theorem 1 and Lemma 2, we can directly derive the following theorem Theorem 2 For the overall footprints F(R, S) and F(R,S), their overlap will keep constant if the value of S continues to increase from the S value obtained by Theorem 1 To maximize the overlap between F(R, S)andF(R, S)in the two dimension space, we can find that the f y element of the partition size is not so important as the f x element, since the intersection always increases when f y is enlarged We will determine the value of f y based on other conditions Therefore, the key is what is the minimum value of f x to make the intersection maximum, given a certain f y Next, we discuss the situation with G a two-dimensional identity matrix If G is not an identity matrix, the same idea

61 Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding Figure 6: The stripe division of a footprint can be applied as long as a = m G 1 n G 2 The only difference is that the original XY-spacewillbetransformed to the new space by the G matrix An augment set R can be obtained based on a certain partition size of S and the set R with the following method: a i = a i, a in = a i f y P y y, wheren is the size of the set R and P y = (P y x,p y y) Arranging all the points in the set R with the increasing order of the Y element, the overall footprint of one partition can be divided into a series of stripes Each stripe is determined by two horizontal lines which pass the two adjacent points sorted in R For instance, in Figure 6, the R set is {(0, 0), (6, 1), (3, 2), (1, 3)} Assume the value of f y P y y is 5, then the augment set R is {(0, 0), (0, 5), (6, 1), (6, 6), (3, 2), (3, 7), (1, 3), (1, 8)} After sorting, it will become {(0, 0), (6, 1), (3, 2), (1, 3), (0, 5), (6, 6), (3, 7), (1, 8)} The overall footprint consists of 7 stripes as indicated infigure 6 In each stripe, a horizontal line will intersect with left bounds of some footprints f ( a, S) Thus, the twodimensional intersection problem of this stripe in the footprint can be reduced to the one-dimensional problem, which can be solved using Theorem 1 Applying this idea to each stripe, we can solve the two-dimensional overlap problem, as demonstrated in Algorithm 1 The algorithm is obviously a polynomial-time algorithm, whose time complexity is O(n 2 ) From Lemma 2, the intersection will keep constant if f x is greater than the value chosen by this algorithm, and will reduce with less f x We can demonstrate this phenomenon by two examples The set R for the first example is {(0, 1), (5, 3), ( 3, 1), (4, 1), ( 2, 2)} and the partition shape is (1, 0) (0, 1) It is the partition shape for wave digital filter The set R for the second example is {(0, 2), (3, 5), (1, 3), ( 1, 1)} and the partition shape is (1, 0) ( 3, 1) It is the partition shape for two-dimensional filter Figures 7a and 7b show the varying trends of footprint intersection with the value of f x and f y for two examples, respectively 4 THE OVERALL SCHEDULE The overall schedule can be divided into two parts ALU and memory schedules For the ALU schedule, the multidimensional rotation scheduling algorithm [6] is used to generate a Input: The set R and the shape of the partition Output:The f x to make the overlap maximum under a certain f y (1) Set f x to 0 (2) Based on the set R and partition shape, choose an f y such that the product f y P y y is larger than the difference between the largest and least b element of all vectors in the set R (3) Using the f y above, generate the augment set R (4) Sort all the values in the R in increasing order according to the b element and keep them in an event list (5) Use a horizontal line to sweep the whole iteration space When an event point is met, insert the corresponding set f ( a, S) in a visiting list, if the event point is the lower bound of the footprint Otherwise delete the corresponding f ( a, S)fromthe list (6) Calculate the intersection point of this line with the left bound and right bound of each set in the visiting list, respectively Use Theorem 1 to derive an f x value to make the intersection in the current stripe maximal (7) Replace f x with f x if f x >f x Algorithm 1: Calculating the minimum x to make the overlap maximum static schedule for one iteration Then the entire ALU schedule can be formed by simply replicating this schedule for each iteration in the partition The schedule obtained in this way is the most compact schedule since it only considers the ALU hardware resource constraints The overall schedule length must be longer than it Thus, this ALU schedule provides a lower bound for the overall schedule This lower bound can be calculated by #len iteration #nodes, where len iteration represents the schedule length obtained by multidimensional rotation scheduling algorithm for one iteration, and #nodes denotes the number of iteration nodes in one partition Our objective is to find a partition whose overall schedule length can be very close to this lower bound 41 Balanced overall schedule Different from the ALU schedule, the memory schedule is considered as an integrate for the entire partition It consists of two parts: memory operations for initial data and intermediate data Each part consists of the prefetch and keep operations for the corresponding data Because all the prefetch operations have no relations to the current computation, they can be arranged from the beginning of the memory schedule part On the contrary, the keep operation for intermediate data can only be issued after the corresponding computation has finished The keep operations for initial data can be issued as soon as they have been prefetched The memory part schedule length is the summation of these two parts schedule lengths For the intermediate data, the calculation of the number

62 932 EURASIP Journal on Applied Signal Processing (a) 2D 15 (b) WDF y = 8 y = 6 y = 5 y = 3 25 y = 7 y = 5 y = 3 Figure 7: The tendency of intersection with f x and f y of prefetch and keep operations can refer to [13] For the initial data, they can be prefetched in blocks This kind of operation can fetch several data at one time and costs only a little longer time than general prefetch operation To calculate the number of such operations, we first have the following observation Property 2 As long as f y P y G 2, the projection of footprint size along the direction G 2, is larger than the maximum difference of ag 2,forall a belongs to a uniformly generated offset vector set, the overall footprint will increase at a constant rate with the increment of f y, so does the number of prefetch operations for initial data Note the requirement in the above property guarantees that the partition is large enough, such that the footprint with respect to an offset vector can intersect with the footprint with respect to all other offset vectors belonging to the same uniformly generated set Suppose that a two-dimensional vector can be written as a = (a x,a y) Given a certain f x, the number of prefetch operations for initial data for any f y, which satisfy the condition in the above property, is Pre Base ini (f y f y0 ) Pre incr ini, where f y0 = y 0 /((P y G) y), y 0 is the maximum difference of ( ag) y for all offset vectors, Pre Base ini denotes the number of such operations for a partition with size f x f y0,and Pre incr ini represents the increment of number of prefetch operations when f y is increased by one The keep operations for the initial data can be issued after they have been prefetched The number of such keep operations is Keep Base ini (f y f y0 ) Keep incr ini,wherey 0 and f y0 have the same meaning as above Keep Base ini denotes the number of keep operations for a partition with size f x f y0, and Keep incr ini represents the increment of keep operations when f y is increased by one In order to understand what is a good partition size, we first need the definition of the balanced overall schedule It also gives the balanced overall schedule requirement Definition 2 A balanced overall schedule is a schedule for which the memory schedule is at most one unit time of keep operation longer than the ALU schedule To reduce the computation complexity and simplify the analysis, we add a restriction on the partition size: the partition size is large enough that no data dependence can span more than two partitions (1) There is no delay dependency which can span more than two partitions along the y coordinate direction, that is, f y P y y d y,foralld = (d x,d y ) D (2) There is no delay dependency which can span more than two partitions along the x coordinate direction, that is, f x > max{d x d y (P y y/(p y x))} As long as these constraints on minimal partition size are satisfied, the length of prefetch and keep parts for intermediate data in memory schedule increases slower than the ALU schedule length when partition size is enlarged At this time, if a partition size cannot be found to meet the balanced overall schedule requirement, it means that the length of the block prefetch part for initial data increases too fast Due to the property of block prefetch, increasing f x will increase the number of block prefetch only by a small number, while increase the ALU part by a relative large length Therefore, a partition size which satisfy the balanced overall schedule requirement can be found Algorithm 2 determines the partition size to obtain the balanced overall schedule After the optimal partition size is determined, the operations in ALU and memory schedules can be easily arranged For the ALU part, it is the duplication of the schedule for one iteration For the memory part, the memory operations for initial data are allocated first, then are the memory operations for intermediate data, as we discussed above The memory requirement for a partition consists of four parts, the memory requirement for the calculation of inpartition data, the memory for prefetch operations of intermediate data, the memory for keep operations of intermediate data, and the memory for those operations of initial data The memory consumption calculation for in-partition data can refer to [9] For the other part memory requirements,

63 Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 933 Table 1: Experimental results with only one initial data Benchmark Par vector New algo Partition algo List Hardware P x P y size m r len size m r len ratio len ratio len ratio WDF (1, 0) ( 3, 1) % % % IIR (1, 0) ( 2, 1) % % % DPCM (1, 0) ( 2, 1) % % % 2D (1, 0) (0, 1) % % % Floyd (1, 0) ( 3, 1) % % % Input: The ALU schedule for one iteration, the partition shape P x P y and the initial data offset vector set R Output: A partition size which can generate a balanced overall schedule (1) Based on the information of initial data, use Algorithm 1 to calculate the minimum partition size f x and f y (2) Using the two above conditions on partition size, calculate another pair of minimum f x and f y (3) Get a new pair f x = max( f x,f x )and f y = max( f y,f y ) (4) Using this pair ( f x,f y ), calculate the number of prefetch operations, block prefetch operations, and keep operations (5) Calculate the ALU schedule length to see if the balanced overall schedule requirement is satisfied (6) If it is satisfied, this pair ( f x,f y ) is the partition size Otherwise, increase f x by one, use the balanced overall schedule requirement to find the minimum f y If such f y does not exist, continue increasing f x until the feasible f y is found Use them as partition size (7) Based on the partition size, output the corresponding ALU part schedule and memory part schedule Algorithm 2: Find a balanced overall schedule they can be computed simply by multiplying the number of operations with the memory requirement of each operation The memory requirement for a prefetch operation is 2 One is used to store the data prefetched by the previous partition and consumed in the current partition, the other stores the data prefetched by the current partition and consumed in the next partition As the same rule, the keep operation will take 2 memory locations, too The block prefetch operations will take 2 block size memory locations 5 EXPERIMENT In this section, we use several DSP benchmarks to illustrate the effectiveness of our new algorithm They are WDF, IIR, DPCM, 2D, and Floyd, as indicated in Tables 1 and 2, which stand for wave digital filter, infinite impulse response filter, differential pulse-code modulation device, twodimensional filter and Folyd-Steinberg algorithm,respectively These are DSP filters in common usage in real DSP applications We applied five different algorithms on these benchmarks: list scheduling, hardware prefetching scheme, partitioning algorithms in [9, 13] and our new partition algorithm (since it has been shown in [9] that loop tiling technique cannot outperform partitioning algorithms, we do not compare the result of loop tiling in this section) In list scheduling, the same architecture model is used However, the ALU part uses the traditional list scheduling algorithm, and the iteration space is not partitioned In hardware prefetching scheduling, we use the model presented in [18] In this model, whenever a block is accessed, the next block is also loaded The partitioning algorithms in [9, 13] assume the same architecture model as ours They partition the iteration space and execute the entire loop along the partition sequence However, they do not take into account the influence of the initial data In the experiment, we assume an ALU computation, a keep operation of one clock cycle, a prefetch time of 10 CPU clock cycles, and a block prefetch time of 16 CPU clock cycles, which is reasonable when the big performance gap between CPU and the main memory is considered Table 1 presents results with only one initial data with the offset vector (1, 1), and Table 2 presents results with three initial data with the offset vector set {(1, 1), (2, 2), (0, 3)} Note all these three initial data references are uniformly generated From the discussion in Section 4, the overall footprint is only the simple summation of the footprint with respect to different uniformly generated reference sets In Tables 1 and 2, the par vector column determines the partition shape The list column lists the schedule length for list scheduling and the improvement ratio our algorithm can get compared to list scheduling The hardware column lists the schedule length for hardware prefetching and our algorithm s relative improvement ratio Since the algorithm in [13] will get the same result as the algorithm in [9] when there is no memory size constraint, we merge their results into one column partition algo In the partition algo and new algo columns, the size column is the size of partition presented with the multiple of partition vectors The m r column represents the corresponding memory requirement and the len column is the average scheduling length for corresponding algorithms The ratio column is the improvement our new algorithm can get relative to the corresponding algorithms The list scheduling and hardware prefetching schedule the operations based on the iteration, which will result in the

64 934 EURASIP Journal on Applied Signal Processing Table 2: Experimental results with three initial data Benchmark Par vector New algo Partition algo List Hardware V x V y size m r len size m r len ratio len ratio len ratio WDF (1, 0) ( 3, 1) % % % IIR (1, 0) ( 2, 1) % % % DPCM (1, 0) ( 2, 1) % % % 2D (1, 0) (0, 1) % % % Floyd (1, 0) ( 3, 1) % % 30 80% much longer memory schedule It is this dominant memory schedule that leads to an overall schedule which is far away from the balanced schedule Thus, lots of ALU resources are wasted waiting for the data Their much worse performance compared with the partitioning technique can be seen from the tables Although the traditional partitioning algorithms consider the balance of ALU and memory schedules for intermediate data They lack of the consideration for the initial data The time consumption to load the initial data is a rather significant influence factor for one partition The lack of such consideration will result in an unbalanced overall schedule The memory latency cannot be efficiently hidden This is the reason why traditional partitioning algorithms get the worse performance than our new algorithm It also explains the results that the performance will become worse as the initial data references increase Our new algorithm considers both data locality and the initial data Therefore, the much better performance can be achieved through balancing the ALU part and memory schedule 6 CONCLUSION In this paper, a new scheme that can obtain a minimal average schedule length under the consideration of initial data was proposed The theories and an algorithm on initial data were presented The algorithm explores the ILP among instructions by using software pipelining techniques and combines it with data prefetching to produce high throughput schedules Experiments on DSP benchmarks show that our scheme can always produce a better average schedule length than existing methods REFERENCES [1] T Mowry, Tolerating latency in multiprocessors through compiler-inserted prefetching, ACM Trans Computer Systems, vol 16, no 1, pp 55 92, 1998 [2] T-F Chen, Data prefetching for high-performance processors, PhD thesis, Dept of Comp Sci and Engr, University of Washington, Wash, USA [3] F Dahlgren and M Dubois, Sequential hardware prefetching in shared-memory multiprocessors, IEEE Trans on Parallel and Distributed Systems, vol 6, no 7, pp , 1995 [4] N Manjikian, Combining loop fusion with prefetching on shared-memory multiprocessors, in Proc International Conference on Parallel Processing, pp 78 82, Bloomingdale, Ill, USA, August 1997 [5] M K Tcheun, H Yoon, and S R Maeng, An adaptive sequential prefetching scheme in shared-memory multiprocessors, in Proc International Conference on Parallel Processing, pp , Bloomington, Ill, USA, August 1997 [6] N Passos and E H-M Sha, Scheduling of uniform multidimensional systems under resource constraints, IEEE Trans on VLSI Systems, vol 6, no 4, pp , 1998 [7] W Mangione-Smith, S G Abraham, and E S Davidson, Register requirements of pipelined processors, in Proc International Conference on Supercomputing, pp , Washington, DC, USA, July 1992 [8] B R Rau, Iterative modulo scheduling: an algorithm for software pipelining loops, in Proc 27th Annual International Symposium on Microarchitecture, pp 63 74, San Jose, Calif, USA, November 1994 [9] Z Wang, T W O Neil, and E H-M Sha, Minimizing average schedule length under memory constraints by optimal partitioning and prefetching, Journal of VLSI Signal Processing, vol 27, no 3, pp , 2001 [10] P Bouilet, A Darte, T Risset, and Y Robert, (pen)-ultimate tiling, in Scalable High-Performance Computing Conference, pp , Knoxville, Tenn, USA, May 1994 [11] J Chame and S Moon, A tile selection algorithm for data locality and cache interference, in Proc 13th ACM International Conference on Supercomputing, pp , Rhodes, Greece, June 1999 [12] A Agarwal, D A Kranz, and V Natarajan, Automatic partitioning of parallel loops and data arrays for distributed shared-memory multiprocessors, IEEE Trans on Parallel and Distributed Systems, vol 6, no 9, pp , 1995 [13] F Chen and E H-M Sha, Loop scheduling and partitions for hiding memory latencies, in Proc IEEE 12th International Symposium on System Synthesis, pp 64 70, San Jose, Calif, USA, November 1999 [14] V Van Dongen and P Quinton, Uniformization of linear recurrence equations: a step towards the automatic synthesis of systolic array, in International Conference on Systolic Arrays, pp , San Diego, Calif, USA, May 1988 [15] R Bixby, K Kennedy, and U Kremer, Automatic data layout using 0-1 integer programming, in Proc International Conference on Parallel Architectures and Compilation Techniques, pp , Montreal, Canada, August 1994 [16] G Rivera and C W Tseng, Eliminating conflict misses for high performance architectures, in Proc 1998 AACM International Conference on Supercomputing, pp , Melbourne, Australia, July 1998 [17] N L Passos, E H-M Sha, and S C Bass, Schedulebased multi-dimensional retiming on data flow graphs, IEEE Trans Signal Processing, vol 44, no 1, pp , 1996 [18] J L Baer and T F Chen, An effective on-chip preloading scheme to reduce data access penalty, in Proc Supercomputing 91, pp , Albuquerque, NM, USA, November 1991

Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 935 Zhong Wang received a Bachelor s degree in electric engineering in 1994 from Xi an Jiaotong University, China and a

Notre Dame in Indiana His current research focuses on the loop scheduling and high-level synthesis Edwin Hsing-Mean Sha received his BS degree in computer science and information engineering from

From August 1992 to August 2000, he was with the Department of Computer Science and Engineering at University of Notre Dame, Notre Dame, IN He served as Associate Chairman for Graduate Studies since

65 Partitioning and Scheduling DSP Applications with Maximal Memory Access Hiding 935 Zhong Wang received a Bachelor s degree in electric engineering in 1994 from Xi an Jiaotong University, China and a Master s degree in information and signal processing in 1998 from Institute of Acoustics, Academia Sinica, China Currently, he is pursuing his PhD in computer science and engineering at University of Notre Dame in Indiana His current research focuses on the loop scheduling and high-level synthesis Edwin Hsing-Mean Sha received his BS degree in computer science and information engineering from National Taiwan University, Taipei, Taiwan, in 1986; he received the MS and PhD degrees from the Department of Computer Science, Princeton University, Princeton, NJ, in 1991 and 1992, respectively From August 1992 to August 2000, he was with the Department of Computer Science and Engineering at University of Notre Dame, Notre Dame, IN He served as Associate Chairman for Graduate Studies since 1995 He is now a tenured full professor in the Department of Computer Science at the University of Texas at Dallas He has published more than 140 research papers in refereed conferences and journals He has been serving as an editor for several journals such as IEEE Transactions on Signal Processing and Journal of VLSI Signal Processing He also served as program committee member in numerous conferences He received Oak Ridge Association Junior Faculty Enhancement Award in 1994, and NSF CAREER Award He was a guest editor for the special issue on Low Power Design of IEEE Transactions on VLSI Systems in 1997 He also served as the program chairs for the International Conference on Parallel and Distributed Computing Systems (PDCS), 2000 and PDCS 2001 He received Teaching award in 1998 Yuke Wang received his BS degree from the University of Science and Technology of China, Hefei, China, in 1989, the MS and PhD degrees from the University of Saskatchewan, Canada, in 1992 and 1996, respectively He has held faculty positions at Concordia University, Canada, and Florida Atlantic University, Florida, USA Currently he is an Assistant Professor at the Computer Science Department, University of Texas at Dallas He has also held visiting assistant professor positions in the University of Minnesota, the University of Maryland, and the University of California at Berkeley Dr Yuke Wang is currently an Editor of IEEE Transactions on Circuits and Systems, Part II, an Editor of IEEE Transactions on VLSI Systems, an Editor of Applied Signal Processing, and a few other journals Dr Wang s research interests include VLSI design of circuits and systems for DSP and communication, computer aided design, and computer architectures During , he has published about 60 papers among which about 20 papers are in IEEE/ACM Transactions

66 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm Martin Kuhlmann Broadcom Corporation, Irvine, CA 92619, USA kuhlmann@broadcomcom Keshab K Parhi Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA parhi@eceumnedu Received 30 August 2001 and in revised form 14 May 2002 This paper presents a CORDIC (coordinate rotation digital computer) algorithm and architecture for the rotation mode in which the directions of all micro-rotations are precomputed while maintaining a constant scale factor Thus, an examination of the sign of the angle after each iteration is no longer required The algorithm is capable to perform the CORDIC computation for an operand word-length of 54 bits Additionally, there is a higher degree of freedom in choosing the pipeline cutsets due to the novel feature of independence of the iterations i and i 1 in the CORDIC rotation Keywords and phrases: CORDIC, computer arithmetic, constant scale factor, precomputation, rotation mode 1 INTRODUCTION CORDIC (coordinate rotation digital computer) [1, 2] isan iterative algorithm for the calculation of the rotation of a 2- dimensional vector, in linear, circular, or hyperbolic coordinate systems, using only add and shift operations It has a wide range of applications including discrete transformationssuchashartleytransform[3], discrete cosine transform [4], fast Fourier transform (FFT) [5], chirp Z transform (CZT) [6], solving eigenvalue and singular value problems [7], digital filters [8], Toeplitz system and linear system solvers [9], and Kalman filters [10] It is also able to detect multiuser in code division multiple access (CDMA) wireless systems [11] The CORDIC algorithm consists of two operating modes, the rotation mode and the vectoring mode, respectively In the rotation mode, a vector (x, y) is rotated by an angle θ to obtain the new vector (x,y ) (see Figure 1) In every micro-rotation i, fixed angles of the value arctan(2 i ) are subtracted or added from/to the angle remainder θ i,so that the angle remainder approaches zero In the vectoring mode, the length R and the angle towards the x-axis α of a vector (x, y) are computed For this purpose, the vector is rotated towards the x-axis so that the y-component approaches zero The sum of all angle rotations is equal to the value of α, while the value of the x-component corresponds to the length R of the vector (x, y) The mathematical relations for y y θ α x x R Figure 1: The rotation and vectoring mode of the CORDIC algorithm the CORDIC rotations are as follows: x i1 = x i m σ i 2 i y i, y i1 = y i σ i 2 i x i, z i1 = z i 1 m σ i arctan ( m2 i), where σ i is the weight of each micro-rotation and m steers (1)

67 P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 937 the choice of rectangular (m = 0), circular (m = 1), or hyperbolic (m = 1) coordinate systems The required microrotations are not perfect rotations, they increase the length of the vector In order to maintain a constant vector length, the obtained results have to be scaled by a scale factor K Nevertheless, assuming consecutive rotations in positive and/or negative directions, the scale factor is constant and can be precomputed according to n 1 K = k i = i=0 n 1 i=0 ( 1σ 2 i 2 2i) 1/2 (2) The computation of the scale factor can be truncated after n/2 iterations because the multiplicands in the last n/2iterations are 1 due to the finite word-length and do not affect the final value of K 1, n/2 ( K 1 = 1σ 2 i 2 2i) 1/2 (3) i=0 There are two different approaches for the computation of the CORDIC algorithm The first one uses consecutive rotations in positive and/or negative direction, where the weight of each rotation is 1 Hence, σ i is either 1 or1,depending on the sign of the angle remainder z(i) In every iteration a significant amount of time is used to examine the most significant bit in case of a binary architecture or the most significant three digits of a redundant architecture to predict the sign of z(i) and hence the rotation direction σ i In comparison to the CORDIC implementations with constant scale factor, other implementations use a minimally redundant radix-4 or an even higher radix number representation [12, 13, 14] These architectures make use of a wider range of σ i In case of a minimally redundant radix-4 architecture, σ i { 2, 1, 0, 1, 2} By using this numbering system, the number of iterations can be reduced However, the computation time per iteration increases, since it takes more time to differentiate between five different rotation direction values and to generate five different multiples of arctan(2 i ) The scalefactoralsobecomesvariableandhastobecomputed every time, due to the absence of consecutive rotations leading to an increase in area To speed up the computation time of the CORDIC algorithm, either the number of iterations or the delay of each iteration have to be minimized The proposed algorithm introduces a novel approach, in which the rotation direction can be precomputed by adding the rotation angle θ, a constant and a variable adjustment which is stored in a table Hence, a significant speedup of the delay per iteration is obtained Since all rotation directions are known before the actual rotation begins, more than one rotation can also be performed in one iteration leading to a reduction in latency The proposed architecture also eliminates the z-datapath and reduces the area of the implementation This paperis organized as follows Section2 presents the theoretical background for the novel CORDIC algorithm for rotation mode and Section 3 presents the novel architecture Section 4 performs an evaluation of different CORDIC architectures whilesection 5 concludes the paper 2 THE NOVEL CORDIC ALGORITHM 21 Mathematical derivation using Taylor series The summation of all micro-rotation with their corresponding weight σ i is equivalent to the rotation angle θ n θ = σ i arctan ( 2 i), (4) i=0 where σ i { 1, 1}, corresponding to the addition and subtraction of the micro-angles θ i Since consecutive rotations are employed, the scale factor is constant The value of σ can be interpreted as a number in radix-2 representation The goal of the proposed method is to compute the sequence of the micro-rotation without performing any iteration To accomplish this, σ i is recoding as 2d i 1 leading to a binary representation in which a zero corresponds to the addition of a micro-angle [15, 16] This allows the use of simple binary adders Adding and subtracting 2 i to (4) results in ( θ = 2di 1 ) (2 i 2 i arctan ( 2 i)) (5) i=0 = i=0 ( 2di 1 ) 2 i ( 2di 1 ) (2 i arctan ( 2 i)) i=0 (6) ( =2d 2 2 i arctan ( 2 i)) ( 2d i 2 i arctan ( 2 i)) i=0 i=0 = 2d c 1 sign(θ) 2 ( 1 arctan(1) ) ( 2d i 2 i arctan ( 2 i)), i=1 where c 1 corresponds to c 1 = 2 i=0 (2 i arctan(2 i )) Solving (8)ford results in d = 05θ 05c 1 sign(θ) (1 arctan(1) ) d i (2 i arctan ( 2 i)) i=1 = 05θ c sign(θ) ɛ 0 d i ɛ i, i=1 where c corresponds to 05c 1 Table 1 shows the values of the partial offsets ɛ i for the first 10 values of i and indicates that the value of ɛ i decreases approximately by a factor of 8 with increasing i Hence, the summation of d i ɛ i can be limited to n/3 d = 05θ c sign(θ) ɛ 0 d i ɛ i, i=1 d = 05θ c sign(θ) ɛ 0 δ (7) (8) (9) (10)

68 938 EURASIP Journal on Applied Signal Processing Table 1: The values of ɛ i of the first 10 values of i Iteration i Partial offset ɛ i e e e e e e e e e-09 Rather than storing the partial offsets ɛ i and computing the sum over all i of the product d i ɛ i, δ = n/3 i=1 d i ɛ i can be precomputed and stored Hence, the only difficulty consists of determining which offset corresponds to the input θ This can be achieved by comparing the input θ with a reference angle θ ref The reference angles θ ref correspond to the summation of the first n/3 micro-rotation To be certain to obtain the correct offset, θ has to be larger than the reference angle θ ref All reference angles are stored in a ROM and are accessed by the most significant n/3bitsofθ In addition to the reference angles, the values of δ are stored In case of a negative difference θ ref θ, the corresponding δ is selected, otherwise the next smaller value of δ is chosen to be subtracted from θ c sign(θ) ɛ 0 Example 1 Assuming we have a word-length of 16 bits and θ = According to Table 2, θ ref corresponds to and δ = Hence, d is computed as n/3 d = 05 θ 1 05 c d i ɛ i i=0 = = = , σ = (11) 22 High precision By using a mantissa of n = 54 bits (corresponding to the floating point precision), the ROM for storing all offsets would require 2 18 entries This is rather impractical since the required area to implement the ROM will exceed by far the area for the CORDIC implementation To reduce the area for the ROM, δ can be split into two parts, δ = δ ROM δ r, (12) where δ ROM is stored in a ROM while δ r is computed By examining the Taylor series expansion of arctan(2 i ), it becomes obvious that the partial offset ɛ for iteration i and i1 Table 2: The reference angles of the rotation mode and their corresponding values of δ for an operand word-length of 16 bits corresponds to θ ref ɛ i = 2 i arctan ( 2 i) ( ) = 2 3i 3 2 5i 5 2 7i 7 2 9i 9, δ (13) ɛ i1 = 2 i 1 arctan ( 2 i 1) (14) ( ) = 2 3i 3 2 5i 5 2 7i 7 2 9i (15) = 2 3 ( 2 3i 3 2 5i 2 2 7i 4 2 9i ) (16) By comparing (13)and(16), it can be seen that (13)isabout 2 3 times larger than (16) Assuming a word-length of n bits and i> n/5 2, the factor is 2 3 Hence,thetermɛ n/5 1 = 2 3( n/5 1) /32 5( n/5 1) /5 can be stored in a ROM and

69 P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 939 the remaining offset δ r is computed as δ r = n/3 d j ɛ n/ (j n/5 1) < 2 3( n/5 1) (17) j= n/5 1 Thelargestmagnitudeofδ r is smaller than 2 3( n/5 1) Example for high precision Assume that we have a word-length of 50 bits and θ = Using the most significant 9 bits of θ, δ ROM = can be obtained Hence, d is computed according to d = 05 θ 1 05 c δ ROM δ r = e 09 ( ) = = (18) 23 The rotation mode in hyperbolic coordinate systems Similar to the circular coordinate system, a simple correlation between the input angle θ and the directions of the micro-rotation can be obtained Due to the incomplete representation of the hyperbolic rotation angle θ i,someiterationshavetobeperformedtwicein[2], it was recommended that every 4th, 13th, (3k 1)th iteration should be repeated to complete the angle representation Similar to the rotation mode in circular coordinate system, the rotation angle θ is equivalent to the summation of all micro-rotation with their corresponding weight This leads to θ = σ i arctanh ( 2 i) σ4 extra arctanh ( 2 4) i=0 σ extra 13 arctanh ( 2 13) σ extra 40 arctanh ( 2 40) (19) Performing a Taylor series expansion and applying σ i = 2d i 1 results in d = 05θ 05c 1 sign(θ) 2 ( 05 arctanh(05) ) ( d4 extra 1 ) arctanh ( 2 4) 2 ( d13 extra 1 ) arctanh ( 2 13) 2 ( d40 extra 1 ) arctanh ( 2 40) 2 = 05θ c d4 extra arctanh ( 2 4) d13 extra arctanh ( 2 13) d40 extra arctanh ( 2 40), (20) where c corresponds to c = 1 05 i=1 (2 i arctanh(2 i )) 05 (arctanh(2 4 ) arctanh(2 13 ) arctanh(2 40 ) ) Since these extra rotation are not known in advance, an efficient high precision VLSI implementation is not possible However, for signal processing applications using a wordlength of less than 13 bits, the ROM size corresponds to only 14 entries 3 THE NOVEL ROTATION-CORDIC ARCHITECTURE For an implementation with the operand word-length of n bits, the pre-processing part consists of a ROM of 2 n/5 2 entries in which the reference angles θ ref and the corresponding offsets δ are stored, respectively (see Figure 2) To avoid a second access to the ROM in case of θ ref >θthe next smaller offset δ k 1 is additionally stored in the kth entry of the ROM The ROM is accessed by the n/5 2 MSB bits of θ A binary tree adder computes, whether θ is smaller or larger than the chosen reference angle θ ref and selects the corresponding offset (either δ k or δ k 1 ) Using a 3 : 2 compressor and another fast binary tree adder, the two required additions to obtain d approx = 05θ c 2 δ ROM can be performed, where c 2 corresponds to csign(θ)ɛ 0 Using the bits d n/5 1 to d n/3, δ r can be computed according to (17) and has to be added to d approx For the worst case scenario, there is a possible ripple from the bit d 3( n/5 1) to the bit d ( n/5 ) which would call for a time consuming ripple adder However, by employing an extra rotation for d 3( n/5 1) 1 this limitation can be resolved This extra rotation corresponds to the overflow bit of the addition from the bits d approx 3( n/5 1) n and δ r The additional rotation also does not affectthescale factor,since 3( n/5 1) >n/2 For a precision of n 16 bits, there are less than 32 offsets which can be stored in a ROM and the additional overhead to compute δ r can be removed The alternative architecture can be chosen by realizing that the directions of the micro-rotations are required in a most significant bit first manner (see Figure 2) As in the previous architecture, a fast binary adder is employed to determine which offset has to be selected A redundant sign digit adder adds 05θ, c, andδ ROM and an on-the-fly converter starts converting resulting into the corresponding binary representation Normally, the most significant bit cannot be determined until the least significant digit is converted However, such worst cases do not exist in the CORDIC implementation, due to the redundant representations of the angles arctan(2 i ), where arctan ( 2 i) < arctan ( 2 k), (21) k=i1 as opposed to the binary representation 2 i > 2 k (22) k=i1 Therefore, it is not possible that there are more than l 1 consecutive rotations in the same direction In case that there are l 1 consecutive rotations in the same direction, the lth

70 940 EURASIP Journal on Applied Signal Processing c 2 θ ROM θ θ ref δ k δ k 1 2words Fast adder sign Muxes δ ROM δr θ ε n/5 2 d[n/5 2:n/3] Redundant adder x y σ On-the-fly converter CORDIC rotations while in our evaluation the unit delay is set to a full-adder delay Hence, the delays for 2-input (NAND, NOR) gate, XOR, multiplexer, register, and full-adder are 025, 05, 05, 05, and 1t FA The determination of which offset has to be chosen consists of the delay of the decoder, the ROM, a fast binary n- bit tree adder and a multiplexer Assuming a delay of log 2 (m) gate delays for the decoder, where m corresponds to the number of rows in the ROM (m < log 2 (n) 1), one for the word-line driver and another for the ROM, log 2 (n) t Mux for the fast binary adder and 05 t FA for the multiplexer, we can obtain the correct value of δ ROM after a delay of (05log 2 (n)1025 log 2 (log 2 (n))) t FA A 3 : 2 compressor can be employed to reduce the number of partial products to two An additional fast binary tree adder can compute the final value of d approx Hence, the entire delay to obtain d approx corresponds to x y ( 05log2 (n)1025 log 2 ( log2 (n) ) Figure 2: The novel architecture for the rotation mode Table 3: The maximal number of consecutive rotation in the same direction i θ i l 1 i θ i l 1 i θ i l iteration has to be rotated into the opposite direction This happens if the angle remainder z i 0 Table 3 shows the maximum number of consecutive unidirectional rotations depending on the iteration number i This limitation leads to a reduction in the complexity of the online converters and its most significant bits can already be used to start the rotations in the x/y datapath Example 2 Assuming an angle θ = 0001 Hence, the angle remainder θ i correspond to θ 0 = 0001, θ 1 = θ 0 arctan(1) = 07844, θ 2 = θ 1 arctan(05) = 03208, θ 3 = θ 2 arctan(025) = 00758, θ 4 = θ 3 arctan(0125) = (23) The next rotation has to be performed in the negative directions, since θ 4 > 0 Hence, it is not possible to obtain rotation sequence like σ 0 4 = but it has to be σ 0 4 = Evaluation of the z-datapath Delay analysis In this paper, we assume a similar delay model as proposed in [14] Nevertheless in [14], the unit delay is set to a gate delay 105log 2 (n) ) t FA = ( log 2 (n)225 ) t FA (24) After obtaining the bits d n/5 1 to d n/3, δ r can be computed Since the value of δ r is smaller than 2 3( n/5 1) and the value of d approx δ r is not required before 2 3( n/5 ) t FA the computation of δ r is not in the critical path Alternatively to the 3 : 2 compressor and the tree adder, a minimally redundant radix-4 sign digit adder can be employed which has a delay of two full-adders Hence, all output digits are available after these two full-adder delays An additional on-the-fly converter converts the digits into its equivalent binary representation starting with the MSD It requires a delay of multiplexer and four NANDs/NORs to convert one digit which results in 15t FA per digit (1 digit = 2 bits) The last digit is converted after a delay of (n/21) 15t FA As already described in Table 3,bitn/3 is stable as soon as the last digit (corresponding to bit n) has been converted Hence, the n/3 rotation can be performed after a delay of (n/21) 15t FA Therefore, the iterations i = 0 can already be performed after a delay of (n/21) 15t FA n/3 2t FA = (1/12 n 1)t FA Note that the conversion of one redundant digit is performed faster than the addition/subtraction of the x/y datapath Hence, an initial delay of (1/12 n 1)t FA (log 2 (n)225)t FA = (1/12 n log 2 (n)325)t FA has to be added to the delay of the x/y datapath Area analysis Previously, the area of the z-datapath consists of n/2 iterations in which (nlog 2 n2) multiplexers and (nlog 2 n2) full-adders and registers are employed Additionally, due to the Booth encoding, in the last n/4 iterations, about 2(n log 2 n 2) multiplexers and (n log 2 n 2) full-adders are required Assuming A FA = 193 A mux and A FA = 161 A reg (values are based on layouts), the hardware complexity of the z-datapath results in A z = 17 n(nlog 2 n2)a FA Assuming a word-length of 54 bits and neglecting the required area for

71 P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 941 the examination of the most significant three digits, about 5700A FA are required The proposed architecture utilizes a ROM of word-length n and 2 n/5 2 entries, requiring an area of n 2 n/5 2 A FA 1/50 resulting in 552A FA for a word-length of 54 bits The implementation of the decoders can be done in multiple ways NOR based decoders with precharge lead to the fastest implementation However, the decoder area becomes larger The decoder size per word-line corresponds to A dec = 083A FA Since 2 n/5 2 decoders are required, the area for all decoder corresponds to A dec,total = n/5 2 = 424A FA,assuming a 54 bit word-length The ROM has to store θ ref, δ k,andδ k 1 This results in a total area for the ROM and the decoder of about 2080A FA The computation of δ r requires n/3 n/52 = 2n/15 2 rows of CSA (carry-saveadders) and Muxes and a final fast binary tree adder Note that each row of CSA adders and Muxes only consists of (n 3n/5 6 = 2n/5 6) bits (the more significant bits are zero) The required area corresponds to 10 27A FA 10 27A mux and 5 27A FA, respectively Hence, the computation of δ r requires 540A FA Moreover, the two redundant sign digit adder require 2n A FA, while the converter consists of about (05n 2 n)a mux This corresponds to 108 and 696A FA for a word-length of 54 bits This makes a total of 3426A FA, which is about 60% of the z-datapath previously employed 32 Evaluation of the x/y datapath In the first n/2 micro-rotations, the critical path of the x/y rotator part consists of a multiplexer and a 4 : 2 compressor, which has a combined critical path of 2 full-adders The last n/2 micro-rotations can be performed only using n/4iterations, since Booth encoding can be employed However, the delay of the selection for the multiple of the shifted x/y components requires slightly more time, resulting in a delay of about one full-adder delay The delay for the 4 : 2 compressor remains 15 full-adder Hence, the critical path of the entire x/y rotator part consists of n/2 2t FA n/4 25 t FA = 1625n t FA Note that the direction of the first iteration is already known; hence, the first iteration is not in the critical path Therefore, the critical path of the entire x/y rotator part consists of (1625n 2)t FA As an example, for a word-length of n = 16 bits, the x/y datapath delay and the entire delay of the CORDIC algorithm corresponds to 24 and 325 full-adder delays, respectively 33 Scale factor compensation Since the scale factor is constant, the x and y values can already be scaled while the rotation direction is being computed The scaling requires an adder of word-length (n log 2 (n)) bits Using a binary tree adder, this results in a delay of log 2 (n log 2 (n)) t Mux For the scale factor, a CSD (canonic signed digit) representation can be used, leading to at most n/3 nonzero digits Applying a Wallace-tree for the partial product reduction, the total delay of the scaling results into (05log 2 (n log 2 (n)) log 15 (n/3)) t FA < (1/12 n log 2 (n)325) t FA = t initial Hence, the scaling of the x and y coordinates does not affect the total latency of the novel algorithm 4 OVERVIEW OF PREVIOUSLY REPORTED CORDIC ALGORITHMS The delay of every iteration can be decomposed into two different time delays, t d,σ and t d,xy,wheret d,σ corresponds to the time delay to predict the new rotation direction while t d,xy corresponds to the time delay of the multiplexer/add structure of the x/y datapath Various implementations have been proposed to obtain a speedup of the CORDIC algorithm Improvements have been especially made in the reduction of t d,σ In [17], the angle remainder has been decomposed every k = 3k 1 iteration From the given angle θ, the first four rotation directions can be immediately determined After performing the corresponding addition/subtraction of the terms σ i α i from the input angle θ using CSA arithmetic, a fast binary tree adder computes the nonredundant result z 4 The bits4to13ofz 4 deliver the rotation direction σ 4 to σ 13 which are used to perform the rotation in the x/y datapath and the computation of the next angle remainder z 40 Hence,alow latency CORDIC algorithm is obtained However, a significant reduction in latency is achieved at the cost of an irregular design Furthermore, it is difficult to perform a π/2 initial rotation or the rotation of index i = 0forcircular coordinates, as it would force a conversion from redundant to conventional arithmetic for the z coordinate just after the first micro-rotation which is costly in time and area Hence, this parallel and nonpipelined architecture only converges in the range of [ 1, 1] The overall latency of this architecture corresponds to about 2n log 3 (n) log 2 (n) full-adder delay In [18], a direct correlation between the z remainder after n/3 rotations and the remaining rotation direction have been shown Hence, no more examination of the direction of the micro-rotation has to be performed leading to a considerable reduction in latency However, in the first n/3 iteration a conventional method has to be employed In [19], the directions of the micro-rotation have been recoded using an offset binary coding (OBC) [20] The obtained correlation is approximately piecewise linear since small elementary angles can be approximated by a(i) = arctan(2 i ) s 2 n i 2,wheres is the slope of the linearity This is valid for i m, wherem is an integer which makes the approximation tolerable (normally m = n/3 ) Hence, the following correlation can be obtained: n 1 i=m n 1 σ i 2α i s i=m σ i 2 i 1 (25) By performing some arithmetic computations, the following correlation of the rotation direction can be obtained: n 1 i=0 n 1 σ i 2 i = i=0 n 1 σ i 2 i 1 i=m σ i 2α(i) (26) s

72 942 EURASIP Journal on Applied Signal Processing Hence, a multiplication by the inverse of the slope s is required This multiplication can be simplified to two stages of addition for an operand word-length of 9 bits However, in most digital signal processing application, the operands have a word-length of up to 16 bits Hence, for those applications, the presented method requires more stages of addition to compensate the multiplication resulting in a more complex implementation and an increase in delay In [21], a double rotation method is introduced which compensates for the scale factor while performing the regular x/y rotations However, due to the double rotation nature of this method, t d,xy is increased to about twice its original value To reduce the latency of the CORDIC operation, [22] proposed an algorithm using online arithmetic However, this results in a variable scale factor This drawback is removed in [23] In every iteration a significant amount of time is used to examine the most significant three digits to predict σ i The employed random logic requires a delay of about 15 full-adder delays Since the x/y datapath consists of a 4-2 compressor, it requires also a delay of 2 full-adders Hence, the overall iteration delay corresponds to 35 full-adder delays To maintain a constant scale factor, consecutive rotations are required in the first n/2, where n corresponds to the word-length of the operands For the computation of the last n/2 bits, Booth encoding can be employed reducing the number of iterations by a factor of 2 However, the selection of multiple of the shifted x and y operands requires an additional multiplexer delay and increases the overall iteration delay to 4 full-adder delays Hence, the number of iteration is equivalent to 075n which corresponds to a total latency of 3n full-adders (this does not include the scale operation and the conversion) Other implementations like [24] remove the extra rotations by a branching mechanism in case that the sign of the remainder cannot be determined (most significant three digits are zero) Hence, no extra-rotations are required while the required implementation area is doubled Nevertheless, the most significant three digits (or most significant six bits) still have to be examined for the prediction of the next rotation direction In [25], the double step branching CORDIC algorithm is introduced which performs two rotations in a single step Nevertheless, this method requires an examination of the most significant six digits to detect two rotation directions Since some of the digits can be examined in parallel, the delay increases only to 2t FA The computation time of a double rotation in the x/y datapath is slightly reduced compared to two normal x/y rotations Hence, the total amount of computation time corresponds to 05n(2t FA 3t FA ) = 25t FA In [26], the signs of all micro-rotations are computed serially However, a speed up of the sampling rate is achieved by separating the computation of the sign and the magnitude of every z i or y i remainder The sign of every remainder is computed by a pipelined carry-ripple adder (CRA) leading to an initial latency of n full-adders before the first CORDIC rotation can be performed Nevertheless, after this initial latency, the following signs can be obtained with a delay of only one Table 4: An overview between the proposed algorithm and other CORDIC implementations Approach Delay in t FA proposed 1625n 1/12 n log 2 (n)125 [14] 2n 6 [26] 3n 1 [21] 375n [27] 525n [25] 25n [17] 2n log 3 (n)log 2 (n) full-adder This leads to an overall latency of 3n full-adders delays In comparison to the CORDIC implementations with constant scale factor, other implementations use a minimally redundant radix-4 or an even higher radix number representation [12, 13, 14] By using this number system, the number of iterations can be reduced However, the prediction of the σ i becomes more complicated, since there are more possible values for σ i In addition, the scale factor becomesvariableandhastobecomputedeverytime,dueto the absence of consecutive rotations An online computation of the scale factor and a parallel scaling of the x and y operands can be achieved Depending of the use of CSA or fast carry-propagate-adders (CCLA), the number of iterations can be reduced to 2 n/3 4andn/2 1, respectively The iteration delay t d,csa of the architecture using CSA adders corresponds to the same delay as already described for the last n/2 iteration in the constant scale factor using Booth-encoding, while the architecture employing the fast CCLA adders requires 15 d,csa [14] Hence, the overall latency of these CORDIC algorithm using a minimally redundant radix-4 digit set corresponds to about 2n full-adder delays Table 4 provides a delay comparison between the proposed algorithm and other CORDIC implementations Some of the delays have been taken from [14, 17, 26] 5 CONCLUSION This paper presented a CORDIC algorithm for the rotation mode which computes the directions of the required microrotation before the actual CORDIC computations start while maintaining a constant scale factor This is obtained by using a linear correlation between the rotation angle θ and the corresponding direction of all micro-rotations for the rotation mode The rotation directions are obtained by adding the rotation angle θ to a constant and a variable offset which is stored in a ROM An implementation for high precision is also provided which reduces the size of the required ROM Hence, neither extra or double rotations nor a variable scale factor are required The implementation is suitable for wordlengths up to 54 bits, while maintaining a reasonable ROM size

P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 943 ACKNOWLEDGMENT This work was supported by the Defense Advanced Research Projects Agency under contract number DA/DABT63-96-C- 0050 Prof

73 P-CORDIC: A Precomputation Based Rotation CORDIC Algorithm 943 ACKNOWLEDGMENT This work was supported by the Defense Advanced Research Projects Agency under contract number DA/DABT63-96-C Prof Parhi is on leave from the Department of Electrical and Computer Engineering of the University of Minnesota, Minneapolis, MN, USA REFERENCES [1] J E Volder, The CORDIC trigonometric computing technique, IRE Transactions on Electronic Computers, vol 8, no 3, pp , 1959 [2] J S Walther, A unified algorithm for elementary functions, in Proc Spring Joint Computer Conference, vol 38, pp , Arlington, Va, USA, 1971 [3] L W Chang and S W Lee, Systolic arrays for the discrete Hartley transform, IEEE Trans Signal Processing, vol 39, no 11, pp , 1991 [4] W-H Chen, C H Smith, and S C Fralick, A fast computational algorithm for the discrete cosine transform, IEEE Trans Communications, vol 25, no 9, pp , 1977 [5] A M Despain, Fourier transform computers using CORDIC iterations, IEEE Trans on Computers, vol 23, no 10, pp , 1974 [6] Y H Hu and S Naganathan, A novel implementation of chirp Z-transform using a CORDIC processor, IEEE Transaction on Acoustics, Speech, and Signal Processing, vol 38, no 2, pp , 1990 [7] M Ercegovac and T Lang, Redundant and on-line CORDIC: Application to matrix triangularization and SVD, IEEE Trans on Computers, vol 39, no 6, pp , 1990 [8] P P Vaidyanathan, A unified approach to orthogonal digital filters and wave digital filters, based on LBR two-pair extraction, IEEE Trans Circuits and Systems,vol32,no7,pp , 1985 [9] Y H Hu and H M Chern, VLSI CORDIC array structure implementation of Toeplitz eigensystem solver, in Proc IEEE Int Conf Acoustics, Speech, Signal Processing, pp , Alberquerque, NM, USA, April 1990 [10] T Y Sung and Y H Hu, Parallel VLSI implementation of Kalman filter, IEEE Trans on Aerospace and Electronics Systems, vol 23, pp , March 1987 [11] H V Poor and X Wang, Code-aided interference suppression for DS/CDMA communications Part I: Interference suppression capability, IEEE Trans Communications, vol 45, no 9, pp , 1997 [12] C Li and S G Chen, A radix-4 redundant CORDIC algorithm with fast on-line variable scale factor compensation, in International Symposium on Circuits and systems, pp , Hong Kong, June 1997 [13] ROsorio,EAntelo,JVillalba,JDBruguera,andELZapata, Digit on-line large radix CORDIC rotator, in Proc Int Conf Application-Specific Array Processors, pp , Strasbourg, France, July 1995 [14] J Villalba, J Hidalgo, E L Zapata, E Antelo, and J D Bruguera, CORDIC architectures with parallel compensation of the scale factor, in Proc Int Conf Application Specific Array Processors, pp , Strasbourg, France, July 1995 [15] M Kuhlmann and K K Parhi, A high-speed CORDIC algorithm and architecture for digital signal processing applications, in Proc 1999 IEEE Workshop on Signal Processing Systems: Design and Implementation, pp , Taipei, Taiwan, October 1999 [16] M Kuhlmann and K K Parhi, A new CORDIC rotation method for generalized coordinate systems, in Proc 1999 Asilomar Conf on Signals, Systems and Computers, Pacific Grove, Calif, USA, October 1999 [17] D Timmermann, H Hahn, and B J Hosticka, Low latency time CORDIC algorithms, IEEE Trans on Computers, vol 41, no 8, pp , 1992 [18] S Wang, V Piuri, and E Swartzlander, Hybrid CORDIC algorithms, IEEE Trans on Computers, vol 46, no 11, pp , 1997 [19] S Nahm and W Sung, A fast direction sequence generation method for CORDIC processors, in Proc IEEE Int Conf Acoustics, Speech, Signal Processing, pp , Munich, Germany, April 1997 [20] N Demassieux and F Jutand, VLSI Implementation for Image Communications, chapter 7, P Pirsch, Ed, Elsevier Science, New York, NY, USA, 8th edition, 1993 [21] N Takagi, T Asada, and S Yajima, Redundant CORDIC methods with a constant scale factor for sine and cosine computation, IEEE Trans on Computers, vol 40, no 9, pp , 1991 [22] H X Lin and H J Sips, On-line CORDIC algorithms, IEEE Trans on Computers, vol 39, no 8, pp , 1990 [23] R Hamill, J McCanny, and R Walke, On-line CORDIC algorithm and VLSI architecture for implementing QR-array processors, to appear in Journal of VLSI Signal Processing, 1999 [24] J Duprat and J-M Muller, The CORDIC algorithm: New results for fast VLSI implementation, IEEE Trans on Computers, vol 42, no 2, pp , 1993 [25] D S Phatak, Double step branching CORDIC: A new algorithm for fast sine and cosine generation, IEEE Trans on Computers, vol 47, no 5, pp , 1998 [26] H Dawid and H Meyr, The differential CORDIC algorithm: Constant scale factor redundant implementation without correcting iterations, IEEE Trans on Computers, vol 45, no 3, pp , 1996 [27] J-A Lee and T Lang, A constant-factor redundant CORDIC for angle calculation and rotation, IEEE Trans on Computers, vol 41, no 8, pp , 1992 Martin Kuhlmann received his Diplome Ingénieur and PhD degrees in electrical engineering from the University of Technology Aachen, Germany in 1997 and from the University of Minnesota in 1999, respectively Currently, he is a staff design engineer at Broadcom Corporation, Irvine, CA, USA His research interests include computer arithmetic, digital communication, VLSI design, and deep-submicron crosstalk Keshab K Parhi is a distinguished McKnight University Professor of Electrical and Computer Engineering at the University of Minnesota, Minneapolis, where he also holds the Edgar F Johnson Professorship He received the BTech, MSEE, and PhD degrees from the Indian Institute of Technology, Kharagpur (India) (1982), the University of Pennsylvania, Philadelphia (1984), and the University of California at Berkeley (1988), respectively His research interests include all aspects of physical layer VLSI implementations of broadband access systems He is currently working on VLSI adaptive digital filters, equalizers and beamformers, error control coders and cryptography architectures, lowpower digital systems, and computer arithmetic

74 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design Jin-Gyun Chung Division of Electronic and Information Engineering, Chonbuk National University, Chonju , Korea jgchung@moakchonbukackr Keshab K Parhi Department of Electrical and Computer Engineering, University of Minnesota, Minneapolis, MN 55455, USA parhi@eceumnedu Received 11 July 2001 and in revised form 15 May 2002 Parallel (or block) FIR digital filters can be used either for high-speed or low-power (with reduced supply voltage) applications Traditional parallel filter implementations cause linear increase in the hardware cost with respect to the block size Recently, an efficient parallel FIR filter implementation technique requiring a less-than linear increase in the hardware cost was proposed This paper makes two contributions First, the filter spectrum characteristics are exploited to select the best fast filter structures Second, a novel block filter quantization algorithm is introduced Using filter benchmarks, it is shown that the use of the appropriate fast FIR filter structures and the proposed quantization scheme can result in reduction in the number of binary adders up to 20% Keywords and phrases: parallel FIR filter, quantization, fast FIR algorithm, canonic signed digit 1 INTRODUCTION Finite impulse response (FIR) filters are widely used in various DSP applications In some applications, the FIR filter circuit must be able to operate at high sample rates, while in other applications, the FIR filter circuit must be a low-power circuit operating at moderate sample rates The low-power or low-area techniques developed specifically for digital filters can be found in [1, 2, 3, 4, 5, 6, 7] Parallel (or block) processing can be applied to digital FIR filters to either increase the effective throughput or reduce the power consumption of the original filter While sequential FIR filter implementation has been given extensive consideration, very little work has been done that deals directly with reducing the hardware complexity or power consumption of parallel FIR filters Traditionally, the application of parallel processing to an FIR filter involves the replication of the hardware units that exist in the original filter If the area required by the original circuit is A, then the L-parallel circuit requires an area of L A Recently, an efficient parallel FIR filter implementation technique requiring a less-than linear increase in the hardware cost was proposed using FFAs (fast FIR Algorithms) [8] In [9], it was shown that the power consumption of arithmetic units can be reduced if statistical properties of the input signals are exploited In this paper, based on [10], it is shown that the hardware cost can be reduced by exploiting the frequency spectrum characteristics of the given transfer function This is achieved by selecting appropriate FFA structures out of many possible FFA structures all of whom have similar hardware complexity at the word-level However, their complexity can differ significantly at the bit-level For example, in narrowband low-pass filters, the signs of consecutive unit sample response values do not change much and therefore their difference can require fewer number of bits than their sum This favors the use of a parallel structure which requires subfilters which require difference of consecutive unit sample response values as opposed to sum In addition to the appropriate selection of FFA structures, proper quantization of subfilters is important for low-power or low hardware cost implementation of parallel FIR filters It is shown in [5, 6, 7] that if the filter coefficients are first scaled before the quantization process is performed, the resulting filter will have much better frequency-space characteristics When the quantized filter is implemented, a postprocessing scale factor (PPSF) is used to properly adjust the magnitude of the filter output In cases where large levels of parallelism are used, the number of required subfilters is large, and consequently the PPSFs can contribute to a significant amount of hardware overhead In [8], PPSFs are restricted to a set of simple values to reduce the hardware overhead due to PPSFs Since the original PPSF is replaced with the new simple PPSF

75 Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 945 that is the nearest in value, the quantized filter coefficients must also be properly modified However, this approach is not guaranteed to give optimal quantized coefficients since already quantized coefficients are modified again To avoid this problem, we propose look-ahead maximum absolute difference (LMAD) quantization algorithm, which gives optimal quantized coefficients for a given simple PPSF value In Section 2, FFAs are briefly reviewed Also, frequency spectrum related hardware complexities for different types of FFAs are discussed Section 3 presents a quantization method suitable for block FIR filters Section4 presents several block filter design examples H 0 y(2k) x(2k) H 1 H 0 y(2k 1) x(2k 1) H 1 D Figure 1: Traditional 2-parallel FIR filter 2 FAST FIR ALGORITHMS Consider the general formulation of a length-n FIR filter, N 1 y n = h i x n i, n = 0, 1, 2,,, (1) i=0 where {x i } is an infinite length input sequence and {h i } are the length-n FIR filter coefficients Then the polyphase representation of a traditional L-parallel FIR filter [11] canbe expressed as L 1 i=0 ( Y i z L ) L 1 z i = j=0 ( H j z L ) L 1 z j ( X k z L ) z k, (2) where Y i (z) = m=0 z m y mli, H i (z) = N/L 1 m=0 z m h mli, X i (z) = m=0 z m x mli,fori = 0, 1,,L 1 This block FIR filtering equation shows that the parallel FIR filter can be realized using L 2 -FIR filters of length N/L This linear complexity can be reduced using various FFA structures (L = 2) FFAs From (2)withL = 2, we have k=0 Y 0 z 1 Y 1 = ( H 0 z 1 H 1 )( X0 z 1 X 1 ), which implies that = H 0 X 0 z 1( H 0 X 1 H 1 X 0 ) z 2 H 1 X 1, Y 0 = H 0 X 0 z 2 H 1 X 1, Y 1 = H 0 X 1 H 1 X 0 Direct implementation of (4) is shown in Figure 1 This structure computes a block of 2 outputs using 4 length N/2 FIR filters and 2 postprocessing additions, which requires 2N multipliers and 2N 2adders If (4) is written in a different form, the (2 2) FFA0 (FFAtype 0) is obtained, Y 0 = H 0 X 0 z 2 H 1 X 1, Y 1 = H 01 X 01 H 0 X 0 H 1 X 1, where H ij = H i H j and X ij = X i X j Implementation of (5) is shown in Figure 2 Thisstructurecomputesablockof (3) (4) (5) x(2k) x(2k 1) x(2k) x(2k 1) H 0 H 0 H 1 H 1 Figure 2: 2-parallel FIR filter using FFA0 H 0 H 0 H 1 H 1 Figure 3: 2-parallel FIR filter using FFA1 D D y(2k) y(2k 1) y(2k) y(2k 1) 2 outputs using 3 length N/2 FIR filters and 4 preprocessing and postprocessing additions, which requires 3N/2 multipliers and 3(N/2 1) 4 adders By a simple modification of (5), the following FFA1 (FFA-type1)isderived[11], Y 0 = H 0 X 0 z 2 H 1 X 1, Y 1 = H 0 1 X 0 1 H 0 X 0 H 1 X 1 In (6), H 0 1 = H 0 H 1 and X 0 1 = X 0 X 1 The structure derived by FFA1 is shown in Figure 3 The structures derived by FFA0 and FFA1 are essentially the same except some sign changes Notice that, in FFA1, H 0 1 is used instead of H 01 When an FIR filter is implemented using a multiplierless approach, the hardware complexity is directly proportional to the number of nonzero bits in the filter coefficients If the signs of the given impulse response sequences do not change frequently as in the narrowband low-pass filter cases, the coefficient magnitudes of H 0 H 1 are likely to be larger than those of H 0 H 1 Then,H 0 H 1 hasmorenonzerobitsin the coefficients than H 0 H 1 (See examples in Section 4) If the signs of the given impulse response sequences change frequently as in the wide-band low-pass filter cases, H 0 H 1 is likely to have more nonzero bits in the coefficients than (6)

76 946 EURASIP Journal on Applied Signal Processing H 0 H 1 Thus, to achieve minimum hardware cost, it is necessary to select either FFA0 or FFA1 depending upon the frequency spectrum specifications (L = 3) FFAs The (3 3) FFA produces a parallel filtering structure of block size 3 From (2)withL = 3, we have Y 0 = H 0 X 0 z 3( ) H 1 X 2 H 2 X 1, Y 1 = ( ) H 0 X 1 H 1 X 0 z 3 H 2 X 2, Y 2 = H 0 X 2 H 1 X 1 H 2 X 0 Direct implementation of (7)computesablockof3outputs using 9 length N/3 FIR filters and 6 postprocessing additions, whichrequires3n multipliers and 3N 3adders By a similar approach as in (2 2) FFA0, following (3 3) FFA0 is obtained, Y 0 =H 0 X 0 z 3 H 2 X 2 z 3[ ] H 12 X 12 H 1 X 1, Y 1 = [ ] [ H 01 X 01 H 1 X 1 H0 X 0 z 3 ] H 2 X 2, Y 2 =H 012 X 012 [ ] [ ] H 01 X 01 H 1 X 1 H12 X 12 H 1 X 1 (8) Figure 4 shows the filtering structure that results from the (3 3) FFA0 This structure computes a block of 3 outputs using 6 length N/3 FIR filters and 10 preprocessing and postprocessing additions, which requires 6(N/3) multipliers and 6(N/3 1) 10 adders Notice that (3 3) FFA0 structure provides a saving of approximately 33% over the traditional structure The (3 3) FFA1 structure can be obtained by modifying (8) as follows: Y 0 =H 0 X 0 z 3 H 2 X 2 z 3[ ] H 2 1 X 2 1 H 1 X 1, Y 1 = [ ] [ H 0 1 X 0 1 H 1 X 1 H0 X 0 z 3 ] H 2 X 2, Y 2 =H 0 12 X 0 12 [ ] [ ] H 0 1 X 0 1 H 1 X 1 H2 1 X 2 1 H 1 X 1 (9) Figure 5 shows the filtering structure that results from the (3 3) FFA1 We propose the following (3 3) FFA2 structure which is efficient when the coefficient magnitudes of H 0 2 are smaller than those of H 0 12 or H 012, Y 0 = H 0 X 0 z 3( ) H 2 X 2 H 1 X 1 H 2 1 X 2 1, Y 1 = H 0 1 X 0 1 H 1 X 1 H 0 X 0 z 3 H 2 X 2, Y 2 = H 0 2 X 0 2 H 0 X 0 H 1 X 1 H 2 X 2 (7) (10) Figure 6 shows the filtering structure that results from the (3 3) FFA2 23 Cascading FFAs The (2 2) and (3 3) FFAs can be cascaded together to achieve higher levels of parallelism The cascading of FFAs is a straightforward extension of the original FFA application [8] For example, an (m m) FFA can be cascaded with an (n n)ffatoproducean(m n)-parallel filtering structure The set of FIR filters that result from the application of the (m m) FFA are further decomposed, one at a time, by the application of the (n n) FFA The resulting set of filters will be of length N/(m n) For example, the (4 4) FFA can be obtained by first applying the (2 2) FFA0 to (2) and then applying the (2 2) FFA0 or the (2 2) FFA1 to each of the filtering operations that result from the first application of the FFA0 The resulting (4 4) FFA structure is shown in Figure 7 Eachfilter block F 0, F 0 F 1,andF 1 represents a (2 2) FFA structure and can be replaced separately by either (2 2) FFA0 or (2 2) FFA1 Each filter block F 0, F 0 F 1,andF 1 is composed of three subfilters as follows: (i) F 0 : H 0,H 2,H 0 ± H 2, (ii) F 0 F 1 : H 0 H 1,H 2 H 3, (H 0 H 1 ) ± (H 2 H 3 ), (iii) F 1 : H 1,H 3,H 1 ± H 3, where, for FFA0, ±= (11), for FFA1 When the filter block F 0 F 1 is implemented using FFA1 structure, the subfilters are H 01, H 23,andH 01 H 23 Thus, even though FFA1 structure is used for slowly varying impulse response sequences, optimum performance is not guaranteed In this case, better performance can be obtained by using the FFA1 shown in Figure 8 Since the subfilters in FFA1 are H 0 1, H 2 3,andH 0 1 H 2 3, the FFA1 gives smaller number of nonzero bits than FFA1 for the case of slowly varying impulse response sequences Notice that the FFA1 structure can be derived by first applying the (2 2) FFA1 (instead of the (2 2) FFA0) to (2) When the filter block F 0 F 1 in Figure 7 is replaced by FFA1 in Figure 8, it can be shown that the outputs are y(4k), y(4k 1), y(4k 2),and y(4k 3) 24 Selection of FFA types For given length N unit sample response values {h i } and block size L, the selection of best FFA type can be roughly determined by comparing the signs of the values in subfilters H 0,H 1,,H L 1 For example, in the case of L = 2andevenN, H 0,andH 1 are H 0 = { h 0,h 2,,h N 2 }, H 1 = { h 1,h 3,,h N 1 } (12) From (12), the ith value of H 0 can be paired with the ith value of H 1 as (h 0,h 1 ), (h 2,h 3 ),,(h N 2,h N 1 ) Comparing the signs of the values in each pair, the number of pairs with opposite signs and the number of pairs with the same signs can be determined If the number of pairs with opposite signs is larger than the number of pairs with the same signs, H 0 H 1 is likely to be more efficient than H 0 H 1 The sign-comparing procedure can be extended to any block size of L with appropriate modifications

77 Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 947 x(3k) H 0 x(3k 1) H 1 x(3k 2) H 2 D y(3k) H 0 H 1 y(3k 1) H 1 H 2 D H 0 H 1 H 2 y(3k 2) Figure 4: 3-parallel FIR filter using FFA0 x(3k) H 0 x(3k 1) H 1 x(3k 2) H 2 D y(3k) H 0 H 1 y(3k 1) H 2 H 1 D H 0 H 1 H 2 y(3k 2) Figure 5: 3-parallel FIR filter using FFA1 3 LOOK-AHEAD MAD QUANTIZATION It is shown in [5, 6, 7] that if the filter coefficients are first scaled before the quantization process is performed, the resulting filter will have much better frequency-space characteristics The NUS algorithm [6] employs a scalable quantization process To begin the process, the ideal filter is normalized so that the largest coefficient has an absolute value of 1 The normalized ideal filter is then multiplied by a variable scale factor (VSF) The VSF steps through the range of numbers from to 113 with a step size of 2 W,where W is the coefficient word length Signed power-of-two (SPT) terms are then allocated to the quantized filter coefficient that represents the largest absolute difference between the scaled ideal filter and the quantized filter The NUS algorithm iteratively allocates SPT terms until the desired number of SPT terms is allocated or until the desired NPR, normalized peak ripple, specification is met Once the allocation of terms stops, the NPR is calculated The process is then repeated for a new scale factor The quantized filter leading to the minimum NPR is chosen In parallel FIR filters, the NPR cannot be used as a selection criteria for choosing the best quantized filter since passband/stopband ripples cannot be defined for the set of subfilters obtained by the application of FFAs In [8], it is shown that the maximum absolute difference (MAD) between the

78 948 EURASIP Journal on Applied Signal Processing x(3k) H 0 y(3k) x(3k 1) H 1 x(3k 2) H 2 D D H 0 H 1 y(3k 1) H 0 H 2 y(3k 2) H 2 H 1 Figure 6: 3-parallel FIR filter using FFA2 x(4k) x(4k 2) F 0 y(4k) y(4k 2) x(4k)x(4k 1) x(4k 2)x(4k 3) F 0 F 1 y(4k 1) y(4k 3) x(4k 1) x(4k 3) F 1 D Figure 7: 4-parallel FIR filter structure x(4k) x(4k 1) H 0 H 1 y(4k 1) H 0 H 1 H 2 H 3 y(4k 3) x(4k 2) x(4k 3) H 2 H 3 D Figure 8: FFA1 structure frequency responses of the ideal filter and the quantized filter can be used as an efficient selection criteria for parallel filters When the quantized filter is implemented, a postprocessing scale factor (PPSF) is used to properly adjust the magnitude of the filter output The PPSF is calculated as PPSF = Max[Absolute(Ideal Filter Coeffs)] (13) VSF In the cases where large levels of parallelism are used, the PPSFs can contribute to a significant amount of hardware overhead In [8], to reduce this hardware overhead the PPSFs are restricted to the following set of values: {0125, 025, 0375, 05, 0625, 075, 0875, 1} The original PPSF is replaced with the new PPSF that is the nearest in value Since the scale factor of the quantized filter is shifted in value, the quantized coefficients must also be properly shifted

79 Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 949 For each filter section in the parallel FIR filter { Normalize the set of filter coefficients so that the magnitude of the largest coefficient is 1; For VSF = Lower Scale:Step Size:Upper Scale, { Compute PPSF by (13); Convert PPSF into Canonic Signed Digit form; If (No of nonzero bits in PPSF) < prespecified value, { Scale normalized coefficients with VSF; Quantize the scaled coefficients using SPT term allocation scheme in NUS algorithm; Calculate MAD between the frequency responses of the ideal and quantized filters; } } Choose scale factor that leads to the minimum MAD; } Magnitude Ideal By [8] Proposed Frequency Algorithm 1: Look-ahead MAD quantization Figure 9: Frequency responses of Example 1 in value This is accomplished using the following three steps: (i) determine effective coefficients, effective coeffs = quantized coeffs PPSF; (ii) determine shifted coefficients with new PPSF, shiftedcoeffs = effectivecoeffs/new PPSF; (iii) quantize the shifted coefficients However, the above steps are not guaranteed to give optimal quantized coefficients for the new PPSF value The reason is that the quantization in (iii) is performed on the already quantized coefficients To avoid this problem, LMAD quantization algorithm is proposed In the proposed algorithm, the PPSF for a given VSF is computed by (13) before the quantization step begins If the number of nonzero bits in the computed PPSF is less than a prespecified value, then the normalized coefficients are scaled by the VSF and the scaled coefficients are quantized Otherwise, the procedure is repeated for the next VSF value In [8], the number of nonzero bits in PPSF is fixed However, in the proposed approach, the number of nonzero bits in PPSF can be varied and the PPSF value giving the best performance can be selected From our simulation experience, increasing the number of nonzero bits in PPSF more than three does not improve the numerical performance significantly Example 1 Consider an ideal filter section with the following coefficients [8]: ideal coeffs ={ } In[8], these coefficients are quantized using word length of 7 bits to the following values by the scalable MAD quantization algorithm: { } with PPSF = 0625 The computed MAD value is For comparison, the ideal coefficients are quantized using the proposed algorithm with PPSF = Table 1: The number of adders by the method used in [8] andby the proposed method The numbers inside parentheses denote the FFA types used for each case 24-Tap FIR 72-Tap FIR By [8] Proposed By[8] Proposed L = 1 56 (0) 49 (0) 125 (0) 96 (0) L = 2 74 (0) 54 (1) 192 (0) 173 (1) L = (0) 99 (1) 293 (0) 272 (1) L = (0-0-0) 123 (0-1-0) 313 (0-0-0) 303 (1-1-1) 0625 The quantized coefficients are { } The computed MAD value is Notice that the MAD value by the proposed method is only 45% of the MAD value in [8] Frequency responses are compared in Figure 9 Table 1 shows that, for the two low-pass FIR filter examples in [8], the proposed method can save up to 24% of adders In [8], only FFA type 0 is used for each value of L However,as can be seen from Table 1, better results are obtained by selecting FFA type(s) properly for each L Example 2 In this example, the hardware saving by the appropriate selection of FFA structures is compared with the hardware saving by the proposed LMAD quantization scheme using a simple low-pass filter with filter order = 7, passband edge = 01π, maximum passband ripple = 002dB, stopband edge = 03π, and minimum stopband attenuation = 22 db In this example, only block size of 2 (L = 2) is considered Table 2 shows the filter coefficients obtained by FFA0 without scaling Table 3 shows the filtercoefficients obtained

80 950 EURASIP Journal on Applied Signal Processing Table 2: Filter coefficients (canonic signed digit format) and the number of nonzero bits for FFA0 without scaling (word-length = 8) H 0 H 01 H Coefficients Nonzero bits Table 3: Filter coefficients and the number of nonzero bits for FFA0 with LMAD scaling (word-length = 7) H 0 H 01 H Coefficients Nonzero bits Table 4: Filter coefficients and the number of nonzero bits for FFA1 with LMAD scaling (word-length = 7) H 0 H 0 1 H Coefficients Nonzero bits by FFA0 with LMAD scaling Notice that the filter coefficients by FFA0 with LMAD scaling satisfy the given specifications by word-length of 7 bits while the filter coefficients by FFA0 without scaling require word-length of 8 bits The reduction of the word-length is due to the use of scaling factors The PPSFs for the filter coefficients by FFA0 with LMAD scaling are (H 0 ), (H 01 ), and (H 1 ) Each PPSF contains two nonzero bits, which corresponds to the overhead of one adder Table 4 shows the filter coefficients obtained by FFA1 with LMAD scaling The PPSFs for the filter coefficients by FFA1 with LMAD scaling are 00101(H 0 ), 00001(H 01 ), and 00101(H 1 ) Frequency responses of ideal filter and the filter obtained by FFA1 quantized by LMAD are compared in Figure 10 To compare the hardware savings by the quantization and the proper selection of FFA types, only H 01 or H 0 1 subfilters are considered From Table 2, the number of nonzero bits for H 01 of nonscaled FFA0 filter is 14 while the number of nonzero bits for H 01 of scaled FFA0 filter is 10 (including PPSF) Thus, in addition to the word-length reduction, hardware saving of about 28% can be obtained by LMAD scaling From Table 4, the number of nonzero bits for H 0 1 of scaled FFA1 filter is 7 (including PPSF) Thus, 22% further saving is obtained by the selection of proper filter type Thus, in this example, about half of the saving is due to the LMAD quantization and the other half is due to proper filter type selection 4 DESIGN EXAMPLES In this section, three design examples with various frequency specifications are given Example 3 Consider a narrowband low-pass filter with filter order = 35, passband edge = 02π, maximum passband ripple = 0185 db, stopband edge = 03π, and minimum stopband attenuation = 335dB As can be seen from Figure 11, the signs of the impulse response sequences (designed by the Remez exchange algorithm) change slowly For L = 2, according to the discussions in Section 24, the number of pairs with the same signs is 16, while the number of pairs with the opposite signs is only 2 Thus, FFA1 is more efficient than FFA0 By the LMAD quantization algorithm, the number of nonzero bits required for H 01 is 42 but the number of nonzero bits required for H 0 1 is 24 Thus the hardware cost of H 0 1 is about 57% of the hardware cost of H 01 The frequency responses for L = 2arecomparedin Figure 12 For L = 3, the number of pairs with the same signs in subfilter pairs {H 0,H 1 }, {H 1,H 2 },and{h 02,H 1 } is 28 while the number of pairs with the opposite signs is 8 Also, the number of pairs with the same signs in subfilter pairs {H 0,H 1 }, {H 1,H 2 },and{h 0,H 2 } is 12 Thus, FFA1 is the most efficient For L = 4, the number of pairs with the opposite signs in subfilter pair {H 0,H 2 } is 7 while the number of pairs with

81 Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design Magnitude Magnitude Frequency Frequency Ideal FFA1-LMAD Figure 10: Frequency responses of ideal filter and the filter obtained by FFA1 quantized by LMAD Ideal FFA0 FFA1 Figure 12: Frequency responses of Example Table 5: Total number of nonzero bits for Example 3 with different block size and various structures (word-length = 10) L = 2 L = 3 L = 4 FFA0 FFA1 FFA0 FFA1 FFA2 FFA0-0-0 FFA1-1-1 FFA Figure 11: Ideal impulse response of Example Table 6: Total number of nonzero bits for Example 4 with different block size and various structures (word-length = 9forL = 2and word-length = 10 for L = 3andL = 4) L = 2 L = 3 L = 4 FFA0 FFA1 FFA0 FFA1 FFA2 FFA1-1-1 FFA1-0-1 FFA the same signs is 2 Thus, FFA0 is the most efficient for F 0 The number of pairs with the opposite signs in the subfilter pair {H 1,H 3 } is 7 while the number of pairs with the same signs is 2 Thus, FFA0 is the most efficient for F 1 By a similar procedure, it can be shown that FFA1 is the most efficient choice for F 0 F 1 The design results for L = 2, 3, and 4 are summarized in Table 5ForL = 2andL = 3, about 20% of the hardware can be saved by a proper choice of FFA types However, for L = 4, only 7% of the hardware saving can be achieved by a proper choice of FFA types The main reason is that the correlation of filter coefficients between subfilters is reduced as the block size increases Example 4 Consider a wideband low-pass filter with filter order = 62, passband edge = 08π, maximum passband rip- ple = 027 db, stopband edge = 085π, and minimum stopband attenuation = 325dB As can be seen from Figure 13, the signs of the impulse response sequences change frequently By the sign comparing procedure, the best FFA types are predicted as FFA0 (L = 2), FFA0 (L = 3), and FFA1- FFA1-FFA1 (L = 4) The design results for L = 2, 3, and 4 are summarized in Table 6ForL = 2andL = 3, about 12% 15% of the hardware can be saved by a proper choice of FFA types For L = 4, 4% of the hardware saving can be achieved by a proper choice of FFA types Example 5 Consider a narrow bandpass filter with filter order = 86, passband = 022π 03π, maximum passband ripple = 019 db, stopband = 0 018π, 034π π, and minimum stopband attenuation = 35 db Figure 14 shows

82 952 EURASIP Journal on Applied Signal Processing Figure 13: Ideal impulse response of Example Figure 14: Ideal impulse response of Example 5 the impulse response sequence By the sign-comparing procedure, the best FFA types are predicted as FFA1 (L = 2), FFA2 (L = 3), and FFA0-FFA1 -FFA0 (L = 4) The design results for L = 2, 3, and 4 are summarized in Table 7ForL = 2 and L = 3, about 16% 18% of the hardware can be saved by a proper choice of FFA types For L = 4, 4% of the hardware saving can be achieved by a proper choice of FFA types 5 CONCLUSIONS It has been shown that the hardware cost and power consumption of parallel FIR filters can be reduced significantly by exploiting the frequency spectrum characteristics For example, in narrowband low-pass filters, the signs of consecutive unit sample response values do not change much and therefore their difference (FFA1) can require fewer number of bits than their sum (FFA0) In wideband low-pass filters, the signs of consecutive unit sample response values change frequently and therefore their sum (FFA0) can require fewer number of bits than their difference (FFA1) To determine Table 7: Total number of nonzero bits for Example 5 with different block size and various structures (word-length = 12) L = 2 L = 3 L = 4 FFA0 FFA1 FFA0 FFA1 FFA2 FFA0-0-0 FFA1-1-1 FFA the best FFA type for given impulse response sequence and block size L, a sign-comparing procedure was proposed The usefulness of the proposed sign-comparing procedure was demonstrated by several examples Also, the proposed lookahead MAD quantization algorithm was shown to be very efficient for the implementation of parallel FIR filters Substructure sharing is the process of examining the hardware implementation of the filter coefficients and sharing the hardware units that are common among the filter coefficients Using the substructure sharing techniques in [8], further savings in hardware cost and power consumption can be achieved Developing a similar approach to power reduction of adaptive FIR filters will be an interesting future research Further research needs to be directed towards finite word-length analysis of these low-power parallel FIR filters ACKNOWLEDGMENTS This research was supported in part by Information and Communication Research Institute at Chonbuk National University and by NSF under grant number CCR REFERENCES [1]JWAdamsandANWillsonJr, Someefficient digital prefilter structures, IEEE Trans Circuits and Systems, vol 31, no 3, pp , 1984 [2] J T Ludwig, S H Nawab, and A P Chandrakasan, Lowpower digital filtering using approximate processing, IEEE Journal of Solid-State Circuits, vol 31, no 3, pp , 1996 [3] N Sankarayya, K Roy, and D Bhattacharya, Algorithms for low-power high speed FIR filter realization using differential coefficients, IEEE Trans on Circuits and Systems II: Analog and Digital Signal Processing, vol 44, no 6, pp , 1997 [4] N R Shanbhag and M Goel, Low-power adaptive filter architectures and their application to 5184 Mb/s ATM-LAN, IEEE Trans Signal Processing, vol 45, no 5, pp , 1997 [5] H Samueli, An improved search algorithm for the design of multiplierless FIR filters with powers-of-two coefficients, IEEE Trans Circuits and Systems, vol 36, no 7, pp , 1989 [6] D Li, J Song, and Y C Lim, A polynomial-time algorithm for designing digital filters with power-of-two coefficients, in Proc IEEE International Symposium on Circuits and Systems, vol 1, pp 84 87, Chicago, Ill, USA, May 1993 [7] C-L Chen, K-Y Khoo, and A N Willson Jr, An improved polynomial-time algorithm for designing digital filters with power-of-two coefficients, in Proc IEEE International Symposium on Circuits and Systems, vol 1, pp , Seattle, Wash, USA, 30 April 3 May 1995

Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 953 [8] D A Parker and K K Parhi, Low-area/power parallel FIR digital filter implementations, Journal of VLSI Signal Processing,

83 Frequency Spectrum Based Low-Area Low-Power Parallel FIR Filter Design 953 [8] D A Parker and K K Parhi, Low-area/power parallel FIR digital filter implementations, Journal of VLSI Signal Processing, vol 17, no 1, pp 75 92, 1997 [9] M Winzker, Low-power arithmetic for the processing of video signals, IEEE Trans on VLSI Systems, vol 6, no 3, pp , 1998 [10] J-G Chung, Y-B Kim, H-J Jeong, K K Parhi, and Z Wang, Efficient parallel FIR filter implementations using frequency spectrum characteristics, in Proc IEEE International Symposium on Circuits and Systems, vol 5, pp , Monterey, Calif, USA, 31 May 3 June 1998 [11] Z J Mou and P Duhamel, Short-length FIR filters and their use in the fast nonrecursive filtering, IEEE Trans Signal Processing, vol 39, no 6, pp , 1991 Jin-Gyun Chung received his BS degree in electronic engineering from Chonbuk National University, Chonju, South Korea, in 1985 and the MS and PhD degrees in electrical engineering from the University of Minnesota, Minneapolis, Minnesota, in 1991 and 1994, respectively Since 1995, he has been with the Department of Electronic and Information Engineering at Chonbuk National University, where he is currently Associate Professor His research interests are in the area of VLSI architectures and algorithms for signal processing and communication systems, which include the design of high-speed and lowpower algorithms for digital filters, DSL systems, OFDM systems, and ultrasonic NDE systems Keshab K Parhi is a distinguished McKnight University Professor of Electrical and Computer Engineering at the University of Minnesota, Minneapolis, where he also holds the Edgar F Johnson Professorship He received the BTech, MSEE, and PhD degrees from the Indian Institute of Technology, Kharagpur (India, 1982), the University of Pennsylvania, Philadelphia (1984), and the University of California at Berkeley (1988), respectively His research interests include all aspects of physical layer VLSI implementations of broadband access systems He is currently working on VLSI adaptive digital filters, equalizers and beamformers, error control coders and cryptography architectures, low-power digital systems, and computer arithmetic He has published over 330 papers in these areas He has authored the text book VLSI Digital Signal Processing Systems (Wiley, 1999) and coedited the reference book Digital Signal Processing for Multimedia Systems (Dekker, 1999) Dr Parhi has been a Visiting Professor at the Delft University of Technology and at Lund University He has been a Visiting Researcher at the NEC Corporation, Japan, and a Technical Director DSP Systems in the Office of CTO at Broadcom Corporation in Irvine, Calif Dr Parhi has served on editorial boards of the IEEE Transactions on Circuits and Systems, Signal Processing, Circuits and Systems Part II: Analog and Digital Signal Processing, VLSI Systems, and the IEEE Signal Processing Letters He is an editor of the Journal of VLSI Signal Processing He has received numerous best paper awards including the 2001 IEEE W R G Baker prize paper award He has been a Distinguished Lecturer ( ) of and a recipient of a Golden Jubilee medal (1999) from the IEEE Circuits and Systems Society He received a Young Investigator Award from the National Science Foundation in 1992, and was elected a Fellow of IEEE in 1996

84 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Low-Complexity Versatile Finite Field Multiplier in Normal Basis Hua Li Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge, Alberta, Canada T1K 3M4 huali@csulethca Chang Nian Zhang Department of Computer Science, TRLabs, University of Regina, Regina, SK, Canada S4S 0A2 zhang@csureginaca Received 6 August 2001 and in revised form 30 August 2002 A low-complexity VLSI array of versatile multiplier in normal basis over GF(2 n ) is presented The finite field parameters can be changed according to the user s requirement and make the multiplier reusable in different applications It increases the flexibility to use the same multiplier for different applications and reduces the user s cost The proposed multiplier has a regular structure and is very suitable for high speed VLSI implementation In addition, the pipeline versatile multiplier can be modified to a low-cost architecture which is feasible in embedded systems and restricted computing environments Keywords and phrases: finite field multiplication, Massey-Omura multiplier, normal basis, VLSI, encryption 1 INTRODUCTION The finite fields GF(2 n )ofcharacteristic2areofgreatinterest for cryptosystems and digital signal processing The additionoperationingf(2 n ) is fast and inexpensive as it can be realized with n bitwise XOR operations The multiplication operation is costly in terms of gate number and time delay There have been three main kinds of basis representations of the field elements in GF(2 n ): standard (canonical, polynomial) basis, dual basis, and normal basis Different basis representation multipliers have their own benefits and tradeoffs The dual basis multiplier [1] needs the least number of gates which leads to the smallest area required for VLSI implementation [2] The normal basis multiplier, for example, Massey-Omura multiplier [3], is very effectivein performing squaring, exponentiation, and inversion operation The standard basis multiplier [4, 5, 6, 7] iseasiertoextendtohighorder finite fields than the dual or normal basis multipliers Most of the proposed finite field multipliers operate over a fixed field In other words, a new multiplier is needed if there is a change in the field parameters such as the irreducible polynomial defining the representation of the field elements This makes the multiplier not reusable There are few versatile multipliers [4, 6, 8, 9] reported and all based on canonical basis In this paper, we present a new VLSI array of versatile pipeline multiplier based on the normal basis representation In normal basis, the squaring is a cost-free cyclic shift operation and the inversion (the most complicated operation among the important finite field arithmetic operations) can be effectively computed by Fermat s theorem which requires recursive squaring and multiplication [10, 11] Three main advantages accrue from the proposed pipelined versatile multiplier First, the finite field parameters can be changed according to the application environments It increases the flexibility to use the same multiplier for different applications Secondly, the structure of the multiplier can be easily extended to higher-order finite fields Thirdly, the basic architecture of the proposed multiplier can be modified to a low-cost multiplier which is very suitable for both embedded systems and wireless devices with restricted hardware resources Moreover, the structure of the multiplier has the properties of modularity, simplicity, regular interconnection, and is easy for VLSI implementation The proposed versatile multiplier can be efficiently used in public-key cryptosystems, such as elliptic curve cryptography; and the digital signal processing, for example, the Reed-Solomon encoder/decoder The outline of the remainder of the paper is as follows In Section 2, we briefly review the normal basis representation and Massey-Omura multiplier Section 3 contains the derivation of the pipeline versatile normal basis multiplier in GF(2 n ) and comparison with previous works Section 4 concludes with the improved result and a description of areas of applications

85 Low-Complexity Versatile Finite Field Multiplier in Normal Basis MULTIPLICATION ON GF(2 n ) It has been proved that there always exists a normal basis [12] for a given finite field GF(2 n ) which is the form of N = { β, β 2,β 22,,β 2n 1}, (1) where β is a root of the irreducible polynomial P(x)ofdegree n over GF(2) and n elements of the set are linearly independent We say that β generates the normal basis N,orβ is a normal element of GF(2 n ) Every element a GF(2 n )canbe represented by a = n 1 i=0 a i β 2i,wherea i {0, 1} The following properties [10] of a finite field GF(2 n )are useful in the applications (1) Squaring is a linear operation, that is, given any two elements a and b in GF(2 n ), (2) For any element a GF(2 n ), (3) For any element a GF(2 n ), (a b) 2 = a 2 b 2 (2) a 2n = a (3) 1 = a a 2 a 4 a 2n 1 (4) This implies that the normal basis representation of 1 is (1, 1,,1) (4) Squaring an element a in the normal basis representation is a cyclic shift operation, that is, n 1 a 2 = a i β 2i1 i=0 n 1 = i=0 a i 1 β 2i = ( a n 1,a 0,,a n 2 ) with indices reduced modulo n Let a and b be two arbitrary elements in GF(2 n )inanormal basis representation and c = a b be the product of a and bwedenotea = n 1 i=0 a i β 2i as a vector a = (a 0,a 1,,a n 1 ), b = n 1 i=0 b i β 2i as a vector b = (b 0,b 1,,b n 1 ), and c = n 1 i=0 c i β 2i as a vector c = (c 0,c 1,,c n 1 ), then the last term c n 1 of c is a logic function of the components of a and b, that is, (5) c n 1 = f ( a 0,a 1,,a n 1 ; b 0,b 1,,b n 1 ) (6) Since squaring in normal representation is a cyclic shift of the element, we have c 2 = a 2 b 2 or equivalently ( cn 1,c 0,c 1,,c n 2 ) = ( a n 1,a 0,a 1,,a n 2 ) ( bn 1,b 0,b 1,,b n 2 ) (7) Hence, the last component c n 2 of c 2 can be obtained by the same function f operating on the components of a 2 and b 2 That is, c n 2 = f ( a n 1,a 0,a 1,,a n 2 ; b n 1,b 0,b 1,,b n 2 ) (8) By squaring c repeatedly, we get c n 1 = f ( ) a 0,a 1,,a n 1 ; b 0,b 1,,b n 1, c n 2 = f ( ) a n 1,a 0,a 1,,a n 2 ; b n 1,b 0,b 1,,b n 2, c 0 = f ( ) a 1,a 2,,a n 1,a 0 ; b 1,b 2,,b n 1,b 0 Equations 9define the Massey-Omura multiplier in normal basis representation [10] In Massey-Omura multiplier, the same logic function f for computing the last component of c n 1 of the product c can be used to get the remaining components c n 2,c n 3,,c 0 of the product sequentially In parallel architecture, we can use n identical logic function f for calculating all components of the product simultaneously 3 A PIPELINE ARCHITECTURE FOR THE SERIAL VERSATILE NORMAL BASIS MULTIPLIER In this section, we derive a pipeline architecture to implement the versatile normal basis multiplier Let c be the product of a and b, n 1 c = i=0 n 1 j=0 In the normal basis, we have Thus, we can get n 1 β 2i β 2j = λ (k) ij β 2k, n 1 c k = i=0 k=0 n 1 j=0 (9) a i b j β 2i β 2j (10) λ (k) ij GF(2) (11) λ (k) ij a i b j, 0 k n 1 (12) From the above analysis, we see that the important issue for building a versatile normal basis multiplier is to get the value of λ (k) i,j for different irreducible polynomials The n n matrices λ (k) (0 k n 1) whose elements is λ (k) i,j (0 i, j n 1) can be obtained if we know the transformation between the elements of the canonical basis and the elements of the normal basis, that is, the normal basis representation of the elements of the canonical basis In the following, we define the multiplication table of the normal basis and use the basis element transformation formula to get the values of the multiplication table, and then obtain the n n matrices λ (k) Finally, we illustrate the approach to build the versatile pipeline normal basis multiplier

86 956 EURASIP Journal on Applied Signal Processing Definition 1 Let N ={β, β 2,,β 2n 1 } be a normal basis in GF(2 n ), then for any i, j (0 i, j n 1), β 2i β 2j is a linear combination of β, β 2,,β 2n 1 with coefficients in GF(2) In particular, β β β 2 β 2 β = T, (13) β 2n 1 β 2n 1 where T is an n n matrix over GF(2) We call T the multiplication table of the normal basis N The number of nonzero entries in T is called the complexity of the normal basis N, denoted by C N There always exists the multiplication table T and the matrix λ (k) for a given irreducible polynomial which defines the normal basis in GF(2 n )[12] After the multiplication table T is obtained, the matrix λ (k) can be calculated according to (12) An example is shown below Example 1 Let the irreducible polynomial be P 1 (x) = x 5 x 4 x 2 x 1andβ be a root of the polynomial, then the canonical basis is {1,β,β 2,β 3,β 4 } and the normal basis is {β, β 2,β 4,β 8,β 16 } We can get the following normal basis representation for the elements of the canonical basis: 1 = β β 2 β 4 β 8 β 16, β = β, β 2 = β 2, β 3 = β β 8, β 4 = β 4 (14) The appendix illustrates how to obtain the normal basis representation of β 3 Thus the element β i (i>5) can be reduced to the representation of canonical basis and converted to the corresponding representation of normal basis by the base element transformation formula (14) For instance, β 17 = 1β 2 β 3 = 1β 2 ( β β 8) = β 16 β 4 (15) Then we can get the multiplication table T for given P 1 (x)whichis T = , β β (16) β 2 β 2 β β 4 = T β β 8 4 β 8 β 16 β 16 The product of a and b is c = ab = c 0 β c 1 β 2 c 2 β 4 c 3 β 8 c 4 β 16 = ( a 0 β a 1 β 2 a 2 β 4 a 3 β 8 a 4 β 16) ( b 0 β b 1 β 2 b 2 β 4 b 3 β 8 b 4 β 16) = a 0 b 0 β 2 a 0 b 1 β 3 a 0 b 2 β 5 a 0 b 3 β 9 a 0 b 4 β 17 a 1 b 0 β 3 a 1 b 1 β 4 a 1 b 2 β 6 a 1 b 3 β 10 a 1 b 4 β 18 a 2 b 0 β 5 a 2 b 1 β 6 a 2 b 2 β 8 a 2 b 3 β 12 a 2 b 4 β 20 a 3 b 0 β 9 a 3 b 1 β 10 a 3 b 2 β 12 a 3 b 3 β 16 a 3 b 4 β 24 a 4 b 0 β 17 a 4 b 1 β 18 a 4 b 2 β 20 a 4 b 3 β 24 a 4 b 4 β 32 (17) As β 6 = (β 3 ) 2, β 10 = (β 5 ) 2, β 18 = (β 9 ) 2, β 12 = (β 6 ) 2, β 20 = (β 5 ) 4, β 24 = (β 3 ) 8, β 32 = β, we can easily obtain these elements normal basis representation by cost-free cyclic shift operation on the row of the multiplication table T and get the matrix λ (4) which leads to the function f to compute the coefficient of c λ (4) = (18) It can be readily seen that the matrices λ (k) (0 k n 1) are symmetric From the matrix λ (4), we can get the following logic function to compute the most significant bit of the product of ab in GF(2 5 ) defined on the irreducible polynomial P 1 (x) 4 4 c 4 = λ (4) ij a i b j i=0 j=0 = a 0 b 2 a 2 b 0 a 0 b 4 a 4 b 0 a 1 b 2 a 2 b 1 a 1 b 3 a 3 b 1 a 3 b 3 (19) In the normal basis representation, the logic function f = (a 0,a 1,,a n 1 ; b 0,b 1,,b n 1 ) which is used to get the most significant bit (c n 1 ) of the product can also be used to get the remaining bits (c n 2,c n 3,,c 0 ) of the product, except we cyclically shift the input of the function [10] Thus, we may choose one matrix from the matrices λ (k) (0 k n 1) and input the values of upper triangle of the symmetric matrix for doing the multiplication A VLSI array architecture to implement the versatile GF(2 n ) normal basis multiplier is proposed and illustrated in Figures 1 and 2 The basic cells in the structure are 3-input

87 Low-Complexity Versatile Finite Field Multiplier in Normal Basis 957 λ 0,0 λ 0,1 λ 0,2 λ 0,n 2 λ 0,n 1 a 0 AND AND AND AND AND XOR XOR XOR a 1 λ 1,0 λ 1,1 λ 1,2 λ 1,n 2 λ 1,n 1 AND AND AND AND AND Buffer XOR XOR XOR λ n 2,0 λ n 2,1 λ n 2,2 λ n 2,n 2 λ n 2,n 1 a n 2 AND AND AND AND AND XOR XOR XOR λ n 1,0 λ n 1,1 λ n 1,2 λ n 1,n 2 λ n 1,n 1 a n 1 AND AND AND AND AND a XOR XOR XOR b 0 b 1 b 2 b n 2 b n 1 b Buffer XOR 2-input XOR gate Figure 1: The logic circuit of the AND gate plane in the versatile multiplier a c i b AND gate array XOR gate tree Figure 2: The architecture of the serial versatile normal basis GF(2 n ) multiplier AND gates and 2-input XOR gates We use the 3-input AND gates to compute a i b j λ (n 1) i,j in the X-Y dimension, and compute the sum of a i b j λ (n 1) i,j by a binary tree structure of 2-input XOR gates in the Z dimension The architecture requires n 2 3-input AND gates and n input XOR gates, the time delay for generating one bit of the product is T AND3 2( log 2 n )T XOR,whereT AND3 is the time delay of a 3-input AND gate and T XOR is the time delay of a 2-input XOR gate We can get all bits of the product by cyclically shifting the input coefficients of a and b As the irreducible polynomial is not changed frequently as the multiplicands, we can store the elements of the matrix λ (n 1) in the registers once the irreducible polynomial has been decided The algorithm for this multiplication can be described as follows

88 958 EURASIP Journal on Applied Signal Processing λ n 1,0 λ 1,0 λ 0,0 λ n 1,1 λ 1,1 λ 0,1 λ n 1,2 λ 1,2 λ 0,2 λ n 1,n 2 λ 1,n 2 λ 0,n 2 λ n 1,n 1 λ 1,n 1 λ 0,n 1 a n 1 a 1 a 0 b 0 AND b 1 AND b 2 AND b n 2 AND b n 1 AND XOR XOR XOR XOR XOR XOR XOR c i Figure 3: A low-cost architecture of the serial versatile normal basis GF(2 n ) multiplier Algorithm 1 (versatile normal basis multiplication in GF(2 n )) Input: Coefficients of a, b, and the matrix of λ (n 1) Output: c = ab Begin load matrix λ (n 1) for k = n 1to0do begin c k = n 1 i=0 n 1 j=0 a i b j λ (n 1) ij ; cyclic shift the coefficients of a and b; end; End The proposed architecture can be implemented by a pipeline structure In the first n clock cycles, the coefficients of a and b are fed sequentially into the buffers In the following n clock cycles, we will get the result of the product by cyclically shifting the registers which store the original coefficients of a and b In the meantime, the next two multiplicands can be fed into the buffers during these clock cycles and we can compute the second product immediately just after we finish the first one In the restricted computing environment, we can iterate using one level components of the proposed multiplier (Figure 2) to obtain a low-cost serial architecture as illustrated in Figure 3 to implement the same computation It can be described by the following algorithm Algorithm 2 (low-cost serial versatile normal basis multiplication in GF(2 n )) Input: Coefficients of a, b, and the matrix of λ (n 1) Output: c = ab Begin for k = n 1to0do begin ck 0 = 0; for i = 0ton 1 ck i1 = ck i n 1 j=0 a i b j λ (n 1) ij ; cyclic shift the coefficients of a and b; end; End The low-cost versatile normal basis multiplier in GF(2 n ) requires n 3-input AND gates and n 2-input XOR gates The time delay for generating one bit of the product is n(t AND3 ( log 2 n 1)T XOR ) The proposed versatile normal basis multipliers have modular structures, regular interconnections which are suitable for high speed or restricted space of VLSI implementations Table 1 lists the comparison of space and time complexity between our new multipliers and previous works The input ports of the proposed versatile multiplier are almost the same as the nonversatile multiplier, since the finite field parameters can be configured into the multiplier by the input ports of multiplicands (a and b) through a one-bit control signal at the configuration time The finite field parameters do not need reconfiguration during the running time of the multiplier, until the application environments are changed Thus the hardware cost can be greatly reduced compared to the nonversatile multiplier where a new multiplier has to be redesigned and implemented when the finite field parametersarerequiredtobechanged

89 Low-Complexity Versatile Finite Field Multiplier in Normal Basis 959 Table 1: Comparison of versatile multipliers with nonversatile multipliers in GF(2 n ) Multiplier Type # XOR Gates # AND Gates Time Delay Wang-MOM [10] Nonversatile 2n 2 2n 1 n(t AND ( log 2 n 1)T XOR ) Li-CVM [9] (canonical basis) Versatile 2n 2 2n 2 n(t AND 2T XOR ) Prop multiplier (Figure 2) Versatile n 2 1 n 2 (3-input) n(t AND3 2 log 2 n T XOR ) Prop low-cost multiplier (Figure 3) Versatile n n(3-input) n 2 (T AND3 ( log 2 n 1)T XOR ) Moreover, the proposed architecture in GF(2 n )canbe easily expanded to the finite field of GF(2 2n ) The one solution is to use two basic GF(2 n ) architecture to implement the multiplication in GF(2 2n ) and another alternative solution is to do the GF(2 2n ) multiplication serially by using only one basic GF(2 n )architecture 4 CONCLUSION In this paper, the architectures for finite field multiplication based on normal basis have been proposed The architectures require simple control signals and have regular local interconnections As a consequence, they are very suitable for VLSI implementation The versatile property of this VLSI array modular multiplier increases the application range and the same multiplier can be applied for different application environments, such as elliptic curve cryptosystems and Reed- Solomon encoder/decoder The proposed multiplier can be easily extended to high order of n for more security Moreover, the structures can be modified to make fast exponentiation and inversion Also note that we can make a lowcost and space efficient serial multiplier which is feasible in the restricted computing environments and embedded systems APPENDIX Let the irreducible polynomial be P 1 (x) = x 5 x 4 x 2 x 1andletβ be a root of the polynomial We show the procedures of computing the multiplication table T and the matrix λ (4) As β is a root of the P 1 (x), β 5 = β 4 β 2 β 1, β 6 = β 5 β = β 5 β 3 β 2 β = β 4 β 2 β 1β 3 β 2 β = β 4 β 3 1 We multiply β 2 to both sides of (A2), and get From (A3), β 8 = β 6 β 5 β 2 β 6 = β 8 β 5 β 2 (A1) (A2) (A3) (A4) As 1 = β 16 β 8 β 4 β 2 β (A5) Substitute (A5) into (A1), β 5 = β 4 β 2 β β 16 β 8 β 4 β 2 β = β 16 β 8 Substitute (A6) into (A4), From (A2), we get β 6 = β 8 β 5 β 2 = β 8 β 16 β 8 β 2 = β 16 β 2 β 3 = β 6 β 4 1 Substitute (A7) and(a5) into (A8), β 3 = β 16 β 2 β 4 β 16 β 8 β 4 β 2 β = β 8 β REFERENCES (A6) (A7) (A8) (A9) [1] E R Berlekamp, Bit-serial Reed-Solomon encoders, IEEE Transactions on Information Theory, vol 28, no 6, pp , 1982 [2] ISHsu,TKTruong,LJDeutsch,andISReed, Acomparison of VLSI architecture of finite field multipliers using dual, normal, or standard bases, IEEE Trans on Computers, vol 37, no 6, pp , 1988 [3] J L Massey and J K Omura, Computational method and apparatus for finite field arithmetic, US Patent application, 1981 [4] B A Laws Jr and C K Rushforth, A cellular-array multiplier for GF(2 m ), IEEE Trans on Computers, vol 20, no 12, pp , 1971 [5] PAScott,SETarvares,andLEPeppard, AfastVLSImultiplier for GF(2 m ), IEEE Journal on Selected Areas in Communications, vol 4, pp 62 66, January 1986 [6] L Song and K Parhi, Low-energy digit-serial/parallel finite field multipliers, Journal of VLSI Signal Processing, vol 19, no 2, pp , 1998 [7] S K Jain, L Song, and K K Parhi, Efficient semisystolic architectures for finite-field arithmetic, IEEE Trans on VLSI Systems, vol 6, no 1, pp , 1998 [8] M A Hasan and A G Wassal, VLSI algorithms, architectures and implementation of a versatile GF(2 m ) processor, IEEE Trans on Computers, vol 49, no 10, pp , 2000

90 960 EURASIP Journal on Applied Signal Processing [9] H Li and C N Zhang, Efficient cellular automata based versatile modular multiplier for GF(2 m ), to appear in Journal of Information Science and Engineering [10]CCWang,TKTruong,HMShao,LJDeutsch,JK Omura, and I S Reed, VLSI architectures for computing multiplications and inverses in GF(2 m ), IEEE Trans on Computers, vol 34, no 8, pp , 1985 [11] G Feng, A VLSI architecture for fast inversion in GF(2 m ), IEEE Trans on Computers, vol 38, no 10, pp , 1989 [12] A J Menezes, Applications of Finite Fields, KluwerAcademic Publishers, Boston, Mass, USA, 1993 Hua Li received his BE and MS degrees from Beijing Polytechnic University and Peking University He is a PhD candidate in the Department of Computer Science, University of Regina Currently, he works as an assistant professor at Department of Mathematics and Computer Science, University of Lethbridge, Canada His research interests include parallel systems, reconfigurable computing, fault-tolerant, VLSI design, and information and network security He is a member of IEEE Chang Nian Zhang received his BS degree in applied mathematics from University of Science Technology, China, and the PhD degree in computer science and engineering from Southern Methodist University In 1998, he joined Concordia University as a research assistant professor in Department of Computer Science Since 1990, he has been with University of Regina, Canada, in Department of Computer Science Currently he is a full professor and leads a research group in parallel processing, data security, and neural networks

91 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems Tsun-Shan Chan VXIS Technology Corporation, Hsin-chu, Taiwan, ROC tscan@vxiscom Jen-Chih Kuo Department of Electrical Engineering, Graduate Institute of Electronics Engineering, National Taiwan University, Taipei 106, Taiwan, ROC jj@accesseentuedutw An-Yeu (Andy) Wu Department of Electrical Engineering, Graduate Institute of Electronics Engineering, National Taiwan University, Taipei 106, Taiwan, ROC andywu@cceentuedutw Received 31 August 2001 and in revised form 15 May 2002 The discrete multitone (DMT) modulation/demodulation scheme is the standard transmission technique in the application of asymmetric digital subscriber lines (ADSL) and very-high-speed digital subscriber lines (VDSL) Although the DMT can achieve higher data rate compared with other modulation/demodulation schemes, its computational complexity is too high for costefficient implementations For example, it requires 512-point IFFT/FFT as the modulation/demodulation kernel in the ADSL systems and even higher in the VDSL systems The large block size results in heavy computational load in running programmable digital signal processors (DSPs) In this paper, we derive computationally efficient fast algorithm for the IFFT/FFT The proposed algorithm can avoid complex-domain operations that are inevitable in conventional IFFT/FFT computation The resulting software function requires less computational complexity We show that it acquires only 17% number of multiplications to compute the IFFT and FFT compared with the Cooly-Tukey algorithm Hence, the proposed fast algorithm is very suitable for firmware development in reducing the MIPS count in programmable DSPs Keywords and phrases: FFT, IFFT, DMT, software implementation 1 INTRODUCTION Recent progress of Internet access has a strong demand on high-speed data transmission To overcome the transmission bottleneck over the conventional twisted-pair telephone lines, several sophisticated modulation/demodulation schemes have been proposed, including carrierlessamplitude-phase (CAP) modulation [1], discrete multitone modulation (DMT) [2, 3, 4, 5] andqamtechnology[6] Among these advanced modulation schemes, the DMT can achieve highest transmission rate since it incorporates lots of advanced DSP techniques such as dynamic bit allocation, multidimensional tone encoding, frequency-domain equalization, and so forth As a consequence, the DMT has been chosen as the physical layer transmission standard by the ADSL standardization committee One major disadvantage of the DMT scheme is its high computational complexity In particular, the large block size of the IFFT/FFT consumes lots of computing power in running programmable DSPs [7] In [8], we have considered a cost-efficient lattice VLSI architecture to realize the IFFT/FFT in integrated circuits In this paper, we propose computationally efficient fast algorithms to run the IFFT/FFT function in software implementation such as programmable DSP processors (DSPs) By making use of the symmetric/antisymmetric properties of the Fourier transform, we first decompose the IFFT/FFT into a combination of two new real-domain transform kernels the Modified DCT and Modified DST These two transform functions are used to replace the complex-domain IFFT/FFT Then we employ the divide-and-conquer approach in [9]toderivenovel recursive algorithms and butterfly architectures for the modified DCT DST

92 962 EURASIP Journal on Applied Signal Processing Modulator Demodulator X(0) X(1) X(2) X(0) X(1) X(2) X(0) X(1) X(2) X(0) X(1) X(2) Encoded complex symbols (from encoder) Demodulated complex symbols (to decoder) X(N 1) X(N) 2N-point IFFT Parallel/Serial y(n) Channel ỹ(n) Serial/Parallel 2N-point FFT X(N 1) Conjugate Discard X(2N 2) X(2N 1) X(2N 2) X(2N 1) Figure 1: The IFFT/FFT block diagram in the DMT system The new scheme can avoid redundant complex-domain of the IFFT/FFT That is, it involves only real-valued operations to compute the IFFT/FFT Hence, we can avoid the special data structure in software programming to run complexdomain addition/multiplication operations in computing the IFFT/FFT In addition, our analysis shows that we need only 17% and multiplications in computing the IFFT and FFT compared with Cooly-Tukey algorithm [10] The low computational complexity as well as real-domain operations makes it very suitable for firmware coding in DSPs, which helps to save the MIPS counts Also, the DSP program can bewritteninrecursiveformwhichrequireslessrom/ram program storage space to implement the IFFT/FFT The rest of this paper is organized as follows Section 2 shows the derivation of the IFFT algorithm In Section 3, the derivation of the FFT algorithm is discussed The computation complexity comparison is shown in Section 4 The finite precision effect of our algorithm is also discussed Finally, we conclude our work in Section 5 2 REDUCED-COMPLEXITY IFFT ALGORITHM 21 The IFFT derivation The IFFT/FFT block diagram in the DMT system is showed in Figure 1 At the transmitter side, to ensure the IFFT generates only real-valued outputs, the inputs of the IFFT in the DMT standard have the constraint [11], where X(k) = X r (k)j X i (k) areencodedcomplexsymbols As defined in [12, Chapter 9], the IFFT of a finite-length sequence of length 2N is x(n) = 1 [ 2N 1 2N where ( W2N nk = exp k=0 X(k)W nk 2N ] j 2πnk ) = cos 2πnk 2N 2N, for n = 0, 1,,2N 1, (2) 2πnk j sin 2N (3) By decomposing n into the first half and the second half, (2) becomes x(n) = 1 [ N 1 2N k=0 2N 1 X(k)W2N nk k=n X(k)W nk 2N ] (4) Next, by substituting (3) into (4), and using (1), we can simplify (4) as (see Appendix A) x(n) = 1 N [ N 1 k=0 X r (k)cos 2πnk 2N N 1 = 1 N [MDCT(n) MDST(n) ], k=0 X i (k)sin 2πnk ] 2N X(0) = X(N) = 0, X(k) = X (2N k) fork = 1, 2,,N 1, (1) for n = 0, 1,,2N 1 (5)

93 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 963 X r (0) X r (1) X r (N 2) X r (N 1) Even-odd index mapping X r (0) X r (2) X r (N 4) X r (N 2) X r (1) X r (3) X r (N 5) X r (N 3) X r (N 1) N/2- point MDCT g(n) N/2- point MDCT h (n) x r (N 1) x r (N 1) x r (N 1) x r (N 1) Γ 0 Γ 1 Γ N/2 2 Γ N/2 1 MDCT (0) MDCT (1) MDCT (N/2 2) MDCT (N/2 1) Special case MDCT (N/2) MDCT (N/21) MDCT (N/22) MDCT (N 1) Odd summation Injected items Γ n = 1/2C n 2N n :0 N/2 1 N-point MDCT X r (k) 1-point MDCT MDCT (n) X r (k) MDCT (n) Figure 2: N-point MDCT(n) butterfly structure, where 1-point MDCT is the minimum-sized processing block From (5), we can see that the computation of the IFFT is decomposed into two real-valued operations One is a discrete cosine transform DCT-like operation with X r (k), k = 0, 1, 2,,N 1, as the inputs The other is a discrete sine transform DST-like operation with X i (k), k = 0, 1, 2,, N 1, as the inputs We will name the first term Modified DCT (MDCT), and the second term Modified DST (MDST) Note that the MDCT and MDST involve only real-valued operators Furthermore, it can be shown that MDCT(n) = MDCT(2N n), for n = 0, 1,,N 1, (6) MDST(n) = MDST(2N n), for n = 0, 1,,N 1 (7) Hence, we can focus on computing MDCT(n) and MDST(n) for n = 0, 1,,N 1 Then, expand the results for n = N 1,N2,,2N 1 For the special cases of n = 0andn = N, the MDCT and MDST can be simplified as N 1 MDCT(0) = X r (k)cos 2π0k 2N k=0 N 1 = k=0 N 1 MDST(0) = X i (k)sin 2π0k 2N = 0, k=0 N 1 MDCT(N) = X r (k)cos 2πNk 2N k=0 N 1 = k=0 N 1 MDST(N) = X i (k)sin 2πNk 2N = 0, k=0 X r (k), X r (k)( 1) k, (8) respectively These simple relationships can help us to save additional computation complexity 22 MDCT/MDST operations of the IFFT From the preceding discussion, we can see that the implementation issue of the IFFT is to realize MDCT and MDST in a cost-efficient way Then, we can just combine the results of the MDCT and MDST to obtain the IFFT results basedon(5) Here, we first consider the implementation

94 964 EURASIP Journal on Applied Signal Processing of the MDCT We follow the derivation in [9] anddefine = cos(2πnk/2n) Then, the MDCT can be written as C nk 2N N 1 MDCT(n) = X r (k)c2n, nk for n = 0, 1,,N 1 (9) k=0 Decompose the MDCT into even and odd indices of k, then (9)canberewrittenas MDCT(n) = g(n)h (n), for n = 0, 1,, N 2 where N/2 1 g(n) = k=0 X r (2k)C n(2k) N/2 1 2N = k=0 N/2 1 h (n) = X r (2k 1)C n(2k1) 2N k=0 X r (2k)C nk N, 1, (10) (11) Define h(n) = 2C n 2Nh (n) Following the derivation in Lee s algorithm [9], we canfind That is, MDCT(n) = g(n)h (n) = g(n) 1 2C2N n h(n) (12) N 1 N/2 1 X r (k)c2n nk = X r (2k)CN nk k=0 k=0 }{{}}{{} N-point MDCT N/2-point MDCT, g(n) 1 N/2 1 [ 2C2N n Xr (2k 1)X r (2k 1) ] CN nk k=0 }{{} N/2-point MDCT, h (n) X r (N 1)( 1) n, }{{} injected item for n = 0, 1,, N 2 1 (13) On the other hand, by replacing index n with (N n)in(12), it can be shown that MDCT(N n) = g(n) h (n) = g(n) 1 2C2N n h(n) (14) The special case MDCT(N/2)needstobecomputedseparately, which can be simplified as ( ) N 1 N MDCT = X r (k)c k(n/2) N 1 2N = X r (k)cos kπ 2 2 (15) k=0 The mapping of (13), (14), and (15) is shown in Figure 2As we can see, the N-point MDCT is decomposed into two N/2- k=0 point MDCT (g(n)andh (n)) plus some pre-processing and post-processing modules Then we can apply the technique of divide-and-conquer to recursively expand the N/2-point MDCT until 1-point MDCT is formed That is, we repeat the decomposition in (10)and(11) until N = 1 Next, we consider the recursive implementation of the MDST We define S nk 2N = sin (2πnk/2N) As with the derivation in (10), (11), (12), (13),and (14),we can find N/2 1 MDST(n) = X i (2k)S nk N k=0 1 2C n 2N N/2 1 k=0 N/2 1 MDST(N n) = X i (2k)S nk N k=0 1 2C n 2N N/2 1 k=0 [ Xi (2k 1)X i (2k 1) ] S nk N, [ Xi (2k 1)X i (2k 1) ] S nk N, for n = 0, 1,, N 2 1 (16) It is worth noting that the injected item is zero in the MDST Besides, the MDST also has a special case for index N/2as ( ) N 1 N MDST = X i (k)s k(n/2) N 1 2N = X i (k)sin kπ 2 2 (17) k=0 The mapping of the MDST structure in Figure 3 is similar to the MDCT structure, except that minimum processing block is2-pointmdst (seefigure 3) andtheinjecteditemsdonot exist in the MDST implementation That is, we repeat the decomposition in (16) until N = 2 Note that the 1-point MDST is always equal to zero 23 Overall IFFT computation procedures The overall IFFT computation flow is shown in Figure 4 It consists of the MDCT/MDST operations and a postprocessing operation The operations in Figure 4 are as follows: (1) set the butterfly operation to MDCT mode; (2) X r (k), k = 0, 1,,N 1, are first fed into the butterfly architecture to obtain the MDCT(n), for n = 0, 1,,N 1; (3) the post-processing operation expands the N-point MDCT outputs to 2N-point MDCT using the symmetric property in (6); (4) set the butterfly operation to MDST mode; (5) repeat the computation in Steps 2 and 3 using X i (k), k = 0, 1,,N 1 as inputs, and obtain the MDST(n), for n = 0, 1,,N 1; (6) the post-processing operation expands the N-point MDST outputs to 2N-point MDST by using the antisymmetric property in (7); k=0

95 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 965 X i (0) X i (1) X i (N 2) X i (N 1) Even-odd index mapping X i (0) X r (2) X i (N 4) X i (N 2) X i (1) X i (3) X i (N 5) X i (N 3) X i (N 1) N/2- point MDST g(n) N/2- point MDST h (n) Γ 0 Γ 1 Γ N/2 2 Γ N/2 1 MDST (0) MDST (1) MDST (N/2 2) MDST (N/2 1) Special case MDST (N/2) MDST (N/21) MDST (N/22) MDST (N 1) Odd summation Γ n = 1/2C n 2N n :0 N/2 1 N-point MDST X i (k 0 ) X i (k 1 ) 2-point MDST MDST (n 0 ) MDST (n 1 ) X i (k 0 ) X i (k 1 ) MDST (n 0 ) MDST (n 1 ) Figure 3: N-point MDST(n)butterflystructure,where 2-point MDST is theminimum-sizedprocessing block (7) based on (5), we combine the MDCT and MDST results together with the scaling operation (which is achieved by shifting right by log 2 (N) bits) to obtain the IFFT results This is done in the post-processing operation represented as where Injected = [ O N/2 ] Xr (N 1), (21) 24 Matrix notation of the MDCT/MDST In this section, we present the matrix notation of the proposed fast IFFT algorithm The matrix form can help to see the divide-and-conquer nature of our approach By following the notation in [13], we rewrite X r (k) and MDCT(n)as [ Xr (k) N ] = [ Xr (0) X r (1) X r (N 1) ] T, (18) [ MDCT(n)N ] =[ MDCT(0) MDCT(1) MDCT(N 1) ] T, (19) respectively Then (9)canberepresentedas [ MDCT(n)N ] = [ TN,MDCT ][ Xr (k) N ], (20) where [T N,MDCT ] denotes the transform kernel matrix of the MDCT operation Next, the injected items of (13) can be [ ON/2 ] = [ ] T (22) We define the odd-summation matrix as [ ] LN/2 = and the scaling matrix as (23) { } [ ] 1 ΦN/2 = diag 2C2N n, for n = 0, 1,, N 1 (24) 2 The special case of the MDCT in (15)canberepresentedas ( ) N MDCT = [ ][ ] S N Xr (k) N, (25) 2

96 966 EURASIP Journal on Applied Signal Processing X i (0) X i (1) X r (0) X r (1) N-point MDCT/ MDST R R R x(0) x(1) x(n/2) X i (N 2) X r (N 2) X i (N 1) X r (N 1) Phase II Phase I for for MDST MDCT R R x(n 1) Special case x(n) x(n 1) R x(3n/2) x(2n 1) R Shift left log 2 (N)bits Post-processing (Expanding circuit) Figure 4: The proposed IFFT architecture where [ SN ] = [ ] (26) The block diagram of the MDST in the matrix form is shown in Figure 6 Based on (12), (13), (14), (21), (22), (23), (24), (25), and (26), the [T N,MDCT ] can be expressed in the matrix form as [ ] [ ] [ ] TN/2 ψn,mdct TN,MDCT = [ ][ ] [ ][ ], (27) TN/2 JN/2 ψn,mdct JN/2 where [ψ N,MDCT ] = [Φ N/2 ]([L N/2 ][T N/2 ][O N/2 ]) [S N ]and [J N/2 ] denotes the opposite-diagonal identity matrix We can also represent (20) and(27) in the recursive form as shown in Figure 5 Following the above derivations, the matrix notation of transform kernel of the MDST can be derived as [ [ ] [ ] ] [ ] TN/2 ψn,mdst TN,MDST = [ ][ ] [ ][ ], (28) T N/2 JN/2 ψn,mdst JN/2 where [ψ N,MDST ] = [Φ N/2 ][L N/2 ][T N/2 ][S N ] Note that the MDST is similar to the MDCT except that there is no injected items Also, the special case matrix can be modified as [ SN ] = [ ] (29) 3 REDUCED-COMPLEXITY FFT ALGORITHM 31 The FFT derivation At the receiver side (see Figure 1), the 512-point FFT is used to demodulate the received signals, which is given by where 2N 1 X(k) = x(n)w 2N, nk for k = 0, 1,,2N 1, (30) n=0 ( W2N nk = exp j 2πnk ) = cos 2πnk 2N 2N 2πnk j sin 2N (31) Note that x(n),n= 0, 1,,2N 1, are real-valued numbers Hence, (30) canberewrittenas 2N 1 X(k) = x(n)cos 2πnk 2N n=0 2N 1 j n=0 x(n)sin 2πnk 2N =MDCT(k) j MDST(k), for k = 0, 1,,2N 1 (32) Equation (32) shows that the computation of the FFT is

97 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 967 X r (0) X r (1) Even [T N/2 ] X r (N 2) X r (N 1) Even-odd index mapping Odd [L N/2 ] X r (N 1) [T N/2 ] [Q N/2 ] [S N ] [Φ N/2 ] MDCT (n) N [J N/2 ] Figure 5: Block diagram of the MDCT operation in matrix form X i (0) X i (1) Even [T N/2 ] X i (N 2) X i (N 1) Even-odd index mapping Odd [L N/2 ] [T N/2 ] [S N ] [Φ N/2 ] [J N/2 ] MDST (n) N Figure 6: Block diagram of the MDCT operation in matrix form decomposed into a combination of two real-domain kernels MDCT(k)andMDST(k) Both MDCT and MDST use x(n), n = 0, 1,,2N 1, as the inputs Hence, we only employ two real-valued kernels (MDCT and MDST), thus no complex-valued operations are required in computing the FFT In addition, in the DMT system, the lower N-point FFT outputs are conjugate-symmetric to the uppern-point outputs We are only interested in N-point data for k = 0, 1,,N 1 Hence, we can neglect the outputs X(k), for k = N,N 1,,2N 1 32 MDCT/MDST operations of the FFT In (32), the transform kernels are 2N-point MDCT(k) and MDST(k) Here, we propose a novel approach to further reduce the computational complexity Hence, we only need to performn-point MDCT/MDST We first decompose input sequence into a symmetric sequence, x c (n), plus an antisymmetric sequence, x s (n), where x c (n) = 1 2 [ x(n) x(2n n) ], x s (n) = 1 ] [ x(n) x(2n n), for n = 1, 2,,N 1 2 (33) Hence, we have x(n) = x c (n) x s (n), (34) x(2n n) = x c (n) x s (n), for n = 1, 2,,N 1 (35) By substituting (34)and(35) into (30), we can simplify (30) as (seeappendix B) X(k) = { x(0) x(n)( 1) k 2 [ N 1 n=0 x c (n)cos 2πnk 2N N 1 j n=0 x s (n)sin 2πnk ]} 2N = { x(0) x(n)( 1) k 2 [ MDCT(k) j MDST(k) ]}, for k = 0, 1,,N 1, (36) where x c (0) = 0and x s (0) = 0 Since the block size is reduced from 2N-point (see (32)) to N-point (see (36)) Next, following the derivations of the IFFT in Section 2, we can have

98 968 EURASIP Journal on Applied Signal Processing MDCT(k) = g(k) 1 2C2N k h(k) = N/2 1 x c (2n)C nk N n=0 }{{} N/2-point MDCT, g(k) 1 2C k 2N [ N/2 1 [ xc (2n 1) x c (2n 1) ] C nk N n=0 } {{ } N/2-point MDCT, h (k) ] x c (N 1)( 1) k }{{} injected item MDCT(N k) = g(k) 1 2C2N k h(k) N/2 1 = n=0 1 2C k 2N x c (2n)C nk N [ N/2 1 n=0 Similarly, for the MDST(k),we have MDST(k) = g(k) 1 2C2N k h(k) N/2 1 = n=0 1 2C k 2N x s (2n)S nk N N/2 1 n=0 MDST(N k) = g(k) 1 2C2N k h(k) N/2 1 = n=0 1 2C k 2N [ xc (2n 1) x c (2n 1) ] C nk N x c (N 1)( 1) k ], for k = 0, 1,, N 2 1 [ xs (2n 1) x s (2n 1) ] S nk N, x s (2n)S nk N N/2 1 n=0 The two special cases for index N/2are, (37) (38) (39) [ xs (2n 1) x s (2n 1) ] S nk N, for k = 0, 1,, N 2 1 (40) ( ) N 1 N MDCT = x c (n)cos nπ 2 2, n=0 ( ) N 1 MDST N2 = n=0 x s (n)sin nπ 2 (41) The block diagram of the MDCT(k) is shown in Figure 7 The mapping of the MDST structure is similar to the MDCT structure in Figure 7 except that minimum processing block is 2-point MDST and the injected items do not exist in the MDST(k) implementation (see Figure 8) Then we can just combine the MDCT(k) and MDST(k) outputs, followed by adding x(0) and x(n)( 1) k, to obtain the FFT results based on (36) 33 Overall FFT computation procedures The overall computation flow of the FFT is shown in Figure 9 The operations are as follows (1) The received signals x(n), n = 0, 1,,2N 1, are decomposed to x c (n)and x s (n), n = 0, 1,,N 1, through the pre-processing operation (2) In the first phase, the generated x c (n) are fed into recursive butterfly operation to obtain the MDCT(k) outputs (3) In the second phase, we repeat the computation by using the x s (n) as inputs into recursive butterfly operation to obtain the MDST(k) outputs (4) We combine the MDCT(k) andmdst(k) results then add x(0) and x(n)( 1) k together to obtain the FFT results based on (36) This is done in the post-processing operation 34 Matrix notation of the MDCT/MDST Based on (19), (20), (21), (22) (23), (24), (25), and (26), we can represent (37), (38), and (39)as [ [ ] [ ] ] [ ] TN/2 ψn,mdct TN,MDCT = [ ][ ] [ ][ ], (42) TN/2 JN/2 ψn,mdct JN/2 where [ψ N,MDCT ] = [Φ N/2 ]([L N/2 ][T N/2 ][O N/2 ]) [S N ], [ [ ] [ ] ] [ ] TN/2 ψn,mdst TN,MDST = [ ][ ] [ ][ ], (43) T N/2 JN/2 ψn,mdst JN/2 where [ψ N,MDST ] = [Φ N/2 ][L N/2 ][T N/2 ][S N ], of the MDCT(k)/ MDST(k), respectively The block diagrams of the MDCT(k) and MDST(k) are very similar to the MDCT(n) andmdst(n) insection 2 Thedifference is that it requires a pre-processing to compute the x c (n) and x s (n) The block diagrams of the MDCT and MDST are shown in Figures 10 and 11,respectively 4 COMPLEXITY COMPARISON AND FINITE-PRECISION EFFECT 41 Comparison of hardware complexity In this section, we compare the computation complexity of the proposed algorithm with the traditional Cooly-Tukey

99 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 969 2N-point N-point x(0) x(1) x(2n 2) x(2n 1) x c (0) x c (1) x c (N 2) x c (N 1) Even-odd index mapping x c (0) x c (2) x c (N 4) x c (N 2) x c (1) x c (3) x c (N 5) x c (N 3) x c (N 1) Preprocessing N/2- point MDCT g(k) x c (N 1) x c (N 1) N/2- point MDCT x c (N 1) h (k) x c (N 1) Γ 0 Γ 1 MDCT (0) MDCT (1) MDCT (N/2 2) MDCT (N/2 1) Special case MDCT (N/2) MDCT (N/21) MDCT (N/22) MDCT (N 1) Odd summation Injected items Γ k = 1/2C k 2N k :0 N/2 1 N-point MDCT x c (n) 1-point MDCT MDCT(k) x c (n) MDCT(k) Figure 7: N-point MDCT(k) butterfly structure, where 1-point MDCT is the minimum-sized processing block of the FFT module 2N-point N-point x(0) x(1) x(2n 2) x(2n 1) x s (0) x s (1) x s (N 2) x s (N 1) Even-odd index mapping x s (0) x s (2) x s (N 4) x s (N 2) x s (1) x s (3) x s (N 5) x s (N 3) x s (N 1) Preprocessing N/2- point MDST g(k) N/2- point MDST h (k) Γ 0 Γ 1 MDST (0) MDST (1) MDST (N/2 2) MDST (N/2 1) Special case MDST (N/2) MDST (N/21) MDST (N/22) MDST (N 1) Odd summation Injected items Γ k = 1/2C k 2N k :0 N/2 1 N-point MDST x s (n 0 ) x s (n 1 ) 2-point MDST MDST(k 0 ) MDST(k 1 ) x s (n 0 ) x s (n 1 ) MDST(k 0 ) MDST(k 1 ) Figure 8: N-point MDST(k)butterfly structure,where the 2-point MDST is theminimum-sizedprocessing block of thefft module

100 970 EURASIP Journal on Applied Signal Processing 2N-point N-point N-point x(0) x(n)( 1) k x(0) x(1) x s (0), x c (0) x s (1), x c (1) R X(0) Preprocessing N-point MDCT(1st phase) MDST(2nd phase) x(0) x(n)( 1) k R X(N/2) x(2n 2) x(2n 1) x s (N 2), x c (N 2) x s (N 1), x c (N 1) x(0) x(n)( 1) k R X(N 1) 2nd phase 1st phase Post-processing Figure 9: The proposed FFT architecture x(0) x(1) x s (0) x s (1) Even [T N/2 ] Preprocessing Evenodd index mapping Odd [LN/2 ] x c (N 1) [T N/2 ] [O N/2 ] [Φ N/2 ] MDCT (k) N [J N/2] x(2n 2) x(2n 1) x s (2N 2) x s (2N 1) [S N ] Figure 10: Block diagram of the MDCT in matrix form for the FFT operation x(0) x(1) x s (0) x s (1) Even [T N/2 ] x(2n 2) x(2n 1) x s (N 2) x s (N 1) Preprocessing Evenodd index mapping Odd [LN/2 ] [T N/2 ] [S N ] [Φ N/2 ] MDST (k) N [J N/2] Figure 11: Block diagram of the MDST in matrix form for the FFT operation algorithm The corresponding butterfly architecture requires log 2 (2N) stages in the 2N-point IFFT/FFT Each stage consists of N multiplications and 2N additions Because input sequences are complex data, the IFFT/FFT kernels are complex in nature Hence, it requires 4 real-valued multiplications and 2 real-valued additions for 1 complex

101 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 971 Table 1: Comparison of computational complexity for 2N-point IFFT/FFT IFFT FFT Cooly-Tukey [10] Chan et al Cooly-Tukey [10] Chan et al (O 1 ) (O 2 ) CR (O 1 ) (O 2 ) CR N 4N log 2 2N Nlog 2 N 2N 2 4N log 2 2N Nlog 2 N 2N (a) Number of multiplication operations IFFT FFT Cooly-Tukey [10] Chan et al Cooly-Tukey [10] Chan et al (O 1 ) (O 2 ) CR (O 1 ) (O 2 ) CR N 6N log 2 2N (9/2)N log 2 N N 1 6N log 2 2N (9/2)N log 2 N N (b) Number of addition operations multiplication Also, it takes 2 real additions to realize a complex addition As a result, the direct approach requires a total of 4N log 2 (2N) real multiplications and 6N log 2 (2N)real additions The large computation complexity are not suitable for cost-effective realization of the IFFT/FFT modules in the DMT system The complexity comparison for 2N-point IFFT/FFT are listed in Table 1Thecomplexity ratio (CR) is defined as CR = O 2, (44) O 1 where O 1 and O 2 are the number of multiplications (or additions) in other fast algorithms and our approach, respectively We can see that the complexity ratio of the multiplication is only 17% for N = 256 compared with conventional IFFT/FFT Table 1 also shows that our approach can gain more computation savings as N gets larger in the VDSL systems [14] 42 Experiment results There are lots of DSP processors on the market Due to the variety or hardware structure, coding styles, compliers, and so forth, we are not trying to do the detail optimization for specific processors On the other hand, we would like to compare the proposed algorithm with Cooly- Tukey s algorithm, which is a baseline of the FFT realization The implementation platform is TI TMS320C54 evaluation board, Both algorithms are written in C language without any assembly-level programming tricks During compilation, the TI C54X C complier is used without adding special compilation options, neither Table 2 shows the comparison of the proposed algorithm andtheconventionalfftintermsofclockcyclesaswecan see, the proposed algorithm requires only about 30% clock cycles of the Cooly-Tukey s The result is very consistent with our observation in Table 1

102 972 EURASIP Journal on Applied Signal Processing Table 2: Comparison of clock cycle for Cooley-Tukey FFT and proposed recursive algorithm 128-point 256-point 512-point Cooley-Tukey FFT 16,485 37,118 82,347 Proposed 11,869 25,726 55,435 Clock cycle Ratio 28% 31% 33% 5 CONCLUSIONS In this paper, we develop a computationally efficient fast algorithm for the software implementation of the IFFT/FFT kernel in the DMT system We reformulate the IFFT/FFT functions so as to avoid complex-domain operations The complexity ratio of the multiplications is only 17% compared with the direct butterfly implementation approach The proposed algorithm provides a good solution in reducing MIPS count in programmable DSP implementation for the applications of the DMT transceiver systems Averaged SNR (db) Averaged SNR (db) Wordlength (B) Direct butterfly approach Our approach (a) Wordlength (B) Direct butterfly approach Our approach (b) Figure 12: Averaged SNR versus wordlength for the 512-point (2N value) (a) IFFT (b) FFT 43 Finite-precision effect In fixed-point implementation of the IFFT/FFT kernels, it is important to consider the effects of finite register length in the IFFT/FFT calculations (see [12,Chapter9]and[15]) To compare the butterfly approach and our approach in fixedpoint implementation, we conduct extensive computer simulation by using MATLAB for finite-wordlength IFFT/FFT architecture Figure 12 shows the SNR performance with assigned wordlength B = 8, 16, 32 bits We observe that the SNR performance with B =16 bits is good enough in practical fixed-point implementations From the simulation results, we can see that the SNR performance of our approach is comparable to the traditional butterfly approach under the same wordlength APPENDICES A DERIVATION OF (4) Decomposing (4) into the first half and second half with the fact that X(0) = X(N) = 0, (4)canberepresentedas x(n) = 1 [ N 1 2N X(k)W nk k=1 2N 1 2N k=n1 X(k)W nk 2N ] (A1) Use k = 2N k to replace the variable in the second term Then, we have x(n)= 1 [ N 1 2N X(k)W nk k=1 1 2N k =N 1 ] X(2N k )W (2N k )n 2N (A2) Because k isadummyvariable,wecanrewrite(a2)as x(n) = 1 [ N 1 2N k=1 N 1 X(k)W2N nk = 1 [ N 1 2N X(k)W2N nk k=1 By using the facts that ( W2N nk = exp W nk 2N ( = exp j 2πnk 2N j 2πnk 2N we can rearrange (A3)to x(n) = 1 N [ N 1 k=0 W 2Nn 2N = 1, k=1 N 1 k=1 ) = cos 2πnk 2N ) = cos 2πnk 2N X(0) = X(N) = 0, ( X r (k)cos 2πnk 2N ] X(2N k)w (2N k)n 2N X(2N k)w 2Nn 2N 2πnk j sin 2N, 2πnk j sin 2N, W2N nk ] (A3) (A4) X i(k)sin 2πnk 2N ) ] (A5)

A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 973 B DERIVATION OF (30) Equation (30)canberepresentedas X(k) = x(0) x(n)( 1) k [ N 1 n=1 x(n)w nk 2N 1

103 A Reduced-Complexity Fast Algorithm for Software Implementation of the IFFT/FFT in DMT Systems 973 B DERIVATION OF (30) Equation (30)canberepresentedas X(k) = x(0) x(n)( 1) k [ N 1 n=1 x(n)w nk 2N 1 2N n=n1 x(n)w nk 2N (B1) Use n = 2N n to replace the variable in the second term Then, we have X(k) = x(0) x(n)( 1) k [ N 1 n=1 x(n)w nk 1 2N n =N 1 ] x(2n n )W k(2n n ) 2N Because n is a dummy variable, we can rewrite (B2)as X(k) = x(0) x(n)( 1) k [ N 1 n=1 N 1 x(n)w 2N nk = x(0) x(n)( 1) k [ N 1 n=1 n=1 N 1 x(n)w 2N nk n=1 ] x(2n n)w k(2n n) 2N x(2n n)w2n 2kN W2N nk ] (B2) ] (B3) By using the fact that W2N 2kN = 1 and applying the assumption of the input data in (35), we can rearrange (B3)as X(k) = x(0) x(n)( 1) k 2 [ N 1 n=1 x c (n)cos 2πnk 2N N 1 j n=1 x s (n)sin 2πnk ] 2N (B4) [4]ILee,JSChou,andJMCioffi, Performance evaluation of a fast computation algorithm for the DMT in high-speed subscriber loop, IEEE Journal on Selected Areas in Communications, vol 13, no 9, pp , 1995 [5] T N Zogakis, J T Aslanis Jr, and J M Cioffi, Acodedand shaped discrete multitone system, IEEE Trans Communications, vol 43, no 12, pp , 1995 [6] B Daneshrad and H Samueli, A 16 Mbps digital-qam system for DSL transmission, IEEE Journal on Selected Areas in Communications, vol 13, no 9, pp , 1995 [7] B R Wiese and J S Chow, Programmable implementations of xdsl transceiver systems, IEEE Communications Magazine, vol 38, no 5, pp , 2000 [8] A-Y Wu and T S Chan, Cost-efficient parallel lattice VLSI architecture for the IFFT/FFT in DMT transceiver technology, in ProcIEEEIntConfAcoustics,Speech,SignalProcessing, pp , Seattle, Wash, USA, May 1998 [9] B G Lee, A new algorithm to compute the discrete cosine transform, IEEE Trans Acoustics, Speech, and Signal Processing, vol 32, no 6, pp , 1984 [10] J W Cooly and J W Tukey, An algorithm for the machine calculation of the complex Fourier series, Math Comp, vol 19, pp , April 1965 [11] ANSI Standard T1413, Network and customer installation interface-asymmetric digital subscriber line (ADSL) metallic interface, 1995 [12] A V Oppenheim and R W Schafer, Discrete-Time Signal Processing, Prentice-Hall, Englewood Cliffs, NJ, USA, 1989 [13] H D Yun and S U Lee, On the fixed-point-error analysis of several fast DCT algorithms, IEEE Trans Circuits and Systems for Video Technology, vol 3, no 1, pp 27 41, 1993 [14] T1E14/ R3, Very-high-speed digital subscriber lines (VDSL) metallic interface, part 3: Technical specification of a multi-carrier modulation transceiver, 2000 [15] K J R Liu, A-Y Wu, A Raghupathy, and J Chen, Algorithm-based low-power and high-performance multimedia signal processing, Proceedings of the IEEE, vol 86, no 6, pp , 1998, Special Issue on Multimedia Signal Processing ACKNOWLEDGMENT T S Chan is with the VXIS Tech Corp Hsin-Chu, Taiwan, ROC This work is supported in part by the National Science Council, ROC, under Grant NSC E REFERENCES [1]GHIm,DDHarman,GHuang,AVMandzik,MH Nguyen, and J J Werner, 5184 Mb/s 16-CAP ATM LAN standard, IEEE Journal on Selected Areas in Communications, vol 13, no 4, pp , 1995 [2]JSChow,JCTu,andJMCioffi, A discrete multitone transceiver system for HDSL applications, IEEE Journal on Selected Areas in Communications, vol 9, no 6, pp , 1991 [3] K Sistanizadeh, P Chow, and J M Cioffi, Multi-tone transmission for asymmetric digital subscriber lines (ADSL), in Proc IEEE International Conf on Communications, vol 2, pp , Geneva, Switzerland, 1993 Tsun-Shan Chan was born in Chang-Hui, Taiwan, ROC, in 1973 He received his MS degree in electrical engineering from the National Central University, Taiwan, in 1998 During , he worked on communication applications in Industrial Technology Research Institute, Hsin-Chu, Taiwan Since 1999, he has been serving as a system engineer of the video processing projects in VXIS Technology Corporation Jen-Chih Kuo received his BS degree in electrical engineering from the Nation Taiwan University, Taiwan, in 2000 He is now in the Graduate Institute of Electronics Engineering of the same school His research interests include VLSI architectures for DSP algorithms, adaptive signal processing, and digital communication systems

974 EURASIP Journal on Applied Signal Processing An-Yeu (Andy) Wu received his BS degree from National Taiwan University in 1987, and the MS and PhD degrees from the University of Maryland, College

104 974 EURASIP Journal on Applied Signal Processing An-Yeu (Andy) Wu received his BS degree from National Taiwan University in 1987, and the MS and PhD degrees from the University of Maryland, College Park in 1992 and 1995, respectively, all in electrical engineering During , he served as a signal officer in the Army, Taipei, Taiwan, for his mandatory military service During , he was a graduate teaching and research assistant with the Department of Electrical Engineering and Institute for Systems Research at the University of Maryland, College Park From August 1995 to July 1996, he was a Member of Technical Staff at AT&T Bell Laboratories, Murray Hill, NJ, working on high-speed transmission IC designs From 1996 to July 2000, he was with the Electrical Engineering Department of National Central University, Taiwan He is currently an Associate Professor with the Department of Electrical Engineering Department and Graduate Institute of Electronics Engineering of National Taiwan University, Taiwan His research interests include low-power/high-performance VLSI architectures for DSP and communication applications, adaptive signal processing, and multirate signal processing

105 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation A DSP Based POD Implementation for High Speed Multimedia Communications Chang Nian Zhang Department of Computer Science, University of Regina, TRLabs, SK, Canada S4S 0A2 zhang@csureginaca Hua Li Department of Mathematics and Computer Science, University of Lethbridge, Lethbridge, Alberta, Canada T1K 3M4 huali@csulethca Nuannuan Zhang Department of Computer Science, University of Regina, TRLabs, SK, Canada S4S 0A2 Jiesheng Xie Department of Computer, China Agriculture University, Beijing , China Received 20 August 2001 and in revised form 10 August 2002 In the cable network services, the audio/video entertainment contents should be protected from unauthorized copying, intercepting, and tampering Point-of-deployment (POD) security module, proposed by OpenCable TM, allows viewers to receive secure cable services such as premium subscription channels, impulse pay-per-view, video-on-demand as well as other interactive services In this paper, we present a digital signal processor (DSP) (TMS320C6211) based POD implementation for the real-time applications which include elliptic curve digital signature algorithm (ECDSA), elliptic curve Diffie Hellman (ECDH) key exchange, elliptic curve key derivation function (ECKDF), cellular automata (CA) cryptography, communication processes between POD and Host, and Host authentication In order to get different security levels and different rates of encryption/decryption, a CA based symmetric key cryptography algorithm is used whose encryption/decryption rate can be up to 75 Mbps The experiment results indicate that the DSP based POD implementation provides high speed and flexibility, and satisfies the requirements of real-time video data transmission Keywords and phrases: point-of-deployment, DSP, cellular automata, copy protection, ECDSA, DH key exchange 1 INTRODUCTION The next generation of the cable networks requires that a security module should be built and separated from the host devices (set top boxes and integrated digital televisions) in order to facilitate commercial sale of navigational devices The point-of-deployment (POD) security module is being developed to satisfy these separable security requirements and to enable retail availability of Host devices [1, 2, 3] The POD module supports two major functions (1) The POD will provide the cable operator with a secure device at the customer s location (2) The POD will act as a translator so that the Host device will only have to understand a single protocol, regardless of the type of network to which it is connected Since the draft of the specification of the POD module was released in fall 1997, several POD products have been reported All of them are the application specific integrated circuits (ASIC), and use data encryption standard (DES) as the preliminary technique for content encryption/decryption But DES has been proved not secure enough and will be replaced by the new standards Moreover, due to the nature of cable network services, different applications require different security levels It is desirable for POD to provide versatile cryptography schemes On the other hand, since the current specification of the POD module has not been accepted as an international standard, any further modifications of the standard will cause redesigning and rebuilding of the ASIC POD In order to provide a low cost and flexible POD, a DSP based POD implementation is proposed in this paper which satisfies the requirements of real-time video data transmission and can be applied in different security levels The

106 976 EURASIP Journal on Applied Signal Processing POD module OCI-C2 Cable headend Host device (Set top or digital TV) Consumer device (HDTV, digital VCR, etc) OCI-N OCI-C1 Figure 1: OpenCable network and consumer interfaces outline of the remainder of the paper is as follows Section 2 introduces the POD security module including the overview of POD, its functionalities, and algorithms used in POD Section 3 presents the POD implementation based on DSP Section4 is the conclusion 2 FUNCTIONS OF POD MODULE The set top box (STB) is a commonly used interface between digital television and the functions accessible via cable network in the architecture of next-generation television and video systems It attaches a point-of-deployment (POD) security plug-in module to provide the security and copy protection of the contents Figure 1 illustrates logically how the POD module interface connects with other OpenCable interfaces In Figure 1, OCI-N (OpenCable interface network) is the interface between a cable network and the Host device OCI-C1 (Open- Cable interface consumer 1) is the interface between a Host device and a digital consumer device OCI-C2 (OpenCable interface consumer 2) is the interface between a Host device and the POD module The primary functions of the OpenCable POD module include: (1) provide conditional access to a Host device; (2) provide communication and control between the headend and the Host device The POD module decrypt the contents under control of the headend and re-encrypt the contents for the purpose of copy protection between the POD module and Host device Typically, the POD is authorized by the conditional access system to decrypt contents, and authorizes the Host by delivering either clear or CP (copy protection) encrypted content The content passing the POD interface can be one of the following three formats: (1) Cleartext, (2) Passing through, (3) Rescramble The copy protection between the POD and the Host works as follows Step 1 (Initialization of the POD and the Host evaluation) When the POD is powered on, it checks if the Host supports OpenCable TM content protection by checking the availability of the CP resource and verifying the authenticity of the device certificate Step 2 (Host authentication) The POD retrieves the Host certificate Data to initiate the authentication procedure and the Host replies to it After this exchange, both the POD and the Host come up with the authentication key Step 3 (Key exchange) The POD sends its DH (Diffie- Hellman) public key to the Host and requests the Host s DH public key and then the Host sends its DH public key to the POD After this exchange, both the POD and the Host come up with a common secret value By using a method covered by intellectual property, they establish the shared secret keys derived from the Host authentication process Step 4 (Interface encryption) The POD uses the secret key to encrypt the content The cryptography schemes used in POD include: (1) Elliptic curve digital signature algorithm (ECDSA), which is used in the Host authentication process for signing and verification (2) Diffie-Hellman (DH) public key agreement algorithm, which provides a method for POD and Host to compute a shared secret value, that is, used in the content encryption/decryption key generation (3) SHA-1 (secure hash algorithm) [4], which is used in the digital signature algorithm to generate a message digest of length 160 bits For the POD, the SHA-1 algorithm is used for Host certificate signature verification, authentication key generation and copy protection key generation (4) Elliptic curve key derivation function (ECKDF) algorithm, which is used to generate the key for the content protection Moreover, a random number generator is included to generate DH private keys which will be compliant with the SHA-1 based algorithm Each OpenCable device has a unique seed value which is set by the manufactory Figure 2 illustrates the cryptographic functions used in the POD copy protection 3 A DSP BASED POD IMPLEMENTATION 31 Introduction of DSP C6211 Texas Instruments (TI) TMS320C6000 generation [5] is based on VelociTI TM architecture, an advanced architecture

107 A DSP Based POD Implementation for High Speed Multimedia Communications 977 Diffie-Hellman key exchange Key generation Q = dp POD Host Signature generation (r, s) SHA-1 SHA-1 Verification procedure v Key derive function Key derive function Check if r = v? Figure 3: ECDSA algorithm Plaintext DES, CA encryption on MPEG transport Copy portected DES, CA encryption on MPEG transport System parameters (elliptic curve) Figure 2: Cryptographic functions used in POD copy protection POD Host for DSPs with very long instruction word (VLIW) The VLIW architecture makes it very suitable for the multichannel and multifunction applications TMS320C6211 (C6211 for short) provides 1200 MIPS (million instructions per second) at 150 MHz, and the TMS32062xx devices are the fixedpoint DSP family The cache architecture in C6211 provides low cost and high performance capabilities C6211 has 32 general purpose registers of 32 bit word length and eight highly independent functional units The eight functional units provide six arithmetic logic units (ALUs) for a high degree of parallelism and two 16-bit multipliers The development tools of C6211 include: C compiler, assembly optimizer to simplify programming and scheduling, and Windows TM debugger interface for visibility into source code execution [6] The DSP based POD can greatly reduce the hardware design period, since it can easily reprogram when the specifications of POD are to be modified or new components are added 32 Cryptography algorithms used in the DSP based POD In order to make the POD more efficient, we use ECKDF whichisbasedontheellipticcurvecryptography[7, 8] for the key derive function; and use cellular automata (CA) based symmetric-key cryptographic algorithm for media content protection ECDSA algorithm is applied in POD to authenticate the Host, which includes three parts: key schedule which is to set up the key, signature procedure, and verification process as illustrated in Figure 3 Elliptic curve Diffie Hellman (ECDH) primitives is the Private key d P Public key Q P POD shared value z P z P = z H Private key d H Public key Q H Host shared value z H Figure 4: ECDH protocol between POD and Host basis for the operation of elliptic curve encryption scheme For the POD, we use this algorithm to exchange the key between the POD and the Host Figure 4 illustrates the flow chart of ECDH algorithm Suppose POD (P) and Host (H) will communicate with each other, and require the key exchange Here we use d P and Q P to represent the P s private key and public key which are obtained from the key schedule d H and Q H denote H s private key and public key, respectively P performs the following steps: /* setup the scheme */ create the elliptic curve; /* compute the elliptic curve point */ V P = (x p,y p ) = d P Q H ;

108 978 EURASIP Journal on Applied Signal Processing return the x component of V P shared secret key (z P ) as the Similarly, H uses the same primitive to get the shared secret key /* setup the scheme */ create the elliptic curve; /* compute the elliptic curve point */ V H = (x h,y h ) = d H Q P ; return the x component of V H as the shared secret key (z H ) By running ECDH algorithm, we have P V = P U That is, two parties get the same secret key In the POD implementation, ECKDF key derivation function is used to generate the common key for content encryption and decryption The following is the description of ECKDF key derivation function: check the length of input data (z); initiate a Counter = 1; for (i = 0; i<n; i ) { /* compute the hash value */ k i = h(z Counter); increment Counter; } set the key K = k 1 k 2 k n, where means concatenation, and h stands for hash function SHA-1 By applying this function, we can generate different key sizes as required In the following, we introduce the cellular automata based symmetric-key cryptography algorithm and how it is applied in POD Cellular automata (CA) is an array of cells where each cell is in any of the permissible states For example, in a 2-state CA, each cell s state can be zero or one In a k-neighborhood CA, at each clock cycle, the evolution of a cell value depends on its rule and the present states of its neighbors The following three CA rules have special characteristics which can be applied in message encryption: Rule 51: x i (t) = x i (t 1), Rule 195: x i (t) = x i 1 (t 1) x i (t 1), Rule 153: x i (t) = x i (t 1) x i1 (t 1) (1) Theorem 1 Applying complemented rules of 195, 153, and 51 toacaformsacagroup[9] Theorem 2 If a CA configures with rules 51, 153 and 195, then its state transition diagram consists of equal cycles of even length Thus, if we choose rules of 51, 153, 195 as a group CA, then the fundamental transformations are self-inverse, that is, the decryption is carried out in the same way as encryption Assuming the rule matrix is T, then we have T 2n = T n T n = I (the identity matrix) (2) Controlbit Different rule Figure 5: Overview of rule applied to message The CA-based block cipher scheme is as follows Encryption Decryption E = T n1 1 T n2 2 T nq q, C = EM (3) M = E 1 C = ( ) T1 n1 T2 n2 Tq nq 1C ( (T n q ) 1 = q ( T n 2 ) 1 ( 2 T n 1 ) 1 ) 1 C = ( T nq q ) T2 n2 T1 n1 C, where T 1,T 2,,T q aresecretcarules,whichcanbereviewed as the subkeys of the block cipher The flexibility of CA based cryptosystem is that by choosing different values of n and q, wecanachievedifferent security levels and data encryption/decryption rates according to the application requirements In Figure 5, the first bit is the rule control bit where 0 stands for rule 51, and 1 stands for rule 195 or 153 which will be selected by the corresponding bit The core procedures of the CA algorithm is described as follows: temp51 = (~Message) & (~Rule); /* Implement the rule 51 */ switch(rule sign) { case 0: temp1 = Message 1; temp195 = (~(Message ^ temp1)) & Rule; temp C Block = temp195; break; case 1: temp2 = Message 1; temp153 = (~(Message ^ temp2)) & Rule; temp C Block = temp153; break; } C Block = temp51 temp C Block Note that cycles used for encryption and decryption can be variable as well in CA based cryptography For example, if we set 2n = 8, that is, the message should be processed by applying 8 times of CA rule during the procedure of encryption and decryption, then we can choose the first four cycles for encryption and the other four cycles for decryption, or we can use the first three cycles for encryption and another fivecyclesfordecryption (4)

109 A DSP Based POD Implementation for High Speed Multimedia Communications 979 Table 1: Test data for the encryption speed for CA (cycle < 5) Test No rule No cycle No clk En speed (Mbps) The CA based cryptography is used for video content protection in the POD implementation The experiment indicates that the encryption/decryption rate can be up to 75 Mbps, which satisfies the requirement of real-time data transmission in the cable network 33 Implementation The algorithms used in POD are programmed by C language and compiled with C6211 development tools, where the code composer studio compiles and converts the C programs into assembly language Finally, an executable file in out format is produced and loaded into the DSP Tables 1 and 2 list all the data tested from different rules and cycles In these tables, No rule stands for the number of rules, No cycle means the number of process cycles for encryption, No clk is the DSP cycles running this program on DSP, and En speed is the speed of encryption which can be calculated by the following equation: /No clk (Mbps) (5) 4 CONCLUSION POD is the security module to be used in the cable network and digital TV services Its main function is to provide the cryptographic protocol in the interface between the POD and the Host, and to protect the content passing through the interface In this paper, a DSP based POD implementation is proposed by using TMS320C6211 The experiment indicates that the proposed POD implementation provides high data speed and flexibility to real-time applications In order to get the different degree of security and different speed of encryption/decryption, we use a simple symmetric key encryption Table 2: Test data for the encryption speed for CA (cycle 5) Test No rule No cycle No clk En speed (Mbps) algorithm cellular automata cryptography for the content protection, whose encryption/decryption rate can be up to 75 Mbps REFERENCES [1] OpenCable TM, OpenCable TM POD Copy Protection System, IS-POD-CP-INT , January 2000 [2] OpenCable TM, OpenCable TM Host-POD Interface Specification, IS-POD-131-INT , October 1999 [3] Hitachi, Intel, MEI, Sony and Toshiba companies, Digital Transmission Content Protection Specification (Informational Version) Revision 10, vol 1, April 1999 [4] National Institute of Standards and Technology(NIST), Secure Hash Standard (SHS), FIPS Publication 180-1, April 1995 [5] Texas Instruments, How to Begin Development Today with the TMS320C6211 DSP, Application report, SPRA474, September 1998 [6] Texas Instruments, TMS320C6000 Optimizing C Compiler User s Guide, Digital signal processing solutions, 1999 [7] Certicom research, Standards for Efficient Cryptography, SEC 1: Elliptic Curve Cryptography, Working Draft, Version 05, Certicom Corp, 1999 [8] M Rosing, Implementing Elliptic Curve Cryptography, Manning Publications, Greenwich, Conn, USA, 1999 [9] S Nandi, B K Kar, and P Pal Chaudhuri, Theory and applications of cellular automata in cryptography, IEEE Trans on Computers, vol 43, no 12, pp , 1994 Chang Nian Zhang received his BS degree in applied mathematics from University of Science Technology, China, and the PhD degree in computer science and engineering from Southern Methodist University In 1998, he joined Concordia University as a research assistant professor in Department of Computer Science Since 1990, he has been with University of Regina, Canada, in Department of Computer Science Currently he is a full professor and leads a research group in parallel processing, data security, and neural networks

110 980 EURASIP Journal on Applied Signal Processing Hua Li received his BE and MS degrees from Beijing Polytechnic University and Peking University He is a PhD candidate in the Department of Computer Science, University of Regina Currently, he works as an assistant professor at Department of Mathematics and Computer Science, University of Lethbridge, Canada His research interests include parallel systems, reconfigurable computing, fault-tolerant, VLSI design, and information and network security He is a member of IEEE Nuannuan Zhang was a graduate student in Department of Computer Science, University of Regina from September 1998 to September 2000 Jiesheng Xie is a professor in the Department of Computer, China Agriculture University, Beijing He was a visiting professor in the Department of Computer Science, University of Regina from September 1999 to August 2000

111 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding Stefano Gnavi CERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy gnavi@mailtlcpolitoit Barbara Penna CERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy penna@mailtlcpolitoit Marco Grangetto CERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy grangetto@politoit Enrico Magli CERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy magli@politoit Gabriella Olmo CERCOM, Center for Multimedia Radio Communications, Dipartimento di Elettronica, Politecnico di Torino, Corso Duca degli Abruzzi 24, Torino, Italy olmo@politoit Received 30 August 2001 and in revised form 30 April 2002 We develop wavelet engines on a digital signal processors (DSP) platform, the target application being image and intraframe video compression by means of the forthcoming JPEG2000 and Motion-JPEG2000 standards We describe two implementations, based on the lifting scheme and the filter bank scheme, respectively, and we present experimental results on code profiling In particular, we address the following problems: (1) evaluating the execution speed of a wavelet engine on a modern DSP; (2) comparing the actual execution speed of the lifting scheme and the filter bank scheme with the theoretical results; (3) using the on-board direct memory access (DMA) to possibly optimize the execution speed The results allow to assess the performance of a modern DSP in the image coding task, as well as to compare the lifting and filter bank performance in a realistic application scenario Finally, guidelines for optimizing the code efficiency are provided by investigating the possibleuseof the on-board DMA Keywords and phrases: wavelet, lifting scheme, filter bank, JPEG2000, DSP 1 INTRODUCTION A huge number of applications use the discrete wavelet transform (DWT) [1] as a means to extract relevant featuresfrom signals Examples are reported in the fields of mathematics, physics, numerical computing, and engineering, including image classification, feature detection, image denoising, image registration, and image compression, just to mention a few Especially in the engineering field, there has been a considerable interest in using wavelet transforms for image and video coding applications [2, 3] As a result, the ISO/ITU-T has selected the DWT as the transform coding kernel for the new image compression standard, namely JPEG2000 [4], which will be released during 2001 Consequently, fast and cost-effective implementations of DWT kernels, compliant with JPEG2000 specifications, are called for in order to make its diffusion as widespread as possible While the wavelet transform of an image can be fairly

112 982 EURASIP Journal on Applied Signal Processing easily computed by means of a general-purpose personal computer, there obviously exist contexts where more compact, light-weight and less power-demanding computing devices are required A recent trend [5] fosters the design of reconfigurable systems that make use of digital signal processors (DSPs) and field-programmable gate array (FPGA) An example is given by the transmission of images from scientific space missions, where the images collected by the onboard sensors may undergo wavelet-based compression (eg, Rosetta Osiris [6]), with a DSP-based system being used as computational core DSPs are also very often used to handle image and video processing tasks in consumer electronics [7] The importance of wavelets on a DSP is witnessed by the number of implementations proposed in the literature (cf [8, 9]) In this paper, we focus on the study of the DSPbased implementation of a wavelet kernel; the target application is image coding with JPEG2000, with its extension to intraframe video coding (Motion-JPEG2000) Until recently, DWT implementations were based on the so-called filter bank scheme [1], which computes the DWT of a signal by iterating a sequence of highpass and lowpass filtering steps, followed by downsampling In 1997 Sweldens proposed a new scheme, called lifting scheme (LS), as an alternative way to compute the DWT [10]TheLShasimmediately obtained a noteworthy success, as it provides several advantages with respect to the filter bank scheme The most interesting ones from the implementation standpoint are that (i) the LS requires less operations than the filter bank scheme, with a saving of up to one half for very long filters; (ii) the LS allows to compute an integer wavelet transform (IWT), that is, a wavelet transform that maps integers to integers [11], thus enabling the design of embedded lossless and lossy image encoders [12, 13] This paper is focused on the development of a wavelet kernel based on the LS, and using a DSP as the computational core The interest of this work is manifold Firstly, from a pure implementation perspective, the performance evaluation of an optimized implementation of such a kernel on a modern DSP indicates the maximum sustainable processing rate This can be used to estimate the number of images per second that can be processed by, for example, a compression engine such as JPEG2000, or the video frame rate that can be sustained by a Motion-JPEG2000 encoder/decoder in DSPbased applications, for example, videoconferencing by means of a PC card Secondly, since the LS can be used to design a progressive lossy-to-lossless compression algorithm [13], it is important to evaluate the execution speed of the IWT with respect to the DWT Thirdly, and distinctively novel in this paper, from a signal processing point of view, there is a strong interest in finding out to which degree the theoretically lower complexity of the LS translates into reduced execution speed; in fact, it is likely that the DSP architecture affects the performance of lifting and filter bank wavelet cores in a different fashion In this work all aspects are considered, that is, an optimized DSP implementation of the LS is presented, and its performance is then compared with that of the fil G H 2 2 G H 2 2 HP HP Figure 1: Block diagram of the filter bank scheme ter bank scheme; both DWT and IWT are considered The results allow to assess the impact of the DSP architecture on the performance of both algorithms, thus providing useful guidelines for the architectural design and implementation of a wavelet-based processing system This paper is organized as follows In Section 2,webriefly review the wavelet transform, focusing on the filter bank scheme and the LS in Sections 21 and 22, respectivelythe DSP implementations of both algorithms are described in Section 3 In Section 4, a performance evaluation of such implementations is proposed; in particular, results related to the execution speed are reported in Section 41, anda comparison between LS and filter bank scheme is presented in Section42 The possibility of improving performance by means of direct memory access (DMA) is discussed in Section 43 Finally, in Section 5 conclusions are drawn 2 WAVELET TRANSFORM As already stated, the two main algorithms used to compute the DWT are the filter bank scheme and the LS, which are briefly reviewed in Sections 21 and 22,respectively 21 Filter bank scheme The filter bank scheme (see [1]) is sketched in Figure 1, where the operations needed to compute the DWT of a onedimensional signal are depicted One level of decomposition involves that the input sequence is highpass and lowpass filtered by the analysis filters H(z) and G(z); the two resulting sequences are then downsampled by a factor two More decomposition levels can be obtained by iterating this procedure on the lowpass branch, as shown in Figure 1 The twodimensional extension is achieved by filtering and downsampling first along the rows, and then along the columns The inverse transform is achieved performing a similar sequence of filtering and upsampling operations (see [1]) 22 Lifting scheme As well known [1], a discrete-time filter can be represented by its polyphase matrix, which is built from the transforms of the even and odd samples of its impulse response The LS stems from the observation [14] that the polyphase matrix can be factorized, leading to the implementation of one step of the filter bank scheme as a cascade of shorter filters, which act on the even and odd signal samples, followed by a normalization In particular, the LS performs a sequence of primal and dual lifting steps, as described in the following and LP

113 Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 983 1/K LP SPLIT t 1 (z) t m (z) s 1 (z) s m (z) K BP Figure 2: Block diagram of the LS reported in the block diagram of Figure 2 The inverse transform is achieved performing the same steps in reversed order [14] The polyphase representation of a discrete-time filter H(z)isdefinedas H(z) = H e ( z 2 ) z 1 H o ( z 2 ), (1) where H e (z) andh o (z) are respectively obtained from the even and odd coefficients of h[n] = 1 {H(z)},where denotes the zeta transform The synthesis filters H(z) and G(z) (lowpass and highpass, respectively) can thus be expressed in terms of their polyphase matrix P(z) = [ ] He (z) G e (z) H o (z) G o (z) and P(z) can be analogously defined for the analysis filters The Euclidean algorithm [14]canbeusedtodecompose P(z)and P(z)as [ ][ ] m 1 si (z) 1 0 P(z) = K 0 1, 0 1 t i=1 i (z) 1 0 K [ ][ ( m P(z) = ( s i=1 i z 1 ) ti z 1 ) ] 1 K K This factorization leads to the sequence of primal and dual lifting steps shown in Figure 2 The filters H e (z), H o (z), G e (z), and G o (z), along with their analysis counterparts, are Laurent polynomials [14] Since the set of all Laurent polynomials exhibits a commutative ring structure, within which polynomial division with remainder is possible, long division between two Laurent polynomials is not a unique operation [14] Therefore, several different factorizations (ie, pairs of {s i (z)} and {t i (z)} filters) may exist for each wavelet However, in case of DWT implementation, all possible choices are equivalent An IWT, mapping integers onto integers, can be very simply achieved rounding off the output of the s i (z) andt i (z) filters right before adding or subtracting [11]; the rounding operation introduces a nonlinearity in each filter operation As a consequence, in the IWT the choice of the factorization impacts on both lossless and lossy compression, so making the transition from the DWT to the IWT not straightforward [15] (2) (3) Table 1: Computational cost (number of multiplications plus additions) of lifting versus filter banks Filter Standard algorithm Lifting scheme LEG(5, 3) 4(N M)2 2(N M 2) DB(9, 7) 4(N M)2 2(N M 2) SWE(13, 7) 3(N Ñ) 2 3/2(N Ñ) As already stated, the LS requires fewer operations than the filter bank scheme The latter algorithm corresponds to merely applying the polyphase matrix: only the samples that are not discarded by the subsequent downsampling operation are actually filtered In order to compare the two algorithms, one can use the number of multiplications and additions required to output a pair of samples, one on the lowpass and one on the highpass branch As shown in [14], the cost of lifting tends, asymptotically for long filters, to one-half of the cost of the standard algorithm Table 1 reports the formulas, presented in [14], to compute the cost of the two algorithms for the filters used in this work (see also Section 3) Here h and g are the degree of the highpass and lowpass filter (ie, the number of coefficients minus one); in the case that h and g are even, we set h =2N and g =2M Note that the filter SWE(13,7), being an interpolating filter, has a different formula, which also involves the number of vanishing moments Ñ 3 IMPLEMENTATION The LS and filter bank scheme have been implemented on the floating-point Texas Instruments TMS320C6711 DSP board The board comprises a 150 MHz floating-point processor, two memory regions, namely on-chip and off-chip, a direct memory access (DMA) controller, and some peripherals interfaces The CPU core includes two sets of 16 registers (register file A, register file B), the on-chip memory divided into two cache memories (L1, L2), and the arithmetic and logical units (see [16]) Figure 3 shows the block diagram of the DSP architecture In the following, we outline some features of our implementation of the two algorithms, including the filters and types of boundary extensions used The LS implementation is compliant with the specifications of the Final Committee Draft of JPEG2000 Part I (core coding systems) [4], which is,

114 984 EURASIP Journal on Applied Signal Processing Peripheral devices CPU Register file A Register file B ALUs Table 2: Factorization of LEG(5,3) s i (z),t i (z) = a 0 z dm a 1 z dm 1 a 2 z dm 2 Filter: LEG(5,3) d M a 0 a 1 a 2 a 3 K s 1 (z) t 1 (z) s 2 (z) Table 3: Factorization of DB(9,7) L1 L2 DMA controller Ext memory interface Figure 3: Block diagram of the DSP architecture at the time of this writing, the latest publicly available document describing the standard The code profiling results for the two algorithms, reported in Section 4, have been obtained demanding the optimization of the assembler code to the C compiler, which is known to nearly achieve the same efficiency as an expert programmer For this reason, the code has been written in a simple and plain style, so as to facilitate compiler optimization Therefore, in this section we only give an overview of the implementations of the two algorithms, whereas we rather concentrate on the profiling results (Section 4), which represent the main contribution of this article Of course, one could achieve some performance improvement by constraining the implementations, for example, to support a limited number of filters (even only one); nevertheless, this approach would negatively impact on generality of application, which in this work has been preserved as far as possible As for boundary extension at the borders of the input signal, which is necessary because the wavelet filters are noncausal, two possible extensions are considered (i) Symmetric extension: it performs mirroring of the signal samples outside the signal support If used with biorthogonal symmetric filters, it allows to achieve perfect reconstruction also at the image borders (ii) Zero padding: it consists in adding zeros before and after the signal This extension is not supported by the JPEG2000 standard, but is very simple, and hence sometimes used The filters supported by this implementation have been selected according to the JPEG2000 standard: LeGall I(5,3) (LEG(5,3) in the following); Daubechies (9,7) (DB(9,7) in the following); Sweldens (13,7) (SWE(13,7) in the following) The first two filters are explicitly embodied in JPEG2000, for the reversible and nonreversible transform, respectively s i (z),t i (z) = a 0 z dm a 1 z dm 1 a 2 z dm 2 Filter: DB(9,7) d M a 0 a 1 a 2 a 3 K s 1 (z) t 1 (z) s 2 (z) t 2 (z) s 3 (z) Table 4: Factorization of SWE(13,7) s i (z),t i (z) = a 0 z dm a 1 z dm 1 a 2 z dm 2 Filter: SWE(13,7) d M a 0 a 1 a 2 a 3 K s 1 (z) t 1 (z) s 2 (z) Note that the last filter is not supported by JPEG2000 Part I However, it has been considered because, being a long filter, it allows to verify the asymptotic complexity of the LS The selection of the factorization of the wavelet filters, to be used in the LS, has been made following the directives of the JPEG2000 standard, and is reported in Tables 2, 3, and 4 The filter length, deducible from the acronym, allows to easily identify the N and M parameters previously defined As for the filter bank scheme, the input signal is filtered by the same kernels listed above, but using the expanded rather than the factorized representation Notice that, while performing the convolution between the signal and the filter impulse response, the samples that would be discarded by downsampling are not computed at all For completeness, Tables 5, 6, and7 report the coefficients of the filters employed, up to the fourth decimal digit 4 EXPERIMENTAL RESULTS As stated in Section 1, the objective of this work is manifold; in particular, experimental tests have been carried out with the following goals (1) To evaluate the absolute running time of an LS-based wavelet kernel on a modern DSP; in particular, in view of

115 Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 985 Table 5: LEG(5,3) filter i h 0 h ± ± Table 6: DB(9,7) filter i h 0 h ± ± ± ± Table 7: SWE(13,7) filter i h 0 h ± ± ± ± ±5 0 0 ± the implementation of an embedded lossy-to-lossless image compression system, to understand to which degree embodying an IWT capability may penalize the execution speed This matter is discussed in Section 41 (2) To find out how close to the theoretical value is the actual performance gain of the LS with respect to the filter bank scheme, in terms of execution speed This matter is discussed in Section 42 (3) To study the possibility of exploiting an available onboard DMA, in order to speed up code execution This matter is discussed in Section 43 The results shown in the following, and the comparison between LS and filter bank scheme, have been reported (see Sections 41 and 42) in terms of the time needed to perform one level of transform on one image row; this has been done so as to facilitate the interpretation of results The results have been parameterized on the length of the input data vector, and execution times for dyadic lengths are reported It has been found that the sum of such dyadic values yields a very accurate estimate of the multilevel transform Of course, computing the wavelet transform of an image requires to perform both rowwise and columnwise filtering However, it has been found that the time needed to compute a columnwise filtering is the same as for rowwise filtering Even though this behavior might seem surprising at a first glance, it can be reasonably justified by the efficient management of memory accesses performed by the cache memory; a more detailed explanation is given in Section Absolute running times The graphs in Figures 4, 5, and6 report the absolute running times achieved by the LS (in the DWT and IWT mode, respectively) and the filter bank scheme, in order to compute the one-level wavelet transform of a one-dimensional data vector, contiguously stored in the external memory The boundary extension used is the symmetric one The results reported on the graphs can be used to estimate the number of images per second that can be processed by these algorithms Employing the LEG(5,3) filter and the symmetric extension for a complete 2D one-level decomposition on a grey-scale image, the LS allows to process between 7 and 8 images per second, whereas the filter bank scheme only sustains between 4 and 5 images per second Note that computing the IWT, rounding off the filtered coefficients in the LS, leads to slower operation The running times of the IWT are from 10% to 25% larger than the DWT using the LS If the wavelet kernel is thought of as the core of a JPEG2000 encoder, it is worth recalling that the wavelet transform is responsible for a significant part of the total encoder and decoder running time Some figures have been obtained by profiling the Jasper reference JPEG2000 implementation, and have been reported in [17] It turns out that, for progressive lossless coding, the wavelet transform is responsible of about 30% of the overall encoder and decoder running time In the progressive lossy case this percentage is increased to about 50% at the encoder, and 70% at the decoder This implies that it should be possible to encode/decode, with a single DSP, about images per second in the integer lossless mode (using the LEG(5,3) filter), and encode and decode about 2 images per second in the lossy mode using the DB(9,7) filter While this figures are suitable for an image coding application, it turns out that more powerful hardware, such as a multi-dsp system or an FPGA, is required to sustain real-time Motion-JPEG2000 video 42 Comparison between lifting and filter bank As stated, in [14] it is claimed that the LS requires asymptotically half the number of operations with respect to the filter bank scheme We have compared the running time of our LS and filter bank implementations, in order to understand how the DSP architecture impacts on the performance gain In particular, Table 8 reports the ratios between the running time of the filter bank scheme and the LS Comparing these figures with the theoretical results, it can be noticed that these ratios are different from the theoretical values This behavior can be explained considering the architectural features of the processor employed The DSP used in this work has an efficient pipeline, which can dispatch 8 parallel instructions per cycle Parallel instructions proceed simultaneously through each pipeline phase, whereas serial instructions proceed through the pipeline with a fixed relative phase difference between instructions Every time a jump to an instruction not belonging to the pipeline occurs, the pipeline must be emptied and reloaded Thus, in this case, the filtering operations that frequently update the pipeline

116 986 EURASIP Journal on Applied Signal Processing Seconds 7000E E E E E E E E00 Absolute running times Number of samples LS F bank IWT Table 8: Ratios between the running times of filter bank scheme and LS Filter bank running time/ls running time Samples LEG(5,3) DB(9,7) SWE(13,7) Theoretical value Seconds Seconds 1200E E E E E E E E E E E E E E E00 Figure 4: LEG(5,3): absolute running times Absolute running times Number of samples LS F bank IWT Figure 5: DB(9,7): absolute running times Absolute running times Number of samples LS F bank IWT Figure 6: SWE(13,7): absolute running times twice as many operations as the LS with long filters; on the other hand, the use of the pipeline tampers with the LS operation, since the factorizations of long filters may consist of numerous short filters The best results, with regard to the gain, are obtained with the SWE(13,7) filter: even though the filter is long, its factorization consists of only 2 filters, with 4coefficients each 1 The opposite occurs with the DB(9,7) filter, whose factorization consists of 4 filters with 2 coefficients each The gain that comes from the inferior number of operations in LS is thus lost in emptying and reloading the pipeline The LEG(5,3) filter has an intermediate behavior Note that, for an increasing number of samples, the ratio between the running times of the two algorithms is not constant, nor it increases linearly This behavior is due to the way the processor manages the cache memory and the data transfer from external to internal memory 43 Optimization with DMA The results in Section 42 have shown that the LS is faster than the filter bank scheme as for the computation of the wavelet transform These results have been obtained using implementations which demand to the CPU the data transfer from the external memory to the CPU itself for performing the convolutions In the following, we focus on the architectural features of the DSP employed, investigating the possibility of improving the LS performance by exploiting the properties of the DMA, typically available on a DSP board TheLSprogrampreviouslydescribedfiltersavectorof coefficients allocated on a region of external (off-chip) memory On the other hand, the DSP has a two-level internal (onchip) cache, with significantly lower access time than the external one The second-level cache (L2) can be configured as internal memory, and can be used to store and filter the image pixels values, with an expected speedup due to the reduced memory access time Since the size of an image is usually larger than the L2 size, it is necessary to transfer the data in small blocks from contents, turn out to be disadvantaged The effect on the computation of the wavelet transform is that, in general, the convolution with a long kernel can be optimized more efficiently than several convolutions with short kernels Therefore there is a trade-off, in that the filter bank must perform 1 It is worth noticing that even higher gains (nearly 3) have been found with the SWE(13,7) filter, using a fixed-point implementation that is not addressed in this paper This is not surprising, for the upper bound of 2 on the LS gain[14] is computed in the case of worst case factorization, while the SWE(13,7) filter also admits shorter factorizations

117 Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 987 Start Table 9: Comparison between the running times of the LS without and with using the DMA DMA transf to buf0 CPU filter buf1 (simultaneously) N End image Y Absolute running times [in seconds] Standard LS LS with DMA End image Y N DMA transf to buf1 CPU filter buf0 (simultaneously) Figure 7: Ping-pong buffering End Table 10: Comparison between the running times of the LS without and with using the DMA: transfer of several rows simultaneously Absolute running times [in seconds] Standard LS LS with DMA 2 rows rows the external to the internal memory The device used to this purpose is the DMA controller In this work, the DMA has been configured so as to transfer a row (or column) of the image into the on-chip memory, while at the same time the CPU filters the data transferred at the previous step In this way, the CPU never accesses the off-chip memory, since both the stack and the temporary variables are allocated in L2 The use of L2 as a shared resource between DMA and CPU involves the need of synchronizing these devices The data can be corrupted if the accesses to L2 do not take place in the correct order To avoid this problem, we have employed four software interrupts to regulate the sequence of operations Moreover, the two concurrent devices are set to operate on two different buffers, which are swapped at each filtering cycle with a ping-pong buffering mechanism (see Figure 7) We have run this version of the LS on a greyscale image, performing one complete 2D decomposition level with the DB(9,7) filter This has led to the results shown in Table 9, where the running time of the standard LS algorithm is also reported for comparison It can be noticed that the synchronization of the devices and the reconfiguration of the DMA after each transfer leads to a higher running time In order to make the employment of DMA and L2 advantageous, it is necessary to reduce the number of DMA reconfigurations This can be done by transferring more than one row or column at one time Table 10 shows that transferring, for example, 2 or 4 rows simultaneously yields an improvement of the LS performance However, the gain is not as high as expected, and hardly pays back for the additional complexity The reason for such low gain using the DMA lies in the efficiency of the DSP cache memory In fact, every time the CPU needs a datum stored in the external memory, 32 consecutive bytes are transferred from the memory to L1 If the offset between two data processed sequentially by the CPU is fewer than 32 bytes (ie, 8 floating-point coefficients), the CPU accesses the memory only once, since the second datum will be already cached in L1 The advantage of accessing a faster memory is apparent only when the weight of the memory accesses is high overall We assume that the image pixel values are stored in the ex- ternal memory as floating-point values in row major order As for rowwise filtering, one access to the external memory is sufficient to retrieve eight samples of the to-be-filtered data As far as columnwise filtering is concerned, once a complete image column has been retrieved from the external memory, the subsequent seven columns are also placed in the cache 2 Moreover, in the specific case of the wavelet transform, most of the time is spent by the processor in computing the convolution, that is, sums and products between the filter coefficients and the image pixel values; the filtering routine is computationally heavy, so that the weight of the access operations is not very high Therefore the actual number of accesses to the external memory turns out to be quite limited, and their weight on the program running time accordingly low In summary, the performance improvement in the wavelet transform computation, which can be obtained by employing the DMA, is limited because of the efficiency of the on-chip cache 5 CONCLUSIONS In this paper, we have addressed the development of wavelet cores on a DSP, compatible with the JPEG2000 specifications The wavelet transform has been implemented according to the filter bank scheme and the lifting scheme; in this latter case the integer-transform option has also been considered The code has been profiled so as to evaluate the efficiency of the implementation and, more interestingly, to allow a comparison between the LS and the filter bank scheme Moreover, the use of the DMA has also been considered as a possible way to improve data throughput The results have highlighted some aspects of DSP-based implementations of the wavelet transform, which are discussed in the following (1) The DSP considered in this work is able to compute up to 8 complete 2D one-level wavelet transforms per second on a grey-scale image This figure can be used to 2 This holds provided that the cache memory is large enough to store eightcolumns,asusuallyhappensinpractice

988 EURASIP Journal on Applied Signal Processing evaluate the number of JPEG2000 frames that a single DSP is able to code or decode, for example, using the JPEG2000 profiling results reported in [17]

the theoretical results in [14], because the DSP architecture has a different impact on code optimization for the two algorithms In particular, convolutions with long filters, which are typical of

length of the factorized filters used in the LS (3) It has turned out that employing the DMA to transfer data from the external to the internal memory (and vice versa), while the CPU concurrently

118 988 EURASIP Journal on Applied Signal Processing evaluate the number of JPEG2000 frames that a single DSP is able to code or decode, for example, using the JPEG2000 profiling results reported in [17] (2) A performance comparison between lifting and filter banks has been carried out We have found that the LS always runs faster than the filter bank scheme However, the performance gain differs from the theoretical results in [14], because the DSP architecture has a different impact on code optimization for the two algorithms In particular, convolutions with long filters, which are typical of the filter bank scheme, tend to benefit from the DSP pipelined architecture On the other hand, the LS gain is higher for long filters In the end, the actual gain heavily depends on the number and length of the factorized filters used in the LS (3) It has turned out that employing the DMA to transfer data from the external to the internal memory (and vice versa), while the CPU concurrently filters the previously transferred data, may provide very little advantage, if any at all, in terms of execution speed This is due to the fact that the on-chip cache memory is able to very efficiently manage the data transfer operations, for both rowwise and columnwise filtering ACKNOWLEDGMENT This work was partially developed under the Texas Instruments Elite program REFERENCES [1] M Vetterli and J Kovačević, Wavelets and subband coding, Prentice-Hall, Englewood Cliffs, NJ, USA, 1995 [2] M Antonini, M Barlaud, P Mathieu, and I Daubechies, Image coding using wavelet transform, IEEE Trans Image Processing, vol 1, no 2, pp , 1992 [3] D Lazar and A Averbuch, Wavelet-based video coder via bit allocation, IEEE Trans Circuits and Systems for Video Technology, vol 11, no 7, pp , 2001 [4] D S Taubman and M W Marcellin, JPEG2000: Image Compression Fundamentals, Standards, and Practice, Kluwer Academic Publishers, Dordrecht, Netherlands, 2001 [5] JEyreandJBier, TheevolutionofDSPprocessors, IEEE Signal Processing Magazine, vol 17, no 2, pp 43 51, 2000 [6] B Fiethe, P Rüffer, and F Gliem, Image processing for rosetta osiris, in 6th International Workshop on Digital Signal Processing Techniques for Space Applications, vol 144, ESTEC, Noordwijk, The Netherlands, September 1998 [7] J Eyre, The digital signal processor derby, IEEE Spectrum, vol 38, no 6, pp 62 68, 2001 [8] K Haapala, P Kolinummi, T Hamalainen, and J Saarinen, Parallel DSP implementation of wavelet transform in image compression, in Proc IEEE International Symposium on Circuits and Systems, pp 89 92, Geneva, Switzerland, May 2000 [9] B Yiliang, W Houng-Jyh, C-C J Kuo, and R Chung, Design of a memory-scalable wavelet-based image codec, in Proc IEEE International Conference on Image Processing, Chicago, Ill, USA, October 1998 [10] W Sweldens, The lifting scheme: A construction of second generation wavelets, Siam J Math Anal, vol29,no2,pp , 1997 [11] R C Calderbank, I Daubechies, W Sweldens, and B Yeo, Wavelet transforms that map integers to integers, Applied and Computational Harmonic Analysis, vol 5, no 3, pp , 1998 [12] A Bilgin, P Sementilli, F Sheng, and M Marcellin, Scalable image coding using reversible integer wavelet transforms, IEEE Trans Image Processing, vol 9, no 11, pp , 2000 [13] M Grangetto, E Magli, and G Olmo, Efficient commoncore lossless and lossy image coder based on integer wavelets, Signal Processing, vol 81, no 2, pp , 2001 [14] I Daubechies and W Sweldens, Factoring wavelet transforms into lifting steps, J Fourier Anal Appl, vol4,no3, pp , 1998 [15] M Grangetto, E Magli, and G Olmo, Minimally nonlinear integer wavelets for image coding, in Proc IEEE Int Conf Acoustics, Speech, Signal Processing,Istanbul,Turkey, June 2000 [16] Document SPRU189F, TMS320C6000 CPU and instructions set reference guide, October 2000, wwwticom [17] M D Adams and F Kossentini, JasPer: A software-based JPEG-2000 codec implementation, in Proc of IEEE International Conference on Image Processing, vol 2, pp 53 56, Vancouver, BC, Canada, October 2000 Stefano Gnavi was born in Biella, Italy, in March 1976 He received the degree in electrical engineering at Politecnico di Torino, Italy, in July 2001 Since March 2002 he is a researcher under grant with the Center for Wireless Multimedia Communications (CERCOM), at the Department of Electronics, Politecnico di Torino His research interests are in the field of image communication, video processing and compression, as well as hardware implementation Currently he is working on very low bit rate video coding techniques Barbara Penna was born in Castellamonte, Italy, in May 1976 She received the degree in electrical engineering at Politecnico di Torino, Italy, in July 2001 Since September 2001 she is a researcher under grant with the Signal Analysis and Simulation (SAS) group, at the Department of Electronics, Politecnico di Torino Her research interests are in the field of data compression Currently she is working on novel SAR raw data compression algorithms based on wavelet transforms Marco Grangetto received the summa cum laude degree in electrical engineering at Politecnico di Torino in 1999, where he is currently pursuing a PhD degree His research interests are in the field of digital signal processing and multimedia communications In particular, he is working at the development of efficient and low complexity lossy and lossless image encoders based on wavelet transforms Moreover, he is challenging the design of reliable multimedia delivery systems for tetherless lossy packet networking He was awarded the Premio Optime by Unione industriale di Torino in September 2000, and a Fulbright grant in 2001 for a research period at the Center for Wireless Communications (CWC) at UCSD

Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 989 Enrico Magli received the degree in electronics engineering in 1997, and the PhD degree in electrical and

wireless communications, compression of remote sensing images, superresolution imaging, and pattern detection and recognition In particular, he is involved in the study of compression and detection

119 Wavelet Kernels on a DSP: A Comparison Between Lifting and Filter Banks for Image Coding 989 Enrico Magli received the degree in electronics engineering in 1997, and the PhD degree in electrical and communications engineering in 2001, from Politecnico di Torino, Turin, Italy He is currently a Post- Doctoral researcher at the same university His research interests are in the field of robust wireless communications, compression of remote sensing images, superresolution imaging, and pattern detection and recognition In particular, he is involved in the study of compression and detection algorithms for aerial and satellite images, and of signal processing techniques for environmental surveillance from unmanned aerial vehicles From March to August 2000 he was a visiting researcher at the Signal Processing Laboratory of the Swiss Federal Institute of Technology (EPFL), Lausanne, Switzerland Gabriella Olmo received the Laurea Degree (cum laude) and the PhD in electronic engineering at Politecnico di Torino in 1986 and 1992, respectively From 1986 to 1988 she was researcher with CSELT (Centro Studi e Laboratori in Telecomunicazioni), Turin, working on network management, non hierarchical models and dynamic routing From 1991, she has been Assistant Professor at Politecnico di Torino, where she is member of the Telecommunications group and the Image Processing Lab Her main recent interests are in the field of wavelets, remote sensing, image and video coding, resilient multimedia transmission, joint source-channel coding, stratospheric platforms She has joined several national and international research programs under contracts by Inmarsat, ESA (European Space Agency), ASI (Italian Space Agency), European Community She has coauthored more than 80 papers in international scientific journals and conference proceedings

120 EURASIP Journal on Applied Signal Processing 2002:9, c 2002 Hindawi Publishing Corporation AVSynDEx: A Rapid Prototyping Process Dedicated to the Implementation of Digital Image Processing Applications on Multi-DSP and FPGA Architectures Virginie Fresse CNRS UMR IETR (Institut en Electronique et Télécommunications de Rennes), INSA Rennes, 20 avenue des Buttes de Coësmes, CS 14315, Rennes Cedex, France vfresse@insa-rennesfr Olivier Déforges CNRS UMR IETR (Institut en Electronique et Télécommunications de Rennes), INSA Rennes, 20 avenue des Buttes de Coësmes, CS 14315, Rennes Cedex, France odeforge@insa-rennesfr Jean-François Nezan CNRS UMR IETR (Institut en Electronique et Télécommunications de Rennes), INSA Rennes, 20 avenue des Buttes de Coësmes, CS 14315, Rennes Cedex, France jnezan@insa-rennesfr Received 31 August 2001 and in revised form 12 May 2002 We present AVSynDEx (concatenation of AVS SynDEx), a rapid prototyping process aiming to the implementation of digital signal processing applications on mixed architectures (multi-dsp FPGA) This process is based on the use of widely available and efficient CAD tools established along the design process so that most of the implementation tasks become automatic These tools and architectures are judiciously selected and integrated during the implementation process to help a signal processing specialist without relevant hardware experience We have automated the translation between the different levels of the process to increase and secure it One main advantage is that only a signal processing designer is needed, all the other specialized manual tasks being transparent in this prototyping methodology, hereby reducing the implementation time Keywords and phrases: rapid prototyping process, multi-dsp-fpga architecture, CAD environment, image processing applications 1 INTRODUCTION The prolific evolution of telecommunication, wireless and multimedia technologies has sustained the requirement for the development of increasingly complex integrated systems Indeed, digital signal processing applications including image processing have become more and more complex, thereby demanding much greater computational performances This aspect is especially crucial for certain realtime applications To validate a new technique only functionality is not sufficient, the algorithm has to be executed in a limited time Until then the first approach to meet this aspect was to optimize the algorithm, a digital signal or image designer could do this task Nevertheless, this solution was quickly inadequate, and parallel to the algorithm development, the implementation aspect must be taken into account The use of parallel architectures distributes and then accelerates the execution time of the application Currently, one of the best solutions is mixed platforms integrating a combination of standard programmable processors and a hardware part containing components like FPGA, ASIC, or ASIP It has been demonstrated in [1, 2, 3] that such architectures are well suited for complex digital image processing applications: the distribution between both parts is generally done by implementing the elementary and regular operations in the hardware part, the other processing steps being processed by the software part These platforms can deliver higher performances but this heterogeneous aspect involves software and hardware engineering skills The result is the rapid execution of an application on such architectures but the implementation process becomes long and is quite complex: several different specialized engineers are needed for each part of the platform and this separate parallel implementation poses the risk that the software and hardware designs diverge at the end of the process and

121 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 991 lose their initial correlation Moreover, it is difficult to manage all the tasks, especially the shared resources and the associated synchronizations [4, 5] The negative side of such implementations on complex and parallel architecture is the long development time emerging from the number of specialized people involved in the process The signal processing designer does not have any more the sufficient skills to supervise the complete development, the error detection at each level becoming more difficult The intervention of several engineers involves a task partitioning between the different parts at the beginning of the process, and a partitioning modification is very difficult as another complete implementation is often required Most of computer-aided design (CAD) environments dedicated to the implementation on parallel and mixed architectures are called codesign tools [6, 7, 8] and they integrate a manual and arbitrary partitioning based on the designer experience The present codesign tools can address the problem of implementation either on multistandard processors, or one processor and a dedicated hardware, but not both of them, which is the topic of this paper An example of codesign tool is POLIS (which is presented in [9]) This tool is dedicated to embedded systems, which support the control flow description The representation is CSFM, codesign finite state machine and the advantage of this one is its independence with the target architecture There is also Chinook, which is dedicated to reactive real-time systems (as explained in [10]) The objective of this work is to propose a full rapid prototyping process (AVSynDEx) by means of existing academic, commercial CAD tools and platforms A translator between the CAD environments allows going automatically through the process The prototyping methodology enables a digital signal or image-processing designer to create the application with a usual development environment (advanced visual system) and then to supervise the implementation on a mixed architecture without any other necessary skills AVSynDEx is open-ended and can realize the partitioning between software/hardware target at the highest level of the application description The approach consists of starting with a customary environment used by the digital signal processing developer Then the integration of a distributed executive generator SynDEx (synchronised distributed executive) leads to an optimized implementation on a parallel and mixed platform The target architecture combines a multi-dsp (digital signal processor) part with an FPGA (field programmable gate array) platform A main characteristic is the presence of checking points at each level of the implementation process accelerating the development time: the designer can check and correct his design immediately without waiting for the implementation This aspect gives the designer the possibility to easily and quickly modify the algorithm or to change its implementation This prototyping process leads to a low production cost: SynDEx is a free academic CAD tool and the multi-dsp and FPGA board is profitable compared to the development, time and cost (including raw material, specialized engineers and specific equipment) for a new platform Moreover, this prototyping process can integrate the new versions of the CAD tools and also can use their new performances The remainder of this paper is organized into 5 sections Section 2 gives an overview of the prototyping process by briefly introducing the CAD environments as well as the target architecturesection 3 details all these elements The prototyping methodology is fully described by explaining the compatibility requirements between the levels of the process, and introducing the automatic translator By way of process illustration, the implementation of an image compression algorithm LAR is given in Section 5 Section 6 concludes the paper 2 OVERVIEW OF THE PROTOTYPING PROCESS The prototyping process (Figure 1) aims to a quasi-automatic implementation of digital signal or image applications on parallel and mixed platform The target architecture can be homogeneous (multiprocessor part) or heterogeneous (multi-dsp FPGA) A real-time distributed and optimized executive is generated according to the target platform The digital image designer creates the data flow graph by means of the graphical development tool AVS This CAD software enables the user to achieve a functional validation of the application Then, an automatic translator converts this information into a new data flow graph directly compatible with the second CAD tool, SynDEx This last tool schedules and distributes the data flow graph according to the parallel architecture and generates an optimized and distributed executive This executive is loaded onto the platform by using GODSP, a loader and debugger tool These tools are quite simple and the links between them are automatic, accelerating the prototyping process The target applications are complex digital image processing algorithms, whose functions possess different granularity levels The partitioning consists generally of implementing the regular and elementary operations on the FPGA part and the higher-level operations on the multi-dsp part The prototyping process has the advantage to take this partitioning aspect into account and to ensure an adjustable and quickly modifiable decision The image-processing designer is not restricted to one implementation and the partitioning modifications are quickly realized Three elements are necessary for this prototyping process: the CAD tools, the target architecture, and the links for an automatic process 3 PRESENTATION OF THE INTEGRATED CAD TOOLS AND THE MIXED PLATFORM Several computer-aided design environments are used and judiciously integrated in the prototyping process Two main CAD tools are necessary: AVS for the functional description and validation, and SynDEx for the generation of

992 EURASIP Journal on Applied Signal Processing Data flow graph AVS Automatic translator SynDEx Sequential executive Workstation PC Sequential executive distributed executive GODSP Multi-DSP FPGA

GODSP The links between these CAD environments are automatic The data flow graph is implemented on a multi-dsp FPGA board a distributed and optimized executive A third tool, a translator between AVS

122 992 EURASIP Journal on Applied Signal Processing Data flow graph AVS Automatic translator SynDEx Sequential executive Workstation PC Sequential executive distributed executive GODSP Multi-DSP FPGA board Figure 1: The prototyping process It consists of one graphical image development tool AVS, SynDEx, which is dedicated to the generation of parallel and optimized executive and a loader-debugger GODSP The links between these CAD environments are automatic The data flow graph is implemented on a multi-dsp FPGA board a distributed and optimized executive A third tool, a translator between AVS and SynDEx realized the automatic link The target architecture is a mixed platform with a multi- DSP part and an FPGA one 31 AVS: advanced visual system AVS (advanced visual system) is a high-level environment for the development and the functional validation of graphical applications [11] It provides powerful visualization methods, such as color, shape, and size for accurate information about data, as shown in Figure 2 The AVS environment (Figure 3) contains several module libraries located on top and a workspace dedicated to the application developments These algorithms are constructed by inserting existing modules or user modules into the workspace A module is linked to its input and output images and their corresponding types Each module calls a C, C, or Fortran function and the associated library files During a modification of an existing function, the module is immediately updated and the algorithm as well All these modules are connected by input and output ports to constitute the global application in the form of a static data flow graph In the following, we consider that traded data are mainly images represented as one-dimensional array AVS includes a subset of visualization modules for data visualization, image processing, and user-interface design A main advantage is the automatic visualization of intermediate and resulting images at the input and output of each module This characteristic enables the image-processing designer to check and validate the functionality of the application before the implementation step 32 SynDEx SynDEx is an academic system-level CAD tool [13, 14] This free tool is an academic environment designed and developed at INRIA, Rocquencourt France and several national laboratories take part in this project as we does SynDEx is an efficient environment, which uses the AAA methodology to generate a distributed and optimized executive dedicated to parallel architectures AAA stands for algorithm architecture adéquation, adéquation is a French word meaning an efficient matching(notethatitisdifferent from the English word adequacy, Figure 2: Examples of AVS applications Above, the Tracking Money Launderers and below, Bright Forecast at the National Weather Service [12] These examples use the color, size, and shape for data visualization which involves a sufficient matching) [15] The purpose of this methodology is to find the best matching between one algorithm and a specific architecture while satisfying constraints This methodology is based on graph models to exhibit both the potential parallelism of the algorithm and the available parallelism of the hardware architecture This is formalized in term of graph transformations Heuristics take into account execution times, durations of computations, and intercomponent communications are used to optimize real-time performances and resources allocation of embedded real-time applications The result of graph transformations is an optimized executive build from a library of architecture-dependent executive primitives composing the executive kernel There is one executive kernel for each

Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 993 Figure 3: The AVS environment Above, the libraries containing the available modules A rectangle

123 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 993 Figure 3: The AVS environment Above, the libraries containing the available modules A rectangle is a defined module The designer can create and insert the modules into those libraries The red and pink marks represent the input and output; the color indicates the type of the ports Below, the workspace for the algorithm creation The visualization of an image is done by inserting the Uviewer2D module and connecting to the target module (here, it is the output module OUT) supported processor These primitives support boot loading, memory allocation, intercomponent communications, sequentialisation of user supplied computation functions, of intercomponent communications and intersequences synchronization SynDEx ensures the following tasks [16] (i) Specification of an application algorithm as a conditioned data flow graph (or interface with the compiler of one of the Synchronous languages ESTEREL, LUSTRE, SIG- NAL through the common format DC) The algorithm is described as a software graph (ii) Specification of the multicomponent architecture as a hardware graph (iii) Heuristic for distributing and scheduling the algorithm on the architecture with response time optimization (iv) Visualization of predicted real-time performances for the multicomponent sizing (v) Generation of deadlock-free executives for real-time execution on the multicomponent with optional real-time performance measurement These executives are built from a processor-dependent executive kernel [17] SynDEx comes currently with executives kernels for digital signal processor and microcontroler: SHARC-ADSP21060, TMS320C4x, Transputer-T80X, i80386, i8051, i80c96, MC68332, and for workstations: UNIX/C/TCP/IP (SUN, DEC, SGI, HP, PC Linux) Executive kernels for other processors can be easily ported from the existing ones The shared resources and the synchronizations are taken into account SynDEx transfers images by using static memory whose allocation is optimized The current development work with SynDEx is to refine the communication media and to target the use of this tool to mixed architectures (including FPGA and other ASIC) So this evolution will be still coherent for future complex products The SynDEx environment is shown in Figure 4 The edition view contains two graphs: the hardware architecture above and the software graph below The hardware graph represents the target architecture with the hardware components and the physical links whereas the software graph is the data flow graph of the application: the vertex is a task (compiled sequence of instructions), and each edge is datadependent between the output of an operation and the input of another task The vertex is defined by means of information such as the input and output images, the size and type of these images, the name of the corresponding C-function and the time execution If the execution time is not known, a first random value must be affected for every function Any random value can be used but a by default value is chosen to obtain an automatic translation and generate the executive by means of SynDEx Then, SynDEx generates a first sequential executive on a monoprocessor implementation in order to determine the real task time The granularity level of the graph has an impact on the final implementation: many vertices lead to more parallelism but also increase the data communication cost SynDEx generates a timing diagram Figure 5, according to the hardware and software graphs The schedule view includes one column for each processor and one line for each

994 EURASIP Journal on Applied Signal Processing current C-development tools Once the functional validation is done, it has to be split into several functions for the Syn- DEx data flow graph; each

data flow graph can generate some mistakes Moreover, optimization is quite long and not automatic: the image processing designer has to work with the initial algorithm, the SynDEx tool and the

124 994 EURASIP Journal on Applied Signal Processing current C-development tools Once the functional validation is done, it has to be split into several functions for the Syn- DEx data flow graph; each function represents an edge in the SynDEx data flow graph The resulting data flow graph can be checked only after the implementation on the target platform The manually transformation of the SynDEx data flow graph can generate some mistakes Moreover, optimization is quite long and not automatic: the image processing designer has to work with the initial algorithm, the SynDEx tool and the transformations between these two data flow graphs Figure 4: SynDEx CAD software: the workspace Above, a target architecture with 4 DSP, C4, C3, C2, and root Root is the DSP, which is dedicated to the video (grab and display image): the input and output functions are processed by this processor The physical connections are represented Below, the software graph (algorithm) contains processing tasks (edges) and the data dependencies (vertices) Ee and Se are respectively the input and output images Figure 5: The SynDEx timing diagram Each column represents the task allocation for one DSP (Moy1, Ngr1, and Eros1 are treated by the DSP C2 and the video modules Ee and Se are effectively implemented on the video DSP root) The size of each rectangle is the execution time for every task and the lines between the columns indicate the communication transfers This diagram shows the parallelism of the algorithm and the timing estimation communication medium The timing diagram describes the distribution (spatial allocation) and the scheduling (temporal allocation) of tasks on processors, and of interprocessor data transfers on communication media Time is flowing from top at bottom and the height of each box is proportional to the execution duration of the corresponding operation SynDEx is an efficient tool for the implementation on parallel architectures but is not an application development tool environment as it is not able to simulate the software graph Without any front-end tool, the signal processing designer has to create a sequential C-application by means of 33 The multi-dsp-fpga platform The development of this own parallel architecture is a very complex task while suppliers offer powerful solutions It is the reason why we have opted for a commercial product The choice of the target architecture has been directed by different motivations Firstly, the platform had to be generic enough in order to integrate most of the possible image applications Secondly, it had to be modular to be able to evolve with time Thirdly, the architecture programming had to be at sufficient level in order to be interfaced with SynDEx Fourthly, the cost had to be reasonable to represent a realistic solution for processing speedup The experimental target architecture is a multi-dsp and FPGA platform based on a Sundance PCI motherboard, Sundance Multiprocessor technology Ltd, Chiltern House, Werside, Chesham, UK, whose characteristics enable the user to obtain a coherent and flexible architecture Two Texas Instrument Modules (TIM) constitute the multi-dsp part [18, 19] The first one integrates two TMS320C44 processors to carry out the processing The second module is a frame-grabber containing one TMS320C40 DSP When it is not used for a video processing, this DSP can run image as well An additional FPGA part (Figure 6) is designed as a reconfigurable coprocessor for TMS320C4x based-systems and is associated to this multi-dsp platform This is a MIROTECH X-C436 board [19] integrating a XC4036 FPGA, fully compatible with the TIM specifications: this module can be directly integrated onto the motherboard This FPGA is used as two virtual processing elements called VPE Each VPE is considered as one FPGA XC4013, the rest of the full FPGA being used for the communication port management: all external transfers between the modules and the multi-dsp architecture are resynchronized by a communication port manager (CPM) The designers of this board propose this solution, which will be used for the prototyping process and they give the existing cores (a core is a processing task for one VPE) for this configuration Nevertheless, it can be extended to different size of VPE and different number of VPE on the sole condition that the imageprocessing designer or a hardware engineer creates the specific cores Each VPE is connected to a C4x processor via direct links and DMA The target topology of the platform is shown in Figure 7 This host processor does the FPGA module configuration and the data transfers Thus, the use of the dedicated module is straightforward as it consists only in functions calls

125 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 995 CPM Comms port CSU JTAG ILINK VPE1 VPE2 Figure 6: The X-CIM architecture The FPGA includes 2 VPE dedicated to achieve the image processing ILINK is a direct link between both The CPM is responsible for the communications between DSP and FPGA A supervisor, Configuration and Shutdown Unit, controls the clock and the reset VPE1 VPE2 FPGA Root Figure 7: The proposed multi-dsp FPGA topology The multi-dsp part contains 3 DSP called root, P1, and P2, root being the video processor The black squares are the physical links between DSP The FPGA part is the coprocessor for the root DSP Each VPE have one input and one output, which are connected to the root DSP The FPGA part (grayed) is fully transparent to the user and the functions are managed by the root processor inside the DSP code A configuration step is necessary before the processing step The configuration step includes the initialization of the module (link specifications between the VPE and DSP, license affectation), the parameters for each core (data size, image size, ), core assignations for each VPE and the module configuration (implementation of all previous configurations on the board) Afterwards, the processing can be achieved in 3 possible ways (i) Transparent communication mode The first solution consists of using one instruction corresponding to the target processing The input and output images are the only necessary parameters: this only instruction ensures to send the input images to FPGA and then to receive the output images at the end of the execution This unique instruction is easy to use but prevents the processor running a parallel function (ii) Low-level communication mode In a second approach, the user gives the input and output image by using some pointers and sends the images pixel per pixel, which are immediately processed and then sent back With this method, a function is time-consuming and the processor cannot run another function in the same time SRAM P1 P2 (iii) DMA communication mode The last way consists of sending the input image via the DMA Specific instructions enable the designer to associate the image, to read and write in the DMA, and to wait for the end of the processing The advantage is that the processor can execute another process at the same time For all these processing possibilities, specific libraries contain cores and specific instructions for the FPGA configuration and use The library declaration is inserted in these functions The configuration and processing tasks can be separated and included in different functions The configuration time of the dedicated module is long (about 26 seconds), limiting to a static use of the coprocessor to two type of operation 4 PROTOTYPING METHODOLOGY FOR MIXED AND PARALLEL ARCHITECTURES AVSynDEx is a prototyping process aiming to go automatically from the functional AVS description to the distributed execution over the multi-dsp or mixed architecture (see Figure 1) It implies first to guaranty a full compatibility between the elements throw the process, and then to generate automatic links The general requirements for a multi-dsp implementation are listed first, before the specific ones linked to the dedicated target 41 Compatibility for multi-dsp architectures 411 SynDEx-multi-DSP platform SynDEx can handle the multi-dsp architecture once the synchronization primitives, memory management, and communication schemes have been realized for the type of processors involved Architecture configurations such as frame grabber initialization are gathered in an INIT file executed once at the beginning of the application running 412 AVS-SynDEx SynDEx and AVS present a similar semantic in terms of application description with the use of static data flow graph Nevertheless, some particularities have to be dealt with Restrictions in SynDEx description Only a few data types are defined in SynDEx (Boolean, integer,real,),andthedimensionofthearraysmustbefixed The same rules have to be applied for the AVS graphs C-functions associated to processing vertices For both graphs, each vertex can be associated to a C- function A skeleton of the function is generally created by AVS when editing a new module, containing specific AVS instructions to interface the user code to the global application To be compiled in the SynDEx environments, all these instructions have to be removed Specific vertices In and Out functions are of course dependent on the platform In the AVS environment, IN and OUT correspond to

126 996 EURASIP Journal on Applied Signal Processing h-files c-files AVS data flow graph Automatic translator SynDEx software graph SynDEx hardware graph DSP configuration file c-files cores h-files Configuration file Figure 8: Presentation of the automatic translator read and write image files, whereas it is linked to video capture and display for SynDEx Besides the processing vertices, SynDEx defines also three specific ones: (i) Memory: a storage element acting as a FIFO whose deep is variable (ii) When: allows to build a conditional graph (the following of the graph is executed if an input condition is asserted) (iii) Default:selects between two inputs the one to be transmitted depending on an input condition The corresponding AVS primitives have been designed in V (the low-level language of AVS) to keep the whole SynDEx potential of graphs management 42 Compatibility specific to FPGA module 421 SynDEx-Mixed platform As the FPGA module management is carried out by a host DSP, the adopted solution consists of representing the use of a VPE in the data flow graph by a dedicated vertex linked to the host The associated function contains only the reference to the core according to the transparent communication mode The essential advantage is that the data flow graph remains unchanged compared to the multi-dsp case: whatever the target architecture is (software or hardware), a task is specified by its inputs, outputs, and executive time FPGA module configuration is also stored in the global INIT file 422 AVS-SynDEx In order to get equivalent functional graph, a functional equivalent C-function has to be developed for each available core, gathered in a library It is an easy task as the low-level treatments corresponds generally to simple algorithms For a multi-dsp architecture only, the function can be directly reused and implemented into a DSP When using the FPGA module, it has to be replaced by the call to the core 43 The automatic translator The fulfillment of the compatibility between the prototyping process stages allows to go from the functional description to the parallel implementation By designing a translator between AVS and SynDEx, the process is performed automatically The automatic translator is designed with the Lex and Yacc tools The first one filters the necessary parts in a sequence whereas the second one transforms an input chain into another one The translator realizes the following tasks, as shown in Figure 8: (i) Transforms the AVS data flow graph syntax into a Syn- DEx one (ii) Looks for user c-files and h-files associated to each module and cleans them of specific AVS instructions (iii) Transforms Memory, When, anddefault primitives into SynDEx ones (iv) Generates the constraints (eg, IN and OUT linked to the host DSP) (v) Adds automatically the target architecture (hardware graph) (vi) Generates the INIT file for the multi-dsp configuration Moreover, the translator is a key element in the codesign process A flag is associated to each core equivalent AVS module that indicates if the target is a DSP or the hardware module In the first case, the c-file and the h-files are fetched and reused for the generation of the executive In the second case, these files are replaced with the corresponding core and the FPGA configuration is added to the INIT file Thus, the allocation/partitioning tasks are easily done in the functional environment Another field of the AVS modules contains the execution time of the operation if it is known (otherwise a random value is assigned) The timing information is not needed in the AVS description and is inserted in the module as a comment This information is not used to determine the overall functionality of the AVS description: AVS does not do any difference between two similar C-functions whose timing information is different Nevertheless, the image processing designer can decide what partitioning (software/hardware module) is more efficient thanks to this timing information Another reason is the use of this information in the SynDEx data flow graph; the automatic translator needs this information to generate the SynDEx data flow graph This feature has generally already been determined for the cores This time is also copied out on SynDEx

127 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 997 Specifications AVS Allocation/Partitioning Data flow graph User Data flow graph C- functions Data flow graph Data flow graph cores SynDEx Spatial and temporal scheduling Sequential executive Distributed executive DSP implementations FPGAmulti-DSP implementation Timing measurement User GODSP Figure 9: The AVSynDEx prototyping methodology for mixed and parallel architecture The starting point is the specification for a data flow graph creation Two runs of the implementation process are necessary: the first one is the chronometrical report by means of a sequential executive (left part) It is necessary only for new C-functions and is removed in the other case The second one is the implementation on the mixed platform The links between the CAD tools are automatic and the designer supervises all the implementation steps 44 Prototyping process The implementation process is simple, requiring only a maximum of two development presented in Figure 9 Fornew user C-modules, their executive time has first to be determined to get eventually an optimized parallel implementation It is done by first considering a mono-dsp target, when the user constraints all the tasks of the software graph to be associated to the root processor The executive generated by SynDEx is at this step only sequential The loader GODSP ensures the implementation of the application and the report of the chronological information Then, the designer has only to copy out these times on the AVS modules If the application is made of already valued C-modules, this first run of the process is of course useless Once the algorithm is functionally validated and the partitioning is decided by the designer, the automatic translator generates the new SynDEx description associating the C- functions and the cores From now on, the hardware graph is the multi-dsp architecture SynDEx schedules and distributes the algorithm and gives the resulting timing diagram The user can choose to modify the partitioning in AVS or run the application on the mixed platform The main advantage of this prototyping process is its simplicity, as most of the tasks realized by the users concern the application description with his conventional environment The required knowledge of SynDEx and the loader are limited to simple operations Other front-end development tools can be used in the process instead of AVS so far as they present a similar semantic for the application description Ptolemy-related works can be found in [20, 21] AVSynDEx can be adapted to other architecturesaswell 5 IMPLEMENTATION OF AN IMAGE COMPRESSION ALGORITHM A new image compression algorithm has been developed in our laboratory: its implementation on a mixed architecture provides a validation of our fast prototyping methodology This algorithm called LAR, locally adaptive resolution [20], is an efficient technique well suited for image transmission via Internet or for embedded systems Basically, the LAR method was dedicated to gray levels still image compression, but extensions have been also proposed for colour images and videos [22] 51 Principle of the compression The basic idea of the LAR method is that the local resolution (pixel size) can depend on the activity: when the luminance is locally uniform, the resolution can be low (large pixel size) When the activity is high, the resolution has to be finer (smaller pixel size) A first coder is an original spatial technique and achieves high compression ratio It can be used as a stand-alone technique, or complemented with a second coder allowing to encode the error image from the first coder topology description This second one is based on an optimal block-size DCTtransform This study concerns only the first spatial coder Figure 10a presents its global process The image is first subsampled by squares rep-

128 998 EURASIP Journal on Applied Signal Processing Source image Nonuniform subsampling Blocks average Gray-level blocks Blocks quantization Source image Erosion 3 3 Erosion 3 3 Erosion 3 3 Diferential entropic coding Dilation 3 3 Dilation 3 3 Dilation 3 3 Grid Entropic code Compressed image (a) <T <T (b) <T Stationary within 3 3blocks Stationary within 5 5blocks Stationary within blocks Figure 10: (a) Global scheme of the spatial LAR coder (b) Decomposition of the nonuniform function resenting local trees Then, each one is split according to a quadtree scheme depending on the local activity (edge presence) The finest resolution is typically 2 2squaresThe image can be reconstructed by associating to each square the corresponding average luminance in the source image The image contents information given through the square size is considered advantageous for the luminance quantization Large squares require a fine quantization, as they are located in uniform area (strong sensitivity of human eye to brightness variations) Small ones support a coarse quantization as they are upon edges (low sensitivity) Size and luminance are both encoded by an adaptive arithmetic entropic encoder The average cost is less than 4 bits per square 52 Functional description of the application by means of AVS In order to obtain the best implementation of the data flow graph on the mixed architecture, the image processing designer has to exhibit both elementary operations available in the core library and additional data parallelism allowed by some tasks All the decisions and modifications are achieved only at the functional level (AVS data flow representation) In the LAR method, block stationary property is evaluated by a morphological gradient followed by a threshold A morphological gradient is defined as the difference between the dilated value (maximal value in a predefined neighbourhood) and the eroded value (minimal value in the same neighbourhood) A low resulting value indicates a flat region Ahighonesshowoffan edge in the neighbourhood By computing this stationary estimation using growing neighbourhood surface (2 2, 4 4, 8 8, and 16 16), it is possible to choose the maximal block size to represent the region while keeping the stationary property The major drawback of this approach is that the morphological operators complexity is proportional to the neighbourhood size, and then an erosion/dilation upon a block requires 256 operations per pixel To reduce the complexity, one uses generally the Minkowski addition by performing an operation upon a large neighbourhood as successive operations upon smaller neighbourhood [23] As 3 3 erosion and dilation operators are available in the core library, the graph modifications at this stage have consisted of decomposing the global morphological operations into iterative elementary ones (see Figure 10b) Data parallelism has been also pointed out for a multiprocessing purpose as most of the other functions are localised into blocks The algorithm development and optimizations are achieved using the AVS tool The designer can develop the application and can easily refine the granularity of several functions The data flow graph algorithm of the LAR application is developed and the resulting AVS data flow graph is shown in Figure 11 AVS enables the image processing designer to check the functionality of the new algorithm as shown in Figure Implementation on the multi-c4x-fpga platform According to the presented prototyping process, the automatic translator generates the corresponding SynDEx data flow graph and the associated files A first monoprocessor implementation is required for chronological measurements and the modules are specified to be software modules Syn- DEx generates a sequential executive for the implementation on the root processor The designer does the chronometrical reports (Table 1) and inserts these new times in the AVS data flow The time corresponding to a C4x-processor implementation is 321 seconds and represents the reference for the parallel one The second choice of architecture is to use only three C4x-DSP as represented in Figure 13 The modification con-

Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 999 Table 1: Modules execution time (image 120 120 pixel) Functions Time (microsecond) DSP C4X FPGA

GrayCod 64 275 3 193 FilGray 2 816 232 FilSiz 2 859 243 OUT 12 141 Figure 11: Presentation of the LAR algorithm under the AVS environment On top, the new modules are inserted into libraries (right)

only one DSP in SynDEx The best distribution according SynDEx is using only 2 C4x-processors (root and P1) This timing diagram shows that the global time should be longer or not more efficient in

129 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 999 Table 1: Modules execution time (image pixel) Functions Time (microsecond) DSP C4X FPGA DSP C6X IN Er ( 4) ( 4) ( 4) Dil ( 4) ( 4) ( 4) GrStep GrStep BlAver DPCM SiZcod GrayCod FilGray FilSiz OUT Figure 11: Presentation of the LAR algorithm under the AVS environment On top, the new modules are inserted into libraries (right) Below, the data flow graph of the LAR application The output image is visualized by means of the Uviewer2D module: it is the Lena image sists only of removing the constraint of tasks allocation to only one DSP in SynDEx The best distribution according SynDEx is using only 2 C4x-processors (root and P1) This timing diagram shows that the global time should be longer or not more efficient in case of a 3DSP implementation The resulting time for this implementation is 174seconds As the 3 3 erosion and dilation cores are available, the last solution consists in the use of the FPGA to perform these operations A comparison with the software implementation shows that the hardware one is approximately 100 times faster Note that the architecture limits the number of cores in an application but theses cores can be used several times (several identical vertices in the graph) Changing the target flag of the equivalent AVS modules and running again the translator leads to a new SynDEx input file The Er3 3 and Dil3 3 C-functions are replaced by the call of the corresponding cores The SynDEx data flow graph remains unchanged except new constraints on the FPGA tasks allocated to the host (root) processor Then, SynDEx can schedule and distribute the application according to the new software and material graphs The timing diagram is almost the same than Figure 14 except erosion and dilation tasks, which are much smaller The resulting executive time is 245 milliseconds 54 Implementation on a multi-c6x platform: AVSynDEx version 2 The new Sundance multiprocessor architecture is now available and an upgrade of AVSynDEx (version 2) to this new ar- (a) Original image ( , 8 bits per pixel) (c) Reconstructed image (b) Nonuniform grid (d) Reconstructed image after postprocessing (018 bits per pixel, PSNR 285dB) Figure 12: Visualization of images in the AVS environment Image display is available via an Uviewer2D module chitecture is in progress The platform consists of a Sundance SMT320 motherboard with two TIM SMT335 Each module contains a TMS320C6201 processor (the clock frequency being 200 MHz) and one FPGA dedicated to the communication management between both processors

1000 EURASIP Journal on Applied Signal Processing Table 2: Comparisons of different ameliorations given by the prototyping process The proposed prototyping process improves the development time and

friendly for such modifications Full development Translation AVS-SynDEx Functional validation Partitioning Error detection Without 3 4 days 60 min No SynDEx No With 1 day 5 min Immediate AVS

130 1000 EURASIP Journal on Applied Signal Processing Table 2: Comparisons of different ameliorations given by the prototyping process The proposed prototyping process improves the development time and is friendlier The optimizations and functional validation ensure to improve the application description and to obtain a rapid implementation The partitioning is easier as the SynDEx tool is not being friendly for such modifications Full development Translation AVS-SynDEx Functional validation Partitioning Error detection Without 3 4 days 60 min No SynDEx No With 1 day 5 min Immediate AVS Immediate Figure 13: The generated SynDEx description The architecture is added (3DSP: root, P1, and P2) The software graph is similar to the AVS data flow graph Actually, SynDEx does not completely ensure the generation of an optimized and distributed executive of the algorithm for this new architecture Indeed, current works are performed in our laboratory on the generation of the executive integrating conventional features, that is, the description of primitives for DSP synchronization, data exchange via the use of DMA and communication buses Parallel to this are added new features such as shared memory, conditional nodes, Nevertheless, the prototyping process remains similar and the development stages as well The translation between AVS and SynDEx is identical: the modification lies only in the generation of the distributed executive by SynDEx for the target platform The LAR application is reimplemented on a one-dsp architecture and the chronometrical reports are presented in the right-hand column of Table 1 The result for a sequential execution time is milliseconds on a C6x DSP, that is, a rough accelerating factor of 13 only by integrating new and faster processors Several observations can be made: (i) Most of the software functions are faster with the use of the C6x DSP So, without changing the rapid prototyping process, the implementation time will be improved only by integrating newandefficient components (ii) The execution time of initial hardware implementations (ie, Er3 3andDil3 3) is not improved in case of a software implementation and the hardware integration Figure 14: The timing diagram generated by SynDEx The best implementation only uses 2DSPs (root and P1): Er3 3 are treated by the P1 processor and Dil3 3 by the root processor These functions are executed at the same time The root processor executes most of other functions SynDEx indicated (on top) that the efficiency is 18 compared to a sequential executive remains the best solution A mixed DSP-FPGA architecture will always be one efficient platform for the implementation of digital image processing with real-time constraints 55 Results For this application, the executive time is 321 seconds on a one-dsp implementation, and 245 milliseconds for a multi- DSP architecture (leading to an accelerating factor of about 13) Our methodology ensures a fast and controlled prototyping process, and a final optimized implementation The development time of such applications, Table 2, is valued to one day (when different scenario are tested) with AVSynDEx and its automatic translator, and 3 4 days without this one This estimation is based on the hypothesis that there is at least one mistake in the development process and the times integrate the detection and correction of this mistake The estimations are the result of personal implementations of complex image processing applications combined to the experience of designers working in the same laboratory All of them have a huge experience in the AVS environment The main work consists of describing the application under the AVS environment and creating the new C-modules The implementation process is very fast and secured: the time for the chronometrical stage is about 20 minutes,

131 Rapid Prototyping Process for Image Processing Application Implementations on Mixed Architectures 1001 starting from the automatic generation of the sequential executive to the final timing results Without the automatic translator, the generation of the SynDEx data flow graph lasts 1 hour It is an average time so far as it depends on the application size (number of vertices) The remaining of the implementation process is very fast (15 minutes) 6 CONCLUSION AND PERSPECTIVES We have presented AVSynDEx, a rapid prototyping process able to implement complex signal/image applications on a multi-dspfpga platform AVSynDEx is currently the only environment able to target this kind of architecture from a high-level functional description The methodology integrates two CAD tools: AVS for the functional development of the application described as a static data flow graph, and Syn- DEx as generator of optimized distributed executive SynDEx is a powerful tool to find the best matching between an application and a specific architecture, but does not constitute a development environment of algorithms Adding a front-end one and developing an automatic link between them introduce a higher level of abstraction in the process Moreover, SynDEx can only handle processors but not dedicated hardware (even if some works in this sense are in progress) By selecting a suitable FPGA-based target and adapting its management to the SynDEx description type, we have removed this limitation The result is a fast and easy-to-use process The image designer can develop and supervise the whole implementation process without any pre-requirement as different complex and specific stages become transparent A main characteristic is the opening of this process It becomes very easy to use other CAD tools or to update the used environments The structure of the methodology ensures to replace the AVS environment with other graphical application development tools such as Ptolemy The application description of this new tool should just present a similar semantic with SynDEx (static data flow graph) The target platform itself can integrate the components and the number of FPGA and DSP is not set We offer a low-cost solution considering that the frontend environment is necessary for a high-level perfecting of image applications, and that SynDEx is a free academic tool Moreover, the prototyping targets relatively cheap platforms among existing multicomponents architectures Works in progress concern the integration of the new versions of AVS and SynDEx (introducing the notion of dynamic data flow graph) as well as the interface to a new architecture based on several TI C6x and FPGA Virtex In particular, we are developing a new SynDEx executive kernel for these DSPs On the same time, we are developing an Mpeg- 4 coder with AVS, which should be integrated into the new platform thanks to AVSynDEx, in order to reach real-time performances An Mpeg-2 coder has already been developed and implemented on the multi-c4x-fpga platform [24] Another perspective is the integration of new tools such ArtBuilder [25] ordk1designsuite[26] to facilitate the creation of new cores for the FPGA by nonspecialists in hardware These tools can generate VHDL code or the target core starting from a C description, or a similar c-description, such as Handle-C for the DK1 Design Suite REFERENCES [1] A Downton and D Crookes, Parallel architecture for image processing, Electronics & communications Engineering Journal, vol 10, no 3, pp , June 1998 [2] N M Allinson, N J Howard, A R Kolcz, et al, Image processing applications using a novel parallel computing machine based on reconfigurable logic, in IEE Colloquium on Parallel Architectures for Image Processing, pp 2/1 2/7, 1994 [3] G Quénot, C Coutelle, J Sérot, and B Zavidovique, Implementing image processing applications on a real-time architecture, in Proc Computer Architectures for Machine Perception, pp 34 42, New Orleans, La, USA, December 1993 [4] Q Wang and S G Ziavras, Powerful and feasible processor interconnections with an evaluation of their communications capabilities, in Proc 4th International Symposium on Algorithms and Networks, pp , Freemantle, Australia, June 1999 [5] M Makhaniok and R Manner, Hardware synchronization of massively parallel processes in distributed systems, in Proc 3rd International Symposium on Parallel Architectures, Algorithms and Networks, pp , Taipei, Taiwan, December 1997 [6] G Koch, U Kebschull, and W Rosenstiel, A prototyping environment for hardware/software codesign in the COBRA project, in Proc 3rd International Workshop on Hardware/Software Codesign, pp 10 16, Grenoble, France, September 1994 [7] B K Seljak, Hardware-software co-design for a real-time executive, in Proc IEEE International Symposium on Industrial Electronics, vol 1, pp 55 58, Bled, Slovenia, 1999 [8] R K Gupta, Hardware-software co-design: Tools for architecting systems-on-a-chip, in Proc Design Automation Conference, pp , Makuhari, Japan, January 1997 [9] F Balarin, D Chiodo, M and Engels, et al, POLIS a Design Environment for Controldominated Embedded Systems, version 30, User s manual, December 1997 [10] Department of Computer Science and Engineering, The Chinook project, Tech Rep, University of Washington, Seattle, Wash, USA, May 1998, chinook/ [11] Advanced Visual Systems Inc, Introduction to AVS/Express, Official site [12] R O Cleaver and S F Midkiff, Visualization of network performance using the AVS visualization system, in Proc 2nd International Workshop on Modeling, Analysis, and Simulation of Computer and Telecommunication Systems, pp , Durham, NC, USA, 31 January 2 February 1994 [13] C Lavarenne, O Seghrouchni, Y Sorel, and M Sorine, The SynDEx software environment for real-time distributed systems design and implementation, in Proc European Control Conference, pp , Grenoble, France, July 1991 [14] C Lavarenne and Y Sorel, Specification, performance optimization and executive generation for real-time embedded multiprocessor applications with SynDEx, in CNES Symposium on Real-Time Embedded Processing for Space Applications, Les Saintes Maries de la Mer, France, November 1992 [15] C Lavarenne and Y Sorel, Real time embedded image processing applications using the A3 methodology, in Proc IEEE International Conference on Image Processing, pp , Lausanne, Switzerland, November 1996 [16] T Grandpierre, C Lavarenne, and Y Sorel, Optimized rapid prototyping for real-time embedded heterogeneous multi-

1002 EURASIP Journal on Applied Signal Processing processors, in Proc 7th International Workshop on Hardware/Software Co-Design, pp 74 78, Rome, Italy, May 1999 [17] A Vicard and Y Sorel,

132 1002 EURASIP Journal on Applied Signal Processing processors, in Proc 7th International Workshop on Hardware/Software Co-Design, pp 74 78, Rome, Italy, May 1999 [17] A Vicard and Y Sorel, Formalization and static optimization of parallel implementations, in Workshop on Distributed and Parallel Systems, Budapest, Hungary, September 1998 [18] Sundance Inc, SMT20 4 slots TIM, com/s320htm, 2000 [19] Sundance Inc, SMT314 video grab and display TMS320C40 TIM, [20] J Lienard and G Lejeune, Mustig: a simulation tool in front of the SynDEx software, in Thematically Days University-Industry, GRAISyHM-AAA-99, pp 34 39, Lille, France, March 1999 [21] V Fresse, R Berbain, ando Déforges, Ptolemy as front end tool for fast prototyping into parallel and mixed architecture, in International Conference on Signal Processing Applications Technology, Dallas, Tex, USA, October 2000 [22] O Déforges and J Ronsin, Nonuniform sub-sampling using square elements: a fast still image coding at low bit rate, in International Picture Coding Symposium, Portland, Ore, USA, April 1999 [23] H Minkowski, Volumen und Oberflache, Math Ann, vol 57, pp , 1903 [24] J F Nezan, V Fresse, and O Déforges, Fast prototyping of parallel architectures: an Mpeg-2 coding application, in The 2001 International Conference on Imaging Science, Systems, and Technology, Las Vegas, Nev, USA, June 2001 [25] M Fleury, R P Self, and A C Downton, Hardware compilation for software engineers: an ATM example, IEE Proceedings Software, vol 148, no 1, pp 31 42, 2001 [26] T Stockein and J Basig, Handel-C: an effective method for designing FPGA (and ASIC), Academic paper, University of Applied Science, Nuremberg, 2001, products/technical papers/indexhtm Jean-FrançoisNezan received his postgraduate certificate in Signal, Telecommunications, Images and Radar Sciences from Rennes University in 1999, and his MSI in electronic and computer engineering from INSA-Rennes Scientific and Technical University in 1999, where he is currently working toward a PhD Research interests include image compression algorithms and rapid prototyping Virginie Fresse received the PhD degree in electronics from the Institut of Applied Sciences of Rennes, INSA, France in 2001 She is currently a postdoctoral researcher in the Department of Electrical Engineering in the University of Strathclyde, Glasgow, Scotland Her research interests include the implementation of real-time image-processing applications on parallel and mixed architectures, the development of rapid prototyping processes and the codesign methodologies Olivier Déforges graduated in electronic engineering in 1992, from the Polytechnique University of Nantes, France, where he also received in 1995 a PhD degree in image processing Since September 1996, he has been a lecturer in the Department of Electronic Engineering at the INSA Rennes Scientific and Technical University He is a member of the UMR CNRS 6164 IETR laboratory in Rennes His principal research interests are parallel architectures, image understanding and compression

Lightweight Arithmetic for Mobile Multimedia Devices. IEEE Transactions on Multimedia

Lightweight Arithmetic for Mobile Multimedia Devices Tsuhan Chen Carnegie Mellon University tsuhan@cmu.edu Thanks to Fang Fang and Rob Rutenbar IEEE Transactions on Multimedia EDICS Signal Processing for