Double-Precision Floating Point Emulation Acceleration

Size: px

Start display at page:

Download "Double-Precision Floating Point Emulation Acceleration"

Derick Davis
6 years ago
Views:

1 Double-Precision Floating Point Emulation Acceleration Application Note Tensilica, Inc Scott Blvd. Santa Clara, CA (408) Fax (408) December 2007 Doc Number: AN

2 2007 Tensilica, Inc. Printed in the United States of America All Rights Reserved This publication is provided AS IS. Tensilica, Inc. (hereafter Tensilica ) does not make any warranty of any kind, either expressed or implied, including, but not limited to, the implied warranties of merchantability and fitness for a particular purpose. Information in this document is provided solely to enable system and software developers to use Tensilica processors. Unless specifically set forth herein, there are no express or implied patent, copyright or any other intellectual property rights or licenses granted hereunder to design or fabricate Tensilica integrated circuits or integrated circuits based on the information in this document. Tensilica does not warrant that the contents of this publication, whether individually or as one or more groups, meets your requirements or that the publication is error-free. This publication could include technical inaccuracies or typographical errors. Changes may be made to the information herein, and these changes may be incorporated in new editions of this publication. Tensilica is a registered trademark of Tensilica, Inc. The following terms are trademarks of Tensilica, Inc. FLIX, OSKit, Sea of Processors, TurboXim, Vectra, Xenergy, Xplorer, and XPRES. All other trademarks and registered trademarks are the property of their respective companies. Notice Tensilica, Inc. reserves the right to make changes to its products or discontinue any of its products or offerings without notice. Tensilica warrants the performance of its products to the specifications applicable at the time of sale in accordance with Tensilica s standard warranty. Document Change History: Published December 2007 ii

3 Contents H1 Introduction... H1 H2 Accelerating H3 Double-precision H4 Code H5 The H6 Building Basic Double-Precision Emulation Functions... H1 HDouble-precision Emulation Package Features... H1 HComparison... H2 HConformance to IEEE 754 Specification... H2 Emulation Routine Performance... H2 Size for the Double-Precision Emulation Functions... H5 TIE Extensions... H6 and Using the Double-Precision Acceleration Library... H7 HUsing Xplorer... H7 HUsing Command Line Tools... H8 HUsing the Library with an RTOS... H9 Tables HTable 1: Double-precision Floating Point Emulation Library Features... H2 HTable 2: Cycle Count Comparison for Double-precision Emulation... H3 HTable 3: Cycle Count Comparison for Double-precision Multiply Emulation... H4 HTable 4: Cycle Count Comparison for Integer Divide and Modulus Emulation... H4 HTable 5: Double-precision Emulation Code Size Comparison... H5 HTable 6: Double-precision Emulation Code Size Comparison... H5 HTable 7: Integer Divide and Modulus Code Size ComparisonError! Bookmark not defined. HTable 8: Description of Added Instructions... H6 TENSILICA INC. iii

4 Abstract Double precision floating point is used in applications that require precision greater than single precision floating point. In Xtensa 7, LX, LX2 and Diamond products, double-precision floating point operations are implemented with a software emulation library. This application note presents a small set of TIE instructions and states that can be used for speeding up the existing double-precision software emulation. Adding 4K-7K gates to an Xtensa processor can perform double precision adds and subtracts in an average of 19 cycles. Multiplies take an average of 26 cycles for configurations with the Multiply High option and 60 cycles for configurations with 16 bit or 32 bit multipliers. It also includes a software library designed for easy integration into an existing project that uses these instructions to implement basic double-precision floating point operations. Since the library provides functions for routines that the compiler invokes when it encounters floating point operations, the library is easy to drop in to an existing project that needs double-precision floating point. The library directly speeds up double-precision floating point addition, subtraction, multiplication, square root, divide and comparison operations. Other routines that invoke these basic operations will be sped up indirectly. This application note characterizes the IEEE compliance of the implemented emulation routines, provides estimated gate counts for the hardware, and code sizes for the software. In addition we present average and maximum cycle counts for the emulation routines. Finally, we give step-by-step instructions on integrating the TIE and software library into an existing project to speed up double-precision floating point operations. The instructions added to speed up the floating point divide can also be used to speed up 32-bit integer divide and modulus operations. The package provided in this application note includes software routines for signed and unsigned 32-bit divide and modulus operations, in addition to the double-precision floating point ones. iv

5 1 Introduction This document describes TIE extensions and a software library used to accelerate software emulation of basic double-precision floating point functions. This library can be used to speed up double-precision floating point functionality on Xtensa processors. Customers who need low energy, moderate performance double-precision floating point operations should consider using this package. This package adds an estimated 4K gates when synthesizing for low area to a standard Xtensa processor and less than 7K gates when synthesizing for high speed. In addition to speeding up double-precision functionality, the instruction extensions used for speeding up the floating point divide operation can also be used to speed up integer divide and modulus operations for configurations without the Divide Option. This application note is divided into the following sections: Presentation of the accelerated double-precision and integer operations. This includes a description of the IEEE compliance of the accelerated double-precision functions. Cycle count comparison of the accelerated double-precision emulation functions with the existing software emulation library. Cycle count comparisons of the accelerated integer divide and modulus functions with the existing software emulation library. Gate count estimates for the added TIE instructions and states Step-by-step instructions on using the double-precision acceleration libraries with Xplorer or command line tools. A methodology for modifying an existing application to take advantage of the hardware instructions using intrinsics. 2 Accelerating Basic Double-Precision Emulation Functions This package implements IEEE 754 compliant 64-bit double-precision add, subtract, multiply, square root and divide operations with the round-to-nearest rounding mode. It also implements a complete set of comparison operations that allow for IEEE-compliant comparisons of two double-precision numbers. The package correctly handles IEEE denormalized numbers. To use this library, the user needs to build the library and include it on their link line before other libraries. In Xtensa Xplorer, adding the library to your project s dependencies is sufficient. Double-precision Emulation Package Features The double-precision emulation package adds a 32-bit and a 64-bit state to a processor along with instructions to speed up common double-precision operations. Optimizing for area in 90lp technology, the package synthesizes to an extra 4093 gates. Optimizing for speed in 90g technology the package synthesizes to an extra 6721 gates. The package is designed to work with any Xtensa configuration, however, it is recommended that it be used with the Sign Extension option, Zero-overhead Loop option and at least one of the multiply options. The iterative divide instruction and normalization instructions used to accelerate doubleprecision operations can also be used to implement signed and unsigned integer division and modulo operations. We have included emulation routines for these operations in the library for configurations without the Divide Option. These routines use the extra states that the library provides. If these emulation functions can be invoked from within an interrupt routine, they 1

6 should either be removed from the library or the interrupt routine must correctly save and restore the extra states. Comparison TABLE 1: DOUBLE-PRECISION FLOATING POINT EMULATION LIBRARY FEATURES Feature Double-precision Operations 32-bit Integer Operations Additional Architectural State Recommended Processor Configuration Options Post-Synthesis Additional Gates (Optimized for Area in 90lp) Post-Synthesis Addition Gates (Optimized for Speed in 90g) Rounding Modes Signaling Nans Overflow/Underflow exceptions Support Add Subtract Multiply Divide Comparisons ( ==,!=, <, <=, >, >=) Square Root Divide, modulus (signed and unsigned) 32-bit status state (F64S) 64-bit value state (F64R) Sign Extend Zero-overhead loop, MAC16, MUL16, MUL32 or MUL32 High 4093 gates 6721 gates Round-to-nearest No No Conformance to IEEE 754 Specification All of the implemented functions correctly implement the Round-To-Nearest rounding mode. Truncate, round up and round down modes are not implemented. The library does not generate underflow, overflow, inexact, invalid or divide-by-zero flags or exceptions. 3 Double-precision Emulation Routine Performance This section provides individual timing data for each of the functions. This timing data is derived with the cycle-accurate instruction simulator by counting the cycles spent in the emulation functions. The simulation assumes a single-cycle latency for each data memory access. The optimized functions do not access data memory, so this assumption only benefits the un-optimized functions that use a few PC-relative loads to instantiate constant literal values. The cycle counts used in this application note all assume that the double-precision functions are invoked with a windowed call instruction. Call and return sequences do not overflow or underflow the register file. All instructions either hit in the cache or local memory. Cycle counts for emulation routines are measured from the commit cycle of the first instruction in the routine 2

7 to the cycle before the commit of the instruction following the return. The add and subtract routines share code. This can confuse the standard profiler so the data is measured from execution traces. Table 2 gives average and maximum cycles counts for the standard emulation library (Base cycles) and the optimized library ( Cycles) for the implemented double-precision emulation routines. The average and maximum cycle data for add, subtract, multiply, divide, and square root was taken from a simulation of the timesoftfloat test that is included in the source package. This program does not include all of the comparison functions so the average and maximum cycle data for the comparison functions was taken from a separate directed random comparison test. With optimized functions and the data mix generated by timesoftfloat, a floating point add or subtract takes less than 20 cycles on average. Emulation for the divide takes about 72 cycles and the square root function takes about 78 cycles. If the zero-overhead loop instructions are not available, the performance of the square root and divide instructions are significantly degraded. The accelerated comparison functions (*) take 6 to 8 cycles in the emulation routines. If the TIE instructions that implement them are inlined into the routines that use them, only 2 instructions are needed to produce the binary comparison result. The table reports the number of cycles required when they are invoked through the emulation library functions invoked by the compiler for C code with double-precision comparisons. Note that because of compiler expectations, the functions for == and!= are the same. TABLE 2: CYCLE COUNT COMPARISON FOR DOUBLE-PRECISION EMULATION Operation Library Name Base Cycles Cycles Avg Max Avg Max Avg Base / Add adddf x Sub subdf x Mul muldf x Div divdf x Sqrt Sqrt x ==,!= * eqdf2, * nedf x < * ltdf x <= * ledf x > * gtdf x >= * gedf x Multiply emulation performance has a significant dependence on the base Xtensa processor's multiply hardware. Table 2 presents the performance for a configuration that has Mul32High. HTable 3 shows the performance for a variety of multiply configurations. With a base processor that includes the 32-bit Integer Multiply Option with Mul32High, the multiply takes about 26 cycles on average. For processors with the 32-bit Integer Multiply Option that do not include Mul32High and processors with the 16-bit Multiply Option, the multiply takes about 60 cycles on average. TENSILICA INC. 3

8 For processors without the 32-bit or 16-bit multiply options that include the MAC16 Option, the multiply takes about 70 instructions. Without any multiply or MAC Option a double-precision multiply takes 706 cycles. If the Sign Extension option is not available, the multiply and divide emulation can take an extra cycle with some inputs. TABLE 3: CYCLE COUNT COMPARISON FOR DOUBLE-PRECISION MULTIPLY EMULATION Multiply Configuration Option Base Cycles Cycles Avg Max Avg Max Avg Base / 32-Bit Multiply Option with Mul32High x 32-Bit Multiply Option without Mul32High x 16-Bit Multiply Option x MAC16 Option x No Multiply x HTable 4 gives a cycle count comparison for the 32-bit divide and modulus emulation routines. If the zero-overhead loop option is not available, the performance of these routines is degraded significantly. When the Divide Option is available, these emulation routine are not included. TABLE 4: CYCLE COUNT COMPARISON FOR INTEGER DIVIDE AND MODULUS EMULATION Operation Library Name Base Cycles Cycles Avg Max Avg Max Avg Base / Unsigned Integer Divide Unsigned Integer Modulus udivsi3 umodsi x x Integer Divide divsi x Integer Modulus modsi x 4

9 4 Code Size for the Double-Precision Emulation Functions The enhanced double-precision library reduces the code size footprint for all of the emulated functions. HTable 5 presents the code reductions for the double-precision emulation routines. TABLE 5: DOUBLE-PRECISION EMULATION CODE SIZE COMPARISON Operation Library Name Base Size (Bytes) Code Size Reduction Add, Sub adddf3, subdf % Mul (w Mul32 High) muldf % Mul (w Mul16/Mul32) muldf % Div divdf % Sqrt Sqrt % ==,!=, <, <=, >, >= eqdf2, nelt2, ltdf2, ledf2, gtdf2, gedf % HTable 6 presents the code reductions for the double-precision multiply with various multiply configuration options. TABLE 6: DOUBLE-PRECISION EMULATION CODE SIZE COMPARISON Multiply Configuration Option Base Size (Bytes) Code Size Reduction 32-Bit Multiply Option with Mul32High % 32-Bit Multiply Option without Mul32High % 16-Bit Multiply Option % MAC16 Option % No Multiply % Error! Reference source not found. presents the code size reductions for the integer divide and modulus emulation routines. The code size for these routines is reduced by about 50-60%. TENSILICA INC. 5

10 TABLE 7: INTEGER DIVIDE AND MODULUS CODE SIZE COMPARISON Operation Library Name Base Size (Bytes) Code Size Reduction Unsigned Integer Divide udivsi % Unsigned Integer Modulus umodsi % Integer Divide divsi % Integer Modulus modsi % 5 The TIE Extensions The TIE package implements 2 new states, a 32-bit F64S status state and a 64-bit F64R state. In addition, it implements a number of operations. We give a brief description of these operations in HTable 8. TABLE 8: DESCRIPTION OF ADDED INSTRUCTIONS Instruction Description F64CMPL, F64CMPH F64ITER F64RND F64NORM F64SIG F64SEXP F64ADDC, F64SUBC First 2 instructions for each emulation routine. Zeros F64R state and sets F64S status state. Iterative step for divide and square root Rounding assist Count leading zeros of a mantissa Extract the upper part of a mantissa Set an exponent Addition and subtraction with carry operations RF64R, WF64R Move data in and out of the F64R state RUR.F64S, WUR.F64S Move data in and out of the F64S status state 6

11 6 Building and Using the Double-Precision Acceleration Library The library is easy to use from Xplorer or command line tools. In either case, you must first compile the TIE file with your configuration, compile the library, and then make your project dependent on the compiled library. Using Xplorer To build the libdfpemu double-precision acceleration library, import the source workspace dfpemu_library.xws into Xtensa Xplorer, attach the TIE file to your configuration and build the test program. The dependent library will be built before the test program. To build a different project, create a Library Dependency on the libdfpemu library before building your project and add some link flags to your program. 1. Import the TIE file and projects into your Xplorer Workspace. a. Choose File, Import... b. Choose Import Xtensa Xplorer Workspace. c. Browse to the dfpemu_library.xws file in the Application Note directory and click Next. d. Select all of the projects (libdfpemu and timesoftfloat) in the projects dialog and the dfpemu_lib.tie file in the TIE files dialog. On the last dialog, click Finish to import the projects and TIE file into the workspace. 2. Attach the TIE file to your configuration. a. Select the C/C++ perspective if it is not the active perspective. Select the menu Window, Open Perspective, C/C++. b. In the System Overview pane, right-click on your configuration and choose Attach TIE and TDB files. c. Select the dfpemu_lib.tie file and click Finish. 3. Compile the TDK for your configuration. a. In the System Overview pane, right-click on your configuration and choose Compile TDK for Configuration. 4. If building your own project, add the libdfpemu dependency to the project. This step is unnecessary when building the timesoftfloat project because the dependency has been added already. a. In the C/C++ Projects pane, right-click your project and choose Properties. b. Click Library Dependencies. Choose libdfpemu from the Available Libraries and click Add. c. Click Ok to dismiss the project properties. 5. When building your own project, add link options to force the dfpemu library routines to be included if your project does not reference them directly.. TENSILICA INC. 7

12 a. In the active project area, select your project as the active project, select your configuration as the active configuration, and select the Release target (or the Debug target) as the active target. b. Click on the triangle to the right of the active target and select Modify c. Click on the Linker tab to change linker flags for the timesoftfloat program Release target. d. In the Linker Flags box, add: -Wl,-u, adddf3,-u, subdf3,-u, muldf3,-u, divdf3,-u,sqrt -Wl,-u, eqdf2,-u, gedf2,-u, gtdf2,-u, ledf2,-u, ltdf2,-u, nedf2 e. For configurations without the Divide Option, add: -Wl,-u, divsi3,-u, modsi3,-u, udivsi3,-u, umodsi3 6. Build the main project. a. In the active project area, select the timesoftfloat project (or your own project) as the active project, select your configuration as the active configuration, and select the Release target (or the Debug target) as the active target. b. Click Build Active. This will build the dependent libdfpemu library and the main project. 7. Now that the timesoftfloat project has been built, you can run or profile it once you have set up the command line arguments. a. Click the triangle to the right of the Run button in the active project area. Choose Run. b. In the Configurations area, choose the Auto timesoftfloat launch c. In the Create, Manage, and Run dialog box, choose the Arguments tab. Add all nearesteven tininessafter in the C/C++ Program Arguments box. d. Click Apply to save the arguments then Run to start the simulation. e. For a profile, click Profile in the active project area. If the runtime arguments have not been set yet, set them as in step b). Because the floating point add and subtract share code, some of the cycles from one can be misattributed to the other in the profile. Using Command Line Tools From a command prompt on a Linux host with the csh or tcsh shell, unpackage the library sources, TIE files and test program: 1. unzip dfpemu_library.zip 2. cd dfpemu_library 3. setenv PATH <path_to_your_xtensa_tools>:$path 4. setenv XTENSA_CORE <your_config_name> 5. xt-make clean all test profile This will compile the TIE file, build the libdfpemu.a library, build timesoftfloat, execute it and profile it. It will leave the text profile in timesoftfloat/prof.gmon.txt. 8

13 The emulation routines in this package redefine routines found in the standard libgcc and libm libraries. The object and library order on the link command line determine whether the libdfpemu or standard emulation routine is included in the application. To ensure that the libdfpemu library is included with your own project, add the full library pathname for the libdfpemu.a library to the link command before any other libraries and add -Wl,-u, adddf3,-u, subdf3,-u, muldf3,-u, divdf3,-u,sqrt -Wl,-u, eqdf2,-u, gedf2,- u, gtdf2,-u, ledf2,-u, ltdf2,-u, nedf2 to the link flags. For configurations without the Divide Option, add -Wl,-u, divsi3,-u, modsi3,-u, udivsi3,- u, umodsi3 as well. Using the Library with an RTOS The library routines add two additional states, F64R and F64S. If routines in this library can be invoked during interrupt handling, they should be removed from the library or the interrupt handlers must save the states before invoking these routines and restore them before returning from the interrupt. It is uncommon for an interrupt handler to invoke double-precision floating point emulation routines. Integer divide and modulus are more likely to be invoked. If they can be invoked from an interrupt handler with configurations without the Divide Option, then either remove the integer divide and modulus emulation routines from the library or save and restore the F64R and F64S states in the interrupt handling code. TENSILICA INC. 9

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine

ConnX D2 DSP Engine. A Flexible 2-MAC DSP. Dual-MAC, 16-bit Fixed-Point Communications DSP PRODUCT BRIEF FEATURES BENEFITS. ConnX D2 DSP Engine PRODUCT BRIEF ConnX D2 DSP Engine Dual-MAC, 16-bit Fixed-Point Communications DSP FEATURES BENEFITS Both SIMD and 2-way FLIX (parallel VLIW) operations Optimized, vectorizing XCC Compiler High-performance