Irecently had an assignment in

Size: px

Start display at page:

Download "Irecently had an assignment in"

Diana Montgomery
6 years ago
Views:

by BILL TRUDELL Keys to Writing Efficient Embedded Code A key to writing efficient real-time embedded software is to understand clearly your processor s architecture, the programming language, the

1 by BILL TRUDELL Keys to Writing Efficient Embedded Code A key to writing efficient real-time embedded software is to understand clearly your processor s architecture, the programming language, the compiler s features, and the object model used by the compiler. With this understanding, you can identify potentially slow code, make the code faster, and thus write more efficient applications. Nance Paternoster Irecently had an assignment in which I was responsible for identifying ways to write efficient embedded code. What I discovered isn t rocket science common mistakes, misunderstandings, or assumptions about the demands made on the compiler and over-estimating the power of the microprocessor can adversely impact the execution time of an application. Most of my effort focused on implementing code that doesn t enable floating-point operations, but instead relies on the math libraries supplied by the compiler vendor. Examples are presented primarily in C, but compiled in C++. I will leave the analysis of virtual tables and the like to the C++ experts. I hope a good compiler vendor does a reasonable job of implementing such things. Most of what I learned, though, can be applied to any programming language. Inefficient code seems to be more closely related to the human condition than to the chosen programming language. Slow code is probably slow because that s the way it was written, however unintentionally. I do believe that it s better to first write code that is correct and then to optimize it. There 52 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

Automatic type conversion is generally taken for granted, but it does chew up valuable processor time. will always be another compiler switch, faster clock chip, or newer processor around the bend.

2 Automatic type conversion is generally taken for granted, but it does chew up valuable processor time. will always be another compiler switch, faster clock chip, or newer processor around the bend. Well-written test drivers can prove equality between two implementations. My general assumption is that to implement efficient embedded software, a developer must be familiar with the code that the compiler generates, as well as with the microprocessor architecture. Writing efficient code, though, can sometimes make it less portable. Code written in Assembler is usually processor-specific and not portable. The code you write today will very likely need to run on a different processor in a year or two. Using Assembler, then, to make the code fast might not be prudent, except for interrupt service routines or frequently-used functions. Analysis of any improvements made to working code is very important. Validate changes to ensure that errors have not been introduced. Make sure that the desired level of precision and the accuracy with which calculations are performed is maintained or is adequate. You can easily overlook rounding and truncation errors. DATA TYPE SPECIFICATIONS One common oversight is specifying the wrong data type and then allowing the compiler or preprocessor to convert the type automatically. Automatic type conversion is generally taken for granted, but it does chew up valuable processor time. Without looking at the related Assembler, the code might compile, link, run, and produce the right output, yet be very inefficient. Table 1 contrasts two code segments generated for a processor without an FPU. The segment on the left omitted the use of the single-precision floatingpoint specifier f an easily overlooked mistake. The value 10.0 defaulted to double precision and forced the numerator to first be converted to double precision before the double-precision divide. The result of the division, a double-precision value, is then converted back to single precision. Without an FPU, these conversion operations, as implemented in a software math library, are very expensive as compared to integer operations. Correctly specifying the divisor as a single-precision value produces the code shown in the right-hand segment in Table 1. A single-precision division is used and the type conversions are avoided. If the numerator were an integer, a conversion from long to float would be required. This would be a time-consuming operation that could be avoided if the data type were specified as float. (See section A.6 of The C Programming Language by Kernighan and Ritchie. 1 ) Single-precision accuracy is probably used more frequently than double precision. Therefore, if double precision is required, a simple comment in the code would remove all doubt as to the developer s intention and design. AUTOMATIC PROMOTIONS Implied promotions can easily be taken for granted. In some cases you ll find it desirable or necessary to write and debug the code first in a PC environment, and later port or recompile it for the embedded processor. The clock speed on the PC will usually be much faster than the embedded hardware, and the PC will surely OCTOBER 1997 EMBEDDED SYSTEMS PROGRAMMING 53

3 TABLE 1 Floating-point specifiers. C Code omitting float specifier f : C Code using float specifier f : float res = val /10.0; float res = val / 10.0 f; move.l -4(a6),-(sp) move.l -4(a6),-(sp) jsr ftod move.l # ,-(sp) clr.l -(sp) jsr fdiv move.l # ,-(sp) move.l (sp)+,-8(a6) jsr ddiv jsr dtof (Less is Better.) move.l (sp)+,-8(a6) TABLE 2 Using math functions and the effect of automatic promotions. C Code, Automatic Type Promotion: C Code, Casting to Avoid Excessive Promotions: (val is of type float) float res = float res = ( 17.0f * sqrt( val ) ) / 10.0f; ( 17.0f * (float)(sqrt( val )) ) / 10.0f; move.l -4(a6),-(sp) // Load val on stack, move.l -4(a6),-(sp) jsr ftod // Convert it to dbl jsr ftod jsr _sqrt // Dbl prec. Sqrt() jsr _sqrt addq.l #4,sp // Adjust Stack addq.l #4,sp move.l d1,(sp) // Load sqrt() result move.l d1,(sp) // Load Stack with move.l d0,-(sp) // d1&d2 on stack move.l d0,-(sp) //sqrt() result & clr.l -(sp) // Load Stack with jsr dtof //convert to single move.l # ,-(sp) // dbl for 17.0 move.l # ,-(sp) // 17.0f jsr dmul // Dbl Prec. Mult. jsr fmul //Single Prec. clr.l -(sp) // Load Stack with move.l # ,-(sp) // 10.0f move.l # ,-(sp) // dbl for 10.0 jsr fdiv // Single Prec. jsr ddiv // Dbl Prec. Divide move.l (sp)+,-8(a6) // Save Result jsr dtof // Double to Float in res move.l (sp)+,-8(a6) // Save in res have a floating-point unit. The implied promotions performed by the compiler for the embedded code might not execute as fast as they did on the workstation. Standard math routines usually take double-precision inputs and return double-precision outputs. If only single precision is required, the return value should immediately be cast back to single precision, provided that accuracy and overflow conditions are satisfied. If this isn t done, further promotions can be precipitated, causing slower execution. Table 2 contrasts the use of automatic promotion using the sqrt() function as is, with the casting of the sqrt() function s return value. Using the sqrt() function as is forces the other variables to be promoted. Casting the return of the sqrt() functions replaces the double-precision multiply and divides with single-precision versions, which should execute faster because in this case, they re implemented in the software. If the input to sqrt() were of type double instead of float, the costly call to convert the float to double could be avoided. REWRITING AND REARRANGING EXPRESSIONS Rearranging operands and operators in an equation can give the preprocessor a better chance at pre-evaluating expressions at compile time instead of run time, saving clock cycles of execution for other important operations. The equation used in Table 2 can be rearranged for faster execution without losing readability, as shown in Table 3. Significant savings result because a single precision division is no longer necessary, as 17.0f/10.0f is equivalent to 1.7f. In general, for both native instruction sets and floating-point emulation, divides take much longer to execute than multiplies. Therefore, provided that accuracy requirements are met and overflow and under-flow conditions are considered, trading a divide for a multiply usually saves time. For example, an algebraic equation can be rewritten for faster execution, as shown in Table 4. Here, the segment on the left takes two divides, whereas the rewrite takes one divide and one multiply, and will be much faster. Coding algorithms and procedures for the most frequently executed path also contributes to faster overall execution. Assembler branches taken are usually faster than those not taken, so put the evaluation of the most frequently occurring conditions first. 54 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

4 TABLE 3 Rearranging equations for efficient preprocessing. C Code Example: float num = ( 17.0f / 10.0f) * (float)(sqrt( val )); move.l -4(a6),-(sp) // Load Stack with float val jsr ftod // Convert val from Single to Double jsr _sqrt // Double Precision Square Root addq.l #4,sp move.l d1,(sp) // Load Stack with Double Result move.l d0,-(sp) jsr dtof // Convert sqrt() Double Result to Single move.l # ,-(sp) // Load pre-evaluated 17.0f/10.0f jsr fmul // Single Precision Multiply move.l (sp)+,-8(a6) // Pop and Store Result in variable num TABLE 4 Rewriting algebraic expressions for better efficiency. C Code Successive Divides: C Code Multiply instead of Divide: D = A / B / C; D = A / ( B * C ); Compilers generally have their own optimization for switch and case statements which use jump tables and take note of large gaps in values used in the various cases. Table 5 shows some different ways to optimize algebraic equations. Look for the repeated use or evaluation of the same expression. In Equation 1, the product A * B is evaluated twice. Defining another variable to hold the product increases code size but avoids the extra multiply. Depending on the processor and data type, this can result in significant time move.l -4(a6),-(sp) move.l -4(a6),-(sp) move.l -8(a6),-(sp) move.l -8(a6),-(sp) jsr fdiv move.l -12(a6),-(sp) move.l -12(a6),-(sp) jsr fmul jsr fdiv jsr fdiv move.l (sp)+,-16(a6) move.l (sp)+,-16(a6) savings. Note the following example: D = A / (B * C) E = 1 / (1 + (B * C)) evaluate B * C once, bc = B * C (1) LITERAL DEFINITIONS The specification of common values using #defines or const terms might be pragmatic, but is also prone to error. In the following example, several significant observations can be made. The value 3.14 is double precision, forcing a double-precision multiply and later a double- to single-precision type conversion call, all of which is time-consuming. #define TWO_PI 2 * 3.14 ; float c, r; ; ; c = 2 * 3.14 * r; move.l -8(a6),-(sp) // load r jsr ftod // Make dbl move.l # ,-(sp) // Load move.l # ,-(sp) // 2 * PI jsr dmul // Multiply 2PI * r jsr dtof // Convert to single move.l (sp)+,-4(a6) // Save in c The next example shows that the defined value, along with the hierarchy of operators, results in an incorrect solution for a circle s radius because the circumference variable c is first divided by two, and not the product (2 * PI). The code also shows that the multiplication of 2 * PI occurs at run time everywhere the literal TWO_PI is used: #define TWO_PI 2 * 3.14f ; float c, r; ; ; r = c / 2 * 3.14f; move.l -4(a6),-(sp) move.l # ,-(sp) jsr fdiv move.l # ,-(sp) jsr fmul move.l (sp)+,-8(a6)?line 10973,22 Two correct implementations follow, one using a #define, and the other, a const variable. The substitution for the literal definitions avoids a multiply because the preprocessor evaluates the expression inside the parentheses at compile time. Similarly, the const variable is also evaluated at compile time, but requires more memory and some overhead for referencing the variable: #define TWO_PI (float)(2 * 3.14) // const float TWO_PI = 2 * 3.14f; 56 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

5 TABLE 5 Algebraic simplifications and the laws of exponents. Original Expression: Optimized Expression: a 2-3a + 2 (a - 1) * (a - 2) 1 multiply (square term) 2 subtractions 1 subtraction 1 multiplication 1 multiply ( 3a) 1 addition (a - 1) * (a + 1) a subtraction 1 multiply (square term) 1 multiplication 1 subtraction 1 addition 1 / (1 + a / b) b / (b + a * b) 2 divides 1 divide 1 addition 1 addition 1 multiply a m * a n a m + n 2 power functions 1 addition 1 multiply 1 power function (a m ) n a m * n 2 power functions 1 multiply 1 power function INTEGER MATH VS. FLOATING POINT If inputs are bound by definition, convention, or data type, a chance exists that floating-point computations might be substituted for integer math and appropriate scaling, as shown in Table 6. The savings in this case may seem unimpressive, but replacing a floatingpoint subtraction with a left shift and integer subtraction represents a significant savings in execution time. Don t forget that pushing arguments on the stack, jumping to the subroutine, returning, and adjusting the stack are all overhead in solving the problem. THE STANDARD MATH LIBRARY As I ve already mentioned, the standard math library generally expects double-precision values. Massive penalties result when converting from single precision to double precision and back again when using floating-point emulation software. However, using double precision as default data types isn t the solution either, because it consumes more space and time (unless you re using a Pentium Pro, which does everything in double precision but isn t really an embedded processor). Table 7 shows how a single-precision absolute value can be written. While the alternative generates more code, it s much faster than the type conversion function calls. This alternative can be encapsulated in a macro or in-line function. It would be even better if the function abs() was overloaded for all relevant data types. COMPILER OPTIONS AND ISSUES Compilers offer many degrees of optimization. Some of these features are related to the programming language or object model, while others use specific knowledge of the processor to make the code execute faster. If developers are expected to 58 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

6 write efficient code, they ll need to know which options are enabled or disabled, which default options are being used, and the effect of these options on their application. Some general options supplied by compilers for optimizing code include: stack processing global flow inlining local optimization instruction scheduling code produced for a specific processor run-time library function inlining allocation of frequently used variables in registers code space optimization speed optimization floating-point emulation Some processors have separate calculation and execution stages. Because of this separation, instructions can be re-ordered to take advantage of known latencies in specific instructions, so that any stalling in the instruction pipeline is avoided or minimized. Good compilers have options for enabling these decisions. The code generated for one processor might not be optimal for another processor because of architectural differences, even if they re in the same family. Therefore, be sure to specify the correct processor in the compile switches. TABLE 6 Integer math instead of floating point. C Code: unsigned short int input = 12345; // Range 0 to float output; output = ( (float)input f ) / f; // (Slow Code Ahead) move.l #12345,d4 // Load into register d4 move.l d4,d0 // Load register d0, input to ultof jsr ultof // Convert unsigned long to float move.l # ,-(sp) // Push f jsr fsub // input f move.l # ,-(sp) // Push f jsr fdiv // (input f) / f move.l (sp)+,-4(a6) // Save result in output C Code using scaling: // Using Integer Data Types where possible and scaling numbers by 2^15 output = (float)( ((int)input<<15) ) / ; move.l d4,d0 // Load reg. d4 into d0 ( input = 12345) moveq #15,d1 // Load register d1 with shift amount lsl.l d1,d0 // Left Shift register d0 by d1 subi.l # ,d0 // input*2^ * 2^15 jsr ltof // Convert numerator from long to float move.l # ,-(sp) // Load float of int jsr fdiv // Single Precision Division move.l (sp)+,-4(a6) // Save result in output variable OCTOBER 1997 EMBEDDED SYSTEMS PROGRAMMING 59

7 TABLE 7 Floating-point absolute value. C Code: output = fabs(input); The inlining of run-time libraries can also help reduce execution time by avoiding function call overhead or by supporting optimized operations. For example, a memcpy routine can be optimized for a small number of bytes and result in an inline expansion. If the size is very large or the data type is userspecified instead of a primitive type, a function call to the memcpy routine might be generated. I found in one particular case that using memset resulted in a function call. While the library function handled the various data organization schemes that could be selected by the user, a quicker inline version might be a better choice if the size is known to be small. Even though the // (Slow Code Ahead) move.l -4(a6),-(sp) // Load input on stack jsr ftod // Convert it from Single to Double Precision jsr _fabs // Double Precision ABS addq.l #4,sp move.l d1,(sp) // Load result move.l d0,-(sp) jsr dtof // Convert result from Double to Single move.l (sp)+,-8(a6) // Save result in output C Code Test against zero instead of abs() function: if ( input < 0 ) output = - input; else output = input; move.l -4(a6),-(sp) clr.l -(sp) jsr fcmp bge.s L38 move.l -4(a6),-(sp) eori.b #128,(sp) // XOR the sign bit move.l (sp)+,-8(a6) // Save output bra.s L39 L38:move.l -4(a6),-8(a6) // Save output L39: internals of the memset function were implemented in Assembler, the overhead for setting a long word to zero by using a memset was excessive. For portability, among other reasons, you should give special care to the alignment of data. These choices can affect efficiency and can vary with the processor used. If your processor has an FPU, make sure the compiler has this switch turned on and that you re not running software to do floating-point calculations. Default options should be well understood because they may have an impact on performance. For example, Microsoft Visual C++ supports a workaround for Pentium processors with flawed floating-point instructions. The Help Index states: By default, the workaround is disabled (/QIfdiv), and the code generator emits code that is unsafe on a flawed Pentium. If the workaround is enabled (/QIfdiv), the code generator emits fatter, safe code that tests for the processor bug and calls run-time routines instead of using the native instructions of the processor to generate correct floating-point results. 2 So a trade-off exists between accuracy and speed. Therefore, be very careful when generating benchmarks and comparing the accuracy of generated values with specific versions of programs, like a debug build versus a release build. It s important to understand the implications of the switches chosen. Obviously, with a flawed Pentium, the run-time routines will run much slower than native instructions, but they ll be more accurate. A Pentium would not be a typical embedded processor selection, but your processor or compiler may have its own set of quirks. Some high-end processors have both instruction and data caches. The compiler provides switches for enabling these features. The instruction cache should be enabled; the data cache should only be enabled if sufficient consideration has been given to data synchronization. Multiprocessing will require extra hardware for bus snooping to be sure that the data cache is synchronized between two or more processors or processes. MEMORY ALLOCATION For time-critical sections of code, the use of the memory manager should be scrutinized. If the size of an object or data type is small and the scope is sufficiently restricted, the stack might be a better choice for a storage area. This analysis might only be obvious after a design has been implemented. Some assumptions can 60 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

8 then be made regarding the future growth of an object: if the object is small, maybe it s better left on the stack, rather than incurring the overhead of calling memory allocation functions. Another benefit is that keeping the smaller segments out of the heap can lessen fragmentation. In C and C++, using malloc, free, new, and delete doesn t come without some time penalty, yet it can be more flexible than using the stack. C++ Some mistakes I ve frequently made in C++ code have made my programs run more slowly than I would have liked. I found it especially important to know when a constructor is being executed. Trace statements or reference counting can help monitor these events. Most often, it s the copy constructor or assignment operator that is easily overlooked. It was only after stepping through the code that I noticed my design and implementation required the use of a copy constructor. The copy constructor was being called in a tight loop, which cost significant execution time. The design had to be re-done. Whenever possible, use a profiler or incorporate some crude timing elements to judge how efficient the program is executing. This tactic is handy for benchmarking and making changes to performance that are difficult to prove except by empirical methods. Some of the better development environments have a built-in profiler or an add-on component that monitors entries and returns from methods. Another common mistake is passing an object back on the stack instead of passing its reference. The copy constructor is executed for the temporary object. After a design has been implemented, tested, and used, a second pass can be made to improve the application. Classes that were originally necessary may become superfluous and can be eliminated or encapsulated in a superclass. This may remove unnecessary A top-down approach should be used to find where the time is being spent hopefully, you ll find a smoking gun. layers of inheritance which add to the overhead and processing time of an application. MISCELLANEOUS SETBACKS Acommon source of performance degradation is the cutting and pasting of working code as a base for newer features. One bad example, copied and promulgated in the system, significantly degrades performance. Bad examples should always be corrected, or at least commented. Developers will appreciate your honesty and humility. We often write code with the assumption that it might be used in the future. When the future arrives, the requirements may have changed or may be understood such that the old design is insufficient or wrong. Thus, when in doubt, leave it out. Don t code if it isn t needed. If you re using archival tools, the questionable code can be saved there. It s frustrating to browse through code that is no longer built, looking for code that needs to be optimized, or for common defects in relic code. RECOGNIZE THE SIGNS I ve discussed some common implementation oversights and mistakes and offered some recommendations regarding implementation details that will optimize code for time. A top-down approach should generally be used to find where the time is being spent hopefully, you ll find a smoking gun. However, this approach will address only specific instances of code inefficiency. A parallel path can also be taken, in which the cross reference section of link maps are inspected for unusual references. Double-precision math operations are an example. Most importantly, optimize something that already works the first step is to get the correct solution. Some fundamental concepts must first be understood in order to write efficient real-time embedded software. These concepts include knowing your processor s architecture, the programming language used, the features of the compiler, and even the object model used by the compiler. Inspecting the code generated by the compiler closes the loop and allows you to see how well the compiler has interpreted your coding requests and converted them into executable code. Be ready for some surprises. With this understanding and mindset, you will learn to recognize those warning signs that read slow code ahead. Then you can speed up your code, and write more efficient applications in the process. Bill Trudell is a software engineer currently employed by Fisher Rosemount Systems, Inc., a solutions provider for the process control industry. He has extensive experience in the design of real-time multitasking embedded and PC applications in various domains. Bill can be reached via at billtrudell@msn.com. REFERENCES 1. Kernighan, Brian W. and Dennis M. Ritchie. The C Programming Language, Second Edition. Englewood Cliffs, NJ: Prentice-Hall, Microsoft Visual C++, Version 4.2, Books On-line. Microsoft Corp., Redmond, WA. 62 EMBEDDED SYSTEMS PROGRAMMING OCTOBER 1997

Fixed-Point Math and Other Optimizations

Fixed-Point Math and Other Optimizations Embedded Systems 8-1 Fixed Point Math Why and How Floating point is too slow and integers truncate the data Floating point subroutines: slower than native, overhead