Wednesday, October 4, Optimizing compilers source modification Optimizing compilers code generation Your program - miscellaneous

Size: px

Start display at page:

Download "Wednesday, October 4, Optimizing compilers source modification Optimizing compilers code generation Your program - miscellaneous"

Peter Doyle
6 years ago
Views:

1 Wednesday, October 4, 2017 Topics for today Code improvement Optimizing compilers source modification Optimizing compilers code generation Your program - miscellaneous Optimization Michael Jackson Donald Knuth Rule 1: Don t do it Rule 2: (experts only) Don t do it yet Premature optimization is the root of all evil (or at least most of it) in programming Optimizing compilers (p 297) There are many possible translations of a particular high-level language program into assembly code Two measures of the assembly code are: its size (in bytes) the time it takes to run The "default" translation (least work for the compiler?) is not likely to have the shortest run time nor use the least space Typically there is a trade-off between the time it takes an object program to run and its size In the following diagram, s represent possible translations of the source program D represents the program you get by default Run time Source program Compiler D Program size An optimizing compiler is one that does extra work during translation to arrive at a translation that is better than the default (by some measure) Comp 162 Notes Page 1 of 15 October 4, 2017

2 Many compilers permit users to specify the types of optimization that they wish the compiler to perform [ From the manual for the gcc compiler, here is the section on optimization options The default is no optimization ] ] Optimization Options -falign-functions=n -falign-jumps=n -falign-labels=n -falign-loops=n -fbranch-probabilities -fcaller-saves -fcprop-registers -fcse-follow-jumps -fcse-skip-blocks -fdata-sections -fdelayed-branch -fdelete-null-pointer-checks -fexpensive-optimizations -ffast-math -ffloat-store -fforce-addr -fforce-mem -ffunction-sections -fgcse -fgcse-lm -fgcse-sm -floop-optimize -fcrossjumping -fif-conversion -fif-conversion2 -finline-functions -finline-limit=n -fkeep-inline-functions -fkeep-static-consts -fmerge-constants -fmerge-all-constants -fmove-all-movables -fnew-ra -fno-branch-count-reg -fno-default-inline -fno-defer-pop -fno-function-cse -fno-guess-branch-probability -fno-inline -fno-math-errno -fno-peephole -fno-peephole2 -funsafe-math-optimizations -ffinite-math-only -fno-trapping-math -fno-zero-initialized-in-bss -fomit-frame-pointer -foptimize-register-move -foptimize-sibing-calls -fprefetch-loop-arrays -freduce-all-givs -fregmove -frename-registers -freorder-blocks -freorder-functions -frerun-cse-after-loop -frerun-loop-opt -fschedule-insns -fschedule-insns2 -fno-sched-interblock -fno-sched-spec -fsched-spec-load -fsched-spec-load-dangerous -fsignaling-nans -fsingle-precision-constant -fssa -fssa-ccp -fssa-dce -fvrp -fstrength-reduce -fstrict-aliasing -ftracer -fthread-jumps -ftsp-ordering -funroll-all-loops -funroll-loops --param name=value -O -O0 -O1 -O2 -O3 -Os A compilation that uses optimization techniques takes more time because the compiler is doing more work It is thus most appropriate to select optimization options when producing the final production run of a program rather than when still at the debugging stage Here are five examples of the kinds of optimizations a compiler might perform Think of them as transformations on the source program performed before translation into assembly code even though that is probably not how the compiler implements them 1: Detecting common sub-expressions Performing arithmetic at compile time 2 Detecting common code in branches 3 Loop unrolling with constant 4 Strength reduction 5 Loop unrolling with variable Comp 162 Notes Page 2 of 15 October 4, 2017

3 Example 1: detection of common sub-expressions or performing compile-time arithmetic In each of the following assignments statements there are duplicate expressions A compiler can detect that and generate a single instance of the code that evaluates the expression t = (a + b - 49 ) * ( b + a - 49); AR [ y * t + 19 ] = AR [ y * t + 19] + 1; For instance, the second example is transformed to temp = y * t + 49; AR[ temp ] = AR[ temp ] + 1; Probably saving both time and space If there is arithmetic that can be carried out at compile time it makes sense for the compiler to do it Here is an (extreme) example W = T + ( * 7) * (43-6); Compiler replaces by W = T ; A more reasonable example is where you have included an expression for readability int secondsinaweek = 7 * 24 * 60 * 60 ; Example 2: common code in branches Sometimes, typically after multiple edits, a programmer may not realize that both branches of a conditional contain the same action A compiler can check for this if (t<0) b = 99; a = 0; else a = 0; b = 40; The compiler can in effect transform this into if (t<0) b = 99; else b = 40; a = 0; saving space in the object program Comp 162 Notes Page 3 of 15 October 4, 2017

4 Example 3: Loop unrolling with constant Depending on the size of the loop action, the loop overhead (initializing a loop variable, testing it, incrementing or decrementing it) may account for a significant fraction of loop run time A compiler can unroll a loop to reduce or eliminate loop overheads If the compiler can determine the number of iterations, loop overheads can be eliminated completely, for example for (i = 0; i < 3; i++) read(n); sum+=n; can be treated as if the user had written read(n); sum+=n; read(n); sum+=n; read(n); sum+=n; We get rid of the overhead of the loop counter The new code will run faster but it may take up more space This is why we have to give the user many choices about optimizations to be performed See Example 5 later for cases where the number of iterations is not a constant Example 4: strength reduction In strength reduction we try to replace an operation by a faster one In the following example we replace multiplication (a slow operation on most systems) by addition The compiler treats for (i=1; i<n; i++) T = 6 * i; output(t); as if the programmer had written T = 0; for (i=1; i<n; i++) T = T + 6; output(t); Comp 162 Notes Page 4 of 15 October 4, 2017

5 If we know the number of iterations of this loop as in for (i=1; i<10; i++) T = 6 * i; output(t); the compiler might be able to reduce the program to output(6); output(12); output(18); etc Example 5: Loop unrolling with variable What if the loop count is not a constant, for example for (i=0; i<n; i++)? The following transformation reduces the time requirements of the loop overhead by approximately 50% but the space requirements are increased because we have three instances of the action for (i=0; i<n/2; i++) for (i=0; i<n%2; i++ For example, if N contains 37 then the first loop will iterate 18 times (performing 2 actions each time) and the second loop once Note the compiler will typically generate code that avoids calculating N/2 or N%2 more than once We can take this idea further with the following that uses even more space but decreases runtime still further Comp 162 Notes Page 5 of 15 October 4, 2017

6 M = N/8; for (i=0; i<m; i++) for (i=0; i<n%8; i++) If N is 101 for example then the main loop iterates 12 times (accounting for 96 actions) then the second loop iterates 5 times accounting for the remainder Duff s Device Tom Duff proposed combining the two loops into a single loop Surprisingly the following is legal C though jumping into the middle of a loop is normally frowned on The first time through the loop, only part of the loop may be executed (the remainder part of the example above) thereafter come the full loops int t = (N+7)/8 switch (N%8) case 0: do case 7: case 6: case 5: case 4: case 3: case 2: case 1: while (--t > 0) For example, if N is 85, t is 11 and N%8 is 5, we jump to case 5 and do 5 actions as we fall through the cases Thereafter we perform full loops There are a total of t (11) loops, one of 5 actions and 10 with 8 actions giving us a total of 85 An article in the August 2005 issue of Dr Dobbs Journal shows how this code can be wrapped in a macro with (1) A dummy outer loop to enable local namespace for variables (2) Logical operations to speed up divide and mod Comp 162 Notes Page 6 of 15 October 4, 2017

7 Here is a macro definition that you can use in a C program #define DUFF_DEVICE_8(aCount, aaction) \ do \ int count_ = (acount); \ int times_ = (count_ + 7) >> 3; \ switch (count_ & 7) \ \ case 0: do aaction; \ case 7: aaction; \ case 6: aaction; \ case 5: aaction; \ case 4: aaction; \ case 3: aaction; \ case 2: aaction; \ case 1: aaction; \ while (--times_ > 0);\ \ while (0) now the user can simply write something like DUFF_DEVICE_8 (N, printf( \n ) ) Here is the log of a test of Duff s Device (the file duffh contains the text of the macro above) sh-300$ cat test2bc #include <stdioh> #include "duffh" int main() int N=13; DUFF_DEVICE_8 (N, printf("\n") ); sh-300$ gcc test2bc sh-300$ aout Comp 162 Notes Page 7 of 15 October 4, 2017

8 Tweaking source code In an embedded system environment, there is often tweaking of the source code to cause the compiler to generate the output you want (small and/or fast) This may result in source programs looking inelegant (from the point of view of an instructor in a high-level language programming course) Example 1: writing code inline instead of using function calls with their overheads Good code Code that might give you the object program you want void () A; B; C; (); (); A; B; C; A; B; C; Example 2: avoiding parameter passing Good code void pstars (int N) for (int I to N) output ( * ); Code that should give you a faster object program no parameter passing void p5stars() output( ***** ); Void p17stars() Even better no calls to user function output( ***** ); output( **************** ); output( ***** ); pstars(5); pstars(17); pstars(5); output( **************** ); p5stars(); p17stars(); p5stars(); Comp 162 Notes Page 8 of 15 October 4, 2017

9 Optimizing Compilers: Code generation The optimizations we have seen so far can be thought of as being applied to the source program transforming it in some way before translation What choices might the compiler make when it comes to generating the assembly code? We look at a couple (1) Memory vs Registers It takes longer for the CPU to access memory than to access registers An optimization is to have the compiler use registers instead of memory where possible In general, the algorithm for determining which variable is mapped to which register at any time might be complex However, Pep/9 only has two general purpose registers which simplifies the issue Here is an example of how a C program might be translated into Pep/9 assembly code maximizing use of registers sum = 0; for (i=0; i<n; i++) read(m); sum+=m; output(sum); We can use register A to hold "sum" and register to hold "i" leading to top: done: ldwa 0,i ; sum ldwx 0,i ; i cpwx N,d brge done deci M,d adda M,d ; sum is updated addx 1,i ; I is updated br top stwa sum,d deco sum,d This is faster than using memory variables i and sum but needs comments to help the reader follow the mapping In general, a compiler determines the scope of variables in a program and determines if any can map to the same register Consider the following where the table on the left of the code indicates the lines where the particular variable is in use Comp 162 Notes Page 9 of 15 October 4, 2017

10 a b c d int complexfunction() int a,b=0,c,d; for (a=0; a<1000; a++) read vector[a]; b+=vector[a]; for (c=1; c<1000; c++ vector[c] = vector[c]/b; vector[c] = vector[c-1] + vector[c]; d=0; for (c=0; c<1000; c++) d+=vector[c]; vector[c] /= d; for (a=0; a<1000; a++) output(vector[a]); Because they are never in use at the same time, variables a and c can be mapped to the same register Similarly, variables b and d can be mapped to the same register (2) Basic blocks A basic block is a sequence of statements with only one way in (at the top) and one way out (at the end In other words, there is no way we can jump into the middle of the block and no way to leave it from the middle Consider the C fragment total = sum + 5; result = sum + 2 * total; sum = 2 * sum + 2 * total; If the compiler looks at each statement in isolation, it would produce something like the following 12-instruction sequence Comp 162 Notes Page 10 of 15 October 4, 2017

11 ldwa sum,d adda 5,i stwa total,d ; ldwa sum,d adda total,d adda total,d stwa result,d ; ldwa sum,d adda sum,d adda total,d adda total,d stwa sum,d However, if the compiler takes into account the fact that the three high-level instructions constitute a basic block and must be executed in the sequence given it could take advantage of previous calculations and save 4 instructions as in ldwa sum,d adda 5,i stwa totald ; asla adda sum,d stwa result,d ; adda sum,d stwa sum,d Miscellaneous ideas for improving your programs (1) In looking at the programs you write, note that cpa 0,i is almost always redundant (2) In the sequence stwa t,d ldwa t,d the load is redundant because the value of t is still in register A (3) Branches to the next line are redundant brge label label: because the program goes to the labeled line whether or not the condition is met Comp 162 Notes Page 11 of 15 October 4, 2017

12 Reading Warford has some remarks on optimization on pages 297 and 298 We will begin section 63 next looking at subroutines in Pep/9 and how they can be used to implement functions in a high-level language Comp 162 Notes Page 12 of 15 October 4, 2017

13 Review Questions 1 There are redundancies in the following program that inputs a number and outputs one of two messages Identify the instructions that could be removed deci N,d ldwa N,d cpwa 40,i brlt br Y : stro ONE,d br end end: stop Y: ldwa N,d cpwa 40,i brlt stro TWO,d br end stop N: block 2 ONE: ascii low\n\x00 TWO: ascii high\n\x00 end 2 Consider the for loop for (i=1; i<limit; i++) A translation is top: (a) (b) ldwa 1,i stwa i,d action ldwa i,d adda 1,i stwa i,d cpwa limit,d brlt top if limit has a value 3 what is the largest number of bytes that action can be so that the unrolled loop is no bigger than the original code suppose that limit has a value 5, what is the largest number of bytes now? Comp 162 Notes Page 13 of 15 October 4, 2017

14 3 The following code inputs N and assigns 15 * N to M deci N,d ldwa N,d ldwx 14,i L: adda N,d subx 1,i brgt L stwa M,d (a) (b) (c) how much space does it occupy (in bytes) and how many instruction executions are there when it runs? what are the space and run-time figures if we unroll the loop? Is the unrolled version smaller or larger? Is the unrolled version faster or slower? Is there an implementation of the calculation that is both faster and smaller than the original? If so, show how it can be done Comp 162 Notes Page 14 of 15 October 4, 2017

15 Review Answers 1 deci N,d ldwa N,d cpwa 40,i brlt br Y : stro ONE,d br end *** end: stop Y: ldwa N,d *** cpwa 40,I *** brlt *** stro TWO,d br end *** stop N: block 2 ONE: ascii low\n\x00 TWO: ascii high\n\x00 end 2 (a) loop is 21 bytes plus the action Unrolled is 3 copies of action so action can be no more than 10 bytes (b) action can be no more than 5 bytes 3 (a) 21 bytes and 46 instructions at run-time (b) 51 bytes and 17 instructions so unrolled version is larger but faster (c) Yes Following is 16 bytes and 8 instructions deci N,d ldwa N,d asla asla asla asla suba N,d stwa M,d ; 2N ; 4N ; 8N ; 16N ; 15N Comp 162 Notes Page 15 of 15 October 4, 2017

Optimization of C Programs

Optimization of C Programs C Programming and Software Tools N.C. State Department of Computer Science with material from R. Bryant and D. O Halloran Computer Systems: A Programmer s Perspective and Jon