Program Opmizaon and Analysis. Chenyang Lu CSE 467S

Size: px

Start display at page:

Download "Program Op*miza*on and Analysis. Chenyang Lu CSE 467S"

Howard Webster
6 years ago
Views:

1 Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1

2 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute Address Chenyang Lu CSE 467S 2

3 What do we need to do? Ø Understand optimization levels (-O1, -O2, etc.) q Ø Optimize HLL code. Ø Analyze and optimize assembly code. Ø Modifying compiler output requires care: q correctness; q loss of hand-tweaked code. Chenyang Lu CSE 467S 3

4 Goals Ø Optimizing for execution time. Ø Optimizing for energy/power. Ø Optimizing for program size. Ø They may conflict with each other! Chenyang Lu CSE 467S 4

5 Expression Simplifica*on Ø Constant folding: q 8+1 = 9 Ø Algebraic: q a*b + a*c = a*(b+c) Ø Strength reduction: q a*2 = a<<1 Chenyang Lu CSE 467S 5

6 Dead Code Elimina*on Ø Dead code: #define DEBUG 0 if (DEBUG) dbg(p1); Ø Eliminate by control flow analysis, constant folding dbg(p1); Chenyang Lu CSE 467S 6

7 Func*on Call Chenyang Lu CSE 467S 7

8 Instruc*ons (ARM7) Ø Branch and link instruction: BL foo == MOV r14, r15 B foo q r15 contains the current PC q Copies current PC to r14. Ø To return from subroutine: MOV r15,r14 Chenyang Lu CSE 467S 8

9 Stack Ø Use a stack to keep track of q parameters, q return value, q return address. Ø Caller and callee access the stack in a consistent order. q Different compilers/programmers may follow different orders. Ø Access the stack (ARM7) q r13 always points to the top of stack q Push: STR r0, [r13, #4]! q Pop: SUB r13, #4 Chenyang Lu CSE 467S 9

10 Stack Opera*ons Ø Caller: call a function q Push parameters to stack q BL (r15 à r14; jump) Ø Callee: receive a call q Read parameters from stack q Overwrite top of stack with return address (r14) Ø Callee: return q Load PC with return address (on top of stack) Ø Caller: receive a return q Pop callee s return address from stack Chenyang Lu CSE 467S 10

11 Nested func*on calls (ARM7) main() { f1(x); } void f1(int a) { f2(a); } ; f1 is called by main() LDR r0, [r13] ; load parameter into r0 from stack STR r14, [r13] ; store f1 s return addr. ; f1 calls f2() STR r0, [r13, #4]! ; push parameter for f2 to stack BL f2 ; branch and link to f2 ; return from f2() SUB r13, #4 ; pop f2 s parameter off stack ; f1 returns to main() LDR r15, [r13] ; restore register and return Chenyang Lu CSE 467S 11

12 Func*on Inlining int foo(a,b,c) { return a + b - c;} z = foo(w,x,y); ð z = w + x - y; Ø Improve performance by eliminating function call overhead. Ø May increase code size, but not always Ø Affect instruction cache behavior. Chenyang Lu CSE 467S 12

13 Op*miza*on: Inlining App Code size Code inlined noninlined reduction Data size CPU reduction Surge % % Maté % % TinyDB % % Inlining improves performance and reduces code size. Why? Chenyang Lu 13

14 Loops Chenyang Lu CSE 467S 14

15 Loop Unrolling Ø Reduces loop overhead for (i=0; i<4; i++) a[i] = b[i] * c[i]; ð for (i=0; i<4; i+=2) { a[i] = b[i] * c[]; } a[i+1] = b[i+1] * c[i+1]; Chenyang Lu CSE 467S 15

16 Loop Overhead on ARM7 ; loop initiation code MOV r0, #0 ; use r0 for loop counter MOV r8, #0 ; use separate index for arrays LDR r1, #4 ; buffer size MOV r2, #0 ; use r2 for f ADR r3, c ; load r3 with base of c[ ] ADR r5, x ; load r5 with base of x[ ] ; loop; L: LDR r4, [r3, r8] ; get c[i] LDR r6, [r5, r8] ; get x[i] MUL r4, r4, r6 ; compute c[i]x[i] ADD r2, r2, r4 ; add into sum ADD r8, r8, #4 ; add one word to array index ADD r0, r0, #1 ; add 1 to i CMP r0, r1 ; exit? BLT L ; if i < 4, continue Chenyang Lu CSE 467S 16

17 Loop Fusion Combines multiple loops: for (i=0; i<n; i++) a[i] = b[i] * 5; for (j=0; j<n; j++) w[j] = c[j] * d[j]; ð for (i=0; i<n; i++) { a[i] = b[i] * 5; w[i] = c[i] * d[i]; } Necessary conditions Ø Loops share a same index Ø No dependencies between two loops Chenyang Lu CSE 467S 17

18 Code Mo*on for (i=0; i<n*m; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M Y i<n*m i<x N z[i] = a[i] + b[i]; i = i+1; Chenyang Lu CSE 467S 18

19 Array Chenyang Lu CSE 467S 19

20 One- Dimensional Array Ø C array name points to 0th element: a a[0] a[1] a[2] a[i] = *(a + i) Chenyang Lu CSE 467S 20

21 Two- Dimensional Array Ø Row-major layout: N a[0,0] a[0,1]... M... a[1,0] a[1,1] a[i][j] = *(a + i*m + j) Chenyang Lu CSE 467S 21

22 for (i=0; i<n; i++) for (j=0; j<m; j++) z[i][j] = b[i][j]; zptr = z; bptr = b; for (i=0; i<n; i++) for (j=0; j<m; j++) { zind = i*m+j; bind = i*m+j; *(zptr+zind)=*(bptr+bind) } zptr = z; bptr = b; for (i=0; i<n; i++) for (j=0; j<m; j++) { zbind = i*m+j; *(zptr+zbind)=*(bptr+zbind); } zptr = z; bptr = b; zbind = 0; for (i=0; i<n; i++) for (j=0; j<m; j++) { *(zptr+zbind)=*(bptr+zbind); zbind++; } induction variable elimination strength reduction Chenyang Lu CSE 467S 22

23 Cache Analysis Ø Loops use large quan55es of data (arrays) à cache conflicts Chenyang Lu CSE 467S 23

24 Direct- Mapped Cache 1 0xabcd byte byte byte... valid tag data cache block tag index offset = hit value byte Chenyang Lu CSE 467S 24

25 Array Conflicts in Cache for (i=0; i<n; i++) for (j=0; j<m; j++) a[i][j] = a[i][j] + b[i][j]; a[0,0] b[0,0] main memory cache Chenyang Lu CSE 467S 25

26 Array Conflicts Ø Array elements conflict because they are in the same line. Ø Solu5on: move one array. Chenyang Lu CSE 467S 26

27 Sta*c Cache Locking Ø Lock instructions in cache before execution. Ø Predictable execution time. Ø Similarly, lock code and data in memory to avoid paging. Chenyang Lu CSE 467S 27

28 Register Alloca*on Ø Fit current variables in registers. Ø Load once, use many times. ü Reduce number of cache/memory accesses. ü Improve performance. ü Reduce energy consumption. Chenyang Lu CSE 467S 28

29 Register Life*me Graph 1. w = a + b; 2. x = c + w; 3. y = c + d; 4. z = a - b; a b c d w x y z no. of needed register = Chenyang Lu CSE 467S 29

30 ATer Rescheduling 1. w = a + b; 2. z = a - b; 3. x = c + w; 4. y = c + d; a b c d w x y z no. of needed register = Cannot change dependencies between instructions! Chenyang Lu CSE 467S 30

31 Summary: Performance Op*miza*on Ø Use registers efficiently. Ø Optimize loops. Ø Optimize function calls. Ø Optimize cache behavior: q Avoid instruction conflicts by rewriting code, rescheduling; q Move conflicting scalar/array data can be moved. Chenyang Lu CSE 467S 31

32 Execu*on Time Analysis Ø Real-time systems must meet deadlines. Ø Need to analyze execution time. Chenyang Lu CSE 467S 32

33 Execu*on Time Ø Affected by program path and instruction timing Ø Program path depends on input data. q Sensor readings q User input Ø Instruction timing depends on q pipelining q cache behavior memory can be x10 slower than cache! Chenyang Lu CSE 467S 33

34 Program Path for (i=0, f=0; i<n; i++) f = f + c[i]*x[i]; i=0; f=0; Loop initiation executed once. Loop test executed N+1 times. Loop body and index update executed N times. i<n N Y f = f + c[i]*x[i]; i = i+1; Chenyang Lu CSE 467S 34

35 Execu*on Time Metrics Ø Difficult to predict execution time accurately. Ø Average case q For typical data values q Soft real-time Ø Worst case q For any possible input set q Hard real-time q Longest program path may NOT lead to worst-case execution time Chenyang Lu CSE 467S 35

36 Approaches Ø Compile-time analysis: pessimistic Ø Measurement: optimistic Chenyang Lu CSE 467S 36

37 Analysis Ø Analyze optimized assembly/binary code, not high-level language (HLL) code q HLL statement à many assembly/binary instructions q Example: function calls Ø Challenges q Program path depends on input data q Pipelining, cache effects are hard to predict q Analysis tends to be pessimistic Chenyang Lu CSE 467S 37

38 Measurement Ø CPU simulator q I/O may be hard to measure. q May not be totally accurate. Ø Time stamping q Requires instrumenting program. q Timer granularity Gettimeofday on Linux: ms Gethrtime on Intel processors: read 64-bit clock cycle counter and return the number of clock cycles since CPU was powered up or reset. Ø Logic analyzer: limited logic analyzer memory depth. Chenyang Lu CSE 467S 38

39 Output from a Logic Analyzer Timing diagram of event propagation on Mote Granularity: 50 microsecond Chenyang Lu CSE 467S 39

40 Trace- driven Analysis Ø Record of the program path of a program. Ø Help study cache behavior and power management policies. Ø A useful trace q requires proper input values; q is large. Chenyang Lu CSE 467S 40

41 Trace Genera*on Ø Hardware capture q Logic analyzer Limited buffer space Cannot observe on-chip cache q Hardware assist in CPU Pentium supports automatic tracing of branches Ø Software q PC sampling q Instrumentation instructions q Simulation Chenyang Lu CSE 467S 41

42 Goals Ø Optimizing for execution time. Ø Optimizing for energy/power. Ø Optimizing for program size. Chenyang Lu CSE 467S 42

43 Op*mizing for Program Size Ø Goals q Reduce memory cost; q Reduce power consumption. Ø Two opportunities: q Data; q Instructions. Chenyang Lu CSE 467S 43

44 Reduce Data Size Ø Reuse constants, variables, buffers in different parts of code. q Single-buffer in TinyOS. q Pack multiple flags in one byte. q Use shortest data type needed. q Requires careful verification of correctness. uint8_t i; for(i = 0; i < 1000; i++) {... } // This loop will never terminate Ø Generate data using instructions. Chenyang Lu CSE 467S 44

45 Reduce Code Size Ø Avoid loop unrolling. Ø Inlining? q Size of function q Number of calls Ø Choose CPU with compact instructions. q Digital Signal Processors (DSP) tend to have smaller code. Ø Some CPUs support dense instruction set q ARM Thumb, MIPS-16 Chenyang Lu CSE 467S 45

46 Code Compression Ø Use sta5s5cal compression to reduce code size. Ø Decompress on- the- fly. Ø Need to handle jump addresses main memory decompressor table LDR r0,[r4] cache CPU Chenyang Lu CSE 467S 46

47 Reading Ø Textbook 5.5, 5.6, 5.7, 5.8, 5.9. Chenyang Lu CSE 467S 47

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations: Announcements HW1 is available online Next Class Liang will give a tutorial on TinyOS/motes Very useful! Classroom: EADS Hall 116 This Wed ONLY Proposal is due on 5pm, Wed Email me your proposal Loops