Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:

Size: px

Start display at page:

Download "Loops. Announcements. Loop fusion. Loop unrolling. Code motion. Array. Good targets for optimization. Basic loop optimizations:"

Bernadette Shepherd
5 years ago
Views:

1 Announcements HW1 is available online Next Class Liang will give a tutorial on TinyOS/motes Very useful! Classroom: EADS Hall 116 This Wed ONLY Proposal is due on 5pm, Wed me your proposal Loops Good targets for optimization. Basic loop optimizations: Unrolling Fusion code motion; induction-variable elimination; strength reduction. Chenyang Lu CSE 467S 1 Chenyang Lu CSE 467S 2 Loop unrolling Reduces loop overhead for (i=0; i<4; i++) a[i] = b[i] * c[i]; for (i=0; i<4; i+=2) { a[i] = b[i] * c[]; a[i+1] = b[i+1] * c[i+1]; Unnecessary on SHARC Loop fusion Combines multiple loops into 1: a[i] = b[i] * 5; for (j=0; j<n; j++) w[j] = c[j] * d[j]; { a[i] = b[i] * 5; w[i] = c[i] * d[i]; Necessary conditions Loops share a same index No dependencies between two loops Chenyang Lu CSE 467S 3 Chenyang Lu CSE 467S 4 Code motion Array for (i=0; i<n*m; i++) z[i] = a[i] + b[i]; i=0; X i=0; = N*M i<n*m i<x Y N z[i] = a[i] + b[i]; i = i+1; Chenyang Lu CSE 467S 5 Chenyang Lu CSE 467S 6

2 One-dimensional arrays C array name points to 0th element: Two-dimensional arrays Row-major layout: a a[0] a[1] a[2] a[i] = *(a + i) N... a[0,0] a[0,1] M... a[1,0] a[1,1] a[i][j] = *(a + i*m + j) Chenyang Lu CSE 467S 7 Chenyang Lu CSE 467S 8 for (j=0; j<m; j++) z[i][j] = b[i][j]; zptr = z; bptr = b; for (j=0; j<m; j++) { zind = i*m+j; bind = i*m+j; *(zptr+zind)=*(bptr+bind) zptr = z; bptr = b; for (j=0; j<m; j++) { zbind = i*m+j; *(zptr+zbind)=*(bptr+zbind); zptr = z; bptr = b; zbind = 0; for (j=0; j<m; j++) { zbind++; *(zptr+zbind)=*(bptr+zbind); induction var elimination strength reduction Chenyang Lu CSE 467S 9 Cache analysis Because loops use large quantities of data (arrays), cache conflicts are common. Chenyang Lu CSE 467S 10 Direct-mapped cache Array conflicts in cache for (j=0; j<m; j++) a[i][j] = a[i][j] + b[i][j]; 1 0xabcd byte byte byte... valid tag data cache block a[0,0] tag index offset = b[0,0] hit value byte main memory cache Chenyang Lu CSE 467S 11 Chenyang Lu CSE 467S 12

3 Array conflicts, cont d. Array elements conflict because they are in the same line, even if not mapped to same location. Solutions: move one array; pad array. Static Cache Locking Lock instructions in cache before execution Predictable execution time Similarly, you may lock code and data in main memory to avoid paging Chenyang Lu CSE 467S 13 Chenyang Lu CSE 467S 14 Register allocation Reduce the number of used registers Fit all frequently used variables in registers Load once, use many times Reduce number of cache/memory access Register lifetime graph 1. w = a + b; 2. x = c + w; 3. y = c + d; 4. z = a - b; a b c d w x y z no. of needed register = Chenyang Lu CSE 467S 15 Chenyang Lu CSE 467S 16 After rescheduling 1. w = a + b; 2. z = a - b; 3. x = c + w; 4. y = c + d; a b c d w x y z no. of needed register = Note: Must make sure no dependencies among instructions are changed Performance optimization hints Use registers efficiently. Optimize loops. Optimize function calls. Optimize cache behavior: instruction conflicts can be handled by rewriting code, rescheduling; conflicting scalar data can easily be moved; conflicting array data can be moved, padded. Chenyang Lu CSE 467S 17 Chenyang Lu CSE 467S 18

4 Execution Time Analysis Motivation Embedded systems must meet deadlines. Need to analyze execution time. Chenyang Lu CSE 467S 19 Chenyang Lu CSE 467S 20 Performance analysis Execution time affected by both program path and instruction timing Path depends on input data values. Instruction timing depends on pipelining cache behavior: memory access can be 10 times slower than cache! Accurate execution time is unknown a priori Program paths for (i=0, f=0; i<n; i++) f = f + c[i]*x[i]; Loop initiation block executed once. Loop test executed N+1 times. Loop body and variable update executed N times. i=0; f=0; i<n N Y f = f + c[i]*x[i]; i = i+1; Chenyang Lu CSE 467S 21 Chenyang Lu CSE 467S 22 Execution time metrics Average-case For typical data values, whatever they are. Soft real-time Worst-case For any possible input set Hard real-time Longest program path may NOT lead to the longest execution time Best-case For any possible input set Approaches Analysis: Compile-time tools Measurement Chenyang Lu CSE 467S 23 Chenyang Lu CSE 467S 24

5 Analyze execution time Analyze optimized assembly/binary code, not highlevel language code: non-obvious translations of HLL statements into instructions; E.x., Heap operations: new(obj); Challenges Program path depends on input data Modern processors: Pipelining, cache effects are hard to predict Analysis tends to be pessimistic Measure execution time CPU simulator. I/O may be hard. May not be totally accurate. Time stamping Requires instrumented program. Timer granularity Gettimeofday on UNIX/Linux: 10 ms Gethrtime on Pentium: read a 64 bit clock cycle counter. and return the number of clock cycles since the CPU was powered up or reset. Logic analyzer: Limited logic analyzer memory depth. Chenyang Lu CSE 467S 25 Chenyang Lu CSE 467S 26 Example: Output from a Logic Analyzer Trace-driven analysis Trace: a record of the program path of a program. Help study cache behavior A useful trace: requires proper input values; is large (gigabytes). Timing diagram of event propagation on Mote Granularity: 50 microsecond Chenyang Lu CSE 467S 27 Chenyang Lu CSE 467S 28 Trace generation Hardware capture logic analyzer Limited buffer space Cannot observe on-chip cache hardware assist in CPU Pentium supports automatic tracing of branches Software PC sampling Instrumentation instructions Simulation Optimizing for program size Goal: reduce hardware cost of memory; reduce power consumption of memory units. Two opportunities: data; instructions. Chenyang Lu CSE 467S 29 Chenyang Lu CSE 467S 30

6 Data size minimization Reuse constants, variables, data buffers in different parts of code. E.x., buffering in TinyOS E.x., pack multiple flags in one byte Requires careful verification of correctness. Generate data using instructions. Reducing code size Avoid loop unrolling. Inlining? Choose CPU with compact instructions. Some CPUs support dense instruction set ARM Thumb, MIPS-16 Chenyang Lu CSE 467S 31 Chenyang Lu CSE 467S 32 Effects of inlining on TinyOS Code compression Use statistical compression to reduce code size, decompress on-the-fly: Inlining reduces code size AND improves performance! Can you guess why? main memory decompressor table LDR r0,[r4] cache CPU Chenyang Lu CSE 467S 33 Chenyang Lu CSE 467S 34 Reading Textbook 5.6, 5.7, 5.8. Chenyang Lu CSE 467S 35

Program Opmizaon and Analysis. Chenyang Lu CSE 467S

Program Op*miza*on and Analysis Chenyang Lu CSE 467S 1 Program Transforma*on op#mize Analyze HLL compile assembly assemble Physical Address Rela5ve Address assembly object load executable link Absolute