What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009
|
|
- Adrian Todd
- 5 years ago
- Views:
Transcription
1 What Compilers Can and Cannot Do Saman Amarasinghe Fall 009
2 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System
3 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Compiler Pros: Full source code is available Pros: Easy to intercept in the high-level to low-level transformations Pros: Compile-time is not much of an issue Cons: Don t see the whole program Cons: Don t know the runtime conditions Cons: Don t know (too much about) the architecture
4 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Linker Pros: Full program available Cons: May not have the full program Cons: Don t have access to the source Cons: Don t know (too much about) the architecture
5 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Loader Pros: Full program available (Cons: May not have the full program ) Cons: Don t have access to the source Cons: Don t know the runtime conditions Cons: Time pressure to get the loading done fast
6 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Runtime Pros: Full program available Pros: Knows the runtime behavior Cons: Don t have access to the source Cons: Time in the optimizer is time away from running the program
7 Dataflow Analysis 9 Compile-Time Reasoning About Run-Time Values of Variables or Expressions At Different Program Points Which assignment statements produced value of variable at this point? Which variables contain values that are no longer used after this program point? What is the range of possible values of variable at this program point?
8 Example int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 8
9 test: subu $fp, 6 sw zero, 0($fp) # x = 0 sw zero, 4($fp) # y = 0 sw zero, 8($fp) # i = 0 lab: # for(i=0;i<n; i++) mul $t0, $a0, 4 # a*4 div $t, $t0, $a # a*4/b lw $t, 8($fp) # i mul $t3, $t, $t # a*4/b*i lw $t4, 8($fp) # i addui $t4, $t4, # i+ lw $t5, 8($fp) # i addui $t5, $t5, # i+ mul $t6, $t4, $t5 # (i+)*(i+) addu $t7, $t3, $t6 # a*4/b*i + (i+)*(i+) lw $t8, 0($fp) # x add $t8, $t7, $t8 # x = x + a*4/b*i + (i+)*(i+) sw $t8, 0($fp) lw $t0, 4($fp) # y mul $t, $t0, a # b*y lw $t, 0($fp) # x add $t, $t, $t # x = x + b*y sw $t, 0($fp) lw $t0, 8($fp) # i addui $t0, $t0, # i+ sw $t0, 8($fp) ble $t0, $a3, lab lw $v0, 0($fp) addu $fp, 6 Saman b Amarasinghe $ra 9
10 Constant Propagation 0 In all possible execution paths a value of a variable at a given use point of that variable is a known constant. Replace the variable with the constant Pros: No need to keep that value in a variable (freeing storage/register) Can lead to further optimization. Saman Amarasinghe 008
11 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe
12 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe
13 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 3
14 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 4
15 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 5
16 Algebraic Simplification 3 If an expression can be calculated/simplified at compile time, then do it. Or use faster instructions to do the operation Examples: Pros: X * 0 0 X* X X * 04 X << 0 X + 0 X X + X X << Less work at runtime Leads to more optimizations Cons: Machine that runs the code may behave differently than the machine that is used to compile/compiler Ex: Overflow, underflow Use commutivity and transitivity can slightly change the results
17 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 7
18 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 8
19 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe 9
20 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe 0
21 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe
22 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x; } return x; } Saman Amarasinghe
23 Copy Propagation 6 If you are just making a copy of a variable, try to use the original and eliminate the copy. Pros: Less instructions Less memory/registers Con: May make an interference graph no longer colorable, leading to register spills Saman Amarasinghe 008
24 Copy Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x; } return x; } Saman Amarasinghe 4
25 Copy Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 5
26 Common Subexpression Elimination 8 Same subexpression is calculated multiple times Calculate is once is use the result Pros: Less computation Cons: Need additional storage/register to keep the results. May lead to register spill. May hinder parallelization by adding dependences Saman Amarasinghe 008
27 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 7
28 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 8
29 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + (i+)*(i+); } return x; } Saman Amarasinghe 9
30 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 30
31 Dead Code Elimination 0 If the result of a calculation is not used, don t do the calculation Pros: Less computation May be able to release the storage earlier No need to store the results Saman Amarasinghe 008
32 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 3
33 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 33
34 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe 34
35 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe
36 Loop Invariant Removal If an expression always calculate to the same value in all the loop iterations, move the calculation outside the loop Pros: A lot less work within the loop Cons Need to store the result from before loop starts and throughout the execution of the loop more live ranges may lead to register spills Saman Amarasinghe 008
37 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe
38 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe 38
39 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } Saman Amarasinghe 39
40 Strength Reduction 5 In a loop, instead of recalculating an expression updated the previous value of the expression if that requires less computation. Example for(i= 0 ) t=a*i; t=0; for(i= 0 ) t=t+a; Pros: Less computation Cons: More values to keep increase number of live variables possible register spill Introduces a loop-carried dependence a parallel loop become sequential Strength increase transformation: Eliminate the loop carried dependence (but more instructions) by the inverse transformations to strength reduction. Saman Amarasinghe 008
41 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } Saman Amarasinghe 4
42 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } u*0, u*, u*, u*3, u*4,... v=0, v=v+u, v=v+u, v=v+u, v=v+u,... Saman Amarasinghe 4
43 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; v = v + u; } return x; } Saman Amarasinghe 43
44 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } Saman Amarasinghe 44
45 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } Saman Amarasinghe 45
46 Register Allocation 8 Use the limited registers effectively to keep some of the live values instead of storing them in memory. In the x86 architecture, this is very important(that is why x86-64 have more registers) However, this is critical in the RISC architectures. Saman Amarasinghe 008
47 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 47
48 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 48
49 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 49
50 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 50
51 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 5
52 Example def y s def x def y s def x use y use x use y s3 use x def x s4 Saman Amarasinghe use x 5
53 Example def y s def x def y s def x use y use x use y s3 use x def x s s s4 Saman Amarasinghe use x s3 s4 53
54 Graph Coloring 3 Saman Amarasinghe 008
55 Register Allocation If the graph can be colored with # of colors < # of registers allocate a register for each variable If too many colors Eliminate a edge spill that register Recolor the graph Saman Amarasinghe 008
56 Register Allocation Local variable X Local variable Y Local variable I fp $t9 = X $t8 = t $t7 = u $t6 = v $t5 = i Saman Amarasinghe 56
57 Optimized Example int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = ((a<<)/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } 34 Saman Amarasinghe 57
58 test: subu $fp, 6 add $t9, zero, zero # x = 0 sll $t0, $a0, # a<< div $t7, $t0, $a # u = (a<<)/b add $t6, zero, zero # v = 0 add $t5, zero, zero # i = 0 lab: # for(i=0;i<n; i++) addui $t8, $t5, # t = i+ mul $t0, $t8, $t8 # t*t addu $t, $t0, $t6 # v + t*t addu $t9, t9, $t # x = x + v + t*t addu $6, $6, $7 # v = v + u addui $t5, $t5, # i = i+ ble $t5, $a3, lab addu $v0, $t9, zero addu $fp, 6 b $ra Saman Amarasinghe 58
59 Unoptimized Code test: subu $fp, 6 sw zero, 0($fp) sw zero, 4($fp) sw zero, 8($fp) lab: mul $t0, $a0, 4 div $t, $t0, $a lw $t, 8($fp) mul $t3, $t, $t lw $t4, 8($fp) addui $t4, $t4, lw $t5, 8($fp) addui $t5, $t5, mul $t6, $t4, $t5 addu $t7, $t3, $t6 lw $t8, 0($fp) add $t8, $t7, $t8 sw $t8, 0($fp) lw $t0, 4($fp) mul $t, $t0, a lw $t, 0($fp) add $t, $t, $t sw $t, 0($fp) lw $t0, 8($fp) addui $t0, $t0, sw $t0, 8($fp) ble $t0, $a3, lab Optimized Code test: subu $fp, 6 add $t9, zero, zero sll $t0, $a0, div $t7, $t0, $a add $t6, zero, zero add $t5, zero, zero lab: addui $t8, $t5, mul $t0, $t8, $t8 addu $t, $t0, $t6 addu $t9, t9, $t addu $6, $6, $7 addui $t5, $t5, ble $t5, $a3, lab addu $v0, $t9, zero addu $fp, 6 b $ra lw $v0, 0($fp) addu $fp, 6 b $ra 4*ld/st + *add/sub + br + N*(9*ld/st + 6*add/sub + 4* mul + div + br) = 7 + N* Execution time = 43 sec Saman Amarasinghe 6*add/sub + shift + div + br + N*(5*add/sub + mul + br) = 9 + N*7 Execution time = 7 sec 59
60 What Stops Optimizations? 35 Optimizer has to guarantee program equivalence For all the valid inputs For all the valid executions For all the valid architectures In order to optimize the program, the compiler needs to understand The control-flow Data accessors Most of the time, the full information is not available, then the compiler has to Reduce the scope of the region that the transformations can apply Reduce the aggressiveness of the transformations Leave computations with unanalyzable data alone Saman Amarasinghe 008
61 Control-Flow 38 All the possible paths through the program Representations within the compiler Call graphs Control-flow graphs What hinders the compiler analysis? Function pointers Indirect branches Computed gotos Large switch statements Loops with exits and breaks Loops where the bound are not known Conditionals where the branch condition is not analyzable Saman Amarasinghe 008
62 Data-Accessors 4 Who else can read or write the data Representations within the compiler Def-use chains Dependence vectors What hinders the compiler analysis? Address taken variables Global variables Parameters Arrays Pointer accesses Volatile types Saman Amarasinghe 008
63 Instruction Scheduling 43 The correct sequence to issue the instructions that will make the processor run fast. x86 is an out-of-order superscalar Hardware tries to schedule the instructions Compiler can still be useful Hardware cannot look too forward Compiler can transform the program to eliminate hazards Hardware don t have the time or the information to do most of these transformations For in-order processors Instruction scheduling is CRITICAL. Saman Amarasinghe 008
64 Constraints On Scheduling Data dependencies Control dependencies Resource Constraints
65 Data Dependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies True: write read Anti: read write Output: write write What to do if two instructions are dependent. The order of execution cannot be reversed Reduce the possibilities for scheduling 45
66 Computing Dependencies 47 For basic blocks, compute dependencies by walking through the instructions Identifying register dependencies is simple is it the same register? For memory accesses simple: base + offset?= base + offset data dependence analysis: a[i]?= a[i+] interprocedural analysis: global?= parameter pointer alias analysis: p foo?= p foo
67 Representing Dependencies 5 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies
68 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r -
69 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r - 4 3
70 Representing Dependencies 5 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r 4 3 Edge is labeled with Latency: v(i j) = delay required between initiation times of i and j minus the execution time required by i
71 Resource Constraints Modern machines have many resource constraints Superscalar architectures: can run few parallel operations But have constraints Saman Amarasinghe 7
72 Resource Constraints of a Superscalar Processor Example: One fully pipelined reg-to-reg unit All integer operations taking one cycle In parallel with One fully pipelined memory-to/from-reg unit Data loads take two cycles Data stores teke one cycle Saman Amarasinghe 7
73 List Scheduling Algorithm 53 Idea Do a topological sort of the dependence DAG Consider when an instruction can be scheduled without causing a stall Schedule the instruction if it causes no stall and all its predecessors are already scheduled Optimal list scheduling is NP-complete Use heuristics when necessary Saman Amarasinghe 73
74 List Scheduling Algorithm with resource constraints Represent the superscalar architecture as multiple pipelines Each pipeline represent some resource Saman Amarasinghe 74
75 List Scheduling Algorithm with 54 resource constraints Represent the superscalar architecture as multiple pipelines Each pipeline represent some resource Example One single cycle reg-to-reg ALU unit One two-cycle pipelined reg-to/from-memory unit ALU MEM MEM
76 List Scheduling Algorithm with resource constraints Create a dependence DAG of a basic block Topological Sort READY = nodes with no predecessors Loop until READY is empty Let n READY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update READY
77 Example 55 : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) Saman Amarasinghe 77
78 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) Saman Amarasinghe 78
79 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 8 d=0 d=0 f=0 9 f=0 Saman Amarasinghe 79
80 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0
81 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } ALUop MEM MEM 8 d=0 d=0 f=0 9 f=0
82 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } ALUop MEM MEM 8 d=0 d=0 f=0 9 f=0
83 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
84 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
85 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
86 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
87 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
88 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
89 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
90 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6
91 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 3 } 7 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6
92 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6
93 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6
94 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4
95 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3 } 5 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4
96 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4
97 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4
98 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
99 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5 } 8, 9 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
100 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
101 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
102 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
103 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
104 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
105 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
106 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
107 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
108 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
109 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
110 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
111 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM
112 Example 60 : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) READY = { } 8 9 ALUop MEM MEM
113 Control Dependencies Constraints in moving instructions across basic blocks
114 Control Dependencies Constraints in moving instructions across basic blocks if (... ) a = b op c
115 Control Dependencies Constraints in moving instructions across basic blocks if (... ) a = b op c
116 Control Dependencies Constraints in moving instructions across basic blocks if ( c!= 0 ) a = b / c NO!!!
117 Control Dependencies Constraints in moving instructions across basic blocks If (... ) d = *(a)
118 Control Dependencies Constraints in moving instructions across basic blocks If ( valid address? ) d = *(a)
119 Trace Scheduling Find the most common trace of basic blocks Use profile information Combine the basic blocks in the trace and schedule them as one block Create clean-up code if the execution goes off-trace
120 Trace Scheduling A B C D E F G H
121 Trace Scheduling A B C D E F G H
122 Trace Scheduling A B D E G H
123 Trace Scheduling A B D E G H
124 Large Basic Blocks via Code Duplication Creating large extended basic blocks by duplication Schedule the larger blocks A B C D E
125 Large Basic Blocks via Code Duplication Creating large extended basic blocks by duplication Schedule the larger blocks A A B C B C D D D E E E
126 Trace Scheduling A B D E G H
127 Trace Scheduling A B C D D E F H E G H F H G
128 Loop Example mov d=7 4 Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop imul 3 mov 0 sub d=5 d= d= bge d=0
129 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule (9 cycles per iteration) mov mov mov mov imul bge imul bge imul sub sub mov imul 3 mov 0 sub bge d=7 d=5 d= d= d=0
130 Loop Unrolling 6 Unroll the loop body few times Pros: Create a much larger basic block for the body Eliminate few loop bounds checks Cons: Much larger program Setup code (# of iterations < unroll factor) beginning and end of the schedule can still have unused slots
131 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop 6
132 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule (8 cycles per iteration) mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub mov mul mov sub mov mul mov sub bge d=4 7 d= d=9 d=9 d=7 d=5 d= d= d=0
133 Loop Unrolling 7 Rename registers Use different registers in different iterations
134 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 8
135 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %rcx imul %r, %rcx mov %rcx, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 9
136 Loop Unrolling 9 Rename registers Use different registers in different iterations Eliminate unnecessary dependencies again, use more registers to eliminate true, anti and output dependencies eliminate dependent-chains of calculations when possible
137 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub 4, %rax mov (%rdi,%rax), %rcx imul %r, %rcx mov %rcx, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 9
138 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $8, %rax mov (%rdi,%rbx), %rcx imul %r, %rcx mov %rcx, (%rdi,%rbx) sub $8, %rbx bge loop 3 3 mov mul mov sub mov mul mov sub bge d= d=0 d=7 d=7 d=5 d=3 d=0 d= d=0 0
139 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $8, %rax mov (%rdi,%rbx), %rcx imul %r, %rcx mov %rcx, (%rdi,%rbx) sub 8, %rbx bge loop Schedule (4.5 cycles per iteration 3 3 mov mul mov sub mov mul mov sub bge d= d=0 d=7 d=7 d=5 d=3 d=0 d= d=0 0 mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub
140 Software Pipelining Try to overlap multiple iterations so that the slots will be filled Find the steady-state window so that: all the instructions of the loop body is executed but from different iterations
141 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mul mul mul st sub mov sub bge bge
142 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov st st mov mov mov mov mul mul bge bge mul mul bge bge mul mul sub sub sub sub
143 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st st st mov mov mov mov mov mov mul mul mul bge bge bge mul mul mul bge bge bge mul mul mul sub sub sub sub sub sub
144 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st st st3 mov mov mov mov mov3 mov mov mov3 mul mul mul bge mul3 bge bge mul mul mul bge mul3 bge bge mul mul mul mul3 sub sub sub sub3 sub sub sub sub3
145 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge oop Schedule mov mov mov st mov3 st mov4 st st3 mov mov mov mov mov3 mov mov4 mov mov3 mul mul mul bge mul3 bge mul4 bge mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3
146 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3
147 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov6 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3
148 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov6 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3
149 Loop Example 4 iterations are overlapped value of %r don t change 4 regs for (%rdi,%rax) each addr. incremented by 4*4 4 regs to keep value %r0 Same registers can be reused after 4 of these blocks generate code for 4 blocks, otherwise need to move st mov4 mov3 mov bge mul3 mul bge mul sub sub loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop
150 MIT OpenCourseWare Performance Engineering of Software Systems Fall 009 For information about citing these materials or our Terms of Use, visit:
Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model
Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching
More informationSpring 2 Spring Loop Optimizations
Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition
More informationMore Loop Optimizations
Spring 2010 More Loop Optimizations 5 Outline Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 2 6.035 MIT Fall 1998 Strength Reduction Replace
More informationLecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)
Lecture 7 Instruction Scheduling I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 10.4 CS243: Instruction Scheduling 1 Scheduling Constraints Data dependences
More informationOutline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006
P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value
More informationA Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013
A Bad Name Optimization is the process by which we turn a program into a better one, for some definition of better. CS 2210: Optimization This is impossible in the general case. For instance, a fully optimizing
More informationCompiler Architecture
Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer
More informationOptimization Prof. James L. Frankel Harvard University
Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory
More informationFinal Jeopardy. CS356 Unit 15. Binary Brainteaser 100. Binary Brainteaser 200. Review
15.1 Final Jeopardy 15.2 Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles CS356 Unit 15 Review 100 100 100 100 100 100 200 200 200 200 200
More informationMIT Spring 2011 Quiz 3
MIT 6.035 Spring 2011 Quiz 3 Full Name: MIT ID: Athena ID: Run L A TEX again to produce the table 1. Consider a simple machine with one ALU and one memory unit. The machine can execute an ALU operation
More informationLecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)
Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures
More informationMultiple Instruction Issue. Superscalars
Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths
More informationProcessor (IV) - advanced ILP. Hwansoo Han
Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle
More informationECE 2035 Programming Hw/Sw Systems Spring problems, 8 pages Final Exam Solutions 1 May 2013
Problem 1 (3 parts, 30 points) Compilation, Concurrency & Interrupts Part A (20 points) Perform at least five standard compiler optimizations on the following C code fragment by writing the optimized version
More informationPipelining and Exploiting Instruction-Level Parallelism (ILP)
Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &
More informationTopic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer
Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication
More informationregister allocation saves energy register allocation reduces memory accesses.
Lesson 10 Register Allocation Full Compiler Structure Embedded systems need highly optimized code. This part of the course will focus on Back end code generation. Back end: generation of assembly instructions
More informationDynamic Control Hazard Avoidance
Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>
More informationBackground: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation
Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined
More informationInstruction-Level Parallelism (ILP)
Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit
More informationCode Generation. CS 540 George Mason University
Code Generation CS 540 George Mason University Compiler Architecture Intermediate Language Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure
More informationOverview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2)
Introduction to the MIPS ISA Overview Remember that the machine only understands very basic instructions (machine instructions) It is the compiler s job to translate your high-level (e.g. C program) into
More informationLecture: Pipeline Wrap-Up and Static ILP
Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle
More informationCompiler Optimization
Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules
More informationUNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.
UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known
More information15.1. CS356 Unit 15. Review
15.1 CS356 Unit 15 Review 15.2 Final Jeopardy Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles 100 100 100 100 100 100 200 200 200 200 200
More informationLecture 21 CIS 341: COMPILERS
Lecture 21 CIS 341: COMPILERS Announcements HW6: Analysis & Optimizations Alias analysis, constant propagation, dead code elimination, register allocation Available Soon Due: Wednesday, April 25 th Zdancewic
More informationLecture 1: Introduction
Lecture 1: Introduction Staff Lecturer Prof. Michael Carbin mcarbin@mit.edu 253-5881 32-G782 Prof. Martin Rinard rinard@mit.edu 258-6922 32-G828 Rooms MWF 3-370 TH 4-149 Course Secretary Cree Bruins cbruins@csail.mit.edu
More informationBuilding a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano
Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code
More informationLecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)
Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) 1 Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR :
More informationMachine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine
Machine Language Instructions Introduction Instructions Words of a language understood by machine Instruction set Vocabulary of the machine Current goal: to relate a high level language to instruction
More informationUnoptimized Code Generation
Unoptimized Code Generation Last time we left off on the procedure abstraction Saman Amarasinghe 2 6.035 MIT Fall 1998 The Stack Arguments 0 to 6 are in: %b %rbp %rsp %rdi, %rsi, %rdx, %rcx, %r8 and %r9
More informationCode generation and local optimization
Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction
More informationReal Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University
Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel
More informationAdministration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering
dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction
More informationCode generation and local optimization
Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction
More informationCompiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler
Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine
More informationTour of common optimizations
Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)
More informationPage # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?
Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the
More informationCompiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz
Compiler Design Fall 2015 Control-Flow Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292
More informationCOMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in
More informationCISC 662 Graduate Computer Architecture. Lecture 4 - ISA
CISC 662 Graduate Computer Architecture Lecture 4 - ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationRegister Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations
Register Allocation Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class
More informationRegister Allocation, i. Overview & spilling
Register Allocation, i Overview & spilling 1 L1 p ::=(label f...) f ::=(label nat nat i...) i ::=(w
More informationInstruction-Level Parallelism and Its Exploitation
Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic
More informationMiddle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code
Traditional Three-pass Compiler Source Code Front End IR Middle End IR Back End Machine code Errors Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce
More informationCOMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor
COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle
More information16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as
372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct
More informationCSc 453. Code Generation I Christian Collberg. Compiler Phases. Code Generation Issues. Slide Compilers and Systems Software.
Slide 16 2 Lexing, Parsing Semantic Analysis, Intermediate Code Generation Peephole Optimization Assembly Code Assembler Machine Code Register Allocation Intermediate Code Selection Scheduling Register
More informationIntroduction. CSc 453. Compilers and Systems Software. 19 : Code Generation I. Department of Computer Science University of Arizona.
CSc 453 Compilers and Systems Software 19 : Code Generation I Introduction Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2009 Christian Collberg Compiler Phases Optimize
More informationArchitectures. Code Generation Issues III. Code Generation Issues II. Machine Architectures I
CSc 553 Principles of Compilation 7 : Code Generation I Introduction Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Lexing, Parsing Semantic
More informationLecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)
Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2) 1 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6,
More informationFloating Point/Multicycle Pipelining in DLX
Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit
More informationCompiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7
Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,
More informationCSE 378 Midterm Sample Solution 2/11/11
Question 1. (9 points). Suppose we have the following MIPS instructions stored in memory starting at location 0x2400. 2400: add $v0,$t1,$t3 2404: loop: sll $t3,$a2,12 2408: or $s2,$t3,$a1 240c: bne $s2,$t8,loop
More informationHY425 Lecture 09: Software to exploit ILP
HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques
More informationAdvanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University
Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:
More informationCS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19
More informationBranch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015
Branch Addressing Branch instructions specify Opcode, two registers, target address Most branch targets are near branch Forward or backward op rs rt constant or address 6 bits 5 bits 5 bits 16 bits PC-relative
More informationData Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard
Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:
More informationPrinciples of Compiler Design
Principles of Compiler Design Code Generation Compiler Lexical Analysis Syntax Analysis Semantic Analysis Source Program Token stream Abstract Syntax tree Intermediate Code Code Generation Target Program
More informationMIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology
MIT 6.035 Introduction to Program Analysis and Optimization Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Program Analysis Compile-time reasoning about run-time behavior
More informationThe Processor: Instruction-Level Parallelism
The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy
More informationLast time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls
CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1
More informationECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010
ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 This homework is to be done individually. Total 9 Questions, 100 points 1. (8
More informationCMSC 611: Advanced Computer Architecture
CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science
More informationCode generation and optimization
Code generation and timization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best tion What we will cover: Peephole timizations
More informationCS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines
CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended
More informationCISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization
CISC 662 Graduate Computer Architecture Lecture 4 - ISA MIPS ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,
More information6.035 Project 3: Unoptimized Code Generation. Jason Ansel MIT - CSAIL
6.035 Project 3: Unoptimized Code Generation Jason Ansel MIT - CSAIL Quiz Monday 50 minute quiz Monday Covers everything up to yesterdays lecture Lexical Analysis (REs, DFAs, NFAs) Syntax Analysis (CFGs,
More informationLECTURE 10. Pipelining: Advanced ILP
LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction
More informationCODE GENERATION Monday, May 31, 2010
CODE GENERATION memory management returned value actual parameters commonly placed in registers (when possible) optional control link optional access link saved machine status local data temporaries A.R.
More informationHW 2 is out! Due 9/25!
HW 2 is out! Due 9/25! CS 6290 Static Exploitation of ILP Data-Dependence Stalls w/o OOO Single-Issue Pipeline When no bypassing exists Load-to-use Long-latency instructions Multiple-Issue (Superscalar),
More informationChapter 4 The Processor (Part 4)
Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence
More informationCS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining
2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired
More informationComputer Science 246 Computer Architecture
Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these
More informationCPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation
Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction
More information2. dead code elimination (declaration and initialization of z) 3. common subexpression elimination (temp1 = j + g + h)
Problem 1 (20 points) Compilation Perform at least five standard compiler optimizations on the following C code fragment by writing the optimized version (in C) to the right. Assume f is a pure function
More informationArchitecture II. Computer Systems Laboratory Sungkyunkwan University
MIPS Instruction ti Set Architecture II Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Making Decisions (1) Conditional operations Branch to a
More informationCS146 Computer Architecture. Fall Midterm Exam
CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state
More informationRui Wang, Assistant professor Dept. of Information and Communication Tongji University.
Instructions: ti Language of the Computer Rui Wang, Assistant professor Dept. of Information and Communication Tongji University it Email: ruiwang@tongji.edu.cn Computer Hierarchy Levels Language understood
More informationIn-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution
In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall
More informationCS 316: Procedure Calls/Pipelining
CS 316: Procedure Calls/Pipelining Kavita Bala Fall 2007 Computer Science Cornell University Announcements PA 3 IS out today Lectures on it this Fri and next Tue/Thu Due on the Friday after Fall break
More informationCSE 378 Midterm 2/12/10 Sample Solution
Question 1. (6 points) (a) Rewrite the instruction sub $v0,$t8,$a2 using absolute register numbers instead of symbolic names (i.e., if the instruction contained $at, you would rewrite that as $1.) sub
More informationCS 3330 Exam 3 Fall 2017 Computing ID:
S 3330 Fall 2017 Exam 3 Variant E page 1 of 16 Email I: S 3330 Exam 3 Fall 2017 Name: omputing I: Letters go in the boxes unless otherwise specified (e.g., for 8 write not 8 ). Write Letters clearly: if
More informationCSE Lecture In Class Example Handout
CSE 30321 Lecture 07-09 In Class Example Handout Part A: A Simple, MIPS-based Procedure: Swap Procedure Example: Let s write the MIPS code for the following statement (and function call): if (A[i] > A
More informationPlease state clearly any assumptions you make in solving the following problems.
Computer Architecture Homework 3 2012-2013 Please state clearly any assumptions you make in solving the following problems. 1 Processors Write a short report on at least five processors from at least three
More informationInstruction Scheduling
Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation
More informationTopic Notes: MIPS Instruction Set Architecture
Computer Science 220 Assembly Language & Comp. Architecture Siena College Fall 2011 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture.
More informationOutline. Lecture 17: Putting it all together. Example (input program) How to make the computer understand? Example (Output assembly code) Fall 2002
Outline 5 Fall 2002 Lecture 17: Putting it all together From parsing to code generation Saman Amarasinghe 2 6.035 MIT Fall 1998 How to make the computer understand? Write a program using a programming
More informationAdvanced Computer Architecture
ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle
More informationInstruction scheduling
Instruction scheduling iaokang Qiu Purdue University ECE 468 October 12, 2018 What is instruction scheduling? Code generation has created a sequence of assembly instructions But that is not the only valid
More informationOptimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.
Optimization ASU Textbook Chapter 9 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine.
More informationECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines
ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html
More informationComplex Pipelines and Branch Prediction
Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle
More informationLecture Compiler Backend
Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 Backend Tasks Instruction selection Map virtual instructions To machine instructions
More informationCompiler construction in4303 lecture 9
Compiler construction in4303 lecture 9 Code generation Chapter 4.2.5, 4.2.7, 4.2.11 4.3 Overview Code generation for basic blocks instruction selection:[burs] register allocation: graph coloring instruction
More information