What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009

Size: px
Start display at page:

Download "What Compilers Can and Cannot Do. Saman Amarasinghe Fall 2009"

Transcription

1 What Compilers Can and Cannot Do Saman Amarasinghe Fall 009

2 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System

3 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Compiler Pros: Full source code is available Pros: Easy to intercept in the high-level to low-level transformations Pros: Compile-time is not much of an issue Cons: Don t see the whole program Cons: Don t know the runtime conditions Cons: Don t know (too much about) the architecture

4 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Linker Pros: Full program available Cons: May not have the full program Cons: Don t have access to the source Cons: Don t know (too much about) the architecture

5 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Loader Pros: Full program available (Cons: May not have the full program ) Cons: Don t have access to the source Cons: Don t know the runtime conditions Cons: Time pressure to get the loading done fast

6 Optimization Continuum Many examples across the compilation pipeline Static Dynamic Program Compiler Linker Loader Runtime System Runtime Pros: Full program available Pros: Knows the runtime behavior Cons: Don t have access to the source Cons: Time in the optimizer is time away from running the program

7 Dataflow Analysis 9 Compile-Time Reasoning About Run-Time Values of Variables or Expressions At Different Program Points Which assignment statements produced value of variable at this point? Which variables contain values that are no longer used after this program point? What is the range of possible values of variable at this program point?

8 Example int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 8

9 test: subu $fp, 6 sw zero, 0($fp) # x = 0 sw zero, 4($fp) # y = 0 sw zero, 8($fp) # i = 0 lab: # for(i=0;i<n; i++) mul $t0, $a0, 4 # a*4 div $t, $t0, $a # a*4/b lw $t, 8($fp) # i mul $t3, $t, $t # a*4/b*i lw $t4, 8($fp) # i addui $t4, $t4, # i+ lw $t5, 8($fp) # i addui $t5, $t5, # i+ mul $t6, $t4, $t5 # (i+)*(i+) addu $t7, $t3, $t6 # a*4/b*i + (i+)*(i+) lw $t8, 0($fp) # x add $t8, $t7, $t8 # x = x + a*4/b*i + (i+)*(i+) sw $t8, 0($fp) lw $t0, 4($fp) # y mul $t, $t0, a # b*y lw $t, 0($fp) # x add $t, $t, $t # x = x + b*y sw $t, 0($fp) lw $t0, 8($fp) # i addui $t0, $t0, # i+ sw $t0, 8($fp) ble $t0, $a3, lab lw $v0, 0($fp) addu $fp, 6 Saman b Amarasinghe $ra 9

10 Constant Propagation 0 In all possible execution paths a value of a variable at a given use point of that variable is a known constant. Replace the variable with the constant Pros: No need to keep that value in a variable (freeing storage/register) Can lead to further optimization. Saman Amarasinghe 008

11 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe

12 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe

13 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 3

14 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*y; } return x; } Saman Amarasinghe 4

15 Constant Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 5

16 Algebraic Simplification 3 If an expression can be calculated/simplified at compile time, then do it. Or use faster instructions to do the operation Examples: Pros: X * 0 0 X* X X * 04 X << 0 X + 0 X X + X X << Less work at runtime Leads to more optimizations Cons: Machine that runs the code may behave differently than the machine that is used to compile/compiler Ex: Overflow, underflow Use commutivity and transitivity can slightly change the results

17 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 7

18 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + b*0; } return x; } Saman Amarasinghe 8

19 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe 9

20 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe 0

21 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x + 0; } return x; } Saman Amarasinghe

22 Algebraic Simplification int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x; } return x; } Saman Amarasinghe

23 Copy Propagation 6 If you are just making a copy of a variable, try to use the original and eliminate the copy. Pros: Less instructions Less memory/registers Con: May make an interference graph no longer colorable, leading to register spills Saman Amarasinghe 008

24 Copy Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); x = x; } return x; } Saman Amarasinghe 4

25 Copy Propagation int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 5

26 Common Subexpression Elimination 8 Same subexpression is calculated multiple times Calculate is once is use the result Pros: Less computation Cons: Need additional storage/register to keep the results. May lead to register spill. May hinder parallelization by adding dependences Saman Amarasinghe 008

27 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 7

28 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y; x = 0; y = 0; for(i = 0; i <= N; i++) { x = x + (4*a/b)*i + (i+)*(i+); } } return x; Saman Amarasinghe 8

29 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + (i+)*(i+); } return x; } Saman Amarasinghe 9

30 Common Subexpression Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 30

31 Dead Code Elimination 0 If the result of a calculation is not used, don t do the calculation Pros: Less computation May be able to release the storage earlier No need to store the results Saman Amarasinghe 008

32 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 3

33 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; y = 0; for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; } Saman Amarasinghe 33

34 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, y, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe 34

35 Dead Code Elimination int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe

36 Loop Invariant Removal If an expression always calculate to the same value in all the loop iterations, move the calculation outside the loop Pros: A lot less work within the loop Cons Need to store the result from before loop starts and throughout the execution of the loop more live ranges may lead to register spills Saman Amarasinghe 008

37 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe

38 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t; x = 0; } for(i = 0; i <= N; i++) { t = i+; x = x + (4*a/b)*i + t*t; } return x; Saman Amarasinghe 38

39 Loop Invariant Removal int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } Saman Amarasinghe 39

40 Strength Reduction 5 In a loop, instead of recalculating an expression updated the previous value of the expression if that requires less computation. Example for(i= 0 ) t=a*i; t=0; for(i= 0 ) t=t+a; Pros: Less computation Cons: More values to keep increase number of live variables possible register spill Introduces a loop-carried dependence a parallel loop become sequential Strength increase transformation: Eliminate the loop carried dependence (but more instructions) by the inverse transformations to strength reduction. Saman Amarasinghe 008

41 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } Saman Amarasinghe 4

42 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u; x = 0; u = (4*a/b); for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; } return x; } u*0, u*, u*, u*3, u*4,... v=0, v=v+u, v=v+u, v=v+u, v=v+u,... Saman Amarasinghe 4

43 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + u*i + t*t; v = v + u; } return x; } Saman Amarasinghe 43

44 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } Saman Amarasinghe 44

45 Strength Reduction int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = (4*a/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } Saman Amarasinghe 45

46 Register Allocation 8 Use the limited registers effectively to keep some of the live values instead of storing them in memory. In the x86 architecture, this is very important(that is why x86-64 have more registers) However, this is critical in the RISC architectures. Saman Amarasinghe 008

47 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 47

48 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 48

49 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 49

50 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 50

51 Example def y def x def y def x use y use x use y use x def x Saman Amarasinghe use x 5

52 Example def y s def x def y s def x use y use x use y s3 use x def x s4 Saman Amarasinghe use x 5

53 Example def y s def x def y s def x use y use x use y s3 use x def x s s s4 Saman Amarasinghe use x s3 s4 53

54 Graph Coloring 3 Saman Amarasinghe 008

55 Register Allocation If the graph can be colored with # of colors < # of registers allocate a register for each variable If too many colors Eliminate a edge spill that register Recolor the graph Saman Amarasinghe 008

56 Register Allocation Local variable X Local variable Y Local variable I fp $t9 = X $t8 = t $t7 = u $t6 = v $t5 = i Saman Amarasinghe 56

57 Optimized Example int sumcalc(int a, int b, int N) { int i; int x, t, u, v; x = 0; u = ((a<<)/b); v = 0; for(i = 0; i <= N; i++) { t = i+; x = x + v + t*t; v = v + u; } return x; } 34 Saman Amarasinghe 57

58 test: subu $fp, 6 add $t9, zero, zero # x = 0 sll $t0, $a0, # a<< div $t7, $t0, $a # u = (a<<)/b add $t6, zero, zero # v = 0 add $t5, zero, zero # i = 0 lab: # for(i=0;i<n; i++) addui $t8, $t5, # t = i+ mul $t0, $t8, $t8 # t*t addu $t, $t0, $t6 # v + t*t addu $t9, t9, $t # x = x + v + t*t addu $6, $6, $7 # v = v + u addui $t5, $t5, # i = i+ ble $t5, $a3, lab addu $v0, $t9, zero addu $fp, 6 b $ra Saman Amarasinghe 58

59 Unoptimized Code test: subu $fp, 6 sw zero, 0($fp) sw zero, 4($fp) sw zero, 8($fp) lab: mul $t0, $a0, 4 div $t, $t0, $a lw $t, 8($fp) mul $t3, $t, $t lw $t4, 8($fp) addui $t4, $t4, lw $t5, 8($fp) addui $t5, $t5, mul $t6, $t4, $t5 addu $t7, $t3, $t6 lw $t8, 0($fp) add $t8, $t7, $t8 sw $t8, 0($fp) lw $t0, 4($fp) mul $t, $t0, a lw $t, 0($fp) add $t, $t, $t sw $t, 0($fp) lw $t0, 8($fp) addui $t0, $t0, sw $t0, 8($fp) ble $t0, $a3, lab Optimized Code test: subu $fp, 6 add $t9, zero, zero sll $t0, $a0, div $t7, $t0, $a add $t6, zero, zero add $t5, zero, zero lab: addui $t8, $t5, mul $t0, $t8, $t8 addu $t, $t0, $t6 addu $t9, t9, $t addu $6, $6, $7 addui $t5, $t5, ble $t5, $a3, lab addu $v0, $t9, zero addu $fp, 6 b $ra lw $v0, 0($fp) addu $fp, 6 b $ra 4*ld/st + *add/sub + br + N*(9*ld/st + 6*add/sub + 4* mul + div + br) = 7 + N* Execution time = 43 sec Saman Amarasinghe 6*add/sub + shift + div + br + N*(5*add/sub + mul + br) = 9 + N*7 Execution time = 7 sec 59

60 What Stops Optimizations? 35 Optimizer has to guarantee program equivalence For all the valid inputs For all the valid executions For all the valid architectures In order to optimize the program, the compiler needs to understand The control-flow Data accessors Most of the time, the full information is not available, then the compiler has to Reduce the scope of the region that the transformations can apply Reduce the aggressiveness of the transformations Leave computations with unanalyzable data alone Saman Amarasinghe 008

61 Control-Flow 38 All the possible paths through the program Representations within the compiler Call graphs Control-flow graphs What hinders the compiler analysis? Function pointers Indirect branches Computed gotos Large switch statements Loops with exits and breaks Loops where the bound are not known Conditionals where the branch condition is not analyzable Saman Amarasinghe 008

62 Data-Accessors 4 Who else can read or write the data Representations within the compiler Def-use chains Dependence vectors What hinders the compiler analysis? Address taken variables Global variables Parameters Arrays Pointer accesses Volatile types Saman Amarasinghe 008

63 Instruction Scheduling 43 The correct sequence to issue the instructions that will make the processor run fast. x86 is an out-of-order superscalar Hardware tries to schedule the instructions Compiler can still be useful Hardware cannot look too forward Compiler can transform the program to eliminate hazards Hardware don t have the time or the information to do most of these transformations For in-order processors Instruction scheduling is CRITICAL. Saman Amarasinghe 008

64 Constraints On Scheduling Data dependencies Control dependencies Resource Constraints

65 Data Dependency between Instructions If two instructions access the same variable, they can be dependent Kind of dependencies True: write read Anti: read write Output: write write What to do if two instructions are dependent. The order of execution cannot be reversed Reduce the possibilities for scheduling 45

66 Computing Dependencies 47 For basic blocks, compute dependencies by walking through the instructions Identifying register dependencies is simple is it the same register? For memory accesses simple: base + offset?= base + offset data dependence analysis: a[i]?= a[i+] interprocedural analysis: global?= parameter pointer alias analysis: p foo?= p foo

67 Representing Dependencies 5 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies

68 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r -

69 Representing Dependencies Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r - 4 3

70 Representing Dependencies 5 Using a dependence DAG, one per basic block Nodes are instructions, edges represent dependencies : r = *(r + 4) : r3 = *(r + 8) 3: r4 = r + r3 4: r5 = r 4 3 Edge is labeled with Latency: v(i j) = delay required between initiation times of i and j minus the execution time required by i

71 Resource Constraints Modern machines have many resource constraints Superscalar architectures: can run few parallel operations But have constraints Saman Amarasinghe 7

72 Resource Constraints of a Superscalar Processor Example: One fully pipelined reg-to-reg unit All integer operations taking one cycle In parallel with One fully pipelined memory-to/from-reg unit Data loads take two cycles Data stores teke one cycle Saman Amarasinghe 7

73 List Scheduling Algorithm 53 Idea Do a topological sort of the dependence DAG Consider when an instruction can be scheduled without causing a stall Schedule the instruction if it causes no stall and all its predecessors are already scheduled Optimal list scheduling is NP-complete Use heuristics when necessary Saman Amarasinghe 73

74 List Scheduling Algorithm with resource constraints Represent the superscalar architecture as multiple pipelines Each pipeline represent some resource Saman Amarasinghe 74

75 List Scheduling Algorithm with 54 resource constraints Represent the superscalar architecture as multiple pipelines Each pipeline represent some resource Example One single cycle reg-to-reg ALU unit One two-cycle pipelined reg-to/from-memory unit ALU MEM MEM

76 List Scheduling Algorithm with resource constraints Create a dependence DAG of a basic block Topological Sort READY = nodes with no predecessors Loop until READY is empty Let n READY be the node with the highest priority Schedule n in the earliest slot that satisfies precedence + resource constraints Update READY

77 Example 55 : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) Saman Amarasinghe 77

78 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) Saman Amarasinghe 78

79 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 8 d=0 d=0 f=0 9 f=0 Saman Amarasinghe 79

80 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0

81 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } ALUop MEM MEM 8 d=0 d=0 f=0 9 f=0

82 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } ALUop MEM MEM 8 d=0 d=0 f=0 9 f=0

83 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

84 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

85 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

86 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

87 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = {, 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

88 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

89 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

90 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 6, 4, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6

91 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 3 } 7 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6

92 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6

93 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6

94 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 4, 7, 3 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4

95 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3 } 5 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4

96 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4

97 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM 6 4 4

98 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 7, 3, 5 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

99 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5 } 8, 9 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

100 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

101 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

102 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 3, 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

103 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

104 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

105 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 5, 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

106 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

107 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

108 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 8, 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

109 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

110 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

111 Example : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) d=4 f= d=3 f= d= f= d=0 d= 4 f=0 f= d= d=0 f= 5 f=0 READY = { 9 } 8 d=0 d=0 f=0 9 f=0 ALUop MEM MEM

112 Example 60 : lea var_a, %rax : add 4(%rsp), %rax 3: inc %r 4: mov 4(%rsp), %r0 5: mov %r0, 8(%rsp) 6: and $0x00ff, %rbx 7: imul %rax, %rbx 8: lea var_b, %rax 9: mov %rbx, 6(%rsp) READY = { } 8 9 ALUop MEM MEM

113 Control Dependencies Constraints in moving instructions across basic blocks

114 Control Dependencies Constraints in moving instructions across basic blocks if (... ) a = b op c

115 Control Dependencies Constraints in moving instructions across basic blocks if (... ) a = b op c

116 Control Dependencies Constraints in moving instructions across basic blocks if ( c!= 0 ) a = b / c NO!!!

117 Control Dependencies Constraints in moving instructions across basic blocks If (... ) d = *(a)

118 Control Dependencies Constraints in moving instructions across basic blocks If ( valid address? ) d = *(a)

119 Trace Scheduling Find the most common trace of basic blocks Use profile information Combine the basic blocks in the trace and schedule them as one block Create clean-up code if the execution goes off-trace

120 Trace Scheduling A B C D E F G H

121 Trace Scheduling A B C D E F G H

122 Trace Scheduling A B D E G H

123 Trace Scheduling A B D E G H

124 Large Basic Blocks via Code Duplication Creating large extended basic blocks by duplication Schedule the larger blocks A B C D E

125 Large Basic Blocks via Code Duplication Creating large extended basic blocks by duplication Schedule the larger blocks A A B C B C D D D E E E

126 Trace Scheduling A B D E G H

127 Trace Scheduling A B C D D E F H E G H F H G

128 Loop Example mov d=7 4 Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop imul 3 mov 0 sub d=5 d= d= bge d=0

129 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule (9 cycles per iteration) mov mov mov mov imul bge imul bge imul sub sub mov imul 3 mov 0 sub bge d=7 d=5 d= d= d=0

130 Loop Unrolling 6 Unroll the loop body few times Pros: Create a much larger basic block for the body Eliminate few loop bounds checks Cons: Much larger program Setup code (# of iterations < unroll factor) beginning and end of the schedule can still have unused slots

131 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop 6

132 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule (8 cycles per iteration) mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub mov mul mov sub mov mul mov sub bge d=4 7 d= d=9 d=9 d=7 d=5 d= d= d=0

133 Loop Unrolling 7 Rename registers Use different registers in different iterations

134 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 8

135 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax mov (%rdi,%rax), %rcx imul %r, %rcx mov %rcx, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 9

136 Loop Unrolling 9 Rename registers Use different registers in different iterations Eliminate unnecessary dependencies again, use more registers to eliminate true, anti and output dependencies eliminate dependent-chains of calculations when possible

137 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub 4, %rax mov (%rdi,%rax), %rcx imul %r, %rcx mov %rcx, (%rdi,%rax) sub $4, %rax bge loop mov mul mov sub mov mul mov sub bge d=4 d= d=9 d=9 d=7 d=5 d= d= d=0 9

138 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $8, %rax mov (%rdi,%rbx), %rcx imul %r, %rcx mov %rcx, (%rdi,%rbx) sub $8, %rbx bge loop 3 3 mov mul mov sub mov mul mov sub bge d= d=0 d=7 d=7 d=5 d=3 d=0 d= d=0 0

139 Loop Example loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $8, %rax mov (%rdi,%rbx), %rcx imul %r, %rcx mov %rcx, (%rdi,%rbx) sub 8, %rbx bge loop Schedule (4.5 cycles per iteration 3 3 mov mul mov sub mov mul mov sub bge d= d=0 d=7 d=7 d=5 d=3 d=0 d= d=0 0 mov mov mov mov mov mov mov mov imul imul bge imul imul bge imul imul sub sub sub sub

140 Software Pipelining Try to overlap multiple iterations so that the slots will be filled Find the steady-state window so that: all the instructions of the loop body is executed but from different iterations

141 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mul mul mul st sub mov sub bge bge

142 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov st st mov mov mov mov mul mul bge bge mul mul bge bge mul mul sub sub sub sub

143 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st st st mov mov mov mov mov mov mul mul mul bge bge bge mul mul mul bge bge bge mul mul mul sub sub sub sub sub sub

144 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st st st3 mov mov mov mov mov3 mov mov mov3 mul mul mul bge mul3 bge bge mul mul mul bge mul3 bge bge mul mul mul mul3 sub sub sub sub3 sub sub sub sub3

145 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge oop Schedule mov mov mov st mov3 st mov4 st st3 mov mov mov mov mov3 mov mov4 mov mov3 mul mul mul bge mul3 bge mul4 bge mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3

146 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3

147 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov6 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3

148 Loop Example Assembly Code loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop Schedule mov mov mov st mov3 st mov4 st mov5 st3 mov6 mov mov mov mov mov3 mov mov4 mov ld5 mov3 mul mul mul bge mul3 bge mul4 bge mul5 mul mul mul bge mul3 bge mul4 bge mul mul mul mul3 mul4 sub sub sub sub3 sub sub sub sub3

149 Loop Example 4 iterations are overlapped value of %r don t change 4 regs for (%rdi,%rax) each addr. incremented by 4*4 4 regs to keep value %r0 Same registers can be reused after 4 of these blocks generate code for 4 blocks, otherwise need to move st mov4 mov3 mov bge mul3 mul bge mul sub sub loop: mov (%rdi,%rax), %r0 imul %r, %r0 mov %r0, (%rdi,%rax) sub $4, %rax bge loop

150 MIT OpenCourseWare Performance Engineering of Software Systems Fall 009 For information about citing these materials or our Terms of Use, visit:

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model

Simple Machine Model. Lectures 14 & 15: Instruction Scheduling. Simple Execution Model. Simple Execution Model Simple Machine Model Fall 005 Lectures & 5: Instruction Scheduling Instructions are executed in sequence Fetch, decode, execute, store results One instruction at a time For branch instructions, start fetching

More information

Spring 2 Spring Loop Optimizations

Spring 2 Spring Loop Optimizations Spring 2010 Loop Optimizations Instruction Scheduling 5 Outline Scheduling for loops Loop unrolling Software pipelining Interaction with register allocation Hardware vs. Compiler Induction Variable Recognition

More information

More Loop Optimizations

More Loop Optimizations Spring 2010 More Loop Optimizations 5 Outline Strength Reduction Loop Test Replacement Loop Invariant Code Motion SIMDization with SSE Saman Amarasinghe 2 6.035 MIT Fall 1998 Strength Reduction Replace

More information

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code)

Lecture 7. Instruction Scheduling. I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Lecture 7 Instruction Scheduling I. Basic Block Scheduling II. Global Scheduling (for Non-Numeric Code) Reading: Chapter 10.3 10.4 CS243: Instruction Scheduling 1 Scheduling Constraints Data dependences

More information

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006

Outline. Register Allocation. Issues. Storing values between defs and uses. Issues. Issues P3 / 2006 P3 / 2006 Register Allocation What is register allocation Spilling More Variations and Optimizations Kostis Sagonas 2 Spring 2006 Storing values between defs and uses Program computes with values value

More information

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013

A Bad Name. CS 2210: Optimization. Register Allocation. Optimization. Reaching Definitions. Dataflow Analyses 4/10/2013 A Bad Name Optimization is the process by which we turn a program into a better one, for some definition of better. CS 2210: Optimization This is impossible in the general case. For instance, a fully optimizing

More information

Compiler Architecture

Compiler Architecture Code Generation 1 Compiler Architecture Source language Scanner (lexical analysis) Tokens Parser (syntax analysis) Syntactic structure Semantic Analysis (IC generator) Intermediate Language Code Optimizer

More information

Optimization Prof. James L. Frankel Harvard University

Optimization Prof. James L. Frankel Harvard University Optimization Prof. James L. Frankel Harvard University Version of 4:24 PM 1-May-2018 Copyright 2018, 2016, 2015 James L. Frankel. All rights reserved. Reasons to Optimize Reduce execution time Reduce memory

More information

Final Jeopardy. CS356 Unit 15. Binary Brainteaser 100. Binary Brainteaser 200. Review

Final Jeopardy. CS356 Unit 15. Binary Brainteaser 100. Binary Brainteaser 200. Review 15.1 Final Jeopardy 15.2 Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles CS356 Unit 15 Review 100 100 100 100 100 100 200 200 200 200 200

More information

MIT Spring 2011 Quiz 3

MIT Spring 2011 Quiz 3 MIT 6.035 Spring 2011 Quiz 3 Full Name: MIT ID: Athena ID: Run L A TEX again to produce the table 1. Consider a simple machine with one ALU and one memory unit. The machine can execute an ALU operation

More information

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2)

Lecture: Static ILP. Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) Lecture: Static ILP Topics: compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Static vs Dynamic Scheduling Arguments against dynamic scheduling: requires complex structures

More information

Multiple Instruction Issue. Superscalars

Multiple Instruction Issue. Superscalars Multiple Instruction Issue Multiple instructions issued each cycle better performance increase instruction throughput decrease in CPI (below 1) greater hardware complexity, potentially longer wire lengths

More information

Processor (IV) - advanced ILP. Hwansoo Han

Processor (IV) - advanced ILP. Hwansoo Han Processor (IV) - advanced ILP Hwansoo Han Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel To increase ILP Deeper pipeline Less work per stage shorter clock cycle

More information

ECE 2035 Programming Hw/Sw Systems Spring problems, 8 pages Final Exam Solutions 1 May 2013

ECE 2035 Programming Hw/Sw Systems Spring problems, 8 pages Final Exam Solutions 1 May 2013 Problem 1 (3 parts, 30 points) Compilation, Concurrency & Interrupts Part A (20 points) Perform at least five standard compiler optimizations on the following C code fragment by writing the optimized version

More information

Pipelining and Exploiting Instruction-Level Parallelism (ILP)

Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Exploiting Instruction-Level Parallelism (ILP) Pipelining and Instruction-Level Parallelism (ILP). Definition of basic instruction block Increasing Instruction-Level Parallelism (ILP) &

More information

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer

Topic 14: Scheduling COS 320. Compiling Techniques. Princeton University Spring Lennart Beringer Topic 14: Scheduling COS 320 Compiling Techniques Princeton University Spring 2016 Lennart Beringer 1 The Back End Well, let s see Motivating example Starting point Motivating example Starting point Multiplication

More information

register allocation saves energy register allocation reduces memory accesses.

register allocation saves energy register allocation reduces memory accesses. Lesson 10 Register Allocation Full Compiler Structure Embedded systems need highly optimized code. This part of the course will focus on Back end code generation. Back end: generation of assembly instructions

More information

Dynamic Control Hazard Avoidance

Dynamic Control Hazard Avoidance Dynamic Control Hazard Avoidance Consider Effects of Increasing the ILP Control dependencies rapidly become the limiting factor they tend to not get optimized by the compiler more instructions/sec ==>

More information

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation

Background: Pipelining Basics. Instruction Scheduling. Pipelining Details. Idealized Instruction Data-Path. Last week Register allocation Instruction Scheduling Last week Register allocation Background: Pipelining Basics Idea Begin executing an instruction before completing the previous one Today Instruction scheduling The problem: Pipelined

More information

Instruction-Level Parallelism (ILP)

Instruction-Level Parallelism (ILP) Instruction Level Parallelism Instruction-Level Parallelism (ILP): overlap the execution of instructions to improve performance 2 approaches to exploit ILP: 1. Rely on hardware to help discover and exploit

More information

Code Generation. CS 540 George Mason University

Code Generation. CS 540 George Mason University Code Generation CS 540 George Mason University Compiler Architecture Intermediate Language Intermediate Language Source language Scanner (lexical analysis) tokens Parser (syntax analysis) Syntactic structure

More information

Overview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2)

Overview. Introduction to the MIPS ISA. MIPS ISA Overview. Overview (2) Introduction to the MIPS ISA Overview Remember that the machine only understands very basic instructions (machine instructions) It is the compiler s job to translate your high-level (e.g. C program) into

More information

Lecture: Pipeline Wrap-Up and Static ILP

Lecture: Pipeline Wrap-Up and Static ILP Lecture: Pipeline Wrap-Up and Static ILP Topics: multi-cycle instructions, precise exceptions, deep pipelines, compiler scheduling, loop unrolling, software pipelining (Sections C.5, 3.2) 1 Multicycle

More information

Compiler Optimization

Compiler Optimization Compiler Optimization The compiler translates programs written in a high-level language to assembly language code Assembly language code is translated to object code by an assembler Object code modules

More information

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation.

UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. UNIT 8 1. Explain in detail the hardware support for preserving exception behavior during Speculation. July 14) (June 2013) (June 2015)(Jan 2016)(June 2016) H/W Support : Conditional Execution Also known

More information

15.1. CS356 Unit 15. Review

15.1. CS356 Unit 15. Review 15.1 CS356 Unit 15 Review 15.2 Final Jeopardy Binary Brainteasers Instruction Inquiry Random Riddles Memory Madness Processor Predicaments Programming Pickles 100 100 100 100 100 100 200 200 200 200 200

More information

Lecture 21 CIS 341: COMPILERS

Lecture 21 CIS 341: COMPILERS Lecture 21 CIS 341: COMPILERS Announcements HW6: Analysis & Optimizations Alias analysis, constant propagation, dead code elimination, register allocation Available Soon Due: Wednesday, April 25 th Zdancewic

More information

Lecture 1: Introduction

Lecture 1: Introduction Lecture 1: Introduction Staff Lecturer Prof. Michael Carbin mcarbin@mit.edu 253-5881 32-G782 Prof. Martin Rinard rinard@mit.edu 258-6922 32-G828 Rooms MWF 3-370 TH 4-149 Course Secretary Cree Bruins cbruins@csail.mit.edu

More information

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano

Building a Runnable Program and Code Improvement. Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program and Code Improvement Dario Marasco, Greg Klepic, Tess DiStefano Building a Runnable Program Review Front end code Source code analysis Syntax tree Back end code Target code

More information

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2)

Lecture: Static ILP. Topics: loop unrolling, software pipelines (Sections C.5, 3.2) Lecture: Static ILP Topics: loop unrolling, software pipelines (Sections C.5, 3.2) 1 Loop Example for (i=1000; i>0; i--) x[i] = x[i] + s; Source code FPALU -> any: 3 s FPALU -> ST : 2 s IntALU -> BR :

More information

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine

Machine Language Instructions Introduction. Instructions Words of a language understood by machine. Instruction set Vocabulary of the machine Machine Language Instructions Introduction Instructions Words of a language understood by machine Instruction set Vocabulary of the machine Current goal: to relate a high level language to instruction

More information

Unoptimized Code Generation

Unoptimized Code Generation Unoptimized Code Generation Last time we left off on the procedure abstraction Saman Amarasinghe 2 6.035 MIT Fall 1998 The Stack Arguments 0 to 6 are in: %b %rbp %rsp %rdi, %rsi, %rdx, %rcx, %r8 and %r9

More information

Code generation and local optimization

Code generation and local optimization Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction

More information

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University

Real Processors. Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Real Processors Lecture for CPSC 5155 Edward Bosworth, Ph.D. Computer Science Department Columbus State University Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in parallel

More information

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering

Administration CS 412/413. Instruction ordering issues. Simplified architecture model. Examples. Impact of instruction ordering dministration CS 1/13 Introduction to Compilers and Translators ndrew Myers Cornell University P due in 1 week Optional reading: Muchnick 17 Lecture 30: Instruction scheduling 1 pril 00 1 Impact of instruction

More information

Code generation and local optimization

Code generation and local optimization Code generation and local optimization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best option What we will cover: Instruction

More information

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler

Compiler Passes. Optimization. The Role of the Optimizer. Optimizations. The Optimizer (or Middle End) Traditional Three-pass Compiler Compiler Passes Analysis of input program (front-end) character stream Lexical Analysis Synthesis of output program (back-end) Intermediate Code Generation Optimization Before and after generating machine

More information

Tour of common optimizations

Tour of common optimizations Tour of common optimizations Simple example foo(z) { x := 3 + 6; y := x 5 return z * y } Simple example foo(z) { x := 3 + 6; y := x 5; return z * y } x:=9; Applying Constant Folding Simple example foo(z)

More information

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two?

Page # Let the Compiler Do it Pros and Cons Pros. Exploiting ILP through Software Approaches. Cons. Perhaps a mixture of the two? Exploiting ILP through Software Approaches Venkatesh Akella EEC 270 Winter 2005 Based on Slides from Prof. Al. Davis @ cs.utah.edu Let the Compiler Do it Pros and Cons Pros No window size limitation, the

More information

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz

Compiler Design. Fall Control-Flow Analysis. Prof. Pedro C. Diniz Compiler Design Fall 2015 Control-Flow Analysis Sample Exercises and Solutions Prof. Pedro C. Diniz USC / Information Sciences Institute 4676 Admiralty Way, Suite 1001 Marina del Rey, California 90292

More information

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H

COMPUTER ORGANIZATION AND DESIGN. The Hardware/Software Interface. Chapter 4. The Processor: C Multiple Issue Based on P&H COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface Chapter 4 The Processor: C Multiple Issue Based on P&H Instruction-Level Parallelism (ILP) Pipelining: executing multiple instructions in

More information

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA CISC 662 Graduate Computer Architecture Lecture 4 - ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations

Register Allocation. Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Register Allocation Global Register Allocation Webs and Graph Coloring Node Splitting and Other Transformations Copyright 2015, Pedro C. Diniz, all rights reserved. Students enrolled in the Compilers class

More information

Register Allocation, i. Overview & spilling

Register Allocation, i. Overview & spilling Register Allocation, i Overview & spilling 1 L1 p ::=(label f...) f ::=(label nat nat i...) i ::=(w

More information

Instruction-Level Parallelism and Its Exploitation

Instruction-Level Parallelism and Its Exploitation Chapter 2 Instruction-Level Parallelism and Its Exploitation 1 Overview Instruction level parallelism Dynamic Scheduling Techniques es Scoreboarding Tomasulo s s Algorithm Reducing Branch Cost with Dynamic

More information

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code

Middle End. Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce running time of the compiled code Traditional Three-pass Compiler Source Code Front End IR Middle End IR Back End Machine code Errors Code Improvement (or Optimization) Analyzes IR and rewrites (or transforms) IR Primary goal is to reduce

More information

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor

COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface. 5 th. Edition. Chapter 4. The Processor COMPUTER ORGANIZATION AND DESIGN The Hardware/Software Interface 5 th Edition Chapter 4 The Processor Introduction CPU performance factors Instruction count Determined by ISA and compiler CPI and Cycle

More information

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as

16.10 Exercises. 372 Chapter 16 Code Improvement. be translated as 372 Chapter 16 Code Improvement 16.10 Exercises 16.1 In Section 16.2 we suggested replacing the instruction r1 := r2 / 2 with the instruction r1 := r2 >> 1, and noted that the replacement may not be correct

More information

CSc 453. Code Generation I Christian Collberg. Compiler Phases. Code Generation Issues. Slide Compilers and Systems Software.

CSc 453. Code Generation I Christian Collberg. Compiler Phases. Code Generation Issues. Slide Compilers and Systems Software. Slide 16 2 Lexing, Parsing Semantic Analysis, Intermediate Code Generation Peephole Optimization Assembly Code Assembler Machine Code Register Allocation Intermediate Code Selection Scheduling Register

More information

Introduction. CSc 453. Compilers and Systems Software. 19 : Code Generation I. Department of Computer Science University of Arizona.

Introduction. CSc 453. Compilers and Systems Software. 19 : Code Generation I. Department of Computer Science University of Arizona. CSc 453 Compilers and Systems Software 19 : Code Generation I Introduction Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2009 Christian Collberg Compiler Phases Optimize

More information

Architectures. Code Generation Issues III. Code Generation Issues II. Machine Architectures I

Architectures. Code Generation Issues III. Code Generation Issues II. Machine Architectures I CSc 553 Principles of Compilation 7 : Code Generation I Introduction Department of Computer Science University of Arizona collberg@gmail.com Copyright c 2011 Christian Collberg Lexing, Parsing Semantic

More information

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2)

Lecture: Static ILP. Topics: predication, speculation (Sections C.5, 3.2) Lecture: Static ILP Topics: predication, speculation (Sections C.5, 3.2) 1 Scheduled and Unrolled Loop Loop: L.D F0, 0(R1) L.D F6, -8(R1) L.D F10,-16(R1) L.D F14, -24(R1) ADD.D F4, F0, F2 ADD.D F8, F6,

More information

Floating Point/Multicycle Pipelining in DLX

Floating Point/Multicycle Pipelining in DLX Floating Point/Multicycle Pipelining in DLX Completion of DLX EX stage floating point arithmetic operations in one or two cycles is impractical since it requires: A much longer CPU clock cycle, and/or

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 ILP techniques Hardware Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit

More information

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7

Compiler Optimizations. Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 Compiler Optimizations Chapter 8, Section 8.5 Chapter 9, Section 9.1.7 2 Local vs. Global Optimizations Local: inside a single basic block Simple forms of common subexpression elimination, dead code elimination,

More information

CSE 378 Midterm Sample Solution 2/11/11

CSE 378 Midterm Sample Solution 2/11/11 Question 1. (9 points). Suppose we have the following MIPS instructions stored in memory starting at location 0x2400. 2400: add $v0,$t1,$t3 2404: loop: sll $t3,$a2,12 2408: or $s2,$t3,$a1 240c: bne $s2,$t8,loop

More information

HY425 Lecture 09: Software to exploit ILP

HY425 Lecture 09: Software to exploit ILP HY425 Lecture 09: Software to exploit ILP Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 4, 2010 Dimitrios S. Nikolopoulos HY425 Lecture 09: Software to exploit ILP 1 / 44 ILP techniques

More information

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University

Advanced d Instruction Level Parallelism. Computer Systems Laboratory Sungkyunkwan University Advanced d Instruction ti Level Parallelism Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu ILP Instruction-Level Parallelism (ILP) Pipelining:

More information

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture. VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering CS252 Graduate Computer Architecture VLIW, Vector, and Multithreaded Machines Assigned 3/24/2019 Problem Set #4 Due 4/5/2019 http://inst.eecs.berkeley.edu/~cs152/sp19

More information

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015

Branch Addressing. Jump Addressing. Target Addressing Example. The University of Adelaide, School of Computer Science 28 September 2015 Branch Addressing Branch instructions specify Opcode, two registers, target address Most branch targets are near branch Forward or backward op rs rt constant or address 6 bits 5 bits 5 bits 16 bits PC-relative

More information

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard

Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Data Hazards Compiler Scheduling Pipeline scheduling or instruction scheduling: Compiler generates code to eliminate hazard Consider: a = b + c; d = e - f; Assume loads have a latency of one clock cycle:

More information

Principles of Compiler Design

Principles of Compiler Design Principles of Compiler Design Code Generation Compiler Lexical Analysis Syntax Analysis Semantic Analysis Source Program Token stream Abstract Syntax tree Intermediate Code Code Generation Target Program

More information

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology

MIT Introduction to Program Analysis and Optimization. Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology MIT 6.035 Introduction to Program Analysis and Optimization Martin Rinard Laboratory for Computer Science Massachusetts Institute of Technology Program Analysis Compile-time reasoning about run-time behavior

More information

The Processor: Instruction-Level Parallelism

The Processor: Instruction-Level Parallelism The Processor: Instruction-Level Parallelism Computer Organization Architectures for Embedded Computing Tuesday 21 October 14 Many slides adapted from: Computer Organization and Design, Patterson & Hennessy

More information

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls

Last time: forwarding/stalls. CS 6354: Branch Prediction (con t) / Multiple Issue. Why bimodal: loops. Last time: scheduling to avoid stalls CS 6354: Branch Prediction (con t) / Multiple Issue 14 September 2016 Last time: scheduling to avoid stalls 1 Last time: forwarding/stalls add $a0, $a2, $a3 ; zero or more instructions sub $t0, $a0, $a1

More information

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010

ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 ECE/CS 552: Introduction to Computer Architecture ASSIGNMENT #1 Due Date: At the beginning of lecture, September 22 nd, 2010 This homework is to be done individually. Total 9 Questions, 100 points 1. (8

More information

CMSC 611: Advanced Computer Architecture

CMSC 611: Advanced Computer Architecture CMSC 611: Advanced Computer Architecture Compilers Some material adapted from Mohamed Younis, UMBC CMSC 611 Spr 2003 course slides Some material adapted from Hennessy & Patterson / 2003 Elsevier Science

More information

Code generation and optimization

Code generation and optimization Code generation and timization Generating assembly How do we convert from three-address code to assembly? Seems easy! But easy solutions may not be the best tion What we will cover: Peephole timizations

More information

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines

CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines CS152 Computer Architecture and Engineering VLIW, Vector, and Multithreaded Machines Assigned April 7 Problem Set #5 Due April 21 http://inst.eecs.berkeley.edu/~cs152/sp09 The problem sets are intended

More information

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization

CISC 662 Graduate Computer Architecture. Lecture 4 - ISA MIPS ISA. In a CPU. (vonneumann) Processor Organization CISC 662 Graduate Computer Architecture Lecture 4 - ISA MIPS ISA Michela Taufer http://www.cis.udel.edu/~taufer/courses Powerpoint Lecture Notes from John Hennessy and David Patterson s: Computer Architecture,

More information

6.035 Project 3: Unoptimized Code Generation. Jason Ansel MIT - CSAIL

6.035 Project 3: Unoptimized Code Generation. Jason Ansel MIT - CSAIL 6.035 Project 3: Unoptimized Code Generation Jason Ansel MIT - CSAIL Quiz Monday 50 minute quiz Monday Covers everything up to yesterdays lecture Lexical Analysis (REs, DFAs, NFAs) Syntax Analysis (CFGs,

More information

LECTURE 10. Pipelining: Advanced ILP

LECTURE 10. Pipelining: Advanced ILP LECTURE 10 Pipelining: Advanced ILP EXCEPTIONS An exception, or interrupt, is an event other than regular transfers of control (branches, jumps, calls, returns) that changes the normal flow of instruction

More information

CODE GENERATION Monday, May 31, 2010

CODE GENERATION Monday, May 31, 2010 CODE GENERATION memory management returned value actual parameters commonly placed in registers (when possible) optional control link optional access link saved machine status local data temporaries A.R.

More information

HW 2 is out! Due 9/25!

HW 2 is out! Due 9/25! HW 2 is out! Due 9/25! CS 6290 Static Exploitation of ILP Data-Dependence Stalls w/o OOO Single-Issue Pipeline When no bypassing exists Load-to-use Long-latency instructions Multiple-Issue (Superscalar),

More information

Chapter 4 The Processor (Part 4)

Chapter 4 The Processor (Part 4) Department of Electr rical Eng ineering, Chapter 4 The Processor (Part 4) 王振傑 (Chen-Chieh Wang) ccwang@mail.ee.ncku.edu.tw ncku edu Depar rtment of Electr rical Engineering, Feng-Chia Unive ersity Outline

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 4A: Instruction Level Parallelism - Static Scheduling Avinash Kodi, kodi@ohio.edu Agenda 2 Dependences RAW, WAR, WAW Static Scheduling Loop-carried Dependence

More information

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining

CS356 Unit 12a. Logic Circuits. Combinational Logic Gates BASIC HW. Processor Hardware Organization Pipelining 2a. 2a.2 CS356 Unit 2a Processor Hardware Organization Pipelining BASIC HW Logic Circuits 2a.3 Combinational Logic Gates 2a.4 logic Performs a specific function (mapping of input combinations to desired

More information

Computer Science 246 Computer Architecture

Computer Science 246 Computer Architecture Computer Architecture Spring 2009 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Compiler ILP Static ILP Overview Have discussed methods to extract ILP from hardware Why can t some of these

More information

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation

CPE 631 Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Lecture 09: Instruction Level Parallelism and Its Dynamic Exploitation Aleksandar Milenkovic, milenka@ece.uah.edu Electrical and Computer Engineering University of Alabama in Huntsville Outline Instruction

More information

2. dead code elimination (declaration and initialization of z) 3. common subexpression elimination (temp1 = j + g + h)

2. dead code elimination (declaration and initialization of z) 3. common subexpression elimination (temp1 = j + g + h) Problem 1 (20 points) Compilation Perform at least five standard compiler optimizations on the following C code fragment by writing the optimized version (in C) to the right. Assume f is a pure function

More information

Architecture II. Computer Systems Laboratory Sungkyunkwan University

Architecture II. Computer Systems Laboratory Sungkyunkwan University MIPS Instruction ti Set Architecture II Jin-Soo Kim (jinsookim@skku.edu) Computer Systems Laboratory Sungkyunkwan University http://csl.skku.edu Making Decisions (1) Conditional operations Branch to a

More information

CS146 Computer Architecture. Fall Midterm Exam

CS146 Computer Architecture. Fall Midterm Exam CS146 Computer Architecture Fall 2002 Midterm Exam This exam is worth a total of 100 points. Note the point breakdown below and budget your time wisely. To maximize partial credit, show your work and state

More information

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University.

Rui Wang, Assistant professor Dept. of Information and Communication Tongji University. Instructions: ti Language of the Computer Rui Wang, Assistant professor Dept. of Information and Communication Tongji University it Email: ruiwang@tongji.edu.cn Computer Hierarchy Levels Language understood

More information

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution

In-order vs. Out-of-order Execution. In-order vs. Out-of-order Execution In-order vs. Out-of-order Execution In-order instruction execution instructions are fetched, executed & committed in compilergenerated order if one instruction stalls, all instructions behind it stall

More information

CS 316: Procedure Calls/Pipelining

CS 316: Procedure Calls/Pipelining CS 316: Procedure Calls/Pipelining Kavita Bala Fall 2007 Computer Science Cornell University Announcements PA 3 IS out today Lectures on it this Fri and next Tue/Thu Due on the Friday after Fall break

More information

CSE 378 Midterm 2/12/10 Sample Solution

CSE 378 Midterm 2/12/10 Sample Solution Question 1. (6 points) (a) Rewrite the instruction sub $v0,$t8,$a2 using absolute register numbers instead of symbolic names (i.e., if the instruction contained $at, you would rewrite that as $1.) sub

More information

CS 3330 Exam 3 Fall 2017 Computing ID:

CS 3330 Exam 3 Fall 2017 Computing ID: S 3330 Fall 2017 Exam 3 Variant E page 1 of 16 Email I: S 3330 Exam 3 Fall 2017 Name: omputing I: Letters go in the boxes unless otherwise specified (e.g., for 8 write not 8 ). Write Letters clearly: if

More information

CSE Lecture In Class Example Handout

CSE Lecture In Class Example Handout CSE 30321 Lecture 07-09 In Class Example Handout Part A: A Simple, MIPS-based Procedure: Swap Procedure Example: Let s write the MIPS code for the following statement (and function call): if (A[i] > A

More information

Please state clearly any assumptions you make in solving the following problems.

Please state clearly any assumptions you make in solving the following problems. Computer Architecture Homework 3 2012-2013 Please state clearly any assumptions you make in solving the following problems. 1 Processors Write a short report on at least five processors from at least three

More information

Instruction Scheduling

Instruction Scheduling Instruction Scheduling Michael O Boyle February, 2014 1 Course Structure Introduction and Recap Course Work Scalar optimisation and dataflow L5 Code generation L6 Instruction scheduling Next register allocation

More information

Topic Notes: MIPS Instruction Set Architecture

Topic Notes: MIPS Instruction Set Architecture Computer Science 220 Assembly Language & Comp. Architecture Siena College Fall 2011 Topic Notes: MIPS Instruction Set Architecture vonneumann Architecture Modern computers use the vonneumann architecture.

More information

Outline. Lecture 17: Putting it all together. Example (input program) How to make the computer understand? Example (Output assembly code) Fall 2002

Outline. Lecture 17: Putting it all together. Example (input program) How to make the computer understand? Example (Output assembly code) Fall 2002 Outline 5 Fall 2002 Lecture 17: Putting it all together From parsing to code generation Saman Amarasinghe 2 6.035 MIT Fall 1998 How to make the computer understand? Write a program using a programming

More information

Advanced Computer Architecture

Advanced Computer Architecture ECE 563 Advanced Computer Architecture Fall 2010 Lecture 6: VLIW 563 L06.1 Fall 2010 Little s Law Number of Instructions in the pipeline (parallelism) = Throughput * Latency or N T L Throughput per Cycle

More information

Instruction scheduling

Instruction scheduling Instruction scheduling iaokang Qiu Purdue University ECE 468 October 12, 2018 What is instruction scheduling? Code generation has created a sequence of assembly instructions But that is not the only valid

More information

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu.

Optimization. ASU Textbook Chapter 9. Tsan-sheng Hsu. Optimization ASU Textbook Chapter 9 Tsan-sheng Hsu tshsu@iis.sinica.edu.tw http://www.iis.sinica.edu.tw/~tshsu 1 Introduction For some compiler, the intermediate code is a pseudo code of a virtual machine.

More information

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines

ECE 552 / CPS 550 Advanced Computer Architecture I. Lecture 15 Very Long Instruction Word Machines ECE 552 / CPS 550 Advanced Computer Architecture I Lecture 15 Very Long Instruction Word Machines Benjamin Lee Electrical and Computer Engineering Duke University www.duke.edu/~bcl15 www.duke.edu/~bcl15/class/class_ece252fall11.html

More information

Complex Pipelines and Branch Prediction

Complex Pipelines and Branch Prediction Complex Pipelines and Branch Prediction Daniel Sanchez Computer Science & Artificial Intelligence Lab M.I.T. L22-1 Processor Performance Time Program Instructions Program Cycles Instruction CPI Time Cycle

More information

Lecture Compiler Backend

Lecture Compiler Backend Lecture 19-23 Compiler Backend Jianwen Zhu Electrical and Computer Engineering University of Toronto Jianwen Zhu 2009 - P. 1 Backend Tasks Instruction selection Map virtual instructions To machine instructions

More information

Compiler construction in4303 lecture 9

Compiler construction in4303 lecture 9 Compiler construction in4303 lecture 9 Code generation Chapter 4.2.5, 4.2.7, 4.2.11 4.3 Overview Code generation for basic blocks instruction selection:[burs] register allocation: graph coloring instruction

More information