Just-In-Time Software Pipelining

Size: px

Start display at page:

Download "Just-In-Time Software Pipelining"

Randolph Warren
6 years ago
Views:

1 Just-In-Time Software Pipelining Hongbo Rong Hyunchul Park Youfeng Wu Cheng Wang Programming Systems Lab Intel Labs, Santa Clara

2 What is software pipelining? A loop optimization exposing instruction-level parallelism (ILP) for (i = 0; i<n; i++){ } a: x = y + 1 b: y = A[i] + x c: B[i+2] = B[i]*x : A[i+2] = B[i+2] 2

3 What is software pipelining? A loop optimization exposing instruction-level parallelism (ILP) for (i = 0; i<n; i++){ a <1> a: x = y + 1 b: y = A[i] + x <1> c b c: B[i+2] = B[i]*x : A[i+2] = B[i+2] <2> } Local epenence Loop-carrie epenence 2

4 <1> c a <1> b <2> 3

5 Iteration <1> c a <1> b <2> 3

6 a b c Iteration a <1> <1> c b <2> 3

7 a Iteration b c a b c a <1> <1> c b <2> 3

8 a b c a b c Iteration Initiation interval(ii)=2 a <1> <1> c b <2> 3

9 a b c a b c Iteration Initiation interval(ii)=2 a <1> a b c <1> c b <2> 3

10 a b c a b c Iteration Initiation interval(ii)=2 a <1> a b c a <1> c b <2> b c 3

11 a b c a b c Iteration Initiation interval(ii)=2 a <1> a b c a <1> c b <2> b c 3

a 0 1 2 3 b c a b c a b c a b c Iteration Initiation interval(ii)=2

12 a b c a b c a b c a b c Iteration Initiation interval(ii)=2 Kernel 2 stages Different iterations work on ifferent stages in parallel 3

13 a b c a b c a b c a b c Iteration Initiation interval(ii)=2 Kernel 2 stages Different iterations work on ifferent stages in parallel 3

14 a b c a b c a b c a b c Iteration Initiation interval(ii)=2 Kernel 2 stages Different iterations work on ifferent stages in parallel 3

a 0 1 2 3 b c a b c a b c a b c Iteration Initiation interval(ii)=2 Kernel 2 stages

15 a b c a b c a b c a b c Iteration Initiation interval(ii)=2 Kernel 2 stages Different iterations work on ifferent stages in parallel II is the performance inicator 3

16 Software pipelining has been static Extensively stuie in 3 ecaes, an efficient for wie-issue architectures VLIW [Lam 1988] superscalar [Ruttenberg et al. 1996] 4

17 Software pipelining has been static Extensively stuie in 3 ecaes, an efficient for wie-issue architectures VLIW [Lam 1988] superscalar [Ruttenberg et al. 1996] It is seen only in static compilers 4

18 Software pipelining has been static Extensively stuie in 3 ecaes, an efficient for wie-issue architectures VLIW [Lam 1988] superscalar [Ruttenberg et al. 1996] It is seen only in static compilers Most works aim to minimize II but not compile overhea 4

19 It is time now to exten software pipelining to ynamic compilers! 5

20 It is time now to exten software pipelining to ynamic compilers! Dynamic languages are increasingly popular JavaScript an PhP 88.9% an 81.5% in client an server websites (W3Techs) 5

21 It is time now to exten software pipelining to ynamic compilers! Dynamic languages are increasingly popular JavaScript an PhP 88.9% an 81.5% in client an server websites (W3Techs) Huge amount of legacy coe Small optimization scope: a loop iteration Software pipelining enlarges the scope to many iterations 5

22 It is time now to exten software pipelining to ynamic compilers! Dynamic languages are increasingly popular JavaScript an PhP 88.9% an 81.5% in client an server websites (W3Techs) Huge amount of legacy coe Small optimization scope: a loop iteration Software pipelining enlarges the scope to many iterations Minimizing compile overhea must be the 1 st objective Only simple/fast algorithms can be use 5

23 It is time now to exten software pipelining to ynamic compilers! Dynamic languages are increasingly popular JavaScript an PhP 88.9% an 81.5% in client an server websites (W3Techs) Huge amount of legacy coe Small optimization scope: a loop iteration Software pipelining enlarges the scope to many iterations Minimizing compile overhea must be the 1 st objective Only simple/fast algorithms can be use linear-time algorithms are preferre 5

24 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] 6

25 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] for (i = 0; i<n; i++){ a b c } 6

26 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] for (i = 0; i<n; i++){ Original a optimizat b ion scope c } 6

27 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] for (i = 0; i<n; i++){ Original a optimizat b ion scope c } 6 for (j = 0; j<n; j+=m){ for (i = j; i<j+m; i++){ a b c } }

28 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] for (i = 0; i<n; i++){ Original a optimizat b ion scope c } Atomic region 6 for (j = 0; j<n; j+=m){ for (i = j; i<j+m; i++){ a } } b c

29 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] Costly rollback Software: Light-weight checkpointing for (i = 0; i<n; i++){ Original a optimizat b ion scope c } Atomic region 6 for (j = 0; j<n; j+=m){ for (i = j; i<j+m; i++){ a } } b c

30 Challenges Memory aliases kill parallelism Harware: Atomic region + rotating alias registers [MICRO-46] Costly rollback Software: Light-weight checkpointing Scheuling is expensive for (i = 0; i<n; i++){ Original a optimizat b ion scope c } Atomic region 6 for (j = 0; j<n; j+=m){ for (i = j; i<j+m; i++){ a } } b c

31 7 Framework (on Transmeta CMS)

32 x86 binary Framework (on Transmeta CMS) 7

33 x86 binary Framework (on Transmeta CMS) Interpreter & profiler 7

34 x86 binary Interpreter & profiler Framework (on Transmeta CMS) Hot region optimizations 7

35 x86 binary Interpreter & profiler Framework (on Transmeta CMS) Hot region optimizations Acyclic scheuler 7

36 x86 binary Interpreter & profiler Framework (on Transmeta CMS) Hot region optimizations Acyclic scheuler Assembler 7

37 x86 binary Interpreter & profiler Framework (on Transmeta CMS) Hot region optimizations Acyclic scheuler Coe cache Assembler 7

38 x86 binary Interpreter & profiler Framework (on Transmeta CMS) Hot region optimizations Acyclic scheuler Coe cache Assembler 8

39 Framework (on Transmeta CMS) Hot region optimizations Acyclic scheuler Assembler 8

40 Framework (on Transmeta CMS) Hot region optimizations Loop selection Acyclic scheuler Acyclic scheuler Assembler 8

41 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Acyclic scheuler Acyclic scheuler Assembler 8

42 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Scheuling Acyclic scheuler Acyclic scheuler Assembler 8

43 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Acyclic scheuler Acyclic scheuler Scheuling Rotating alias reg alloc Assembler 8

44 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Acyclic scheuler Acyclic scheuler Scheuling Rotating alias reg alloc Assembler Coe generation 8

45 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Acyclic scheuler Acyclic scheuler Scheuling Rotating alias reg alloc Assembler Coe generation 8

46 Framework (on Transmeta CMS) Hot region optimizations Initialization Loop selection Acyclic scheuler Acyclic scheuler Scheuling Rotating alias reg alloc Assembler Atomic region Coe generation Rotating alias register file 8

47 Scheuling is expensive NP-complete problem to fin an optimal scheule [Collan et al. 1996] 9

48 Scheuling is expensive NP-complete problem to fin an optimal scheule [Collan et al. 1996] O(V 3 ) at least, exponential at worst [Rau et al. 1992] V: number of operations 9

49 Scheuling is expensive NP-complete problem to fin an optimal scheule [Collan et al. 1996] O(V 3 ) at least, exponential at worst [Rau et al. 1992] V: number of operations Can we linearize software pipelining? 9

50 Keep scheuling time uner control 10

51 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity 10

52 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity For iterative sub-algorithms, have a threshol: the maximum #iterations allowe 10

53 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity For iterative sub-algorithms, have a threshol: the maximum #iterations allowe The smaller the threshol, the less the compile overhea Once exceee, abort software pipelining 10

54 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity For iterative sub-algorithms, have a threshol: the maximum #iterations allowe The smaller the threshol, the less the compile overhea Once exceee, abort software pipelining Key question: How small can the threshol be? 10

55 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity For iterative sub-algorithms, have a threshol: the maximum #iterations allowe The smaller the threshol, the less the compile overhea Once exceee, abort software pipelining Key question: How small can the threshol be? Fin a goo enough scheule No backtracking Priority function: approximate an never upate Separate epenence an resource constraints Separate local an loop-carrie epenences 10

56 Keep scheuling time uner control Scheule in linear time Use either simple or fast sub-algorithms Avoi cubic or exponential complexity For iterative sub-algorithms, have a threshol: the maximum #iterations allowe The smaller the threshol, the less the compile overhea Once exceee, abort software pipelining Key question: How small can the threshol be? Fin a goo enough scheule No backtracking Priority function: approximate an never upate Separate epenence an resource constraints Separate local an loop-carrie epenences Iteratively improve a scheule 10

57 Just-In-Time Software Pipelining 11

58 Just-In-Time Software Pipelining Prepartition H, B Quickly creates an initial scheule to start with 11

59 Just-In-Time Software Pipelining Prepartition H, B Local scheuling Quickly creates an initial scheule to start with Hanles local epenences an resources 11

60 Just-In-Time Software Pipelining Prepartition H, B Local scheuling Kernel expansion Quickly creates an initial scheule to start with Hanles local epenences an resources Ajusts the scheule for loop-carrie epenences 11

61 Just-In-Time Software Pipelining Prepartition Local scheuling Kernel expansion Exit H, B Quickly creates an initial scheule to start with Hanles local epenences an resources Ajusts the scheule for loop-carrie epenences 11

62 Just-In-Time Software Pipelining Prepartition Local scheuling Kernel expansion Exit H, B Rotation 3 Quickly creates an initial scheule to start with Hanles local epenences an resources Ajusts the scheule for loop-carrie epenences Iteratively improves the scheule 11

63 Just-In-Time Software Pipelining Prepartition Local scheuling Kernel expansion Exit H, B Rotation 3 Time complexity: O(V+E) V: #operations E: #epenences 11 Quickly creates an initial scheule to start with Hanles local epenences an resources Ajusts the scheule for loop-carrie epenences Iteratively improves the scheule

64 Prepartition Iteration Local scheuling Kernel expansion Exit Rotation Local epenences Resources Loop-carrie epenences Illustration 12

65 Prepartition Local scheuling Kernel expansion abc RecMII Iteration Exit Rotation Local epenences Resources Loop-carrie epenences Illustration 12

66 Prepartition Local scheuling abc RecMII Iteration Kernel expansion abc Kernel Exit Rotation abc Local epenences Resources Loop-carrie epenences abc Illustration 12

67 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho 13

68 Conventional Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho O(V 3 ) 13

69 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho Conventional O(V 3 ) Turn the problem into a Markov ecision process 1 st time Howar algorithm is applie to pipelining 13

70 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho Conventional O(V 3 ) Howar policy iteration algo. O(exponential*E) Turn the problem into a Markov ecision process 1 st time Howar algorithm is applie to pipelining 13

71 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho Conventional Howar policy iteration algo. Linearize Howar O(V 3 ) O(exponential*E) O( H* E) Turn the problem into a Markov ecision process 1 st time Howar algorithm is applie to pipelining Linearize Howar with a small constant H 13

72 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho Conventional O(V 3 ) Howar policy iteration algo. O(exponential*E) Linearize Howar O( E) Turn the problem into a Markov ecision process 1 st time Howar algorithm is applie to pipelining Linearize Howar with a small constant H 13

73 Calculating RecMII Recurrence Minimum II II etermine by the biggest epenence cycles Neee by almost every software pipelining metho Conventional O(V 3 ) Howar policy iteration algo. O(exponential*E) Linearize Howar O( E) Turn the problem into a Markov ecision process 1 st time Howar algorithm is applie to pipelining Linearize Howar with a small constant H A by-prouct: critical 13 operations

74 Divie operations into stages Bellman-For O(V*E) 14

75 Divie operations into stages Bellman-For O(V*E) Also an iterative sub-algorithm 14

76 Divie operations into stages Bellman-For Linearize Bellman-For O(V*E) O( E) B* Also an iterative sub-algorithm Linearize it with a constant B 14

77 Divie operations into stages Bellman-For O(V*E) Linearize Bellman-For O( E) B* Also an iterative sub-algorithm Linearize it with a constant B Specific orer in visiting eges Scan noes in sequential orer, an visit their incoming eges Values are propagate along local eges in the 1 st itr. Values are propagate along loop-carrie eges in the 2 n itr. 14

78 Divie operations into stages Bellman-For O(V*E) Linearize Bellman-For O( E) Also an iterative sub-algorithm Linearize it with a constant B Specific orer in visiting eges Scan noes in sequential orer, an visit their incoming eges Values are propagate along local eges in the 1 st itr. Values are propagate along loop-carrie eges in the 2 n itr. 14

79 Prepartition Local scheuling abc RecMII Iteration Kernel expansion abc Kernel Exit Rotation abc Local epenences Resources Loop-carrie epenences abc Illustration 15

80 Prepartition Local scheuling abc RecMII Iteration Kernel expansion abc Kernel Exit Rotation abc Local epenences Resources Loop-carrie epenences abc Illustration 15

81 Prepartition Local scheuling abc II Iteration Kernel expansion abc Kernel Exit Rotation Local epenences Resources Loop-carrie epenences Illustration (Cont.) 16 abc abc

82 X X Prepartition Local scheuling Kernel expansion Exit Rotation Local epenences Resources a Illustration (Cont.) b c a Loop-carrie epenences 16 b c II a b c Kernel a b c Iteration

83 Local scheuling Any local scheuling algorithm can be use E.g. list scheuling 17

84 Local scheuling Any local scheuling algorithm can be use E.g. list scheuling Weakness: Loop-carrie epenences may be violate 17

85 Local scheuling Any local scheuling algorithm can be use E.g. list scheuling Weakness: Loop-carrie epenences may be violate To reuce the chance of violation: Before scheuling, priority function consiers loopcarrie epenences in avance Prioritize critical operations 17

86 Prepartition Local scheuling Kernel expansion Exit Rotation Iteration a b c a b c II a Kernel X X Local epenences Resources Loop-carrie epenences Illustration (Cont.) 18 b c a b c

87 Prepartition Local scheuling Kernel expansion Exit Rotation Iteration a b c a b c II a Kernel X X Local epenences Resources Loop-carrie epenences Illustration (Cont.) 18 b c a b c

88 x x Prepartition Local scheuling Kernel expansion Exit Rotation Local epenences Resources a b c Illustration (Cont.) 19 Loop-carrie epenences a b c II a b c a b c Iteration

2 3 Iteration a b c Loop-carrie epenences

89 Prepartition Local scheuling Kernel expansion x x x Exit Rotation Local epenences Resources Iteration a b c Loop-carrie epenences Illustration (Cont.) 19 II a b c Kernel a b c a b c

90 Prepartition Local scheuling Kernel expansion x x x Exit Rotation Local epenences Resources Loop-carrie epenences Illustration (Cont.) Iteration a b c a b c a b c 20 a b c

Resources a0 1 2 3 b c Loop-carrie epenences

91 Prepartition Local scheuling Kernel expansion Exit Rotation Local epenences Resources a b c Loop-carrie epenences Illustration (Cont.) 20 Iteration a b c a Kernel b c a b c

92 Experiments Transmeta CMS on SPEC2k traces 21

93 Experiments Transmeta CMS on SPEC2k traces Functional simulator to comprehensively Explore threshols H an B Evaluate compile overhea an scheules quality 21

94 Experiments Transmeta CMS on SPEC2k traces Functional simulator to comprehensively Explore threshols H an B Evaluate compile overhea an scheules quality Cycle-accurate simulator Simulates cache misses, latencies, Initial performance stuy 21

95 Linearize Howar vs. exponential backoff + binary search [Rau et al. 1992] 22

96 1000 Linearize Howar vs. exponential backoff + binary search [Rau et al. 1992] #epenences

97 1000 Linearize Howar vs. exponential backoff + binary search [Rau et al. 1992] #epenences

98 Linearize Howar vs. exponential backoff + binary 1000 search [Rau et al. 1992] Average 9X faster Peak 812X faster H 3 3 for 96% of 11,992 loops #epenences

99 1000 Linearize Howar vs. exponential backoff + binary search [Rau et al. 1992] Average 9X faster Peak 812X faster H 3 3 for 96% of 11,992 loops H 14 for all loops 22 #epenences

100 Linearize Bellman-For 23

101 Linearize Bellman-For 100% % Total loops 50% 0% B 23

102 Linearize Bellman-For 100% % Total loops 50% 0% B B 3 for 98.8% of 11,992 loops 23

103 Linearize Bellman-For 100% % Total loops 50% 0% B B 3 for 98.8% of 11,992 loops 23

104 Linearize Bellman-For 100% % Total loops 50% 0% B B 3 for 98.8% of 11,992 loops B 5 for all the loops 23

105 Linearize Bellman-For 100% % Total loops 50% 0% B B 3 for 98.8% of 11,992 loops B 5 for all the loops From now on, we set H=10, B=5 11,910 loops scheule 23

106 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 95% 1X RS2 DESP JITSP RS2 DESP JITSP 24

107 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 95% 1X RS2 DESP JITSP RS2 DESP JITSP JITSP achieves optimal scheules for 95% loops 24

108 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 12X 95% 1X RS2 DESP JITSP RS2 DESP JITSP JITSP achieves optimal scheules for 95% loops 24

109 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 12X 13% 95% 1X RS2 DESP JITSP RS2 DESP JITSP JITSP achieves optimal scheules for 95% loops 24

110 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 12X 13% 95% 25% 1X RS2 DESP JITSP RS2 DESP JITSP JITSP achieves optimal scheules for 95% loops 24

111 Scheuling overhea & scheules quality Overhea % loops with optimal scheules 12X 13% 23% 95% 25% 1X RS2 DESP JITSP RS2 DESP JITSP JITSP achieves optimal scheules for 95% loops 24

optimizations, 34% Note: acyclic scheuler hanles

112 Compile overhea istribution Acyclic scheuler, 27% Software pipelining, 33% Assembler, 6% Hot region optimizations, 34% Note: acyclic scheuler hanles acyclic coe, or loops NOT selecte for software pipelining 25

113 Preliminary performance 40 hot loops 26

114 Preliminary performance too many unrolls 1% Successfully generate coe 25% too many registers 55% filtere 19% 40 hot loops 26

115 Preliminary performance too many unrolls 1% Successfully generate coe 25% too many registers 55% filtere 19% 40 hot loops 26

116 Preliminary performance too many unrolls 1% Successfully generate coe 25% too many registers 55% filtere 19% 40 hot loops The architecture has bottleneck in registers 2/7/24/32 preicate/static alias/integer/floating point available for pipelining 26

117 Preliminary performance too many unrolls 1% Successfully generate coe 25% too many registers 55% filtere 19% 40 hot loops The architecture has bottleneck in registers 2/7/24/32 preicate/static alias/integer/floating point available for pipelining 26

118 Preliminary performance too many unrolls 1% Successfully generate coe 25% too many registers 55% filtere 19% 40 hot loops The architecture has bottleneck in registers 2/7/24/32 preicate/static alias/integer/floating point available for pipelining 26

119 Preliminary performance (Cont.) 2.00 II/MII (The lower, the better) 1.50 Optimal RS2 DESP JITSP 27

120 Preliminary performance (Cont.) 2.00 II/MII (The lower, the better) 1.50 Optimal RS2 DESP JITSP JITSP achieves optimal scheules for all but 1 loops 27

121 Preliminary performance (Cont.) 40% 20% 0% -20% Speeup RS2 DESP JITSP -40% JITSP 10~36% speeup. Better than the others 28

122 Preliminary performance (Cont.) 40% 20% 0% -20% Speeup RS2 DESP JITSP -40% JITSP 10~36% speeup. Better than the others Exception: loop 2. Optimal scheule but slowown ue to memory aliases retranslation neee 28

123 Preliminary performance (Cont.) 40% Speeup 20% 0% -20% RS2 DESP JITSP -40% JITSP 10~36% speeup. Better than the others Exception: loop 2. Optimal scheule but slowown ue to memory aliases retranslation neee Speeup swim(5.3%), ammp(4.4%), 28 mcf(3.6%)

124 Conclusion 29

125 Conclusion 1st linear software pipelining algorithm implemente for ynamic compilers 29

126 Conclusion 1st linear software pipelining algorithm implemente for ynamic compilers Turns a traitionally-expensive optimization into linear time O(V+E) 29

127 Conclusion 1st linear software pipelining algorithm implemente for ynamic compilers Turns a traitionally-expensive optimization into linear time O(V+E) Taking avantages of results from various omains: harware circuit esign (Retiming) stochastic control (Howar algorithm) Graph (Bellman-For) software pipelining (Rotation scheuling an DESP) 29

128 Conclusion 1st linear software pipelining algorithm implemente for ynamic compilers Turns a traitionally-expensive optimization into linear time O(V+E) Taking avantages of results from various omains: harware circuit esign (Retiming) stochastic control (Howar algorithm) Graph (Bellman-For) software pipelining (Rotation scheuling an DESP) Generates optimal or near-optimal scheules with reasonable compile overhea 29

129 Future work Register availability A more architecture registers Algorithm: Register pressure-aware Implementation Loop selection Re-translation Evaluation Benchmarks 30

130 Backup slies 31

131 Howar algorithm Each operation is reware to reach a cycle via 1 policy ege The bigger the cycle, the more the rewar 32

132 Howar algorithm Each operation is reware to reach a cycle via 1 policy ege The bigger the cycle, the more the rewar a c b 32

133 Howar algorithm Each operation is reware to reach a cycle via 1 policy ege The bigger the cycle, the more the rewar a c b 32

134 Howar algorithm Each operation is reware to reach a cycle via 1 policy ege The bigger the cycle, the more the rewar a c b 32

135 Distribution statistics of JITSP 33

Chapter 9 Memory Management

Chapter 9 Memory Management Contents 1. Introuction 2. Computer-System Structures 3. Operating-System Structures 4. Processes 5. Threas 6. CPU Scheuling 7. Process Synchronization 8. Dealocks 9. Memory Management 10.Virtual Memory