Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

Size: px

Start display at page:

Download "Shengyue Wang, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota"

Ashlie Tyler
5 years ago
Views:

1 Loop Selection for Thread-Level Speculation, Xiaoru Dai, Kiran S. Yellajyosula, Antonia Zhai, Pen-Chung Yew Department of Computer Science & Engineering University of Minnesota

2 Chip Multiprocessors (CMPs) IBM Power5 CMPs: IBM Power5 Sun Niagara Intel dual-core Xeon AMD dual-core Opteron Proc Proc Proc Cache Improve program performance with parallel threads

3 Thread-Level Speculation (TLS) Automatic parallelization is difficult Ambiguous data dependences Complex control flow TLS facilitates automatic parallelization by: Executing potentially dependent threads in parallel Preserving data dependences via runtime checking Where do we find speculative parallel threads?

4 Parallelizing Loops under TLS Loops are good candidates for parallelism Regular structure Significant coverage on dynamic execution time General purpose applications are complicated Facts about SPECINT 2000 Average number of loops: 714 Average dynamic loop nesting: 8 Loop selection: which loops should be parallelized?

5 Potential of Loop Selection Outer loop Inner loop Best 100% program speedup 80% 60% 40% 20% 0% -20% -40% mcf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbmk Carefully selected loops can improve performance significantly!

6 Outline Loop selection Algorithm Parallel performance prediction Dynamic loop behavior Conclusions

7 Loop Nesting main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); foo( ) { while ( condition3 ) { goo( ); goo( ) { while ( condition4 ) { Source code foo_loop1 main_loop1 main_loop2 goo_loop1 Loop graph : static loop : nesting relationship

8 Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Loop Speedup Benefit 5% foo_loop1 main_loop2 goo_loop1 20% 18% 80% % 70% % 30% 1.2 5% 50% % benefit = % program execution time saved = coverage (1 1 / loop speedup) Program speedup = 1 / (1 - benefit) = 1.25

9 Loop Selection: Problem Definition Goal: Select the set of loops that maximizes the overall program performance when parallelized Constraint: The set cannot contain loops with nesting relationship Loop selection is NP-complete! Weighted maximum independent set

10 Loop Selection: Algorithm Exhaustive search ( 50 nodes) Try all possible combinations of loops Greedy algorithm (> 50 nodes) In descending order of benefit Check for nesting relation Add the loop to the set if no nesting Average number of loops for SPECINT 2000: 714

11 Loop Pruning loop1 loop2 loop3 loop4 loop5 loop6 loop7 loop8 Only resort to greedy algorithm for gcc and parser

12 Benefit of Parallelizing a Single Loop 13% main_loop1 Coverage Speedup Benefit 5% foo_loop1 main_loop2 20% 80% % 70% % 30% 1.2 5% 50% % 18% goo_loop1 Loop graph How can we estimate the speedup?

13 Outline Loop selection Algorithm Parallel performance prediction Dynamic loop behavior Conclusions

14 Estimating Parallel Performance Communicating value between speculative threads adds significant overhead to parallel execution Synchronization: Resolves frequently occurring data dependences Speculation: Resolves infrequently occurring data dependences Estimating communication costs with the compiler

15 Cost of Mis-speculation T1 T2 load Amount of work wasted store Cost of mis-speculation = amount of work wasted prob. of mis-speculation

16 Cost of Mis-speculation T1 T2 Sequential part store load Amount of work wasted Cost of mis-speculation = amount of work wasted prob. of mis-speculation

17 Synchronization T1 T2 store load Synchronization serializes parallel execution

18 Cost of Synchronization Est. I Est. II Est. III T1 T2 T1 T2 T1 T2 store2 load1 store1 load2 store1 load1 store1 load1 Synchronization Cost = # of dependent instructions Synchronization Cost = longest stall Synchronization Cost = longest stall based on dependent instructions

19 Experimental Framework Machine model 4 single-issue in-order processors Private L1 data cache (32K, 2-way, 1 cycle) Shared L2 data cache (2M, 4-way, 10 cycles) Speculation support (write buffer, address buffer) Synchronization support (comm. buffer, 10 cycles) Compiler optimizations using ORC 2.1 Instruction scheduling to improve parallelism

20 Comparison: Speedup Estimation Techniques program speedup 100% 80% 60% 40% 20% 0% -20% mcf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbmk Est. I Est. II Est. III Perfect coverage -40% 100% Average program speedup: 20%, coverage: 70% 80% 60% 40% 40% 20% 20% 0% 0% Shengyue mcf Wangcrafty twolf gzip bzip2 vortex vpr parser gap gcc perlbmk mcf crafty twolf gzip bzip2 vortex vpr parser gap gcc perlbmk Est. I I Est. Est. II II Est. III Est. III Perf ect Perf ect

21 Outline Loop selection Algorithm Parallel performance prediction Dynamic loop behavior Conclusions

22 Loop Behavior May Change main( ) { while ( condition1 ) { while ( condition2 ) { foo( ); goo( ); foo( ) { while ( condition3 ) { goo( ); goo( ) { while ( condition4 ) { Source code foo_loop1 goo_loop1_a main_loop1 main_loop2 Loop tree goo_loop1_b Calling context of a loop: the path from the root to that loop

23 Loop Selection in a Tree 13% main_loop1 20% main_loop2 5% foo_loop1 goo_loop1 is parallelized only when it is reached from main_loop2-2% 18% goo_loop1_a goo_loop1_b Loop cloning can be used

24 Loop Behavior May Change main_loop1 foo_loop1 foo_loop1 only parallelize the loop in these invocations goo_loop1_a goo_loop1_b Exploit loop behavior dynamically

25 Potential of Exploiting Dynamic Behavior 100% 80% 60% No Context Calling Context Oracle 40% 20% mcf mcf crafty crafty 0% twolf twolf gzip gzip bzip2 bzip2 vortex vortex vpr vpr parser parser gap gap gcc gcc perlbmk perlbmk 100% 80% 60% 5 out of 11 benchmarks show performance potential No Context Calling Context Oracle 40% 20% 0% mcf mcf crafty crafty twolf twolf gzip gzip bzip2 bzip2 vortex vortex vpr vpr parser parser gap gap gcc gcc perlbmk perlbmk coverage program speedup

26 Conclusions Loop selection is important for TLS Compiler-based loop selection Speedup 20%, Coverage 70% Exploiting dynamic behavior offers performance potential

27 Thank You!

POSH: A TLS Compiler that Exploits Program Structure

POSH: A TLS Compiler that Exploits Program Structure Wei Liu, James Tuck, Luis Ceze, Wonsun Ahn, Karin Strauss, Jose Renau and Josep Torrellas Department of Computer Science University of Illinois at Urbana-Champaign