Dynamic Performance Tuning for Speculative Threads

Size: px

Start display at page:

Download "Dynamic Performance Tuning for Speculative Threads"

Myles Parks
5 years ago
Views:

1 Dynamic Performance Tuning for Speculative Threads Yangchun Luo, Venkatesan Packirisamy, Nikhil Mungre, Ankit Tarkas, Wei-Chung Hsu, and Antonia Zhai Dept. of Computer Science and Engineering Dept. of Electronic Computer Engineering University of Minnesota Twin Cities

2 Motivation Multicore processors brought forth large potentials of computing power Intel Core i7 die photo Exploiting these potentials demands thread-level parallelism 2

3 Time Time Time Exploiting Thread-Level Parallelism Sequential Traditional Parallelization Store *p Load *q Store *p p!= q?? Load *q Compiler gives up Thread-Level Speculation (TLS) p!= q p == q Load 20 Load 88 3 Store 88 Parallel execution Store 88 Speculation Failure Potentially more parallelism with speculation But Unpredictable Load 88

4 Addressing Unpredictability State-of-the-art approach: Profiling-directed optimization Compiler analysis A typical compilation framework Profiling run Source code Region selection Profiling info Parallel code generation Speculative code optimization Machine code 4

5 Time Problem #1 0 spawning thread allow commit signal wait load imbalance inter-thread synchronization 3 3 speculation failure Hard to model TLS overhead statically, even with profiling 5

6 Problem #2 Training Input Actual Input Profiling Behavior Actual Behavior Profile-based decision may not be appropriate for actual input 6

7 Metric Problem #3 phase1 phase2 phase3 Time Behavior of the same code changes over time 7

8 Load 0x22 Extra misses Problem #4 Sequential Parallel P P P L1$ L1$ L1$ miss Load 0x22 miss Load 0x22 miss Load 0x22 hit Prefetching Pollution P L1$ P L1$ P L1$ P L1$ Load *p=0x88 Load *p =0x88 Load *p=0x80 Load *p =0x08 How important are cache impacts? 8

8 Cache Squash Other load *p=0x88 0.6 0.4 2.5X load *p =0x88 0.

9 Normalized Execution Time Impact on Cache Performance A major loop from benchmark art (scanner.c:594) Cache Squash Other load *p=0x X load *p =0x % Cache: p == p Prefetched (+) 0 SEQ TLS-4core Cache impact is difficult to model statically 9

10 Our proposal Proposed Performance Tuning Framework Compiler Analysis Source code Region selection Accurate Timing & Cache Impacts Runtime Information Runtime Analysis Performance Evaluation Adaptations for inputs and phase changes Profiling info Parallel code generation Speculative code optimization Dynamic Execution Performance Profile Table Machine code Compiler Annotation Tuning Policy 10

11 Parallelization Decisions at Runtime Thread spawning End of a parallel execution Performance Profile Table parallelize serialize spawn spawn spawn spawn Parallel Execution Parallel execution Sequential execution Evaluate performance New execution model facilitates dynamic tuning 11

12 Normalized Execution Time Evaluating Performance Impact Runtime Analysis Runtime Information Performance Evaluation Dynamic Execution Performance Profile Table Cache Squash Other 0.4 Tuning Policy 0.2 Simple metric: cost of speculative failure 0 SEQ TLS-4core Comprehensive evaluation: sequential performance prediction 12

13 Sequential Performance Prediction Hardware Counters Busy ifetch ExeStall Cache Performance Impact Cache Programmable Cycle Classifier P Information from ROB P Speculative Parallel TLS overhead Predict ExeStall ifetch Busy Predicted sequential 13

14 Cache Behavior: Scenario #1 TLS T0 load p (unmodified) miss Squash SEQ T0 load p miss Predicted T0 load p hit T0 load p hit Count the miss in squashed execution 14

15 Cache Behavior: Scenario #2 TLS SEQ Predicted load p store p invalidate p T0 load p T0 T0 T0 miss miss load p store p miss load p store p miss*2 Not count a miss if tag matched invalid status 15

16 Tune the performance Runtime Analysis Runtime Information Performance Evaluation Dynamic Execution Performance Profile Table Tuning Policy Exhaustive search Prioritized search 16

suggestion Outside-In Performance summary metric: Speedup or not?

17 Performance Summary metric Speedup or not? Improve. in cycles How to prioritize the search Search order: Inside-Out Compiler suggestion Outside-In Performance summary metric: Speedup or not? Improvement in cycles Suboptimal Quantitative Quantitative +StaticHint Simple Inside-Out Search order Compiler suggestion 17

18 Evaluation Infrastructure Benchmarks SPEC2000 written in C, -O3 optimization Underlying architecture 4-core, chip-multiprocessor (CMP) P P P P speculation supported by coherence C C C C Simulator Superscalar with detailed memory model simulates communication latency models bandwidth and contention Interconnect C Detailed, cycle-accurate simulation 18

19 Speedup wrt. SEQ Comparing Tuning Policies Simple Quantitative Quantitative+StaticHint 1.37x 1.23x x Parallel Code Overhead 19

(Quant+StaticHint)/dynamicSEQ 1.27x 1.25x 1.46x 1.5 1 0.

20 Speedup wrt. SEQ Profile-Based vs. Dynamic ProfileBased ProfileBased/ProfileBasedSEQ Quant+StaticHint (Quant+StaticHint)/dynamicSEQ 1.27x 1.25x 1.46x Dynamic outperformed static by 10% (1.37x/1.25x) Potentially go up to 15% (1.46x/1.27x) 20

Related works: Region Selection Statically Compiler Heuristics [Vijaykumar

06] (POSH) Extensive Profiling [Wang LCPC 05] Dynamically Runtime Loop

21 Related works: Region Selection Statically Compiler Heuristics [Vijaykumar Micro 98] Balanced Min-Cut [Johnson PLDI 04] Simple Profiling Pass [Liu PPoPP 06] (POSH) Extensive Profiling [Wang LCPC 05] Dynamically Runtime Loop Detection [Tubella HPCA 98] [Marcuello ICS 98] Saving Power from Useless Respawning [Renau ICS 05] Profile-Based Search [Johnson PPoPP 07] Dynamic Performance Tuning [Luo ISCA 09] (this work) Lack of adaptability Lack of high-level information 21

22 Conclusions Deciding how to extract speculative threads at runtime Quantitatively summarize performance profile Compiler hints avoid suboptimal decisions Compiler and hardware work together. 37% speedup over sequential execution 10% speedup over profile-based decisions Configurable HPMs enable efficient dynamic optimizations Dynamic performance tuning can improve TLS efficiency 22

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors

Supporting Speculative Multithreading on Simultaneous Multithreaded Processors Venkatesan Packirisamy, Shengyue Wang, Antonia Zhai, Wei-Chung Hsu, and Pen-Chung Yew Department of Computer Science, University