Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)

Size: px

Start display at page:

Download "Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology)"

Valerie Martin
5 years ago
Views:

1 Exploiting Incorrectly Speculated Memory Operations in a Concurrent Multithreaded Architecture (Plus a Few Thoughts on Simulation Methodology) David J Lilja lilja@eceumnedu

2 Acknowledgements! Graduate students (who did the real work) o Ying Chen o Resit Sendag o Joshua Yi! Faculty collaborator o Douglas Hawkins (School of Statistics)! Funders o National Science Foundation o IBM o HP/Compaq o Minnesota Supercomputing Institute

3 Problem #! Speculative execution is becoming more popular o Branch prediction o Value prediction o Speculative multithreading! Potentially higher performance! What about impact on the memory system? o Pollute cache/memory hierarchy? o Leads to more misses?

4 Problem #! Computer architecture research relies on simulation! Simulation is slow o Years to simulate SPEC CPU000 benchmarks! Simulation can be wildly inaccurate o Did I really mean to build that system?! Results are difficult to reproduce! Need statistical rigor

5 Outline (Part )! The Superthreaded Architecture! The Wrong Execution Cache (WEC)! Experimental Methodology! Performance of the WEC [Chen, Sendag, Lilja, IPDPS, 003]

6 Hard-to-Parallelize Applications! Early exit loops! Pointers and aliases! Complex branching behaviors! Small basic blocks! Small loops counts Hard to parallelize with conventional techniques

7 Introduce Maybe dependences! Data dependence?! Pointer aliasing? o Yes o No o Maybe! Maybe allows aggressive compiler optimizations o When in doubt, parallelize! Run-time check to correct wrong assumption

8 Thread Pipelining Execution Model CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK Thread i Fork Sync Sync CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed WRITE-BACK Fork Sync Sync CONTINUATION -Values needed to fork next thread TARGET STORE -Forward addresses of maybe dependences COMPUTATION -Forward addresses and computed data as needed Fork Sync Thread i WRITE-BACK Sync Thread i+

9 The Superthread Architecture Instruction Cache Super-Scalar Core Super-Scalar Core Super-Scalar Core Super-Scalar Core Registers Registers Registers Registers PC Execution Unit PC Execution Unit PC Execution Unit PC Execution Unit Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Comm Dependence Buffer Data Cache

10 Wrong Path Execution Within Superscalar Core Predicted path Speculative execution Prediction result is wrong Ld A Ld B Correct path CP Wrong path execution WP Ld C Ld D Ld E Wrong path Not ready to be executed

11 Wrong Thread Execution Sequential region Parallel region Mark the successor threads as wrong threads BEGIN FORK ABORT TU0 Sequential region between two parallel regions Parallel region FORK ABORT BEGIN FORK ABORT TU TU Kill all the wrong threads from the Previous parallel region FORK ABORT FORK ABORT TU WTH TU FORK ABORT FORK ABORT TU3 WTH TU3 Wrong thread kills itself FORK ABORT TU0

12 How Could Wrong Thread Execution Help Improve Performance? for (i=0; i<0; i++) { for (j=0; j<i; j++) { x=y[j]; } } Parallelized When i=4, j=0,,,3=>y[0], y[], y[], y[3],y[4] When i=5, j=0,,,3,4 =>y[0],y[],y[],y[3],y[4],y[5] i=4 i=5 TU TU TU3 TU4 y[0] y[] y[] y[3] y[4] y[5] TU TU TU3 TU4 y[0] y[] y[] y[3] y[4] y[5] y[6] wrong threads

13 Operation of the WEC Wrong execution YES YES Wrong thread execution? NO Wrong path execution? NO Correct execution YES L data cache miss? YES L data cache miss? NO NO WEC miss? WEC miss? YES NO YES NO Bring the block from the next level memory into the WEC Update LRU info for the WEC Updata LRU info for the L data cache Bring the block from next level memory into L data cache Put the victim block into the WEC Swap the victim block and the WEC block Prefetch the next line into the WEC Update LRU info for the L data cache

14 Processor Configurations for Simulations SIMCA (the SIMulator for the Superthreaded Architecture) features configurations baseline (orig) wrong path (wp) wrong thread (wth) wrong execution cache (wec) prefetch into WEC victim cache (vc) next line prefetch (nlp) orig a vc a a wth-wp-vc a a a a wth-wp-wec a a a a a nlp a a

15 Parameters for Each Thread Unit Issue rate branch target buffer speculative memory buffer round-trip memory latency fork delay unidirectional communication ring Load/store queue Reorder buffer INT ALU, INT multiply/divide units FP adders, FP multiply/divide units WEC L data cache L instruction caches L cache 8 instrs/cycle per thread unit 4-way associative, 04 entries fully associative, 8 entries 00 cycles 4 cycles requests/cycle bandwidth 64 entries 64 entries 8, 4 4, 4 8 entries (same block size as L cache), fully associative distributed, 8KB, -way associative, block size of 64 bytes distributed, 3KB, -way associative, block size of 64 bytes unified, 5KB, 4-way associative, block size of 8 bytes

16 Characteristics of the Parallelized SPEC000 Benchmarks Bench -marks SPEC 000 Type Input set Fraction Parallelized Loop Coalesc -ing Loop Unrolling Statement Reordering to Increase Overlap 75vpr INT SPEC test 86% a 64gzip INT MinneSPEC large 57% a 8mcf INT MinneSPEC large 36% a 97parser INT MinneSPEC medium 7% a 83equake FP MinneSPEC large 3% a a 77mesa FP SPEC test 73% a a

17 Performance of the Superthreaded Architecture for the Parallelized Portions of the Benchmarks TU TU 4TU 8TU 6TU # of TUs Issue rate Reorder buffer size INT ALU INT MULT FP ALU speedup FP MULT 8 4 L data cache size (KB) vpr 64gzip 8mcf 83equake 97parser 77mesa average Baseline configuration

18 Performance of the wth-wp-wec Configuration on Top of the Parallel Execution TU org 4TU org 8TU org 6TU org TU w th-w p-w ec TU w th-w p-w ec 4TU w th-w p-w ec 8TU w th-w p-w ec 6TU w th-w p-w ec relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

19 Performance Improvements Due to the WEC vc wth-wp-vc wth-wp-wec nlp 75vpr 64gzip 8mcf 97parser 83equake 77mesa average relative speedup (%)

20 Sensitivity to L Data Cache Size orig 3k wth-wp-wec 4k wth-wp-wec 3k relative speedup (%) 0 75vpr 64gzip 8mcf 97parser 83equake 77mesa average

21 Sensitivity to WEC Size Compared to a Victim Cache wth-wp-vc 4 wth-wp-vc 6 wth-wp-wec 4 wth-wp-wec 6 relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

22 Sensitivity to WEC Size Compared to Next-Line Prefetching (NLP) nlp 8 nlp 3 wth-wp-wec 8 w th-w p-w ec 3 4 relative speedup (%) vpr 64gzip 8mcf 97parser 83equake 77mesa average

23 Additional Loads and Reduction of Misses Additional loads Reduction of Misses % 40 % vpr 64gzip 8mcf 97parser 83equake 77mesa average

24 Conclusions for the WEC! Allow loads to continue executing even after they are known to be incorrectly issued o Do not let them change state! 455% average reduction in number of misses o 97% average improvement on top of parallel execution o 4% average improvement over victim cache o 56% average improvement over next-line prefetching! Cost o 4% additional loads o Minor hardware complexity

25 Typical Computer Architecture Study Find an interesting problem/performance bottleneck! Eg Memory delays Invent a clever idea for solving it! This is the hard part 3 Implement the idea in a processor/system simulator! This is the part grad students usually like best 4 Run simulations on n standard benchmark programs! This is time-consuming and boring 5 Compare performance with and without your change! Execution time, clocks per instruction (CPI), etc

26 Problem # Simulation in Computer Architecture Research! Simulators are an important tool for computer architecture research and design o Low cost o Faster than building a new system o Very flexible

27 Performance Evaluation Techniques Used in ISCA Papers 00% 80% 60% 40% 0% Other Modeling Measurement Other sim SimpleScalar 0% * Some papers used more than one evaluation technique

28 Simulation is Very Popular, But! Current simulation methodology is not o Formal o Rigorous o Statistically-based! Never enough simulations o Design a new processor based on a few seconds of actual execution time! What are benchmark programs really exercising?

29 An Example -- Sensitivity Analysis! Which parameters should be varied? Fixed?! What range of values should be used for each variable parameter?! What values should be used for the constant parameters?! Are there interactions between variable and fixed parameters?! What is the magnitude of those interactions?

30 Let s Introduce Some Statistical Rigor! Decreases the number of errors o Modeling o Implementation o Set up o Analysis! Helps find errors more quickly! Provides greater insight o Into the processor o Effects of an enhancement! Provides objective confidence in results! Provides statistical support for conclusions

31 Outline (Part )! A statistical technique for o o o o Examining the overall impact of an architectural change Classifying benchmark programs Ranking the importance of processor/simulation parameters Reducing the total number of simulation runs [Yi, Lilja, Hawkins, HPCA, 003]

32 A Technique to Limit the Number of Simulations! Plackett and Burman designs (946) o Multifactorial designs o Originally proposed for mechanical assemblies! Effects of main factors only o Logically minimal number of experiments to estimate effects of m input parameters (factors) o Ignores interactions! Requires O(m) experiments o Instead of O( m ) or O(v m )

33 Plackett and Burman Designs! PB designs exist only in sizes that are multiples of 4! Requires X experiments for m parameters o X = next multiple of 4 m! PB design matrix o Rows = configurations o Columns = parameters values in each config High/low = / o First row = from P&B paper o Subsequent rows = circular right shift of preceding row o Last row = all ()

34 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

35 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

36 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

37 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect 65

38 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect 65-45

39 PB Design Matrix Config Input Parameters (factors) Response A B C D E F G Effect

40 PB Design! Only magnitude of effect is important o Sign is meaningless! In example, most least important effects: o [C, D, E] F G A B

41 Case Study #! Determine the most significant parameters in a processor simulator

42 Determine the Most Significant Processor Parameters! Problem o So many parameters in a simulator o How to choose parameter values? o How to decide which parameters are most important?! Approach o Choose reasonable upper/lower bounds o Rank parameters by impact on total execution time

43 Simulation Environment! SimpleScalar simulator o sim-outorder 30! Selected SPEC 000 Benchmarks o gzip, vpr, gcc, mesa, art, mcf, equake, parser, vortex, bzip, twolf! MinneSPEC Reduced Input Sets! Compiled with gcc (PISA) at O3

44 Functional Unit Values Parameter Int ALUs Int ALU Latency Int ALU Throughput FP ALUs FP ALU Latency FP ALU Throughputs Int Mult/Div Units Int Mult Latency Int Div Latency Int Mult Throughput Int Div Throughput FP Mult/Div Units FP Mult Latency FP Div Latency FP Sqrt Latency FP Mult Throughput FP Div Throughput FP Sqrt Throughput Low Value Cycles 5 Cycles 5 Cycles 80 Cycles 5 Cycles 35 Cycles 35 Cycles Equal to Int Div Latency Equal to FP Mult Latency Equal to FP Div Latency Equal to FP Sqrt Latency High Value 4 Cycle 4 Cycle 4 Cycles 0 Cycles 4 Cycles 0 Cycles 5 Cycles

45 Memory System Values, Part I Parameter L I-Cache Size L I-Cache Assoc L I-Cache Block Size L I-Cache Repl Policy L I-Cache Latency L D-Cache Size L D-Cache Assoc L D-Cache Block Size L D-Cache Repl Policy L D-Cache Latency L Cache Size L Cache Assoc L Cache Block Size Low Value 4 KB -Way 6 Bytes 4 Cycles 4 KB -Way 6 Bytes 4 Cycles 56 KB -Way 64 Bytes Least Recently Used Least Recently Used High Value 8 KB 8-Way 64 Bytes Cycle 8 KB 8-Way 64 Bytes Cycle 89 KB 8-Way 56 Bytes

46 Memory System Values, Part II Parameter L Cache Repl Policy L Cache Latency Mem Latency, First Mem Latency, Next Mem Bandwidth I-TLB Size I-TLB Page Size I-TLB Assoc I-TLB Latency D-TLB Size D-TLB Page Size D-TLB Assoc D-TLB Latency Low Value High Value Least Recently Used 0 Cycles 5 Cycles 00 Cycles 50 Cycles 00 * Mem Latency, First 4 Bytes 3 Bytes 3 Entries 56 Entries 4 KB 4096 KB -Way Fully Assoc 80 Cycles 30 Cycles 3 Entries 56 Entries Same as I-TLB Page Size -Way Fully-Assoc Same as I-TLB Latency

47 Processor Core Values Parameter Fetch Queue Entries Branch Predictor Branch MPred Penalty RAS Entries BTB Entries BTB Assoc Spec Branch Update Decode/Issue Width ROB Entries LSQ Entries Memory Ports Low Value 4 -Level 0 Cycles 4 6 -Way In Commit 8 05 * ROB 4-Way High Value 3 Perfect Cycles 64 5 Fully-Assoc In Decode 64 0 * ROB 4

48 Determining the Most Significant Parameters Run simulations to find response With input parameters at high/low, on/off values Confi g Input Parameters (factors) Respons e A B C D E F G 9 3 Effect

49 Determining the Most Significant Parameters Calculate the effect of each parameter Across configurations Confi g Input Parameters (factors) Respons e A B C D E F G 9 3 Effect 65

50 Determining the Most Significant Parameters 3 For each benchmark Rank the parameters in descending order of effect (=most important, ) Parameter Benchmark Benchmark Benchmark 3 A 3 8 B 9 4 C 6 7

51 Determining the Most Significant Parameters 4 For each parameter Average the ranks Parameter Benchmark Benchmark Benchmark 3 Average A B C 6 7 5

52 Most Significant Parameters Number Parameter ROB Entries gcc 4 gzip art Average 77 L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Speculative Branch Update

53 General Procedure! Determine upper/lower bounds for parameters! Simulate configurations to find response! Compute effects of each parameter for each configuration! Rank the parameters for each benchmark based on effects! Average the ranks across benchmarks! Focus on top-ranked parameters for subsequent analysis

54 Case Study #! Determine the big picture impact of a system enhancement

55 Determining the Overall Effect of an Enhancement! Problem: o Performance analysis is typically limited to single metrics Speedup, power consumption, miss rate, etc o Simple analysis Discards a lot of good information

56 Determining the Overall Effect of an Enhancement! Find most important parameters without enhancement o Using Plackett and Burman! Find most important parameters with enhancement o Again using Plackett and Burman! Compare parameter ranks

57 Example: Instruction Precomputation! Profile to find the most common operations o 0,, etc! Insert the results of common operations in a table when the program is loaded into memory! Query the table when an instruction is issued! Don t execute the instruction if it is already in the table! Reduces contention for function units [Yi, Sendag, Lilja, Europar, 00]

58 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

59 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

60 The Effect of Instruction Precomputation Parameter ROB Entries L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries Average Rank Before After Difference

61 Case Study #3! Benchmark program classification

62 Benchmark Classification! By application type o Scientific and engineering applications o Transaction processing applications o Multimedia applications! By use of processor function units o Floating-point code o Integer code o Memory intensive code! Etc, etc

63 Another Point-of-View! Classify by overall impact on processor! Define: o Two benchmark programs are similar if They stress the same components of a system to similar degrees! How to measure this similarity? o Use Plackett and Burman design to find ranks o Then compare ranks

64 Similarity metric! Use rank of each parameter as elements of a vector! For benchmark program X, let o X = (x, x,, x n, x n ) o x = rank of parameter o x = rank of parameter o

65 Vector Defines a Point in n-space Param #3 (x, x, x 3 ) D Param # (y, y, y 3 ) Param #

66 Similarity Metric! Euclidean Distance Between Points / ] ) ( ) ( ) ( ) [( n n n n y x y x y x y x D = K

67 Most Significant Parameters Number Parameter ROB Entries gcc 4 gzip art L Cache Latency Branch Predictor Accuracy Number of Integer ALUs L D-Cache Latency L I-Cache Size 6 7 L Cache Size L I-Cache Block Size Memory Latency, First LSQ Entries 0 39 Speculative Branch Update 8 8 6

68 Distance Computation! Rank vectors o Gcc = (4,, 5, 8, ) o Gzip = (, 4,, 3, ) o Art = (, 4, 7, 9, )! Euclidean distances o D(gcc - gzip) = [(4) + (-4) + (5-) + ] / o D(gcc - art) = [(4-) + (-4) + (5-7) + ] / o D(gzip - art) = [(-) + (4-4) + (-7) + ] /

69 Euclidean Distances for Selected Benchmarks gcc gzip art mcf gcc gzip art mcf 0

70 Dendogram of Distances Showing (Dis-)Similarity

71 Final Benchmark Groupings Group I II III IV V VI VII VIII Benchmarks Gzip,mesa Vpr-Place,twolf Vpr-Route, parser, bzip Gcc, vortex Art Mcf Equake ammp

72 Conclusion! Multifactorial (Plackett and Burman) design o Requires only O(m) experiments o Determines effects of main factors only o Ignores interactions! Logically minimal number of experiments to estimate effects of m input parameters! Powerful technique for obtaining a big-picture view of a lot of simulation data

73 Conclusion! Demonstrated for o Ranking importance of simulation parameters o Finding overall impact of processor enhancement o Classifying benchmark programs! Current work comparing simulation strategies o Reduced input sets (eg MinneSPEC) o Sampling (eg SimPoints, sampling)

74 Goals! Develop/understand tools for interpreting large quantities of data! Increase insights into processor design! Improve rigor in computer architecture research

Design of Experiments - Terminology

Design of Experiments - Terminology Response variable Measured output value E.g. total execution time Factors Input variables that can be changed E.g. cache size, clock rate, bytes transmitted Levels Specific