Detecting Global Stride Locality in Value Streams

Size: px

Start display at page:

Download "Detecting Global Stride Locality in Value Streams"

Simon Simpson
6 years ago
Views:

1 Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte TINKER Research Group Department of Electrical & Computer Engineering North Carolina State University 1

2 Introduction This paper is about predicting values A code example (from the benchmark parser, function is_con_word) lw$v0[2],0($v1[3]) sw$v0[2],0($s8[30]) lw$v0[2],0($s8[30]) //the insn to be predicted bne$v0[2],$zero[0], lw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) a8 sw$v1[3],0($s8[30]) b0 j local value history from value profile: xx528, xx840, 0, xx792, 0, xx720, 0, xx816, xx768, xx744, xx696, xx624, xx672, 2

3 Hard-to to-predict Values 900 Value sequence Value Occurance Hard to predict with (local) value predictors: DFCM [Goeman et.al.]: 2% accuracy; (local) stride [Lipastiet. al.]: 4% accuracy. 3

4 Another look at the code Global Value Locality lw$v0[2],0($v1[3]) //the correlated load sw$v0[2],0($s8[30]) lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0], lw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j

5 Global Stride Locality Control flow graph lw$v0[2],0($v1[3]) // the correlated load sw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0], % prediction accuracy is possible if we can use the results of correlated instructions. 5

6 Global value history: Global Value History If the values produced by the dynamic instruction stream are labeled x 1, x 2,..., x N, then the ordered data sequence (x 1, x 2,..., x N ) is the global value history of order (i.e., length) N. Local value history The value sequence produced by the same instruction to be predicted. 6

7 Global Value Locality Similar to local value predictability [Sazeidesand Smith] Global context locality Previous instruction (PI) predictor [Nakra et.al.] Dynamic dataflow-inherited speculative context (DDISC) predictor [Thomas & Franklin] Global computational locality x N = a N 1x N 1 + a N 2 x N 2 + L + a1x1 + a 0. Similar formulation to Perceptron branch predictor [Jimenez & Lin] 7

8 Global stride locality Global Stride Locality xn = xn k + a 0. k and a 0 are run-time variables The previous example: x N = x N lw$v0[2],0($v1[3]) // the correlated load sw$v0[2],0($s8[30]) a0 lw$v1[3],12($v0[2]) //the correlated load a8 sw$v1[3],0($s8[30]) b0 j lw$v0[2],0($s8[30]) //the instruction to be predicted bne$v0[2],$zero[0],

9 Global Stride Locality In A Program Global stride locality in the code Define (e.g., load ra, rb, rc) // load value is hard to predict Explicit Use (e.g., add rx, ra, #constant) // the destof add can be //predicted using the value of the define instruction results (ra) Explicit Use (e.g., sub rx, ra, rd) // the destof sub can be predicted //if rd has strong repeating patterns Implicit Use (e.g., load rx, ry, rz) //Implicit use through the //memory (eg., spilling and filling; reloading) 9

10 Global Stride Locality In A Program (cont.) Global stride locality embedded in the data structure struct string_list { struct string_list * next; char * string; } Near constant stride between the two addresses (*next, *string) if the two fields (next, string) are allocated and referenced in the same order. [Serrano & Wu] 10

11 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 11

12 The GDiff Value Predictor Two major structures PC Stored differences and distance Global value queue (gvq) completing instructions results Prediction table 12

13 Example A code example a: load r1, r2, #20 b: c: d: add r3, r1, #4 loop Instruction a produces (1, 8, 3, 2, ) Instruction d generates (5, 12, 7, 6, ) 13

14 1 b1 Global value queue 0 0 Stored differences How GDiff Works c1 0 5 Update 4 x Calculated Differences x 8 b2 c x x Global value queue Update Calculated differences 4 x x match Distance = 2 Stored differences 3 b3 Global value queue 4 x c3 x Prediction is 3+4 = 7 Stored differences Instruction a produces (1, 8, 3, 2, ) Instruction d generates (5, 12, 7, 6, ) 14

15 The GDiff Value Predictor prediction update PC detect match =? =? =? =? =? diff1 diff2 diffk diffn distance global value queue mux completing instructions results Prediction table mux + prediction x x N = N k + a 0. 15

16 Results: Exploiting Global Stride Value Locality Value prediction accuracy 90% stride DFCM gdiff(queue size = 8) Prediction accuracy 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average Based on trace simulation (sim-profile) and predicting all value producing insns Stride, gdiff, DFCM L1: 8k entry pc-indexed tagless table; DFCM L2: 64K entry table 16

17 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 17

18 Value Delay When the prediction of an instruction is being made, the correlated instruction has not finished... A. lw$v0[2],0($v1[3]) //the correlated load B. sw$v0[2],0($s8[30]) C. lw$v0[2],0($s8[30]) //the instruction to be predicted D. bne$v0[2],$zero[0], A dispatch A issue A execute A complete C dispatch time A s value needed A s value available 18

19 Value Delay Impact Model value delay as T values: a prediction can not use the previous T insns results. prediction accuracy 90% 80% 70% 60% 50% 40% 30% Prediction accuracy for different value delays (queue size = 8) T = 0 T = 2 T = 4 T = 8 T = 16 bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average Trace simulation using sim-profile 19

20 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 20

21 Reducing the Value Delay Fetch Dispatch Issue RegRead Execution Write Back Retire Execution pipeline PC GDiff prediction Table Global value queue Update (out-of-order) GDiff predictor prediction Gdiff with speculative value queue (SGVQ) 21

22 Results of SGVQ 100% 90% Prediction accuracy and coverage of the gdiff predictor with SGVQ and the local stride predictor gdiff accuracy L_stride accuracy gdiff coverage L_stride coverage 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl tw olf vortex vpr average Worse than the simple local stride predictor. Why? 22

23 Problems with Speculative GDiff Though value delay is reduced. The results are out-of-order Completion order a: add r3, r4, r5 b: load r5, r6, 28 c: sub r2, r3, 4 Cache miss b: load r5, r6, 28 a: add r3, r4, r5 c: sub r2, r3, 4 Cache hit 23

24 Outline Introduction of global stride locality GDiff predictor Value Delay Speculative Value Queue Hybrid Value Queue Potential uses of gdiff predictor Summary 24

25 The Hybrid GDiff Value Predictor Fetch Dispatch Issue RegRead Execution Write Back Retire Execution pipeline Update (out-of-order) PC GDiff prediction Table Global value queue GDiffpredictor prediction PC Local stride predictor local stride prediction Key: construct the global value history in-order (fetch/dispatch order); combine with a different value predictor. 25

26 Results of Hybrid GDiff 100% Prediction accuracy and coverage of gdiff and local predictors gdiff accuracy L_stride accuracy L_context accuracy gdiff coverage L_stride coverage L_context coverage 90% 80% 70% 60% 50% 40% 30% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average (Local context uses DFCM) 26

27 Predicting load addresses. Potential Uses of GDiff Forming the global value history only with load addresses. Predict next address based on previous addresses. Predicting addresses of missing loads only Forming the global value history only with addresses of missing loads. Predict next address based on previous addresses. Predicting values to break true data dependence chain. And 27

28 Predicting Addresses Using Hybrid GDiff 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Predictability of load address stream ls_accu gs_accu markov_accu ls_cov gs_cov markov_cov bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average 4-way 256k entry Markov address predictor uses tag for confidence 28

29 Predicting Address of Missing Loads 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Predictability of addresses of missing loads ls_m_accu gs_m_accu markov_m_accu ls_m_cov gs_m_cov markov_m_cov bzip2 gap gcc gzip mcf parser perl twolf vortex vpr average 29

30 Using GDiff to Break Data Dependencies 60% 50% Speedups of gdiff and local value prediction local_stride local_context gdiff(hgvq) 40% 30% 20% 10% 0% bzip2 gap gcc gzip mcf parser perl twolf vortex vpr H_mean 4/64 issue out-of-order model 30

31 Summary Global stride locality The Gdiff value predictor Value delay issue Reduce the value delay In-order global value history Potential Uses: What to put into the global value queue (i.e., global value history) How to utilize the prediction 31

32 Thank you and Questions? 32

Detecting Global Stride Locality in Value Streams

Detecting Global Stride Locality in Value Streams Huiyang Zhou, Jill Flanagan, Thomas M. Conte Department of Electrical and Computer Engineering North Carolina State University {hzhou,jtbodine,conte}@eos.ncsu.edu