Store-Load Forwarding: Conventional. NoSQ: Store-Load Communication without a Store Queue. Store-Load Forwarding: Hmmm

Size: px

Start display at page:

Download "Store-Load Forwarding: Conventional. NoSQ: Store-Load Communication without a Store Queue. Store-Load Forwarding: Hmmm"

Elaine Watts
6 years ago
Views:

Store-Load Forwarding: Conventional No: Store-Load Communication without a Store Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu http://www.

1 Store-Load Forwarding: Conventional No: Store-Load Communication without a Store Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu addr predictor addr addr (CAM) data addr data (RAM) (RAM) Associatively-searched queue () Complex On execution path Doesn t scale well Exists for benefit of 10% of s Pentium III (1999) MICRO-06 :: Dec 12, 2006 [ 2 ] Store-Load Forwarding: Proposals Reduce range and frequency of search Bloom-filtered [Sethumadhavan 03] Pipelined/chained [Park 03] Hierarchical/filtered [Srinivasan 04, Ghandi 05, Torres 05] Decomposed [Roth 05; Baugh 04] Replace full associativity with set associativity Address-indexed forwarding cache [Stone 05] Replace search with speculative indexed access Speculative indexed [Sha 05], Fire-and-Forget [Subramaniam 06] Some forwarding structure still there in the middle of the datapath [ 3 ] Store-Load Forwarding: Hmmm Observe I: can predict the exact forwarding accurately Memory dependence prediction [Moshovos 97, Chrysos 98, ] No need to search to find it Observe II: data already exists in register file No need to copy input register!! output register Should be able to eliminate if Can connect s input data register to s consumers Can verify in some way that doesn t require we have the technology Speculative memory bypassing (SMB) [Moshovos 97] Filtered in-order re-execution [Cain 04, Roth 05] [ 4 ]

2 Store-Load Forwarding : No distinguishes bypassing/non-bypassing s Non-bypassing s just access the data cache s skip out-of-order execution (SMB) Store vulnerability window (SVW) verifies and trains Stores skip out-of-order execution too Don t participate in forwarding anymore Commit pipeline extended to execute s Road Overview The road to No (prior work) Conventional - forwarding Load verification with filtered re-execution Speculative memory bypassing No Eliminating out-of-order s and the queue Eliminating the queue Store- bypassing predictor Evaluation No Store (or any forwarding structure)! No Load! [ 5 ] [ 6 ] Conventional Design Conventional Design Loads Execute: search, write address into Loads Execute: search, write address into Stores Execute: write address/data to, search for early s [ 7 ] [ 8 ]

3 Conventional Design In-Order Load Re-Execution [Cain 04] Loads Execute: search, write address into Stores Execute: write address/data to, search for early s Commit: use data/address from to write [ 9 ] [ 10 ] In-Order Load Re-Execution [Cain 04] Store Vulnerability Window (SVW) [Roth 05] flush? Replace search with re-execution prior to commit Squash if in-order value! out-of-order value Moves queue out of core datapath Can verify any form of speculation Consumes a lot of cache bandwidth if all s are speculative [ 11 ] Don t re-execute if no to s address in long time Store Sequence Numbers (Ns): formalize time N Bloom Filter (): N of youngest to address Store commit: update Load commit: read, skip re-execution if entry is older than Forwarding? forwarding Non-forwarding? youngest committed at time of execute [ 12 ]

4 Speculative Memory [Moshovos 97] Speculative Memory Raw insns: add R1, 4! R2 R2! A A! R3 sub R3, 4! R4 Renamed insns: add P1, 4! P2 P2! A A! P3 sub P3, 4! P4 SMB renamed insns: add P1, 4! P2 P2! A A! P2 Convert DEF---USE to DEF-USE Extend register renaming -table[.input] : -table[def.output] Predict - dependence -table[.output] : -table[.input] Store- removed from dataflow graph sub P2, 4! P4 [ 13 ] ( register queue): maps to input data register Originally: verify bypassing s by executing out-of-order Modification: verify using SVW-filtered re-execution [Petric 05] s skip out-of-order execution s (~10%) never access data cache [ 14 ] No: A New Use of SMB No 1: Remove Store from Load Path Load re-execution / SVW filtering: target design simplification Eliminate associative queue search Traditional SMB: targets performance improvement SMB as opportunistic complement to queue Only 10% of s forward! only 4% gains Not worth the effort [Loh 02] No SMB: targets design simplification SMB as exclusive replacement for queue Loads don t need to access queue during execution s: skip out-of-order execution (SMB) Non-bypassing s: get values from the data cache [ 15 ] [ 16 ]

5 No 1: Remove Store from Load Path No 2: Remove Store Loads don t need to access queue during execution s: skip out-of-order execution (SMB) Non-bypassing s: get values from the data cache Remove queue from the the path [ 17 ] [ 18 ] No 2: Remove Store No 3: Remove Load Move execution from out-of-order to commit Extend ROB to remember registers, offsets and data sizes Elongate commit pipeline to read register file, calculate address No additional regfile ports, adders: out-of-order ports! in-order ports Don t dispatch s to Eliminate queue [ 19 ] [ 20 ]

6 No 3: Remove Load No 3: Remove Load Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Re-generate addresses for non-bypassed s at commit May need additional register read port [ 21 ] [ 22 ] No 3: Remove Load Once Again: Conventional Design Generate addresses for bypassed s at commit (to verify) ROB is already extended, pipeline already elongated Re-generate addresses for non-bypassed s at commit May need additional register read port Eliminate queue [ 23 ] [ 24 ]

7 No Road Overview From conventional to No Conventional - forwarding Load verification with filtered re-execution Speculative memory bypassing No Eliminating out-of-order s and the queue Eliminating the queue No s - bypassing predictor Evaluation [ 25 ] [ 26 ] No s Store-Load Prediction Similar to previous - prediction, but more difficult Load scheduling prediction: [Chrysos 98, ] Can predict conservatively, predicts only violating s Speculative forwarding prediction [Sha 05, ] Must predict all s, but benefits from - address check Traditional bypassing prediction [Moshovos 97, ] No address check, but can decline to predict difficult s No s bypassing prediction Must predict all s, precise, no address check Distance-Based Dependence Prediction interface: PC! dynamic Load PC! PC(s)! dynamic E.g., Store Sets [Chrysos 98], Speculative indexed [Sha 05] Store PC! dynamic requires table Can only (easily) represent most recent instance of each PC Load PC! distance (in s) to! dynamic E.g., [Lipasti 97, Yoaz 98] Can represent any instance Dovetails with SVW: just compare/subtract distances/ns Predict:.N bypass N rename.distance bypass Verify:.N bypass [.address] Train:.distance bypass N commit [.address] [ 27 ] [ 28 ]

PC branchcall history Design xor tag tag dist dist Each entry includes tag, distance Explicitly path-sensitive design Two tables: path-insensitive path-sensitive Both set-associative Predict: prefer

8 PC branchcall history Design xor tag tag dist dist Each entry includes tag, distance Explicitly path-sensitive design Two tables: path-insensitive path-sensitive Both set-associative Predict: prefer path-sensitive prediction Train: update both tables on every commit distance to forwarding Evaluation Goal!No queue or queue Same (or better) IPC as conventional design mis-predictions Deeper commit pipeline Latency benefit of SMB Reduced consumption of issue bandwidth and queue slots Simulation environment SPECint2000, SPECfp2000, Mediabench (only show 9) Dynamically scheduled 4-way superscalar 128-ROB, 40-entry issue queue, 11-stage front-end/core Base: 24/48-entry /, 2K-entry Store Sets, 6 stage commit No: No /, 2K-entry predictor, 8 stage commit [ 29 ] [ 30 ] Relative execution time No Performance No Avoid Bypass Mis-Predictions with Delay Two kinds of bypassing mis-predictions Difficult to predict Long signature, data dependent, predictor conflicts, etc. Simply cannot be bypassed Narrow- to wide- No can do wide-/narrow- and narrow-/narrow- See paper On average: slightly outperforms conventional design Prediction accuracy is ~99.8% (15 mis-predictions per 10,000 s) In a few cases: slowdowns 10 of 47 benchmarks >1% slowdown (worst case 7%) Catch-all: convert bypassing to delayed non-bypassing Inject it to the But delay it until the predicted bypassing commits Load gets value from data cache Attach a confidence counter to each predictor entry [ 31 ] [ 32 ]

No with Delay No No delay No with Perfect No No delay No perfect predictor Relative execution time Relative execution time Robust performance: only 1 of 47

See paper for scalability, data cache accesses, etc.

mechanisms Re-execution with SVW filter, speculative memory bypassing New highly-accurate bypassing predictor Simple, clean data path with no queue/ queue More

9 No with Delay No No delay No with Perfect No No delay No perfect predictor Relative execution time Relative execution time Robust performance: only 1 of 47 benchmarks > 1% Overdelay hurts sometimes, but not too bad As fast or faster than conventional in almost all cases Only 2% between realistic and perfect predictor See paper for scalability, data cache accesses, etc. [ 33 ] [ 34 ] Conclusions Conventional queue / queue Complex, non scalable Exist for the benefit of 10% forwarding s No Exploits synergy of previously proposed mechanisms Re-execution with SVW filter, speculative memory bypassing New highly-accurate bypassing predictor Simple, clean data path with no queue/ queue More scalable Fits well with distributed, partitioned cores (e.g., SMT) Outperforms conventional design Thank you. Tingting Sha, Milo M.K. Martin, Amir Roth University of Pennsylvania {shatingt, milom, amir}@cis.upenn.edu Store-Load Forwarding: Just Say No! [ 35 ]

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding

HANDLING MEMORY OPS. Dynamically Scheduling Memory Ops. Loads and Stores. Loads and Stores. Loads and Stores. Memory Forwarding HANDLING MEMORY OPS 9 Dynamically Scheduling Memory Ops Compilers must schedule memory ops conservatively Options for hardware: Hold loads until all prior stores execute (conservative) Execute loads as