Spatial Memory Streaming (with rotated patterns)

Size: px

Start display at page:

Download "Spatial Memory Streaming (with rotated patterns)"

Peter Morris
5 years ago
Views:

1 Spatial Memory Streaming (with rotated patterns) Michael Ferdman, Stephen Somogyi, and Babak Falsafi Computer Architecture Lab at 2006 Stephen Somogyi

2 The Memory Wall Memory latency 100 s clock cycles; improving slowly execution memory Reduce time stalled on memory Raise memory-level parallelism Capture all access patterns Strides Pointers (linked lists, trees) Complex layouts (sparse structs) time Stephen Somogyi, Michael Ferdman 2

3 Our Observation: Spatial Correlation page header Database Page (8kB) tuple data tuple slot index Memor ry Large-scale spatial access patterns Irregular layout non-strided Sparse can t capture with cache blocks But, repetitive predict to improve MLP Stephen Somogyi, Michael Ferdman 3

4 DPC Submission Code-correlated spatial patterns Pattern storage independent of dataset size Compulsory misses predictable Spatial Memory Streaming Observes and records spatial patterns Upon first access, stream remaining blocks Fetch in parallel increase MLP Sparse patterns fetch directly into L Stephen Somogyi, Michael Ferdman 4

5 Outline Introduction Spatial Correlation Spatial Memory Streaming Pattern Rotation Stephen Somogyi, Michael Ferdman 5

6 Spatial Regions Logically divide memory into regions Identify region by base address Fixed-size Simplifies hardware Can represent spatial patterns as bit vectors Region A fixed-size regions Region B spatial patterns Stephen Somogyi, Michael Ferdman 6

7 Why Exploit Spatial Correlation? Perfect predictor = one miss per spatial pattern Miss Rate No ormalized B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB 64B 512B 8kB Large Blocks Perfect Predictor 64B 512B 8kB 64B 512B 8kB OLTP DSS Web Sci. OLTP DSS Web Sci. L1 L1 (64kB) Large blocks prohibitive miss rate at L1 bandwidth inefficient i Spatial correlation opportunity to eliminate misses L2 L2 (8MB) Stephen Somogyi, Michael Ferdman 7

8 How to Exploit Spatial Correlation? Patterns are code-correlated Use PC to predict patterns Storage independent of dataset size Can predict compulsory misses But, data layout may not be aligned to region PC is not enough [Kumar 98] [Chen 04] Offset within region identifies alignment Practical hardware can predict spatial correlation Stephen Somogyi, Michael Ferdman 8

9 Outline Introduction Spatial Correlation Spatial Memory Streaming Rotated Patterns Stephen Somogyi, Michael Ferdman 9

10 Spatial Memory Streaming (SMS) 1. Observe pattern during generation 2. Store pattern at end of generation 3. Predict pattern at subsequent generation PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 1 observe 2 store time PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 3 predict cache hits Stephen Somogyi, Michael Ferdman 10

11 SMS Hardware Overview Core accesses Active Generation Table Tracks current patterns L1d 1 2 observe store 3 predict L2 / Memory stream into hierarchy Pattern History Table Stores observed patterns Stephen Somogyi, Michael Ferdman 11

12 Learning Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 Active Generation Table Region PC / off Pattern PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Active Generation Table Accumulates patterns 32 ~ 64 entries sufficient Stephen Somogyi, Michael Ferdman 12

13 Learning Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Active Generation Table Region PC / off Pattern A PC 1 / First access creates new entry Stephen Somogyi, Michael Ferdman 13

14 Learning Patterns PC 1 ld A+4 Active Generation Table PC 2 ld A Region PC / off Pattern PC 3 ld A+3 evict A+3 PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 A PC 1 / Further accesses accumulate bits in pattern Stephen Somogyi, Michael Ferdman 14

15 Learning Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Active Generation Table Region PC / off Pattern A PC 1 / Further accesses accumulate bits in pattern Stephen Somogyi, Michael Ferdman 15

16 Learning Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 Active Generation Table Region PC / off Pattern PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Eviction ends pattern PC 1 /4 to Pattern History Table Stephen Somogyi, Michael Ferdman 16

17 Learning Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 Pattern History Table PC / off Pattern PC 1 / PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Pattern History Table Stores previously-observed patterns Set-associative: 8-way 2k-entries Stephen Somogyi, Michael Ferdman 17

18 Predicting Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 Pattern History Table PC / off Pattern PC 1 / stream B, B+3 into cache First access looks in Pattern History Table Stream predicted blocks into L1 cache Stephen Somogyi, Michael Ferdman 18

19 Predicting Patterns PC 1 ld A+4 PC 2 ld A PC 3 ld A+3 evict A+3 Pattern History Table PC / off Pattern PC 1 / PC 1 ld B+4 PC 2 ld B PC 3 ld B+3 cache hit cache hit Subsequent accesses hit in L1 cache Stephen Somogyi, Michael Ferdman 19

20 SMS Results (SPEC CPU 2006) astar bwave bzip s 2 cactusadm dealii gcc GemsFDTD D gromacs h264ref hmmer lbm leslie3d libquantum mcf mil omnetp sople xalancbm zeusm c p x k p Normalized Execution Time Stephen Somogyi, Michael Ferdman 20

21 Outline Introduction Spatial Correlation Spatial Memory Streaming Rotated Patterns Stephen Somogyi, Michael Ferdman 21

22 Our Observation: Rotated Patterns PC is insufficient to predict pattern Offset of first access highly variable But: Access pattern almost always the same Can store rotated patterns in PHT Rotate as needed before prediction Stephen Somogyi, Michael Ferdman 22

23 Learning Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 Active Generation Table Region PC / off Pattern PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 Active Generation Table Accumulates patterns 32 ~ 64 entries sufficient Stephen Somogyi, Michael Ferdman 23

24 Learning Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 Active Generation Table Region PC / off Pattern A PC 1 / First access creates new entry Bits are recorded rotated left by initial offset Stephen Somogyi, Michael Ferdman 24

25 Learning Patterns PC 1 ld A+4 Active Generation Table PC 2 ld A+8 Region PC / off Pattern PC 3 ld A+7 evict A+7 PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 A PC 1 / Further accesses accumulate bits in pattern Bits are recorded rotated left by initial offset Stephen Somogyi, Michael Ferdman 25

26 Learning Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 Active Generation Table Region PC / off Pattern A PC 1 / Further accesses accumulate bits in pattern Bits are recorded rotated left by initial offset Stephen Somogyi, Michael Ferdman 26

27 Learning Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 Active Generation Table Region PC / off Pattern PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 Eviction ends pattern PC 1 to Pattern History Table PC only no offset Stephen Somogyi, Michael Ferdman 27

28 Learning Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 PC only no offset Pattern History Table PC Pattern PC PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 Pattern History Table Stores previously-observed patterns Set-associative: 8-way 2k-entries Stephen Somogyi, Michael Ferdman 28

29 Predicting Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 +2 Pattern History Table PC Pattern PC stream B+6, B+5 into cache First access looks in Pattern History Table Stream predicted rotated blocks into L1 cache Stephen Somogyi, Michael Ferdman 29

30 Predicting Patterns PC 1 ld A+4 PC 2 ld A+8 PC 3 ld A+7 evict A+7 Pattern History Table PC Pattern PC PC 1 ld B+2 PC 2 ld B+6 PC 3 ld B+5 cache hit cache hit Subsequent accesses hit in L1 cache Stephen Somogyi, Michael Ferdman 30

31 Rotation: Theoretical Benefit Before Pattern History Table PC / off Pattern After Pattern History Table PC Pattern PC 1 / PC 1 / PC PC 1 / PC 1 / Rotated patterns saves PHT storage Stephen Somogyi, Michael Ferdman 31

32 Coverage e Predictor 140% 120% 100% 80% 60% 40% 20% 0% Rotation: Practical Benefit Covered Uncovered Overpredicted k- 1k- 4k- 4k k- 1k- 4k- 4k- sms rot sms rot sms rot sms rot sms rot sms rot OLTP Web Rotated patterns saves 2x PHT storage Stephen Somogyi, Michael Ferdman 32

33 Rotation: Applicability Commercial workloads (e.g., OLTP, web, DSS) Large instruction footprints (>1MB [cidr 07]) Benefits from rotation Desktop/engineering (e.g., SPEC CPU 2000) Small instruction footprints (fit in L1-I) Unlikely to benefit from rotation [hpca 04] SPEC CPU 2006 very similar to CPU 2000 Need broad range of workloads to observe benefit of rotated patterns Stephen Somogyi, Michael Ferdman 33

34 Conclusion Spatial Memory Streaming Learns large-scale spatial access patterns Streams remaining blocks upon first access in pattern Accurate predictor with small hardware cost Rotated Patterns Stores one rotated version of spatial pattern per PC Significant reduction in number of patterns Needed in PHT-capacity constrained environment Stephen Somogyi, Michael Ferdman 34

35 Questions? STeMS Project Spatio-Temporal Memory Streaming cmu edu/~stems Computer Architecture Laboratory Carnegie Mellon University Stephen Somogyi, Michael Ferdman 35

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming. Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009

Analyzing Memory Access Patterns and Optimizing Through Spatial Memory Streaming Ogün HEPER CmpE 511 Computer Architecture December 24th, 2009 Agenda Introduction Memory Hierarchy Design CPU Speed vs.