ECE 5730 Memory Systems

Size: px

Start display at page:

Download "ECE 5730 Memory Systems"

Rafe Henderson
6 years ago
Views:

1 ECE 5730 Memory Systems Spring 2009 Off-line Cache Content Management Lecture 7: 1

2 Quiz 4 on Tuesday Announcements Only covers today s lecture No office hours today Lecture 7: 2

3 Where We re Headed Off-line content management (today) Partitioning heuristics Prefetching heuristics Locality optimizations Combined approaches Cache power management Cache case studies Lecture 7: 3

4 Off-line Partitioning Heuristics Programmer or compiler partitions code and data between the scratchpad and main memory Lecture 7: 4

5 Example Embedded Processor SPM + d-cache backed by main memory P N-1 correction of paper figure [Panda00] Lecture 7: 5

6 Why Not Just a Big Data Cache? Consider digital camera histogram application cache conflicts between arrays cannot be removed via data layout techniques handle by placing arrays in different memories (scratchpad and main memory) Lecture 7: 6

7 Data Partitioning Factors Where to map scalar variables and constants Usually take up little space so map to SPM to avoid d-cache conflicts with arrays Size of arrays Arrays larger than SPM requires book-keeping code to determine which region of the array is addressed If accesses are uniform, prefetching into the d-cache may work well Assign these large arrays to d-cache Lecture 7: 7

8 Data Partitioning Factors Lifetimes of variables Use lifetime analysis to store variables/arrays with disjoint lifetimes in same storage location Many partitioning choices. Which one is best? [Panda00] Lecture 7: 8

9 Data Partitioning Factors Access frequency of variables Use to estimate degree of conflicts with other variables Rough metric: Interference factor IF(u) = VAC(u) + IAC(u) variable access count: number of accesses to elements of u during its lifetime interference access count: number of accesses to other variables during the lifetime of u High value of IF(u) indicates u is likely to have large number of d-cache conflicts if mapped to DRAM Map instead to SPM Lecture 7: 9

10 Data Partitioning Factors Conflicts in loops Identify array d-cache conflicts in loops that cannot be avoided by memory address assignment N-1 [Panda00] (different access patterns) Map a and b to DRAM and c to SPM Lecture 7: 10

11 Performance Comparison [Panda00] Lecture 7: 11

12 Off-line Prefetching Prefetch instructions provides hint to the hardware to bring data into the cache (L1 or L2) Programmer or compiler embeds these instructions into the code Caches must be non-blocking Exceptions (e.g., protection violations) typically cause the prefetch request to be dropped Lecture 7: 12

13 Common Prefetch Instructions Normal prefetch Block is brought into the cache Prefetch with modify intent Block is brought into the cache in the Dirty state Prefetch with block modify intent Write access obtained without reading the old block Lecture 7: 13

14 Common Prefetch Instructions Non-temporal prefetch Block has no temporal reuse, so prevent other data from being displaced as much as possible Example 1: HW brings into the MRU position Example 2: HW brings into a particular way of the cache Lecture 7: 14

15 Prefetching Arrays Array accesses whose indices are affine (linear) functions of the loop indices have addresses that can be calculated ahead of time Goal: Overlap computation of current iteration(s) with SW prefetch of a future iteration [Intel08] Lecture 7: 15

16 Prefetching Example original code miss hit 8KB WB cache 2 array elements/line 100 cycle miss penalty [Mowry98] cache misses/hits Lecture 7: 16

17 Prefetching Example prologue steady state epilogue [Mowry98]. code with prefetching [next page] Lecture 7: 17

18 Prefetching Example code with prefetching [Mowry98] Lecture 7: 18

19 Locality analysis Prefetching Steps Determine which accesses are likely to miss and therefore should be prefetched Loop splitting Separate the predicted miss instances to avoid the overhead of conditional statements in the loop bodies Scheduling prefetches Schedule prefetches the proper time in advance and overlap with computation (via software pipelining) Lecture 7: 19

20 Locality Analysis Identify references likely to cause a cache miss Two step process Determine the intrinsic data reuses within a loop nest Determine the reuses that can be exploited by a cache of a particular size Which reuses can be translated into locality Lecture 7: 20

21 Reuse Analysis Attempts to find those instances of array accesses that refer to the same line Spatial reuse: A reference accesses data in the same line in different iterations Temporal reuse: A reference accesses the same data location in different iterations Group reuse: An array access is to the same line or data location as a previous array access Lecture 7: 21

22 Reuse Analysis A[i][j] has spatial reuse in the inner loop miss hit B[j][0] has temporal reuse in the outer loop B[j][0] and B[j+1][0] have group reuse cache misses/hits [Mowry98] Lecture 7: 22

23 Identifying Prefetches Reuses translate into locality only if the reuse of data occurs before the data is displaced Depends on Loop iteration count (determines how much data is brought into the cache between reuses) Cache characteristics Localized iteration space: Set of innermost loops whose volume of data accessed in a single iteration does not exceed the cache size A reuse can be exploited only if it lies within the LIS Lecture 7: 23

24 Identifying Prefetches With no locality, all references are prefetched With temporal locality, only need to prefetch at the beginning of the loop (e.g., i = 0) With spatial locality, need to prefetch only those references for which (i mod l) = 0 loop index number of array elements in each line Prefetch predicate: predicate that determines if a particular iteration needs to be prefetched Lecture 7: 24

25 Result of Locality Analysis miss hit cache misses/hits [Mowry98] Lecture 7: 25

26 Loop Splitting Loops are decomposed into different sections so that the all predicates for a section evaluate to the same value Predicate i = 0 requires peeling the first loop iteration Predicate (i mod l) = 0 requires unrolling the loop by a factor of l Need to worry about code expansion! Lecture 7: 26

27 Loop Splitting unrolled twice peeled first iteration [Mowry98] Lecture 7: 27

28 Loop Splitting no prefetches of B[j+1][0] [Mowry98] Lecture 7: 28

29 Scheduling Prefetches Prefetches should be issues early enough to hide memory latency, but not too early so that the data is not flushed Prefetches are scheduled ceiling(m/s) iterations in advance m = prefetch latency in cycles s = shortest path in cycles through the loop body For our example, ceiling(100/36) = 3 iterations Lecture 7: 29

30 Scheduling Prefetches prologue: start prefetching 3 iterations ahead steady state: continue prefetching 3 iterations ahead epilogue: no prefetching [Mowry98] Lecture 7: 30

31 Scheduling Prefetches continue prefetching A[i][j] [Mowry98] Lecture 7: 31

32 HW Versus SW Prefetching Optimizing data access patterns to suit the hardware prefetcher should be a higherpriority consideration than using software prefetch instructions. [Intel08] In other words, organize your code to help the HW prefetcher do its job, and if you can t, use SW prefetching Remember: SW prefetches are instructions that consume Icache and pipeline resources Lecture 7: 32

33 Next Time Cache Power Management Lecture 7: 33

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining

Lecture 21. Software Pipelining & Prefetching. I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining Lecture 21 Software Pipelining & Prefetching I. Software Pipelining II. Software Prefetching (of Arrays) III. Prefetching via Software Pipelining [ALSU 10.5, 11.11.4] Phillip B. Gibbons 15-745: Software