Queue. Processor. Processor. + L1 Cache. Prefetch

Size: px

Start display at page:

Download "Queue. Processor. Processor. + L1 Cache. Prefetch"

Allen Stokes
5 years ago
Views:

1 computing for Prefetching Introspective project for CS252, Spring 2000) (Class Andrew Y. Ng and Eric Xing Berkeley UC

2 Introspective computing Secondary processor (may be DSP or GP) to \help" the primary processor. Feedback-driven execution to optimize performance. i.e. Prefetching Allows far more ambitious control algorithms than hardware prefetchers.

3 Advantages: Introspective computing Gives speed ups without relying on extracting more parallelism. Also Binary compatible. Scales up well: Can in principle be applied to multiprocessors, etc. Multiple secondary processors also possible. Disadvantages: Silicon can also be used for other things (e.g. multiprocessors, larger cache)

4 Prefetching: The problem Fetch data into cache to reduce processor stall time. Focus on L1 data cache. Architecture: Primary processor runs program. Secondary processor monitors execution, sends prefetch addresses.

5 Architecture What is the mechanism by which secondary processor monitors execution? We chose stream of miss addresses. Primary Processor + L1 Cache Queue Secondary Processor Prefetch

6 Algorithms Given stream of miss addresses, need to predict what the next miss will be. Many standard adaptive methods/machine learning algorithms applicable! e.g. Reinforcement learning in MDPs: C(s; a) = T (s) + E h min a C(s0 ;a) s is the current state of the processor+cache, a is the where we take (e.g. prefetch or not), T (s) is the time cost of action the current state, and s 0 is the sucessor state. i (1)

7 Real-time constraints Crucial issue: Secondary processor is receiving a stream of instruction misses in real time. Must respond to them quite quickly (e.g. at most about 100 cycles of processing) We found reinforcement learning and other statistical to be about an order of magnitude too slow. algorithms

8 Simple predictive algorithm trying several more alternatives, we settled on the following After algorithm (and minor variants) { simple each cache miss that the secondary processor gets o the queue, On does two things: it 1. Update cache-miss statistics 2. (Maybe) do a prefetch.

9 Keeping track of statistics Have a huge hash table hashing from cache tags into a table of the cache tags of the next few \likely" misses. On a cache miss on tag x, keep track of the number of of of misses on a few (5) other cache tags within a occurences short time following the miss on x.

10 Issuing prefetches Suppose that, after a cache miss on tag x, we had observed at least n times a cache miss on y shortly after. Then the next time we see a miss on x again, we also prefetch y.

11 Parameters Queue type: Fifo or Lifo? (Both have advantages.) What to do if queue is full? Speed of secondary processor. Size of hash table for miss statistics. What to do in event of collision in hash table. What to do in event of collision in table keeping track of next-miss tag.

12 Default parameters: Experimental details Used simplescalar simulator. SPEC95 Benchmarks gcc, go, compress, hydro2d. Hashsize = Used Queue of size 10. Assumed secondary processor as fast as primary processor. In event of hash etc. conicts, information on the new cache tag is thrown away.

13 Queue type: Fifo or Lifo? processes data in order. Lifo more urgently gets to recent Fifo rst. misses % reduction in cache misses Benchmark Fifo Lifo

14 What to do when queue full? 60 Policy on push when queue is full Drop push Delete oldest 50 % reduction in cache misses Benchmark

15 Queue size queue size % reduction in cache misses

16 Hashtable size hash size % reduction in cache misses

17 #misses observed before we prefetch % reduction in cache misses prefetch threshold

18 Conicts in hash table (Similar results over fairly large range of hash table sizes.) % reduction in cache misses tag replacement chance

19 in table storing next-miss Conicts address % reduction in cache misses address replacement chance

20 Speed of secondary processor % reduction in cache misses secondary processor speed Student Version of MATLAB

21 Summary and Related issues Up to about about 60% reduction in cache misses possible. If the queue is full (i.e. the secondary processor is busy), it is advantageous to (stochastically) skip some of the sometimes of the statistics, to focus on issuing appropiate updating instructions. prefetching Introspective computing can similarly be used to improved branch prediction (but hard!) Parallelizes easily: Can have multiple secondary processors. Specialized instructions for secondary processor?

Demand fetching is commonly employed to bring the data

Demand fetching is commonly employed to bring the data Proceedings of 2nd Annual Conference on Theoretical and Applied Computer Science, November 2010, Stillwater, OK 14 Markov Prediction Scheme for Cache Prefetching Pranav Pathak, Mehedi Sarwar, Sohum Sohoni