MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS. Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion

Size: px

Start display at page:

Download "MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS. Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion"

Clement Phillips
5 years ago
Views:

1 1 MEMORY PREFETCHING WITH NEURAL-NETWORKS: CHALLENGES AND INSIGHTS Leeor Peled, Shie Mannor, Uri Weiser, Yoav Etsion

2 2 Introduction CPU performs speculation on multiple domains Branch prediction, prefetching, cache replacement, disambiguation Most mechanisms try to predict the future based on history of decisions and outcomes We can learn simple recurring patterns, but complex pattern recognition is a challenge Some decision cannot be classified based on simple history alone

3 3 But why can we even make predictions? What makes workloads so predictable? We claim: Locality is a property of program semantics Program + machine attributes can successfully represent a semantic execution context Machine learning can approximate semantic locality

4 4 Next level in memory prefetching Our goal is to approximate Semantic Locality Higher abstraction level: Locality between program objects Accesses are semantically local if they are related through a sequence of actions Dictated by program semantics (e.g.: cur = cur->next) Correlates actions that are consequential, not necessarily consecutive Not just temporal or spatial adjacency

5 5 Example: Memory accesses on MCF x Range B 1,396, ,396, ,396, ,396, ,396, ,396, ,396, ,396, ,396, ,396, ,396, x Range A Memory address space Various interleaved streams, each has it s own traits and semantics for( ; arc < stop_arcs; arc += nr_group ) { if( arc->ident > BASIC ) { red_cost = arc->cost - arc->tail->potential + arc->head->potential; if (bea_is_dual_infeasible(arc, red_cost)) { basket_sz++; perm[basket_sz]->a = arc; perm[basket_sz]->cost = red_cost; perm[basket_sz]->abs_cost = ABS(red_cost); } } } do { while (perm[l]->abs_cost > cut) l++; while (cut > perm[r]->abs_cost) r--; if( l < r ) { xchange = perm[l]; perm[l] = perm[r]; perm[r] = xchange; } if( l <= r ) { l++; r--; } } while (l <= r ); if( min < r ) sort_basket (min, r); if( l < max && l <= B ) sort_basket (l, max);

6 6 Memory accesses on MCF Zoom into the quicksort range The distinctive pattern represents the repeated pivot partitioning and recursion steps. Could this pattern be recognized by NNs?

7 7 Prefetching with Neural-Networks Premise: machine + program state approximate semantic information; this state can be correlated with memory addresses Problem #1: We must find a useful context Too many (/ few) attributes will result in overfit (/ underfit) Each workload may have its own useful subset of attributes Our attempt to solve this with automated feature selection (contextual bandits, ISCA 15) had many shortcomings (size, collisions, adaptability) Problem #2: we need an efficient way of learning Associative: learn global rules from local examples Represent algorithmic complexity

8 8 Neural Network structure We use a NN (3-5 layers of varying sizes) to train the associations (the predictive context and the resulting address) Context: IP, Address Hint, Access type Branch hist Access hist ` ` Address1 Address 2 Additional predictions (confidence, reuse, )

9 9 Can a NN imitate existing prefetchers? As a first step, we trained over sequences representing the patterns targeted by common prefetchers Prefetcher Pattern Additional state Streamer A A+1 A+2 A+3 A+4 A+5 Strider A A+n A+2n A+3n A+4n A+5n SMS A A+n A+m B B+n B+m Markov A B A C A B A B(33%), A C(66%) VLDP, GHB */DC A A+2 A+3 A+5 A+6 A+8 +2,+1,+2,+1,.. GHB PC/* A 1 B 1 A 2 B 2 A 3 B 3 PC A :{A} n, PC B :{B} n Placebo: learn non-trivial functions: sin(x), LFSR, (x+1) 2

10 Simple function: Sin 10

11 11 Convergence: Sin Convergence speed per NN size (sin)

12 Convergence: Prefetchers + Placebo 12 Convergence speed per sequence

13 13 How to imitate a prefetcher? Prefetchers are not mathematical functions Next elements are computed as deltas over previous elements Need to examine how best to learn each function Learning modes include Function learning: x f(x) [described above] Next element: f(x 1 ) f(x 2 ) Next with history: { f(x 1 ), f(x 2 ) } f(x 3 ) Deltas: (f(x 2 ) - f(x 1 )) (f(x 3 ) - f(x 2 ))

14 14 Learning modes: MSE Error rate per trained relation (mode)

15 15 Training on real memory streams Problem: we don t know the correct address to train Unlike the bandits case, we must select a single address (per NN) Train context state from depth n (denoted as S n ), and propagate forward through the network. Select the associated address A i and do back-propagation Selection starts from default depth, and scans for the next miss Can take into account the current network output - address with min square error will train fastest S n S n-1 S 2 S 1 S 0 ` A n A n-1 A d A 0

16 16 NN prefetching: Recap NN Benefits NN can imitate various prefecthers for MCF No need to select attributes. BP will train the weights to activate the most useful inputs. Questions What are the limits on the complexity of patterns we can learn i.e., different workloads How does the prediction accuracy change with NN Width, depth, structure

17 17 Design space and enhancements Network depth and size of hidden layers 3-6 layers, 256 input nodes (context), 64 output (addr/delta), per hidden layer Network connectivity Full, CNN (various reduction ratios), LSTM (dedicated nodes) Number of output prefetches Up to 4, either as independent networks or with shared network Confidence prediction Predicting usefulness internally through the network, or in an external Bloom filter. Association heuristics Minimal MSE, best match or fixed distance (with multiple NN, we can assign different policies) Dynamic selection of max delta and min association depth

18 NN prefetch speedup (SPEC06) 18

19 19 NN prefetch speedup (kernels) linked data structures

20 21 NN prefetching: Challenges Back-propagation is computationally heavy In this work we did a full *magic* back-prop for every memory access But a real back-prop is too slow for that Possible alternatives: Train only a subset of useful predictions Reduced FP precision Quantized / binarized NNs

21 22 NN prefetching: Challenges Technology for an in-core NN is not yet practical Memristors? Off-core accelerator? Convergence time of online learning We can control NN parameters (learning rate, momentum) Unlike tabular updates (bandits), each BP affects all NN weights. May break previous updates Offline learning is possible, but more difficult Offline traces may not represent real-time workloads

22 23 Conclusions Semantic locality traces the origins of temporal and spatial locality to program semantics NNs do a good job at approximating semantic locality NNs can support a generalized prefetcher that targets diverse access patterns, which are today targeted by dedicated prefetchers An NN-based prefetcher is, however, theoretical; existing technology does not support a feasible (in terms of timing, area, and power) in-core prefetcher.

23 Thank You 24

25 Quantized Neural Networks Hubara, Courbariaux, Soudry, El-Yaniv, Bengio https://arxiv.org/pdf/1609.07061.

24 25 Quantized Neural Networks Hubara, Courbariaux, Soudry, El-Yaniv, Bengio A hot topic in ML/DL: Train using quantized / discreet weights In the extreme case binary (+/- 1) Greatly simplifies inference No need for MAC (multiply & accumulates), only simple add/sub operations Training accuracy somewhat harmed, but not more than with stochastic gradient descent (SGD) Overall accuracy impact is shown to be small Much easier to implementable and sustain in HW!

SEMANTIC LOCALITY & CONTEXT BASED PREFETCHING. Leeor Peled 1, Shie Mannor 1, Uri Weiser 1, Yoav Etsion 1,2

SEMANTIC LOCALITY & CONTEXT BASED PREFETCHING Leeor Peled 1, Shie Mannor 1, Uri Weiser 1, Yoav Etsion 1,2 1 Electrical Engineering and 2 Computer Science Technion Israel Institute of Technology ICRI-CI