Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

Size: px

Start display at page:

Download "Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES"

Horatio Hart
5 years ago
Views:

1 Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

2 Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory latency However, irregular access patterns hinder their usefulness Key insight: accesses seem irregular at individual load/store level, but have predictable structure when we consider the high-level algorithm e.g. Breadth-first search (BFS) 2

3 Background and Motivation Existing SW/HW prefetching is insufficient Speedup Both L2 L DCU L DCU IP Both L All L + L2 (a) Hardware prefetchers Speedup HW SW SW+HW (b) Hardware vs software Figure 3: Hardware and software prefetching on Graph 500 search with scale 2, edge factor 0. 3

4 Background and Motivation - BFS Unvisited A Visited C B Will be visited after visiting C So we can prefetch after visiting B... 4

5 Prefetching with Algorithmic Knowledge Design a hardware prefetcher that relies on access patterns specific to algorithms Target BFS, but can support a wider range of algorithms/access patterns Specific to Compressed Sparse Row (CSR) format Prefetcher snoop reads/writes from L cache Achieve an average of 2.3x speedup 5

6 Review: Compressed Sparse Row (CSR) Sparse representation, with a vertex list indirecting to an edge list Authors add a visited list and work list specifically for BFS (a) Graph 7 4 Work List Vertex List Edge List (b) CSR breadth-first search Visited Figure : A compressed sparse row format graph and breadth-first search on it. False True True True True True False True 6

7 Poor Locality of Accesses in Graphs Stall Rate (%) Edge factor 5 Edge factor 0 Edge factor Scale (a) Stall rate Miss Rate (%) Edge factor 5 Edge factor 0 Edge factor Scale (b) L miss rate (c) Source of misses 7

8 Overview of Approach Prefetch all relevant data of o-distance away from the current worklist entry: visited[edgelist[vertexlist[worklist[n+o]]]] Prefetcher snoops the core-to-l mem. accesses to determine which data to prefetch Observation Load from worklist[n] Prefetch vid = worklist[n] Prefetch from vertexlist[vid] Prefetch vid = edgelist[eid] Vertex-O set Mode Action Prefetch worklist[n+o] Prefetch vertexlist[vid] Prefetch edgelist[vertexlist[vid]] to edgelist[vertexlist[vid+]] (2 lines max) Prefetch visited[vid] 8

9 System Architecture Main Memory L2 Cache Work List Vertex List Edge List Visited List Dcache (a) System overview To / From L2 Cache Snoops Prefetch Reqs Prefetched Data DTLB Core EWMA Calculator Address Generator Request Queue Prefetcher Config Figure 5: A graph prefetcher, configured with in 9

10 Prefetcher Microarchitecture Snoops & Prefetched Data From L Cache Address Filter Address Bounds Registers Work List Start Vertex List Start Edge List Start Visited List Start Work List End Vertex List End Edge List End Visited List End Programmer must specify these bounds To DTLB & L Cache Prefetch Request Queue EWMA Unit Prefetch Address Generator Work List Time EWMA Data Time EWMA Ratio Register (b) Prefetcher microarchitecture detail 0

11 Determining Prefetch Distance Easy Case: Time to process a vertex (work_list_time) is less than time to pre-fetch the next vertex (data_time) o work list time = data time work_list_time and data_time vary wildly => use exponentially weighted moving averages (EWMA) Use a safe bound because EWMA often underestimates data_time: o =+ k data time work list time

12 Determining Prefetch Distance Problem: work_list_time > data_time Pre-fetched data is not used timely, might get kicked out of cache before it is used Happens with high-degree vertices Solution: Large vertex mode Base prefetch on how far along we have processed the high-degree vertex»possible because we know the range of the edge indices Prefetch within edgelist for larger vertex Fetch need vertex in worklist when almost done with current vertex s edges 2

13 Extensions Technique can be extended to other algorithms: Parallel BFS Sequentially scanning vertex and edge data (e.g. PageRank) 3

14 Methodology gem5 simulator Set of algorithms from Graph500 and the Boost Graph Library: BFS-like traversal: Connected components, BFS, betweenness-centrality, ST connectivity Sequential access: PageRank, sequential coloring 4

15 Evaluation - BFS-like traversal Speedup s6e0 s9e5s9e0 Stride Stride-Indirect CC s9e5 s2e0s6e0 (a) Graph 500 Graph Search s9e5 s9e0s9e5 s2e0 5

16 Evaluation - BFS-like traversal Stride Stride-Indirect Graph 3 BFS BC ST 2.5 Speedup 2.5 web road web (b) BGL road web road 6

17 Significantly improved L hit-rate L Cache Read Hit Rate s6e0 s9e5 s9e0 s9e5 s2e0 No Prefetching CC Search BFS BC ST s6e0 s9e5 s9e0 s9e5 s2e0 webroad Graph Prefetching webroad webroad Figure 7: Hit rates in the L cache with and without prefetching. 7

18 Prefetching has low overheads Extra Memory Accesses (%) s6e0 s9e5 s9e CC s9e5 s2e0 Search web road BFS BC ST Figure 8: Percentage of additional memory accesses as a result of using our prefetcher. s6e0 s9e5 s9e0 s9e5 s2e0 web road L Utilisation Rate CC Search BFS BC ST Figure 9: Rates of prefetched cache lines that are used before leaving the L cache. 8

19 Prefetching Analysis Most of the benefit comes from prefetching visited & edge lists -> as expected Speedup Proportion s6e0 s9e5 s9e0 Visited Edge Vertex Work CC Search BFS BC ST s9e5 s2e0 s6e0 s9e5 s9e0 s9e5 s2e0 webroad webroad webroad Figure 0: The proportion of speedup from prefetching each data structure within the breadth first search. 9

20 Prefetching works for other traversal types Example: parallel BFS Speedup Graph PF No PF 2 4 Number of Cores Figure : Speedup relative to core with a parallel implementation of Graph500 search with scale 2, edge factor 0 using OpenMP. 20

21 Prefetching works for other traversal types Stride Graph PR SC Speedup web road web road Figure 2: Speedup for di erent types of prefetching when running PageRank and Sequential Colouring. 2

22 Conclusion Prefetching with knowledge of the graph traversal order significantly improves its performance Works for different traversal types (BFS, sequential scan,...) 22

Graph Prefetching Using Data Structure Knowledge

Graph Prefetching Using Data Structure Knowledge Sam Ainsworth University of Cambridge sam.ainsworth@cl.cam.ac.uk Timothy M. Jones University of Cambridge timothy.jones@cl.cam.ac.uk ABSTRACT Searches on