Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Size: px

Start display at page:

Download "Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation"

Ethelbert Bryant
5 years ago
Views:

1 Optimizing Cache Performance for Graph Analytics Yunming Zhang Presentation

2 Goals How to optimize in-memory graph applications How to go about performance engineering just about anything

3 In-memory Graph Processing Compare to Disk / DRAM boundary (GraphChi, BigSparse, LLAMA..), Cache / DRAM boundary has Much smaller latency gap (L cache - ns, DRAM 8- ns, Flash, ns ( ms)) Much larger memory bandwidth (DRAM >GB/s, Flash 6GB/s) Much smaller granularity (64 bytes cache lines vs 4k or MB pages)

4 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

5 Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server Scott Beamer, Krste Asanović, David Patterson Mostly borrowed from the authors IISWC presentation

6 Motivation What is the performance bottleneck for graph applications running in memory? How much performance can we gain? How can we achieve the performance gains?

7 Graph Algorithms Are Random? 7

8 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7

9 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7

10 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7

11 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7

12 Current Graph Architecture Wisdom? Current Wisdom 8

13 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern 8

14 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth 8

15 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization 8

16 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design 8

17 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches 8

18 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches Heavy multithreading 8

19 Are graph applications really memory bandwidth bounded? Is cache really completely useless in graph computations? 9

20 Results from Characterization

21 Results from Characterization No single representative workload

22 Results from Characterization No single representative workload need suite

23 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads

24 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor

25 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality

26 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality caches help! try to avoid thrashing

27 Target Graph Algorithms Most popular based on 45-paper literature survey: Breadth-First Search (BFS) Single-Source Shortest Paths (SSSP) PageRank (PR) Connected Components (CC) Betweenness Centrality (BC)

28 Target Graph Frameworks

29 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks

30 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions

31 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions GAP Benchmark Suite (OpenMP) - UCB written directly in most natural way for algorithm, not constrained by framework

32 Target Input Graphs UC Berkeley Graph # Vertices # Edges Degree Diameter Degree Dist. Roads of USA.9M 58.M.4 High const Twitter Follow Links 6.6M 468.4M.8 Low power Web Crawl of.sk Domain 5.6M 949.4M 8.5 Medium power Kronecker Synthetic Graph 8.M 48.M 6. Low power Uniform Random Graph 8.M 48.M 6. Low normal Graphs can have very different degree distributions, diameters and other structural characteristics.

33 How does the memory system work? 4

34 How does the memory system work? Executing a load that access DRAM requires: 4

35 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) 4

36 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window 4

37 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) 4

38 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available 4

39 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4

40 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Memory bandwidth (#4) matters only if (#-) satisfied Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4

41 UC Berkeley Little s Law 5

42 UC Berkeley Little s Law effective MLP (MLP = memory level parallelism) 5

43 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) 5

44 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) application MLP 5

45 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

46 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

47 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

48 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

49 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

50 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

51 Single-Core Memory Bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

52 Single-Core Memory Bandwidth 4 bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

53 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

54 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion 7

55 Importance of Window Size UC Berkeley core w/ thread 8

56 Importance of Window Size UC Berkeley core w/ thread 8

57 Importance of Window Size UC Berkeley fetch core w/ thread 8

58 Importance of Window Size UC Berkeley fetch core w/ thread 8

59 Importance of Window Size UC Berkeley fetch window core w/ thread 8

60 Importance of Window Size UC Berkeley fetch dataflow window core w/ thread 8

61 Importance of Window Size UC Berkeley 4 bandwidth fetch dataflow window core w/ thread 8

62 Importance of Window Size UC Berkeley core w/ thread 8

63 Importance of Window Size UC Berkeley MLPmax window size IPM + core w/ thread 8

64 Importance of Window Size UC Berkeley MLPmax Instruction window window size size limits + memory bandwidth IPM if misses rare core w/ thread 8

65 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

66 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

67 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

68 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

69 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload core w/ thread 9

70 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload Only few workloads near memory bandwidth limit core w/ thread 9

71 Instruction Window Limits BW UC Berkeley core w/ thread

72 Instruction Window Limits BW UC Berkeley core w/ thread

73 Instruction Window Limits BW UC Berkeley core w/ thread

74 Instruction Window Limits BW UC Berkeley window core w/ thread

75 Instruction Window Limits BW UC Berkeley Instruction window limits memory bandwidth core w/ thread

76 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion

77 Memory Bandwidth ~ Performance UC Berkeley threads vs. thread

78 Memory Bandwidth ~ Performance UC Berkeley Increasing memory bandwidth utilization increases performance threads vs. thread

79 Parallel Utilization UC Berkeley 6 cores w/ threads

80 Parallel Utilization UC Berkeley At system level, memory bandwidth under-utilized 6 cores w/ threads

81 Multithreading Opportunity UC Berkeley 4

82 Multithreading Opportunity UC Berkeley fetch window dataflow 4 bandwidth 4

83 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth 4

84 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4

85 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4

86 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow ++ more application MLP 4 bandwidth same hardware (shared) 4

87 Multithreading Increases Bandwidth UC Berkeley core w/ - threads 5

88 Multithreading Increases Bandwidth UC Berkeley Speedups imply workloads probably bottlenecked by dataflow core w/ - threads 5

89 Conclusions UC Berkeley 6

90 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth 6

91 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality 6

92 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Cache misses too infrequent to fit in window 6

93 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window 6

$Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing$

94 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window Sub-linear parallel speedups cast doubt on gains from multithreading on OoO core 6

95 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

96 Milk / Propagation Blocking Is changing the architecture really the only way to improve the performance of graph applications running in memory? Boosting MLP Other approaches Reducing the amount of communication

97 Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 6 September, 6, Haifa, Israel 9

98 Indirect Accesses

99 Indirect Accesses with OpenMP

100 Indirect Accesses with OpenMP 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L

101 Indirect Accesses with milk milk if(!milk) 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L

102 No Locality? Cache miss TLB miss DRAM row miss No prefetching Time

103 No Locality? Time 4

104 No Locality? Time 5

105 No Locality? Time 6

106 Milk Clustering Time 7

107 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching Time 8

108 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching No need for atomics! Time 9

109 Outline Milk programming model milk syntax MILK compiler and runtime 4

110 Foundations Milk programming model extending BSP milk syntax OpenMP for C/C++ MILK compiler and runtime LLVM/Clang 4

111 Big (sparse) Data Terabyte Working Sets - AWS TB VM In-memory Databases, Key-value stores Machine Learning Graph Analytics 4

112 Infinite Cache Locality in Graph Applications Temporal Locality Spatial Locality.... Ideal Cache Hit % Road (d=.4) Twitter (d=4) Web (d=9). r t w. r t w. r t w. r t w r t w BC BFS CC PR SSSP Betweenness Centrality Breadth-First Search Connected Components PageRank Single-Source Shortest Paths [GAPBS] 4

113 Milk Execution Model Collection Distribution Delivery Propagation Blocking: Binning (Collection + Distribution), Deliver (Accumulation) 44

114 Collection += f(i); d count

115 Collection += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count

116 Distribution += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count

117 Distribution += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count

118 Delivery += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count

119 Delivery += f(i); d count

120 milk syntax milk clause in parallel loop milk directive per indirect access tag address to group by pack additional state f() 48

121 pack Combiners 49

122 PageRank 7.5 5

123 PageRank with OpenMP 5

124 PageRank with milk 5

125 PageRank with milk 5

126 MILK compiler and runtime Collection loop transformation Distribution runtime library Delivery continuation 54

127 PageRank with milk

128 PageRank: Collection

129 Tag Distribution L pails 9-bit radix partition 57

130 Tag Distribution L.5 pails

131 Tag Distribution L.5 pails

132 Distribution: Pail Overflow. L pails DRAM tubs

133 Milk Delivery DRAM tubs L

134 Milk Delivery DRAM tubs L 6

135 Related Work Database JOIN optimizations [Shatdal94] cache partitioning [Manegold, Kim9, Albutiu, Balkesen5] TLB, SIMD, NUMA, non-temporal writes, software write buffers 6

136 Overall Speedup with milk x.5x x V=M 8 MB L Speedup.5x x.5x x BC BFS CC PR SSSP Betweenness Breadth-First Connected PageRank Single-Source Centrality Search Components Shortest Paths 64

137 Overall Speedup with milk x M 8M M.5x x Speedup.5x x.5x x BC BFSd BFSp CC PR SSSP 8 MB L 65

138 Stall Cycle Reduction % baseline milk 8% % of Total Cycles 6% 4% % % L miss stalls 56 KB L L miss stalls 8 MB L PageRank, V=M, d=6 (uniform) 66

139 Indirect Access Cache Hit% 8 baseline milk V=M 8 MB L 56KB L Cache Hit % 6 4 BC BFS CC PR SSSP 67

140 Higher Degree Higher Locality 5x 4x V=6M V=M Speedup x x x x M edges Average Degree B edges 68

141 Related Works How is Milk different from BigSparse and Propagation Blocking? 69

142 Related Work Big Sparse Big Sparse can work on both graphs with good and bad locality (Milk and Propagation Blocking both work on low locality graphs) Can afford to do global sort instead of bucketing Propagation Blocking Milk doesn t have two separate phases for Binning and Accumulate (Collection, Distribution, Delivery are all fused together using coroutines) PB reuses the tags to save memory bandwidth assuming the application is iterative Milk has a more general programming model for various applications 7

143 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

144 Making Caches Work for Graph Analytics Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Matei Zaharia*, Saman Amarasinghe MIT CSAIL and *Stanford InfoLab 7

145 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 7

146 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 74

147 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 75

148 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 76

149 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks

150 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 78

151 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 79

152 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

153 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

154 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: 8

155 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: stored in two cache lines 8

156 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 84

157 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85

158 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85

159 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 86

160 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87

161 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87

162 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 88

163 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89

164 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89

165 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9

166 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9

167 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9

168 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

169 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

170 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

171 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

172 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 9

173 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 A very high miss rate 94

174 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 Working set larger than cache 95

175 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces Often only use newranks[node] = basescore + damping*newranks[node]; part of the swap ranks and newranks cache line Cache #hits: #misses: 6 96

176 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 97

177 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Real-world graphs often have working set -x larger than cache size Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 98

178 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 99

179 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Often only use /6 - /8 of a cache line in modern hardware Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles )

180 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine

181 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine Up to 8% of the cycles are spent on slow random memory accesses

182 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine No Random

183 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5%.8x speedup if we can eliminate random memory accesses % on RMAT7 graph BaseLine 4 No Random

184 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% Within x of no random accesses % on RMAT7 graph BaseLine Cache Optimized No Random 5

185 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 6

186 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices 7

187 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices Design Group together the frequently accessed nodes Keep the ordering of average degree nodes 8

188 Frequency based Vertex Reordering 9

189 Frequency based Vertex Reordering outdegree: outdegree: outdegree: outdegree:

190 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:

191 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:

192 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and

193 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and 4

194 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Relabel corresponding edges 5

195 Frequency based Vertex Reordering Reorganize nodes data 6

196 Frequency based Vertex Reordering Reorganize nodes data 6

197 Frequency based Vertex Reordering Reorganize nodes data Groups together the data of frequently accessed nodes in one cache line 7

198 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

199 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9

200 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

201 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

202 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

203 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

204 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

205 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

206 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

207 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4

208 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4

209 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5

210 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5

211 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6

212 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6

213 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 7

214 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: 7

215 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Much #hits: 4 better than #misses: #hits: #misses: 6 8

216 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: #hits: #misses: 6 Better cache line utilization 9

217 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation

218 Cache-aware Segmenting Design Partition the graph into subgraphs where the random access are limited to LLC Process each partition sequentially and accumulate rank contributions for each partition Merge the rank contributions from all subgraphs

219 Graph Partitioning

220 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data

221 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4

222 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4

223 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5

224 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5

225 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 6

226 Graph Processing Cache #hits: #misses: 7

227 Graph Processing Cache #hits: #misses: 8

228 Graph Processing Cache #hits: #misses: 9

229 Graph Processing Cache #hits: #misses: 9

230 Graph Processing Cache #hits: #misses: 4

231 Graph Processing Cache #hits: #misses: 4

232 Graph Processing Cache #hits: #misses: 4

233 Graph Processing Cache #hits: #misses: 4

234 Graph Processing Cache #hits: #misses: 4

235 Graph Processing Cache #hits: #misses: 44

236 Graph Processing Cache #hits: #misses: 44

237 Graph Processing Cache #hits: #misses: 45

238 Graph Processing Cache #hits: 4 #misses: 46

239 Graph Processing Cache #hits: 5 #misses: 47

240 Graph Processing Cache Only have misses #hits: 5 #misses: 48

241 Graph Processing Better than Frequency based Reordering #hits: 4 #misses: Cache #hits: 5 #misses: 49

242 Cache-aware Merge 5

243 Cache-aware Merge 5

244 Cache-aware Merge 5

245 Cache-aware Merge 5

246 Cache-aware Merge 5

247 Cache-aware Merge 5

248 Cache-aware Merge 5

249 Cache-aware Merge 5

250 Cache-aware Merge 5

251 Cache-aware Merge 5

252 Cache-aware Merge 5

253 Cache-aware Merge The naive approach incurs random DRAM accesses 54

254 Cache-aware Merge 55

255 Cache-aware Merge 55

256 Cache-aware Merge 55

257 Cache-aware Merge 55

258 Cache-aware Merge Break down into chunks that fit in cache + 56

259 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 57

260 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

261 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

262 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

263 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

264 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

265 Cache-aware Merge Sum up the intermediate updates from the two subgraphs

266 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

267 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

268 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

269 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 6% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

270 Related Work Distributed Graph Systems Shared memory efficiency is a key component of distributed graph processing systems (PowerGraph, GraphLab, Pregel..) Shared-memory Graph Systems Frameworks (Ligra, Galois, GraphMat..) did not focus on cache optimizations Milk [PACT6], Propagation Blocking[IPDPS7] Out-of-core Systems (GraphChi, XStream) 64

271 Outline Motivation Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 65

272 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 66

273 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 67

274 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s The best published results so far is.7s (Gemini OSDI 7) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 68

275 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Very fast execution on label propagation used in Connected Components and SSSP (Bellman-Ford) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 69

276 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD 7

277 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.-x slower than our version 7

278 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD 7

279 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.7-6.7x slower than our version 7

280 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 74

281 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge increases as the size of the graph increases 75

282 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge stays constant as the size of the graph increases 76

283 Summary Performance Bottleneck of Graph Applications Frequency based Vertex Reordering Cache-aware Segmenting 77

284 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

285 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality Runtime Reordering the Memory Accesses

286 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality What are the tradeoffs? Runtime Reordering the Memory Accesses

287 Improving Cache Performance for Graph Computations Reordering the Graph Small preprocessing cost, modest performance improvement, dependent on graph structure Partitioning the Graph for Locality Bigger preprocessing cost, small runtime overhead, bigger performance gains, suitable for applications with lots of random accesses. Runtime Reordering the Memory Accesses No preprocessing cost, bigger runtime overhead

288 Outside of Graph Computing? Sparse Linear Algebra Matrix Reordering, Preconditioning (Graph Reordering) Cache Blocking (CSR segmenting) Inspector-Executor (Runtime Access Reordering) These are Fundamental Communication Reductions Techniques, used in many other domains (sparse linear algebra, join optimization in databases)

289 Performance Engineering Understand your applications performance characteristics Many papers worked on different ways to abandon cache and improve MLP with a large number of threads Understand the tradeoff space of the optimizations Pick the technique that best suit your hardware, application and data

GAIL The Graph Algorithm Iron Law

GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations