Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation

Size: px
Start display at page:

Download "Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation"

Transcription

1 Optimizing Cache Performance for Graph Analytics Yunming Zhang Presentation

2 Goals How to optimize in-memory graph applications How to go about performance engineering just about anything

3 In-memory Graph Processing Compare to Disk / DRAM boundary (GraphChi, BigSparse, LLAMA..), Cache / DRAM boundary has Much smaller latency gap (L cache - ns, DRAM 8- ns, Flash, ns ( ms)) Much larger memory bandwidth (DRAM >GB/s, Flash 6GB/s) Much smaller granularity (64 bytes cache lines vs 4k or MB pages)

4 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

5 Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server Scott Beamer, Krste Asanović, David Patterson Mostly borrowed from the authors IISWC presentation

6 Motivation What is the performance bottleneck for graph applications running in memory? How much performance can we gain? How can we achieve the performance gains?

7 Graph Algorithms Are Random? 7

8 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7

9 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7

10 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7

11 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7

12 Current Graph Architecture Wisdom? Current Wisdom 8

13 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern 8

14 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth 8

15 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization 8

16 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design 8

17 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches 8

18 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches Heavy multithreading 8

19 Are graph applications really memory bandwidth bounded? Is cache really completely useless in graph computations? 9

20 Results from Characterization

21 Results from Characterization No single representative workload

22 Results from Characterization No single representative workload need suite

23 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads

24 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor

25 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality

26 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality caches help! try to avoid thrashing

27 Target Graph Algorithms Most popular based on 45-paper literature survey: Breadth-First Search (BFS) Single-Source Shortest Paths (SSSP) PageRank (PR) Connected Components (CC) Betweenness Centrality (BC)

28 Target Graph Frameworks

29 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks

30 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions

31 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions GAP Benchmark Suite (OpenMP) - UCB written directly in most natural way for algorithm, not constrained by framework

32 Target Input Graphs UC Berkeley Graph # Vertices # Edges Degree Diameter Degree Dist. Roads of USA.9M 58.M.4 High const Twitter Follow Links 6.6M 468.4M.8 Low power Web Crawl of.sk Domain 5.6M 949.4M 8.5 Medium power Kronecker Synthetic Graph 8.M 48.M 6. Low power Uniform Random Graph 8.M 48.M 6. Low normal Graphs can have very different degree distributions, diameters and other structural characteristics.

33 How does the memory system work? 4

34 How does the memory system work? Executing a load that access DRAM requires: 4

35 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) 4

36 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window 4

37 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) 4

38 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available 4

39 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4

40 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Memory bandwidth (#4) matters only if (#-) satisfied Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4

41 UC Berkeley Little s Law 5

42 UC Berkeley Little s Law effective MLP (MLP = memory level parallelism) 5

43 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) 5

44 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) application MLP 5

45 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

46 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

47 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

48 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

49 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

50 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

51 Single-Core Memory Bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

52 Single-Core Memory Bandwidth 4 bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

53 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6

54 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion 7

55 Importance of Window Size UC Berkeley core w/ thread 8

56 Importance of Window Size UC Berkeley core w/ thread 8

57 Importance of Window Size UC Berkeley fetch core w/ thread 8

58 Importance of Window Size UC Berkeley fetch core w/ thread 8

59 Importance of Window Size UC Berkeley fetch window core w/ thread 8

60 Importance of Window Size UC Berkeley fetch dataflow window core w/ thread 8

61 Importance of Window Size UC Berkeley 4 bandwidth fetch dataflow window core w/ thread 8

62 Importance of Window Size UC Berkeley core w/ thread 8

63 Importance of Window Size UC Berkeley MLPmax window size IPM + core w/ thread 8

64 Importance of Window Size UC Berkeley MLPmax Instruction window window size size limits + memory bandwidth IPM if misses rare core w/ thread 8

65 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

66 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

67 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

68 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9

69 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload core w/ thread 9

70 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload Only few workloads near memory bandwidth limit core w/ thread 9

71 Instruction Window Limits BW UC Berkeley core w/ thread

72 Instruction Window Limits BW UC Berkeley core w/ thread

73 Instruction Window Limits BW UC Berkeley core w/ thread

74 Instruction Window Limits BW UC Berkeley window core w/ thread

75 Instruction Window Limits BW UC Berkeley Instruction window limits memory bandwidth core w/ thread

76 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion

77 Memory Bandwidth ~ Performance UC Berkeley threads vs. thread

78 Memory Bandwidth ~ Performance UC Berkeley Increasing memory bandwidth utilization increases performance threads vs. thread

79 Parallel Utilization UC Berkeley 6 cores w/ threads

80 Parallel Utilization UC Berkeley At system level, memory bandwidth under-utilized 6 cores w/ threads

81 Multithreading Opportunity UC Berkeley 4

82 Multithreading Opportunity UC Berkeley fetch window dataflow 4 bandwidth 4

83 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth 4

84 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4

85 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4

86 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow ++ more application MLP 4 bandwidth same hardware (shared) 4

87 Multithreading Increases Bandwidth UC Berkeley core w/ - threads 5

88 Multithreading Increases Bandwidth UC Berkeley Speedups imply workloads probably bottlenecked by dataflow core w/ - threads 5

89 Conclusions UC Berkeley 6

90 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth 6

91 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality 6

92 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Cache misses too infrequent to fit in window 6

93 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window 6

94 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window Sub-linear parallel speedups cast doubt on gains from multithreading on OoO core 6

95 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

96 Milk / Propagation Blocking Is changing the architecture really the only way to improve the performance of graph applications running in memory? Boosting MLP Other approaches Reducing the amount of communication

97 Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 6 September, 6, Haifa, Israel 9

98 Indirect Accesses

99 Indirect Accesses with OpenMP

100 Indirect Accesses with OpenMP 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L

101 Indirect Accesses with milk milk if(!milk) 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L

102 No Locality? Cache miss TLB miss DRAM row miss No prefetching Time

103 No Locality? Time 4

104 No Locality? Time 5

105 No Locality? Time 6

106 Milk Clustering Time 7

107 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching Time 8

108 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching No need for atomics! Time 9

109 Outline Milk programming model milk syntax MILK compiler and runtime 4

110 Foundations Milk programming model extending BSP milk syntax OpenMP for C/C++ MILK compiler and runtime LLVM/Clang 4

111 Big (sparse) Data Terabyte Working Sets - AWS TB VM In-memory Databases, Key-value stores Machine Learning Graph Analytics 4

112 Infinite Cache Locality in Graph Applications Temporal Locality Spatial Locality.... Ideal Cache Hit % Road (d=.4) Twitter (d=4) Web (d=9). r t w. r t w. r t w. r t w r t w BC BFS CC PR SSSP Betweenness Centrality Breadth-First Search Connected Components PageRank Single-Source Shortest Paths [GAPBS] 4

113 Milk Execution Model Collection Distribution Delivery Propagation Blocking: Binning (Collection + Distribution), Deliver (Accumulation) 44

114 Collection += f(i); d count

115 Collection += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count

116 Distribution += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count

117 Distribution += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count

118 Delivery += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count

119 Delivery += f(i); d count

120 milk syntax milk clause in parallel loop milk directive per indirect access tag address to group by pack additional state f() 48

121 pack Combiners 49

122 PageRank 7.5 5

123 PageRank with OpenMP 5

124 PageRank with milk 5

125 PageRank with milk 5

126 MILK compiler and runtime Collection loop transformation Distribution runtime library Delivery continuation 54

127 PageRank with milk

128 PageRank: Collection

129 Tag Distribution L pails 9-bit radix partition 57

130 Tag Distribution L.5 pails

131 Tag Distribution L.5 pails

132 Distribution: Pail Overflow. L pails DRAM tubs

133 Milk Delivery DRAM tubs L

134 Milk Delivery DRAM tubs L 6

135 Related Work Database JOIN optimizations [Shatdal94] cache partitioning [Manegold, Kim9, Albutiu, Balkesen5] TLB, SIMD, NUMA, non-temporal writes, software write buffers 6

136 Overall Speedup with milk x.5x x V=M 8 MB L Speedup.5x x.5x x BC BFS CC PR SSSP Betweenness Breadth-First Connected PageRank Single-Source Centrality Search Components Shortest Paths 64

137 Overall Speedup with milk x M 8M M.5x x Speedup.5x x.5x x BC BFSd BFSp CC PR SSSP 8 MB L 65

138 Stall Cycle Reduction % baseline milk 8% % of Total Cycles 6% 4% % % L miss stalls 56 KB L L miss stalls 8 MB L PageRank, V=M, d=6 (uniform) 66

139 Indirect Access Cache Hit% 8 baseline milk V=M 8 MB L 56KB L Cache Hit % 6 4 BC BFS CC PR SSSP 67

140 Higher Degree Higher Locality 5x 4x V=6M V=M Speedup x x x x M edges Average Degree B edges 68

141 Related Works How is Milk different from BigSparse and Propagation Blocking? 69

142 Related Work Big Sparse Big Sparse can work on both graphs with good and bad locality (Milk and Propagation Blocking both work on low locality graphs) Can afford to do global sort instead of bucketing Propagation Blocking Milk doesn t have two separate phases for Binning and Accumulate (Collection, Distribution, Delivery are all fused together using coroutines) PB reuses the tags to save memory bandwidth assuming the application is iterative Milk has a more general programming model for various applications 7

143 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

144 Making Caches Work for Graph Analytics Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Matei Zaharia*, Saman Amarasinghe MIT CSAIL and *Stanford InfoLab 7

145 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 7

146 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 74

147 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 75

148 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 76

149 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks

150 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 78

151 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 79

152 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

153 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

154 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: 8

155 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: stored in two cache lines 8

156 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 84

157 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85

158 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85

159 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 86

160 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87

161 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87

162 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 88

163 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89

164 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89

165 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9

166 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9

167 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9

168 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

169 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

170 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

171 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9

172 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 9

173 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 A very high miss rate 94

174 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 Working set larger than cache 95

175 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces Often only use newranks[node] = basescore + damping*newranks[node]; part of the swap ranks and newranks cache line Cache #hits: #misses: 6 96

176 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 97

177 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Real-world graphs often have working set -x larger than cache size Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 98

178 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 99

179 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Often only use /6 - /8 of a cache line in modern hardware Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles )

180 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine

181 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine Up to 8% of the cycles are spent on slow random memory accesses

182 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine No Random

183 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5%.8x speedup if we can eliminate random memory accesses % on RMAT7 graph BaseLine 4 No Random

184 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% Within x of no random accesses % on RMAT7 graph BaseLine Cache Optimized No Random 5

185 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 6

186 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices 7

187 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices Design Group together the frequently accessed nodes Keep the ordering of average degree nodes 8

188 Frequency based Vertex Reordering 9

189 Frequency based Vertex Reordering outdegree: outdegree: outdegree: outdegree:

190 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:

191 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:

192 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and

193 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and 4

194 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Relabel corresponding edges 5

195 Frequency based Vertex Reordering Reorganize nodes data 6

196 Frequency based Vertex Reordering Reorganize nodes data 6

197 Frequency based Vertex Reordering Reorganize nodes data Groups together the data of frequently accessed nodes in one cache line 7

198 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8

199 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9

200 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

201 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

202 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

203 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

204 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

205 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

206 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:

207 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4

208 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4

209 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5

210 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5

211 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6

212 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6

213 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 7

214 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: 7

215 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Much #hits: 4 better than #misses: #hits: #misses: 6 8

216 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: #hits: #misses: 6 Better cache line utilization 9

217 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation

218 Cache-aware Segmenting Design Partition the graph into subgraphs where the random access are limited to LLC Process each partition sequentially and accumulate rank contributions for each partition Merge the rank contributions from all subgraphs

219 Graph Partitioning

220 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data

221 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4

222 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4

223 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5

224 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5

225 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 6

226 Graph Processing Cache #hits: #misses: 7

227 Graph Processing Cache #hits: #misses: 8

228 Graph Processing Cache #hits: #misses: 9

229 Graph Processing Cache #hits: #misses: 9

230 Graph Processing Cache #hits: #misses: 4

231 Graph Processing Cache #hits: #misses: 4

232 Graph Processing Cache #hits: #misses: 4

233 Graph Processing Cache #hits: #misses: 4

234 Graph Processing Cache #hits: #misses: 4

235 Graph Processing Cache #hits: #misses: 44

236 Graph Processing Cache #hits: #misses: 44

237 Graph Processing Cache #hits: #misses: 45

238 Graph Processing Cache #hits: 4 #misses: 46

239 Graph Processing Cache #hits: 5 #misses: 47

240 Graph Processing Cache Only have misses #hits: 5 #misses: 48

241 Graph Processing Better than Frequency based Reordering #hits: 4 #misses: Cache #hits: 5 #misses: 49

242 Cache-aware Merge 5

243 Cache-aware Merge 5

244 Cache-aware Merge 5

245 Cache-aware Merge 5

246 Cache-aware Merge 5

247 Cache-aware Merge 5

248 Cache-aware Merge 5

249 Cache-aware Merge 5

250 Cache-aware Merge 5

251 Cache-aware Merge 5

252 Cache-aware Merge 5

253 Cache-aware Merge The naive approach incurs random DRAM accesses 54

254 Cache-aware Merge 55

255 Cache-aware Merge 55

256 Cache-aware Merge 55

257 Cache-aware Merge 55

258 Cache-aware Merge Break down into chunks that fit in cache + 56

259 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 57

260 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

261 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

262 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

263 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

264 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58

265 Cache-aware Merge Sum up the intermediate updates from the two subgraphs

266 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

267 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

268 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

269 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 6% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6

270 Related Work Distributed Graph Systems Shared memory efficiency is a key component of distributed graph processing systems (PowerGraph, GraphLab, Pregel..) Shared-memory Graph Systems Frameworks (Ligra, Galois, GraphMat..) did not focus on cache optimizations Milk [PACT6], Propagation Blocking[IPDPS7] Out-of-core Systems (GraphChi, XStream) 64

271 Outline Motivation Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 65

272 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 66

273 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 67

274 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s The best published results so far is.7s (Gemini OSDI 7) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 68

275 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Very fast execution on label propagation used in Connected Components and SSSP (Bellman-Ford) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 69

276 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD 7

277 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.-x slower than our version 7

278 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD 7

279 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.7-6.7x slower than our version 7

280 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 74

281 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge increases as the size of the graph increases 75

282 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge stays constant as the size of the graph increases 76

283 Summary Performance Bottleneck of Graph Applications Frequency based Vertex Reordering Cache-aware Segmenting 77

284 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary

285 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality Runtime Reordering the Memory Accesses

286 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality What are the tradeoffs? Runtime Reordering the Memory Accesses

287 Improving Cache Performance for Graph Computations Reordering the Graph Small preprocessing cost, modest performance improvement, dependent on graph structure Partitioning the Graph for Locality Bigger preprocessing cost, small runtime overhead, bigger performance gains, suitable for applications with lots of random accesses. Runtime Reordering the Memory Accesses No preprocessing cost, bigger runtime overhead

288 Outside of Graph Computing? Sparse Linear Algebra Matrix Reordering, Preconditioning (Graph Reordering) Cache Blocking (CSR segmenting) Inspector-Executor (Runtime Access Reordering) These are Fundamental Communication Reductions Techniques, used in many other domains (sparse linear algebra, join optimization in databases)

289 Performance Engineering Understand your applications performance characteristics Many papers worked on different ways to abandon cache and improve MLP with a large number of threads Understand the tradeoff space of the optimizations Pick the technique that best suit your hardware, application and data

GAIL The Graph Algorithm Iron Law

GAIL The Graph Algorithm Iron Law GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

NUMA-aware Graph-structured Analytics

NUMA-aware Graph-structured Analytics NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11

More information

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation

More information

Practical Near-Data Processing for In-Memory Analytics Frameworks

Practical Near-Data Processing for In-Memory Analytics Frameworks Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature

More information

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia

Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems

More information

Big Data Analytics Using Hardware-Accelerated Flash Storage

Big Data Analytics Using Hardware-Accelerated Flash Storage Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018 A Big Data Application: Personalized Genome

More information

GPU Sparse Graph Traversal. Duane Merrill

GPU Sparse Graph Traversal. Duane Merrill GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor

More information

arxiv: v4 [cs.dc] 12 Jan 2018

arxiv: v4 [cs.dc] 12 Jan 2018 Making Caches Work for Graph Analytics Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Saman Amarasinghe MIT CSAIL {yunming,vlk,charithm,saman}@csail.mit.edu Matei Zaharia Stanford InfoLab matei@cs.stanford.edu

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC

Automatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road

More information

GPU Sparse Graph Traversal

GPU Sparse Graph Traversal GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex

More information

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...

A common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads... OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.

More information

Scalable GPU Graph Traversal!

Scalable GPU Graph Traversal! Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang

More information

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES

Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory

More information

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck

CS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL

A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH

More information

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms

A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas

More information

A Framework for Processing Large Graphs in Shared Memory

A Framework for Processing Large Graphs in Shared Memory A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge

More information

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer)

Caches 3/23/17. Agenda. The Dataflow Model (of a Computer) Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Caches. Samira Khan March 23, 2017

Caches. Samira Khan March 23, 2017 Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed

More information

Heterogeneous Memory Subsystem for Natural Graph Analytics

Heterogeneous Memory Subsystem for Natural Graph Analytics Heterogeneous Memory Subsystem for Natural Graph Analytics Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, and Valeria Bertacco Computer Science and Engineering, University of Michigan Email: {abrahad,

More information

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs

Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse

More information

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches

CS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single

More information

A Lightweight Infrastructure for Graph Analytics

A Lightweight Infrastructure for Graph Analytics A Lightweight Infrastructure for Graph Analytics Donald Nguyen, Andrew Lenharth and Keshav Pingali The University of Texas at Austin, Texas, USA {ddn@cs, lenharth@ices, pingali@cs}.utexas.edu Abstract

More information

Lecture 14: Multithreading

Lecture 14: Multithreading CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw

More information

GraFBoost: Using accelerated flash storage for external graph analytics

GraFBoost: Using accelerated flash storage for external graph analytics : Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute

More information

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages

Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important

More information

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured

More information

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)

More information

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache

10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is

More information

Staged Memory Scheduling

Staged Memory Scheduling Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:

More information

Case Study 1: Optimizing Cache Performance via Advanced Techniques

Case Study 1: Optimizing Cache Performance via Advanced Techniques 6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each

More information

EE 4683/5683: COMPUTER ARCHITECTURE

EE 4683/5683: COMPUTER ARCHITECTURE EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major

More information

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip

Reducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off

More information

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example

Locality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near

More information

Memory Systems and Performance Engineering. Fall 2009

Memory Systems and Performance Engineering. Fall 2009 Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory

More information

Performance in the Multicore Era

Performance in the Multicore Era Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An

More information

Adapted from David Patterson s slides on graduate computer architecture

Adapted from David Patterson s slides on graduate computer architecture Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual

More information

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I

Algorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Computer Caches. Lab 1. Caching

Computer Caches. Lab 1. Caching Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main

More information

arxiv: v1 [cs.db] 21 Oct 2017

arxiv: v1 [cs.db] 21 Oct 2017 : High-performance external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology

More information

Memory hierarchy review. ECE 154B Dmitri Strukov

Memory hierarchy review. ECE 154B Dmitri Strukov Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal

More information

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology

Multilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology 1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823

More information

Architecture-Conscious Database Systems

Architecture-Conscious Database Systems Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query

More information

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University

More information

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems

The GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate

More information

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot

Sort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack

More information

Why do we need graph processing?

Why do we need graph processing? Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group

More information

Memory Systems and Performance Engineering

Memory Systems and Performance Engineering SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide

More information

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]

CSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user

More information

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination

A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination 1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu

More information

Lecture-14 (Memory Hierarchy) CS422-Spring

Lecture-14 (Memory Hierarchy) CS422-Spring Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology

More information

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.

Computer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per

More information

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh

Accelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary

More information

Handout 4 Memory Hierarchy

Handout 4 Memory Hierarchy Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced

More information

CACHE-GUIDED SCHEDULING

CACHE-GUIDED SCHEDULING CACHE-GUIDED SCHEDULING EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound

More information

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem

More information

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading

CS 152 Computer Architecture and Engineering. Lecture 18: Multithreading CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste

More information

Cache Optimisation. sometime he thought that there must be a better way

Cache Optimisation. sometime he thought that there must be a better way Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching

More information

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.

Performance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0. Since 1980, CPU has outpaced DRAM... EEL 5764: Graduate Computer Architecture Appendix C Hierarchy Review Ann Gordon-Ross Electrical and Computer Engineering University of Florida http://www.ann.ece.ufl.edu/

More information

CS152 Computer Architecture and Engineering Lecture 17: Cache System

CS152 Computer Architecture and Engineering Lecture 17: Cache System CS152 Computer Architecture and Engineering Lecture 17 System March 17, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http//http.cs.berkeley.edu/~patterson

More information

X-Stream: A Case Study in Building a Graph Processing System. Amitabha Roy (LABOS)

X-Stream: A Case Study in Building a Graph Processing System. Amitabha Roy (LABOS) X-Stream: A Case Study in Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system Single Machine Works on graphs stored Entirely in RAM Entirely in SSD Entirely on Magnetic

More information

1. Creates the illusion of an address space much larger than the physical memory

1. Creates the illusion of an address space much larger than the physical memory Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for

More information

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative

Chapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory

More information

The Processor Memory Hierarchy

The Processor Memory Hierarchy Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.

More information

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation

CS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation CS 152 Computer Architecture and Engineering Lecture 9 - Address Translation Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste

More information

Introduction to OpenMP. Lecture 10: Caches

Introduction to OpenMP. Lecture 10: Caches Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for

More information

Course Administration

Course Administration Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570

More information

Copyright 2012, Elsevier Inc. All rights reserved.

Copyright 2012, Elsevier Inc. All rights reserved. Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more

More information

Cray XE6 Performance Workshop

Cray XE6 Performance Workshop Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors

More information

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed

and data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed 5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid

More information

Computer Science 146. Computer Architecture

Computer Science 146. Computer Architecture Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle

More information

Memory. From Chapter 3 of High Performance Computing. c R. Leduc

Memory. From Chapter 3 of High Performance Computing. c R. Leduc Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor

More information

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13

Bigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13 Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University

More information

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory

Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University

More information

Benchmarking the Memory Hierarchy of Modern GPUs

Benchmarking the Memory Hierarchy of Modern GPUs 1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong

More information

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing

Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut

More information

Page 1. Memory Hierarchies (Part 2)

Page 1. Memory Hierarchies (Part 2) Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy

More information

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY

LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal

More information

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy

Chapter 5A. Large and Fast: Exploiting Memory Hierarchy Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2

More information

CuSha: Vertex-Centric Graph Processing on GPUs

CuSha: Vertex-Centric Graph Processing on GPUs CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power

More information

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky

Memory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky Memory Hierarchy, Fully Associative Caches Instructor: Nick Riasanovsky Review Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of datapath component Data

More information

ECE232: Hardware Organization and Design

ECE232: Hardware Organization and Design ECE232: Hardware Organization and Design Lecture 21: Memory Hierarchy Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Ideally, computer memory would be large and fast

More information

LECTURE 5: MEMORY HIERARCHY DESIGN

LECTURE 5: MEMORY HIERARCHY DESIGN LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive

More information

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary

L2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40

More information

Key Point. What are Cache lines

Key Point. What are Cache lines Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible

More information

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality

EEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality Administrative EEC 7 Computer Architecture Fall 5 Improving Cache Performance Problem #6 is posted Last set of homework You should be able to answer each of them in -5 min Quiz on Wednesday (/7) Chapter

More information

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)

Memory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed) Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2

More information

Memory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"

Memory Management! How the hardware and OS give application pgms: The illusion of a large contiguous address space Protection against each other Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware

More information