Optimizing Cache Performance for Graph Analytics. Yunming Zhang Presentation
|
|
- Ethelbert Bryant
- 5 years ago
- Views:
Transcription
1 Optimizing Cache Performance for Graph Analytics Yunming Zhang Presentation
2 Goals How to optimize in-memory graph applications How to go about performance engineering just about anything
3 In-memory Graph Processing Compare to Disk / DRAM boundary (GraphChi, BigSparse, LLAMA..), Cache / DRAM boundary has Much smaller latency gap (L cache - ns, DRAM 8- ns, Flash, ns ( ms)) Much larger memory bandwidth (DRAM >GB/s, Flash 6GB/s) Much smaller granularity (64 bytes cache lines vs 4k or MB pages)
4 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary
5 Locality Exists in Graph Processing: Workload Characterization on an Ivy Bridge Server Scott Beamer, Krste Asanović, David Patterson Mostly borrowed from the authors IISWC presentation
6 Motivation What is the performance bottleneck for graph applications running in memory? How much performance can we gain? How can we achieve the performance gains?
7 Graph Algorithms Are Random? 7
8 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7
9 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP 7
10 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7
11 Graph Algorithms Are Random? Thus, the low speedup of OOO execution is due solely to a lack of memory bandwidth required to service the repeated last level cache misses caused by the random access memory pattern of the algorithm. PPoPP First, the memory bandwidth of the system seems to limit performance SPAA 7
12 Current Graph Architecture Wisdom? Current Wisdom 8
13 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern 8
14 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth 8
15 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization 8
16 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design 8
17 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches 8
18 Current Graph Architecture Wisdom? Current Wisdom Random memory access pattern Limited by memory bandwidth Will be plagued by low core utilization Cray XMT Design No caches Heavy multithreading 8
19 Are graph applications really memory bandwidth bounded? Is cache really completely useless in graph computations? 9
20 Results from Characterization
21 Results from Characterization No single representative workload
22 Results from Characterization No single representative workload need suite
23 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads
24 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor
25 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality
26 Results from Characterization No single representative workload need suite Out-of-order core not limited by memory bandwidth for most graph workloads can improve by changing only processor Many graph workloads have good locality caches help! try to avoid thrashing
27 Target Graph Algorithms Most popular based on 45-paper literature survey: Breadth-First Search (BFS) Single-Source Shortest Paths (SSSP) PageRank (PR) Connected Components (CC) Betweenness Centrality (BC)
28 Target Graph Frameworks
29 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks
30 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions
31 Target Graph Frameworks Galois (custom parallel runtime) - UT Austin specialized for irregular fine-grain tasks Ligra (Cilk) - CMU applies algorithm in push or pull directions GAP Benchmark Suite (OpenMP) - UCB written directly in most natural way for algorithm, not constrained by framework
32 Target Input Graphs UC Berkeley Graph # Vertices # Edges Degree Diameter Degree Dist. Roads of USA.9M 58.M.4 High const Twitter Follow Links 6.6M 468.4M.8 Low power Web Crawl of.sk Domain 5.6M 949.4M 8.5 Medium power Kronecker Synthetic Graph 8.M 48.M 6. Low power Uniform Random Graph 8.M 48.M 6. Low normal Graphs can have very different degree distributions, diameters and other structural characteristics.
33 How does the memory system work? 4
34 How does the memory system work? Executing a load that access DRAM requires: 4
35 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) 4
36 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window 4
37 How does the memory system work? Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) 4
38 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available 4
39 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4
40 How does the memory system work? 4 Executing a load that access DRAM requires: Execution reaches load instruction (fetch) Space in the instruction window Memory bandwidth (#4) matters only if (#-) satisfied Register operands are available (dataflow) Memory bandwidth is available Bandwidth ~ # outstanding requests 4
41 UC Berkeley Little s Law 5
42 UC Berkeley Little s Law effective MLP (MLP = memory level parallelism) 5
43 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) 5
44 UC Berkeley Little s Law effective MLP average memory bandwidth = x average memory latency (MLP = memory level parallelism) application MLP 5
45 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
46 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
47 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
48 Single-Core Memory Bandwidth UC Berkeley fetch core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
49 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
50 Single-Core Memory Bandwidth UC Berkeley fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
51 Single-Core Memory Bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
52 Single-Core Memory Bandwidth 4 bandwidth UC Berkeley dataflow fetch window core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
53 Single-Core Memory Bandwidth UC Berkeley core Pointer Chasing Microbenchmark with varying number of parallel pointer chases 6
54 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion 7
55 Importance of Window Size UC Berkeley core w/ thread 8
56 Importance of Window Size UC Berkeley core w/ thread 8
57 Importance of Window Size UC Berkeley fetch core w/ thread 8
58 Importance of Window Size UC Berkeley fetch core w/ thread 8
59 Importance of Window Size UC Berkeley fetch window core w/ thread 8
60 Importance of Window Size UC Berkeley fetch dataflow window core w/ thread 8
61 Importance of Window Size UC Berkeley 4 bandwidth fetch dataflow window core w/ thread 8
62 Importance of Window Size UC Berkeley core w/ thread 8
63 Importance of Window Size UC Berkeley MLPmax window size IPM + core w/ thread 8
64 Importance of Window Size UC Berkeley MLPmax Instruction window window size size limits + memory bandwidth IPM if misses rare core w/ thread 8
65 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9
66 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9
67 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9
68 Biggest Influence on Single-Core? UC Berkeley core w/ thread 9
69 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload core w/ thread 9
70 Biggest Influence on Single-Core? UC Berkeley Need suite, no single representative workload Only few workloads near memory bandwidth limit core w/ thread 9
71 Instruction Window Limits BW UC Berkeley core w/ thread
72 Instruction Window Limits BW UC Berkeley core w/ thread
73 Instruction Window Limits BW UC Berkeley core w/ thread
74 Instruction Window Limits BW UC Berkeley window core w/ thread
75 Instruction Window Limits BW UC Berkeley Instruction window limits memory bandwidth core w/ thread
76 Outline UC Berkeley Methodology Platform Memory Bandwidth Availability Single-core Results Parallel Results GAP Benchmark Suite Conclusion
77 Memory Bandwidth ~ Performance UC Berkeley threads vs. thread
78 Memory Bandwidth ~ Performance UC Berkeley Increasing memory bandwidth utilization increases performance threads vs. thread
79 Parallel Utilization UC Berkeley 6 cores w/ threads
80 Parallel Utilization UC Berkeley At system level, memory bandwidth under-utilized 6 cores w/ threads
81 Multithreading Opportunity UC Berkeley 4
82 Multithreading Opportunity UC Berkeley fetch window dataflow 4 bandwidth 4
83 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth 4
84 Multithreading Opportunity UC Berkeley fetch window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4
85 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow 4 bandwidth same hardware (shared) 4
86 Multithreading Opportunity UC Berkeley fetch + fewer instructions in flight window same hardware (shared) dataflow ++ more application MLP 4 bandwidth same hardware (shared) 4
87 Multithreading Increases Bandwidth UC Berkeley core w/ - threads 5
88 Multithreading Increases Bandwidth UC Berkeley Speedups imply workloads probably bottlenecked by dataflow core w/ - threads 5
89 Conclusions UC Berkeley 6
90 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth 6
91 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality 6
92 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Cache misses too infrequent to fit in window 6
93 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window 6
94 Conclusions UC Berkeley Most graph workloads do not utilize a large fraction of memory bandwidth Many graph workloads have decent locality Changing processor alone could help Cache misses too infrequent to fit in window Sub-linear parallel speedups cast doubt on gains from multithreading on OoO core 6
95 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary
96 Milk / Propagation Blocking Is changing the architecture really the only way to improve the performance of graph applications running in memory? Boosting MLP Other approaches Reducing the amount of communication
97 Optimizing Indirect Memory References with milk Vladimir Kiriansky, Yunming Zhang, Saman Amarasinghe MIT PACT 6 September, 6, Haifa, Israel 9
98 Indirect Accesses
99 Indirect Accesses with OpenMP
100 Indirect Accesses with OpenMP 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L
101 Indirect Accesses with milk milk if(!milk) 5 4 Speedup OpenMP +Milk uniform [..M) 8 threads, 8MB L
102 No Locality? Cache miss TLB miss DRAM row miss No prefetching Time
103 No Locality? Time 4
104 No Locality? Time 5
105 No Locality? Time 6
106 Milk Clustering Time 7
107 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching Time 8
108 Milk Clustering Cache hit TLB hit DRAM row hit Effective prefetching No need for atomics! Time 9
109 Outline Milk programming model milk syntax MILK compiler and runtime 4
110 Foundations Milk programming model extending BSP milk syntax OpenMP for C/C++ MILK compiler and runtime LLVM/Clang 4
111 Big (sparse) Data Terabyte Working Sets - AWS TB VM In-memory Databases, Key-value stores Machine Learning Graph Analytics 4
112 Infinite Cache Locality in Graph Applications Temporal Locality Spatial Locality.... Ideal Cache Hit % Road (d=.4) Twitter (d=4) Web (d=9). r t w. r t w. r t w. r t w r t w BC BFS CC PR SSSP Betweenness Centrality Breadth-First Search Connected Components PageRank Single-Source Shortest Paths [GAPBS] 4
113 Milk Execution Model Collection Distribution Delivery Propagation Blocking: Binning (Collection + Distribution), Deliver (Accumulation) 44
114 Collection += f(i); d count
115 Collection += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count
116 Distribution += f(i); d f() f() 4 f() 5 f() 8 f(4) 7 f(5) f(6) 7 f(7) count
117 Distribution += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count
118 Delivery += f(i); d f() f(6) 5 f() 7 f() 7 f(5) 7 f(7) 4 f() 8 f(4) count
119 Delivery += f(i); d count
120 milk syntax milk clause in parallel loop milk directive per indirect access tag address to group by pack additional state f() 48
121 pack Combiners 49
122 PageRank 7.5 5
123 PageRank with OpenMP 5
124 PageRank with milk 5
125 PageRank with milk 5
126 MILK compiler and runtime Collection loop transformation Distribution runtime library Delivery continuation 54
127 PageRank with milk
128 PageRank: Collection
129 Tag Distribution L pails 9-bit radix partition 57
130 Tag Distribution L.5 pails
131 Tag Distribution L.5 pails
132 Distribution: Pail Overflow. L pails DRAM tubs
133 Milk Delivery DRAM tubs L
134 Milk Delivery DRAM tubs L 6
135 Related Work Database JOIN optimizations [Shatdal94] cache partitioning [Manegold, Kim9, Albutiu, Balkesen5] TLB, SIMD, NUMA, non-temporal writes, software write buffers 6
136 Overall Speedup with milk x.5x x V=M 8 MB L Speedup.5x x.5x x BC BFS CC PR SSSP Betweenness Breadth-First Connected PageRank Single-Source Centrality Search Components Shortest Paths 64
137 Overall Speedup with milk x M 8M M.5x x Speedup.5x x.5x x BC BFSd BFSp CC PR SSSP 8 MB L 65
138 Stall Cycle Reduction % baseline milk 8% % of Total Cycles 6% 4% % % L miss stalls 56 KB L L miss stalls 8 MB L PageRank, V=M, d=6 (uniform) 66
139 Indirect Access Cache Hit% 8 baseline milk V=M 8 MB L 56KB L Cache Hit % 6 4 BC BFS CC PR SSSP 67
140 Higher Degree Higher Locality 5x 4x V=6M V=M Speedup x x x x M edges Average Degree B edges 68
141 Related Works How is Milk different from BigSparse and Propagation Blocking? 69
142 Related Work Big Sparse Big Sparse can work on both graphs with good and bad locality (Milk and Propagation Blocking both work on low locality graphs) Can afford to do global sort instead of bucketing Propagation Blocking Milk doesn t have two separate phases for Binning and Accumulate (Collection, Distribution, Delivery are all fused together using coroutines) PB reuses the tags to save memory bandwidth assuming the application is iterative Milk has a more general programming model for various applications 7
143 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary
144 Making Caches Work for Graph Analytics Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Matei Zaharia*, Saman Amarasinghe MIT CSAIL and *Stanford InfoLab 7
145 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 7
146 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 74
147 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 75
148 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 76
149 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks
150 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 78
151 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks 79
152 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8
153 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8
154 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: 8
155 PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Cache holds one cache line #hits: #misses: stored in two cache lines 8
156 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 84
157 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85
158 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 85
159 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 86
160 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87
161 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 87
162 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 88
163 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89
164 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 89
165 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9
166 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9
167 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4 9
168 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9
169 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9
170 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9
171 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5 9
172 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 9
173 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 A very high miss rate 94
174 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6 Working set larger than cache 95
175 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces Often only use newranks[node] = basescore + damping*newranks[node]; part of the swap ranks and newranks cache line Cache #hits: #misses: 6 96
176 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 97
177 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Real-world graphs often have working set -x larger than cache size Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 98
178 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles ) 99
179 Performance Bottleneck Working set much larger than cache size Access pattern is random Often uses part of the cache line Often only use /6 - /8 of a cache line in modern hardware Can not benefit from hardware prefetching TLB miss, DRAM row miss (hundreds of cycles )
180 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine
181 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine Up to 8% of the cycles are spent on slow random memory accesses
182 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % on RMAT7 graph BaseLine No Random
183 PageRank while Removing for node : graph.ver,ces Random Accesses for ngh : graph.getinneighbors(node) (Incorrect) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5%.8x speedup if we can eliminate random memory accesses % on RMAT7 graph BaseLine 4 No Random
184 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[]/outdegree[]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% Within x of no random accesses % on RMAT7 graph BaseLine Cache Optimized No Random 5
185 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 6
186 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices 7
187 Frequency based Vertex Key Observations Reordering Cache lines are underutilized Certain vertices are much more likely to be accessed than other vertices Design Group together the frequently accessed nodes Keep the ordering of average degree nodes 8
188 Frequency based Vertex Reordering 9
189 Frequency based Vertex Reordering outdegree: outdegree: outdegree: outdegree:
190 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:
191 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree:
192 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and
193 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Reorder nodes and 4
194 Frequency based Vertex Reordering outdegree: outdegree: Group together high outdegree nodes outdegree: outdegree: Relabel corresponding edges 5
195 Frequency based Vertex Reordering Reorganize nodes data 6
196 Frequency based Vertex Reordering Reorganize nodes data 6
197 Frequency based Vertex Reordering Reorganize nodes data Groups together the data of frequently accessed nodes in one cache line 7
198 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 8
199 Cache PageRank Focus on the random while memory accesses on for node : graph.ver,ces ranks array for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 9
200 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
201 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
202 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
203 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
204 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
205 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
206 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses:
207 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4
208 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 4
209 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5
210 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 5
211 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6
212 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 6
213 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: #misses: 7
214 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: 7
215 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Much #hits: 4 better than #misses: #hits: #misses: 6 8
216 Cache PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks #hits: 4 #misses: #hits: #misses: 6 Better cache line utilization 9
217 Outline PageRank Frequency based Vertex Reordering Cache-aware Segmenting Evaluation
218 Cache-aware Segmenting Design Partition the graph into subgraphs where the random access are limited to LLC Process each partition sequentially and accumulate rank contributions for each partition Merge the rank contributions from all subgraphs
219 Graph Partitioning
220 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data
221 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4
222 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 4
223 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5
224 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 5
225 Graph Partitioning Partitions the original graph into subgraphs that only access a subset of nodes data 6
226 Graph Processing Cache #hits: #misses: 7
227 Graph Processing Cache #hits: #misses: 8
228 Graph Processing Cache #hits: #misses: 9
229 Graph Processing Cache #hits: #misses: 9
230 Graph Processing Cache #hits: #misses: 4
231 Graph Processing Cache #hits: #misses: 4
232 Graph Processing Cache #hits: #misses: 4
233 Graph Processing Cache #hits: #misses: 4
234 Graph Processing Cache #hits: #misses: 4
235 Graph Processing Cache #hits: #misses: 44
236 Graph Processing Cache #hits: #misses: 44
237 Graph Processing Cache #hits: #misses: 45
238 Graph Processing Cache #hits: 4 #misses: 46
239 Graph Processing Cache #hits: 5 #misses: 47
240 Graph Processing Cache Only have misses #hits: 5 #misses: 48
241 Graph Processing Better than Frequency based Reordering #hits: 4 #misses: Cache #hits: 5 #misses: 49
242 Cache-aware Merge 5
243 Cache-aware Merge 5
244 Cache-aware Merge 5
245 Cache-aware Merge 5
246 Cache-aware Merge 5
247 Cache-aware Merge 5
248 Cache-aware Merge 5
249 Cache-aware Merge 5
250 Cache-aware Merge 5
251 Cache-aware Merge 5
252 Cache-aware Merge 5
253 Cache-aware Merge The naive approach incurs random DRAM accesses 54
254 Cache-aware Merge 55
255 Cache-aware Merge 55
256 Cache-aware Merge 55
257 Cache-aware Merge 55
258 Cache-aware Merge Break down into chunks that fit in cache + 56
259 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 57
260 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58
261 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58
262 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58
263 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58
264 Cache-aware Merge Sum up the intermediate updates from the two subgraphs + 58
265 Cache-aware Merge Sum up the intermediate updates from the two subgraphs
266 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks % Normalized Total Cycles 75% 5% 5% % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6
267 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6
268 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 5% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6
269 PageRank while for node : graph.ver,ces for ngh : graph.getinneighbors(node) newranks[node] += ranks[ngh]/outdegree[ngh]; for node : graph.ver,ces newranks[node] = basescore + damping*newranks[node]; swap ranks and newranks Normalized Total Cycles % 75% 5% 5% 6% cycle reduction % BaseLine Reordering Segmenting Reordering on RMAT7 graph + Segmenting 6
270 Related Work Distributed Graph Systems Shared memory efficiency is a key component of distributed graph processing systems (PowerGraph, GraphLab, Pregel..) Shared-memory Graph Systems Frameworks (Ligra, Galois, GraphMat..) did not focus on cache optimizations Milk [PACT6], Propagation Blocking[IPDPS7] Out-of-core Systems (GraphChi, XStream) 64
271 Outline Motivation Frequency based Vertex Reordering Cache-aware Segmenting Evaluation 65
272 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 66
273 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 67
274 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) In a single machine, we can complete iterations of PageRank on 4 million nodes Twitter graph within 6s The best published results so far is.7s (Gemini OSDI 7) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 68
275 Evaluation PageRank ( iter) Label Propagation (per iter) Betweenness Centrality (per start node) Very fast execution on label propagation used in Connected Components and SSSP (Bellman-Ford) Twitter 5.8s.7s.s RMAT7.6s.5s.85s Web Graph 8.6s.4s.875s Absolute Running Times on 4 core Intel Xeon E5 servers 69
276 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD 7
277 PageRank Ours HandOptC++ GraphMat Ligra GridGraph Slowdown to Ours 9 6 Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.-x slower than our version 7
278 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD 7
279 Label Propagation 7 Ours HandOptC++ Ligra Slowdown to Ours Twitter RMAT5 RMAT7 SD Intel expert hand optimized version and state-of-the art graph frameworks are.7-6.7x slower than our version 7
280 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 74
281 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge increases as the size of the graph increases 75
282 Evaluation Cycles stalled on memory / Edge Hand Optimized C++ Ours LiveJournal RMAT5 Twitter SD RMAT7 Cycles stalled on memory per edge stays constant as the size of the graph increases 76
283 Summary Performance Bottleneck of Graph Applications Frequency based Vertex Reordering Cache-aware Segmenting 77
284 Outline Performance Analysis for Graph Applications Milk / Propagation Blocking Frequency based Clustering CSR Segmenting Summary
285 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality Runtime Reordering the Memory Accesses
286 Improving Cache Performance for Graph Computations Reordering the Graph Partitioning the Graph for Locality What are the tradeoffs? Runtime Reordering the Memory Accesses
287 Improving Cache Performance for Graph Computations Reordering the Graph Small preprocessing cost, modest performance improvement, dependent on graph structure Partitioning the Graph for Locality Bigger preprocessing cost, small runtime overhead, bigger performance gains, suitable for applications with lots of random accesses. Runtime Reordering the Memory Accesses No preprocessing cost, bigger runtime overhead
288 Outside of Graph Computing? Sparse Linear Algebra Matrix Reordering, Preconditioning (Graph Reordering) Cache Blocking (CSR segmenting) Inspector-Executor (Runtime Access Reordering) These are Fundamental Communication Reductions Techniques, used in many other domains (sparse linear algebra, join optimization in databases)
289 Performance Engineering Understand your applications performance characteristics Many papers worked on different ways to abandon cache and improve MLP with a large number of threads Understand the tradeoff space of the optimizations Pick the technique that best suit your hardware, application and data
GAIL The Graph Algorithm Iron Law
GAIL The Graph Algorithm Iron Law Scott Beamer, Krste Asanović, David Patterson GAP Berkeley Electrical Engineering & Computer Sciences gap.cs.berkeley.edu Graph Applications Social Network Analysis Recommendations
More informationGridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University
GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationPractical Near-Data Processing for In-Memory Analytics Frameworks
Practical Near-Data Processing for In-Memory Analytics Frameworks Mingyu Gao, Grant Ayers, Christos Kozyrakis Stanford University http://mast.stanford.edu PACT Oct 19, 2015 Motivating Trends End of Dennard
More informationGraFBoost: Using accelerated flash storage for external graph analytics
GraFBoost: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind MIT CSAIL Funded by: 1 Large Graphs are Found Everywhere in Nature
More informationTanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia
How well do CPU, GPU and Hybrid Graph Processing Frameworks Perform? Tanuj Kr Aasawat, Tahsin Reza, Matei Ripeanu Networked Systems Laboratory (NetSysLab) University of British Columbia Networked Systems
More informationBig Data Analytics Using Hardware-Accelerated Flash Storage
Big Data Analytics Using Hardware-Accelerated Flash Storage Sang-Woo Jun University of California, Irvine (Work done while at MIT) Flash Memory Summit, 2018 A Big Data Application: Personalized Genome
More informationGPU Sparse Graph Traversal. Duane Merrill
GPU Sparse Graph Traversal Duane Merrill Breadth-first search of graphs (BFS) 1. Pick a source node 2. Rank every vertex by the length of shortest path from source Or label every vertex by its predecessor
More informationarxiv: v4 [cs.dc] 12 Jan 2018
Making Caches Work for Graph Analytics Yunming Zhang, Vladimir Kiriansky, Charith Mendis, Saman Amarasinghe MIT CSAIL {yunming,vlk,charithm,saman}@csail.mit.edu Matei Zaharia Stanford InfoLab matei@cs.stanford.edu
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationAutomatic Compiler-Based Optimization of Graph Analytics for the GPU. Sreepathi Pai The University of Texas at Austin. May 8, 2017 NVIDIA GTC
Automatic Compiler-Based Optimization of Graph Analytics for the GPU Sreepathi Pai The University of Texas at Austin May 8, 2017 NVIDIA GTC Parallel Graph Processing is not easy 299ms HD-BFS 84ms USA Road
More informationGPU Sparse Graph Traversal
GPU Sparse Graph Traversal Duane Merrill (NVIDIA) Michael Garland (NVIDIA) Andrew Grimshaw (Univ. of Virginia) UNIVERSITY of VIRGINIA Breadth-first search (BFS) 1. Pick a source node 2. Rank every vertex
More informationA common scenario... Most of us have probably been here. Where did my performance go? It disappeared into overheads...
OPENMP PERFORMANCE 2 A common scenario... So I wrote my OpenMP program, and I checked it gave the right answers, so I ran some timing tests, and the speedup was, well, a bit disappointing really. Now what?.
More informationScalable GPU Graph Traversal!
Scalable GPU Graph Traversal Duane Merrill, Michael Garland, and Andrew Grimshaw PPoPP '12 Proceedings of the 17th ACM SIGPLAN symposium on Principles and Practice of Parallel Programming Benwen Zhang
More informationGraph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES
Graph Prefetching Using Data Structure Knowledge SAM AINSWORTH, TIMOTHY M. JONES Background and Motivation Graph applications are memory latency bound Caches & Prefetching are existing solution for memory
More informationCS252 S05. Main memory management. Memory hardware. The scale of things. Memory hardware (cont.) Bottleneck
Main memory management CMSC 411 Computer Systems Architecture Lecture 16 Memory Hierarchy 3 (Main Memory & Memory) Questions: How big should main memory be? How to handle reads and writes? How to find
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationA POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL
A POWER CHARACTERIZATION AND MANAGEMENT OF GPU GRAPH TRAVERSAL ADAM MCLAUGHLIN *, INDRANI PAUL, JOSEPH GREATHOUSE, SRILATHA MANNE, AND SUDHKAHAR YALAMANCHILI * * GEORGIA INSTITUTE OF TECHNOLOGY AMD RESEARCH
More informationA Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms
A Study of Partitioning Policies for Graph Analytics on Large-scale Distributed Platforms Gurbinder Gill, Roshan Dathathri, Loc Hoang, Keshav Pingali Department of Computer Science, University of Texas
More informationA Framework for Processing Large Graphs in Shared Memory
A Framework for Processing Large Graphs in Shared Memory Julian Shun Based on joint work with Guy Blelloch and Laxman Dhulipala (Work done at Carnegie Mellon University) 2 What are graphs? Vertex Edge
More informationCaches 3/23/17. Agenda. The Dataflow Model (of a Computer)
Agenda Caches Samira Khan March 23, 2017 Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationCaches. Samira Khan March 23, 2017
Caches Samira Khan March 23, 2017 Agenda Review from last lecture Data flow model Memory hierarchy More Caches The Dataflow Model (of a Computer) Von Neumann model: An instruction is fetched and executed
More informationHeterogeneous Memory Subsystem for Natural Graph Analytics
Heterogeneous Memory Subsystem for Natural Graph Analytics Abraham Addisie, Hiwot Kassa, Opeoluwa Matthews, and Valeria Bertacco Computer Science and Engineering, University of Michigan Email: {abrahad,
More informationLocality-Aware Software Throttling for Sparse Matrix Operation on GPUs
Locality-Aware Software Throttling for Sparse Matrix Operation on GPUs Yanhao Chen 1, Ari B. Hayes 1, Chi Zhang 2, Timothy Salmon 1, Eddy Z. Zhang 1 1. Rutgers University 2. University of Pittsburgh Sparse
More informationCS 61C: Great Ideas in Computer Architecture. The Memory Hierarchy, Fully Associative Caches
CS 61C: Great Ideas in Computer Architecture The Memory Hierarchy, Fully Associative Caches Instructor: Alan Christopher 7/09/2014 Summer 2014 -- Lecture #10 1 Review of Last Lecture Floating point (single
More informationA Lightweight Infrastructure for Graph Analytics
A Lightweight Infrastructure for Graph Analytics Donald Nguyen, Andrew Lenharth and Keshav Pingali The University of Texas at Austin, Texas, USA {ddn@cs, lenharth@ices, pingali@cs}.utexas.edu Abstract
More informationLecture 14: Multithreading
CS 152 Computer Architecture and Engineering Lecture 14: Multithreading John Wawrzynek Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~johnw
More informationGraFBoost: Using accelerated flash storage for external graph analytics
: Using accelerated flash storage for external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute
More informationAccelerating Irregular Computations with Hardware Transactional Memory and Active Messages
MACIEJ BESTA, TORSTEN HOEFLER spcl.inf.ethz.ch Accelerating Irregular Computations with Hardware Transactional Memory and Active Messages LARGE-SCALE IRREGULAR GRAPH PROCESSING Becoming more important
More informationChallenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery
Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark by Nkemdirim Dockery High Performance Computing Workloads Core-memory sized Floating point intensive Well-structured
More informationHigh-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock
High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock Yangzihao Wang University of California, Davis yzhwang@ucdavis.edu March 24, 2014 Yangzihao Wang (yzhwang@ucdavis.edu)
More information10/16/2017. Miss Rate: ABC. Classifying Misses: 3C Model (Hill) Reducing Conflict Misses: Victim Buffer. Overlapping Misses: Lockup Free Cache
Classifying Misses: 3C Model (Hill) Divide cache misses into three categories Compulsory (cold): never seen this address before Would miss even in infinite cache Capacity: miss caused because cache is
More informationStaged Memory Scheduling
Staged Memory Scheduling Rachata Ausavarungnirun, Kevin Chang, Lavanya Subramanian, Gabriel H. Loh*, Onur Mutlu Carnegie Mellon University, *AMD Research June 12 th 2012 Executive Summary Observation:
More informationCase Study 1: Optimizing Cache Performance via Advanced Techniques
6 Solutions to Case Studies and Exercises Chapter 2 Solutions Case Study 1: Optimizing Cache Performance via Advanced Techniques 2.1 a. Each element is 8B. Since a 64B cacheline has 8 elements, and each
More informationEE 4683/5683: COMPUTER ARCHITECTURE
EE 4683/5683: COMPUTER ARCHITECTURE Lecture 6A: Cache Design Avinash Kodi, kodi@ohioedu Agenda 2 Review: Memory Hierarchy Review: Cache Organization Direct-mapped Set- Associative Fully-Associative 1 Major
More informationReducing Hit Times. Critical Influence on cycle-time or CPI. small is always faster and can be put on chip
Reducing Hit Times Critical Influence on cycle-time or CPI Keep L1 small and simple small is always faster and can be put on chip interesting compromise is to keep the tags on chip and the block data off
More informationLocality. CS429: Computer Organization and Architecture. Locality Example 2. Locality Example
Locality CS429: Computer Organization and Architecture Dr Bill Young Department of Computer Sciences University of Texas at Austin Principle of Locality: Programs tend to reuse data and instructions near
More informationMemory Systems and Performance Engineering. Fall 2009
Memory Systems and Performance Engineering Fall 2009 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide illusion of fast larger memory
More informationPerformance in the Multicore Era
Performance in the Multicore Era Gustavo Alonso Systems Group -- ETH Zurich, Switzerland Systems Group Enterprise Computing Center Performance in the multicore era 2 BACKGROUND - SWISSBOX SwissBox: An
More informationAdapted from David Patterson s slides on graduate computer architecture
Mei Yang Adapted from David Patterson s slides on graduate computer architecture Introduction Ten Advanced Optimizations of Cache Performance Memory Technology and Optimizations Virtual Memory and Virtual
More informationAlgorithm Performance Factors. Memory Performance of Algorithms. Processor-Memory Performance Gap. Moore s Law. Program Model of Memory I
Memory Performance of Algorithms CSE 32 Data Structures Lecture Algorithm Performance Factors Algorithm choices (asymptotic running time) O(n 2 ) or O(n log n) Data structure choices List or Arrays Language
More informationData Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationComputer Architecture A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationComputer Caches. Lab 1. Caching
Lab 1 Computer Caches Lab Objective: Caches play an important role in computational performance. Computers store memory in various caches, each with its advantages and drawbacks. We discuss the three main
More informationarxiv: v1 [cs.db] 21 Oct 2017
: High-performance external graph analytics Sang-Woo Jun, Andy Wright, Sizhuo Zhang, Shuotao Xu and Arvind Department of Electrical Engineering and Computer Science Massachusetts Institute of Technology
More informationMemory hierarchy review. ECE 154B Dmitri Strukov
Memory hierarchy review ECE 154B Dmitri Strukov Outline Cache motivation Cache basics Six basic optimizations Virtual memory Cache performance Opteron example Processor-DRAM gap in latency Q1. How to deal
More informationMultilevel Memories. Joel Emer Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology
1 Multilevel Memories Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology Based on the material prepared by Krste Asanovic and Arvind CPU-Memory Bottleneck 6.823
More informationArchitecture-Conscious Database Systems
Architecture-Conscious Database Systems 2009 VLDB Summer School Shanghai Peter Boncz (CWI) Sources Thank You! l l l l Database Architectures for New Hardware VLDB 2004 tutorial, Anastassia Ailamaki Query
More informationOn Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs
On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs Sungpack Hong 2, Nicole C. Rodia 1, and Kunle Olukotun 1 1 Pervasive Parallelism Laboratory, Stanford University
More informationThe GraphGrind Framework: Fast Graph Analytics on Large. Shared-Memory Systems
Memory-Centric System Level Design of Heterogeneous Embedded DSP Systems The GraphGrind Framework: Scott Johan Fischaber, BSc (Hon) Fast Graph Analytics on Large Shared-Memory Systems Jiawen Sun, PhD candidate
More informationSort vs. Hash Join Revisited for Near-Memory Execution. Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot
Sort vs. Hash Join Revisited for Near-Memory Execution Nooshin Mirzadeh, Onur Kocberber, Babak Falsafi, Boris Grot 1 Near-Memory Processing (NMP) Emerging technology Stacked memory: A logic die w/ a stack
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationMemory Systems and Performance Engineering
SPEED LIMIT PER ORDER OF 6.172 Memory Systems and Performance Engineering Fall 2010 Basic Caching Idea A. Smaller memory faster to access B. Use smaller memory to cache contents of larger memory C. Provide
More informationCSF Improving Cache Performance. [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005]
CSF Improving Cache Performance [Adapted from Computer Organization and Design, Patterson & Hennessy, 2005] Review: The Memory Hierarchy Take advantage of the principle of locality to present the user
More informationA Multiprocessor Memory Processor for Efficient Sharing And Access Coordination
1 1 A Multiprocessor Memory Processor for Efficient Sharing And Access Coordination David M. Koppelman Department of Electrical & Computer Engineering Louisiana State University, Baton Rouge koppel@ee.lsu.edu
More informationLecture-14 (Memory Hierarchy) CS422-Spring
Lecture-14 (Memory Hierarchy) CS422-Spring 2018 Biswa@CSE-IITK The Ideal World Instruction Supply Pipeline (Instruction execution) Data Supply - Zero-cycle latency - Infinite capacity - Zero cost - Perfect
More informationData Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich
Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology
More informationComputer Architecture. A Quantitative Approach, Fifth Edition. Chapter 2. Memory Hierarchy Design. Copyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive per
More informationAccelerating Pointer Chasing in 3D-Stacked Memory: Challenges, Mechanisms, Evaluation Kevin Hsieh
Accelerating Pointer Chasing in 3D-Stacked : Challenges, Mechanisms, Evaluation Kevin Hsieh Samira Khan, Nandita Vijaykumar, Kevin K. Chang, Amirali Boroumand, Saugata Ghose, Onur Mutlu Executive Summary
More informationHandout 4 Memory Hierarchy
Handout 4 Memory Hierarchy Outline Memory hierarchy Locality Cache design Virtual address spaces Page table layout TLB design options (MMU Sub-system) Conclusion 2012/11/7 2 Since 1980, CPU has outpaced
More informationCACHE-GUIDED SCHEDULING
CACHE-GUIDED SCHEDULING EXPLOITING CACHES TO MAXIMIZE LOCALITY IN GRAPH PROCESSING Anurag Mukkara, Nathan Beckmann, Daniel Sanchez 1 st AGP Toronto, Ontario 24 June 2017 Graph processing is memory-bound
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationCS 152 Computer Architecture and Engineering. Lecture 18: Multithreading
CS 152 Computer Architecture and Engineering Lecture 18: Multithreading Krste Asanovic Electrical Engineering and Computer Sciences University of California, Berkeley http://www.eecs.berkeley.edu/~krste
More informationCache Optimisation. sometime he thought that there must be a better way
Cache sometime he thought that there must be a better way 2 Cache 1. Reduce miss rate a) Increase block size b) Increase cache size c) Higher associativity d) compiler optimisation e) Parallelism f) prefetching
More informationPerformance! (1/latency)! 1000! 100! 10! Capacity Access Time Cost. CPU Registers 100s Bytes <10s ns. Cache K Bytes ns 1-0.
Since 1980, CPU has outpaced DRAM... EEL 5764: Graduate Computer Architecture Appendix C Hierarchy Review Ann Gordon-Ross Electrical and Computer Engineering University of Florida http://www.ann.ece.ufl.edu/
More informationCS152 Computer Architecture and Engineering Lecture 17: Cache System
CS152 Computer Architecture and Engineering Lecture 17 System March 17, 1995 Dave Patterson (patterson@cs) and Shing Kong (shing.kong@eng.sun.com) Slides available on http//http.cs.berkeley.edu/~patterson
More informationX-Stream: A Case Study in Building a Graph Processing System. Amitabha Roy (LABOS)
X-Stream: A Case Study in Building a Graph Processing System Amitabha Roy (LABOS) 1 X-Stream Graph processing system Single Machine Works on graphs stored Entirely in RAM Entirely in SSD Entirely on Magnetic
More information1. Creates the illusion of an address space much larger than the physical memory
Virtual memory Main Memory Disk I P D L1 L2 M Goals Physical address space Virtual address space 1. Creates the illusion of an address space much larger than the physical memory 2. Make provisions for
More informationChapter 6 Caches. Computer System. Alpha Chip Photo. Topics. Memory Hierarchy Locality of Reference SRAM Caches Direct Mapped Associative
Chapter 6 s Topics Memory Hierarchy Locality of Reference SRAM s Direct Mapped Associative Computer System Processor interrupt On-chip cache s s Memory-I/O bus bus Net cache Row cache Disk cache Memory
More informationThe Processor Memory Hierarchy
Corrected COMP 506 Rice University Spring 2018 The Processor Memory Hierarchy source code IR Front End Optimizer Back End IR target code Copyright 2018, Keith D. Cooper & Linda Torczon, all rights reserved.
More informationCS 152 Computer Architecture and Engineering. Lecture 9 - Address Translation
CS 152 Computer Architecture and Engineering Lecture 9 - Address Translation Krste Asanovic Electrical Engineering and Computer Sciences University of California at Berkeley http://www.eecs.berkeley.edu/~krste
More informationIntroduction to OpenMP. Lecture 10: Caches
Introduction to OpenMP Lecture 10: Caches Overview Why caches are needed How caches work Cache design and performance. The memory speed gap Moore s Law: processors speed doubles every 18 months. True for
More informationCourse Administration
Spring 207 EE 363: Computer Organization Chapter 5: Large and Fast: Exploiting Memory Hierarchy - Avinash Kodi Department of Electrical Engineering & Computer Science Ohio University, Athens, Ohio 4570
More informationCopyright 2012, Elsevier Inc. All rights reserved.
Computer Architecture A Quantitative Approach, Fifth Edition Chapter 2 Memory Hierarchy Design 1 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more
More informationCray XE6 Performance Workshop
Cray XE6 Performance Workshop Mark Bull David Henty EPCC, University of Edinburgh Overview Why caches are needed How caches work Cache design and performance. 2 1 The memory speed gap Moore s Law: processors
More informationand data combined) is equal to 7% of the number of instructions. Miss Rate with Second- Level Cache, Direct- Mapped Speed
5.3 By convention, a cache is named according to the amount of data it contains (i.e., a 4 KiB cache can hold 4 KiB of data); however, caches also require SRAM to store metadata such as tags and valid
More informationComputer Science 146. Computer Architecture
Computer Architecture Spring 2004 Harvard University Instructor: Prof. dbrooks@eecs.harvard.edu Lecture 18: Virtual Memory Lecture Outline Review of Main Memory Virtual Memory Simple Interleaving Cycle
More informationMemory. From Chapter 3 of High Performance Computing. c R. Leduc
Memory From Chapter 3 of High Performance Computing c 2002-2004 R. Leduc Memory Even if CPU is infinitely fast, still need to read/write data to memory. Speed of memory increasing much slower than processor
More informationBigtable. A Distributed Storage System for Structured Data. Presenter: Yunming Zhang Conglong Li. Saturday, September 21, 13
Bigtable A Distributed Storage System for Structured Data Presenter: Yunming Zhang Conglong Li References SOCC 2010 Key Note Slides Jeff Dean Google Introduction to Distributed Computing, Winter 2008 University
More informationFine-Grained Task Migration for Graph Algorithms using Processing in Memory
Fine-Grained Task Migration for Graph Algorithms using Processing in Memory Paula Aguilera 1, Dong Ping Zhang 2, Nam Sung Kim 3 and Nuwan Jayasena 2 1 Dept. of Electrical and Computer Engineering University
More informationBenchmarking the Memory Hierarchy of Modern GPUs
1 of 30 Benchmarking the Memory Hierarchy of Modern GPUs In 11th IFIP International Conference on Network and Parallel Computing Xinxin Mei, Kaiyong Zhao, Chengjian Liu, Xiaowen Chu CS Department, Hong
More informationJulienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing
Julienne: A Framework for Parallel Graph Algorithms using Work-efficient Bucketing Laxman Dhulipala Joint work with Guy Blelloch and Julian Shun SPAA 07 Giant graph datasets Graph V E (symmetrized) com-orkut
More informationPage 1. Memory Hierarchies (Part 2)
Memory Hierarchies (Part ) Outline of Lectures on Memory Systems Memory Hierarchies Cache Memory 3 Virtual Memory 4 The future Increasing distance from the processor in access time Review: The Memory Hierarchy
More informationLECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY
LECTURE 4: LARGE AND FAST: EXPLOITING MEMORY HIERARCHY Abridged version of Patterson & Hennessy (2013):Ch.5 Principle of Locality Programs access a small proportion of their address space at any time Temporal
More informationChapter 5A. Large and Fast: Exploiting Memory Hierarchy
Chapter 5A Large and Fast: Exploiting Memory Hierarchy Memory Technology Static RAM (SRAM) Fast, expensive Dynamic RAM (DRAM) In between Magnetic disk Slow, inexpensive Ideal memory Access time of SRAM
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2011/12 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2011/12 1 2
More informationCuSha: Vertex-Centric Graph Processing on GPUs
CuSha: Vertex-Centric Graph Processing on GPUs Farzad Khorasani, Keval Vora, Rajiv Gupta, Laxmi N. Bhuyan HPDC Vancouver, Canada June, Motivation Graph processing Real world graphs are large & sparse Power
More informationMemory Hierarchy, Fully Associative Caches. Instructor: Nick Riasanovsky
Memory Hierarchy, Fully Associative Caches Instructor: Nick Riasanovsky Review Hazards reduce effectiveness of pipelining Cause stalls/bubbles Structural Hazards Conflict in use of datapath component Data
More informationECE232: Hardware Organization and Design
ECE232: Hardware Organization and Design Lecture 21: Memory Hierarchy Adapted from Computer Organization and Design, Patterson & Hennessy, UCB Overview Ideally, computer memory would be large and fast
More informationLECTURE 5: MEMORY HIERARCHY DESIGN
LECTURE 5: MEMORY HIERARCHY DESIGN Abridged version of Hennessy & Patterson (2012):Ch.2 Introduction Programmers want unlimited amounts of memory with low latency Fast memory technology is more expensive
More informationL2 cache provides additional on-chip caching space. L2 cache captures misses from L1 cache. Summary
HY425 Lecture 13: Improving Cache Performance Dimitrios S. Nikolopoulos University of Crete and FORTH-ICS November 25, 2011 Dimitrios S. Nikolopoulos HY425 Lecture 13: Improving Cache Performance 1 / 40
More informationKey Point. What are Cache lines
Caching 1 Key Point What are Cache lines Tags Index offset How do we find data in the cache? How do we tell if it s the right data? What decisions do we need to make in designing a cache? What are possible
More informationEEC 170 Computer Architecture Fall Improving Cache Performance. Administrative. Review: The Memory Hierarchy. Review: Principle of Locality
Administrative EEC 7 Computer Architecture Fall 5 Improving Cache Performance Problem #6 is posted Last set of homework You should be able to answer each of them in -5 min Quiz on Wednesday (/7) Chapter
More informationMemory Hierarchy Computing Systems & Performance MSc Informatics Eng. Memory Hierarchy (most slides are borrowed)
Computing Systems & Performance Memory Hierarchy MSc Informatics Eng. 2012/13 A.J.Proença Memory Hierarchy (most slides are borrowed) AJProença, Computer Systems & Performance, MEI, UMinho, 2012/13 1 2
More informationMemory Management! How the hardware and OS give application pgms:" The illusion of a large contiguous address space" Protection against each other"
Memory Management! Goals of this Lecture! Help you learn about:" The memory hierarchy" Spatial and temporal locality of reference" Caching, at multiple levels" Virtual memory" and thereby " How the hardware
More information