Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Size: px

Start display at page:

Download "Mosaic: Processing a Trillion-Edge Graph on a Single Machine"

Amie Charles
6 years ago
Views:

1 Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student EuroSys 7 June 8, 07 Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

2 Large-scale graph processing is ubiquitous Social networks Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

3 Large-scale graph processing is ubiquitous Social networks Genome analysis Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

4 Large-scale graph processing is ubiquitous Social networks Genome analysis Graphs enable Machine Learning Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

5 Powerful, heterogeneous machines Terabytes of RAM on multiple sockets Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

6 Powerful, heterogeneous machines Terabytes of RAM on multiple sockets Powerful many-core coprocessors Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

7 Powerful, heterogeneous machines Terabytes of RAM on multiple sockets Powerful many-core coprocessors Fast, large-capacity Non-volatile Memory Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

8 Powerful, heterogeneous machines Take advantage of heterogeneous machine to process tera-scale graphs Terabytes of RAM on multiple sockets Powerful many-core coprocessors Fast, large-capacity Non-volatile Memory Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

9 Table of contents Graph Processing: Sample Application Design Mosaic Architecture Graph Encoding API Evaluation Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

10 Graph Processing: Applications Community Detection Find Common Friends Find Shortest Paths Estimate Impact of Vertices (webpages, users,... )... Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

11 Example: Pagerank Calculate impact of each vertex Pagerank v = α u Neighborhood(v) Pagerank u degree u + ( α) Simple Algorithm: In each iteration, distribute current impact along out-edges, weighted by degree Sum up all incoming impacts new impact for next iteration Weight new impact with regularization factor α = 0.8 Repeat until no changes Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 6 / 8

12 Pagerank: Graphical example ) Initialize Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

13 Pagerank: Graphical example ) Propagate along outgoing edges Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

14 Pagerank: Graphical example ) Sum up incoming contributions Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

15 Pagerank: Graphical example ) Apply regularization: x Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

16 Pagerank: Graphical example ) Update outgoing edges for second iteration Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

17 Pagerank: Graphical example 6) Repeat until stabilized Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

18 Mosaic: Design space Graph Processing has many faces: Single Machine Out-of-core In memory Cluster Out-of-core In memory Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 8 / 8

19 Mosaic: Design space Graph Processing has many faces: Single Machine Out-of-core Cheap, but potentially slow In memory Fast, but limited graph size Cluster Out-of-core Large graphs, but expensive & slow In memory Large graphs & fast, but very expensive Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 8 / 8

20 Mosaic: Design space Graph Processing has many faces: Single Machine Out-of-core Cheap, but potentially slow In memory Fast, but limited graph size Cluster Out-of-core Large graphs, but expensive & slow In memory Large graphs & fast, but very expensive Single machine, out-of-core is most cost-effective Goal: Good performance and large graphs! Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 8 / 8

21 Mosaic: Design goals Goal Run algorithms on very large graphs on a single machine using coprocessors Enabled by: Common, familiar API (vertex/edge-centric) Encoding: Lossless compression Cache locality Processing on isolated subgraphs Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 9 / 8

22 Architecture of Mosaic Usage of Xeon Phi & NVMe Involvement of Host Global vertex state <current state>... <next state>... stripped Host Processors (Xeon) ( 6) I I... T T edge processing NVMe... Meta transfer Tile transfer fetch Xeon Phi receive ( 6 cores)... per Xeon Phi ( ) PCIe Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

23 Graph encoding: Idea Compression Split graph into subgraphs, use local (short) identifiers Cache locality Inside subgraphs: Sort by access order Between subgraphs: Overlap vertex sets Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

24 Background: Column first Locality for write Multiple sequential reads Target vertex Source vertex P P P P P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

25 Background: Row first Locality for read Multiple sequential writes Target vertex Source vertex P P P P P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

26 Background: Hilbert order Space-filling curve Provides locality between adjacent data points Target vertex Source vertex P P P P P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

27 From global to local: Tiles Convert graph to set of tiles ) Start with adjacency Matrix: 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex Source vertex ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

28 From global to local: Tiles Convert graph to set of tiles ) Use first edge in tile T : 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex (global) Source vertex (global) ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Tile- (T ) ➊ meta (I ) (,) (local) (,) : local vertex id (,) : local global id ➊ : local edge store order Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

29 From global to local: Tiles Convert graph to set of tiles ) Use more edges in tile T : 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex (global) Source vertex (global) ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Tile- (T ) ➊ ➋ meta (I ) (,) (,) (,) (local) : local vertex id (,) : local global id ➊ : local edge store order Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

30 From global to local: Tiles Convert graph to set of tiles ) Use more edges in tile T : 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex (global) Source vertex (global) ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Tile- (T ) ➊ ➋ ➍ meta (I (,) (,) ( (,) ) (local),) : local vertex id (,) : local global id ➊ : local edge store order Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

31 From global to local: Tiles Convert graph to set of tiles ) Use more edges in tile T : 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex (global) Source vertex (global) ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Tile- (T ) ➊ ➋ ➍ ➌ meta (I (,) (,) ( (,) ) (local),) : local vertex id (,) : local global id ➊ : local edge store order Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

32 From global to local: Tiles Convert graph to set of tiles 6) Next edges do not fit in T anymore, construct T : 6 ➏ ➑ ➒ ➋ ➊ ➎ ➌ ➐ ➍ Target vertex (global) Source vertex (global) ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Tile- (T ) ➊ ➋ ➍ ➌ meta (I (,) (,) ( (,) ) (local),) Tile- (T ) ➎ ➐ ➑ ➒ ➏ meta (I (,) (local) (,6) ( (,) ),) : local vertex id (,) : local global id ➊ : local edge store order Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

33 Locality with Hilbert-ordered tiles Overlapping sets of sources and targets Target vertex Source vertex ➊ ➐ ➒ ➑ ➋ ➍ P P P P ➌ ➎ ➏ P P P P P P P P P P P P Global adjacency matrix Partition (S = ) Better locality than row-first or column-first Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 6 / 8

34 From global to local: Data structure ) Split original graphs into two subgraphs: ➊ ➋ ➌ ➍ ➎ ➏ ➐ ➑ ➒ 6 ➊ ➋ ➌ ➍ ➎ ➏ ➐ ➑ ➒ Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

35 From global to local: Data structure ) Internal data structure of T : ➊ ➋ 6 ➑ ➏ ➒ ➎ ➌➐ ➍... ➎ ➐ ➑ ➒ ➏ Index local global 6 ➎ ➏ ➐ ➑ ➒ src tgt Edges B B sorted by Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

36 From global to local: Data structure ) Compress edges: Compressed sparse rows ➊ ➋ 6 ➑ ➏ ➒ ➎ ➌➐ ➍... ➎ ➐ ➑ ➒ ➏ Index local global 6 ➎ ➏ ➐ ➑ ➒ src tgt src (tgt,#tgt) Edges B B sorted by ➎ ➏ ➐ ➑ ➒ (, ) (, ) B B #tgt B compressed Efficient, local encoding, sequentially accessed Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

37 From global to local: Summary Better locality Efficient encoding of local graphs Effect: up to 68% reduction in data size: Graph #vertices #edges Raw data Mosaic size (red.) rmat 6.8 M 0. B.0 GB. GB (.0%) twitter.6 M. B 0.9 GB 7.7 GB ( 9.%) rmat7. M. B 6.0 GB. GB ( 0.6%) uk M.7 B 7.9 GB 8.7 GB ( 68.8%) hyperlink,7.6 M 6. B 80.0 GB. GB ( 68.%) rmat-trillion,9.9 M,000.0 B 8,000.0 GB,86.7 GB ( 9.8%) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 8 / 8

38 API: Pagerank example Pull: Gather per edge information Reduce: Combine results from multiple subgraphs Apply: Calculate non-associative regularization Edge-centric operation Local graph processing on Tile // On edge processor (co-processor) // Edge e = (Vertex src, Vertex tgt) def Pull(Vertex src, Vertex tgt): return src.val / src.out_degree // On edge processor/global reducers (both) def Reduce(Vertex v, Vertex v ): return v.val + v.val // On global reducers (host) def Apply(Vertex v): v.val = ( - α) + α v.val Formula: Pagerank v = α ( u Neighborhood(v) Global graph processing Vertex-centric operation Pagerank u degree u ) + ( α) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 9 / 8

39 API: Pagerank example ) Start with T Host current state next state T meta (I (,) (,) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

40 API: Pagerank example ) Execute Pull along all edges in T Host current state next state T Pull(e) meta (I (,) (,) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

41 API: Pagerank example ) Reduce all updates from T onto next state Host current state next state T meta (I (,) (,) ( (,) ),) Reduce(v) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

42 API: Pagerank example ) Switch to T Host current state next state T meta (I (,) (,6) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

43 API: Pagerank example ) Pull all updates Host current state next state T Pull(e) meta (I (,) (,6) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

44 API: Pagerank example 6) Reduce updates from T Host current state next state T meta (I (,) (,6) ( (,) ),) Reduce(v) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

45 API: Pagerank example 7) All tiles processed, Apply processed updates Host current state Apply(v) next state T meta (I (,) (,6) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

46 API: Pagerank example 8) Switch current and next state, clear next state for second iteration Host current state next state T meta (I (,) (,) ( (,) ),) Xeon Phi Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

47 Evaluation Questions: Preprocessing Cost Performance (in comparison) Impact of Design Decisions Scalability Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

48 Evaluation - Setup Single Server: sockets, cores each 768GB RAM Xeon Phi (KNC, 6 cores) 6 NVMe (.TB each) 7 Algorithms 6 Datasets ( real world, synthetic) Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

49 Preprocessing Mosaic needs explicit preprocessing step - min for small datasets, minutes for webgraph, hours for trillion edges But: Can be amortized during execution: GridGraph: Mosaic faster after twitter: 0 iterations uk007: 8 iterations X-Stream: Mosaic faster after twitter: 8 iterations uk007: iterations Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

50 Performance comparison Comparison to other single machine engines with Pagerank: Runtime (seconds) Mosaic GridGraph X-Stream GraphChi 0 rmat twitter rmat7 uk007-0 Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

51 Performance comparison Comparison to other single machine engines with Pagerank: log Runtime (seconds) 00 0 Mosaic GridGraph X-Stream GraphChi 0. rmat twitter rmat7 uk007-0 Mosaic outperforms other system by.7 to 8.6 Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

52 Hilbert-ordered tiles: Cache locality Cache misses and execution times for three different strategies Cache Misses (%) Hilbert Row-First Column-First Runtime (s) Pagerank BFS WCC 0 Pagerank BFS WCC Hilbert-ordered tiles have up to % better cache locality, up to % reduction in runtime Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

53 Evaluation - Scalability Dimensions Add Xeon Phis/NVMes Add threads on each Xeon Phi # iterations / minute # pair of Phi+NVMe Xeon Phi threads Mosaic scales well when adding threads/xeon Phis Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 6 / 8

54 Conclusion Mosaic, a graph processing engine for trillion edge graphs on a single machine Hilbert-ordered tiles allow: Enable localized processing on coprocessors Optimizes cache locality Enables compression Code is open-source: Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 7 / 8

55 Thank you! Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 8 / 8

56 Evaluation - Selective scheduling Skip subgraphs without active vertices Especially useful for traversal algorithms: BFS, Connected Components,... 0k 8k # active tiles 6k k k Tiles Time w/o Selective Scheduling Time Selective Scheduling. 0. Time (sec) 0k # iteration. speedup for BFS on twitter graph Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 9 / 8

57 Evaluation - Dynamic load balancing Runtime (seconds) Effect of choosing the wrong load balancing scheme Mosaic has light-weight balancing scheme optimal one optimal*0 0 Pagerank BFS WCC SpMV TC Up to.8 improvement in running time by choosing correct load balancing scheme Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 0 / 8

58 Mosaic architecture Host and Xeon Phi component Streaming-based design Global vertex state <current state>... <next state>... stripped Host Processors (Xeon) ( ) Meta transfer... (T, ) LF LR (T, ) (T, ) I... local... T EP ( 6 cores) I I... Tile core transfer (cache locality) T T... T t+... T T... Tile Prefetching Active tiles NVMe processing (concurrent I/O) Xeon Phi I (local, own memory) per socket (T, ) GR GR... per Xeon Phi ( ) T I / Global reducers (memory locality) PCIe messaging via ring buffer data transfer tile meta local data per tile process Steffen Maass Mosaic: Trillion Edges on a Single Machine June 8, 07 / 8

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing

Solros: A Data-Centric Operating System Architecture for Heterogeneous Computing Changwoo Min, Woonhak Kang, Mohan Kumar, Sanidhya Kashyap, Steffen Maass, Heeseung Jo, Taesoo Kim Virginia Tech, ebay, Georgia