Graph Analytics using Vertica Relational Database

Size: px

Start display at page:

Download "Graph Analytics using Vertica Relational Database"

Erica Casey
6 years ago
Views:

1 Graph Analytics using ertica Relational Database Alekh Jindal* Samuel Madden Malú Castellanos Meichun Hsu Microsoft MIT ertica ertica * work done while at MIT

2 Motivation for graphs on DB Data anyways in a DB - avoid expensive copying - end-to-end data analysis - leverage other DB features Processing involves full scans and joins - relational engines could run them efficiently - particularly suited for column stores Relational algebra/sql offers powerful declarative syntax - in fact, we could express Giraph as an operator DAG - can even express more complex graph analytics

3 5-point Agenda From graph queries to SQL: how do we make the translation? Graph query optimization: can we leverage decades of relational wisdom? Column store backends: why are they a good choice? Comparison with specialized graph systems: how do the numbers look? xtending column stores: can we do better?

4 . From Graph to SQL

5 ertex-centric Graph Queries Popular language for graph analytics ertex programs that run in supersets and communicate via message passing

6 ertex-centric Graph Queries Popular language for graph analytics ertex programs that run in supersets and communicate via message passing inf 0 inf inf 4 inf

7 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing inf 4 inf

8 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing

9 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing

10 ertex-centric Graph Queries Popular language for graph analytics 0 ertex programs that run in supersets and communicate via message passing Programmer only specifies a vertex program 4 2 System takes care of running it in parallel

11 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop

12 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop The Giraph physical plan: hard coded physical execution pipeline Input Superstep Superstep Output Superstep. HDFS Scan RecRead Shuffle W W2 W3 W4 Server Data Shuffle W W2 W3 partition store W4 edge store message store Master Server Data Shuffle synchronize W W2 W3 partition store W4 edge store message store Master Server Data cleanup store synchronize W W2 W3 partition store W4 edge store message store Master HDFS G=(,) Split synchronize G =(, )

13 The Giraph Plan Giraph: a popular, open-source graph analytics system on Hadoop The Giraph physical plan: hard coded physical execution pipeline Giraph logical query plan using relational operators Modified ertices U M ertices.id=.from γ.id=m.to M dges New Messages Messages

14 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by U M U M M.id=.from.id=.from.id=.from γ γ.id=m.to.id=m.to M M γ.id=m.to M γ.id=.to 2.id=.from 2

15 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M γ M.id=M.to M.id=.from σ γ d <.d Γ d =min( 2.d+).id=.to γ 2.id=.from 2 2.id=.to 2.id=.from Single Source Shortest Path

16 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M γ M.id=M.to M.id=.from σ γ cc <.cc Γ cc =min( 2.id).id=.to γ 2.id=.from 2 2.id=.to 2.id=.from Connected Components

17 Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF 2 3 Replacing M by.id=.from U M U M.id=.from γ γ.id=m.to.id=m.to M M Γ cc =min( 2.id) γ γ M σ cc <.cc.id=.from Γ.r=0.5/n+0.85* sum(2.r/2.outd) γ.id=.to.id=.to.id=.to 2.id=.from.id=M.to 2.id=.from 2.id=.from M PageRank

18 Hash-based shuffling No sorting, as opposed to Hadoop Intermediate data is not persisted Rewriting Logical Giraph Plan Giraph logical query plan Pushing down the UDF as Table UDF UDF Replacing M join by with union U M U M.id=.from.id=.from U M.id=.from γ γ γ.id=m.to.id=m.to.id=m.to M Table UDF sort γ.id=.from U M M γ.pid.id=m.to.id=m.to M M.id=.from U M Table UDF Γ.r=0.5/n+0.85* sum(2.r/2.outd) γ sort.id=.to 2 2 γ.pid.id=.to 2.id=.from 2.id=.from U M Unmodified ertex Compute Program, e.g. Disk-based Iterations 2 In-Memory SGD Iterations Optimized Unmodified ertex Compute Program

19 2. Graph Query Optimization

20 Leveraging Relational Query Optimizers Multiple rule- or cost-based query rewriting possible; pick the best one using an optimizer No hard-coded physical execution plan Several new optimizations proposed: - update vs replace - incremental evaluation - join elimination - alternate direction graph exploration

21 Inner Join Update Updated Input Node alue Output Node alue SSSP inf 3 4 inf Inner Join 4 5 inf inf Good for small number of updates!

22 Outer Join Replace Input Node alue Output Node alue New Input Node alue 0 0 SSSP 2 Outer Join 0 3 inf 2 inf inf 3 4 inf 4 inf 4 5 inf 5 inf inf Good for bulk updates!

23 Incremental Computation Inc. Input Output Node alue SSSP 0 Node alue Input Node alue 0 2 inf 3 inf 4 inf 5 inf 2 3 New Inc. Input Node alue 2 3 New Input Node alue Outer Join inf 5 inf inf inf

24 Incremental Computation Inc. Input Node alue 2 3 SSSP Output Node alue New Inc. Input Node alue Input Node alue inf 5 inf Outer Join New Input Node alue Faster Iteration Runtime!

25 3. Column Store Backends

26 Why columns stores could be a good choice? Modern column stores provide several features - physical design - join optimizations - query pipelining - intra-query parallelism For more details, pick your favorite column store papers: - MonetDB [Database Architecture volution: Mammals Flourished long before Dinosaurs became xtinct, Peter A. Boncz et. al., PLDB 2009.] - C-Store [C-Store: A Column-oriented DBMS, Mike Stonebraker et. al., LDB 2005.] - ertica [The ertica Analytic Database: C-Store 7 Years Later, Andrew Lamb et. al., LDB 202.]

27 Root OutBlk=[UncTuple(2)] Illustration: ertica Query Plan for SSSP NewNode OutBlk=[UncTuple(2)] xprval: e.to_node, <SAR> Recv from: node0,node,node2,node3 Send to: node0 arly filtering using sideways information passing Fully pipelined query execution Picks the right join strategies, e.g. broadcast Join: Hash-Join: using twitter_edge and twitter_node_b0 ScanStep: twitter_edge SIP2(HashJoin): e.from_node SIP(MergeJoin): e.to_node to_node (not emitted),from_node FilterStep: (<SAR> < <SAR>) GroupByPipe: keys Aggs: min((n.value + )), min(n2.value) StorageMergeStep: twitter_edge; sorted GroupByPipe: keys Aggs: min((n.value + )), min(n2.value) xprval: e.to_node, (n.value + ), n2.value Join: Merge-Join: using previous join and twitter_node_b0 Recv from: node0,node,node2,node3 Send to: node0,node,node2,node3 StorageMergeStep: twitter_node_b0; sorted ScanStep: twitter_node_b0 id, value StorageUnionStep: twitter_node_b0 ScanStep: twitter_node_b0 id, value

28 4. Comparison with Specialized Graph Systems

29 Setup Systems: - ertica - Giraph - GraphLab Datasets: - directed (Twitter, LiveJournal) - undirected (Youtube, LiveJournal) Machines - 4 machines (2 cores, 48GB memory,.4tb disk) Data preparation - upload time [ertica: 96s; GraphLab: 472s; Giraph: 268s] - disk usage [ertica: 0GB; GraphLab/Giraph: 73GB]

30 Giraph ertica Typical Graph Analytics Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds), GraphLab Giraph ertica Time (seconds) 0 PR SSSP CC

31 y ertica Time (seconds) Advanced Graph Analytics ertica (SQL + Disk) ertica (UDF + Shared Memory) Time (seconds) Load/Sto ertica 0 PageRank Shortest Path (a) Comparing different implementations Mixing graph and on ertica relational (LiveJournal graph) queries Fig.. Multi-hop neighborhood queries 0 PR SSSP CC PR SSSP (b) Comparing in-memory compu GraphLab and ertica (LiveJourna Improving I/O Performance in ertica with In-memory Graph Analysis. Twitter graph with synthetic metadata Query Type ertica Giraph SpeedUp Sub-graph Projection & Selection PR SSSP Graph Analysis Aggregation PR SSSP Graph Joins PR+SSSP TABL III. COMBINING GRAPH AND RLATIONAL ANALYSIS. those nodes which are either very near (path distance less than a given threshold) or are relatively very important (PageRank Query Dataset ertica Giraph greater than a given threshold). We compare against Giraph, Youtube which Strong Overlap allows users LiveJournal-undir to provide custom input/output of memoryformats that could Youtube out of memory Weak Ties be used to perform the projection and selection. We write additionallivejournal-undir MapReduce jobs, for theout aggregation of memory and join. TableTABL III shows I. the - result N on the Twitter A dataset. over 4 nodes. We can see that the performance difference between ertica Que Stro Wea pairs. Th need to te this could join bein edge doe SLCT e sum(cas FROM edg JOIN edg AND e. LFT JOI AND e.

32 Detailed Analysis: Cost Breakdown Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds) Iterations Load/Store 0 PR SSSP CC PR SSSP CC PR SSSP CC GraphLab Giraph ertica (c) Cost Breakdown (Twitter graph)

33 Detailed Analysis: Memory Footprint Twitter graph:.4 billion edges, 4.6 million nodes (GB) Size GraphLab Read (GB) Size Giraph Write 4 2 (GB) Size ertica Total

34 Detailed Analysis: I/O Footprint Twitter graph:.4 billion edges, 4.6 million nodes Read GraphLab Giraph ertica Write Total Time (ms) Read Write

35 Problem: significantly high I/O Can we do better?

36 5. xtending Column Stores

37 Rewriting Graph Query Plan (Yet again!) Disk-based Iterations 2 In-Memory Iterations Table UDF U M Updates sort Table UDF U M ertex Compute M Synchronization γ.pid U M sort γ.pid U

38 Trading Memory for I/O In-Memory Iterations Table UDF Updates U M ertex Compute sort γ.pid U M Synchronization Loading and keeping data in mainmemory no additional I/Os for each iteration All iterations run as a single transaction reduce overheads such as logging, locking, buffer lookups Unmodified vertex-program run via table UDFs Communication (message passing) via shared memory

39 Comparing Different Implementations in ertica LiveJournal graph: 69 million edges, 4.8 million nodes Time (seconds) , PageRank ertica (UDF + Disk) ertica (SQL + Disk) ertica (UDF + Shared Memory) Shortest Path 6.50

40 Comparison with GraphLab LiveJournal graph: 69 million edges, 4.8 million nodes Time (seconds) GraphLab Algorithm Time Load/Store Time ertica PR SSSP CC PR SSSP CC

41 Scaling to larger graphs Twitter graph:.4 billion edges, 4.6 million nodes Time (seconds) Algorithm Time Load/Store Time GraphLab 4 nodes Giraph 4 nodes ertica (SQL) 4 nodes ertica (Mem) node

42 Conclusion fficient graph analytics possible within column stores such as ertica - graph queries can be mapped to SQL - several query optimizations can be applied - column stores serve as efficient backends - could extend column stores to trade memory for I/O The curious case of relational database re-discovery - repeatedly emerged as the backend for several new data/ applications, e.g., XML, RDF, Spatial, Array, etc. - cycles of branch-innovate-merge-commit Next time you have a big data problem > try relational databases!

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured