Evaluating Use of Data Flow Systems for Large Graph Analysis

Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344

Graph mining techniques have been widely-used in many important applications in recent years Graph mining extracts information by analyzing relations and structures in graphs (such as ER graphs) So-called scale-free graphs can carry rich information 2

Graph Mining Applications: Web Search Google s PageRank uses a web graph to rank web pages for given queries Related applications Personalized web search People search Eigenvalue/eigenvector Random walk with restart 3

Graph Mining Applications: Social Network Analysis Community detection algorithms can identify the two communities (e.g., Girvan and Newman, 2002) Zachary s Karate Club, 1977 Divided into two groups centered around two individuals, 1 and 34 Further analysis reveals detailed community structures in the graph (e.g., van Dongen, 2000 and Palla, 2005) 4

Graph Mining Applications: Protein Clustering Can discover proteins with similar functions by clustering protein modules in the proteinprotein interaction graphs. Protein-protein interaction network of yeast Adamcsek et. al., Bioinformatica, 1021, 2006 5

Graph Mining Applications: National Security Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976) Other related applications Exact and inexact pattern discovery Fraud detection Cyber security Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, ACM, 2004 6

Challenges High complexity of graph mining algorithms Common graph mining algorithms have high-order computational complexity High-order algorithms (O(N 2+ )) Page rank, community finding, path traversal NP-Complete algorithms Maximal cliques, subgraph pattern matching Large data size requires out-of-core approaches Graphs with 10 9+ nodes and edges are increasingly common Intermediate result increases exponentially in many cases 7

Traditional relational databases have been used in large graph analysis Due to prevalence and ease of use conventional database systems have been used in graph analysis Designed for transaction processing Poor performance and scalability 10+ minutes 5-10 minutes 2-5 minutes 1-2 minutes < 1 minute Distribution of Response Time for 100 Bi-directional searches 2% 5% 26% 30% 37% 0% 5% 10% 15% 20% 25% 30% 35% 40% 300B node graph search on Netezza on 700-node NPS (SC 06) 120B node graph search on 60-node MSSG (Cluster 06) 8

Many-tasks paradigm is currently used for analyzing large data sets: Map/Reduce Map/Reduce is a popular manytasks model being used for a wide range of applications Map/Reduce model A M/R program consists of many map and reduce tasks Each task works independently Data between mappers and reducers via intermediate files Processes list of (key, value) pairs Is Map/Reduce for everything? Map/Reduce model 9

Map/Reduce model is too limited for large complex graph analysis Map/Reduce successfully used for some applications, but Inverted index construction Distributed sort Term-vector calculation Page Rank Drawbacks Model limited to embarrassingly parallel applications Poor performance and scalability (due to poor handling of intermediate results) System Platform Time (Sec) Map/Reduce 20-node Fenix Cluster 1068 SGRACE 64-node Tuson Cluster 221 333.75 Sec/64 Nodes BFS Search Results Full PubMed graph with 30 million vertices and 500 million edges were used, except SGRACE for which a synthetic graph with 25 million vertices and 125 million edges is used 10

Dataflow model is a promising alternative to address these issues More flexible and complex than Map/Reduce (Map/Reduce on steroids!!) Many independent tasks accessing external data in parallel, realizing data parallelism Tasks triggered by the availability of data No flow of control Data parallel and independent We evaluated the use of dataflow model for large graph analysis in this work Dryad dataflow diagram 11

We measured the performance of graph algorithms on an actual dataflow machine: Data Analytic Supercomputer DAS VS. RDBMS Parallel dataflow engine on commodity clusters Specialized high-performance library Streaming data pipelined for maximum in-memory processing Sequentialized disk accesses Optimized for SORT and JOIN operations Offers great flexibility for optimization Sequential or parallel relational database systems on commodity HW Optimized for transaction processing Ubiquitous Relatively easy to use Relies on SQL compiler for optimization 12

DAS programming and execution environment Uses ECL, a proprietary dataflow language Built-in ECL data manipulation constructs are implemented in a highly optimized library JOIN, SORT, MERGE, etc. Unlike SQL, these low-level constructs are suitable for complex graph operations ECL Code ECL Compiler ECL Library C++ Code Executable CE CE CE 13

An example ECL code vertex_rec := RECORD END; adjacent_raw := INTEGER8 gid; DATASET('pubmed::datasets::full::bi_links_split_ds', PubMed_Definitions_Full.Links_Bidirectional, THOR); adjacent_distr := DISTRIBUTE(adjacent_raw,HASH32(src_gid)); adjacent_sort := SORT(adjacent_distr, src_gid, LOCAL); adjacent := DEDUP(adjacent_sort, src_gid, LOCAL); OUTPUT(adjacent); 14

We evaluated some of the most commonly used applications in our experiments Applications evaluated on DAS System Path Traversal Pattern Matching TeraByte (TB) Sort Page Rank Disambiguation Uni- and Bi-directional BFS Find subgraphs that matches given template Jim Gray s SORT Benchmark Eigenvector using power method Binning-based coreference resolution 15

Real-world graphs are used in our performance experiments Grant Agency PubMed Sm PubMed Lg V 1M 29M E 2M 270M Raw data size 400 MB 127 GB Autho r IsAut horof Article HasMeshHe ading Gran t FundedBy Grant IssuedG rant Published In HasChemic al HasKeywor d HasContac tinfo Journal IsIss ueof Journal Issue Chemical Keyword MeshHeadin g ContactInf o 16

Path Traversal: Breadth-first search (BFS) on DAS Sou rce Destin ation Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join (Seconds) Edge List Adjacency List (Denormalized) Unidirectional 287.926 120.359 Used large PubMed data Bidirectional 204.90 56.431 17

DAS system is ideal for handling complex subgraph pattern queries on large data sets Find authors who published four articles in specific dates (Query 1) Find authors who published four articles in the journal Physical Review Letters (Query 3) Find authors who published two articles in the same journal (Query 2) 18

DAS system is ideal for handling complex subgraph pattern queries on large data sets (Cont d) Find two authors who have coauthored two papers (Query 4) Find an article that has an associated grant and an article that does not have an associated grant and their corresponding authors (Query 5) 19

Query Performance for Large PubMed (30M nodes) DAS Netezza YADM (20 nodes) (54 nodes) (4 nodes) Query 1 9.422 27.3 120.00 Query 2 142.099 834.47 930.00 Query 3 469.511 15392.96 10188.00 Query 4 37.803 741.42 667.00 Query 5 44.600 496.48 N/A ~250 - ~300X Speedup (Seconds) 20

DAS system still outperforms other SQL machines in price/performance DAS Netezza YADM Query 1 2.51253E-05 6.66144E-05 0.0004 Query 2 0.000378931 0.00203618 0.0031 Query 3 0.001252029 0.037560164 0.03396 Query 4 0.000100808 0.001809129 0.002223333 Query 5 0.000118933 0.001211454 N/A Metric = Time/(#Spindles * Cost) 21

LNSSI/LLNL measured Terabyte Sort (TB Sort) performance One of sort benchmarks that measures the elapsed time to sort 10 12 bytes of data Yahoo holds current record (as of March 2009) 3.48 minutes on 910 nodes (4 dual-core processors, 4 disks, 8 GB memory) Hadoop Map/Reduce Performed TB sort on 20-node DAS system Apache Hadoop 03:20:44 DAS 01:39:26 Achieved 2X speedup by Radix-based distribution and local sort Makes tasks to be independent Optimized SORT operation 22

Found some key people from large Enron email graph by running Page Rank algorithm Data has 4022 Enron employees and 51078 emails Top 30 high scorers found with some notable names Jeff Dasovich Louise Kitchen Tana Jones John Lavorato Took 7 seconds to run on DAS 23

Developed scalable algorithm for author disambiguation in many-tasks paradigm LLNL has develop a entity resolution algorithm based on binning (or blocking) algorithm Original algorithm not complete for full PubMed data set Only 67% completed (in 2 months) Could not resolve bins > 300+ names Uses DAS as an active-disk system Bring computation to where data is, instead of moving data from data store Achieved orders of magnitude performance improvement 24

Distributed disambiguation algorithm: DAS as an Active Disk Original (Sequential) Algorithm Many-tasks Disambiguation Algorithm Author Info Bin Binning Algorithm (BA) JDBC Coauthors Keywords Reader performance bottleneck Abstracts Titles Binner MySQL RDBMS Resolver Binning algorithm works only on local data by many-tasks in parallel. Able to process 45529883 bins in 20 hours! Disambiguated Authors 25

Many-tasks model has enabled efficient large graph analysis High performance and scalability feasible by many-tasks approach Benefits Enables data parallelism on large scale data Reduces communication via independent localized tasks Enables optimization of tasks for built-in constructs Combines complexity and flexibility 26

Conclusions Studied the use of many-tasks model for large complex graph analysis Evaluated the performance of a comprehensive set of graph applications, including subgraph pattern queries, on an actual dataflow system Many-tasks paradigm is very promising approach for graph mining applications and offers many advantages over contemporary methods like RDBMS and Map/Reduce 27

Thank you 28