Evaluating Use of Data Flow Systems for Large Graph Analysis

Similar documents
BIG DATA TESTING: A UNIFIED VIEW

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Big Data with Hadoop Ecosystem

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Graph-Processing Systems. (focusing on GraphChi)

April Copyright 2013 Cloudera Inc. All rights reserved.

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Link Analysis in the Cloud

Big Data Management and NoSQL Databases

Distributed computing: index building and use

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

A Parallel Algorithm for Finding Sub-graph Isomorphism

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

Introduction to Data Management CSE 344

Databases 2 (VU) ( / )

SAP HANA. Jake Klein/ SVP SAP HANA June, 2013

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Social-Network Graphs

The Anatomy of a Large-Scale Hypertextual Web Search Engine

MapReduce: A Programming Model for Large-Scale Distributed Computation

Jeffrey D. Ullman Stanford University

A Review Paper on Big data & Hadoop

Importing and Exporting Data Between Hadoop and MySQL

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Data Analytics using MapReduce framework for DB2's Large Scale XML Data Processing

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

ELTMaestro for Spark: Data integration on clusters

PREDICTING COMMUNICATION PERFORMANCE

Mining Social Network Graphs

Tutorial Outline. Map/Reduce vs. DBMS. MR vs. DBMS [DeWitt and Stonebraker 2008] Acknowledgements. MR is a step backwards in database access

Introduction to Data Management CSE 344

Typical size of data you deal with on a daily basis

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Approaching the Petabyte Analytic Database: What I learned

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Data Intensive Scalable Computing

Crawler. Crawler. Crawler. Crawler. Anchors. URL Resolver Indexer. Barrels. Doc Index Sorter. Sorter. URL Server

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Rule 14 Use Databases Appropriately

CSE 344 MAY 2 ND MAP/REDUCE

Map-Reduce. John Hughes

Map Reduce. Yerevan.

Database Systems CSE 414

Scalable Web Programming. CS193S - Jan Jannink - 2/25/10

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Introduction to MapReduce Algorithms and Analysis

Welcome. Atlanta R Users Group. HPCC Systems Architecture Overview & R Integration Demo

Improving Performance and Ensuring Scalability of Large SAS Applications and Database Extracts

MapReduce and Friends

MapReduce, Hadoop and Spark. Bompotas Agorakis

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Big Data Management and NoSQL Databases

Map-Reduce. Marco Mura 2010 March, 31th

Data Informatics. Seon Ho Kim, Ph.D.

MATE-EC2: A Middleware for Processing Data with Amazon Web Services

Distributed File Systems II

Assignment 3 ITCS-6010/8010: Cloud Computing for Data Analysis

Improving Per Processor Memory Use of ns-3 to Enable Large Scale Simulations

Introduction to Hadoop and MapReduce

DIVIDE & RECOMBINE (D&R), RHIPE,

Comparing SQL and NOSQL databases

Big Data Systems on Future Hardware. Bingsheng He NUS Computing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Distributed computing: index building and use

New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH

University of Maryland. Tuesday, March 2, 2010

The amount of data increases every day Some numbers ( 2012):

Massive Online Analysis - Storm,Spark

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Frequent Item Set using Apriori and Map Reduce algorithm: An Application in Inventory Management

Sorting. Overview. External sorting. Warm up: in memory sorting. Purpose. Overview. Sort benchmarks

Introduction to MapReduce (cont.)

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

Aerospike Scales with Google Cloud Platform

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Apache Hive for Oracle DBAs. Luís Marques

Evolution of Database Systems

Epilog: Further Topics

A Parallel Community Detection Algorithm for Big Social Networks

Copyright 2012, Oracle and/or its affiliates. All rights reserved.

Distributed Databases: SQL vs NoSQL

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

CS 61C: Great Ideas in Computer Architecture. MapReduce

Performance Evaluation of a MongoDB and Hadoop Platform for Scientific Data Analysis

SWARMGUIDE: Towards Multiple-Query Optimization in Graph Databases

Analysis in the Big Data Era

PEGASUS: A peta-scale graph mining system Implementation. and observations. U. Kang, C. E. Tsourakakis, C. Faloutsos

Distance Estimation for Very Large Networks using MapReduce and Network Structure Indices

Transcription:

Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under Contract DE-AC52-07NA27344

Graph mining techniques have been widely-used in many important applications in recent years Graph mining extracts information by analyzing relations and structures in graphs (such as ER graphs) So-called scale-free graphs can carry rich information 2

Graph Mining Applications: Web Search Google s PageRank uses a web graph to rank web pages for given queries Related applications Personalized web search People search Eigenvalue/eigenvector Random walk with restart 3

Graph Mining Applications: Social Network Analysis Community detection algorithms can identify the two communities (e.g., Girvan and Newman, 2002) Zachary s Karate Club, 1977 Divided into two groups centered around two individuals, 1 and 34 Further analysis reveals detailed community structures in the graph (e.g., van Dongen, 2000 and Palla, 2005) 4

Graph Mining Applications: Protein Clustering Can discover proteins with similar functions by clustering protein modules in the proteinprotein interaction graphs. Protein-protein interaction network of yeast Adamcsek et. al., Bioinformatica, 1021, 2006 5

Graph Mining Applications: National Security Apply subgraph pattern matching algorithms to intelligence analysis (e.g., J. Ullman, 1976) Other related applications Exact and inexact pattern discovery Fraud detection Cyber security Behavioral prediction T. Coffman, S. Greenblatt, S. Marcus, Graph-based technologies for intelligence analysis, ACM, 2004 6

Challenges High complexity of graph mining algorithms Common graph mining algorithms have high-order computational complexity High-order algorithms (O(N 2+ )) Page rank, community finding, path traversal NP-Complete algorithms Maximal cliques, subgraph pattern matching Large data size requires out-of-core approaches Graphs with 10 9+ nodes and edges are increasingly common Intermediate result increases exponentially in many cases 7

Traditional relational databases have been used in large graph analysis Due to prevalence and ease of use conventional database systems have been used in graph analysis Designed for transaction processing Poor performance and scalability 10+ minutes 5-10 minutes 2-5 minutes 1-2 minutes < 1 minute Distribution of Response Time for 100 Bi-directional searches 2% 5% 26% 30% 37% 0% 5% 10% 15% 20% 25% 30% 35% 40% 300B node graph search on Netezza on 700-node NPS (SC 06) 120B node graph search on 60-node MSSG (Cluster 06) 8

Many-tasks paradigm is currently used for analyzing large data sets: Map/Reduce Map/Reduce is a popular manytasks model being used for a wide range of applications Map/Reduce model A M/R program consists of many map and reduce tasks Each task works independently Data between mappers and reducers via intermediate files Processes list of (key, value) pairs Is Map/Reduce for everything? Map/Reduce model 9

Map/Reduce model is too limited for large complex graph analysis Map/Reduce successfully used for some applications, but Inverted index construction Distributed sort Term-vector calculation Page Rank Drawbacks Model limited to embarrassingly parallel applications Poor performance and scalability (due to poor handling of intermediate results) System Platform Time (Sec) Map/Reduce 20-node Fenix Cluster 1068 SGRACE 64-node Tuson Cluster 221 333.75 Sec/64 Nodes BFS Search Results Full PubMed graph with 30 million vertices and 500 million edges were used, except SGRACE for which a synthetic graph with 25 million vertices and 125 million edges is used 10

Dataflow model is a promising alternative to address these issues More flexible and complex than Map/Reduce (Map/Reduce on steroids!!) Many independent tasks accessing external data in parallel, realizing data parallelism Tasks triggered by the availability of data No flow of control Data parallel and independent We evaluated the use of dataflow model for large graph analysis in this work Dryad dataflow diagram 11

We measured the performance of graph algorithms on an actual dataflow machine: Data Analytic Supercomputer DAS VS. RDBMS Parallel dataflow engine on commodity clusters Specialized high-performance library Streaming data pipelined for maximum in-memory processing Sequentialized disk accesses Optimized for SORT and JOIN operations Offers great flexibility for optimization Sequential or parallel relational database systems on commodity HW Optimized for transaction processing Ubiquitous Relatively easy to use Relies on SQL compiler for optimization 12

DAS programming and execution environment Uses ECL, a proprietary dataflow language Built-in ECL data manipulation constructs are implemented in a highly optimized library JOIN, SORT, MERGE, etc. Unlike SQL, these low-level constructs are suitable for complex graph operations ECL Code ECL Compiler ECL Library C++ Code Executable CE CE CE 13

An example ECL code vertex_rec := RECORD END; adjacent_raw := INTEGER8 gid; DATASET('pubmed::datasets::full::bi_links_split_ds', PubMed_Definitions_Full.Links_Bidirectional, THOR); adjacent_distr := DISTRIBUTE(adjacent_raw,HASH32(src_gid)); adjacent_sort := SORT(adjacent_distr, src_gid, LOCAL); adjacent := DEDUP(adjacent_sort, src_gid, LOCAL); OUTPUT(adjacent); 14

We evaluated some of the most commonly used applications in our experiments Applications evaluated on DAS System Path Traversal Pattern Matching TeraByte (TB) Sort Page Rank Disambiguation Uni- and Bi-directional BFS Find subgraphs that matches given template Jim Gray s SORT Benchmark Eigenvector using power method Binning-based coreference resolution 15

Real-world graphs are used in our performance experiments Grant Agency PubMed Sm PubMed Lg V 1M 29M E 2M 270M Raw data size 400 MB 127 GB Autho r IsAut horof Article HasMeshHe ading Gran t FundedBy Grant IssuedG rant Published In HasChemic al HasKeywor d HasContac tinfo Journal IsIss ueof Journal Issue Chemical Keyword MeshHeadin g ContactInf o 16

Path Traversal: Breadth-first search (BFS) on DAS Sou rce Destin ation Improved performance by constructing adjacent list via denormalization, which reduces the number of rows to join (Seconds) Edge List Adjacency List (Denormalized) Unidirectional 287.926 120.359 Used large PubMed data Bidirectional 204.90 56.431 17

DAS system is ideal for handling complex subgraph pattern queries on large data sets Find authors who published four articles in specific dates (Query 1) Find authors who published four articles in the journal Physical Review Letters (Query 3) Find authors who published two articles in the same journal (Query 2) 18

DAS system is ideal for handling complex subgraph pattern queries on large data sets (Cont d) Find two authors who have coauthored two papers (Query 4) Find an article that has an associated grant and an article that does not have an associated grant and their corresponding authors (Query 5) 19

Query Performance for Large PubMed (30M nodes) DAS Netezza YADM (20 nodes) (54 nodes) (4 nodes) Query 1 9.422 27.3 120.00 Query 2 142.099 834.47 930.00 Query 3 469.511 15392.96 10188.00 Query 4 37.803 741.42 667.00 Query 5 44.600 496.48 N/A ~250 - ~300X Speedup (Seconds) 20

DAS system still outperforms other SQL machines in price/performance DAS Netezza YADM Query 1 2.51253E-05 6.66144E-05 0.0004 Query 2 0.000378931 0.00203618 0.0031 Query 3 0.001252029 0.037560164 0.03396 Query 4 0.000100808 0.001809129 0.002223333 Query 5 0.000118933 0.001211454 N/A Metric = Time/(#Spindles * Cost) 21

LNSSI/LLNL measured Terabyte Sort (TB Sort) performance One of sort benchmarks that measures the elapsed time to sort 10 12 bytes of data Yahoo holds current record (as of March 2009) 3.48 minutes on 910 nodes (4 dual-core processors, 4 disks, 8 GB memory) Hadoop Map/Reduce Performed TB sort on 20-node DAS system Apache Hadoop 03:20:44 DAS 01:39:26 Achieved 2X speedup by Radix-based distribution and local sort Makes tasks to be independent Optimized SORT operation 22

Found some key people from large Enron email graph by running Page Rank algorithm Data has 4022 Enron employees and 51078 emails Top 30 high scorers found with some notable names Jeff Dasovich Louise Kitchen Tana Jones John Lavorato Took 7 seconds to run on DAS 23

Developed scalable algorithm for author disambiguation in many-tasks paradigm LLNL has develop a entity resolution algorithm based on binning (or blocking) algorithm Original algorithm not complete for full PubMed data set Only 67% completed (in 2 months) Could not resolve bins > 300+ names Uses DAS as an active-disk system Bring computation to where data is, instead of moving data from data store Achieved orders of magnitude performance improvement 24

Distributed disambiguation algorithm: DAS as an Active Disk Original (Sequential) Algorithm Many-tasks Disambiguation Algorithm Author Info Bin Binning Algorithm (BA) JDBC Coauthors Keywords Reader performance bottleneck Abstracts Titles Binner MySQL RDBMS Resolver Binning algorithm works only on local data by many-tasks in parallel. Able to process 45529883 bins in 20 hours! Disambiguated Authors 25

Many-tasks model has enabled efficient large graph analysis High performance and scalability feasible by many-tasks approach Benefits Enables data parallelism on large scale data Reduces communication via independent localized tasks Enables optimization of tasks for built-in constructs Combines complexity and flexibility 26

Conclusions Studied the use of many-tasks model for large complex graph analysis Evaluated the performance of a comprehensive set of graph applications, including subgraph pattern queries, on an actual dataflow system Many-tasks paradigm is very promising approach for graph mining applications and offers many advantages over contemporary methods like RDBMS and Map/Reduce 27

Thank you 28