G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Similar documents
Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

WGB: Towards a Universal Graph Benchmark

Distributed computing: index building and use

Link Analysis in the Cloud

Wedge A New Frontier for Pull-based Graph Processing. Samuel Grossman and Christos Kozyrakis Platform Lab Retreat June 8, 2018

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Towards Performance and Scalability Analysis of Distributed Memory Programs on Large-Scale Clusters

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Challenges for Data Driven Systems

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Distributed computing: index building and use

B490 Mining the Big Data. 5. Models for Big Data

Data-Intensive Distributed Computing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Introduction To Graphs and Networks. Fall 2013 Carola Wenk

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Data-Intensive Computing with MapReduce

Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18

Experimental Analysis of Distributed Graph Systems

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

modern database systems lecture 10 : large-scale graph processing

Social-Network Graphs

Graph Mining and Social Network Analysis

Analysis of Biological Networks. 1. Clustering 2. Random Walks 3. Finding paths

Graph Data Management

Data Analytics Framework and Methodology for WhatsApp Chats

University of Maryland. Tuesday, March 2, 2010

DCBench: a Data Center Benchmark Suite

Similarity Ranking in Large- Scale Bipartite Graphs

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

FastForward I/O and Storage: ACG 5.8 Demonstration

Introduction to Data Mining and Data Analytics

TI2736-B Big Data Processing. Claudia Hauff

Graph Algorithms. Revised based on the slides by Ruoming Kent State

Understanding Networks through Physical Metaphors. Daniel A. Spielman

Distributed Graph Storage. Veronika Molnár, UZH

High-Performance Graph Primitives on the GPU: Design and Implementation of Gunrock

Big Data with Hadoop Ecosystem

Efficient and Scalable Friend Recommendations

Big Data Management and NoSQL Databases

Harp-DAAL for High Performance Big Data Computing

Graph Analytics in the Big Data Era

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

One Trillion Edges. Graph processing at Facebook scale

Compressed representations for web and social graphs. Cecilia Hernandez and Gonzalo Navarro Presented by Helen Xu 6.

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

NUMA-aware Graph-structured Analytics

Antonio Fernández Anta

Exploring the Structure of Data at Scale. Rudy Agovic, PhD CEO & Chief Data Scientist at Reliancy January 16, 2019

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Graph Analytics and Machine Learning A Great Combination Mark Hornick

PegasusN. User guide. August 29, Overview General Information License... 1

Hierarchy of knowledge BIG DATA 9/7/2017. Architecture

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Clustering in Data Mining

Label propagation with dams on large graphs using Apache Hadoop and Apache Spark

PEGASUS: A peta-scale graph mining system Implementation. and observations. U. Kang, C. E. Tsourakakis, C. Faloutsos

Databases 2 (VU) ( / )

Extracting Information from Complex Networks

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Replication on Affinity Propagation: Clustering by Passing Messages Between Data Points

Graph Theory for Network Science

Embedded Technosolutions

A Novel Parallel Hierarchical Community Detection Method for Large Networks

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

Why do we need graph processing?

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Matrix-Vector Multiplication by MapReduce. From Rajaraman / Ullman- Ch.2 Part 1

Real-time Scheduling of Skewed MapReduce Jobs in Heterogeneous Environments

Social Network Analysis

Structure of Social Networks

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

Naiad. a timely dataflow model

Graph Data Processing with MapReduce

Summary: What We Have Learned So Far

9/29/13. Outline Data mining tasks. Clustering algorithms. Applications of clustering in biology

Easier than Excel: Social Network Analysis of DocGraph with Gephi

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Supervised Random Walks

CSE 190 Lecture 16. Data Mining and Predictive Analytics. Small-world phenomena

Challenges in large-scale graph processing on HPC platforms and the Graph500 benchmark. by Nkemdirim Dockery

E6885 Network Science Lecture 10: Graph Database (II)

PULP: Fast and Simple Complex Network Partitioning

Social Network Mining An Introduction

LazyBase: Trading freshness and performance in a scalable database

Oracle NoSQL Database and Cisco- Collaboration that produces results. 1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

CS6200 Information Retreival. The WebGraph. July 13, 2015

Transcription:

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu

Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow

Big Graphs o Twitter breaks 400 million tweets per day o 2.4 billion internet users as of June-2012 o 200+ million new Facebook users annually since 2009 o 40+ billion web pages indexed by Google o 300 million new websites in 2011 o RedHat 7.1 has 30+ million line of code (60% more than Redhat 6.2)

Graph Processing Systems

Problems! Systems report their performance with comparison to Hadoop MapReduce only! Systems usually use PageRank algorithm only!

A solution A Benchmark with the following conditions: Represent all query types Generate graphs with real-life properties Easy to implement by existing systems

Workload Online Queries Update Queries Iterative Queries

Workload - Online Queries Find matching nodes/edges Find k-hop neighbours Reachability and shortest path query Pattern matching query

Workload - Update Queries Insert / Update / Delete an edge or a node The Caveat: Update queries might seem straightforward but in reality they are challenging because they may require an update to a precomputed answer or an index.

Workload - Iterative Queries These are Data Mining Algorithms Examples include PageRank, Clustering etc The main theme: "iterate on graph data until convergence".

Data Generator Based on RTG generator which extended Millar s work in power-law distribution generation.

Data Generation: Real-life Graph properties Very Large Exponential growth in size Sparse Small diameters Community structure One large connected component Power Law degree distribution

PowerLaw generation Random process to generate PowerLaw distribution: A Monkey press random keys on a keyboard One character is selected to separate words The distribution of the generated words is PowerLaw.

Data Generator : RTG 1. Initiate a symmetric probability matrix for all words characters and the escape character ϕ. 2. Characters in the main row makes the source vertex. A B C ϕ A B C ϕ 3. Characters in the main column makes the destination vertex.

Data Generator : RTG RTG parameters: Number of characters B Probability of ϕ (escape char) C Structure factor (0-1) Graph Weight (# edges) ϕ Bipartite flag Number of time stamps (Dynamic graphs) A A B C ϕ

Data Generator : RTG RTG parameters: Number of characters B Probability of ϕ (escape char) C Structure factor (0-1) Graph Weight (# edges) Bi-Partite flag ϕ Number of time stamps (Dynamic graphs) A A B C ϕ It is challenging to fit these parameters!

Data Generator Caveat: Some nodes and edges are going to be identical. N = Number of unique nodes k = Number of characters q = Probability of ϕ (q = 1 k * p) p = Probability of a character W = Graph Weight (Total number of edges) E = Number of unique edges

Data Generator N = Number of unique nodes k = Number of characters q = Probability of ϕ (q = 1 k * p) p = Probability of a character W = Graph Weight (Total number of edges) E = Number of unique edges Graph Density :

Data Generator N = Number of unique nodes k = Number of characters q = Probability of ϕ (q = 1 k * p) p = Probability of a character W = Graph Weight (Total number of edges) E = Number of unique edges Graph Density D = Fn (E, N) = Fn (p, k, W, q) = Fn (q, k, W)

Data Generator Target: Generate graphs matching specific graph properties How: Replace two parameters (k, q) by other graph properties

Data Generator Target: Generate graphs with properties How: Replace two parameters (k, q) by graph properties Observation: for a reasonable k, number of characters in the language does not impact graph properties.

Data Generator Observation: for a reasonable k, number of characters in the language does not impact graph properties. Density 1.0000000 0.5000000 0.2500000 0.1250000 0.0625000 0.0312500 0.0156250 0.0078125 0.0039063 0.0019531 0.0009766 0.0004883 0.0002441 0.0001221 0.0000610 0.0000305 1 3 4 5 6 7 8 9 10 0.0000485 0.5482760 Graph Density D = Fn (q, k, W) = Fn (q, W) 0.0000307 0.0000307

Data Generator Given a reasonable k (=10), Find q D = Fn (q, W) q = Fn (D, W) This is a Complex and approximate equation

Data Generator Find q using supervised learning techniques. We generated a large number of graphs with different settings and used a Gaussian model to predict q. Based on our experiments, the average prediction error was 10%

Conclusion Graphs are emerging in many applications Graph processing systems need a benchmark Benchmark should be complete, efficient, and usable. Genchmark has graph queries that represents all workload types. Genchmark has a usable real-graph generator

What is next? Build an extensive quantitative comparison between existing Graph systems Integrate our graph generators with business driven data generators like ones in TPC-H or BigBench.

Thanks

Existing BenchMarks HPC Scalable Graph Analysis Benchmark Graph Traversal Benchmark Graph500 BigBench RDF Benchmarks