Graph Analytics in the Big Data Era

Similar documents
Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Graph Data Management

Distributed Graph Storage. Veronika Molnár, UZH

One Trillion Edges. Graph processing at Facebook scale

A Fast and High Throughput SQL Query System for Big Data

Managing and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Sempala. Interactive SPARQL Query Processing on Hadoop

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Epilog: Further Topics

Sublinear Models for Streaming and/or Distributed Data

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Analyzing Flight Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

Introduction to Graph Data Management

An overview of RDB2RDF techniques and tools

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Unifying Big Data Workloads in Apache Spark

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

KNOWLEDGE GRAPHS. Lecture 1: Introduction and Motivation. TU Dresden, 16th Oct Markus Krötzsch Knowledge-Based Systems

modern database systems lecture 10 : large-scale graph processing

RDFPath. Path Query Processing on Large RDF Graphs with MapReduce. 29 May 2011

Data Warehousing and Decision Support (mostly using Relational Databases) CS634 Class 20

Big Data Management and NoSQL Databases

E6885 Network Science Lecture 10: Graph Database (II)

Lecture 1: Introduction and Motivation Markus Kr otzsch Knowledge-Based Systems

MANAGING DATA(BASES) USING SQL (NON-PROCEDURAL SQL, X401.9)

Information Workbench

Big Data Hadoop Stack

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Compare Two Identical Tables Data In Different Oracle Databases

Challenges for Data Driven Systems

Database Instance And Relational Schema Design A Fact Oriented Approach

COMP9321 Web Application Engineering

An Efficient Approach to Triple Search and Join of HDT Processing Using GPU

Introduction to NoSQL Databases

SociaLite: A Datalog-based Language for

TriAD: A Distributed Shared-Nothing RDF Engine based on Asynchronous Message Passing

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Course Modules for MCSA: SQL Server 2016 Database Development Training & Certification Course:

CSE 190D Spring 2017 Final Exam Answers

Introduction to Data Management. Lecture #1 (Course Trailer )

Porting Social Media Contributions with SIOC

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

Accelerating BI on Hadoop: Full-Scan, Cubes or Indexes?

CSE 344 Final Review. August 16 th

Big Data Analytics. Rasoul Karimi

Event Stores (I) [Source: DB-Engines.com, accessed on August 28, 2016]

New Developments in Spark

A Glimpse of the Hadoop Echosystem

Graph Databases. Graph Databases. May 2015 Alberto Abelló & Oscar Romero

CS 5220: Parallel Graph Algorithms. David Bindel

C-SPARQL: A Continuous Extension of SPARQL Marco Balduini

Programming the Semantic Web

An Entity Based RDF Indexing Schema Using Hadoop And HBase

Introduction to Data Management. Lecture #1 (Course Trailer )

Efficient and Scalable Friend Recommendations

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Elementary IR: Scalable Boolean Text Search. (Compare with R & G )

Distributed Databases: SQL vs NoSQL

CSE 344 MAY 7 TH EXAM REVIEW

Large Scale Graph Solutions: Use-cases And Lessons Learnt

Databases 2 (VU) ( / )

NoSQL systems: introduction and data models. Riccardo Torlone Università Roma Tre

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Research Works to Cope with Big Data Volume and Variety. Jiaheng Lu University of Helsinki, Finland

Shortest paths on large graphs: Systems, Algorithms, Applications

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

An Introduction to Big Data Formats

Antonio Fernández Anta

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Introduction to Hive Cloudera, Inc.

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

CISC 7610 Lecture 2b The beginnings of NoSQL

Querying Data with Transact SQL

2.3 Algorithms Using Map-Reduce

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

We focus on the backend semantic web database architecture and offer support and other services around that.

Linked Data in the Clouds : a Sindice.com perspective

Introduction to Graph Databases

Prof. Dr. Christian Bizer

Part 1: Indexes for Big Data

SPAR-Key: Processing SPARQL-Fulltext Queries to Solve Jeopardy! Clues

REVIEW ON BIG DATA ANALYTICS AND HADOOP FRAMEWORK

Massive Online Analysis - Storm,Spark

7. Query Processing and Optimization

Graph Analytics using Vertica Relational Database

Map Reduce. Yerevan.

1 Copyright 2011, Oracle and/or its affiliates. All rights reserved.

DIVING IN: INSIDE THE DATA CENTER

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

From Single Purpose to Multi Purpose Data Lakes. Thomas Niewel Technical Sales Director DACH Denodo Technologies March, 2019

Goal of the presentation is to give an introduction of NoSQL databases, why they are there.

Understanding Billions of Triples with Usage Summaries

Man vs. Machine Dierences in SPARQL Queries

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

Database Systems CSE 414

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

NOSQL Databases and Neo4j

Transcription:

Graph Analytics in the Big Data Era Yongming Luo, dr. George H.L. Fletcher Web Engineering Group

What is really hot? 19-11-2013 PAGE 1

An old/new data model graph data Model entities and relations between entities Trending application space Social network analysis (facebook, linkedin, ) Bioinformatics (protein networks) Recommendation (web graph, netflix, ) Semantic Web (RDF data, Google knowledge graph) / mathematics and computer science 19-11-2013 PAGE 2

RDF as a data model for graph data RDF Model (Resource Description Framework) Describe things (resources) on the web Part of the linked data vision, connect information on the web Distributed way of managing things Schema-less feature We will skip the strict format description for now / mathematics and computer science 19-11-2013 PAGE 3

RDF Graph Model and Format Directed graph Triple format (subject, predicate, object) Subjects and objects are nodes/things in graph Predicates are edges (node, edge, node) format, nothing special No distinction between data and metadata Predicates can also be resources / mathematics and computer science 19-11-2013 PAGE 4

RDF Graph Example TU/e wine beer like graduatefrom graduatefrom like Alice isfriendof Bob / mathematics and computer science 19-11-2013 PAGE 5

Example of Triples TU/e wine beer like graduatefrom graduatefrom like Alice isfriendof Bob <Alice> <isfriendof> <Bob> <Alice> <like> <wine> <Alice> <graduatefrom> <TU/e> <Bob> <like> <beer> <Bob> <graduatefrom> <TU/e> / mathematics and computer science 19-11-2013 PAGE 6

Example of Triples, expanded TU/e wine beer like graduatefrom graduatefrom like Alice isfriendof Bob <Alice> <isfriendof> <Bob> <Alice> <like> <wine> <Alice> <graduatefrom> <TU/e> <Bob> <like> <beer> <Bob> <graduatefrom> <TU/e> <like> <isa> <feeling> <isfriendof> <isa> <relation> / mathematics and computer science 19-11-2013 PAGE 7

Data Explosion in the (RDF) world How much data are we talking about? Simply type filetype:rdf in google, 28M documents found Billion Triple Challenge at ISWC Easily reach a Trillion triples in commercial systems More data join the open data project to connect datasets, and the famous graph (next page) / mathematics and computer science 19-11-2013 PAGE 8

http://richard.cyganiak.de/2007/10/lod/ / mathematics and computer science 19-11-2013 PAGE 9

Facebook graph / mathematics and computer science 19-11-2013 PAGE 10

Graph analytics What to do with graphs? Pattern matching graph query languages Graph algorithms Classical algorithms PageRank-style algorithms / mathematics and computer science 19-11-2013 PAGE 11

Pattern Matching - SPARQL Query for RDF The query language for RDF data Similar to SQL for relational database Similar grammar, select, where, filter clauses and more Example select?a?b where {?a <isfriendof>?b.?a <graduatefrom>?c.?b <graduatefrom>?c. } / mathematics and computer science 19-11-2013 PAGE 12

SPARQL Example select?a?b where {?a <isfriendof>?b.?a <graduatefrom>?c.?b <graduatefrom>?c. } wine like Alice TU/e graduatefrom graduatefrom isfriendof Bob like beer Two people who are friends and graduate from the same university / mathematics and computer science 19-11-2013 PAGE 13

SPARQL Query can be Complex SELECT?gn?fn WHERE {?gn <givennameof>?p.?fn <familynameof>?p.?p <type> "scientist"; <borninlocation>?city; <hasdoctoraladvisor>?a.?a <borninlocation>?city2.?city <locatedin> "Switzerland".?city2 <locatedin> "Germany". } rdf3x / Yago / A1.sparql / mathematics and computer science 19-11-2013 PAGE 14

A few more words on query language Essentially ad-hoc graph algorithm execution Pattern matching as the backbone, with possibly many features added E.g., regular expression, keyword search, aggregation Some special cases get special treatment Triangle counting community detection, graph measurement / mathematics and computer science 19-11-2013 PAGE 15

Graph Algorithms Breadth First Search Single Source Shortest Path Bipartite Matching PageRank, SimRank and more So called diffusion based techniques / mathematics and computer science 19-11-2013 PAGE 16

Where are we Graph model Operations on graph Pattern matching (queries) -- indexes Algorithms -- platforms 19-11-2013 PAGE 17

Indexes to accelerate query processing Value-based indexes Relational approaches Document-oriented approaches Structural indexes Seeqr Frequent patterns, Storing and Indexing Massive RDF Datasets. Yongming Luo, Francois Picalausa, George H. L. Fletcher, Jan Hidders and Stijn Vansummeren. In: De Virgilio, R., et al. (eds.) Semantic Search over the Web, Data-Centric Systems and Applications, pp. 31 60. Springer, Heidelberg (2012). / mathematics and computer science 19-11-2013 PAGE 18

Relational approaches, RDF as an example Treat triples as a three-column table Relational DB, row store, column store Subject Predicate Object <Alice> <isfriendof> <Bob> <Alice> <like> <wine> <Alice> <graduatefrom> <TU/e> <Bob> <like> <beer> <Bob> <graduatefrom> <TU/e> One entity one row / mathematics and computer science 19-11-2013 PAGE 19

Relational Approaches Query Processing For relational DB SPARQL -> SQL select?a?b where {?a <isfriendof>?b.?a <graduatefrom>?c.?b <graduatefrom>?c. } select T1.subject,T2.subject from example T1 inner join example T2 inner join example T3 on T1.object = T2.object AND T1.subject = T3.subject AND T2.subject = T3.object where T1.predicate = 'graduatefrom' AND T2.predicate = 'graduatefrom' AND T3.predicate = 'isfriendof'; / mathematics and computer science 19-11-2013 PAGE 20

Execution Plan Projection Join T1.s = T3.s AND T2.s=T3.o Join T1.o = T2.o T3 T1 Filter / mathematics and computer science T2 Filter Filter select T1.subject,T2.subject from example T1 inner join example T2 inner join example T3 on T1.object=T2.object AND T1.subject=T3.subject AND T2.subject=T3.object where T1.predicate='graduateFrom' AND T2.predicate='graduateFrom' AND T3.predicate='isFriendOf'; 19-11-2013 PAGE 21

Relational DB is not enough Problems Reduce data size -> ID mapping String ID <Alice> 1 <Bob> 2 <like> 3 How to build indexes? Too many self-joins Several alternative SQLs, which one to choose? / mathematics and computer science 19-11-2013 PAGE 22

Native Approaches - RDF-3X B-Tree index Index several (if not all) permutations of triples (s, p, o) (p, s, o) (o, s, p) (s, o, p) prefer merge join rather than hash join Predicate Subject Object select?a?b where {?a <isfriendof>?b.?a <graduatefrom>?c.?b <graduatefrom>?c. } <isfriendof> <Alice> <Bob> <like> <Alice> <wine> <like> <Bob> <beer> <graduatefrom> <Alice> <TU/e> <graduatefrom> <Bob> <TU/e> / mathematics and computer science 19-11-2013 PAGE 23

One possible plan from RDF-3X Filter (v4=v6) HashJoin (v1=v5) MergeJoin (v2=v3) IndexScan (v5, graduatefrom, v6) IndexScan (v1, isfriendof, v2) IndexScan (v3, graduatefrom, v4) / mathematics and computer science 19-11-2013 PAGE 24

More techniques Query optimization Statistics, query history, cache, all information helps Compression Standard compression techniques Works better for column store Update Save changes first, merge them later / mathematics and computer science 19-11-2013 PAGE 25

One Entity One Row In the spirit of E-R model, or adjacency list of graphs Reduce self-joins Entity isfriendof like graduatefrom <Alice> <Bob> <wine> <TU/e> <Bob> <beer> <TU/e> <wine> <beer> <TU/e> / mathematics and computer science 19-11-2013 PAGE 26

One Entity One Row - Cons Null values Multi-value property Too many properties Schema required Schema update Entity isfriendof like graduatefrom <Alice> <Bob> <wine> <TU/e> <Bob> <beer> <TU/e> <wine> <beer> <TU/e> / mathematics and computer science 19-11-2013 PAGE 27

Structural Index - Motivation Value indexes are good, but there is more regularity/structure we can get from data Query/index mismatch Looking for structures The structures inside data are ignored by index / mathematics and computer science 19-11-2013 PAGE 28

Structural Index Preserve structures in index Give the queries what they are looking for, not less, not more Looking for structures Indexing on structures / mathematics and computer science 19-11-2013 PAGE 29

It will be better if we have something like wine TU/e beer TU/e Beer wine like graduatefrom graduatefrom like graduatefrom like Alice isfriendof Bob Alice Bob isfriendof / mathematics and computer science 19-11-2013 PAGE 30

Structural Index - Example select?a?b where {?a <isfriendof>?b.?a <graduatefrom>?c.?b <graduatefrom>?c. } If two triples share a subject Draw an SS edge between them Same for SO, OS, SP, PS,?a <isfriendof>?b SS?a <graduatefrom>?c PP,OO OS?b <graduatefrom>?c / mathematics and computer science 19-11-2013 PAGE 31

Ideal case?a <isfriendof>?b SS?a <graduatefrom>?c PP,OO OS?b <graduatefrom>?c SS partition1 OS partition2 PP,OO partition3 / mathematics and computer science 19-11-2013 PAGE 32

Workflow 3, query real graph 2, return related partition blocks 1, query abstract graph 0, construct index A Structural Approach to Indexing Triples. Francois Picalausa, Yongming Luo, George H. L. Fletcher, Jan Hidders and Stijn Vansummeren. ESWC 2012, Heraklion, GR. LNCS 7295, pp. 406 421, 2012, Springer-Verlag Berlin Heidelberg. / mathematics and computer science 19-11-2013 PAGE 33

How to build partition blocks (structural index)? E.g., according to k-bisimulation, denoted k x k y when the following holds: x 0 y if the node labels of x and y are the same If x x, then there is some y y, such that x k-1 y If y y, then there is some x x, such that x k-1 y / mathematics and computer science 19-11-2013 PAGE 34

Algorithm for k-bisimulation computation Create a signature for each node in the graph Nodes are partitioned by their signature values / mathematics and computer science 19-11-2013 PAGE 35

Where are we Graph model Operations on graph Pattern matching (queries) -- indexes Algorithms -- platforms 19-11-2013 PAGE 36

Platforms for (graph) algorithms Single machine, out-of-core I/O-efficient algorithms GraphChi Shared-nothing architecture MapReduce (Hadoop) Pregel/GraphLab/Giraph More to come and play / mathematics and computer science 19-11-2013 PAGE 37

Project Ideas 1. Graph Generator 2. Graph Consumer 19-11-2013 PAGE 38

Project Idea: Bisimulation-friendly Big Graph Generator In recent research, we see that power-law distribution in bisimulation results are not preserved in current synthetic graph generators. We want to change that. The task includes: 0. Pick a distributed programming framework, map-reduce (or other distributed framework as you like, spark, hyracks), get comfortable with the programming environment. 1. Design an algorithm that generates big graphs (billions of edges) that are 1.1. power-law 1.2. bisimulation friendly *1.3 other properties, such as small diameter 2. Test, compare with other approaches, both in efficiency and in quality (e.g., socialbench, graph500) / name of department 19-11-2013 PAGE 39

Project Idea: External-Memory Giraph Giraph is an Apache copy of Pregel, a BSP-like computation framework for distributed environment. A very new and hot platform to play with. It is proved that BSP algorithms can be simulated in an external memory environment in an efficient way. In this research, we want to use external memory environment as a backend for Giraph, enabling its efficiency on single machine. The task includes: 1. try out Giraph 2. write a few classical algorithms in Giraph 3. write the external memory backend, API-compatible 4. compare the result on medium to large graph datasets (~1billion edges) / name of department 19-11-2013 PAGE 40

Other related topics If you have some related ideas in mind, just come and talk to me. / name of department 19-11-2013 PAGE 41

Thank you! Q&A y.luo@tue.nl / mathematics and computer science 19-11-2013 PAGE 42