Distributed Graph Storage. Veronika Molnár, UZH

Similar documents
Big Data Hadoop Stack

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

One Trillion Edges. Graph processing at Facebook scale

Webinar Series TMIP VISION

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

TI2736-B Big Data Processing. Claudia Hauff

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

A Highly Efficient Runtime and Graph Library for Large Scale Graph Analytics

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Document databases Graph databases Metadata Column databases

Big Data Architect.

Processing of big data with Apache Spark

modern database systems lecture 10 : large-scale graph processing

CISC 7610 Lecture 4 Approaches to multimedia databases. Topics: Graph databases Neo4j syntax and examples Document databases

Big Data Hadoop Course Content

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

CIB Session 12th NoSQL Databases Structures

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

PROFESSIONAL. NoSQL. Shashank Tiwari WILEY. John Wiley & Sons, Inc.

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Stages of Data Processing

CISC 7610 Lecture 2b The beginnings of NoSQL

Big Data Analytics using Apache Hadoop and Spark with Scala

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

E6895 Advanced Big Data Analytics Lecture 4:

An Introduction to Apache Spark

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Efficient and Scalable Friend Recommendations

Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

SQT03 Big Data and Hadoop with Azure HDInsight Andrew Brust. Senior Director, Technical Product Marketing and Evangelism

Apache Hadoop Goes Realtime at Facebook. Himanshu Sharma

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Cassandra, MongoDB, and HBase. Cassandra, MongoDB, and HBase. I have chosen these three due to their recent

Introduction to NoSQL by William McKnight

Hadoop. Introduction / Overview

A Review Paper on Big data & Hadoop

MapReduce and Friends

Presented by Sunnie S Chung CIS 612

THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

A Glimpse of the Hadoop Echosystem

Graph Analytics in the Big Data Era

Distributed Databases: SQL vs NoSQL

Big Data Infrastructures & Technologies

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

Processing big data with modern applications: Hadoop as DWH backend at Pro7. Dr. Kathrin Spreyer Big data engineer

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Apache Ignite TM - In- Memory Data Fabric Fast Data Meets Open Source

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

COSC 416 NoSQL Databases. NoSQL Databases Overview. Dr. Ramon Lawrence University of British Columbia Okanagan

DATABASE DESIGN II - 1DL400

Introduction to Big Data. NoSQL Databases. Instituto Politécnico de Tomar. Ricardo Campos

MapReduce, Hadoop and Spark. Bompotas Agorakis

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Specialist ICT Learning

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Hadoop An Overview. - Socrates CCDH

Social Network Analytics on Cray Urika-XA

Unifying Big Data Workloads in Apache Spark

Distributed Graph Algorithms

Analyzing Flight Data

Databases and Big Data Today. CS634 Class 22

Apache Spark and Hadoop Based Big Data Processing System for Clinical Research

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

CS November 2018

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Large Scale Graph Solutions: Use-cases And Lessons Learnt

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Lambda Architecture for Batch and Real- Time Processing on AWS with Spark Streaming and Spark SQL. May 2015

CSE 444: Database Internals. Lecture 23 Spark

Online Bill Processing System for Public Sectors in Big Data

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Dynamic Graph Query Support for SDN Management. Ramya Raghavendra IBM TJ Watson Research Center

Index. bfs() function, 225 Big data characteristics, 2 variety, 3 velocity, 3 veracity, 3 volume, 2 Breadth-first search algorithm, 220, 225

Graph Analytics and Machine Learning A Great Combination Mark Hornick

Big Data with Hadoop Ecosystem

Microsoft Big Data and Hadoop

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Data Analytics Job Guarantee Program

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

CS November 2017

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Introduction to NoSQL Databases

Massive Online Analysis - Storm,Spark

Experimental Analysis of Distributed Graph Systems

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Transcription:

Distributed Graph Storage Veronika Molnár, UZH

Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems Questions 2

Graphs and Social Networks 1 Graph = collection of nodes + edges connecting nodes to each other Social Network = collection of individuals and social relations Social Network is also a Graph! (node = person, edge = relation) Social Network graph (image source : thenextweb.com) 3

Graphs and Social Networks 2 Social Network graph properties (SNA = Social Network Analysis) Limited number of connections at each node (person) e.g. Facebook: max 5000 Distribution not uniform Most people: an average number of connections But: a few people have a lot of connections (Power law distribution) Small degree of separation = Small World (length of shortest paths) Centrality Constantly changing, but very large graph! (7 billion people = 7 billion nodes) 4

Graphs and Social Networks Shortest Path 3 Centrality VM BP Betweenness Closeness PageRank Degree 5

Graphs and Social Networks 4 Social Network can be Facebook Emails Mailing lists Academic networks 6

Criteria for Graph Processing Systems 1 Modes: Distributed processing Research and industry use Interactive and noninteractive modes Storage of static and dynamic information Email connectivity graph 7 (image source: research.microsoft.com)

Criteria for Graph Processing Systems Properties: 2 Scalability (social networks are large!) Speed Features: SNA (Social Network Analysis) metrics: PageRank, Centrality, Shortest paths,... Extensibility Email connectivity graph 8 (image source: research.microsoft.com)

Current Systems 1 Storage: Apache Hive (and Hadoop) Titan Graph Database Neo4j 9

Current Systems Storage 2 Apache Hive (and Hadoop) Hadoop: Map/Reduce architecture Hive: Highlevel operations on large data sets HiveQL (similar to SQL) Converted to MapReduce jobs Not graphspecific Supports custom data formats Can be used as a backend for other systems 10

Current Systems Storage 3 Titan Graph Database Store and Query large graphs Graph schemas Gremlin query language edge and vertex labels transactional query model high level operations Two backends: Cassandra and HBase 11

Current Systems Storage 4 Neo4j Cost: 12K for startups (more for large companies), free for personal use Graph Database Management ACID compliant (Atomicity, Consistency, Isolation, Durability) Graphs are stored as Edges, Nodes, Attributes Focus on finding and querying data Graph analytics with igraph or GraphX Community! 12

Neo4j 13

Current Systems 5 Computation: igraph Spark GraphX GraphLab 14

Current Systems Computation 6 igraph Network analysis / network research Portable and efficient Python, R, C, C++ Builtin, optimized SNA metrics (centrality, diameter, connected components) Standalone or Grid Extensible, 3 layer API 15

Current Systems Computation 7 Spark GraphX Graphs and parallel graph computations Userdefined parallel operations stored inmemory for faster processing very good endtoend performance graphs are immutable; all operations create a new graph Prebuilt graph algorithms, e.g. PageRank 16

Current Systems Computation 8 GraphLab Cost: $4,000/machine/year, or free 1 year student subscription Graph computations: processing & analytics Visualization (GraphLab Canvas) Machine learning Common graph algorithms + API 17

GraphLab 18

Current Systems 9 Used by Facebook/Google: Pregel/Pregelix Apache Giraph 19

Current Systems Large Scale 10 Pregel/Pregelix Pregel: Googleonly, Pregelix: opensource BSP (bulk synchronous processing) model Extremely large graphs User defined edge, vertex, message types Supersteps inmemory/outofcore operation models Vertexbased API, libraries with graph algorithms 20

Current Systems Large Scale 11 Apache Giraph BSP model Graphwide metrics via global operations Built on Hadoop, 526 times faster than Hive Highly parallel, keeps all data in memory Scales linearly with number of edges, can make efficient use of large clusters Used for PageRank, popularity rank, shortest paths No builtin graph metrics 21

Comparison Focus Scalability SNA Extensibility Used for Hive parallel computations any size no Java generic Titan storage ~100 B no Python, Java graph queries Neo4j transactional DB ~1 B yes Java, Python, R recommender systems igraph efficiency, portability ~1 M yes R, Python, C++ research GraphX parallel computations ~1 B yes Java, Python, R graph processing GraphLab processing, analytics ~1 B yes C++ recommender systems Giraph large scale, BSP any size no Java, Python Facebook Pregel(ix) large scale, BSP any size yes Java Google 22

Which is the best? Depends on the network and intended use.. Very large Social Networks: Research: igraph and GraphX support R and Python integration Analysis and Visualisation of Social Networks Highperformance, customizable systems, such as Pregelix GraphLab with builtin interactive analysis and plotting features Neo4j contains vast amounts of community resources for these tasks Custom use cases... Existing systems might not support these Instead: use Hadoop/Hive and write the rest yourself! 23

Thank You! aaaaaand Stay for some questions 24

Questions 1 Why do we analyse social data? What are the possible uses of analysing social data? 25

Questions 2 Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, ) 26

Questions 3 Have you ever used such a system? Which one? 27

Questions 4 What are the advantages and disadvantages of distributed graph processing? What is the value of graph processing? 28

Questions 5 How can social metric calculations deal with fake accounts? 29

The End... 30