Distributed Graph Storage. Veronika Molnár, UZH

Size: px

Start display at page:

Download "Distributed Graph Storage. Veronika Molnár, UZH"

Crystal Cannon
5 years ago
Views:

1 Distributed Graph Storage Veronika Molnár, UZH

2 Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems Questions 2

3 Graphs and Social Networks 1 Graph = collection of nodes + edges connecting nodes to each other Social Network = collection of individuals and social relations Social Network is also a Graph! (node = person, edge = relation) Social Network graph (image source : thenextweb.com) 3

4 Graphs and Social Networks 2 Social Network graph properties (SNA = Social Network Analysis) Limited number of connections at each node (person) e.g. Facebook: max 5000 Distribution not uniform Most people: an average number of connections But: a few people have a lot of connections (Power law distribution) Small degree of separation = Small World (length of shortest paths) Centrality Constantly changing, but very large graph! (7 billion people = 7 billion nodes) 4

5 Graphs and Social Networks Shortest Path 3 Centrality VM BP Betweenness Closeness PageRank Degree 5

6 Graphs and Social Networks 4 Social Network can be Facebook s Mailing lists Academic networks 6

7 Criteria for Graph Processing Systems 1 Modes: Distributed processing Research and industry use Interactive and noninteractive modes Storage of static and dynamic information connectivity graph 7 (image source: research.microsoft.com)

8 Criteria for Graph Processing Systems Properties: 2 Scalability (social networks are large!) Speed Features: SNA (Social Network Analysis) metrics: PageRank, Centrality, Shortest paths,... Extensibility connectivity graph 8 (image source: research.microsoft.com)

9 Current Systems 1 Storage: Apache Hive (and Hadoop) Titan Graph Database Neo4j 9

10 Current Systems Storage 2 Apache Hive (and Hadoop) Hadoop: Map/Reduce architecture Hive: Highlevel operations on large data sets HiveQL (similar to SQL) Converted to MapReduce jobs Not graphspecific Supports custom data formats Can be used as a backend for other systems 10

11 Current Systems Storage 3 Titan Graph Database Store and Query large graphs Graph schemas Gremlin query language edge and vertex labels transactional query model high level operations Two backends: Cassandra and HBase 11

12 Current Systems Storage 4 Neo4j Cost: 12K for startups (more for large companies), free for personal use Graph Database Management ACID compliant (Atomicity, Consistency, Isolation, Durability) Graphs are stored as Edges, Nodes, Attributes Focus on finding and querying data Graph analytics with igraph or GraphX Community! 12

13 Neo4j 13

14 Current Systems 5 Computation: igraph Spark GraphX GraphLab 14

15 Current Systems Computation 6 igraph Network analysis / network research Portable and efficient Python, R, C, C++ Builtin, optimized SNA metrics (centrality, diameter, connected components) Standalone or Grid Extensible, 3 layer API 15

16 Current Systems Computation 7 Spark GraphX Graphs and parallel graph computations Userdefined parallel operations stored inmemory for faster processing very good endtoend performance graphs are immutable; all operations create a new graph Prebuilt graph algorithms, e.g. PageRank 16

17 Current Systems Computation 8 GraphLab Cost: $4,000/machine/year, or free 1 year student subscription Graph computations: processing & analytics Visualization (GraphLab Canvas) Machine learning Common graph algorithms + API 17

18 GraphLab 18

19 Current Systems 9 Used by Facebook/Google: Pregel/Pregelix Apache Giraph 19

20 Current Systems Large Scale 10 Pregel/Pregelix Pregel: Googleonly, Pregelix: opensource BSP (bulk synchronous processing) model Extremely large graphs User defined edge, vertex, message types Supersteps inmemory/outofcore operation models Vertexbased API, libraries with graph algorithms 20

21 Current Systems Large Scale 11 Apache Giraph BSP model Graphwide metrics via global operations Built on Hadoop, 526 times faster than Hive Highly parallel, keeps all data in memory Scales linearly with number of edges, can make efficient use of large clusters Used for PageRank, popularity rank, shortest paths No builtin graph metrics 21

22 Comparison Focus Scalability SNA Extensibility Used for Hive parallel computations any size no Java generic Titan storage ~100 B no Python, Java graph queries Neo4j transactional DB ~1 B yes Java, Python, R recommender systems igraph efficiency, portability ~1 M yes R, Python, C++ research GraphX parallel computations ~1 B yes Java, Python, R graph processing GraphLab processing, analytics ~1 B yes C++ recommender systems Giraph large scale, BSP any size no Java, Python Facebook Pregel(ix) large scale, BSP any size yes Java Google 22

23 Which is the best? Depends on the network and intended use.. Very large Social Networks: Research: igraph and GraphX support R and Python integration Analysis and Visualisation of Social Networks Highperformance, customizable systems, such as Pregelix GraphLab with builtin interactive analysis and plotting features Neo4j contains vast amounts of community resources for these tasks Custom use cases... Existing systems might not support these Instead: use Hadoop/Hive and write the rest yourself! 23

24 Thank You! aaaaaand Stay for some questions 24

25 Questions 1 Why do we analyse social data? What are the possible uses of analysing social data? 25

26 Questions 2 Can visualisation help to understand graphs? (connections can be viewed, subset of graph can be analysed, ) 26

27 Questions 3 Have you ever used such a system? Which one? 27

28 Questions 4 What are the advantages and disadvantages of distributed graph processing? What is the value of graph processing? 28

29 Questions 5 How can social metric calculations deal with fake accounts? 29

30 The End... 30

Big Data Hadoop Stack

Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware