Distributed Graph Algorithms

Similar documents
modern database systems lecture 10 : large-scale graph processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

TI2736-B Big Data Processing. Claudia Hauff

Graph Processing. Connor Gramazio Spiros Boosalis

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

GPS: A Graph Processing System

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Distributed Algorithms Models

Large Scale Graph Processing Pregel, GraphLab and GraphX

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Clustering Documents. Case Study 2: Document Retrieval

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Pregel. Ali Shah

Data-Intensive Distributed Computing

CS 5220: Parallel Graph Algorithms. David Bindel

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Experimental Analysis of Distributed Graph Systems

More Graph Algorithms

DFA-G: A Unified Programming Model for Vertex-centric Parallel Graph Processing

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

Extreme-scale Graph Analysis on Blue Waters

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Distributed Graph Storage. Veronika Molnár, UZH

DFEP: Distributed Funding-based Edge Partitioning

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

Graph Data Management

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Batch & Stream Graph Processing with Apache Flink. Vasia

Outline. Graphs. Divide and Conquer.

SociaLite: A Datalog-based Language for

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

L22-23: Graph Algorithms

Programming Models and Tools for Distributed Graph Processing

Extreme-scale Graph Analysis on Blue Waters

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Data-Intensive Computing with MapReduce

One Trillion Edges. Graph processing at Facebook scale

BSP, Pregel and the need for Graph Processing

Early Experience with Intergrating Charm++ Support to Green-Marl DSL

Graph-Parallel Problems. ML in the Context of Parallel Architectures

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

High Dimensional Indexing by Clustering

Lecture 11: Graph algorithms! Claudia Hauff (Web Information Systems)!

Randomized Algorithms

Lesson 2 7 Graph Partitioning

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Cloud Computing CS

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Why do we need graph processing?

GraphHP: A Hybrid Platform for Iterative Graph Processing

Preclass Warmup. ESE535: Electronic Design Automation. Motivation (1) Today. Bisection Width. Motivation (2)

Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

Case Study 4: Collaborative Filtering. GraphLab

Graphs (Part II) Shannon Quinn

Management and Analysis of Big Graph Data: Current Systems and Open Challenges

CS 161 Lecture 11 BFS, Dijkstra s algorithm Jessica Su (some parts copied from CLRS) 1 Review

Graph Connectivity in MapReduce...How Hard Could it Be?

Graph-Processing Systems. (focusing on GraphChi)

Algorithms Activity 6: Applications of BFS

Graphs. Part I: Basic algorithms. Laura Toma Algorithms (csci2200), Bowdoin College

Report. X-Stream Edge-centric Graph processing

Lecture 22 : Distributed Systems for ML

Social-Network Graphs

Green-Marl. A DSL for Easy and Efficient Graph Analysis. LSDPO (2017/2018) Paper Presentation Tudor Tiplea (tpt26)

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Pregel: A System for Large-Scale Graph Proces sing

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Quegel: A General-Purpose Query-Centric Framework for Querying Big Graphs

1 Connected components in undirected graphs

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

Graph Representations and Traversal

CS 206 Introduction to Computer Science II

hyperx: scalable hypergraph processing

An Introduction to Apache Spark

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Graph Algorithms: Part 2. Dr. Baldassano Yu s Elite Education

CS140 Final Project. Nathan Crandall, Dane Pitkin, Introduction:

Addressed Issue. P2P What are we looking at? What is Peer-to-Peer? What can databases do for P2P? What can databases do for P2P?

Analyzing Flight Data

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

(Refer Slide Time: 05:25)

DISTRIBUTED COMPUTER SYSTEMS ARCHITECTURES

Resilient Distributed Datasets

HPCGraph: Benchmarking Massive Graph Analytics on Supercomputers

CMPSCI611: The Simplex Algorithm Lecture 24

Naiad. a timely dataflow model

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)

Transcription:

Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License.

Contents 1 Introduction P2P Distance computation 2 Thinking distributively Big Data Pregel and Vertex-Centric frameworks Graph partitioning 3 Graph algorithms Triangle counting Connectedness Strongly Connected Components Graph clustering

Introduction Who am I? My name is Alessio Guerrieri Postdoctoral researcher with prof. Montresor Specialization on distributed large-scale graph processing Graph G = (V, E) V set of nodes E set of edges (links, connections) between pair of nodes Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 1 / 42

Introduction Today s topic Vertex-centric model for graph computation Possible applications of vertex-centric model Vertex-centric graph algorithms Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 2 / 42

Introduction P2P Peer-to-Peer network Collection of peer (nodes) Have physical connections Compute an overlay graph Nodes know existence of neighbors Can discover other nodes via peer-sampling techniques Overlay TCP/IP Network Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 3 / 42

Introduction P2P Computing in Peer-to-Peer systems Peers might want to run protocols on the network: Finding centrality of peer Test properties of network Compute distances... Can we simply reuse distributed algorithms for Computer Networks? Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 4 / 42

Introduction Distance computation Example: decentralized distance computation Problem For each peer x in overlay G, compute its distance from target peer T arget. Vertex-centric model Initially each peer only knows its direct neighbors in the overlay Each peer can send messages to each peer of which it knows the existence Amount of memory used by each peer should be sub-linear No control on rate of activity of peers and speed of messages (asynchronous execution) Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 5 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 6 / 42

Introduction Distance computation Distributed Breadth-First-Search D-BFS protocol executed by p: upon initialization do if p = T arget then D p 0 send D p + 1 to neighbors else D p upon receive d do if d < D p then D p d send D p + 1 to neighbors repeat every time units send D p + 1 to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 7 / 42

Introduction Distance computation Notes Works if: Nodes execute in any order Messages arrive in any order Nodes or edges are added Messages are lost Fails if: Nodes fail Edges are removed Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 8 / 42

Thinking distributively Big Data New Graph Applications Many settings in which there is need to analyze graph: Web data Computer networks Biological networks Social networks... Size can be extremely large Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 9 / 42

Thinking distributively Big Data Approaches Centralized/Sequential Collect data in one place Load it in memory Analyze it with traditional techniques Decentralized One machine for each node Construct overlay network Run vertex-centric protocol Both unfeasible! These are better: Centralized/Parallel Parallel processes and threads in shared memory Distributed system Multiple (not N) machines, each hosting part of the graph Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 10 / 42

Thinking distributively Big Data Why distributed? Disadvantages Slower than parallel systems Younger field of research Needs distribution of data Advantages Fault tolerant Cheaper Extendible Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 11 / 42

Thinking distributively Big Data What do we want? Nice interface for distributed graph algorithms and fast framework to run those algorithms Graph analyst are not distributed systems experts! Issues with: Data replication Fault tolerance Message deliverance... should be dealt by the system! Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 12 / 42

Thinking distributively Big Data Vertex centric Why not reuse vertex-centric interface? User Defines data associated to vertex Defines type of messages Defines code executed by single vertex when it receives messages System Divides input graph between the available machines Each machine simulates each vertex independently Takes care of fault tolerance Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 13 / 42

Thinking distributively Big Data Vertex centric distributed system Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 14 / 42

Thinking distributively Pregel and Vertex-Centric frameworks Pregel Developed by Google First large-scale graph processing system with vertex-centric interface Created for PageRank Code automatically deployed on thousands of machines Programming API Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 15 / 42

Thinking distributively Pregel and Vertex-Centric frameworks Pregel-like frameworks Many: Giraph: on top of Hadoop GraphX: on top of Spark Gelly: on top of Flink Graphlab: standalone All frameworks allow user to run vertex-centric programs on distributed systems Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 16 / 42

Thinking distributively Pregel and Vertex-Centric frameworks Example (Giraph) public void compute ( I t e r a b l e <IntWritable > messages ) { int mindist = I n t e g e r.max VALUE; for ( IntWritable message : messages ) { mindist = Math. min ( mindist, message. get ( ) ) ; } i f ( mindist < getvalue ( ). get ( ) ) { setvalue (new IntWritable ( mindist ) ) ; IntWritable d i s t a n c e = new IntWritable ( mindist + 1) for ( Edge <... > edge : getedges ( ) ) { sendmessage ( edge. gettargetvertexid ( ), d i s t a n c e ) ; } } votetohalt ( ) ; } Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 17 / 42

Thinking distributively Pregel and Vertex-Centric frameworks Common characteristics of Vertex-centric models Independence from problem size Code stays the same regardless of size of graph number of machines used Embedded fault-tolerance Periodic benchmarking Machines can steal nodes from slower machines Distributed File System Usually HDFS Allow all machines to read and write in common space Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 18 / 42

Thinking distributively Pregel and Vertex-Centric frameworks Synchronicity of Vertex-centric models Synchronous Execution in rounds Each vertex receives in round i all messages sent to it in round i 1 Better guarantees easiest model Asynchronous No rounds Messages processed without guarantees More difficult, more efficient Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 19 / 42

Thinking distributively Graph partitioning Graph partitioning First step of analysis Which machine hosts which subset of nodes/edges? Balance Size of partitions should be balanced Quality Messages between nodes hosted in same machine are cheap Messages between machines are costly Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 20 / 42

Thinking distributively Graph partitioning Heuristic solutions Graph partitioning is NP-Hard Hash-based heuristics Node n ends in machine H(n)%K Iterative algorithms Start from random and incrementally improve Heuristics can be devised for specific data-sets Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 21 / 42

Graph Algorithms Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 22 / 42

What algorithms are available This area is not completely new: there are distributed algorithms for computer networks But: Computer networks are small Many interesting graph problems are not interesting on computer networks Some ideas can be taken from research in PRAM ( Parallel Random Access Machine) Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 23 / 42

Problems We ll look at: Triangle counting Connected components Strongly connected components Clustering We won t look at: Pagerank Single-Value decomposition Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 24 / 42

Triangle counting Triangle Counting Definition In an undirected graph Compute, for each node n, how many pairs of nodes (v, w) exists such that (n, v), (v, w), (w, n) are edges. Application on social networks Seems easier than it is... Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 25 / 42

Triangle counting Triangle counting protocol Send neighbor-list to neighbors and see which neighbors we have in common Executed by node p upon initialization do T p 0 send neighbor-list to neighbors upon receive M do Common neighbor-list M T p T p + Common Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 26 / 42

Triangle counting Issues Multi-counting Each triangle is counted 6 times! Should make sure that every triangle is counted only once! Hubs Hubs are nodes with disproportionally large degree Exists in almost all real graphs Will send large neighbor-lists to large number of neighbors Will receive large quantity of messages Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 27 / 42

Triangle counting Triangle counting protocol (2) Send list of neighbors with smaller id Executed by node p upon initialization do T p 0 M {n : n neighbor-list, id n < id p } send M to neighbors upon receive M do Common neighbor-list M T p T p + Common Cut messages by half Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 28 / 42

Triangle counting Triangle counting protocol (3) Executed by node p upon initialization do T p 0 foreach n neighbor-list do if id n < id p then M {q : q neighbor-list, id q < id n } send LIST (M) to n upon receive LIST (M) from n do Common neighbor-list M T p T p + Common send T R( Common ) to n send T R(1) to v Common upon receive T R(t) do T p T p + t Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 29 / 42

Connectedness Connected components (problem) Problem definition Two nodes v, w of an undirected graph are in the same weakly connected component (WCC) if exists a path from v to w. For each node n compute value C n such that: v, w V, C v = C w W CC v = W CC w Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 30 / 42

Connectedness Connected components (solution) Executed by node p: upon initialization do C p p.id send C p to neighbors upon receive c do if c < C p then C p c send c to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42

Connectedness Connected components (solution) Executed by node p: upon initialization do C p p.id send C p to neighbors upon receive c do if c < C p then C p c send c to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42

Connectedness Connected components (solution) Executed by node p: upon initialization do C p p.id send C p to neighbors upon receive c do if c < C p then C p c send c to neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 31 / 42

Strongly Connected Components Strongly connected components Problem definition Two nodes v, w of an directed graph are in the same strongly connected component (SCC) if exist paths from v to w and from w to v (both directions). Centralized solution Single depth-first search visit using Tarjan s algorithm Two depth-first search visits using Kosaraju s algorithm Distributed solution much more difficult! Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 32 / 42

Strongly Connected Components Strongly Connected components (solution) SCC: Executed by node p: upon initialization do F p p.id // Lowest id of nodes that reaches p B p p.id // Lowest id of nodes reachable from p send F OR(F p ) to out-neighbors send BACK(B p ) to in-neighbors upon receive F OR(c) do if c < F p then F p c send F OR(c) to out-neighbors upon receive BACK(c) do if c < B p then B p c send BACK(c) to in-neighbors Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 33 / 42

Strongly Connected Components Properties B p = minimum id of nodes reachable from p F p = minimum id of nodes that can reach p Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 34 / 42

Strongly Connected Components Properties B p = minimum id of nodes reachable from p F p = minimum id of nodes that can reach p Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 34 / 42

Strongly Connected Components Properties Lemma1 (F p = B p = i) p and i are in the same SCC Lemma2 (F p = F q and B P = B 1 p and q are in the same SCC Are the two lemmas correct? Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 35 / 42

Strongly Connected Components Properties B p = minimum id of nodes reachable from p F p = minimum id of nodes that can reach p Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 36 / 42

Strongly Connected Components Complete algorithm Complete algorithm While graph not empty: Run SCC algorithm remove each node p such that F p = B p Can improve number of SCC found through heuristics Quicker convergence for largest SCC Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 37 / 42

Graph clustering Graph clustering Given a graph, divide its vertices in clusters such that: Most edges connect nodes in same cluster Few edges go across cluster Clusters are significant Precise definition depends from application scenario All versions are NP-Complete Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 38 / 42

Graph clustering Note:ties resolved randomly Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 39 / 42 Label-propagation algorithm Executed by node p: upon initialization do C p p.id N p {} send C p to neighbors // starting label equal to id // dictionary containing labels of neighbors upon receive c from q do N p [q] = c // update dictionary with new label of neighbor repeat every time units c most common label in neighbors if C p c then send c to neighbors C p c

Graph clustering Sample run Initialization: Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 1 (randomly) Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 3(randomly) Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 6(randomly) Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 6 Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 1 Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 1 Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Active node chooses label 6 Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Sample run Converged state: no node change its mind Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 40 / 42

Graph clustering Disadvantages Non-deterministic Non-null probability of collapsing to single cluster Low quality In practice, it s not that good Better options Slow, iterative algorithms Fast, heuristics My algorithm! Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 41 / 42

Graph clustering Conclusions Vertex-centric algorithms can be used in: P2P systems Computer networks Large-scale graph analysis Field still in flux New frameworks New programming models New algorithms Many applications Everything is a graph Could be interesting for a project Alessio Guerrieri (UniTN) DS - Graphs 2016/04/26 42 / 42

Reading Material Pregel: A System for Large-Scale Graph Processing by Malewicz et al. Pregel Algorithms for Graph Connectivity Problems with Performance Guarantees by Yan et al. Giraph http://giraph.apache.org/ Graphx http://spark.apache.org/graphx/