GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel

Size: px
Start display at page:

Download "GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel"

Transcription

1 GraphX Graph nalytics in Spark nkur ave Graduate Student, U Berkeley MPLab Joint work with Joseph Gonzalez, Reynold Xin, aniel U BRKLY rankshaw, Michael Franklin, and Ion Stoica

2 Graphs

3 Social Networks

4 Web Graphs

5 User-Item Graphs

6 Graph lgorithms

7 PageRank

8 Triangle ounting

9 ollaborative Filtering Users Ratings Users x Products f(j) f(i) Products

10 ollaborative Filtering f(3) r 13 f(1) User Factors f(2) r 14 r 24 f(4) Product Factors r 25 f(5) X f[i] = arg min w2r d r ij w T f[j] 2 + w 2 2 j2nbrs(i)

11 The Graph-Parallel Pattern

12 The Graph-Parallel Pattern

13 The Graph-Parallel Pattern

14 Many Graph-Parallel lgorithms ollaborative Filtering» lternating Least Squares» Stochastic Gradient escent» Tensor Factorization ommunity etection» Triangle-ounting» K-core ecomposition» K-Truss Structured Prediction» Loopy Belief Propagation» Max-Product Linear Programs» Gibbs Sampling Graph nalytics» PageRank» Personalized PageRank» Shortest Path» Graph oloring Semi-supervised ML» Graph SSL lassification» Neural Networks» om

15 High-egree Vertices hallenges: 1. Storage: How to store a graph where one vertex s edges don t fit on a machine? 2. PI: How to expose parallelism within vertex neighborhoods?

16 omplex Pipelines Link Table Hyperlinks PageRank Top 20 Pages Title Link Title PR Raw Wikipedia < / > < / > < / > Top ommunities om. PR.. XML ditor ommunity User Table ditor Graph etection ommunity ditor Title User om.

17 omplex Pipelines Solution: mbed graph processing within a tableoriented system (Spark) hallenges: 1. Storage: How to store graphs as tables? 2. omputation: How to express graph ops as table ops (map, reduce, join, etc.)? 3. PI: How to present the two views to the user?

18 The GraphX PI

19 Property Graphs Vertex Property: User Profile urrent PageRank Value dge Property: Weights Relationships Timestamps

20 reating a Graph (Scala) type VertexId = Long Graph val vertices: R[(VertexId, String)] = sc.parallelize(list( (1L, lice ), (2L, Bob ), (3L, harlie ))) 1 coworker lice class dge[]( val srcid: VertexId, val dstid: VertexId, val attr: ) val edges: R[dge[String]] = sc.parallelize(list( dge(1l, 2L, coworker ), dge(2l, 3L, friend ))) 2 friend 3 Bob harlie val graph = Graph(vertices, edges)

21 Graph Operations (Scala) class Graph[V, ] { // Table Views def vertices: R[(VertexId, V)] def edges: R[dge[]] def triplets: R[dgeTriplet[V, ]] // Transformations def mapvertices[v2](f: (VertexId, V) => V2): Graph[V2, ] def mapdges[2](f: dge[] => 2): Graph[V2, ] def reverse: Graph[V, ] def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] // Joins def outerjoinvertices[u, V2] (tbl: R[(VertexId, U)]) (f: (VertexId, V, Option[U]) => V2): Graph[V2, ] // omputation def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )]

22 Built-in lgorithms (Scala) } // ontinued from previous slide def pagerank(tol: ouble): Graph[ouble, ouble] def triangleount(): Graph[Int, ] def connectedomponents(): Graph[VertexId, ] //...and more: org.apache.spark.graphx.lib PageRank Triangle ount onnected omponents

23 The triplets view class Graph[V, ] { } def triplets: R[dgeTriplet[V, ]] class dgetriplet[v, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) Graph 1 lice R coworker 2 friend Bob triplets srcttr dstttr attr lice coworker Bob Bob friend harlie 3 harlie

24 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(epred = (edge) => edge.attr!= relative ) Graph Graph lice coworker Bob lice coworker Bob relative friend subgraph friend harlie relative avid harlie avid

25 The subgraph transformation class Graph[V, ] { def subgraph(epred: dgetriplet[v, ] => Boolean, vpred: (VertexId, V) => Boolean): Graph[V, ] } graph.subgraph(vpred = (id, name) => name!= Bob ) Graph Graph lice coworker Bob lice relative friend subgraph relative harlie relative avid harlie relative avid

26 omputation with aggregatemessages class Graph[V, ] { def aggregatemessages[]( sendmsg: dgeontext[v,, ] => Unit, mergemsg: (, ) => ): R[(VertexId, )] } class dgeontext[v,, ]( val srcid: VertexId, val dstid: VertexId, val attr:, val srcttr: V, val dstttr: V) { def sendtosrc(msg: ) def sendtost(msg: ) } graph.aggregatemessages( ctx => { ctx.sendtosrc(1) ctx.sendtost(1) }, _ + _)

27 omputation with aggregatemessages R Graph vertex id degree lice relative harlie coworker friend relative Bob avid aggregatemessages lice 2 Bob 2 harlie 3 avid 1

28 xample: Graph oarsening Web Pages Intra-omain Links Pages by omain subgraph onnected omponents Graph constructor omains omain Graph

29 How GraphX Works

30 Storing Graphs as Tables Property Graph Vertex Table dge Table (R) (R) B B F Machine 1 Machine 2 B B F F F

31 Simple Operations Reuse vertices or edges across multiple graphs Input Graph Transformed Graph Vertex dge Vertex dge Table Table Table Table Transform Vertex Properties

32 Implementing triplets Vertex Table Routing Table dge Table (R) (R) (R) B 1 2 Machine 1 Machine 2 B F B F B F F

33 Implementing triplets dge Table (R) B B F F Mirror ache B Mirror ache F Vertex Table (R) B F B F Routing Table (R) B F

34 Reduction in ommunication ue to ached Updates onnected omponents on Twitter Graph (1.4B edges) Network omm. (MB) Most vertices are within 8 hops of all vertices in their component Iteration

35 Implementing aggregatemessages dge Table (R) B B F F Mirror ache B Mirror ache F Scan Result Vertex Table B F B F

36 Speedup ue To ccess Method Selection 30 onnected omponents on Twitter (1.4B edges) Runtime (Seconds) Scan ll dges Scan Indexed 10 5 Index of ctive dges Iteration

37 PageRank Benchmark 2 luster of 16 x m2.4xlarge (8 cores) + 1Gig Twitter Graph (42M Vertices,1.5B dges) UK-Graph (106M Vertices, 3.7B dges) Runtime (Seconds) x x GraphX GraphLab Giraph Naïve GraphX GraphLab Giraph Naïve Spark Spark GraphX performs comparably to state-of-the-art graph processing systems.

38 Open Source lpha release with Spark in Feb 2014 Stable release with Spark in ec 2014

39 Future of GraphX 1. Language support a) Java PI: PR #3234 b) Python PI: collaborating with Intel, SPRK More algorithms a) L (topic modeling): PR #2388 b) orrelation clustering 3. Research a) Local graphs b) Streaming/time-varying graphs c) Graph database like queries

40 IndexedR Support efficient updates to immutable Rs using purely functional data structures

41 Thanks!

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics GraphX: Unifying ata-parallel and Graph-Parallel nalytics Presented by aniel rankshaw Joint work with Reynold Xin, Joseph Gonzalez, nkur ave, Michael Franklin, and Ion Stoica Spark Meetup 3/25/2014 U BERKELEY

More information

GraphX: Graph Processing in a Distributed Dataflow Framework

GraphX: Graph Processing in a Distributed Dataflow Framework GraphX: Graph Processing in a Distributed Dataflow Framework Joseph Gonzalez Postdoc, UC-Berkeley MPLab Co-founder, GraphLab Inc. Joint work with Reynold Xin, nkur Dave, Daniel Crankshaw, Michael Franklin,

More information

Why do we need graph processing?

Why do we need graph processing? Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group

More information

Analyzing Flight Data

Analyzing Flight Data IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo

More information

Tegra Time-evolving Graph Processing on Commodity Clusters

Tegra Time-evolving Graph Processing on Commodity Clusters Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica 1 Graphs are everywhere Social Networks 2 Graphs are everywhere Gnutella network

More information

GraphX: Graph Processing in a Distributed Dataflow Framework

GraphX: Graph Processing in a Distributed Dataflow Framework GraphX: Graph Processing in a Distributed Dataflow Framework Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel

More information

Spark Streaming and GraphX

Spark Streaming and GraphX Spark Streaming and GraphX Amir H. Payberah amir@sics.se SICS Swedish ICT Amir H. Payberah (SICS) Spark Streaming and GraphX June 30, 2016 1 / 1 Spark Streaming Amir H. Payberah (SICS) Spark Streaming

More information

The GraphX Graph Processing System

The GraphX Graph Processing System The GraphX Graph Processing System Daniel Crankshaw Ankur Dave Reynold S. Xin Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {crankshaw, ankurd, rxin, jegonzal, franklin, istoica@cs.berkeley.edu

More information

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song. CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao

More information

Cloud, Big Data & Linear Algebra

Cloud, Big Data & Linear Algebra Cloud, Big Data & Linear Algebra Shelly Garion IBM Research -- Haifa 2014 IBM Corporation What is Big Data? 2 Global Data Volume in Exabytes What is Big Data? 2005 2012 2017 3 Global Data Volume in Exabytes

More information

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation Analytic Cloud with Shelly Garion IBM Research -- Haifa 2014 IBM Corporation Why Spark? Apache Spark is a fast and general open-source cluster computing engine for big data processing Speed: Spark is capable

More information

SociaLite: A Datalog-based Language for

SociaLite: A Datalog-based Language for SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler

More information

Fast, Interactive, Language-Integrated Cluster Computing

Fast, Interactive, Language-Integrated Cluster Computing Spark Fast, Interactive, Language-Integrated Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker, Ion Stoica www.spark-project.org

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

Large Scale Graph Processing Pregel, GraphLab and GraphX

Large Scale Graph Processing Pregel, GraphLab and GraphX Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah

More information

Analysis of Algorithms Prof. Karen Daniels

Analysis of Algorithms Prof. Karen Daniels UMass Lowell omputer Science 91.404 nalysis of lgorithms Prof. Karen aniels Spring, 2013 hapter 22: raph lgorithms & rief Introduction to Shortest Paths [Source: ormen et al. textbook except where noted]

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

An Introduction to Apache Spark

An Introduction to Apache Spark An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations

More information

What is Routing? EE 122: Shortest Path Routing. Example. Internet Routing. Ion Stoica TAs: Junda Liu, DK Moon, David Zats

What is Routing? EE 122: Shortest Path Routing. Example. Internet Routing. Ion Stoica TAs: Junda Liu, DK Moon, David Zats What is Routing? Routing implements the core function of a network: : Shortest Path Routing Ion Stoica Ts: Junda Liu, K Moon, avid Zats http://inst.eecs.berkeley.edu/~ee/fa9 (Materials with thanks to Vern

More information

Graph-Based Parallel Computing. William Cohen

Graph-Based Parallel Computing. William Cohen Graph-Based Parallel Computing William Cohen 1 Announcements Next Tuesday 12/8: Presentations for 10-805 projects. 15 minutes per project. Final written reports due Tues 12/15 2 Graph-Based Parallel Computing

More information

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning

More information

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics GraphX: Unifying Data-Parallel and Graph-Parallel Analytics Reynold S. Xin Daniel Crankshaw Ankur Dave Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {rxin, crankshaw, ankurd, jegonzal,

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

Batch & Stream Graph Processing with Apache Flink. Vasia

Batch & Stream Graph Processing with Apache Flink. Vasia Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph

More information

Programming Models and Tools for Distributed Graph Processing

Programming Models and Tools for Distributed Graph Processing Programming Models and Tools for Distributed Graph Processing Vasia Kalavri kalavriv@inf.ethz.ch 31st British International Conference on Databases 10 July 2017, London, UK ABOUT ME Postdoctoral Fellow

More information

IP Forwarding Computer Networking. Routes from Node A. Graph Model. Lecture 10: Intra-Domain Routing

IP Forwarding Computer Networking. Routes from Node A. Graph Model. Lecture 10: Intra-Domain Routing IP orwarding - omputer Networking Lecture : Intra-omain Routing RIP (Routing Information Protocol) & OSP (Open Shortest Path irst) The Story So ar IP addresses are structure to reflect Internet structure

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Matrix Computations and " Neural Networks in Spark

Matrix Computations and  Neural Networks in Spark Matrix Computations and " Neural Networks in Spark Reza Zadeh Paper: http://arxiv.org/abs/1509.02256 Joint work with many folks on paper. @Reza_Zadeh http://reza-zadeh.com Training Neural Networks Datasets

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

GraphLab: A New Framework for Parallel Machine Learning

GraphLab: A New Framework for Parallel Machine Learning GraphLab: A New Framework for Parallel Machine Learning Yucheng Low, Aapo Kyrola, Carlos Guestrin, Joseph Gonzalez, Danny Bickson, Joe Hellerstein Presented by Guozhang Wang DB Lunch, Nov.8, 2010 Overview

More information

Spark GraphX INTRODUCTION AND OVERVIEW WITH FOLLOW-ALONG GRAPH ANALYSIS EXAMPLES

Spark GraphX INTRODUCTION AND OVERVIEW WITH FOLLOW-ALONG GRAPH ANALYSIS EXAMPLES Spark GraphX INTRODUCTION AND OVERVIEW WITH FOLLOW-ALONG GRAPH ANALYSIS EXAMPLES Notes Example data and scripts available at: All examples will be executed with the Spark-Shell A basic knowledge of Spark

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

GraphFrames: An Integrated API for Mixing Graph and Relational Queries

GraphFrames: An Integrated API for Mixing Graph and Relational Queries GraphFrames: An Integrated API for Mixing Graph and Relational Queries Ankur ave, Alekh Jindal, Li rran Li, Reynold Xin, Joseph Gonzalez, Matei Zaharia University of California, erkeley MIT Uber Technologies

More information

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

MLlib. Distributed Machine Learning on. Evan Sparks.  UC Berkeley MLlib & ML base Distributed Machine Learning on Evan Sparks UC Berkeley January 31st, 2014 Collaborators: Ameet Talwalkar, Xiangrui Meng, Virginia Smith, Xinghao Pan, Shivaram Venkataraman, Matei Zaharia,

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

Thrill : High-Performance Algorithmic

Thrill : High-Performance Algorithmic Thrill : High-Performance Algorithmic Distributed Batch Data Processing in C++ Timo Bingmann, Michael Axtmann, Peter Sanders, Sebastian Schlag, and 6 Students 2016-12-06 INSTITUTE OF THEORETICAL INFORMATICS

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Mosaic: Processing a Trillion-Edge Graph on a Single Machine Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys

More information

Third Generation Routers

Third Generation Routers IP orwarding 5-5- omputer Networking 5- Lecture : Routing Peter Steenkiste all www.cs.cmu.edu/~prs/5-- The Story So ar IP addresses are structured to reflect Internet structure IP packet headers carry

More information

Distributed Computing with Spark

Distributed Computing with Spark Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

Distributed Graph Storage. Veronika Molnár, UZH

Distributed Graph Storage. Veronika Molnár, UZH Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems

More information

L10 Graphs. Alice E. Fischer. April Alice E. Fischer L10 Graphs... 1/37 April / 37

L10 Graphs. Alice E. Fischer. April Alice E. Fischer L10 Graphs... 1/37 April / 37 L10 Graphs lice. Fischer pril 2016 lice. Fischer L10 Graphs... 1/37 pril 2016 1 / 37 Outline 1 Graphs efinition Graph pplications Graph Representations 2 Graph Implementation 3 Graph lgorithms Sorting

More information

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit. Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer

More information

Case Study 4: Collaborative Filtering. GraphLab

Case Study 4: Collaborative Filtering. GraphLab Case Study 4: Collaborative Filtering GraphLab Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin March 14 th, 2013 Carlos Guestrin 2013 1 Social Media

More information

Cluster Computing Architecture. Intel Labs

Cluster Computing Architecture. Intel Labs Intel Labs Legal Notices INFORMATION IN THIS DOCUMENT IS PROVIDED IN CONNECTION WITH INTEL PRODUCTS. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY INTELLECTUAL PROPERTY RIGHTS IS GRANTED

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

3.1 Basic Definitions and Applications

3.1 Basic Definitions and Applications Graphs hapter hapter Graphs. Basic efinitions and Applications Graph. G = (V, ) n V = nodes. n = edges between pairs of nodes. n aptures pairwise relationship between objects: Undirected graph represents

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University GridGraph: Large-Scale Graph Processing on a Single Machine Using -Level Hierarchical Partitioning Xiaowei ZHU Tsinghua University Widely-Used Graph Processing Shared memory Single-node & in-memory Ligra,

More information

Integration of Machine Learning Library in Apache Apex

Integration of Machine Learning Library in Apache Apex Integration of Machine Learning Library in Apache Apex Anurag Wagh, Krushika Tapedia, Harsh Pathak Vishwakarma Institute of Information Technology, Pune, India Abstract- Machine Learning is a type of artificial

More information

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast

Lightning Fast Cluster Computing. Michael Armbrust Reflections Projections 2015 Michast Lightning Fast Cluster Computing Michael Armbrust - @michaelarmbrust Reflections Projections 2015 Michast What is Apache? 2 What is Apache? Fast and general computing engine for clusters created by students

More information

Routing. Effect of Routing in Flow Control. Relevant Graph Terms. Effect of Routing Path on Flow Control. Effect of Routing Path on Flow Control

Routing. Effect of Routing in Flow Control. Relevant Graph Terms. Effect of Routing Path on Flow Control. Effect of Routing Path on Flow Control Routing Third Topic of the course Read chapter of the text Read chapter of the reference Main functions of routing system Selection of routes between the origin/source-destination pairs nsure that the

More information

Execution Tracing and Profiling

Execution Tracing and Profiling lass 8 Questions/comments iscussion of academic honesty, GT Honor ode fficient path profiling inal project presentations: ec, 3; 4:35-6:45 ssign (see Schedule for links) Problem Set 7 discuss Readings

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Distributed Computing with Spark and MapReduce

Distributed Computing with Spark and MapReduce Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How

More information

Twitter data Analytics using Distributed Computing

Twitter data Analytics using Distributed Computing Twitter data Analytics using Distributed Computing Uma Narayanan Athrira Unnikrishnan Dr. Varghese Paul Dr. Shelbi Joseph Research Scholar M.tech Student Professor Assistant Professor Dept. of IT, SOE

More information

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization

Graph Theory. Network Science: Graph theory. Graph theory Terminology and notation. Graph theory Graph visualization Network Science: Graph Theory Ozalp abaoglu ipartimento di Informatica Scienza e Ingegneria Università di ologna www.cs.unibo.it/babaoglu/ ranch of mathematics for the study of structures called graphs

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

ASAP: Fast, Approximate Graph Pattern Mining at Scale

ASAP: Fast, Approximate Graph Pattern Mining at Scale ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;

More information

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on,

CS435 Introduction to Big Data FALL 2018 Colorado State University. 9/24/2018 Week 6-A Sangmi Lee Pallickara. Topics. This material is built based on, FLL 218 olorado State University 9/24/218 Week 6-9/24/218 S435 Introduction to ig ata - FLL 218 W6... PRT 1. LRGE SLE T NLYTIS WE-SLE LINK N SOIL NETWORK NLYSIS omputer Science, olorado State University

More information

Stream Processing on IoT Devices using Calvin Framework

Stream Processing on IoT Devices using Calvin Framework Stream Processing on IoT Devices using Calvin Framework by Ameya Nayak A Project Report Submitted in Partial Fulfillment of the Requirements for the Degree of Master of Science in Computer Science Supervised

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

CS Spark. Slides from Matei Zaharia and Databricks

CS Spark. Slides from Matei Zaharia and Databricks CS 5450 Spark Slides from Matei Zaharia and Databricks Goals uextend the MapReduce model to better support two common classes of analytics apps Iterative algorithms (machine learning, graphs) Interactive

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin.

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin. A New Parallel Framework for Machine Learning Joseph hgonzalez Joint work with Yucheng Low Aapo Kyrola Danny Bickson Carlos Guestrin Alex Smola Guy Blelloch Joe Hellerstein David O Hallaron Carnegie Mellon

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

Let s focus on clarifying questions. More Routing. Logic Refresher. Warning. Short Summary of Course. 10 Years from Now.

Let s focus on clarifying questions. More Routing. Logic Refresher. Warning. Short Summary of Course. 10 Years from Now. Let s focus on clarifying questions I love the degree of interaction in this year s class More Routing all Scott Shenker http://inst.eecs.berkeley.edu/~ee/ Materials with thanks to Jennifer Rexford, Ion

More information

Harp-DAAL for High Performance Big Data Computing

Harp-DAAL for High Performance Big Data Computing Harp-DAAL for High Performance Big Data Computing Large-scale data analytics is revolutionizing many business and scientific domains. Easy-touse scalable parallel techniques are necessary to process big

More information

Graph Analytics using Vertica Relational Database

Graph Analytics using Vertica Relational Database Graph Analytics using ertica Relational Database Alekh Jindal* Samuel Madden Malú Castellanos Meichun Hsu Microsoft MIT ertica ertica * work done while at MIT Motivation for graphs on DB Data anyways in

More information

Introduction. Example

Introduction. Example OMS0 Introduction Priority queues and ijkstra s algorithm In this lecture we will discuss ijkstra s algorithm, a more efficient way of solving the single-source shortest path problem. epartment of omputer

More information

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce

More information

Undirected graph is a special case of a directed graph, with symmetric edges

Undirected graph is a special case of a directed graph, with symmetric edges S-6S- ijkstra s lgorithm -: omputing iven a directed weighted graph (all weights non-negative) and two vertices x and y, find the least-cost path from x to y in. Undirected graph is a special case of a

More information

CS535 Big Data Fall 2017 Colorado State University 9/5/2017. Week 3 - A. FAQs. This material is built based on,

CS535 Big Data Fall 2017 Colorado State University  9/5/2017. Week 3 - A. FAQs. This material is built based on, S535 ig ata Fall 217 olorado State University 9/5/217 Week 3-9/5/217 S535 ig ata - Fall 217 Week 3--1 S535 IG T FQs Programming ssignment 1 We will discuss link analysis in week3 Installation/configuration

More information

Homework 4 Solutions

Homework 4 Solutions CS3510 Design & Analysis of Algorithms Section A Homework 4 Solutions Uploaded 4:00pm on Dec 6, 2017 Due: Monday Dec 4, 2017 This homework has a total of 3 problems on 4 pages. Solutions should be submitted

More information

Distributed Graph Algorithms

Distributed Graph Algorithms Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction

More information

ECE 158A: Lecture 5. Fall 2015

ECE 158A: Lecture 5. Fall 2015 8: Lecture Fall 0 Routing ()! Location-ased ddressing Recall from Lecture that routers maintain routing tables to forward packets based on their IP addresses To allow scalability, IP addresses are assigned

More information

Query Optimization of Distributed Pattern Matching

Query Optimization of Distributed Pattern Matching Query Optimization of istributed Pattern Matching Jiewen Huang, Kartik Venkatraman, aniel J. badi Yale University jiewen.huang@yale.edu, kartik.venkatraman@yale.edu, dna@cs.yale.edu bstract Greedy algorithms

More information

CSE 373: Data Structures and Algorithms. Graphs. Autumn Shrirang (Shri) Mare

CSE 373: Data Structures and Algorithms. Graphs. Autumn Shrirang (Shri) Mare SE 373: Data Structures and lgorithms Graphs utumn 2018 Shrirang (Shri) Mare shri@cs.washington.edu Thanks to Kasey hampion, en Jones, dam lank, Michael Lee, Evan Mcarty, Robbie Weber, Whitaker rand, Zora

More information

Week 9 Student Responsibilities. Mat Example: Minimal Spanning Tree. 3.3 Spanning Trees. Prim s Minimal Spanning Tree.

Week 9 Student Responsibilities. Mat Example: Minimal Spanning Tree. 3.3 Spanning Trees. Prim s Minimal Spanning Tree. Week 9 Student Responsibilities Reading: hapter 3.3 3. (Tucker),..5 (Rosen) Mat 3770 Spring 01 Homework Due date Tucker Rosen 3/1 3..3 3/1 DS & S Worksheets 3/6 3.3.,.5 3/8 Heapify worksheet ttendance

More information

DATA SCIENCE USING SPARK: AN INTRODUCTION

DATA SCIENCE USING SPARK: AN INTRODUCTION DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data

More information

Dell In-Memory Appliance for Cloudera Enterprise

Dell In-Memory Appliance for Cloudera Enterprise Dell In-Memory Appliance for Cloudera Enterprise Spark Technology Overview and Streaming Workload Use Cases Author: Armando Acosta Hadoop Product Manager/Subject Matter Expert Armando_Acosta@Dell.com/

More information

Adjacency Matrix Undirected Graph. » Matrix [ p, q ] is 1 (True) if < p, q > is an edge in the graph. Graphs Representation & Paths

Adjacency Matrix Undirected Graph. » Matrix [ p, q ] is 1 (True) if < p, q > is an edge in the graph. Graphs Representation & Paths djacency Matrix Undirected Graph Row and column indices are vertex numbers Graphs Representation & Paths Mixture of Chapters 9 & 0» Matrix [ p, q ] is (True) if < p, q > is an edge in the graph if < p,

More information

BSP, Pregel and the need for Graph Processing

BSP, Pregel and the need for Graph Processing BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph

More information

CPL 2016, week 10. Clojure functional core. Oleg Batrashev. April 11, Institute of Computer Science, Tartu, Estonia

CPL 2016, week 10. Clojure functional core. Oleg Batrashev. April 11, Institute of Computer Science, Tartu, Estonia CPL 2016, week 10 Clojure functional core Oleg Batrashev Institute of Computer Science, Tartu, Estonia April 11, 2016 Overview Today Clojure language core Next weeks Immutable data structures Clojure simple

More information

IP Forwarding Computer Networking. Graph Model. Routes from Node A. Lecture 11: Intra-Domain Routing

IP Forwarding Computer Networking. Graph Model. Routes from Node A. Lecture 11: Intra-Domain Routing IP Forwarding 5-44 omputer Networking Lecture : Intra-omain Routing RIP (Routing Information Protocol) & OSPF (Open Shortest Path First) The Story So Far IP addresses are structured to reflect Internet

More information

Scaled Machine Learning at Matroid

Scaled Machine Learning at Matroid Scaled Machine Learning at Matroid Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Machine Learning Pipeline Learning Algorithm Replicate model Data Trained Model Serve Model Repeat entire pipeline Scaling

More information

ALGORITHMS FOR DECISION SUPPORT

ALGORITHMS FOR DECISION SUPPORT GORIHM FOR IION UOR ayesian networks Guest ecturer: ilja Renooij hanks to hor Whalen or kindly contributing to these slides robabilistic Independence onditional Independence onditional Independence hain

More information

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin

MapReduce review. Spark and distributed data processing. Who am I? Today s Talk. Reynold Xin Who am I? Reynold Xin Stanford CS347 Guest Lecture Spark and distributed data processing PMC member, Apache Spark Cofounder & Chief Architect, Databricks PhD on leave (ABD), UC Berkeley AMPLab Reynold

More information

Digraphs ( 12.4) Directed Graphs. Digraph Application. Digraph Properties. A digraph is a graph whose edges are all directed.

Digraphs ( 12.4) Directed Graphs. Digraph Application. Digraph Properties. A digraph is a graph whose edges are all directed. igraphs ( 12.4) irected Graphs OR OS digraph is a graph whose edges are all directed Short for directed graph pplications one- way streets flights task scheduling irected Graphs 1 irected Graphs 2 igraph

More information

arxiv: v1 [cs.db] 11 Jan 2017

arxiv: v1 [cs.db] 11 Jan 2017 SPARQL over GraphX Besat Kassaie Cheriton School of Computer Science, University of Waterloo bkassaie@uwaterloo.ca arxiv:1701.03091v1 [cs.db] 11 Jan 2017 ABSTRACT The ability of the RDF data model to link

More information

SociaLite: A Python-Integrated Query Language for

SociaLite: A Python-Integrated Query Language for SociaLite: A Python-Integrated Query Language for Big Data Analysis Jiwon Seo * Jongsoo Park Jaeho Shin Stephen Guo Monica Lam STANFORD UNIVERSITY M OBIS OCIAL RESEARCH GROUP * Intel Parallel Research

More information

MLI - An API for Distributed Machine Learning. Sarang Dev

MLI - An API for Distributed Machine Learning. Sarang Dev MLI - An API for Distributed Machine Learning Sarang Dev MLI - API Simplify the development of high-performance, scalable, distributed algorithms. Targets common ML problems related to data loading, feature

More information