Why do we need graph processing?

Similar documents
Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

An Introduction to Apache Spark

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Case Study 4: Collaborative Filtering. GraphLab

TensorFlow: A System for Learning-Scale Machine Learning. Google Brain

Distributed Computing with Spark

GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel

MLlib and Distributing the " Singular Value Decomposition. Reza Zadeh

Distributed Computing with Spark and MapReduce

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Analyzing Flight Data

GraphLab: A New Framework for Parallel Machine Learning

Graph-Processing Systems. (focusing on GraphChi)

Rapid growth of massive datasets

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Scaled Machine Learning at Matroid

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Matrix Computations and " Neural Networks in Spark

Graph Data Management

Distributed Machine Learning" on Spark

Lecture 22 : Distributed Systems for ML

Pouya Kousha Fall 2018 CSE 5194 Prof. DK Panda

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

DATA SCIENCE USING SPARK: AN INTRODUCTION

Unifying Big Data Workloads in Apache Spark

CS-541 Wireless Sensor Networks

Gaia: Geo-Distributed Machine Learning Approaching LAN Speeds

modern database systems lecture 10 : large-scale graph processing

ECS289: Scalable Machine Learning

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Large Scale Graph Processing Pregel, GraphLab and GraphX

Big Data Infrastructures & Technologies

Navigating the Maze of Graph Analytics Frameworks using Massive Graph Datasets

Harp-DAAL for High Performance Big Data Computing

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

GraphX: Graph Processing in a Distributed Dataflow Framework

Linear Regression Optimization

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

A Framework for Processing Large Graphs in Shared Memory

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Resilient Distributed Datasets

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Spark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM

CSE 444: Database Internals. Lecture 23 Spark

Programming Systems for Big Data

MLI - An API for Distributed Machine Learning. Sarang Dev

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

GraphChi: Large-Scale Graph Computation on Just a PC

SCALABLE, LOW LATENCY MODEL SERVING AND MANAGEMENT WITH VELOX

15.1 Optimization, scaling, and gradient descent in Spark

MapReduce, Hadoop and Spark. Bompotas Agorakis

Opportunities and challenges in personalization of online hotel search

Chapter 1 - The Spark Machine Learning Library

Apache Spark 2.0. Matei

Processing of big data with Apache Spark

Apache Spark Graph Performance with Memory1. February Page 1 of 13

NVGRAPH,FIREHOSE,PAGERANK GPU ACCELERATED ANALYTICS NOV Joe Eaton Ph.D.

/ Cloud Computing. Recitation 13 April 14 th 2015

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Big data systems 12/8/17

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

A Framework for Processing Large Graphs in Shared Memory

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Case Study 4: Collabora1ve Filtering

Distributed Graph Algorithms

Neural Networks. Single-layer neural network. CSE 446: Machine Learning Emily Fox University of Washington March 10, /10/2017

Fast, Interactive, Language-Integrated Cluster Computing

Distributed Graph Storage. Veronika Molnár, UZH

Cloud, Big Data & Linear Algebra

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

COMP6237 Data Mining Data Mining & Machine Learning with Big Data. Jonathon Hare

Scalable Machine Learning in R. with H2O

Blurring the Line Between Developer and Data Scientist

15.1 Data flow vs. traditional network programming

Joseph hgonzalez. A New Parallel Framework for Machine Learning. Joint work with. Yucheng Low. Aapo Kyrola. Carlos Guestrin.

Using Existing Numerical Libraries on Spark

Asanka Padmakumara. ETL 2.0: Data Engineering with Azure Databricks

An Introduction to Big Data Analysis using Spark

ODL based AI/ML for Networks Prem Sankar Gopannan, Ericsson YuLing Chen, Cisco

Best Practices and Performance Tuning on Amazon Elastic MapReduce

Logistic Regression

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Dynamic Resource Allocation for Distributed Dataflows. Lauritz Thamsen Technische Universität Berlin

Creating a Recommender System. An Elasticsearch & Apache Spark approach

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Cluster Computing Architecture. Intel Labs

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

MLlib. Distributed Machine Learning on. Evan Sparks. UC Berkeley

TI2736-B Big Data Processing. Claudia Hauff

Transcription:

Why do we need graph processing?

Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group similar articles 2

Why do we need graph processing? Collaborative Filtering Alternating Least Squares Stochastic Gradient Descent Tensor Factorization Structured Prediction Loopy Belief Propagation Max-Product Linear Programs Gibbs Sampling Semi-supervised ML Graph SSL CoEM Community Detection Triangle-Counting K-core Decomposition K-Truss Graph Analytics PageRank Personalized PageRank Shortest Path Graph Coloring Many of these can be expressed as matrix problems

Community detection: suggest followers? Determine what products people will like Count how many people are in different communities Graphs Collaborative are filtering Everywhere Clustering Group similar articles? (polling?) 4

Given the lack of activity on GraphX and its defunct predecessor Bagel, I doubt anything significant will be added here. I'd almost just close this. The entire world of ecommerce on the internet is driven by graph analytics (page rank, suggestions, also viewed, etc. etc.) While arcane to some, is a very important and growing field of computer science and web analytics. ESPECIALLY where big data is concerned. My personal bias is that most real-world problems that look like they'd be cool to solve as graph problems, aren't graph problems or aren't great to actually solve that way FWIW virtually none of our customers use GraphX, and we interact with a pretty good cross section of Big Companies. Many of the useful functions you identify are not solved as graph problems in my experience, even if they could be (e.g. recommenders, also viewed). Can we infer from the comments on this ticket that GraphX will be discontinued and no longer supported by the Spark Community? I see the makings of rumors already...

Which of the following Spark features or modules are most likely to solve your big data challenges? (Typesafe + Databricks + Dzone surveyed 2136 people)

Some Mllib algorithms use GraphX: Power Iteration Clustering, LDA

Trends

Why do we need distributed graph processing? Graphs used in GraphX paper have billions of edges Twitter: 40m users, 1.4billion links Frank McSherry s laptop can process them faster than GraphX

Big Billions of Edges Rich Metadata 10

Why do we need distributed graph processing? To process graphs with lots of metadata Is the metadata needed for the graph problem? Because it s convenient to incorporate in a single system Include fast single-machine implementation (as fallback) in Spark?

What s hard about distributed graph processing? How do you represent the graph? How do you distribute the graph over the machine? Some parts of the graph need to be duplicated Existing systems: specialized representation B F A C E D

GraphX: can we use a general-purpose system? Graph computation often part of a bigger pipeline Why now? Frameworks support in-memory processing (kind of) Allow fine-grained control over data partitioning

Fundamental challenge: data representation B C Dataflow systems expect a single, partitioned dataset A D F E

Encoding Property Graphs as Tables Property Graph B F A Vertex Cut C Part. 1 D Part. 2 Vertex Table (RDD) Machine 1 Machine 2 A B C D A Need E E 2 E join to get vertex proper/es and E F Routing Table (RDD) A B C D F 1 2 1 1 2 edge proper/es together! 1 2 Edge Table (RDD) A A B C A E B C C D E F D F

What changed?

Graph Processing Systems Pregel: synchronous steps GraphLab: asynchronous steps PowerGraph: better placement / representation GraphX: built on general purpose system, synchronous

Takeaway: Spark as a building block Gave control over data storage (memory / disk) Gave control over data partitioning Doesn t support asynchrony

Existing paper says: network doesn t matter for GraphX I optimized the CPU time of PageRank Now the network matters more

Why are Frank McSherry s things so much faster His laptop was faster than GraphX? Timely dataflow was much faster Cost of generality? With simple types, GraphX spends much of its time boxing/ unboxing primitive types Serialization is not optimized Spark is in the process of improving this