Tegra Time-evolving Graph Processing on Commodity Clusters

Similar documents
GraphX. Graph Analytics in Spark. Ankur Dave Graduate Student, UC Berkeley AMPLab. Joint work with Joseph Gonzalez, Reynold Xin, Daniel

Big Graph Processing. Fenggang Wu Nov. 6, 2016

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

GraphX: Graph Processing in a Distributed Dataflow Framework

modern database systems lecture 10 : large-scale graph processing

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

: Gaining Command on Geo-Distributed Graph Analytics

Analytic Cloud with. Shelly Garion. IBM Research -- Haifa IBM Corporation

Graph-Processing Systems. (focusing on GraphChi)

Unifying Big Data Workloads in Apache Spark

GraphX: Graph Processing in a Distributed Dataflow Framework

Why do we need graph processing?

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Survey on Incremental MapReduce for Data Mining

Graph Processing. Connor Gramazio Spiros Boosalis

Apache Flink Big Data Stream Processing

The GraphX Graph Processing System

ASAP: Fast, Approximate Graph Pattern Mining at Scale

Resilient Distributed Datasets

Large Scale Graph Processing Pregel, GraphLab and GraphX

Distributed Graph Algorithms

Twitter data Analytics using Distributed Computing

Fast, Interactive, Language-Integrated Cluster Computing

GraphX: Unifying Data-Parallel and Graph-Parallel Analytics

Analyzing Flight Data

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

GraFBoost: Using accelerated flash storage for external graph analytics

FINE-GRAIN INCREMENTAL PROCESSING OF MAPREDUCE AND MINING IN BIG DATA ENVIRONMENT

PREDICTIVE DATACENTER ANALYTICS WITH STRYMON

Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.

The age of Big Data Big Data for Oracle Database Professionals

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

Graph Data Management

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

I ++ Mapreduce: Incremental Mapreduce for Mining the Big Data

Pregel. Ali Shah

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

Sub-millisecond Stateful Stream Querying over Fast-evolving Linked Data

Naiad (Timely Dataflow) & Streaming Systems

Rocksteady: Fast Migration for Low-Latency In-memory Storage. Chinmay Kulkarni, Aniraj Kesavan, Tian Zhang, Robert Ricci, Ryan Stutsman

Computational Social Choice in the Cloud

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Experimental Analysis of Distributed Graph Systems

TENSORFLOW: LARGE-SCALE MACHINE LEARNING ON HETEROGENEOUS DISTRIBUTED SYSTEMS. by Google Research. presented by Weichen Wang

Chapter 4: Apache Spark

TI2736-B Big Data Processing. Claudia Hauff

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Dynamic Large-Scale Graph Processing over Data Streams with Community Detection as a Case Study

Naiad. a timely dataflow model

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

DISTRIBUTED graph processing systems are widely employed

Cloud, Big Data & Linear Algebra

PROJECT PROPOSALS: COMMUNITY DETECTION AND ENTITY RESOLUTION. Donatella Firmani

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS

DRIZZLE: FAST AND Adaptable STREAM PROCESSING AT SCALE

Data Platforms and Pattern Mining

Batch & Stream Graph Processing with Apache Flink. Vasia

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Apache Spark Graph Performance with Memory1. February Page 1 of 13

refers to the preparation of the next to-be-processed Such a procedure can be arbitrary, because the sequence

Big Data Analytics Influx of data pertaining to the 4Vs, i.e. Volume, Veracity, Velocity and Variety

Data-Intensive Distributed Computing

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

IMPROVING MAPREDUCE FOR MINING EVOLVING BIG DATA USING TOP K RULES

Asynchronous Distributed Incremental Computation on Evolving Graphs

DATA SCIENCE USING SPARK: AN INTRODUCTION

Cloudline Autonomous Driving Solutions. Accelerating insights through a new generation of Data and Analytics October, 2018

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Latency-Tolerant Software Distributed Shared Memory

A Hybrid Solution for Mixed Workloads on Dynamic Graphs

Holistic Analysis of Multi-Source, Multi- Feature Data: Modeling and Computation Challenges

Dell In-Memory Appliance for Cloudera Enterprise

CSE 701: LARGE-SCALE GRAPH MINING. A. Erdem Sariyuce

Active Archive and the State of the Industry

An Exascale Programming, Multi objective Optimisation and Resilience Management Environment Based on Nested Recursive Parallelism.

Coflow. Recent Advances and What s Next? Mosharaf Chowdhury. University of Michigan

Spark. Cluster Computing with Working Sets. Matei Zaharia, Mosharaf Chowdhury, Michael Franklin, Scott Shenker, Ion Stoica.

gspan: Graph-Based Substructure Pattern Mining

Apache Flink. Alessandro Margara

Optimistic Recovery for Iterative Dataflows in Action

Programming Models and Tools for Distributed Graph Processing

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Distributed computing: index building and use

hyperx: scalable hypergraph processing

Similarities and Differences Between Parallel Systems and Distributed Systems

Zorro: Zero-Cost Reactive Failure Recovery in Distributed Graph Processing

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

A Relaxed Consistency based DSM for Asynchronous Parallelism. Keval Vora, Sai Charan Koduru, Rajiv Gupta

One Trillion Edges. Graph processing at Facebook scale

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Asymmetry in Large-Scale Graph Analysis, Explained

Khadija Souissi. Auf z Systems November IBM z Systems Mainframe Event 2016

An Introduction to Apache Spark

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

Transcription:

Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica

1 Graphs are everywhere Social Networks

2 Graphs are everywhere Gnutella network subgraph

Graphs are everywhere 3

4 Graphs are everywhere Metabolic network of a single cell organism Tuberculosis

5 Plenty of interest in processing them Graph MS 25% of all enterprises by end of 2017 1 Many open-source and research prototypes on distributed graph processing frameworks: Giraph, Pregel, GraphLab, GraphX, 1 Forrester Research

6 Real-world Graphs are ynamic Earthquake Occurrence ensity

Real-world Graphs are ynamic 7

Real-world Graphs are ynamic 8

9 Processing Time-evolving Graphs Many interesting business and research insights possible by processing such dynamic graphs little or no work in supporting such workloads in existing big-data graph-processing frameworks

10 hallenge #1: Storage Time E G 1 G 2 G 3 Redundant storage of graph entities over time

11 hallenge #2: omputation Time E G 1 G 2 G 3 E R 1 R 2 R 3 Wasted computation across snapshots

12 hallenge #3: ommunication Time E G 1 G 2 G 3 E uplicate messages sent over the network

13 How do we process time-evolving, Share Storage dynamically changing ommunication graphs omputation efficiently? Tegra

14 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

15 Sharing Storage Time E G 1 G 2 G 3 δg 1 δg 2 Storing deltas result in the most optimal storage, but creating snapshot from deltas can be expensive! δg 3 E

16 etter Storage Solution Use a persistent datastructure Snapshot 1 Snapshot 2 t 1 t 2 Store snapshots in Persistent daptive Radix Trees (PRT)

17 Graph Snapshot Index Snapshot I Management Snapshot 1 Snapshot 2 Snapshot 1 Snapshot 2 t 1 t 1 t 2 t 2 Vertex Edge Partition Shares structure between snapshots, and enables efficient operations

18 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

19 Graph Parallel bstraction - GS Gather: ccumulate information from neighborhood pply: pply the accumulated value Scatter: Update adjacent edges & vertices with updated value

20 Processing Multiple Snapshots Time E G 1 G 2 G 3 for (snapshot in snapshots) { for (stage in graph-parallel-computation) { } }

21 Reducing Redundant Messages for (step in graph-parallel-computation) { for (snapshot in snapshots) { } } Time E E G 1 G 2 G 3 an potentially avoid large number of redundant messages

22 How do we process time-evolving, dynamically changing graphs efficiently? Share Storage ommunication omputation Tegra

23 Updating Results If result from a previous snapshot is available, how can we reuse them? Three approaches in the past: Restart the algorithm Redundant computations Memoization (GraphInc 1 ) Too much state Operator-wise state (Naiad 2,3 ) Too much overhead Fault tolerance 1 Facilitating real- time graph mining, loud 12 2 Naiad: timely dataflow system, SOSP 13 3 ifferential dataflow, IR 13

24 Key Idea Leverage how GS model executes computation Each iteration in GS modifies the graph by a little an be seen as another time-evolving graph! Upon change to a graph: Mark parts of the graph that changed Expand the marked parts to involve regions for recomputation in every iteration orrow results from parts not changed

25 Incremental omputation Iterations G 0 1 G 1 1 G 2 1 Time G 0 2 G 1 2 G 2 2 Larger graphs and more iterations can yield significant improvements

26 Implementation & Evaluation Implemented on Spark 2.0 Extended dataframes with versioning information and iterate operator Extended GraphX PI to allow computation on multiple snapshots Preliminary evaluation on two real-world graphs Twitter: 41,652,230 vertices, 1,468,365,182 edges uk-2007: 105,896,555 vertices, 3,738,733,648 edges

27 enefits of Storage Sharing Storage Reduction 4.5 4 3.5 3 2.5 2 1.5 1 0.5 0 atastructure overheads Significant improvements with more snapshots 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Snapshots

28 enefits of sharing communication Time (s) 5000 4500 4000 3500 3000 2500 2000 1500 1000 500 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 Number of Snapshots GraphX Tegra

29 enefits of Incremental omputing 250 omputation Time (s) 200 150 100 50 0 Only 5% of the graph modified in every snapshot 50x reduction by processing only the modified part Incremental Full omputation 0 5 10 15 20 Snapshot I

30 Summary & Future Work Processing time-evolving graph efficiently can be useful Sharing storage, computation and communication key to efficient time-evolving graph analysis ode release Incremental pattern matching pproximate graph analytics Geo-distributed graph analytics api@cs.berkeley.edu www.cs.berkeley.edu/~api