Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Similar documents
Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

GridGraph: Large-Scale Graph Processing on a Single Machine Using 2-Level Hierarchical Partitioning. Xiaowei ZHU Tsinghua University

Graph Data Management

Graph Analytics in the Big Data Era

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

An Introduction to Apache Spark

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Chapter 4: Apache Spark

Why do we need graph processing?

Tools for Social Networking Infrastructures

MapReduce, Hadoop and Spark. Bompotas Agorakis

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Mosaic: Processing a Trillion-Edge Graph on a Single Machine

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Developing MapReduce Programs

G(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu

Big Data Analytics: What is Big Data? Stony Brook University CSE545, Fall 2016 the inaugural edition

GraphCEP Real-Time Data Analytics Using Parallel Complex Event and Graph Processing

Harp-DAAL for High Performance Big Data Computing

Databases 2 (VU) ( / )

Big data systems 12/8/17

The Stratosphere Platform for Big Data Analytics

Antonio Fernández Anta

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Distributed computing: index building and use

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

Graph-Processing Systems. (focusing on GraphChi)

Resource and Performance Distribution Prediction for Large Scale Analytics Queries

Big Data Architect.

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

Hadoop An Overview. - Socrates CCDH

Pregel. Ali Shah

Map Reduce. Yerevan.

GraFBoost: Using accelerated flash storage for external graph analytics

Cluster Computing Architecture. Intel Labs

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

CSC 261/461 Database Systems Lecture 24. Spring 2017 MW 3:25 pm 4:40 pm January 18 May 3 Dewey 1101

A New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader

Analyzing Big Data with Microsoft R

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

FPGP: Graph Processing Framework on FPGA

One Trillion Edges. Graph processing at Facebook scale

Distributed computing: index building and use

Processing of big data with Apache Spark

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Big Data Analytics. Rasoul Karimi

modern database systems lecture 10 : large-scale graph processing

CSE 444: Database Internals. Lecture 23 Spark

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

A Fast and High Throughput SQL Query System for Big Data

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

On Fast Parallel Detection of Strongly Connected Components (SCC) in Small-World Graphs

microsoft

Graph Analytics using Vertica Relational Database

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

Overview. Audience profile. At course completion. Course Outline. : 20773A: Analyzing Big Data with Microsoft R. Course Outline :: 20773A::

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

Graph Analytics and Machine Learning A Great Combination Mark Hornick

Social Network Analytics on Cray Urika-XA

Prototyping Data Intensive Apps: TrendingTopics.org

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

New Challenges in Big Data: Technical Perspectives. Hwanjo Yu POSTECH

Extreme-scale Graph Analysis on Blue Waters

Data Platforms and Pattern Mining

The amount of data increases every day Some numbers ( 2012):

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

2/26/2017. The amount of data increases every day Some numbers ( 2012):

Big Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing

BIG DATA TESTING: A UNIFIED VIEW

FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS 655 Advanced Topics in Distributed Systems

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Functionality, Challenges and Architecture of Social Networks

Bigtable. Presenter: Yijun Hou, Yixiao Peng

Introduction to MapReduce (cont.)

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

Databricks, an Introduction

Flash Storage Complementing a Data Lake for Real-Time Insight

Big Data Management and NoSQL Databases

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

HADOOP FRAMEWORK FOR BIG DATA

Popularity of Twitter Accounts: PageRank on a Social Network

TI2736-B Big Data Processing. Claudia Hauff

HaLoop Efficient Iterative Data Processing on Large Clusters

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Revolver: Vertex-centric Graph Partitioning Using Reinforcement Learning

Batch Processing Basic architecture

custinger - Supporting Dynamic Graph Algorithms for GPUs Oded Green & David Bader

Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)

New Developments in Spark

Transcription:

Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

2 My research group at Stanford: Mining and modeling large social and information networks Problems motivated by the Web and on-line media, large scale data We work with networks from FB, Yahoo, Twitter, LinkedIn, Wikimedia, StackOverflow This talk: Feedback appreciated!

3 Posts Select Select Python Questions Answers Join Python Q&A Construct Graph Hits Algorithm Provenance script Experts Join Scores Users

4

Graphs are almost never given but they have to be constructed from input data! (graph constructions is a part of modeling/discovery process) Examples: Facebook graphs: Friend, Communication, Poke, Co-Tag, Co-location, Co-Event Cellphone/Email graphs: How many calls? Biology: P2P, Gene interaction networks 12/14/2014 Jure Leskovec (@jure), Stanford University 5

6 Graph construction operations Graph analytics Hadoop MapReduce Big data storage Relational tables Graphs and networks

7 Easy to use front-end Common high-level programming language Fast execution times Interactive use (as opposed to batch use) Ability to process large networks Tens of billions of edges Support for several representations Transformations between tables and graphs Large number of graph algorithms Straightforward to use Workflow management and reproducibility Provenance

Two observations: (1) Most graphs are not that large (2) Big-memory machines are here: 4x Intel CPU, 64 cores, 1TB RAM, $35K 12/14/2014 Jure Leskovec (@jure), Stanford University Number of Edges Number of Graphs <0.1M 16 0.1M 1M 25 1M 10M 17 10M 100M 7 100M 1B 5 > 1B 1 Stanford Large Network Collection 71 graphs 8

9 Option 1 Option 2 Standard SQL database Separate systems for tables and graphs Single representation for tables and graphs Distributed system Disk based structures Custom representations Integrated system for tables and graphs Separate table and graph representations Single machine system In-memory structures

10 Option 1 Option 2 Standard SQL database Separate systems for tables and graphs Single representation for tables and graphs Distributed system Disk based structures Custom representations Integrated system for tables and graphs Separate table and graph representations Single machine system In-memory structures Ringo

12 Unstructured data (a) Specify entities Relational tables (b) Specify relationships Tabular networks (c) Optimize representation Network representation (e) Integrate the results (d) Perform analytic reasoning and inference Results Ringo

13 Ringo (Python) code for executing finding the StackOverflow example

14 High-Level Language User Front-End Interface with Graph Processing Engine Metadata (Provenance) Provenance Script In-memory Graph Processing Engine Filters Graph Methods Graph Containers Graph, Table Conversions Table Objects Secondary Storage

16 Input data must be manipulated and transformed into graphs Src Dst v1 v2 v2 v3 v3 v4 v1 v3 v1 v4 v2 v1 v3 v4 Table data structure Graph data structure

17 Four ways to create a graph: The data already contains edges as source and destination pairs Nodes are connected based on their similarity Nodes are connected based on a temporal order Nodes are connected based on grouping and aggregation

18 Use case: In a forum, connect users that post to similar topics: Distance metrics Euclidean, Haversine, Jaccard distance Connect similar nodes SimJoin, connect if closer than the threshold Quadratic complexity Locality sensitive hashing

19 Use case: In a Web log, connect pages in a temporal order as clicked by the users Connect a node with its successors Events selected per user, ordered by timestamps NextK, connect K successors

20 Use case: In a Web log, measure the activity level of different user groups Edge creation Partition users to groups Identify interactions within each group Compute a score for each group based on interactions Treat groups as super-nodes in a graph

21 generation manipulation analytics Methods graphs networks Containers Several graph containers are supported Over 200 graph algorithms (provided via SNAP)

22 Requirements: Fast processing Efficient traversal of nodes and edges Dynamic structure Quickly add/remove nodes and edges Create subgraphs, dynamic graphs, How to achieve good balance?

23 Nodes table Sorted vector of neighbors 1 3 6 4

24 Nodes table Sorted vectors of in and out neighbors 1 3 6 4

25 Nodes table Sorted vectors of in and out edges Edges table 1 7 2 3 3 6 4 7 8 5 9 1

26 Column based store Persistent row identifier Lots of tricks: Implicit select, String pools Standard table operations Select, join, project, group, aggregate Graph construction operations Explicit pairs Distance based edges, SimJoin Time based edges, NextK

27 Src Dst v1 v2 v2 v3 v3 v4 v1 v3 v2 v1 v3 v4 v1 v4 Src Dst v1 v2 v3 v4 v1 v2 v3 v1 v1 v2 v3 v4 v3 v4 v2 v3 v4 v3 v1 v4 v1 v1 v2 v3

28 Make two copies of pairs of (Src,Dst) columns (Paralle Sort first copy by Src, second copy by Dst (Parallel) Merge column Src from the first copy and Dst from the second copy (Sequential) Count the number of unique values in the merged column to get the number of nodes in the graph (S) Allocate the node hash table For each node, build vectors of neighbors (Parallel)

31 High-level front end Python module Based on Snap.py, uses SWIG for C++ interface High-performance graph engine C++ based on SNAP Multi-core support OpenMP to parallelize loops Fast, concurrent hash table, vector operations

32 Dataset LiveJournal Twitter2010 Nodes 4.8M 42M Edges 69M 1.5B Text Size 1.1GB 26.2GB Graph Size 0.7GB 13.2GB Table Size 1.1GB 23.5GB

33 Algorithm Graph PageRank LiveJournal PageRank Twitter2010 Triangles LiveJournal Triangles Twitter2010 Giraph 45.6s 439.3s N/A N/A GraphX 56.0s - 67.6s - GraphChi 54.0s 595.3s 66.5s - PowerGrap h 27.5s 251.7s 5.4s 706.8s Ringo 2.6s 72.0s 13.7s 284.1s Hardware: 4x Intel CPU, 64 cores, 1TB RAM, $35K

34 System Hosts CPUs host Host Configuration Time GraphChi 1 4 8x core AMD, 64GB RAM 158s TurboGraph 1 1 6x core Intel, 12GB RAM 30s Spark 50 2 97s GraphX 16 1 8X core Intel, 68GB RAM 15s PowerGraph 64 2 8x hyper Intel, 23GB RAM 3.6s Ringo 1 4 20x hyper Intel, 1TB RAM 6.0s Twitter2010, one iteration of PageRank

35 Dataset LiveJournal Twitter2010 Table to graph Graph to table 8.5s 13.0 MEdges/s 1.5s 46.0 MEdges/s 81.0s 18.0 MEdges/s 29.2s 50.4 MEdges/s Hardware: 4x Intel CPU, 80 cores, 1TB RAM, $35K

Dataset LiveJournal Twitter2010 Select 10K Select All- 10K Join 10K Join All-10K <0.2s 405.9 MRows/s <0.1s 575.0 MRows/s 0.6s 109.5 MRows/s 3.1s 44.5 MRows/s 1.6s 935.3 MRows/s 1.6s 917.7 MRows/s 4.2s 348.8 MRows/s 29.7s 98.8 MRows/s 12/14/2014 Jure Leskovec (@jure), Stanford University 36

37 Algorithm Runtime 3-core 31.0s Single source shortest path Strongly connected component 7.4s 18.0s LiveJournal, 1 core

38 Dataset LiveJournal Twitter2010 Load graph 5.2s 76.6s Save graph 3.5s 69.0s Load table 6.0s 114.3s Save table 4.3s 100.2s

39 Big-memory machines are here: 1TB RAM, 100 Cores a small cluster No overheads of distributed systems Easy to program Most useful datasets fit in memory Big-memory machines present a viable solution for analysis of all-but-thelargest networks

graph construction operations graph analytics Relational tables Graphs and networks Ringo: Network science & exploration In-memory graph analytics Processing of tables and graphs Fast and scalable 12/14/2014 Jure Leskovec (@jure), Stanford University 40

12/14/2014 Jure Leskovec (@jure), Stanford University 41 Get your own 1TB RAM server! And download RINGO/SNAP http://snap.stanford.edu/snap

42