hyperx: scalable hypergraph processing

Size: px
Start display at page:

Download "hyperx: scalable hypergraph processing"

Transcription

1 hyperx: scalable hypergraph processing Jin Huang November 15, 2015 The University of Melbourne

2 overview Research Outline Scalable Hypergraph Processing Problem and Challenge Idea Solution Implementation Emperical Results Conclusion 2

3 research outline

4 scalable hypergraph processing

5 problem context Any (high-order) relationships with more than 2 participants. Figure 1: A few high-order relationships 5

6 representative existing hypergraph studies Table 1: Various hypergraph learning studies in literature Application Study Vertex Hyperedge Recommendation [TMCCA 13] Songs and users Listening histories Text retrieval [SIGIR 08] Documents Semantic similarities Image retrieval [Pattern Recognition 13] Images Descriptor similarities Multimedia [Multimedia 08] Videos Hyperlinks Bioinformatics [ICDM 13] Proteins Interactions Social mining [AAAI 14] Users Communities Machine learning [Signal Processing 14] Data Records Labels 6

7 existing solution Converting to a graph! Option I a bipartite Option II a clique Figure 2: Graph conversion inflates the problem size 7

8 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation 8

9 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Figure 3: Vertex replicas to reduce network communication 8

10 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Inflated Size 2M V and 15M H -> 17M V and 1B E Excessive Replication replicating both V and H 8

11 challenges ii Difficulty in Load Balance two causes 1. V and H not active simultaneously 2. double overhead in each iteration Figure 3: Two issues in balancing the loads 9

12 idea To Support (API) Random walks, label propagation, spectral Inflated Size (Representation) a distributed hypergraph Excessive Replication (Representation) replicate only V Difficulty in Lload Balance (Partitioning) An optimization minimizes the communication cost minimizes the replication cost balances both V and H loads 10

13 proposed solution: hyperx Figure 4: An overview of HyperX implemented over Spark 11

14 details: apis Algorithms expressed as vprog updates vertex values given incident hyperedges hprog update hyperedge values given incident vertices Table 2: HyperX Main APIs Name joinv mrtuples mapv maph subh HyperPregel Usage vprog as distributed joins hprog on hyperedges and reduce vertices update vertices independently (locally) update hyperedges independently (locally) restrict computation over a sub-hypergraph iteratively execute mrtuple and joinv 12

15 details: hyperpregel implementation Algorithm 1: HyperPregel input : G: Hypergraph[V,H], vprog: (Id,V) V, hprog: Tuple M, combine: (M,M) M, initial: M output: RDD[(Id, V)] 1 G G.mapV((id, v) vprog(id, v, initial)) 2 msg G.mrTuples(hProg, combine) 3 while msg > 0 do 4 G G.joinV (msg) (vprog).subh(v, t ) 5 msg G.mrTuples(hProg, combine) 6 return G.vertices 13

16 details: random walks with apis Algorithm 2: Random Walks (RW) with restart input : G, label vertex set L, restart probability rp output: RDD[(Id, Double)] 1 vprog(id,(v,d),msg)= ((1 rp) msg + rp v, d) S i Sd i D 2 hprog(s,d,sd,dd,h)= i S 3 combine(a,b)= a + b 4 G G.joinV (G.outDeg, (id, v, d) d) 5 G G.mapV((id, v) if id L (1.0, v) else (0.0, v)) 6 G.HyperPregel(G, vprog, hprog, combine,0) 14

17 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd Multiple vertices list or set flattened (vid, hid, issrc) in columnar arrays saves 41% to 88% memory consumption 15

18 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd To do mrtuples locally, replicate vertices One replica is adequate Cost in distributed vprog Cost in updating replicas Cost in storing replicas How to partition vrdd and hrdd to minimize the cost? 15

19 details: partitioning introduction Different from vertex-cut or edge-cut in graph literature Cut both vertices and hyperedges simultaneously Minimizes the vertex replicas (with local aggregation) With separate load constaints on vprog and hprog 16

20 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) 17

21 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) to optimize minimize u V R(x u, y) h H a h subject to y h,i a h (1 + α), i {1, 2,..., k} k h H u V x u,i R(x u, y) (1 + β) R(x u, y), i {1, 2,..., k} k u V 17

22 details: partitioning theoretic analysis How hard? a special case where α = 0 and β = + minimize k (1 (1 y h,i )) u V i=1 h N(u) subject to h H y h,i a h a h, i {1, 2,..., k} k h H reduction from the strongly NP-Complete 3-Partition no polynomial solution with finite approximation factor in plain words, it is extremely hard! how about α > 0? 18

23 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels 19

24 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels specifically, L(h) = arg max {v v N(h) L(v) = i} i K Ā 2 A 2 i L(v) = arg max( {h h N(v) L(h) = i} e Ā 2 ), i K where A i = L(h)=i a h. 19

25 experimental settings Metrics data RDD size data shuffuled elapsed time Comparisons HyperX (hx), Bipartite (star), Clique (clique) random, greedy, aweto, hmetis, LPP random walk (RW), label propagation (LP), spectural (SP) Environment 8 node, 28 workers, network 600Mbps Hadoop 2.4.0, YARN enabled, Spark HyperX implemented in Scala 20

26 datasets Table 3: Datasets presented in the empirical study Dataset n m d min d max d σd c vd a min a max ā σ a c va Medline Coauthor (Med) 3.2m 8m Orkut Communities (Ork) 2.3m 15m , Friendster Communities (Fri) 7.9m 1.6m , Synthetic (Zipfian s = 2) 2m 8m , m 5 1, , m 10 1, , m 15 1, , m 21 2, , m 16m 1 1, , m , m , m ,

27 evaluating hypergraph representation: space Data RDD size (GB) Vertices Hyperedges hx clique star hx clique star MedRW MedLP hx clique star OrkRW hx clique star OrkLP hx clique star FriRW hx clique star FriLP Figure 5: Memory Consumption of Data RDDs HyperX consumes 44% to 77% less memory than Bipartite. 22

28 evaluating hypergraph representation: communication Data shuffled (GB) at 5 t h iter star hx Write Read star hx MedRW MedLP star hx OrkRW star hx OrkLP star hx FriRW Figure 6: Data Shuffled on the Network star hx FriLP HyperX shuffles 19% to 98% fewer data than Bipartite. 23

29 evaluating hypergraph representation: time Elapsed time (S) per 10 iters MedRW MedLP OrkRw OrkLP FriRW hx star FriLP Figure 7: Elapsed Time HyperX is up to 49.1 times faster than Bipartite. 24

30 evaluating partitioning effectiveness: replica factor 16 Replica factor Med Ork Fri random aweto greedy hmetis5 hmetis1 lpp Figure 8: Different partitioning algorithms, replication factor HyperX produces 1.1 to 1.9 times more replicas than hmetis. 25

31 evaluating partitioning effectiveness: load balance Workload CoV MedArity MedReplica OrkArity OrkReplica FriArity FriReplica random aweto greedy hmetis5 hmetis1 lpp Figure 9: Different partitioning algorithms, load balance LPP prodcues 1.1 to 37.7 times more balanced loads than hmetis. 26

32 evaluating partitioning effectiveness: space Data RDD size (GB) lpp hmetis1 hmetis5 greedy aweto random lpp hmetis1 hmetis5 greedy aweto random Hyperedges Vertices lpp hmetis1 hmetis5 greedy aweto random RW LP SP Figure 10: Different partitioning algorithms on Orkut, space LPP and hmetis both outperform simplistic methods. 27

33 evaluating partitioning effectiveness: communication Data shuffled (MB) at 5 th Iter greedy aweto random lpp hmetis1 hmetis5 RW greedy aweto random lpp hmetis1 hmetis5 LP Read Write greedy aweto random lpp hmetis1 hmetis5 Figure 11: Different partitioning algorithms on Orkut, communication SP LPP and hmetis both significantly outperform simplistic methods. 28

34 evaluating partitioning effectiveness: time Elapsed time (S) per 10 iters MedRW random aweto MedLP MedSP greedy hmetis5 OrkRW OrkLP hmetis1 lpp OrkSP Figure 12: Different partitioning algorithms, time LPP results to up to 2.6 times speedup over hmetis. 29

35 evaluating partitioning efficiency LPP in Scala, run on JVM; hmetis in C Table 4: Partitioning time of different algorithms Dataset Algorithm Time t (s) w w.r.t. LPP Med LPP hmetis5 14, Ork LPP hmetis5 88, Fri LPP hmetis5 6,

36 evaluating learning algorithms: dataset cardinality Elapsed Time (S) per 5 iters RW LP SP M 12M 16M 20M 24M Number of hyperedges Figure 13: Elapsed time running algorithms on varying dataset cardinality, synthetic 31

37 evaluating learning algorithms: number of workers Elapased time (S) per 10 iters Number of workers RW LP SP Figure 14: Elapsed time running algorithms on varying number of workers, Orkut 32

38 optional evaluating lpp: time and replicas Elapsed time (S) MedReplica OrkReplica MedTime OrkTime Iteration Replica factor Figure 15: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable replication ratio. 33

39 optional evaluating lpp: load balance Workload CoV MedReplica MedArity OrkReplica OrkArity Iteration Figure 16: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable load balance. 34

40 conclusion Problem Scalable hypergraph learning Challenges Solutions Contribution 1. Inflated problem size 2. Excessive replication 3. Great difficulty in balancing the loads 1. Operate on a distributed hypergraph 2. Replicate only vertices 3. Partitioning optimization Efficient and scalable hypergraph framework Effective and efficient partitioning algorithm 35

41 Thanks! Any Questions or Comments? 36

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop

More information

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents

More information

Distributed Graph Algorithms

Distributed Graph Algorithms Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance

More information

Big Data Architect.

Big Data Architect. Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

Big data systems 12/8/17

Big data systems 12/8/17 Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores

More information

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI

Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1 Big Data

More information

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Spark Overview. Professor Sasu Tarkoma.

Spark Overview. Professor Sasu Tarkoma. Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Scalable Label Propagation Algorithms for Heterogeneous Networks

Scalable Label Propagation Algorithms for Heterogeneous Networks Scalable Label Propagation Algorithms for Heterogeneous Networks Erfan Farhangi Maleki Department f Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran e.farhangi@ec.iut.ac.ir

More information

LogGP: A Log-based Dynamic Graph Partitioning Method

LogGP: A Log-based Dynamic Graph Partitioning Method LogGP: A Log-based Dynamic Graph Partitioning Method Ning Xu, Lei Chen, Bin Cui Department of Computer Science, Peking University, Beijing, China Hong Kong University of Science and Technology, Hong Kong,

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

The GraphX Graph Processing System

The GraphX Graph Processing System The GraphX Graph Processing System Daniel Crankshaw Ankur Dave Reynold S. Xin Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {crankshaw, ankurd, rxin, jegonzal, franklin, istoica@cs.berkeley.edu

More information

Computational Social Choice in the Cloud

Computational Social Choice in the Cloud Computational Social Choice in the Cloud Theresa Csar, Martin Lackner, Emanuel Sallinger, Reinhard Pichler Technische Universität Wien Oxford University PPI, Stuttgart, March 2017 By Sam Johnston [CC BY-SA

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

Approximation Algorithms

Approximation Algorithms Chapter 8 Approximation Algorithms Algorithm Theory WS 2016/17 Fabian Kuhn Approximation Algorithms Optimization appears everywhere in computer science We have seen many examples, e.g.: scheduling jobs

More information

Large Scale Graph Processing Pregel, GraphLab and GraphX

Large Scale Graph Processing Pregel, GraphLab and GraphX Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah

More information

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma

Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday

More information

On Covering a Graph Optimally with Induced Subgraphs

On Covering a Graph Optimally with Induced Subgraphs On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number

More information

GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs

GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim Department of Information and Communication Engineering

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

Sandpiper: Scaling Probabilistic Inferencing to Large Scale Graphical Models

Sandpiper: Scaling Probabilistic Inferencing to Large Scale Graphical Models Sandpiper: Scaling Probabilistic Inferencing to Large Scale Graphical Models Alexander Ulanov 1, Manish Marwah, Mijung Kim, Roshan Dathathri 2, Carlos Zubieta 3, and Jun Li 4 Hewlett Packard Labs Palo

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics

A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Distributed Machine Learning: An Intro. Chen Huang

Distributed Machine Learning: An Intro. Chen Huang : An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous

More information

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::

Overview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development:: Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized

More information

Parallel Graph Partitioning for Complex Networks

Parallel Graph Partitioning for Complex Networks Parallel Graph Partitioning for Complex Networks Henning Meyerhenke, Peter Sanders, Christian Schulz High-performance Graph Algorithms and Applications in Computational Science Dagstuhl 1 Christian Schulz:

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson

From Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

GPS: A Graph Processing System

GPS: A Graph Processing System GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed

More information

NScale: neighborhood-centric large-scale graph analytics in the cloud

NScale: neighborhood-centric large-scale graph analytics in the cloud The VLDB Journal DOI 10.1007/s00778-015-0405-2 REGULAR PAPER NScale: neighborhood-centric large-scale graph analytics in the cloud Abdul Quamar 1 Amol Deshpande 1 Jimmy Lin 1 Received: 7 December 2014

More information

Graph Processing. Connor Gramazio Spiros Boosalis

Graph Processing. Connor Gramazio Spiros Boosalis Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps

More information

Distributed Graph Storage. Veronika Molnár, UZH

Distributed Graph Storage. Veronika Molnár, UZH Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Why do we need graph processing?

Why do we need graph processing? Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group

More information

Graph-Processing Systems. (focusing on GraphChi)

Graph-Processing Systems. (focusing on GraphChi) Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

CSE 444: Database Internals. Lecture 23 Spark

CSE 444: Database Internals. Lecture 23 Spark CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei

More information

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics

Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations

Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National

More information

ECS289: Scalable Machine Learning

ECS289: Scalable Machine Learning ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences

More information

Big Data Infrastructures & Technologies

Big Data Infrastructures & Technologies Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory

More information

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine

RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook

More information

Parallel Greedy Matching Algorithms

Parallel Greedy Matching Algorithms Parallel Greedy Matching Algorithms Fredrik Manne Department of Informatics University of Bergen, Norway Rob Bisseling, University of Utrecht Md. Mostofa Patwary, University of Bergen 1 Outline Background

More information

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing

An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing ABSTRACT Shiv Verma 1, Luke M. Leslie 1, Yosub Shin, Indranil Gupta 1 1 University of Illinois at Urbana-Champaign,

More information

GraphX: Graph Processing in a Distributed Dataflow Framework

GraphX: Graph Processing in a Distributed Dataflow Framework GraphX: Graph Processing in a Distributed Dataflow Framework Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel

More information

Unifying Big Data Workloads in Apache Spark

Unifying Big Data Workloads in Apache Spark Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

Scalable Influence Maximization in Social Networks under the Linear Threshold Model

Scalable Influence Maximization in Social Networks under the Linear Threshold Model Scalable Influence Maximization in Social Networks under the Linear Threshold Model Wei Chen Microsoft Research Asia Yifei Yuan Li Zhang In collaboration with University of Pennsylvania Microsoft Research

More information

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING

RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,

More information

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark

Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,

More information

Harp: Collective Communication on Hadoop

Harp: Collective Communication on Hadoop Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu Computer Science Department Indiana University Bloomington, IN, USA zhangbj, yangruan, xqiu@indiana.edu Abstract Big data processing

More information

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning

Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455 Technical Report

More information

Introduction to Spark

Introduction to Spark Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty

More information

2/4/2019 Week 3- A Sangmi Lee Pallickara

2/4/2019 Week 3- A Sangmi Lee Pallickara Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1

More information

GraphHP: A Hybrid Platform for Iterative Graph Processing

GraphHP: A Hybrid Platform for Iterative Graph Processing GraphHP: A Hybrid Platform for Iterative Graph Processing Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo and Wei Pan Northwestern Polytechnical University Xi an, China {chenbenben, baisong, lizhh,

More information

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS

SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS Ren Wang, Andong Wang, Talat Iqbal Syed and Osmar R. Zaïane Department of Computing Science, University of Alberta, Canada ABSTRACT

More information

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics

IBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that

More information

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number

More information

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018

Cloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018 Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning

More information

Apache Spark 2.0. Matei

Apache Spark 2.0. Matei Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and

More information

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS

AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS c 2017 Shiv Verma AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING BY SHIV VERMA THESIS Submitted in partial fulfillment of the requirements for the degree of Master

More information

Adaptive Cluster Computing using JavaSpaces

Adaptive Cluster Computing using JavaSpaces Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of

More information

Experimental Analysis of Distributed Graph Systems

Experimental Analysis of Distributed Graph Systems Experimental Analysis of Distributed Graph ystems Khaled Ammar, M. Tamer Özsu David R. Cheriton chool of Computer cience University of Waterloo, Waterloo, Ontario, Canada {khaled.ammar, tamer.ozsu}@uwaterloo.ca

More information

Partitioning. Course contents: Readings. Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic. Chapter 7.5.

Partitioning. Course contents: Readings. Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic. Chapter 7.5. Course contents: Partitioning Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic Readings Chapter 7.5 Partitioning 1 Basic Definitions Cell: a logic block used to build larger circuits.

More information

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016

Bring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016 Bring x3 Spark Performance Improvement with PCIe SSD Yucai, Yu (yucai.yu@intel.com) BDT/STO/SSG January, 2016 About me/us Me: Spark contributor, previous on virtualization, storage, mobile/iot OS. Intel

More information

Graph Partitioning for Scalable Distributed Graph Computations

Graph Partitioning for Scalable Distributed Graph Computations Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering

More information

Apache Spark Graph Performance with Memory1. February Page 1 of 13

Apache Spark Graph Performance with Memory1. February Page 1 of 13 Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

NXgraph: An Efficient Graph Processing System on a Single Machine

NXgraph: An Efficient Graph Processing System on a Single Machine NXgraph: An Efficient Graph Processing System on a Single Machine arxiv:151.91v1 [cs.db] 23 Oct 215 Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory

More information

Interactive Graph Analytics with Spark

Interactive Graph Analytics with Spark Interactive Graph Analytics with Spark A talk by Daniel Darabos from Lynx Analytics about the design and implementation of the LynxKite analytics application About LynxKite Analytics web application AngularJS

More information

Chapter 4: Apache Spark

Chapter 4: Apache Spark Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,

More information

Tegra Time-evolving Graph Processing on Commodity Clusters

Tegra Time-evolving Graph Processing on Commodity Clusters Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica 1 Graphs are everywhere Social Networks 2 Graphs are everywhere Gnutella network

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

Understanding Graph Computa3on Behavior to Enable Robust Benchmarking

Understanding Graph Computa3on Behavior to Enable Robust Benchmarking Understanding Graph Computa3on Behavior to Enable Robust Benchmarking Fan Yang* and Andrew A. Chien* *University of Chicago, Argonne Na3onal Laboratory {fanyang, achien}@cs.uchicago.edu HPDC, June 18,

More information

Workload Characterization and Optimization of TPC-H Queries on Apache Spark

Workload Characterization and Optimization of TPC-H Queries on Apache Spark Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research

More information

Billion node graph inference: iterative processing on The Machine

Billion node graph inference: iterative processing on The Machine Billion node graph inference: iterative processing on The Machine Chen,Fei; Gonzalez, Maria Teresa; Viswanathan, Krishnamurthy; Cai, Qiong; Laffite, Hernan; Rivera, Janneth; Mitchell, April; Singhal, Sharad

More information

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

Acyclic Network. Tree Based Clustering. Tree Decomposition Methods

Acyclic Network. Tree Based Clustering. Tree Decomposition Methods Summary s Join Tree Importance of s Solving Topological structure defines key features for a wide class of problems CSP: Inference in acyclic network is extremely efficient (polynomial) Idea: remove cycles

More information

Constraint Satisfaction Problems

Constraint Satisfaction Problems Constraint Satisfaction Problems Search and Lookahead Bernhard Nebel, Julien Hué, and Stefan Wölfl Albert-Ludwigs-Universität Freiburg June 4/6, 2012 Nebel, Hué and Wölfl (Universität Freiburg) Constraint

More information

PTAS for Matroid Matching

PTAS for Matroid Matching PTAS for Matroid Matching Jon Lee 1 Maxim Sviridenko 1 Jan Vondrák 2 1 IBM Watson Research Center Yorktown Heights, NY 2 IBM Almaden Research Center San Jose, CA May 6, 2010 Jan Vondrák (IBM Almaden) PTAS

More information

Big Data Infrastructures & Technologies Hadoop Streaming Revisit.

Big Data Infrastructures & Technologies Hadoop Streaming Revisit. Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer

More information

Think Sequential, Run Parallel

Think Sequential, Run Parallel Think Sequential, Run Parallel Wenfei Fan 12, Muyang Liu 2, Ruiqi Xu 1, Lei Hou 2, Dongze Li 2, Zizhong Meng 2 1 University of Edinburgh, UK 2 Beihang University, China Abstract. Parallel computation is

More information

Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization

Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL XX, NO. XX, 2005 1 Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization Navaratnasothie Selvakkumaran and

More information