hyperx: scalable hypergraph processing
|
|
- Jayson Lambert
- 5 years ago
- Views:
Transcription
1 hyperx: scalable hypergraph processing Jin Huang November 15, 2015 The University of Melbourne
2 overview Research Outline Scalable Hypergraph Processing Problem and Challenge Idea Solution Implementation Emperical Results Conclusion 2
3 research outline
4 scalable hypergraph processing
5 problem context Any (high-order) relationships with more than 2 participants. Figure 1: A few high-order relationships 5
6 representative existing hypergraph studies Table 1: Various hypergraph learning studies in literature Application Study Vertex Hyperedge Recommendation [TMCCA 13] Songs and users Listening histories Text retrieval [SIGIR 08] Documents Semantic similarities Image retrieval [Pattern Recognition 13] Images Descriptor similarities Multimedia [Multimedia 08] Videos Hyperlinks Bioinformatics [ICDM 13] Proteins Interactions Social mining [AAAI 14] Users Communities Machine learning [Signal Processing 14] Data Records Labels 6
7 existing solution Converting to a graph! Option I a bipartite Option II a clique Figure 2: Graph conversion inflates the problem size 7
8 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation 8
9 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Figure 3: Vertex replicas to reduce network communication 8
10 challenges i Scalable graph frameworks: GraphLab, Giraph, GraphX, etc. synchronous BSP (Pregel) vertex-centric style vertex replication and aggregation Inflated Size 2M V and 15M H -> 17M V and 1B E Excessive Replication replicating both V and H 8
11 challenges ii Difficulty in Load Balance two causes 1. V and H not active simultaneously 2. double overhead in each iteration Figure 3: Two issues in balancing the loads 9
12 idea To Support (API) Random walks, label propagation, spectral Inflated Size (Representation) a distributed hypergraph Excessive Replication (Representation) replicate only V Difficulty in Lload Balance (Partitioning) An optimization minimizes the communication cost minimizes the replication cost balances both V and H loads 10
13 proposed solution: hyperx Figure 4: An overview of HyperX implemented over Spark 11
14 details: apis Algorithms expressed as vprog updates vertex values given incident hyperedges hprog update hyperedge values given incident vertices Table 2: HyperX Main APIs Name joinv mrtuples mapv maph subh HyperPregel Usage vprog as distributed joins hprog on hyperedges and reduce vertices update vertices independently (locally) update hyperedges independently (locally) restrict computation over a sub-hypergraph iteratively execute mrtuple and joinv 12
15 details: hyperpregel implementation Algorithm 1: HyperPregel input : G: Hypergraph[V,H], vprog: (Id,V) V, hprog: Tuple M, combine: (M,M) M, initial: M output: RDD[(Id, V)] 1 G G.mapV((id, v) vprog(id, v, initial)) 2 msg G.mrTuples(hProg, combine) 3 while msg > 0 do 4 G G.joinV (msg) (vprog).subh(v, t ) 5 msg G.mrTuples(hProg, combine) 6 return G.vertices 13
16 details: random walks with apis Algorithm 2: Random Walks (RW) with restart input : G, label vertex set L, restart probability rp output: RDD[(Id, Double)] 1 vprog(id,(v,d),msg)= ((1 rp) msg + rp v, d) S i Sd i D 2 hprog(s,d,sd,dd,h)= i S 3 combine(a,b)= a + b 4 G G.joinV (G.outDeg, (id, v, d) d) 5 G G.mapV((id, v) if id L (1.0, v) else (0.0, v)) 6 G.HyperPregel(G, vprog, hprog, combine,0) 14
17 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd Multiple vertices list or set flattened (vid, hid, issrc) in columnar arrays saves 41% to 88% memory consumption 15
18 details: representation Built on Spark s RDD, how to represent a hypergraph? Vertices vrdd Hyperedges hrdd To do mrtuples locally, replicate vertices One replica is adequate Cost in distributed vprog Cost in updating replicas Cost in storing replicas How to partition vrdd and hrdd to minimize the cost? 15
19 details: partitioning introduction Different from vertex-cut or edge-cut in graph literature Cut both vertices and hyperedges simultaneously Minimizes the vertex replicas (with local aggregation) With separate load constaints on vprog and hprog 16
20 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) 17
21 details: partitioning objective formulation n vertices, m hyperedges, k workers, a h the arity of h number of replicas for vertex u R(x u, y) = k max((1 x u,i (1 y h,i ), 0) i=1 h N(u) to optimize minimize u V R(x u, y) h H a h subject to y h,i a h (1 + α), i {1, 2,..., k} k h H u V x u,i R(x u, y) (1 + β) R(x u, y), i {1, 2,..., k} k u V 17
22 details: partitioning theoretic analysis How hard? a special case where α = 0 and β = + minimize k (1 (1 y h,i )) u V i=1 h N(u) subject to h H y h,i a h a h, i {1, 2,..., k} k h H reduction from the strongly NP-Complete 3-Partition no polynomial solution with finite approximation factor in plain words, it is extremely hard! how about α > 0? 18
23 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels 19
24 details: partitioning practical solutions Lable propagation partitioning (LPP) labels are partitions label both vertices and hyperedges iteratively update labels specifically, L(h) = arg max {v v N(h) L(v) = i} i K Ā 2 A 2 i L(v) = arg max( {h h N(v) L(h) = i} e Ā 2 ), i K where A i = L(h)=i a h. 19
25 experimental settings Metrics data RDD size data shuffuled elapsed time Comparisons HyperX (hx), Bipartite (star), Clique (clique) random, greedy, aweto, hmetis, LPP random walk (RW), label propagation (LP), spectural (SP) Environment 8 node, 28 workers, network 600Mbps Hadoop 2.4.0, YARN enabled, Spark HyperX implemented in Scala 20
26 datasets Table 3: Datasets presented in the empirical study Dataset n m d min d max d σd c vd a min a max ā σ a c va Medline Coauthor (Med) 3.2m 8m Orkut Communities (Ork) 2.3m 15m , Friendster Communities (Fri) 7.9m 1.6m , Synthetic (Zipfian s = 2) 2m 8m , m 5 1, , m 10 1, , m 15 1, , m 21 2, , m 16m 1 1, , m , m , m ,
27 evaluating hypergraph representation: space Data RDD size (GB) Vertices Hyperedges hx clique star hx clique star MedRW MedLP hx clique star OrkRW hx clique star OrkLP hx clique star FriRW hx clique star FriLP Figure 5: Memory Consumption of Data RDDs HyperX consumes 44% to 77% less memory than Bipartite. 22
28 evaluating hypergraph representation: communication Data shuffled (GB) at 5 t h iter star hx Write Read star hx MedRW MedLP star hx OrkRW star hx OrkLP star hx FriRW Figure 6: Data Shuffled on the Network star hx FriLP HyperX shuffles 19% to 98% fewer data than Bipartite. 23
29 evaluating hypergraph representation: time Elapsed time (S) per 10 iters MedRW MedLP OrkRw OrkLP FriRW hx star FriLP Figure 7: Elapsed Time HyperX is up to 49.1 times faster than Bipartite. 24
30 evaluating partitioning effectiveness: replica factor 16 Replica factor Med Ork Fri random aweto greedy hmetis5 hmetis1 lpp Figure 8: Different partitioning algorithms, replication factor HyperX produces 1.1 to 1.9 times more replicas than hmetis. 25
31 evaluating partitioning effectiveness: load balance Workload CoV MedArity MedReplica OrkArity OrkReplica FriArity FriReplica random aweto greedy hmetis5 hmetis1 lpp Figure 9: Different partitioning algorithms, load balance LPP prodcues 1.1 to 37.7 times more balanced loads than hmetis. 26
32 evaluating partitioning effectiveness: space Data RDD size (GB) lpp hmetis1 hmetis5 greedy aweto random lpp hmetis1 hmetis5 greedy aweto random Hyperedges Vertices lpp hmetis1 hmetis5 greedy aweto random RW LP SP Figure 10: Different partitioning algorithms on Orkut, space LPP and hmetis both outperform simplistic methods. 27
33 evaluating partitioning effectiveness: communication Data shuffled (MB) at 5 th Iter greedy aweto random lpp hmetis1 hmetis5 RW greedy aweto random lpp hmetis1 hmetis5 LP Read Write greedy aweto random lpp hmetis1 hmetis5 Figure 11: Different partitioning algorithms on Orkut, communication SP LPP and hmetis both significantly outperform simplistic methods. 28
34 evaluating partitioning effectiveness: time Elapsed time (S) per 10 iters MedRW random aweto MedLP MedSP greedy hmetis5 OrkRW OrkLP hmetis1 lpp OrkSP Figure 12: Different partitioning algorithms, time LPP results to up to 2.6 times speedup over hmetis. 29
35 evaluating partitioning efficiency LPP in Scala, run on JVM; hmetis in C Table 4: Partitioning time of different algorithms Dataset Algorithm Time t (s) w w.r.t. LPP Med LPP hmetis5 14, Ork LPP hmetis5 88, Fri LPP hmetis5 6,
36 evaluating learning algorithms: dataset cardinality Elapsed Time (S) per 5 iters RW LP SP M 12M 16M 20M 24M Number of hyperedges Figure 13: Elapsed time running algorithms on varying dataset cardinality, synthetic 31
37 evaluating learning algorithms: number of workers Elapased time (S) per 10 iters Number of workers RW LP SP Figure 14: Elapsed time running algorithms on varying number of workers, Orkut 32
38 optional evaluating lpp: time and replicas Elapsed time (S) MedReplica OrkReplica MedTime OrkTime Iteration Replica factor Figure 15: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable replication ratio. 33
39 optional evaluating lpp: load balance Workload CoV MedReplica MedArity OrkReplica OrkArity Iteration Figure 16: Elapsed time and replication factor It only takes LPP a few iteration to achieve reasonable load balance. 34
40 conclusion Problem Scalable hypergraph learning Challenges Solutions Contribution 1. Inflated problem size 2. Excessive replication 3. Great difficulty in balancing the loads 1. Operate on a distributed hypergraph 2. Replicate only vertices 3. Partitioning optimization Efficient and scalable hypergraph framework Effective and efficient partitioning algorithm 35
41 Thanks! Any Questions or Comments? 36
One Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationHadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved
Hadoop 2.x Core: YARN, Tez, and Spark YARN Hadoop Machine Types top-of-rack switches core switch client machines have client-side software used to access a cluster to process data master nodes run Hadoop
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationDistributed Graph Algorithms
Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationBig Data Architect.
Big Data Architect www.austech.edu.au WHAT IS BIG DATA ARCHITECT? A big data architecture is designed to handle the ingestion, processing, and analysis of data that is too large or complex for traditional
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationArabesque. A system for distributed graph mining. Mohammed Zaki, RPI
rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1 Big Data
More informationDistributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
More informationResilient Distributed Datasets
Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationScalable Label Propagation Algorithms for Heterogeneous Networks
Scalable Label Propagation Algorithms for Heterogeneous Networks Erfan Farhangi Maleki Department f Electrical and Computer Engineering, Isfahan University of Technology, Isfahan 84156-83111, Iran e.farhangi@ec.iut.ac.ir
More informationLogGP: A Log-based Dynamic Graph Partitioning Method
LogGP: A Log-based Dynamic Graph Partitioning Method Ning Xu, Lei Chen, Bin Cui Department of Computer Science, Peking University, Beijing, China Hong Kong University of Science and Technology, Hong Kong,
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationThe GraphX Graph Processing System
The GraphX Graph Processing System Daniel Crankshaw Ankur Dave Reynold S. Xin Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {crankshaw, ankurd, rxin, jegonzal, franklin, istoica@cs.berkeley.edu
More informationComputational Social Choice in the Cloud
Computational Social Choice in the Cloud Theresa Csar, Martin Lackner, Emanuel Sallinger, Reinhard Pichler Technische Universität Wien Oxford University PPI, Stuttgart, March 2017 By Sam Johnston [CC BY-SA
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationApproximation Algorithms
Chapter 8 Approximation Algorithms Algorithm Theory WS 2016/17 Fabian Kuhn Approximation Algorithms Optimization appears everywhere in computer science We have seen many examples, e.g.: scheduling jobs
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationSummary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma
Summary of Big Data Frameworks Course 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Course Schedule Tuesday 10.3. Introduction and the Big Data Challenge Tuesday 17.3. MapReduce and Spark: Overview Tuesday
More informationOn Covering a Graph Optimally with Induced Subgraphs
On Covering a Graph Optimally with Induced Subgraphs Shripad Thite April 1, 006 Abstract We consider the problem of covering a graph with a given number of induced subgraphs so that the maximum number
More informationGTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs
GTS: A Fast and Scalable Graph Processing Method based on Streaming Topology to GPUs Min-Soo Kim, Kyuhyeon An, Himchan Park, Hyunseok Seo, and Jinwook Kim Department of Information and Communication Engineering
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationSandpiper: Scaling Probabilistic Inferencing to Large Scale Graphical Models
Sandpiper: Scaling Probabilistic Inferencing to Large Scale Graphical Models Alexander Ulanov 1, Manish Marwah, Mijung Kim, Roshan Dathathri 2, Carlos Zubieta 3, and Jun Li 4 Hewlett Packard Labs Palo
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationA Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics
A Hierarchical Synchronous Parallel Model for Wide-Area Graph Analytics Shuhao Liu*, Li Chen, Baochun Li, Aiden Carnegie University of Toronto April 17, 2018 Graph Analytics What is Graph Analytics? 2
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationDistributed Machine Learning: An Intro. Chen Huang
: An Intro. Chen Huang Feature Engineering Group, Data Mining Lab, Big Data Research Center, UESTC Contents Background Some Examples Model Parallelism & Data Parallelism Parallelization Mechanisms Synchronous
More informationOverview. Prerequisites. Course Outline. Course Outline :: Apache Spark Development::
Title Duration : Apache Spark Development : 4 days Overview Spark is a fast and general cluster computing system for Big Data. It provides high-level APIs in Scala, Java, Python, and R, and an optimized
More informationParallel Graph Partitioning for Complex Networks
Parallel Graph Partitioning for Complex Networks Henning Meyerhenke, Peter Sanders, Christian Schulz High-performance Graph Algorithms and Applications in Computational Science Dagstuhl 1 Christian Schulz:
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationGPS: A Graph Processing System
GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed
More informationNScale: neighborhood-centric large-scale graph analytics in the cloud
The VLDB Journal DOI 10.1007/s00778-015-0405-2 REGULAR PAPER NScale: neighborhood-centric large-scale graph analytics in the cloud Abdul Quamar 1 Amol Deshpande 1 Jimmy Lin 1 Received: 7 December 2014
More informationGraph Processing. Connor Gramazio Spiros Boosalis
Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationECS289: Scalable Machine Learning
ECS289: Scalable Machine Learning Cho-Jui Hsieh UC Davis Sept 22, 2016 Course Information Website: http://www.stat.ucdavis.edu/~chohsieh/teaching/ ECS289G_Fall2016/main.html My office: Mathematical Sciences
More informationBig Data Infrastructures & Technologies
Big Data Infrastructures & Technologies Spark and MLLIB OVERVIEW OF SPARK What is Spark? Fast and expressive cluster computing system interoperable with Apache Hadoop Improves efficiency through: In-memory
More informationRStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine
RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook
More informationParallel Greedy Matching Algorithms
Parallel Greedy Matching Algorithms Fredrik Manne Department of Informatics University of Bergen, Norway Rob Bisseling, University of Utrecht Md. Mostofa Patwary, University of Bergen 1 Outline Background
More informationAn Experimental Comparison of Partitioning Strategies in Distributed Graph Processing
An Experimental Comparison of Partitioning Strategies in Distributed Graph Processing ABSTRACT Shiv Verma 1, Luke M. Leslie 1, Yosub Shin, Indranil Gupta 1 1 University of Illinois at Urbana-Champaign,
More informationGraphX: Graph Processing in a Distributed Dataflow Framework
GraphX: Graph Processing in a Distributed Dataflow Framework Joseph E. Gonzalez, University of California, Berkeley; Reynold S. Xin, University of California, Berkeley, and Databricks; Ankur Dave, Daniel
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationScalable Influence Maximization in Social Networks under the Linear Threshold Model
Scalable Influence Maximization in Social Networks under the Linear Threshold Model Wei Chen Microsoft Research Asia Yifei Yuan Li Zhang In collaboration with University of Pennsylvania Microsoft Research
More informationRESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING
RESILIENT DISTRIBUTED DATASETS: A FAULT-TOLERANT ABSTRACTION FOR IN-MEMORY CLUSTER COMPUTING Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael J. Franklin,
More informationAnalysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark
Analysis of Extended Performance for clustering of Satellite Images Using Bigdata Platform Spark PL.Marichamy 1, M.Phil Research Scholar, Department of Computer Application, Alagappa University, Karaikudi,
More informationHarp: Collective Communication on Hadoop
Harp: Collective Communication on Hadoop Bingjing Zhang, Yang Ruan, Judy Qiu Computer Science Department Indiana University Bloomington, IN, USA zhangbj, yangruan, xqiu@indiana.edu Abstract Big data processing
More informationMultilevel Algorithms for Multi-Constraint Hypergraph Partitioning
Multilevel Algorithms for Multi-Constraint Hypergraph Partitioning George Karypis University of Minnesota, Department of Computer Science / Army HPC Research Center Minneapolis, MN 55455 Technical Report
More informationIntroduction to Spark
Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty
More information2/4/2019 Week 3- A Sangmi Lee Pallickara
Week 3-A-0 2/4/2019 Colorado State University, Spring 2019 Week 3-A-1 CS535 BIG DATA FAQs PART A. BIG DATA TECHNOLOGY 3. DISTRIBUTED COMPUTING MODELS FOR SCALABLE BATCH COMPUTING SECTION 1: MAPREDUCE PA1
More informationGraphHP: A Hybrid Platform for Iterative Graph Processing
GraphHP: A Hybrid Platform for Iterative Graph Processing Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo and Wei Pan Northwestern Polytechnical University Xi an, China {chenbenben, baisong, lizhh,
More informationSCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS
SCALABLE LOCAL COMMUNITY DETECTION WITH MAPREDUCE FOR LARGE NETWORKS Ren Wang, Andong Wang, Talat Iqbal Syed and Osmar R. Zaïane Department of Computing Science, University of Alberta, Canada ABSTRACT
More informationIBM Data Science Experience White paper. SparkR. Transforming R into a tool for big data analytics
IBM Data Science Experience White paper R Transforming R into a tool for big data analytics 2 R Executive summary This white paper introduces R, a package for the R statistical programming language that
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationAN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING SHIV VERMA THESIS
c 2017 Shiv Verma AN EXPERIMENTAL COMPARISON OF PARTITIONING STRATEGIES IN DISTRIBUTED GRAPH PROCESSING BY SHIV VERMA THESIS Submitted in partial fulfillment of the requirements for the degree of Master
More informationAdaptive Cluster Computing using JavaSpaces
Adaptive Cluster Computing using JavaSpaces Jyoti Batheja and Manish Parashar The Applied Software Systems Lab. ECE Department, Rutgers University Outline Background Introduction Related Work Summary of
More informationExperimental Analysis of Distributed Graph Systems
Experimental Analysis of Distributed Graph ystems Khaled Ammar, M. Tamer Özsu David R. Cheriton chool of Computer cience University of Waterloo, Waterloo, Ontario, Canada {khaled.ammar, tamer.ozsu}@uwaterloo.ca
More informationPartitioning. Course contents: Readings. Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic. Chapter 7.5.
Course contents: Partitioning Kernighang-Lin partitioning heuristic Fiduccia-Mattheyses heuristic Readings Chapter 7.5 Partitioning 1 Basic Definitions Cell: a logic block used to build larger circuits.
More informationBring x3 Spark Performance Improvement with PCIe SSD. Yucai, Yu BDT/STO/SSG January, 2016
Bring x3 Spark Performance Improvement with PCIe SSD Yucai, Yu (yucai.yu@intel.com) BDT/STO/SSG January, 2016 About me/us Me: Spark contributor, previous on virtualization, storage, mobile/iot OS. Intel
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationApache Spark Graph Performance with Memory1. February Page 1 of 13
Apache Spark Graph Performance with Memory1 February 2017 Page 1 of 13 Abstract Apache Spark is a powerful open source distributed computing platform focused on high speed, large scale data processing
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationNXgraph: An Efficient Graph Processing System on a Single Machine
NXgraph: An Efficient Graph Processing System on a Single Machine arxiv:151.91v1 [cs.db] 23 Oct 215 Yuze Chi, Guohao Dai, Yu Wang, Guangyu Sun, Guoliang Li and Huazhong Yang Tsinghua National Laboratory
More informationInteractive Graph Analytics with Spark
Interactive Graph Analytics with Spark A talk by Daniel Darabos from Lynx Analytics about the design and implementation of the LynxKite analytics application About LynxKite Analytics web application AngularJS
More informationChapter 4: Apache Spark
Chapter 4: Apache Spark Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich PD Dr. Matthias Renz 2015, Based on lectures by Donald Kossmann (ETH Zürich), as well as Jure Leskovec,
More informationTegra Time-evolving Graph Processing on Commodity Clusters
Tegra Time-evolving Graph Processing on ommodity lusters SIM SE 17 2 March 2017 nand Iyer Qifan Pu Joseph Gonzalez Ion Stoica 1 Graphs are everywhere Social Networks 2 Graphs are everywhere Gnutella network
More informationPREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options
Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing
More informationUnderstanding Graph Computa3on Behavior to Enable Robust Benchmarking
Understanding Graph Computa3on Behavior to Enable Robust Benchmarking Fan Yang* and Andrew A. Chien* *University of Chicago, Argonne Na3onal Laboratory {fanyang, achien}@cs.uchicago.edu HPDC, June 18,
More informationWorkload Characterization and Optimization of TPC-H Queries on Apache Spark
Workload Characterization and Optimization of TPC-H Queries on Apache Spark Tatsuhiro Chiba and Tamiya Onodera IBM Research - Tokyo April. 17-19, 216 IEEE ISPASS 216 @ Uppsala, Sweden Overview IBM Research
More informationBillion node graph inference: iterative processing on The Machine
Billion node graph inference: iterative processing on The Machine Chen,Fei; Gonzalez, Maria Teresa; Viswanathan, Krishnamurthy; Cai, Qiong; Laffite, Hernan; Rivera, Janneth; Mitchell, April; Singhal, Sharad
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationAcyclic Network. Tree Based Clustering. Tree Decomposition Methods
Summary s Join Tree Importance of s Solving Topological structure defines key features for a wide class of problems CSP: Inference in acyclic network is extremely efficient (polynomial) Idea: remove cycles
More informationConstraint Satisfaction Problems
Constraint Satisfaction Problems Search and Lookahead Bernhard Nebel, Julien Hué, and Stefan Wölfl Albert-Ludwigs-Universität Freiburg June 4/6, 2012 Nebel, Hué and Wölfl (Universität Freiburg) Constraint
More informationPTAS for Matroid Matching
PTAS for Matroid Matching Jon Lee 1 Maxim Sviridenko 1 Jan Vondrák 2 1 IBM Watson Research Center Yorktown Heights, NY 2 IBM Almaden Research Center San Jose, CA May 6, 2010 Jan Vondrák (IBM Almaden) PTAS
More informationBig Data Infrastructures & Technologies Hadoop Streaming Revisit.
Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer
More informationThink Sequential, Run Parallel
Think Sequential, Run Parallel Wenfei Fan 12, Muyang Liu 2, Ruiqi Xu 1, Lei Hou 2, Dongze Li 2, Zizhong Meng 2 1 University of Edinburgh, UK 2 Beihang University, China Abstract. Parallel computation is
More informationMulti-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization
IEEE TRANSACTIONS ON COMPUTER AIDED DESIGN, VOL XX, NO. XX, 2005 1 Multi-Objective Hypergraph Partitioning Algorithms for Cut and Maximum Subdomain Degree Minimization Navaratnasothie Selvakkumaran and
More information