Arabesque. A system for distributed graph mining. Mohammed Zaki, RPI
|
|
- Aubrey Dorsey
- 6 years ago
- Views:
Transcription
1 rabesque system for distributed graph mining Mohammed Zaki, RPI Carlos Teixeira, lexandre Fonseca, Marco Serafini, Georgos Siganos, shraf boulnaga, Qatar Computing Research Institute (QCRI) 1
2 Big Data Why has data analytics become so hot? Physical and digital worlds increasingly intertwined More and more digital breadcrumbs More and more applications Hadoop has made data analytics accessible Key drivers in systems research Define abstractions that ease development Systems that efficiently implement them 2 2
3 Graphs are Ubiquitous 3 3
4 Graph Mining lgorithms Finding subgraphs of interest in (labeled) input graphs Examples: Clique finding Others: frequent subgraph mining, motifs 4 4
5 pplications Web: Community detection, link spam detection Semantic data: ttributed patterns in RDF Biology: Protein-protein or gene interactions 5 5
6 Some Terminology Input graph Pattern Embeddings 6 6
7 Example: Frequent Graph Mining 7
8 Frequent Subgraph Discovery Mining frequent subgraphs from a database of many graphs C D B D C (G 1 ) (G 2 ) (G 3 ) Maximal Frequent Subgraphs with minimum support (minsup) = 2 C D D B C D D B 8
9 Frequent Subgraph Discovery Mining frequent subgraphs from a single large graph B D C D D C C D B Find subgraphs that have a minimum embedding count Total (6) Edge Disjoint (3) Vertex Disjoint (2) NP-Hard to find edge/vertex disjoint from total C D 9
10 Subgraph Mining: Complete Level-wise Search Candidate generation: add one more edge; enumerate all extensions Support counting: check which are frequent; retain for next iteration C Frequent Level 1 D B D C C Minimum Support = 2 C D D B D D B Candidate Frequent Level 2 C C B C C C D C C B C C C D 10
11 Taming of the Morphisms Challenge of isomorphisms How to detect duplicates? Graph Isomorphism How to count occurrences? Subgraph Isomorphism 11
12 Candidate Generation Can be very expensive: potentially millions of isomorphism checks C C D C D C D Graph isomorphism 12
13 Support Counting Graph Database If frequent Parent Graph Candidate Graph (G) Subgraph Isomorphism testing G1 G2 G3 Costly for large datasets, large graphs, small support: potentially millions of subgraph isomorphism checks G n 13
14 rabesque for Graph Mining 14
15 Challenge Exponential number of subgraphs/embeddings # unique subgraphs (log-scale) 1.7B 117M 7.8M 335K 4K 22K Size of subgraphs 15 15
16 State of the rt: Custom lgorithms Easy to Code High Performance Transparent Distribution Custom lgorithms 16 16
17 State of the rt: Think Like a Vertex Easy to Code Efficient Implementation Transparent Distribution Custom lgorithms Think Like a Vertex 17 17
18 rabesque New system & execution model Purpose-built for graph mining New Think Like an Embedding model Contributions: Simple & Generic PI High performance Distributed & Scalable by design 18 18
19 rabesque Easy to Code High Performance Transparent Distribution Custom lgorithms Think Like a Vertex rabesque 19 19
20 rabesque PI - Clique finding boolean filter(embedding e) { return isclique(e); } void process(embedding e) { output(e); } State of the rt (Mace, centralized) 4,621 LOC boolean isclique(embedding e) { return e.getnumedgesdded() == e.getnumberofvertices() - 1; } 20 20
21 rabesque PI - Motif Counting boolean filter(embedding e) { return e.getnumvertices() <= MX_SIZE; } void process(embedding e) { mapoutput(e.getpattern(), 1); } State of the rt (GTrieScanner, centralized) 3,145 LOC Pair<Pattern, Integer> reduceoutput(pattern p, List<Integer> counts) { return new Pair(p, sum(counts)); } 21 21
22 rabesque PI - Frequent Subgraph mining Ours was the first distributed implementation 280 lines of Java code of which 212 compute frequency metric Baseline (GRMI): 5,443 lines of Java 22 22
23 rabesque: n Efficient System COST: s efficient as centralized state of the art pplication - Graph Centralized Baseline rabesque 1 thread Motifs - MiCo (MS=3) 50s 37s Cliques - MiCo (MS=4) 281s 385s FSM - CiteSeer (S=300) 4.8s 5s 23 23
24 rabesque: Scalable System Scalable to thousands of workers Hours/days Minutes pplication - Graph Centralized Baseline rabesque 640 cores Motifs - MiCo 2 hours 24 minutes 25 seconds Cliques - MiCo 4 hours 8 minutes 1 minute 10 seconds FSM - Patents > 1 day 1 minute 28 seconds Can process graphs with almost 1 billion edges 24 24
25 lternative Paradigms? 25
26 Think Like a Vertex pplication = Stateful vertex object Vertices sends messages to their neighbors Easy to scale to large graphs: partition by vertex Bulk Synchronous Programming (BSP) 1. Receive from all neighbors 2. Compute new state 3. Send to all neighbors 26 26
27 Example: Shortest Path Input: Graph (weighted edges), source vertex Output: Min source vertex distance Superstep 0 1 message values = 2 and Superstep 1 message values = 4, 3, and Superstep 2 message values = 6 and Superstep 3 Complete, no new messages Example taken from: [McCune et al., arxiv: (2015)] 27
28 Matrix-Vector Multiplication E.g. Page-Rank style computation importance: i2 importance: i3 2 3 a12 * i2 a13 * i3 sum inputs 1 new state to neighbors superstep i superstep i+1 superstep i+2 links to v1 0 a12 a13 i1 a12 * i2 + a13 * i3 i2 * = i3 adjacency matrix (transposed) importance new importance 28 28
29 Graph Exploration with TLV 1. Receive embeddings 2. Expand by adding neighboring vertices 3. Send canonical embeddings to their constituting vertices Superstep 2 for vertex 4 Input graph Receive Expand Send
30 Think Like a Pattern Many existing algorithms keep state by pattern dvantages Rebuild embeddings from scratch No need to materialize full intermediate state Idea of TLP: ssign different patterns to different machines void storing materialized embedding 30 30
31 rabesque Details 31
32 How: rabesque Optimizations void Redundant Work Efficient canonicality checking Embedding Compression Overapproximating Directed cyclic Graphs (ODGs) Efficient ggregation 2-level pattern aggregation 32 32
33 rabesque: Fundamentals Subgraphs as 1st class citizens: Embedding == Subgraph Think Like an Embedding model rabesque responsibilities User responsibilities Graph Exploration ggregation (Isomorphism) Filter Load Balancing No redundant work (utomorphism) Process 33 33
34 Graph Exploration Iterative expansion Subgraph order n Subgraph order n + 1 Connect to neighbours, one vertex at a time Input graph Depth Depth
35 Graph Exploration Depth Input graph
36 Model - Think Like an Embedding Exploration step 1 Exploration step 2 Exploration step Input Output Input Output Input Output Filter false true Process Discard Save 1. Start from a set of initial embeddings 2. Expand: add one vertex or edge 3. Filter uninteresting candidates 4. Produce outputs User-defined functions 36 36
37 Guarantee: Completeness For each e, if Filter(e) == true then Process(e) is executed Requirement: nti-monotonicity Filter = true Filter = true Filter = false Keep expanding Filter = false We can prune and be sure that we won t ignore desired embeddings 37 37
38 ggregation Some applications must aggregate across embeddings E.g., Frequent subgraph mining: Count embeddings with same pattern ggregation in parallel with exploration step 38 38
39 ggregation ggregate across multiple embeddings readggregate(k) Process map(k, v) 1 2 gg Filter 1 3 Discard gg Process Save Filter - Process as before Exploration step i Exploration step i+1 User-defined functions 39 39
40 System rchitecture Input Embeddings size n split 1 split 4 Worker 1 Output Embeddings size n + 1 split 1 split 4 split 7 split 7 Previous step split 2 split 5 split 8 Worker 2 Worker 3 split 2 split 5 split 8 Next step split 3 split 3 split 6 split 6 split 9 split
41 rabesque PI pp-defined functions: boolean filter(embedding e) void process(embedding e) boolean aggregationfilter(embedding e) void aggregationprocess(embedding e) Pair<K,V> reduce(k key, List<V> values) Pair<K,V> reduceoutput(k key, List<V> values) Functions provided by rabesque: void map(k key, V value) V readggregate(k key) void output(object value) void mapoutput(k key, V value) 41 41
42 Technical Challenges 42
43 voiding redundant work Problem: utomorphic embeddings utomorphisms == subgraph equivalences Redundant work == Worker 1 Worker
44 voiding redundant work Solution: Decentralized Embedding Canonicality No coordination Efficient == Worker 1 Worker 2 iscanonical(e) true iscanonical(e) false 44 44
45 Embedding Canonicality iscanonical(e) iff at every step add neighbor with smallest ID 5 e 6 Initial embedding (e) Expansions: canonical canonical not canonical ( ) 45 45
46 Handling Exponential growth Goal: handle trillions+ different embeddings? Solution: Overapproximating DGs (ODGs) Compress into less restrictive superset Deal with spurious embeddings Canonical Embeddings Input Graph Embedding List ODG 46 46
47 ggregation by Pattern Label Distinguishable property of a vertex (e.g. color). Pattern - Meta sub-graph or the template. Captures subgraph structure and labelling Embedding - Instance of a pattern. ctual vertices and edges Input graph Pattern Embeddings 47 47
48 Efficient Pattern ggregation Goal: ggregate automorphic patterns to single key Find canonical pattern No known polynomial solution Canonical pattern 3x Expensive graph canonization 48 48
49 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 49 49
50 Evaluation 50
51 Evaluation - Setup 20 servers: GHz, 256GB RM 10 Gbps network 3 algorithms: Frequent Subgraph Mining, Counting Motifs and Clique Finding # Vertices # Edges # Labels vg. Degree CiteSeer 3,312 4, MiCO 100,000 1,080, Patents 2,745,761 13,965, Youtube 4,589,876 43,968, SN 5,022, ,613, Instagram 179,527, ,390,
52 Evaluation - TLP & TLV Use case: frequent subgraph mining No scalability. Bottlenecks: TLV: Replication of embeddings, hotspots TLP: very few patterns do all the work 10 8 Ideal TLP TLV Speedup Number of nodes (32 threads) 52 52
53 Evaluation - raquesque Scalability 53 53
54 Evaluation rabesque Scalability pplication - Graph Centralized Baseline rabesque - Num. Servers (32 threads) Motifs - MiCo 8,664s 328s 74s 41s 31s 25s FSM - Citeseer 1,813s 431s 105s 65s 52s 41s Cliques - MiCo 14,901s 1,185s 272s 140s 91s 70s Motifs - Youtube Fail 8,995s 2,218s 1,167s 900s 709s FSM - Patents >19h 548s 186s 132s 102s 88s 54 54
55 Evaluation - ODGs Compression 44 GB 60 MB 4000 vertices 1.7 billion subgraphs 55 55
56 Evaluation - Speedup w ODGs 56 56
57 Efficient Pattern ggregation Solution: 2-level pattern aggregation 1. Embeddings quick patterns 2. Quick patterns canonical pattern ) Quick patterns 2) Canonical pattern 3x Linear matching to quick pattern 2x Expensive graph canonization 57 57
58 Evaluation - Two-level aggregation Motifs MiCo (MS = 4) Motifs Youtube (MS=4) FSM CiteSeer (S=220, MS=7) FSM Patents (S=24k) Embeddings 10,957,439, ,909,854,429 1,680,983,703 1,910,611,704 Quick Patterns Canonical Patterns Reduction Factor 521,782,810x 10,424,278,782x 1,173,052x 1,061,451x 58 58
59 Evaluation - Two-level aggregation 59 59
60 CPU Utilization Breakdown dvantages of a simple PI rabesque does all the work (unlike TLV system) Great opportunities for system-level optimizations P: Pattern ggregation, C: canonicality checks, G: generate new candidates, R/W: Read/write embeddings 60
61 Large Graphs Graph # Vertices # Edges # Labels vg. Degree SN 5,022, ,613, Instagram 179,527, ,390, pplication Time Embeddings Motifs-SN (MS=4) 6h 18m 8.4 trillion Cliques-SN (MS=5) 29m 30 billion Motifs-Instagram (MS=3) 10h 45m 5 trillion 61 61
62 What s Next? 62
63 Future Work Better ways to organize intermediate state Scale to larger intermediate states Support for approximate exploration Out-of-core? Support for real-time graphs Verticals and new applications 63 63
64 Conclusions Fundamental trend: democratizing data analytics rabesque: graph mining system Straightforward to code Transparent and scalable distribution High performance Only a first step: many opportunities for improvement 69 69
65 Download It, Play with It, Hack It Open-source (pache 2.0) Pre-compiled jar User guide 70 70
66 Thank you arabesque.io 71
Arabesque: A System for Distributed Graph Mining
Arabesque: A System for Distributed Graph Mining Carlos H. C. Teixeira, Alexandre J. Fonseca, Marco Serafini, Georgos Siganos, Mohammed J. Zaki, Ashraf Aboulnaga Qatar Computing Research Institute - HBKU,
More informationRStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine
RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook
More informationA New Parallel Algorithm for Connected Components in Dynamic Graphs. Robert McColl Oded Green David Bader
A New Parallel Algorithm for Connected Components in Dynamic Graphs Robert McColl Oded Green David Bader Overview The Problem Target Datasets Prior Work Parent-Neighbor Subgraph Results Conclusions Problem
More informationgspan: Graph-Based Substructure Pattern Mining
University of Illinois at Urbana-Champaign February 3, 2017 Agenda What motivated the development of gspan? Technical Preliminaries Exploring the gspan algorithm Experimental Performance Evaluation Introduction
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2016) Week 2: MapReduce Algorithm Design (2/2) January 14, 2016 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationA Comparative Study on Exact Triangle Counting Algorithms on the GPU
A Comparative Study on Exact Triangle Counting Algorithms on the GPU Leyuan Wang, Yangzihao Wang, Carl Yang, John D. Owens University of California, Davis, CA, USA 31 st May 2016 L. Wang, Y. Wang, C. Yang,
More informationBiology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc
Motivation Motifs Algorithms G-Tries Parallelism Complex Networks Networks are ubiquitous! Biology, Physics, Mathematics, Sociology, Engineering, Computer Science, Etc Images: UK Highways Agency, Uriel
More informationPSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets
2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department
More informationPattern Mining in Frequent Dynamic Subgraphs
Pattern Mining in Frequent Dynamic Subgraphs Karsten M. Borgwardt, Hans-Peter Kriegel, Peter Wackersreuther Institute of Computer Science Ludwig-Maximilians-Universität Munich, Germany kb kriegel wackersr@dbs.ifi.lmu.de
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationMonotone Constraints in Frequent Tree Mining
Monotone Constraints in Frequent Tree Mining Jeroen De Knijf Ad Feelders Abstract Recent studies show that using constraints that can be pushed into the mining process, substantially improves the performance
More informationMaximum Clique Problem. Team Bushido bit.ly/parallel-computing-fall-2014
Maximum Clique Problem Team Bushido bit.ly/parallel-computing-fall-2014 Agenda Problem summary Research Paper 1 Research Paper 2 Research Paper 3 Software Design Demo of Sequential Program Summary Of the
More informationGraph Mining and Social Network Analysis
Graph Mining and Social Network Analysis Data Mining and Text Mining (UIC 583 @ Politecnico di Milano) References q Jiawei Han and Micheline Kamber, "Data Mining: Concepts and Techniques", The Morgan Kaufmann
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 2: MapReduce Algorithm Design (2/2) January 12, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationCARPENTER Find Closed Patterns in Long Biological Datasets. Biological Datasets. Overview. Biological Datasets. Zhiyu Wang
CARPENTER Find Closed Patterns in Long Biological Datasets Zhiyu Wang Biological Datasets Gene expression Consists of large number of genes Knowledge Discovery and Data Mining Dr. Osmar Zaiane Department
More informationEfficient Subgraph Matching by Postponing Cartesian Products
Efficient Subgraph Matching by Postponing Cartesian Products Computer Science and Engineering Lijun Chang Lijun.Chang@unsw.edu.au The University of New South Wales, Australia Joint work with Fei Bi, Xuemin
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationMapReduce Design Patterns
MapReduce Design Patterns MapReduce Restrictions Any algorithm that needs to be implemented using MapReduce must be expressed in terms of a small number of rigidly defined components that must fit together
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 1: MapReduce Algorithm Design (4/4) January 16, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationMapReduce-style data processing
MapReduce-style data processing Software Languages Team University of Koblenz-Landau Ralf Lämmel and Andrei Varanovich Related meanings of MapReduce Functional programming with map & reduce An algorithmic
More informationLecture 3, Review of Algorithms. What is Algorithm?
BINF 336, Introduction to Computational Biology Lecture 3, Review of Algorithms Young-Rae Cho Associate Professor Department of Computer Science Baylor University What is Algorithm? Definition A process
More informationInteraction Between Input and Output-Sensitive
Interaction Between Input and Output-Sensitive Really? Mamadou M. Kanté Université Blaise Pascal - LIMOS, CNRS Enumeration Algorithms Using Structure, Lorentz Institute, August 26 th, 2015 1 Introduction
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationMining frequent Closed Graph Pattern
Mining frequent Closed Graph Pattern Seminar aus maschninellem Lernen Referent: Yingting Fan 5.November Fachbereich 21 Institut Knowledge Engineering Prof. Fürnkranz 1 Outline Motivation and introduction
More informationG(B)enchmark GraphBench: Towards a Universal Graph Benchmark. Khaled Ammar M. Tamer Özsu
G(B)enchmark GraphBench: Towards a Universal Graph Benchmark Khaled Ammar M. Tamer Özsu Bioinformatics Software Engineering Social Network Gene Co-expression Protein Structure Program Flow Big Graphs o
More informationData Mining in Bioinformatics Day 5: Frequent Subgraph Mining
Data Mining in Bioinformatics Day 5: Frequent Subgraph Mining Chloé-Agathe Azencott & Karsten Borgwardt February 18 to March 1, 2013 Machine Learning & Computational Biology Research Group Max Planck Institutes
More informationCloud Computing 2. CSCI 4850/5850 High-Performance Computing Spring 2018
Cloud Computing 2 CSCI 4850/5850 High-Performance Computing Spring 2018 Tae-Hyuk (Ted) Ahn Department of Computer Science Program of Bioinformatics and Computational Biology Saint Louis University Learning
More informationEvaluating Use of Data Flow Systems for Large Graph Analysis
Evaluating Use of Data Flow Systems for Large Graph Analysis Andy Yoo and Ian Kaplan, P. O. Box 808, Livermore, CA 94551 This work performed under the auspices of the U.S. Department of Energy by under
More informationOn Smart Query Routing: For Distributed Graph Querying with Decoupled Storage
On Smart Query Routing: For Distributed Graph Querying with Decoupled Storage Arijit Khan Nanyang Technological University (NTU), Singapore Gustavo Segovia ETH Zurich, Switzerland Donald Kossmann Microsoft
More informationLarge-Scale Graph Mining. Vincent Leroy
Large-Scale Graph Mining Vincent Leroy FREQUENT SUGRPH MINING (FSM) Graphs represent complex data Chemical compounds, proteins Social networks Knowledge bases (ontologies) 0M entities, 20M facts Frequent
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationmotifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs { }
motifs In the context of networks, the term motif may refer to di erent notions. Subgraph motifs Coloured motifs G M { } 2 subgraph motifs 3 motifs Find interesting patterns in a network. 4 motifs Find
More informationData mining, 4 cu Lecture 8:
582364 Data mining, 4 cu Lecture 8: Graph mining Spring 2010 Lecturer: Juho Rousu Teaching assistant: Taru Itäpelto Frequent Subgraph Mining Extend association rule mining to finding frequent subgraphs
More informationDepartment of Computer Science San Marcos, TX Report Number TXSTATE-CS-TR Clustering in the Cloud. Xuan Wang
Department of Computer Science San Marcos, TX 78666 Report Number TXSTATE-CS-TR-2010-24 Clustering in the Cloud Xuan Wang 2010-05-05 !"#$%&'()*+()+%,&+!"-#. + /+!"#$%&'()*+0"*-'(%,1$+0.23%(-)+%-+42.--3+52367&.#8&+9'21&:-';
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 10. Graph databases Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Graph Databases Basic
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationCS6702 GRAPH THEORY AND APPLICATIONS QUESTION BANK
CS6702 GRAPH THEORY AND APPLICATIONS 2 MARKS QUESTIONS AND ANSWERS 1 UNIT I INTRODUCTION CS6702 GRAPH THEORY AND APPLICATIONS QUESTION BANK 1. Define Graph. 2. Define Simple graph. 3. Write few problems
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationGraphs. Edges may be directed (from u to v) or undirected. Undirected edge eqvt to pair of directed edges
(p 186) Graphs G = (V,E) Graphs set V of vertices, each with a unique name Note: book calls vertices as nodes set E of edges between vertices, each encoded as tuple of 2 vertices as in (u,v) Edges may
More informationCS555: Distributed Systems [Fall 2017] Dept. Of Computer Science, Colorado State University
CS 555: DISTRIBUTED SYSTEMS [MAPREDUCE] Shrideep Pallickara Computer Science Colorado State University Frequently asked questions from the previous class survey Bit Torrent What is the right chunk/piece
More informationP = NP; P NP. Intuition of the reduction idea:
1 Polynomial Time Reducibility The question of whether P = NP is one of the greatest unsolved problems in the theoretical computer science. Two possibilities of relationship between P and N P P = NP; P
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationDistributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
More informationWorkloads Programmierung Paralleler und Verteilter Systeme (PPV)
Workloads Programmierung Paralleler und Verteilter Systeme (PPV) Sommer 2015 Frank Feinbube, M.Sc., Felix Eberhardt, M.Sc., Prof. Dr. Andreas Polze Workloads 2 Hardware / software execution environment
More informationMining Distributed Frequent Itemset with Hadoop
Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationLightweight Streaming-based Runtime for Cloud Computing. Shrideep Pallickara. Community Grids Lab, Indiana University
Lightweight Streaming-based Runtime for Cloud Computing granules Shrideep Pallickara Community Grids Lab, Indiana University A unique confluence of factors have driven the need for cloud computing DEMAND
More informationImprovements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao Fan1, Yuexin Wu2,b, Ao Xiao1
3rd International Conference on Machinery, Materials and Information Technology Applications (ICMMITA 2015) Improvements and Implementation of Hierarchical Clustering based on Hadoop Jun Zhang1, a, Chunxiao
More informationCommunity Detection. Community
Community Detection Community In social sciences: Community is formed by individuals such that those within a group interact with each other more frequently than with those outside the group a.k.a. group,
More informationChapter 2. Related Work
Chapter 2 Related Work There are three areas of research highly related to our exploration in this dissertation, namely sequential pattern mining, multiple alignment, and approximate frequent pattern mining.
More informationhyperx: scalable hypergraph processing
hyperx: scalable hypergraph processing Jin Huang November 15, 2015 The University of Melbourne overview Research Outline Scalable Hypergraph Processing Problem and Challenge Idea Solution Implementation
More informationA single-exponential FPT algorithm for Distance-Hereditary Vertex Deletion
A single-exponential FPT algorithm for Distance-Hereditary Vertex Deletion O-joung Kwon Institute for Computer Science and Control, Hungarian Academy of Sciences in Budapest, Hungary Joint work with Eduard
More informationIntroduction to Parallel & Distributed Computing Parallel Graph Algorithms
Introduction to Parallel & Distributed Computing Parallel Graph Algorithms Lecture 16, Spring 2014 Instructor: 罗国杰 gluo@pku.edu.cn In This Lecture Parallel formulations of some important and fundamental
More informationIntroduction to Graph Theory
Introduction to Graph Theory Tandy Warnow January 20, 2017 Graphs Tandy Warnow Graphs A graph G = (V, E) is an object that contains a vertex set V and an edge set E. We also write V (G) to denote the vertex
More information[CoolName++]: A Graph Processing Framework for Charm++
[CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu
More informationScalable Algorithmic Techniques Decompositions & Mapping. Alexandre David
Scalable Algorithmic Techniques Decompositions & Mapping Alexandre David 1.2.05 adavid@cs.aau.dk Introduction Focus on data parallelism, scale with size. Task parallelism limited. Notion of scalability
More informationCA485 Ray Walshe Google File System
Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage
More informationLecture 4: Graph Algorithms
Lecture 4: Graph Algorithms Definitions Undirected graph: G =(V, E) V finite set of vertices, E finite set of edges any edge e = (u,v) is an unordered pair Directed graph: edges are ordered pairs If e
More informationParallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU
Parallelization of Shortest Path Graph Kernels on Multi-Core CPUs and GPU Lifan Xu Wei Wang Marco A. Alvarez John Cavazos Dongping Zhang Department of Computer and Information Science University of Delaware
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationFASCIA. Fast Approximate Subgraph Counting and Enumeration. 2 Oct Scalable Computing Laboratory The Pennsylvania State University 1 / 28
FASCIA Fast Approximate Subgraph Counting and Enumeration George M. Slota Kamesh Madduri Scalable Computing Laboratory The Pennsylvania State University 2 Oct. 2013 1 / 28 Overview Background Motivation
More informationKartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18
Accelerating PageRank using Partition-Centric Processing Kartik Lakhotia, Rajgopal Kannan, Viktor Prasanna USENIX ATC 18 Outline Introduction Partition-centric Processing Methodology Analytical Evaluation
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationHigh Performance Computing. Introduction to Parallel Computing
High Performance Computing Introduction to Parallel Computing Acknowledgements Content of the following presentation is borrowed from The Lawrence Livermore National Laboratory https://hpc.llnl.gov/training/tutorials
More information1 Definition of Reduction
1 Definition of Reduction Problem A is reducible, or more technically Turing reducible, to problem B, denoted A B if there a main program M to solve problem A that lacks only a procedure to solve problem
More informationASAP: Fast, Approximate Graph Pattern Mining at Scale
ASAP: Fast, Approximate Graph Pattern Mining at Scale Anand Padmanabha Iyer, UC Berkeley; Zaoxing Liu and Xin Jin, Johns Hopkins University; Shivaram Venkataraman, Microsoft Research / University of Wisconsin;
More informationUNIT 3. Greedy Method. Design and Analysis of Algorithms GENERAL METHOD
UNIT 3 Greedy Method GENERAL METHOD Greedy is the most straight forward design technique. Most of the problems have n inputs and require us to obtain a subset that satisfies some constraints. Any subset
More informationA Graph-Based Approach for Mining Closed Large Itemsets
A Graph-Based Approach for Mining Closed Large Itemsets Lee-Wen Huang Dept. of Computer Science and Engineering National Sun Yat-Sen University huanglw@gmail.com Ye-In Chang Dept. of Computer Science and
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationEnterprise. Breadth-First Graph Traversal on GPUs. November 19th, 2015
Enterprise Breadth-First Graph Traversal on GPUs Hang Liu H. Howie Huang November 9th, 5 Graph is Ubiquitous Breadth-First Search (BFS) is Important Wide Range of Applications Single Source Shortest Path
More informationBig Data Analytics. Izabela Moise, Evangelos Pournaras, Dirk Helbing
Big Data Analytics Izabela Moise, Evangelos Pournaras, Dirk Helbing Izabela Moise, Evangelos Pournaras, Dirk Helbing 1 Big Data "The world is crazy. But at least it s getting regular analysis." Izabela
More informationCS 61C: Great Ideas in Computer Architecture. MapReduce
CS 61C: Great Ideas in Computer Architecture MapReduce Guest Lecturer: Justin Hsia 3/06/2013 Spring 2013 Lecture #18 1 Review of Last Lecture Performance latency and throughput Warehouse Scale Computing
More informationGraph Theory S 1 I 2 I 1 S 2 I 1 I 2
Graph Theory S I I S S I I S Graphs Definition A graph G is a pair consisting of a vertex set V (G), and an edge set E(G) ( ) V (G). x and y are the endpoints of edge e = {x, y}. They are called adjacent
More informationEvolving To The Big Data Warehouse
Evolving To The Big Data Warehouse Kevin Lancaster 1 Copyright Director, 2012, Oracle and/or its Engineered affiliates. All rights Insert Systems, Information Protection Policy Oracle Classification from
More informationComparative Study of Subspace Clustering Algorithms
Comparative Study of Subspace Clustering Algorithms S.Chitra Nayagam, Asst Prof., Dept of Computer Applications, Don Bosco College, Panjim, Goa. Abstract-A cluster is a collection of data objects that
More informationParallel Graph Algorithms
Parallel Graph Algorithms Design and Analysis of Parallel Algorithms 5DV050/VT3 Part I Introduction Overview Graphs definitions & representations Minimal Spanning Tree (MST) Prim s algorithm Single Source
More informationMath 776 Graph Theory Lecture Note 1 Basic concepts
Math 776 Graph Theory Lecture Note 1 Basic concepts Lectured by Lincoln Lu Transcribed by Lincoln Lu Graph theory was founded by the great Swiss mathematician Leonhard Euler (1707-178) after he solved
More informationCSE 417 Branch & Bound (pt 4) Branch & Bound
CSE 417 Branch & Bound (pt 4) Branch & Bound Reminders > HW8 due today > HW9 will be posted tomorrow start early program will be slow, so debugging will be slow... Review of previous lectures > Complexity
More informationGraph Similarity for Data Mining Lot 6 of Data Analytics FY17/18
Graph Similarity for Data Mining Lot 6 of Data Analytics FY17/18 Meeting Presentation Peter Rodgers Algorithms for the Comparison of Graphs Graph Isomorphism Graph Edit Distance Various Others Used in
More informationLinear Regression Optimization
Gradient Descent Linear Regression Optimization Goal: Find w that minimizes f(w) f(w) = Xw y 2 2 Closed form solution exists Gradient Descent is iterative (Intuition: go downhill!) n w * w Scalar objective:
More informationLecture 24: More Reductions (1997) Steven Skiena. skiena
Lecture 24: More Reductions (1997) Steven Skiena Department of Computer Science State University of New York Stony Brook, NY 11794 4400 http://www.cs.sunysb.edu/ skiena Prove that subgraph isomorphism
More informationPractice Problems for the Final
ECE-250 Algorithms and Data Structures (Winter 2012) Practice Problems for the Final Disclaimer: Please do keep in mind that this problem set does not reflect the exact topics or the fractions of each
More informationIntroduction to MapReduce Algorithms and Analysis
Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)
More informationOrder or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations
Order or Shuffle: Empirically Evaluating Vertex Order Impact on Parallel Graph Computations George M. Slota 1 Sivasankaran Rajamanickam 2 Kamesh Madduri 3 1 Rensselaer Polytechnic Institute, 2 Sandia National
More informationExploring the Hidden Dimension in Graph Processing
Exploring the Hidden Dimension in Graph Processing Mingxing Zhang, Yongwei Wu, Kang Chen, *Xuehai Qian, Xue Li, and Weimin Zheng Tsinghua University *University of Shouthern California Graph is Ubiquitous
More informationPERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES
PERFORMANCE ANALYSIS AND OPTIMIZATION OF SKIP LISTS FOR MODERN MULTI-CORE ARCHITECTURES Anish Athalye and Patrick Long Mentors: Austin Clements and Stephen Tu 3 rd annual MIT PRIMES Conference Sequential
More informationChapter 26 Cluster Heuristic Search
Chapter 26 Cluster Heuristic Search Part I. Preliminaries Part II. Tightly Coupled Multicore Part III. Loosely Coupled Cluster Chapter 18. Massively Parallel Chapter 19. Hybrid Parallel Chapter 20. Tuple
More informationCS521 \ Notes for the Final Exam
CS521 \ Notes for final exam 1 Ariel Stolerman Asymptotic Notations: CS521 \ Notes for the Final Exam Notation Definition Limit Big-O ( ) Small-o ( ) Big- ( ) Small- ( ) Big- ( ) Notes: ( ) ( ) ( ) ( )
More informationManaging and Mining Billion Node Graphs. Haixun Wang Microsoft Research Asia
Managing and Mining Billion Node Graphs Haixun Wang Microsoft Research Asia Outline Overview Storage Online query processing Offline graph analytics Advanced applications Is it hard to manage graphs? Good
More informationEfficiency. Efficiency: Indexing. Indexing. Efficiency Techniques. Inverted Index. Inverted Index (COSC 488)
Efficiency Efficiency: Indexing (COSC 488) Nazli Goharian nazli@cs.georgetown.edu Difficult to analyze sequential IR algorithms: data and query dependency (query selectivity). O(q(cf max )) -- high estimate-
More informationProject Genesis. Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth
Scaling with HiveDB Project Genesis Cafepress.com Product Catalog Hundreds of Millions of Products Millions of new products every week Accelerating growth Enter Jeremy and HiveDB Our Requirements OLTP
More informationChapter 4: Association analysis:
Chapter 4: Association analysis: 4.1 Introduction: Many business enterprises accumulate large quantities of data from their day-to-day operations, huge amounts of customer purchase data are collected daily
More information