Batch & Stream Graph Processing with Apache Flink. Vasia
|
|
- Pierce Edwards
- 5 years ago
- Views:
Transcription
1 Batch & Stream Graph Processing with Apache Flink Vasia
2
3 Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph Processing with Flink
4 WHEN DO YOU NEED DISTRIBUTED GRAPH PROCESSING?
5 MISCONCEPTION # MY GRAPH IS SO BIG, IT DOESN T FIT IN A SINGLE MACHINE Big Data Ninja
6 A SOCIAL NETWORK
7 YOUR INPUT DATASET SIZE IS _OFTEN_ IRRELEVANT
8 INTERMEDIATE DATA: THE OFTEN Naive Who(m) to Follow: compute a friends-of-friends list per user exclude existing friends rank by common connections
9 MISCONCEPTION # DISTRIBUTED PROCESSING IS ALWAYS FASTER THAN SINGLE-NODE Data Science Rockstar
10
11 GRAPHS DON T APPEAR OUT OF THIN AIR Expectation
12 GRAPHS DON T APPEAR OUT OF THIN AIR Reality!
13 HOW DO WE EXPRESS A DISTRIBUTED GRAPH ANALYSIS TASK?
14 GRAPH APPLICATIONS ARE DIVERSE Iterative value propagation PageRank, Connected Components, Label Propagation Traversals and path exploration Shortest paths, centrality measures Ego-network analysis Personalized recommendations Pattern mining Finding frequent subgraphs
15 LINEAR ALGEBRA 3 5 Adjacency Matrix Partition by rows, columns, blocks - Efficient representation of non-zero elements - Algorithms expressed as vector-matrix multiplications
16 BREADTH-FIRST SEARCH
17 BREADTH-FIRST SEARCH
18 BREADTH-FIRST SEARCH X =
19 BREADTH-FIRST SEARCH X =
20 BREADTH-FIRST SEARCH X =
21 RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque Pegasus Signal-Collect PowerGraph NScale Tinkerpop Iterative value propagation Ego-network analysis
22 PREGEL: THINK LIKE A VERTEX 3,,
23 PREGEL: SUPERSTEPS Superstep i Superstep i+ 3, 3,,., (V i+, outbox) < compute(v i, inbox)
24 PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability / / /3 5
25 PAGERANK: THE WORD COUNT OF GRAPH PROCESSING VertexID Out-degree Transition Probability / / /3 5 PR(3) = 0.5*PR() *PR() + PR(5)
26 PREGEL EXAMPLE: PAGERANK void compute(messages): sum = 0.0 sum up received messages for (m <- messages) do sum = sum + m end for update vertex rank setvalue(0.5/numvertices() *sum) for (edge <- getoutedges()) do sendmessageto( edge.target(), getvalue()/numedges) end for distribute rank to neighbors
27 SIGNAL-COLLECT Signal Superstep i Collect Superstep i+ 3, 3, 3,,,, outbox < signal(v i ) V i+ < collect(inbox)
28 SIGNAL-COLLECT EXAMPLE: PAGERANK void signal(): for (edge <- getoutedges()) do sendmessageto( edge.target(), getvalue()/numedges) end for distribute rank to neighbors void collect(messages): sum = 0.0 for (m <- messages) do sum = sum + m end for sum up received messages update vertex rank setvalue(0.5/numvertices() *sum)
29 GATHER-SUM-APPLY (POWERGRAPH) Superstep i Superstep i+ Gather Sum Apply Gather
30 GSA EXAMPLE: PAGERANK double gather(source, edge, target): return target.value() / target.numedges() double sum(rank, rank): return rank + rank combine partial ranks compute partial rank double apply(sum, currentrank): return *sum update rank
31 PROBLEMS WITH VERTEX-CENTRIC MODELS Excessive communication Worker load imbalance Global Synchronization High memory requirements inbox /outbox can grow too large overhead for low-degree vertices in GSA
32 Vertex-Centric Connected Components Propagate the minimum value through the graph In each superstep, the value propagates one hop Requires diameter + supersets to converge
33 THINK LIKE A (SUB)GRAPH compute() on the entire partition - Information flows freely inside each partition - Network communication between partitions, not vertices 3 5
34 Subgraph-Centric Connected Components In each superstep, the value propagates throughout each subgraph Communication between partitions only Requires less (possibly) supersteps to converge
35 RECENT DISTRIBUTED GRAPH PROCESSING HISTORY Graph Traversals Pattern Matching MapReduce Pregel Giraph++ Arabesque Pegasus Signal-Collect PowerGraph NScale Tinkerpop Iterative value propagation Ego-network analysis
36 CAN WE HAVE IT ALL? Data pipeline integration: built on top of an efficient distributed processing engine Graph ETL: high-level API with abstractions and methods to transform graphs Familiar programming model: support popular programming abstractions
37 Gelly the Apache Flink Graph API
38 Flink Stack Hadoop M/R Table Gelly ML Dataflow Cascading Table CEP SAMOA Dataflow (WiP) DataSet (Java/Scala) DataStream (Java/Scala) Streaming dataflow runtime Local Remote Yarn Embedded
39 Why Graph Processing with Apache Flink? Native Iteration Operators DataSet Optimizations Ecosystem Integration
40 Flink Iteration Operators Result Result Iterative Update Function Replace Iterative Update Function State Input Workset Solution Set
41 Optimization the runtime is aware of the iterative execution no scheduling overhead between iterations caching and state maintenance are handled automatically Push work out of the loop Cache Loop-invariant Data Maintain state as index
42 Beyond Iterations Performance & Scalability Memory management Efficient serialization framework Operations on binary data Automatic Optimizations Choose best execution strategy Cache invariant data
43 Meet Gelly Java & Scala Graph APIs on top of Flink graph transformations and utilities iterative graph processing library of graph algorithms Can be seamlessly mixed with the DataSet Flink API to easily implement applications that use both record-based and graph-based analysis
44 Hello, Gelly! Java ExecutionEnvironment env = ExecutionEnvironment.getExecutionEnvironment(); DataSet<Edge<Long, NullValue>> edges = getedgesdataset(env); Graph<Long, Long, NullValue> graph = Graph.fromDataSet(edges, env); DataSet<Vertex<Long, Long>> verticeswithminids = graph.run( new ConnectedComponents(maxIterations)); Scala val env = ExecutionEnvironment.getExecutionEnvironment val edges: DataSet[Edge[Long, NullValue]] = getedgesdataset(env) val graph = Graph.fromDataSet(edges, env) val components = graph.run(new ConnectedComponents(maxIterations))
45 Graph Methods Graph Properties getvertexids getedgeids numberofvertices numberofedges getdegrees... Transformations map, filter, join subgraph, union, difference reverse, undirected gettriplets Mutations add vertex/edge remove vertex/edge
46 Example: mapvertices // increment each vertex value by one val graph = Graph.fromDataSet(...) // increment each vertex value by one val updatedgraph = graph.mapvertices(v => v.getvalue + )
47 Example: subgraph val graph: Graph[Long, Long, Long] =... // keep only vertices with positive values // and only edges with negative values val subgraph = graph.subgraph( ) vertex => vertex.getvalue > 0, edge => edge.getvalue < 0
48 Neighborhood Methods Apply a reduce function to the st-hop neighborhood of each vertex in parallel graph.reduceonneighbors( new MinValue, EdgeDirection.OUT)
49 Iterative Graph Processing Gelly offers iterative graph processing abstractions on top of Flink s Delta iterations vertex-centric scatter-gather gather-sum-apply partition-centric*
50 Vertex-Centric SSSP final class SSSPComputeFunction extends ComputeFunction { override def compute(vertex: Vertex, messages: MessageIterator) = { var mindistance = if (vertex.getid == srcid) 0 else Double.MaxValue while (messages.hasnext) { val msg = messages.next if (msg < mindistance) mindistance = msg } } if (vertex.getvalue > mindistance) { setnewvertexvalue(mindistance) for (edge: Edge <- getedges) sendmessageto(edge.gettarget, vertex.getvalue + edge.getvalue)
51 Library of Algorithms PageRank* Single Source Shortest Paths* Label Propagation Weakly Connected Components* Community Detection Triangle Count & Enumeration Clustering Coefficient Jaccard & Adamic-Adar Similarity Graph Summarization val ranks = inputgraph.run(new PageRank(0.85, 0)) *: both scatter-gather and GSA implementations
52 Gelly-Stream single-pass stream graph processing with Flink
53 Real Graphs are dynamic Graphs are created from events happening in real-time
54
55 How we ve done graph processing so far. Load: read the graph from disk and partition it in memory
56 How we ve done graph processing so far. Load: read the graph from disk and partition it in memory. Compute: read and mutate the graph state
57 How we ve done graph processing so far. Load: read the graph from disk and partition it in memory. Compute: read and mutate the graph state 3. Store: write the final graph state back to disk
58 What s wrong with this model? It is slow wait until the computation is over before you see any result pre-processing and partitioning It is expensive lots of memory and CPU required in order to scale It requires re-computation for graph changes no efficient way to deal with updates
59 Graph Streaming Challenges Maintain the dynamic graph structure Provide up-to-date results with low latency Compute on fresh state only
60 Single-Pass Graph Streaming Each event is an edge addition Maintain only a graph summary Recent events are grouped in graph windows
61
62 Graph Summaries spanners for distance estimation sparsifiers for cut estimation sketches for homomorphic properties graph summary algorithm R ~ R algorithm
63 Batch Connected Components i=
64 Batch Connected Components 3 i=
65 Batch Connected Components i= 6 6 6
66 Batch Connected Components i=
67 Batch Connected Components i= 6 6 6
68 6 8 Stream Connected Components 5 3 Graph Summary: Disjoint Set (Union-Find) Only store component IDs and vertex IDs
69 3 6 8 Cid =
70 7 8 3 Cid = Cid =
71 7 8 3 Cid = Cid =
72 7 8 Cid = Cid = Cid = 6 7 3
73 Cid = Cid = Cid =
74 Cid = Cid = Cid =
75 Cid = Cid = Cid =
76 Cid = Cid =
77 Distributed Stream Connected Components
78 Stream Connected Components with Flink DataStream<DisjointSet> cc = edgestream.keyby(0).timewindow(time.of(00, TimeUnit.MILLISECONDS)).fold(new DisjointSet(), new UpdateCC()).flatMap(new Merger()).setParallelism();
79 Stream Connected Components with Flink DataStream<DisjointSet> cc = edgestream.keyby(0).timewindow(time.of(00, TimeUnit.MILLISECONDS)).fold(new DisjointSet(), new UpdateCC()).flatMap(new Merger()).setParallelism(); Partition the edge stream
80 Stream Connected Components with Flink DataStream<DisjointSet> cc = edgestream.keyby(0).timewindow(time.of(00, TimeUnit.MILLISECONDS)).fold(new DisjointSet(), new UpdateCC()).flatMap(new Merger()).setParallelism(); Define the merging frequency
81 Stream Connected Components with Flink DataStream<DisjointSet> cc = edgestream.keyby(0).timewindow(time.of(00, TimeUnit.MILLISECONDS)).fold(new DisjointSet(), new UpdateCC()).flatMap(new Merger()).setParallelism(); merge locally
82 Stream Connected Components with Flink DataStream<DisjointSet> cc = edgestream.keyby(0).timewindow(time.of(00, TimeUnit.MILLISECONDS)).fold(new DisjointSet(), new UpdateCC()).flatMap(new Merger()).setParallelism(); merge globally
83 Gelly on Streams Static Graphs Multi-Pass Algorithms Full Computations Dynamic Graphs Single-Pass Algorithms Approximate Computations Gelly Gelly-Stream DataSet DataStream Distributed Dataflow Deployment
84 Introducing Gelly-Stream Gelly-Stream enriches the DataStream API with two new additional ADTs: GraphStream: A representation of a data stream of edges. Edges can have state (e.g. weights). Supports property streams, transformations and aggregations. GraphWindow: A time-slice of a graph stream. It enables neighborhood aggregations
85 GraphStream Operations Property Streams Transformations GraphStream -> DataStream.getEdges().getVertices().numberOfVertices().numberOfEdges().getDegrees().inDegrees().outDegrees() GraphStream -> GraphStream.mapEdges();.distinct();.filterVertices();.filterEdges();.reverse();.undirected();.union();
86 Graph Stream Aggregations graphstream.aggregate( new MyGraphAggregation(window, fold, combine, transform)) graph stream edges fold (window) fold agg global aggregates can be persistent or transient reduce combine result aggregate property stream local summaries global summary
87 Slicing Graph Streams graphstream.slice(time.of(, MINUTE)); :0 : : :3
88 Aggregating Slices source graphstream.slice(time.of(, MINUTE), direction) target Aggregations.reduceOnEdges();.foldNeighbors();.applyOnNeighbors(); Slicing collocates edges by vertex information Neighborhood aggregations on sliced graphs
89 Finding Matches Nearby graphstream.filtervertices(graphgeeks()).slice(time.of(5, MINUTE), EdgeDirection.IN).applyOnNeighbors(FindPairs()) GraphWindow :: user-place GraphStream :: graph geek check-ins wendy wendy checked_in soap_bar steve checked_in soap_bar tom checked_in joe s_grill sandra checked_in soap_bar rafa checked_in joe s_grill slice steve soap bar tom rafa joe s grill sandra FindPairs {wendy, steve} {steve, sandra} {wendy, sandra} {tom, rafa}
90 Feeling Gelly? Gelly Guide gelly_guide.html Gelly-Stream Repository Gelly-Stream Related Papers
Programming Models and Tools for Distributed Graph Processing
Programming Models and Tools for Distributed Graph Processing Vasia Kalavri kalavriv@inf.ethz.ch 31st British International Conference on Databases 10 July 2017, London, UK ABOUT ME Postdoctoral Fellow
More informationLarge-Scale Graph Processing with Apache Flink GraphDevroom FOSDEM 15. Vasia Kalavri Flink committer & PhD
Large-Scale Graph Processing with Apache Flink GraphDevroom FOSDEM 15 Vasia Kalavri Flink committer & PhD student @KTH vasia@apache.org @vkalavri Overview What is Apache Flink? Why Graph Processing with
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationArchitecture of Flink's Streaming Runtime. Robert
Architecture of Flink's Streaming Runtime Robert Metzger @rmetzger_ rmetzger@apache.org What is stream processing Real-world data is unbounded and is pushed to systems Right now: people are using the batch
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationPractical Big Data Processing An Overview of Apache Flink
Practical Big Data Processing An Overview of Apache Flink Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de With slides from Volker Markl and data artisans 1 2013
More informationReal-time data processing with Apache Flink
Real-time data processing with Apache Flink Gyula Fóra gyfora@apache.org Flink committer Swedish ICT Stream processing Data stream: Infinite sequence of data arriving in a continuous fashion. Stream processing:
More informationApache Flink. Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides
Apache Flink Fuchkina Ekaterina with Material from Andreas Kunft -TU Berlin / DIMA; dataartisans slides What is Apache Flink Massive parallel data flow engine with unified batch-and streamprocessing CEP
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationmodern database systems lecture 10 : large-scale graph processing
modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs
More informationApache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis
Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning
More informationDistributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationDistributed Graph Algorithms
Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationManagement and Analysis of Big Graph Data: Current Systems and Open Challenges
Management and Analysis of Big Graph Data: Current Systems and Open Challenges Martin Junghanns 1, André Petermann 1, Martin Neumann 2 and Erhard Rahm 1 1 Leipzig University, Database Research Group 2
More information2/26/2017. Originally developed at the University of California - Berkeley's AMPLab
Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationApache Flink Big Data Stream Processing
Apache Flink Big Data Stream Processing Tilmann Rabl Berlin Big Data Center www.dima.tu-berlin.de bbdc.berlin rabl@tu-berlin.de XLDB 11.10.2017 1 2013 Berlin Big Data Center All Rights Reserved DIMA 2017
More informationPregel: A System for Large-Scale Graph Proces sing
Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationBig Data Infrastructures & Technologies Hadoop Streaming Revisit.
Big Data Infrastructures & Technologies Hadoop Streaming Revisit ENRON Mapper ENRON Mapper Output (Excerpt) acomnes@enron.com blake.walker@enron.com edward.snowden@cia.gov alex.berenson@nyt.com ENRON Reducer
More informationApache Flink- A System for Batch and Realtime Stream Processing
Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationAnalyzing Flight Data
IBM Analytics Analyzing Flight Data Jeff Carlson Rich Tarro July 21, 2016 2016 IBM Corporation Agenda Spark Overview a quick review Introduction to Graph Processing and Spark GraphX GraphX Overview Demo
More informationPREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu
More informationGraph Processing. Connor Gramazio Spiros Boosalis
Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps
More informationCOSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph
COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationUnifying Big Data Workloads in Apache Spark
Unifying Big Data Workloads in Apache Spark Hossein Falaki @mhfalaki Outline What s Apache Spark Why Unification Evolution of Unification Apache Spark + Databricks Q & A What s Apache Spark What is Apache
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationRStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine
RStream:Marrying Relational Algebra with Streaming for Efficient Graph Mining on A Single Machine Guoqing Harry Xu Kai Wang, Zhiqiang Zuo, John Thorpe, Tien Quang Nguyen, UCLA Nanjing University Facebook
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationCS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.
Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationProcessing of big data with Apache Spark
Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT
More informationBlended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)
Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a) Cloudera s Developer Training for Apache Spark and Hadoop delivers the key concepts and expertise need to develop high-performance
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationCompile-Time Code Generation for Embedded Data-Intensive Query Languages
Compile-Time Code Generation for Embedded Data-Intensive Query Languages Leonidas Fegaras University of Texas at Arlington http://lambda.uta.edu/ Outline Emerging DISC (Data-Intensive Scalable Computing)
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationGPS: A Graph Processing System
GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed
More informationLarge-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC
Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei
More informationLecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao Yi Ong, Swaroop Ramaswamy, and William Song.
CME 323: Distributed Algorithms and Optimization, Spring 2015 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Databricks and Stanford. Lecture 8, 4/22/2015. Scribed by Orren Karniol-Tambour, Hao
More informationData-Intensive Computing with MapReduce
Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives
More informationGraph Analytics using Vertica Relational Database
Graph Analytics using ertica Relational Database Alekh Jindal* Samuel Madden Malú Castellanos Meichun Hsu Microsoft MIT ertica ertica * work done while at MIT Motivation for graphs on DB Data anyways in
More informationCSE 444: Database Internals. Lecture 23 Spark
CSE 444: Database Internals Lecture 23 Spark References Spark is an open source system from Berkeley Resilient Distributed Datasets: A Fault-Tolerant Abstraction for In-Memory Cluster Computing. Matei
More informationIntroduction to Spark
Introduction to Spark Outlines A brief history of Spark Programming with RDDs Transformations Actions A brief history Limitations of MapReduce MapReduce use cases showed two major limitations: Difficulty
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationMosaic: Processing a Trillion-Edge Graph on a Single Machine
Mosaic: Processing a Trillion-Edge Graph on a Single Machine Steffen Maass, Changwoo Min, Sanidhya Kashyap, Woonhak Kang, Mohan Kumar, Taesoo Kim Georgia Institute of Technology Best Student Paper @ EuroSys
More informationBenchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms
Benchmarking Apache Flink and Apache Spark DataFlow Systems on Large-Scale Distributed Machine Learning Algorithms Candidate Andrea Spina Advisor Prof. Sonia Bergamaschi Co-Advisor Dr. Tilmann Rabl Co-Advisor
More informationBig data systems 12/8/17
Big data systems 12/8/17 Today Basic architecture Two levels of scheduling Spark overview Basic architecture Cluster Manager Cluster Cluster Manager 64GB RAM 32 cores 64GB RAM 32 cores 64GB RAM 32 cores
More informationExperimental Analysis of Distributed Graph Systems
Experimental Analysis of Distributed Graph ystems Khaled Ammar, M. Tamer Özsu David R. Cheriton chool of Computer cience University of Waterloo, Waterloo, Ontario, Canada {khaled.ammar, tamer.ozsu}@uwaterloo.ca
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationDATA SCIENCE USING SPARK: AN INTRODUCTION
DATA SCIENCE USING SPARK: AN INTRODUCTION TOPICS COVERED Introduction to Spark Getting Started with Spark Programming in Spark Data Science with Spark What next? 2 DATA SCIENCE PROCESS Exploratory Data
More informationCS246: Mining Massive Datasets Jure Leskovec, Stanford University
CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu HITS (Hypertext Induced Topic Selection) Is a measure of importance of pages or documents, similar to PageRank
More informationSpark 2. Alexey Zinovyev, Java/BigData Trainer in EPAM
Spark 2 Alexey Zinovyev, Java/BigData Trainer in EPAM With IT since 2007 With Java since 2009 With Hadoop since 2012 With EPAM since 2015 About Secret Word from EPAM itsubbotnik Big Data Training 3 Contacts
More informationBSP, Pregel and the need for Graph Processing
BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph
More informationGraphs (Part II) Shannon Quinn
Graphs (Part II) Shannon Quinn (with thanks to William Cohen and Aapo Kyrola of CMU, and J. Leskovec, A. Rajaraman, and J. Ullman of Stanford University) Parallel Graph Computation Distributed computation
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationApril Copyright 2013 Cloudera Inc. All rights reserved.
Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and the Virtual EDW Headline Goes Here Marcel Kornacker marcel@cloudera.com Speaker Name or Subhead Goes Here April 2014 Analytic Workloads on
More informationTurning NoSQL data into Graph Playing with Apache Giraph and Apache Gora
Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato Marroquín! PhD student: Interested in: Information retrieval. Distributed and scalable data management. Apache Gora:
More informationFlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs
FlashGraph: Processing Billion-Node Graphs on an Array of Commodity SSDs Da Zheng, Disa Mhembere, Randal Burns Department of Computer Science Johns Hopkins University Carey E. Priebe Department of Applied
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationHigh-Level Data Models on RAMCloud
High-Level Data Models on RAMCloud An early status report Jonathan Ellithorpe, Mendel Rosenblum EE & CS Departments, Stanford University Talk Outline The Idea Data models today Graph databases Experience
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 8: Analyzing Graphs, Redux (1/2) March 20, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationSociaLite: A Datalog-based Language for
SociaLite: A Datalog-based Language for Large-Scale Graph Analysis Jiwon Seo M OBIS OCIAL RESEARCH GROUP Overview Overview! SociaLite: language for large-scale graph analysis! Extensions to Datalog! Compiler
More informationReport. X-Stream Edge-centric Graph processing
Report X-Stream Edge-centric Graph processing Yassin Hassan hassany@student.ethz.ch Abstract. X-Stream is an edge-centric graph processing system, which provides an API for scatter gather algorithms. The
More informationSpark Overview. Professor Sasu Tarkoma.
Spark Overview 2015 Professor Sasu Tarkoma www.cs.helsinki.fi Apache Spark Spark is a general-purpose computing framework for iterative tasks API is provided for Java, Scala and Python The model is based
More informationRAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko
RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:
More informationApache Spark 2.0. Matei
Apache Spark 2.0 Matei Zaharia @matei_zaharia What is Apache Spark? Open source data processing engine for clusters Generalizes MapReduce model Rich set of APIs and libraries In Scala, Java, Python and
More informationMODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS
MODERN BIG DATA DESIGN PATTERNS CASE DRIVEN DESINGS SUJEE MANIYAM FOUNDER / PRINCIPAL @ ELEPHANT SCALE www.elephantscale.com sujee@elephantscale.com HI, I M SUJEE MANIYAM Founder / Principal @ ElephantScale
More informationiihadoop: an asynchronous distributed framework for incremental iterative computations
DOI 10.1186/s40537-017-0086-3 RESEARCH Open Access iihadoop: an asynchronous distributed framework for incremental iterative computations Afaf G. Bin Saadon * and Hoda M. O. Mokhtar *Correspondence: eng.afaf.fci@gmail.com
More informationClash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics
Clash of the Titans: MapReduce vs. Spark for Large Scale Data Analytics Presented by: Dishant Mittal Authors: Juwei Shi, Yunjie Qiu, Umar Firooq Minhas, Lemei Jiao, Chen Wang, Berthold Reinwald and Fatma
More informationAn Introduction to Apache Spark
An Introduction to Apache Spark 1 History Developed in 2009 at UC Berkeley AMPLab. Open sourced in 2010. Spark becomes one of the largest big-data projects with more 400 contributors in 50+ organizations
More informationFrom Think Like a Vertex to Think Like a Graph. Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson
From Think Like a Vertex to Think Like a Graph Yuanyuan Tian, Andrey Balmin, Severin Andreas Corsten, Shirish Tatikonda, John McPherson Large Scale Graph Processing Graph data is everywhere and growing
More informationDatabricks, an Introduction
Databricks, an Introduction Chuck Connell, Insight Digital Innovation Insight Presentation Speaker Bio Senior Data Architect at Insight Digital Innovation Focus on Azure big data services HDInsight/Hadoop,
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationMap-Reduce Applications: Counting, Graph Shortest Paths
Map-Reduce Applications: Counting, Graph Shortest Paths Adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share Alike 3.0 United States. See http://creativecommons.org/licenses/by-nc-sa/3.0/us/
More informationNUMA-aware Graph-structured Analytics
NUMA-aware Graph-structured Analytics Kaiyuan Zhang, Rong Chen, Haibo Chen Institute of Parallel and Distributed Systems Shanghai Jiao Tong University, China Big Data Everywhere 00 Million Tweets/day 1.11
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationDistributed Computing with Spark
Distributed Computing with Spark Reza Zadeh Thanks to Matei Zaharia Outline Data flow vs. traditional network programming Limitations of MapReduce Spark computing engine Numerical computing on Spark Ongoing
More information15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce
More informationBig Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017)
Big Data Infrastructure CS 489/698 Big Data Infrastructure (Winter 2017) Week 5: Analyzing Graphs (2/2) February 2, 2017 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo These
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationGraph Partitioning for Scalable Distributed Graph Computations
Graph Partitioning for Scalable Distributed Graph Computations Aydın Buluç ABuluc@lbl.gov Kamesh Madduri madduri@cse.psu.edu 10 th DIMACS Implementation Challenge, Graph Partitioning and Graph Clustering
More informationAlways start off with a joke, to lighten the mood. Workflows 3: Graphs, PageRank, Loops, Spark
Always start off with a joke, to lighten the mood Workflows 3: Graphs, PageRank, Loops, Spark 1 The course so far Cost of operations Streaming learning algorithms Parallel streaming with map-reduce Building
More informationDistributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018
Distributed Systems 21. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2018 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it easy to use MapReduce via scripting instead of
More informationThe GraphX Graph Processing System
The GraphX Graph Processing System Daniel Crankshaw Ankur Dave Reynold S. Xin Joseph E. Gonzalez Michael J. Franklin Ion Stoica UC Berkeley AMPLab {crankshaw, ankurd, rxin, jegonzal, franklin, istoica@cs.berkeley.edu
More informationDistributed Computing with Spark and MapReduce
Distributed Computing with Spark and MapReduce Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Traditional Network Programming Message-passing between nodes (e.g. MPI) Very difficult to do at scale:» How
More information