modern database systems lecture 10 : large-scale graph processing
|
|
- Adrian Cook
- 6 years ago
- Views:
Transcription
1 modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18
2 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due
3 graphs biological networks biological systems re protein-protein int model used to represent items and their relations gene regulation ne social networks gene co-expressio knowledge and information networks metabolic pathwa technology networks the food web knowledge and information networks biological networks neural networks he internet map nodes store information links associate information citation network (directed acyclic) the web (directed) peer-to-peer networks word networks networks of trust software graphs bluetooth networks home page/blog networks
4 graph properties graph mining is a heavily researched topic many properties are pervasive in all types of graphs some properties relevant to this lecture : degree distribution is heavy-tailed (hubs exist) graphs are highly interconnected and distances are short (six degrees of separation, or rather 4.74 degrees of separation)
5 challenges in processing graphs (I) graphs can be very large although not always the main bottleneck e.g., Facebook graph is in the order of billion nodes can fit in the memory of a large computer many graph tasks are inherently computationally expensive a seemingly simple task : counting triangles in a graph complex tasks : all pairs shortest paths (matrix multiplication), graph cuts, graph partitioning, etc. many graph tasks are inherently sequential e.g., finding shortest paths difficult to parallelize
6 real-world graphs difficult to partition/parallelize because lack of locality an idealized graph a real world graph
7 challenges in processing graphs (II) poor locality of memory access by graph algorithms I/O intensive waits for memory fetches difficult to parallelize by data partitioning varying degree of parallelism over the course of execution varying degree of parallelism due to non-uniform graph topology (e.g., existence of hubs)
8 graph processing in large-scale platforms approach : use existing systems, e.g., hadoop, Spark example PageRank is a standard graph algorithm already discussed PageRank in map-reduce and Spark problem reduces to eigenvector computation solved via iterative matrix-vector multiplication
9 graph processing in large-scale platforms drawbacks of using off-the-shelf systems for graph processing (such as hadoop, Spark, and other) not intuitive (not graph-specific semantics) e.g., graph problem maps to eigenvector computation hard to implement unnecessarily slow each iteration is a single job with lots of overhead the graph structure is read from disk the intermediary result is written to disk need for systems developed especially for graph processing
10 pregel introduced by Google researchers in 9 distributed system especially developed for large scale graph processing intuitive API following the principle think like a vertex bulk synchronous parallel (BSP) execution model fault tolerance by checkpointing
11 bulk synchronous parallel (BSP) processors local computation communication superstep barrier synchronization
12 vertex-centric BSP each vertex has an id, a value, list of neighbor ids, and corresponding edge values each vertex is invoked in each superset it can recompute its value and send messages to other vertices, which are delivered over superstep barriers advanced features : termination votes, combiners, aggregators, vertex 1 vertex 1 vertex 1 vertex vertex vertex vertex vertex vertex superstep i superstep i+1 superstep i+
13 computation model superstep : the vertices compute in parallel each vertex receives messages sent in the previous superstep executes the same user-defined function modifies its value or that of its outgoing edges sends messages to other vertices (to be received in the next superstep) votes to halt if it has no further work to do termination condition all vertices are simultaneously inactive there are no messages in transit message received active inactive vote to halt
14 example 1 : PageRank in pregel recall PageRank : initially all vertices have rank score 1/N iteration each vertex sends a contribution of rank/n to its neighbors rank : own rank of page n : number of neighbors of vertex then update rank of each vertex to α/n +(1-α)Σici ci : contribution received from vertex i, i=1..n ideal setting for the pregel paradigm
15 example 1 : PageRank in pregel class PageRankVertex { void compute(iterator messages) { if (getsuperstep() > ) { // recompute own PageRank from the neighbours messages pagerank = sum(messages); } } setvertexvalue(pagerank); } if (getsuperstep() < k) { // send updated PageRank to each neighbour sendmessagetoallneighbors(pagerank / getnumoutedges()); } else { votetohalt(); // terminate }
16 example : finding connected components in a graph how would you implement it in the sequential model? how would you implement it in pregel?
17 example : finding connected components in a graph assume each vertex has a unique label algorithm : propagate vertex label to neighbors, update vertex label to smallest label received until convergence in the end, all vertices of a component will have the same label
18 example : finding connected components in a graph 5 1 4
19 example : finding connected components in a graph 5 1 4
20 example : finding connected components in a graph 5 1 4
21 example : finding connected components in a graph 5 1 4
22 example : shortest path how would you compute it in the sequential model? how would you compute it in pregel?
23 example : shortest path vertex value : distance to source initially source has value all other vertices have value iteration each vertex propagates its value to its neighbors update value based on values received and edge distances
24 example : shortest path 7 1 6
25 example : shortest path
26 example : shortest path
27 example : shortest path
28 example : shortest path
29 example 4 : distance profile estimate N(u,t) the number of vertices that are at distance t from u simultaneously for all u and t (up to diameter) how would you implemented in the sequential model? how would you implement it in pregel?
30 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received
31 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received v u z w
32 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(v,)=1 N(u,)=1 u v z N(z,)=1 w N(w,)=1
33 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= u v N(v,)=1 N(v,1)= z N(z,)=1 N(z,1)= w N(w,)=1 N(w,1)=
34 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= N(u,)=7 u v w N(v,)=1 N(v,1)= N(v,)=7 N(z,)=1 z N(z,1)= N(z,)=7 N(w,)=1 N(w,1)= N(w,)=7
35 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= N(u,)=7 u v w N(v,)=1 N(v,1)= N(v,)=7 N(z,)=1 z N(z,1)= N(z,)=7 N(w,)=1 N(w,1)= N(w,)=7 incorrect because of double-counting
36 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received
37 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received v u z w
38 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(v,)={v} N(u,)={u} u v z N(z,)={z} w N(w,)={w}
39 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} u v N(v,)={v} N(v,1)={u,v,z} z N(z,)={z} N(z,1)={v,w,z} w N(w,)={w} N(w,1)={u,w,z}
40 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} N(u,)={u,v,w,z} u v w N(v,)={v} N(v,1)={u,v,z} N(v,)={u,v,w,z} z N(w,)={w} N(w,1)={u,w,z} N(w,)={u,v,w,z} N(z,)={z} N(z,1)={v,w,z} N(z,)={u,v,w,z}
41 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} N(u,)={u,v,w,z} u v w N(v,)={v} N(v,1)={u,v,z} N(v,)={u,v,w,z} z N(w,)={w} N(w,1)={u,w,z} N(w,)={u,v,w,z} N(z,)={z} N(z,1)={v,w,z} N(z,)={u,v,w,z} correct logic but space grows quadratically
42 example 4 : distance profile attempt (final) requirements : need to be able to estimate the size of a set need to make updates under union operations need to allocate small amount of space at each vertex approximate answers are OK solution : counter of distinct elements in a data stream provide approximate answers and uses logarithmic space method proposed by Flajolet-Martin, 1985 replace N(u,t) in previous solution with counter for distinct elements
43 estimating the number of distinct elements [Flajolet and Martin1985, Alon Matias Szegedy 1] consider a bit vector of length O(log n) upon seen element i, set: the 1st bit with probability 1/ the nd bit with probability 1/4 the j-th bit with probability 1/ j important : bits are set deterministically for each element let R be the index of the largest bit set return Y = R
44 estimating the number of distinct elements Theorem (Alon, Matias, Szegedy 1): for every c >, the previous algorithm computes a number Y using O(logn) memory bits, such that the probability that the ratio between Y and the true number of distinct elements is not between 1/c and c is at most /c
45 example 5 : counting the number of triangles in a graph useful for social-network analysis social networks tend to have many triangles large clustering coefficient the friend of a friend is likely to be a friend how would you compute it in the sequential model? how would you compute it in pregel?
46 example 5 : counting the number of triangles in a graph sequential model method 1 : check all triples of vertices cubic, too expensive method : for all pairs of neighbours of a vertex check if the third edge is present method (approximate) : sample connected triples (paths of length ) check if the third edge is present
47 example 5 : counting the number of triangles in a graph in pregel assume that each vertex has a unique id value of vertex u : a subset of neighbor ids first round a vertex propagates its id to its neighbors a vertex keeps a list of ids that are smaller than its own second round a vertex sends its list of ids to its neighbors if a vertex receives a list that contains an id of a neighbour that is larger than its own then it increments a global counter
48 example 5 : counting the number of triangles in a graph 1
49 example 5 : counting the number of triangles in a graph first round {,1} 1 {,1} 1 {}
50 example 5 : counting the number of triangles in a graph first round {,1} second round 1:{} :{,1} :{,1} 1:{} 1 {,1} 1 {} 1 1:{} :{,1} :{,1}
51 example 5 : counting the number of triangles in a graph first round {,1} second round 1:{} :{,1} :{,1} 1:{} 1 {,1} 1 {} 1 1:{} :{,1} :{,1} vertex sees id 1 in two lists so, triangles
52 implementation of pregel system
53 master-slave architecture vertices are partitioned and assigned to workers default : hash-partitioning custom partitioning possible master maintains status of worker recovers faults of workers provides web-ui monitoring tool of job progress worker processes its task communicates with the other workers inates, rtices ach Master Worker 1 Worker Worker
54 execution of a pregel program 1. many copies of the program are executed on a cluster of machines. the master assigns a partition of the input to each worker each worker loads the vertices and marks them as active. the master instructs each worker to perform a superstep each worker loops through its active vertices and performs computation for each vertex messages are sent asynchronously, but are delivered before the end of the superstep this step is repeated as long as any vertices are active, or any messages are in transit 4. after the computation halts, the master may instruct each worker to save its portion of the graph
55 fault tolerance checkpointing the master periodically instructs the workers to save the state of their partitions to persistent storage e.g., vertex values, edge values, incoming messages failure detection using regular ping messages recovery the master reassigns graph partitions to the currently available workers the workers reload their partition state from most recent available checkpoint
56 great! where can I download pregel? pregel is proprietary, but many other systems available apache giraph is an open source implementation of pregel (runs on standard hadoop infrastructure) graphx (Spark) gps graph-lab / power-graph (asynchronous)
57 map-reduce vs. pregel map-reduce requires passing of entire graph topology from one iteration to the next intermediate results after every iteration are stored at disk and then read again from the disk programmer needs to write a driver program to support iterations; another map-reduce program to check for convergence pregel each node sends its state only to its neighbors graph topology information is not passed across iterations main memory based (leads to more efficient programs) use of supersteps and masterclient architecture makes programming easy
58 drawbacks of pregel
59 drawbacks of pregel 1. in bulk synchronous parallel (BSP) model, performance is limited by the slowest machine real-world graphs have power-law degree distribution, which may lead to a few highly-loaded servers. does not utilize the already computed partial results from the same iteration several machine learning algorithms (e.g., belief propagation, expectation maximization, stochastic optimization) have higher accuracy and efficiency with asynchronous updates
60 drawbacks of pregel 1. in bulk synchronous parallel (BSP) model, performance is limited by the slowest machine real-world graphs have power-law degree distribution, which may lead to a few highly-loaded servers. does not utilize the already computed partial results from the same iteration several machine learning algorithms (e.g., belief propagation, expectation maximization, stochastic optimization) have higher accuracy and efficiency with asynchronous updates to address problem 1, partition the graph, so that (1) balance server workloads () minimize communication across servers
61 synchronous vs. asynchronous pregel synchronous system no worries about consistency easy fault-tolerance, check point at each barrier bad performance when waiting for stragglers or there is loadimbalance graph-lab asynchronous system consistency of updates harder (edge, vertex, sequential) fault-tolerance harder (need a snapshot with consistency) asynchronous model can make faster progress can balance load in scheduling to deal with load skew
62 summary
63 summary graph processing in large-scale distributed systems dedicated systems offer better performance, better abstraction, and are easier to program many open-source implementations are available giraph, graphx, gps, graph-lab, x-stream, grace synchronous vs. asynchronous systems
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing
More informationPREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu
More informationAuthors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.
Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term
More informationDistributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016
Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016
More informationPregel. Ali Shah
Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation
More informationKing Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing
King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents
More informationPREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING
PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from
More informationPregel: A System for Large-Scale Graph Proces sing
Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi
More informationLarge-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC
Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationGraph Processing. Connor Gramazio Spiros Boosalis
Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps
More informationDistributed Graph Algorithms
Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction
More informationCS 347 Parallel and Distributed Data Processing
CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large
More informationCS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.
Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead
More informationPregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010
Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very
More informationApache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM
Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationGiraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi
Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number
More informationPREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options
Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing
More informationLarge Scale Graph Processing Pregel, GraphLab and GraphX
Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah
More informationBig Graph Processing. Fenggang Wu Nov. 6, 2016
Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao
More informationGraph-Processing Systems. (focusing on GraphChi)
Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])
More informationMizan: A System for Dynamic Load Balancing in Large-scale Graph Processing
/34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of
More informationECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective
ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs
More informationParallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem
I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **
More informationGraphHP: A Hybrid Platform for Iterative Graph Processing
GraphHP: A Hybrid Platform for Iterative Graph Processing Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo and Wei Pan Northwestern Polytechnical University Xi an, China {chenbenben, baisong, lizhh,
More informationApache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis
Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning
More informationCOSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph
COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share
More informationMapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia
MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,
More informationDistributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017
Distributed Systems 20. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2017 November 20, 2017 2014-2017 Paul Krzyzanowski 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it
More informationCS November 2017
Distributed Systems 0. Other parallel frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 017 November 0, 017 014-017 Paul Krzyzanowski 1 Apache Pig Apache Pig Why? Make
More informationMaster-Worker pattern
COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Spring 2017 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:
More informationGraph-Parallel Problems. ML in the Context of Parallel Architectures
Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014
More informationGPS: A Graph Processing System
GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives
More informationMaster-Worker pattern
COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Fall 2018 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:
More informationDistributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018
Distributed Systems 21. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2018 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it easy to use MapReduce via scripting instead of
More informationCS November 2018
Distributed Systems 1. Other parallel frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 018 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead
More informationDistributed Graph Storage. Veronika Molnár, UZH
Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems
More informationPutting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21
Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files
More informationGiraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems ABSTRACT Minyang Han David R. Cheriton School of Computer Science University of Waterloo m25han@uwaterloo.ca
More informationLecture 22 : Distributed Systems for ML
10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.
More informationMapReduce: Simplified Data Processing on Large Clusters 유연일민철기
MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,
More information[CoolName++]: A Graph Processing Framework for Charm++
[CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu
More informationTurning NoSQL data into Graph Playing with Apache Giraph and Apache Gora
Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato Marroquín! PhD student: Interested in: Information retrieval. Distributed and scalable data management. Apache Gora:
More informationGiraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems
Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems University of Waterloo Technical Report CS-215-4 ABSTRACT Minyang Han David R. Cheriton School of Computer
More informationGraphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationCourse : Data mining
Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter
More informationGraph Processing & Bulk Synchronous Parallel Model
Graph Processing & Bulk Synchronous Parallel Model CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 14 : 590.02 Spring 13 1 Recap: Graph Algorithms Many graph algorithms need iterafve computafon
More informationDistributed Computation Models
Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case
More informationApache Flink- A System for Batch and Realtime Stream Processing
Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink
More informationThe Future of High Performance Computing
The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer
More informationCMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22
More informationWebinar Series TMIP VISION
Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing
More informationGraphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech
CSE 6242/ CX 4242 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John
More informationGraphs! December 1, 2014
Graphs! December 1, 2014 Announcements This is our last technical lecture! Thank you for all your great ques@ons and interes@ng interac@ons Next lecture is our final review Send ques@ons!!! All exam logis@cs
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationWhy do we need graph processing?
Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group
More informationLecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink
Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationBSP, Pregel and the need for Graph Processing
BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph
More informationAn Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization
An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationPREGEL. A System for Large-Scale Graph Processing
PREGEL A System for Large-Scale Graph Processing The Problem Large Graphs are often part of computations required in modern systems (Social networks and Web graphs etc.) There are many graph computing
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More information15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018
15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationMapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia
MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation
More information15.1 Data flow vs. traditional network programming
CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and
More informationResearch challenges in data-intensive computing The Stratosphere Project Apache Flink
Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive
More informationJure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah
Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks
More informationData-Intensive Distributed Computing
Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 8: Analyzing Graphs, Redux (1/2) March 20, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo
More informationToday s content. Resilient Distributed Datasets(RDDs) Spark and its data model
Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,
More informationApache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context
1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes
More informationChisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique
Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.
More informationDistributed Computations MapReduce. adapted from Jeff Dean s slides
Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)
More informationHigh Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster
High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster Summary Open source analytic frameworks, such as those in the Apache
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationAutomatic Scaling Iterative Computations. Aug. 7 th, 2012
Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics
More informationSurvey Paper on Traditional Hadoop and Pipelined Map Reduce
International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,
More informationCS 5220: Parallel Graph Algorithms. David Bindel
CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :
More informationCS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:
CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online
More informationLecture 4. Distributed sketching. Graph problems
1 / 21 Lecture 4. Distributed sketching. Graph problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 21 1 Distributed sketching 2 An application: distance distributions in large
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationCSE 120 Principles of Operating Systems
CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number
More informationHIGH PERFORMANCE DATA ANALYTICS:
www.gdmissionsystems.com/hpc HIGH PERFORMANCE DATA ANALYTICS: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster 1. Summary Open source analytic frameworks,
More informationBatch & Stream Graph Processing with Apache Flink. Vasia
Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph
More informationPopularity of Twitter Accounts: PageRank on a Social Network
Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,
More informationFLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568
FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected
More informationMapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1
MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.
More informationGraph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web
Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some
More informationLecture 9: MIMD Architectures
Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.
More informationWrite a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical
Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or
More informationParallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)
Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication
More informationGraph Data Management
Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of
More informationThe Stratosphere Platform for Big Data Analytics
The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured
More informationCase Study 4: Collaborative Filtering. GraphLab
Case Study 4: Collaborative Filtering GraphLab Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin March 14 th, 2013 Carlos Guestrin 2013 1 Social Media
More informationLecture #3: PageRank Algorithm The Mathematics of Google Search
Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,
More information