modern database systems lecture 10 : large-scale graph processing

Size: px
Start display at page:

Download "modern database systems lecture 10 : large-scale graph processing"

Transcription

1 modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18

2 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due

3 graphs biological networks biological systems re protein-protein int model used to represent items and their relations gene regulation ne social networks gene co-expressio knowledge and information networks metabolic pathwa technology networks the food web knowledge and information networks biological networks neural networks he internet map nodes store information links associate information citation network (directed acyclic) the web (directed) peer-to-peer networks word networks networks of trust software graphs bluetooth networks home page/blog networks

4 graph properties graph mining is a heavily researched topic many properties are pervasive in all types of graphs some properties relevant to this lecture : degree distribution is heavy-tailed (hubs exist) graphs are highly interconnected and distances are short (six degrees of separation, or rather 4.74 degrees of separation)

5 challenges in processing graphs (I) graphs can be very large although not always the main bottleneck e.g., Facebook graph is in the order of billion nodes can fit in the memory of a large computer many graph tasks are inherently computationally expensive a seemingly simple task : counting triangles in a graph complex tasks : all pairs shortest paths (matrix multiplication), graph cuts, graph partitioning, etc. many graph tasks are inherently sequential e.g., finding shortest paths difficult to parallelize

6 real-world graphs difficult to partition/parallelize because lack of locality an idealized graph a real world graph

7 challenges in processing graphs (II) poor locality of memory access by graph algorithms I/O intensive waits for memory fetches difficult to parallelize by data partitioning varying degree of parallelism over the course of execution varying degree of parallelism due to non-uniform graph topology (e.g., existence of hubs)

8 graph processing in large-scale platforms approach : use existing systems, e.g., hadoop, Spark example PageRank is a standard graph algorithm already discussed PageRank in map-reduce and Spark problem reduces to eigenvector computation solved via iterative matrix-vector multiplication

9 graph processing in large-scale platforms drawbacks of using off-the-shelf systems for graph processing (such as hadoop, Spark, and other) not intuitive (not graph-specific semantics) e.g., graph problem maps to eigenvector computation hard to implement unnecessarily slow each iteration is a single job with lots of overhead the graph structure is read from disk the intermediary result is written to disk need for systems developed especially for graph processing

10 pregel introduced by Google researchers in 9 distributed system especially developed for large scale graph processing intuitive API following the principle think like a vertex bulk synchronous parallel (BSP) execution model fault tolerance by checkpointing

11 bulk synchronous parallel (BSP) processors local computation communication superstep barrier synchronization

12 vertex-centric BSP each vertex has an id, a value, list of neighbor ids, and corresponding edge values each vertex is invoked in each superset it can recompute its value and send messages to other vertices, which are delivered over superstep barriers advanced features : termination votes, combiners, aggregators, vertex 1 vertex 1 vertex 1 vertex vertex vertex vertex vertex vertex superstep i superstep i+1 superstep i+

13 computation model superstep : the vertices compute in parallel each vertex receives messages sent in the previous superstep executes the same user-defined function modifies its value or that of its outgoing edges sends messages to other vertices (to be received in the next superstep) votes to halt if it has no further work to do termination condition all vertices are simultaneously inactive there are no messages in transit message received active inactive vote to halt

14 example 1 : PageRank in pregel recall PageRank : initially all vertices have rank score 1/N iteration each vertex sends a contribution of rank/n to its neighbors rank : own rank of page n : number of neighbors of vertex then update rank of each vertex to α/n +(1-α)Σici ci : contribution received from vertex i, i=1..n ideal setting for the pregel paradigm

15 example 1 : PageRank in pregel class PageRankVertex { void compute(iterator messages) { if (getsuperstep() > ) { // recompute own PageRank from the neighbours messages pagerank = sum(messages); } } setvertexvalue(pagerank); } if (getsuperstep() < k) { // send updated PageRank to each neighbour sendmessagetoallneighbors(pagerank / getnumoutedges()); } else { votetohalt(); // terminate }

16 example : finding connected components in a graph how would you implement it in the sequential model? how would you implement it in pregel?

17 example : finding connected components in a graph assume each vertex has a unique label algorithm : propagate vertex label to neighbors, update vertex label to smallest label received until convergence in the end, all vertices of a component will have the same label

18 example : finding connected components in a graph 5 1 4

19 example : finding connected components in a graph 5 1 4

20 example : finding connected components in a graph 5 1 4

21 example : finding connected components in a graph 5 1 4

22 example : shortest path how would you compute it in the sequential model? how would you compute it in pregel?

23 example : shortest path vertex value : distance to source initially source has value all other vertices have value iteration each vertex propagates its value to its neighbors update value based on values received and edge distances

24 example : shortest path 7 1 6

25 example : shortest path

26 example : shortest path

27 example : shortest path

28 example : shortest path

29 example 4 : distance profile estimate N(u,t) the number of vertices that are at distance t from u simultaneously for all u and t (up to diameter) how would you implemented in the sequential model? how would you implement it in pregel?

30 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received

31 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received v u z w

32 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(v,)=1 N(u,)=1 u v z N(z,)=1 w N(w,)=1

33 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= u v N(v,)=1 N(v,1)= z N(z,)=1 N(z,1)= w N(w,)=1 N(w,1)=

34 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= N(u,)=7 u v w N(v,)=1 N(v,1)= N(v,)=7 N(z,)=1 z N(z,1)= N(z,)=7 N(w,)=1 N(w,1)= N(w,)=7

35 example 4 : distance profile attempt value of vertex u : number of nodes N(u,t) at distance t initially vertex u has value N(u,)=1 iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = ΣvN(v,t)+1 with all N(v,t) received N(u,)=1 N(u,1)= N(u,)=7 u v w N(v,)=1 N(v,1)= N(v,)=7 N(z,)=1 z N(z,1)= N(z,)=7 N(w,)=1 N(w,1)= N(w,)=7 incorrect because of double-counting

36 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received

37 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received v u z w

38 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(v,)={v} N(u,)={u} u v z N(z,)={z} w N(w,)={w}

39 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} u v N(v,)={v} N(v,1)={u,v,z} z N(z,)={z} N(z,1)={v,w,z} w N(w,)={w} N(w,1)={u,w,z}

40 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} N(u,)={u,v,w,z} u v w N(v,)={v} N(v,1)={u,v,z} N(v,)={u,v,w,z} z N(w,)={w} N(w,1)={u,w,z} N(w,)={u,v,w,z} N(z,)={z} N(z,1)={v,w,z} N(z,)={u,v,w,z}

41 example 4 : distance profile attempt 1 value of vertex u : set of nodes N(u,t) at distance t initially vertex u has value N(u,)={u} iteration (at round t) vertex u propagates N(u,t) vertex u updates N(u,t+1) = N(v,t) {u} with all N(v,t) received N(u,)={u} N(u,1)={u,v,w} N(u,)={u,v,w,z} u v w N(v,)={v} N(v,1)={u,v,z} N(v,)={u,v,w,z} z N(w,)={w} N(w,1)={u,w,z} N(w,)={u,v,w,z} N(z,)={z} N(z,1)={v,w,z} N(z,)={u,v,w,z} correct logic but space grows quadratically

42 example 4 : distance profile attempt (final) requirements : need to be able to estimate the size of a set need to make updates under union operations need to allocate small amount of space at each vertex approximate answers are OK solution : counter of distinct elements in a data stream provide approximate answers and uses logarithmic space method proposed by Flajolet-Martin, 1985 replace N(u,t) in previous solution with counter for distinct elements

43 estimating the number of distinct elements [Flajolet and Martin1985, Alon Matias Szegedy 1] consider a bit vector of length O(log n) upon seen element i, set: the 1st bit with probability 1/ the nd bit with probability 1/4 the j-th bit with probability 1/ j important : bits are set deterministically for each element let R be the index of the largest bit set return Y = R

44 estimating the number of distinct elements Theorem (Alon, Matias, Szegedy 1): for every c >, the previous algorithm computes a number Y using O(logn) memory bits, such that the probability that the ratio between Y and the true number of distinct elements is not between 1/c and c is at most /c

45 example 5 : counting the number of triangles in a graph useful for social-network analysis social networks tend to have many triangles large clustering coefficient the friend of a friend is likely to be a friend how would you compute it in the sequential model? how would you compute it in pregel?

46 example 5 : counting the number of triangles in a graph sequential model method 1 : check all triples of vertices cubic, too expensive method : for all pairs of neighbours of a vertex check if the third edge is present method (approximate) : sample connected triples (paths of length ) check if the third edge is present

47 example 5 : counting the number of triangles in a graph in pregel assume that each vertex has a unique id value of vertex u : a subset of neighbor ids first round a vertex propagates its id to its neighbors a vertex keeps a list of ids that are smaller than its own second round a vertex sends its list of ids to its neighbors if a vertex receives a list that contains an id of a neighbour that is larger than its own then it increments a global counter

48 example 5 : counting the number of triangles in a graph 1

49 example 5 : counting the number of triangles in a graph first round {,1} 1 {,1} 1 {}

50 example 5 : counting the number of triangles in a graph first round {,1} second round 1:{} :{,1} :{,1} 1:{} 1 {,1} 1 {} 1 1:{} :{,1} :{,1}

51 example 5 : counting the number of triangles in a graph first round {,1} second round 1:{} :{,1} :{,1} 1:{} 1 {,1} 1 {} 1 1:{} :{,1} :{,1} vertex sees id 1 in two lists so, triangles

52 implementation of pregel system

53 master-slave architecture vertices are partitioned and assigned to workers default : hash-partitioning custom partitioning possible master maintains status of worker recovers faults of workers provides web-ui monitoring tool of job progress worker processes its task communicates with the other workers inates, rtices ach Master Worker 1 Worker Worker

54 execution of a pregel program 1. many copies of the program are executed on a cluster of machines. the master assigns a partition of the input to each worker each worker loads the vertices and marks them as active. the master instructs each worker to perform a superstep each worker loops through its active vertices and performs computation for each vertex messages are sent asynchronously, but are delivered before the end of the superstep this step is repeated as long as any vertices are active, or any messages are in transit 4. after the computation halts, the master may instruct each worker to save its portion of the graph

55 fault tolerance checkpointing the master periodically instructs the workers to save the state of their partitions to persistent storage e.g., vertex values, edge values, incoming messages failure detection using regular ping messages recovery the master reassigns graph partitions to the currently available workers the workers reload their partition state from most recent available checkpoint

56 great! where can I download pregel? pregel is proprietary, but many other systems available apache giraph is an open source implementation of pregel (runs on standard hadoop infrastructure) graphx (Spark) gps graph-lab / power-graph (asynchronous)

57 map-reduce vs. pregel map-reduce requires passing of entire graph topology from one iteration to the next intermediate results after every iteration are stored at disk and then read again from the disk programmer needs to write a driver program to support iterations; another map-reduce program to check for convergence pregel each node sends its state only to its neighbors graph topology information is not passed across iterations main memory based (leads to more efficient programs) use of supersteps and masterclient architecture makes programming easy

58 drawbacks of pregel

59 drawbacks of pregel 1. in bulk synchronous parallel (BSP) model, performance is limited by the slowest machine real-world graphs have power-law degree distribution, which may lead to a few highly-loaded servers. does not utilize the already computed partial results from the same iteration several machine learning algorithms (e.g., belief propagation, expectation maximization, stochastic optimization) have higher accuracy and efficiency with asynchronous updates

60 drawbacks of pregel 1. in bulk synchronous parallel (BSP) model, performance is limited by the slowest machine real-world graphs have power-law degree distribution, which may lead to a few highly-loaded servers. does not utilize the already computed partial results from the same iteration several machine learning algorithms (e.g., belief propagation, expectation maximization, stochastic optimization) have higher accuracy and efficiency with asynchronous updates to address problem 1, partition the graph, so that (1) balance server workloads () minimize communication across servers

61 synchronous vs. asynchronous pregel synchronous system no worries about consistency easy fault-tolerance, check point at each barrier bad performance when waiting for stragglers or there is loadimbalance graph-lab asynchronous system consistency of updates harder (edge, vertex, sequential) fault-tolerance harder (need a snapshot with consistency) asynchronous model can make faster progress can balance load in scheduling to deal with load skew

62 summary

63 summary graph processing in large-scale distributed systems dedicated systems offer better performance, better abstraction, and are easier to program many open-source implementations are available giraph, graphx, gps, graph-lab, x-stream, grace synchronous vs. asynchronous systems

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 60 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Pregel: A System for Large-Scale Graph Processing

More information

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE-SCALE GRAPH PROCESSING Grzegorz Malewicz, Matthew Austern, Aart Bik, James Dehnert, Ilan Horn, Naty Leiser, Grzegorz Czajkowski (Google, Inc.) SIGMOD 2010 Presented by : Xiu

More information

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G.

Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Authors: Malewicz, G., Austern, M. H., Bik, A. J., Dehnert, J. C., Horn, L., Leiser, N., Czjkowski, G. Speaker: Chong Li Department: Applied Health Science Program: Master of Health Informatics 1 Term

More information

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016

Distributed Systems. 21. Graph Computing Frameworks. Paul Krzyzanowski. Rutgers University. Fall 2016 Distributed Systems 21. Graph Computing Frameworks Paul Krzyzanowski Rutgers University Fall 2016 November 21, 2016 2014-2016 Paul Krzyzanowski 1 Can we make MapReduce easier? November 21, 2016 2014-2016

More information

Pregel. Ali Shah

Pregel. Ali Shah Pregel Ali Shah s9alshah@stud.uni-saarland.de 2 Outline Introduction Model of Computation Fundamentals of Pregel Program Implementation Applications Experiments Issues with Pregel 3 Outline Costs of Computation

More information

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing

King Abdullah University of Science and Technology. CS348: Cloud Computing. Large-Scale Graph Processing King Abdullah University of Science and Technology CS348: Cloud Computing Large-Scale Graph Processing Zuhair Khayyat 10/March/2013 The Importance of Graphs A graph is a mathematical structure that represents

More information

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING

PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING PREGEL: A SYSTEM FOR LARGE- SCALE GRAPH PROCESSING G. Malewicz, M. Austern, A. Bik, J. Dehnert, I. Horn, N. Leiser, G. Czajkowski Google, Inc. SIGMOD 2010 Presented by Ke Hong (some figures borrowed from

More information

Pregel: A System for Large-Scale Graph Proces sing

Pregel: A System for Large-Scale Graph Proces sing Pregel: A System for Large-Scale Graph Proces sing Grzegorz Malewicz, Matthew H. Austern, Aart J. C. Bik, James C. Dehnert, Ilan Horn, Naty Leiser, and Grzegorz Czajkwoski Google, Inc. SIGMOD July 20 Taewhi

More information

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC

Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Large-Scale Graph Processing 1: Pregel & Apache Hama Shiow-yang Wu ( 吳秀陽 ) CSIE, NDHU, Taiwan, ROC Lecture material is mostly home-grown, partly taken with permission and courtesy from Professor Shih-Wei

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

Graph Processing. Connor Gramazio Spiros Boosalis

Graph Processing. Connor Gramazio Spiros Boosalis Graph Processing Connor Gramazio Spiros Boosalis Pregel why not MapReduce? semantics: awkward to write graph algorithms efficiency: mapreduces serializes state (e.g. all nodes and edges) while pregel keeps

More information

Distributed Graph Algorithms

Distributed Graph Algorithms Distributed Graph Algorithms Alessio Guerrieri University of Trento, Italy 2016/04/26 This work is licensed under a Creative Commons Attribution-ShareAlike 4.0 International License. Contents 1 Introduction

More information

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing CS 347 Parallel and Distributed Data Processing Spring 2016 Notes 14: Distributed Graph Processing Motivation Many applications require graph processing E.g., PageRank Some graph data sets are very large

More information

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data.

CS /21/2016. Paul Krzyzanowski 1. Can we make MapReduce easier? Distributed Systems. Apache Pig. Apache Pig. Pig: Loading Data. Distributed Systems 1. Graph Computing Frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 016 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead

More information

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010

Pregel: A System for Large- Scale Graph Processing. Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 Pregel: A System for Large- Scale Graph Processing Written by G. Malewicz et al. at SIGMOD 2010 Presented by Chris Bunch Tuesday, October 12, 2010 1 Graphs are hard Poor locality of memory access Very

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi

Giraph: Large-scale graph processing infrastructure on Hadoop. Qu Zhi Giraph: Large-scale graph processing infrastructure on Hadoop Qu Zhi Why scalable graph processing? Web and social graphs are at immense scale and continuing to grow In 2008, Google estimated the number

More information

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options

PREGEL AND GIRAPH. Why Pregel? Processing large graph problems is challenging Options Data Management in the Cloud PREGEL AND GIRAPH Thanks to Kristin Tufte 1 Why Pregel? Processing large graph problems is challenging Options Custom distributed infrastructure Existing distributed computing

More information

Large Scale Graph Processing Pregel, GraphLab and GraphX

Large Scale Graph Processing Pregel, GraphLab and GraphX Large Scale Graph Processing Pregel, GraphLab and GraphX Amir H. Payberah amir@sics.se KTH Royal Institute of Technology Amir H. Payberah (KTH) Large Scale Graph Processing 2016/10/03 1 / 76 Amir H. Payberah

More information

Big Graph Processing. Fenggang Wu Nov. 6, 2016

Big Graph Processing. Fenggang Wu Nov. 6, 2016 Big Graph Processing Fenggang Wu Nov. 6, 2016 Agenda Project Publication Organization Pregel SIGMOD 10 Google PowerGraph OSDI 12 CMU GraphX OSDI 14 UC Berkeley AMPLab PowerLyra EuroSys 15 Shanghai Jiao

More information

Graph-Processing Systems. (focusing on GraphChi)

Graph-Processing Systems. (focusing on GraphChi) Graph-Processing Systems (focusing on GraphChi) Recall: PageRank in MapReduce (Hadoop) Input: adjacency matrix H D F S (a,[c]) (b,[a]) (c,[a,b]) (c,pr(a) / out (a)), (a,[c]) (a,pr(b) / out (b)), (b,[a])

More information

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing

Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing /34 Mizan: A System for Dynamic Load Balancing in Large-scale Graph Processing Zuhair Khayyat 1 Karim Awara 1 Amani Alonazi 1 Hani Jamjoom 2 Dan Williams 2 Panos Kalnis 1 1 King Abdullah University of

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models Piccolo: Building Fast, Distributed Programs

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

GraphHP: A Hybrid Platform for Iterative Graph Processing

GraphHP: A Hybrid Platform for Iterative Graph Processing GraphHP: A Hybrid Platform for Iterative Graph Processing Qun Chen, Song Bai, Zhanhuai Li, Zhiying Gou, Bo Suo and Wei Pan Northwestern Polytechnical University Xi an, China {chenbenben, baisong, lizhh,

More information

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning

More information

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph

COSC 6339 Big Data Analytics. Graph Algorithms and Apache Giraph COSC 6339 Big Data Analytics Graph Algorithms and Apache Giraph Parts of this lecture are adapted from UMD Jimmy Lin s slides, which is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017

Distributed Systems. 20. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2017 Distributed Systems 20. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2017 November 20, 2017 2014-2017 Paul Krzyzanowski 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it

More information

CS November 2017

CS November 2017 Distributed Systems 0. Other parallel frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 017 November 0, 017 014-017 Paul Krzyzanowski 1 Apache Pig Apache Pig Why? Make

More information

Master-Worker pattern

Master-Worker pattern COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Spring 2017 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

GPS: A Graph Processing System

GPS: A Graph Processing System GPS: A Graph Processing System Semih Salihoglu and Jennifer Widom Stanford University {semih,widom}@cs.stanford.edu Abstract GPS (for Graph Processing System) is a complete open-source system we developed

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Ctd. Graphs Pig Design Patterns Hadoop Ctd. Giraph Zoo Keeper Spark Spark Ctd. Learning objectives

More information

Master-Worker pattern

Master-Worker pattern COSC 6397 Big Data Analytics Master Worker Programming Pattern Edgar Gabriel Fall 2018 Master-Worker pattern General idea: distribute the work among a number of processes Two logically different entities:

More information

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018

Distributed Systems. 21. Other parallel frameworks. Paul Krzyzanowski. Rutgers University. Fall 2018 Distributed Systems 21. Other parallel frameworks Paul Krzyzanowski Rutgers University Fall 2018 1 Can we make MapReduce easier? 2 Apache Pig Why? Make it easy to use MapReduce via scripting instead of

More information

CS November 2018

CS November 2018 Distributed Systems 1. Other parallel frameworks Can we make MapReduce easier? Paul Krzyzanowski Rutgers University Fall 018 1 Apache Pig Apache Pig Why? Make it easy to use MapReduce via scripting instead

More information

Distributed Graph Storage. Veronika Molnár, UZH

Distributed Graph Storage. Veronika Molnár, UZH Distributed Graph Storage Veronika Molnár, UZH Overview Graphs and Social Networks Criteria for Graph Processing Systems Current Systems Storage Computation Large scale systems Comparison / Best systems

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems ABSTRACT Minyang Han David R. Cheriton School of Computer Science University of Waterloo m25han@uwaterloo.ca

More information

Lecture 22 : Distributed Systems for ML

Lecture 22 : Distributed Systems for ML 10-708: Probabilistic Graphical Models, Spring 2017 Lecture 22 : Distributed Systems for ML Lecturer: Qirong Ho Scribes: Zihang Dai, Fan Yang 1 Introduction Big data has been very popular in recent years.

More information

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기

MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 MapReduce: Simplified Data Processing on Large Clusters 유연일민철기 Introduction MapReduce is a programming model and an associated implementation for processing and generating large data set with parallel,

More information

[CoolName++]: A Graph Processing Framework for Charm++

[CoolName++]: A Graph Processing Framework for Charm++ [CoolName++]: A Graph Processing Framework for Charm++ Hassan Eslami, Erin Molloy, August Shi, Prakalp Srivastava Laxmikant V. Kale Charm++ Workshop University of Illinois at Urbana-Champaign {eslami2,emolloy2,awshi2,psrivas2,kale}@illinois.edu

More information

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora

Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Turning NoSQL data into Graph Playing with Apache Giraph and Apache Gora Team Renato Marroquín! PhD student: Interested in: Information retrieval. Distributed and scalable data management. Apache Gora:

More information

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems

Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems Giraph Unchained: Barrierless Asynchronous Parallel Execution in Pregel-like Graph Processing Systems University of Waterloo Technical Report CS-215-4 ABSTRACT Minyang Han David R. Cheriton School of Computer

More information

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks. CSE 6242/ CX 4242 Feb 18, Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Feb 18, 2014 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Graph Processing & Bulk Synchronous Parallel Model

Graph Processing & Bulk Synchronous Parallel Model Graph Processing & Bulk Synchronous Parallel Model CompSci 590.03 Instructor: Ashwin Machanavajjhala Lecture 14 : 590.02 Spring 13 1 Recap: Graph Algorithms Many graph algorithms need iterafve computafon

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Apache Flink- A System for Batch and Realtime Stream Processing

Apache Flink- A System for Batch and Realtime Stream Processing Apache Flink- A System for Batch and Realtime Stream Processing Lecture Notes Winter semester 2016 / 2017 Ludwig-Maximilians-University Munich Prof Dr. Matthias Schubert 2016 Introduction to Apache Flink

More information

The Future of High Performance Computing

The Future of High Performance Computing The Future of High Performance Computing Randal E. Bryant Carnegie Mellon University http://www.cs.cmu.edu/~bryant Comparing Two Large-Scale Systems Oakridge Titan Google Data Center 2 Monolithic supercomputer

More information

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS

CMU SCS CMU SCS Who: What: When: Where: Why: CMU SCS Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB s C. Faloutsos A. Pavlo Lecture#23: Distributed Database Systems (R&G ch. 22) Administrivia Final Exam Who: You What: R&G Chapters 15-22

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech

Graphs / Networks CSE 6242/ CX Centrality measures, algorithms, interactive applications. Duen Horng (Polo) Chau Georgia Tech CSE 6242/ CX 4242 Graphs / Networks Centrality measures, algorithms, interactive applications Duen Horng (Polo) Chau Georgia Tech Partly based on materials by Professors Guy Lebanon, Jeffrey Heer, John

More information

Graphs! December 1, 2014

Graphs! December 1, 2014 Graphs! December 1, 2014 Announcements This is our last technical lecture! Thank you for all your great ques@ons and interes@ng interac@ons Next lecture is our final review Send ques@ons!!! All exam logis@cs

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Why do we need graph processing?

Why do we need graph processing? Why do we need graph processing? Community detection: suggest followers? Determine what products people will like Count how many people are in different communities (polling?) Graphs are Everywhere Group

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Apache Flink Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur Schmid, Daniyal Kazempour,

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

BSP, Pregel and the need for Graph Processing

BSP, Pregel and the need for Graph Processing BSP, Pregel and the need for Graph Processing Patrizio Dazzi, HPC Lab ISTI - CNR mail: patrizio.dazzi@isti.cnr.it web: http://hpc.isti.cnr.it/~dazzi/ National Research Council of Italy A need for Graph

More information

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization

An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization An Exploratory Journey Into Network Analysis A Gentle Introduction to Network Science and Graph Visualization Pedro Ribeiro (DCC/FCUP & CRACS/INESC-TEC) Part 1 Motivation and emergence of Network Science

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

PREGEL. A System for Large-Scale Graph Processing

PREGEL. A System for Large-Scale Graph Processing PREGEL A System for Large-Scale Graph Processing The Problem Large Graphs are often part of computations required in modern systems (Social networks and Web graphs etc.) There are many graph computing

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia

MapReduce & Resilient Distributed Datasets. Yiqing Hua, Mengqi(Mandy) Xia MapReduce & Resilient Distributed Datasets Yiqing Hua, Mengqi(Mandy) Xia Outline - MapReduce: - - Resilient Distributed Datasets (RDD) - - Motivation Examples The Design and How it Works Performance Motivation

More information

15.1 Data flow vs. traditional network programming

15.1 Data flow vs. traditional network programming CME 323: Distributed Algorithms and Optimization, Spring 2017 http://stanford.edu/~rezab/dao. Instructor: Reza Zadeh, Matroid and Stanford. Lecture 15, 5/22/2017. Scribed by D. Penner, A. Shoemaker, and

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah

Jure Leskovec Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah Jure Leskovec (@jure) Including joint work with Y. Perez, R. Sosič, A. Banarjee, M. Raison, R. Puttagunta, P. Shah 2 My research group at Stanford: Mining and modeling large social and information networks

More information

Data-Intensive Distributed Computing

Data-Intensive Distributed Computing Data-Intensive Distributed Computing CS 451/651 431/631 (Winter 2018) Part 8: Analyzing Graphs, Redux (1/2) March 20, 2018 Jimmy Lin David R. Cheriton School of Computer Science University of Waterloo

More information

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model

Today s content. Resilient Distributed Datasets(RDDs) Spark and its data model Today s content Resilient Distributed Datasets(RDDs) ------ Spark and its data model Resilient Distributed Datasets: A Fault- Tolerant Abstraction for In-Memory Cluster Computing -- Spark By Matei Zaharia,

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique

Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Chisel++: Handling Partitioning Skew in MapReduce Framework Using Efficient Range Partitioning Technique Prateek Dhawalia Sriram Kailasam D. Janakiram Distributed and Object Systems Lab Dept. of Comp.

More information

Distributed Computations MapReduce. adapted from Jeff Dean s slides

Distributed Computations MapReduce. adapted from Jeff Dean s slides Distributed Computations MapReduce adapted from Jeff Dean s slides What we ve learnt so far Basic distributed systems concepts Consistency (sequential, eventual) Fault tolerance (recoverability, availability)

More information

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster

High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster High Performance Data Analytics: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster Summary Open source analytic frameworks, such as those in the Apache

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Automatic Scaling Iterative Computations. Aug. 7 th, 2012

Automatic Scaling Iterative Computations. Aug. 7 th, 2012 Automatic Scaling Iterative Computations Guozhang Wang Cornell University Aug. 7 th, 2012 1 What are Non-Iterative Computations? Non-iterative computation flow Directed Acyclic Examples Batch style analytics

More information

Survey Paper on Traditional Hadoop and Pipelined Map Reduce

Survey Paper on Traditional Hadoop and Pipelined Map Reduce International Journal of Computational Engineering Research Vol, 03 Issue, 12 Survey Paper on Traditional Hadoop and Pipelined Map Reduce Dhole Poonam B 1, Gunjal Baisa L 2 1 M.E.ComputerAVCOE, Sangamner,

More information

CS 5220: Parallel Graph Algorithms. David Bindel

CS 5220: Parallel Graph Algorithms. David Bindel CS 5220: Parallel Graph Algorithms David Bindel 2017-11-14 1 Graphs Mathematically: G = (V, E) where E V V Convention: V = n and E = m May be directed or undirected May have weights w V : V R or w E :

More information

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following:

CS 470 Spring Fault Tolerance. Mike Lam, Professor. Content taken from the following: CS 47 Spring 27 Mike Lam, Professor Fault Tolerance Content taken from the following: "Distributed Systems: Principles and Paradigms" by Andrew S. Tanenbaum and Maarten Van Steen (Chapter 8) Various online

More information

Lecture 4. Distributed sketching. Graph problems

Lecture 4. Distributed sketching. Graph problems 1 / 21 Lecture 4. Distributed sketching. Graph problems Ricard Gavaldà MIRI Seminar on Data Streams, Spring 2015 Contents 2 / 21 1 Distributed sketching 2 An application: distance distributions in large

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

CSE 120 Principles of Operating Systems

CSE 120 Principles of Operating Systems CSE 120 Principles of Operating Systems Spring 2018 Lecture 15: Multicore Geoffrey M. Voelker Multicore Operating Systems We have generally discussed operating systems concepts independent of the number

More information

HIGH PERFORMANCE DATA ANALYTICS:

HIGH PERFORMANCE DATA ANALYTICS: www.gdmissionsystems.com/hpc HIGH PERFORMANCE DATA ANALYTICS: Experiences Porting the Apache Hama Graph Analytics Framework to an HPC InfiniBand Connected Cluster 1. Summary Open source analytic frameworks,

More information

Batch & Stream Graph Processing with Apache Flink. Vasia

Batch & Stream Graph Processing with Apache Flink. Vasia Batch & Stream Graph Processing with Apache Flink Vasia Kalavri vasia@apache.org @vkalavri Outline Distributed Graph Processing Gelly: Batch Graph Processing with Flink Gelly-Stream: Continuous Graph

More information

Popularity of Twitter Accounts: PageRank on a Social Network

Popularity of Twitter Accounts: PageRank on a Social Network Popularity of Twitter Accounts: PageRank on a Social Network A.D-A December 8, 2017 1 Problem Statement Twitter is a social networking service, where users can create and interact with 140 character messages,

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web

Graph Algorithms using Map-Reduce. Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some examples: The hyperlink structure of the web Graph Algorithms using Map-Reduce Graphs are ubiquitous in modern society. Some

More information

Lecture 9: MIMD Architectures

Lecture 9: MIMD Architectures Lecture 9: MIMD Architectures Introduction and classification Symmetric multiprocessors NUMA architecture Clusters Zebo Peng, IDA, LiTH 1 Introduction A set of general purpose processors is connected together.

More information

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical

Write a technical report Present your results Write a workshop/conference paper (optional) Could be a real system, simulation and/or theoretical Identify a problem Review approaches to the problem Propose a novel approach to the problem Define, design, prototype an implementation to evaluate your approach Could be a real system, simulation and/or

More information

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides)

Parallel Computing. Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Computing 2012 Slides credit: M. Quinn book (chapter 3 slides), A Grama book (chapter 3 slides) Parallel Algorithm Design Outline Computational Model Design Methodology Partitioning Communication

More information

Graph Data Management

Graph Data Management Graph Data Management Analysis and Optimization of Graph Data Frameworks presented by Fynn Leitow Overview 1) Introduction a) Motivation b) Application for big data 2) Choice of algorithms 3) Choice of

More information

The Stratosphere Platform for Big Data Analytics

The Stratosphere Platform for Big Data Analytics The Stratosphere Platform for Big Data Analytics Hongyao Ma Franco Solleza April 20, 2015 Stratosphere Stratosphere Stratosphere Big Data Analytics BIG Data Heterogeneous datasets: structured / unstructured

More information

Case Study 4: Collaborative Filtering. GraphLab

Case Study 4: Collaborative Filtering. GraphLab Case Study 4: Collaborative Filtering GraphLab Machine Learning/Statistics for Big Data CSE599C1/STAT592, University of Washington Carlos Guestrin March 14 th, 2013 Carlos Guestrin 2013 1 Social Media

More information

Lecture #3: PageRank Algorithm The Mathematics of Google Search

Lecture #3: PageRank Algorithm The Mathematics of Google Search Lecture #3: PageRank Algorithm The Mathematics of Google Search We live in a computer era. Internet is part of our everyday lives and information is only a click away. Just open your favorite search engine,

More information