Sublinear Models for Streaming and/or Distributed Data

Size: px
Start display at page:

Download "Sublinear Models for Streaming and/or Distributed Data"

Transcription

1 Sublinear Models for Streaming and/or Distributed Data Qin Zhang Guest lecture in B649 Feb. 3,

2 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures

3 Now about the Big Data Big data is everywhere : over 2.5 petabytes of sales transactions : an index of over 19 billion web pages : over 40 billion of pictures... Magazine covers Nature 06 Nature 08 CACM 08 Economist

4 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data 3-1

5 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-2

6 Source and Challenge Source Retailer databases Logistics, financial & health data Social network Pictures by mobile devices Internet of Things New forms of scientific data Challenge Volume Velocity } The focus of algorithm design Variety (Documents, Stock records, Personal profiles, Photographs, Audio & Video, 3D models, Location data,... ) 3-3

7 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,

8 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? 4-2

9 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. Sublinear in time 4-3

10 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time 4-4

11 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Sublinear in I/O 4-5

12 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space 4-6

13 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Store on the disk (page/block access) Throw some of them away. Sublinear in time Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? 4-7

14 What is the meaning of Big Data IN THEORY? We don t define Big Data in terms of TB, PB, EB,... The data is stored there, but no time to read them all. What can we do? Read some of them. The data is too big to fit in main memory. What can we do? Sublinear in time Store on the disk (page/block access) Throw some of them away. Sublinear in I/O Sublinear in space The data is too big to be stored in a single machine. What can we do if we do not want to throw them away? Store in multiple machines, which collaborate via communication Sublinear in communication 4-8

15 What do we mean by sublinear? Time/space/communication spent is o(input size) 5-1

16 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing 6-1

17 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms 6-2

18 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) 6-3

19 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms Sublinear communication algorithms Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) Sublinear I/O algorithms External memory data structures/algorithms 6-4

20 Conceretly, theory folks talk about the following... Sublinear time algorithms Sublinear time approximation algorithms Property testing Sublinear space algorithms Data stream algorithms (today) Sublinear communication algorithms (today) Multiparty communication protocols/algorithms (special case: MapReduce, BSP, MPI,... ) Sublinear I/O algorithms External memory data structures/algorithms 6-5

21 Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM CPU 7-1

22 Sublinear in space The data stream model (Alon, Matias and Szegedy 1996) RAM Applications Internet Router. Packets limited space CPU Router The router wants to maintain some statistics on data. E.g., want to detect anomalies for security. 7-2 Stock data, ad auction, flight logs on tapes, etc.

23 Why hard? You do see everything but then forget! Game 1: A sequence of numbers 8-1

24 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

25 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

26 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

27 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

28 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

29 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

30 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

31 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

32 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

33 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

34 Why hard? You do see everything but then forget! Game 1: A sequence of numbers

35 Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? 8-13

36 Why hard? You do see everything but then forget! Game 1: A sequence of numbers Q: What s the median? A:

37 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul 9-1

38 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Bob become friends 9-2

39 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Carol and Eva become friends 9-3

40 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob become friends 9-4

41 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul become friends 9-5

42 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Paul become friends 9-6

43 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Eva and Bob unfriends 9-7

44 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Alice and Dave become friends 9-8

45 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Bob and Paul become friends 9-9

46 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Paul unfriends 9-10

47 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Dave and Carol become friends 9-11

48 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? 9-12

49 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob 9-13

50 Why hard? Cannot store everything. Game 1: A sequence of numbers Q: What s the median? A: 33 Game 2: Relationships between Alice, Bob, Carol, Dave, Eva and Paul Q: Are Eva and Bob connected by friends? A: YES. Eva Carol Dave Alice Bob Have to allow approx/randomization given a small memory. 9-14

51 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? 10-1

52 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. 10-2

53 A toy example: Reservoir Sampling Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. Correctness: each item is included in the final sample w.p. 1 i (1 1 i+1 )... (1 1 n ) = 1 n (n: total # items) Space: O(1) 10-3

54 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Data in the cloud 11-1

55 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Abstraction Data in the cloud 11-2

56 Sublinear in communication Communication time, energy, bandwidth,... Input Output Sensor networks Map Shuffle Reduce The MapReduce model. E.g., Hadoop, Google MapReduce. The BSP model. E.g., Apache Giraph, Pregel. Data in the cloud Abstraction = C S 1 S 2 S 3 S k 11-3

57 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. 12-1

58 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. 12-2

59 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k 12-3

60 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. 12-4

61 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits. Can we do better, e.g., o(kn) bits of communication? 12-5

62 Why hard? You do not have a global view of the data. Let s think about the graph connectivity problem: k machine each holds a set of edges of a graph. Goal: compute whether the graph is connected. C S 1 S 2 S 3 S k A trivial solution: each machine sends a local spanning tree to the first machine. Cost O(kn log n) bits Can we do better, e.g., o(kn) bits of communication? What if the graph is node partitioned among the k machines? That is, each node is stored in 1 machine with all adjancent edges.

63 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy

64 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree

65 Problems Statistical problems Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... L p regression Numerical linear algebra Low-rank approximation

66 Problems Statistical problems 13-4 Frequency moments F p F 0 : #distinct elements F 2 : size of self-join Heavy hitters Quantile Entropy... Numerical linear algebra L p regression Low-rank approximation... Graph problems Connectivity Bipartiteness Counting triangles Matching Minimum spanning tree... DB queries Strings Geometry problems Clustering Conjuntive queries Edit distance Earth-Mover Distance... Longest increasing sequence

67 Do the streaming and the multi-party computation model capture all? 14-1

68 Do the streaming and the multi-party computation model capture all? Sensor network: each is monitoring a stream of data 14-2

69 Do the streaming and the multi-party computation model capture all? Sensor network: each is monitoring a stream of data Data can be noisy Music, Docs,... After compressions 14-3

70 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time 15-1

71 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items 15-2

72 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items Goal: minimize communication (main), space and processing time per item at sites. 15-3

73 Distributed monitoring: sublinear space and comm. C S 1 S 2 S 3 S k The coordinator needs to maintain some function defined on the union of k streams at any time e.g., total # items Almost always allow approximation Goal: minimize communication (main), space and processing time per item at sites. 15-4

74 Reservoir Sampling doesn t work Tasks: Find a uniform sample from a stream of unknown length, can we do it in O(1) space? Algorithm: Store 1-st item. When the i-th (i > 1) item arrives With probability 1/i, replace the current sample; With probability 1 1/i, throw it away. C Why this doesn t work? S 1 S 2 S 3 S k 16-1

75 Random sampling Initialize i = 0 In epoch i: Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. upper lower C S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) 17-1

76 Random sampling Initialize i = 0 In epoch i: upper lower C Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample 17-2

77 Random sampling Initialize i = 0 In epoch i: upper lower C Sites send every item w.pr. 2 i Coordinator maintains a lower sample and an upper sample: each received item goes to either with equal prob. S 1 S 2 S 3 S k (Each item is included in lower sample w.pr. 2 (i+1) ) When the lower sample reaches size s, the coordinator broadcasts to k sites advance to epoch i i + 1 Discards the upper sample Randomly splits the lower sample into a new lower and an upper sample Correctness: (1): In epoch i, each item is maintained in C w. pr. 2 i (2): Always s items are maintained in C 17-3

78 Noisy data Datasets may potentially be noisy If we consider similar elements as one element, then what is the number of distinct elements? C S 1 S 2 S 3 S k John Smith, 800 Mountain Av springfield Joe Smith, 100 Mount Av Springfield Joseph Smith, 800 Mt. Road Springfield Joe Smith, 800 Mt. Road Springfield 18-1

79 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? 19-1

80 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? The design of comparison metric is a separate issue. Can use some distance functions. 19-2

81 Noisy data (cont.) Goal: compute some good estimation while minimize communication and #rounds C S 1 S 2 S 3 S k In the coordinator model, how can we perform noise-resilient computation, given that each site employs a comparison metric which can tell whether two elements are the same or not? The design of comparison metric is a separate issue. Can use some distance functions Many techniques for noise-free datasets cannot be used. Often needs very different approaches.

82 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. 20-1

83 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. We have talked about why algorithmic design in these models/paradigms are difficult. 20-2

84 Summary for today We have introduced a few models for big data : the streaming model, multiparty computation, distributed monitoring, etc. We have talked about why algorithmic design in these models/paradigms are difficult. We have given a few examples. 20-3

85 Thank you! Good luck with B649! 21-1

86 Attribution Some examples are borrowed from Andrew McGregor s course CS711S12/index.html 22-1

87 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. 23-1

88 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. 23-2

89 Maintain a sample for Sliding Windows Tasks: Find a uniform sample from the last w items. Algorithm: For each x i, we pick a random value v i (0, 1). In a window < x j w+1,..., x j >, return value x i with smallest v i. To do this, maintain the set of all x i in sliding window whose v i value is minimal among subsequent values. Correctness: Obvious. Space (expected): 1/w + 1/(w 1) /1 = log w. 23-3

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1

B561 Advanced Database Concepts Streaming Model. Qin Zhang 1-1 B561 Advanced Database Concepts 2.2. Streaming Model Qin Zhang 1-1 Data Streams Continuous streams of data elements (massive possibly unbounded, rapid, time-varying) Some examples: 1. network monitoring

More information

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1

B561 Advanced Database Concepts. 6. Streaming Algorithms. Qin Zhang 1-1 B561 Advanced Database Concepts 6. Streaming Algorithms Qin Zhang 1-1 The model and challenge The data stream model (Alon, Matias and Szegedy 1996) a n a 2 a 1 RAM CPU Why hard? Cannot store everything.

More information

CMPSCI 711: More Advanced Algorithms

CMPSCI 711: More Advanced Algorithms CMPSCI 711: More Advanced Algorithms Section 0: Overview Andrew McGregor Last Compiled: April 29, 2012 1/11 Goal and Focus of Class Goal: Learn general techniques and gain experience designing and analyzing

More information

Mining data streams. Irene Finocchi. finocchi/ Intro Puzzles

Mining data streams. Irene Finocchi.   finocchi/ Intro Puzzles Intro Puzzles Mining data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ 1 / 33 Irene Finocchi Mining data streams Intro Puzzles Stream sources Data stream model Storing

More information

Course : Data mining

Course : Data mining Course : Data mining Lecture : Mining data streams Aristides Gionis Department of Computer Science Aalto University visiting in Sapienza University of Rome fall 2016 reading assignment LRU book: chapter

More information

Optimal Sampling from Distributed Streams

Optimal Sampling from Distributed Streams Optimal Sampling from Distributed Streams Graham Cormode AT&T Labs-Research Joint work with S. Muthukrishnan (Rutgers) Ke Yi (HKUST) Qin Zhang (HKUST) 1-1 Reservoir sampling [Waterman??; Vitter 85] Maintain

More information

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania

Mining for Patterns and Anomalies in Data Streams. Sampath Kannan University of Pennsylvania Mining for Patterns and Anomalies in Data Streams Sampath Kannan University of Pennsylvania The Problem Data sizes too large to fit in primary memory Devices with small memory Access times to secondary

More information

Parallel DBs. April 25, 2017

Parallel DBs. April 25, 2017 Parallel DBs April 25, 2017 1 Why Scale Up? Scan of 1 PB at 300MB/s (SATA r2 Limit) (x1000) ~1 Hour ~3.5 Seconds 2 Data Parallelism Replication Partitioning A A A A B C 3 Operator Parallelism Pipeline

More information

Algorithms for data streams

Algorithms for data streams Intro Puzzles Sampling Sketches Graphs Lower bounds Summing up Thanks! Algorithms for data streams Irene Finocchi finocchi@di.uniroma1.it http://www.dsi.uniroma1.it/ finocchi/ May 9, 2012 1 / 99 Irene

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Massive Online Analysis - Storm,Spark

Massive Online Analysis - Storm,Spark Massive Online Analysis - Storm,Spark presentation by R. Kishore Kumar Research Scholar Department of Computer Science & Engineering Indian Institute of Technology, Kharagpur Kharagpur-721302, India (R

More information

Data Mining. Jeff M. Phillips. January 8, 2014

Data Mining. Jeff M. Phillips. January 8, 2014 Data Mining Jeff M. Phillips January 8, 2014 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 3/6/2012 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 In many data mining

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/25/2013 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 3 In many data mining

More information

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21

Putting it together. Data-Parallel Computation. Ex: Word count using partial aggregation. Big Data Processing. COS 418: Distributed Systems Lecture 21 Big Processing -Parallel Computation COS 418: Distributed Systems Lecture 21 Michael Freedman 2 Ex: Word count using partial aggregation Putting it together 1. Compute word counts from individual files

More information

CS246: Mining Massive Datasets Jure Leskovec, Stanford University

CS246: Mining Massive Datasets Jure Leskovec, Stanford University CS246: Mining Massive Datasets Jure Leskovec, Stanford University http://cs246.stanford.edu 2/24/2014 Jure Leskovec, Stanford CS246: Mining Massive Datasets, http://cs246.stanford.edu 2 High dim. data

More information

Page 1. Goals for Today" Background of Cloud Computing" Sources Driving Big Data" CS162 Operating Systems and Systems Programming Lecture 24

Page 1. Goals for Today Background of Cloud Computing Sources Driving Big Data CS162 Operating Systems and Systems Programming Lecture 24 Goals for Today" CS162 Operating Systems and Systems Programming Lecture 24 Capstone: Cloud Computing" Distributed systems Cloud Computing programming paradigms Cloud Computing OS December 2, 2013 Anthony

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ.

Big Data Programming: an Introduction. Spring 2015, X. Zhang Fordham Univ. Big Data Programming: an Introduction Spring 2015, X. Zhang Fordham Univ. Outline What the course is about? scope Introduction to big data programming Opportunity and challenge of big data Origin of Hadoop

More information

The Internet. History & Current Applications

The Internet. History & Current Applications The Internet History & Current Applications Popescu 2012 1 Connecting computers to other computers Share data Join computing forces Ensure resiliency 2 Types of Communication Synchronous: sender and receiver

More information

Clustering Documents. Case Study 2: Document Retrieval

Clustering Documents. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April 21 th, 2015 Sham Kakade 2016 1 Document Retrieval Goal: Retrieve

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Graph-Parallel Problems. ML in the Context of Parallel Architectures

Graph-Parallel Problems. ML in the Context of Parallel Architectures Case Study 4: Collaborative Filtering Graph-Parallel Problems Synchronous v. Asynchronous Computation Machine Learning for Big Data CSE547/STAT548, University of Washington Emily Fox February 20 th, 2014

More information

Sublinear Algorithms for Big Data Analysis

Sublinear Algorithms for Big Data Analysis Sublinear Algorithms for Big Data Analysis Michael Kapralov Theory of Computation Lab 4 EPFL 7 September 2017 The age of big data: massive amounts of data collected in various areas of science and technology

More information

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval

Clustering Documents. Document Retrieval. Case Study 2: Document Retrieval Case Study 2: Document Retrieval Clustering Documents Machine Learning for Big Data CSE547/STAT548, University of Washington Sham Kakade April, 2017 Sham Kakade 2017 1 Document Retrieval n Goal: Retrieve

More information

modern database systems lecture 10 : large-scale graph processing

modern database systems lecture 10 : large-scale graph processing modern database systems lecture 1 : large-scale graph processing Aristides Gionis spring 18 timeline today : homework is due march 6 : homework out april 5, 9-1 : final exam april : homework due graphs

More information

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype?

Nowcasting. D B M G Data Base and Data Mining Group of Politecnico di Torino. Big Data: Hype or Hallelujah? Big data hype? Big data hype? Big Data: Hype or Hallelujah? Data Base and Data Mining Group of 2 Google Flu trends On the Internet February 2010 detected flu outbreak two weeks ahead of CDC data Nowcasting http://www.internetlivestats.com/

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #6: Mining Data Streams Seoul National University 1 Outline Overview Sampling From Data Stream Queries Over Sliding Window 2 Data Streams In many data mining situations,

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

Principal Software Engineer Red Hat Emerging Technology June 24, 2015

Principal Software Engineer Red Hat Emerging Technology June 24, 2015 USING APACHE SPARK FOR ANALYTICS IN THE CLOUD William C. Benton Principal Software Engineer Red Hat Emerging Technology June 24, 2015 ABOUT ME Distributed systems and data science in Red Hat's Emerging

More information

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context

Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context 1 Apache Spark is a fast and general-purpose engine for large-scale data processing Spark aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes

More information

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 7, 2019 CS 5140 / CS 6140 Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 7, 2019 What is Data Mining? What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational

More information

Sketching Asynchronous Streams Over a Sliding Window

Sketching Asynchronous Streams Over a Sliding Window Sketching Asynchronous Streams Over a Sliding Window Srikanta Tirthapura (Iowa State University) Bojian Xu (Iowa State University) Costas Busch (Rensselaer Polytechnic Institute) 1/32 Data Stream Processing

More information

Graph Analytics in the Big Data Era

Graph Analytics in the Big Data Era Graph Analytics in the Big Data Era Yongming Luo, dr. George H.L. Fletcher Web Engineering Group What is really hot? 19-11-2013 PAGE 1 An old/new data model graph data Model entities and relations between

More information

Data Mining. Jeff M. Phillips. January 9, 2013

Data Mining. Jeff M. Phillips. January 9, 2013 Data Mining Jeff M. Phillips January 9, 2013 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational statistics? Data

More information

Streaming Algorithms. Stony Brook University CSE545, Fall 2016

Streaming Algorithms. Stony Brook University CSE545, Fall 2016 Streaming Algorithms Stony Brook University CSE545, Fall 2016 Big Data Analytics -- The Class We will learn: to analyze different types of data: high dimensional graphs infinite/never-ending labeled to

More information

5 Fundamental Strategies for Building a Data-centered Data Center

5 Fundamental Strategies for Building a Data-centered Data Center 5 Fundamental Strategies for Building a Data-centered Data Center June 3, 2014 Ken Krupa, Chief Field Architect Gary Vidal, Solutions Specialist Last generation Reference Data Unstructured OLTP Warehouse

More information

Optimal Tracking of Distributed Heavy Hitters and Quantiles

Optimal Tracking of Distributed Heavy Hitters and Quantiles Optimal Tracking of Distributed Heavy Hitters and Quantiles Qin Zhang Hong Kong University of Science & Technology Joint work with Ke Yi PODS 2009 June 30, 2009 - 2- Models and Problems Distributed streaming

More information

B490 Mining the Big Data. 5. Models for Big Data

B490 Mining the Big Data. 5. Models for Big Data B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive

More information

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

More information

Processing of big data with Apache Spark

Processing of big data with Apache Spark Processing of big data with Apache Spark JavaSkop 18 Aleksandar Donevski AGENDA What is Apache Spark? Spark vs Hadoop MapReduce Application Requirements Example Architecture Application Challenges 2 WHAT

More information

Data-Intensive Computing with MapReduce

Data-Intensive Computing with MapReduce Data-Intensive Computing with MapReduce Session 5: Graph Processing Jimmy Lin University of Maryland Thursday, February 21, 2013 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

BIG DATA TESTING: A UNIFIED VIEW

BIG DATA TESTING: A UNIFIED VIEW http://core.ecu.edu/strg BIG DATA TESTING: A UNIFIED VIEW BY NAM THAI ECU, Computer Science Department, March 16, 2016 2/30 PRESENTATION CONTENT 1. Overview of Big Data A. 5 V s of Big Data B. Data generation

More information

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab

2/26/2017. Originally developed at the University of California - Berkeley's AMPLab Apache is a fast and general engine for large-scale data processing aims at achieving the following goals in the Big data context Generality: diverse workloads, operators, job sizes Low latency: sub-second

More information

745: Advanced Database Systems

745: Advanced Database Systems 745: Advanced Database Systems Yanlei Diao University of Massachusetts Amherst Outline Overview of course topics Course requirements Database Management Systems 1. Online Analytical Processing (OLAP) vs.

More information

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018

15-388/688 - Practical Data Science: Big data and MapReduce. J. Zico Kolter Carnegie Mellon University Spring 2018 15-388/688 - Practical Data Science: Big data and MapReduce J. Zico Kolter Carnegie Mellon University Spring 2018 1 Outline Big data Some context in distributed computing map + reduce MapReduce MapReduce

More information

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140

Data Mining. Jeff M. Phillips. January 12, 2015 CS 5140 / CS 6140 Data Mining CS 5140 / CS 6140 Jeff M. Phillips January 12, 2015 Data Mining What is Data Mining? Finding structure in data? Machine learning on large data? Unsupervised learning? Large scale computational

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Epilog: Further Topics

Epilog: Further Topics Ludwig-Maximilians-Universität München Institut für Informatik Lehr- und Forschungseinheit für Datenbanksysteme Knowledge Discovery in Databases SS 2016 Epilog: Further Topics Lecture: Prof. Dr. Thomas

More information

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis

Apache Giraph. for applications in Machine Learning & Recommendation Systems. Maria Novartis Apache Giraph for applications in Machine Learning & Recommendation Systems Maria Stylianou @marsty5 Novartis Züri Machine Learning Meetup #5 June 16, 2014 Apache Giraph for applications in Machine Learning

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Queries on streams

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Resilient Distributed Datasets

Resilient Distributed Datasets Resilient Distributed Datasets A Fault- Tolerant Abstraction for In- Memory Cluster Computing Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin,

More information

Scalable Tools - Part I Introduction to Scalable Tools

Scalable Tools - Part I Introduction to Scalable Tools Scalable Tools - Part I Introduction to Scalable Tools Adisak Sukul, Ph.D., Lecturer, Department of Computer Science, adisak@iastate.edu http://web.cs.iastate.edu/~adisak/mbds2018/ Scalable Tools session

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Page 1. Goals for Today" What is a Database " Key Concept: Structured Data" CS162 Operating Systems and Systems Programming Lecture 13.

Page 1. Goals for Today What is a Database  Key Concept: Structured Data CS162 Operating Systems and Systems Programming Lecture 13. Goals for Today" CS162 Operating Systems and Systems Programming Lecture 13 Transactions" What is a database? Transactions Conflict serializability October 12, 2011 Anthony D. Joseph and Ion Stoica http://inst.eecs.berkeley.edu/~cs162

More information

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed?

What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? Simple to start What is the maximum file size you have dealt so far? Movies/Files/Streaming video that you have used? What have you observed? What is the maximum download speed you get? Simple computation

More information

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi.

Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi. Data Structures and Algorithms Dr. Naveen Garg Department of Computer Science and Engineering Indian Institute of Technology, Delhi Lecture 18 Tries Today we are going to be talking about another data

More information

CS4700/CS5700 Fundamentals of Computer Networks

CS4700/CS5700 Fundamentals of Computer Networks CS4700/CS5700 Fundamentals of Computer Networks Lecture 19: Multicast Routing Slides used with permissions from Edward W. Knightly, T. S. Eugene Ng, Ion Stoica, Hui Zhang Alan Mislove amislove at ccs.neu.edu

More information

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM

Apache Giraph: Facebook-scale graph processing infrastructure. 3/31/2014 Avery Ching, Facebook GDM Apache Giraph: Facebook-scale graph processing infrastructure 3/31/2014 Avery Ching, Facebook GDM Motivation Apache Giraph Inspired by Google s Pregel but runs on Hadoop Think like a vertex Maximum value

More information

Optimizing Apache Spark with Memory1. July Page 1 of 14

Optimizing Apache Spark with Memory1. July Page 1 of 14 Optimizing Apache Spark with Memory1 July 2016 Page 1 of 14 Abstract The prevalence of Big Data is driving increasing demand for real -time analysis and insight. Big data processing platforms, like Apache

More information

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor

Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack. Chief Architect RainStor Making the Most of Hadoop with Optimized Data Compression (and Boost Performance) Mark Cusack Chief Architect RainStor Agenda Importance of Hadoop + data compression Data compression techniques Compression,

More information

Webinar Series TMIP VISION

Webinar Series TMIP VISION Webinar Series TMIP VISION TMIP provides technical support and promotes knowledge and information exchange in the transportation planning and modeling community. Today s Goals To Consider: Parallel Processing

More information

CSE 414: Section 7 Parallel Databases. November 8th, 2018

CSE 414: Section 7 Parallel Databases. November 8th, 2018 CSE 414: Section 7 Parallel Databases November 8th, 2018 Agenda for Today This section: Quick touch up on parallel databases Distributed Query Processing In this class, only shared-nothing architecture

More information

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling

Key aspects of cloud computing. Towards fuller utilization. Two main sources of resource demand. Cluster Scheduling Key aspects of cloud computing Cluster Scheduling 1. Illusion of infinite computing resources available on demand, eliminating need for up-front provisioning. The elimination of an up-front commitment

More information

Introduction to the Mathematics of Big Data. Philippe B. Laval

Introduction to the Mathematics of Big Data. Philippe B. Laval Introduction to the Mathematics of Big Data Philippe B. Laval Fall 2017 Introduction In recent years, Big Data has become more than just a buzz word. Every major field of science, engineering, business,

More information

From Internet Data Centers to Data Centers in the Cloud

From Internet Data Centers to Data Centers in the Cloud From Internet Data Centers to Data Centers in the Cloud This case study is a short extract from a keynote address given to the Doctoral Symposium at Middleware 2009 by Lucy Cherkasova of HP Research Labs

More information

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9

Machine Learning for Large-Scale Data Analysis and Decision Making A. Distributed Machine Learning Week #9 Machine Learning for Large-Scale Data Analysis and Decision Making 80-629-17A Distributed Machine Learning Week #9 Today Distributed computing for machine learning Background MapReduce/Hadoop & Spark Theory

More information

Graph Connectivity in MapReduce...How Hard Could it Be?

Graph Connectivity in MapReduce...How Hard Could it Be? Graph Connectivity in MapReduce......How Hard Could it Be? Sergei Vassilvitskii +Karloff, Kumar, Lattanzi, Moseley, Roughgarden, Suri, Vattani, Wang August 28, 2015 Google NYC Maybe Easy...Maybe Hard...

More information

An Introduction to Big Data Analysis using Spark

An Introduction to Big Data Analysis using Spark An Introduction to Big Data Analysis using Spark Mohamad Jaber American University of Beirut - Faculty of Arts & Sciences - Department of Computer Science May 17, 2017 Mohamad Jaber (AUB) Spark May 17,

More information

Introduction to MapReduce Algorithms and Analysis

Introduction to MapReduce Algorithms and Analysis Introduction to MapReduce Algorithms and Analysis Jeff M. Phillips October 25, 2013 Trade-Offs Massive parallelism that is very easy to program. Cheaper than HPC style (uses top of the line everything)

More information

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem

Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem I J C T A, 9(41) 2016, pp. 1235-1239 International Science Press Parallel HITS Algorithm Implemented Using HADOOP GIRAPH Framework to resolve Big Data Problem Hema Dubey *, Nilay Khare *, Alind Khare **

More information

Gynoii Smart Baby Monitor. User Guide

Gynoii Smart Baby Monitor. User Guide Gynoii Smart Baby Monitor User Guide 1. Overview of the camera 1. Light sensor 2. Infrared LEDs 3. Lens assembly 4. Built-in microphone 5. Built-in speaker 6. DC 5V power input 7. Reset button 8. LED indicator

More information

Research challenges in data-intensive computing The Stratosphere Project Apache Flink

Research challenges in data-intensive computing The Stratosphere Project Apache Flink Research challenges in data-intensive computing The Stratosphere Project Apache Flink Seif Haridi KTH/SICS haridi@kth.se e2e-clouds.org Presented by: Seif Haridi May 2014 Research Areas Data-intensive

More information

Lecture 11. Lecture 11: External Sorting

Lecture 11. Lecture 11: External Sorting Lecture 11 Lecture 11: External Sorting Lecture 11 Announcements 1. Midterm Review: This Friday! 2. Project Part #2 is out. Implement CLOCK! 3. Midterm Material: Everything up to Buffer management. 1.

More information

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications

Spark. In- Memory Cluster Computing for Iterative and Interactive Applications Spark In- Memory Cluster Computing for Iterative and Interactive Applications Matei Zaharia, Mosharaf Chowdhury, Tathagata Das, Ankur Dave, Justin Ma, Murphy McCauley, Michael Franklin, Scott Shenker,

More information

Streaming Auto-Scaling in Google Cloud Dataflow

Streaming Auto-Scaling in Google Cloud Dataflow Streaming Auto-Scaling in Google Cloud Dataflow Manuel Fahndrich Software Engineer Google Addictive Mobile Game https://commons.wikimedia.org/wiki/file:globe_centered_in_the_atlantic_ocean_(green_and_grey_globe_scheme).svg

More information

Netezza The Analytics Appliance

Netezza The Analytics Appliance Software 2011 Netezza The Analytics Appliance Michael Eden Information Management Brand Executive Central & Eastern Europe Vilnius 18 October 2011 Information Management 2011IBM Corporation Thought for

More information

Application-Aware SDN Routing for Big-Data Processing

Application-Aware SDN Routing for Big-Data Processing Application-Aware SDN Routing for Big-Data Processing Evaluation by EstiNet OpenFlow Network Emulator Director/Prof. Shie-Yuan Wang Institute of Network Engineering National ChiaoTung University Taiwan

More information

CS 161 Computer Security

CS 161 Computer Security Raluca Popa Spring 2018 CS 161 Computer Security Discussion 3 Week of February 5, 2018: Cryptography I Question 1 Activity: Cryptographic security levels (20 min) Say Alice has a randomly-chosen symmetric

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lectures 23 and 24 Parallel Databases 1 Why compute in parallel? Most processors have multiple cores Can run multiple jobs simultaneously Natural extension of txn

More information

The amount of data increases every day Some numbers ( 2012):

The amount of data increases every day Some numbers ( 2012): 1 The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect

More information

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

2/26/2017. The amount of data increases every day Some numbers ( 2012):

2/26/2017. The amount of data increases every day Some numbers ( 2012): The amount of data increases every day Some numbers ( 2012): Data processed by Google every day: 100+ PB Data processed by Facebook every day: 10+ PB To analyze them, systems that scale with respect to

More information

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1

CS 5614: (Big) Data Management Systems. B. Aditya Prakash Lecture #13: Mining Streams 1 CS 5614: (Big) Data Management Systems B. Aditya Prakash Lecture #13: Mining Streams 1 Data Streams In many data mining situa;ons, we do not know the en;re data set in advance Stream Management is important

More information

Distributed Machine Learning" on Spark

Distributed Machine Learning on Spark Distributed Machine Learning" on Spark Reza Zadeh @Reza_Zadeh http://reza-zadeh.com Outline Data flow vs. traditional network programming Spark computing engine Optimization Example Matrix Computations

More information

Clustering Large scale data using MapReduce

Clustering Large scale data using MapReduce Clustering Large scale data using MapReduce Topics Clustering Very Large Multi-dimensional Datasets with MapReduce by Robson L. F. Cordeiro et al. Fast Clustering using MapReduce by Alina Ene et al. Background

More information

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia

MapReduce Spark. Some slides are adapted from those of Jeff Dean and Matei Zaharia MapReduce Spark Some slides are adapted from those of Jeff Dean and Matei Zaharia What have we learnt so far? Distributed storage systems consistency semantics protocols for fault tolerance Paxos, Raft,

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Big Data and Cloud Computing

Big Data and Cloud Computing Big Data and Cloud Computing Presented at Faculty of Computer Science University of Murcia Presenter: Muhammad Fahim, PhD Department of Computer Eng. Istanbul S. Zaim University, Istanbul, Turkey About

More information

Worksheet - Storing Data

Worksheet - Storing Data Unit 1 Lesson 12 Name(s) Period Date Worksheet - Storing Data At the smallest scale in the computer, information is stored as bits and bytes. In this section, we'll look at how that works. Bit Bit, like

More information

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko

RAMCloud. Scalable High-Performance Storage Entirely in DRAM. by John Ousterhout et al. Stanford University. presented by Slavik Derevyanko RAMCloud Scalable High-Performance Storage Entirely in DRAM 2009 by John Ousterhout et al. Stanford University presented by Slavik Derevyanko Outline RAMCloud project overview Motivation for RAMCloud storage:

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011

Jordan Boyd-Graber University of Maryland. Thursday, March 3, 2011 Data-Intensive Information Processing Applications! Session #5 Graph Algorithms Jordan Boyd-Graber University of Maryland Thursday, March 3, 2011 This work is licensed under a Creative Commons Attribution-Noncommercial-Share

More information

Computational Social Choice in the Cloud

Computational Social Choice in the Cloud Computational Social Choice in the Cloud Theresa Csar, Martin Lackner, Emanuel Sallinger, Reinhard Pichler Technische Universität Wien Oxford University PPI, Stuttgart, March 2017 By Sam Johnston [CC BY-SA

More information

A Review Paper on Big data & Hadoop

A Review Paper on Big data & Hadoop A Review Paper on Big data & Hadoop Rupali Jagadale MCA Department, Modern College of Engg. Modern College of Engginering Pune,India rupalijagadale02@gmail.com Pratibha Adkar MCA Department, Modern College

More information

High Performance Computing on MapReduce Programming Framework

High Performance Computing on MapReduce Programming Framework International Journal of Private Cloud Computing Environment and Management Vol. 2, No. 1, (2015), pp. 27-32 http://dx.doi.org/10.21742/ijpccem.2015.2.1.04 High Performance Computing on MapReduce Programming

More information

"Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute

Big Data... and Related Topics John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute "Big Data... and Related Topics" John S. Erickson, Ph.D The Rensselaer IDEA Rensselaer Polytechnic Institute erickj4@rpi.edu @olyerickson Director of Operations, The Rensselaer IDEA Deputy Director, Rensselaer

More information

EECS 122: Introduction to Communication Networks Final Exam Solutions

EECS 122: Introduction to Communication Networks Final Exam Solutions EECS 22: Introduction to Communication Networks Final Exam Solutions Problem. (6 points) How long does it take for a 3000-byte IP packet to go from host A to host B in the figure below. Assume the overhead

More information