INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)

Size: px
Start display at page:

Download "INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)"

Transcription

1 INTRODUCTION TO DATA SCIENCE MapReduce and the New Software Stacks(MMDS2)

2 Big-Data Hardware Computer Clusters Computation: large number of computers/cpus Network: Ethernet switching Storage: Large collection of distributed disks Commodity hardware- cheap but relatively unreliable Computer node: Network connected CPU(s), Disk(s) and RAM How to manage distributed computations tasks and storage? How to protect against frequent failures?

3 New Software Stack Distributed File System Data is duplicated and distributed across multiple locations MapReduce programming paradigm A computational model for performing parallel computations A software infrastructure that manages all the boring tasks (failures, scheduling etc.)

4 2.1 Distributed File Systems

5 Computer Nodes Cluster computing Nodes are placed on racks (8-64 nodes) connected by Ethernet Racks are connected by switch

6 Failures Typical failures Loss of a node Loss of a rack Some computations take hours Can t restart whole process for each failure (will never complete) Solution Redundant storage of files Continue to work on the same file chunk but on other node Computation divided into tasks. Restart task on other node without affecting other tasks

7 Large-Scale File-System Organization Assumptions Enormous files (TB) Files are rarely updated Read or data is appended

8 DFS: Master Node Files are divided into large chunks (e.g. 64MB) Chunks are replicated on different nodes/racks To find required chunk Locate a name node or master node for a file Name node itself and file index are replicated All participants know how to find the directory

9 2.2 MapReduce

10 MapReduce systems A model and a software system for large-scale fault resilient computation Powerful and simple Multiple implementations Google MapReduce Open-Source Hadoop and HDFS Code all the logic/algorithm in two functions only Map function and Reduce function Let the system to handle the rest: failure, duplication, restart, scheduling, resources, monitoring etc.

11 MapReduce Computation Map: A number of Map tasks on multiple nodes are given file chunks from a DFS Produce key-value pairs according to logic coded in a map function Each data element might produce zero o more key-value pairs Custom logic, coded by user Group: Master controller collect key-value pairs and sorted by key. Divide result into chunks to submit to a number of Reduce tasks. All key-value pairs with the same key are going to the same chunk Standard logic, implemented by the system Reduce: A number of Reduce tasks each working on one key at a time Combine all values for a single key Custom logic, coded by user

12 MapReduce Computation

13 Function, Task, Node Map or Reduce function Logic coded by a user Mapper or reducer Map or Reduce function with a single input Example: Reducer for key w Map or Reduce Task Map/Reduce function applied to a chunk (list of key-value pairs) A Reduce task runs a number of reducers Map or Reduce Node A computer that currently runs one or more tasks (Map or Reduce) Tasks might be scheduled to a different nodes (more tasks then nodes)

14 The Map Function Input file Elements: a tuple, a line, a document Chunk is a collection of elements Each map task works on a chunk at a time Technically, inputs to Map are key-value pairs Allows composition of several MapReduce processes Usually key is not relevant for Map task (e.g. line number in the input file) Output Each element is converted to zero or more key-value pair Keys are not unique, several same key-value pair from same element are possible

15 Example: Word Count Input: repository of documents Output: number of appearances of each word Each document is an input element The Map function Read file and break into a sequence of words Produce key-value pairs

16 Grouping by key Collect outputs from all maps into a single list Combine same key values into key-value list The system divides all keys into buckets The number of buckets as the number of Reduce tasks Use an appropriate hash function Send each bucket to a Reduce task Input to each Reduce task: list of key-value pairs

17 Example: Word Count Input to the Reduce function Key is word Value is ones Sum all ones to get counter Output of all Reduce tasks is a sequence of (w,m) w is a word (key) m is number of appearances

18 Combiners Optimize MapReduce process provided reducers are Associative and Commutative Values can be combined in any order Push some of Reducer logic to Map tasks Word count example Apply reduce step in each map Reduce key-value pairs with a similar key to a single (w,m) Grouping and reduce steps are still necessary Call a Combiner function after each map task Works on local files produced by a map task Before all map s output are collected and shuffled

19 MapReduce with Combiners

20 Parallelism Can execute each reducer by a dedicate Reduce task Single key per process Overhead with creating tasks is too much Skew: difference in computing time for reducers Different number of values for each keys Different computation time, some nodes become idle early Control number of Reduce tasks Run several reducers per each Reduce task to average the load More tasks than nodes to balance node load

21 Details of MapReduce Execution Fork a Master controller process and a number of Worker processes Worker handles Map or Reduce but not both Create a number of Reduce and Map tasks Usually Map task for each input chunk Select number of reduce task carefully More task means more communication More tasks means more parallelism Master keeps tracks of tasks

22 MapReduce Execution

23 Coping With Node Failures Failure of master node Restart the whole MapReduce job Worst case Failure of worker node Master monitors workers and detects a failure Master restarts only tasks in this worker

24 2.3 Algorithms Using MapReduce

25 Usage Original usage Google uses it for very large vector-matrix multiplication (compute the PageRank) Matrices represents links between web pages Vector represents an importance of a web page Make sense Very large files Not updated in-place Batch processing

26 Matrix-Vector Multiplication n-by-n matrix M with elements m i,j n is 10B for Web pages A vector v of length n v j Matrix-vector product is a vector x of length n Assume matrix and vector are stored in DFS Easily discoverable row-column coordinates For instance, stored as triples i, j, m i,j

27 Case I: Vector fits into RAM Whole vector data is available to each mapper The Map Function Element i, j, m i,j Output key-value pair i, m i,j v j The Reduce Function Sum all values associated with the key

28 Case II: Vector cannot fit in RAM Partition matrix and vector into stripes Each map task get a chunk from a stripe and gets entire corresponding strip of the vector

29 Relation-Algebra Operations Data frequently is stored in tables Relational Database Management Systems (RDBMS) Query language SQL Underlying theory: relations and operations over them MapReduce-Data is stored in files Frequently files contain tables and key-value pairs Need to perform SQL-like queries using MapReduce

30 Relations A Relation is a table(set) with column headers Attribute: a column header Tuple: a row in the table, no duplicates(!) Schema: a set of all attributes of particular relation Relation Example

31 Relational Algebra Relational Algebra: a set of standard operations Operations usual produce other relations from one or more input relations Operations (Queries) are often written in SQL and are executed by RDBMS Need some formalism to describe/define similar operations to be executed by MapReduce

32 Relational Algebra Selection R = σ C R Apply Boolean condition C on every tuple Produce relation with tuples that satisfy C Projection R = π S (R) Select attributes that are in a given subset S New relation contains only selected attributes Union, Intersection, Difference Easy to define for same-schema relations

33 Relational Algebra Natural Join R = R S Join two relations into a single relation (table) Merge tuples which agree on intersecting attributes Grouping and Aggregation R = γ X R Partition tuples according to values of attributes set G Compute aggregation per group (MAX, SUM,..) for all other attributes Value X is a list of elements that are either A grouping attribute Aggregation function θ A, for A not a grouping attribute

34 Examples Web Links Find path of length two using relation Links Triples (u,v,w) Natural join to itself Two copies L1(U1,U2) L2(U2,U3) Social Network Friends Tuples Number of friends for each user

35 MapReduce Selection Projection

36 Union, Intersection, Difference Union Intersection Difference

37 MapReduce: Natural Join Start with a simple form R(A,B) and S(B,C) Similar approach for joining by a groups of attributes

38 MapReduce: Grouping And Aggregations Simple case R(A,B,C). Group by A, aggregate by B

39 Matrix Multiplication using Relational Algebra M,N are matrices with elements m i,j and n i,j Multiplication P=MN Represent a matrix by a relation M(I,J,V) and N(J,K,W) Especially efficient if matrices are sparse (omit zeroes) Product Natural join i, j, k, v, w represents m i,j n j,k Transform to i, j, k, v w Group by I and K with sum aggregation over J

40 MapReduce: Matrix Multiplication 1 st phase (create all products m i,j n j,k ) Map for each m i,j produce j, M, i, m i,j for each n j,k produce j, N, k, n j,k Reduce For each key, output all possible combinations of M and N values with key i, k and multiplication m i,j n j,k 2 nd phase (combine all products for i, k ) Map: identity Reduce: sum values for each key i, k

41 Single Step Map Reduce Matrix Multiplication Map function Create multiple copies of each input element i, k, M, j, m i,j for k = 1,.., n i, k, N, j, n j,k for i = 1,.., n Reduce Input: For all keys i, k pairs M, j, m i,j and N, j, n j,k for all j Output Multiply and sum up all pairs for each keys

42 2.4 Map-Reduce Extensions

43 Workflow systems Will discuss later as we ll talk about streams

44 Recursive Extensions Recursive tasks are difficult to compute using MapReduce Map-Reduce: independent restart of failed task What if parent in the recursion chain is failed? Need some other mechanism for implementing recursive workflows Represented by flow graphs with cycles Convert recursion to iteration

45 Example: Path relation in a graph Assume directed graph is represented by the relation E(X, Y) Compute the path relation P(X,Y) There is a path from node X to node Y P n (X, Y) = π X,Y P n 1 (X, Z) E (Z,Y) Iterative algorithm for computing P(X,Y) Start from P(X, Y) = E(X, Y) Update/add new pairs till there is no change

46 Algorithm

47 Workflow Implementation Two types of tasks: Join and Dup-elim n Join tasks create new candidate pairs m Dup-elim tasks remove duplicates and resend to Joins Route/Partition data by hash functions Each join task handle pairs (a,b) (b,c) according to hash value h(b) Each Dup-elim handles pairs (a,c) according to hash value g(a,c)

48 Join Tasks Join task #i receives all pairs a, b according to hash value On of the nodes h(b) = i or h(a) = i Each pair can go to two tasks h a = i and h b = j Each Join task Stores P(a,b) locally till the end of the computation If h(a) = i tries to match locally stored P(x,a) and new P(a,b) and produce output P(x,b) If h(b) = i tries to match locally stored P(b,y) and new P(a,y) and produce output P(y,b) Send resulting(c,d) to dup-elim according to a hash value g(c,d)

49 Dup-elim task Dup-elim task #j stores all pairs (c,d) with hash value g(c, d) = j On receiving new pair, it checks it against locally stored pairs If it s a new pair, it s sent to Join tasks according to h(c) and h(d)

50 Workflow

51 Details Every Join task writes to m output files Single file for each Dup-elim task Every Dup-elim writes to n output files Single file for each Join task Start by sending E(a,b) pairs appropriate Dup-elim tasks According to g(a,b) Wait till all Join tasks finish before starting dup-elim phase All Dup-elim tasks have their input files

52 Failures Not necessary to have two types of tasks Whenever Join produces a new candidate (a,c), transmit it to two other Join tasks according to h(a), h(b) Before Join task uses a new pair to search for candidates, check it against locally stored and discard if already exists Failure Single task: everything with this hash value is lost Two types of tasks can handle single failures Failed Join recreates data from relevant Dup-elim tasks Failed Dup-elim recreates data from relevant Join tasks No problem with restarted task to produce duplicate input for other tasks

53 Graph processing systems Handle computation where input is a very large graph Google s Pregel and Apache s Giraph Facebook 200 machines, Giraph, 1Trillion graph edges 4 Minutes processing time

54 Example Given a graph compute shortest distance between each pair of nodes Assign task per node Group several tasks on a single node Each task receive messages Process them and send out other messages Computation by supersteps All nodes process their messages All nodes issue their messages

55 Algorithm Initially Node #a stores every edges from a Edge from a to b of weight w (b,w) Node #a sends messages to all other nodes Message : (a,b,w) When message (c,d,w) arrives to node #a Consider new paths (a,d) as (a->c->d) and (c,a) as (c->d->a). Update weight if a shorter path discovered Send out message with newly discovered path to all other nodes

56 Handling Failure By checkpoints Every node save its entire step every few supersteps Failure All nodes are restarted from a last checkpoint

57 2.5 The Communication Cost Model

58 Measuring Quality of Algorithms For many algorithms the performance bottleneck is moving data between tasks(nodes) Each task is usually simple, linear in data Transmitting and reading data into memory is slow Communication cost dominates To measure/estimate communication cost Describe algorithm as acyclic workflow Graph of tasks and communication between them Measure/Estimate data transmitted at each edge

59 Communication Cost Communication Cost of a task is the size of the input for the task Measured in bytes or tuples Communication Cost of an algorithm is the sum of the communication cost of all the tasks. Why not outputs? Counted as input to another tasks Unless it s the output of entire algorithm If output of entire algorithm is large, then most probably it s an input to a next stage Count it as an input to the next stage

60 Example: Natural Join Algorithm Reminder Map For each (a,b) of R create pair (b, (R,a)) For each (b,c) of S create pair (b,(s,c)) Reduce For a key b combine all pairs (R,a),(S,c) into (a,c) Communication Cost Assume R and S are of size r and s Input to all Maps: r+s Output of all Maps and input to all Reduces: r+s Total 2r+2s tuples or O(r+s) bytes Computation time is small We don t count for output from Reduce (potentially r*s)

61 Wall-Clock Time Can assign all work to a single task to minimize the communication cost Running time of the algorithm : wall-clock time Need to divide work fairly among the task while minimizing the communication cost Talk about it later

62 Multiway Joins: Cascade Example 3-way join Cascade two MapReduce jobs Do first join, then second join or Do second join, then first join Communication cost p-is a proportion (probability) of match 1 st, then 2nd O((r+s)+(prs)+t) bytes 2 nd, then 1 st O((s+t)+(pst+r)) bytes

63 Multiway Join: Single Step A single MapReduce job that joins all three relations at once Key to a Reducer is a pair (i, j) Receives R(u, v), S(v, w), T(w, x) such that h v = i and g w = j Total number of reducers k = b c where b and c are number of buckets for h and g Send S(v, w) only to a single Reducer Send R(u, v) to c Reducers and T(w, x) to b Reducers

64 Single Step: Communication Cost To Reduce tasks s tuples to move (1 copy of each S) cr tuples to move (c copies of each r tuple) bt tuples to move (b copies of each t tuple) To Map Task r + s +t input tuples to all Map tasks How do you select c and b subject to cb = k? Map communication cost is the same for all choices

65 Optimization Problem Minimize under constrain s + cr + bt, cb = k Lagrangean multipliers Minimize s + cr + bt λ cb k Set derivatives w.t.r. to c, b to zero r λb = 0, t λc = 0 Rearrange and multiply rt = λ 2 k λ = Substitute b = r λ = r rt k = kr t, c = Substitute to s + cr + bt = s + 2 krt Add map cost r + 2s + t + 2 krt kt r rt k

66 Example: Facebook 1B users 300 friends each on average Size of relation r is Friends of Friends relationship R R Maximum size r 2, assume cliques so 30r Want to compute friends of friends of friends relationship R R R Start marketing to those who has large number of friends of friends of friends What is the best way to compute on MapReduce? Cascade a pair of two-way joins Single 3-way join workflow

67 Example: Communication cost Cascade computation First join Map: 2r Reduce 2r Total 4r Second join Map: r+30r Reduce:r+30r Total 62r Total: 4r + 62r = 66r = way join r + r + 2r + 2r k = k Compare > k k < / =31 k<961 Result If number of reducers is less then 961, use 3-way join

68 Star join: Walmart Example Fact table Business facts Each sale is kept in the fact table Relation F A 1, A 2,.. Attributes are important components of the sale Item, Store, Branch, Customer Id Very large table Dimension tables For each key attribute D A i, B i,1, B i,2, Descriptive attributes/fields For Customer Id: Attribute: phone number, address, age Many small tables Analytical queries Join fact table with one or more dimensional tables (star join) Aggregate results into a useful form Example: aggregated sales by region and color for each month

69 Star Joins: Map Reduce Computation Don t wait for a specific query, prepare Reduce nodes for all possibilities Send and store locally dimension tables at the Reduce nodes Use the same hashing values as would be used for Multiway join of fact table with every dimension table Store fact table on Map nodes Run multi-way join and aggregate

70 2.6 Complexity Theory for MapReduce

71 Parameters of a MapReduce Algorithm Reducer size q: upper bound on the number of values for a single key Want it to be small Input for a single reducer fits into memory High degree of parallelism Replication rate r: average number of key-value pairs per input element Average communication from Map to Reduce per input element Usually there is a tradeoff between communication cost factor (r) and computational efficiency factor(q)

72 Example: 1Pass Matrix Multiplication Recall 1Pass Matrix Multiplication A reducer per each output(matrix element) Map sends a copy to each relevant reducer Replication rate is r = n Actually each element is duplicated exactly n times Reducer size is q = 2n (n values for each matrix) Can design a family of algorithms with qr > 2n 2

73 Example: Similarity Joins Assume a collection of images X Size 1M images Input key-value: index+image i, P i Some measure of similarity s(x, y) Symmetric s x, y = s y, x Find all pairs (x, y) such that s(x, y) > t

74 Obvious algorithm Reducer per pair Evaluate s on key is i, j If greater then a threshold, produce output Map Duplicate each input (i, P i ) for each j: ((i, j), P i ) Duplication rate is r=999999, reduce size q=2 Communication Map+ Reduce: For 1Mb images total size is Bytes or Exabyte Takes 300 years over the gigabit Ethernet

75 Similarity Joins Using Groups Select g- number of groups, each has 106 g images Use hash function with g values to define groups The Map Function For each input element (i, P i ) generate g 1 key-value parts Each key is a set (unordered) u, v, where u is group of this image and v is all other groups The Reduce Function (key is u, v ) Compare between two groups For each key u, v there are g elements Compare between instances from different groups Need to choose a reducer to compare within the group For instance, compare within the group u for {u,u+1}

76 Analysis Replication Rate r = g 1 g Assume g is large Reducer size q= g For 1M images, total bytes for a reducer is Number of reducers k g 2 /2 g For instance g=1000 Input for a single reducer is 2Gb Communication cost is Mb = bytes 1000 times less (4 month vs. 300 years) 500K reducers, can be balanced well

77 Graph model for a MapReduce problem MapReduce problem A set of inputs A set of outputs A many-many relationship between input and output Which inputs are needed to produce which outputs Example: Similarity Join for 4 pictures

78 Example: Matrix Multiplication Multiply two n-by-n matrices 2n 2 inputs m ij and n jk and n 2 outputs p ik Each output p ik is related to 2n inputs m i1, m i2,. and n 1,k, n 2,k,.. Each input m ij or n jk is related to n outputs p i1, p i2, for m ij and p 1k, p 2k, for n jk

79 Implicit inputs/outputs Example: Natural join R(A,B) and S(B,C) Assume A,B,C have finite domain Finite number of possible inputs and outputs Not all inputs are present(all possible tuples) Not all outputs are produced For the analysis purposes, consider the complete graph for the problem It s model for the problem not for a specific input instance

80 Mapping Schemas Each algorithm is defined by a mapping scheme How outputs are produced from inputs (by reducers) Given a reducer size q, mapping scheme is an assignment of inputs to one or more reducers No reducer is assigned more then q inputs For every output, there is at least one reducer assigned to all inputs related to this output

81 Example: Similarity Join Number of inputs is p, number of outputs is p 2 p2 2 Assign g 2 g2 reducers A reducer get inputs from 2 groups : q=2p/g It s a mapping schema Reducer size is q Each output is covered Replication rate is r= g 1 g In this case r = 2p/q Inverse relation between r and q

82 When Not All Inputs Are Present Not all inputs are present Always 1M images Bur how many tuples in a relation? How to assign? Example: Assume only 5% of possible data is present For reducer size q, only q/20 will actually arrive After the analysis, estimate 5% and replace q by 20q Restrict algorithm fro 20q, in reality only q will arrive

83 Lower Bound On Replication Rate Similarity join Select reducer size to trade communication vs. parallelism Or ability to execute in RAM How do we know that we got the best tradeoff between q and r? Minimum possible communication(r) given q? Prove matching lower bound Find lower bound for the problem (for a given q) Show that tradeoff (achieved by the assignment) matches it

84 Steps to Prove Lower Bounds Bound outputs of a single reducer Given an input q, it can cover only g(q) output For any mapping Calculate total number of outputs for the problem Not depended on particular mapping Bound number of outputs covered by all reducers k i=1 Note inequality g q i > n k Manipulate inequality to get total communication q i Use reducer size (max input) q i=1 k Replication rate is i=1 q i divided by number of inputs

85 Example: Similarity Join Reducer with q inputs can t cover more then q2 2 Total number of outputs p2 2 Total coverable outputs q k i=1 q i 2 q i 2 So replication rate is r = k i=1 p q i k 2 q i 2 k i=1 p2 2 2 i=1 p2 2 or q i p q k i=1 p2 q

86 Case Study: Matrix Multiplication Consider an improvement to 1-pass matrix multiplication Consider the tradeoff of the algorithm and show matching low bound Idea: Group rows and columns into bands Each reducer gets band of columns from the 1 st matrix and a band of rows from the second matrix Reducer produce a square of elements of the output matrix

87 Matrix Multiplications

88 Matrix Multiplication Compute P=MN, all matrices are n-by-n Group rows of M into g groups/bands, n/g rows each Keys corresponds to two groups (from M and N) Map function g 2 keys: pairs of row/column band number Duplicate each M/N input for all possible rows/columns of N/M. Reduce function Compute a square of output elements

89 Analysis Reducer gets n n/g from each matrix Total q = 2n2 g Replication rate r = g Combine to obtain tradeoff r = 2n2 q Can t get better tradeoff Lower r for the same q with 1 pass mapping

90 Analysis Reducer have to receive a full row and column to produce a single output Input consist of a rows and b columns, total t = a + b Produce s=ab outputs. Most coverage if a=b q=2na, g(q)=a 2 = q2 4n 2 Total number of outputs is n 2, therefore q i 2 k i=1 n 2 4n 2

91 Analysis From previous Derive q q i=1 q i 4n 4 q(r2n 2 ) 4n 4 r 2n2 q Total communication r2n 2 = 4n4 q

92 Matrix Multiplication II Recall two-pass algorithm 1 st step: combine (i, j) and (j, k) with reducers by j 2 nd step send all i, k to be summed up by i, k Generalization Partition rows and columns in both matrices in groups Total g 2 squares of n2 2 elements in each square g Square (I,J) in M and (J,K) in N are needed to compute square (I,K), where I,J,K are sets of indices(groups)

93 Matrix Multiplication

94 Matrix Multiplication: 1st step The Map Function Keys are (I,J,K) group numbers Duplicate M inputs for all K(group numbers) and N inputs for all I(group numbers) for Replication rate is g The Reduce Function: key (I,J,K) Compute some products needed for each P i,k x i,j,k = j in J m i,j n j,k, for all i in I and k in K

95 Matrix Multiplication: 2 nd step The Map Function Input x i,j,k Key i, k The Reduce Function Sum all x i,j,k to obtain P i,k

96 Analysis 1 st step Replication rate g Total communication 2gn 2 Reducer size q = 2 n2 g 2 Total communication (subs g) 2gn 2 = 2n 3 2 q 2nd step Communication Total g for each (i, k), total gn 2 = n 3 2 q Total Communication 2-pass algorithm (1 st +2 nd ): 3n 3 2 q 1-pass algorithm (see slide 91): 4 n4 q

97 Summary Cluster Computing Disk, CPU, Memory in racks of nodes Distributed File System Large duplicated chunks Complexity Communication cost Reducer size and replication rate Problem as an input-output graph MapReduce Parallelize, manage failures, logic in two custom functions Hadoop, Workflows Relation Operations Natural Join, multiway, star joins Mapping schema Matrix Multiplication 1 pass vs 2-pass Generalization fro band and squares Analysis

2.3 Algorithms Using Map-Reduce

2.3 Algorithms Using Map-Reduce 28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure

More information

MapReduce and the New Software Stack

MapReduce and the New Software Stack 20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the

More information

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU

Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Databases 2 (VU) ( )

Databases 2 (VU) ( ) Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment

More information

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata

MapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata MapReduce and Hadoop Debapriyo Majumdar Indian Statistical Institute Kolkata debapriyo@isical.ac.in Let s keep the intro short Modern data mining: process immense amount of data quickly Exploit parallelism

More information

Databases 2 (VU) ( / )

Databases 2 (VU) ( / ) Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:

More information

Data Partitioning and MapReduce

Data Partitioning and MapReduce Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,

More information

Generalizing Map- Reduce

Generalizing Map- Reduce Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute

More information

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms

MapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word

More information

Distributed computing: index building and use

Distributed computing: index building and use Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto

Lecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs

More information

Map-Reduce. Marco Mura 2010 March, 31th

Map-Reduce. Marco Mura 2010 March, 31th Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

Introduction to MapReduce (cont.)

Introduction to MapReduce (cont.) Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations

More information

Part A: MapReduce. Introduction Model Implementation issues

Part A: MapReduce. Introduction Model Implementation issues Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Chapter 13: Query Processing

Chapter 13: Query Processing Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing

More information

Lecture 11 Hadoop & Spark

Lecture 11 Hadoop & Spark Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel

Indexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes

More information

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for

! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

Chapter 13: Query Processing Basic Steps in Query Processing

Chapter 13: Query Processing Basic Steps in Query Processing Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix

Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel

More information

Chapter 12: Query Processing

Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join

More information

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline

Faloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415 Faloutsos 1 introduction selection projection

More information

CSE 344 MAY 2 ND MAP/REDUCE

CSE 344 MAY 2 ND MAP/REDUCE CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Programming Systems for Big Data

Programming Systems for Big Data Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There

More information

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23

Final Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23 Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17

Announcement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17 Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa

More information

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014

CS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014 CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions

More information

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014

MapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014 MapReduce ad Hadoop Debapriyo Majumdar Data Miig Fall 2014 Idia Statistical Istitute Kolkata November 10, 2014 Let s keep the itro short Moder data miig: process immese amout of data quickly Exploit parallelism

More information

Distributed File Systems II

Distributed File Systems II Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation

More information

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe

Introduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms

More information

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414

Announcements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414 Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s

More information

MI-PDB, MIE-PDB: Advanced Database Systems

MI-PDB, MIE-PDB: Advanced Database Systems MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Hadoop/MapReduce Computing Paradigm

Hadoop/MapReduce Computing Paradigm Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications

More information

Evaluation of relational operations

Evaluation of relational operations Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later

More information

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018

Cloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018 Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Clustering Lecture 8: MapReduce

Clustering Lecture 8: MapReduce Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data

More information

Big Data Management and NoSQL Databases

Big Data Management and NoSQL Databases NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model

More information

One Trillion Edges. Graph processing at Facebook scale

One Trillion Edges. Graph processing at Facebook scale One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's

More information

Introduction to MapReduce

Introduction to MapReduce Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed

More information

CISC 7610 Lecture 2b The beginnings of NoSQL

CISC 7610 Lecture 2b The beginnings of NoSQL CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone

More information

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data)

Principles of Data Management. Lecture #16 (MapReduce & DFS for Big Data) Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin

More information

Data Informatics. Seon Ho Kim, Ph.D.

Data Informatics. Seon Ho Kim, Ph.D. Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate

More information

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing

CS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2

More information

CSE 190D Spring 2017 Final Exam

CSE 190D Spring 2017 Final Exam CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016

Query Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,

More information

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)

Hive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic) Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to

More information

Query Processing. Introduction to Databases CompSci 316 Fall 2017

Query Processing. Introduction to Databases CompSci 316 Fall 2017 Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2

More information

Chapter 12: Query Processing. Chapter 12: Query Processing

Chapter 12: Query Processing. Chapter 12: Query Processing Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join

More information

Query Processing & Optimization

Query Processing & Optimization Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction

More information

Map- Reduce. Everything Data CompSci Spring 2014

Map- Reduce. Everything Data CompSci Spring 2014 Map- Reduce Everything Data CompSci 290.01 Spring 2014 2 Announcements (Thu. Feb 27) Homework #8 will be posted by noon tomorrow. Project deadlines: 2/25: Project team formation 3/4: Project Proposal is

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection

More information

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team

Introduction to Hadoop. Owen O Malley Yahoo!, Grid Team Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since

More information

Evaluation of Relational Operations

Evaluation of Relational Operations Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

TI2736-B Big Data Processing. Claudia Hauff

TI2736-B Big Data Processing. Claudia Hauff TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement

More information

B490 Mining the Big Data. 5. Models for Big Data

B490 Mining the Big Data. 5. Models for Big Data B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive

More information

CSE 344 Final Review. August 16 th

CSE 344 Final Review. August 16 th CSE 344 Final Review August 16 th Final In class on Friday One sheet of notes, front and back cost formulas also provided Practice exam on web site Good luck! Primary Topics Parallel DBs parallel join

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!

Lecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)! Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm

More information

15-415/615 Faloutsos 1

15-415/615 Faloutsos 1 Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415/615 Faloutsos 1 Outline introduction selection

More information

Distributed Filesystem

Distributed Filesystem Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the

More information

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing

Finding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification

More information

Storage hierarchy. Textbook: chapters 11, 12, and 13

Storage hierarchy. Textbook: chapters 11, 12, and 13 Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker

Shark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on

More information

Implementation of Relational Operations

Implementation of Relational Operations Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows

More information

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:

Something to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact: Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base

More information

Programming Models MapReduce

Programming Models MapReduce Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases

More information

Apache Flink. Alessandro Margara

Apache Flink. Alessandro Margara Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate

More information

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto

Lecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto Lecture 02.03. Query evaluation Combining operators. Logical query optimization By Marina Barsky Winter 2016, University of Toronto Quick recap: Relational Algebra Operators Core operators: Selection σ

More information

Improving the MapReduce Big Data Processing Framework

Improving the MapReduce Big Data Processing Framework Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM

More information

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA

CSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers

More information

From SQL-query to result Have a look under the hood

From SQL-query to result Have a look under the hood From SQL-query to result Have a look under the hood Classical view on RA: sets Theory of relational databases: table is a set Practice (SQL): a relation is a bag of tuples R π B (R) π B (R) A B 1 1 2

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

CSE 344 Final Examination

CSE 344 Final Examination CSE 344 Final Examination March 15, 2016, 2:30pm - 4:20pm Name: Question Points Score 1 47 2 17 3 36 4 54 5 46 Total: 200 This exam is CLOSED book and CLOSED devices. You are allowed TWO letter-size pages

More information

Evaluation of Relational Operations. Relational Operations

Evaluation of Relational Operations. Relational Operations Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )

More information