INTRODUCTION TO DATA SCIENCE. MapReduce and the New Software Stacks(MMDS2)
|
|
- Allyson Warren
- 6 years ago
- Views:
Transcription
1 INTRODUCTION TO DATA SCIENCE MapReduce and the New Software Stacks(MMDS2)
2 Big-Data Hardware Computer Clusters Computation: large number of computers/cpus Network: Ethernet switching Storage: Large collection of distributed disks Commodity hardware- cheap but relatively unreliable Computer node: Network connected CPU(s), Disk(s) and RAM How to manage distributed computations tasks and storage? How to protect against frequent failures?
3 New Software Stack Distributed File System Data is duplicated and distributed across multiple locations MapReduce programming paradigm A computational model for performing parallel computations A software infrastructure that manages all the boring tasks (failures, scheduling etc.)
4 2.1 Distributed File Systems
5 Computer Nodes Cluster computing Nodes are placed on racks (8-64 nodes) connected by Ethernet Racks are connected by switch
6 Failures Typical failures Loss of a node Loss of a rack Some computations take hours Can t restart whole process for each failure (will never complete) Solution Redundant storage of files Continue to work on the same file chunk but on other node Computation divided into tasks. Restart task on other node without affecting other tasks
7 Large-Scale File-System Organization Assumptions Enormous files (TB) Files are rarely updated Read or data is appended
8 DFS: Master Node Files are divided into large chunks (e.g. 64MB) Chunks are replicated on different nodes/racks To find required chunk Locate a name node or master node for a file Name node itself and file index are replicated All participants know how to find the directory
9 2.2 MapReduce
10 MapReduce systems A model and a software system for large-scale fault resilient computation Powerful and simple Multiple implementations Google MapReduce Open-Source Hadoop and HDFS Code all the logic/algorithm in two functions only Map function and Reduce function Let the system to handle the rest: failure, duplication, restart, scheduling, resources, monitoring etc.
11 MapReduce Computation Map: A number of Map tasks on multiple nodes are given file chunks from a DFS Produce key-value pairs according to logic coded in a map function Each data element might produce zero o more key-value pairs Custom logic, coded by user Group: Master controller collect key-value pairs and sorted by key. Divide result into chunks to submit to a number of Reduce tasks. All key-value pairs with the same key are going to the same chunk Standard logic, implemented by the system Reduce: A number of Reduce tasks each working on one key at a time Combine all values for a single key Custom logic, coded by user
12 MapReduce Computation
13 Function, Task, Node Map or Reduce function Logic coded by a user Mapper or reducer Map or Reduce function with a single input Example: Reducer for key w Map or Reduce Task Map/Reduce function applied to a chunk (list of key-value pairs) A Reduce task runs a number of reducers Map or Reduce Node A computer that currently runs one or more tasks (Map or Reduce) Tasks might be scheduled to a different nodes (more tasks then nodes)
14 The Map Function Input file Elements: a tuple, a line, a document Chunk is a collection of elements Each map task works on a chunk at a time Technically, inputs to Map are key-value pairs Allows composition of several MapReduce processes Usually key is not relevant for Map task (e.g. line number in the input file) Output Each element is converted to zero or more key-value pair Keys are not unique, several same key-value pair from same element are possible
15 Example: Word Count Input: repository of documents Output: number of appearances of each word Each document is an input element The Map function Read file and break into a sequence of words Produce key-value pairs
16 Grouping by key Collect outputs from all maps into a single list Combine same key values into key-value list The system divides all keys into buckets The number of buckets as the number of Reduce tasks Use an appropriate hash function Send each bucket to a Reduce task Input to each Reduce task: list of key-value pairs
17 Example: Word Count Input to the Reduce function Key is word Value is ones Sum all ones to get counter Output of all Reduce tasks is a sequence of (w,m) w is a word (key) m is number of appearances
18 Combiners Optimize MapReduce process provided reducers are Associative and Commutative Values can be combined in any order Push some of Reducer logic to Map tasks Word count example Apply reduce step in each map Reduce key-value pairs with a similar key to a single (w,m) Grouping and reduce steps are still necessary Call a Combiner function after each map task Works on local files produced by a map task Before all map s output are collected and shuffled
19 MapReduce with Combiners
20 Parallelism Can execute each reducer by a dedicate Reduce task Single key per process Overhead with creating tasks is too much Skew: difference in computing time for reducers Different number of values for each keys Different computation time, some nodes become idle early Control number of Reduce tasks Run several reducers per each Reduce task to average the load More tasks than nodes to balance node load
21 Details of MapReduce Execution Fork a Master controller process and a number of Worker processes Worker handles Map or Reduce but not both Create a number of Reduce and Map tasks Usually Map task for each input chunk Select number of reduce task carefully More task means more communication More tasks means more parallelism Master keeps tracks of tasks
22 MapReduce Execution
23 Coping With Node Failures Failure of master node Restart the whole MapReduce job Worst case Failure of worker node Master monitors workers and detects a failure Master restarts only tasks in this worker
24 2.3 Algorithms Using MapReduce
25 Usage Original usage Google uses it for very large vector-matrix multiplication (compute the PageRank) Matrices represents links between web pages Vector represents an importance of a web page Make sense Very large files Not updated in-place Batch processing
26 Matrix-Vector Multiplication n-by-n matrix M with elements m i,j n is 10B for Web pages A vector v of length n v j Matrix-vector product is a vector x of length n Assume matrix and vector are stored in DFS Easily discoverable row-column coordinates For instance, stored as triples i, j, m i,j
27 Case I: Vector fits into RAM Whole vector data is available to each mapper The Map Function Element i, j, m i,j Output key-value pair i, m i,j v j The Reduce Function Sum all values associated with the key
28 Case II: Vector cannot fit in RAM Partition matrix and vector into stripes Each map task get a chunk from a stripe and gets entire corresponding strip of the vector
29 Relation-Algebra Operations Data frequently is stored in tables Relational Database Management Systems (RDBMS) Query language SQL Underlying theory: relations and operations over them MapReduce-Data is stored in files Frequently files contain tables and key-value pairs Need to perform SQL-like queries using MapReduce
30 Relations A Relation is a table(set) with column headers Attribute: a column header Tuple: a row in the table, no duplicates(!) Schema: a set of all attributes of particular relation Relation Example
31 Relational Algebra Relational Algebra: a set of standard operations Operations usual produce other relations from one or more input relations Operations (Queries) are often written in SQL and are executed by RDBMS Need some formalism to describe/define similar operations to be executed by MapReduce
32 Relational Algebra Selection R = σ C R Apply Boolean condition C on every tuple Produce relation with tuples that satisfy C Projection R = π S (R) Select attributes that are in a given subset S New relation contains only selected attributes Union, Intersection, Difference Easy to define for same-schema relations
33 Relational Algebra Natural Join R = R S Join two relations into a single relation (table) Merge tuples which agree on intersecting attributes Grouping and Aggregation R = γ X R Partition tuples according to values of attributes set G Compute aggregation per group (MAX, SUM,..) for all other attributes Value X is a list of elements that are either A grouping attribute Aggregation function θ A, for A not a grouping attribute
34 Examples Web Links Find path of length two using relation Links Triples (u,v,w) Natural join to itself Two copies L1(U1,U2) L2(U2,U3) Social Network Friends Tuples Number of friends for each user
35 MapReduce Selection Projection
36 Union, Intersection, Difference Union Intersection Difference
37 MapReduce: Natural Join Start with a simple form R(A,B) and S(B,C) Similar approach for joining by a groups of attributes
38 MapReduce: Grouping And Aggregations Simple case R(A,B,C). Group by A, aggregate by B
39 Matrix Multiplication using Relational Algebra M,N are matrices with elements m i,j and n i,j Multiplication P=MN Represent a matrix by a relation M(I,J,V) and N(J,K,W) Especially efficient if matrices are sparse (omit zeroes) Product Natural join i, j, k, v, w represents m i,j n j,k Transform to i, j, k, v w Group by I and K with sum aggregation over J
40 MapReduce: Matrix Multiplication 1 st phase (create all products m i,j n j,k ) Map for each m i,j produce j, M, i, m i,j for each n j,k produce j, N, k, n j,k Reduce For each key, output all possible combinations of M and N values with key i, k and multiplication m i,j n j,k 2 nd phase (combine all products for i, k ) Map: identity Reduce: sum values for each key i, k
41 Single Step Map Reduce Matrix Multiplication Map function Create multiple copies of each input element i, k, M, j, m i,j for k = 1,.., n i, k, N, j, n j,k for i = 1,.., n Reduce Input: For all keys i, k pairs M, j, m i,j and N, j, n j,k for all j Output Multiply and sum up all pairs for each keys
42 2.4 Map-Reduce Extensions
43 Workflow systems Will discuss later as we ll talk about streams
44 Recursive Extensions Recursive tasks are difficult to compute using MapReduce Map-Reduce: independent restart of failed task What if parent in the recursion chain is failed? Need some other mechanism for implementing recursive workflows Represented by flow graphs with cycles Convert recursion to iteration
45 Example: Path relation in a graph Assume directed graph is represented by the relation E(X, Y) Compute the path relation P(X,Y) There is a path from node X to node Y P n (X, Y) = π X,Y P n 1 (X, Z) E (Z,Y) Iterative algorithm for computing P(X,Y) Start from P(X, Y) = E(X, Y) Update/add new pairs till there is no change
46 Algorithm
47 Workflow Implementation Two types of tasks: Join and Dup-elim n Join tasks create new candidate pairs m Dup-elim tasks remove duplicates and resend to Joins Route/Partition data by hash functions Each join task handle pairs (a,b) (b,c) according to hash value h(b) Each Dup-elim handles pairs (a,c) according to hash value g(a,c)
48 Join Tasks Join task #i receives all pairs a, b according to hash value On of the nodes h(b) = i or h(a) = i Each pair can go to two tasks h a = i and h b = j Each Join task Stores P(a,b) locally till the end of the computation If h(a) = i tries to match locally stored P(x,a) and new P(a,b) and produce output P(x,b) If h(b) = i tries to match locally stored P(b,y) and new P(a,y) and produce output P(y,b) Send resulting(c,d) to dup-elim according to a hash value g(c,d)
49 Dup-elim task Dup-elim task #j stores all pairs (c,d) with hash value g(c, d) = j On receiving new pair, it checks it against locally stored pairs If it s a new pair, it s sent to Join tasks according to h(c) and h(d)
50 Workflow
51 Details Every Join task writes to m output files Single file for each Dup-elim task Every Dup-elim writes to n output files Single file for each Join task Start by sending E(a,b) pairs appropriate Dup-elim tasks According to g(a,b) Wait till all Join tasks finish before starting dup-elim phase All Dup-elim tasks have their input files
52 Failures Not necessary to have two types of tasks Whenever Join produces a new candidate (a,c), transmit it to two other Join tasks according to h(a), h(b) Before Join task uses a new pair to search for candidates, check it against locally stored and discard if already exists Failure Single task: everything with this hash value is lost Two types of tasks can handle single failures Failed Join recreates data from relevant Dup-elim tasks Failed Dup-elim recreates data from relevant Join tasks No problem with restarted task to produce duplicate input for other tasks
53 Graph processing systems Handle computation where input is a very large graph Google s Pregel and Apache s Giraph Facebook 200 machines, Giraph, 1Trillion graph edges 4 Minutes processing time
54 Example Given a graph compute shortest distance between each pair of nodes Assign task per node Group several tasks on a single node Each task receive messages Process them and send out other messages Computation by supersteps All nodes process their messages All nodes issue their messages
55 Algorithm Initially Node #a stores every edges from a Edge from a to b of weight w (b,w) Node #a sends messages to all other nodes Message : (a,b,w) When message (c,d,w) arrives to node #a Consider new paths (a,d) as (a->c->d) and (c,a) as (c->d->a). Update weight if a shorter path discovered Send out message with newly discovered path to all other nodes
56 Handling Failure By checkpoints Every node save its entire step every few supersteps Failure All nodes are restarted from a last checkpoint
57 2.5 The Communication Cost Model
58 Measuring Quality of Algorithms For many algorithms the performance bottleneck is moving data between tasks(nodes) Each task is usually simple, linear in data Transmitting and reading data into memory is slow Communication cost dominates To measure/estimate communication cost Describe algorithm as acyclic workflow Graph of tasks and communication between them Measure/Estimate data transmitted at each edge
59 Communication Cost Communication Cost of a task is the size of the input for the task Measured in bytes or tuples Communication Cost of an algorithm is the sum of the communication cost of all the tasks. Why not outputs? Counted as input to another tasks Unless it s the output of entire algorithm If output of entire algorithm is large, then most probably it s an input to a next stage Count it as an input to the next stage
60 Example: Natural Join Algorithm Reminder Map For each (a,b) of R create pair (b, (R,a)) For each (b,c) of S create pair (b,(s,c)) Reduce For a key b combine all pairs (R,a),(S,c) into (a,c) Communication Cost Assume R and S are of size r and s Input to all Maps: r+s Output of all Maps and input to all Reduces: r+s Total 2r+2s tuples or O(r+s) bytes Computation time is small We don t count for output from Reduce (potentially r*s)
61 Wall-Clock Time Can assign all work to a single task to minimize the communication cost Running time of the algorithm : wall-clock time Need to divide work fairly among the task while minimizing the communication cost Talk about it later
62 Multiway Joins: Cascade Example 3-way join Cascade two MapReduce jobs Do first join, then second join or Do second join, then first join Communication cost p-is a proportion (probability) of match 1 st, then 2nd O((r+s)+(prs)+t) bytes 2 nd, then 1 st O((s+t)+(pst+r)) bytes
63 Multiway Join: Single Step A single MapReduce job that joins all three relations at once Key to a Reducer is a pair (i, j) Receives R(u, v), S(v, w), T(w, x) such that h v = i and g w = j Total number of reducers k = b c where b and c are number of buckets for h and g Send S(v, w) only to a single Reducer Send R(u, v) to c Reducers and T(w, x) to b Reducers
64 Single Step: Communication Cost To Reduce tasks s tuples to move (1 copy of each S) cr tuples to move (c copies of each r tuple) bt tuples to move (b copies of each t tuple) To Map Task r + s +t input tuples to all Map tasks How do you select c and b subject to cb = k? Map communication cost is the same for all choices
65 Optimization Problem Minimize under constrain s + cr + bt, cb = k Lagrangean multipliers Minimize s + cr + bt λ cb k Set derivatives w.t.r. to c, b to zero r λb = 0, t λc = 0 Rearrange and multiply rt = λ 2 k λ = Substitute b = r λ = r rt k = kr t, c = Substitute to s + cr + bt = s + 2 krt Add map cost r + 2s + t + 2 krt kt r rt k
66 Example: Facebook 1B users 300 friends each on average Size of relation r is Friends of Friends relationship R R Maximum size r 2, assume cliques so 30r Want to compute friends of friends of friends relationship R R R Start marketing to those who has large number of friends of friends of friends What is the best way to compute on MapReduce? Cascade a pair of two-way joins Single 3-way join workflow
67 Example: Communication cost Cascade computation First join Map: 2r Reduce 2r Total 4r Second join Map: r+30r Reduce:r+30r Total 62r Total: 4r + 62r = 66r = way join r + r + 2r + 2r k = k Compare > k k < / =31 k<961 Result If number of reducers is less then 961, use 3-way join
68 Star join: Walmart Example Fact table Business facts Each sale is kept in the fact table Relation F A 1, A 2,.. Attributes are important components of the sale Item, Store, Branch, Customer Id Very large table Dimension tables For each key attribute D A i, B i,1, B i,2, Descriptive attributes/fields For Customer Id: Attribute: phone number, address, age Many small tables Analytical queries Join fact table with one or more dimensional tables (star join) Aggregate results into a useful form Example: aggregated sales by region and color for each month
69 Star Joins: Map Reduce Computation Don t wait for a specific query, prepare Reduce nodes for all possibilities Send and store locally dimension tables at the Reduce nodes Use the same hashing values as would be used for Multiway join of fact table with every dimension table Store fact table on Map nodes Run multi-way join and aggregate
70 2.6 Complexity Theory for MapReduce
71 Parameters of a MapReduce Algorithm Reducer size q: upper bound on the number of values for a single key Want it to be small Input for a single reducer fits into memory High degree of parallelism Replication rate r: average number of key-value pairs per input element Average communication from Map to Reduce per input element Usually there is a tradeoff between communication cost factor (r) and computational efficiency factor(q)
72 Example: 1Pass Matrix Multiplication Recall 1Pass Matrix Multiplication A reducer per each output(matrix element) Map sends a copy to each relevant reducer Replication rate is r = n Actually each element is duplicated exactly n times Reducer size is q = 2n (n values for each matrix) Can design a family of algorithms with qr > 2n 2
73 Example: Similarity Joins Assume a collection of images X Size 1M images Input key-value: index+image i, P i Some measure of similarity s(x, y) Symmetric s x, y = s y, x Find all pairs (x, y) such that s(x, y) > t
74 Obvious algorithm Reducer per pair Evaluate s on key is i, j If greater then a threshold, produce output Map Duplicate each input (i, P i ) for each j: ((i, j), P i ) Duplication rate is r=999999, reduce size q=2 Communication Map+ Reduce: For 1Mb images total size is Bytes or Exabyte Takes 300 years over the gigabit Ethernet
75 Similarity Joins Using Groups Select g- number of groups, each has 106 g images Use hash function with g values to define groups The Map Function For each input element (i, P i ) generate g 1 key-value parts Each key is a set (unordered) u, v, where u is group of this image and v is all other groups The Reduce Function (key is u, v ) Compare between two groups For each key u, v there are g elements Compare between instances from different groups Need to choose a reducer to compare within the group For instance, compare within the group u for {u,u+1}
76 Analysis Replication Rate r = g 1 g Assume g is large Reducer size q= g For 1M images, total bytes for a reducer is Number of reducers k g 2 /2 g For instance g=1000 Input for a single reducer is 2Gb Communication cost is Mb = bytes 1000 times less (4 month vs. 300 years) 500K reducers, can be balanced well
77 Graph model for a MapReduce problem MapReduce problem A set of inputs A set of outputs A many-many relationship between input and output Which inputs are needed to produce which outputs Example: Similarity Join for 4 pictures
78 Example: Matrix Multiplication Multiply two n-by-n matrices 2n 2 inputs m ij and n jk and n 2 outputs p ik Each output p ik is related to 2n inputs m i1, m i2,. and n 1,k, n 2,k,.. Each input m ij or n jk is related to n outputs p i1, p i2, for m ij and p 1k, p 2k, for n jk
79 Implicit inputs/outputs Example: Natural join R(A,B) and S(B,C) Assume A,B,C have finite domain Finite number of possible inputs and outputs Not all inputs are present(all possible tuples) Not all outputs are produced For the analysis purposes, consider the complete graph for the problem It s model for the problem not for a specific input instance
80 Mapping Schemas Each algorithm is defined by a mapping scheme How outputs are produced from inputs (by reducers) Given a reducer size q, mapping scheme is an assignment of inputs to one or more reducers No reducer is assigned more then q inputs For every output, there is at least one reducer assigned to all inputs related to this output
81 Example: Similarity Join Number of inputs is p, number of outputs is p 2 p2 2 Assign g 2 g2 reducers A reducer get inputs from 2 groups : q=2p/g It s a mapping schema Reducer size is q Each output is covered Replication rate is r= g 1 g In this case r = 2p/q Inverse relation between r and q
82 When Not All Inputs Are Present Not all inputs are present Always 1M images Bur how many tuples in a relation? How to assign? Example: Assume only 5% of possible data is present For reducer size q, only q/20 will actually arrive After the analysis, estimate 5% and replace q by 20q Restrict algorithm fro 20q, in reality only q will arrive
83 Lower Bound On Replication Rate Similarity join Select reducer size to trade communication vs. parallelism Or ability to execute in RAM How do we know that we got the best tradeoff between q and r? Minimum possible communication(r) given q? Prove matching lower bound Find lower bound for the problem (for a given q) Show that tradeoff (achieved by the assignment) matches it
84 Steps to Prove Lower Bounds Bound outputs of a single reducer Given an input q, it can cover only g(q) output For any mapping Calculate total number of outputs for the problem Not depended on particular mapping Bound number of outputs covered by all reducers k i=1 Note inequality g q i > n k Manipulate inequality to get total communication q i Use reducer size (max input) q i=1 k Replication rate is i=1 q i divided by number of inputs
85 Example: Similarity Join Reducer with q inputs can t cover more then q2 2 Total number of outputs p2 2 Total coverable outputs q k i=1 q i 2 q i 2 So replication rate is r = k i=1 p q i k 2 q i 2 k i=1 p2 2 2 i=1 p2 2 or q i p q k i=1 p2 q
86 Case Study: Matrix Multiplication Consider an improvement to 1-pass matrix multiplication Consider the tradeoff of the algorithm and show matching low bound Idea: Group rows and columns into bands Each reducer gets band of columns from the 1 st matrix and a band of rows from the second matrix Reducer produce a square of elements of the output matrix
87 Matrix Multiplications
88 Matrix Multiplication Compute P=MN, all matrices are n-by-n Group rows of M into g groups/bands, n/g rows each Keys corresponds to two groups (from M and N) Map function g 2 keys: pairs of row/column band number Duplicate each M/N input for all possible rows/columns of N/M. Reduce function Compute a square of output elements
89 Analysis Reducer gets n n/g from each matrix Total q = 2n2 g Replication rate r = g Combine to obtain tradeoff r = 2n2 q Can t get better tradeoff Lower r for the same q with 1 pass mapping
90 Analysis Reducer have to receive a full row and column to produce a single output Input consist of a rows and b columns, total t = a + b Produce s=ab outputs. Most coverage if a=b q=2na, g(q)=a 2 = q2 4n 2 Total number of outputs is n 2, therefore q i 2 k i=1 n 2 4n 2
91 Analysis From previous Derive q q i=1 q i 4n 4 q(r2n 2 ) 4n 4 r 2n2 q Total communication r2n 2 = 4n4 q
92 Matrix Multiplication II Recall two-pass algorithm 1 st step: combine (i, j) and (j, k) with reducers by j 2 nd step send all i, k to be summed up by i, k Generalization Partition rows and columns in both matrices in groups Total g 2 squares of n2 2 elements in each square g Square (I,J) in M and (J,K) in N are needed to compute square (I,K), where I,J,K are sets of indices(groups)
93 Matrix Multiplication
94 Matrix Multiplication: 1st step The Map Function Keys are (I,J,K) group numbers Duplicate M inputs for all K(group numbers) and N inputs for all I(group numbers) for Replication rate is g The Reduce Function: key (I,J,K) Compute some products needed for each P i,k x i,j,k = j in J m i,j n j,k, for all i in I and k in K
95 Matrix Multiplication: 2 nd step The Map Function Input x i,j,k Key i, k The Reduce Function Sum all x i,j,k to obtain P i,k
96 Analysis 1 st step Replication rate g Total communication 2gn 2 Reducer size q = 2 n2 g 2 Total communication (subs g) 2gn 2 = 2n 3 2 q 2nd step Communication Total g for each (i, k), total gn 2 = n 3 2 q Total Communication 2-pass algorithm (1 st +2 nd ): 3n 3 2 q 1-pass algorithm (see slide 91): 4 n4 q
97 Summary Cluster Computing Disk, CPU, Memory in racks of nodes Distributed File System Large duplicated chunks Complexity Communication cost Reducer size and replication rate Problem as an input-output graph MapReduce Parallelize, manage failures, logic in two custom functions Hadoop, Workflows Relation Operations Natural Join, multiway, star joins Mapping schema Matrix Multiplication 1 pass vs 2-pass Generalization fro band and squares Analysis
2.3 Algorithms Using Map-Reduce
28 CHAPTER 2. MAP-REDUCE AND THE NEW SOFTWARE STACK one becomes available. The Master must also inform each Reduce task that the location of its input from that Map task has changed. Dealing with a failure
More informationMapReduce and the New Software Stack
20 Chapter 2 MapReduce and the New Software Stack Modern data-mining applications, often called big-data analysis, require us to manage immense amounts of data quickly. In many of these applications, the
More informationFall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU
Fall 2018: Introduction to Data Science GIRI NARASIMHAN, SCIS, FIU !2 MapReduce Overview! Sometimes a single computer cannot process data or takes too long traditional serial programming is not always
More informationMapReduce. Stony Brook University CSE545, Fall 2016
MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining
More informationOutline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins
MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce
More informationMapReduce and Friends
MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web
More informationDatabases 2 (VU) ( )
Databases 2 (VU) (707.030) Map-Reduce Denis Helic KMI, TU Graz Nov 4, 2013 Denis Helic (KMI, TU Graz) Map-Reduce Nov 4, 2013 1 / 90 Outline 1 Motivation 2 Large Scale Computation 3 Map-Reduce 4 Environment
More informationMapReduce and Hadoop. Debapriyo Majumdar Indian Statistical Institute Kolkata
MapReduce and Hadoop Debapriyo Majumdar Indian Statistical Institute Kolkata debapriyo@isical.ac.in Let s keep the intro short Modern data mining: process immense amount of data quickly Exploit parallelism
More informationDatabases 2 (VU) ( / )
Databases 2 (VU) (706.711 / 707.030) MapReduce (Part 3) Mark Kröll ISDS, TU Graz Nov. 27, 2017 Mark Kröll (ISDS, TU Graz) MapReduce Nov. 27, 2017 1 / 42 Outline 1 Problems Suited for Map-Reduce 2 MapReduce:
More informationData Partitioning and MapReduce
Data Partitioning and MapReduce Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies,
More informationGeneralizing Map- Reduce
Generalizing Map- Reduce 1 Example: A Map- Reduce Graph map reduce map... reduce reduce map 2 Map- reduce is not a solu;on to every problem, not even every problem that profitably can use many compute
More informationMapReduce Patterns. MCSN - N. Tonellotto - Distributed Enabling Platforms
MapReduce Patterns 1 Intermediate Data Written locally Transferred from mappers to reducers over network Issue - Performance bottleneck Solution - Use combiners - Use In-Mapper Combining 2 Original Word
More informationDistributed computing: index building and use
Distributed computing: index building and use Distributed computing Goals Distributing computation across several machines to Do one computation faster - latency Do more computations in given time - throughput
More informationCS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab
CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material
More informationLecture Map-Reduce. Algorithms. By Marina Barsky Winter 2017, University of Toronto
Lecture 04.02 Map-Reduce Algorithms By Marina Barsky Winter 2017, University of Toronto Example 1: Language Model Statistical machine translation: Need to count number of times every 5-word sequence occurs
More informationMap-Reduce. Marco Mura 2010 March, 31th
Map-Reduce Marco Mura (mura@di.unipi.it) 2010 March, 31th This paper is a note from the 2009-2010 course Strumenti di programmazione per sistemi paralleli e distribuiti and it s based by the lessons of
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)
More informationIntroduction to MapReduce (cont.)
Introduction to MapReduce (cont.) Rafael Ferreira da Silva rafsilva@isi.edu http://rafaelsilva.com USC INF 553 Foundations and Applications of Data Mining (Fall 2018) 2 MapReduce: Summary USC INF 553 Foundations
More informationPart A: MapReduce. Introduction Model Implementation issues
Part A: Massive Parallelism li with MapReduce Introduction Model Implementation issues Acknowledgements Map-Reduce The material is largely based on material from the Stanford cources CS246, CS345A and
More informationMap Reduce. Yerevan.
Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate
More informationChapter 12: Query Processing
Chapter 12: Query Processing Overview Catalog Information for Cost Estimation $ Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Transformation
More informationDatabase Systems CSE 414
Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create
More informationChapter 13: Query Processing
Chapter 13: Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 13.1 Basic Steps in Query Processing 1. Parsing
More informationLecture 11 Hadoop & Spark
Lecture 11 Hadoop & Spark Dr. Wilson Rivera ICOM 6025: High Performance Computing Electrical and Computer Engineering Department University of Puerto Rico Outline Distributed File Systems Hadoop Ecosystem
More informationAdvanced Database Systems
Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed
More informationCloud Computing CS
Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part
More informationAnnouncements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm
Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before
More informationCompSci 516: Database Systems
CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and
More informationIndexing. Week 14, Spring Edited by M. Naci Akkøk, , Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel
Indexing Week 14, Spring 2005 Edited by M. Naci Akkøk, 5.3.2004, 3.3.2005 Contains slides from 8-9. April 2002 by Hector Garcia-Molina, Vera Goebel Overview Conventional indexes B-trees Hashing schemes
More information! A relational algebra expression may have many equivalent. ! Cost is generally measured as total elapsed time for
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationChapter 13: Query Processing Basic Steps in Query Processing
Chapter 13: Query Processing Basic Steps in Query Processing! Overview! Measures of Query Cost! Selection Operation! Sorting! Join Operation! Other Operations! Evaluation of Expressions 1. Parsing and
More informationCS 345A Data Mining. MapReduce
CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes
More informationTime Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix
Time Complexity and Parallel Speedup to Compute the Gamma Summarization Matrix Carlos Ordonez, Yiqun Zhang Department of Computer Science, University of Houston, USA Abstract. We study the serial and parallel
More informationChapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Overview Chapter 12: Query Processing Measures of Query Cost Selection Operation Sorting Join
More informationFaloutsos 1. Carnegie Mellon Univ. Dept. of Computer Science Database Applications. Outline
Carnegie Mellon Univ. Dept. of Computer Science 15-415 - Database Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415 Faloutsos 1 introduction selection projection
More informationCSE 344 MAY 2 ND MAP/REDUCE
CSE 344 MAY 2 ND MAP/REDUCE ADMINISTRIVIA HW5 Due Tonight Practice midterm Section tomorrow Exam review PERFORMANCE METRICS FOR PARALLEL DBMSS Nodes = processors, computers Speedup: More nodes, same data
More informationWhere We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344
Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)
More informationProgramming Systems for Big Data
Programming Systems for Big Data CS315B Lecture 17 Including material from Kunle Olukotun Prof. Aiken CS 315B Lecture 17 1 Big Data We ve focused on parallel programming for computational science There
More informationFinal Exam Review 2. Kathleen Durant CS 3200 Northeastern University Lecture 23
Final Exam Review 2 Kathleen Durant CS 3200 Northeastern University Lecture 23 QUERY EVALUATION PLAN Representation of a SQL Command SELECT {DISTINCT} FROM {WHERE
More informationCSE 190D Spring 2017 Final Exam Answers
CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join
More informationAnnouncement. Reading Material. Overview of Query Evaluation. Overview of Query Evaluation. Overview of Query Evaluation 9/26/17
Announcement CompSci 516 Database Systems Lecture 10 Query Evaluation and Join Algorithms Project proposal pdf due on sakai by 5 pm, tomorrow, Thursday 09/27 One per group by any member Instructor: Sudeepa
More informationCS / Cloud Computing. Recitation 3 September 9 th & 11 th, 2014
CS15-319 / 15-619 Cloud Computing Recitation 3 September 9 th & 11 th, 2014 Overview Last Week s Reflection --Project 1.1, Quiz 1, Unit 1 This Week s Schedule --Unit2 (module 3 & 4), Project 1.2 Questions
More informationMapReduce and Hadoop. Debapriyo Majumdar Data Mining Fall 2014 Indian Statistical Institute Kolkata. November 10, 2014
MapReduce ad Hadoop Debapriyo Majumdar Data Miig Fall 2014 Idia Statistical Istitute Kolkata November 10, 2014 Let s keep the itro short Moder data miig: process immese amout of data quickly Exploit parallelism
More informationDistributed File Systems II
Distributed File Systems II To do q Very-large scale: Google FS, Hadoop FS, BigTable q Next time: Naming things GFS A radically new environment NFS, etc. Independence Small Scale Variety of workloads Cooperation
More informationIntroduction to Query Processing and Query Optimization Techniques. Copyright 2011 Ramez Elmasri and Shamkant Navathe
Introduction to Query Processing and Query Optimization Techniques Outline Translating SQL Queries into Relational Algebra Algorithms for External Sorting Algorithms for SELECT and JOIN Operations Algorithms
More informationAnnouncements. Parallel Data Processing in the 20 th Century. Parallel Join Illustration. Introduction to Database Systems CSE 414
Introduction to Database Systems CSE 414 Lecture 17: MapReduce and Spark Announcements Midterm this Friday in class! Review session tonight See course website for OHs Includes everything up to Monday s
More informationMI-PDB, MIE-PDB: Advanced Database Systems
MI-PDB, MIE-PDB: Advanced Database Systems http://www.ksi.mff.cuni.cz/~svoboda/courses/2015-2-mie-pdb/ Lecture 10: MapReduce, Hadoop 26. 4. 2016 Lecturer: Martin Svoboda svoboda@ksi.mff.cuni.cz Author:
More informationAnnouncements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems
Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic
More informationTopics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples
Hadoop Introduction 1 Topics Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples 2 Big Data Analytics What is Big Data?
More informationParallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce
Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The
More informationHadoop/MapReduce Computing Paradigm
Hadoop/Reduce Computing Paradigm 1 Large-Scale Data Analytics Reduce computing paradigm (E.g., Hadoop) vs. Traditional database systems vs. Database Many enterprises are turning to Hadoop Especially applications
More informationEvaluation of relational operations
Evaluation of relational operations Iztok Savnik, FAMNIT Slides & Textbook Textbook: Raghu Ramakrishnan, Johannes Gehrke, Database Management Systems, McGraw-Hill, 3 rd ed., 2007. Slides: From Cow Book
More informationIntroduction to Data Management CSE 344
Introduction to Data Management CSE 344 Lecture 24: MapReduce CSE 344 - Winter 215 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Due next Thursday evening Will send out reimbursement codes later
More informationCloud Computing and Hadoop Distributed File System. UCSB CS170, Spring 2018
Cloud Computing and Hadoop Distributed File System UCSB CS70, Spring 08 Cluster Computing Motivations Large-scale data processing on clusters Scan 000 TB on node @ 00 MB/s = days Scan on 000-node cluster
More informationTITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP
TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop
More informationClustering Lecture 8: MapReduce
Clustering Lecture 8: MapReduce Jing Gao SUNY Buffalo 1 Divide and Conquer Work Partition w 1 w 2 w 3 worker worker worker r 1 r 2 r 3 Result Combine 4 Distributed Grep Very big data Split data Split data
More informationBig Data Management and NoSQL Databases
NDBI040 Big Data Management and NoSQL Databases Lecture 2. MapReduce Doc. RNDr. Irena Holubova, Ph.D. holubova@ksi.mff.cuni.cz http://www.ksi.mff.cuni.cz/~holubova/ndbi040/ Framework A programming model
More informationOne Trillion Edges. Graph processing at Facebook scale
One Trillion Edges Graph processing at Facebook scale Introduction Platform improvements Compute model extensions Experimental results Operational experience How Facebook improved Apache Giraph Facebook's
More informationIntroduction to MapReduce
Basics of Cloud Computing Lecture 4 Introduction to MapReduce Satish Srirama Some material adapted from slides by Jimmy Lin, Christophe Bisciglia, Aaron Kimball, & Sierra Michels-Slettvet, Google Distributed
More informationCISC 7610 Lecture 2b The beginnings of NoSQL
CISC 7610 Lecture 2b The beginnings of NoSQL Topics: Big Data Google s infrastructure Hadoop: open google infrastructure Scaling through sharding CAP theorem Amazon s Dynamo 5 V s of big data Everyone
More informationPrinciples of Data Management. Lecture #16 (MapReduce & DFS for Big Data)
Principles of Data Management Lecture #16 (MapReduce & DFS for Big Data) Instructor: Mike Carey mjcarey@ics.uci.edu Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Today s News Bulletin
More informationData Informatics. Seon Ho Kim, Ph.D.
Data Informatics Seon Ho Kim, Ph.D. seonkim@usc.edu HBase HBase is.. A distributed data store that can scale horizontally to 1,000s of commodity servers and petabytes of indexed storage. Designed to operate
More informationCS 4604: Introduction to Database Management Systems. B. Aditya Prakash Lecture #10: Query Processing
CS 4604: Introduction to Database Management Systems B. Aditya Prakash Lecture #10: Query Processing Outline introduction selection projection join set & aggregate operations Prakash 2018 VT CS 4604 2
More informationCSE 190D Spring 2017 Final Exam
CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,
More informationDeveloping MapReduce Programs
Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes
More informationQuery Processing. Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016
Query Processing Debapriyo Majumdar Indian Sta4s4cal Ins4tute Kolkata DBMS PGDBA 2016 Slides re-used with some modification from www.db-book.com Reference: Database System Concepts, 6 th Ed. By Silberschatz,
More informationHive and Shark. Amir H. Payberah. Amirkabir University of Technology (Tehran Polytechnic)
Hive and Shark Amir H. Payberah amir@sics.se Amirkabir University of Technology (Tehran Polytechnic) Amir H. Payberah (Tehran Polytechnic) Hive and Shark 1393/8/19 1 / 45 Motivation MapReduce is hard to
More informationQuery Processing. Introduction to Databases CompSci 316 Fall 2017
Query Processing Introduction to Databases CompSci 316 Fall 2017 2 Announcements (Tue., Nov. 14) Homework #3 sample solution posted in Sakai Homework #4 assigned today; due on 12/05 Project milestone #2
More informationChapter 12: Query Processing. Chapter 12: Query Processing
Chapter 12: Query Processing Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 12: Query Processing Overview Measures of Query Cost Selection Operation Sorting Join
More informationQuery Processing & Optimization
Query Processing & Optimization 1 Roadmap of This Lecture Overview of query processing Measures of Query Cost Selection Operation Sorting Join Operation Other Operations Evaluation of Expressions Introduction
More informationMap- Reduce. Everything Data CompSci Spring 2014
Map- Reduce Everything Data CompSci 290.01 Spring 2014 2 Announcements (Thu. Feb 27) Homework #8 will be posted by noon tomorrow. Project deadlines: 2/25: Project team formation 3/4: Project Proposal is
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Yanlei Diao UMass Amherst March 13 and 15, 2006 Slides Courtesy of R. Ramakrishnan and J. Gehrke 1 Relational Operations We will consider how to implement: Selection
More informationIntroduction to Hadoop. Owen O Malley Yahoo!, Grid Team
Introduction to Hadoop Owen O Malley Yahoo!, Grid Team owen@yahoo-inc.com Who Am I? Yahoo! Architect on Hadoop Map/Reduce Design, review, and implement features in Hadoop Working on Hadoop full time since
More informationEvaluation of Relational Operations
Evaluation of Relational Operations Chapter 14 Comp 521 Files and Databases Fall 2010 1 Relational Operations We will consider in more detail how to implement: Selection ( ) Selects a subset of rows from
More informationIntroduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)
Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction
More informationTI2736-B Big Data Processing. Claudia Hauff
TI2736-B Big Data Processing Claudia Hauff ti2736b-ewi@tudelft.nl Intro Streams Streams Map Reduce HDFS Pig Pig Design Patterns Hadoop Ctd. Graphs Giraph Spark Zoo Keeper Spark Learning objectives Implement
More informationB490 Mining the Big Data. 5. Models for Big Data
B490 Mining the Big Data 5. Models for Big Data Qin Zhang 1-1 2-1 MapReduce MapReduce The MapReduce model (Dean & Ghemawat 2004) Input Output Goal Map Shuffle Reduce Standard model in industry for massive
More informationCSE 344 Final Review. August 16 th
CSE 344 Final Review August 16 th Final In class on Friday One sheet of notes, front and back cost formulas also provided Practice exam on web site Good luck! Primary Topics Parallel DBs parallel join
More informationMapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec
MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)
More informationLecture 7: MapReduce design patterns! Claudia Hauff (Web Information Systems)!
Big Data Processing, 2014/15 Lecture 7: MapReduce design patterns!! Claudia Hauff (Web Information Systems)! ti2736b-ewi@tudelft.nl 1 Course content Introduction Data streams 1 & 2 The MapReduce paradigm
More information15-415/615 Faloutsos 1
Carnegie Mellon Univ. Dept. of Computer Science 15-415/615 - DB Applications Lecture #14: Implementation of Relational Operations (R&G ch. 12 and 14) 15-415/615 Faloutsos 1 Outline introduction selection
More informationDistributed Filesystem
Distributed Filesystem 1 How do we get data to the workers? NAS Compute Nodes SAN 2 Distributing Code! Don t move data to workers move workers to the data! - Store data on the local disks of nodes in the
More informationFinding Similar Sets. Applications Shingling Minhashing Locality-Sensitive Hashing
Finding Similar Sets Applications Shingling Minhashing Locality-Sensitive Hashing Goals Many Web-mining problems can be expressed as finding similar sets:. Pages with similar words, e.g., for classification
More informationStorage hierarchy. Textbook: chapters 11, 12, and 13
Storage hierarchy Cache Main memory Disk Tape Very fast Fast Slower Slow Very small Small Bigger Very big (KB) (MB) (GB) (TB) Built-in Expensive Cheap Dirt cheap Disks: data is stored on concentric circular
More informationCIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu
CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean
More informationShark. Hive on Spark. Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker
Shark Hive on Spark Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael Franklin, Ion Stoica, Scott Shenker Agenda Intro to Spark Apache Hive Shark Shark s Improvements over Hive Demo Alpha
More informationParallel Nested Loops
Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),
More informationParallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011
Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on
More informationData Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros
Data Clustering on the Parallel Hadoop MapReduce Model Dimitrios Verraros Overview The purpose of this thesis is to implement and benchmark the performance of a parallel K- means clustering algorithm on
More informationImplementation of Relational Operations
Implementation of Relational Operations Module 4, Lecture 1 Database Management Systems, R. Ramakrishnan 1 Relational Operations We will consider how to implement: Selection ( ) Selects a subset of rows
More informationSomething to think about. Problems. Purpose. Vocabulary. Query Evaluation Techniques for large DB. Part 1. Fact:
Query Evaluation Techniques for large DB Part 1 Fact: While data base management systems are standard tools in business data processing they are slowly being introduced to all the other emerging data base
More informationProgramming Models MapReduce
Programming Models MapReduce Majd Sakr, Garth Gibson, Greg Ganger, Raja Sambasivan 15-719/18-847b Advanced Cloud Computing Fall 2013 Sep 23, 2013 1 MapReduce In a Nutshell MapReduce incorporates two phases
More informationApache Flink. Alessandro Margara
Apache Flink Alessandro Margara alessandro.margara@polimi.it http://home.deib.polimi.it/margara Recap: scenario Big Data Volume and velocity Process large volumes of data possibly produced at high rate
More informationLecture Query evaluation. Combining operators. Logical query optimization. By Marina Barsky Winter 2016, University of Toronto
Lecture 02.03. Query evaluation Combining operators. Logical query optimization By Marina Barsky Winter 2016, University of Toronto Quick recap: Relational Algebra Operators Core operators: Selection σ
More informationImproving the MapReduce Big Data Processing Framework
Improving the MapReduce Big Data Processing Framework Gistau, Reza Akbarinia, Patrick Valduriez INRIA & LIRMM, Montpellier, France In collaboration with Divyakant Agrawal, UCSB Esther Pacitti, UM2, LIRMM
More informationCSE Lecture 11: Map/Reduce 7 October Nate Nystrom UTA
CSE 3302 Lecture 11: Map/Reduce 7 October 2010 Nate Nystrom UTA 378,000 results in 0.17 seconds including images and video communicates with 1000s of machines web server index servers document servers
More informationFrom SQL-query to result Have a look under the hood
From SQL-query to result Have a look under the hood Classical view on RA: sets Theory of relational databases: table is a set Practice (SQL): a relation is a bag of tuples R π B (R) π B (R) A B 1 1 2
More informationCSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark
CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel
More informationCSE 344 Final Examination
CSE 344 Final Examination March 15, 2016, 2:30pm - 4:20pm Name: Question Points Score 1 47 2 17 3 36 4 54 5 46 Total: 200 This exam is CLOSED book and CLOSED devices. You are allowed TWO letter-size pages
More informationEvaluation of Relational Operations. Relational Operations
Evaluation of Relational Operations Chapter 14, Part A (Joins) Database Management Systems 3ed, R. Ramakrishnan and J. Gehrke 1 Relational Operations v We will consider how to implement: Selection ( )
More information