Anurag Sharma (IIT Bombay) 1 / 13

Size: px

Start display at page:

Download "Anurag Sharma (IIT Bombay) 1 / 13"

Dustin Hoover
5 years ago
Views:

1 0 Map Reduce Algorithm Design Anurag Sharma (IIT Bombay) 1 / 13

2 Relational Joins Anurag Sharma Fundamental Research Group IIT Bombay Anurag Sharma (IIT Bombay) 1 / 13

3 Secondary Sorting Required if we need to sort the entries based on the value rather than key Google Map Reduce provides built-in secondary sort Anurag Sharma (IIT Bombay) 2 / 13

4 eg. In the weather dataset example there are m sensors. Every reading would be - (t,m,r) where t is time, m sensor id, r the reading of the sensor. If we want the values separated/sorted according to the sensor, then map output would be - m (t,r). But suppose we wish to reconstruct the activity at each individual sensor over time. Approach 1 : buffer all the readings in memory in reducer and then sort by timestamp before additional processing. It is faster but creates a scalability bottleneck if sensor data is a lot of sensor reading is a complex object. Approach 2 : Value-to-key Conversion design pattern. Idea is to move part of the value into the intermediate key to form a composite key, and let the MapReduce execution framework handle the sorting. So the new output of mapper is - (m,t) r Anurag Sharma (IIT Bombay) 3 / 13

5 Relational Joins popular application of Hadoop is data-warehousing Typically data is relational in nature, but increasingly data warehouses are used to store semi-structured data (e.g., query logs) as well as unstructured data Traditionally data warehouses were implemented using Relational Databases. Some vendors provide parallel databases. They often don t scale cost-effectively to the crushing amounts of data an organization needs to deal with today Facebook abandoned Oracle based intelligence applications in favor of a Hadoop-based solution developed in-house called Hive Anurag Sharma (IIT Bombay) 4 / 13

6 eg. There are two datasets (relations), generically named S and T. S looks something like - (k 1, s 1, S 1 ) and so on where k is the key we would like to join on, s n is a unique id for the tuple, and the S n denotes other attributes in the tuple and T looks like - (k 1, t 1, T 1 ) and so on. The terms in the tuple are of same kind as S, k is the join key, t n unique id of tuple and T n denotes other attributes in tuple S might represent a collection of user profiles, in which case k could be interpreted as the primary key (i.e., user id). The tuples might contain demographic information such as age, gender, income, etc T, might represent logs of online activity. Joining these two datasets would allow an analyst, for example, to break down online activity in terms of demographics. Anurag Sharma (IIT Bombay) 5 / 13

7 Reduce Side Join map over both datasets and emit the join key as the intermediate key, and the tuple itself as the intermediate value MapReduce guarantees that all values with the same key are brought together, all tuples will be grouped by the join key known as a parallel sort-merge join in the database community Anurag Sharma (IIT Bombay) 6 / 13

8 Case I : one-to-one join at most one tuple from S and one tuple from T share the same join key (but it may be the case that no tuple from S shares the join key with a tuple from T, or vice versa Reducer will be presented with entries like k 23 [(s 64, S 64 ), (t 84, T 84 )] and so on where there is one key and 2 values associated with it If there is only one value associated with a key, this means that no tuple in the other dataset shares the join key, so the reducer does nothing Anurag Sharma (IIT Bombay) 7 / 13

9 Case II : one to many join tuples in S have unique join keys, so that S is the one, and T is the many algorithm will still work, but when processing each key in the reducer, we have no idea when the value corresponding to the tuple from S will be encountered, since values are arbitrarily ordered easiest solution is to buffer all values in memory, pick out the tuple from S, and then cross it with every tuple from T. THis might create scalibility bottleneck requires a secondary sort. Mapper emits a composite key which is like - (k 82, s 105 ) [S 105 ] Two additional changes are required: we must define the sort order of the keys to first sort by the join key, and then sort all tuple ids from S before all tuple ids from T we must define the partitioner to pay attention to only the join key, so that all composite keys with the same join key arrive at the same reducer Anurag Sharma (IIT Bombay) 8 / 13

10 Whenever the reducer encounters a new join key, it is guaranteed that the associated value will be the relevant tuple from S. The reducer can hold this tuple in memory and then proceed to cross it with tuples from T in subsequent steps (until a new join key is encountered) Anurag Sharma (IIT Bombay) 9 / 13

11 Case III : many-to-many join Assuming that S is the smaller dataset, the above algorithm works as well Consider what happens at the reducer (k 82, s 105 ) [(S 105 )] (k 82, s 124 ) [(S 124 )]... (k 82, t 98 ) [(T 98 )] (k 82, t 101 ) [(T 101 )] (k 82, t 137 ) [(T 137 )]... All the tuples from S with the same join key will be encountered first, which the reducer can buffer in memory. As the reducer processes each tuple from T, it is crossed with all the tuples from S. Of course, we are assuming that the tuples from S (with the same join key) will fit into memory Anurag Sharma (IIT Bombay) 10 / 13

12 Map Side Joins Suppose we have two datasets that are both sorted by the join key We can perform a join by scanning through both datasets simultaneously - this is known as a merge join in the database community suppose S and T were both divided into ten files, partitioned in the same manner by the join key. Further suppose that in each file, the tuples were sorted by the join key. In this case, we simply need to merge join the first file of S with the first file of T and so on This can be accomplished in parallel, in the map phase of a MapReduce jobhence, a map-side join We can map over one of the datasets (the larger one) and inside the mapper read the corresponding part of the other dataset to perform the merge join. No reducer is required Anurag Sharma (IIT Bombay) 11 / 13

13 References I Map Reduce Algorithm Design 12 Data-Intensive Text Processing with MapReduce. Anurag Sharma (IIT Bombay) 12 / 13

14 Anurag Sharma (IIT Bombay) 12 / 13

15 12thebibliography Anurag Sharma (IIT Bombay) 12 / 13

MapReduce: Algorithm Design for Relational Operations

MapReduce: Algorithm Design for Relational Operations Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec Projection π Projection in MapReduce Easy Map over tuples, emit