Designing a Parallel Query Engine over Map/Reduce

Size: px

Start display at page:

Download "Designing a Parallel Query Engine over Map/Reduce"

Randolf Banks
6 years ago
Views:

1 Designing a Parallel Query Engine over Map/Reduce Chatzistergiou Andreas E H U N I V E R S I T Y T O H F R G E D I N B U Master of Science School of Informatics University of Edinburgh 2010

2 Abstract Map/Reduce is a parallel programming model introduced by Google Inc., which enables the easy parallelization of tasks while hiding the details and complexity of parallel computation. This report presents the design of a parallel query engine over Map/Reduce. This is achieved in two parts. First, we examine algorithms for performing equi-joins between datasets over Map/Reduce and we provide a comparative analysis. Second, we design a cost model for estimating the performance of each algorithm. This is considered as one of the keystones for building an optimizer capable of choosing the appropriate algorithm for each case. Our results indicate that all join algorithms are significantly affected by certain properties of the input datasets (size, selectivity factor, etc) and that each algorithm performs better under certain circumstances. Our cost model manages to capture these factors and estimates fairly accurately the performance of each algorithm. i

3 Acknowledgements Well, a tough year (the least I can say) came to its end. I would like to express my appreciation and gratitude to the following people: To my supervisor S.Viglas for his guidance and support that defined my work and also for his effort and remarks (admittedly always with a great sense of humor) during the writing process. To Evgenia for her dedication and defining support for turning this MSc into a reality. To my friend D.Kartsaklis for sharing my enthusiasm, thoughts and concerns throughout this year and for his support during the hard times. Last but not least, I would like to thank my family and friends who supported and encouraged my controversial decision of leaving my job and returning back to the academia. ii

4 Declaration I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified. (Chatzistergiou Andreas) iii

5 To all the people who made this possible. iv

6 Table of Contents 1 Introduction Motivation and scope Contributions Structure Related work Parallel DBMS Overview Design considerations Architecture Parallelism Map/Reduce Overview A closer look Extending Map/Reduce Map-Reduce-Merge Hadoop DB Pig Latin Joining datasets over Map/Reduce Problem Definition Reduce-side merge-join Performance analysis Tuning task granularity Map-side replication-join Performance analysis Semi-join v

7 3.4.1 Performance analysis Bloom Filters Join algorithms comparison Joining datasets with outer-join Improving semi-join Performance analysis Minimizing the cost of the final map phase Applying projections and selections JVM initialization Designing a cost model Overview Framework costs Computational costs The assumptions Merge-join cost model Replication-join cost model Semi-join cost model Improved semi-join cost model Evaluating the join algorithms Evaluating inner-join operations Increasing the size of both tables Keeping the selectivity factor constant Evaluating outer-join operations Evaluating the cost model Parameter calibration Merge-join estimation Replication-join estimation Semi-join estimation Bloom filter construction estimation Merge-join estimation Putting everything together Conclusions and future work 64 vi

8 7.1 Conclusions Future work Bibliography 66 vii

9 List of Figures 2.1 MR Overview [10] MR execution in detail [7] Reduce-side merge-join Map-side replication-join The bloom filter [13] Bloom filter construction Union semi-join Improved semi-join Join between two tables with constant size left table Comparing the reduce phase of MJ and SJ algorithms Increasing the size of both tables Inner-join with high selectivity factor Comparison of MJ, RJ and USJ for outer-join operations Comparison of MJ, RJ and ISJ for outer-join operations Reduce phase comparison between MJ, USJ and ISJ Comparison of MJ, RJ and ISJ for outer-join operations without the disk sharing factor MJ map phase MJ reduce phase MJ estimates comparison with real costs RJ estimates comparison with real costs Bloom filter construction - map phase estimates Bloom filter construction - reduce phase estimates Bloom filter construction estimates SJ job 2 - map phase estimates viii

10 6.9 SJ job 2 - reduce phase estimates SJ job 2 estimates SJ job estimates Inner join estimated costs Inner join actual costs ix

11 List of Tables 4.1 MJ map phase variables MJ reduce phase variables Loading time variables MJ map phase model parameters MJ reduce phase model parameters RJ model parameters SJ job 2 map model parameters MJ reduce phase model parameters x

12 Chapter 1 Introduction When you are stuck in a traffic jam with a Porsche, all you do is burn more gas in idle. Scalability is about building wider roads, not about building faster cars. Steve Swartz Over the past years, the growth of the amount of data completely overtook the growth of the computational power of uniprocessor systems. The shift towards parallel computing focused on overcoming the computational barrier. In addition to that, a challenging problem found to be also the CPU-I/O gap [11]. The processor speed increase was several orders of magnitude higher compared to that of hard disks, turning I/O into a bottleneck. Early attempts for dealing with the problem were directed towards building specialized database machines but they did not deliver what they promised. The Parallel database systems [11] emerged as an extension of traditional DBMSs to overcome these limitations. By exploiting the shared-nothing architecture the jobs could be parallelized in cluster of nodes with minimal interference between the processes and superior scalability potential. The computational power could now be increased by just adding more nodes to the system. Moreover, the I/O bottleneck was significantly mitigated since every node on the cluster used a dedicated hard disk drive. In 2004, Google Inc. introduced a new paradigm for handling large scale datasets, termed Map-Reduce [10]. MR is a shared-nothing programming model and provides to the developer the ability to distribute data and parallelize computation while hiding the details and complexity of parallel programming. Although it was initially developed 1

13 Chapter 1. Introduction 2 for search engine tasks it is applicable to general purpose operations. The model relies on distributing jobs on low-cost unreliable commodity hardware rather than expensive, high performance hardware. The success of the MR model attracted the attention of the broader research community. Several papers were published for comparing the performance and of MR with that of parallel DBMSs. In addition, extensive work has been done for creating hybrid systems from MR and parallel DBMS. 1.1 Motivation and scope Our work is set at the sweet spot between parallel DBMS and MR. We examine how we can design a new paradigm of a parallel query engine based on MR. Our purpose is to exploit the advantages of both worlds by combining the superior scalability, simplicity and fault tolerance of the recent MR model with ideas from the "mature" parallel/distributed database literature. We focus on the two most fundamental aspects of a query engine, the operators and the optimizer. For the former part we chose to examine the join operator as it is one of the most challenging and interesting operators in the database literature. Our implementation is based on Hadoop [2], an open source implementation of MR, and includes a set of different algorithms for performing equi-join between datasets over MR. For the second part of our work, the optimizer, we design and evaluate a cost model for estimating the performance of each algorithm. A typical query optimizer has two main responsibilities. It enumerates a set of possible plans (search space exploration) and evaluates each plan. We focus on the evaluation of the plans which is the most challenging part since for the search space exploration we can simply adopt one of the current approaches used in traditional DBMS. 1.2 Contributions Overall, the contributions of this paper are the following: We implement, compare and experimentally evaluate three of the most popular algorithms for performing inner-joins over MR.

14 Chapter 1. Introduction 3 We discuss the suitability of the implemented algorithms for outer-joins and experimentally evaluate it. We propose an algorithm for performing outer-joins over MR and experimentally evaluate it. We propose a cost model for estimating the performance of each join algorithm and experimentally evaluate it. 1.3 Structure The remainder of this paper is structured as follows. In chapter 2 we review the literature. This is followed by chapter 3 and chapter 4 where we present our work. In more detail, in chapter 3 we describe the join algorithms implementation and in chapter 4 we present the cost model. Next, in chapter 5 and chapter 6 we present the experimental evaluation of the join algorithms and the cost model respectively. Finally, in chapter 7 we make our conclusions and propose future work.

15 Chapter 2 Related work 2.1 Parallel DBMS Overview The Parallel DBMS approach [11] is based heavily on the already mature and successful field of DBMSs and uses well-tested ideas and algorithms that were developed over the years. The main idea is to apply techniques used in traditional DBMSs in a distributed environment. Early examples of parallel DBMSs include GAMMA [12] and GRACE [14] which are quite representative of the field since most of the recent work is influenced by them. Major concerns in the context of parallel DBMSs include the partitioning of data among the nodes as well as ways to exploit parallelism within database operators Design considerations The main aim of every parallel system is to achieve linear speedup and scaleup. With speedup we measure the rate at which performance is increased as we add additional resources. In a system with linear speedup the execution time should decrease proportionally with the size of resources we add. On the other hand, scaleup measures the ability of a system when we increase both the resources and the problem size. The magnitude of the task can be processed in the same amount of time by a system with linear scaleup, increases proportionally with the size of resources. 4

16 Chapter 2. Related work 5 As a consequence, the main design considerations around a parallel database system are focused on dealing with the fundamental factors that affect the scalability of the system: Startup costs. It is the initialization cost of the parallel operation. In operations with large number of processes the startup costs can easily overwhelm the execution cost. Interference. As we increase the number of processes the competition for sharedresources increases also. Interference, describes that additional over-head that every new process imposes. Skew. Parallel algorithms usually consist of a series of parallel steps. The elapsed time of the job is equal to the elapsed time of the slowest step. When the time needed by the slowest step significantly deviates from the average time, the benefit from increasing parallelism is substantially reduced Architecture The hardware architecture of a parallel system is one of the most important topics since it affects the degree of interference among the processes. According to the taxonomy proposed by [21] there are three basic categories, shared-memory, shared-disk and shared-nothing. The latter category found to be the most appropriate for database systems because it minimizes the interference among the processes and reduces the network requirements Parallelism The types of parallelism that can be found in database systems are mainly two, pipelined parallelism and partitioned parallelism. The former is achieved by pipelining the operators that comprise the plan and the latter by partitioning the inputs and outputs so that each operator solves a smaller part of the problem. Partitioning parallelism offers far more opportunities for linear scaleup and speedup since pipelined parallelism is limited by the length of the pipeline and the imbalance in the time needed by each operator. Moreover, some operators are not to possible to be pipelined.

17 Chapter 2. Related work 6 The general idea with partitioning parallelism is to take advantage of the implementation of the algorithms used in the traditional databases and parallelize them. The first step is to consider the various partitioning strategies. There are three strategies involved each with different trade-offs, round-robin, hash partitioning and range partitioning. These strategies are applied at a relation level according to the types of queries are mainly executed over this relation. The execution of the plan begins with each operator receiving as an input a partition of the total input. Then, the output of all the operators are then aggregated by a merge function into one file. In cases where we want to pipeline many operators we use a split function to partition the output of an operator to prepare it as an input to the next operator. 2.2 Map/Reduce Overview The main idea was inspired from functional programming languages and it is based on the map and reduce primitives. The map function takes the input data and converts them to key/value pairs and then the reduce function merges them. A very common example which is drawn from the area of information retrieval is to create inverted lists from a set of documents. The map function takes a given document, tokenizes it into words and produces word/document id pairs. Then, the reduce function gets all pairs and merges them so that every word is linked with a sorted document list. The execution flow starts by dividing the input data into M sets and distributes them over multiple machines. Then, the program is initialized on a cluster of machines and a master instance of the program assigns map or reduce tasks to worker instances and keeps the status (idle, in progress, completed) and identity of each worker. The map workers, after they process their input, store the key/value pairs in memory which eventually are stored in local disk. The local disk is partitioned in R divisions and their locations are forwarded back to the master which in turn signals the reduce workers. The reduce workers use remote procedure calls to read the data and then apply the reduce function. The procedure finishes when all map and reduce operations finish and the output is comprised of R output files one for each reduce operation.

18 Chapter 2. Related work 7 Apart from hiding the distribution and parallelism details from the developer, the model offers also automatic fault tolerance. The Master instance is responsible for periodically checking the status of the workers. If a worker is not responding it resets the status of all completed or in progress jobs of the specific worker so they can be rescheduled on other workers. Furthermore, the model incorporates a mechanism for handling slow workers so that they do not affect significantly the running time of the whole procedure. This effect is achieved by scheduling backup executions of the remaining tasks when the map-reduce operation is near completion. Experimental results indicate that if this mechanism is tuned correctly it can improve significantly the running time of the operation. In figure 2.1 we give an overview of MR. Figure 2.1: MR Overview [10]. The success of Map-Reduce mainly lies on its simplicity, scalability and automated parallelization while managing to remain abstract. The model successfully hides the details from memory management, threads, file allocation and network programming. The use of low cost commodity hardware greatly simplifies the parallelization process and makes the architecture extremely scalable and high-performance since it can be easily deployed to thousands of nodes. Moreover, instead of using centralized storage systems each node uses its local hard drive achieving even greater scalability.

19 Chapter 2. Related work A closer look In later chapters we often provide arguments that require a better insight of MR. In this section we go a level deeper and with a help of a small example we explain a few key points. An MR job is illustrated in figure 2.2 with three mappers and two reducers. Initially, DFS (Distributed File System) contains the job input divided into input splits. MR assigns to each mapper an equal input share trying to exploit data locality as much as possible to reduce network traffic. Each mapper processes the input and outputs the result into a number of partitions defined by the number of reducers. The partitions are created by using a hash function that ensures each partition contains all the keys of a certain range. Afterwards, each partition is sorted and sent to the appropriate reducer. Figure 2.2: MR execution in detail [7]. Each reducer receives the corresponding partitions and merges them into a unique sorted file. The reduce function processes the input and outputs the result into DFS. After the reduce phase, the result is spread in one file per reducer. 2.3 Extending Map/Reduce After the success of the MR model many researchers focused on developing models that share the same benefits with it but deal with its limitations. As a result a variety of new models introduced and utilized by organizations such as Google, Yahoo, Microsoft, Amazon, ASK.com etc. Below we outline the models that are most related to our work.

20 Chapter 2. Related work Map-Reduce-Merge While the Map-Reduce model proved to be very effective when dealing with homogeneous datasets, according to [19] it was found not to be as effective in manipulating and joining heterogeneous datasets. The Map-Reduce-Merge [22] model tries to solve this inadequacy by incorporating relational algebra principles into the Map-Reduce model without sacrificing its simplicity. The new model tries to achieve that by introducing the concept of data lineages and adding a new primitive, the merge primitive. To explain how the model processes heterogeneous datasets let us assume we have α,β lineages where k stands for keys and v for values. The map function processes each lineage producing intermediate key/value pairs. Therefore, for a key/value pair (k1,v1)α it produces a pair [(k2,v2)]α and for (k4,v4)β a pair [(k5,v5)]β. Then the reduce function aggregates each result producing a new value list for each lineage [(k2, v3)]α and [(k5, v6)]β respectively. Finally, the merge function joins the two newly formed lineages into a third γ lineage [(k6,v7)]γ. The execution environment was implemented by inheriting the map and reduce functions from the classical Map-Reduce framework and by adding four new functions (merge, processor, partition selector, configurable iterator) in order to implement the merge primitive. The collaboration of these functions enables heterogeneous dataset merging with a highly customized and flexible way Hadoop DB HadoopDB [4] is a hybrid system that tries to combine the advantages of both the parallel DBMS and MR worlds. It uses MR as the communication layer between nodes where each node is a separate DBMS. The execution begins with SQL queries that are translated in MR jobs which in turn forward the majority of work to singlenode DBMS. The main idea is DBMSs to handle the computation in order to take advantage of its superior performance and use MR to handle communication in order to benefit from its scalability and fault tolerance. The implementation is based on Hadoop [2] which is an open source implementation of MR and utilizes PostgreSQL for the database layer. The main contribution of HadoopDB is comprised of the following components.

21 Chapter 2. Related work 10 Database Connector. It acts as the intermediate between Hadoop and the DBMS that resides to each node. Catalog. It includes the required metadata information for handling the connection with the DBMS. Data Loader. Reads data from Hadoop file system (HDFS), repartitions them and then loads them to the corresponding DBMS of each node. SQL to MR to SQL (SMS) Planner. It is a mechanism for creating the query plan from SQL queries, converting them to MR jobs and then pushing the query processing logic to DBMS Pig Latin Pig [18] is a processing environment developed by Yahoo and along with its associate language Pig Latin tries to fill the gap between the low level Map/Reduce and the declarative SQL. From a SQL point of view we can consider that the Map-reduce model is a SQL group by clause where the map function is the definition of the group function while reduce is the aggregation. In that way, Pig Latin extends Map-reduce by defining additional SQL-like clauses which at the end are translated in map-reduce jobs. Pig is also implemented using Hadoop [2]. In Pig Latin we can define a series of steps where each step represents a single high level data manipulation. This differs from the traditional SQL approach but enforces a clearer and more concise implementation. The supported primitives are very carefully chosen in order to be easily parallelized; the most important among them are LOAD, FILTER, FOREACH, COGROUP, JOIN, STORE, UNION, CROSS, ORDER, DIS- TINCT. Moreover, the framework provides even more flexibility by supporting user defined functions (UDF) so that developers can add custom data processing. Finally, Pig Latin unlike traditional database systems incorporates a nested data model in order to avoid flat table limitations. This adds greater flexibility and it comes more natural to the way programmers think and the way data is usually stored in data files. The basic supported types are atom, tuple, bag and map.

22 Chapter 3 Joining datasets over Map/Reduce The join operation is one of the most extensively studied areas in the Database literature. With the emergence of parallel/distributed Databases a number of join algorithms were redesigned for exploiting parallelism [6], [11]. Since Map-Reduce is a parallel programming model it can greatly benefit from this research. The Hadoop community studied and ported the best fitted algorithms. Although many different versions exist [1], [3], [18], [7], [16] there are three basic algorithm classes. In this chapter we focus our discussion around our implementation of each class. In more detail, our contribution in this chapter is the following: The performance analysis, comparison and discussion of the presented inner-join algorithms. Various improvements of the inner-join algorithms as described in sections 3.2.2, , The discussion about the suitability of the presented join algorithms for performing outer-join operations. The implementation, discussion and evaluation of a novel outer-join algorithm. The discussion about applying selections and projections to the join algorithms. The implications of the JVM initialization described in section

23 Chapter 3. Joining datasets over Map/Reduce Problem Definition To test out hypothesis we primary focus on the most commonly used join type, the inner-join but we also consider outer-join algorithms as well. For the sake of simplicity we assume there are no projections and selections. As a result, we project all columns from both tables and we do not apply any selection to the data. Nevertheless, we discuss the implications of relaxing these assumptions and how we can achieve it. 3.2 Reduce-side merge-join It is the most straightforward way to join two datasets over the Hadoop framework. It can be considered as the Hadoop version of the parallel sort-merge join algorithm [11]. The main idea is to sort the input splits on the join column, forward them to the appropriate reducer and then merge them during the reduce phase. In more detail, the execution begins with each node reading an input split. The map function is then called for every tuple of the input split (which can contain records from either dataset) and it performs two main operations. First, it extracts the value of the join column from the input row and submits the key-value pair where the key is the join column and the value is the whole row. Secondly, it tags the pair with the source table. The tag value should take the minimum required space in order to restrain the size increase of the output. Afterwards, the framework sorts the output of the map function according to the submitted key and forwards it to the appropriate reducer. This results in all the rows with the same key from all mappers and both tables to be forwarded to the same reducer. The reducer reads the input, separates the records of each table and loads them in memory in two distinct lists. Then, it merges the lists and writes the result to the output. The choice of the merge algorithm depends on the granularity of the input. For example, if the reduce function is called for each key value we just need to add the tuples of both tables. In a more coarse-grained approach where the reduce function runs for a block of values we need to use the merge sort algorithm [15]. Furthermore, if the join attribute is not a key we need to take account that the values are not unique. This requires to compute the cross product of the groups with the same value. In our implementation in order to decouple the reduce function from the granularity strategy

24 Chapter 3. Joining datasets over Map/Reduce 13 Figure 3.1: Reduce-side merge-join. we use the merge sort algorithm used in the database literature that works in all cases. In figure 3.1 we illustrate the merge-join algorithm Performance analysis The performance of the algorithm is dominated by two main factors. The first is the communication overhead required to shuffle the datasets through the network from mapper to reducer. The second one is the time required to sort and write the datasets to disk before forwarding them to the reducers. In general, these are standard costs involved in every Map/Reduce job, however, a typical map function would filter the input reducing its size and thus reducing the performance overhead. The problem with the Reduce-side merge join is that the map function does not apply any filter and the output size remains at the same size with the input. Another potential drawback is that the reducer loads in memory all the tuples of each split. In the case of highly skewed values the split size could exceed the available memory on the node. Then we would be forced to break the split into blocks that fit in memory and perform nested loops join. Nevertheless, this would dramatically affect performance.

25 Chapter 3. Joining datasets over Map/Reduce Tuning task granularity Hadoop calls the reduce function for each unique submitted key. In a common case where the join attribute is the key of the dataset the reduce processes only one record per call. This imposes significant communication overhead. To mitigate the problem we control the granularity of the reduce tasks. We replace the key-level granularity with a coarser group-level granularity. This is achieved by using a compound key 1 for each pair. The compound key is comprised of two keys, the group key and the record key. The group key controls the destination reducer and how often the reducer is called. The record key is used for sorting the pairs. All pairs are split equally among the reducers and all pairs that belong to the same reducer share the same group key. We built our own customized partitioner to control exactly how the pairs are split. Assuming a uniform hash distribution the group key can be created by the keyhash modulo number of reducers function. Furthermore, using a custom RawComparator we ensure that the reducer is called once per group key. Nevertheless, we achieve a similar effect by just modifying the hash function that controls the destination reducer. However, with this approach we can control also how often the reduce function is called. This is achieved by adjusting the reduce function to be called only once for every group key while the default approach is to be called for every input pair. In future versions, we should take into account also the available memory to split a key group to smaller groups if it does not fit in memory. 3.3 Map-side replication-join The Map-Side Replication join tries to address the drawbacks of the previous approach. The concept was initially conceived in the database literature [5]. The idea comes from the observation of a relatively common case where a small table is joined with a large table. This is a common case since when we normalize a database schema we usually come up with a few central tables surrounded by many small-size supporting tables. An ordinary example is a customers table supported by the cities table, educational degrees etc. Replication-join takes advantage of this case and If the small table fits in 1 The sub-keys that comprise the compound key are not added but are stored and loaded as two separate entities.

Chapter 3. Joining datasets over Map/Reduce 15 memory it distributes it to all nodes, loads it in memory and perform the join directly in the map function.

26 Chapter 3. Joining datasets over Map/Reduce 15 memory it distributes it to all nodes, loads it in memory and perform the join directly in the map function. The implementation is much simpler compared to the previous algorithm. We start by replicating the small table to all nodes by using the distributed cache facility. Then, during the setup 2 of the mapper we load the table into a hash table. For each value of the hash table we nest an array list for storing multiple rows with the same join attribute. Hence, for each row of the bigger table we search over only the unique keys of the small table. In the case we have many rows per join attribute it results in substantial performance gain. The hash table provides constant time search for a key value. Figure 3.2: Map-side replication-join. During the execution of the mapper for each key-value pair of the input split we extract the join attribute and probe the hash table. If the value exists we combine the tuples of the matching keys and submit the new tuple. The algorithm is illustrated in figure The setup() function is called by the framework one time at the beginning during the initialization of the mapper. We override it to customize the initialization.

27 Chapter 3. Joining datasets over Map/Reduce Performance analysis This algorithm manages to overcome the drawbacks of the Reduce-side Merge Join at the cost of the initial distribution. Nevertheless, the initial cost is amortized by omitting the reduce phase. The performance gains are significant because we do not need to sort the output of the Map function neither to shuffle data through the network. Albeit its advantages, the algorithm is restricted by the memory size of the nodes. If the small table does not fit in memory we cannot use the algorithm at all. A potential workaround of splitting the small table into chunks that fit in memory would introduce forbidding I/O overhead since for every pair of the large table we would have to load and search in more than one chunk. In general, there is a trade-off between performance and how often we can use the algorithm. As the size of the available memory increases the algorithm is less restricted, yet, the distribution overhead is increased as well as the time we need to search for a row within the hash table. In the case where the nodes have large memory and the two datasets are nearly equal in size the performance gain by omitting the reduce stage may not be able to compensate the distribution cost. Finally, another notable characteristic is that it produces non-sorted output. In more complicated queries where the optimizer has to build a tree of operators it should take into account whether the output of each operator is sorted or not. Even when Replication join could perform better at a given input the optimizer could instead choose the Merge-Join algorithm and the performance loss would be compensated by the next operator. 3.4 Semi-join Like Merge-join this is also a reduce-side join with the main difference that it is preceded by a map-side filtering. Again here the idea was first developed in the distributed databases literature [6] and tries to increase performance by reducing the data transfer overhead between the nodes. As discussed in the Merge-Join algorithm, the most significant bottleneck is the cost of sorting and shuffling the data through the network from mappers to reducers. The Semi-join algorithm filters the input records in the map function so that only records

28 Chapter 3. Joining datasets over Map/Reduce 17 with join attribute value that belong to both datasets are emitted. As a result, the costs related with the map output are substantially reduced. This effect is achieved by chaining two different Map/Reduce jobs. The first job creates the list with the common keys and the second one performs the join. In more detail, the mapper of the first Map/Reduce job extracts the join attribute from the record and emits the key value pair where the key is the join attribute of the record and the value is the source table. A combiner 3 then eliminates the duplicate keys so that the output from mapper to reducer is reduced. The reducer receives the distinct keys from both datasets and merges them with the merge-join algorithm. It is also possible to eliminate the duplicate keys directly to the mapper without calling a combiner, with less computational work. Nevertheless, in this case he have to assume that the key list would fit in memory. By using the combiner we do not have to worry about the memory size since the data are sent in batches. The first Map/Reduce job is followed by a distribution of the list to all nodes so that each mapper has a local access to the whole list. The mapper of the second job loads the list in a hash table and for every incoming pair it checks the list and submits only the pairs. After this point the join is handled by the merge-join algorithm. Theoretically, the Semi-join algorithm can be combined with any of the previous two algorithms. For example, after the creation of the key list we can estimate the size of each dataset after the filtering and if one of the two datasets fits in memory, we could use the Replication-join algorithm instead of the Merge-Join. This provides greater flexibility to a good optimizer where it can combine different algorithms in each case to achieve better performance for a given input. Our implementation and evaluation were based on the combination of the Semi-Join with the Merge-join algorithm Performance analysis As discussed above, the main advantage of this algorithm is that it reduces the sorting and shuffling costs between the mappers and reducers. However, its performance is dependent on several input properties and it is possible to underperform under certain circumstances. 3 The combiner() function is called by the framework right after the map function() to aggregate the output for reducing the communication cost. The call takes place while the key-value pair is still in memory before is written to disk.

29 Chapter 3. Joining datasets over Map/Reduce 18 Probably the most essential input property is the number of the common keys between the two datasets. As the number of the common keys decreases the number of filtered records increases and thus the performance gain increases also. On the contrary, in an extreme case where the majority of keys between the two datasets is common, the algorithm underperforms compared to merge-join. This is quite reasonable because the extra MR job does not pay off since during the second MR job the majority of records are sorted and shuffled through the network. The number of columns of the datasets can be also an important factor. In case the datasets have very few columns the overhead during the first MR job when we send the join attribute column to the reducers, can be comparable to the overhead we try to avoid by filtering the records at the second MR job. However, with bloom filters we substantially mitigate this problem Bloom Filters Definition The Bloom filter was initially invented by Burton Howard Bloom [8] and it is a compact data structure used for testing the membership of an element to a set. It provides a simple and space-efficient data compression at the cost of a small false positive rate. The data is stored in a bit array of size m by a set of k hash functions with a uniform distribution. The trade-off between space and accuracy is achieved by tuning the size of the bit array and the number of hash functions used. All bits of the bit array are initialized to zero. To add an element we feed it to all hash functions each one producing a position in the bit array that has to be set equal to 1. If the specified offset is already equal to one we leave it as it is. When a bit is set to one it is not possible to become zero again and thus the elements inside the array cannot be removed. Testing the membership of an element is achieved in the same way. We apply the hash functions to the element and test whether the resulting positions are equal to one. With a set of equations [9] we can calculate the required size m of the bit array and the number k of hash functions for a given number of elements n and an allowable false positive rate p. The optimal size k of hash functions is defined as:

30 Chapter 3. Joining datasets over Map/Reduce 19 and the required size as k = m ln2 (3.1) n m = nln p (ln2) 2 (3.2) For 1% false positive rate and the optimal number of hash functions, we need approximately 9.6 bits per word. In our implementation we use 10 bits per word thus for 1,000,000 unique keys we need 1.19 MB of memory. The size of the bloom filter increases linearly with the number of words. Figure 3.3: The bloom filter [13]. In figure 3.3 we illustrate an example of a bloom filter with m = 18, k = 3 and a set of member elements {x,y,z}. The arrows show how each element is sprayed in the bit array by the hash functions. In addition, we have also an element w that is not part of the set since it hashes to a position in the array that equals to zero Incorporating Bloom filter to the semi-join algorithm With a bloom filter we can improve the semi-join algorithm by reducing the size of the common key list. The Map/Reduce job used to create the key list at the previous approach is replaced by a Map/Reduce job that creates the bloom filter. We use the implementation of the bloom filter data structure provided by Hadoop. There has been an extensive amount of work in the distributed database literature [17] for improving join efficiency by exploiting the bloom filter. The Hadoop community

31 Chapter 3. Joining datasets over Map/Reduce 20 applied many techniques to the MR model. In our implementation we adopt the idea from the Hadoop literature [16] and we extend it by creating a bloom filter for both tables. This allows us to filter the records of both tables and increase the performance gain compared to the initial approach where it creates a bloom filter for just one table. The main idea is to create two separate bloom filters one for each dataset and then compute their intersection. This is achieved by creating a local bloom filter from the input split in each mapper and then emitting the key/value pair where the key is the source table and the value the bloom filter of the split. All the bloom filters are then forwarded to a single reducer sorted by the source table and the aggregation takes place. The reducer first aggregates the bloom filters of each table by performing an OR operation and then performs an AND operation between the two bloom filters of each table to keep only the intersection of the keys. The final bloom filter is then distributed and loaded by every mapper and we proceed as the previous approach. To be able to perform operations between the bloom filters we have to initialize them with the same size and the same number of hash functions. Therefore, when computing the required size and number of hash functions we use the total number of unique keys of both datasets. In figure 3.4 we illustrate the bloom filter construction process. An interesting question is how the false positives affect the algorithm. The final bloom filter appears to contains keys that do not belong to the intersection of data-sets. This does not pose a problem since the reduce function skips keys that are not matched. The only side effect would be a small increase of the size of the bloom filter but still its overall size will be much smaller compared to the size of the original list. Finally, like the original version of the semi-join, the drawback of this approach is that the reducer should have enough memory to keep both bloom filters. Nevertheless, it requires less memory than keeping two array lists as the previous approach Final thoughts It may seem that we create a bottleneck by using a single reducer but it has also some advantages in this case. First, Hadoop shares the unused slots to additional mappers and thus the reducer starts collecting results much faster. Furthermore, the job of the reducer is relatively quick because the input size is many times smaller compared to the initial input.

Chapter 3. Joining datasets over Map/Reduce 21 Figure 3.4: Bloom filter construction. However, in a cluster with thousands of nodes a single reducer will have greater impact.

32 Chapter 3. Joining datasets over Map/Reduce 21 Figure 3.4: Bloom filter construction. However, in a cluster with thousands of nodes a single reducer will have greater impact. It is important though that all bloom filters are aggregated to a single bloom filter for the later stages. In such cases, it may be appropriate to chain multiple reduce jobs were each subsequent job uses the half number of reducers of its predecessor. This allows us to create a tree shaped (turned upside down) aggregation with lgn + 1 depth (n is the number of reducers used at the first level) where each level is a different job and the number of branches on that level is the number of reducers. In that way, the last job consists of a single reducer, the preceded job consists of two reducers and so on. Nevertheless, we leave that for future study. 3.5 Join algorithms comparison Mainly, we already discussed the characteristics of each algorithm and analyzed their performance under different input sizes. In this section, we put all the pieces together and make a more clear comparison of the different approaches by outlining the basic

33 Chapter 3. Joining datasets over Map/Reduce 22 points. To begin with, replication-join represents the simpler join algorithm with the most straightforward trade-offs. When one table is small enough and fits in memory it should perform significantly better compared to the other two algorithms. Nevertheless, as the small table size increases the distribution and loading costs over-whelm the computational costs. Consequently, in cases where we have an adequate amount of available memory is not directly implied that replication-join will perform better. On the other hand, the semi-join is the most complicated join algorithm and consists of two separate MR jobs. The performance gain comes from the reduced shuffling costs at the cost of an extra MR job. In general its performance is affected by the selectivity factor of the two tables. As the selectivity factor increases the performance gain decreases. Finally, the reduce side merge-join is the a straightforward adoption of the mergejoin algorithm used in traditional DBMS. It fits more naturally to MR compared to the other algorithms. Probably its most notable property is that most of the input properties (excluding the total input size) have minimal impact on its performance. As a result, although in many cases it is not the fastest solution, we know that in all cases it cannot be a very bad choice. 3.6 Joining datasets with outer-join While the Map/Reduce literature is quite rich regarding inner-join algorithms, there is not much about outer-join algorithms. In this section, we consider the suitability of the algorithms discussed so far for performing outer-join between two datasets and we end-up proposing a new algorithm. Replication join seems to be the most suitable and efficient choice for performing outer join. If the inner table fits in memory we replicate it across all nodes and perform the join directly during the map phase. The mapper loads the inner datasets in a hash table and for every row of the outer dataset searches the hash table. If the join attribute is found in the inner table it emits the concatenation of both tuples otherwise it emits only the tuple of the outer dataset. But what we can do if the inner dataset does not fit in memory? Both Merge-join and

34 Chapter 3. Joining datasets over Map/Reduce 23 Semi-join can be used with a small modification. During the merge phase they should submit not only the joined records but also the records of the outer dataset with no match. The Merge join algorithm can be used for outer-joins with the same advantages and disadvantages. Again its main drawback is that it requires to sort and transfer both datasets through the network. On the contrary, the semi-join algorithm seems to lose much of its advantage when performing an outer join. The creation of the common key list proceeds in the same way but during the filtering at the map phase it can filter only the records of the inner dataset because the outer dataset has to appear whole at the results. Hence, the main advantage of the algorithm is substantially reduced. In the common case where the outer table is substantially larger compared to the inner table the performance of the algorithm is reduced even more Improving semi-join The proposed algorithm comes from the observation of the drawbacks of the semi-join algorithm when used for outer-joins. The main idea is to avoid transferring the whole outer-table through the network. This effect can be achieved by filtering the inner and outer tables with the common keys of both tables and merging only the filtered records. Then, an additional map job performs a union 4 between the joined results and the initial outer table. From now on we refer to this algorithm as union semi-join which derives from the addition of the last mapper. The algorithm is comprised by three MR jobs. The first MR job creates a bloom filter with the intersection of the unique keys of both tables and distributes it to all nodes. The second job filters both tables during the map phase and merges the records during the reduce phase. Finally, we employ a third job with only a map phase that takes as an input the output of the previous job along with the initial outer-table. The last job selects and outputs the outer table tuple that do not appear in the joined results. In a sense it performs a union between the output of the previous jobs and the rest outer table tuples. The last mapper receives pairs from either the output or the outer-table. For each input pair it checks the source and if it is from the joined results it emits it otherwise it 4 The union is the set of the distinct elements of the two sets. Hence, the duplicates between the joined results and the outer-table are eliminated.

35 Chapter 3. Joining datasets over Map/Reduce 24 Figure 3.5: Union semi-join. checks the bloom filter for membership. If the record exists we skip it since it was already included in the joined results otherwise we emit it. In figure 3.5 we illustrate the algorithm. Unlike semi-join, the false positives of the bloom filter can affect the validity of the results of this algorithm. The final mapper can only identify if a record was included in the joined results by checking the bloom filter that contains the keys intersection. If a key appears to belong in the joined results it is skipped. Hence, all the false positives keys will be missing from the final output. There are two possible solutions to the problem. The first is to use a common hash table instead of a bloom filter with a loss in performance. A second more wise solution is to modify the reducer that performs the

36 Chapter 3. Joining datasets over Map/Reduce 25 merge. When it finds a record that cannot be not merged it means it is a false positive. Instead of skipping it should emit it as well. Therefore, the records that are lost by the final mapper are already included in the joined results by the reducer that performs the merge Performance analysis Like semi-join, the main advantage of this algorithm comes from the filtering during the map phase with an extra cost of the bloom filter construction and the final map phase. In the common case of a join between a large outer-table and a small inner small it should perform better compared to merge-join and semi-join algorithms. Nevertheless, when the inner table fits in memory the replication-join seems to be the best choice. Normally, as the size of the outer table increases, the number of the filtered records also increases and thus the gain by using this algorithm increases also. In cases where the majority of the outer table records are joined with the records of the inner table the algorithm underperforms Minimizing the cost of the final map phase An additional improvement comes from the realization we can omit the last map phase by writing the outer table tuples that do not participate in the join directly during the map phase of the merge-join job. As we discussed earlier, the mapper of the merge-join job gets the key of every input pair and probes the bloom filter. If the key exists, it is emitted otherwise is discarded. The idea is instead of discarding that pair to output it directly to disk. This is achieved by bypassing the framework and writing these tuples directly to DFS using the DFS API. This allows us to output the entire outer table but only shuffle the tuples that join with the inner table tuples. As a result, the last mapper can be completely omitted. The final output is comprised of the files we created during the map phase and the output files of the reduce phase. The improved algorithm is illustrated in figure 3.6. This can be implemented in two different ways. The most straightforward is during the execution of the map function right after the hash table lookup to write the tuple to disk. However, if we have enough memory (the split size in the worst case) we can keep those tuples in memory and write them to disk at once when the map task is

Chapter 3. Joining datasets over Map/Reduce 26 Figure 3.6: Improved semi-join. finished 5. This solution decouples the overhead of writing the outer table tuples from emitting the pairs.

37 Chapter 3. Joining datasets over Map/Reduce 26 Figure 3.6: Improved semi-join. finished 5. This solution decouples the overhead of writing the outer table tuples from emitting the pairs. As a result, the reducers can have available the output of the map task before the mapper starts writing the outer table. Overall, the main advantage of the algorithm compared to union semi-join is that it minimizes I/O. The last mapper of the previous algorithm had to read and write the entire input. On the contrary, this approach completely omits the read phase and places the write phase at the mapper of the merge-join job. Compared to merge-join it adds an additional overhead to construct the bloom filter but it minimizes the shuffling costs as well as the number of I/Os. The merge-join algorithm reads the input, shuffles it through the network, reads it again during the reduce phase and writes the output to disk. In this approach we read the input, write the outer table tuples that are not joined and we only shuffle, read and write the tuples that belong to the intersection of the tables. 5 During the close() function of the mapper.

38 Chapter 3. Joining datasets over Map/Reduce Applying projections and selections In traditional databases a query plan is constructed as a tree of operators [15]. Each operator applies some computation to the data and propagates the results to the next operator. A good query optimizer often pushes down selections and projections to the lower operators. This is very reasonable since operators higher in the tree receive smaller input from lower operators. Hence, the total computation needed is reduced. By following this approach in MR it is straightforward to apply projections and selections at the beginning of the execution during the map stage. The Map function will have the added responsibility of filtering the input by applying local predicates and selections. This will result in significant performance gain since less data will have to be sorted, shuffled through the network and aggregated in the reduce function. This rule should apply independently of the join algorithm. Selections and projections can also significantly affect the choice of the join algorithm. For example in the case where we have a join between a small and a big table but the small does not fit in memory, the optimizer would be forced to exclude the Map-Side Replication join. However, if we applied a selection in the small table the resulting dataset could fit in memory. The optimizer should be able to estimate the output size after the selection to decide if it is still able to use Map-Side Replication join. In another example, projections could affect the choice between Merge-Join and Semi- Join. While for the initial input Semi-join seemed to be the ideal choice after the application of projections the output size could be too small to compensate for the extra Map/Reduce job used by the Semi-join algorithm. Hence, Merge join could be considered as a better choice. An interesting improvement comes also when we want to join a small dataset with a bigger one but we only need to project columns only from the larger dataset. As we discussed Replication join work very well when joining a small with a large dataset. However, in this case we can take advantage also that the small dataset is used only for filtering and apply a bloom filter to its join attribute. As result we significantly reduce the distribution cost and at the same time be able to apply Replication-join for bigger tables. Overall, the choice of the join algorithm is highly dependent on the properties of the input. Since selections and projections alter the input they can also significantly affect

39 Chapter 3. Joining datasets over Map/Reduce 28 the choice of the join algorithm. 3.8 JVM initialization The performance of the join algorithms is significantly affected by the way Hadoop runs the tasks in each node because it affects the way the startup costs are amortized. Hadoop spawns a new map task for each input split and runs each map task in a separate JVM. The number of the input splits depend on the size of the input and the size of the DFS block. As the input size increases the number of the map tasks increases also. This means, in a cluster with n nodes and a job with m map tasks, each node runs m n tasks. As a result, the JVM initialization as well as any other process that takes place during the initialization of the map task runs m n times in each node. A solution to this problem, the option to reuse the same JVM for tasks that run on the same node was introduced in Hadoop version Unfortunately, our implementation and evaluation was based on version due to the configuration of the School s cluster. If we could reuse the JVM, the loading time of replication-join and semi-join would depend on the number of the nodes in the cluster and not on the size of the input. As the size of the input increased the loading costs of replication join and semi join should be amortized and thus their performance should be at least an order of magnitude better compared to merge-join. In the current version, the loading time of the small table in replication-join increases linearly with the size of the right table. Even if we keep constant the size of the small table, as the size of the right table increases the loading time increases also because the number of map tasks increases. Semi-join is affected also. Again, the loading time of the bloom filter is increased but this does not pose a significant problem because the size of the filter small. Nevertheless, significant performance degradation occurs during the construction of the bloom filter. As we discussed above, during the map phase each mapper builds a local filter from the input split and then all filters are aggregated by a single reducer. The local bloom filters must be at the same size with the final bloom filter to allow logical operations. Consequently, if the size of the bloom filter is s and the number of map tasks m, we have to write to disk and shuffle through the network ms bytes. In Hadoop we would only need to write and shuffle mn bytes where n is the number of nodes.

40 Chapter 4 Designing a cost model We implemented four different join algorithms and we discussed how different inputs affect their performance. Now, we go a step further and consider how to design an optimizer capable of choosing a suitable algorithm for a given input. A typical cost-based optimizer has two main responsibilities; It enumerates the possible plans and it provides a cost estimate for each plan. Exploration the search space is a very challenging problem and there has been an extensive amount of work by the database community over the last years. Our work focuses on the design of the cost model which is the most novel part since for exploring the search space we can adopt a System-R [20] dynamic programming approach as it is used in many database systems today. 4.1 Overview Our focus is not to measure precisely the cost of each join algorithm but to identify an appropriate algorithm for a given input. As with traditional databases it is not always the case to choose the best algorithm but to avoid a bad one. The accuracy of the decision is far more important than the accuracy of the estimate. Our main intention is to provide a simple but effective cost model. The measurement unit is probably the most essential decision regarding the cost model. It should reflect as closely as possible the real performance of the algorithm. The traditional database systems measure I/O since it is the most important factor that affects performance. On the other hand, Map/Reduce as a parallel programming model involves communication overhead between the tasks, degree of parallelization, network 29

41 Chapter 4. Designing a cost model 30 overhead, I/O etc Therefore, measuring only the I/O would be misleading. Instead, we measure time in an attempt to incorporate all these factors into a single measurement unit. Having defined the measurement unit we now need to extract cost expressions. The computational costs of the map and reduce functions seems to be the most straightforward answer. Nevertheless, taking into account only the computation costs would result in a significant deviation from the real costs. The Hadoop framework does a considerable amount of work between the calls of the map and reduce functions and in many cases it dominates the total execution time. It is therefore essential to incorporate these costs in our model. Furthermore, as we discussed earlier, some algorithms completely omit the reduce phase while others add an extra MR job to reduce the amount of data shuffled through the network. To be able to compare these algorithms we have to know how much gain we have when we omit the reduce phase or reduce the amount of data shuffled through the network, compared to the local computational costs. As a result, the cost model should incorporate both framework costs and computational costs. 4.2 Framework costs In reality, the framework costs depend on factors that are hard to compute precisely. A detailed cost analysis would require to analyze how the framework works and take into consideration other factors such as network bandwidth, hard disk speed, available memory of each node, Hadoop configuration etc. Instead of modeling the framework s cost in detail, we consider it as a black box and we measure its performance for a given input and a given cluster setup. The main idea is to measure the correlation between the input size and the time Hadoop needs to process that input for a given hardware and software configuration. Then we can use these values as parameters to the cost models. More specifically, me measure separately the map and reduce costs. Using a simple mapper with no computational costs, we feed Hadoop with input rows and measure the elapsed time. We repeat the same procedure for the reduce phase. With a similar procedure we can also measure the time needed to distribute a file using distributed cache. An important point, is that instead of measuring the elapsed time we measure the accu-

42 Chapter 4. Designing a cost model 31 mulated elapsed time. This means that for a job that lasted s seconds with n map tasks s i seconds each, we record n i=1 s i time. In that way, the performance of the framework appears to be more linear and thus easier to estimate. Hadoop performance has different behavior until the saturation point. In a cluster with s execution slots, the elapsed time of the job remains the same until the input size is sufficient for filling all execution slots. Until this point as we increase the input Hadoop creates extra map tasks that run in parallel keeping the elapsed time at the same levels. By measuring the accumulated elapsed time we eliminate this problem and we make the cost models independent of the degree of parallelization. Moreover, by measuring the accumulated elapsed time we make our estimations more resistant to slow nodes since the impact of a slow node to the accumulated elapsed time is insignificant compared to the impact to the elapsed time. Eliminating the degree of parallelization from the cost models can be a controversial decision because we design cost models for parallel algorithms and we do not take into account the parallelizability of the algorithms. Nevertheless, we support our decision for two main reasons. First, Map/Reduce naturally enforces parallelization and guaranties near linear scalability, thus, all algorithms should have approximately a similar or, in the worst case, a minimum degree of parallelization. Second, our intention is to provide a comparative model and not to estimate accurately the performance of each algorithm. Therefore, since Map/Reduce guaranties a similar degree of parallelization we can take out this factor. A final note, is that these parameters should be calibrated each time for a different hardware and software configuration. The measured values incorporate the hardware characteristics of the nodes, the network as well as Hadoop configuration. Even a single modification in Hadoop configuration like the size of the DFS block, the sorting factor, can dramatically alter performance. Although this may seem restrictive, it is what makes the cost model feasible because we can provide an estimate without examining all the different factors that affect the performance of the framework. In summary, we incorporate the following framework costs in the cost models: Map framework cost. It captures the time required by the framework to process data before and after the call to the map function. Such steps are reading the input split, partitioning, sorting and writing the output to the disk. It is expressed as a factor multiplied with the total input number of rows and is denoted as

43 Chapter 4. Designing a cost model 32 c map n. Reduce framework cost. It captures the time required by the framework to process the data before and after the execution of the reduce function. It includes shuffling and copying the data to the reduce nodes, merging the partitions and writing the output to the disk. Like the map framework cost, it is expressed as a factor multiplied with the total input number of rows and is denoted as c reduce n. File distribution cost. It is the time needed by the framework to distribute a file to all nodes using distributed cache. This factor is always multiplied with the size of the file and it is denoted as c distribution s. 4.3 Computational costs Computational costs are much easier to estimate since their performance depends on fewer factors compared to framework costs. We compute the local costs as a function of the number of input rows n. Since the framework costs are based on the accumulated elapsed time, we feed the local cost models with the total number of input rows and not the input split. Therefore, the estimation reflects the accumulated computational costs. Finally, we convert the resulting function to time by multiplying it with the average time needed to process a single row. Moreover, we put additional effort to simplify the local computational cost models without affecting significantly the accuracy of the estimation. We achieve this by estimating the cost in asymptotic terms. For example, for a mapper that performs an operation with nc+n cost where n is the number of rows and c the number of columns, we simplify it by keeping only the most dominating factor nc. 4.4 The assumptions Certainly, estimating the framework costs by just taking into account the number of input rows is overly simplistic. However, it simplifies the task enough and makes it more feasible for the given available time. In this section, we identify the most important input properties that affect the performance of the framework and therefore we enumerate the assumptions of the current approach.

44 Chapter 4. Designing a cost model 33 To begin with, the framework cost does not depend only on the number of the input rows but also on the size of each row. The size of each row affects the number of I/Os the framework needs to read it and write it to disk as well as the time to shuffle it through the network. In our experiments to eliminate this factor we keep the row size constant. In cases where we change the row size we use another factor to estimate the framework costs. Furthermore, except from the input size the cost is affected also by the output size. The output size of the mapper affects the time it spends for sorting, grouping and shuffling the data through the network while the output size if the reducer affects the time it needs to write it to the disk. In our experiments the output size is usually we used a different factor in cases where the input is filtered. Finally, other more minor factors include the number of reducers as well as the number of the unique keys to the total emitted key/value pairs. The number of reducers affects the way the output is partitioned during the map phase while the key groupings how often the reducer is called. Again here in our experiments we keep constant these factors. 4.5 Merge-join cost model As we discussed earlier, the merge-join algorithm consists of a single MR job with a map and a reduce phase. The mapper extracts the join attribute from each tuple, tags the tuple with the name of the input file and it submits the pair. The computational cost of the mapper is mainly dominated by the extraction of the join attribute. To extract the join attribute we scan the tuple until we find the column delimiter. The deeper the join attribute is in the tuple the more time we need to extract it. Hence, the computational cost of the mapper is expressed as The map framework cost is expressed as map computational cost = npt (4.1) map f ramework cost = c map n (4.2)

45 Chapter 4. Designing a cost model 34 giving a total map total cost = mapcost f ramework + map computational cost (4.3) In table 4.1 we depict the variables used in the equations above. Variable n t p Description The number of input tuples The average time needed to scan each tuple until it finds the first delimiter The position of the join attribute within the tuple Table 4.1: MJ map phase variables Afterwards, the reducer gets the output of the map phase and it splits it in two datasets according to the tag and then it merges the datasets. The merge operation asymptotically costs nt time (t is the time needed to process a single row) but it is dependent on the selectivity factor, the number of tuples in average that every outer tuple joins. It is important to take into account the selectivity factor because the merge algorithm performs nested loop for key groupings which in an extreme case can cost nearly n 2 t time. The computational cost is depicted below and the variables in table 4.2. reduce computational cost = (n outer + n outer f )t (4.4) Variable n outer n t f Description The number of tuples of the outer table The total number of tuples of both tables The average computation time needed per tuple The selectivity factor Table 4.2: MJ reduce phase variables In the case where the join attribute is unique and every tuple of the outer table joins with a tuple of the inner table f is equal to one. If the inner table is smaller than the outer table f should be less than one so that n outer + n outer f is approximately equal to n. The computation cost should be added to the reduce framework cost: reduce f ramework cost = c reduce n (4.5)

46 Chapter 4. Designing a cost model 35 Giving a total cost for the reduce phase: reduce total cost = reducecost f ramework + reduce computational cost (4.6) The total cost of the merge-join algorithm is the sum of the cost of the map phase and the cost of the reduce phase. Hence, MJ cost = map total cost + reduce total cost (4.7) 4.6 Replication-join cost model The execution time of the replication-join is mainly dominated by the loading and computation times of the map function. The algorithm consists of a single MR job with only a map phase. The join takes place directly in the map phase completely eliminating the need for a reduce phase. The algorithm begins by distributing the small table to all nodes. The cost of this operation is computed by c distribution s small. Then, for every map function call the table is loaded into a hash table. To compute the total loading time we have to compute the number of times the map function is called. In general, this depends on the number of input splits. The number of the inputs splits can be computed by inputsize DFSblocksize. Therefore, the loading time is computed by n split n small t. In table 4.3 we depict the variables. Variable n split n small t s small Description The number of input splits The number of tuples of the small table The time needed to load a single tuple The size of the small table Table 4.3: Loading time variables After loading the small table the join takes place where for every tuple of the outer table we extract the join attribute and we search the hash table. If the join attribute of the outer table is found in the hash table we emit the key/value pair. Searching the hash table is a constant time operation which asymptotically costs n big t since for every

47 Chapter 4. Designing a cost model 36 tuple of the big table we check the hash table. This time is added to the time we need to extract the join attribute which is computed by n big pt like in merge-join. In general, the extract cost dominates the search time, thus we omit the latter for simplicity. Overall, the replication-join is computed as follows: RJ cost = map f ramework cost + c distribution s small + n split n small t + n big pt (4.8) 4.7 Semi-join cost model This is the most complicated cost model. The algorithm consists of two chained MR jobs we need to estimate separately. The first job, the construction of the bloom filter, creates local bloom filters during the map phase and then aggregates them in the reduce phase. The mapper extracts from each tuple the join attribute and adds it to the local bloom filter. The local computation time is again dominated by the extraction of the join attribute which is computed as previous by npt. Again, for simplicity we omit the time needed to add the value to the bloom filter. The total cost of the map job is given by: bloom f ilter mapcost = mapcost f ramework + npt (4.9) The reducer, gets the local bloom filters from mappers and aggregates them into a final bloom filter. To estimate the number of input rows we compute the number of mappers. Like the replication-join cost model, the number of mappers is computed by n split = inputsize DFSblocksize. The total reduce cost is therefore computed by adding the framework cost to the local computation cost which is equal to the number of the input rows multiplied by the average time we need to process each row: bloom f ilter reducecost = c reduce n split + n split t (4.10) In more detail, with t we denote the time needed to perform a logical operation between two bloom filters. In reality, to perform a logical OR between two filters we have to apply this operation to every bit of the byte array. Consequently, the time needed is proportional to the size of the bloom filter. Here, we simplify the equation by taking an average time t.

48 Chapter 4. Designing a cost model 37 Overall, the total time needed to complete the first job is: bloom f ilter cost = map bloom f ilter cost bloom f ilter + reducecost (4.11) The construction of the bloom filter is followed by a second MR job where we merge join the two datasets. The cost model here is borrowed from the merge-join algorithm with a small addition to the map phase to compute the time needed to distribute and load the bloom filter. Strictly speaking, in our implementation the distribution of the bloom filter takes place before the map phase during job setup although in the cost equation it seems as it is a part of the map phase. Nevertheless, algebraically it does not make any difference. To compute the distribution cost we have to compute the size of the bloom filter. This is achieved by using the equation m = nln p (ln2) 2 as it was presented in section We substitute n with the total number of input rows and p with 0.01 which is the chosen allowable error rate. Having computed the size of the bloom filter we can compute the distribution cost as before by distribution cost = c distribution s bloom f ilter (4.12) The loading time is computed by multiplying the number of mappers with an average time needed to load the bloom filter by each mapper. Hence, loading cost = n split t loading (4.13) The map computation cost is given by the total map cost is given by map computational cost = distribution cost + loading cost + npt (4.14) map total cost = mapcost f ramework + map computational cost (4.15) The reduce cost is computed like merge-join with the only difference being that we multiply c reduce with the number of output rows of the map phase and not total input

49 Chapter 4. Designing a cost model 38 rows (in MJ the number of output rows remains the same). The problem here is that during the time of the evaluation the number of output rows is not known. In traditional databases systems this is achieved by estimating the output size using table statistics usually in the form of histograms. Here we need to follow a similar approach, however, this an issue in its own right and for the sake of simplicity we omit this phase and provide "perfect" statistics. The total MJ cost is given by Finally, the total Semi-Join cost is given by MJ cost = map total cost + reduce total cost (4.16) SJ cost = bloom f ilter cost + MJ cost (4.17) 4.8 Improved semi-join cost model It is almost identical to the cost model of the semi-join algorithm since the flow of the two algorithms is exactly the same. The only addition is to incorporate the cost of writing the outer table tuples during the map phase of the merge-join job. This is achieved by adjusting the value of the c map parameter. This solution however has also the side-effect of conceptually including this cost to the map framework costs. Strictly speaking, this cost is part of the computational costs since it takes place in the map function. Nevertheless, we feel it fits better with the map framework costs along with the rest I/O work.

50 Chapter 5 Evaluating the join algorithms The experimental evaluation consists of two different parts, the evaluation of the join algorithms discussed in chapter 2 and the evaluation of the cost models. This chapter concerns the first section. We present a series of experiments designed to illustrate some of the most fundamental factors that affect the algorithms performance as we discussed in chapter 2. In each experiment we provide an initial description of the experiment and the intuition behind it. We proceed by presenting the results and finally we draw our conclusions. We report the average values of three consecutive runs of each experiment. In cases where the results deviated significantly we conducted an additional run, eliminated the extreme value and reported the new average. It is also important to note here that although all parallel systems are usually evaluated by their speedup and scaleup, we use the accumulated elapsed time as the metric. Since MR guarantees near linear scalability we take this as a fact and compare the algorithms with the accumulated elapsed time to keep a consistency with the cost model. The experimental evaluation was conducted on a cluster with eight Dell Poweredge SC1425 nodes. Each node contained two Intel Xeon 3.2GHz CPUs, 2GB DDR SDRAM and 80GB hard disk running at 7200 rpm. One node served as the name node, a second as the task tracker and six as data nodes giving a total of twelve execution slots two in each data node. The nodes are spread in two racks HP ProCurve 2650 at a network bandwidth of 100BaseTXFD each. The racks are connected by 1 gigabit link. For both implementation and evaluation we used Hadoop Although each data 39

51 Chapter 5. Evaluating the join algorithms 40 node contains two CPUs with two execution slots each (two threads per processor) giving a total of four execution slots per node, we decided to use only two execution slots per node to reduce hard disk sharing between the tasks. The number of reducers was set to six, one for each node, in order to parallelize as much as possible the reduce phase. Furthermore, we increased the java heap size to 1GB. 5.1 Evaluating inner-join operations In the first experiment we compare the performance of the three algorithms when conducting inner joins. We start with both a left and a right table of one million records. Then, in each consecutive experiment we increase only the size of the right table by one million rows. As we increase the size of the right table, the selectivity factor between the tables is decreasing. As a result, the performance of semi-join should improve over time compared to merge-join. On the other hand, since we keep the size of the left table constant the performance of replication-join should be the most optimal. The size of the left table was chosen to keep at low levels the time needed to distribute it and load it to all nodes by distributed cache. Below in figure 5.1 we depict the result of the first experiment. Figure 5.1: Join between two tables with constant size left table The first thing to notice here, is that the performance of each algorithm remains linear

Chapter 5. Evaluating the join algorithms 41 as we increase the input size with only exception the merge-join that presents a steeper increase after seven million records.

52 Chapter 5. Evaluating the join algorithms 41 as we increase the input size with only exception the merge-join that presents a steeper increase after seven million records. This is the point where Hadoop schedules map tasks at the second execution slot in each node and the disk is shared between the tasks. The other two algorithms seem to be more resistant to this problem because they are less disk intensive. Another important observation, is that the time of each algorithm increases linearly with the input size. What differs from algorithm to algorithm is the degree of the growth. As expected, merge-join is the most expensive algorithm in this case. Semi-join amortizes the cost of the bloom filter construction and overtakes merge-join. The performance of the semi-join has more fluctuations compared to the other two algorithms probably because is the most complex algorithm and its execution consists of two MR jobs. Since the size of the left table remains the same in all experiments, the intersection of the two tables remains also constant despite the increase in size of the right table. As a result, the reduce phase of semi-join should have a constant cost since the output of the map phase remains constant. In figure 5.2 we illustrate the reduce phases of the merge-join and semi-join algorithms. It can be clearly seen that semi-join despite the fluctuations has a steady cost during the reduce phase while the cost of merge-join keeps increasing as we increase the input. Figure 5.2: Comparing the reduce phase of MJ and SJ algorithms On the other hand, replication-join presents superior performance compared to the other two algorithms. Since we keep the size of the left table constant, the distribution

53 Chapter 5. Evaluating the join algorithms 42 costs remain constant in all experiments. However, as we explained in the JVM initialization section the loading costs keep increasing. Overall, these costs are amortized by the really low framework costs since we join the input directly to the map phase. 5.2 Increasing the size of both tables The intuition behind this experiment is to observe the performance of replication-join compared to the other two algorithms as we increase the size of the left table. We still keep the left table small enough to fit in memory, thus, it is still a join between a small and a large table. We start with a left table with 1.2 million records and a right table with 1 million records. In each consecutive experiment we increase the size of the left table by one hundred thousand records and the right table by one million records. In figure 5.3 we illustrate the results of the experiment. Figure 5.3: Increasing the size of both tables As we can observe, the distribution and loading time of the left table dominates the cost of the replication-join algorithm. This makes replication-join appropriate when the size of the left table is relatively small although the available memory can be sufficient for loading a larger left table. On the contrary, the merge-join and semi-join algorithms seem to run unaffected from the increase of the left table although the cost of semijoin is closer to the cost of merge-join compared to the first experiment. This is very

54 Chapter 5. Evaluating the join algorithms 43 reasonable since the intersection of the two tables is increasing in each trial and thus the cost of its reduce phase is increasing proportionally. 5.3 Keeping the selectivity factor constant In this experiment, we illustrate a case where semi-join underperforms compared to merge-join. This is achieved by increasing the selectivity factor between the tables so that the mapper of the semi-join filters less records. Initially, the left table contains five hundred thousand records and the right table one million. In each trial we always increase the left table by five hundred thousand records and the right by one million records. This allows us to keep the selectivity factor constant to 50% while we increase the input. In figure 5.4 we depict the results. Figure 5.4: Inner-join with high selectivity factor Although with a selectivity factor of 50% where the mapper filters half of the records before the reduce phase, the gain from the filtering does not outweigh the cost of the extra MR job. Interestingly, semi-join follows a similar shape with merge-join although its execution consists of two separate jobs. This allows us to see that the cost of the first job (bloom filter construction) is insignificant to the total cost since the performance is mainly affected by the second job.

55 Chapter 5. Evaluating the join algorithms Evaluating outer-join operations In this section we evaluate the proposed algorithms for performing outer-joins. We repeat the initial experiment where we keep the size of the small table constant and only increase the size of the large table by one million rows each time. Consequently, the size of the output increases in each trial by one million rows while the intersection of the tables remains constant to two millions rows. In figure 5.5 we compare the performance of replication-join, merge-join and union semi-join. Figure 5.5: Comparison of MJ, RJ and USJ for outer-join operations As expected, RJ performs best for both inner-joins and outer-joins when the size of the left table is small. On the other hand, both MJ and union SJ are more expensive in this case and interestingly they cost about the same. As it seems, the performance gain from the filtering phase in USJ is lost by the I/O cost of the additional mapper at the end. Probably an improvement in this case can only be seen when we reach the bandwidth of the network so the shuffle phase gets more expensive. In general, the size of the input in our experiments was too low to reach that limit. In figure 5.6 we illustrate the same experiment but we replace USJ with ISJ.

56 Chapter 5. Evaluating the join algorithms 45 Figure 5.6: Comparison of MJ, RJ and ISJ for outer-join operations Surprisingly, as we can observe ISJ presents the same cost with MJ (and consequently with USJ). At a first look this does not seem very reasonable since ISJ performs less I/O operations from both algorithms and shuffles only the required tuples. After a thorough investigation the source of the problem seems to be the reduce phase of the MJ job. In figure 5.7 we depict the reduce phases of the three algorithms. Figure 5.7: Reduce phase comparison between MJ, USJ and ISJ

57 Chapter 5. Evaluating the join algorithms 46 It can be clearly seen that the reduce phase of ISJ presents an unexpected behavior. While the reduce phases of both USJ and ISJ operate with the exact same way, the cost of ISJ remains stable while the cost of USJ increases as we increase the input. The cost of ISJ should remain stable as well since the size of the input remains constant in all experiments. On the other hand, MJ behaves as expected since the input of the reduce phase is increased as we increase the input size. An explanation to this phenomenon lies to the disk sharing between the mappers and reducers. As we explained previously, each node runs concurrently one map task and one reduce task with a shared disk. Although, this affects all algorithms, ISJ is affected substantially more since it has the most disk-intensive map phase. In many cases the writing operation at the end of the map phase occurs at the same time with the copy and merge operations of the reduce phase which is also a disk-intensive operation. As we increase the input size the writing operation of the map phase takes longer and thus the reducer shares the disk for more time. As a result, the reducer appears to run for a longer time but in reality the additional cost is the waiting time for the I/O operations. Figure 5.8: Comparison of MJ, RJ and ISJ for outer-join operations without the disk sharing factor As a conclusion, the problem is confined to the specific cluster configuration. Nevertheless, to illustrate correctly the performance of ISJ we have to eliminate the disksharing factor. A solution to that problem is to configure Hadoop to run only one task per node. Unfortunately, version only allows us to set at minimum one reducer

58 Chapter 5. Evaluating the join algorithms 47 and one mapper per node. Since we cannot completely eliminate the factor, we mitigate it by replacing the experimental results of the ISJ reduce phase with the results of the USJ reduce phase which is substantially less affected. In figure 5.8 we depict the new results. Finally, as we can observe ISJ can have improved performance compared to MJ for performing outer-joins under a correct configuration.

59 Chapter 6 Evaluating the cost model In this chapter we present our evaluation of the cost model. The evaluation is based on the input of the first experiment presented in chapter 4. All experiments were conducted with the same configuration described in the previous chapter. We follow a step by step process where we present the estimates for each phase of each algorithm. In each step we explain in detail the process we followed to produce the estimates and we depict the values of the parameters we used. In many cases we compare the different values of the same parameters we used for different algorithms along with justification. At the end, we plot all estimates together and compare them with the real costs. What is important for the proof of concept, is not the accuracy of the estimation but how the estimates affect the decisions of the optimizer. Taking as example the first experiment, we observed that semi-join begins with a higher cost compared to mergejoin, but after a point its performance is better. On the other hand, replication-join presents better performance at all times. The estimates irrespectively of the actual values should demonstrate the same behavior with the real costs. As a result, the optimizer will know that semi-join underperforms when the input size is low but it is better from merge-join after a specific input size (as well as selectivity factor) although this point can deviate from the actual value. 48

60 Chapter 6. Evaluating the cost model Parameter calibration This can be a very time consuming process if it is done exhaustively. Even when we repeat the same experiment the parameters show variations in each trial and from node to node. Tuning the parameters correctly to reflect the real costs may require several experiments. During the evaluation for every erroneous estimates we had to judge if the problem was due to wrong parameter values or due to errors in the cost model which was not always easy to see. In general, the accuracy of the estimation depends on the validity of the cost model and the tuning of the parameters. In our experiments we devoted a fair amount of time to set the values as realistically as possible. However, our intention is the proof of concept and not to estimate as accurate as possible the real costs. Below we depict the details of the measurement process for each parameter. Map and reduce framework costs. The parameters were calibrated when the cluster was fully utilized with all execution slots occupied. We sampled the accumulated elapsed time of different tasks at different input sizes and we used an average value. Afterwards, during the model evaluation we performed additional tuning to improve the accuracy of the estimations. Distribution costs. This was a much easier task since the results of each experiment fluctuate less. We distributed files of varying sizes and computed the average distribution time for each MB. Average row computation time. We get the average time by measuring the elapsed time for a large number of rows and then divide it by the number of rows. Loading time. It was calibrated similarly as the average computation time. We measured the elapsed time for a large table and we divided by the number of rows. Waiting time. When Hadoop reports the elapsed time of each reducer it always includes the waiting time. That is, the elapsed time from the initialization of the reducer until the reducer start working. The waiting time mainly depends on the how quickly the map tasks start sending data to reducers. During the design of the cost models we decided not to include the waiting time but since in the reported values of the experiments in chapter 4 were included, we have

Chapter 6. Evaluating the cost model 50 to add them also to the estimates. We measured the average wait time for each algorithm separately. 6.2 Merge-join estimation We begin with the evaluation of the merge join cost model.

61 Chapter 6. Evaluating the cost model 50 to add them also to the estimates. We measured the average wait time for each algorithm separately. 6.2 Merge-join estimation We begin with the evaluation of the merge join cost model. We use the framework cost equation presented in chapter 3, mapcost f ramework = c map n and we substitute the parameters with the measured values depicted in table 6.1. We perform the same number of estimations as the trials of the first experiment and in each estimation we increase the number of rows by one million records. Parameter c map p 1 t n Value sec sec 2-17 million tuples Table 6.1: MJ map phase model parameters Figure 6.1: MJ map phase For the computational cost we use map computational cost = npt. We set p equal to one since in all experiments we used the first column of each table as the join attribute. The

62 Chapter 6. Evaluating the cost model 51 average extraction time of the join attribute from each tuple is set equal to seconds. The estimated computational cost is added to the framework cost computed above to take the total cost of the map phase. In figure 6.1, we illustrate the estimated costs for the map phase of each trial and we compare them with the actual costs from the first experiment. As it can be seen the estimated cost is quite close to the real cost. There is a small deviation below seven million records due to the map framework costs. As we discussed earlier when the cluster utilization is low the map tasks do not share the disk in each node and thus the framework costs are less. In general, the accuracy of the estimation is more important for higher number of rows since a wrong decision by the optimizer has greater impact. This is the main reason we measured the framework costs when the cluster was fully utilized. We could solve the inaccuracy at the low input sizes by using a different map framework parameter for different input threshold. However, for the reason discussed above we decided it is not necessary and we kept simple the calibration process. Similarly, for the reduce phase we compute the framework costs and the computation costs. We use the reducecost f ramework = c reduce n equation to compute the framework cost. The computation cost is given by reduce computational cost = (n outer + n outer f )t but since the join attribute is unique it is simplified to reduce computational cost = nt where n is the total number of input rows. Finally, we add the waiting time which is measured approximately equal to 170 seconds (28.3 seconds approximately for each reducer). The values of the parameters we used are presented in table 6.2 and the estimates in figure 6.2. Parameter c reduce t waitingtime n Value sec sec 170 sec 2-17 million tuples Table 6.2: MJ reduce phase model parameters The estimated costs of the reduce phase are again fairly close to the real costs although the estimate is completely linear while the real cost present some fluctuations. After seven million records the framework presents a steeper increase that the estimation fails to capture but after that point the actual cost slowly converges with the estimated

63 Chapter 6. Evaluating the cost model 52 cost. Figure 6.2: MJ reduce phase In general, the map phase can be more easily estimated compared to the reduce phase. This is very reasonable since the map factor is affected by factors that are easier to compute. It is mainly I/O and computational work with little network involved although it depends on the data locality. In our experiments we observed that the performance map phase is much more linear compared to the reduce phase. Overall, as we will see later the deviations of the estimates from the real costs are mainly due to the reduce phase. Finally, in figure 6.3 we depict the total merge-join estimates by adding the map estimated costs with the reduce estimated costs. As expected, the total merge-join estimates are fairly accurate since map and reduce estimates were pretty close to the real costs. 6.3 Replication-join estimation As we discussed in chapter 3, the replication-join cost is mostly dominated by the computation and loading costs. The framework only reads the input and writes the output with no sorting or partitioning involved since there is no reduce phase. This has two implications on the cost model, the map factor is set low ( seconds where

64 Chapter 6. Evaluating the cost model 53 Figure 6.3: MJ estimates comparison with real costs MJ used a value of seconds) and the cost is much easier to estimate since there are less factors involved that affect the cost. Initially, we compute the distribution cost by c distribution s small. Since we keep the size of the small table constant, the distribution cost remains the same in all trials. The left table contains one million records and its size is MB. The distribution factor was measured to 0.12 seconds per MB. This means that for 65 MB file size we need approximately 8 seconds to distribute it to all nodes in the cluster. The loading cost depends on the number of map tasks which is computed by n split = largetablesize DFSblocksize. The default DFS block size is found at the dfs.block.size configuration property and is equal to 67,108,864 bytes. The size of the right table for the initial 1 million rows is equal to 68,966,670 bytes which is split in two map tasks. In each consecutive experiment the size of the right table is increased by approximately 68,966,670 bytes which spawns an additional map task. The loading time in each map task is given by n split n small t where n small is equal to 1 million records and t is equal to seconds. The local computation cost is given by n big pt where t is set equal to seconds. Again, we use the first column as the join attribute, thus p is equal to one. A summary of the parameters used in this cost model is depicted in table 6.3 and a comparison of the estimations with the real costs in figure 6.4.

65 Chapter 6. Evaluating the cost model 54 Parameter Value c map c distribution DFSblocksize t maploading n small s small t sec 0.12 sec 67,108,864 bytes sec 1 million tuples 68,966,670 bytes sec p 1 n 2-17 million tuples Table 6.3: RJ model parameters Figure 6.4: RJ estimates comparison with real costs Overall, despite some small fluctuations of the real costs the RJ estimates are pretty accurate and in fact more accurate compared to the MJ estimates. 6.4 Semi-join estimation The semi join estimation is split in two phases, the bloom filter construction estimation and merge-join estimation.

66 Chapter 6. Evaluating the cost model Bloom filter construction estimation bloom f ilter We estimate the cost of the map phase by using mapcost = mapcost f ramework + npt. All parameters are substituted with the same values as the previous experiments with the only exception being the c map which is set equal to seconds. As we can observe this value is almost three times less compared to the value we used in the MJ map phase. This is very reasonable because although the input size is exactly the same compared to merge-join, the output size is limited to one row per map which eliminates the sorting and partitioning costs. In addition, while the size of that row is substantially larger compared to the size of each output row of the MJ, in overall, the size of the output is significantly less. Moreover, this value is also a bit less than the value we used in RJ. This again can be justified by the size of the output. The cost estimates along with the actual costs are presented below in figure 6.5. Figure 6.5: Bloom filter construction - map phase estimates Like the map phase estimation of the other algorithms, this one is also pretty accurate. It can be seen that the estimated cost below five million records is a little bit higher than the real cost but this does not pose a significant problem. bloom f ilter To compute the cost of the reduce phase we use the equation reducecost = c reduce n split + n split t. Recall that n split represents the number of input splits and consequently the number of map tasks. Each map task produces one row, thus, n split represents the input rows of the reduce phase. The parameters c reduce and t are set equal to 1.5 and 0.41

67 Chapter 6. Evaluating the cost model 56 seconds respectively. Furthermore, we add to the estimate the waiting time which is set equal to 7 seconds. It is important to note here that the value of c reduce is many time larger compared to the value we used in MJ. This is due to the row size of the input. In merge-join, the size of each input row is approximately equal to 70 bytes while in this case the size of each row in our experiments can vary from 1 MB to 15 MB depending on the size of the bloom filter. Another important point, is that the row size is not kept constant as it does in all other algorithms. This happens because the size of the bloom filter increases as we increase the input size and since each input row in the reduce phase is a bloom filter, the row size is increased. This fact violates one of our basic assumptions is to keep constant the row size and as a result it can lower the accuracy of the estimation. In figure 6.6 we illustrate the estimates of the reduce phase. Figure 6.6: Bloom filter construction - reduce phase estimates Although the actual cost presents intense fluctuations, the estimated costs manage to capture the slope of the growth. Another important point is that total cost is relatively low when compared to the cost of the map phase since the reduce phase consists of a single reducer. Finally, in figure 6.7 we illustrate the estimates of the bloom filter construction.

68 Chapter 6. Evaluating the cost model 57 Figure 6.7: Bloom filter construction estimates It is evident, that the shape of the estimates is almost identical to the estimated costs of the map phase since its cost dominates the total cost. Luckily, the cost of the map phase "masks" potential inaccuracies of the reduce phase caused by the increasing input row size. Overall, the total estimate is pretty accurate Merge-join estimation We begin with the cost model of the map phase by computing the distribution costs. We compute the size of the bloom filter for each consecutive experiment. As an example, the first experiment with a total input size of 2 million records creates a bloom filter with a size of: m = nln p (ln2) 2 = m = 2.29MB The size of the bloom filter is then multiplied with the framework distribution costs. We set c distribution equal to 0.12 seconds as in previous experiments. The loading costs are computed similarly to RJ. To compute the number of map tasks we use again n split = largetablesize d f sblocksize where the default DFS block size is equal to 67,108,864 bytes and the input size of the initial experiment (two million records) is equal to 137,933,340 bytes which results to three map tasks. The only difference compared to RJ is that we compute the number of map tasks with the total number of rows instead

69 Chapter 6. Evaluating the cost model 58 the number of rows of the right table. Again here, in each consecutive experiment the size of the right table is increased by approximately 68,966,670 bytes which spawns an additional map task. In each experiment we multiply the number of map tasks with the average loading time of each bloom filter t which is set equal to 0.1 seconds. The computation and framework costs of the map phase are computed exactly as in MJ with the same parameter values with the only exception being the map factor which is set equal to seconds. This value is approximately three times less compared to the value we used in MJ. This can be justified by the reduced output size. While, the input size after the MJ map phase remains unchanged, the SJ mapper filters the input records reducing the time needed to partition, sort and write the data. In table 6.4 we depict the parameter values used in the map phase cost model and figure 6.8 the results. Parameter Value c map c distribution DFSblocksize t maploading n s bloom f ilter t sec 0.12 sec 67,108,864 bytes 0.1 sec 2-17 million tuples 2.24 MB MB sec p 1 Table 6.4: SJ job 2 map model parameters The estimated costs tend to be lower compared to the real costs when the input size is still low. However, after eight million records the accuracy of the estimates is improved. Finally, for the estimation of the reduce phase we use exactly the same parameters as in the MJ reduce phase with a slightly increased waiting time. Unlike MJ, here the input size remains constant in all experiments since the intersection of the two tables remains constant. As a result, the estimate remains the same for all consecutive experiments. The values of the parameters are depicted in table 6.5 and the results in figure 6.9.

70 Chapter 6. Evaluating the cost model 59 Figure 6.8: SJ job 2 - map phase estimates Parameter c reduce t waitingtime n Value sec sec 190 sec 2 million tuples Table 6.5: MJ reduce phase model parameters Figure 6.9: SJ job 2 - reduce phase estimates

71 Chapter 6. Evaluating the cost model 60 Like the previous reduce phase costs of the algorithms presented so far, this one presents also intense fluctuations. Although the estimate remains steady, it approximates the average cost. Probably it should be slightly higher to be more precise but it can be further improved with better tuning of the parameters. The total estimated cost of the second phase is given by adding the estimates of the map and reduce phases and is illustrated in figure Figure 6.10: SJ job 2 estimates Figure 6.11: SJ job estimates

72 Chapter 6. Evaluating the cost model 61 Since the estimates of the map and reduce phases are slightly lower than the real costs when the input size is low, this is also evident in the total estimate which is depicted in Nevertheless, as discussed above, this is not a significant problem for two reasons. First, with a more thorough tuning of the parameters the results can be improved and second, the impact of a wrong decision for small size input is very low. Overall, despite some slight miss estimations on the parts that compose the algorithm, the total estimate is quite satisfactory. It can be seen in here too a slight tendency to underestimate the cost when the input size is low but it is less evident compared to previous estimates.

Chapter 6. Evaluating the cost model 62 6.5 Putting everything together Having estimated the cost of each algorithm, we can now plot together all estimates and compare with the real costs.

73 Chapter 6. Evaluating the cost model Putting everything together Having estimated the cost of each algorithm, we can now plot together all estimates and compare with the real costs. In figure 6.12 we plot all estimates together and in figure 6.13 we depict again the actual costs from chapter 4. Figure 6.12: Inner join estimated costs Figure 6.13: Inner join actual costs In figure 6.12 we can observe how the cost model affects the decision of the optimizer

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)