Adaptive Join Plan Generation in Hadoop

Size: px

Start display at page:

Download "Adaptive Join Plan Generation in Hadoop"

Rose Gilbert
5 years ago
Views:

1 Adaptive Join Plan Generation in Hadoop For CPS296.1 Course Project Gang Luo Duke University Durham, NC Liang Dong Duke University Durham, NC ABSTRACT Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a clever step towards good performance in some cases, but it can be a dangerous move under certain circumstances. This paper introduces some new techniques used in map join to tackle these issues, and proposes a plan generator for the join types that we currently have. Categories and Subject Descriptors H.2 [Database Management]: Plan Generation General Terms Theory Keywords Hadoop, join operation 1. INTRODUCTION Currently, the amount of data the industry and academia are facing is large, and will keep increasing, which makes large-scale data processing a hot issue. Map-Reduce[5] is a popular parallel data processing framework. The simple programming model allows users to write simple program that could run on hundreds of machines simultaneously to process data. Its fault tolerance feature also makes it a robust system even for commodity machines. These features enable the system running Map-Reduce expand to a really large scale by adding commodity machines to the cluster at a low cost, thus could greatly reduce the time consume by jobs. The open source implementation of Map/Reduce framework, namely Hadoop[2], has caught much attention ever since it was born. Even though it seems promising to improve the efficiency for data processing by brutally enlarging the cluster size and running the jobs on more nodes, it is a better idea to design sophisticate plan that make good use the Map-Reduce paradigm while avoid the side effect as much as possible. As one of the most critical operations in data processing, join operation is usually more time-consuming than other kinds of work and thus has a greater impact on the overall performance. Basically, to join two datasets in Map-Reduce is quite simple, as we will introduce later. But for the mostobvious join method, which reads both tables from the disk, and shuffles all the data over the network to the reducers, the performance can be limited by the network connection speed. When the datasets are too large, the network transfer time becomes the bottleneck, thus lowering the utilization of the computing resources. With some tools/framework built on top of Hadoop, for example, Pig[6] or Hive[8], the annoying work of programming in Java could be saved. Instead, users could write declarative (for Hive) or procedural (for Pig) queries to perform tasks that could take much lines of code with pure Hadoop. However, neither of these tools has addressed the problem of join: Pig has implemented fragment-duplicate join (known as map join in our paper), and also skew join that can handle skewed tables; the user may want to give some hints to the compiler, indicating which join method the system should use. This is not a good way of handling the problem since the user may not know what is map join, or the underlying data; furthermore, the user may give a wrong hint which could hurt the performance. Building a plan generator which decides smartly which plan to use will make those tools more favorable. Through our early experience with map join, we have learned that it may consume more memory of a map task than it possesses, the consequence will be either extremely slow execution or thrashing map tasks. Advanced join is another approach that can be potentially beneficial, but the overhead of this kind of join should not be neglected. Using Distributed Cache[1] to copy files to each node could be a potential improvement to the performance, but our preliminary experiments suggested otherwise, which forms one issue that this paper tries to solve. Other than that, our work will focus on extending the original map join implementation for it to work with more cases. More importantly, we will propose a cost-based plan generator for efficient joins

2 in Hadoop, which will determine what join plan should be used for better performance. 2. CLASSIC JOIN IN HADOOP In this section, we will briefly introduce the implementations of a typical approach of join, which is named default join for the rest of the paper, and then move on to map join. 2.1 Default/reduce Join Default join is a 2-way join widely used in Map-Reduce which comply with the MapReduce spirit fairly well. This strategy is most intuitive and is the default strategy for tools such as Pig and Hive. It is called reduce join later in the paper for the reason that join is performed at the reduce phase, in contrast with map join. Given two tables, default join will first read different parts of these two tables into different Mappers where for each record, the join attribute will be extracted as the key of a record in intermediate result, with the file tag (to indicate which table does one record come from) and other necessary attributes as the value. The intermediate will be shuffled and then send to Reducers. Each reducer will only get tuples with the same key. Records in each reducer are grouped by tag so that the records from different tables are identical. The Cartesian product of the two parts are the final result. Default join work well for most situations. One of the exceptions is that when both tables are huge, there are lots of data transferred over network from Mappers to Reducers. Without considering early projection or any selection based on predicates given by user, the size of data transferred on network is the sum of size of R and S. Now that the network transfer is the bottleneck in this case, one potential solution is to first filter the source datasets and get rid of those records at Mapper phase which are not possible to be joined in the Reducer phase. This approach is exactly the advanced join that we have proposed, and we will introduce this join method later in this section. 2.2 Map join Map join is one of the improvement from default join by eliminating the reduce phase and thus eliminating the transfer of data over the network between map phase and reduce phase. This gain from this becomes obvious when one of the tables is small. Map join aims to use only the map phase so the no data will be transferred on network. The input file to Hadoop for this job will be only one table (fragment side), and one map task is initialized for each split of the table, thus finishing the fragmenting step. Then those map tasks will read from the other table (duplicate side) as a whole, and perform the join locally. The name of fragment-replicate join is due to the behavior of fragmenting one table to be processed in a distributed fashion, and then replicating the other side for each of the fragment; and the name map join suggests that the join is performed using only the map phase, thus eliminating the cost of sorting and shuffling over the network. For map side join has to read the whole table S into every Mapper, it is really efficient when it meets a pattern of a huge table and a small table, but disastrous when both of tables are huge, because reading the duplicate side to every map task will impose much more cost than the save gained from not transferring the other table over the network. One original implementation of map join reads and loads the duplicate table to a hash table built in memory, and then probes this hash table to do a join. As can be observed, this kind of implementation will not work if the duplicate table cannot be loaded to memory. This issue will be explored and answered later in this paper. 2.3 Distributed Cache Distributed Cache is one way of distributing common data that is shared and accessed by all the map tasks in a Hadoop job. Before launching the job, we assign the duplicate side to Distributed Cache, so that it is copied to the local file system in each node that has task(s) to run. Then at the map phase, Mapper will just need to read from the local FS, rather than using an HDFS call to read the data (most likely from another node). The merit of Distributed Cache is grouping multiple accesses to the duplicate table to only once per node: Distributed Cache will copy the duplicate table only ONCE per node; assuming several map tasks run on each node, if we directly read the duplicate table through HDFS calls, there is clearly more network communication instead of local I/O. Therefore, using of Distributed Cache is one way of improving the performance of map join; although it can also be handy in the case of advanced join discussed in the next part. However, the real benefit from Distributed Cache remains to be analyzed with controlled experiments. 2.4 Advanced join Advanced join is based on the idea of filtering the table before transferring them on the network. It relies on one prerequisite operation semi join, which performs a normal join only on the join key. In other words, semi join will only extract join key in the map phase, and the output of semi join will be a table of keys. Semi join aims to find out all the distinct values used as the join key that are shared by both tables. The process of semi join is described as follows: First, the two tables are read by map tasks, where the join key value for each record will be extracted as the key of intermediate result with a tag (to indicate which table does one record come from) as the value. After shuffling, all the records from both tables with the same join key value well be sent to the same Reducer. Each Reducer scan its input and try to find different tags. If it find different tags, which means the both tables contains records with the same join key value, then the Reducer output the key. If all the tags are the same after scanning all the input for certain Reducer, which means only one table contains records with this join key value, the Reducer does nothing for this key since it will not be used for join later. After all the Reducers finishing their work, we get all the distinct join key values. This step will be useful for filtering out those useless records for this join. Semi-join makes sense even if the join key is Primary and Foreign key. Consider the following scenario: if the join key is a Primary key in one table, and in the other table

3 it corresponds to the first table, then we cannot gain much performance through semi-join, since it is highly possible that most tuples will join with the other table, and semijoin will not purge a significant number of tuples. Even if there is selection over the key, we should always apply the selection in the mapping phase. It seems semi-join is not applicable to this scenario. However, if we consider the selection on attributes that are not the key of the join, then after the selection, we could purge useless tuples in the other table using semi-join. If we compare this case with transferring the whole second table over the network, semijoin still makes sense. Advanced join is then straight-forward: using this list of useful keys, it will filter the table before doing the actual join. Moreover, this strategy can be applied to both default join and map join, getting rid of useless tuples before doing the actual join. To filter the original tables, one simple way would be performing a map join of the original table and the list of useful join keys; but performing the filtering on the fly while doing the actual join will be more efficient. Advanced join can be one solution to the problem of insufficient memory in map join: filtering the table may reduce the size of the duplicate side, after which map join could be performed successfully. From a rough comparison, it can be observed that advanced join could save some network communication (it saves more if the selectivity is low); however, the overhead of doing the semi join and filtering cannot be neglected. 3. IMPROVING MAP JOIN Map join is fast: it transfers only the duplicate table, which is picked using the smaller table, meanwhile the large table need not to be moved; it has no reduce phase, saving a lot of work for Hadoop; it doesn t sort the output of the join (Hadoop will not sort output of the map tasks if there is no reduce task). However, it fails when the duplicate table becomes too large to fit into the memory allocated for map tasks. One solution for this nasty problem would be increasing the heap size, but obviously this cannot be done every time the duplicate table gets larger. However one of our proposed solution is based on this observation of increasing memory: it employs JDBM to store the larger duplicate table to disk, so that the memory will no longer be an issue. Our three approaches for a map join that can handle larger duplicate table are described as follows. 3.1 Multi-phase map join The mechanism of multi-phase map join is straight-forward: since the original version of map join can only take the duplicate table as large as the memory can possibly hold, it would make sense to read as much data, and then stop at the point when the memory becomes full and perform the map join on this part of the data; by repeating this process the map join will be performed multiple times, and hence we have the name multi-phase map join. Although multi-phase map join solves the problem, it is one kind of a brute-force strategy rather than an elegant way: when the size of duplicate table doubles, processing time of multi-phase map join will double. This situation seems reasonable at first glance; however, this is a bad performance problem: suppose the fragment side is 100GB and the duplicate side is 1GB, then doubling duplicate table size will cause the time of multi-phase map join to double, whereas in default join where both table is transferred through the network, this increase will result very little increase of the processing time. Another observation is that multi-phase map join will perform multiple times of reading the fragment side in each map task. Essentially this is an equivalent statement to the one made above. But it generate the following question: since the fragment side is read multiple times, why not store each fragment in memory? And this will be the next type of new map join. 3.2 Reversed map join Continuing above section: after storing the fragment in memory, the duplicate could be read sequentially, while probing the in-memory hash table of the fragment and performing the join. This strategy is the opposite of building hash table of duplicate table and probe it while each tuple from fragment table is processed, and therefore it gets the name of reversed map join. One doubt about reversed map join is how large the fragment is and whether it can be load to the memory or not. Typically Hadoop will use one chunk of HDFS file as the input for the map task, therefore this one fragment here will be one block/chunk of the HDFS used in the system. The default chunk size of Hadoop is 100MB, and normally the Hadoop administrator will allocate far more memory to a map/reduce task than 100MB (note that it is not the memory of the node, because each node usually has several map/reduce slots that run tasks concurrently). In other words, it is most likely that the fragment will fit into memory, but the implementation of reversed map join should be more careful so that it won t run into undesired state. From the performance perspective, reversed map join has the same power is default map join: the network communication remains exactly the same, and for local processing, reversed map join has the same behavior. The fragment table will not be read twice if we double the size of duplicate table. However, implementing reversed map join complicates the story: the fragment hash table is built while executing the map method during the map phase. Once the map finishes (the building of the hash table completes), the map task will end. So when to read the duplicate table and do the join? We have found that before ending the map task, Hadoop will call a close method which can be override in user code. And thus all the rest operations are added in the close method. This is bypassing all the convenience provided by Hadoop and should not be considered an elegant way of using Hadoop. 3.3 JDBM-based map join JDBM[4] is a transactional persistence engine for Java. It aims to be a fast and simple persistence engine, can be used to store a mix of objects and BLOBs, and all updates are done in a transactionally safe manner. As the name suggests,

4 JDBM-based map join will utilize JDBM to store the hash table so that memory wouldn t be an issue. The question to ask about JDBM-based map join is the efficiency of look-ups in a hash table that could reside on the disk. Furthermore, introducing a between-layer will decrease the control over the performance of such a join plan. 3.4 Other techniques One of the reason for default map join to run out of memory is that Java hash table implementation takes much more space than the actual data. For example, according to our experiment, a 100MB table stored as a hash table in Java may require around 300MB of memory space. Storing the data using ArrayList would save much space, but compromises performance. Using an in-memory structure that balances between storage consumption and performance will be a desirable situation. Java s lazy memory management is also one cause of the issue. However, this can be avoided to some extent. For example, the following code exhibits a bad habit of programming in Java: output.collect(new Text( Hello World )); A better way of doing this: Text outtext;... outtext.set( Hello World ) output.collect(outtext); Instead of creating a new object each time to write an output, the second code recycles the same object to perform the operation. The benefit from this code would be more obvious if the output is pretty large, which is usually the case in Hadoop jobs. 4. EXPERIMENT We run our experiments on Illinois Cloud Computing Testbed [3]. The cluster used for Hadoop consists of 64 HP DL160 nodes connected by high speed network. Each node is equipped with two quad-core CPUs, 16 GB RAM and 2 TB storage. We use TPC-DS [7] datasets for experiments scaling from 10 GB to 1TB. The size of input split to map tasks is 128 MB for all the experiments. For semi join, the number of reduce tasks is 10. For the other reduce side joins, the number of reduce tasks is Comparison between Advanced Joins and Default Joins Figure 1 shows the performance of advanced map-side/reduceside joins and the defaults map-side/reduce-side joins. The total time of advanced joins consists of time consumed by semi join, filtering (for map side join only) and the actual join process. As can be seen, in all scenarios, the advanced joins perform worse than their default versions, which strongly goes again our expectation. The philosophy here to do such a filtering is to reduce the amount of data transferred over network since the network IO is always the bottleneck of MapReduce workload. However, the cluster we are experimenting on has a extremely fast network, we did not see huge benefits by doing such a filtering, even we can eliminate as much as 50% of the data by doing so. Instead, the overhead caused by extra MapReduce jobs (semi join and table filtering) degrades the performance significantly. Consequently, the semi join and filtering works well only when the (semi)selectivity is extremely low, e.g. 90% of data will be eliminated. However, this situation may not be seen frequently. Further, to figure out the selectivity before hand is really difficult and will definitely introduce more overhead (actually, semi join is the thing to figure out the selectivity). Thus, in such a cluster with fast network, advanced joins seems not options for our adaptive joins. 4.2 Performance of Distributed Cache Distributed cache reduces the total amount of data fed to each tasks by ensuring that only one data copy will be sent to each node and all the tasks running on that node share this copy. However, this benefit comes at the cost of the overhead introduced by converting data from HDFS to local disk. As can be seen in Figure 1, in rare scenarios does a join approach with distributed cache outperform its counterpart without distributed cache. This bad performance of distributed cache is due to two factors. First, the map tasks launched for a join job is not enough to show the power of distributed cache. The largest number of map tasks in our experiments are 288, meaning the replicated table for a map join without distributed cache has to be duplicated 288 copies each of which is sent to one map task. In such a cluster with high speed network, the benefit by reducing the data amount transferred over network is insignificant, while the overhead of distributed cache is dominant. Second, the size of the cluster is 64 nodes, meaning there will be as many as 64 copies of data distributed to all this nodes for a job with distributed cache. Obviously, the smaller the size of the cluster is, the fewer copies we need to distributed and the better performance the distributed cache will achieve. However, a trend can be observed that by increasing the size of large tables, which also increase the number of map tasks, the distributed cache performance is drawing closer to the non-distributed cache versions. Thus, we still argue that distributed cache is of great help with more map tasks involved. The distribution of the large table to smaller number of node could also show the power of distributed cache. 4.3 Comparison between Default Map Joins and Default Reduce Joins Figure 2 shows the performance of map join series and reduce join series. Due to the drawbacks of distributed cache we discuss in 4.2, we skip those map joins with distributed cache. As can be seen, the default map join performs better than the default reduce side join. Even the advanced map join, which launch one more MapReduce job for table filtering, outperforms the advanced reduce side join. This result is straightforward since map joins eliminate the data transferring and sorting at shuffle and reduce stages, respectively. However, not upper bound is guarantee for the amount of data shipped to each map tasks. A map join without distributed cache may launch enough map tasks to ship tremen-

Figure 1: Comparison between Advanced Joins and Default Joins dous amount of data to each map task, thus degrade the performance significantly. The limitation of Map Joins is also shown in Figure 2.

5 Figure 1: Comparison between Advanced Joins and Default Joins dous amount of data to each map task, thus degrade the performance significantly. The limitation of Map Joins is also shown in Figure 2. When the replicated table is large enough, the default map join fails handling it due to the out-of-memory exception. Even though the filtering using semi join result helps to extend the usability of map join, we get the same exception when larger tables come. 4.4 Performance of new Map Joins Figure 2 also shows the performance of the three new map joins we implemented. For comparison, the performances of default map join and default reduce side join are also shown. Multi-phase Map Join (MMJ) launch one phase when the scale factor of replicated table is 50 or less. In this case, MMJ is exactly the same as DMJ where the performance of MMJ is close to that of DMJ. When the scale factor increase to 70, MMJ split the replicated table into 2 parts and launch two MapReduce jobs to join each of them with the fragmented table. In this case, the performance still tie with DMJ while the overhead caused by one more MapReduce join has no significant impact. However, when the scale factor keep growing up, the performance decrease sharply. Two factors contributed to this sharp degradation. The first one is the increased amount of data shipped to more map tasks which place a great burden on the network and degrade the performance even the number of phases remains the same. The other factor is the number of phases keep increasing as the scale factor of replicated table grows. The number of phase is 4 and 6 for 300*200 and 1000*200, respectively. In those cases, the sum of overhead by each of the MapReduce jobs makes the performance extremely bad. The performance of Reversed Map Join (RMJ) and JDBM Map Join (JMJ) performance bad when joining small scale of tables. As discussed before, the adoption of distributed cache is the main factor in this scenario. However, when increasing the size of tables, the performance of these two join approaches are drawing closer to that of DMJ, and RMJ outperform any other approaches at 300*200 and 1000*200. Note that the JMJ fails for 300*200 and 1000*200, which is mainly due to the improper configuration of JDBM. We believe the JMJ should performs well if configured properly since it presents a better performance than RMJ for small joins. 4.5 Result analysis summary Our experiments shows some important properties of various join approaches in MapReduce environment. One of the most critical one is the performance of distributed cache which may introduce great overhead if not used properly. Some experiments we have done shows that the delay to distribute the files to each node is proportional to the number of files. Based on our experiments result, the performance of distributed cache heavily depends on the network environment and the scale the the join task. In such an scenario where the network is slow and the join task are large in terms of map tasks number and table size, distributed cache can bring us more benefits. Our experiments result also shows that the advanced joins using semi join to filter tables only works as expected when the semi selectivity is extremely low and the network speed is slow. The other observation is that map joins are usually better than reduce side join since they eliminate extra stages for data shuffling and sorting and the

6 Figure 2: Performance of new map joins performance in various condition depends on the proper use of distributed cache. As can be seen in the result, the three new map joins we implemented can extend the usability of default map join. However, MMJ does not scale well due to the overhead caused by multiple MapReduce jobs and JMJ may fail sometime due to an improper configuration. JMJ seems a promising substitute of default map join at large scale of join tasks. The last but not the least conclusion we can draw is that the default reduce side join achieve a acceptable performance, and performs close to default map join in some situations especially in a fast network environment. 5. ADAPTIVE JOIN PLAN GENERATION 5.1 Plan space Even we implemented nine join approaches, not all of them Based on the experiments result and discussion in section 4, default map join with distributed cache is the best option in most cases and the distributed cache version may performs well in large join tasks. They are the join candidates in the plan space. We also put the default reduce side join into plan space since it scale well and achieve an acceptable performance. The advanced joins are eliminated from the plan space since in no case does any of them outperform its default counterpart. We also eliminate the MMJ and JMJ from the plan space since MMJ does not scale well while JMJ still meet some configuration problems unsolved yet. RMJ is the option we will consider in the plan space since it scale well and performs best in certain condition. 5.2 Cost model Table 1: Parameters adopted by the cost model Parameter symbol Number of distributed files d Network speed v Number of map tasks m Number of reduce tasks r Number of working nodes n Small table size s Large table size l Table 1 lists the parameters that will be adopted into the cost models. Distributed cache cost model: for join with and without distributed cache, the cost model focus on two aspects: the amount of data transferred over network and the overhead. The cost for map joins with and without distributed cache are: 1 n s + α d (with distributed cache) v 1 m s (without distributed cache) v Note that in the first cost model, the coefficient is the average overhead to distribute on file. Based on our knowledge and the experiments result, this overhead is 20 (sec.) per file. Default Map join/reversed Map Join cost:

7 n s + α d (with distributed cache) m s (without distributed cache) both default map join and reversed map join share the same cost model. The choose between them based on a rule as will be discussed later. Default Reduce side Join cost: (s + l) β Since reduce side join need to shuffle and sort the data, we use f(v, r) to describe the cost of the two operations on the same amount of data, which is defined the network speed and the number of reducers. 5.3 Join Plan Generation We implemented the join plan generation in both cost-based and rule-based manner. Essentially, the choose among all the join approaches in the plan space can be simplified as the choose between map join and reduce side join. Since the default reduce join is the only option as reduce side join, the work here is to choose a map join plan. Based on the previous discussion, whether to use distributed cache has a great impact on the performance. Thus, the first step in plan generation is to determine whether or not to use the distributed cache based on the cost model proposed previously. After the determination of distributed cache, which map join approach is desirable needs to be determined. As a substitute of DMJ, RMJ performs well at large join tasks where DMJ fails. In no cases as shown in the experiments result does RMJ outperforms DMJ once DMJ could work. Thus, we give DMJ our preference. Once the replicated table is small enough to fit into memory, we choose DMJ as the representative of map join. If not, RMJ will be the choice. The last step is to compare the chosen map join with the reduce side join and choose the option with the lower cost for execution. 6. SUMMARY Our contributions, put in a short way are: verifying our guess about the performance of map join proving advanced join suffers from the overhead examining the performance of Distributed Cache extending the default map join for larger duplicate table comparing the performance of new map join types providing a cost-based plan generator, which picks from all join plan space that we have so far The advantage of map join over reduce join is Self-evident; and we have proved this point through our experiment. And thus our plan generator will prefer the use of map join, as long as it really works. Distributed Cache on the other hand, did not prove its value according to the experiments, since almost all join results from a distributed cache version of map join perform not as good as a default version. However, the argument made in favor of Distributed Cache is still valid, and the use of Distributed Cache should always be taken into consideration, though it might not actually improve the performance. The problem of map join is resolved by three new types of join well. With reversed map join outperforming MMJ and JMJ in most cases, it is believed that RMJ should be used to replace DMJ to handle larger duplicate tables. Advanced join types are proved to suffer from the overhead imposed by semi join and filtering, and one reason for it failing to compete with default join types is the table that we have used: there is no such a situation that the selectivity of joining two tables is low enough to show the value of advanced join. A perfect plan generator will take this into consideration and picks advanced join when the selectivity becomes low enough. But to know the selectivity before performing the join will be difficult, and is not the focus of this paper. The observation proved here is advanced join cannot rival default join in performance under normal circumstances (join selectivity). The cost-based plan generator is based on some assumptions made through the process of analyzing the experiment results. We have strong reason to believe that those assumptions about actual performance are valid. At the same time, the overhead of picking a good join plan will be minor, since all the whole process is based on the basic information about the tables, and the configuration of Hadoop; no actual data is read from the tables. However, without reading the data (or at least sampling it) should be considered a disadvantage for a perfect plan generator; with the knowledge of join selectivity and data skewness, further optimization is possible. But again, this is beyond the scope of this paper. 7. FUTURE WORK Joining tables is the most time-consuming operation in data analyzing and the optimizations of join operation will bring significant benefits in the data processing dataflow. In this section, we discuss some challenging issues we did not tackle in this paper. 1. integration of join into SPJA workflow our current work focus on a join task individully. However, join is usually integrated into a complete selectionprojection-join-aggregation workflow, where the other operations in the workflow may greatly influence the join planing, or vise versa. To achieve, we need to consider the whole picture in stead of a individual step. 2. better plan generator the cost model we currently proposed is based on some rules we extract from the experiment result. However, the cluster we use for experiment is not representative enough due to the high speed network, which may not be seen frequently. This property eliminate some join candidate, e.g. advanced join, from our plan space which may performance well on other cluster. Further, to figure out the selectivity before hand is really difficult and will definitely introduce some

8 overhead. How to efficiently sample and collect statistics for plan generator is critical to choose the optimum plan.for a more general plan space and generator, we need to understand better how the cluster and data property impact the join performance, and a true cost model should be achieved. 3. adaptive parameter tuning The parameter configuration of MapReduce job is also critical to the performance. For example, a large number of reducers can be beneficial in terms of parallelism, but may also introduce more overhead when distributed cache is applied on the results generated by these reducers. Our currently implementations of various join are not properly tuned, which brings inaccuracy in choosing join candidates for plan space and cost model. 8. ACKNOWLEDGMENTS The authors would like to thank Prof. Jun Yang for providing such a good opportunity from this outstanding course, as well as the guides and comments to our job. We would also like to thank Prof. Shivnath Babu for helpful discussions and the Illinois cluster. 9. REFERENCES [1] Distributed cache. tutorial.html#distributedcache. [2] Hadoop. [3] Illinois could computing testbed. [4] Jdbm. [5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): , [6] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow., 2(2): , [7] M. Poess, R. O. Nambiar, and D. Walrath. Why you should run tpc-ds: a workload analysis. In VLDB 07: Proceedings of the 33rd international conference on Very large data bases, pages VLDB Endowment, [8] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2): , 2009.

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics