Adaptive Join Plan Generation in Hadoop

Size: px
Start display at page:

Download "Adaptive Join Plan Generation in Hadoop"

Transcription

1 Adaptive Join Plan Generation in Hadoop For CPS296.1 Course Project Gang Luo Duke University Durham, NC Liang Dong Duke University Durham, NC ABSTRACT Joins in Hadoop has always been a problem for its users: the Map/Reduce framework seems to be specifically designed for group-by aggregation tasks rather than across-table operations; on the other hand, join operation in distributed database systems was never an easy task because data location and skewness makes join strategies harder to optimize. Fragment-replicate join (map join) may be a clever step towards good performance in some cases, but it can be a dangerous move under certain circumstances. This paper introduces some new techniques used in map join to tackle these issues, and proposes a plan generator for the join types that we currently have. Categories and Subject Descriptors H.2 [Database Management]: Plan Generation General Terms Theory Keywords Hadoop, join operation 1. INTRODUCTION Currently, the amount of data the industry and academia are facing is large, and will keep increasing, which makes large-scale data processing a hot issue. Map-Reduce[5] is a popular parallel data processing framework. The simple programming model allows users to write simple program that could run on hundreds of machines simultaneously to process data. Its fault tolerance feature also makes it a robust system even for commodity machines. These features enable the system running Map-Reduce expand to a really large scale by adding commodity machines to the cluster at a low cost, thus could greatly reduce the time consume by jobs. The open source implementation of Map/Reduce framework, namely Hadoop[2], has caught much attention ever since it was born. Even though it seems promising to improve the efficiency for data processing by brutally enlarging the cluster size and running the jobs on more nodes, it is a better idea to design sophisticate plan that make good use the Map-Reduce paradigm while avoid the side effect as much as possible. As one of the most critical operations in data processing, join operation is usually more time-consuming than other kinds of work and thus has a greater impact on the overall performance. Basically, to join two datasets in Map-Reduce is quite simple, as we will introduce later. But for the mostobvious join method, which reads both tables from the disk, and shuffles all the data over the network to the reducers, the performance can be limited by the network connection speed. When the datasets are too large, the network transfer time becomes the bottleneck, thus lowering the utilization of the computing resources. With some tools/framework built on top of Hadoop, for example, Pig[6] or Hive[8], the annoying work of programming in Java could be saved. Instead, users could write declarative (for Hive) or procedural (for Pig) queries to perform tasks that could take much lines of code with pure Hadoop. However, neither of these tools has addressed the problem of join: Pig has implemented fragment-duplicate join (known as map join in our paper), and also skew join that can handle skewed tables; the user may want to give some hints to the compiler, indicating which join method the system should use. This is not a good way of handling the problem since the user may not know what is map join, or the underlying data; furthermore, the user may give a wrong hint which could hurt the performance. Building a plan generator which decides smartly which plan to use will make those tools more favorable. Through our early experience with map join, we have learned that it may consume more memory of a map task than it possesses, the consequence will be either extremely slow execution or thrashing map tasks. Advanced join is another approach that can be potentially beneficial, but the overhead of this kind of join should not be neglected. Using Distributed Cache[1] to copy files to each node could be a potential improvement to the performance, but our preliminary experiments suggested otherwise, which forms one issue that this paper tries to solve. Other than that, our work will focus on extending the original map join implementation for it to work with more cases. More importantly, we will propose a cost-based plan generator for efficient joins

2 in Hadoop, which will determine what join plan should be used for better performance. 2. CLASSIC JOIN IN HADOOP In this section, we will briefly introduce the implementations of a typical approach of join, which is named default join for the rest of the paper, and then move on to map join. 2.1 Default/reduce Join Default join is a 2-way join widely used in Map-Reduce which comply with the MapReduce spirit fairly well. This strategy is most intuitive and is the default strategy for tools such as Pig and Hive. It is called reduce join later in the paper for the reason that join is performed at the reduce phase, in contrast with map join. Given two tables, default join will first read different parts of these two tables into different Mappers where for each record, the join attribute will be extracted as the key of a record in intermediate result, with the file tag (to indicate which table does one record come from) and other necessary attributes as the value. The intermediate will be shuffled and then send to Reducers. Each reducer will only get tuples with the same key. Records in each reducer are grouped by tag so that the records from different tables are identical. The Cartesian product of the two parts are the final result. Default join work well for most situations. One of the exceptions is that when both tables are huge, there are lots of data transferred over network from Mappers to Reducers. Without considering early projection or any selection based on predicates given by user, the size of data transferred on network is the sum of size of R and S. Now that the network transfer is the bottleneck in this case, one potential solution is to first filter the source datasets and get rid of those records at Mapper phase which are not possible to be joined in the Reducer phase. This approach is exactly the advanced join that we have proposed, and we will introduce this join method later in this section. 2.2 Map join Map join is one of the improvement from default join by eliminating the reduce phase and thus eliminating the transfer of data over the network between map phase and reduce phase. This gain from this becomes obvious when one of the tables is small. Map join aims to use only the map phase so the no data will be transferred on network. The input file to Hadoop for this job will be only one table (fragment side), and one map task is initialized for each split of the table, thus finishing the fragmenting step. Then those map tasks will read from the other table (duplicate side) as a whole, and perform the join locally. The name of fragment-replicate join is due to the behavior of fragmenting one table to be processed in a distributed fashion, and then replicating the other side for each of the fragment; and the name map join suggests that the join is performed using only the map phase, thus eliminating the cost of sorting and shuffling over the network. For map side join has to read the whole table S into every Mapper, it is really efficient when it meets a pattern of a huge table and a small table, but disastrous when both of tables are huge, because reading the duplicate side to every map task will impose much more cost than the save gained from not transferring the other table over the network. One original implementation of map join reads and loads the duplicate table to a hash table built in memory, and then probes this hash table to do a join. As can be observed, this kind of implementation will not work if the duplicate table cannot be loaded to memory. This issue will be explored and answered later in this paper. 2.3 Distributed Cache Distributed Cache is one way of distributing common data that is shared and accessed by all the map tasks in a Hadoop job. Before launching the job, we assign the duplicate side to Distributed Cache, so that it is copied to the local file system in each node that has task(s) to run. Then at the map phase, Mapper will just need to read from the local FS, rather than using an HDFS call to read the data (most likely from another node). The merit of Distributed Cache is grouping multiple accesses to the duplicate table to only once per node: Distributed Cache will copy the duplicate table only ONCE per node; assuming several map tasks run on each node, if we directly read the duplicate table through HDFS calls, there is clearly more network communication instead of local I/O. Therefore, using of Distributed Cache is one way of improving the performance of map join; although it can also be handy in the case of advanced join discussed in the next part. However, the real benefit from Distributed Cache remains to be analyzed with controlled experiments. 2.4 Advanced join Advanced join is based on the idea of filtering the table before transferring them on the network. It relies on one prerequisite operation semi join, which performs a normal join only on the join key. In other words, semi join will only extract join key in the map phase, and the output of semi join will be a table of keys. Semi join aims to find out all the distinct values used as the join key that are shared by both tables. The process of semi join is described as follows: First, the two tables are read by map tasks, where the join key value for each record will be extracted as the key of intermediate result with a tag (to indicate which table does one record come from) as the value. After shuffling, all the records from both tables with the same join key value well be sent to the same Reducer. Each Reducer scan its input and try to find different tags. If it find different tags, which means the both tables contains records with the same join key value, then the Reducer output the key. If all the tags are the same after scanning all the input for certain Reducer, which means only one table contains records with this join key value, the Reducer does nothing for this key since it will not be used for join later. After all the Reducers finishing their work, we get all the distinct join key values. This step will be useful for filtering out those useless records for this join. Semi-join makes sense even if the join key is Primary and Foreign key. Consider the following scenario: if the join key is a Primary key in one table, and in the other table

3 it corresponds to the first table, then we cannot gain much performance through semi-join, since it is highly possible that most tuples will join with the other table, and semijoin will not purge a significant number of tuples. Even if there is selection over the key, we should always apply the selection in the mapping phase. It seems semi-join is not applicable to this scenario. However, if we consider the selection on attributes that are not the key of the join, then after the selection, we could purge useless tuples in the other table using semi-join. If we compare this case with transferring the whole second table over the network, semijoin still makes sense. Advanced join is then straight-forward: using this list of useful keys, it will filter the table before doing the actual join. Moreover, this strategy can be applied to both default join and map join, getting rid of useless tuples before doing the actual join. To filter the original tables, one simple way would be performing a map join of the original table and the list of useful join keys; but performing the filtering on the fly while doing the actual join will be more efficient. Advanced join can be one solution to the problem of insufficient memory in map join: filtering the table may reduce the size of the duplicate side, after which map join could be performed successfully. From a rough comparison, it can be observed that advanced join could save some network communication (it saves more if the selectivity is low); however, the overhead of doing the semi join and filtering cannot be neglected. 3. IMPROVING MAP JOIN Map join is fast: it transfers only the duplicate table, which is picked using the smaller table, meanwhile the large table need not to be moved; it has no reduce phase, saving a lot of work for Hadoop; it doesn t sort the output of the join (Hadoop will not sort output of the map tasks if there is no reduce task). However, it fails when the duplicate table becomes too large to fit into the memory allocated for map tasks. One solution for this nasty problem would be increasing the heap size, but obviously this cannot be done every time the duplicate table gets larger. However one of our proposed solution is based on this observation of increasing memory: it employs JDBM to store the larger duplicate table to disk, so that the memory will no longer be an issue. Our three approaches for a map join that can handle larger duplicate table are described as follows. 3.1 Multi-phase map join The mechanism of multi-phase map join is straight-forward: since the original version of map join can only take the duplicate table as large as the memory can possibly hold, it would make sense to read as much data, and then stop at the point when the memory becomes full and perform the map join on this part of the data; by repeating this process the map join will be performed multiple times, and hence we have the name multi-phase map join. Although multi-phase map join solves the problem, it is one kind of a brute-force strategy rather than an elegant way: when the size of duplicate table doubles, processing time of multi-phase map join will double. This situation seems reasonable at first glance; however, this is a bad performance problem: suppose the fragment side is 100GB and the duplicate side is 1GB, then doubling duplicate table size will cause the time of multi-phase map join to double, whereas in default join where both table is transferred through the network, this increase will result very little increase of the processing time. Another observation is that multi-phase map join will perform multiple times of reading the fragment side in each map task. Essentially this is an equivalent statement to the one made above. But it generate the following question: since the fragment side is read multiple times, why not store each fragment in memory? And this will be the next type of new map join. 3.2 Reversed map join Continuing above section: after storing the fragment in memory, the duplicate could be read sequentially, while probing the in-memory hash table of the fragment and performing the join. This strategy is the opposite of building hash table of duplicate table and probe it while each tuple from fragment table is processed, and therefore it gets the name of reversed map join. One doubt about reversed map join is how large the fragment is and whether it can be load to the memory or not. Typically Hadoop will use one chunk of HDFS file as the input for the map task, therefore this one fragment here will be one block/chunk of the HDFS used in the system. The default chunk size of Hadoop is 100MB, and normally the Hadoop administrator will allocate far more memory to a map/reduce task than 100MB (note that it is not the memory of the node, because each node usually has several map/reduce slots that run tasks concurrently). In other words, it is most likely that the fragment will fit into memory, but the implementation of reversed map join should be more careful so that it won t run into undesired state. From the performance perspective, reversed map join has the same power is default map join: the network communication remains exactly the same, and for local processing, reversed map join has the same behavior. The fragment table will not be read twice if we double the size of duplicate table. However, implementing reversed map join complicates the story: the fragment hash table is built while executing the map method during the map phase. Once the map finishes (the building of the hash table completes), the map task will end. So when to read the duplicate table and do the join? We have found that before ending the map task, Hadoop will call a close method which can be override in user code. And thus all the rest operations are added in the close method. This is bypassing all the convenience provided by Hadoop and should not be considered an elegant way of using Hadoop. 3.3 JDBM-based map join JDBM[4] is a transactional persistence engine for Java. It aims to be a fast and simple persistence engine, can be used to store a mix of objects and BLOBs, and all updates are done in a transactionally safe manner. As the name suggests,

4 JDBM-based map join will utilize JDBM to store the hash table so that memory wouldn t be an issue. The question to ask about JDBM-based map join is the efficiency of look-ups in a hash table that could reside on the disk. Furthermore, introducing a between-layer will decrease the control over the performance of such a join plan. 3.4 Other techniques One of the reason for default map join to run out of memory is that Java hash table implementation takes much more space than the actual data. For example, according to our experiment, a 100MB table stored as a hash table in Java may require around 300MB of memory space. Storing the data using ArrayList would save much space, but compromises performance. Using an in-memory structure that balances between storage consumption and performance will be a desirable situation. Java s lazy memory management is also one cause of the issue. However, this can be avoided to some extent. For example, the following code exhibits a bad habit of programming in Java: output.collect(new Text( Hello World )); A better way of doing this: Text outtext;... outtext.set( Hello World ) output.collect(outtext); Instead of creating a new object each time to write an output, the second code recycles the same object to perform the operation. The benefit from this code would be more obvious if the output is pretty large, which is usually the case in Hadoop jobs. 4. EXPERIMENT We run our experiments on Illinois Cloud Computing Testbed [3]. The cluster used for Hadoop consists of 64 HP DL160 nodes connected by high speed network. Each node is equipped with two quad-core CPUs, 16 GB RAM and 2 TB storage. We use TPC-DS [7] datasets for experiments scaling from 10 GB to 1TB. The size of input split to map tasks is 128 MB for all the experiments. For semi join, the number of reduce tasks is 10. For the other reduce side joins, the number of reduce tasks is Comparison between Advanced Joins and Default Joins Figure 1 shows the performance of advanced map-side/reduceside joins and the defaults map-side/reduce-side joins. The total time of advanced joins consists of time consumed by semi join, filtering (for map side join only) and the actual join process. As can be seen, in all scenarios, the advanced joins perform worse than their default versions, which strongly goes again our expectation. The philosophy here to do such a filtering is to reduce the amount of data transferred over network since the network IO is always the bottleneck of MapReduce workload. However, the cluster we are experimenting on has a extremely fast network, we did not see huge benefits by doing such a filtering, even we can eliminate as much as 50% of the data by doing so. Instead, the overhead caused by extra MapReduce jobs (semi join and table filtering) degrades the performance significantly. Consequently, the semi join and filtering works well only when the (semi)selectivity is extremely low, e.g. 90% of data will be eliminated. However, this situation may not be seen frequently. Further, to figure out the selectivity before hand is really difficult and will definitely introduce more overhead (actually, semi join is the thing to figure out the selectivity). Thus, in such a cluster with fast network, advanced joins seems not options for our adaptive joins. 4.2 Performance of Distributed Cache Distributed cache reduces the total amount of data fed to each tasks by ensuring that only one data copy will be sent to each node and all the tasks running on that node share this copy. However, this benefit comes at the cost of the overhead introduced by converting data from HDFS to local disk. As can be seen in Figure 1, in rare scenarios does a join approach with distributed cache outperform its counterpart without distributed cache. This bad performance of distributed cache is due to two factors. First, the map tasks launched for a join job is not enough to show the power of distributed cache. The largest number of map tasks in our experiments are 288, meaning the replicated table for a map join without distributed cache has to be duplicated 288 copies each of which is sent to one map task. In such a cluster with high speed network, the benefit by reducing the data amount transferred over network is insignificant, while the overhead of distributed cache is dominant. Second, the size of the cluster is 64 nodes, meaning there will be as many as 64 copies of data distributed to all this nodes for a job with distributed cache. Obviously, the smaller the size of the cluster is, the fewer copies we need to distributed and the better performance the distributed cache will achieve. However, a trend can be observed that by increasing the size of large tables, which also increase the number of map tasks, the distributed cache performance is drawing closer to the non-distributed cache versions. Thus, we still argue that distributed cache is of great help with more map tasks involved. The distribution of the large table to smaller number of node could also show the power of distributed cache. 4.3 Comparison between Default Map Joins and Default Reduce Joins Figure 2 shows the performance of map join series and reduce join series. Due to the drawbacks of distributed cache we discuss in 4.2, we skip those map joins with distributed cache. As can be seen, the default map join performs better than the default reduce side join. Even the advanced map join, which launch one more MapReduce job for table filtering, outperforms the advanced reduce side join. This result is straightforward since map joins eliminate the data transferring and sorting at shuffle and reduce stages, respectively. However, not upper bound is guarantee for the amount of data shipped to each map tasks. A map join without distributed cache may launch enough map tasks to ship tremen-

5 Figure 1: Comparison between Advanced Joins and Default Joins dous amount of data to each map task, thus degrade the performance significantly. The limitation of Map Joins is also shown in Figure 2. When the replicated table is large enough, the default map join fails handling it due to the out-of-memory exception. Even though the filtering using semi join result helps to extend the usability of map join, we get the same exception when larger tables come. 4.4 Performance of new Map Joins Figure 2 also shows the performance of the three new map joins we implemented. For comparison, the performances of default map join and default reduce side join are also shown. Multi-phase Map Join (MMJ) launch one phase when the scale factor of replicated table is 50 or less. In this case, MMJ is exactly the same as DMJ where the performance of MMJ is close to that of DMJ. When the scale factor increase to 70, MMJ split the replicated table into 2 parts and launch two MapReduce jobs to join each of them with the fragmented table. In this case, the performance still tie with DMJ while the overhead caused by one more MapReduce join has no significant impact. However, when the scale factor keep growing up, the performance decrease sharply. Two factors contributed to this sharp degradation. The first one is the increased amount of data shipped to more map tasks which place a great burden on the network and degrade the performance even the number of phases remains the same. The other factor is the number of phases keep increasing as the scale factor of replicated table grows. The number of phase is 4 and 6 for 300*200 and 1000*200, respectively. In those cases, the sum of overhead by each of the MapReduce jobs makes the performance extremely bad. The performance of Reversed Map Join (RMJ) and JDBM Map Join (JMJ) performance bad when joining small scale of tables. As discussed before, the adoption of distributed cache is the main factor in this scenario. However, when increasing the size of tables, the performance of these two join approaches are drawing closer to that of DMJ, and RMJ outperform any other approaches at 300*200 and 1000*200. Note that the JMJ fails for 300*200 and 1000*200, which is mainly due to the improper configuration of JDBM. We believe the JMJ should performs well if configured properly since it presents a better performance than RMJ for small joins. 4.5 Result analysis summary Our experiments shows some important properties of various join approaches in MapReduce environment. One of the most critical one is the performance of distributed cache which may introduce great overhead if not used properly. Some experiments we have done shows that the delay to distribute the files to each node is proportional to the number of files. Based on our experiments result, the performance of distributed cache heavily depends on the network environment and the scale the the join task. In such an scenario where the network is slow and the join task are large in terms of map tasks number and table size, distributed cache can bring us more benefits. Our experiments result also shows that the advanced joins using semi join to filter tables only works as expected when the semi selectivity is extremely low and the network speed is slow. The other observation is that map joins are usually better than reduce side join since they eliminate extra stages for data shuffling and sorting and the

6 Figure 2: Performance of new map joins performance in various condition depends on the proper use of distributed cache. As can be seen in the result, the three new map joins we implemented can extend the usability of default map join. However, MMJ does not scale well due to the overhead caused by multiple MapReduce jobs and JMJ may fail sometime due to an improper configuration. JMJ seems a promising substitute of default map join at large scale of join tasks. The last but not the least conclusion we can draw is that the default reduce side join achieve a acceptable performance, and performs close to default map join in some situations especially in a fast network environment. 5. ADAPTIVE JOIN PLAN GENERATION 5.1 Plan space Even we implemented nine join approaches, not all of them Based on the experiments result and discussion in section 4, default map join with distributed cache is the best option in most cases and the distributed cache version may performs well in large join tasks. They are the join candidates in the plan space. We also put the default reduce side join into plan space since it scale well and achieve an acceptable performance. The advanced joins are eliminated from the plan space since in no case does any of them outperform its default counterpart. We also eliminate the MMJ and JMJ from the plan space since MMJ does not scale well while JMJ still meet some configuration problems unsolved yet. RMJ is the option we will consider in the plan space since it scale well and performs best in certain condition. 5.2 Cost model Table 1: Parameters adopted by the cost model Parameter symbol Number of distributed files d Network speed v Number of map tasks m Number of reduce tasks r Number of working nodes n Small table size s Large table size l Table 1 lists the parameters that will be adopted into the cost models. Distributed cache cost model: for join with and without distributed cache, the cost model focus on two aspects: the amount of data transferred over network and the overhead. The cost for map joins with and without distributed cache are: 1 n s + α d (with distributed cache) v 1 m s (without distributed cache) v Note that in the first cost model, the coefficient is the average overhead to distribute on file. Based on our knowledge and the experiments result, this overhead is 20 (sec.) per file. Default Map join/reversed Map Join cost:

7 n s + α d (with distributed cache) m s (without distributed cache) both default map join and reversed map join share the same cost model. The choose between them based on a rule as will be discussed later. Default Reduce side Join cost: (s + l) β Since reduce side join need to shuffle and sort the data, we use f(v, r) to describe the cost of the two operations on the same amount of data, which is defined the network speed and the number of reducers. 5.3 Join Plan Generation We implemented the join plan generation in both cost-based and rule-based manner. Essentially, the choose among all the join approaches in the plan space can be simplified as the choose between map join and reduce side join. Since the default reduce join is the only option as reduce side join, the work here is to choose a map join plan. Based on the previous discussion, whether to use distributed cache has a great impact on the performance. Thus, the first step in plan generation is to determine whether or not to use the distributed cache based on the cost model proposed previously. After the determination of distributed cache, which map join approach is desirable needs to be determined. As a substitute of DMJ, RMJ performs well at large join tasks where DMJ fails. In no cases as shown in the experiments result does RMJ outperforms DMJ once DMJ could work. Thus, we give DMJ our preference. Once the replicated table is small enough to fit into memory, we choose DMJ as the representative of map join. If not, RMJ will be the choice. The last step is to compare the chosen map join with the reduce side join and choose the option with the lower cost for execution. 6. SUMMARY Our contributions, put in a short way are: verifying our guess about the performance of map join proving advanced join suffers from the overhead examining the performance of Distributed Cache extending the default map join for larger duplicate table comparing the performance of new map join types providing a cost-based plan generator, which picks from all join plan space that we have so far The advantage of map join over reduce join is Self-evident; and we have proved this point through our experiment. And thus our plan generator will prefer the use of map join, as long as it really works. Distributed Cache on the other hand, did not prove its value according to the experiments, since almost all join results from a distributed cache version of map join perform not as good as a default version. However, the argument made in favor of Distributed Cache is still valid, and the use of Distributed Cache should always be taken into consideration, though it might not actually improve the performance. The problem of map join is resolved by three new types of join well. With reversed map join outperforming MMJ and JMJ in most cases, it is believed that RMJ should be used to replace DMJ to handle larger duplicate tables. Advanced join types are proved to suffer from the overhead imposed by semi join and filtering, and one reason for it failing to compete with default join types is the table that we have used: there is no such a situation that the selectivity of joining two tables is low enough to show the value of advanced join. A perfect plan generator will take this into consideration and picks advanced join when the selectivity becomes low enough. But to know the selectivity before performing the join will be difficult, and is not the focus of this paper. The observation proved here is advanced join cannot rival default join in performance under normal circumstances (join selectivity). The cost-based plan generator is based on some assumptions made through the process of analyzing the experiment results. We have strong reason to believe that those assumptions about actual performance are valid. At the same time, the overhead of picking a good join plan will be minor, since all the whole process is based on the basic information about the tables, and the configuration of Hadoop; no actual data is read from the tables. However, without reading the data (or at least sampling it) should be considered a disadvantage for a perfect plan generator; with the knowledge of join selectivity and data skewness, further optimization is possible. But again, this is beyond the scope of this paper. 7. FUTURE WORK Joining tables is the most time-consuming operation in data analyzing and the optimizations of join operation will bring significant benefits in the data processing dataflow. In this section, we discuss some challenging issues we did not tackle in this paper. 1. integration of join into SPJA workflow our current work focus on a join task individully. However, join is usually integrated into a complete selectionprojection-join-aggregation workflow, where the other operations in the workflow may greatly influence the join planing, or vise versa. To achieve, we need to consider the whole picture in stead of a individual step. 2. better plan generator the cost model we currently proposed is based on some rules we extract from the experiment result. However, the cluster we use for experiment is not representative enough due to the high speed network, which may not be seen frequently. This property eliminate some join candidate, e.g. advanced join, from our plan space which may performance well on other cluster. Further, to figure out the selectivity before hand is really difficult and will definitely introduce some

8 overhead. How to efficiently sample and collect statistics for plan generator is critical to choose the optimum plan.for a more general plan space and generator, we need to understand better how the cluster and data property impact the join performance, and a true cost model should be achieved. 3. adaptive parameter tuning The parameter configuration of MapReduce job is also critical to the performance. For example, a large number of reducers can be beneficial in terms of parallelism, but may also introduce more overhead when distributed cache is applied on the results generated by these reducers. Our currently implementations of various join are not properly tuned, which brings inaccuracy in choosing join candidates for plan space and cost model. 8. ACKNOWLEDGMENTS The authors would like to thank Prof. Jun Yang for providing such a good opportunity from this outstanding course, as well as the guides and comments to our job. We would also like to thank Prof. Shivnath Babu for helpful discussions and the Illinois cluster. 9. REFERENCES [1] Distributed cache. tutorial.html#distributedcache. [2] Hadoop. [3] Illinois could computing testbed. [4] Jdbm. [5] J. Dean and S. Ghemawat. Mapreduce: simplified data processing on large clusters. Commun. ACM, 51(1): , [6] A. F. Gates, O. Natkovich, S. Chopra, P. Kamath, S. M. Narayanamurthy, C. Olston, B. Reed, S. Srinivasan, and U. Srivastava. Building a high-level dataflow system on top of map-reduce: the pig experience. Proc. VLDB Endow., 2(2): , [7] M. Poess, R. O. Nambiar, and D. Walrath. Why you should run tpc-ds: a workload analysis. In VLDB 07: Proceedings of the 33rd international conference on Very large data bases, pages VLDB Endowment, [8] A. Thusoo, J. S. Sarma, N. Jain, Z. Shao, P. Chakka, S. Anthony, H. Liu, P. Wyckoff, and R. Murthy. Hive: a warehousing solution over a map-reduce framework. Proc. VLDB Endow., 2(2): , 2009.

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data

Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Performance Comparison of Hive, Pig & Map Reduce over Variety of Big Data Yojna Arora, Dinesh Goyal Abstract: Big Data refers to that huge amount of data which cannot be analyzed by using traditional analytics

More information

Mitigating Data Skew Using Map Reduce Application

Mitigating Data Skew Using Map Reduce Application Ms. Archana P.M Mitigating Data Skew Using Map Reduce Application Mr. Malathesh S.H 4 th sem, M.Tech (C.S.E) Associate Professor C.S.E Dept. M.S.E.C, V.T.U Bangalore, India archanaanil062@gmail.com M.S.E.C,

More information

Improved MapReduce k-means Clustering Algorithm with Combiner

Improved MapReduce k-means Clustering Algorithm with Combiner 2014 UKSim-AMSS 16th International Conference on Computer Modelling and Simulation Improved MapReduce k-means Clustering Algorithm with Combiner Prajesh P Anchalia Department Of Computer Science and Engineering

More information

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu

CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin. Presented by: Suhua Wei Yong Yu CIS 601 Graduate Seminar Presentation Introduction to MapReduce --Mechanism and Applicatoin Presented by: Suhua Wei Yong Yu Papers: MapReduce: Simplified Data Processing on Large Clusters 1 --Jeffrey Dean

More information

V Conclusions. V.1 Related work

V Conclusions. V.1 Related work V Conclusions V.1 Related work Even though MapReduce appears to be constructed specifically for performing group-by aggregations, there are also many interesting research work being done on studying critical

More information

Huge market -- essentially all high performance databases work this way

Huge market -- essentially all high performance databases work this way 11/5/2017 Lecture 16 -- Parallel & Distributed Databases Parallel/distributed databases: goal provide exactly the same API (SQL) and abstractions (relational tables), but partition data across a bunch

More information

CS 345A Data Mining. MapReduce

CS 345A Data Mining. MapReduce CS 345A Data Mining MapReduce Single-node architecture CPU Machine Learning, Statistics Memory Classical Data Mining Disk Commodity Clusters Web data sets can be very large Tens to hundreds of terabytes

More information

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved

Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India. IJRASET 2015: All Rights are Reserved Implementation of K-Means Clustering Algorithm in Hadoop Framework Uday Kumar Sr 1, Naveen D Chandavarkar 2 1 PG Scholar, Assistant professor, Dept. of CSE, NMAMIT, Nitte, India Abstract Drastic growth

More information

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury

RESTORE: REUSING RESULTS OF MAPREDUCE JOBS. Presented by: Ahmed Elbagoury RESTORE: REUSING RESULTS OF MAPREDUCE JOBS Presented by: Ahmed Elbagoury Outline Background & Motivation What is Restore? Types of Result Reuse System Architecture Experiments Conclusion Discussion Background

More information

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets

PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets 2011 Fourth International Symposium on Parallel Architectures, Algorithms and Programming PSON: A Parallelized SON Algorithm with MapReduce for Mining Frequent Sets Tao Xiao Chunfeng Yuan Yihua Huang Department

More information

CSE 190D Spring 2017 Final Exam Answers

CSE 190D Spring 2017 Final Exam Answers CSE 190D Spring 2017 Final Exam Answers Q 1. [20pts] For the following questions, clearly circle True or False. 1. The hash join algorithm always has fewer page I/Os compared to the block nested loop join

More information

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE

RAID SEMINAR REPORT /09/2004 Asha.P.M NO: 612 S7 ECE RAID SEMINAR REPORT 2004 Submitted on: Submitted by: 24/09/2004 Asha.P.M NO: 612 S7 ECE CONTENTS 1. Introduction 1 2. The array and RAID controller concept 2 2.1. Mirroring 3 2.2. Parity 5 2.3. Error correcting

More information

Mining Distributed Frequent Itemset with Hadoop

Mining Distributed Frequent Itemset with Hadoop Mining Distributed Frequent Itemset with Hadoop Ms. Poonam Modgi, PG student, Parul Institute of Technology, GTU. Prof. Dinesh Vaghela, Parul Institute of Technology, GTU. Abstract: In the current scenario

More information

Chapter 18: Parallel Databases

Chapter 18: Parallel Databases Chapter 18: Parallel Databases Database System Concepts, 6 th Ed. See www.db-book.com for conditions on re-use Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery

More information

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction

Chapter 18: Parallel Databases. Chapter 18: Parallel Databases. Parallelism in Databases. Introduction Chapter 18: Parallel Databases Chapter 18: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of

More information

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm

Announcements. Optional Reading. Distributed File System (DFS) MapReduce Process. MapReduce. Database Systems CSE 414. HW5 is due tomorrow 11pm Announcements HW5 is due tomorrow 11pm Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create your AWS account before

More information

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce

Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Parallelizing Structural Joins to Process Queries over Big XML Data Using MapReduce Huayu Wu Institute for Infocomm Research, A*STAR, Singapore huwu@i2r.a-star.edu.sg Abstract. Processing XML queries over

More information

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large

! Parallel machines are becoming quite common and affordable. ! Databases are growing increasingly large Chapter 20: Parallel Databases Introduction! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems!

More information

Chapter 20: Parallel Databases

Chapter 20: Parallel Databases Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Chapter 20: Parallel Databases. Introduction

Chapter 20: Parallel Databases. Introduction Chapter 20: Parallel Databases! Introduction! I/O Parallelism! Interquery Parallelism! Intraquery Parallelism! Intraoperation Parallelism! Interoperation Parallelism! Design of Parallel Systems 20.1 Introduction!

More information

Query processing on raw files. Vítor Uwe Reus

Query processing on raw files. Vítor Uwe Reus Query processing on raw files Vítor Uwe Reus Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB 5. Summary Outline 1. Introduction 2. Adaptive Indexing 3. Hybrid MapReduce 4. NoDB

More information

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2

B.H.GARDI COLLEGE OF ENGINEERING & TECHNOLOGY (MCA Dept.) Parallel Database Database Management System - 2 Introduction :- Today single CPU based architecture is not capable enough for the modern database that are required to handle more demanding and complex requirements of the users, for example, high performance,

More information

Database Systems CSE 414

Database Systems CSE 414 Database Systems CSE 414 Lecture 19: MapReduce (Ch. 20.2) CSE 414 - Fall 2017 1 Announcements HW5 is due tomorrow 11pm HW6 is posted and due Nov. 27 11pm Section Thursday on setting up Spark on AWS Create

More information

Introduction to Data Management CSE 344

Introduction to Data Management CSE 344 Introduction to Data Management CSE 344 Lecture 26: Parallel Databases and MapReduce CSE 344 - Winter 2013 1 HW8 MapReduce (Hadoop) w/ declarative language (Pig) Cluster will run in Amazon s cloud (AWS)

More information

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data

More information

The Google File System

The Google File System The Google File System Sanjay Ghemawat, Howard Gobioff and Shun Tak Leung Google* Shivesh Kumar Sharma fl4164@wayne.edu Fall 2015 004395771 Overview Google file system is a scalable distributed file system

More information

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568

FLAT DATACENTER STORAGE. Paper-3 Presenter-Pratik Bhatt fx6568 FLAT DATACENTER STORAGE Paper-3 Presenter-Pratik Bhatt fx6568 FDS Main discussion points A cluster storage system Stores giant "blobs" - 128-bit ID, multi-megabyte content Clients and servers connected

More information

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 10: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 10: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

MapReduce. Stony Brook University CSE545, Fall 2016

MapReduce. Stony Brook University CSE545, Fall 2016 MapReduce Stony Brook University CSE545, Fall 2016 Classical Data Mining CPU Memory Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining CPU Memory (64 GB) Disk Classical Data Mining

More information

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich

Data Modeling and Databases Ch 9: Query Processing - Algorithms. Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Data Modeling and Databases Ch 9: Query Processing - Algorithms Gustavo Alonso Systems Group Department of Computer Science ETH Zürich Transactions (Locking, Logging) Metadata Mgmt (Schema, Stats) Application

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Parallel Nested Loops

Parallel Nested Loops Parallel Nested Loops For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on (S 1,T 1 ), (S 1,T 2 ),

More information

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011

Parallel Partition-Based. Parallel Nested Loops. Median. More Join Thoughts. Parallel Office Tools 9/15/2011 Parallel Nested Loops Parallel Partition-Based For each tuple s i in S For each tuple t j in T If s i =t j, then add (s i,t j ) to output Create partitions S 1, S 2, T 1, and T 2 Have processors work on

More information

Distributed Computation Models

Distributed Computation Models Distributed Computation Models SWE 622, Spring 2017 Distributed Software Engineering Some slides ack: Jeff Dean HW4 Recap https://b.socrative.com/ Class: SWE622 2 Review Replicating state machines Case

More information

Enhanced Performance of Database by Automated Self-Tuned Systems

Enhanced Performance of Database by Automated Self-Tuned Systems 22 Enhanced Performance of Database by Automated Self-Tuned Systems Ankit Verma Department of Computer Science & Engineering, I.T.M. University, Gurgaon (122017) ankit.verma.aquarius@gmail.com Abstract

More information

Clustering and Reclustering HEP Data in Object Databases

Clustering and Reclustering HEP Data in Object Databases Clustering and Reclustering HEP Data in Object Databases Koen Holtman CERN EP division CH - Geneva 3, Switzerland We formulate principles for the clustering of data, applicable to both sequential HEP applications

More information

Introduction to Hadoop and MapReduce

Introduction to Hadoop and MapReduce Introduction to Hadoop and MapReduce Antonino Virgillito THE CONTRACTOR IS ACTING UNDER A FRAMEWORK CONTRACT CONCLUDED WITH THE COMMISSION Large-scale Computation Traditional solutions for computing large

More information

Cloud Computing CS

Cloud Computing CS Cloud Computing CS 15-319 Programming Models- Part III Lecture 6, Feb 1, 2012 Majd F. Sakr and Mohammad Hammoud 1 Today Last session Programming Models- Part II Today s session Programming Models Part

More information

Advanced Databases: Parallel Databases A.Poulovassilis

Advanced Databases: Parallel Databases A.Poulovassilis 1 Advanced Databases: Parallel Databases A.Poulovassilis 1 Parallel Database Architectures Parallel database systems use parallel processing techniques to achieve faster DBMS performance and handle larger

More information

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec

MapReduce: Recap. Juliana Freire & Cláudio Silva. Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Some slides borrowed from Jimmy Lin, Jeff Ullman, Jerome Simeon, and Jure Leskovec MapReduce: Recap Sequentially read a lot of data Why? Map: extract something we care about map (k, v)

More information

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems

Announcements. Reading Material. Map Reduce. The Map-Reduce Framework 10/3/17. Big Data. CompSci 516: Database Systems Announcements CompSci 516 Database Systems Lecture 12 - and Spark Practice midterm posted on sakai First prepare and then attempt! Midterm next Wednesday 10/11 in class Closed book/notes, no electronic

More information

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS

PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS PLATFORM AND SOFTWARE AS A SERVICE THE MAPREDUCE PROGRAMMING MODEL AND IMPLEMENTATIONS By HAI JIN, SHADI IBRAHIM, LI QI, HAIJUN CAO, SONG WU and XUANHUA SHI Prepared by: Dr. Faramarz Safi Islamic Azad

More information

CompSci 516: Database Systems

CompSci 516: Database Systems CompSci 516 Database Systems Lecture 12 Map-Reduce and Spark Instructor: Sudeepa Roy Duke CS, Fall 2017 CompSci 516: Database Systems 1 Announcements Practice midterm posted on sakai First prepare and

More information

Map-Reduce. John Hughes

Map-Reduce. John Hughes Map-Reduce John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days Hundreds of ad-hoc

More information

1993 Paper 3 Question 6

1993 Paper 3 Question 6 993 Paper 3 Question 6 Describe the functionality you would expect to find in the file system directory service of a multi-user operating system. [0 marks] Describe two ways in which multiple names for

More information

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time.

CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time. CS5412 CLOUD COMPUTING: PRELIM EXAM Open book, open notes. 90 minutes plus 45 minutes grace period, hence 2h 15m maximum working time. SOLUTION SET In class we often used smart highway (SH) systems as

More information

Part IV. Chapter 15 - Introduction to MIMD Architectures

Part IV. Chapter 15 - Introduction to MIMD Architectures D. Sima, T. J. Fountain, P. Kacsuk dvanced Computer rchitectures Part IV. Chapter 15 - Introduction to MIMD rchitectures Thread and process-level parallel architectures are typically realised by MIMD (Multiple

More information

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing

Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Complexity Measures for Map-Reduce, and Comparison to Parallel Computing Ashish Goel Stanford University and Twitter Kamesh Munagala Duke University and Twitter November 11, 2012 The programming paradigm

More information

CSE 190D Spring 2017 Final Exam

CSE 190D Spring 2017 Final Exam CSE 190D Spring 2017 Final Exam Full Name : Student ID : Major : INSTRUCTIONS 1. You have up to 2 hours and 59 minutes to complete this exam. 2. You can have up to one letter/a4-sized sheet of notes, formulae,

More information

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP

TITLE: PRE-REQUISITE THEORY. 1. Introduction to Hadoop. 2. Cluster. Implement sort algorithm and run it using HADOOP TITLE: Implement sort algorithm and run it using HADOOP PRE-REQUISITE Preliminary knowledge of clusters and overview of Hadoop and its basic functionality. THEORY 1. Introduction to Hadoop The Apache Hadoop

More information

Summary: Issues / Open Questions:

Summary: Issues / Open Questions: Summary: The paper introduces Transitional Locking II (TL2), a Software Transactional Memory (STM) algorithm, which tries to overcomes most of the safety and performance issues of former STM implementations.

More information

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) 4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a,

More information

Chapter 17: Parallel Databases

Chapter 17: Parallel Databases Chapter 17: Parallel Databases Introduction I/O Parallelism Interquery Parallelism Intraquery Parallelism Intraoperation Parallelism Interoperation Parallelism Design of Parallel Systems Database Systems

More information

CLOUD-SCALE FILE SYSTEMS

CLOUD-SCALE FILE SYSTEMS Data Management in the Cloud CLOUD-SCALE FILE SYSTEMS 92 Google File System (GFS) Designing a file system for the Cloud design assumptions design choices Architecture GFS Master GFS Chunkservers GFS Clients

More information

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce

Parallel Programming Principle and Practice. Lecture 10 Big Data Processing with MapReduce Parallel Programming Principle and Practice Lecture 10 Big Data Processing with MapReduce Outline MapReduce Programming Model MapReduce Examples Hadoop 2 Incredible Things That Happen Every Minute On The

More information

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344

Where We Are. Review: Parallel DBMS. Parallel DBMS. Introduction to Data Management CSE 344 Where We Are Introduction to Data Management CSE 344 Lecture 22: MapReduce We are talking about parallel query processing There exist two main types of engines: Parallel DBMSs (last lecture + quick review)

More information

Google File System. Arun Sundaram Operating Systems

Google File System. Arun Sundaram Operating Systems Arun Sundaram Operating Systems 1 Assumptions GFS built with commodity hardware GFS stores a modest number of large files A few million files, each typically 100MB or larger (Multi-GB files are common)

More information

MapReduce programming model

MapReduce programming model MapReduce programming model technology basics for data scientists Spring - 2014 Jordi Torres, UPC - BSC www.jorditorres.eu @JordiTorresBCN Warning! Slides are only for presenta8on guide We will discuss+debate

More information

Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model

Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model Author: Steve Gorman Title: Programming with the Intel architecture in the flat memory model Abstract: As the Intel architecture moves off the desktop into a variety of other computing applications, developers

More information

Process size is independent of the main memory present in the system.

Process size is independent of the main memory present in the system. Hardware control structure Two characteristics are key to paging and segmentation: 1. All memory references are logical addresses within a process which are dynamically converted into physical at run time.

More information

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University

Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Towards Automatic Optimization of MapReduce Programs (Position Paper) Shivnath Babu Duke University Roadmap Call to action to improve automatic optimization techniques in MapReduce frameworks Challenges

More information

DATABASE SCALABILITY AND CLUSTERING

DATABASE SCALABILITY AND CLUSTERING WHITE PAPER DATABASE SCALABILITY AND CLUSTERING As application architectures become increasingly dependent on distributed communication and processing, it is extremely important to understand where the

More information

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES

INDEX-BASED JOIN IN MAPREDUCE USING HADOOP MAPFILES Al-Badarneh et al. Special Issue Volume 2 Issue 1, pp. 200-213 Date of Publication: 19 th December, 2016 DOI-https://dx.doi.org/10.20319/mijst.2016.s21.200213 INDEX-BASED JOIN IN MAPREDUCE USING HADOOP

More information

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE BRETT WENINGER, MANAGING DIRECTOR 10/21/2014 ADURANT APPROACH TO BIG DATA Align to Un/Semi-structured Data Instead of Big Scale out will become Big Greatest

More information

IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata

IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata Research Report IBM DB2 BLU Acceleration vs. SAP HANA vs. Oracle Exadata Executive Summary The problem: how to analyze vast amounts of data (Big Data) most efficiently. The solution: the solution is threefold:

More information

Copyright 2013 Thomas W. Doeppner. IX 1

Copyright 2013 Thomas W. Doeppner. IX 1 Copyright 2013 Thomas W. Doeppner. IX 1 If we have only one thread, then, no matter how many processors we have, we can do only one thing at a time. Thus multiple threads allow us to multiplex the handling

More information

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab

CS6030 Cloud Computing. Acknowledgements. Today s Topics. Intro to Cloud Computing 10/20/15. Ajay Gupta, WMU-CS. WiSe Lab CS6030 Cloud Computing Ajay Gupta B239, CEAS Computer Science Department Western Michigan University ajay.gupta@wmich.edu 276-3104 1 Acknowledgements I have liberally borrowed these slides and material

More information

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II

MASSACHUSETTS INSTITUTE OF TECHNOLOGY Database Systems: Fall 2008 Quiz II Department of Electrical Engineering and Computer Science MASSACHUSETTS INSTITUTE OF TECHNOLOGY 6.830 Database Systems: Fall 2008 Quiz II There are 14 questions and 11 pages in this quiz booklet. To receive

More information

Hadoop MapReduce Framework

Hadoop MapReduce Framework Hadoop MapReduce Framework Contents Hadoop MapReduce Framework Architecture Interaction Diagram of MapReduce Framework (Hadoop 1.0) Interaction Diagram of MapReduce Framework (Hadoop 2.0) Hadoop MapReduce

More information

MapReduce and Friends

MapReduce and Friends MapReduce and Friends Craig C. Douglas University of Wyoming with thanks to Mookwon Seo Why was it invented? MapReduce is a mergesort for large distributed memory computers. It was the basis for a web

More information

Advanced Database Systems

Advanced Database Systems Lecture IV Query Processing Kyumars Sheykh Esmaili Basic Steps in Query Processing 2 Query Optimization Many equivalent execution plans Choosing the best one Based on Heuristics, Cost Will be discussed

More information

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch

Advanced Databases. Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch Advanced Databases Lecture 15- Parallel Databases (continued) Masood Niazi Torshiz Islamic Azad University- Mashhad Branch www.mniazi.ir Parallel Join The join operation requires pairs of tuples to be

More information

Correlation based File Prefetching Approach for Hadoop

Correlation based File Prefetching Approach for Hadoop IEEE 2nd International Conference on Cloud Computing Technology and Science Correlation based File Prefetching Approach for Hadoop Bo Dong 1, Xiao Zhong 2, Qinghua Zheng 1, Lirong Jian 2, Jian Liu 1, Jie

More information

CHAPTER 4 ROUND ROBIN PARTITIONING

CHAPTER 4 ROUND ROBIN PARTITIONING 79 CHAPTER 4 ROUND ROBIN PARTITIONING 4.1 INTRODUCTION The Hadoop Distributed File System (HDFS) is constructed to store immensely colossal data sets accurately and to send those data sets at huge bandwidth

More information

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism

Parallel DBMS. Parallel Database Systems. PDBS vs Distributed DBS. Types of Parallelism. Goals and Metrics Speedup. Types of Parallelism Parallel DBMS Parallel Database Systems CS5225 Parallel DB 1 Uniprocessor technology has reached its limit Difficult to build machines powerful enough to meet the CPU and I/O demands of DBMS serving large

More information

data parallelism Chris Olston Yahoo! Research

data parallelism Chris Olston Yahoo! Research data parallelism Chris Olston Yahoo! Research set-oriented computation data management operations tend to be set-oriented, e.g.: apply f() to each member of a set compute intersection of two sets easy

More information

IV Statistical Modelling of MapReduce Joins

IV Statistical Modelling of MapReduce Joins IV Statistical Modelling of MapReduce Joins In this chapter, we will also explain each component used while constructing our statistical model such as: The construction of the dataset used. The use of

More information

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1

MapReduce-II. September 2013 Alberto Abelló & Oscar Romero 1 MapReduce-II September 2013 Alberto Abelló & Oscar Romero 1 Knowledge objectives 1. Enumerate the different kind of processes in the MapReduce framework 2. Explain the information kept in the master 3.

More information

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0.

Best Practices. Deploying Optim Performance Manager in large scale environments. IBM Optim Performance Manager Extended Edition V4.1.0. IBM Optim Performance Manager Extended Edition V4.1.0.1 Best Practices Deploying Optim Performance Manager in large scale environments Ute Baumbach (bmb@de.ibm.com) Optim Performance Manager Development

More information

Introduction to Big-Data

Introduction to Big-Data Introduction to Big-Data Ms.N.D.Sonwane 1, Mr.S.P.Taley 2 1 Assistant Professor, Computer Science & Engineering, DBACER, Maharashtra, India 2 Assistant Professor, Information Technology, DBACER, Maharashtra,

More information

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09 Presented by: Daniel Isaacs It all starts with cluster computing. MapReduce Why

More information

Developing MapReduce Programs

Developing MapReduce Programs Cloud Computing Developing MapReduce Programs Dell Zhang Birkbeck, University of London 2017/18 MapReduce Algorithm Design MapReduce: Recap Programmers must specify two functions: map (k, v) * Takes

More information

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA)

Introduction to MapReduce. Adapted from Jimmy Lin (U. Maryland, USA) Introduction to MapReduce Adapted from Jimmy Lin (U. Maryland, USA) Motivation Overview Need for handling big data New programming paradigm Review of functional programming mapreduce uses this abstraction

More information

Survey on MapReduce Scheduling Algorithms

Survey on MapReduce Scheduling Algorithms Survey on MapReduce Scheduling Algorithms Liya Thomas, Mtech Student, Department of CSE, SCTCE,TVM Syama R, Assistant Professor Department of CSE, SCTCE,TVM ABSTRACT MapReduce is a programming model used

More information

Map Reduce. Yerevan.

Map Reduce. Yerevan. Map Reduce Erasmus+ @ Yerevan dacosta@irit.fr Divide and conquer at PaaS 100 % // Typical problem Iterate over a large number of records Extract something of interest from each Shuffle and sort intermediate

More information

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures*

Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Optimizing the use of the Hard Disk in MapReduce Frameworks for Multi-core Architectures* Tharso Ferreira 1, Antonio Espinosa 1, Juan Carlos Moure 2 and Porfidio Hernández 2 Computer Architecture and Operating

More information

Introduction to MapReduce

Introduction to MapReduce 732A54 Big Data Analytics Introduction to MapReduce Christoph Kessler IDA, Linköping University Towards Parallel Processing of Big-Data Big Data too large to be read+processed in reasonable time by 1 server

More information

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018

File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018 File System Performance (and Abstractions) Kevin Webb Swarthmore College April 5, 2018 Today s Goals Supporting multiple file systems in one name space. Schedulers not just for CPUs, but disks too! Caching

More information

SQL-to-MapReduce Translation for Efficient OLAP Query Processing

SQL-to-MapReduce Translation for Efficient OLAP Query Processing , pp.61-70 http://dx.doi.org/10.14257/ijdta.2017.10.6.05 SQL-to-MapReduce Translation for Efficient OLAP Query Processing with MapReduce Hyeon Gyu Kim Department of Computer Engineering, Sahmyook University,

More information

SQL Tuning Reading Recent Data Fast

SQL Tuning Reading Recent Data Fast SQL Tuning Reading Recent Data Fast Dan Tow singingsql.com Introduction Time is the key to SQL tuning, in two respects: Query execution time is the key measure of a tuned query, the only measure that matters

More information

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke

Parallel DBMS. Prof. Yanlei Diao. University of Massachusetts Amherst. Slides Courtesy of R. Ramakrishnan and J. Gehrke Parallel DBMS Prof. Yanlei Diao University of Massachusetts Amherst Slides Courtesy of R. Ramakrishnan and J. Gehrke I. Parallel Databases 101 Rise of parallel databases: late 80 s Architecture: shared-nothing

More information

Map-Reduce (PFP Lecture 12) John Hughes

Map-Reduce (PFP Lecture 12) John Hughes Map-Reduce (PFP Lecture 12) John Hughes The Problem 850TB in 2006 The Solution? Thousands of commodity computers networked together 1,000 computers 850GB each How to make them work together? Early Days

More information

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc

FuxiSort. Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc Fuxi Jiamang Wang, Yongjun Wu, Hua Cai, Zhipeng Tang, Zhiqiang Lv, Bin Lu, Yangyu Tao, Chao Li, Jingren Zhou, Hong Tang Alibaba Group Inc {jiamang.wang, yongjun.wyj, hua.caihua, zhipeng.tzp, zhiqiang.lv,

More information

Performance Optimization for Informatica Data Services ( Hotfix 3)

Performance Optimization for Informatica Data Services ( Hotfix 3) Performance Optimization for Informatica Data Services (9.5.0-9.6.1 Hotfix 3) 1993-2015 Informatica Corporation. No part of this document may be reproduced or transmitted in any form, by any means (electronic,

More information

CMSC424: Database Design. Instructor: Amol Deshpande

CMSC424: Database Design. Instructor: Amol Deshpande CMSC424: Database Design Instructor: Amol Deshpande amol@cs.umd.edu Databases Data Models Conceptual representa1on of the data Data Retrieval How to ask ques1ons of the database How to answer those ques1ons

More information

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report

Large Scale OLAP. Yifu Huang. 2014/11/4 MAST Scientific English Writing Report Large Scale OLAP Yifu Huang 2014/11/4 MAST612117 Scientific English Writing Report 2014 1 Preliminaries OLAP On-Line Analytical Processing Traditional solutions: data warehouses built by parallel databases

More information

CA485 Ray Walshe Google File System

CA485 Ray Walshe Google File System Google File System Overview Google File System is scalable, distributed file system on inexpensive commodity hardware that provides: Fault Tolerance File system runs on hundreds or thousands of storage

More information

Dept. Of Computer Science, Colorado State University

Dept. Of Computer Science, Colorado State University CS 455: INTRODUCTION TO DISTRIBUTED SYSTEMS [HADOOP/HDFS] Trying to have your cake and eat it too Each phase pines for tasks with locality and their numbers on a tether Alas within a phase, you get one,

More information

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark CSE 544 Principles of Database Management Systems Alvin Cheung Fall 2015 Lecture 10 Parallel Programming Models: Map Reduce and Spark Announcements HW2 due this Thursday AWS accounts Any success? Feel

More information

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins

Outline. Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins MapReduce 1 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce Algorithm Evaluation Computing Joins 2 Outline Distributed File System Map-Reduce The Computational Model Map-Reduce

More information