Evaluating Data Storage Structures of Map Reduce

Size: px

Start display at page:

Download "Evaluating Data Storage Structures of Map Reduce"

Barnard Clark
6 years ago
Views:

1 The 8th nternational Conference on Computer Science & Education (CCSE 2013) April 26-28, Colombo, Sri Lanka MoB3.2 Evaluating Data Storage Structures of Map Reduce Haiming Lai, Ming Xu, Jian Xu, Yizhi Ren, Ning Zheng College of Computer, Hangzhou Dianzi University Hangzhou, China {mxu, Abstract-MapReduce framework and its open-source implementation Hadoop, a scalable and fault-tolerant infrastructure for big data analysis on large clusters, can achieve different performance with different data storage structures. This paper evaluates the performance about three kinds of data storage structures of MapReduce, namely row-store, columnstore, and RCFile. The evaluating experiments are designed to test three data storage structures in terms of data loading time, data storage space, and query execution time. The experimental results show that RCFile data storage structure can achieve better performance in most cases. Keywords-MapReduce; data storage structure; row-stor;, column-store; RCFile. NTRODUCTON We have entered an era of data explosion, where many data sets being processed and analyzed are called "big data". Big data not only requires a huge amount of storage, but also demands new data management on large distributed systems because conventional database systems have difficulty to manage big data. The popular MapReduce framework [2] and its opensource implementation Hadoop [3] provide a scalable and fault-tolerant infrastructure for big data analysis on large clusters. However, the performance of MapReduce is still far from ideal in the database context. According to a recent study [4], Hadoop, an open source implementation of MapReduce, is slower than two state-of-the-art parallel database systems in performing a variety of analytical tasks by a factor of 3.1 to 6.5. n order to achieve better performance more compute nodes can be allocated to speed up computation; however, this approach is not really cost effective. namely row-store, column-store, and RCFile. MapReduce can achieve different performance with different data storage structures. Row-store is the most common data storage structure, which stores data by rows. The common used data storage structures in Hadoop are TextFile and SequenceFile, which all belong to row-store. Column-store is a data storage structure, which stores data by columns. Column-store is integrated in the Pig data analysis system. RCFile is a data store structure, which gathers many advantages from row-store and column-store. t has also been adopted by Hive [5] and Pig [6]. This paper briefly summarizes major features of these structures for big data, and provides a detailed evaluation on the three data storage structures. The rest of this paper is organized as follows: Section 2 generally reviews related work. Section 3 presents the MapReduce programming model and the execution flow of a MapReduce job. Section 4 presents several different compression formats in Hadoop. Section 5 presents a detailed analysis of existing three data storage structures. Section 6 presents and discusses our benchmark results. Section 7 concludes the paper.. RELATED WORKS MapReduce has become a popular tool for processing large-scale data analytical tasks. However, the performance of MapReduce is still far from ideal in the database context. n [4] [7-8], the authors compared the performance of MapReduce with two parallel database systems. The authors noted that while the process to load data into DBMSs and the tuning of DBMSs incurred much longer time than a MapReduce system, the performance of parallel DBMSs is significantly better. So far, in order to improve the performance of MapReduce three main approaches have been proposed: based on configuration parameter[9], scheduling algorithm[ 10] and data storage structure[12-13]. First, in [9], an extensive experiment was performed to study how the job configuration parameters affect the observed performance of Hadoop. Second, in [10], the authors investigated the scheduling algorithm of Hadoop and proposed a LATE scheduling algorithm which improves Hadoop response times by a factor of two. Third, in [11], the authors extended C-Store (a column-oriented DBMS) with a compression sub-system, and show how compression schemes not traditionally used in row-oriented DBMSs can be applied to column-oriented systems. n [12], the authors describe how the column-oriented storage techniques found in many parallel DBMSs can be used to improve Hadoop's performance. n [13], in order to improve the performance of MapReduce the authors present a data storage structure, called RCFile, and its implementation in Hadoop. What's more, in [8], the authors presented some techniques to improve the performance of MapReduce, including using a binary record format, indexing, merging the results etc. n [14], the authors have an in-depth study to MapReduce, and identify several desigu factors that affect the performance of Hadoop. n [15], the authors described the Manimal system for optimizing Map Reduce programs. namely row-store, column-store, and RCFile. This paper analyzes and compares three kinds of data storage structures of /13/$ EEE 1041

MapReduce, and evaluates the performance about three data storage structures with different compression formats in terms of data loading time, data storage space, and query execution time.

2 MapReduce, and evaluates the performance about three data storage structures with different compression formats in terms of data loading time, data storage space, and query execution time. The aim is to investigate which data storage structure and which compression format more suitable for MapReduce.. BREF NTRODUCTON ABOUT MAPREDUCE This section briefly introduces MapReduce and the common compression format used in Hadoop. A. Map and Reduce Operation According to [2], MapReduce is a programming model for processing large-scale datasets in computer clusters. The Map Reduce programming model consists of two functions, mapo and reduceo. The mapo function takes an input key/value pair and produces a list of intermediate key/value pairs. The MapReduce runtime system groups together all intermediate pairs based on the intermediate keys and passes them to reduceo function for producing the final results. A MapReduce cluster employs a master-slave architecture where one master node manages a number of slave nodes. n the Hadoop, the master node is called 10bTracker and the slave node is called TaskTracker. Hadoop launches a MapReduce job by first splitting the input dataset into even-sized data blocks. Each data block is then scheduled to one Task-Tracker node and is processed by a map task. The task assignment process is implemented as a heartbeat protocol. The TaskTracker node notifies the 10bTracker when it is idle. The scheduler then assigns new tasks to it. The scheduler takes data locality into account when it disseminates data blocks. t always tries to assign a local data block to a TaskTracker. f the attempt fails, the scheduler will assign a rack-local or random data block to the TaskTracker instead. When mapo functions complete, the runtime system groups all intermediate pairs and launches a set of reduce tasks to produce the final results. B. Compression Format Hadoop uses compression to reduce the space needed to store files and speeds up data transfer across the network, or to or from disk. There are many different compression formats, tools and algorithms, each with different characteristics. Table lists some common compression formats used in Hadoop. TABLE. SOME COMMON COMPRESSON FORMATS USED N HADOOP Compression Tool Algorithm format DEFLATE N/A DEFLATE Filename extension. deflate Multiple files Splittable gzip gzip DEFLATE.gz bzip2 bzip2 bzip2.bz2 Yes LZO Lzop LZO.zo All compression algorithms exhibit a space/time trade-off: faster compression and decompression speeds usually come at the expense of smaller space savings. The different tools have very different compression characteristics. DEFLATE is a compression algorithm whose standard implementation is zlib. (te that the gzip file format is DEFLATE with extra headers and a footer.) The ".deflate" filename extension is a Hadoop convention. Gzip is general-purpose compressor, and sits in the middle of the space/time trade-off. Bzip2 compresses more effectively than gzip, but is slower. Bzip2's decompresssion speed is faster than its compression speed, but it is still slower than the other formats. V. DATA STORE STRUCTURES OF MAPREDUCE namely row-store, column-store, and RCFile. This section compares these three data storage structures. A. Row-store Row-store is the most common data storage structure of MapReduce, which stores data by rows. Records are placed contiguously in a disk page. All fields of one record are padded one by one in the order of their occurrences. Figure 1 gives an example to show how a table is placed by the row-store in an HDFS (Hadoop Distributed File System) block. The major advantage of row-store is that it has fast data loading and strong adaptive ability to dynamic workloads. This is because row-store guarantees that all fields in the same record are located in the same cluster node since they are in the same HDFS block. However, row-store has two major weaknesses. First, row-store cannot provide fast query processing due to unnecessary column reads if only a subset of columns in a table are needed in a query. Second, it is not easy for row-store to achieve a high data compression ratio due to mixed columns with different data domains [16]. C C2 C ,..,. B. Column-store.,. '0' Block 1 Block 2 Block 3.,..,. j 1\ \ 16 Record Bytes Sync Number,, Compressed Compressed Keys Len Jths Keys Data Compressed Values Lengths Compressed Values Data (01, 11, 21) (02, 12, 22) ( ) (04, 14, 24) Fig.. An Example about Row-store in an. Column-store is a column-oriented store model, which stores data tables as sections of columns of data rather than as rows of data. Figure 2 shows an example on how a table is stored by column-store on HDFS. n this example, column C1, column C2 and column C3 are stored in three independent columns. The major advantage of column-store is that it can avoid reading unnecessary columns during a query execution, and can easily achieve a high compression ratio by compressing each column within the same data domain. However, the major weakness of column-store is that it cannot provide fast query processing due to high overhead of a tuple reconstruction. Column-store cannot guarantee that all fields in the same record are located in the same cluster node. For instance, in the example in Figure 2, the three fields of a record are stored in three HDFS blocks that can be located in different nodes. Therefore, a record reconstruction will cause a large 1042

amount of data transfers via networks among multiple cluster nodes.

RCFile C11C21C3 01 21 02 12 22 03 13 23 04 14 24 Column 1 Column 2 Column 3 V 02 J---- - v.' 04 L-- _ " L 12 22 14 24 Fig. 2. An Example about Column-store in an.

The RCFile structure is a systematic combination of multiple components including data storage format, data compression approach, and optimization techniques for data reading.

3 amount of data transfers via networks among multiple cluster nodes. As introduced in the original MapReduce paper [2], excessive network transfers in a MapReduce cluster can always be a bottleneck source, which should be avoided if possible. C. RCFile C11C21C Column 1 Column 2 Column 3 V 02 J v.' 04 L-- _ " L Fig. 2. An Example about Column-store in an. RCFile [13] (Record Columnar File) is a data storage structure that determines how to store data tables on computer clusters. t is designed for data warehouse systems using the MapReduce framework. The RCFile structure is a systematic combination of multiple components including data storage format, data compression approach, and optimization techniques for data reading. Figure 3 shows an example on how a table is stored by RCFile on HDFS. RCFile applies the concept of "first horizontally-partition, then verticallypartition" from PAX [17]. t combines the advantages of both row-store and column-store. First, as row-store, RCFile guarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction. Second, as column-store, RCFile can exploit a column-wise data compression and skip unnecessary column reads. What is more, RCFile provides a lazy decompression technique to avoid unnecessary column decompression during query execution. RCFile allows a flexible row group size. A default size is given considering both data compression performance and query execution performance. Cl C2 C RCFile Row Group 16 Bytes Sync Row Group Row Group 2 01,02, OJ, ,12,13, ".vv\' '-J1UUp 3 21,22,23,24 Fig. 3. An Example about RCFile in an. V. EVALUATNG METHOD This section mainly presents evaluating method of data storage structures. The performance of MapReduce is determined by it self structure and data compression format. So, in order to comprehensively evaluate the effectiveness of different data storage structures, the evaluation should be operated on dataset with different data storage structures and compression formats. The effectiveness of different data storage structures with different data compression formats will be evaluated on three aspects: data storage space, data loading time, and query execution time. 1) Data Storage Space At present, the dataset is becoming bigger and bigger. How to store such a big dataset is a critical problem. The data storage space sizes required by different data storage structures are different. The data storage space size is important factor to evaluate performance. 2) Data Loading Time Data loading time is an important factor to evaluate a data storage structure. Especially, an era of data explosion has come. We often need to load big data. Under the current circumstances the data loading time become more and more important. f a data storage structure has less data loading time, it will become more popular. 3) Query Execution Time Query execution time is an important factor for a data storage structure. n order to evaluate the performance of lazy decompression of data storage structures, t will be better to execute several query with different selectivities according to their where conditions. Hive is a data warehouse system for Hadoop that facilitates easy data summarization, ad-hoc queries, and the analysis of large datasets stored in Hadoop compatible file systems. Hive provides a mechanism to project structure onto this data and query the data using a SQL-like language called HiveQL. At present, Hive supports three data storage structures: TextFiles, SequenceFiles and RCFile. We compared these three data storage structures with different compression formats. At first, the dataset was loaded into Hive using different data storage structures with different compression formats. The loading time of different data storage structures were recorded in the loading period. Then, the data storage space of different data storage structures were calculated in the HDFS management page of Hadoop. At last, several query languages with different selectivities were executed in Hive. The query execution time of different data storage structures were recorded to evaluate the query execution effectiveness of different data storage structures with different data compression formats. V. EXPERMENT AND RESULTS This section conducted an experiment to evaluate different data storage structures with different compression formats. A. Experimental Setup The experiments are run on a cluster with 5 nodes connected by a lobit ethernet switch. One node is reserved to run the Hadoop jobtracker and the namenode. The remaining

nodes are used for HDFS and MapReduce. The namenode has two 2.20GHz ntel Core processors, 4GB of main memory, and one locally attached 500GB DE disk. Each of other nodes has one 2.

Dataset The industry standard TPC-H benchmark is used as our dataset. We selected the LNETEM table from the benchmark, and generate 8.7GB dataset. The generated data is in plain text.

This paper only reports results using trials where all nodes are available and the system's software operates correctly during the benchmark ex

The aim is to demonstrate the effectiveness of different data storage structures in three aspects as follows.

Figure 4 shows the data storage space sizes required by different data storage structures.

4 nodes are used for HDFS and MapReduce. The namenode has two 2.20GHz ntel Core processors, 4GB of main memory, and one locally attached 500GB DE disk. Each of other nodes has one 2.66GHz ntel Pentium processor, 1 GB of main memory, and one locally attached 40G DE disk. Hadoop version and Hive version have been used in the experiment. B. Dataset The industry standard TPC-H benchmark is used as our dataset. We selected the LNETEM table from the benchmark, and generate 8.7GB dataset. The generated data is in plain text. Each task has been executed three times and the average of the trials has been reported. Each system executes the benchmark tasks separately to ensure exclusive access to the cluster's resources. This paper only reports results using trials where all nodes are available and the system's software operates correctly during the benchmark execution. e. Evaluation n the experiment, we examined the loading time, storage space for the LNETEM table and query execution times. The aim is to demonstrate the effectiveness of different data storage structures in three aspects as follows. 1) Data Storage Space The generated data are loaded into Hive using different data storage structures. During loading, data is compressed by different format for each data storage structure. Figure 4 shows the data storage space sizes required by different data storage structures. Data compression can significantly reduce the data storage space, and different data storage structures show different compression efficiencies. Several major observations can be found as follows. Row-store has the worst compression efficiency. This S expected because that a column-wise data compression is better than a row-wise data compression with mixed data domains. Bzip2 has higher data compression ratio than other compression formats. tj dramatically less than uncompressed data CT j 800 & l The data storage space of compressed data is :. Un compressed Gzip Bzip2 DEFLATE 2) Data Loading Time Compression Fauna! Fig. 4. Data Storage Space. D Textfi1e D RCFile Data loading time is an important factor to evaluate data storage structure. The data loading times have been recorded for the LNETEM table in the experiment. The results are shown in Figure 5. Several major observations can be found as follows. Among all cases, row-store always has the smallest data loading time. This is because it has the minimum overhead to re-organize records in the raw text file. RCFile is slightly slower than row-store with a comparable performance in practice. This reflects the small overhead of RCFile since it only needs to reorganize records inside each row group. Uncompressed data has the smallest data loading time while bzip2 has the largest data loading time among all cases. E= 'ii i j : j 1 m rtl rtl Uncompressed Gzip Bzip2 DEFLATE 3) Query Execution Time COlTllTession Fonmt Fig. 5. Data Loading Time. n this section, two queries have been executed on the largest table LNETEM as follows. The LNETEM has 16 columns, and the two queries only use a small number of columns. Query 1: SELECT FROM WHERE Query 2: SELECT FROM WHERE sum(l_extendedprice * _discount) as revenue line item l_shipdate >= ' ' and 1 discount >= 0.02 and _quantity> 4; ljetumflag, l_linestatus, sum(l_ quantity) as sum _ qty, sum(l_ extendedprice) as sum_base rice, sum(l_ extendedprice * (1 _discount)) as sum_disc rice, sum(l_ extendedprice * (1 _discount) * (1 + _tax)) as sum_charge, avg(t quantity) as avg_ qty, avg(l_ extendedprice) as avg rice, avg(l_ discount) as avg_ disc, count(*) as count order line item _ship date <= ' ' GROUP BY ljetumflag, l_linestatus ORDER BY ljetumflag, Uinestatus; n order to evaluate the performance of lazy decompression, the two queries were designed to have different selectivities according to their where conditions. t is less than 1 % for 1044

Query 1, and about 72% for Qurey 2. Figure 6 and 7 show the execution times of the two queries with different data storage structures. Several major observations can be found as follows.

5 Query 1, and about 72% for Qurey 2. Figure 6 and 7 show the execution times of the two queries with different data storage structures. Several major observations can be found as follows. For Query 1, RCFile outperforms other data storage structures. This is because the lazy decompression technique in RCFile can accelerate the query execution with low query selectivity. For Query 2, as the selectivity become higher, lazy decompression becomes useless. n this case, RCFile still outperforms other data storage structures. But the advantage of RCFile becomes smaller. Among all cases, gzip has the fastest query speed "'" S & 'j;i while bzip2 has the slowest query speed.., Uncompresseu Gzip Bzip2 DEFLATE Compression Fonmt Fig. 6. Query Execution Time of Query r , : :; : j ;: ell [JJ :- CJ D. Summary Uncompresseu Gzip Bzip2 DEFLATE Co ression Funnat Fig. 7. Query Execution Time of Query 2. D Textfile D RCFi1e D Textfile D RCFile Different data storage structures are compared in three aspects of data loading time, storage space and query execution time. The result shows that each structure has its own merits. Among all cases, RCFile, which adopts advantages of other structures, outperforms other data storage structures. Although bzip2 has higher data compression ratio, but it wastes too much time to load data and execute query. Gzip is general-purpose compressor, and sits in the middle of the space/time trade-off. V. CONCLUSON n this paper, we describe and compare three data storage structures, namely row-store, column-store, and RCFile in the context of large data analysis using MapReduce. Besides, we present some common compression formats in Hadoop. n the end, a benchmark was conducted to evaluate different data storage structures with different compression formats. The results show that each data storage structure has its own merits, and RCFile storage structure with gzip compress format outperforms the other structures in most cases. ACKNOWLEDGMENT This paper is supported by NSFC ( and ), Natural Science Foundation of Zhej iang Province, China (. Y and LY2F02006), the State Key Program of Major Science and Technology (Priority Topics) of Zhejiang Province, China under ( 20OC050), and the science and technology search planned projects of Zhejiang Province (. 2012C21040). Corresponding author: REFERENCES [1] D. A. Patterson. Technical Perspective: The Data Center is the Computer. Commun. ACM, 2008, 51(1), pp [2] J. Dean and S. Ghemawat. MapReduce: Simplified Data Processing on Large Clusters. n OSD '04, 2004, pp [3] Hadoop. [4] A. Pavlo, E. Paulson, A. Rasin, D. J. Abadi, D. J. DeWitt, S. Madden, and M. Stonebraker. A comparison of approaches to large-scale data analysis. n SGMOD, ACM, 2009, pp [5] Hive. [6] Pig. [7] M. Stonebraker, D. Abadi, D. J. DeWitt, S. Madden, E. Paulson, A. Pavlo, and A. Rasin. Mapreduce and parallel dbmss: friends or foes? Communications of the ACM, 20 0, 53( ), pp [8] J. Dean and S. Ghemawat. Mapreduce: a flexible data processing tool. Commun. ACM, 2010, 53(1), pp [9] S. Babu. Towards automatic optimization of mapreduce programs. n SoCC, ACM, 2010, pp [10] M. Zaharia, A. Konwinski, A. Joseph, R. Katz, and. Stoica. mproving mapreduce performance in heterogeneous environments. n OSD, 2008, pp [11] D. J. Abadi, S. Madden, and M. Ferreira. ntegrating Compression and Execution in Column-Oriented Database Systems. n SGMOD, 2006, pp [12] A. Floratou et al. Column-Oriented Storage Techniques for MapReduce. PVLDB, 2011,4(7), pp [13] Y. He, R. Lee, Y. Huai, Z. Shao, N. Jain, X. Zhang, and Z. Xu. RCFile: A Fast and Space-efficient Data Placement Structure in MapReducebased Warehouse Systems. n CDE, 2011, pp [14] D. Jiang, B. C. Ooi, L. Shi, and S. Wu. The Performance of Map Reduce: An n-depth Study. PYLDB, 2010, 3(1), pp [15] E. Jahani, M. J. Cafarella, and C. R'e. Automatic Optimization for MapReduce Programs. PVLDB, 2011, 4(6), pp [16] D. Abadi, S. R. Madden, and N. Hachem. Colunm-Stores vs. Row Stores: How Different Are They Really? n SGMOD, 2008, pp [17] A. Ailamaki, D. J. DeWitt, M. D. Hill, and M. Skounakis. Weaving relations for cache performance. n VLDB, 2001, pp

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data