ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective Part II: Data Center Software Architecture: Topic 3: Programming Models RCFile: A Fast and Space-efficient Data Placement Structure in MapReduce-based Warehouse Systems Presented by Fahad Mirza
Part 1 Background Objectives Previous Techniques Agenda Part 2 RC File Implementation Part 3 Results Conclusion
Part 1
Background Map reduce based data-ware house system: Support big data analytics Adjust quickly the dynamics of user behavior trends Needs in typical web-service providers and social network sites (e.g., Facebook) Data placement structure is a crucial factor in warehouse performance Facebook warehouse characterized four requirements for the data placement structure: Fast data loading Fast query processing Highly efficient storage space utilization Strong adaptivity to highly dynamic workload patterns 4 (1)
Background Why data placement is so critical for warehouse? MapReduce framework and Hadoop provide a scalable and faulttolerant infrastructure for big data analysis on large clusters. MapReduce-based warehouse systems cannotdirectly control storage disks in clusters Utilize cluster-level distributed file system (e.g. HDFS, the Hadoop Distributed File System) to store a huge amount of table data Serious challenge-find an efficient data placement method to organize table data in the underlying HDFS. 5 (2)
Objectives Fast data loading Quick data loading is important for Facebook data warehouse On average, more than 20TB data are pushed daily in Facebookdata warehouse. Fast query processing Response-time critical queries such as decision support queries comprise of major workload in such applications. Data placement structure should be able to support a large number of query processing as the number of queries rapidly increases. Highly efficient storage space utilization High space utilization in disks to avoid any storage waste. Rapidly growing user activities demand scalable storage capacity. Strong adaptivity to highly dynamic workload patterns Underlying system must be highly adaptive to meet unexpected dynamics in data-processing with limited storage for both expected and unexpected 6 queries.
Previous Techniques Common data placement structures in conventional DBs Row-stores Column-stores Hybrid-stores in the context of large data analysis using MapReduce Not very suitable for big data processing in distributed systems Data Placement Techniques for MapReduce Conventional database system placement structures for MapReduce datawarehouse Horizontal row-store structure Vertical column-store structure Hybrid PAX store structure Importing DB structures into MapReduce data warehouse system can t meet all four objectives 7 (1)
Previous Techniques Horizontal Query Structure Advantages Fast data loading and strong adaptive ability to dynamic workloads Disadvantages Cannot support fast query processing Reason: It cannot skip un-necessary column reads when a query only requires only a few columns from a wide table with many columns. Compression ratio is low and hence a high storage space utilization Reason: Due to mixed columns with different data domains 8 (2)
Previous Techniques Vertical column-store structure Advantages High compression ratios Reason: Similar length data fields available Disadvantage Column-store can often cause high record reconstruction overhead with expensive network transfers in a cluster. Slower query processing. Reason: HDFS cannot guarantee that all fields in the same record are stored in the same cluster node. So high overhead for tuple reconstruction. 9 Alternative Pre-grouping multiple columns together can reduce the overhead Disadvantages It does not have a strong adaptivity to respond highly dynamic workload patterns. (3)
Previous Techniques Two schemes of vertical stores Each column in one sub-relation, such as the Decomposition Storage Model (DSM)- Column Store Organize all the columns of a relation into different column groups, and usually allow column overlapping among multiple Column Groups. 10 (4)
Hybrid PAX store structure A hybrid placement structure Previous Techniques Aiming to improve CPU cache performance Multiple fields, of a record, from different columns are put in a single disk page to save additional operations for record reconstructions Within each disk page, PAX uses a mini-page to store all fields belonging to each column, and uses a page header to store pointers to mini pages Advantages Strong adaptive ability for various dynamic query workloads CPU performance improved by better cache utilization Disadvantages Can t satisfy high store utilization and fast query processing speed. 11 (5)
Previous Techniques Drawbacks of PAX Store Architecture Not associated with data compression, which is not necessary for cache optimization Cannot improve I/O performance because it does not change the actual content of a page and hence slower query processing. Limited by the page-level data manipulation inside a traditional DBMS engine, PAX uses a fixed page as the basic unit of data record organization 12 (6)
Part 2
New Technique RC File A big data placement structure RC File ( Record Columnar File) Satisfies all 4 requirements Adopted by Hive and Pig RCFileapplies the concept of first horizontally-partition, then vertically-partition from PAX. It combines the advantages of both row-store and column-store. RCFileguarantees that data in the same row are located in the same node, thus it has low cost of tuple reconstruction. As column-store, RCFilecan exploit a column-wise data compression and skip unnecessary column reads. Utilizes a column-wise data compression within each row group. 14 Provides a lazy decompression technique to avoid unnecessary column decompression during query execution. (1)
Data Layout for an RC File HDFS structure New Technique RC File A table can span multiple HDFS blocks. All the records stored are partitioned into row groups. A table stored in RCFileis first horizontally partitioned into multiple row groups. Followed by each row group is vertically partitioned so that each column is stored independently. For a table, all row groups have the same size. Depending on the row group size and the HDFS block size, an HDFS block can have only one or multiple row groups. RCFileallows a flexible row group size. A default size is given considering both data compression performance and query execution performance. 15 RCFilealso allows users to select the row group size for a given table. (2)
New Technique RC File Data Layout for an RC File The first section Sync marker that is placed in the beginning of the row group. Sync marker is mainly used to separate two continuous row groups in an HDFS block. The second section Metadata header for the row group. Header stores the information items on how many records are in this row group, how many bytes are in each column, and how many bytes are in each field in a column. The third section The table data section that is actually a column-store. In this section, all the fields in the same column are stored continuously together 16 (3)
New Technique RC File 17 (4)
New Technique RC File Data Compression RC File In each row group, the metadata header section and the table data section are compressed independently as follows. Metadata header section compressed using RLE (Run Length Encoding). Advantage of RLE Values of the field lengths in the same column are continuously stored The RLE algorithm can find long runs of repeated data values, especially for fixed field lengths. RLE not used for the column data because its not sorted Tabledatasectionisnotcompressedasawholeunit. Each column is independently compressed with the high end Gzip compression algorithm. 18 Due to the lazy decompression technology, does not need to decompress all the columns when processing a row group. (5)
New Technique RC File Allows us to use different algorithms to compress different columns. In future multiple type of compression schemes can be adopted. Data Appending RCFile does not allow arbitrary data writing operations. Only appends are allowed. Because HDFS only supports data write at the endoffile. The method of data appending in RCFile is summarized as follows: RCFilecreates and maintains an in-memorycolumn holderfor each column. When a record is appended, all its fields will be scattered, and each field will be appended into its corresponding column holder. 19 (6)
New Technique RC File In addition, RCFile will record corresponding metadata of each field in the metadata header. RCFile - two parameters provided to control records counts in memory before they are flushed into the disk. i. Number of records ii. Limitofthesizeofthememorybuffer. RCFile first compresses the metadata header and stores it in the disk. Then it compresses each column holder separately, and flushes compressed column holders into one row group in the underlying file system. 20 (7)
New Technique RC File Data Reads When processing a row group, RCFiledoes not need to fully read the whole content of the row group into memory. It only reads the metadata header and the needed columns in the row group for a given query. Advantage: it can skip unnecessary columns and gain the I/O advantages of column store. For instance: suppose we have a table with four columns tbl(c1, c2, c3, c4), and we have a query SELECT c1 FROM tblwhere c4 = 1. Then, in each row group, RCFileonly reads the content of column c1 and c4. 21 (8)
New Technique RC File Once the metadata header and data of the required columns are loaded they need to be decompressed. The metadata header is always decompressed and held in memory until RCFile processes the next row group. RCFiledoes not decompress all the loaded columns. Instead, it uses a lazy decompression technique. Lazy decompression Column will not be decompressed in memory until RCFilehas determined that the data in the column will be really useful for query execution. If a where condition is not satisfied by all the records in a row group then RCFile does not decompress the columns that do not occur in the where condition. 22 (9)
Part 3
Results Determining effect of row group size on compression ratio I/O performance is a major concern of RCFile. RCFileneeds to use a large and flexible row group size. Current size adopted by Facebookis 4 MB. Larger group size Advantage Better data compression efficiency than that of a small one. Disadvantage May have lower read performance than small sized one. Can undermine benefits of lazy decompression. Higher memory usage 24 (1)
Results 25 (2)
Performance Evaluation Results Effectiveness of RCFileversus other structures(row,column,pax) in three aspects: i. Data storage space ii. Data loading time iii. Query execution time Data Storage Space performance evaluation USERVISITS table from the benchmark Generated the data set whose size is about 120GB Data is all in plain text Loaded it into Hive using different data placement structures Data is compressed by the Gzip algorithm for each structure 26 (3)
Results Data Storage Space Results Interpretation RCFilestores data in two sections: data and the meta-data, hence different compression ratios and better compression efficiency and low storage 27 (4)
Results Data Loading Time Performance Evaluation Data loading time(the time required by loading the raw data into the data warehouse) Data Loading Time Results Interpretation Row-store - smallest data loading time due to minimum overhead to re-organize records in the raw text file. Column-store and column-group -due to raw data file will be written to multiple HDFS blocks for different columns (or column groups). RCFileis slightly slower than row-store due to small overhead to re-organize records inside each row group. (5) 28
Results Query Execution Time Performance Evaluation Executed two queries on the RANKING table, having three columns from the benchmark. For column-store, all three columns are stored independently. Q1: SELECT pagerank, pageurl FROM RANKING WHERE pagerank > 400; Q2: SELECT pagerank, pageurl FROM RANKING WHERE pagerank < 400; Q1 : RC File outperforms others utilizing lazy decompression Q2: Column store performs slightly better due to high selectivity Note that the performance advantage of column-group is not free. It relies on pre-defined column combinations before query executions. 29 (6)
Results Effect of different Row Group Sizes on RCFile s Performance Workload Industry standard TPC-H benchmark for warehousing system evaluations Generated by daily operations for advertisement business in Facebook Factors examined Data storage space Query execution time 30 (7)
Results TCP-H Workload performance evaluation RCFilecan significantly decrease storage space compared with row-store. Increasing row group size after a threshold would not help improve data compression efficiency significantly. A large row group can also decrease the advantage of lazy decompression, and cause unnecessary data decompression 31 (8)
Results Facebook Workload Performance Evaluation Query A: SELECT adid, userid FROM adclicks; Query B: SELECT adid, userid FROM adclickswhere userid="x ; For row-store, the average mappertime of Query B > Query A. This is due to where clause in the query causing more computations For Query B, the average mappertime significantly shorter than that of Query A. This reflects the performance benefit of lazy decompression of RCFile. 32 (9)
Competitive systems Conclusion Cheetah but RC File outperforms due to heavy utilization of Gzipon both meta and column data by Cheetah Big Table from Google. It s a low-level key value store for both read and write-intensive applications. But RFC is served as almost readonly data warehouse. Facebook is trying to transform its existing data to RCFile format. An integration of RCFile to Pig(Yahoo) is being developed by Yahoo. 33
Questions (1) What s the row-store data placement? What re the disadvantages of this data placement? (Section II-A) 34