Unique Data Organization

Size: px

Start display at page:

Download "Unique Data Organization"

Joleen Boone
5 years ago
Views:

1 Unique Data Organization INTRODUCTION Apache CarbonData stores data in the columnar format, with each data block sorted independently with respect to each other to allow faster filtering and better compression. DESCRIPTION Though CarbonData stores data in Columnar format, it differs from the traditional Columnar formats as the columns in each row-group(data Block) are sorted independent of the other columns. Though this arrangement requires CarbonData to store the row-number mapping against each column value, it makes it feasible to use binary search for faster filtering and since the values are sorted, same/similar values come together which yields better compression and reduces the storage overhead required by the row number mapping for the offsets. BRIEF INTRO ABOUT COLUMNAR STORAGE In a columnar database, all the column 1 values are physically together, followed by all the column 2 values, etc. The data is stored in record order, so the 100th entry for column 1 and the 100th entry for column 2 belong to the same input record. This allows individual data elements, for instance customer name, to be accessed in columns as a group, rather than individually row-by-row. Here is an example of a simple database table with 4 columns and 3 rows. Table 1: Database Table with 4 columns and 3 rows ID Last First Bonus 1 Doe John Smith Jane Beck Sam 1000 Row-oriented storage : 1,Doe,John,8000;2,Smith,Jane,4000;3,Beck,Sam,1000; Column-oriented storage : 1,2,3;Doe,Smith,Beck;John,Jane,Sam;8000,4000,1000; One of the main benefits of a columnar database is that data can be highly compressed. The compression permits columnar operations like MIN, MAX, SUM, COUNT and AVG to be performed very rapidly. Another benefit is that because a column-based storage is self-indexing, it uses less disk space than a relational database management system (RDBMS) containing the same data. CARBONDATA FILE FORMAT Apache CarbonData file contains groups of data called blocklet, along with all required information like schema, offsets and indices, etc, in a file footer. The file footer can be read once to build the indices in memory, which then can be utilised for optimising the scans and processing of all the subsequent queries. Each blocklet in the file is further divided into chunks of data called Data Chunks. Each data chunk is organised either in a columnar format or a row format, and stores the data of either in a single column or a set of columns. All blocklets in one file contain the same number and type of Data Chunks.

2 Figure 1 : CarbonData File Figure 2 : Detailed Description of CarbonData File Format

I) File Header :Contains information about CarbonData file version

: A set of rows in columnar format Balance between efficient scan

Default blocklet size: 64MB (but the size is configurable) Minimum

3 I) File Header :Contains information about CarbonData file version number List of column schema Schema updation timestamp II) Blocklet : A set of rows in columnar format Balance between efficient scan and compression Data are sorted along MDK (multi-dimensional keys) Default blocklet size: 64MB (but the size is configurable) Minimum size for predicate filtering Large size for efficient reading and compression

Figure 3 : Pictorial representation of Columnar encoding Further the Blocklet contains Column Page groups for each column, also known as Column chunks.

Column data can be stored as sorted index It is guaranteed to be contiguous in file Allow multiple columns form a column group stored as a single column chunk in row-based format suitable to set of

4 Figure 3 : Pictorial representation of Columnar encoding Further the Blocklet contains Column Page groups for each column, also known as Column chunks. The Column chunk is data for one column in a Blocklet. Column data can be stored as sorted index It is guaranteed to be contiguous in file Allow multiple columns form a column group stored as a single column chunk in row-based format suitable to set of columns frequently fetched together saving stitching cost for reconstructing row Each Data Chunk contains multiple groups of data called as Pages. Page has the data of one column and the number of row is fixed to size. There are three types of pages. Data Page: Contains the encoded data of a column/group of columns. Row ID Page (optional): Contains the row id mappings used when the Data Page is stored as an inverted index. suitable to low cardinality column better compression & fast predicate filtering Figure 4: Representation of Sort Columns within Column Chunks The inverted index tells the actual position of the column value in the column(i.e, the row number). Example: value 1 in the column 2 is present in rows 1-8, so rest of the rows need not to be considered and hence allows fast filtering. Also the inverted index stores the values in a sorted order and hence using binary search will effectively improve the searching time for the filter value. It ll also help to reconstruct the row, as the data has columnar storage, and the values might jumbled up during sorting and storing them column wise.

III) Footer : Metadata information Figure 5: Run Length Encoding File level

5 RLE Page (optional): Contains additional metadata used when the Data Page is RLE coded. III) Footer : Metadata information Figure 5: Run Length Encoding File level metadata (Number of rows, segmentinfo,list of blocklets info and index) & statistics Schema Blocklet Index & Metadata Figure 6 : CarbonData File Footer

CarbonData : An Indexed Columnar File Format For Interactive Query HUAWEI TECHNOLOGIES CO., LTD.

CarbonData : An Indexed Columnar File Format For Interactive Query HUAWEI TECHNOLOGIES CO., LTD. Outline u Motivation : Why introducing a new file format? u CarbonData Deep Dive u Tuning Hint 2 Big Data