Department of Computer Engineering 1, 2, 3, 4,5

Size: px

Start display at page:

Download "Department of Computer Engineering 1, 2, 3, 4,5"

Gervase George
5 years ago
Views:

1 Components for writing Parquet Format Files Manas Rathi 1, Pratik Jagtap 2, Pranali Jain 3, Anisha Jain 4, Prof. Subhash Tatale 5 1, 2, 3, 4,5 Department of Computer Engineering 1, 2, 3, 4,5 Vishwakarma Institute of Information Technology, Pune {mnsrathi@gmail.com, jagtap.pratik1@yahoo.com,pranalijain1995@gmail.com, 1, 2, 3, 4,5 anishajain1995.aj@gmail.com, subhash.tatle@viit.ac.in } Abstract The applications of modern era include extensive usage of Business Intelligence, Data warehousing etc. which produce enormous amount of data. These volumes of unstructured data are referred as Big-Data. The processing of this huge amount of data is achieved in distributed environment. To handle this data, we need an efficient tool which can process data on such humongous scales. Hadoop is one of the tools available for processing of BigData, which provides a framework which runs in distributed environment and executes tasks in parallel way which helps to process such type of complex data efficiently with respect to time, performance and resources. Query performance and its execution speed are the important factors in Big-Data processing. Hadoop provides various file formats to store data onto clusters. The data can be stored in various formats in HDFS such as Avro, Thrift, Parquet, etc. Impact of various file formats for query processing is analyzed and it is found that Parquet file format provides better query performance with regards to our application. Parquet supports very efficient compression and encoding schemes. Parquet is internally built on complex nested data structures and uses the record shredding and assembly algorithm. Parquet is a columnar data storage format which helps in efficient analysis of data. This paper explains a system which takes input files which are stored in row-oriented format, converts it into parquet format on the fly and stores the converted data onto Hadoop cluster, using Apache Drill in the back-end. The system reduces the extra storage space required in storing the original file onto Hadoop cluster by converting the data on the fly, and reduces the time of the entire process as it reduces the time of copying the file onto Hadoop cluster and explicit conversion of data into Parquet file format. General Terms: Big-data, File Format Keywords: Hadoop, Apache drill, Apache Parquet, Row-oriented format, Column-oriented format, Zookeeper 1. INTRODUCTION With the emergence of technology, the amount of digital data has grown exponentially. This data is in unstructured format and thus becomes difficult to process this data by traditional Data Management Systems. We need to think of a system which can efficiently manage enormous data. Hadoop provides efficient way to store and process big-data in distributed manner. Since, the amount of data is humongous, it should get stored in compact way. Columnar storage gives way to store data with reduced size giving better query optimization. Apache Parquet offers proficient way to store the data in columnar format. We are developing a component which converts row-oriented data into parquet file format with the help of Apache Drill. 1.1 Columnar Data Storage Format 6

2 It stores table records in a sequence of columns i.e. the entries of a column is stored in contiguous memory locations. Whenever data is read from the row-oriented data storage, unnecessary attributes also gets accessed due to its storage structure of storing entire entry together. But column store can access only required attributes as per our need thus increasing the read query performance of the system. Due to this fundamental difference between these two type of databases, inserting, deleting, updating rows is optimized in row-stores i.e. modifying a tuple becomes easy since attribute values of a tuple are stored contiguously and selecting data is optimized in column-stores i.e. reading only required data becomes easy. Hence Column-stores are read optimized. Thus, in case of analysis of large amount of data, column oriented approach is chosen Advantages 1. Access to stored data: The data access queries could run faster. For example, if we want to know the average marks of students then instead of looking in all records row by row, we can access the columns in which only marks are stored and get the results, which in turn reduces unnecessary processing. 2. Data Compression: Since the data-types of fields/columns is similar, we can run various compression algorithm on those column and get the better storage efficiency. 3. Parallel Processing of data: Data is stored in columnar format which is partitioned vertically. So, the various operations can be done different columns at a time to prove parallel system performance. The parallelization can be achieved by accessing only the required columns at instance. 1.2 Why Parquet? The traditional row-oriented file format stores data in rows while the parquet file format stores data in columnoriented format. Let's say there are 321 columns and some of them are long text or varchar fields, each different column one following the other and may have records more than 10K. Now while querying this data/tables in a row oriented format the query would need to scan every record of the dataset. Read the first row, parse each and every record and get the required result if it satisfies the condition for say "sales" column of any product based company. If that company is having 10 years of history, then you will be reading every single record just to find 1 of those columns. While in column - oriented format you can directly jump to sales column of the data n get the results as per your need. You don't need to go through all the records including unnecessary fields. Again one more advantage is that.. data is spread around. To get a single record, you can have no. Of workers equal to the columns i.e. parallel access to the data. Parquet file format is better when your input side is large and output is a filtered subset. 1.3 Unit of Parallelization: 7

1. MapReduce - File/Row Group 2. IO - Column chunk 3. Encoding/Compression Page The following diagram explains the storage representation of row and column oriented data: Fig.

3 1. MapReduce - File/Row Group 2. IO - Column chunk 3. Encoding/Compression Page The following diagram explains the storage representation of row and column oriented data: Fig.1 : Storage Represetation 2. CURRENT SCENARIO 2.1 Current Process The current workflow can be listed as follows: Step 1: Input is taken in various row-oriented file formats such as ASCII (CSV, JSON), EBCIDIC, delimiter (like \t,,, ) separated format etc. from various sources. Step 2: The Input is given to ETL tool (Talend) which performs certain operations (Extraction, Transformation and Loading) and loads the input into hadoop cluster. Extraction step does the data extraction from the source system and makes it accessible for further processing. Transformation changes data into feasible form as per specific requirement and provides guidance whether data can be used for intended purpose. Loading includes loading of dimensions and facts. Step 3: CTAS operation is performed on the data which is present in the cluster to convert data into Parquet file format. Step 4: The converted data along with the original data is stored into cluster as output which can be used for further processing. The flow of current system is given in the figure below: 8

4 Fig.2 Current Process Flow Diagram 2.2 Limitations Redundant Data: Two copies of data are stored onto cluster i.e. Original Copy + Converted Copy. This leads to redundant use of storage as a resource Time Required: The time required in the complete process consists of Time of Loading data in ASCII format + Time of Conversion from ASCII to Parquet, which is an overhead. 3. PREVIOUSLY STUDIED APPROACHES: 3.1 Approach 1: Creating a Talend Component Creating a component in talend (ETL Tool) which can convert data and then store the data in Hadoop. Limitation: Talend doesn t offer Batch Implementation for Conversion of data so this idea is discarded 3.2 Approach 2: Changing Magic Number of File Magic Number is a specific set of 2-byte identifiers. It is used to distinguish particular file format from other. Our approach was, if we could somehow change the magic number of any file into magic number of Parquet file then we will get the file converted. Limitation: Magic number is only an identifier and by changing it we can t change the file storage type. 3.3 Approach 3: Converting Column-Index into Row-Index This approach is to convert columns into rows by changing it s indexed value. By changing Column as index value into Row as index value, we thought that conversion can be carried out. But this approach didn t work out further. 4. PROPOSED SYSTEM 9

5 4.1 Idea Instead of loading data in ASCII format into Hadoop and then converting it into Parquet format, we can reduce the extra storage space required, and time of loading and conversion by converting ASCII Files into Parquet Files On-the-fly while loading. To implement this idea we can use an open source Query Engine which offers On-the-fly conversion of data. Fig.3 Proposed System Flow Diagram The workflow of the proposed system is as follows : Step 1: User provides input File-Name and schema Step 2: Apache Drill collects information provided by user. Step 3: Component generates query to be executed by Apache Drill Step 4: Apache Drill executes the query Step 5: Apache Drill passes the information to Zoo-Keeper for storing the converted data For simultaneously converting and loading the data onto the Hadoop cluster, in parquet format, we are going to use Apache Drill. The system consists of 3 modules, front end which is a simple web page that accepts the file name, it s storage path and the column names required for converting file into parquet format. The input provided in this first stage is then checked for valid column names, file path and other related information. In the second stage, the software component will establish the connection with apache drill and execute the CTAS operation. Apache Drill which is responsible for executing the actual operation of conversion will convert the given file content in parquet format and then directly load it onto the Hadoop cluster without generating any file on local system. 4.2 Apache Drill In recent years, data is being generated in large amount which brings the need to develop the systems such as Hadoop, NoSQL and cloud storage, that will store this data in efficient way. Apache Drill enables all it s users to explore 10

6 and analyze this data without loosing the flexibility and agility offered by these datastores. Traditional query engines (eg, relational databases, Hive, Impala, Spark SQL) need to know the structure of the data before query execution. Drill, on the other hand, features a fundamentally different architecture, which enables execution to begin without knowing the structure of the data. The query is automatically compiled and re-compiled during the execution phase, based on the actual data flowing through the system. As a result, Drill can handle data with dynamic schema or even no schema at all (eg, JSON files, MongoDB collections, HBase tables). Drill is primarily focused on non-relational databases. The following data-stores are currently supported: Hadoop: All Hadoop distributions (HDFS API 2.3+), including Apache Hadoop, MapR, CDH and Amazon EMR NoSQL: MongoDB, HBase Cloud storage: Amazon S3, Google Cloud Storage, Azure Blog Storage, Swift A Drill-Bit is responsible for accepting the request from user, processing those requests and returning back the result to the user. This Drill-bit service can be installed and run on all of the required nodes in a Hadoop cluster to form a distributed cluster environment. When a Drill-bit runs on each data node in the cluster, Drill can maximize data locality during query execution without moving data over the network or between nodes. Drill uses ZooKeeper to maintain cluster membership and health-check information. The apache drill uses various storage plugins to store data. This storage plugin can be manipulated according to our need. The Drill installation registers the cp, dfs, hbase, hive, and mongo default storage plugin configurations. While creating a new plugin we need to register it using a name and provide all the configuration details in terms of JSON file format. After updating the storage plugins we can use it as per the requirement. Though Drill works in a Hadoop cluster environment, Drill is not tied to Hadoop and can run in any distributed cluster environment. The only pre-requisite for Drill is Zookeeper. A new data-store can be added by developing a storage plugin. Drill's unique schema-free JSON data model enables it to query non-relational databases (many of these systems store complex or schema-free data) Features Drill is an evolutionary distributed SQL query processing engine designed to enable data processing and analytics on non-relational data-stores. Users can query the data using standard SQL and BI tools without having to create and manage schemas. Agility: Gives results faster without overhead (data loading, schema creation and maintenance, transformations, etc.). There's no need to load the data, create and maintain schemas, or transform the data before it can be processed. Flexibility: It gives flexibility by not transforming or restricting the n on-relational data. Can be used with existing BI Tools: SQL knowledge can be used to interact and BI tools including Tableau, Qlikview, MicroStrategy, Spotfire, Excel and more. Scalable: Drill has simple symmetrical architecture which reduces complexity in addition / deletion of nodes while configuring for bigger scale. 11

7 5. REFERENCES [1] Dmitry Vasilenko, Mahesh Kurapati : Efficient processing of XML Documents in Hadoop Map Reduce [2] Barkha Jain, Smita Agarwal : Application research of Disk space utilization of HDFS and Real Time Troubleshooting to maintain a well-balanced cluster [3] Aditi Andurkar : Implementation of column oriented database in POSTGRESQL for optimization of read only queries. [4] Andres Felipe, Rojas Hernandez; Nancy Yaneth Gelvez Garcia : Distributed processing using cosine similarity for mapping Big-Data in Hadoop [5] Zhiqiang Zhang; Jianghua Hu; Xiaoqin Xie; Haiwei Pan; Xiaoning Feng : An online approximate aggregation query processing method based on Hadoop [6] Kailas Patil and Braun Frederik, A Measurement Study of the Content Security Policy on Real-World Applications, International Journal of Network Security, Vol. 18, No. 2, pp , [7] Kailas Patil, Preventing Click Event Hijacking by User Intention Inference, ICTACT Journal of Communication Technology, (IJCT) Vol. 7, No. 4, pp , [8] Kailas Patil, Request Dependency Integrity: Validating Web Requests using Dependencies in the Browser Environment, InderScience Journal of International Journal of Information Privacy, Security and Integrity (IJIPSI), Vol. 2, No.4, pp , [9] Archana Kamal; Suresh C. Gupta : Query based performance analysis of row and column storage data warehouse [10] Yue Liu, Shuai Guo, Songlin Hu, Tilmann Rabl, Hans-Arno Jacobsen, Jintao Li, Jiye Wang : Performance Evaluation and Optimization of Multi-dimensional Indexes in Hive [11] ttachfile&do=get&target=drill+slides.pdf [12] [13] Xiaopeng Li; Wenli Zhou : Performance Comparison of Hive, Impala and Spark SQL [14] Kailas Patil, Isolating Malicious Content Scripts of Browser Extensions, International Journal of Information Privacy, Security and Integrity (IJIPSI), InderScience, [Accepted] [15] Michael Hausenblas; Jacques Nadeau : APACHE DRILL: Interactive Ad-Hoc Analysis at Scale 12

Big Data Hadoop Stack

Big Data Hadoop Stack Lecture #1 Hadoop Beginnings What is Hadoop? Apache Hadoop is an open source software framework for storage and large scale processing of data-sets on clusters of commodity hardware