4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

Similar documents
Big Data com Hadoop. VIII Sessão - SQL Bahia. Impala, Hive e Spark. Diógenes Pires 03/03/2018

Big Data Technology Ecosystem. Mark Burnette Pentaho Director Sales Engineering, Hitachi Vantara

Hadoop. Introduction / Overview

LeanBench: comparing software stacks for batch and query processing of IoT data

Big Data with Hadoop Ecosystem

Ahmad Ghazal, Minqing Hu, Tilmann Rabl, Alain Crolotte, Francois Raab, Meikel Poess, Hans-Arno Jacobsen

Lecture 7 (03/12, 03/14): Hive and Impala Decisions, Operations & Information Technologies Robert H. Smith School of Business Spring, 2018

From BigBench to TPCx-BB: Standardization of a Big Data Benchmark

Hadoop 2.x Core: YARN, Tez, and Spark. Hortonworks Inc All Rights Reserved

Do-It-Yourself 1. Oracle Big Data Appliance 2X Faster than

Big Data Hadoop Stack

A Fast and High Throughput SQL Query System for Big Data

April Copyright 2013 Cloudera Inc. All rights reserved.

Evolving To The Big Data Warehouse

New research on Key Technologies of unstructured data cloud storage

TPCX-BB (BigBench) Big Data Analytics Benchmark

Big Data solution benchmark

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Safe Harbor Statement

Big Data Architect.

Configuring and Deploying Hadoop Cluster Deployment Templates

EsgynDB Enterprise 2.0 Platform Reference Architecture

The Hadoop Ecosystem. EECS 4415 Big Data Systems. Tilemachos Pechlivanoglou

Oracle Big Data Connectors

Blended Learning Outline: Developer Training for Apache Spark and Hadoop (180404a)

MapR Enterprise Hadoop

Microsoft. Exam Questions Perform Data Engineering on Microsoft Azure HDInsight (beta) Version:Demo

Cloud Computing & Visualization

Certified Big Data and Hadoop Course Curriculum

Innovatus Technologies

Blended Learning Outline: Cloudera Data Analyst Training (171219a)

Security and Performance advances with Oracle Big Data SQL

Oracle Big Data. A NA LYT ICS A ND MA NAG E MENT.

Apache Hive for Oracle DBAs. Luís Marques

Bring Context To Your Machine Data With Hadoop, RDBMS & Splunk

Big Data Hadoop Developer Course Content. Big Data Hadoop Developer - The Complete Course Course Duration: 45 Hours

IBM Big SQL Partner Application Verification Quick Guide

NoSQL Performance Test

Impala. A Modern, Open Source SQL Engine for Hadoop. Yogesh Chockalingam

Oracle Big Data SQL High Performance Data Virtualization Explained

Microsoft Big Data and Hadoop

Why You Should Run TPC-DS: A Workload Analysis

Was ist dran an einer spezialisierten Data Warehousing platform?

Part 1: Indexes for Big Data

Migrate from Netezza Workload Migration

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

We are ready to serve Latest Testing Trends, Are you ready to learn?? New Batches Info

Oracle GoldenGate for Big Data

Migrate from Netezza Workload Migration

a linear algebra approach to olap

Exam Questions

In-Memory Computing EXASOL Evaluation

Question: 1 You need to place the results of a PigLatin script into an HDFS output directory. What is the correct syntax in Apache Pig?

Stages of Data Processing

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

How Apache Hadoop Complements Existing BI Systems. Dr. Amr Awadallah Founder, CTO Cloudera,

microsoft

EXTRACT DATA IN LARGE DATABASE WITH HADOOP

Building an Integrated Big Data & Analytics Infrastructure September 25, 2012 Robert Stackowiak, Vice President Data Systems Architecture Oracle

Certified Big Data Hadoop and Spark Scala Course Curriculum

Big Trend in Business Intelligence: Data Mining over Big Data Web Transaction Data. Fall 2012

HADOOP COURSE CONTENT (HADOOP-1.X, 2.X & 3.X) (Development, Administration & REAL TIME Projects Implementation)

Big Data Infrastructures & Technologies

ECE 7650 Scalable and Secure Internet Services and Architecture ---- A Systems Perspective

Juxtaposition of Apache Tez and Hadoop MapReduce on Hadoop Cluster - Applying Compression Algorithms

Service Oriented Performance Analysis

Shark: Hive (SQL) on Spark

An Introduction to Big Data Formats

Chase Wu New Jersey Institute of Technology

Data Clustering on the Parallel Hadoop MapReduce Model. Dimitrios Verraros

Overview. : Cloudera Data Analyst Training. Course Outline :: Cloudera Data Analyst Training::

Huge Data Analysis and Processing Platform based on Hadoop Yuanbin LI1, a, Rong CHEN2

Shark: SQL and Rich Analytics at Scale. Michael Xueyuan Han Ronny Hajoon Ko

Sempala. Interactive SPARQL Query Processing on Hadoop

DATA SCIENCE USING SPARK: AN INTRODUCTION

INITIAL EVALUATION BIGSQL FOR HORTONWORKS (Homerun or merely a major bluff?)

Oracle Exadata: Strategy and Roadmap

Big Data. Big Data Analyst. Big Data Engineer. Big Data Architect

Cognos Dynamic Cubes

Oracle Big Data Fundamentals Ed 2

Databases 2 (VU) ( / )

Hadoop An Overview. - Socrates CCDH

Processing Unstructured Data. Dinesh Priyankara Founder/Principal Architect dinesql Pvt Ltd.

High Performance Computing on MapReduce Programming Framework

Oracle Big Data SQL. Release 3.2. Rich SQL Processing on All Data

Hadoop. Introduction to BIGDATA and HADOOP

Micron and Hortonworks Power Advanced Big Data Solutions

BIG DATA AND HADOOP ON THE ZFS STORAGE APPLIANCE

CONSOLIDATING RISK MANAGEMENT AND REGULATORY COMPLIANCE APPLICATIONS USING A UNIFIED DATA PLATFORM

Topics. Big Data Analytics What is and Why Hadoop? Comparison to other technologies Hadoop architecture Hadoop ecosystem Hadoop usage examples

Hive SQL over Hadoop

Deploying and Managing Dell Big Data Clusters with StackIQ Cluster Manager

BIG DATA ANALYTICS USING HADOOP TOOLS APACHE HIVE VS APACHE PIG

Embedded Technosolutions

Cisco and Cloudera Deliver WorldClass Solutions for Powering the Enterprise Data Hub alerts, etc. Organizations need the right technology and infrastr

Cloud Analytics and Business Intelligence on AWS

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Study on Building and Applying Technology of Multi-level Data Warehouse Lihua Song, Ying Zhan, Yebai Li

HiTune. Dataflow-Based Performance Analysis for Big Data Cloud

Transcription:

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015) Benchmark Testing for Transwarp Inceptor A big data analysis system based on in-memory computing Mingang Chen1,2,a, Zhenqiang Chen3, b, Wanggen Liu3, c, Zhengyu Liu1,2,d 1 Shanghai Key Laboratory of Computer Software Testing and Evaluating, Shanghai, 201112,China 2 Shanghai Development Center of Computer Software Technology, Shanghai, 201112, China 3 Transwarp Technologies (Shanghai) Co., Ltd., Shanghai, 200233, China a cmg@ssc.stn.sh.cn, bzhenqiang.chen@transwarp.io, cwayne.liu@transwarp.io, dlzy@ssc.stn.sh.cn Keywords: Benchmark testing; Transwarp Inceptor; Big data Abstract. The high demand for data analysis and rapid development of big data technology and application has led to a variety of commercial and open source big data processing systems launched by industry and academia. So how to test and evaluate these systems objectively has become an important research topic. In this paper, we propose and develop an automated benchmark testing solution based on TPC-DS for Transwarp Inceptor, a big data analysis system. Test includes generating and loading a 500 GB data set and creating 24 tables and testing the performance of the system and SQL compatibility by executing 99 standard SQL queries. The test results can provide reference for enterprises to compare and choose big data analysis systems. 1. Introduction With the rapid development of the traditional Internet, the mobile Internet, the Internet of Things, the cloud computing technology, and with the continuous improvement of the IT infrastructure, our world has entered big data era. Big data brings challenges in the aspects of data storage, the method and efficiency of data processing, meanwhile several big data management and analysis systems have appeared to tackle these challenges. In recent years, the industry and academia have launched a variety of commercial and open source versions of big data management and analysis systems, such as Apache Hadoop, Spark, Storm, Oracle Exadata, IBM big SQL and and so on. Big data benchmark test is a practice for the evaluation of big data management and analysis systems. Generally a benchmark first abstracts representative workload from concrete applications, then generates a scalable test data set based on the characteristics and distributions of the data, and finally analyses the system s performance according to evaluation metrics. How to use benchmark test to objectively evaluate big data management and analysis systems has become an important research topic [1-2]. HiBench developed by Intel evaluates running time of MapReduce jobs, HDFS Bandwidth and throughput [3]. TPC-DS proposed by Transaction Processing Performance Council is a decision support system oriented benchmark. TPC-DS is widely used in evaluating the SQL query throughput of structured data, such as data warehouse [4-5]. BigBench is an end-to-end big data benchmark. The underlying business model of the BigBench is a product retailer. It enriches with semi-structured and unstructured data models, and the structured part of data model is adopted from the TPC-DS[6]. BigFrame is a benchmark generator, and can be used to generate user s own benchmarks fit to their special needs [7]. In this paper, we propose a benchmark testing solution for Transwarp Inceptor v4.0, a SQL on Hadoop system for big data analysis based on in-memory computing [8], and develop automated test scripts for the solution. To verify the scalability and big-data process capability, TPC-DS scale factor 500 is adopted as the benchmark. The benchmark testing solution includes generating and loading a 500 Gigabyte test data, creating 24 tables, and testing the performance of the system and SQL compatibility by executing 99 standard SQL queries. 2016. The authors - Published by Atlantis Press 279

2. The Architecture of Transwarp Inceptor Transwarp Inceptor is a big data analysis system based on Apache Spark, which provides capabilities of high-speed SQL analysis. It can help users to build scalable decision supporting system (such as data warehouse), and perform interactive analysis and real-time reporting. Transwarp Inceptor has a three-tier structure from bottom to top: the storage layer, the distributed computing engine layer and the interface layer, as shown in Fig.1. Interface layer JDBC ODBC SHELL Distributed computing engine layer SQL 2003 Compiler SQL Parser Cost Based Optimizer Code Generator Batch & Interactive SQL Engine Transaction Manager Distributed CRUD Concurrency Controller Transwarp Distributed Execution Engine PL/SQL Compiler Procedure Parser Inter Procedure Optimizer Parallel Job Generator Transwarp Holodesk (A heterogeneous distributed colummnar store) Storage layer Transwarp HDFS2 Transwarp Hbase Fig.1. Architecture of Transwarp Inceptor Transwarp Inceptor can handle data stored in HDFS, HBase or Transwarp Holodesk distributed cache (designed by Transwarp), and the amount of data that can be processed is from GB to PB. Transwarp Holodesk is a heterogeneous distributed columnar store used to cache data for Spark accesses, which is compatible with RAM, SSD and HDD. Transwarp Inceptor distributed computing engine is an in-memory execution engine based on Spark that contains the SQL 2003 compiler, PL / SQL compiler and the transaction manager to support data warehouse applications of complex analysis. The top layer of Transwarp Inceptor is JDBC, ODBC and Shell access interfaces to facilitate the migration from the existing traditional database systems to big data platform. 3. Testing design for Transwarp Inceptor The overall process of Transwarp Inceptor system testing consists mainly of five stages, the generation of testing data and SQLs for query, the data loading, the creation of tables and data partition, the SQL queries execution and the analysis of testing, as shown in Fig.2. The test process is automatic executed by test scripts. The generation of testing data &SQL The data compression and loading The creation of the tables and data partition The SQL Query Execution Compatibility analysis of SQL Testing Analysis Performance analysis of SQL Query Check of system function Fig.2. Testing process of Transwarp Inceptor 280

3.1 The generation of testing data and SQL TPC-DS is a decision support benchmark that models several generally applicable aspects of a decision supporting system. The TPC-DS schema contains vital business information, such as customer, order, and product data. The benchmark models user s queries that are the most important component of any mature decision supporting system. User s queries convert operational facts into business intelligence. In this paper, DSTools v1.3.0 provided by TPC-DS benchmark is used to generate 500GB test data, 99 query SQLs, and the script fragments are as follows. #Generate 500GB data in HDFS 1: hadoop dfs -mkdir -p LOCATION_HDFS 2: dbgen2 -scale 500 -dir LOCATION_HDFS #Generate 99 SQL for query 3: qgen2 query99.tpl directory QUERY_TEMPLATE dialect oracle scale 500 The 500GB test data composes the business model of TPC-DS benchmark, which is stored in 7 fact tables and 17 dimension tables. The tables are organized with a star and snowflake mixed model in order to build a data warehouse for decision supporting. The business model of TPC-DS benchmark simulates sales, distributions and products returning in three primary channels: stores catalogs and the web. As a whole the TPC-DS benchmark has the following characteristics: 1) A large amount of business data and test cases (SQL queries) can answer real business problems. 2) A total of 99 SQL queries follow the SQL 99 and SQL 2003 core syntax standard, and SQL queries are complex. 3) The test cases include a variety of business model, such as interactive query, statistical analysis, iterative OLAP and data mining. 4) Almost all of the test cases need high I/O loading and CPU computing. 3.2 The data compression and loading For optimization, the testing for Transwarp Inceptor adopts ORC (Optimized Row Columnar) file format to store test data. The original 500GB text format test data is compressed to 150GB with ORC file format. 3.3 The creation of table and data partition Next, we create 7 fact tables and 17 dimensional tables with the Transwarp Inceptor. The schemas of the tables are provided by the TPC-DS benchmark. We build a data warehouse for testing by loading the test data into the tables. Take store table as an example, the following scripts illustrate how to create table and to load data into it. #Create table store and load data into the table 1: create database if not exists ${DB}; 2: use ${DB}; 3: drop table if exists store; 4: 5: create table store 6: row format serde '${SERDE}' 7: stored as ${FILE} 8: as select * from ${SOURCE}.store; Since the time span of the test data is more than ten years, Transwarp Inceptor partitions all fact tables with range partition according to date related columns to accelerate the queries. All dimension tables are free of any partition. 3.4 The SQL queries execution The key point of the testing for Transwarp Inceptor is the execution of 99 standard queries provided by TPC-DS benchmark. We test the functional correctness and the query efficiency by executing SQL queries. According to the structure and semantics of SQL statements, 99 standard queries can be classified four classes: 9 interactive queries, 69 statistical analysis, 10 iterative 281

OLAPs and 11 data mining queries. To minimize the impact of system cache on the SQL queries performance, 99 SQL are executed sequentially three rounds through automated scripts as shown in the following. The average execution time of each SQL query in three rounds is taken as the result bases on which we calculate the average execution time of each class of SQL according to the above classification. # Execute all 99 SQL queries sequentially 1: for($i = 1; $i<=99; $i++ ){ 2: $sql = "query".$i.".sql"; 3: $cmd = "transwarp -t -h localhost -f./sql/$sql >./logs/$sql.log 2"; 4:} 5: Record the execution time of each SQL query through parsing logs of the SQL execution 4. Analysis of test results Test environment is composed of a cluster of four same configured servers. The network topology of the cluster is shown in Figure 3 and the configuration of the server is shown in Table 1. Table 1. The configuration of the servers Model Dell PowerEdge R720 CPUs Intel(R) Xeon(R) CPU E5-2620 v2 @ 2.10GHz 2 CPU x 6 core Memory 256GB Hard Disks 6 1T HDD hard drive,7200 rpm Operating System Red Hat Enterprise Linux 6.5 Manager Node Primary Name Node Inceptor Server Transwarp Data Iub Standby Manager Node, Computing Node Secondary Name Node Inceptor Meta-store Gigabit full-duplex network Computing Node Ganglia server Transwarp Manager Computing Node Zookeeper Fig.3. Network topology of Transwarp Inceptor cluster under test In the aspect of functional testing, we focus on testing the application scenarios of Transwarp Inceptor. The test results show that Transwarp Inceptor can correctly implement data loading, table creation and partition, and complex SQL queries. In the aspect of performance testing, we mainly test the execution time for Inceptor Transwarp to perform 99 standard SQL queries. In the case of the 500GB data and the fact table size of 18-1440 million records, the four types of SQL execution time are shown in Table 2. 282

Table 2. The four types of SQL execution time The number of SQL SQL Categories The total execution time (seconds) The average execution time (seconds) 9 Interactive query 197 21.9 69 Statistical analysis 7705 111.7 10 Iterative OLAP 4232 423.2 11 Data mining 3502 318.4 As compatibility test result, only 3 SQLs require minor modification and other 96 SQLs are in accordance with SQL99 and SQL 2003 core SQL standards. But according to TPC-DS benchmark specification [5], such minor modification is acceptable. 5. Conclusion In this paper, an automated benchmark testing solution is designed and developed based on TPC-DS for Transwarp Inceptor, a SQL on Hadoop big data analysis system. This solution aims to test the functionality, performance and compatibility of complex SQL queries of Transwarp Inceptor. This benchmark testing solution can also be applied to other system of SQL on Hadoop, such as Apache Hive and Cloudera Impala. In further studies, we plan to test multiple SQL on Hadoop systems, and compare the system s functionality and performance fairly. These test results will provide the reference for enterprises to choose an appropriate big data analysis system. Acknowledgement This work was funded by Science and Technology Commission of Shanghai Municipality Program (15511107003, 14511106804). References [1] Han R, Lu X, Xu J. On Big Data Benchmarking. Lecture Notes in Computer Science, 2014, 8807:3-18. [2] Barata M, Bernardino J, Furtado P. Survey on Big Data and Decision Support Benchmarks. Database and Expert Systems Applications. Springer International Publishing, 2014:174-182. [3] Huang S, Huang J, Dai J, et al. The hibench benchmark suite: Characterization of the mapreduce-based data analysis. ICDE Workshops, 2010, 74:41-51. [4] Poess M, Nambiar R O, Walrath D. Why You Should Run TPC-DS: A Workload Analysis. Proceedings of the 3rd international conference on Very large data bases. VLDB Endowment, 2007:1138-1149. [5] TPC-DS specification 2015. http://www.tpc.org/tpcds/ [6] Ghazal A, Rabl T, Hu M, et al. BigBench: towards an industry standard benchmark for big data analytics. Proceedings of the 2013 ACM SIGMOD International Conference on Management of Data. ACM, 2013:1197-1208. [7] BigFrame User Guide. https://github.com/bigframeteam/bigframe/wiki/ [8] Transwarp Inceptor. http://www.transwarp.cn/product/inceptor?lang=en 283