A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

Similar documents
COLUMN-STORES VS. ROW-STORES: HOW DIFFERENT ARE THEY REALLY? DANIEL J. ABADI (YALE) SAMUEL R. MADDEN (MIT) NABIL HACHEM (AVANTGARDE)

Benchmark TPC-H 100.

Column-Stores vs. Row-Stores: How Different Are They Really?

HYRISE In-Memory Storage Engine

Column Stores vs. Row Stores How Different Are They Really?

Crescando: Predictable Performance for Unpredictable Workloads

Column-Stores vs. Row-Stores. How Different are they Really? Arul Bharathi

An Oracle White Paper June Exadata Hybrid Columnar Compression (EHCC)

In-Memory Data Management

Query Processing on Multi-Core Architectures

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Anastasia Ailamaki. Performance and energy analysis using transactional workloads

C-STORE: A COLUMN- ORIENTED DBMS

Hash Joins for Multi-core CPUs. Benjamin Wagner

Data Structures for Mixed Workloads in In-Memory Databases

4th National Conference on Electrical, Electronics and Computer Engineering (NCEECE 2015)

CSE 544 Principles of Database Management Systems. Alvin Cheung Fall 2015 Lecture 8 - Data Warehousing and Column Stores

In-Memory Data Management Jens Krueger

Big Data solution benchmark

Exadata X3 in action: Measuring Smart Scan efficiency with AWR. Franck Pachot Senior Consultant

class 5 column stores 2.0 prof. Stratos Idreos

April Copyright 2013 Cloudera Inc. All rights reserved.

CompSci 516 Database Systems

Hyrise - a Main Memory Hybrid Storage Engine

NewSQL Databases MemSQL and VoltDB Experimental Evaluation

Smooth Scan: Statistics-Oblivious Access Paths. Renata Borovica-Gajic Stratos Idreos Anastasia Ailamaki Marcin Zukowski Campbell Fraser

Low Overhead Concurrency Control for Partitioned Main Memory Databases. Evan P. C. Jones Daniel J. Abadi Samuel Madden"

Bridging the Processor/Memory Performance Gap in Database Applications

Sub-Second Response Times with New In-Memory Analytics in MicroStrategy 10. Onur Kahraman

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Impact of Column-oriented Databases on Data Mining Algorithms

Column-Stores vs. Row-Stores: How Different Are They Really?

Dynamic Optimization of Generalized SQL Queries with Horizontal Aggregations Using K-Means Clustering

StreamOLAP. Salman Ahmed SHAIKH. Cost-based Optimization of Stream OLAP. DBSJ Japanese Journal Vol. 14-J, Article No.

In-Memory Columnar Databases - Hyper (November 2012)

SSD. DEIM Forum 2014 D8-6 SSD I/O I/O I/O HDD SSD I/O

IMPROVING THE PERFORMANCE, INTEGRITY, AND MANAGEABILITY OF PHYSICAL STORAGE IN DB2 DATABASES

Jignesh M. Patel. Blog:

IT Best Practices Audit TCS offers a wide range of IT Best Practices Audit content covering 15 subjects and over 2200 topics, including:

OLAP Introduction and Overview

NVMFS: A New File System Designed Specifically to Take Advantage of Nonvolatile Memory

Column Stores - The solution to TB disk drives? David J. DeWitt Computer Sciences Dept. University of Wisconsin

data systems 101 prof. Stratos Idreos class 2

Prefetch Threads for Database Operations on a Simultaneous Multi-threaded Processor

Low Overhead Concurrency Control for Partitioned Main Memory Databases

SAP IQ - Business Intelligence and vertical data processing with 8 GB RAM or less

Hadoop Beyond Batch: Real-time Workloads, SQL-on- Hadoop, and thevirtual EDW Headline Goes Here

Data Structures for Mixed Workloads in In-Memory Databases

CST-Trees: Cache Sensitive T-Trees

COLUMN STORE DATABASE SYSTEMS. Prof. Dr. Uta Störl Big Data Technologies: Column Stores - SoSe

Carnegie Mellon Univ. Dept. of Computer Science /615 - DB Applications. Administrivia Final Exam. Administrivia Final Exam

Hardware-Conscious DBMS Architecture for Data-Intensive Applications

class 6 more about column-store plans and compression prof. Stratos Idreos

HANA Performance. Efficient Speed and Scale-out for Real-time BI

Walking Four Machines by the Shore

On-Disk Bitmap Index Performance in Bizgres 0.9

Evolving To The Big Data Warehouse

Architecture-Conscious Database Systems

Safe Harbor Statement

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

I. Introduction. FlashQueryFile: Flash-Optimized Layout and Algorithms for Interactive Ad Hoc SQL on Big Data Rini T Kaushik 1

DBMSs on a Modern Processor: Where Does Time Go? Revisited

Weaving Relations for Cache Performance

VOLTDB + HP VERTICA. page

Oracle Database In-Memory By Example

Performance in the Multicore Era

C-Store: A column-oriented DBMS

Enhanced Performance of Database by Automated Self-Tuned Systems

A Case Study of Real-World Porting to the Itanium Platform

Introduction to Column Stores with MemSQL. Seminar Database Systems Final presentation, 11. January 2016 by Christian Bisig

NoVA MySQL October Meetup. Tim Callaghan VP/Engineering, Tokutek

TPC-E testing of Microsoft SQL Server 2016 on Dell EMC PowerEdge R830 Server and Dell EMC SC9000 Storage

Advances in Data Management - NoSQL, NewSQL and Big Data A.Poulovassilis

Andrew Pavlo, Erik Paulson, Alexander Rasin, Daniel Abadi, David DeWitt, Samuel Madden, and Michael Stonebraker SIGMOD'09. Presented by: Daniel Isaacs

Hewlett Packard Enterprise HPE GEN10 PERSISTENT MEMORY PERFORMANCE THROUGH PERSISTENCE

An In-Depth Analysis of Data Aggregation Cost Factors in a Columnar In-Memory Database

HG-Bitmap Join Index: A Hybrid GPU/CPU Bitmap Join Index Mechanism for OLAP

complex plans and hybrid layouts

DBMS Data Loading: An Analysis on Modern Hardware. Adam Dziedzic, Manos Karpathiotakis*, Ioannis Alagiannis, Raja Appuswamy, Anastasia Ailamaki

Toward timely, predictable and cost-effective data analytics. Renata Borovica-Gajić DIAS, EPFL

Pervasive Insight. Mission Critical Platform

How Achaeans Would Construct Columns in Troy. Alekh Jindal, Felix Martin Schuhknecht, Jens Dittrich, Karen Khachatryan, Alexander Bunte

Data Warehouse and Data Mining

Big and Fast. Anti-Caching in OLTP Systems. Justin DeBrabant

HyPer-sonic Combined Transaction AND Query Processing

NEC Express5800 A2040b 22TB Data Warehouse Fast Track. Reference Architecture with SW mirrored HGST FlashMAX III

STEPS Towards Cache-Resident Transaction Processing

IBM Lotus Domino 7 Performance Improvements

Efficient Aggregation for Graph Summarization

Condusiv s V-locity Server Boosts Performance of SQL Server 2012 by 55%

Recent Innovations in Data Storage Technologies Dr Roger MacNicol Software Architect

Column-Oriented Database Systems. Liliya Rudko University of Helsinki

A Database System Performance Study with Micro Benchmarks on a Many-core System

Automating Information Lifecycle Management with

HyPer-sonic Combined Transaction AND Query Processing

Join Processing for Flash SSDs: Remembering Past Lessons

Fast Retrieval with Column Store using RLE Compression Algorithm

Column Store Internals

Copyright 2014, Oracle and/or its affiliates. All rights reserved.

Anti-Caching: A New Approach to Database Management System Architecture. Guide: Helly Patel ( ) Dr. Sunnie Chung Kush Patel ( )

Transcription:

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses are implemented using traditional relational databases. In a typical data-warehouse, batch processes are run at pre-defined schedules to write data but data is read more often and at random intervals. These relational databases are write-optimized and do not perform optimally in analytical and decision making processes where data is read more often than it is written. New database architecture addresses these requirements of read-oriented applications. In this paper we compare memory and CPU usage of typical ad-hoc queries run in a data-warehouse. We will gauge the performance difference of the two radically different database architectures and find if new architecture provides a promising solution to read-oriented processes. I. INTRODUCTION In today s economy, data has taken the central role in policy and decision making processes in enterprises. The amount of data stored in enterprise databases is increasing exponentially with time. However, less than optimal performance of databases can hinder the decision making process where faster retrieval of data is more important. Most traditional databases implement row-based storage which places all attributes of a record contiguously on the disk. High write speeds are achieved in traditional databases as single disk write moves all the attributes of a record to the storage device. These databases work well where data is written frequently. Column-based databases use a different architecture for storing data on storage devices. Column-based databases store the values of each attribute contiguously. In this architecture, each attribute can be read separately which reduces the amount of memory used as G. Sheoran is a MS candidate in computer science at California State University, Chico, CA 95929-040 USA (e-mail: gaurav.sheoran@gmail.com)

2 irrelevant attributes are not brought into memory. Column-based databases should perform better in typical data warehouse scenarios where the data is read more often than it is written. The rest of this paper is organized as follows: in Section II, we will discuss previous research and studies related to performance of row-based and column-based databases. In Section III, we will give details about the experimental setup for this paper. It must be noted that the queries used for our experiment represent typical queries in real life data warehouses. In Section IV we are going to review the results of our experiments. Finally, in Section V we provide the conclusion of our research. II. RELATED WORK Since the 980s, the relational model has been implemented by most of DBMS vendors. In the relational model, a set of attributes makes up a record which is stored in tables. Online Transaction Processing (OLTP) applications rely on the relational model as OLTP applications frequently update a small number of records, usually one record at a time []. In contrast, Online Analytical Processing (OLAP) applications rarely execute data manipulation statements like insert or update. OLAP applications execute medium or long running queries which scan full tables but return only a small number of attributes. In such cases, current row-based databases do not perform optimally even on modern processors [9]. In OLAP applications, a column-based database architecture, where each individual attribute is stored contiguously, should be more efficient []. There are several studies being done to improve the performance of OLAP queries. Efficient building and maintenance of pre-computed aggregates is being studied. If a predefined set of queries is executed regularly then pre-population of aggregates and materialized views can be very efficient [].

3 Compression techniques are also being studied for improving performance in traditional databases. Compression reduces the size of dataset which reduces the transfer times. As the data is compressed, more data can fit into same memory size, therefore it increases the buffer hit rate [6]. There are more opportunities for data compression in column-based databases when compared to row-based databases. Compression ratios are also higher in column-based databases as adjacent values in a column are often similar, whereas adjacent attributes in a record are not [4]. However, if the cost of compressing and decompressing the data is more than the saving achieved because of compression, then the overall result is not optimal [6]. There have been several studies that show column-based databases perform better than rowbased databases, for simple queries that fetch a few attributes from tables or perform aggregation functions. However, column-based databases are not optimized for frequent data manipulation statements. III. EXPERIMENTS A. Experimental Setup We ran our experiments on two versions of MySQL database. First version of MySQL used brighthouse, a column based storage engine, while second version used MyISAM, a row based storage engine. Our experimental system is a hyper threaded 3.2 GHz Pentium IV, running Fedora 9 Linux kernel 2.6.25-4.fc9.i686, with 2.5 Gb of memory, and 60 Gb of disk space. B. Data We used the TPC-H data generated for scale factor for our experiments [8]. The toolkit for generating TPC-H data is available at http://www.tpc.org/tpch/spec/tpch_2_8_0.tar.gz. The

4 toolkit has two components: DBGEN for generating data and QGEN for generating queries. After downloading and uncompressing the toolkit file, we copied makefile.suite as makefile. We edited makefile and supplied the values for CC, DATABASE, MACHINE, and WORKLOAD parameters. After editing makefile, we ran the make command that generates two executable files: dbgen and qgen. Upon execution the generated dbgen file creates 8 data files which can be loaded to the tables. The TPC-H schema [8] has 8 tables: PART, SUPPLIER, PARTSUPP, CUSTOMER, LINEITEM, ORDERS, REGION, and NATION. The biggest table in this schema, at scale factor, contains 6 million records and the total size of all tables is around Gb. C. Procedure We ran our experiments on out-of-the-box installations of MySQL to eliminate optimization impact on the performance. As the brighthouse engine does not support indexes, we did not create any indexes on database using the MyISAM storage engine also. We used the same data files created by DBGEN to load data in both instances of MySQL database. We created a few queries that mimic the common queries run in a typical data warehouse. These queries performed aggregation, sequential access, random access, etc. We ran the queries for different number of output columns, ranging from single column to all columns in a table, to see the effect on performance. We ran the following queries in both instances of MySQL database: : SELECT <column > <column N> FROM <table>; 2: SELECT SUM(<column>) FROM <table>; 3: SELECT <column > <column N> FROM <table> WHERE <key> = <value> OR <key> = <value2> 4: SELECT SUM(<column>*(-<column2>)) FROM <table>; 5: SELECT <column>, SUM(<column>*<column2>) FROM <table> GROUP BY <column> HAVING

5 <column> = <value>; 6: SELECT <column>, SUM(<column>) FROM (SELECT <column>, <column> FROM <table> WHERE <column> = <value>) GROUP BY <column> ORDER BY <column>; 7: SELECT <column>, SUM(<column>*<column2>) FROM <table> GROUP BY <column> HAVING SUM(<column>*<column2>) > (SELECT SUM(<column>*<column2>*<variable> from <table> WHERE <column> = <value>) ORDER BY <column>; These queries were run multiple times to examine the effect of caching on database performance. After each set of execution for individual query, the cache was cleared so that cached resultset does not affect results of other queries. IV. RESULTS We used the same dataset to populate tables in both MySQL instances. As shown in Table I, the data in column-based database, which implemented brighthouse storage engine, is compressed and row-based database did not compress the data in tables. The compression ratio for our test dataset was as high as 7.437. It must be noted that compression ratio is higher in tables which have higher number of rows. This gives column-based database an advantage in terms of required disk space over row-based database. Table Name Column-based Database Row-based Database Compression Ratio Row Count Compression Row Count Compression CUSTOMER 50000 Yes 50000 No 2.999 LINEITEM 60025 Yes 60025 No 6.086 NATION 25 Yes 25 No.24 ORDERS 500000 Yes 500000 No 5.590 PART 200000 Yes 200000 No 7.437 PARTSUPP 800000 Yes 800000 No 5.29 REGION 5 Yes 5 No 0.429 SUPPLIER 00000 Yes 00000 No 2.672 Table I: Data compression in column-based database vs. row-based database for individual tables.

6 We expected the execution time of a query in a column-based database to approach the execution time in a row-based database as the number of output columns reaches the maximum number of columns in a table. As shown in Fig., the execution time increased more rapidly with the increase in the number of output columns, in column-based database as compared to row-based database. Column-based database performed better for single non-aggregated output column. But as the number of output columns increased, the execution time increased as multiple of the execution time required for single output column execution. 000 00 0 4 7 0 3 6 No. of Columns Column-based DB Row-based DB Fig. Execution times in seconds (log scale) When an individual column is selected with no aggregation, column-based database required less CPU time, to process the query, than row-based database. The difference in CPU usage was also seen when single column aggregation and random access queries were executed. These results support the experimental results of M. Stonebraker et al. []. However it must be noted that when a query performed calculation using multiple columns, column-based database took more time to return the results. After results were cached, column-based database had better performance for subsequent query executions. The details of CPU time for our test queries can

7 be found in Table II. Type Column-based Database Row-based Database Performance Factor Subsequent Subsequent Subsequent First Run First Run First Run Run Run Run 4.07 3.60 43.75 40.3 0.326 0.3389 2 0.04 0.0 0.93 9.40 0.0037 0.00 3 4.73 4.68.43.25 0.438 0.46 4 6.74 0.0 9.68 9.59.7293 0.00 5 33.68 0.02 5.06 4.98 2.2364 0.003 6 0.68 0.59 2.82 20.89 0.032 0.0282 7 259.22 24.3 265.00 25.44 0.9782 0.9590 Table II: CPU usage in seconds, by query execution on column-based database vs. row-based database. The results in Fig. 2 show the memory usage factor of row-based database compared to column-based database. In all test runs, queries used more memory when run on row-based database as compared to column-based database. The biggest difference in memory usage was found for the queries performing aggregation on columns directly read from a physical table. The memory usage factor dropped for queries using in-line views or sub-queries. This drop in memory usage factor can be due to the temporary storage of results from in-line views or subqueries. 000 992 992 00 0 2.3 2 3 4 8 5.64.64 6 7 Memory Usage Factor Fig. 2 Memory Usage Factor of Row-Based Database vs. Column-Based Database (log scale) The results for memory usage tests, shown in Fig. 2, are in line with our expectations. As the

8 data is compressed in column-based database, we expected lesser amount of memory to be used by queries running on column-based database. Another factor that could have helped columnbased database in running queries with lesser memory usage is that only the relevant columns are read from storage device in column-based database as compared to row-based database where full record is read. V. CONCLUSION In this paper, we showed that column-based database stored data in a compressed manner therefore substantial disk space saving was achieved, as compared to row-based database. We also found that the memory usage in column-based database was lower than row-based database. The CPU utilization was lower in column-based database, except for queries performing calculations on multiple columns retrieved from physical table. However, in column-based database, the CPU usage time increased more than our expectation as the number of output columns reached the total number of columns in a table. With the test results of our experiments we can safely assume that column-based databases are promising solution for analytical and read-oriented data-warehouses.

9 REFERENCES [] M. Stonebraker et al., C-Store: A Column-Oriented DBMS, Proc. 3st International Conference on Very Large Data Bases (VLDB), August-September 2005, pp. 553-564. [2] R. A. Hankins and J.M. Patel, Data Morphing: An Adaptive, Cache-Conscious Storage Technique, Proc. 29th International Conference on Very Large Data Bases (VLDB), September 2003, pp. 47 428. [3] B. He, and Q. Luo, Cache-oblivious databases: Limitations and opportunities, ACM Transactions on Database Systems (TODS), vol. 33, no. 2, Article 8, June 2008. [4] R. MacNicol, and B.French, Sybase IQ Multiplex Designed For Analytics, Proc. 30th International Conference on Very Large Data Bases (VLDB), August September 2004, pp. 227 230. [5] P. Boncz, S. Manegold, and M. Kersten, Database architecture optimized for the new bottleneck: Memory access, Proc. 25th International Conference on Very Large Data Bases (VLDB), 999, pp. 54 65. [6] D. Abadi, S. Madden, and M. Ferreira, Integrating Compression and Execution in Column-Oriented Database Systems, Proc. 2006 ACM SIGMOD international conference on Management of data, 2006, pp. 67 682. [7] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach, Scalable Semantic Web Data Management Using Vertical Partitioning, Proc. 33th International Conference on Very Large Data Bases (VLDB), September 2007, pp. 4-422. [8] TPC Benchmark H Standard Specification Revision 2.8.0, http://www.tpc.org/tpch/spec/tpch2.8.0.pdf [9] A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood, DBMSs on a Modern Processor: Where Does Time Go?, Proc. 25th International Conference on Very Large Data Bases (VLDB), 999, pp. 266 277. [0] TPC Benchmark H Dataset Generator, http://www.tpc.org/tpch/spec/tpch_2_8_0.tar.gz.