A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses are implemented using traditional relational databases. In a typical data-warehouse, batch processes are run at pre-defined schedules to write data but data is read more often and at random intervals. These relational databases are write-optimized and do not perform optimally in analytical and decision making processes where data is read more often than it is written. New database architecture addresses these requirements of read-oriented applications. In this paper we compare memory and CPU usage of typical ad-hoc queries run in a data-warehouse. We will gauge the performance difference of the two radically different database architectures and find if new architecture provides a promising solution to read-oriented processes. I. INTRODUCTION In today s economy, data has taken the central role in policy and decision making processes in enterprises. The amount of data stored in enterprise databases is increasing exponentially with time. However, less than optimal performance of databases can hinder the decision making process where faster retrieval of data is more important. Most traditional databases implement row-based storage which places all attributes of a record contiguously on the disk. High write speeds are achieved in traditional databases as single disk write moves all the attributes of a record to the storage device. These databases work well where data is written frequently. Column-based databases use a different architecture for storing data on storage devices. Column-based databases store the values of each attribute contiguously. In this architecture, each attribute can be read separately which reduces the amount of memory used as G. Sheoran is a MS candidate in computer science at California State University, Chico, CA 95929-040 USA (e-mail: gaurav.sheoran@gmail.com)

2 irrelevant attributes are not brought into memory. Column-based databases should perform better in typical data warehouse scenarios where the data is read more often than it is written. The rest of this paper is organized as follows: in Section II, we will discuss previous research and studies related to performance of row-based and column-based databases. In Section III, we will give details about the experimental setup for this paper. It must be noted that the queries used for our experiment represent typical queries in real life data warehouses. In Section IV we are going to review the results of our experiments. Finally, in Section V we provide the conclusion of our research. II. RELATED WORK Since the 980s, the relational model has been implemented by most of DBMS vendors. In the relational model, a set of attributes makes up a record which is stored in tables. Online Transaction Processing (OLTP) applications rely on the relational model as OLTP applications frequently update a small number of records, usually one record at a time []. In contrast, Online Analytical Processing (OLAP) applications rarely execute data manipulation statements like insert or update. OLAP applications execute medium or long running queries which scan full tables but return only a small number of attributes. In such cases, current row-based databases do not perform optimally even on modern processors [9]. In OLAP applications, a column-based database architecture, where each individual attribute is stored contiguously, should be more efficient []. There are several studies being done to improve the performance of OLAP queries. Efficient building and maintenance of pre-computed aggregates is being studied. If a predefined set of queries is executed regularly then pre-population of aggregates and materialized views can be very efficient [].

3 Compression techniques are also being studied for improving performance in traditional databases. Compression reduces the size of dataset which reduces the transfer times. As the data is compressed, more data can fit into same memory size, therefore it increases the buffer hit rate [6]. There are more opportunities for data compression in column-based databases when compared to row-based databases. Compression ratios are also higher in column-based databases as adjacent values in a column are often similar, whereas adjacent attributes in a record are not [4]. However, if the cost of compressing and decompressing the data is more than the saving achieved because of compression, then the overall result is not optimal [6]. There have been several studies that show column-based databases perform better than rowbased databases, for simple queries that fetch a few attributes from tables or perform aggregation functions. However, column-based databases are not optimized for frequent data manipulation statements. III. EXPERIMENTS A. Experimental Setup We ran our experiments on two versions of MySQL database. First version of MySQL used brighthouse, a column based storage engine, while second version used MyISAM, a row based storage engine. Our experimental system is a hyper threaded 3.2 GHz Pentium IV, running Fedora 9 Linux kernel 2.6.25-4.fc9.i686, with 2.5 Gb of memory, and 60 Gb of disk space. B. Data We used the TPC-H data generated for scale factor for our experiments [8]. The toolkit for generating TPC-H data is available at http://www.tpc.org/tpch/spec/tpch_2_8_0.tar.gz. The

4 toolkit has two components: DBGEN for generating data and QGEN for generating queries. After downloading and uncompressing the toolkit file, we copied makefile.suite as makefile. We edited makefile and supplied the values for CC, DATABASE, MACHINE, and WORKLOAD parameters. After editing makefile, we ran the make command that generates two executable files: dbgen and qgen. Upon execution the generated dbgen file creates 8 data files which can be loaded to the tables. The TPC-H schema [8] has 8 tables: PART, SUPPLIER, PARTSUPP, CUSTOMER, LINEITEM, ORDERS, REGION, and NATION. The biggest table in this schema, at scale factor, contains 6 million records and the total size of all tables is around Gb. C. Procedure We ran our experiments on out-of-the-box installations of MySQL to eliminate optimization impact on the performance. As the brighthouse engine does not support indexes, we did not create any indexes on database using the MyISAM storage engine also. We used the same data files created by DBGEN to load data in both instances of MySQL database. We created a few queries that mimic the common queries run in a typical data warehouse. These queries performed aggregation, sequential access, random access, etc. We ran the queries for different number of output columns, ranging from single column to all columns in a table, to see the effect on performance. We ran the following queries in both instances of MySQL database: : SELECT <column > <column N> FROM <table>; 2: SELECT SUM(<column>) FROM <table>; 3: SELECT <column > <column N> FROM <table> WHERE <key> = <value> OR <key> = <value2> 4: SELECT SUM(<column>*(-<column2>)) FROM <table>; 5: SELECT <column>, SUM(<column>*<column2>) FROM <table> GROUP BY <column> HAVING

5 <column> = <value>; 6: SELECT <column>, SUM(<column>) FROM (SELECT <column>, <column> FROM <table> WHERE <column> = <value>) GROUP BY <column> ORDER BY <column>; 7: SELECT <column>, SUM(<column>*<column2>) FROM <table> GROUP BY <column> HAVING SUM(<column>*<column2>) > (SELECT SUM(<column>*<column2>*<variable> from <table> WHERE <column> = <value>) ORDER BY <column>; These queries were run multiple times to examine the effect of caching on database performance. After each set of execution for individual query, the cache was cleared so that cached resultset does not affect results of other queries. IV. RESULTS We used the same dataset to populate tables in both MySQL instances. As shown in Table I, the data in column-based database, which implemented brighthouse storage engine, is compressed and row-based database did not compress the data in tables. The compression ratio for our test dataset was as high as 7.437. It must be noted that compression ratio is higher in tables which have higher number of rows. This gives column-based database an advantage in terms of required disk space over row-based database. Table Name Column-based Database Row-based Database Compression Ratio Row Count Compression Row Count Compression CUSTOMER 50000 Yes 50000 No 2.999 LINEITEM 60025 Yes 60025 No 6.086 NATION 25 Yes 25 No.24 ORDERS 500000 Yes 500000 No 5.590 PART 200000 Yes 200000 No 7.437 PARTSUPP 800000 Yes 800000 No 5.29 REGION 5 Yes 5 No 0.429 SUPPLIER 00000 Yes 00000 No 2.672 Table I: Data compression in column-based database vs. row-based database for individual tables.

6 We expected the execution time of a query in a column-based database to approach the execution time in a row-based database as the number of output columns reaches the maximum number of columns in a table. As shown in Fig., the execution time increased more rapidly with the increase in the number of output columns, in column-based database as compared to row-based database. Column-based database performed better for single non-aggregated output column. But as the number of output columns increased, the execution time increased as multiple of the execution time required for single output column execution. 000 00 0 4 7 0 3 6 No. of Columns Column-based DB Row-based DB Fig. Execution times in seconds (log scale) When an individual column is selected with no aggregation, column-based database required less CPU time, to process the query, than row-based database. The difference in CPU usage was also seen when single column aggregation and random access queries were executed. These results support the experimental results of M. Stonebraker et al. []. However it must be noted that when a query performed calculation using multiple columns, column-based database took more time to return the results. After results were cached, column-based database had better performance for subsequent query executions. The details of CPU time for our test queries can

7 be found in Table II. Type Column-based Database Row-based Database Performance Factor Subsequent Subsequent Subsequent First Run First Run First Run Run Run Run 4.07 3.60 43.75 40.3 0.326 0.3389 2 0.04 0.0 0.93 9.40 0.0037 0.00 3 4.73 4.68.43.25 0.438 0.46 4 6.74 0.0 9.68 9.59.7293 0.00 5 33.68 0.02 5.06 4.98 2.2364 0.003 6 0.68 0.59 2.82 20.89 0.032 0.0282 7 259.22 24.3 265.00 25.44 0.9782 0.9590 Table II: CPU usage in seconds, by query execution on column-based database vs. row-based database. The results in Fig. 2 show the memory usage factor of row-based database compared to column-based database. In all test runs, queries used more memory when run on row-based database as compared to column-based database. The biggest difference in memory usage was found for the queries performing aggregation on columns directly read from a physical table. The memory usage factor dropped for queries using in-line views or sub-queries. This drop in memory usage factor can be due to the temporary storage of results from in-line views or subqueries. 000 992 992 00 0 2.3 2 3 4 8 5.64.64 6 7 Memory Usage Factor Fig. 2 Memory Usage Factor of Row-Based Database vs. Column-Based Database (log scale) The results for memory usage tests, shown in Fig. 2, are in line with our expectations. As the

8 data is compressed in column-based database, we expected lesser amount of memory to be used by queries running on column-based database. Another factor that could have helped columnbased database in running queries with lesser memory usage is that only the relevant columns are read from storage device in column-based database as compared to row-based database where full record is read. V. CONCLUSION In this paper, we showed that column-based database stored data in a compressed manner therefore substantial disk space saving was achieved, as compared to row-based database. We also found that the memory usage in column-based database was lower than row-based database. The CPU utilization was lower in column-based database, except for queries performing calculations on multiple columns retrieved from physical table. However, in column-based database, the CPU usage time increased more than our expectation as the number of output columns reached the total number of columns in a table. With the test results of our experiments we can safely assume that column-based databases are promising solution for analytical and read-oriented data-warehouses.

9 REFERENCES [] M. Stonebraker et al., C-Store: A Column-Oriented DBMS, Proc. 3st International Conference on Very Large Data Bases (VLDB), August-September 2005, pp. 553-564. [2] R. A. Hankins and J.M. Patel, Data Morphing: An Adaptive, Cache-Conscious Storage Technique, Proc. 29th International Conference on Very Large Data Bases (VLDB), September 2003, pp. 47 428. [3] B. He, and Q. Luo, Cache-oblivious databases: Limitations and opportunities, ACM Transactions on Database Systems (TODS), vol. 33, no. 2, Article 8, June 2008. [4] R. MacNicol, and B.French, Sybase IQ Multiplex Designed For Analytics, Proc. 30th International Conference on Very Large Data Bases (VLDB), August September 2004, pp. 227 230. [5] P. Boncz, S. Manegold, and M. Kersten, Database architecture optimized for the new bottleneck: Memory access, Proc. 25th International Conference on Very Large Data Bases (VLDB), 999, pp. 54 65. [6] D. Abadi, S. Madden, and M. Ferreira, Integrating Compression and Execution in Column-Oriented Database Systems, Proc. 2006 ACM SIGMOD international conference on Management of data, 2006, pp. 67 682. [7] D. Abadi, A. Marcus, S. Madden, and K. Hollenbach, Scalable Semantic Web Data Management Using Vertical Partitioning, Proc. 33th International Conference on Very Large Data Bases (VLDB), September 2007, pp. 4-422. [8] TPC Benchmark H Standard Specification Revision 2.8.0, http://www.tpc.org/tpch/spec/tpch2.8.0.pdf [9] A. Ailamaki, D.J. DeWitt, M.D. Hill, and D.A. Wood, DBMSs on a Modern Processor: Where Does Time Go?, Proc. 25th International Conference on Very Large Data Bases (VLDB), 999, pp. 266 277. [0] TPC Benchmark H Dataset Generator, http://www.tpc.org/tpch/spec/tpch_2_8_0.tar.gz.