Performance of popular open source databases for HEP related computing problems

Similar documents
THE ATLAS DISTRIBUTED DATA MANAGEMENT SYSTEM & DATABASES

An SQL-based approach to physics analysis

An update on the scalability limits of the Condor batch system

Accelerate Database Performance and Reduce Response Times in MongoDB Humongous Environments with the LSI Nytro MegaRAID Flash Accelerator Card

Use of containerisation as an alternative to full virtualisation in grid environments.

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona Percona Technical Webinars 9 May 2018

WLCG Transfers Dashboard: a Unified Monitoring Tool for Heterogeneous Data Transfers.

System upgrade and future perspective for the operation of Tokyo Tier2 center. T. Nakamura, T. Mashimo, N. Matsui, H. Sakamoto and I.

Evaluation Report: Improving SQL Server Database Performance with Dot Hill AssuredSAN 4824 Flash Upgrades

Monitoring of large-scale federated data storage: XRootD and beyond.

CMS users data management service integration and first experiences with its NoSQL data storage

Experience with ATLAS MySQL PanDA database service

NVMe SSDs Future-proof Apache Cassandra

Experience with PROOF-Lite in ATLAS data analysis

Lenovo Database Configuration

The DMLite Rucio Plugin: ATLAS data in a filesystem

Considering the 2.5-inch SSD-based RAID Solution:

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 20 th, 2016 Percona Technical Webinars

RACKSPACE ONMETAL I/O V2 OUTPERFORMS AMAZON EC2 BY UP TO 2X IN BENCHMARK TESTING

MySQL Performance Optimization and Troubleshooting with PMM. Peter Zaitsev, CEO, Percona

Abstract. The Challenges. ESG Lab Review InterSystems IRIS Data Platform: A Unified, Efficient Data Platform for Fast Business Insight

Evolution of Database Replication Technologies for WLCG

Accelerating Big Data: Using SanDisk SSDs for Apache HBase Workloads

Striped Data Server for Scalable Parallel Data Analysis

Identifying Performance Bottlenecks with Real- World Applications and Flash-Based Storage

DEDICATED SERVERS WITH WEB HOSTING PRICED RIGHT

IBM Emulex 16Gb Fibre Channel HBA Evaluation

The SHARED hosting plan is designed to meet the advanced hosting needs of businesses who are not yet ready to move on to a server solution.

Big Data is Better on Bare Metal

Catalogic DPX TM 4.3. ECX 2.0 Best Practices for Deployment and Cataloging

CMS High Level Trigger Timing Measurements

The Design and Optimization of Database

Dell PowerEdge R730xd Servers with Samsung SM1715 NVMe Drives Powers the Aerospike Fraud Prevention Benchmark

MongoDB on Kaminario K2

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

Module - 17 Lecture - 23 SQL and NoSQL systems. (Refer Slide Time: 00:04)

NoSQL Databases MongoDB vs Cassandra. Kenny Huynh, Andre Chik, Kevin Vu

Evaluation Report: HP StoreFabric SN1000E 16Gb Fibre Channel HBA

Practical MySQL Performance Optimization. Peter Zaitsev, CEO, Percona July 02, 2015 Percona Technical Webinars

2009. October. Semiconductor Business SAMSUNG Electronics

Firebird performance degradation: tests, myths and truth IBSurgeon, 2014

Accelerate Applications Using EqualLogic Arrays with directcache

RIGHTNOW A C E

Hardware & System Requirements

On BigFix Performance: Disk is King. How to get your infrastructure right the first time! Case Study: IBM Cloud Development - WW IT Services

Database Solutions Engineering. Best Practices for Deploying SSDs in an Oracle OLTP Environment using Dell TM EqualLogic TM PS Series

BrainPad Accelerates Multiple Web Analytics Systems with Fusion iomemory Solutions

BENEFITS AND BEST PRACTICES FOR DEPLOYING SSDS IN AN OLTP ENVIRONMENT USING DELL EQUALLOGIC PS SERIES

The High-Level Dataset-based Data Transfer System in BESDIRAC

ATLAS Data Management Accounting with Hadoop Pig and HBase

Reliability Engineering Analysis of ATLAS Data Reprocessing Campaigns

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

NoSQL Performance Test

Virtualization of the MS Exchange Server Environment

Lenovo Database Configuration for Microsoft SQL Server TB

WEBSITE & CLOUD PERFORMANCE ANALYSIS. Evaluating Cloud Performance for Web Site Hosting Requirements

IBM InfoSphere Streams v4.0 Performance Best Practices

Exchange Server 2007 Performance Comparison of the Dell PowerEdge 2950 and HP Proliant DL385 G2 Servers

Popularity Prediction Tool for ATLAS Distributed Data Management

NVMe SSDs A New Benchmark for OLTP Performance

A Fast and High Throughput SQL Query System for Big Data

DEDICATED SERVERS WITH EBS

Development of DKB ETL module in case of data conversion

Accelerating Enterprise Search with Fusion iomemory PCIe Application Accelerators

ATLAS software configuration and build tool optimisation

Firebird Tour 2017: Performance. Vlad Khorsun, Firebird Project

Benchmarking Cloud Serving Systems with YCSB 詹剑锋 2012 年 6 月 27 日

Sizing & Quotation for Sangfor HCI Technical Training

MySQL Database Scalability

Aerospike Scales with Google Cloud Platform

Kdb+ Transitive Comparisons

Choosing Hardware and Operating Systems for MySQL. Apr 15, 2009 O'Reilly MySQL Conference and Expo Santa Clara,CA by Peter Zaitsev, Percona Inc

Webinar Series: Triangulate your Storage Architecture with SvSAN Caching. Luke Pruen Technical Services Director

A Performance Characterization of Microsoft SQL Server 2005 Virtual Machines on Dell PowerEdge Servers Running VMware ESX Server 3.

PoS(EGICF12-EMITC2)106

NAV 2009 Scalability. Locking Management Solution for Dynamics NAV SQL Server Option. Stress Test Results White Paper

High Throughput WAN Data Transfer with Hadoop-based Storage

Testing the limits of an LVS - GridFTP cluster as a replacement for BeSTMan

A Case Study: Performance Evaluation of a DRAM-Based Solid State Disk

EMC VFCache. Performance. Intelligence. Protection. #VFCache. Copyright 2012 EMC Corporation. All rights reserved.

Seagate Enterprise SATA SSD with DuraWrite Technology Competitive Evaluation

Low Latency Evaluation of Fibre Channel, iscsi and SAS Host Interfaces

Middle East Technical University. Jeren AKHOUNDI ( ) Ipek Deniz Demirtel ( ) Derya Nur Ulus ( ) CENG553 Database Management Systems

Milestone Systems CERTIFICATION TEST REPORT Version /08/17

Data transfer over the wide area network with a large round trip time

Forensic Toolkit System Specifications Guide

An Intelligent & Optimized Way to Access Flash Storage Increase Performance & Scalability of Your Applications

Virtuozzo Hyperconverged Platform Uses Intel Optane SSDs to Accelerate Performance for Containers and VMs

Deep Learning Performance and Cost Evaluation

Solving the I/O bottleneck with Flash

Ingo Brenckmann Jochen Kirsten Storage Technology Strategists SAS EMEA Copyright 2003, SAS Institute Inc. All rights reserved.

Security in the CernVM File System and the Frontier Distributed Database Caching System

Microsoft SQL Server in a VMware Environment on Dell PowerEdge R810 Servers and Dell EqualLogic Storage

Virtualizing SQL Server 2008 Using EMC VNX Series and VMware vsphere 4.1. Reference Architecture

Benchmark of a Cubieboard cluster

Compression in Open Source Databases. Peter Zaitsev April 20, 2016

Fusion iomemory PCIe Solutions from SanDisk and Sqrll make Accumulo Hypersonic

EMC XTREMCACHE ACCELERATES VIRTUALIZED ORACLE

Copyright 2012 EMC Corporation. All rights reserved.

MySQL Performance Improvements

Transcription:

Journal of Physics: Conference Series OPEN ACCESS Performance of popular open source databases for HEP related computing problems To cite this article: D Kovalskyi et al 2014 J. Phys.: Conf. Ser. 513 042027 View the article online for updates and enhancements. Related content - Processing of the WLCG monitoring data using NoSQL J Andreeva, A Beche, S Belov et al. - Structured storage in ATLAS Distributed Data Management: use cases and experiences Mario Lassnig, Vincent Garonne, Angelos Molfetas et al. - The CMS Data Management System M Giffels, Y Guo, V Kuznetsov et al. This content was downloaded from IP address 46.3.198.24 on 26/02/2018 at 07:25

Performance of popular open source databases for HEP related computing problems D Kovalskyi, I Sfiligoi, F Wuerthwein, A Yagil University of California, San Diego, USA E-mail: dmytro@cern.ch Abstract. Databases are used in many software components of HEP computing, from monitoring and job scheduling to data storage and processing. It is not always clear at the beginning of a project if a problem can be handled by a single server, or if one needs to plan for a multi-server solution. Before a scalable solution is adopted, it helps to know how well it performs in a single server case to avoid situations when a multi-server solution is adopted mostly due to sub-optimal performance per node. This paper presents comparison benchmarks of popular open source database management systems. As a test application we use a user job monitoring system based on the Glidein workflow management system used in the CMS Collaboration. 1. Introduction Databases have been actively used in HEP computing for many years. In the past a choice of a proper solution was pretty straightforward. Relational database management systems (RDMS) dominated the market with the two main open source projects: MySQL [1] and PostgreSQL [2]. High demand for scalable database solutions driven by a rapid growth of social networks led to a creation of new non-relational database management systems that allow for easy deployment of large databases across multiple servers. New generation of HEP experiments may have thousands of users with petabyte data volumes to process. In this environment it is not always clear if a single server database management system (DMS) is sufficient. A migration from one DMS to another is always a time consuming non-trivial problem that should be avoided if possible. It is tempting to solve this problem by using a highly scalable non-relational DMS for any application. Unfortunately, if such a DMS performs badly for a single server, we may need to use more hardware than it is really necessary for a given task. In other words we need to make sure that scalability is not achieved at the cost of a poor performance of a single node. In this paper we report performance tests of a few of the most popular relational and nonrelational database management systems for a single server solution. As a test application we use a beta-version of a new CMS user job monitoring system built on top of GlideinWMS (Glidein based Workload Management System) [5]. We have tested the following DMS: MySQL, PostgreSQL, MongoDB [3], Apache Cassandra [4]. Content from this work may be used under the terms of the Creative Commons Attribution 3.0 licence. Any further distribution of this work must maintain attribution to the author(s) and the title of the work, journal citation and DOI. Published under licence by IOP Publishing Ltd 1

2. Test procedure The test procedure is based on a fully functional user job monitoring system called GlideMon. It is based on a single MySQL server used in parallel with the CMS dashboard [6]. In the system the information content is reduced to a minimum that allows users to understand the status of their jobs. For failed jobs a clear summary of causes is provided as well as log files. The log files are stored outside of the monitoring system. For the test we took a snap-shot of the GlideMon database. It has about 14 million individual jobs total. Most queries that we use are only concerned with the last two weeks of data processing. The last two weeks contain roughly 2.2 million jobs and 5 thousand tasks for 251 individual users. The total database size including indexes is about 10-20 GB depending on the DMS. The tests were performed on the following computers: Modern Server - Intel Xeon X5650 @ 2.67 GHz, 12 physical cores, 48GB RAM, SSD Intel 320 MLC, CentOS (6.4) MySQL - 5.1.69-1.el6 4 PostgreSQL - 8.4.13-1.el6 3 MongoDB - 2.4.7-mongodb 1 Multi-core Server - Intel Xeon E5-2660 @ 2.20GHz, 16 physical cores, 128GB RAM, RAID on 15K RPM SAS disks, CentOS (6.4) MySQL - 5.1.69-1.el6 4 PostgreSQL - 8.4.13-1.el6 3 MongoDB - 2.4.7-mongodb 1 Workstation - Intel Core 2 Duo E8500 @ 3.17 GHz, 2 physical cores, 8GB RAM, HDD + SSD Samsung 840 Pro, Fedora 18 MySQL - 5.5.32-1.fc18 PostgreSQL - 9.2.4-1.fc18 MongoDB - 2.2.4-2.fc18 Apache Cassandra - 2.0.0 We used two typical queries that put significant load on the production system. The first one is a complex aggregation query that counts the number of analysis jobs in each state for each user within the last N days. It is normally executed in a single thread to create materialized views (a set of temporary tables), which are queried directly by a web server in the GlideMon system. The second query is a fairly simple select request to list all active tasks for a given user with a number of jobs summary. It is a dynamic query in the production system that is executed for each user request. The test procedure did not cover simultaneous read/write access issues since it was not relevant in our monitoring system. For the same reason, we also did not push the database size to the limit of the available space. All relational database implementations use an identical schema with the same list of indexes. In the case of non-relational databases, we stored all information in a single table that effectively represents an SQL join of the relevant tables that are used for the two test queries. In the case of MongoDB a few additional indexes were added for the documents that corresponds to primary keys in the relational database schema. The difference in the disk space utilization plays a role in the overall performance, since the OS can be more efficient in caching data in the memory for smaller databases. MongoDB takes significantly more disk space (24GB) compared with other DMS: MySQL(4GB), PostgreSQL(5GB) and Cassandra(5GB). The difference is not fully understood. It can be partially explained by the denormalization of data in MongoDB and Cassandra cases and the need for extra indexes in MongoDB (Cassandra was not indexed). 2

We do not show any results for Apache Cassandra since we were not able to implement needed functionally. Unfortunately we realized that this DMS does not fit our task only after creating a functional database and starting to query it. The main problem is that Cassandra is not designed for complex filtering or aggregation. The fact that the query language used in Cassandra (CLQ) is effectively SQL adds to the confusion that Cassandra can be used as a general purpose DMS. To illustrate the limitations of the system we can mention that a simple query to count the total number of elements (columns in Cassandra terminology) may take hours on a simple database like the one that we used. In general this DMS is designed for queries that return a very small number of elements and all the complex aggregation and filtering operations should be performed at the time when the data is inserted into the database. 3. Results 3.1. Complex aggregation query tests The test consists of a set of queries with an increasing range of aggregation. Table 1 shows the total execution time of the test and Figure 1 shows the query execution time as a function of the number of days that we aggregate. The results in the table indicate that MySQL and PostgreSQL showed similar performance on all hardware with slightly better results demonstrated by MySQL ( 30% faster). MongoDB showed much worse performance on the Workstation with just 12GB of RAM (almost an order of magnitude slower). Using more advanced hardware we managed to reduce the gap between MongoDB and RDMS, but even in this case MongoDB was a factor of 3 slower. The results presented in Figure 1 effectively show how the query execution time changes with the increase of the number of elements that have to be counted. MongoDB shows a sharp increase in the execution time above a certain threshold. Replacing the hard drive with a solid state disk, we observed a factor of two drop in the execution time for the queries with a large number of elements. The SSD had no impact for queries accessing a small number of elements. Moving to more advanced hardware we observed an overall reduction in the execution time by about a factor of two for all queries. These results indicate that MongoDB can be more demanding to hardware and available memory for complex queries. 3.2. Simple query tests While a poor performance of a complex query can be easily noticeable, in practice such queries are heavily optimized and if it is necessary a data model used in the database is modified to avoid the problem. For a database that has a sizable number of users, performance of simpler queries can become an issue that is more important and harder to handle. The simple query test results are shown in Table 2. Comparing relational database management systems we found a significant performance gain for PostgreSQL with respect to MySQL. While the median execution time for a single query is identical for the two system, total execution time of 1000 queries is a factor of two smaller for PostgreSQL. This means that PostgreSQL can perform significantly better than MySQL for the type of query that we used mostly by delivering more consistent query execution time. MongoDB showed results comparable with MySQL using advanced hardware. With 10 simultaneous queries MongoDB managed to outperform MySQL, but its performance was significantly degraded on a weaker machine. Nevertheless the performance of MongoDB was never worse than a factor of two compared with MySQL. While our tests were all done in a single server mode, it is interesting to observe how well different DMS results scale with the number of physical cores in a server. Table 3 shows scalability for MySQL, PostgreSQL and MongoDB comparing the total execution time for a test with one thread with respect to the 12 and 16 simultaneous threads that corresponds to the 3

number of physical cores in the two servers that we tested. We found that MySQL has a major issue with the scalability, while PostgreSQL and MongoDB scaled very efficiently. 3.3. Experience deploying different database management systems MySQL and PostgreSQL are well established relational DMS. The systems are widely known and there is a lot of good literature available. Database administration tools and programing interfaces are well developed. Despite their popularity, deploying both MySQL and PostgreSQL can be challenging for non-experts mostly due to strict access rules especially in the case of PostgreSQL. In general this is not an issue, just something that needs to be planned for. MongoDB and Cassandra are very popular solutions in a new generation of non-relational database management systems. They are quite easy to deploy and fairly easy to use. Both systems are under active development and it can be a challenge to find good documentation, but this is a minor issue. Hardware MySQL PostgreSQL MongoDB unbuffered / buffered execution time (secs) Modern Server (ssd, 48GB) 35 / 27 38 / 34 114 / 91 Workstation (ssd, 12GB) 37 / 26 41 / 37 186 / 173 Workstation (hdd, 12GB) 85 / 27 61 / 37 277 / 242 Table 1. The complex aggregation query test results. Hardware Threads MySQL PostgreSQL MongoDB one query median / 95% / whole test time (secs) Modern Server (ssd, 48GB) 2 0.02 / 0.3 / 33 0.02 / 0.1 / 12 0.02 / 0.4 / 36 Workstation (ssd, 12GB) 2 0.02 / 0.3 / 33 0.02 / 0.1 / 15 0.03 / 0.4 / 43 Workstation (hdd, 12GB) 2 0.02 / 0.3 / 33 0.02 / 0.1 / 15 0.05 / 0.8 / 78 Modern Server (ssd, 48GB) 10 0.03 / 0.5 / 12 0.02 / 0.1 / 3 0.02 / 0.3 / 7 Workstation (ssd, 12GB) 10 0.08 / 1.2 / 26 0.10 / 0.4 / 14 0.12 / 2.0 / 43 Workstation (hdd, 12GB) 10 0.07 / 1.2 / 27 0.10 / 0.4 / 14 0.12 / 2.5 / 59 Table 2. The simple query test results. The total number of queries per test is 1000. Memory based caching of disk I/O is in use. 4. Conclusions For the type of queries that we tested, relational database management systems showed the best performance in a single server mode. Out of the two RDMS, PostgreSQL showed better consistency of the query execution time and a better scalability with a number of physical cores in a server. Non-relational systems showed mixed results. Apache Cassandra turned out to be not suitable for our application, while MongoDB showed comparable with RDMS performance on advanced hardware for simple queries and weak performance for complex aggregation queries. MongoDB and PostgreSQL showed close to 100% scalability with the increased number of physical cores. 4

Query execution time 60 50 40 30 MySQL Workstation HDD Buffered MongoDB Workstation HDD Buffered MongoDB Workstation SSD Buffered MongoDB Server SSD Buffered 20 10 0 2 4 6 8 10 12 14 days Figure 1. The complex aggregation query results for different reporting periods. Number of jobs considered increases from about 70 thousand jobs for 1 day to about 2 million jobs for 14 days. Database Management System Scalability (12-cores) Scalability (16-cores) MongoDB 83% 96% PostgreSQL 78% 100% MySQL 32% 13% Table 3. Scalability of various database management systems for the simple query tests. The total number of queries per test is 1000. Scalability is defined as 1 t 1 N t N, where N is number of simultanous threads, t 1 is time to finish the test using one thread and t N is time to finish the test using N threads. Two servers where used - one with 12 physical cores and the other with 16 cores. Memory based caching of disk I/O is in use. Given that a single core performance of modern CPUs is not likely to increase, this property of MongoDB can be very useful even in a single server mode. If it is likely that a database management system may outgrow a single server, using MongoDB from the start can be a wise choice, but it is also likely that it will use more resources (both CPU and disk space) than would be needed with either MySQL or PostgreSQL. 5. Acknowledgments We would like to thank Nikolay Savvinov and Jacob Ribnik for all the fruitful discussions. 5

References [1] http://dev.mysql.com/doc/ [2] http://www.postgresql.org [3] http://www.mongodb.org/ [4] http://cassandra.apache.org/ [5] I Sfiligoi, glideinwmsa generic pilot-based workload management system, 2008, J. Phys.: Conf. Ser. 119, 062044, doi:10.1088/1742-6596/119/6/062044 [6] E Karavakis et al, CMS Dashboard Task Monitoring: A user-centric monitoring view, 2010, J. Phys.: Conf. Ser. 219 072038, doi:10.1088/1742-6596/219/7/072038 6