Genome Data Management using RDBMSs

Size: px

Start display at page:

Download "Genome Data Management using RDBMSs"

Tyrone Page
6 years ago
Views:

1 Genome Data Management using RDBMSs Steffen Janetzki University of Magdeburg Magnús Rafn Tiedemann University of Magdeburg Hardik Balar University of Magdeburg Abstract The goal of Genomics is to analyze genomic sequences in order to elucidate how DNA mutations and gene expression changes relate to an organism s development, behavior or predisposition to diseases. The current state-of-the-art data storage in genome analysis is flat-file-based which makes it complex to access and manage genome data. As the amount of available genome data grows exponentially due to reduced cost of genome sequencing, it poses a significant challenge for storing and querying it efficiently. Using a DBMS for genome data management has recently gained attraction in the scientific community. This paper presents an approach to query variant calling data by integrating it in a relational database system. We have used MonetDB and PostgreSQL to evaluate our approach. We expected that monetdb will have better performance due to its main-memory based execution model.as our data size does not exceed main memory size, we found it is the case. Moreover, we find that loading entire VCF is not useful for implementing specific use case on database systems. Keywords genome data, variant calling, phenomics, VCF, RDBMS, MonetDB, PostgreSQL, SQL I. INTRODUCTION Since completion of the first genome sequencing of the human genome by the Human Genome Project in 2003 (started 1990), genome sequencing has become much more efficient and cheaper [1]. This genetic information became higher valued since it provides a possibility to draw conclusions about the individuals traits like vulnerability towards certain diseases or drug addictions. Additionally, the Human Genome Project provides a foundation for several other projects. One of these projects is the 1000 Genome Project [2]. Large amounts of data are gathered in this project, free to access for everyone who is in need of genome data. It is primarily used for analyzing human genetic diversity and disease prediction. Even though large scale DNA sequencing produces huge amount of data, very little attention has been paid to manage such data using databases. The possible reason behind it could be that genome data is very complex and multi-layered. Genome data can be of different typ, from structural over functional and population to clinical data, making it different to access them for biologists and clinicians [3]. Current approaches use different file formats like text files or proprietary binary files for handling this different layers of data. While flat files are extremely useful in storing large amounts of data, they are not particularly efficient when it comes to querying this data to gain useful insights, which is actually the main use of genome data. The current file-centric approach can t be reliably used when consistency of data is expected. In order to overcome shortcomings of flat-file-based approaches, one could implement a database management system instead. This approach has recently gained attraction in the scientific community. Currently, the use of DBMS is limited to storing workflow-related metadata in flat file based approaches [4]. To use full-fledged DBMS for genome data management is an area of ongoing research. In this paper, we have focused on effectively storing and querying variant calling data - a special kind of genome data that describes possible mutations within a human genome that could lead to diseases. Therefor, we designed a relational database schema. We have used two different relational DBMSs: MonetDB and PostgreSQL. Both systems are open-source. MonetDB is widely used in academia while PostgreSQL can also be found in a commercial applications. Both systems differ in their architecture: MonetDB is a main-memory based column store and PostgreSQL is a disk-based row store [5]. Using these systems for our evaluation will give us insights into the capabilities of RDBMS for genome data management for our specific use case. To this end, our main contributions are: Designing specific queries in SQL which can be used to query variant calling data Designing specific database schema so that queries can be executed on a database system As per designed schema, conversion of VCF file into csv files to integrate it into databases Evaluation of database performance with regard to query execution time and data size under different work load The remainder of this paper is organized as follows: Section II provides necessary domain background on genomics and the relational database systems under consideration. The Section III provides an overview of other scientific researches related to this one. In Section IV, we present our overall approach for designing schemas from a VCF file and formulation of queries which we evaluate in Section V. After this evaluation in Section V, Section VI will discuss possible threats to its validity. Finally, Section VII provides some final conclusions and directions for future work.

2 A. Biological Basics II. BACKGROUND Each living organism has its genetic information coded as DNA (deoxyribonucleic acid). Human DNA consists of three billion base pairs where each of these pairs are a combination of two nucleotide bases (Adenin, Thymin, Guanin, Cytosin). Even though the DNA of each human being is unique, around 99% of it is the same among all humans. It is important to understand the variable 1% of the genome. The purpose of genome sequencing is to unravel the order of these base pairs. It is not possible to sequence entire human genome at once because current sequencing technology can t handle long stretches of DNA at a time so scientists are required to break the genome into pieces, sequence it and assemble it in a correct order [6]. This may lead to variations in sequenced genomes due to sequencing and assembly errors. The other kind of genomic variation constitutes mutations. A mutation is basically a change in the DNA sequence of an organism. DNA sequencing can be useful for finding genetic variations that may cause a disease. A disease-causing change can be very small like substitution, deletion or addition of a single base pair, but larger mutations can affect thousands of bases an once [7]. Many useful scientific insights can also be gained by studying how a given organism mutates over time. For instance, there are on average about 70 mutations occur in the whole human genome per consecutive generation (from a parent to its child). B. Variant calling In this project, we have decided to use variant calling data. Variant calling (Single Nucletoide Polymorphism calling) is the process of determining the positions, where at least one base differs from a reference sequence. It is used to identify a genotype of a human subject, which essentially determines the genetic makeup of an organism. To analyze sequence alignments to determine differences between a sample and a reference sequence, many different tools are available such as SAMtools [8], vcflib [9], GATK ([10], [11]), and Atlas2 [12]. Different variant calling algorithms have been used and their output is stored in a file format called variant calling format. The next subsection introduces variant calling format in details. C. Variant Calling Format (VCF) VCF is one of many formats for storing genome data. It is used by several projects, e.g., the 1000 Genomes Project. It consists of several lines metadata describing the actual data, followed by a header line. Figure 1 shows sample meta data of a VCF file. It describes VCF file related metadata using standard tags and annotations. Additionally, it also includes a description about the actual data fields stored in the particular VCF file. For instance, line 7 to 15 provide meta information about INFO, FILTER and FORMAT field of VCF file. The header line consists out of: Fig. 1: Meta information of a VCF file #CHROM - which Chromosome is taken into consideration POS - which position in this chromosome is considerated ID - unique identifier REF - non mutated alignment ALT - mutated alternative alignment(s) QUAL - quality of measurements FILTER - used filter INFO - additional information about the data FORMAT - additional information about the data, what type of data etc. HG IndividualXXX - index of the appropriate alternative; index is 0 if there has been no mutation; the remaining content of a VCF file is the data itself. #CHROM POS ID REF ALT Qual FILTER INFO rs G A 29 PASS DP= T A 3 q10 DP= rs A G,T 67 PASS DP= T. 47 PASS DP= microsat1 GTCT G,GTACT 50 PASS DP=9 TABLE I: Mandatory columns of VCF Table I shows mandatory fields of a sample VCF file which describes variant calling data about chromosome 20. Optionally, if genotype data of individual samples is available, this can also be included. Table II shows such individual data. It must be followed by FORMAT field which describes presentation of data for individuals. FORMAT NA00001 NA00002 NA00003 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1:43:5.,. GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0:41:3 GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2:35:4 GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0:61:2 GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 TABLE II: Variant calling data of individuals

3 D. Relational Database Systems We have decided to use relational database systems for this project. Relational database system is the first mainstream database system, which is based on the relation model invented by E.F. Codd and it is still widely used. The main advantage of relational databases over file based databases is that a single database can be spread across several tables which are related via constraints. This makes viewing and handling of related data much easier than their flat-file based counterpart. Here, we describe main differences between disk based system and main memory based system and then we list main features of both MonetDB and PostgreSQL. 1) Main-memory vs. Disk-based systems: In a conventional database system, data resides on a secondary disk and is only cached into memory when needed. In main-memory based systems, data resides in the primary physical memory and has a backup copy on the secondary disk. This has led to many different design considerations and performance variations in database systems [13]. Normally, an entire database can reside in main memory, and hence always outperforms disk based systems. However, when the amount of data is too large to fit into the main memory or data security is of paramount concern, disk based systems may be desired. Another important aspect to consider are index structure on both systems. Disk oriented index structures are designed to minimize disk accesses and disk space, whereas main memory index reduces overall computation time by using as little memory as possible [14]. 2) MonetDB: MonetDB is an open source relational DBMS designed to exploit large main memories of modern computer systems for efficient query processing. As explained in the previous subsection, it employs main memory based architecture with a column-store storage model. It doesn t require any manual tuning to improve performance in most cases, as it already has adaptive indexing, a modern CPU-tuned query execution architecture, run-time query optimization, and a modular software architecture [5]. 3) PostgreSql: PostgreSQL is an open source, SQL compliant object relational database system. It employs a disk oriented architecture with a row based data storage model. PostgreSQL provides complete support for reliable transactions (ACID properties). It can also be extended through stored procedures like any advanced DBMS [15]. III. RELATED WORK In this section, we document related work that we identified during our literature review. For genome analysis, there are already well developed database systems like GEMINI [16], which provides a flexible framework for exploring all forms of human genetic variation. GEMINI (Genome Mining) framework loads VCF files into a database with automatic annotation to each variant. The back-end database is portable SQLite. It has very sophisticated querying and filtering mechanisms in place which we used as an inspiration for designing some of our queries [APPENDIX, Listing 1]. We also use a VCF file and load it in a database so we can query it later just like GEMINI, but we only load a part of the VCF file into a database to make our schema minimal. GEMINI also has integrated pre-calculated information like genetic diversity count in database itself making it easier to query it. We have also looked into an approach by Röhm et al. that integrates genotyping using user-defined functions into a DBMS [4]. Work of Dorok et al. presents an approach for variant calling inside a main-memory DBMS using SQL [17]. Our approach uses two different database systems, so we can make a statement about performance of executed queries and scalability of each system. We have also combined phenotype data with variant calling data which allows us to execute phonemics related query [APPENDIX, Listing 2]. IV. METHODOLOGY This section describes the queries which have been designed. Furthermore, the necessary schema adaptions will be presented and the process of converting the VCF files into the new appropriate format. A. Queries 1) Query 1: Count the mutations in one chromosome of one individual This query shall provide an answer to the question, how many mutations occur in one certain chromosome of one certain individual?. This query is inspired by a query designed for the GEMINI (Genomic Mining) framework [18]. The original query shall compute the nucleotide diversity of variants, a measurement used to express the degree of polymorphism in a population at the nucleotide level [19]. Since the intention of this scientific research is the evaluation of querying variant calling data with different DBMSs, we did not implement further mathematical means to calculate genomic diversity. 2) Query 2: Compare the genotypes of phenotypically similar individuals The purpose of this query is to show how individuals with phenotypically similar traits may or may not differ genotypically. This query has been inspired by a scientific paper concerning new complex analyticial genomic benchmarks this evaluating genomic data of patients and relating this to their drug response regarding their disease [20]. Instead of considering diseased patients and

4 their drug response for their disease this query only consider simple phenotypical traits like eye color, hair color and height. 3) Query 3: Display all individuals with a certain set of mutations This query displays occurrences of mutations in ensembles and allows to track possible correlations between these mutations. Since the DNA of living organisms exhibits more than just one single mutation, checking samples for a set of mutations could provide information about possible co-uccurence or linkage. B. Schema According to queries described in previous section, part of VCF files is adapted to design corresponding schema. Here,we describe the extraction of data from VCF in more detail. 1) Restrictions: The introduced queries don t require the VCF files contained columns ID, QUAL, FILTER, INFO and FOR- MAT and will therefore not be part of the new schema. The remaining columns #CHROM, POS, REF, ALT and the columns containing the id of the considered individuals will be remodeled as shown in Figure 2. 2) Variant Calling: The information concerning the chromosome and concrete position of a variant will be extracted in its own table called VariantCalling as chrom and pos. Additionally an identifier, vcid, will be generated and assigned to make each element distinguishable from the remaining ones. 3) Mutations: The actual information about the change caused by the mutation, coded as reference REF and alternative(s) ALT will be integrated into a separate table called Mutations as ref and alt. Since some combinations of REF and ALT may occur more than once, extracting this information into a new separate table allows to remove duplicates and to refer to them by a new assigned identifier which will be mid. 4) Mapping: Finally, all columns concerning the considered individuals will be extracted into a new table called Mapping. The contents of this table will be the ids of the individuals referenced as hid and two foreign keys mapping to the new VariantCalling and Mutations tables. Each row of this table is telling which individual at which chromosome and position got which mutation. Combinations where reference and alternative of a mutation are the same (i.e., no mutations) wont be stored in this table. That means that the Mapping table is the only table keeping track of actual mutations, which saves a huge amount of storage compared to the VCF files, where an equality of reference and alternative is marked as a 0. C. Phenotype For Query 2, phenotype data for every individual is needed, but is not contained in the VCF files. Therefor, we have generated a artificial phenotype data. This table is called Phenotype and consists of four different columns, the individuals ids as hid which is a primary key and additionally three different phenotype traits, eye color, hair color and height. D. Queries This sub-section describes semantics description of queries we have designed and its relevant usefulness to researchers. 1) Converting into several files: Converting the VCF files into csv files according to the schema shown in figure 2 will be realized by an application, which has been implemented in the process of this scientific research. As destination format, csv has been chosen, since the considered DBMSs support integration of data in this format. V. EVALUATION This section displays the experimental setup, followed by evaluation results which is visualized and concluded. A. Experimental Setup Here we have presented our setup and work load details and different consideration which are useful for making a fair comparison between two database systems. For our evaluation, we are using a system with the following properties displayed in the table III. Operating System: OS X Yosemite (Version ) Processor: Intel R Core TM i5-5257u (3M Cache, up to 3.10 GHz) Main Memory: 16 GB 1867 MHz DDR3 HDD: 250 GB SSD TABLE III: Experimental Setup The general measurement taken into consideration are: Data size in database Query execution time Data of different size (1M, 5M, 50M, 10M, 50M, 100M) were queried to evaluate the scalability of the DBMSs. Moreover, different workload of data size, we have also ignore some of the initial results of query execution time as first time database system requires to build index or cache data. We have executed our queries 10 times and then averaged it to make result as fair as possible.

Fig. 2: schema design B. Storage consumption The table IV lists the storage size of tables in MonetDB and PostgreSQL compared with original size of VCF file.

5 Fig. 2: schema design B. Storage consumption The table IV lists the storage size of tables in MonetDB and PostgreSQL compared with original size of VCF file. File name Original MonetDB PostgreSQL Ratio Variant Calling MutationsA Phenotype Mapping (1M) Mapping (5M) Mapping (10M) Mapping (50M) Mapping (100M) since the differences between measurements of the DBMSs are of such magnitude that otherwise it would be hardly possible to draw conclusions considering the scalabilities of the DBMSs. D. Query 1 The purpose of this query was to count the amount of mutations of one certain individual in one certain chromosome. Figure 3 shows the execution time of MonetDB and PostgreSQL in milliseconds for different work load. For smaller work load the execution time of both systems is almost the same, but as the work load increases, it has greater impact on the performance of PostgreSQL than MonetDB. TABLE IV: Data size in Databases C. Execution time & Scalability The following paragraphs will describe and visualize the results of the measured execution time and scalability. As MonetDB is a self optimizing DBMS working in main memory, two adaptions to the measurement procedure became necessary: MonetDB is self optimizing via a short start up time, so the first few queries will take additional time and may bias the results Since PostgreSQL is not self optimizing, it is necessary to optimize it manually by creating index E. Query 2 Fig. 3: Query 1 Benchmark Listing 4 shows created index on mapping table in PostgreSQL. MonetDB does not require to create index for our purpose. For the visualization, a logarithmic graph was chosen, This query aims to display the genotypical similarity or difference of individuals which possess similar phenotypically traits, e.g., the same eye color. The graph in Figure 4 shows

6 the execution time on both system. Both DBMS show similar scalability for different dataset sizes. Considering the execution time, MonetDB requires only a small amount of time whereas it takes PostgreSQL on average about 7 times longer to execute the query than MonetDB. MonetDB is still computing really fast whereas PostgreSQL execution time is increasing rapidly. That is explainable by MonetDB s internal self optimizing algorithms to achieve better performance. Another reason for MonetDB s low execution times is its usage of main memory for computing its queries. Since the size of queried data sets has not exceeded the size of main memory of the used system, MonetDB was able to benefit at any point during the experiment. VI. THREATS TO VALIDITY In this section, we have listed some of the factors that can influence the findings of this project. In general, we have very simple setup without any custom configuration. Our data resides in a local system and does not use any distributed storage server on the web. F. Query 3 Fig. 4: Query 2 Benchmark These queries intention is to determine possible connections between different mutations and their occurrences. For the execution of this query PostgreSQL requires on average about 70 timesof the time required by MonetDB. For the first two data set sizes, MonetDB is already out performing PostgreSQL considering execution time with a ratio of 1:18.6 and then additionally increasing data size results in worse scaling of PostgreSQL to a ratio of 1:134 at 50M records. This huge increase of the ratio between MonetDB and PostgreSQL is easily observable in Figure 5 between the data size of 10M and 50M records. Database configuration PostgreSQL is highly configurable database according to user s need [21], but for the evaluation of this project, we have not applied any custom configurations to the system. It may also be possible to get better performance depending on different choice of indexing in PostgreSQL. Both MonetDB and PostgreSQL are set up in a development mode and hence no claim about validity of the result can be made when used in production or distributed environment. Type of the data Since our converter has omitted many part of VCF file, it may be possible that findings may differ when compared with other system like GEMINI which can load entire VCF file in database. Moreover, we have also ignored annotations and other information provided in a VCF file. Synchronous access of data Real world systems often employs multiple client-single server set up whereas our evaluation is carried out on single system with different processes serving as a single client and server. VII. CONCLUSION & FUTURE WORK G. Conclusion of evaluation Fig. 5: Query 3 Benchmark Except for the first two data points of the second query, MonetDB outperforms PostgreSQL at any other point of measurement. In all the other queries MonetDB achieves better results than PostgreSQL. It becomes visible that MonetDB has got a better scalability than PostgreSQL, at least for the chosen data sizes in this experiment. Even with increased data size, This section concludes the overall finding of our project and also list further work in this area that can be done for future research. In the process of this project, several measurement has been taken for both RDBMSs by loading variant calling data and then querying it to measure execution time and scalability aspects. we have used VCF file for our project and since we have not been able to work with the VCF file directly, converting it to csv files and then loading it to database systems requires to make schema specific change. So we have not loaded entire VCF file but only used some fields only. Furthermore several measurements as execution time, storage consumption and scalability have been chosen and queries delivering results for these measurement allowing a

7 comparison between the two RDBMSs. Our evaluation matrix was about query performance, database size after loading data in database and scaliability. In our evaluation section, we have shown that MonetDB has some advantage over PostgreSQL when size of data doesn t exceed the size of the main memory, which is the case for our project. We have also been able to improve the performance of PostgreSQL by adding index, while MonetDB does not require such external index structure. In future work, the MonetDB could be tested on data set sizes exceeding the systems main memory to observe how it is handling this different circumstance. Furthermore, instead of variant calling data, other type of genome data can also be considered. NoSQL systems which claims to have better performance and scalability can also be interesting area of future work. REFERENCES [1] R. D. L. Almasy, S. Williams-Blangero et al., Genome Mapping and Genomics in Human and Non-Human Primates. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg, [2] 1000 Genomes - A Deep Catalog of Human Genetic Variation, accessed: [3] Andreas M. Kogelnik and Shamkant B. Navathe and Douglas C. Wallace, GENOME: A Networked Database Environment for Human Genome Data, [4] Uwe Röhm and José A. Blakeley, Data Management for High- Throughput Genomics, [5] S. Idreos, F. Groffen, N. Nes et al., Monetdb: Two decades of research in column-oriented database architectures, [6] T. Hubbard, D. Barker, E. Birney et al., The Ensembl genome database project, [7] R. Nielsen, J. S. Paul, A. Albrechtsen et al., Genotype and SNP calling from next-generation sequencing data, [8] H. Li, B. Handsaker, A. Wysoker et al., The sequence alignment/map format and samtools, Bioinformatics, vol. 25, no. 16, p. 2078, [9] E. Garrison, vcflib - a C++ library for parsing and manipulating VCF files, accessed: [10] M. A. DePristo, E. Banks, R. Popl et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nat Genet, vol. 43, no. 5, pp , [11] A. McKenna, M. Hanna, E. Banks et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, no. 9, pp , Sep [12] D. Challis, J. Yu, U. S. Evani et al., An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics, vol. 13, p. 8, [13] H. Garcia-Molina and K. Salem, Main Memory Database Systems: An Overview, [14] T. J. Lehman and M. J. Carey, A Study of Index Structures for Main Memory Database Management Systems, [15] O. TeZer, SQLite vs MySQL vs PostgreSQL: A Comparison Of Relational Database Management Systems, accessed: [16] U. Paila, B. A. Chapman, R. Kirchner et al., GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations, PLoS Computational Biology, vol. 9, no. 7, [17] S. Dorok, S. Breß, and G. Saake, Toward Efficient Variant Calling inside Main-Memory Database Systems, [18] Querying the GEMINI database, latest/content/querying.html#basic-queries, accessed: [19] M. Nei and W.-H. Li, Mathematical model for studying genetic variation in terms of restriction endonucleases, [20] R. Taft, M. Vartak, N. Rajagopalan et al., GenBase: A Complex Analytics Genomics Benchmark, [21] PostgreSQL Documentation, accessed: APPENDIX 1 SELECT COUNT(mapt.hid) 2 FROM mapping mapt 3 WHERE mapt.hid = HG00096 ; Listing 1: Query 1 1 SELECT mapt.hid, vart.pos, mut.alt, 2 phenot.eye_color 3 FROM (((variant_calling vart 4 NATURAL JOIN mapping mapt) 5 NATURAL JOIN mutations mut) 6 NATURAL JOIN phenotype phenot) 7 WHERE phenot.eye_color = brown AND 8 vart.vcid = mapt.vcid 9 AND mapt.mid = mut.mid 10 ORDER BY vart.pos, mapt.hid; Listing 2: Query 2 1 SELECT mapt.hid, vart.pos, mut.alt, 2 phenot.eye_color 3 FROM (((variant_calling vart 4 NATURAL JOIN mapping mapt) 5 NATURAL JOIN mutations mut) 6 NATURAL JOIN phenotype phenot) 7 WHERE phenot.eye_color = brown 8 AND vart.vcid = mapt.vcid 9 AND mapt.mid = mut.mid 10 ORDER BY vart.pos, mapt.hid; Listing 3: Query 3 1 CREATE INDEX idx_mapping ON mapping(hid); 2 CREATE INDEX idx_mapping_mid ON mapping(mid); Listing 4: Index on mapping table in PostgreSQL

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for