Genome Data Management using RDBMSs

Size: px
Start display at page:

Download "Genome Data Management using RDBMSs"

Transcription

1 Genome Data Management using RDBMSs Steffen Janetzki University of Magdeburg Magnús Rafn Tiedemann University of Magdeburg Hardik Balar University of Magdeburg Abstract The goal of Genomics is to analyze genomic sequences in order to elucidate how DNA mutations and gene expression changes relate to an organism s development, behavior or predisposition to diseases. The current state-of-the-art data storage in genome analysis is flat-file-based which makes it complex to access and manage genome data. As the amount of available genome data grows exponentially due to reduced cost of genome sequencing, it poses a significant challenge for storing and querying it efficiently. Using a DBMS for genome data management has recently gained attraction in the scientific community. This paper presents an approach to query variant calling data by integrating it in a relational database system. We have used MonetDB and PostgreSQL to evaluate our approach. We expected that monetdb will have better performance due to its main-memory based execution model.as our data size does not exceed main memory size, we found it is the case. Moreover, we find that loading entire VCF is not useful for implementing specific use case on database systems. Keywords genome data, variant calling, phenomics, VCF, RDBMS, MonetDB, PostgreSQL, SQL I. INTRODUCTION Since completion of the first genome sequencing of the human genome by the Human Genome Project in 2003 (started 1990), genome sequencing has become much more efficient and cheaper [1]. This genetic information became higher valued since it provides a possibility to draw conclusions about the individuals traits like vulnerability towards certain diseases or drug addictions. Additionally, the Human Genome Project provides a foundation for several other projects. One of these projects is the 1000 Genome Project [2]. Large amounts of data are gathered in this project, free to access for everyone who is in need of genome data. It is primarily used for analyzing human genetic diversity and disease prediction. Even though large scale DNA sequencing produces huge amount of data, very little attention has been paid to manage such data using databases. The possible reason behind it could be that genome data is very complex and multi-layered. Genome data can be of different typ, from structural over functional and population to clinical data, making it different to access them for biologists and clinicians [3]. Current approaches use different file formats like text files or proprietary binary files for handling this different layers of data. While flat files are extremely useful in storing large amounts of data, they are not particularly efficient when it comes to querying this data to gain useful insights, which is actually the main use of genome data. The current file-centric approach can t be reliably used when consistency of data is expected. In order to overcome shortcomings of flat-file-based approaches, one could implement a database management system instead. This approach has recently gained attraction in the scientific community. Currently, the use of DBMS is limited to storing workflow-related metadata in flat file based approaches [4]. To use full-fledged DBMS for genome data management is an area of ongoing research. In this paper, we have focused on effectively storing and querying variant calling data - a special kind of genome data that describes possible mutations within a human genome that could lead to diseases. Therefor, we designed a relational database schema. We have used two different relational DBMSs: MonetDB and PostgreSQL. Both systems are open-source. MonetDB is widely used in academia while PostgreSQL can also be found in a commercial applications. Both systems differ in their architecture: MonetDB is a main-memory based column store and PostgreSQL is a disk-based row store [5]. Using these systems for our evaluation will give us insights into the capabilities of RDBMS for genome data management for our specific use case. To this end, our main contributions are: Designing specific queries in SQL which can be used to query variant calling data Designing specific database schema so that queries can be executed on a database system As per designed schema, conversion of VCF file into csv files to integrate it into databases Evaluation of database performance with regard to query execution time and data size under different work load The remainder of this paper is organized as follows: Section II provides necessary domain background on genomics and the relational database systems under consideration. The Section III provides an overview of other scientific researches related to this one. In Section IV, we present our overall approach for designing schemas from a VCF file and formulation of queries which we evaluate in Section V. After this evaluation in Section V, Section VI will discuss possible threats to its validity. Finally, Section VII provides some final conclusions and directions for future work.

2 A. Biological Basics II. BACKGROUND Each living organism has its genetic information coded as DNA (deoxyribonucleic acid). Human DNA consists of three billion base pairs where each of these pairs are a combination of two nucleotide bases (Adenin, Thymin, Guanin, Cytosin). Even though the DNA of each human being is unique, around 99% of it is the same among all humans. It is important to understand the variable 1% of the genome. The purpose of genome sequencing is to unravel the order of these base pairs. It is not possible to sequence entire human genome at once because current sequencing technology can t handle long stretches of DNA at a time so scientists are required to break the genome into pieces, sequence it and assemble it in a correct order [6]. This may lead to variations in sequenced genomes due to sequencing and assembly errors. The other kind of genomic variation constitutes mutations. A mutation is basically a change in the DNA sequence of an organism. DNA sequencing can be useful for finding genetic variations that may cause a disease. A disease-causing change can be very small like substitution, deletion or addition of a single base pair, but larger mutations can affect thousands of bases an once [7]. Many useful scientific insights can also be gained by studying how a given organism mutates over time. For instance, there are on average about 70 mutations occur in the whole human genome per consecutive generation (from a parent to its child). B. Variant calling In this project, we have decided to use variant calling data. Variant calling (Single Nucletoide Polymorphism calling) is the process of determining the positions, where at least one base differs from a reference sequence. It is used to identify a genotype of a human subject, which essentially determines the genetic makeup of an organism. To analyze sequence alignments to determine differences between a sample and a reference sequence, many different tools are available such as SAMtools [8], vcflib [9], GATK ([10], [11]), and Atlas2 [12]. Different variant calling algorithms have been used and their output is stored in a file format called variant calling format. The next subsection introduces variant calling format in details. C. Variant Calling Format (VCF) VCF is one of many formats for storing genome data. It is used by several projects, e.g., the 1000 Genomes Project. It consists of several lines metadata describing the actual data, followed by a header line. Figure 1 shows sample meta data of a VCF file. It describes VCF file related metadata using standard tags and annotations. Additionally, it also includes a description about the actual data fields stored in the particular VCF file. For instance, line 7 to 15 provide meta information about INFO, FILTER and FORMAT field of VCF file. The header line consists out of: Fig. 1: Meta information of a VCF file #CHROM - which Chromosome is taken into consideration POS - which position in this chromosome is considerated ID - unique identifier REF - non mutated alignment ALT - mutated alternative alignment(s) QUAL - quality of measurements FILTER - used filter INFO - additional information about the data FORMAT - additional information about the data, what type of data etc. HG IndividualXXX - index of the appropriate alternative; index is 0 if there has been no mutation; the remaining content of a VCF file is the data itself. #CHROM POS ID REF ALT Qual FILTER INFO rs G A 29 PASS DP= T A 3 q10 DP= rs A G,T 67 PASS DP= T. 47 PASS DP= microsat1 GTCT G,GTACT 50 PASS DP=9 TABLE I: Mandatory columns of VCF Table I shows mandatory fields of a sample VCF file which describes variant calling data about chromosome 20. Optionally, if genotype data of individual samples is available, this can also be included. Table II shows such individual data. It must be followed by FORMAT field which describes presentation of data for individuals. FORMAT NA00001 NA00002 NA00003 GT:GQ:DP:HQ 0 0:48:1:51,51 1 0:48:8:51,51 1/1:43:5.,. GT:GQ:DP:HQ 0 0:49:3:58,50 0 1:3:5:65,3 0/0:41:3 GT:GQ:DP:HQ 1 2:21:6:23,27 2 1:2:0:18,2 2/2:35:4 GT:GQ:DP:HQ 0 0:54:7:56,60 0 0:48:4:51,51 0/0:61:2 GT:GQ:DP 0/1:35:4 0/2:17:2 1/1:40:3 TABLE II: Variant calling data of individuals

3 D. Relational Database Systems We have decided to use relational database systems for this project. Relational database system is the first mainstream database system, which is based on the relation model invented by E.F. Codd and it is still widely used. The main advantage of relational databases over file based databases is that a single database can be spread across several tables which are related via constraints. This makes viewing and handling of related data much easier than their flat-file based counterpart. Here, we describe main differences between disk based system and main memory based system and then we list main features of both MonetDB and PostgreSQL. 1) Main-memory vs. Disk-based systems: In a conventional database system, data resides on a secondary disk and is only cached into memory when needed. In main-memory based systems, data resides in the primary physical memory and has a backup copy on the secondary disk. This has led to many different design considerations and performance variations in database systems [13]. Normally, an entire database can reside in main memory, and hence always outperforms disk based systems. However, when the amount of data is too large to fit into the main memory or data security is of paramount concern, disk based systems may be desired. Another important aspect to consider are index structure on both systems. Disk oriented index structures are designed to minimize disk accesses and disk space, whereas main memory index reduces overall computation time by using as little memory as possible [14]. 2) MonetDB: MonetDB is an open source relational DBMS designed to exploit large main memories of modern computer systems for efficient query processing. As explained in the previous subsection, it employs main memory based architecture with a column-store storage model. It doesn t require any manual tuning to improve performance in most cases, as it already has adaptive indexing, a modern CPU-tuned query execution architecture, run-time query optimization, and a modular software architecture [5]. 3) PostgreSql: PostgreSQL is an open source, SQL compliant object relational database system. It employs a disk oriented architecture with a row based data storage model. PostgreSQL provides complete support for reliable transactions (ACID properties). It can also be extended through stored procedures like any advanced DBMS [15]. III. RELATED WORK In this section, we document related work that we identified during our literature review. For genome analysis, there are already well developed database systems like GEMINI [16], which provides a flexible framework for exploring all forms of human genetic variation. GEMINI (Genome Mining) framework loads VCF files into a database with automatic annotation to each variant. The back-end database is portable SQLite. It has very sophisticated querying and filtering mechanisms in place which we used as an inspiration for designing some of our queries [APPENDIX, Listing 1]. We also use a VCF file and load it in a database so we can query it later just like GEMINI, but we only load a part of the VCF file into a database to make our schema minimal. GEMINI also has integrated pre-calculated information like genetic diversity count in database itself making it easier to query it. We have also looked into an approach by Röhm et al. that integrates genotyping using user-defined functions into a DBMS [4]. Work of Dorok et al. presents an approach for variant calling inside a main-memory DBMS using SQL [17]. Our approach uses two different database systems, so we can make a statement about performance of executed queries and scalability of each system. We have also combined phenotype data with variant calling data which allows us to execute phonemics related query [APPENDIX, Listing 2]. IV. METHODOLOGY This section describes the queries which have been designed. Furthermore, the necessary schema adaptions will be presented and the process of converting the VCF files into the new appropriate format. A. Queries 1) Query 1: Count the mutations in one chromosome of one individual This query shall provide an answer to the question, how many mutations occur in one certain chromosome of one certain individual?. This query is inspired by a query designed for the GEMINI (Genomic Mining) framework [18]. The original query shall compute the nucleotide diversity of variants, a measurement used to express the degree of polymorphism in a population at the nucleotide level [19]. Since the intention of this scientific research is the evaluation of querying variant calling data with different DBMSs, we did not implement further mathematical means to calculate genomic diversity. 2) Query 2: Compare the genotypes of phenotypically similar individuals The purpose of this query is to show how individuals with phenotypically similar traits may or may not differ genotypically. This query has been inspired by a scientific paper concerning new complex analyticial genomic benchmarks this evaluating genomic data of patients and relating this to their drug response regarding their disease [20]. Instead of considering diseased patients and

4 their drug response for their disease this query only consider simple phenotypical traits like eye color, hair color and height. 3) Query 3: Display all individuals with a certain set of mutations This query displays occurrences of mutations in ensembles and allows to track possible correlations between these mutations. Since the DNA of living organisms exhibits more than just one single mutation, checking samples for a set of mutations could provide information about possible co-uccurence or linkage. B. Schema According to queries described in previous section, part of VCF files is adapted to design corresponding schema. Here,we describe the extraction of data from VCF in more detail. 1) Restrictions: The introduced queries don t require the VCF files contained columns ID, QUAL, FILTER, INFO and FOR- MAT and will therefore not be part of the new schema. The remaining columns #CHROM, POS, REF, ALT and the columns containing the id of the considered individuals will be remodeled as shown in Figure 2. 2) Variant Calling: The information concerning the chromosome and concrete position of a variant will be extracted in its own table called VariantCalling as chrom and pos. Additionally an identifier, vcid, will be generated and assigned to make each element distinguishable from the remaining ones. 3) Mutations: The actual information about the change caused by the mutation, coded as reference REF and alternative(s) ALT will be integrated into a separate table called Mutations as ref and alt. Since some combinations of REF and ALT may occur more than once, extracting this information into a new separate table allows to remove duplicates and to refer to them by a new assigned identifier which will be mid. 4) Mapping: Finally, all columns concerning the considered individuals will be extracted into a new table called Mapping. The contents of this table will be the ids of the individuals referenced as hid and two foreign keys mapping to the new VariantCalling and Mutations tables. Each row of this table is telling which individual at which chromosome and position got which mutation. Combinations where reference and alternative of a mutation are the same (i.e., no mutations) wont be stored in this table. That means that the Mapping table is the only table keeping track of actual mutations, which saves a huge amount of storage compared to the VCF files, where an equality of reference and alternative is marked as a 0. C. Phenotype For Query 2, phenotype data for every individual is needed, but is not contained in the VCF files. Therefor, we have generated a artificial phenotype data. This table is called Phenotype and consists of four different columns, the individuals ids as hid which is a primary key and additionally three different phenotype traits, eye color, hair color and height. D. Queries This sub-section describes semantics description of queries we have designed and its relevant usefulness to researchers. 1) Converting into several files: Converting the VCF files into csv files according to the schema shown in figure 2 will be realized by an application, which has been implemented in the process of this scientific research. As destination format, csv has been chosen, since the considered DBMSs support integration of data in this format. V. EVALUATION This section displays the experimental setup, followed by evaluation results which is visualized and concluded. A. Experimental Setup Here we have presented our setup and work load details and different consideration which are useful for making a fair comparison between two database systems. For our evaluation, we are using a system with the following properties displayed in the table III. Operating System: OS X Yosemite (Version ) Processor: Intel R Core TM i5-5257u (3M Cache, up to 3.10 GHz) Main Memory: 16 GB 1867 MHz DDR3 HDD: 250 GB SSD TABLE III: Experimental Setup The general measurement taken into consideration are: Data size in database Query execution time Data of different size (1M, 5M, 50M, 10M, 50M, 100M) were queried to evaluate the scalability of the DBMSs. Moreover, different workload of data size, we have also ignore some of the initial results of query execution time as first time database system requires to build index or cache data. We have executed our queries 10 times and then averaged it to make result as fair as possible.

5 Fig. 2: schema design B. Storage consumption The table IV lists the storage size of tables in MonetDB and PostgreSQL compared with original size of VCF file. File name Original MonetDB PostgreSQL Ratio Variant Calling MutationsA Phenotype Mapping (1M) Mapping (5M) Mapping (10M) Mapping (50M) Mapping (100M) since the differences between measurements of the DBMSs are of such magnitude that otherwise it would be hardly possible to draw conclusions considering the scalabilities of the DBMSs. D. Query 1 The purpose of this query was to count the amount of mutations of one certain individual in one certain chromosome. Figure 3 shows the execution time of MonetDB and PostgreSQL in milliseconds for different work load. For smaller work load the execution time of both systems is almost the same, but as the work load increases, it has greater impact on the performance of PostgreSQL than MonetDB. TABLE IV: Data size in Databases C. Execution time & Scalability The following paragraphs will describe and visualize the results of the measured execution time and scalability. As MonetDB is a self optimizing DBMS working in main memory, two adaptions to the measurement procedure became necessary: MonetDB is self optimizing via a short start up time, so the first few queries will take additional time and may bias the results Since PostgreSQL is not self optimizing, it is necessary to optimize it manually by creating index E. Query 2 Fig. 3: Query 1 Benchmark Listing 4 shows created index on mapping table in PostgreSQL. MonetDB does not require to create index for our purpose. For the visualization, a logarithmic graph was chosen, This query aims to display the genotypical similarity or difference of individuals which possess similar phenotypically traits, e.g., the same eye color. The graph in Figure 4 shows

6 the execution time on both system. Both DBMS show similar scalability for different dataset sizes. Considering the execution time, MonetDB requires only a small amount of time whereas it takes PostgreSQL on average about 7 times longer to execute the query than MonetDB. MonetDB is still computing really fast whereas PostgreSQL execution time is increasing rapidly. That is explainable by MonetDB s internal self optimizing algorithms to achieve better performance. Another reason for MonetDB s low execution times is its usage of main memory for computing its queries. Since the size of queried data sets has not exceeded the size of main memory of the used system, MonetDB was able to benefit at any point during the experiment. VI. THREATS TO VALIDITY In this section, we have listed some of the factors that can influence the findings of this project. In general, we have very simple setup without any custom configuration. Our data resides in a local system and does not use any distributed storage server on the web. F. Query 3 Fig. 4: Query 2 Benchmark These queries intention is to determine possible connections between different mutations and their occurrences. For the execution of this query PostgreSQL requires on average about 70 timesof the time required by MonetDB. For the first two data set sizes, MonetDB is already out performing PostgreSQL considering execution time with a ratio of 1:18.6 and then additionally increasing data size results in worse scaling of PostgreSQL to a ratio of 1:134 at 50M records. This huge increase of the ratio between MonetDB and PostgreSQL is easily observable in Figure 5 between the data size of 10M and 50M records. Database configuration PostgreSQL is highly configurable database according to user s need [21], but for the evaluation of this project, we have not applied any custom configurations to the system. It may also be possible to get better performance depending on different choice of indexing in PostgreSQL. Both MonetDB and PostgreSQL are set up in a development mode and hence no claim about validity of the result can be made when used in production or distributed environment. Type of the data Since our converter has omitted many part of VCF file, it may be possible that findings may differ when compared with other system like GEMINI which can load entire VCF file in database. Moreover, we have also ignored annotations and other information provided in a VCF file. Synchronous access of data Real world systems often employs multiple client-single server set up whereas our evaluation is carried out on single system with different processes serving as a single client and server. VII. CONCLUSION & FUTURE WORK G. Conclusion of evaluation Fig. 5: Query 3 Benchmark Except for the first two data points of the second query, MonetDB outperforms PostgreSQL at any other point of measurement. In all the other queries MonetDB achieves better results than PostgreSQL. It becomes visible that MonetDB has got a better scalability than PostgreSQL, at least for the chosen data sizes in this experiment. Even with increased data size, This section concludes the overall finding of our project and also list further work in this area that can be done for future research. In the process of this project, several measurement has been taken for both RDBMSs by loading variant calling data and then querying it to measure execution time and scalability aspects. we have used VCF file for our project and since we have not been able to work with the VCF file directly, converting it to csv files and then loading it to database systems requires to make schema specific change. So we have not loaded entire VCF file but only used some fields only. Furthermore several measurements as execution time, storage consumption and scalability have been chosen and queries delivering results for these measurement allowing a

7 comparison between the two RDBMSs. Our evaluation matrix was about query performance, database size after loading data in database and scaliability. In our evaluation section, we have shown that MonetDB has some advantage over PostgreSQL when size of data doesn t exceed the size of the main memory, which is the case for our project. We have also been able to improve the performance of PostgreSQL by adding index, while MonetDB does not require such external index structure. In future work, the MonetDB could be tested on data set sizes exceeding the systems main memory to observe how it is handling this different circumstance. Furthermore, instead of variant calling data, other type of genome data can also be considered. NoSQL systems which claims to have better performance and scalability can also be interesting area of future work. REFERENCES [1] R. D. L. Almasy, S. Williams-Blangero et al., Genome Mapping and Genomics in Human and Non-Human Primates. Berlin, Heidelberg: Springer-Verlag Berlin Heidelberg, [2] 1000 Genomes - A Deep Catalog of Human Genetic Variation, accessed: [3] Andreas M. Kogelnik and Shamkant B. Navathe and Douglas C. Wallace, GENOME: A Networked Database Environment for Human Genome Data, [4] Uwe Röhm and José A. Blakeley, Data Management for High- Throughput Genomics, [5] S. Idreos, F. Groffen, N. Nes et al., Monetdb: Two decades of research in column-oriented database architectures, [6] T. Hubbard, D. Barker, E. Birney et al., The Ensembl genome database project, [7] R. Nielsen, J. S. Paul, A. Albrechtsen et al., Genotype and SNP calling from next-generation sequencing data, [8] H. Li, B. Handsaker, A. Wysoker et al., The sequence alignment/map format and samtools, Bioinformatics, vol. 25, no. 16, p. 2078, [9] E. Garrison, vcflib - a C++ library for parsing and manipulating VCF files, accessed: [10] M. A. DePristo, E. Banks, R. Popl et al., A framework for variation discovery and genotyping using next-generation DNA sequencing data, Nat Genet, vol. 43, no. 5, pp , [11] A. McKenna, M. Hanna, E. Banks et al., The Genome Analysis Toolkit: a MapReduce framework for analyzing next-generation DNA sequencing data. Genome research, no. 9, pp , Sep [12] D. Challis, J. Yu, U. S. Evani et al., An integrative variant analysis suite for whole exome next-generation sequencing data. BMC Bioinformatics, vol. 13, p. 8, [13] H. Garcia-Molina and K. Salem, Main Memory Database Systems: An Overview, [14] T. J. Lehman and M. J. Carey, A Study of Index Structures for Main Memory Database Management Systems, [15] O. TeZer, SQLite vs MySQL vs PostgreSQL: A Comparison Of Relational Database Management Systems, accessed: [16] U. Paila, B. A. Chapman, R. Kirchner et al., GEMINI: Integrative Exploration of Genetic Variation and Genome Annotations, PLoS Computational Biology, vol. 9, no. 7, [17] S. Dorok, S. Breß, and G. Saake, Toward Efficient Variant Calling inside Main-Memory Database Systems, [18] Querying the GEMINI database, latest/content/querying.html#basic-queries, accessed: [19] M. Nei and W.-H. Li, Mathematical model for studying genetic variation in terms of restriction endonucleases, [20] R. Taft, M. Vartak, N. Rajagopalan et al., GenBase: A Complex Analytics Genomics Benchmark, [21] PostgreSQL Documentation, accessed: APPENDIX 1 SELECT COUNT(mapt.hid) 2 FROM mapping mapt 3 WHERE mapt.hid = HG00096 ; Listing 1: Query 1 1 SELECT mapt.hid, vart.pos, mut.alt, 2 phenot.eye_color 3 FROM (((variant_calling vart 4 NATURAL JOIN mapping mapt) 5 NATURAL JOIN mutations mut) 6 NATURAL JOIN phenotype phenot) 7 WHERE phenot.eye_color = brown AND 8 vart.vcid = mapt.vcid 9 AND mapt.mid = mut.mid 10 ORDER BY vart.pos, mapt.hid; Listing 2: Query 2 1 SELECT mapt.hid, vart.pos, mut.alt, 2 phenot.eye_color 3 FROM (((variant_calling vart 4 NATURAL JOIN mapping mapt) 5 NATURAL JOIN mutations mut) 6 NATURAL JOIN phenotype phenot) 7 WHERE phenot.eye_color = brown 8 AND vart.vcid = mapt.vcid 9 AND mapt.mid = mut.mid 10 ORDER BY vart.pos, mapt.hid; Listing 3: Query 3 1 CREATE INDEX idx_mapping ON mapping(hid); 2 CREATE INDEX idx_mapping_mid ON mapping(mid); Listing 4: Index on mapping table in PostgreSQL

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection

CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection CBSU/3CPG/CVG Joint Workshop Series Reference genome based sequence variation detection Computational Biology Service Unit (CBSU) Cornell Center for Comparative and Population Genomics (3CPG) Center for

More information

Chapter 11 Database Concepts

Chapter 11 Database Concepts Chapter 11 Database Concepts INTRODUCTION Database is collection of interrelated data and database system is basically a computer based record keeping system. It contains the information about one particular

More information

OSDBQ: Ontology Supported RDBMS Querying

OSDBQ: Ontology Supported RDBMS Querying OSDBQ: Ontology Supported RDBMS Querying Cihan Aksoy 1, Erdem Alparslan 1, Selçuk Bozdağ 2, İhsan Çulhacı 3, 1 The Scientific and Technological Research Council of Turkey, Gebze/Kocaeli, Turkey 2 Komtaş

More information

A Fast and High Throughput SQL Query System for Big Data

A Fast and High Throughput SQL Query System for Big Data A Fast and High Throughput SQL Query System for Big Data Feng Zhu, Jie Liu, and Lijie Xu Technology Center of Software Engineering, Institute of Software, Chinese Academy of Sciences, Beijing, China 100190

More information

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland

Genetic Programming. Charles Chilaka. Department of Computational Science Memorial University of Newfoundland Genetic Programming Charles Chilaka Department of Computational Science Memorial University of Newfoundland Class Project for Bio 4241 March 27, 2014 Charles Chilaka (MUN) Genetic algorithms and programming

More information

Data Walkthrough: Background

Data Walkthrough: Background Data Walkthrough: Background File Types FASTA Files FASTA files are text-based representations of genetic information. They can contain nucleotide or amino acid sequences. For this activity, students will

More information

NGS Data Analysis. Roberto Preste

NGS Data Analysis. Roberto Preste NGS Data Analysis Roberto Preste 1 Useful info http://bit.ly/2r1y2dr Contacts: roberto.preste@gmail.com Slides: http://bit.ly/ngs-data 2 NGS data analysis Overview 3 NGS Data Analysis: the basic idea http://bit.ly/2r1y2dr

More information

Efficient Storage and Analysis of Genome Data in Databases *

Efficient Storage and Analysis of Genome Data in Databases * Datenbank Spektrum (7) 7:9 54 DOI.7/s-7-54-9 Efficient Storage and Analysis of Genome Data in Databases * Sebastian Dorok Sebastian Breß Jens Teubner Horstfried Läpple Gunter Saake Volker Markl Received:

More information

Analyzing Variant Call results using EuPathDB Galaxy, Part II

Analyzing Variant Call results using EuPathDB Galaxy, Part II Analyzing Variant Call results using EuPathDB Galaxy, Part II In this exercise, we will work in groups to examine the results from the SNP analysis workflow that we started yesterday. The first step is

More information

Introduction to GEMINI

Introduction to GEMINI Introduction to GEMINI Aaron Quinlan University of Utah! quinlanlab.org Please refer to the following Github Gist to find each command for this session. Commands should be copy/pasted from this Gist https://gist.github.com/arq5x/9e1928638397ba45da2e#file-gemini-intro-sh

More information

On the Efficacy of Haskell for High Performance Computational Biology

On the Efficacy of Haskell for High Performance Computational Biology On the Efficacy of Haskell for High Performance Computational Biology Jacqueline Addesa Academic Advisors: Jeremy Archuleta, Wu chun Feng 1. Problem and Motivation Biologists can leverage the power of

More information

Integrated Usage of Heterogeneous Databases for Novice Users

Integrated Usage of Heterogeneous Databases for Novice Users International Journal of Networked and Distributed Computing, Vol. 3, No. 2 (April 2015), 109-118 Integrated Usage of Heterogeneous Databases for Novice Users Ayano Terakawa Dept. of Information Science,

More information

Efficient Storage and Analysis of Genome Data in Databases

Efficient Storage and Analysis of Genome Data in Databases Einreichung für: BTW 27, Geplant als Veröffentlichung innerhalb der Lecture Notes in Informatics (LNI) Efficient Storage and Analysis of Genome Data in Databases Sebastian Dorok Sebastian Breß 2 Jens Teubner

More information

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity

Genome sequence analysis with MonetDB: a case study on Ebola virus diversity Genome sequence analysis with MonetDB: a case study on Ebola virus diversity Robin Cijvat 1 Stefan Manegold 2 Martin Kersten 1,2 Gunnar W. Klau 2 Alexander Schönhuth 2 Tobias Marschall 3 Ying Zhang 1,2

More information

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja /

From fastq to vcf. NGG 2016 / Evolutionary Genomics Ari Löytynoja / From fastq to vcf Overview of resequencing analysis samples fastq fastq fastq fastq mapping bam bam bam bam variant calling samples 18917 C A 0/0 0/0 0/0 0/0 18969 G T 0/0 0/0 0/0 0/0 19022 G T 0/1 1/1

More information

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil

called Hadoop Distribution file System (HDFS). HDFS is designed to run on clusters of commodity hardware and is capable of handling large files. A fil Parallel Genome-Wide Analysis With Central And Graphic Processing Units Muhamad Fitra Kacamarga mkacamarga@binus.edu James W. Baurley baurley@binus.edu Bens Pardamean bpardamean@binus.edu Abstract The

More information

A Framework for Securing Databases from Intrusion Threats

A Framework for Securing Databases from Intrusion Threats A Framework for Securing Databases from Intrusion Threats R. Prince Jeyaseelan James Department of Computer Applications, Valliammai Engineering College Affiliated to Anna University, Chennai, India Email:

More information

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units

GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units GPUBwa -Parallelization of Burrows Wheeler Aligner using Graphical Processing Units Abstract A very popular discipline in bioinformatics is Next-Generation Sequencing (NGS) or DNA sequencing. It specifies

More information

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( )

CIS 601 Graduate Seminar. Dr. Sunnie S. Chung Dhruv Patel ( ) Kalpesh Sharma ( ) Guide: CIS 601 Graduate Seminar Presented By: Dr. Sunnie S. Chung Dhruv Patel (2652790) Kalpesh Sharma (2660576) Introduction Background Parallel Data Warehouse (PDW) Hive MongoDB Client-side Shared SQL

More information

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel

Breeding Guide. Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel Breeding Guide Customer Services PHENOME-NETWORKS 4Ben Gurion Street, 74032, Nes-Ziona, Israel www.phenome-netwoks.com Contents PHENOME ONE - INTRODUCTION... 3 THE PHENOME ONE LAYOUT... 4 THE JOBS ICON...

More information

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases

Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Performance Comparison of NOSQL Database Cassandra and SQL Server for Large Databases Khalid Mahmood Shaheed Zulfiqar Ali Bhutto Institute of Science and Technology, Karachi Pakistan khalidmdar@yahoo.com

More information

PhD: a web database application for phenotype data management

PhD: a web database application for phenotype data management Bioinformatics Advance Access published June 28, 2005 The Author (2005). Published by Oxford University Press. All rights reserved. For Permissions, please email: journals.permissions@oupjournals.org PhD:

More information

Exome sequencing. Jong Kyoung Kim

Exome sequencing. Jong Kyoung Kim Exome sequencing Jong Kyoung Kim Genome Analysis Toolkit The GATK is the industry standard for identifying SNPs and indels in germline DNA and RNAseq data. Its scope is now expanding to include somatic

More information

Intro to NGS Tutorial

Intro to NGS Tutorial Intro to NGS Tutorial Release 8.6.0 Golden Helix, Inc. October 31, 2016 Contents 1. Overview 2 2. Import Variants and Quality Fields 3 3. Quality Filters 10 Generate Alternate Read Ratio.........................................

More information

Bioinformatics Data Distribution and Integration via Web Services and XML

Bioinformatics Data Distribution and Integration via Web Services and XML Letter Bioinformatics Data Distribution and Integration via Web Services and XML Xiao Li and Yizheng Zhang* College of Life Science, Sichuan University/Sichuan Key Laboratory of Molecular Biology and Biotechnology,

More information

MIS Database Systems.

MIS Database Systems. MIS 335 - Database Systems http://www.mis.boun.edu.tr/durahim/ Ahmet Onur Durahim Learning Objectives Database systems concepts Designing and implementing a database application Life of a Query in a Database

More information

BIS Database Management Systems.

BIS Database Management Systems. BIS 512 - Database Management Systems http://www.mis.boun.edu.tr/durahim/ Ahmet Onur Durahim Learning Objectives Database systems concepts Designing and implementing a database application Life of a Query

More information

Data Mining Technologies for Bioinformatics Sequences

Data Mining Technologies for Bioinformatics Sequences Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment

More information

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING

JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING JULIA ENABLED COMPUTATION OF MOLECULAR LIBRARY COMPLEXITY IN DNA SEQUENCING Larson Hogstrom, Mukarram Tahir, Andres Hasfura Massachusetts Institute of Technology, Cambridge, Massachusetts, USA 18.337/6.338

More information

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013

RNAseq analysis: SNP calling. BTI bioinformatics course, spring 2013 RNAseq analysis: SNP calling BTI bioinformatics course, spring 2013 RNAseq overview RNAseq overview Choose technology 454 Illumina SOLiD 3 rd generation (Ion Torrent, PacBio) Library types Single reads

More information

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER

DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER DELL EMC POWER EDGE R940 MAKES DE NOVO ASSEMBLY EASIER Genome Assembly on Deep Sequencing data with SOAPdenovo2 ABSTRACT De novo assemblies are memory intensive since the assembly algorithms need to compare

More information

A Parallel Evolutionary Algorithm for Discovery of Decision Rules

A Parallel Evolutionary Algorithm for Discovery of Decision Rules A Parallel Evolutionary Algorithm for Discovery of Decision Rules Wojciech Kwedlo Faculty of Computer Science Technical University of Bia lystok Wiejska 45a, 15-351 Bia lystok, Poland wkwedlo@ii.pb.bialystok.pl

More information

Database Assessment for PDMS

Database Assessment for PDMS Database Assessment for PDMS Abhishek Gaurav, Nayden Markatchev, Philip Rizk and Rob Simmonds Grid Research Centre, University of Calgary. http://grid.ucalgary.ca 1 Introduction This document describes

More information

Paradigm Shift of Database

Paradigm Shift of Database Paradigm Shift of Database Prof. A. A. Govande, Assistant Professor, Computer Science and Applications, V. P. Institute of Management Studies and Research, Sangli Abstract Now a day s most of the organizations

More information

Evolving SQL Queries for Data Mining

Evolving SQL Queries for Data Mining Evolving SQL Queries for Data Mining Majid Salim and Xin Yao School of Computer Science, The University of Birmingham Edgbaston, Birmingham B15 2TT, UK {msc30mms,x.yao}@cs.bham.ac.uk Abstract. This paper

More information

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012

SAM / BAM Tutorial. EMBL Heidelberg. Course Materials. Tobias Rausch September 2012 SAM / BAM Tutorial EMBL Heidelberg Course Materials Tobias Rausch September 2012 Contents 1 SAM / BAM 3 1.1 Introduction................................... 3 1.2 Tasks.......................................

More information

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture

Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Accelerating InDel Detection on Modern Multi-Core SIMD CPU Architecture Da Zhang Collaborators: Hao Wang, Kaixi Hou, Jing Zhang Advisor: Wu-chun Feng Evolution of Genome Sequencing1 In 20032: 1 human genome

More information

NA12878 Platinum Genome GENALICE MAP Analysis Report

NA12878 Platinum Genome GENALICE MAP Analysis Report NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD Jan-Jaap Wesselink, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5

More information

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe

Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Copyright 2016 Ramez Elmasri and Shamkant B. Navathe CHAPTER 1 Databases and Database Users Copyright 2016 Ramez Elmasri and Shamkant B. Navathe Slide 1-2 OUTLINE Types of Databases and Database Applications

More information

INTRODUCTION AUX FORMATS DE FICHIERS

INTRODUCTION AUX FORMATS DE FICHIERS INTRODUCTION AUX FORMATS DE FICHIERS Plan. Formats de séquences brutes.. Format fasta.2. Format fastq 2. Formats d alignements 2.. Format SAM 2.2. Format BAM 4. Format «Variant Calling» 4.. Format Varscan

More information

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V.

REPORT. NA12878 Platinum Genome. GENALICE MAP Analysis Report. Bas Tolhuis, PhD GENALICE B.V. REPORT NA12878 Platinum Genome GENALICE MAP Analysis Report Bas Tolhuis, PhD GENALICE B.V. INDEX EXECUTIVE SUMMARY...4 1. MATERIALS & METHODS...5 1.1 SEQUENCE DATA...5 1.2 WORKFLOWS......5 1.3 ACCURACY

More information

BLAST & Genome assembly

BLAST & Genome assembly BLAST & Genome assembly Solon P. Pissis Tomáš Flouri Heidelberg Institute for Theoretical Studies November 17, 2012 1 Introduction Introduction 2 BLAST What is BLAST? The algorithm 3 Genome assembly De

More information

In-Memory Performance Durability of Disk GridGain Systems, Inc.

In-Memory Performance Durability of Disk GridGain Systems, Inc. In-Memory Performance Durability of Disk Apache Ignite In-Memory Hammer for Your Data Science Toolkit Denis Magda Ignite PMC Chair GridGain Director of Product Management Agenda Apache Ignite Overview

More information

Evolution of Database Systems

Evolution of Database Systems Evolution of Database Systems Krzysztof Dembczyński Intelligent Decision Support Systems Laboratory (IDSS) Poznań University of Technology, Poland Intelligent Decision Support Systems Master studies, second

More information

ISO INTERNATIONAL STANDARD. Health informatics Genomic Sequence Variation Markup Language (GSVML)

ISO INTERNATIONAL STANDARD. Health informatics Genomic Sequence Variation Markup Language (GSVML) INTERNATIONAL STANDARD ISO 25720 First edition 2009-08-15 Health informatics Genomic Sequence Variation Markup Language (GSVML) Informatique de santé Langage de balisage de la variation de séquence génomique

More information

Postgres Plus and JBoss

Postgres Plus and JBoss Postgres Plus and JBoss A New Division of Labor for New Enterprise Applications An EnterpriseDB White Paper for DBAs, Application Developers, and Enterprise Architects October 2008 Postgres Plus and JBoss:

More information

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data

Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Shiratani Unsui forest by Σ64 Improved VariantSpark breaks the curse of dimensionality for machine learning on genomic data Oscar J. Luo Health Data Analytics 12 th October 2016 HEALTH & BIOSECURITY Transformational

More information

CE4031 and CZ4031 Database System Principles

CE4031 and CZ4031 Database System Principles CE4031 and CZ4031 Database System Principles Academic AY1819 Semester 1 CE/CZ4031 Database System Principles s CE/CZ2001 Algorithms; CZ2007 Introduction to Databases CZ4033 Advanced Data Management (not

More information

Briefly: Bioinformatics File Formats. J Fass September 2018

Briefly: Bioinformatics File Formats. J Fass September 2018 Briefly: Bioinformatics File Formats J Fass September 2018 Overview ASCII Text Sequence Fasta, Fastq ~Annotation TSV, CSV, BED, GFF, GTF, VCF, SAM Binary (Data, Compressed, Executable) Data HDF5 BAM /

More information

An I/O device driver for bioinformatics tools: the case for BLAST

An I/O device driver for bioinformatics tools: the case for BLAST An I/O device driver for bioinformatics tools 563 An I/O device driver for bioinformatics tools: the case for BLAST Renato Campos Mauro and Sérgio Lifschitz Departamento de Informática PUC-RIO, Pontifícia

More information

User Guide. v Released June Advaita Corporation 2016

User Guide. v Released June Advaita Corporation 2016 User Guide v. 0.9 Released June 2016 Copyright Advaita Corporation 2016 Page 2 Table of Contents Table of Contents... 2 Background and Introduction... 4 Variant Calling Pipeline... 4 Annotation Information

More information

Caching Personalized and Database-related Dynamic Web Pages

Caching Personalized and Database-related Dynamic Web Pages Caching Personalized and Database-related Dynamic Web Pages Yeim-Kuan Chang, Yu-Ren Lin and Yi-Wei Ting Department of Computer Science and Information Engineering National Cheng Kung University, Taiwan

More information

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure

DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure TM DRAGEN Bio-IT Platform Enabling the Global Genomic Infrastructure About DRAGEN Edico Genome s DRAGEN TM (Dynamic Read Analysis for GENomics) Bio-IT Platform provides ultra-rapid secondary analysis of

More information

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016

SAM and VCF formats. UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 SAM and VCF formats UCD Genome Center Bioinformatics Core Tuesday 14 June 2016 File Format: SAM / BAM / CRAM! NEW http://samtools.sourceforge.net/ - deprecated! http://www.htslib.org/ - SAMtools 1.0 and

More information

CE4031 and CZ4031 Database System Principles

CE4031 and CZ4031 Database System Principles CE431 and CZ431 Database System Principles Course CE/CZ431 Course Database System Principles CE/CZ21 Algorithms; CZ27 Introduction to Databases CZ433 Advanced Data Management (not offered currently) Lectures

More information

Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016

Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016 Package RVS0.0 Jiafen Gong, Zeynep Baskurt, Andriy Derkach, Angelina Pesevski and Lisa Strug October, 2016 The Robust Variance Score (RVS) test is designed for association analysis for next generation

More information

Data Curation Profile Human Genomics

Data Curation Profile Human Genomics Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date

More information

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations

Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Built for Speed: Comparing Panoply and Amazon Redshift Rendering Performance Utilizing Tableau Visualizations Table of contents Faster Visualizations from Data Warehouses 3 The Plan 4 The Criteria 4 Learning

More information

DBMS (FYCS) Unit - 1. A database management system stores data in such a way that it becomes easier to retrieve, manipulate, and produce information.

DBMS (FYCS) Unit - 1. A database management system stores data in such a way that it becomes easier to retrieve, manipulate, and produce information. Prof- Neeta Bonde DBMS (FYCS) Unit - 1 DBMS: - Database is a collection of related data and data is a collection of facts and figures that can be processed to produce information. Mostly data represents

More information

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page.

Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. Welcome to MAPHiTS (Mapping Analysis Pipeline for High-Throughput Sequences) tutorial page. In this page you will learn to use the tools of the MAPHiTS suite. A little advice before starting : rename your

More information

Supporting Fuzzy Keyword Search in Databases

Supporting Fuzzy Keyword Search in Databases I J C T A, 9(24), 2016, pp. 385-391 International Science Press Supporting Fuzzy Keyword Search in Databases Jayavarthini C.* and Priya S. ABSTRACT An efficient keyword search system computes answers as

More information

HANA Performance. Efficient Speed and Scale-out for Real-time BI

HANA Performance. Efficient Speed and Scale-out for Real-time BI HANA Performance Efficient Speed and Scale-out for Real-time BI 1 HANA Performance: Efficient Speed and Scale-out for Real-time BI Introduction SAP HANA enables organizations to optimize their business

More information

Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study

Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study Alma Mater Studiorum Università degli Studi di Bologna DEIS Tabu Search for the Founder Sequence Reconstruction Problem: A Preliminary Study Andrea Roli and Christian Blum January 10, 2009 DEIS Technical

More information

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data

FedX: A Federation Layer for Distributed Query Processing on Linked Open Data FedX: A Federation Layer for Distributed Query Processing on Linked Open Data Andreas Schwarte 1, Peter Haase 1,KatjaHose 2, Ralf Schenkel 2, and Michael Schmidt 1 1 fluid Operations AG, Walldorf, Germany

More information

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report

A B2B Search Engine. Abstract. Motivation. Challenges. Technical Report Technical Report A B2B Search Engine Abstract In this report, we describe a business-to-business search engine that allows searching for potential customers with highly-specific queries. Currently over

More information

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11

DATABASE PERFORMANCE AND INDEXES. CS121: Relational Databases Fall 2017 Lecture 11 DATABASE PERFORMANCE AND INDEXES CS121: Relational Databases Fall 2017 Lecture 11 Database Performance 2 Many situations where query performance needs to be improved e.g. as data size grows, query performance

More information

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture

A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture A Comparison of Memory Usage and CPU Utilization in Column-Based Database Architecture vs. Row-Based Database Architecture By Gaurav Sheoran 9-Dec-08 Abstract Most of the current enterprise data-warehouses

More information

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING)

Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Minimum Information for Reporting Immunogenomic NGS Genotyping (MIRING) Reporting guideline statement for HLA and KIR genotyping data generated via Next Generation Sequencing (NGS) technologies and analysis

More information

Decrypting your genome data privately in the cloud

Decrypting your genome data privately in the cloud Decrypting your genome data privately in the cloud Marc Sitges Data Manager@Made of Genes @madeofgenes The Human Genome 3.200 M (x2) Base pairs (bp) ~20.000 genes (~30%) (Exons ~1%) The Human Genome Project

More information

MPG NGS workshop I: Quality assessment of SNP calls

MPG NGS workshop I: Quality assessment of SNP calls MPG NGS workshop I: Quality assessment of SNP calls Kiran V Garimella (kiran@broadinstitute.org) Genome Sequencing and Analysis Medical and Population Genetics February 4, 2010 SNP calling workflow Filesize*

More information

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA

seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA seqcna: A Package for Copy Number Analysis of High-Throughput Sequencing Cancer DNA David Mosen-Ansorena 1 October 30, 2017 Contents 1 Genome Analysis Platform CIC biogune and CIBERehd dmosen.gn@cicbiogune.es

More information

Deep Learning Performance and Cost Evaluation

Deep Learning Performance and Cost Evaluation Micron 5210 ION Quad-Level Cell (QLC) SSDs vs 7200 RPM HDDs in Centralized NAS Storage Repositories A Technical White Paper Don Wang, Rene Meyer, Ph.D. info@ AMAX Corporation Publish date: October 25,

More information

On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries

On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries On Using an Automatic, Autonomous and Non-Intrusive Approach for Rewriting SQL Queries Arlino H. M. de Araújo 1,2, José Maria Monteiro 2, José Antônio F. de Macêdo 2, Júlio A. Tavares 3, Angelo Brayner

More information

HPC methods for hidden Markov models (HMMs) in population genetics

HPC methods for hidden Markov models (HMMs) in population genetics HPC methods for hidden Markov models (HMMs) in population genetics Peter Kecskemethy supervised by: Chris Holmes Department of Statistics and, University of Oxford February 20, 2013 Outline Background

More information

CHAPTER 2: DATA MODELS

CHAPTER 2: DATA MODELS Database Systems Design Implementation and Management 12th Edition Coronel TEST BANK Full download at: https://testbankreal.com/download/database-systems-design-implementation-andmanagement-12th-edition-coronel-test-bank/

More information

User Manual. Ver. 3.0 March 19, 2012

User Manual. Ver. 3.0 March 19, 2012 User Manual Ver. 3.0 March 19, 2012 Table of Contents 1. Introduction... 2 1.1 Rationale... 2 1.2 Software Work-Flow... 3 1.3 New in GenomeGems 3.0... 4 2. Software Description... 5 2.1 Key Features...

More information

Practical exercises Day 2. Variant Calling

Practical exercises Day 2. Variant Calling Practical exercises Day 2 Variant Calling Samtools mpileup Variant calling with samtools mpileup + bcftools Variant calling with HaplotypeCaller (GATK Best Practices) Genotype GVCFs Hard Filtering Variant

More information

HyPer-sonic Combined Transaction AND Query Processing

HyPer-sonic Combined Transaction AND Query Processing HyPer-sonic Combined Transaction AND Query Processing Thomas Neumann Technische Universität München December 2, 2011 Motivation There are different scenarios for database usage: OLTP: Online Transaction

More information

In-Memory Technology in Life Sciences

In-Memory Technology in Life Sciences in Life Sciences Dr. Matthieu-P. Schapranow In-Memory Database Applications in Healthcare 2016 Apr Intelligent Healthcare Networks in the 21 st Century? Hospital Research Center Laboratory Researcher Clinician

More information

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS

LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS LOG FILE ANALYSIS USING HADOOP AND ITS ECOSYSTEMS Vandita Jain 1, Prof. Tripti Saxena 2, Dr. Vineet Richhariya 3 1 M.Tech(CSE)*,LNCT, Bhopal(M.P.)(India) 2 Prof. Dept. of CSE, LNCT, Bhopal(M.P.)(India)

More information

Analysis of Chromosome 20 - A Study

Analysis of Chromosome 20 - A Study Analysis of Chromosome 20 - A Study Kristiina Ausmees,Pushpam Aji John Department of Information Technology Uppsala University, Sweden arxiv:1607.00276v1 [q-bio.gn] 30 Jun 2016 Abstract Since the arrival

More information

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ)

Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Association Analysis of Sequence Data using PLINK/SEQ (PSEQ) Copyright (c) 2018 Stanley Hooker, Biao Li, Di Zhang and Suzanne M. Leal Purpose PLINK/SEQ (PSEQ) is an open-source C/C++ library for working

More information

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1

B.H.GARDI COLLEGE OF MASTER OF COMPUTER APPLICATION. Ch. 1 :- Introduction Database Management System - 1 Basic Concepts :- 1. What is Data? Data is a collection of facts from which conclusion may be drawn. In computer science, data is anything in a form suitable for use with a computer. Data is often distinguished

More information

TCGA Variant Call Format (VCF) 1.0 Specification

TCGA Variant Call Format (VCF) 1.0 Specification TCGA Variant Call Format (VCF) 1.0 Specification Document Information Specification for TCGA Variant Call Format (VCF) Version 1.0 1 About TCGA VCF specification 2 TCGA-specific customizations 3 File format

More information

Performance of popular open source databases for HEP related computing problems

Performance of popular open source databases for HEP related computing problems Journal of Physics: Conference Series OPEN ACCESS Performance of popular open source databases for HEP related computing problems To cite this article: D Kovalskyi et al 2014 J. Phys.: Conf. Ser. 513 042027

More information

An Incredibly Brief Introduction to Relational Databases: Appendix B - Learning Rails

An Incredibly Brief Introduction to Relational Databases: Appendix B - Learning Rails O'Reilly Published on O'Reilly (http://oreilly.com/) See this if you're having trouble printing code examples An Incredibly Brief Introduction to Relational Databases: Appendix B - Learning Rails by Edd

More information

Interstage Big Data Complex Event Processing Server V1.0.0

Interstage Big Data Complex Event Processing Server V1.0.0 Interstage Big Data Complex Event Processing Server V1.0.0 User's Guide Linux(64) J2UL-1665-01ENZ0(00) October 2012 PRIMERGY Preface Purpose This manual provides an overview of the features of Interstage

More information

Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov

Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov Time Series Analytics with Simple Relational Database Paradigms Ben Leighton, Julia Anticev, Alex Khassapov LAND AND WATER & CSIRO IMT SCIENTIFIC COMPUTING Energy Use Data Model (EUDM) endeavours to deliver

More information

Variations on Genetic Cellular Automata

Variations on Genetic Cellular Automata Variations on Genetic Cellular Automata Alice Durand David Olson Physics Department amdurand@ucdavis.edu daolson@ucdavis.edu Abstract: We investigated the properties of cellular automata with three or

More information

Socrates: A System for Scalable Graph Analytics C. Savkli, R. Carr, M. Chapman, B. Chee, D. Minch

Socrates: A System for Scalable Graph Analytics C. Savkli, R. Carr, M. Chapman, B. Chee, D. Minch Socrates: A System for Scalable Graph Analytics C. Savkli, R. Carr, M. Chapman, B. Chee, D. Minch September 10, 2014 Cetin Savkli Cetin.Savkli@jhuapl.edu 240 228 0115 Challenges of Big Data & Analytics

More information

Maximizing Public Data Sources for Sequencing and GWAS

Maximizing Public Data Sources for Sequencing and GWAS Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda

More information

A Closer Look at SERVER-SIDE RENDERING. Technology Overview

A Closer Look at SERVER-SIDE RENDERING. Technology Overview A Closer Look at SERVER-SIDE RENDERING Technology Overview Driven by server-based rendering, Synapse 5 is the fastest PACS in the medical industry, offering subsecond image delivery and diagnostic quality.

More information

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline

Supplementary Information. Detecting and annotating genetic variations using the HugeSeq pipeline Supplementary Information Detecting and annotating genetic variations using the HugeSeq pipeline Hugo Y. K. Lam 1,#, Cuiping Pan 1, Michael J. Clark 1, Phil Lacroute 1, Rui Chen 1, Rajini Haraksingh 1,

More information

Genetic Analysis. Page 1

Genetic Analysis. Page 1 Genetic Analysis Page 1 Genetic Analysis Objectives: 1) Set up Case-Control Association analysis and the Basic Genetics Workflow 2) Use JMP tools to interact with and explore results 3) Learn advanced

More information

The Hadoop Paradigm & the Need for Dataset Management

The Hadoop Paradigm & the Need for Dataset Management The Hadoop Paradigm & the Need for Dataset Management 1. Hadoop Adoption Hadoop is being adopted rapidly by many different types of enterprises and government entities and it is an extraordinarily complex

More information

Genomic Files. University of Massachusetts Medical School. October, 2014

Genomic Files. University of Massachusetts Medical School. October, 2014 .. Genomic Files University of Massachusetts Medical School October, 2014 2 / 39. A Typical Deep-Sequencing Workflow Samples Fastq Files Fastq Files Sam / Bam Files Various files Deep Sequencing Further

More information

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms

A Framework for adaptive focused web crawling and information retrieval using genetic algorithms A Framework for adaptive focused web crawling and information retrieval using genetic algorithms Kevin Sebastian Dept of Computer Science, BITS Pilani kevseb1993@gmail.com 1 Abstract The web is undeniably

More information

Keywords Real-Time Data Analysis; In-Memory Database Technology; Genome Data; Personalized Medicine; Next-Generation Sequencing

Keywords Real-Time Data Analysis; In-Memory Database Technology; Genome Data; Personalized Medicine; Next-Generation Sequencing Volume 4, Issue 4, April 2014 ISSN: 2277 128X International Journal of Advanced Research in Computer Science and Software Engineering Research Paper Available online at: www.ijarcsse.com High-Throughput

More information

Fundamentals of Information Systems, Seventh Edition

Fundamentals of Information Systems, Seventh Edition Chapter 3 Data Centers, and Business Intelligence 1 Why Learn About Database Systems, Data Centers, and Business Intelligence? Database: A database is an organized collection of data. Databases also help

More information

University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases

University of Waterloo. Storing Directed Acyclic Graphs in Relational Databases University of Waterloo Software Engineering Storing Directed Acyclic Graphs in Relational Databases Spotify USA Inc New York, NY, USA Prepared by Soheil Koushan Student ID: 20523416 User ID: skoushan 4A

More information