12. Key features involved in building biological 3databases
|
|
- Christiana Haynes
- 5 years ago
- Views:
Transcription
1 12. Key features involved in building biological 3databases Central to the discipline of bioinformatics is the need to store biological information systematically in structured databases. The first databases were really just simple formatted text files. These were organised so that particular records within them could be easily identified and linked. However, as the complexity and types of available biological data grew, and biologists wanted to ask more complex questions, database architectures also became more sophisticated. This topic guide will look at a range of aspects involved in creating biological databases, and how these may have to evolve to meet future needs, as new technologies give rise to new types of data. On successful completion of this topic you will: understand key features involved in building biological databases (LO3). To achieve a Pass in this unit you need to show that you can: summarise the design and management of biological databases (3.1) identify a range of records within a file (3.2) discuss the need for biological databases to store, organise and index basic biological processes (3.3) discuss the nature of new data available and the types of database and resources that might be used (3.4). 1
2 Key terms Coding region: The portion of an mrna sequence that is translatable into a polypeptide. Flat-file database: A plain-text file containing a number of data entries that lack structured interrelationships. 1 Building biological databases Complete genome sequencing was a major achievement. However, just amassing more data does not instantly make us more knowledgeable or provide miraculous understanding of the information we are collecting. Gaining biological and biomedical insights from raw genomic data is a huge undertaking. For example, as part of the process: genes must be located and their structures properly assembled coding regions must be translated functions must be assigned to genes and their products disease associations must be discovered, etc. These tasks depend on using the right computational tools and finding the right balance between human and machine input. Online access to databases gave scientists the ability to use public data in their own private research projects. Databases thus became invaluable as repositories of biological information. In addition to this, they are important because they allow logical connections to be made to related information in different resources via their annotations. Annotations are the intelligence or clues we attached to raw data to make them meaningful to, and reusable by, other researchers. For example, linear strings of nucleotide bases or amino acid residues are virtually useless on their own, but allied with information about their evolutionary relationships, biological functions, roles, interactions, disease associations, etc., they become elements of knowledge. The evolution of biological flat-file databases Database annotations add value to raw data, in principle allowing them to be reused quickly and conveniently. The more annotations there are, the richer the database content. The problem is, the more information added, the greater the need for disciplined approaches to data archiving if computers are to be able to access particular annotations reliably, they must be stored in a structured way. This begs two questions: i what kinds of annotation are crucial, and ii how should they be organised? We already saw that, for sequence data, adding notes about potential biological relationships, functions, roles, etc., is useful. To add further value, it is also helpful to add database specific details, such as when the sequence was submitted and when the database entry was last updated; links to relevant scientific literature (for example, to an article that describes the biological function of a sequence) and cross-references to information in related databases are also informative. Structuring such information sensibly and systematically to facilitate computer access is challenging. Plain text, like the page you are reading now, is accessible to humans, but means nothing to computers. Nevertheless, the earliest biological databases were created as plain-text files, or flat-files. This meant that particular pieces of information had to be pinpointed with specific tags to help computers identify the types of data being stored in those parts of the file. Inevitably, several 2
3 different flat-file formats evolved to store different types of biological data. Of these, one particular format became popular (because its structure was relatively simple) and was adapted for a variety of different data-types by a number of databases that are still in use today (e.g., Swiss-Prot, TrEMBL, PROSITE). This simple flat-file format was the one originally devised to store nucleotide sequences in the EMBL database. Figure gives a flavour of how a flat-file database is constructed. Figure : Creating a flat-file database. Numerous plain-text or flatfiles are appended to create a flat-file database. Each file, or database record, contains different data fields, each of which is identified by a specific tag. Here, the zoomed-in section shows a variety of tags found in the EMBL flatfile format, exemplified with a range of fields typical of UniProtKB entries. Flat-file RECORD TAG TAG Xn Flat-file database ID AC DT DE GN OS OC RN CC DR KW FT SQ Zoom A human-readable identifier A computer-readable code Date of creation of database record A descriptive title for the entry Gene name Organism source details Organism classification Cross references to publications Description of the function, etc. Cross-links to related databases Keywords Table listing sequence features Sequence details Key terms Rhodopsin: A light-sensitive biological pigment, found in the rod-shaped photoreceptor cells of the retinas of most vertebrates, that mediates vision in dim light; rhodopsin belongs to the superfamily of G protein-coupled receptors to which it gives its name. PubMed: An online interface to millions of biomedical literature citations from the MEDLINE database, from life science journals, online books, etc.; PubMed is a service of the National Center for Biotechnology Information (NCBI). Database fields and tags The EMBL flat-file format uses a series of two-letter tags to describe the data stored on each line of the file, as shown in Figure on page 4. The file begins with an identifying (ID) code (here, OPSD_HUMAN) and an accession (AC) number (here, P08100): the AC number is designed for computers to read; the ID code is more meaningful to humans in this case, OPSD_HUMAN denotes human rhodopsin. The AC and ID codes specify a given database entry. In principle, the AC number is invariant so that this sequence can always be tracked in any version of the database. Other important pieces of information within the flat-file include: DT, the date a sequence entered the database and when changes were last made to its entry DE, the description or title of the stored entity (here, the protein rhodopsin) GN, the source gene name (here, rho) OS, a more precise specification of the organism species (here, Homo sapiens) OC, a more precise specification of the organism classification (Eukaryota, Metazoa, Chordata, etc.). In addition, the file includes bibliographic citations: RN is the reference number RP gives the subject RM the literature database (PubMed) cross-reference RA the authors RL the place of publication. 3
4 Figure : Illustration of the flat-file format of a UniProtKB/Swiss-Prot entry. ID OPSD_HUMAN STANDARD; PRT; 348 AA. AC P08100; DT 01-AUG-1988 (REL. 08, CREATED) DT 01-AUG-1988 (REL. 08, LAST SEQUENCE UPDATE) DT 01-MAR-1992 (REL. 21, LAST ANNOTATION UPDATE) DE RHODOPSIN. GN RHO. OS HOMO SAPIENS (HUMAN). OC EUKARYOTA; METAZOA; CHORDATA; VERTEBRATA; TETRAPODA; MAMMALIA; OC EUTHERIA; PRIMATES. RN [1] RP SEQUENCE FROM N.A. RM RA NATHANS J., HOGNESS D.S.; RL PROC. NATL. ACAD. SCI. U.S.A. 81: (1984). RN [3] RP VARIANTS RETINITIS PIGMENTOSA. RM RA DRYJA T.P., HAHN L.B., COWLEY G.S., MCGEE T.L., BERSON E.L.; RL PROC. NATL. ACAD. SCI. U.S.A. 88: (1991). CC -!- FUNCTION: VISUAL PIGMENTS ARE THE LIGHT-ABSORBING MOLECULES THAT CC MEDIATE VISION. THEY CONSIST OF AN APOPROTEIN, OPSIN, COENTLY CC LINKED TO CIS-RETINAL. CC -!- TISSUE SPECIFICITY: ROD SHAPED PHOTORECEPTOR CELLS WHICH MEDIATES CC VISION IN DIM LIGHT.. CC -!- DISEASE: AUTOSOMAL DOMINANT RETINITIS PIGMENTOSA CAN BE DUE TO A CC DEFECT IN RHO. PATIENTS TYPICALLY HAVE NIGHT VISION BLINDNESS AND CC LOSS OF MIDPERIPHERAL VISUAL FIELD; AS THEIR CONDITION PROGRESSES, CC THEY LOSE THEIR FAR PERIPHERAL VISUAL FIELD AND EVENTUALLY CENTRAL CC VISION AS WELL. CC -!- SIMILARITY: TO ALL OTHER G-PROTEIN COUPLED RECEPTORS. STRONGEST TO CC ALL OTHER OPSINS. DR EMBL; K02281; HSOPS. DR MIM; ; NINTH EDITION. DR PROSITE; PS00237; G_PROTEIN_RECEPTOR. DR PROSITE; PS00238; OPSIN. KW PHOTORECEPTOR; RETINAL PROTEIN; TRANSMEMBRANE; GLYCOPROTEIN; VISION; KW PHOSPHORYLATION; LIPOPROTEIN; G-PROTEIN COUPLED RECEPTOR; ACETYLATION; KW RETINITIS PIGMENTOSA. FT DOMAIN 1 36 EXTRACELLULAR. FT TRANSMEM FT DOMAIN CYTOPLASMIC. FT DOMAIN EXTRACELLULAR. FT TRANSMEM FT DOMAIN CYTOPLASMIC. FT MOD_RES 1 1 ACETYLATION (BY SIMILARITY). FT CARBOHYD 2 2 BY SIMILARITY. FT BINDING RETINAL CHROMOPHORE. FT LIPID PALMITATE (BY SIMILARITY). FT DISULFID BY SIMILARITY. FT VARIANT T -> M (IN RETINITIS PIGMENTOSA). FT VARIANT P -> S (IN RETINITIS PIGMENTOSA). SQ SEQUENCE 348 AA; MW; CN; MNGTEGPNFY VPFSNATGVV RSPFEYPQYY LAEPWQFSML AAYMFLLIVL GFPINFLTLY VTVQHKKLRT PLNYILLNLA VADLFMVLGG FTSTLYTSLH GYFVFGPTGC NLEGFFATLG GEIALWSLVV LAIERYVVVC KPMSNFRFGE NHAIMGVAFT WVMALACAAP PLAGWSRYIP EGLQCSCGID YYTLKPEVNN ESFVIYMFVV HFTIPMIIIF FCYGQLVFTV KEAAAQQQES ATTQKAEKEV TRMVIIMVIA FLICWVPYAS VAFYIFTHQG SNFGPIFMTI PAFFAKSAAI YNPVIYIMMN KQFRNCMLTT ICCGKNPLGD DEASATVSKT ETSQVAPA // 4
5 Key terms MIM or OMIM (Online Mendelian Inheritance in Man): A comprehensive database of human genes and genetic disorders. Transmembrane domain: A hydrophobic segment of amino acids within a protein that crosses a membrane. Single-letter code: Letters of the alphabet used to denote the amino acids (A for alanine, P for proline, V for valine, etc.). A rich seam of annotation is stored in the comment (CC) field. In this example, we learn about the protein s function, tissue-specificity, disease associations, family relationships, etc. To facilitate swift computational processing of the file, many of the terms used here are also included as keywords (KW). Links to related information in other databases are made in the DR lines (such as here to EMBL, MIM and PROSITE). Further enriching the entry, various characteristics of the sequence itself are documented in the Feature Table (FT): for example, here we learn about the locations of the protein s transmembrane (TM) domains and functional sites (lipid and carbohydrate attachment sites, other binding sites, etc.), and about its sequence variants. Finally, the sequence is stored in the SQ field using the single-letter code, together with attributes such as its length and molecular weight. The entry terminates with the // symbol. Activity The link shows the history of the sequence of human insulin from its earliest Swiss-Prot entry in Scroll down and click on the fifth version (5.txt). What is the function of the protein? With what disease is the protein associated? How many amino acid residues are there in the active hormone? Database interoperability indexing biological data As we have already seen, storing information in databases is not very useful unless computers can access the data and help humans to interrogate and analyse the knowledge they contain. Achieving this requires adherence to standard data formats. For example, for protein and nucleotide sequences using the EMBL format, use of a common Feature Table format helps to improve data consistency and reliability, and facilitates database interoperation. Regulating the content, and the vocabulary and syntax used to describe the documented features, helps to ensure that the data can be readily accessed and manipulated by computer software. The principal means by which computers access database information is via their entries unique AC numbers and ID codes. This allows data from very different resources to be connected, whether from a nucleotide or protein sequence database, a protein family or structure database, a literature database, and so on. The more internal cross-references a database stores, the greater the web of connectivity that is possible from it. 5
6 Figure : Illustration of flat-file indexing. Data fields in flat-file databases can be linked via their two-letter tags. The main points of connectivity are the accession number (AC) or identifier (ID) tags. Here, literature (RP) and database cross-references (DR) in UniProtKB/ Swiss-Prot are linked to MEDLINE via a PubMed ID (PMID), to the PDB via the PDB ID, to EMBL via the EMBL AC tag and to PROSITE via the PROSITE AC tag. Reciprocal links from EMBL and PROSITE link back to UniProtKB/ Swiss-Prot via the Swiss-Prot AC tag. EMBL ID Q AC X RP MEDLINE PMID DR P02700 ABCD_YEAST PROSITE ID ATP-BIND UiniProtKB/Swiss-Prot ID ABCD_YEAST AC P02700 RP MEDLINE PMID DR EMBL X DR PDB 1TIM DR PROSITE PS00500 PDB ID 1TIM N CA C O CB AC DE DR PS00500 ATP-binding domain P02700 ABCD_YEAST MEDLINE PMID Exptl. studies of ATP binding of ACD protein Key term Flat-file index: An address or set of coordinates that allows query software to access specific parts of a flat-file database by means of designated tags. Interoperability: The ability of software systems or databases to communicate or exchange information seamlessly (to interoperate) without restriction. Relational database: A database in which data and their attributes are structured and stored in nonredundant tables in such as way as to facilitate information retrieval. Next-generation sequencing: Lowcost, parallelised, high-throughput technology capable of producing thousands or millions of sequences simultaneously (for example, 454 pyrosequencing, Illumina (Solexa) and SOLiD sequencing). Third-generation sequencing: Lowcost, single-molecule sequencing technology that aims to reduce the cost of sequencing a single human genome to US $1000 or less. The first tool to exploit this fact was SRS, the Sequence Retrieval System. SRS is an information indexing-tool that allows any flat-file database to be indexed to any other, permitting highly specific queries across different databases via a single interface, irrespective of their underlying data-types. Figure illustrates how integrated access to diverse information across different flat-file databases is made possible via links to and from their entries respective AC numbers and ID codes. The need for relational database management systems We have seen that flat-file databases can interoperate if they have been indexed or cross-referenced. This makes database queries fast and efficient, because they can be directed to specific parts of the file, rather than to the whole database(s). However, although this is effective for data integration, the approach is very brittle. Consider the role of AC numbers. If an AC number changes, its associated database entry suddenly becomes invisible to all resources to which it was formerly connected; to remain visible, all connected databases must incorporate the new AC number. Owing to its ease of use, the flat-file format was popular for many years. In time, as the pace of data acquisition increased and the accompanying body of scientific literature grew, keeping database data and annotations up to date became more time-consuming (and error-prone, because much of the work was manual). This prompted the use of relational database systems in order to help structure data more formally: here, data are managed in tables in such a way that changes in one table can be readily propagated to others, easing data-management burdens. For more complex resources too, like data warehouses, removing redundancy between databases and ensuring data consistency are easier to achieve using relational systems. The challenge now is not so much what we want to do with such systems today, but how they will need to adapt to future needs. The quantity of data that nextand third-generation sequencing technologies will produce is unprecedented, and will likely have a major impact on future database design. 6
7 Take it further More detailed information about flat-file database formats can be found in Chapter 3 of Introduction to Bioinformatics (Attwood and Parry- Smith, 1999), Prentice Hall. More detailed information on building biological databases, and the use of MySQL, can be found in Building Bioinformatics Solutions: with Perl, R and MySQL (Bessant, Shadforth and Oakley, 2008), OUP. Link Find out more about DNA and RNA structure and coding regions in Unit 7: Molecular biology and genetics. Activity Read the following news article describing the road to the US $1,000 human genome from the Human National Genome Research Institute website. What is one type of innovation that is being explored in order to allow revolutionary 3Gen technologies to deliver the $1000 genome? Further reading Attwood, T. and Parry-Smith, D. (1999) Introduction to Bioinformatics, Prentice Hall. Chapter 3 contains more information about flat-file database formats. Higgs, P. and Attwood, T. (2005) Bioinformatics and Molecular Evolution, Wiley-Blackwell. Refer to Chapter 5 for further details of biological databases. Bessant, C., Shadforth, I. and Oakley, D. (2008) Building Bioinformatics Solutions: with Perl, R and MySQL, OUP. Contains more detailed information on building biological databases, and the use of MySQL. Find out more about the road to the $1000 genome as described in the following news article from the National Human Genome Research Institute website. Checklist At the end of this topic guide you should be familiar with the following ideas about bioinformatics: the flat-file database (essentially a plain-text file) was the original means of managing raw sequence data the EMBL flat-file format was adopted by different databases because its structure made it easy to use and to adapt the EMBL format uses a series of two-letter tags to denote different database fields (ID and AC tags for the entry identifier and accession number, DE and GN tags for the descriptive title and gene name, CC for comments and FT for the Feature Table, etc.) tags allow the data in different parts of flat-file databases to be indexed, which allows them to be cross-linked to information in related databases flat-file databases are simple to understand but are brittle to changes in the data structure more sophisticated relational database management systems were devised to store and manage bioinformatics data more efficiently and more robustly. Acknowledgements The publisher would like to thank the following for their kind permission to reproduce their photographs: PhotoDisc: Lawrence Lawry All other images Pearson Education We are grateful to the following for permission to reproduce copyright material: Realia showing coding of a flat-file format of a UniProtKB/Swiss-Prot entry. Produced by Uniprot Consortium. Used by permission. In some instances we have been unable to trace the owners of copyright material, and we would appreciate any information that would enable us to do so. 7
Similarity searches in biological sequence databases
Similarity searches in biological sequence databases Volker Flegel september 2004 Page 1 Outline Keyword search in databases General concept Examples SRS Entrez Expasy Similarity searches in databases
More informationBioinformatics Database Worksheet
Bioinformatics Database Worksheet (based on http://www.usm.maine.edu/~rhodes/goodies/matics.html) Where are the opsin genes in the human genome? Point your browser to the NCBI Map Viewer at http://www.ncbi.nlm.nih.gov/mapview/.
More informationThe Use of WWW in Biological Research
The Use of WWW in Biological Research Introduction R.Doelz, Biocomputing Basel T.Etzold, EMBL Heidelberg Information in Biology grows rapidly. Initially, biological retrieval systems used conventional
More informationBioinformatics resources for data management. Etienne de Villiers KEMRI-Wellcome Trust, Kilifi
Bioinformatics resources for data management Etienne de Villiers KEMRI-Wellcome Trust, Kilifi Typical Bioinformatic Project Pose Hypothesis Store data in local database Read Relevant Papers Retrieve data
More informationmpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction
mpmorfsdb: A database of Molecular Recognition Features (MoRFs) in membrane proteins. Introduction Molecular Recognition Features (MoRFs) are short, intrinsically disordered regions in proteins that undergo
More informationIn the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components.
1 In the sense of the definition above, a system is both a generalization of one gene s function and a recipe for including and excluding components. 2 Starting from a biological motivation to annotate
More informationLinkDB: A Database of Cross Links between Molecular Biology Databases
LinkDB: A Database of Cross Links between Molecular Biology Databases Susumu Goto, Yutaka Akiyama, Minoru Kanehisa Institute for Chemical Research, Kyoto University Introduction We have developed a molecular
More information2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI.
2 Navigating the NCBI Instructions Aim: To become familiar with the resources available at the National Center for Bioinformatics (NCBI) and the search engine Entrez. Instructions: Write the answers to
More informationEBI patent related services
EBI patent related services 4 th Annual Forum for SMEs October 18-19 th 2010 Jennifer McDowall Senior Scientist, EMBL-EBI EBI is an Outstation of the European Molecular Biology Laboratory. Overview Patent
More information2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.
Web resources -- Tour. page 1 of 8 This is a guided tour. Any homework is separate. In fact, this exercise is used for multiple classes and is publicly available to everyone. The entire tour will take
More informationIntegrated Access to Biological Data. A use case
Integrated Access to Biological Data. A use case Marta González Fundación ROBOTIKER, Parque Tecnológico Edif 202 48970 Zamudio, Vizcaya Spain marta@robotiker.es Abstract. This use case reflects the research
More informationLiterature Databases
Literature Databases Introduction to Bioinformatics Dortmund, 16.-20.07.2007 Lectures: Sven Rahmann Exercises: Udo Feldkamp, Michael Wurst 1 Overview 1. Databases 2. Publications in Science 3. PubMed and
More informationData Mining Technologies for Bioinformatics Sequences
Data Mining Technologies for Bioinformatics Sequences Deepak Garg Computer Science and Engineering Department Thapar Institute of Engineering & Tecnology, Patiala Abstract Main tool used for sequence alignment
More informationOntology-Based Mediation in the. Pisa June 2007
http://asp.uma.es Ontology-Based Mediation in the Amine System Project Pisa June 2007 Prof. Dr. José F. Aldana Montes (jfam@lcc.uma.es) Prof. Dr. Francisca Sánchez-Jiménez Ismael Navas Delgado Raúl Montañez
More informationRLIMS-P Website Help Document
RLIMS-P Website Help Document Table of Contents Introduction... 1 RLIMS-P architecture... 2 RLIMS-P interface... 2 Login...2 Input page...3 Results Page...4 Text Evidence/Curation Page...9 URL: http://annotation.dbi.udel.edu/text_mining/rlimsp2/
More informationNCBI News, November 2009
Peter Cooper, Ph.D. NCBI cooper@ncbi.nlm.nh.gov Dawn Lipshultz, M.S. NCBI lipshult@ncbi.nlm.nih.gov Featured Resource: New Discovery-oriented PubMed and NCBI Homepage The NCBI Site Guide A new and improved
More information) I R L Press Limited, Oxford, England. The protein identification resource (PIR)
Volume 14 Number 1 Volume 1986 Nucleic Acids Research 14 Number 1986 Nucleic Acids Research The protein identification resource (PIR) David G.George, Winona C.Barker and Lois T.Hunt National Biomedical
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2017 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationGoal-oriented Schema in Biological Database Design
Goal-oriented Schema in Biological Database Design Ping Chen Department of Computer Science University of Helsinki Helsinki, Finland 00014 EMAIL: pchen@cs.helsinki.fi Abstract In this paper, I reviewed
More informationINTRODUCTION TO BIOINFORMATICS
Molecular Biology-2019 1 INTRODUCTION TO BIOINFORMATICS In this section, we want to provide a simple introduction to using the web site of the National Center for Biotechnology Information NCBI) to obtain
More informationBiostatistics and Bioinformatics Molecular Sequence Databases
. 1 Description of Module Subject Name Paper Name Module Name/Title 13 03 Dr. Vijaya Khader Dr. MC Varadaraj 2 1. Objectives: In the present module, the students will learn about 1. Encoding linear sequences
More informationComplex Query Formulation Over Diverse Information Sources Using an Ontology
Complex Query Formulation Over Diverse Information Sources Using an Ontology Robert Stevens, Carole Goble, Norman Paton, Sean Bechhofer, Gary Ng, Patricia Baker and Andy Brass Department of Computer Science,
More informationData Curation Profile Human Genomics
Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date
More informationApproaches to Efficient Multiple Sequence Alignment and Protein Search
Approaches to Efficient Multiple Sequence Alignment and Protein Search Thesis statements of the PhD dissertation Adrienn Szabó Supervisor: István Miklós Eötvös Loránd University Faculty of Informatics
More informationInformation Resources in Molecular Biology Marcela Davila-Lopez How many and where
Information Resources in Molecular Biology Marcela Davila-Lopez (marcela.davila@medkem.gu.se) How many and where Data growth DB: What and Why A Database is a shared collection of logically related data,
More informationComputational Genomics and Molecular Biology, Fall
Computational Genomics and Molecular Biology, Fall 2015 1 Sequence Alignment Dannie Durand Pairwise Sequence Alignment The goal of pairwise sequence alignment is to establish a correspondence between the
More informationTopics of the talk. Biodatabases. Data types. Some sequence terminology...
Topics of the talk Biodatabases Jarno Tuimala / Eija Korpelainen CSC What data are stored in biological databases? What constitutes a good database? Nucleic acid sequence databases Amino acid sequence
More informationGenome Browsers - The UCSC Genome Browser
Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,
More informationExploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix
Exploring and Exploiting the Biological Maze Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Motivation An abundance of biological data sources contain data about scientific entities, such as
More informationGenome Browsers Guide
Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,
More informationWhen we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame
1 When we search a nucleic acid databases, there is no need for you to carry out your own six frame translation. Mascot always performs a 6 frame translation on the fly. That is, 3 reading frames from
More informationHumboldt-University of Berlin
Humboldt-University of Berlin Exploiting Link Structure to Discover Meaningful Associations between Controlled Vocabulary Terms exposé of diploma thesis of Andrej Masula 13th October 2008 supervisor: Louiqa
More informationEBI services. Jennifer McDowall EMBL-EBI
EBI services Jennifer McDowall EMBL-EBI The SLING project is funded by the European Commission within Research Infrastructures of the FP7 Capacities Specific Programme, grant agreement number 226073 (Integrating
More informationBioinformatics Hubs on the Web
Bioinformatics Hubs on the Web Take a class The Galter Library teaches a related class called Bioinformatics Hubs on the Web. See our Classes schedule for the next available offering. If this class is
More informationProtein Sequence Database
Protein Sequence Database A protein is a large molecule manufactured in the cell of a living organism to carry out essential functions within the cell. The primary structure of a protein is a sequence
More informationRetrieving factual data and documents using IMGT-ML in the IMGT information system
Retrieving factual data and documents using IMGT-ML in the IMGT information system Authors : Chaume D. *, Combres K. *, Giudicelli V. *, Lefranc M.-P. * * Laboratoire d'immunogénétique Moléculaire, LIGM,
More informationAbstract. of biological data of high variety, heterogeneity, and semi-structured nature, and the increasing
Paper ID# SACBIO-129 HAVING A BLAST: ANALYZING GENE SEQUENCE DATA WITH BLASTQUEST WHERE DO WE GO FROM HERE? Abstract In this paper, we pursue two main goals. First, we describe a new tool called BlastQuest,
More informationThe GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information
The GenAlg Project: Developing a New Integrating Data Model, Language, and Tool for Managing and Querying Genomic Information Joachim Hammer and Markus Schneider Department of Computer and Information
More informationBIOSPIDA: A Relational Database Translator for NCBI
BIOSPIDA: A Relational Database Translator for NCBI Matthew S. Hagen, MSE 1,2,3,5, Eva K. Lee, PhD *,1,2,3,4 1 Center for Operations Research in Medicine and HealthCare, 2 NSF I/UCRC Center for Health
More informationAn Introduction to PubMed Searching: A Reference Guide
An Introduction to PubMed Searching: A Reference Guide Created by the Ontario Public Health Libraries Association (OPHLA) ACCESSING PubMed PubMed, the National Library of Medicine s free version of MEDLINE,
More informationCustomisable Curation Workflows in Argo
Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:
More informationA First Introduction to Scientific Visualization Geoffrey Gray
Visual Molecular Dynamics A First Introduction to Scientific Visualization Geoffrey Gray VMD on CIRCE: On the lower bottom left of your screen, click on the window start-up menu. In the search box type
More informationEBP. Accessing the Biomedical Literature for the Best Evidence
Accessing the Biomedical Literature for the Best Evidence Structuring the search for information and evidence Basic search resources Starting the search EBP Lab / Practice: Simple searches Using PubMed
More informationFinding homologous sequences in databases
Finding homologous sequences in databases There are multiple algorithms to search sequences databases BLAST (EMBL, NCBI, DDBJ, local) FASTA (EMBL, local) For protein only databases scan via Smith-Waterman
More informationTaxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA
Journal of Computer Science 2 (3): 292-296, 2006 ISSN 1549-3636 2006 Science Publications Taxonomically Clustering Organisms Based on the Profiles of Gene Sequences Using PCA 1 E.Ramaraj and 2 M.Punithavalli
More informationEnabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services
Enabling Open Science: Data Discoverability, Access and Use Jo McEntyre Head of Literature Services www.ebi.ac.uk About EMBL-EBI Part of the European Molecular Biology Laboratory International, non-profit
More informationAutomatic annotation in UniProtKB using UniRule, and Complete Proteomes. Wei Mun Chan
Automatic annotation in UniProtKB using UniRule, and Complete Proteomes Wei Mun Chan Talk outline Introduction to UniProt UniProtKB annotation and propagation Data increase and the need for Automatic Annotation
More informationNGS NEXT GENERATION SEQUENCING
NGS NEXT GENERATION SEQUENCING Paestum (Sa) 15-16 -17 maggio 2014 Relatore Dr Cataldo Senatore Dr.ssa Emilia Vaccaro Sanger Sequencing Reactions For given template DNA, it s like PCR except: Uses only
More informationGeneious 5.6 Quickstart Manual. Biomatters Ltd
Geneious 5.6 Quickstart Manual Biomatters Ltd October 15, 2012 2 Introduction This quickstart manual will guide you through the features of Geneious 5.6 s interface and help you orient yourself. You should
More informationPFstats User Guide. Aspartate/ornithine carbamoyltransferase Case Study. Neli Fonseca
PFstats User Guide Aspartate/ornithine carbamoyltransferase Case Study 1 Contents Overview 3 Obtaining An Alignment 3 Methods 4 Alignment Filtering............................................ 4 Reference
More informationAMNH Gerstner Scholars in Bioinformatics & Computational Biology Application Instructions
PURPOSE AMNH Gerstner Scholars in Bioinformatics & Computational Biology Application Instructions The seeks highly qualified applicants for its Gerstner postdoctoral fellowship program in Bioinformatics
More informationInformation Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si
Information Retrieval, Information Extraction, and Text Mining Applications for Biology Slides by Suleyman Cetintas & Luo Si 1 Outline Introduction Overview of Literature Data Sources PubMed, HighWire
More informationStructural Bioinformatics
Structural Bioinformatics Elucidation of the 3D structures of biomolecules. Analysis and comparison of biomolecular structures. Prediction of biomolecular recognition. Handles three-dimensional (3-D) structures.
More informationProceedings of the Postgraduate Annual Research Seminar
Proceedings of the Postgraduate Annual Research Seminar 2006 202 Database Integration Approaches for Heterogeneous Biological Data Sources: An overview Iskandar Ishak, Naomie Salim Faculty of Computer
More informationCAP BIOINFORMATICS Su-Shing Chen CISE. 8/19/2005 Su-Shing Chen, CISE 1
CAP 5510-2 BIOINFORMATICS Su-Shing Chen CISE 8/19/2005 Su-Shing Chen, CISE 1 Building Local Genomic Databases Genomic research integrates sequence data with gene function knowledge. Gene ontology to represent
More informationDeliverable D4.3 Release of pilot version of data warehouse
Deliverable D4.3 Release of pilot version of data warehouse Date: 10.05.17 HORIZON 2020 - INFRADEV Implementation and operation of cross-cutting services and solutions for clusters of ESFRI Grant Agreement
More informationSupplementary Note 1: Considerations About Data Integration
Supplementary Note 1: Considerations About Data Integration Considerations about curated data integration and inferred data integration mentha integrates high confidence interaction information curated
More informationThe beginning of this guide offers a brief introduction to the Protein Data Bank, where users can download structure files.
Structure Viewers Take a Class This guide supports the Galter Library class called Structure Viewers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,
More informationText mining tools for semantically enriching the scientific literature
Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the
More informationSemi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction
Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir
More informationFinding and Exporting Data. BioMart
September 2017 Finding and Exporting Data Not sure what tool to use to find and export data? BioMart is used to retrieve data for complex queries, involving a few or many genes or even complete genomes.
More informationProtein Sequence Database
Protein Sequence Database A protein is a large molecule manufactured in the cell of a living organism to carry out essential functions within the cell. The primary structure of a protein is a sequence
More informationHsAgilentDesign db
HsAgilentDesign026652.db January 16, 2019 HsAgilentDesign026652ACCNUM Map Manufacturer identifiers to Accession Numbers HsAgilentDesign026652ACCNUM is an R object that contains mappings between a manufacturer
More informationNew generation of patent sequence databases Information Sources in Biotechnology Japan
New generation of patent sequence databases Information Sources in Biotechnology Japan EBI is an Outstation of the European Molecular Biology Laboratory. Patent-related resources Patents Patent Resources
More informationDynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014
Dynamic Programming User Manual v1.0 Anton E. Weisstein, Truman State University Aug. 19, 2014 Dynamic programming is a group of mathematical methods used to sequentially split a complicated problem into
More informationHuman Disease Models Tutorial
Mouse Genome Informatics www.informatics.jax.org The fundamental mission of the Mouse Genome Informatics resource is to facilitate the use of mouse as a model system for understanding human biology and
More informationWhat is Internet COMPUTER NETWORKS AND NETWORK-BASED BIOINFORMATICS RESOURCES
What is Internet COMPUTER NETWORKS AND NETWORK-BASED BIOINFORMATICS RESOURCES Global Internet DNS Internet IP Internet Domain Name System Domain Name System The Domain Name System (DNS) is a hierarchical,
More informationIntroduction to the Protein Data Bank Master Chimie Info Roland Stote Page #
Introduction to the Protein Data Bank Master Chimie Info - 2009 Roland Stote The purpose of the Protein Data Bank is to collect and organize 3D structures of proteins, nucleic acids, protein-nucleic acid
More informationhgu133plus2.db December 11, 2017
hgu133plus2.db December 11, 2017 hgu133plus2accnum Map Manufacturer identifiers to Accession Numbers hgu133plus2accnum is an R object that contains mappings between a manufacturer s identifiers and manufacturers
More informationManaging Your Biological Data with Python
Chapman & Hall/CRC Mathematical and Computational Biology Series Managing Your Biological Data with Python Ailegra Via Kristian Rother Anna Tramontano CRC Press Taylor & Francis Group Boca Raton London
More informationMedical Informatics Databases Databases Databases Databases
Medical Informatics Prof. Dr. Nizamettin AYDIN naydin@yildiz.edu.tr http://www.yildiz.edu.tr/~naydin 1 2 Computers serve four interdependent functions in biomedical informatics: communications, computation,
More informationMaximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009
Maximizing the Value of STM Content through Semantic Enrichment Frank Stumpf December 1, 2009 What is Semantics and Semantic Processing? Content Knowledge Framework Technology Framework Search Text Images
More informationYutaka Ueno Neuroscience, AIST Tsukuba, Japan
Yutaka Ueno Neuroscience, AIST Tsukuba, Japan Lua is good in Molecular biology for: 1. programming tasks 2. database management tasks 3. development of algorithms Current Projects 1. sequence annotation
More informationAn Algebra for Protein Structure Data
An Algebra for Protein Structure Data Yanchao Wang, and Rajshekhar Sunderraman Abstract This paper presents an algebraic approach to optimize queries in domain-specific database management system for protein
More informationefip online Help Document
efip online Help Document University of Delaware Computer and Information Sciences & Center for Bioinformatics and Computational Biology Newark, DE, USA December 2013 K K S I K K Table of Contents INTRODUCTION...
More informationSoftware review. Biomolecular Interaction Network Database
Biomolecular Interaction Network Database Keywords: protein interactions, visualisation, biology data integration, web access Abstract This software review looks at the utility of the Biomolecular Interaction
More informationMedical Center Library & Archives
Medical Center Library & Archives October 1, 2016 The Medical Center Library welcomes you to the Duke community! We would like to take a moment to tell you about some of the tremendous number of services
More informationWilson Leung 01/03/2018 An Introduction to NCBI BLAST. Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment
An Introduction to NCBI BLAST Prerequisites: Detecting and Interpreting Genetic Homology: Lecture Notes on Alignment Resources: The BLAST web server is available at https://blast.ncbi.nlm.nih.gov/blast.cgi
More informationBIFS 617 Dr. Alkharouf. Topics. Parsing GenBank Files. More regular expression modifiers. /m /s
Parsing GenBank Files BIFS 617 Dr. Alkharouf 1 Parsing GenBank Files Topics More regular expression modifiers /m /s 2 1 Parsing GenBank Libraries Parsing = systematically taking apart some unstructured
More informationXML in the bipharmaceutical
XML in the bipharmaceutical sector XML holds out the opportunity to integrate data across both the enterprise and the network of biopharmaceutical alliances - with little technological dislocation and
More informationThis document contains information about the annotation workflow for the Full BioCreative interactive task.
BioCreative IV-User Interactive Task RLIMS-P Annotation Task This document contains information about the annotation workflow for the Full BioCreative interactive task. Annotation Workflow using RLIMS-P
More informationPatterns / Regular expressions
Sequence bioinformatics http://bio.lundberg.gu.se/courses/ht07/bio2/ Perl programming (GK) Hidden Markov Models (MO) Methods and applications - Algorithms of sequence alignment, BLAST, multiple alignments
More informationImportant Example: Gene Sequence Matching. Corrigiendum. Central Dogma of Modern Biology. Genetics. How Nucleotides code for Amino Acids
Important Example: Gene Sequence Matching Century of Biology Two views of computer science s relationship to biology: Bioinformatics: computational methods to help discover new biology from lots of data
More informationSEBI: An Architecture for Biomedical Image Discovery, Interoperability and Reusability based on Semantic Enrichment
SEBI: An Architecture for Biomedical Image Discovery, Interoperability and Reusability based on Semantic Enrichment Ahmad C. Bukhari 1, Michael Krauthammer 2, Christopher J.O. Baker 1 1 Department of Computer
More informationSequence Variation Database Project at the European Bioinformatics Institute
52 LEHVÄ SLAIHO ET AL. HUMAN MUTATION 15:52 56 (2000) MDI SPECIAL ARTICLE Sequence Variation Database Project at the European Bioinformatics Institute Heikki Lehväslaiho,* Elia Stupka, and Michael Ashburner
More informationHIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT
HIDDEN MARKOV MODELS AND SEQUENCE ALIGNMENT - Swarbhanu Chatterjee. Hidden Markov models are a sophisticated and flexible statistical tool for the study of protein models. Using HMMs to analyze proteins
More informationQuerying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help?
Querying Multiple Bioinformatics Information Sources: Can Semantic Web Research Help? David Buttler, Matthew Coleman 1, Terence Critchlow 1, Renato Fileto, Wei Han, Ling Liu, Calton Pu, Daniel Rocco, Li
More informationMeasuring inter-annotator agreement in GO annotations
Measuring inter-annotator agreement in GO annotations Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, Binns ns D, Apweiler R. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.
More informationProtein Data Bank Japan
Protein Data Bank Japan http://www.pdbj.org/ PDBj Today gene information for many species is just at the point of being revealed. To make use of this information, it is necessary to look at the proteins
More informationTurning Text into Insight: Text Mining in the Life Sciences WHITEPAPER
Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER According to The STM Report (2015), 2.5 million peer-reviewed articles are published in scholarly journals each year. 1 PubMed contains
More informationIntroduction to Genome Browsers
Introduction to Genome Browsers Rolando Garcia-Milian, MLS, AHIP (Rolando.milian@ufl.edu) Department of Biomedical and Health Information Services Health Sciences Center Libraries, University of Florida
More informationEMBL-EBI Patent Services
EMBL-EBI Patent Services 5 th Annual Forum for SMEs October 6-7 th 2011 Jennifer McDowall EBI is an Outstation of the European Molecular Biology Laboratory. Patent resources at EBI 2 http://www.ebi.ac.uk/patentdata/
More informationVirusPKT: A Search Tool For Assimilating Assorted Acquaintance For Viruses
VirusPKT: A Search Tool For Assimilating Assorted Acquaintance For Viruses Jayanthi Manicassamy Department of Computer Science Pondicherry University Pondicherry, India. jmanic2@yahoo.com P. Dhavachelvan
More information1. HPC & I/O 2. BioPerl
1. HPC & I/O 2. BioPerl A simplified picture of the system User machines Login server(s) jhpce01.jhsph.edu jhpce02.jhsph.edu 72 nodes ~3000 cores compute farm direct attached storage Research network
More informationExercises. Biological Data Analysis Using InterMine workshop exercises with answers
Exercises Biological Data Analysis Using InterMine workshop exercises with answers Exercise1: Faceted Search Use HumanMine for this exercise 1. Search for one or more of the following using the keyword
More informationWhat is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester
National Centre for Text Mining www.nactem.ac.uk University of Manchester Outline Aims of text mining Text Mining steps Text Mining uses Applications 2 Aims Extract and discover knowledge hidden in text
More informationPubMed Assistant: A Biologist-Friendly Interface for Enhanced PubMed Search
Bioinformatics (2006), accepted. PubMed Assistant: A Biologist-Friendly Interface for Enhanced PubMed Search Jing Ding Department of Electrical and Computer Engineering, Iowa State University, Ames, IA
More informationExploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications
Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications Amanda Clare 1,3, Samuel Croset 2,3 (croset@ebi.ac.uk), Christoph Grabmueller 2,3, Senay
More informationLane Medical Library Stanford University Medical Center
Lane Medical Library Stanford University Medical Center http://lane.stanford.edu LaneAskUs@Stanford.edu 650.723.6831 PubMed: A Quick Guide PubMed: (connect from Lane Library s webpage, http://lane.stanford.edu/
More informationSciVerse Scopus. 1. Scopus introduction and content coverage. 2. Scopus in comparison with Web of Science. 3. Basic functionalities of Scopus
Prepared by: Jawad Sayadi Account Manager, United Kingdom Elsevier BV Radarweg 29 1043 NX Amsterdam The Netherlands J.Sayadi@elsevier.com SciVerse Scopus SciVerse Scopus 1. Scopus introduction and content
More information