Overview of BioCreative VI Precision Medicine Track

Size: px
Start display at page:

Download "Overview of BioCreative VI Precision Medicine Track"

Transcription

1 Overview of BioCreative VI Precision Medicine Track Mining scientific literature for protein interactions affected by mutations Organizers: Rezarta Islamaj Dogan (NCBI) Andrew Chatr-aryamontri (BioGrid) Sun Kim (NCBI) Don Comeau (NCBI) Zhiyong Lu (NCBI) Data Curators: Andrew Chatr-aryamontri (BioGrid) Jennifer Rust (BioGrid) Christie Chang (BioGrid) Rose W. Oughtred (BioGrid) Lorrie Boucher (BioGrid)

2 Precision Medicine Prevention and treatment of disease taking into account variability in environment, lifestyle and genetic profile of each individual. 2

3 BioCreative Challenges Series Workshop Location Year GM GN GO PPI IAT BC I Granada, Spain 2004 x x x BC II Madrid, Spain 2007 x x x BC II.5 Madrid, Spain 2009 x BC III Bethesda, USA 2010 x x x CTD / CDR Curation Workflow BC 2012 DC, USA 2012 x x x BC IV Bethesda, USA 2013 X x x x x BioC CHEM DNER BC V Sevilla, Spain 2015 X X X x x x x x BEL Organization Committee of BioCreative 2017: BioGrid: Andrew Chatr-aryamontri CNIO: Martin Krallinger, Alfonso Valencia Colorado: Kevin Cohen MITRE: Lynette Hirschman NCBI: Sun Kim, Rezarta Dogan, Don Comeau, Zhiyong Lu PIR: Cecilia Arighi, Cathy Wu SBI: Fabio Finaldi, Julien Gobeill, Pascale Gaudet, Patrick Ruch SCAI: Juliane Fluck, Sumit Madan Chung-Chi Huang, and Zhiyong Lu Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics; 2015

4 Objectives of the Precision Medicine Track Input Unstructured data in biomedical literature Identify precision medicine relevant information in scientific literature Support database curators select articles describing molecular interactions that depend on genetic variability Foster development of tools that can triage scientific literature for relevant studies Foster development of tools that can extract specific PPI relations Knowledgebase Structured and normalized information

5 Precision Medicine Track in BioCreative VI Task 1:Document Triage Identifying relevant PubMed citations describing genetic mutations affecting protein-protein interactions Task 2: Relation Extraction Extracting experimentally verified PPI affected by the presence of a genetic mutation 5

6

7

8

9 What information do curators look for? The goal of the Precision Medicine Task was to annotate mutations that affect the stability of proteinprotein interactions. The PSI-MI community standard includes this type of information in the schema but BioGRID doesn t routinely annotate such information.

10 Data: from the curators point of view Mutations Naturally occurring mutations Synthetic mutations, routinely used in lab practice to study gene function Protein-protein interactions Physical interactions Biochemical reactions Self-interactions Aggregations

11 The Precision Medicine track training corpus was generated as a result of two data selection and validation methods: Data Repurposing Text Mining Triage 2,852 IntAct articles, containing inthe-abstract information about binding interfaces and mutations influencing the interactions were reviewed All PubMed articles were scored with PIE the Search, and were filtered with tmvar selecting 1,200 for manual review

12 Triage Annotation Curated database selected articles (PPI set) Text mining tools selected articles (TM set) Complete Training Set Positives ,730 42% Negatives % Total % Methods Avg. Prec. Precision Recall F1 Positive Negative Ratio 10-fold CV (PPI set) % Validation (TM set) % 10-fold CV (all data) %

13 Training data relation extraction task 597 PubMed abstracts with 760 in-abstract PPI relations affected by mutations. These relations were curated in IntAct and were reviewed and verified for purposes of this task Number of unique genes: 1,053 Common species: Human, house mouse, thale cress, yeast, Norway rat, E-coli

14 PM Track: Testing Data 1,500 PubMed articles were extracted via state-of-the-art PPI and mutation detecting text mining methods These articles have not been previously curated for PPI, and are not in IntAct or other databases Each article was reviewed by at least two data curators who consistently met and discussed discrepancies Each article is curated for triage as relevant for curation or not Relevant for curation articles are curated for PPI relations 14

15 Phases of curation 1. Five curators work on 20 PubMed articles discuss all positive and negative selections discuss the annotation tool and its functionality 2. Two sets of 100 articles are annotated by three curators each Discuss all positive selections and resolve all discrepancies Finalize annotation guidelines and agreements on relation extraction 3. All articles are annotated by a pair of curators Detailed reports are prepared, and all inconsistencies and discrepancies are resolved

16 Bioconcepts of interest to curators for this task List of curated relations between two identifiable bioentities Save annotation Curation categories helping curators classify any given article Space for curators to enter optional comments regarding the article Title and abstract of selected articles with bioconcepts of interest highlighted List of identified bioconcepts, that can be edited by curators. Related mentions of the same concept are grouped together.

17 Inter-annotator agreement Annotator agreements and disagreements Curatable NonCuratable LabelReview RelationReview Typically, for 100 articles: 41 are labelled positive 41 are labelled negative 18 are reviewed for label 23 are reviewed for relations Total articles reviewed: 253 for label 328 for relations

18 Annotation Review Cases Gene organism assignment is difficult Not clear which organism the gene belongs to Gene mentioned could be linked to a family of genes Not all Curatable-labelled articles have explicit relations mentioned in the title or abstract Full text curation is necessary Curators have annotated different relations and there are more than one interactions described in the article Curators had marked the article for further discussion

19 Complete dataset Dataset Articles Positive Negative Articles with relations Number of relations Training 4,082 1,729 2, Testing 1,

20 Precision Medicine Track: Timeline January 2017: Sample annotation of 250 PubMed articles and proof of concept March 2017: Training data annotation for Triage Complete April 2017: Repurposing of IntAct PPI curations for the relation extraction task complete May 2017: Training dataset formatted in BioC (XML/JSON) and made available online 27 Text mining teams registered to participate in the challenge June 2017: Phase 1 and 2 of test data annotation August 2017: Test data annotation complete September 2017: Test data available to challenge participants and evaluation 20

21 Evaluation Evaluation script was made available to all participants Dual purpose (evaluation + format check) Precision/recall/average precision For Relation Extraction task: Exact match HomoloGene match

22 Submission format Triage <infon key= relevant >YES/NO</infon> <infon key= confidence > Real value between 0 and 1</infon> Relations <relation id="r1"> <infon key="gene1">geneid-1</infon> <infon key="gene2">geneid-2</infon> <infon key="relation">ppim</infon> <infon key= confidence >0.XY</infon> </relation>

23 Baseline systems Triage Task SVM classifier using unigram and bigram features from titles and abstracts Relations Task Co-occurrence method Gene names were predicted and normalized via GeneNormPlus Mutation and sequence variation prediction were not used If two genes are predicted in the same sentence, a relation is predicted

24 Participation Team Number Triage Task Relation Task Total 10 teams/22 runs 6 teams/14 runs

25 Team Number Submission Avg Prec Precision Recall F1 Data Format Run JSON 374 Run JSON Run JSON Run JSON 375 Run JSON Run JSON 379 Run XML 405 Run JSON Run XML 414 Run XML Run XML Run XML 418 Run XML Run XML Run XML 419 Run XML Run XML 420 Run JSON Run XML 421 Run XML Run XML 433 Run JSON BASELINE

26 System Submission Precision Recall F1 Data Format Run XML 375 Run XML Run XML 379 Run XML Run XML Run XML 391 Run XML Run XML 405 Run JSON Run JSON Run JSON 420 Run JSON Run JSON 433 Run JSON BASELINE

27 F1 Avg Prec 418 Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run BASELINE Run Run Run Run BASELINE Run Run Run Run Run Run Run Run Run Run Run Run Run

28 Micro F1 Macro F1 420 Run Run Run Run Run Run Run Run Run Run Run Run BASELINE Run Run BASELINE Run Run Run Run Run Run Run Run Run Run Run Run Run Run

29 Summary Precision Medicine Track brought together 11 teams worldwide Produced a high quality, manually curated, 5,546 PubMed article corpus containing 2,459 curatable articles for PPI affected by mutations 1,285 articles are curated for relations, with a total of 1,682 relations 22 text mining systems were submitted for the triage task, and 14 for relation extraction As curators are interested in capturing more specialized information such as molecular interactions affected by genetic variations, they will benefit from this work.

30 Summary For Triage: 16 systems outperformed the baseline based on F1-score, 9 of which showed a statistically significant result For the relations task 7 systems outperformed the baseline and all of these results were statistically significant The relations defined in this task are not generally described in a single sentence The corpus is beneficial both for training systems that can extract information of practical value in precision medicine initiative, as well as for training systems that can extract abstract level relations, necessitating paragraph-level understanding.

31 Thank you

A Framework for BioCuration (part II)

A Framework for BioCuration (part II) A Framework for BioCuration (part II) Text Mining for the BioCuration Workflow Workshop, 3rd International Biocuration Conference Friday, April 17, 2009 (Berlin) Martin Krallinger Spanish National Cancer

More information

The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL

The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL http://tinyurl.com/beltask Fabio Rinaldi Outline Biomedical text mining, motivation Competitive evaluations:

More information

BioC: a minimalist approach to interoperability for biomedical text processing. Don Comeau

BioC: a minimalist approach to interoperability for biomedical text processing. Don Comeau BioC: a minimalist approach to interoperability for biomedical text processing Don Comeau Outline Background and origin of BioC What is BioC? Available Tools and Corpora 2 BioCreative Critical Assessment

More information

The user interactive task (IAT) in BioCreative Challenges BioCreative Workshop on Text Mining Applications April 7, 2014

The user interactive task (IAT) in BioCreative Challenges BioCreative Workshop on Text Mining Applications April 7, 2014 The user interactive task (IAT) in BioCreative Challenges BioCreative Workshop on Text Mining Applications April 7, 2014 N., PhD Research Associate Professor Protein Information Resource CBCB, University

More information

Improving Interoperability of Text Mining Tools with BioC

Improving Interoperability of Text Mining Tools with BioC Improving Interoperability of Text Mining Tools with BioC Ritu Khare, Chih-Hsuan Wei, Yuqing Mao, Robert Leaman, Zhiyong Lu * National Center for Biotechnology Information, 8600 Rockville Pike, Bethesda,

More information

A curation pipeline and web-services for PDF documents

A curation pipeline and web-services for PDF documents A curation pipeline and web-services for PDF documents André Santos 1, Sérgio Matos 1, David Campos 2 and José Luís Oliveira 1 1 DETI/IEETA, University of Aveiro, 3810-193 Aveiro, Portugal {aleixomatos,andre.jeronimo,jlo}@ua.pt

More information

PMC text mining subset in BioC: 2.3 million full text articles and growing

PMC text mining subset in BioC: 2.3 million full text articles and growing PMC text mining subset in BioC: 2.3 million full text articles and growing Donald C. Comeau, Chih-Hsuan Wei, Rezarta Islamaj Doğan and Zhiyong Lu National Center for Biotechnology Information, U.S. Library

More information

Measuring inter-annotator agreement in GO annotations

Measuring inter-annotator agreement in GO annotations Measuring inter-annotator agreement in GO annotations Camon EB, Barrell DG, Dimmer EC, Lee V, Magrane M, Maslen J, Binns ns D, Apweiler R. An evaluation of GO annotation retrieval for BioCreAtIvE and GOA.

More information

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus Donald C. Comeau *, Haibin Liu, Rezarta Islamaj Doğan and W. John Wilbur National Center

More information

Projects Tools BLAH proposal Conclusion. OntoGene/BioMeXT

Projects Tools BLAH proposal Conclusion. OntoGene/BioMeXT OntoGene/BioMeXT The Bio Term Hub and OGER Lenz Furrer, Nico Colic, Fabio Rinaldi University of Zurich and Swiss Institute of Bioinformatics January 10, 2018 Outline Projects Tools BLAH proposal Conclusion

More information

Biomedical literature mining for knowledge discovery

Biomedical literature mining for knowledge discovery Biomedical literature mining for knowledge discovery REZARTA ISLAMAJ DOĞAN National Center for Biotechnology Information National Library of Medicine Outline Biomedical Literature Access Challenges in

More information

SciMiner User s Manual

SciMiner User s Manual SciMiner User s Manual Copyright 2008 Junguk Hur. All rights reserved. Bioinformatics Program University of Michigan Ann Arbor, MI 48109, USA Email: juhur@umich.edu Homepage: http://jdrf.neurology.med.umich.edu/sciminer/

More information

PPI Finder: A Mining Tool for Human Protein-Protein Interactions

PPI Finder: A Mining Tool for Human Protein-Protein Interactions PPI Finder: A Mining Tool for Human Protein-Protein Interactions Min He 1,2., Yi Wang 1., Wei Li 1 * 1 Key Laboratory of Molecular and Developmental Biology, Institute of Genetics and Developmental Biology,

More information

efip online Help Document

efip online Help Document efip online Help Document University of Delaware Computer and Information Sciences & Center for Bioinformatics and Computational Biology Newark, DE, USA December 2013 K K S I K K Table of Contents INTRODUCTION...

More information

Customisable Curation Workflows in Argo

Customisable Curation Workflows in Argo Customisable Curation Workflows in Argo Rafal Rak*, Riza Batista-Navarro, Andrew Rowley, Jacob Carter and Sophia Ananiadou National Centre for Text Mining, University of Manchester, UK *Corresponding author:

More information

EFFICIENT AUTOMATED PROCESSING OF BIOMEDICAL LITERATURE

EFFICIENT AUTOMATED PROCESSING OF BIOMEDICAL LITERATURE EFFICIENT AUTOMATED PROCESSING OF BIOMEDICAL LITERATURE NICO COLIC 1. Introduction The rate at which biomedical research papers are published is ever increasing. Because of this, professionals rely on

More information

Ranking of CTD articles and interactions using the OntoGene pipeline

Ranking of CTD articles and interactions using the OntoGene pipeline Ranking of CTD articles and interactions using the OntoGene pipeline Fabio Rinaldi, Simon Clematide and Simon Hafner Institute of Computational Linguistics, University of Zurich {rinaldi,siclemat}@cl.uzh.ch,{hafnersimon@gmail.com}

More information

RLIMS-P Website Help Document

RLIMS-P Website Help Document RLIMS-P Website Help Document Table of Contents Introduction... 1 RLIMS-P architecture... 2 RLIMS-P interface... 2 Login...2 Input page...3 Results Page...4 Text Evidence/Curation Page...9 URL: http://annotation.dbi.udel.edu/text_mining/rlimsp2/

More information

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction Pavel P. Kuksa, Rutgers University Yanjun Qi, Bing Bai, Ronan Collobert, NEC Labs Jason Weston, Google Research NY Vladimir

More information

Integrated Access to Biological Data. A use case

Integrated Access to Biological Data. A use case Integrated Access to Biological Data. A use case Marta González Fundación ROBOTIKER, Parque Tecnológico Edif 202 48970 Zamudio, Vizcaya Spain marta@robotiker.es Abstract. This use case reflects the research

More information

Benchmarking biomedical text mining web servers at BioCreative V.5: the technical Interoperability and Performance of annotation Servers - TIPS track

Benchmarking biomedical text mining web servers at BioCreative V.5: the technical Interoperability and Performance of annotation Servers - TIPS track Benchmarking biomedical text mining web servers at BioCreative V.5: the technical Interoperability and Performance of annotation Servers - TIPS track Martin Pérez-Pérez 1,2, Gael Pérez-Rodríguez 1,2, Aitor

More information

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si Information Retrieval, Information Extraction, and Text Mining Applications for Biology Slides by Suleyman Cetintas & Luo Si 1 Outline Introduction Overview of Literature Data Sources PubMed, HighWire

More information

Extraction of biomedical events using case-based reasoning

Extraction of biomedical events using case-based reasoning Extraction of biomedical events using case-based reasoning Mariana L. Neves Biocomputing Unit Centro Nacional de Biotecnología - CSIC C/ Darwin 3, Campus de Cantoblanco, 28049, Madrid, Spain mlara@cnb.csic.es

More information

Automatic annotation in UniProtKB using UniRule, and Complete Proteomes. Wei Mun Chan

Automatic annotation in UniProtKB using UniRule, and Complete Proteomes. Wei Mun Chan Automatic annotation in UniProtKB using UniRule, and Complete Proteomes Wei Mun Chan Talk outline Introduction to UniProt UniProtKB annotation and propagation Data increase and the need for Automatic Annotation

More information

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation Mariana Neves 1, Monica Chagoyen 1, José M Carazo 1, Alberto Pascual-Montano

More information

Document Retrieval using Predication Similarity

Document Retrieval using Predication Similarity Document Retrieval using Predication Similarity Kalpa Gunaratna 1 Kno.e.sis Center, Wright State University, Dayton, OH 45435 USA kalpa@knoesis.org Abstract. Document retrieval has been an important research

More information

The CALBC RDF Triple store: retrieval over large literature content

The CALBC RDF Triple store: retrieval over large literature content The CALBC RDF Triple store: retrieval over large literature content Samuel Croset, Christoph Grabmüller, Chen Li, Silverstras Kavaliauskas, Dietrich Rebholz-Schuhmann croset@ebi.ac.uk 10 th December 2010,

More information

This document contains information about the annotation workflow for the Full BioCreative interactive task.

This document contains information about the annotation workflow for the Full BioCreative interactive task. BioCreative IV-User Interactive Task RLIMS-P Annotation Task This document contains information about the annotation workflow for the Full BioCreative interactive task. Annotation Workflow using RLIMS-P

More information

A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS

A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS A STACKED GRAPHICAL MODEL FOR ASSOCIATING SUB-IMAGES WITH SUB-CAPTIONS ZHENZHEN KOU, WILLIAM W. COHEN, AND ROBERT F. MURPHY Machine Learning Department, Carnegie Mellon University 5000 Forbes Avenue, Pittsburgh,

More information

EBI services. Jennifer McDowall EMBL-EBI

EBI services. Jennifer McDowall EMBL-EBI EBI services Jennifer McDowall EMBL-EBI The SLING project is funded by the European Commission within Research Infrastructures of the FP7 Capacities Specific Programme, grant agreement number 226073 (Integrating

More information

ClinVar. Jennifer Lee, PhD, NCBI/NLM/NIH ClinVar

ClinVar. Jennifer Lee, PhD, NCBI/NLM/NIH ClinVar ClinVar What is ClinVar ClinVar is a freely available, central archive for associating observed variation with supporting clinical and experimental evidence for a wide range of disorders. The database

More information

Genescene: Biomedical Text and Data Mining

Genescene: Biomedical Text and Data Mining Claremont Colleges Scholarship @ Claremont CGU Faculty Publications and Research CGU Faculty Scholarship 5-1-2003 Genescene: Biomedical Text and Data Mining Gondy Leroy Claremont Graduate University Hsinchun

More information

A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications

A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications A Semantic Web for Bioinformatics: Goals, Tools, Systems, Applications Mid June, 2007 Department of Computer Science, University of Pise, Italy Why Semantic Web Biological information: an underused resource

More information

Chapter 6 Evaluation Metrics and Evaluation

Chapter 6 Evaluation Metrics and Evaluation Chapter 6 Evaluation Metrics and Evaluation The area of evaluation of information retrieval and natural language processing systems is complex. It will only be touched on in this chapter. First the scientific

More information

Text mining tools for semantically enriching the scientific literature

Text mining tools for semantically enriching the scientific literature Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester Need for enriching the

More information

A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction

A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction A Feature Generation Algorithm for Sequences with Application to Splice-Site Prediction Rezarta Islamaj 1, Lise Getoor 1, and W. John Wilbur 2 1 Computer Science Department, University of Maryland, College

More information

Relational Retrieval Using a Combination of Path-Constrained Random Walks

Relational Retrieval Using a Combination of Path-Constrained Random Walks Relational Retrieval Using a Combination of Path-Constrained Random Walks Ni Lao, William W. Cohen University 2010.9.22 Outline Relational Retrieval Problems Path-constrained random walks The need for

More information

Software review. Biomolecular Interaction Network Database

Software review. Biomolecular Interaction Network Database Biomolecular Interaction Network Database Keywords: protein interactions, visualisation, biology data integration, web access Abstract This software review looks at the utility of the Biomolecular Interaction

More information

NCBI News, November 2009

NCBI News, November 2009 Peter Cooper, Ph.D. NCBI cooper@ncbi.nlm.nh.gov Dawn Lipshultz, M.S. NCBI lipshult@ncbi.nlm.nih.gov Featured Resource: New Discovery-oriented PubMed and NCBI Homepage The NCBI Site Guide A new and improved

More information

Retrieval of Highly Related Documents Containing Gene-Disease Association

Retrieval of Highly Related Documents Containing Gene-Disease Association Retrieval of Highly Related Documents Containing Gene-Disease Association K. Santhosh kumar 1, P. Sudhakar 2 Department of Computer Science & Engineering Annamalai University Annamalai Nagar, India. santhosh09539@gmail.com,

More information

EVIDENCE FOR SHOWING GENE/PROTEIN NAME SUGGESTIONS IN BIOSCIENCE LITERATURE SEARCH INTERFACES

EVIDENCE FOR SHOWING GENE/PROTEIN NAME SUGGESTIONS IN BIOSCIENCE LITERATURE SEARCH INTERFACES EVIDENCE FOR SHOWING GENE/PROTEIN NAME SUGGESTIONS IN BIOSCIENCE LITERATURE SEARCH INTERFACES ANNA DIVOLI, MARTI A. HEARST, MICHAEL A. WOOLDRIDGE School of Information, UC Berkeley {divoli,hearst,mikew}@.ischool.berkeley.edu

More information

Painless Relation Extraction with Kindred

Painless Relation Extraction with Kindred Painless Relation Extraction with Kindred Jake Lever and Steven JM Jones Canada s Michael Smith Genome Sciences Centre 570 W 7th Ave, Vancouver BC, V5Z 4S6, Canada {jlever,sjones}@bcgsc.ca Abstract Relation

More information

Validation of Automated Protein Annotation

Validation of Automated Protein Annotation Validation of Automated Protein Annotation Francisco M. Couto Mário J. Silva Pedro M. Coutinho DI FCUL TR 05 24 December 2005 Departamento de Informática Faculdade de Ciências da Universidade de Lisboa

More information

IPA: networks generation algorithm

IPA: networks generation algorithm IPA: networks generation algorithm Dr. Michael Shmoish Bioinformatics Knowledge Unit, Head The Lorry I. Lokey Interdisciplinary Center for Life Sciences and Engineering Technion Israel Institute of Technology

More information

Using open access literature to guide full-text query formulation. Heather A. Piwowar and Wendy W. Chapman. Background

Using open access literature to guide full-text query formulation. Heather A. Piwowar and Wendy W. Chapman. Background Using open access literature to guide full-text query formulation Heather A. Piwowar and Wendy W. Chapman Background Much scientific knowledge is contained in the details of the full-text biomedical literature.

More information

Finding and Exporting Data. BioMart

Finding and Exporting Data. BioMart September 2017 Finding and Exporting Data Not sure what tool to use to find and export data? BioMart is used to retrieve data for complex queries, involving a few or many genes or even complete genomes.

More information

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix

Exploring and Exploiting the Biological Maze. Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Exploring and Exploiting the Biological Maze Presented By Vidyadhari Edupuganti Advisor Dr. Zoe Lacroix Motivation An abundance of biological data sources contain data about scientific entities, such as

More information

MeSH-based dataset for measuring the relevance of text retrieval

MeSH-based dataset for measuring the relevance of text retrieval MeSH-based dataset for measuring the relevance of text retrieval Won Kim, Lana Yeganova, Donald C Comeau, W John Wilbur, Zhiyong Lu National Center for Biotechnology Information, NLM, NIH, Bethesda, MD,

More information

Data Curation Profile Human Genomics

Data Curation Profile Human Genomics Data Curation Profile Human Genomics Profile Author Profile Author Institution Name Contact J. Carlson N. Brown Purdue University J. Carlson, jrcarlso@purdue.edu Date of Creation October 27, 2009 Date

More information

All abstracts should be reviewed by meeting organisers prior to submission to BioMed Central to ensure suitability for publication.

All abstracts should be reviewed by meeting organisers prior to submission to BioMed Central to ensure suitability for publication. Abstract supplements - Guidelines for Organisers General requirements Abstracts submitted to the journal must be original and must not have been previously published elsewhere. Abstracts published on a

More information

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study

Interactive Machine Learning (IML) Markup of OCR Generated Text by Exploiting Domain Knowledge: A Biodiversity Case Study Interactive Machine Learning (IML) Markup of OCR Generated by Exploiting Domain Knowledge: A Biodiversity Case Study Several digitization projects such as Google books are involved in scanning millions

More information

Blast2GO Teaching Exercises

Blast2GO Teaching Exercises Blast2GO Teaching Exercises Ana Conesa and Stefan Götz 2012 BioBam Bioinformatics S.L. Valencia, Spain Contents 1 Annotate 10 sequences with Blast2GO 2 2 Perform a complete annotation process with Blast2GO

More information

Brat2BioC: conversion tool between brat and BioC

Brat2BioC: conversion tool between brat and BioC Brat2: conversion tool between and Antonio Jimeno Yepes 1,2, Mariana Neves 3,4, Karin Verspoor 1,2 1 NICTA Victoria Research Lab, Melbourne VIC 3010, Australia 2 Department of Computing and Information

More information

Alternative Tools for Mining The Biomedical Literature

Alternative Tools for Mining The Biomedical Literature Yale University From the SelectedWorks of Rolando Garcia-Milian May 14, 2014 Alternative Tools for Mining The Biomedical Literature Rolando Garcia-Milian, Yale University Available at: https://works.bepress.com/rolando_garciamilian/1/

More information

Semantic Knowledge Discovery OntoChem IT Solutions

Semantic Knowledge Discovery OntoChem IT Solutions Semantic Knowledge Discovery OntoChem IT Solutions OntoChem IT Solutions GmbH Blücherstr. 24 06120 Halle (Saale) Germany Tel. +49 345 4780472 Fax: +49 345 4780471 mail: info(at)ontochem.com Get the Gold!

More information

Nancy Baker 1, Thomas Knudsen 2, Antony Williams 2

Nancy Baker 1, Thomas Knudsen 2, Antony Williams 2 SOFTWARE TOOL ARTICLE Abstract Sifter: a comprehensive front-end system to PubMed [version 1; referees: 2 approved] Nancy Baker 1, Thomas Knudsen 2, Antony Williams 2 1Leidos, Research Triangle Park, NC,

More information

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao

Large Scale Chinese News Categorization. Peng Wang. Joint work with H. Zhang, B. Xu, H.W. Hao Large Scale Chinese News Categorization --based on Improved Feature Selection Method Peng Wang Joint work with H. Zhang, B. Xu, H.W. Hao Computational-Brain Research Center Institute of Automation, Chinese

More information

DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL

DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL DOCUMENT RETRIEVAL USING A PROBABILISTIC KNOWLEDGE MODEL Shuguang Wang Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA, USA swang@cs.pitt.edu Shyam Visweswaran Department of Biomedical

More information

Evaluation of different biological data and computational classification methods for use in protein interaction prediction.

Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Evaluation of different biological data and computational classification methods for use in protein interaction prediction. Yanjun Qi, Ziv Bar-Joseph, Judith Klein-Seetharaman Protein 2006 Motivation Correctly

More information

2) NCBI BLAST tutorial This is a users guide written by the education department at NCBI.

2) NCBI BLAST tutorial   This is a users guide written by the education department at NCBI. Web resources -- Tour. page 1 of 8 This is a guided tour. Any homework is separate. In fact, this exercise is used for multiple classes and is publicly available to everyone. The entire tour will take

More information

SAPIENT Automation project

SAPIENT Automation project Dr Maria Liakata Leverhulme Trust Early Career fellow Department of Computer Science, Aberystwyth University Visitor at EBI, Cambridge mal@aber.ac.uk 25 May 2010, London Motivation SAPIENT Automation Project

More information

Data Capture and Data Analysis Hupo Plasma Proteome Project workshop Bethesda, July 2003, Henning Hermjakob, EBI

Data Capture and Data Analysis Hupo Plasma Proteome Project workshop Bethesda, July 2003, Henning Hermjakob, EBI Data Capture and Data Analysis Hupo Plasma workshop Bethesda, July 2003, Henning Hermjakob, EBI Aims Ensure data comparability to allow Comparative analysis of results Presentation of results User-friendly

More information

Connecting Text Mining and Pathways using the PathText Resource

Connecting Text Mining and Pathways using the PathText Resource Connecting Text Mining and Pathways using the PathText Resource Sætre, Kemper, Oda, Okazaki a, Matsuoka b, Kikuchi c, Kitano d, Tsuruoka, Ananiadou, Tsujii e a Computer Science, University of Tokyo, Hongo

More information

2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI.

2. Take a few minutes to look around the site. The goal is to familiarize yourself with a few key components of the NCBI. 2 Navigating the NCBI Instructions Aim: To become familiar with the resources available at the National Center for Bioinformatics (NCBI) and the search engine Entrez. Instructions: Write the answers to

More information

The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases

The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases International Symposium on Integrative Bioinformatics 2010 The LAILAPS Search Engine - A Feature Model for Relevance Ranking in Life Science Databases M Lange, K Spies, C Colmsee, S Flemming, M Klapperstück,

More information

Text-mining-assisted biocuration workflows in Argo

Text-mining-assisted biocuration workflows in Argo Database, 2014, 1 14 doi: 10.1093/database/bau070 Original article Original article Text-mining-assisted biocuration workflows in Argo Rafal Rak 1, *, Riza Theresa Batista-Navarro 1,2, Andrew Rowley 1,

More information

A Feature Selection Method to Handle Imbalanced Data in Text Classification

A Feature Selection Method to Handle Imbalanced Data in Text Classification A Feature Selection Method to Handle Imbalanced Data in Text Classification Fengxiang Chang 1*, Jun Guo 1, Weiran Xu 1, Kejun Yao 2 1 School of Information and Communication Engineering Beijing University

More information

BovineMine Documentation

BovineMine Documentation BovineMine Documentation Release 1.0 Deepak Unni, Aditi Tayal, Colin Diesh, Christine Elsik, Darren Hag Oct 06, 2017 Contents 1 Tutorial 3 1.1 Overview.................................................

More information

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London

Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services. Patrick Wendel Imperial College, London Discovery Net : A UK e-science Pilot Project for Grid-based Knowledge Discovery Services Patrick Wendel Imperial College, London Data Mining and Exploration Middleware for Distributed and Grid Computing,

More information

Recommending MeSH terms for annotating biomedical articles

Recommending MeSH terms for annotating biomedical articles Recommending MeSH terms for annotating biomedical articles Minlie Huang, 1,2 Aurélie Névéol, 2 Zhiyong Lu 2 1 State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory for

More information

Chemical name recognition with harmonized feature-rich conditional random fields

Chemical name recognition with harmonized feature-rich conditional random fields Chemical name recognition with harmonized feature-rich conditional random fields David Campos, Sérgio Matos, and José Luís Oliveira IEETA/DETI, University of Aveiro, Campus Universitrio de Santiago, 3810-193

More information

Classification of Protein Crystallization Imagery

Classification of Protein Crystallization Imagery Classification of Protein Crystallization Imagery Xiaoqing Zhu, Shaohua Sun, Samuel Cheng Stanford University Marshall Bern Palo Alto Research Center September 2004, EMBC 04 Outline Background X-ray crystallography

More information

Meter Trouble Report PUBLIC. A Guide for Market Participants. Issue 6.0 IMP_GDE_0098

Meter Trouble Report PUBLIC. A Guide for Market Participants. Issue 6.0 IMP_GDE_0098 PUBLIC IMP_GDE_0098 + Meter Trouble Report A Guide for Market Participants Issue 6.0 This document is a guide for market participants to the use of the Meter Trouble Report workflow application. Public

More information

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM).

The software comes with 2 installers: (1) SureCall installer (2) GenAligners (contains BWA, BWA- MEM). Release Notes Agilent SureCall 4.0 Product Number G4980AA SureCall Client 6-month named license supports installation of one client and server (to host the SureCall database) on one machine. For additional

More information

Digital The Harold B. Lee Library

Digital The Harold B. Lee Library Digital Preservation @ The Harold B. Lee Library CIMA 23 May 2013 How we got here? 1. Understanding Digital Preservation 2. Search for Content 3. Maintain Optical Disc Storage 4. In House Preservation

More information

Core Technology Development Team Meeting

Core Technology Development Team Meeting Core Technology Development Team Meeting To hear the meeting, you must call in Toll-free phone number: 1-866-740-1260 Access Code: 2201876 For international call in numbers, please visit: https://www.readytalk.com/account-administration/international-numbers

More information

Automatic Domain Partitioning for Multi-Domain Learning

Automatic Domain Partitioning for Multi-Domain Learning Automatic Domain Partitioning for Multi-Domain Learning Di Wang diwang@cs.cmu.edu Chenyan Xiong cx@cs.cmu.edu William Yang Wang ww@cmu.edu Abstract Multi-Domain learning (MDL) assumes that the domain labels

More information

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014

Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Classification and retrieval of biomedical literatures: SNUMedinfo at CLEF QA track BioASQ 2014 Sungbin Choi, Jinwook Choi Medical Informatics Laboratory, Seoul National University, Seoul, Republic of

More information

The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature. Jin-Dong Kim Tsujii Laboratory, University of Tokyo

The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature. Jin-Dong Kim Tsujii Laboratory, University of Tokyo The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature Jin-Dong Kim Tsujii Laboratory, University of Tokyo Contents Ontology, Corpus and Annotation for IE Annotation and Information

More information

A quick review. Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.)

A quick review. Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) Gene expression profiling A quick review Which molecular processes/functions are involved in a certain phenotype (e.g., disease, stress response, etc.) The Gene Ontology (GO) Project Provides shared vocabulary/annotation

More information

Bioinformatics Hubs on the Web

Bioinformatics Hubs on the Web Bioinformatics Hubs on the Web Take a class The Galter Library teaches a related class called Bioinformatics Hubs on the Web. See our Classes schedule for the next available offering. If this class is

More information

A simple method to extract abbreviations within a document using regular expressions

A simple method to extract abbreviations within a document using regular expressions A simple method to extract abbreviations within a document using regular expressions Christian Sánchez, Paloma Martínez Computer Science Department, Universidad Carlos III of Madrid Avd. Universidad, 30,

More information

Automatically Constructing a Directory of Molecular Biology Databases

Automatically Constructing a Directory of Molecular Biology Databases Automatically Constructing a Directory of Molecular Biology Databases Luciano Barbosa, Sumit Tandon, and Juliana Freire School of Computing, University of Utah Abstract. There has been an explosion in

More information

Christian Sánchez, Paloma Martínez. Computer Science Department, Universidad Carlos III of Madrid Avd. Universidad, 30, Leganés, 28911, Madrid, Spain

Christian Sánchez, Paloma Martínez. Computer Science Department, Universidad Carlos III of Madrid Avd. Universidad, 30, Leganés, 28911, Madrid, Spain A proposed system to identify and extract abbreviation definitions in Spanish biomedical texts for the Biomedical Abbreviation Recognition and Resolution (BARR) 2017 Christian Sánchez, Paloma Martínez

More information

MDA Blast2GO Exercises

MDA Blast2GO Exercises MDA 2011 - Blast2GO Exercises Ana Conesa and Stefan Götz March 2011 Bioinformatics and Genomics Department Prince Felipe Research Center Valencia, Spain Contents 1 Annotate 10 sequences with Blast2GO 2

More information

Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients

Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients Nikola Milosevic 1, Cassie Gregson 2, Robert Hernandez 2 and Goran Nenadic 1,3

More information

Interoperability and Semantics in Use- Application of UML, XMI and MDA to Precision Medicine and Cancer Research

Interoperability and Semantics in Use- Application of UML, XMI and MDA to Precision Medicine and Cancer Research Interoperability and Semantics in Use- Application of UML, XMI and MDA to Precision Medicine and Cancer Research Ian Fore, D.Phil. Associate Director, Biorepository and Pathology Informatics Senior Program

More information

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task

CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task CIS UDEL Working Notes on ImageCLEF 2015: Compound figure detection task Xiaolong Wang, Xiangying Jiang, Abhishek Kolagunda, Hagit Shatkay and Chandra Kambhamettu Department of Computer and Information

More information

Supplementary Note 1: Considerations About Data Integration

Supplementary Note 1: Considerations About Data Integration Supplementary Note 1: Considerations About Data Integration Considerations about curated data integration and inferred data integration mentha integrates high confidence interaction information curated

More information

GIDMP: GOOD PROTEIN-PROTEIN INTERACTION DATA METAMINING PRACTICE

GIDMP: GOOD PROTEIN-PROTEIN INTERACTION DATA METAMINING PRACTICE CELLULAR & MOLECULAR BIOLOGY LETTERS http://www.cmbl.org.pl Received: 06 October 2010 Volume 16 (2011) pp 258-263 Final form accepted: 28 February 2011 DOI: 10.2478/s11658-011-0004-1 Published online:

More information

Maximizing Public Data Sources for Sequencing and GWAS

Maximizing Public Data Sources for Sequencing and GWAS Maximizing Public Data Sources for Sequencing and GWAS February 4, 2014 G Bryce Christensen Director of Services Questions during the presentation Use the Questions pane in your GoToWebinar window Agenda

More information

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017 NERD workshop Luca Foppiano @ ALMAnaCH - Inria Paris Berlin, 18/09/2017 Agenda Introducing the (N)ERD service NERD REST API Usages and use cases Entities Rigid textual expressions corresponding to certain

More information

Humboldt-University of Berlin

Humboldt-University of Berlin Humboldt-University of Berlin Exploiting Link Structure to Discover Meaningful Associations between Controlled Vocabulary Terms exposé of diploma thesis of Andrej Masula 13th October 2008 supervisor: Louiqa

More information

Minimal Metadata Standards and MIIDI Reports

Minimal Metadata Standards and MIIDI Reports Dryad-UK Workshop Wolfson College, Oxford 12 September 2011 Minimal Metadata Standards and MIIDI Reports David Shotton, Silvio Peroni and Tanya Gray Image BioInformatics Research Group Department of Zoology

More information

Genome Browsers - The UCSC Genome Browser

Genome Browsers - The UCSC Genome Browser Genome Browsers - The UCSC Genome Browser Background The UCSC Genome Browser is a well-curated site that provides users with a view of gene or sequence information in genomic context for a specific species,

More information

G-TACT User Guide Ensuring Accurate Classification of BRCA Variants Module: RUN 1

G-TACT User Guide Ensuring Accurate Classification of BRCA Variants Module: RUN 1 G-TACT User Guide Ensuring Accurate Classification of BRCA Variants Module: RUN 1 Description The G-TACT User Guide provides assistance to users in all aspects of using the Ensuring Accurate Classification

More information

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data

BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data BioNav: An Ontology-Based Framework to Discover Semantic Links in the Cloud of Linked Data María-Esther Vidal 1, Louiqa Raschid 2, Natalia Márquez 1, Jean Carlo Rivera 1, and Edna Ruckhaus 1 1 Universidad

More information

Genome Browsers Guide

Genome Browsers Guide Genome Browsers Guide Take a Class This guide supports the Galter Library class called Genome Browsers. See our Classes schedule for the next available offering. If this class is not on our upcoming schedule,

More information

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools Wahed Hemati, Alexander Mehler, and Tolga Uslu Text Technology Lab, Goethe Universitt

More information

Toward an interactive article: integrating journals and biological databases

Toward an interactive article: integrating journals and biological databases BMC Bioinformatics This Provisional PDF corresponds to the article as it appeared upon acceptance. Fully formatted PDF and full text (HTML) versions will be made available soon. Toward an interactive article:

More information