A Framework for BioCuration (part II)

Size: px

Start display at page:

Download "A Framework for BioCuration (part II)"

Edward Jones
5 years ago
Views:

1 A Framework for BioCuration (part II) Text Mining for the BioCuration Workflow Workshop, 3rd International Biocuration Conference Friday, April 17, 2009 (Berlin) Martin Krallinger Spanish National Cancer Research Centre (CNIO) Joined talk with Gully APC. Burns USC Information Sciences Institute, USA

2 LITERATURE BIOCURATION WORKFLOW TASKS

3 FROM WORKFLOWS TO TEXT MINING

4 EXAMPLE TYPES: ARTICLE IDENTIFICATION (TRIAGE TASK - 1) Expert feedback External reference Literature mining Exhaustive journal

5 EXAMPLE TYPES: ARTICLE IDENTIFICATION (TRIAGE TASK - 2) Taxonomy centric Bio-entity centric Thematic or topic centric

6 ARTICLE IDENTIFICATION:TRIAGE TASK General aspects Usually using keyword/pattern searches against PubMed Depends on annotation type/criteria, organism & literature volume Bottle neck/problems Limited information in abstracts Implicit limitations of Keyword searches Complex information demand: Gene names/ids + Organism Source + annotation relevant + evidential support Text Mining Sophisticated Information Retrieval strategies Rules, regular expressions and pattern mining Document similarity Machine learning and text categorization approaches Full text articles -> text passages Combine with the bio-entity identification task Examples: BCMS, Genomics TREC, PreBIND,

7 BIO-ENTITY IDENTIFICATION TASK General aspects Linking literature to bio-entity database identifier From text to database / from database to text Search database with bio-entity mentions (symbols,names) Organism & annotation type specific problem Bottle neck/problems Time consuming step Disambiguation: level of bio-entity, level of organism source, attributes Missing/incomplete information: in database & in article Text Mining Use for automatically tagging gene and protein mentions Gene dictionary extension, filtering and look-up approaches Disambiguation based on context of mention of bio-entity Currently limited interpretability of automatic results Examples: ihop, BCMS, Prominer, Whatizit,

8 ANNOTATION EVENT IDENTIFICATION TASK General aspects Extraction of some kind of biological relation Identification of some evidential text passage Complex process, domain expert knowledge inference Interpretation of author descriptions by curator Mapping to controlled vocabularies (CV) Bottle neck/problems Granularity of CV within ontology Variability in describing a given annotation event Formalize the Context/conditions of annotation event Text Mining Often sentence co-occurrence assumption Article, passage, sentence classifier Patterns (trigger words), regular expressions & syntactic relations

9 EVIDENTIAL QUALIFIER IDENTIFICATION TASK General aspects Statement vs. Experimentally supported discovery Indicative of reliability and interpretation of annotation Relevant for bioinformatics & systems biology analysis Examples: GO evidence codes, PSI-MI interaction detection methods, Bottle neck/problems Limited lexical resources for experimental techniques Variability in describing experimental methods Multiple experimental evidence for single annotation event Text Mining Often method sentence co-occurrence assumption Only few approaches! General more linguistic approaches: negation and uncertainty

10 PPI ANNOTATION OF BIOGRID (1) Many thanks to Andrew Winter

11 PPI ANNOTATION OF BIOGRID (2) Many thanks to Andrew Winter

12 PPI ANNOTATION OF BIOGRID (3) Many thanks to Andrew Winter

13 PPI ANNOTATION OF BIOGRID (4) Many thanks to Andrew Winter

14 ACKNOWLEDGEMENTS Andrew Winters (BioGRID database) Andrew Chatr-Aryamonti (MINT database) Steven Montgomery (Oreganno database) Gully Burns Lynette Hirschman BioCreative: Biocreative Metaserver:

Overview of BioCreative VI Precision Medicine Track

Overview of BioCreative VI Precision Medicine Track Mining scientific literature for protein interactions affected by mutations Organizers: Rezarta Islamaj Dogan (NCBI) Andrew Chatr-aryamontri (BioGrid)