Overview of BioCreative VI Precision Medicine Track

Size: px

Start display at page:

Download "Overview of BioCreative VI Precision Medicine Track"

Loraine Shields
6 years ago
Views:

1 Overview of BioCreative VI Precision Medicine Track Mining scientific literature for protein interactions affected by mutations Organizers: Rezarta Islamaj Dogan (NCBI) Andrew Chatr-aryamontri (BioGrid) Sun Kim (NCBI) Don Comeau (NCBI) Zhiyong Lu (NCBI) Data Curators: Andrew Chatr-aryamontri (BioGrid) Jennifer Rust (BioGrid) Christie Chang (BioGrid) Rose W. Oughtred (BioGrid) Lorrie Boucher (BioGrid)

2 Precision Medicine Prevention and treatment of disease taking into account variability in environment, lifestyle and genetic profile of each individual. 2

3 BioCreative Challenges Series Workshop Location Year GM GN GO PPI IAT BC I Granada, Spain 2004 x x x BC II Madrid, Spain 2007 x x x BC II.5 Madrid, Spain 2009 x BC III Bethesda, USA 2010 x x x CTD / CDR Curation Workflow BC 2012 DC, USA 2012 x x x BC IV Bethesda, USA 2013 X x x x x BioC CHEM DNER BC V Sevilla, Spain 2015 X X X x x x x x BEL Organization Committee of BioCreative 2017: BioGrid: Andrew Chatr-aryamontri CNIO: Martin Krallinger, Alfonso Valencia Colorado: Kevin Cohen MITRE: Lynette Hirschman NCBI: Sun Kim, Rezarta Dogan, Don Comeau, Zhiyong Lu PIR: Cecilia Arighi, Cathy Wu SBI: Fabio Finaldi, Julien Gobeill, Pascale Gaudet, Patrick Ruch SCAI: Juliane Fluck, Sumit Madan Chung-Chi Huang, and Zhiyong Lu Community challenges in biomedical text mining over 10 years: success, failure and the future. Briefings in Bioinformatics; 2015

4 Objectives of the Precision Medicine Track Input Unstructured data in biomedical literature Identify precision medicine relevant information in scientific literature Support database curators select articles describing molecular interactions that depend on genetic variability Foster development of tools that can triage scientific literature for relevant studies Foster development of tools that can extract specific PPI relations Knowledgebase Structured and normalized information

5 Precision Medicine Track in BioCreative VI Task 1:Document Triage Identifying relevant PubMed citations describing genetic mutations affecting protein-protein interactions Task 2: Relation Extraction Extracting experimentally verified PPI affected by the presence of a genetic mutation 5

9 What information do curators look for? The goal of the Precision Medicine Task was to annotate mutations that affect the stability of proteinprotein interactions. The PSI-MI community standard includes this type of information in the schema but BioGRID doesn t routinely annotate such information.

10 Data: from the curators point of view Mutations Naturally occurring mutations Synthetic mutations, routinely used in lab practice to study gene function Protein-protein interactions Physical interactions Biochemical reactions Self-interactions Aggregations

11 The Precision Medicine track training corpus was generated as a result of two data selection and validation methods: Data Repurposing Text Mining Triage 2,852 IntAct articles, containing inthe-abstract information about binding interfaces and mutations influencing the interactions were reviewed All PubMed articles were scored with PIE the Search, and were filtered with tmvar selecting 1,200 for manual review

12 Triage Annotation Curated database selected articles (PPI set) Text mining tools selected articles (TM set) Complete Training Set Positives ,730 42% Negatives % Total % Methods Avg. Prec. Precision Recall F1 Positive Negative Ratio 10-fold CV (PPI set) % Validation (TM set) % 10-fold CV (all data) %

13 Training data relation extraction task 597 PubMed abstracts with 760 in-abstract PPI relations affected by mutations. These relations were curated in IntAct and were reviewed and verified for purposes of this task Number of unique genes: 1,053 Common species: Human, house mouse, thale cress, yeast, Norway rat, E-coli

14 PM Track: Testing Data 1,500 PubMed articles were extracted via state-of-the-art PPI and mutation detecting text mining methods These articles have not been previously curated for PPI, and are not in IntAct or other databases Each article was reviewed by at least two data curators who consistently met and discussed discrepancies Each article is curated for triage as relevant for curation or not Relevant for curation articles are curated for PPI relations 14

15 Phases of curation 1. Five curators work on 20 PubMed articles discuss all positive and negative selections discuss the annotation tool and its functionality 2. Two sets of 100 articles are annotated by three curators each Discuss all positive selections and resolve all discrepancies Finalize annotation guidelines and agreements on relation extraction 3. All articles are annotated by a pair of curators Detailed reports are prepared, and all inconsistencies and discrepancies are resolved

Bioconcepts of interest to curators for this task List of curated relations between two identifiable bioentities Save annotation Curation categories helping curators classify any given article Space

16 Bioconcepts of interest to curators for this task List of curated relations between two identifiable bioentities Save annotation Curation categories helping curators classify any given article Space for curators to enter optional comments regarding the article Title and abstract of selected articles with bioconcepts of interest highlighted List of identified bioconcepts, that can be edited by curators. Related mentions of the same concept are grouped together.

17 Inter-annotator agreement Annotator agreements and disagreements Curatable NonCuratable LabelReview RelationReview Typically, for 100 articles: 41 are labelled positive 41 are labelled negative 18 are reviewed for label 23 are reviewed for relations Total articles reviewed: 253 for label 328 for relations

18 Annotation Review Cases Gene organism assignment is difficult Not clear which organism the gene belongs to Gene mentioned could be linked to a family of genes Not all Curatable-labelled articles have explicit relations mentioned in the title or abstract Full text curation is necessary Curators have annotated different relations and there are more than one interactions described in the article Curators had marked the article for further discussion

19 Complete dataset Dataset Articles Positive Negative Articles with relations Number of relations Training 4,082 1,729 2, Testing 1,

20 Precision Medicine Track: Timeline January 2017: Sample annotation of 250 PubMed articles and proof of concept March 2017: Training data annotation for Triage Complete April 2017: Repurposing of IntAct PPI curations for the relation extraction task complete May 2017: Training dataset formatted in BioC (XML/JSON) and made available online 27 Text mining teams registered to participate in the challenge June 2017: Phase 1 and 2 of test data annotation August 2017: Test data annotation complete September 2017: Test data available to challenge participants and evaluation 20

21 Evaluation Evaluation script was made available to all participants Dual purpose (evaluation + format check) Precision/recall/average precision For Relation Extraction task: Exact match HomoloGene match

22 Submission format Triage <infon key= relevant >YES/NO</infon> <infon key= confidence > Real value between 0 and 1</infon> Relations <relation id="r1"> <infon key="gene1">geneid-1</infon> <infon key="gene2">geneid-2</infon> <infon key="relation">ppim</infon> <infon key= confidence >0.XY</infon> </relation>

23 Baseline systems Triage Task SVM classifier using unigram and bigram features from titles and abstracts Relations Task Co-occurrence method Gene names were predicted and normalized via GeneNormPlus Mutation and sequence variation prediction were not used If two genes are predicted in the same sentence, a relation is predicted

24 Participation Team Number Triage Task Relation Task Total 10 teams/22 runs 6 teams/14 runs

25 Team Number Submission Avg Prec Precision Recall F1 Data Format Run JSON 374 Run JSON Run JSON Run JSON 375 Run JSON Run JSON 379 Run XML 405 Run JSON Run XML 414 Run XML Run XML Run XML 418 Run XML Run XML Run XML 419 Run XML Run XML 420 Run JSON Run XML 421 Run XML Run XML 433 Run JSON BASELINE

26 System Submission Precision Recall F1 Data Format Run XML 375 Run XML Run XML 379 Run XML Run XML Run XML 391 Run XML Run XML 405 Run JSON Run JSON Run JSON 420 Run JSON Run JSON 433 Run JSON BASELINE

27 F1 Avg Prec 418 Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run Run BASELINE Run Run Run Run BASELINE Run Run Run Run Run Run Run Run Run Run Run Run Run

28 Micro F1 Macro F1 420 Run Run Run Run Run Run Run Run Run Run Run Run BASELINE Run Run BASELINE Run Run Run Run Run Run Run Run Run Run Run Run Run Run

29 Summary Precision Medicine Track brought together 11 teams worldwide Produced a high quality, manually curated, 5,546 PubMed article corpus containing 2,459 curatable articles for PPI affected by mutations 1,285 articles are curated for relations, with a total of 1,682 relations 22 text mining systems were submitted for the triage task, and 14 for relation extraction As curators are interested in capturing more specialized information such as molecular interactions affected by genetic variations, they will benefit from this work.

30 Summary For Triage: 16 systems outperformed the baseline based on F1-score, 9 of which showed a statistically significant result For the relations task 7 systems outperformed the baseline and all of these results were statistically significant The relations defined in this task are not generally described in a single sentence The corpus is beneficial both for training systems that can extract information of practical value in precision medicine initiative, as well as for training systems that can extract abstract level relations, necessitating paragraph-level understanding.

31 Thank you

A Framework for BioCuration (part II)

A Framework for BioCuration (part II) Text Mining for the BioCuration Workflow Workshop, 3rd International Biocuration Conference Friday, April 17, 2009 (Berlin) Martin Krallinger Spanish National Cancer