The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL

Size: px

Start display at page:

Download "The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL"

Rafe Henderson
5 years ago
Views:

1 The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL Fabio Rinaldi

2 Outline Biomedical text mining, motivation Competitive evaluations: BioNLP, BioCreative The BEL BioCreative V Outlook

3 Motivation The purpose of biomedical curation activities is to help the Life Sciences community to make sense of all the data that is accumulating. A. Bairoch, The future of annotation/ biocuration. Nature Preceedings 2009.

4 Growth of PubMed citations from 1986 to Lu Z Database 2011;2011:baq036 The Author(s) Published by Oxford University Press.

5 Motivation The purpose of biomedical curation activities is to help the Life Sciences community to make sense of all the data that is accumulating. Nobody will ever be able to manually annotate all the macromolecular biological entities that exist on this planet, and consequently automatization is the only solution. A. Bairoch, The future of annotation/biocuration. Nature Preceedings 2009.

6 Why text mining? Massive amount of published material, human curation is impossible Text mining can assist Database curation Targeted searches by scientist Identification of research targets by industry Build systemic networks Text mining technologies are regularly evaluated through community assessments

7 Goals of community assessments Determine state of the art Monitor improvements Investigation different approaches Evaluation of new strategies Identification of positive / negative features Scientific forum stimulate progress in research

8 Competitive evaluations BioCreative BioNLP BioASQ i2b2 (medical) CALBC CLEF-ER QA4MRE Semeval

9 BioASQ Three editions so far: 2013, 2014, 2015 Two tasks: a. Large-scale online biomedical semantic indexing Annotate PubMed abstract with classes from the MeSH hierarchy b. Introductory biomedical semantic QA Questions to be answered with relevant concepts (from designated terminologies and ontologies), relevant articles (in English, from designated article repositories), relevant snippets (from the relevant articles), and relevant RDF triples (from designated ontologies).

10 BioNLP shared task

11 BioNLP 2009 Task 1. Core event extraction (mandatory) Unary events: 70%; binding and regulation: 40% Task 2. Event enrichment (optional) phosphorylation of TRAF2 (Type:Phosphorylation, Theme:TRAF2) localization of beta-catenin into nucleus (Type:Localization, Theme:beta-catenin, ToLoc:nucleus) Task 3. Negation and speculation recognition (optional) TRADD did not interact with TES2 (Negation (Type:Binding, Theme:TRADD, Theme:TES2)) Total number of participants: 24

12 BioNLP 2011 [GE] GENIA; p: 15, F: 53% (full text), F: 57% (abstracts) [EPI] Epigenetics and Post-translational Modifications; p: 7, F: 53% [ID] Infectious Diseases; p: 7, F: 56% Bacteria Track: [BB] Biotopes (p:3, F: 45%), [BI] Gene Interactions (p: 1, F: 77.0%), [REN] Bacteria Gene Renaming (p:3, F: 87.0%) [CO] Protein/Gene Coreference Task, p:6, F: 34.1% [REL] Entity Relations Supporting Task, p: 4, F: 57.7%

13 BioNLP 2013 [GE] Genia Event Extraction; p: 10, F: 51% [CG] Cancer Genetics; p: 2, F: 55.4% [PC] Pathway Curation; p: 2, F: 52.8% [GRO] Corpus Annotation with Gene Regulation Ontology; p: 1, F: 22% [GRN] Gene Regulation Network in Bacteria; p: 5, SER: 73% [BB] Bacteria Biotopes (semantic annotation by an ontology); P: 5, entities SER: 46%, relations: 40%, events: 14%

14 BioCreative

15 BioCreative I (2004) BioCreative I 27 Teams, Granada, Spain Hirschman et al. Overview of BioCreative: critical assessment of information extraction for biology. BMC Bioinformatics (2005), 6:S1 Tracks: identification of gene mentions in text and linking protein database entries to abstracts. extraction of human gene product annotations with GO terms

16 BioCreative II (2006) BioCreative II 44 teams, Madrid, Spain Krallinger et al. Evaluation of text-mining systems in Biology: overview of the Second BioCreative community challenge, Genome Biology (2008), 9:S1 Tracks Gene mention tagging [GM] Gene normalization [GN] Extraction of protein-protein interactions from text [PPI]: IAS (article), IPS (pair), IMS (methods), ISS (evidence)

17 BioCreative II: GM, GN

18 BioCreative II: PPI

19 Species

20 BioCreative II.5 (2009) Challenge run through web services Participants: 16 teams Corpus: FEBS Letter 2007 Goal: Reproduce the Structured Digital Abstract Subtasks: [ACT] Article classification; AUC: 67.8% [INT] Interactor Normalization; AUC: 43.5% [IPT] Interaction Pair; AUC: 22.2%

22 Structured Digital Abstracts

23 BioCreative III (2010) Tasks: gene mentions (GM); p: 14, TAP-10: 34.6% protein-protein interactions article classification (ACT); p: 10, AUC: 68% experimental Methods (IMT); p: 9, AUC: 53% interactive task (IAT); p: 6

24 PPI-IMT

25 BioCreative IV (2013) Task 1: BioC (PyBioC); p: 9 Task 2: CHEMDNER (chemicals); p: 27 Task 3: CTD web service; p: 7 Best F-score: 87.39% CEM, 88.20% CDI gene: 61%, chemical: 74%, disease: 51%, act: 54% Task 4: GO annotations; p: 8 Task A (evidence text), best F: 0.27 (exact) / 0.38 Task B (predict GO terms), best F: 0.13 / 0.34 (hier.) Task 5: Interactive task; p: 9

26 BioCreative V (2015) Collaborative Biocurator Assistant Task (BioC) CHEMDNER patents Chemical-disease relation (CDR) task Extraction of causal network information in Biological Expression Language (BEL) Interactive Curation (IAT)

27 The BEL BioCreative V

28 BEL Track: Timeline Oct 2014: preparation of proposal for Task Nov 2014: approval of proposal Dec 2014: administrative and contractual arrangements Jan 2015: official start of supported activity Feb 2015: release sample set Mar 2015: release training set Mar-May 2015: preparation of evaluation framework and supporting data Jun 15-18: release of test set and official evaluation

29 Datasets and supporting material Sample set Training set 295 BEL statements with evidence BEL statements with evidence Supporting material: BioC version Structural graphs Fragments (tsv representation of BEL statements) Entities (list of entities contained in BEL statement)

30 Tasks Task 1 Given textual evidence for a BEL statement, generate the corresponding BEL statement Data: 100 Sentences Accept 3 runs per participant Accept up to 10 BEL statements or fragments per sentence Task 2 Given a BEL statement, provide at most 10 additional evidence sentences Data: 100 BEL statements, verified to have evidence in PubMed Accept 1 run per participant, each with 10 sentences ranked by relevance

31 Simplifications Selection of statements: non-nested Selection of relationships decreases/directlydecreases, increases/directlyincreases Namespaces: Six namespaces considered (HGNC, MGI, EGID, GOPB, MESHD, CHEBI) Equivalence between HGNC, MGI and EGID Simplification of abundance functions for gene/protein: p() can be used instead of g(), m() and r()) Restrictions and equivalences of functions Simplification of Abundance Modifier Functions: Cellular locations are not requested P is used as default argument to pmod() (pmod(p)) Simplification of functions: act() is used instead of cat(), tscript(), kin(), gtp(), sec(), surf() etc.

32 Documentation BEL track initial pages at biocreative.org BEL track extended description at openbel.org: wiki.openbel.org/display/bioc/biocreative+home Setup of the task Sample and Training data Evaluation details

33 Information to participants Broadcast calls to several mailing list Announcements on the BioCreative mailing list Set-up dedidated google group 13 registered users Used to deliver target information about the challenge setup Evaluation web SCAI

34 Evaluation metrics Primary: Term (T), Function (F), Relationship (R), Full BEL statement (F). Secondary: Function (Fs), Relationship (Rs) 2nd stage includes gold standard entities

35 Format conversions Definition of BEL/BioC format (collaboration with NLM/NCBI) Parsing of BEL statements via ruby, conversion BEL into BioC (and RDF) Visualization of BEL structure via graph

37 Timeline: next steps Jul 2015: Feedback to participants on evaluation results Preparation of proceedings: Evaluation kaggle Aug 2015: Arrangements for workshop Possible revisions of overview paper and participants papers Sep 2015: overview paper collection of participants papers Workshop, September 9-11, Sevilla, Spain Sep Dec: Writing of journal paper (DATABASE)

38 BEL task: challenges One sentence is a very limited context Disambiguation of named entities is context dependent One sentence often does not offer sufficient context, in particular for species identification No negative set Several levels of analysis: entities, functions, relations Multiple / large namespaces

39 Outlook A considerable investment in terms of time and resources Yet few participants expected, why? Novelty of the task (it takes time to adapt tools or create new ones) Short time available for development, due to late start and early evaluation Investment made so far will pay off in future editions Participants will become gradually more familiar with the nature of the task Evaluation framework can be reused Documentation can be partially reused (extended and adapted) Workshop raises attention to BEL in the text mining community

40 Bel Task kaggle.com? How well can a general purpose Machine Learning Community solve a biomedical Text Mining task? Challenges and Benefits Provide the evidence in a support data format which lowers the NLP requirements for participants Model the task as a multi-label, multi-class prediction problem NER output, Chunking, Dependency Parsing Representing structured outcomes in this way seems to be interesting for the ML community (e.g. Meka) Get more ML approaches tested on the Bel Task data set Kaggle has an active and innovative community

41 Summary Text mining in biology: essential for coping with the information deluge Competitive evaluations: provide rigorous evaluation in a controlled environment BEL challenge: a novel task with a great potential However: more time is needed to allow participants to become familiar with such a complex framework and develop useful systems

42 Acknowledgments BEL task BioCreative Juliane Fluck Martin Krallinger,CNIO Sumit Madan Florian Leitner, CNIO Tilia Ellendorff Simon Clematide Alfonso Valencia, CNIO Lynette Hirschman, MITRE Adrian van der Lek Sam Ansari Julia Hoeng Manuel Peitsch

Overview of BioCreative VI Precision Medicine Track

Overview of BioCreative VI Precision Medicine Track Mining scientific literature for protein interactions affected by mutations Organizers: Rezarta Islamaj Dogan (NCBI) Andrew Chatr-aryamontri (BioGrid)