Text mining tools for semantically enriching the scientific literature

Size: px

Start display at page:

Download "Text mining tools for semantically enriching the scientific literature"

Stuart Chapman
5 years ago
Views:

1 Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester

2 Need for enriching the literature Need for semantic search i.e. beyond keywords Need for technologies enabling focused semantic search via the creation of semantic metadata from literature The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science Peter Murray-Rust, Data-driven science: A Scientist s view. NSF/JISC Repositories Workshop, 2007

3 Impact of text mining Extraction of named entities (genes, proteins, metabolites, etc) Discovery of concepts allows semantic annotation of documents Improves information access by going beyond index terms, enabling semantic querying Improves clustering, classification of documents Visualisation based on semantic metadata derived from text mining results

4 Beyond named entities: facts Extraction of relationships, events (facts) for knowledge discovery Information extraction, more sophisticated annotation of texts (fact annotation) Enables even more advanced semantic querying

5 Enriched annotation Text Mining provides enriched annotation layers the user will be able to carry out an easily expressed semantic query which will deliver facts matching that semantic query rather than just sets of documents he has to read Information Extraction and not just Information Retrieval Fact extraction and not just sentence extraction

6 Annotations derived from Text Mining lexicon ontology text processing raw (unstructured) text part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. NP NP S VP VP PP PP PP NP Multi-layered annotations NN IN NN VBZ VBN IN NN IN JJ NN NNS. Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. protein_molecule organic_compound cell_line negative regulation

7 Mining associations from MEDLINE FACTA: Finding Associated Concepts with Text Analysis What diseases are related to a particular chemical? What proteins are related to a particular disease? etc. EBIMed PubMatrix : FACTA Quick and interactive

8 Query

9 Click!

11 Innovative Technologies applied to: Term recognition Named entity recognition Fact extraction semantic mark-up improves search classifying, linking documents Semantic Mark-up knowledge discovery, hidden links, associations, hypothesis generation

12 Natural Language Processing technologies Part-of-speech tagging: GENIA Tuned to biomedical text: 97-99% precision Dictionary-based named-entity recognition Deep parsing Predicate argument relations (90%) Protein-protein interaction extraction Event / fact extraction

13 Automatic Term Recognition

16 Recognising and Disambiguating Acronyms in Biomedical Literature

17 Named-entity recognition The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes cell_type Entity types (defined by Ontologies) Genes/protein names Enzymes, substances, metabolites, etc GO ontology, KEGG, CheBI, etc

Leveraging resources Annotated texts (GENIA corpus, GENIA event corpus) Resources for bio-text mining resource-building NLP tools for text-based

19 Leveraging resources Annotated texts (GENIA corpus, GENIA event corpus) Resources for bio-text mining resource-building NLP tools for text-based knowledge harvesting (NaCTeM) BioLexicon Over 1.5M lexical entries for bio-text mining and growing. Containing rich linguistic information for bio-text mining

20 Existing repositories Population Process chemical, disease, enzyme, species names Subclustering of term variants gene/protein names Medline abstracts new gene/protein names Named entity recognition Term mapping by normalization Bio-Lexicon Manual curation Verb subcategorization terminological verbs on-going verb subcategorization frames

21 Semantic search based on facts MEDIE: an interactive advanced IR system retrieving facts Performs a semantic search Core technology annotates texts GENIA tagger syntactic structures Enju (deep parser) facts Dictionary-based named entity recognition J. Tsujii

22 Medie system overview Off-line Input Textbase Deep parser Entity Recognizer Semanticallyannotated Textbase On-line RegionAlgebra Search engine Query Search results

23 Sentence Retrieval System Using Semantic Representation MEDIE

25 InfoPubMed An interactive Information Extraction system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. System components Deep parsing technology Extraction of protein-protein interactions Multi-window interface on a browser

26 InfoPubMed Interactions and not just co-occurrences. Calculated using ML and deep semantics.

27 Semantic Information Retrieval KLEIO: a semantically enriched information retrieval system for biology Offers textual and metadata searches across MEDLINE Leverages terminology technologies Named entity recognition: gene, protein, metabolite, organ, disease, symptom

28 KLEIO architecture

30 Fewer documents with more precise query

31 Linking and enriching pathways with text REFINE (BBSRC) MCISB and NaCTeM (Kell, Ananiadou, Tsujii) to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining

32 2 Steps for linking text with pathways Pathways IkB IkB P IkB U athway Construction Event Extraction Biological events IkB IkB IkB IkB P IkB U Literature Tsujii-lab, Tokyo IkappaB is phosphorylated Ikappa B ubiquitination degradation of IkB

33 Event Annotation - Example

34 Statistics & References Statistics 36,114 events have been identified from and annotated to 1,000 Medline abstracts, which contain 9,372 sentences Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics

35 Acknowledgements Junichi Tsujii and his lab (University of Tokyo) MEDIE, InfoPubMed, event annotation Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE) Naoaki Okazaki (TerMine, AcroMine) Yutaka Sasaki (BioLexicon, NER, KLEIO) John McNaught (BioLexicon, BOOTStrep project) Chikashi Nobata (KLEIO) Douglas Kell (REFINE)

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester National Centre for Text Mining www.nactem.ac.uk University of Manchester Outline Aims of text mining Text Mining steps Text Mining uses Applications 2 Aims Extract and discover knowledge hidden in text