Text mining tools for semantically enriching the scientific literature

Similar documents
What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

Connecting Text Mining and Pathways using the PathText Resource

The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature. Jin-Dong Kim Tsujii Laboratory, University of Tokyo

Customisable Curation Workflows in Argo

UIMA-based Annotation Type System for a Text Mining Architecture

Information Retrieval, Information Extraction, and Text Mining Applications for Biology. Slides by Suleyman Cetintas & Luo Si

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

EVENT EXTRACTION WITH COMPLEX EVENT CLASSIFICATION USING RICH FEATURES

An UIMA based Tool Suite for Semantic Text Processing

Lecture11b: NLP (Introduction)

A Framework for BioCuration (part II)

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

The CALBC RDF Triple store: retrieval over large literature content

Australian Journal of Basic and Applied Sciences. Named Entity Recognition from Biomedical Abstracts An Information Extraction Task

The GENIA Corpus: an Annotated Research Abstract Corpus in Molecular Biology Domain

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL

Building trainable taggers in a web-based, UIMA-supported NLP workbench

Incremental Information Extraction Using Dependency Parser

Visualizing Semantic Metadata from Biological Publications

Semantic Retrieval for the Accurate Identification of Relational Concepts in Massive Textbases

Statistical Parsing for Text Mining from Scientific Articles

Mining the Biomedical Research Literature. Ken Baclawski

Text Mining. Representation of Text Documents

Original article Argo: an integrative, interactive, text mining-based workbench supporting curation

PPI Finder: A Mining Tool for Human Protein-Protein Interactions

A Comparative Study of Syntactic Parsers for Event Extraction

Exploring the Generation and Integration of Publishable Scientific Facts Using the Concept of Nano-publications

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Integrated NLP Evaluation System for Pluggable Evaluation Metrics with Extensive Interoperable Toolkit

Improving Interoperability of Text Mining Tools with BioC

Discovering Biomedical Relations Utilizing the World-Wide Web. Sougata Mukherjea and Saurav Sahay. Pacific Symposium on Biocomputing 11: (2006)

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Collaborative Development and Evaluation of Text-processing Workflows in a UIMA-supported Web-based Workbench

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Developing a hybrid dictionary-based bio-entity recognition technique

Historical Text Mining:

GRO Task: Populating the Gene Regulation Ontology with events and relations

Turning Text into Insight: Text Mining in the Life Sciences WHITEPAPER

Extraction of biomedical events using case-based reasoning

clarin:el an infrastructure for documenting, sharing and processing language data

Integrated Access to Biological Data. A use case

STEPP Tagger 1. BASIC INFORMATION 2. TECHNICAL INFORMATION. Tool name. STEPP Tagger. Overview and purpose of the tool

Unstructured Text in Big Data The Elephant in the Room

A curation pipeline and web-services for PDF documents

About the Edinburgh Pathway Editor:

Introduction to Text Mining. Hongning Wang

Automated Key Generation of Incremental Information Extraction Using Relational Database System & PTQL

Revealing the Modern History of Japanese Philosophy Using Digitization, Natural Language Processing, and Visualization

Profiling Medical Journal Articles Using a Gene Ontology Semantic Tagger. Mahmoud El-Haj Paul Rayson Scott Piao Jo Knight

A new methodology for gene normalization using a mix of taggers, global alignment matching and document similarity disambiguation

The KNIME Text Processing Plugin

Text-mining-assisted biocuration workflows in Argo

Chemical name recognition with harmonized feature-rich conditional random fields

Measuring inter-annotator agreement in GO annotations

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

Enabling Open Science: Data Discoverability, Access and Use. Jo McEntyre Head of Literature Services

Update: MIRIAM Registry and SBO

Natural Language Processing. SoSe Question Answering

Deliverable D1.4 Report Describing Integration Strategies and Experiments

A RapidMiner framework for protein interaction extraction

Extending the Facets concept by applying NLP tools to catalog records of scientific literature

A Survey on Approaches to Text Mining using Information Extraction

{Ontology: Resource} x {Matching : Mapping} x {Schema : Instance} :: Components of the Same Challenge

Knowledge Engineering with Semantic Web Technologies

I Know Your Name: Named Entity Recognition and Structural Parsing

Acquiring Experience with Ontology and Vocabularies

NCIBI Literature Mining Behind the Scenes, Web-Based Access

Text Mining for Software Engineering

Extracting reproducible simulation studies from model repositories using the CombineArchive Toolkit

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Prof. Ahmet Süerdem Istanbul Bilgi University London School of Economics

MedLingMap: A growing resource mapping the Bio-Medical NLP field

Medical Event Extraction using the Swedish FrameNet, a pilot study

A Framework for Schema-Driven Relationship Discovery from Unstructured Text

Digital repositories as research infrastructure: a UK perspective

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

e-scider: A tool to retrieve, prioritize and analyze the articles from PubMed database Sujit R. Tangadpalliwar 1, Rakesh Nimbalkar 2, Prabha Garg* 3

Diagnosticating and Propagating Health Maintenance Information Using Machine Learning Based Methodology

TURNING TEXT INTO INSIGHT: TEXT MINING IN THE LIFE SCIENCES

SciMiner User s Manual

SBML to BioPAX. MIRIAM Annotations in use. Camille Laibe

BioEve: User Interface Framework Bridging IE and IR. Pradeep Kanwar

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

NERD workshop. Luca ALMAnaCH - Inria Paris. Berlin, 18/09/2017

SAPIENT Automation project

SELF-SERVICE SEMANTIC DATA FEDERATION

SEEK User Manual. Introduction

Genescene: Biomedical Text and Data Mining

Fast and Effective System for Name Entity Recognition on Big Data

A Linguistic Approach for Semantic Web Service Discovery

RLIMS-P Website Help Document

Corpus Linguistics: corpus annotation

Natural Language Processing Tutorial May 26 & 27, 2011

efip online Help Document

Exam Marco Kuhlmann. This exam consists of three parts:

Data Mining in Bioinformatics: Study & Survey

Transcription:

Text mining tools for semantically enriching the scientific literature Sophia Ananiadou Director National Centre for Text Mining School of Computer Science University of Manchester

Need for enriching the literature Need for semantic search i.e. beyond keywords Need for technologies enabling focused semantic search via the creation of semantic metadata from literature The current scientific literature, were it to be presented in semantically accessible form, contains huge amounts of undiscovered science Peter Murray-Rust, Data-driven science: A Scientist s view. NSF/JISC Repositories Workshop, 2007

Impact of text mining Extraction of named entities (genes, proteins, metabolites, etc) Discovery of concepts allows semantic annotation of documents Improves information access by going beyond index terms, enabling semantic querying Improves clustering, classification of documents Visualisation based on semantic metadata derived from text mining results

Beyond named entities: facts Extraction of relationships, events (facts) for knowledge discovery Information extraction, more sophisticated annotation of texts (fact annotation) Enables even more advanced semantic querying

Enriched annotation Text Mining provides enriched annotation layers the user will be able to carry out an easily expressed semantic query which will deliver facts matching that semantic query rather than just sets of documents he has to read Information Extraction and not just Information Retrieval Fact extraction and not just sentence extraction

Annotations derived from Text Mining lexicon ontology text processing raw (unstructured) text part-of-speech tagging named entity recognition deep syntactic parsing annotated (structured) text...... Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. NP NP S VP VP PP PP PP NP Multi-layered annotations NN IN NN VBZ VBN IN NN IN JJ NN NNS. Secretion of TNF was abolished by BHA in PMA-stimulated U937 cells. protein_molecule organic_compound cell_line negative regulation

Mining associations from MEDLINE FACTA: Finding Associated Concepts with Text Analysis What diseases are related to a particular chemical? What proteins are related to a particular disease? etc. EBIMed http://www.ebi.ac.uk/rebholz-srv/ebimed/index.jsp PubMatrix http://pubmatrix.grc.nia.nih.gov/ : FACTA http://text0.mib.man.ac.uk/software/facta/ Quick and interactive

Query

Click!

Innovative Technologies applied to: Term recognition Named entity recognition Fact extraction semantic mark-up improves search classifying, linking documents Semantic Mark-up knowledge discovery, hidden links, associations, hypothesis generation

Natural Language Processing technologies Part-of-speech tagging: GENIA Tuned to biomedical text: 97-99% precision Dictionary-based named-entity recognition Deep parsing Predicate argument relations (90%) Protein-protein interaction extraction Event / fact extraction

Automatic Term Recognition http://www.nactem.ac.uk/software/termine/

Recognising and Disambiguating Acronyms in Biomedical Literature http://www.nactem.ac.uk/software/acromine

Named-entity recognition The peri-kappa B site mediates human immunodeficiency DNA virus virus type 2 enhancer activation in monocytes cell_type Entity types (defined by Ontologies) Genes/protein names Enzymes, substances, metabolites, etc GO ontology, KEGG, CheBI, etc

Leveraging resources Annotated texts (GENIA corpus, GENIA event corpus) Resources for bio-text mining resource-building NLP tools for text-based knowledge harvesting (NaCTeM) BioLexicon Over 1.5M lexical entries for bio-text mining and growing. Containing rich linguistic information for bio-text mining

Existing repositories Population Process chemical, disease, enzyme, species names Subclustering of term variants gene/protein names Medline abstracts new gene/protein names Named entity recognition Term mapping by normalization Bio-Lexicon Manual curation Verb subcategorization terminological verbs on-going verb subcategorization frames

Semantic search based on facts MEDIE: an interactive advanced IR system retrieving facts Performs a semantic search Core technology annotates texts GENIA tagger syntactic structures Enju (deep parser) facts Dictionary-based named entity recognition J. Tsujii

Medie system overview Off-line Input Textbase Deep parser Entity Recognizer Semanticallyannotated Textbase On-line RegionAlgebra Search engine Query Search results

Sentence Retrieval System Using Semantic Representation MEDIE

InfoPubMed An interactive Information Extraction system and an efficient PubMed search tool, helping users to find information about biomedical entities such as genes, proteins, and the interactions between them. System components Deep parsing technology Extraction of protein-protein interactions Multi-window interface on a browser

InfoPubMed Interactions and not just co-occurrences. Calculated using ML and deep semantics.

Semantic Information Retrieval http://nactem4.mc.man.ac.uk:8080/kleio/ KLEIO: a semantically enriched information retrieval system for biology Offers textual and metadata searches across MEDLINE Leverages terminology technologies Named entity recognition: gene, protein, metabolite, organ, disease, symptom

KLEIO architecture

Fewer documents with more precise query

Linking and enriching pathways with text REFINE (BBSRC) MCISB and NaCTeM (Kell, Ananiadou, Tsujii) to integrate text mining techniques with visualisation technologies for better understanding of the evidence for biochemical and signalling pathways to enrich pathway models encoded in the Systems Biology Markup Language (SBML) with evidence derived from text mining

2 Steps for linking text with pathways Pathways IkB IkB P IkB U athway Construction Event Extraction Biological events IkB IkB IkB IkB P IkB U Literature Tsujii-lab, Tokyo IkappaB is phosphorylated Ikappa B ubiquitination degradation of IkB

Event Annotation - Example

Statistics & References Statistics 36,114 events have been identified from and annotated to 1,000 Medline abstracts, which contain 9,372 sentences Kim, Jin-Dong, Tomoko Ohta and Jun'ichi Tsujii (2008) Corpus annotation for mining biomedical events from literature. BMC Bioinformatics http://www-tsujii.is.s.u-tokyo.ac.jp/genia

Acknowledgements Junichi Tsujii and his lab (University of Tokyo) MEDIE, InfoPubMed, event annotation Yoshimasa Tsuruoka (NER, FACTA, KLEIO, REFINE) Naoaki Okazaki (TerMine, AcroMine) Yutaka Sasaki (BioLexicon, NER, KLEIO) John McNaught (BioLexicon, BOOTStrep project) Chikashi Nobata (KLEIO) Douglas Kell (REFINE)