Medical Event Extraction using the Swedish FrameNet, a pilot study

Similar documents
Text mining tools for semantically enriching the scientific literature

Text Mining for Software Engineering

Customisable Curation Workflows in Argo

structure of the presentation Frame Semantics knowledge-representation in larger-scale structures the concept of frame

Using Relations for Identification and Normalization of Disorders: Team CLEAR in the ShARe/CLEF 2013 ehealth Evaluation Lab

Natural Language Processing. SoSe Question Answering

University of Sheffield, NLP. Chunking Practical Exercise

University of Sheffield, NLP. Chunking Practical Exercise

Text Mining. Representation of Text Documents

Question Answering Using XML-Tagged Documents

Document Retrieval using Predication Similarity

Maximizing the Value of STM Content through Semantic Enrichment. Frank Stumpf December 1, 2009

Question Answering Systems

Semantics Isn t Easy Thoughts on the Way Forward

UML data models from an ORM perspective: Part 4

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Precise Medication Extraction using Agile Text Mining

A bit of theory: Algorithms

Final Project Discussion. Adam Meyers Montclair State University

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

An UIMA based Tool Suite for Semantic Text Processing

Chapter 8: Enhanced ER Model

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

@Note2 tutorial. Hugo Costa Ruben Rodrigues Miguel Rocha

High Accuracy Information Retrieval and Information Extraction System for Electronic Clinical Notes

INFORMATION EXTRACTION

Argument Structures and Semantic Roles: Actual State in ISO TC37/SC4 TDG 3

A Multilingual Social Media Linguistic Corpus

Genescene: Biomedical Text and Data Mining

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Australian Journal of Basic and Applied Sciences. Named Entity Recognition from Biomedical Abstracts An Information Extraction Task

Knowledge Engineering with Semantic Web Technologies

Frame Semantic Structure Extraction

English Understanding: From Annotations to AMRs

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Generating FrameNets of various granularities: The FrameNet Transformer

Proseminar on Semantic Theory Fall 2013 Ling 720 An Algebraic Perspective on the Syntax of First Order Logic (Without Quantification) 1

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Introduction to Lexical Functional Grammar. Wellformedness conditions on f- structures. Constraints on f-structures

Context-Free Grammars. Carl Pollard Ohio State University. Linguistics 680 Formal Foundations Tuesday, November 10, 2009

A Semantic Multi-Field Clinical Search for Patient Medical Records

A hybrid method to categorize HTML documents

TIPSTER Text Phase II Architecture Requirements

Graphical Notation for Topic Maps (GTM)

Annotation by category - ELAN and ISO DCR

Semantic and Multimodal Annotation. CLARA University of Copenhagen August 2011 Susan Windisch Brown

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

A Method for Semi-Automatic Ontology Acquisition from a Corporate Intranet

Natural Language Requirements

What is Text Mining? Sophia Ananiadou National Centre for Text Mining University of Manchester

I Know Your Name: Named Entity Recognition and Structural Parsing

Context-Free Grammars

Knowledge Retrieval. Franz J. Kurfess. Computer Science Department California Polytechnic State University San Luis Obispo, CA, U.S.A.

Ontology Development. Qing He

Information Retrieval CS Lecture 01. Razvan C. Bunescu School of Electrical Engineering and Computer Science

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Data and Information Integration: Information Extraction

Getting Started With Syntax October 15, 2015

Semi-Supervised Abstraction-Augmented String Kernel for bio-relationship Extraction

TEXT PREPROCESSING FOR TEXT MINING USING SIDE INFORMATION

Introduction to Lexical Analysis

Activity Report at SYSTRAN S.A.

SEMANTIC SUPPORT FOR MEDICAL IMAGE SEARCH AND RETRIEVAL

ANNIS3 Multiple Segmentation Corpora Guide

Chapter 6 Architectural Design. Lecture 1. Chapter 6 Architectural design

INFO216: Advanced Modelling

Annotating Spatio-Temporal Information in Documents

Text Mining: A Burgeoning technology for knowledge extraction

Information Extraction

SAPIENT Automation project

The Text Analytics Challenge BioCreative V - Extraction of causal network information in BEL

MIRACLE at ImageCLEFmed 2008: Evaluating Strategies for Automatic Topic Expansion

The Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation

KNOWLEDGE GRAPH: FROM METADATA TO INFORMATION VISUALIZATION AND BACK. Xia Lin College of Computing and Informatics Drexel University Philadelphia, PA

The CKY algorithm part 2: Probabilistic parsing

A Framework for BioCuration (part II)

Sustainability of Text-Technological Resources

Projects Tools BLAH proposal Conclusion. OntoGene/BioMeXT

National Centre for Text Mining NaCTeM. e-science and data mining workshop

Extracting Conceptual Relationships from Specialized Documents

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

Schema Quality Improving Tasks in the Schema Integration Process

Humboldt-University of Berlin

Chapter 8 The Enhanced Entity- Relationship (EER) Model

Topics for Today. The Last (i.e. Final) Class. Weakly Supervised Approaches. Weakly supervised learning algorithms (for NP coreference resolution)

Ontology Based Prediction of Difficult Keyword Queries

Background and Context for CLASP. Nancy Ide, Vassar College

Diagnosticating and Propagating Health Maintenance Information Using Machine Learning Based Methodology

Web Search: Techniques, algorithms and Aplications. Basic Techniques for Web Search

Machine Learning in GATE

Powering Knowledge Discovery. Insights from big data with Linguamatics I2E

Unsupervised Semantic Parsing

Describe The Differences In Meaning Between The Terms Relation And Relation Schema

SEMINAR: RECENT ADVANCES IN PARSING TECHNOLOGY. Parser Evaluation Approaches

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

ICT for Health Care and Life Sciences

It s time for a semantic engine!

Extracting patient data from tables in clinical literature Case study on extraction of BMI, weight and number of patients

Transcription:

Medical Event Extraction using the Swedish FrameNet, a pilot study DIMITRIOS KOKKINAKIS Centre for Language Technology University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se

Overview From entities to relations & events Motivation Events Resources Frame selection Methodology Results Summary & Future Plans

Motivation Information extraction (IE), a technology with a direct correlation with frame-like structures as described in FN. Templates in the context of IE are frame-like structures with slots representing event information Most event-based IE approaches are designed to identify role fillers that appear as arguments to event verbs or nouns, either explicitly via e.g. syntactic relations

Motivation Event-based IE: identifying all entities that play specific roles within an event described in free text (event types specified beforehand) E.g., given text sentences / documents containing descriptions of disease-treatment events, the goal of an IE system could be to extract event role fillers for such event: Thus, specification of events includes fine-grained details about the circumstances under which a textual discourse is said to contain an event description Frame semantics: a framework that can facilitate the development of text understanding and as such can be used as a backbone to NLU systems

From entities to relations & events Past: recognizing entities in text and mapping them to unique identifiers in curated databases has been helpful for increasing the specificity of document searches in TM systems. Textual cooccurrence of entities, however, does not necessarily indicate meaningful relationships! For semantic and focused queries to retrieve not just articles, but also facts from the literature, it is essential to recognize entities (e.g. protein) as well as relations or events; characterized by e.g. verbs (regulate) or nominalized verbs (regulation) But entity names are continuous spans in text, relations and events, generally appear as discontinuous spans, and have internal structures: that is, a relation or an event generally involves more than one entity, and the entities involved play distinct roles

Events Definition/detection of events have roots in philosophy and linguistics (Davidson 1969; Quine 1985) No 100% consensus on the treatment of events in NLP, in spite of its importance to several areas: topic detection and tracking, information extraction, Q&A Note: comprehensive event detection must encompass the detection of events and their subevents, bridging references (X was murdered yesterday the knife lay nearby; definite descriptions based on previous discourse which require some reasoning in the identification of their textual antecedent) etc.

Events Semantic annotation of text with event-level info for mining complex relations and events has gained a growing attention in e.g. biomedicine a valuable source of evidence-based research and text mining (e.g. semantic search; intelligent Question Answering) Event extraction: the task of extracting structured representations, descriptions of complex combination of relations among one or more entities, from e.g. biomed. literature, not contained in structured DBs, that allow associations of arbitrary numbers of participants in specific roles (e.g. Patient) to be captured

Events BioNLP Shared Task 2009: n-ary associations of participants (entities or other events), each marked as playing a specific role such as Theme or Cause in the event. Each event is assigned a type from a fixed, taskspecific set Events can further be marked with modifiers identifying additional features (metaknowledge) such as polarity, certainty, knowledge type, manner, (Thompson P. et al (2011). Enriching a biomedical event corpus with meta-knowledge. BMC Bioinf, 12:393)

Resources+ All our resources are available on-line: <http:///eng/swefn> Number of frames 860 Core elements 2257 Non-Core elements 5476 Lexical units 24718

Methodology: Frame Selection Identify domain specific medical frames: Administration_of_medication, with core frame elements e.g. Drug, Patient and Medic (112) Medical_Treatment, with core frame elements e.g. Treatment, Affliction and Patient (102) Cure, with core frame elements e.g. Healer, Affliction and Body_Part (115) Falling_Ill, with core frame elements e.g. Patient, Symptom and Ailment (116) # number of manually annotated sentences, extracted from a Swedish biomedical corpus

Methodology: Samples&Resources Sentences selected using trigger words and manual inspection from the MEDLEX corpus (a variety of textdocuments related to various medical text genres) Domain resources: FASS, the Swedish national formulary; SNOMED CT s hierarchies; drug and disease lexicon extensions, e.g. generic drug expressions, misspellings; List of abbreviations: iv, i.v., im, i.m. sc, ; List of drug forms: pill, tablet, capsule, cream, gel, ; List of drug administration paths: intranasal, intravenous, Named Entities: for time, frequency and other relevant numerical information recognition,

Methodology Rule-based approach all steps are applied at the sentence level, i.e. no coherent, larger text fragments used 1. pre-processing: selecting a relevant sample of sentences for med. frames using trigger words (constraint: relevant LUs and medical NEs) for both manual annotation and pattern development and also evaluation 2. main processing: includes terminology, named entity and key word/text segment identification 3. post-processing: modeling observed frame element patterns as rules (regular expressions over annotations and text fragments)

Event Recognition Workflow 21-årig man vars lungtuberkulos <TRIGGER>läkte ut</trigger> med streptomycin. 21-year old man whose pulmonary tuberculosis healed out by streptomycin. TERM: 21-årig man vars <SNOMED>lungtuberkulos</SNOMED> läkte ut med <FASS>streptomycin</FASS>. NER: <PERSON>21-årig man</person> vars lungtuberkulos läkte ut med streptomycin. Normalization, merging and modeling observed patterns in rules: <PERSON> <TERM> [trigger-word] <FASS> <PATIENT> <AFFLICTION> <CURE> <MEDICATION> Manual analysis of the annotated examples gave an approximation of how the examined medical events are expressed Created rules (regular expressions) for the task and also some annotated data for future supervised ML

Results Evaluation sample: 30x4 sentences; P: # elements correctly labeled, out of the total number of all elements labeled; R: # elements correctly labeled given all of the possible elements

Problems Certain elements are difficult to capture using regular expressions: <Purpose>, <Outcome> and <Circumstance>, due to great variability, expressed by lengthy language patterns - syntactic parsing needs to be exploited (chunking): complex NPs or PPs and clauses. E.g., a prepositional phrase complex with four prepositions (in bold face): <Circumstance>Vid klart skyldig blindtarmsinflammation av varierande grad upp till kraftigare inflammation med tecken på vävnadsdöd i blindtarmen</circumstance> administreras antibiotika Tienam 0,5 g x 3 (litt. 'In clear-cut case appendicitis of varying degree up to stronger inflammation with signs of necrosis in the cecum antibiotic Tienam 0.5 g x 3 administered').

Problems Ellipsis: clauses where an overt trigger word is missing (often a predicate belonging to the frame) For instance, the following retrieved example lacks an overt trigger, a verb, in the last clause marked in italic that we would like to have an annotation for: Av journalblad framgår att han behandlats med digitalis, såväl i injektion som per os, samt med kinidin tabletter. (litt. Of the record sheet it is shown that he has been treated with digitalis, both as injection and per os, and with quinidine tablets)

More Issues Some frame elements could not be found in the annotated samples; some had very few occurrences and were not formally evaluated, e.g. the element Place in the Falling_Ill frame The manual annotation resulted to revision of some frames, some domain frames are divided in two, obtaining more accurate and precise semantics and more specialization. E.g., the Administration_of_medication was divided into Administration_of_medication_conveyance (where the procedures that describe the administration of medicine is the focus; e.g. Normalt ska en salva eller kräm strykas på tunt; litt. "Normally, an ointment or cream will be thinly applied") and Administration_of_medication_specification (where the focus is on the specifications concerning administration of medicines; e.g. Tegretol 20 mg/ml, 30 ml x 1).

Summary Simple rule-based, lexical knowledge oriented SRL Findings from a study into how a semantic resource, FrameNet (FN), can be applied for event extraction in the Swedish biomedical domain The Swedish FN (2011-13) is based on the Berkeley FN, which provides the appropriate conceptual structure that describes various events along with their participants and properties Combining the lexical semantic content of FN with domain specific knowledge provides a modeling mechanism that can be utilized for event extraction and other text mining activities

Future Plans Use the annotated sentences as gold standard and look for new sentences in corpora using the triggers and match to the gold standard in order to augment with new sentences using e.g. sequence alignment (identify regions of similarity) More annotated samples for all medical frames Syntactic analysis of the sentences; sufficient with chunking? Automatic pattern learning Rely less on the presence of trigger words event detection as a classification task

Future Plans Compare which technique is most appropriate for which type of frame machine learning vs. rulebased/knowledge-based For some frames, e.g. Adm_of_Medication, simple means - regular expressions - are enough for accurate identification of frame elements (numerical info, domain-specific abbreviations and acronyms). In other cases, such as in the Cure frame, other means seem more appropriate, such as parsing