Medical Event Extraction using the Swedish FrameNet, a pilot study

Medical Event Extraction using the Swedish FrameNet, a pilot study DIMITRIOS KOKKINAKIS Centre for Language Technology University of Gothenburg Sweden dimitrios.kokkinakis@svenska.gu.se

Overview From entities to relations & events Motivation Events Resources Frame selection Methodology Results Summary & Future Plans

Motivation Information extraction (IE), a technology with a direct correlation with frame-like structures as described in FN. Templates in the context of IE are frame-like structures with slots representing event information Most event-based IE approaches are designed to identify role fillers that appear as arguments to event verbs or nouns, either explicitly via e.g. syntactic relations

Motivation Event-based IE: identifying all entities that play specific roles within an event described in free text (event types specified beforehand) E.g., given text sentences / documents containing descriptions of disease-treatment events, the goal of an IE system could be to extract event role fillers for such event: Thus, specification of events includes fine-grained details about the circumstances under which a textual discourse is said to contain an event description Frame semantics: a framework that can facilitate the development of text understanding and as such can be used as a backbone to NLU systems

From entities to relations & events Past: recognizing entities in text and mapping them to unique identifiers in curated databases has been helpful for increasing the specificity of document searches in TM systems. Textual cooccurrence of entities, however, does not necessarily indicate meaningful relationships! For semantic and focused queries to retrieve not just articles, but also facts from the literature, it is essential to recognize entities (e.g. protein) as well as relations or events; characterized by e.g. verbs (regulate) or nominalized verbs (regulation) But entity names are continuous spans in text, relations and events, generally appear as discontinuous spans, and have internal structures: that is, a relation or an event generally involves more than one entity, and the entities involved play distinct roles

Events Definition/detection of events have roots in philosophy and linguistics (Davidson 1969; Quine 1985) No 100% consensus on the treatment of events in NLP, in spite of its importance to several areas: topic detection and tracking, information extraction, Q&A Note: comprehensive event detection must encompass the detection of events and their subevents, bridging references (X was murdered yesterday the knife lay nearby; definite descriptions based on previous discourse which require some reasoning in the identification of their textual antecedent) etc.

Events Semantic annotation of text with event-level info for mining complex relations and events has gained a growing attention in e.g. biomedicine a valuable source of evidence-based research and text mining (e.g. semantic search; intelligent Question Answering) Event extraction: the task of extracting structured representations, descriptions of complex combination of relations among one or more entities, from e.g. biomed. literature, not contained in structured DBs, that allow associations of arbitrary numbers of participants in specific roles (e.g. Patient) to be captured

Events BioNLP Shared Task 2009: n-ary associations of participants (entities or other events), each marked as playing a specific role such as Theme or Cause in the event. Each event is assigned a type from a fixed, taskspecific set Events can further be marked with modifiers identifying additional features (metaknowledge) such as polarity, certainty, knowledge type, manner, (Thompson P. et al (2011). Enriching a biomedical event corpus with meta-knowledge. BMC Bioinf, 12:393)

Resources+ All our resources are available on-line: <http:///eng/swefn> Number of frames 860 Core elements 2257 Non-Core elements 5476 Lexical units 24718

Methodology: Frame Selection Identify domain specific medical frames: Administration_of_medication, with core frame elements e.g. Drug, Patient and Medic (112) Medical_Treatment, with core frame elements e.g. Treatment, Affliction and Patient (102) Cure, with core frame elements e.g. Healer, Affliction and Body_Part (115) Falling_Ill, with core frame elements e.g. Patient, Symptom and Ailment (116) # number of manually annotated sentences, extracted from a Swedish biomedical corpus

Methodology: Samples&Resources Sentences selected using trigger words and manual inspection from the MEDLEX corpus (a variety of textdocuments related to various medical text genres) Domain resources: FASS, the Swedish national formulary; SNOMED CT s hierarchies; drug and disease lexicon extensions, e.g. generic drug expressions, misspellings; List of abbreviations: iv, i.v., im, i.m. sc, ; List of drug forms: pill, tablet, capsule, cream, gel, ; List of drug administration paths: intranasal, intravenous, Named Entities: for time, frequency and other relevant numerical information recognition,

Methodology Rule-based approach all steps are applied at the sentence level, i.e. no coherent, larger text fragments used 1. pre-processing: selecting a relevant sample of sentences for med. frames using trigger words (constraint: relevant LUs and medical NEs) for both manual annotation and pattern development and also evaluation 2. main processing: includes terminology, named entity and key word/text segment identification 3. post-processing: modeling observed frame element patterns as rules (regular expressions over annotations and text fragments)

Event Recognition Workflow 21-årig man vars lungtuberkulos <TRIGGER>läkte ut</trigger> med streptomycin. 21-year old man whose pulmonary tuberculosis healed out by streptomycin. TERM: 21-årig man vars <SNOMED>lungtuberkulos</SNOMED> läkte ut med <FASS>streptomycin</FASS>. NER: <PERSON>21-årig man</person> vars lungtuberkulos läkte ut med streptomycin. Normalization, merging and modeling observed patterns in rules: <PERSON> <TERM> [trigger-word] <FASS> <PATIENT> <AFFLICTION> <CURE> <MEDICATION> Manual analysis of the annotated examples gave an approximation of how the examined medical events are expressed Created rules (regular expressions) for the task and also some annotated data for future supervised ML

Results Evaluation sample: 30x4 sentences; P: # elements correctly labeled, out of the total number of all elements labeled; R: # elements correctly labeled given all of the possible elements

Problems Certain elements are difficult to capture using regular expressions: <Purpose>, <Outcome> and <Circumstance>, due to great variability, expressed by lengthy language patterns - syntactic parsing needs to be exploited (chunking): complex NPs or PPs and clauses. E.g., a prepositional phrase complex with four prepositions (in bold face): <Circumstance>Vid klart skyldig blindtarmsinflammation av varierande grad upp till kraftigare inflammation med tecken på vävnadsdöd i blindtarmen</circumstance> administreras antibiotika Tienam 0,5 g x 3 (litt. 'In clear-cut case appendicitis of varying degree up to stronger inflammation with signs of necrosis in the cecum antibiotic Tienam 0.5 g x 3 administered').

Problems Ellipsis: clauses where an overt trigger word is missing (often a predicate belonging to the frame) For instance, the following retrieved example lacks an overt trigger, a verb, in the last clause marked in italic that we would like to have an annotation for: Av journalblad framgår att han behandlats med digitalis, såväl i injektion som per os, samt med kinidin tabletter. (litt. Of the record sheet it is shown that he has been treated with digitalis, both as injection and per os, and with quinidine tablets)

More Issues Some frame elements could not be found in the annotated samples; some had very few occurrences and were not formally evaluated, e.g. the element Place in the Falling_Ill frame The manual annotation resulted to revision of some frames, some domain frames are divided in two, obtaining more accurate and precise semantics and more specialization. E.g., the Administration_of_medication was divided into Administration_of_medication_conveyance (where the procedures that describe the administration of medicine is the focus; e.g. Normalt ska en salva eller kräm strykas på tunt; litt. "Normally, an ointment or cream will be thinly applied") and Administration_of_medication_specification (where the focus is on the specifications concerning administration of medicines; e.g. Tegretol 20 mg/ml, 30 ml x 1).

Summary Simple rule-based, lexical knowledge oriented SRL Findings from a study into how a semantic resource, FrameNet (FN), can be applied for event extraction in the Swedish biomedical domain The Swedish FN (2011-13) is based on the Berkeley FN, which provides the appropriate conceptual structure that describes various events along with their participants and properties Combining the lexical semantic content of FN with domain specific knowledge provides a modeling mechanism that can be utilized for event extraction and other text mining activities

Future Plans Use the annotated sentences as gold standard and look for new sentences in corpora using the triggers and match to the gold standard in order to augment with new sentences using e.g. sequence alignment (identify regions of similarity) More annotated samples for all medical frames Syntactic analysis of the sentences; sufficient with chunking? Automatic pattern learning Rely less on the presence of trigger words event detection as a classification task

Future Plans Compare which technique is most appropriate for which type of frame machine learning vs. rulebased/knowledge-based For some frames, e.g. Adm_of_Medication, simple means - regular expressions - are enough for accurate identification of frame elements (numerical info, domain-specific abbreviations and acronyms). In other cases, such as in the Cure frame, other means seem more appropriate, such as parsing