David McClosky, Mihai Surdeanu, Chris Manning and many, many others 4/22/2011. * Previously known as BaselineNLProcessor

Similar documents
NLP Chain. Giuseppe Castellucci Web Mining & Retrieval a.a. 2013/2014

Morpho-syntactic Analysis with the Stanford CoreNLP

Stanford s 2013 KBP System

Ghent University-IBCN Participation in TAC-KBP 2015 Cold Start Slot Filling task

Machine Learning in GATE

An UIMA based Tool Suite for Semantic Text Processing

PRIS at TAC2012 KBP Track

Final Project Discussion. Adam Meyers Montclair State University

corenlp-xml-reader Documentation

Implementing a Variety of Linguistic Annotations

CMU System for Entity Discovery and Linking at TAC-KBP 2017

Natural Language Processing Pipelines to Annotate BioC Collections with an Application to the NCBI Disease Corpus

Introduction to Text Mining. Hongning Wang

Stanford-UBC at TAC-KBP

Proposed Task Description for Source/Target Belief and Sentiment Evaluation (BeSt) at TAC 2016

TectoMT: Modular NLP Framework

Apache UIMA and Mayo ctakes

Ortolang Tools : MarsaTag

Package corenlp. June 3, 2015

Advanced Topics in Information Retrieval Natural Language Processing for IR & IR Evaluation. ATIR April 28, 2016

UIMA-based Annotation Type System for a Text Mining Architecture

Learning Latent Linguistic Structure to Optimize End Tasks. David A. Smith with Jason Naradowsky and Xiaoye Tiger Wu

EC999: Named Entity Recognition

CMU System for Entity Discovery and Linking at TAC-KBP 2016

CMU System for Entity Discovery and Linking at TAC-KBP 2015

Annotating Spatio-Temporal Information in Documents

Grounded Compositional Semantics for Finding and Describing Images with Sentences

Natural Language Processing Tutorial May 26 & 27, 2011

Introduction to IE and ANNIE

NUS-I2R: Learning a Combined System for Entity Linking

Using Relations for Identification and Normalization of Disorders: Team CLEAR in the ShARe/CLEF 2013 ehealth Evaluation Lab

Unstructured Information Management Architecture (UIMA) Graham Wilcock University of Helsinki

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Domain Independent Knowledge Base Population From Structured and Unstructured Data Sources

Langforia: Language Pipelines for Annotating Large Collections of Documents

[1] M. P. Bhopal, Natural language Interface for Database: A Brief review, IJCSI, p. 600, 2011.

Text mining tools for semantically enriching the scientific literature

DBpedia Spotlight at the MSM2013 Challenge

Proposed Task Description for Knowledge-Base Population at TAC English Slot Filling Regular and Temporal

It s time for a semantic engine!

Using NLP and context for improved search result in specialized search engines

Natural Language Processing. SoSe Question Answering

BD003: Introduction to NLP Part 2 Information Extraction

Fast and Effective System for Name Entity Recognition on Big Data

Assignment #1: Named Entity Recognition

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

A cocktail approach to the VideoCLEF 09 linking task

Text Mining for Software Engineering

University of Sheffield, NLP. Chunking Practical Exercise

The GENIA corpus Linguistic and Semantic Annotation of Biomedical Literature. Jin-Dong Kim Tsujii Laboratory, University of Tokyo

S-MART: Novel Tree-based Structured Learning Algorithms Applied to Tweet Entity Linking

Mention Detection: Heuristics for the OntoNotes annotations

Introduction to Information Extraction (IE) and ANNIE

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014

Experiences with UIMA in NLP teaching and research. Manuela Kunze, Dietmar Rösner

Topics in Parsing: Context and Markovization; Dependency Parsing. COMP-599 Oct 17, 2016

NLP in practice, an example: Semantic Role Labeling

Two Practical Rhetorical Structure Theory Parsers

LAB 3: Text processing + Apache OpenNLP

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

C. The system is equally reliable for classifying any one of the eight logo types 78% of the time.

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Deliverable D1.4 Report Describing Integration Strategies and Experiments

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

Dynamic Feature Selection for Dependency Parsing

Detection and Extraction of Events from s

ClearTK 2.0: Design Patterns for Machine Learning in UIMA

Partitioning Data. IRDS: Evaluation, Debugging, and Diagnostics. Cross-Validation. Cross-Validation for parameter tuning

A Hybrid Neural Model for Type Classification of Entity Mentions

Computer Support for the Analysis and Improvement of the Readability of IT-related Texts

Text, Knowledge, and Information Extraction. Lizhen Qu

Combining Neural Networks and Log-linear Models to Improve Relation Extraction

A Korean Knowledge Extraction System for Enriching a KBox

A RapidMiner framework for protein interaction extraction

ClearTK Tutorial. Steven Bethard. Mon 11 Jun University of Colorado Boulder

LIDER Survey. Overview. Number of participants: 24. Participant profile (organisation type, industry sector) Relevant use-cases

TAC KBP 2016 Cold Start Slot Filling and Slot Filler Validation Systems by IRT SystemX

Parsing tree matching based question answering

Graph-Based Relation Validation Method

A bit of theory: Algorithms

INFORMATION EXTRACTION

AT&T: The Tag&Parse Approach to Semantic Parsing of Robot Spatial Commands

UNIVERSITY OF EDINBURGH COLLEGE OF SCIENCE AND ENGINEERING SCHOOL OF INFORMATICS INFR08008 INFORMATICS 2A: PROCESSING FORMAL AND NATURAL LANGUAGES

CS 224N Assignment 2 Writeup

STS Infrastructural considerations. Christian Chiarcos

Christoph Treude. Bimodal Software Documentation

A CASE STUDY: Structure learning for Part-of-Speech Tagging. Danilo Croce WMR 2011/2012

Parmenides. Semi-automatic. Ontology. construction and maintenance. Ontology. Document convertor/basic processing. Linguistic. Background knowledge

Combining Text Embedding and Knowledge Graph Embedding Techniques for Academic Search Engines

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Stanford-UBC Entity Linking at TAC-KBP

Computationally Efficient M-Estimation of Log-Linear Structure Models

Information Extraction

Taming Text. How to Find, Organize, and Manipulate It MANNING GRANT S. INGERSOLL THOMAS S. MORTON ANDREW L. KARRIS. Shelter Island

Syntactic N-grams as Machine Learning. Features for Natural Language Processing. Marvin Gülzow. Basics. Approach. Results.

WebAnno: a flexible, web-based annotation tool for CLARIN

Fine-Grained Semantic Class Induction via Hierarchical and Collective Classification

Transcription:

David McClosky, Mihai Surdeanu, Chris Manning and many, many others 4/22/2011 * Previously known as BaselineNLProcessor

Part I: KBP task overview Part II: Stanford CoreNLP Part III: NFL Information Extraction

KBP is a bake-off (shared task) held yearly Task: Given an entity, fill in values for various slots Entities can be people or organizations Slots are prespecified Provenance (textual sources) must be provided

per:city_of_birth: Riverside per:stateorprovince_of_birth: Iowa per:date_of_birth: 2233

per:parents: George and Winona Kirk

per:schools_attended: Starfleet Academy

Inputs Knowledge base Entities, slot names and fillers Source collection Wikipedia, newswire text, broadcast news Evaluation data from 2009, 2010 Queries (entities) Slot names and fillers Testing: Queries Output Slot name, filler, document ID (provenance)

per:parents: George and Winona Kirk per:city_of_birth: Riverside per:stateorprovince_of_birth: Iowa per:date_of_birth: 2233 per:title: Captain Type List of people City State Date Title?

per:parents: George and Winona Kirk =? per:parents: George Kirk per:parents: Winona Kirk

per:city_of_birth: Shi Kahr per:stateorprovince_of_birth: Iowa

new Temporal slot filling: per:spouse per:title org:top_employees/members... Simple representation: [T1, T2, T3, T4] = T1 start T2, T3 end T4 Any can be null to indicate a lack of constraint Day resolution (YYYYMMDD)

per:title: Captain [22650101, 22651231, 22700101, 22701231]

per:title: Captain [22650101, 22651231, 22700101, 22701231] Start is sometime during the year 2265

Part I: KBP task overview Part II: Stanford CoreNLP Part III: NFL Information Extraction

We ve been working on a whole bunch of stuff: Joint NLP models Coreference (now in Stanford CoreNLP) Supervised relation extraction (NFL) Supervised event extraction (BioNLP) Distantly supervised relation extraction (KBP) Scenario templates and graph models in IE This section describes our NLP pipeline, common to many of these components. http://nlp.stanford.edu/software/corenlp.shtml

Approach How to use Command-line (shell, batch) Java interface

Quickly and painlessly get linguistic annotations for a text Hides variations across components behind common API Simple Java objects passed around (no XML, UIMA, etc.) But results can easily be written to XML, etc.

Store the input text as well as the output of each Annotator as values in an Annotation Map. Annotator Annotator Annotation Annotator AnnotationPipeline There are dependencies between Annotators the pipeline ordering is important!

Tokenization (tokenize) Sentence Splitting (ssplit) Execution Flow Part-of-speech Tagging (pos) Morphological Analysis (lemma) Named Entity Recognition (ner) Syntactic Parsing (parse) Coreference Resolution (dcoref) Annotation Object Raw text Annotated text NFL Relation Extraction (nfl)

Approach How to use Command-line (shell, batch) Java interface

java -cp classes:lib/xom.jar:lib/jgrapht.jar -Xmx6g edu.stanford.nlp.pipeline.stanfordcorenlp -props src/ edu/stanford/nlp/pipeline/stanfordcorenlp.properties! Example sentence: Stanford is located in California.! Sentence #1 (6 tokens):! [Word=Stanford Current=Stanford Tag=NNP Lemma=Stanford NER=ORGANIZATION]! [Word=is Current=is Tag=VBZ Lemma=be NER=O]! [Word=located Current=located Tag=VBN Lemma=locate NER=O]! [Word=in Current=in Tag=IN Lemma=in NER=O]! [Word=California Current=California Tag=NNP Lemma=California NER=LOCATION]! [Word=. Current=. Tag=. Lemma=. NER=O]! (ROOT (S (NP (NNP Stanford))! (VP (VBZ is) (VP (VBN located) (PP (IN in) (NP (NNP California)))))! (..)))! nsubjpass(located-3, Stanford-1)! auxpass(located-3, is-2)! prep_in(located-3, California-5)!

java -cp classes:lib/xom.jar:lib/jgrapht.jar -Xmx6g edu.stanford.nlp.pipeline.stanfordcorenlp -props src/ edu/stanford/nlp/pipeline/stanfordcorenlp.properties -file input.txt! java -cp classes:lib/xom.jar:lib/jgrapht.jar -Xmx6g edu.stanford.nlp.pipeline.stanfordcorenlp -props src/ edu/stanford/nlp/pipeline/stanfordcorenlp.properties -file inputdirectory extension.xml! java -cp classes:lib/xom.jar:lib/jgrapht.jar -Xmx6g edu.stanford.nlp.pipeline.stanfordcorenlp -props src/ edu/stanford/nlp/pipeline/stanfordcorenlp.properties -file inputdirectory outputdirectory somewhereelse -outputextension.annotated replaceextension true -noclobber!

Annotator pipeline =! new StanfordCoreNLP(properties);! Annotation annotation =! new Annotation(text);! pipeline.annotate(annotation);!

tokenize split text into tokens, PTB-style cleanxml remove specific XML tags truecase restore case (e.g. if all lowercase, etc.) ssplit sentence splitter pos add POS tags to tokens lemma add lemmas to tokens

ner add named entity tags to tokens regexner add rule-based NER tags from regular expressions parse add parse trees (Stanford Parser) berkeleyparse, charniakparse coming soon Add parse trees from other parsers as well dcoref add coreference links nfl (Machine Reading distribution only) add NFL entity and relation extraction mentions time add temporal annotations (coming later!)

List<CoreMap> sentences = annotation.get(sentencesannotation.class);! for (CoreMap sentence : sentences) {! // traversing the words in the current sentence! for (CoreLabel token: sentences.get(i).get(tokensannotation.class)) {! String word = token.get(textannotation.class);! String pos = token.get(partofspeechannotation.class);! String ne = token.get(namedentitytagannotation.class);! }! // this is the parse tree of the current sentence! Tree tree = sentence.get(treeannotation.class);! }! // this is the coreference link graph! List<Pair<IntTuple, IntTuple>> graph = annotation.get(corefgraphannotation.class);!

hash map with class objects as keys and custom value types List<CoreMap> sentences = annotation.get(sentencesannotation.class);! for (CoreMap sentence : sentences) {! // traversing the words in the current sentence! for (CoreLabel token: sentences.get(i).get(tokensannotation.class)) {! String word = token.get(textannotation.class);! String pos = token.get(partofspeechannotation.class);! String ne = token.get(namedentitytagannotation.class);! }! // this is the parse tree of the current sentence! Tree tree = sentence.get(treeannotation.class);! }! // this is the coreference link graph! List<Pair<IntTuple, IntTuple>> graph = annotation.get(corefgraphannotation.class);!

hash map with class objects as keys and custom value types List<CoreMap> sentences = annotation.get(sentencesannotation.class);! for (CoreMap sentence : sentences) {! CoreMap with additional properties (HasWord, HasTag, etc.) // traversing the words in the current sentence! for (CoreLabel token: sentences.get(i).get(tokensannotation.class)) {! String word = token.get(textannotation.class);! String pos = token.get(partofspeechannotation.class);! String ne = token.get(namedentitytagannotation.class);! }! // this is the parse tree of the current sentence! Tree tree = sentence.get(treeannotation.class);! }! // this is the coreference link graph! List<Pair<IntTuple, IntTuple>> graph = annotation.get(corefgraphannotation.class);!

hash map with class objects as keys and custom value types List<CoreMap> sentences = annotation.get(sentencesannotation.class);! for (CoreMap sentence : sentences) {! CoreMap with additional properties (HasWord, HasTag, etc.) // traversing the words in the current sentence! for (CoreLabel token: sentences.get(i).get(tokensannotation.class)) {! String word = token.get(textannotation.class);! String pos = token.get(partofspeechannotation.class);! String ne = token.get(namedentitytagannotation.class);! }! // this is the parse tree of the current sentence! Tree tree = sentence.get(treeannotation.class);! }! // this is the coreference link graph! List<Pair<IntTuple, IntTuple>> graph = annotation.get(corefgraphannotation.class);! uniquely identify a word by <sentence position, token position> (both offsets start at 0) [note: annotation will change soon ]

/** Simple annotator that recognizes locations stored in a gazetteer */! public class GazetteerLocationAnnotator implements Annotator {! // this is the only method that must be implemented by an annotator! public void annotate(annotation annotation) {! // traverse all sentences in this document (assumes that text already tokenized)! for (CoreMap sentence : annotation.get(sentencesannotation.class)) {! // loop over all tokens in sentence! List<CoreLabel> tokens = sentence.get(tokensannotation.class);! for (int start = 0; start < tokens.size(); start++) {! // assumes that the gazetteer returns the token index! // after the match or -1 otherwise! int end = Gazetteer.isLocation(tokens, start);! if (end > start) {! for (int i = start; i < end; i ++) {! tokens.get(i).set(namedentitytagannotation.class, "LOCATION");! }! }! }! }! }! }!

Part I: KBP task overview Part II: Stanford CoreNLP Part III: NFL Information Extraction

Add nfl to the annotators property. Construct and call the same way. Interpreting output: Span and type List<CoreMap> sentences = annotation.get(sentencesannotation.class);! for (CoreMap sentence : sentences) {! List<EntityMention> entities =! sentence.get(machinereadingannotations.entitymentionsannotation.class);! List<RelationMention> relations =! sentence.get(machinereadingannotations.relationmentionsannotation.class);! }! Entities and type

ExtractionObject:! CoreMap getsentence()! Span getextent()! String gettype()! Counter<String> gettypeprobabilities()! EntityMention:! int getsyntacticheadtokenposition()! CoreLabel getsyntacticheadtoken()! RelationMention:! List<ExtractionObject> getargs()! List<String> getargnames()!

Named entity recognition Relation extraction (Throughout: Lessons from adapting our IE system to NFL domain)

Named entity recognition Relation extraction (Throughout: Lessons from adapting our IE system to NFL domain)

Use NER system to classify each token as one of the NFL entity types or O (other) Contiguous tokens of the same type are combined into EntityMentions! Marginal probabilities on each token form the results of gettypeprobabilities()

Extended the NFL team gazetteer with NFLGame entities extracted from Dekang Lin s distributional similarity dictionary: seeds: win, loss, game added: victory, triumph, shutout, defeat, lead, match, rout, strikeout... If a word sequence (partially) matches a gazetteer entry and it includes the head of a NP gazetteer label If the generic NER labels a sequence as DATE Date If the generic NER labels a sequence as NUMBER and it is a valid score and not followed by measurement unit Score Goal: maximize recall!

Harvested 1400+ sentences on NFL games from sports.yahoo.com It was the third quarter of the Philadelphia Eagles' 38-10 rout of the Carolina Panthers on Sunday and both franchises suddenly had big worries about their veteran quarterbacks. Tagged corpus with rule-based NER, which maximized recall Generated MTurk HITs from this data, using all possible relations between the identified NEs Is it true that the Philadelphia Eagles scored 38 points in this game? yes Is it true that the Philadelphia Eagles scored 10 points in this game? no Averaged annotations from four annotators for each HIT

MTurk helped only up to a point Why? There was a bug in the rule-based NER used to generate candidates Turkers could not identify subtle mistakes, hence errors propagated in the final MTurk corpus the victory game against Dallas Is victory the best word to describe the game? yes Is game the best word to describe the game? yes

Gazetteer used for team names, game entities Packers should match Green Bay Packers but Bay shouldn t. Tokenizer wasn t splitting scores ( 37-7 ) Head finder needed adjustments Heads of entities are critical features for both extraction tasks

Named entity recognition Relation extraction (Throughout: Lessons from adapting our IE system to NFL domain)

Logistic regression classifier Positive datums: annotated relations in the corpus Negative datums: all other possible combinations between existing entities Example: It was the third quarter of the Philadelphia Eagles' 38-10 rout of the Carolina Panthers on Sunday and both franchises suddenly had big worries about their veteran quarterbacks. Positive: teamscoringall( Philadelphia Eagles, 38) Negative: teamscoringall ( Philadelphia Eagles, 10) Features: info on the entities in the relation syntactic path between entities (both dependencies and constituents) surface path between entities entities between the relation elements

Relation classifier is one-against-many Can only predict one relation per pair of entities NFL domain often violates this! gamewinner(team, game) teamingame(team, game) System also doesn t understand domain semantics, e.g.: Games have exactly one winner and one loser. Teams with higher scores win. Simple logical rules fill in some of these cases.

System Entity Mentions Relation Mentions Baseline 73.7 49.7 + gazetteer features 74.0 50.2 + rule-based model for NFLTeam 75.5 53.2 + improved head finding 76.1 57.9 + basic inference 76.1 59.5

Work on TAC-KBP and MR-KBP Use Stanford CoreNLP! http://nlp.stanford.edu/software/corenlp.shtml NFL system builds on top of CoreNLP

Questions?