Text, Knowledge, and Information Extraction. Lizhen Qu

Similar documents
JEDI: Joint Entity and Relation Detection using Type Inference

QANUS A GENERIC QUESTION-ANSWERING FRAMEWORK

Latent Relation Representations for Universal Schemas

CMU System for Entity Discovery and Linking at TAC-KBP 2017

CMU System for Entity Discovery and Linking at TAC-KBP 2016

DBpedia Spotlight at the MSM2013 Challenge

Stanford s 2013 KBP System

CMU System for Entity Discovery and Linking at TAC-KBP 2015

A Korean Knowledge Extraction System for Enriching a KBox

Annotating Spatio-Temporal Information in Documents

CRFVoter: Chemical Entity Mention, Gene and Protein Related Object recognition using a conglomerate of CRF based tools

QuickView: NLP-based Tweet Search

Automatic Domain Partitioning for Multi-Domain Learning

Semi-Supervised Learning of Named Entity Substructure

NLP in practice, an example: Semantic Role Labeling

INFORMATION EXTRACTION

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Introduction to Text Mining. Hongning Wang

Fast and Effective System for Name Entity Recognition on Big Data

Outline. Part I. Introduction Part II. ML for DI. Part III. DI for ML Part IV. Conclusions and research direction

Named Entity Detection and Entity Linking in the Context of Semantic Web

Efficient Dependency-Guided Named Entity Recognition

Noisy Or-based model for Relation Extraction using Distant Supervision

NUS-I2R: Learning a Combined System for Entity Linking

NLATool: An Application for Enhanced Deep Text Understanding

End-To-End Spam Classification With Neural Networks

Module 3: GATE and Social Media. Part 4. Named entities

Introduction to Hidden Markov models

PRIS at TAC2012 KBP Track

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Enhanced Retrieval of Web Pages using Improved Page Rank Algorithm

Using Natural Language Processing and Machine Learning to Assist First-Level Customer Support for Contract Management

Making Sense Out of the Web

Unsupervised Improvement of Named Entity Extraction in Short Informal Context Using Disambiguation Clues

Tokenization and Sentence Segmentation. Yan Shao Department of Linguistics and Philology, Uppsala University 29 March 2017

Techreport for GERBIL V1

Name Entity Recognition and Binary Relation Detection for News Query Dan YE

Feature Extraction and Loss training using CRFs: A Project Report

Weasel: a machine learning based approach to entity linking combining different features

The Goal of this Document. Where to Start?

Knowledge Base Population and Visualization Using an Ontology based on Semantic Roles

DHTK: The Digital Humanities ToolKit

Towards Domain Independent Named Entity Recognition

Disambiguating Entities Referred by Web Endpoints using Tree Ensembles

EUDAMU at SemEval-2017 Task 11: Action Ranking and Type Matching for End-User Development

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

An Adaptive Framework for Named Entity Combination

Knowledge Base Population and Visualization Using an Ontology based on Semantic Roles

Graph-based Entity Linking using Shortest Path

Meaning Banking and Beyond

Prakash Poudyal University of Evora ABSTRACT

Extracting Relation Descriptors with Conditional Random Fields

Distant Supervision via Prototype-Based. global representation learning

arxiv: v1 [cs.cl] 1 Aug 2017

Noise reduction and targeted exploration in imitation learning for Abstract Meaning Representation parsing

State of the Art and Trends in Search Engine Technology. Gerhard Weikum

Epistemo: A Crowd-Powered Conversational Search Interface

Shallow Parsing Swapnil Chaudhari 11305R011 Ankur Aher Raj Dabre 11305R001

3 Data, Data Mining. Chengkai Li

Scalable Machine Learning in R. with H2O

arxiv: v1 [cs.ir] 7 Nov 2017

Semantic Web: Extracting and Mining Structured Data from Unstructured Content

New York University 2014 Knowledge Base Population Systems

Enriching an Academic Knowledge base using Linked Open Data

High-Throughput and Language-Agnostic Entity Disambiguation and Linking on User Generated Data

Domain Adaptation Using Domain Similarity- and Domain Complexity-based Instance Selection for Cross-domain Sentiment Analysis

Word Disambiguation in Web Search

Textual Emigration Analysis

A Hybrid Approach for Entity Recognition and Linking

Papers for comprehensive viva-voce

CIRGDISCO at RepLab2012 Filtering Task: A Two-Pass Approach for Company Name Disambiguation in Tweets

Natural Language Processing

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Discriminative Training with Perceptron Algorithm for POS Tagging Task

TEXTPRO-AL: An Active Learning Platform for Flexible and Efficient Production of Training Data for NLP Tasks

Bc. Pavel Taufer. Named Entity Recognition and Linking

Entity Linking at Web Scale

T2KG: An End-to-End System for Creating Knowledge Graph from Unstructured Text

Transition-Based Dependency Parsing with Stack Long Short-Term Memory

December 4, BigData 2017 Enterprise Knowledge Graphs for Large Scale Analytics 1/47

The Open University s repository of research publications and other research outputs. Search Personalization with Embeddings

Extracting Wikipedia Historical Attributes Data

An Error Analysis Tool for Natural Language Processing and Applied Machine Learning

Conclusion and review

Towards Efficient and Effective Semantic Table Interpretation Ziqi Zhang

Lightly-Supervised Attribute Extraction

MASWS Natural Language and the Semantic Web

Unstructured Data. CS102 Winter 2019

QUALIBETA at the NTCIR-11 Math 2 Task: An Attempt to Query Math Collections

Knowledge Graphs: In Theory and Practice

Gradient of the lower bound

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Learning a Product of Experts with Elitist Lasso

Easy-First POS Tagging and Dependency Parsing with Beam Search

Semantic Consistency: A Local Subspace Based Method for Distant Supervised Relation Extraction

University of Sheffield, NLP. Chunking Practical Exercise

Entity-centric Topic Extraction and Exploration: A Network-based Approach

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014

Lucida Sirius and DjiNN Tutorial

Transcription:

Text, Knowledge, and Information Extraction Lizhen Qu

A bit about Myself PhD: Databases and Information Systems Group (MPII) Advisors: Prof. Gerhard Weikum and Prof. Rainer Gemulla Thesis: Sentiment Analysis with Limited Training Data Now: machine learning group at NICTA, adjunct research fellow at ANU.

Macquarie 3

News about Macquarie Bank 4

Negative News about Macquarie Bank 5

Simple Math Problem Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now? 6

Bob has 15 apples. He gives 9 to Sarah. How many apples does Bob have now? 7

Information Extraction Named entity recognition Named entity disambiguation Relation extraction 8

Knowledge Bases (Open Linked Data) (Bob_Dylan, compose, Like_a_rolling_stone ) (The_Dark_Night, directedby, Christopher_Nolan) Entity Graph OpenIE (Ollie, Reverb) Economic Graph 9

Knowledge Bases (Open Linked Data) YAGO #classes: 350,000 #entities: 10 million #facts: 120 million #language: 10 Entity Graph OpenIE (Ollie, Reverb) Economic Graph 10

Knowledge Bases (Open Linked Data) DBpedia #classes: 735 #entities: 38.3 million #triples: 6.9 billion #languages: 128 Entity Graph OpenIE (Ollie, Reverb) Economic Graph 11

Knowledge Bases (Open Linked Data) Freebase #entities: 50 million #facts: 3 billion #languages: almost 70 Entity Graph OpenIE (Ollie, Reverb) Economic Graph 12

Construct YAGO from (Semi) Structured Data 13

IE Challenge: ambiguity of Natural Language I made her duck. i. I cooked waterfowl for her. ii. I cooked waterfowl belonging her. iii. I created the duck she owns. iv. I caused her to quickly lower her head or body. v. I waved my magic wand and turned her into a waterfowl. 14

Named Entity Recognition TASK: ORG Research at Stanford led to a search engine company, founded by Page and Brin. PER PER Machine Learning Problem: O O ORG O O O O O O O O PER O PER O Research at Stanford led to search engine company, founded by Page and Brin. 15

Learning and Prediction has labels train models Sentences Feature Extraction Labeled Sentences no labels prediction 16

Feature Extraction Use features to represent each word. w -2 Research w -2 to w -1 at w -1 a Features of Stanford : w 0 w +1 Stanford led Features of Search : w 0 w +1 search engine w +2 to w +2 company POS noun POS noun capitalized? true capitalized? false Vectorise feature representations. w -2 = research capitalized w 0 = stanford w 0 = search 1 1 1 0 17

Standard Model: Conditional Random Fields Assigns local score to different (word, label) pairs. Joint inference to find best label sequences. CRF: p(y x) = exp P T t=1 Pi if i (y t 1,y t,x t ) Z Stanford NER [1]: 86% Best system [8]: 89% 18

Named Entity Disambiguation TASK: ORG Research at Stanford led to a search engine company, founded by Page and Brin. PER PER Larry Page Stanford Univeristy Sergey Brin 19

AIDA-light [2] 20

First Stage 21

Second Stage AIDA-light [2]: 84.8% DBPepdia spotlight: 75% 22

Relation Extraction Relation mention extraction. ORG: Stanford_University Research at Stanford led to a search engine company, founded by Page and Brin.? PER: Larry_Page PER: Sergey_Brin Expand knowledge bases. Larry Page? Stanford Univeristy The Dark Night? Christopher Nolan 23

Relation Mention Extraction Multi-class classification. Example features of a pair of entity mentions [3]. Research at Stanford led to a search engine company, founded by Page and Brin.? words between (Stanford, Page) Named entity types Number of mentions between (Stanford, Page) led, to, a, search, engine, company, founded, by (ORG, PER) 0 F-Measure on ACE: 71.2% [3] 24

Expand Knowledge Base Multi-instance, multi-label [4,5]. Distant supervision. Freebase relation-level label Larry Page Sergey Brin mention-level label? mention-level label? Research at Stanford led to a search engine company, founded by Page and Brin. Larry Page and Sergey Brin explained why they just created Alphabet. MAP [3] : 56% MAP [4] : 66% 25

Open Information Extraction Extract triples of any relations from the web [6]. It was exactly 50 years ago today that Bob Dylan walked into Studio A at Columbia Records in New York and recorded "Like a Rolling Stone. ( Bob Dylan, record, Like a rolling stone ) Optional: link triples to knowledge bases. ( Bob Dylan, record, Like a rolling stone ) The_Dark_Night record Like_a_Rolling _Stone F1 [6] : 19.6% F1 [9] : 28.3% 26

Harvest Domain-Specific Knowledge Deep learning. Learn cross-domain features. minimize training data. Transfer learning. source domain target domain newswire nurse handovers 27

Word Representation One-hot representation. stanford [ 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] university [ 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] oxford [ 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0 ] conference [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1 ] talk [ 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0 ] Distributed representation. stanford university oxford = [0.01, 0.3, -0.5, 0.6] conference talk 28

Distributed Representation 29

Apply Distributed Representations for NER Represent words based on positions rather than IDs. 2 nd word to the left first word to the left current word first word to the right 2 nd word to the right label Feature Matrix compare o stanford UNI university UNI and o oxford UNI 30

Results of Named Entity Recognition [7] Reduce the amount of training data. Tiny differences between word embeddings. 31

NER for Novel Named Entity Types Goals: Minimize labeled training data. Leverage existing resources: Labeled corpora. Unlabeled text. Existing knowledge bases. source domain target domain person doctor patient location country city hotel orgnization corporation 32

Experimental Results on I2B2 33

Learn Text Representations for Relations Unsupervised pre-training. Distant supervision. Freebase co-founders Larry Page Sergey Brin Inferred mention-level label Research at Stanford led to a search engine company, founded by Page and Brin. Inferred mention-level label Larry Page and Sergey Brin explained why they just created Alphabet. 34

NICTA Deep Learning for IE Toolkit A fully integrated deep learning toolkit for NLP. Pipelines include both NLP preprocessing and DL components. Written in Scala/Java. Easy to write new ML component. Reuse UIMA NLP components. Scalable. Easy switch between GPUs and CPUs. Learning on GPUs. Make use of UIMA for prediction. 35

References [1] Jenny Rose Finkel, Trond Grenager, and Christopher Manning. 2005. Incorporating Non-local Information into Information Extraction Systems by Gibbs Sampling. Proceedings of the 43nd Annual Meeting of the Association for Computational Linguistics (ACL 2005). [2] Nguyen, Dat Ba, et al. "Aida-light: High-throughput named-entity disambiguation." Linked Data on the Web at WWW2014 (2014). [3] Chan, Yee Seng, and Dan Roth. "Exploiting background knowledge for relation extraction." Proceedings of the 23rd International Conference on Computational Linguistics. Association for Computational Linguistics, 2010. [4] Mihai Surdeanu, Julie Tibshirani, Ramesh Nallapati, Christopher D. Manning. Multi-instance Multi-label Learning for Relation Extraction. Proceedings of the 2012 Conference on Empirical Methods in Natural Language Processing and Natural Language Learning, 2012. [5] Riedel, Sebastian, et al. "Relation extraction with matrix factorization and universal schemas." (2013). [6] Schmitz, Michael, et al. "Open language learning for information extraction." Proceedings of the 2012 Joint Conference on Empirical Methods in Natural Language Processing and Computational Natural Language Learning. Association for Computational Linguistics, 2012. [7] Qu, Lizhen, et al. "Big Data Small Data, In Domain Out-of Domain, Known Word Unknown Word: The Impact of Word Representation on Sequence Labelling Tasks." arxiv preprint arxiv: 1504.05319 (2015). [8] Rie Kubota Ando and Tong Zhang. 2005. A framework for learning predictive structures from multiple tasks and unlabeled data. Journal of Machine Learning Research, 6:1817 1853 [9] Angeli, Gabor, Melvin Johnson Premkumar, and Christopher D. Manning. "Leveraging Linguistic Structure For Open Domain Information Extraction." 36

Resources YAGO: http://www.mpi-inf.mpg.de/departments/databases-and-information-systems/ research/yago-naga/yago DBPedia: http://wiki.dbpedia.org/ Alchemy : http://querybuilder.alchemyapi.com/builder Deep learning: http://www.deeplearning.net/ Word2vec : https://code.google.com/p/word2vec/ Mallet (Java): http://mallet.cs.umass.edu/ Factorie (Scala): http://factorie.cs.umass.edu/ Stanford CoreNLP: http://nlp.stanford.edu:8080/corenlp/ NLP conferences. ACL, EMNLP, COLING, NAACL, EACL NLP online courses. https://www.coursera.org/course/nlangp https://www.youtube.com/playlist?list=pl6397e4b26d00a269 ML online courses. https://www.coursera.org/course/ml https://www.coursera.org/course/neuralnets http://www.socher.org/index.php/deeplearningtutorial/deeplearningtutorial 37