OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

Similar documents
CS 6093 Lecture 6. Survey of Information Extraction Systems. Cong Yu

Ngram Search Engine with Patterns Combining Token, POS, Chunk and NE Information

Outline. NLP Applications: An Example. Pointwise Mutual Information Information Retrieval (PMI-IR) Search engine inefficiencies

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Entity Extraction from the Web with WebKnox

Text Mining for Software Engineering

A Survey Of Different Text Mining Techniques Varsha C. Pande 1 and Dr. A.S. Khandelwal 2

Automatic Summarization

Exam Marco Kuhlmann. This exam consists of three parts:

WebKnox: Web Knowledge Extraction

Introduction to Text Mining. Hongning Wang

Iterative Learning of Relation Patterns for Market Analysis with UIMA

Question Answering Systems

Information Retrieval and Organisation

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi)

Natural Language Processing. SoSe Question Answering

Index Construction. Dictionary, postings, scalable indexing, dynamic indexing. Web Search

A BFS-BASED SIMILAR CONFERENCE RETRIEVAL FRAMEWORK

Natural Language Processing SoSe Question Answering. (based on the slides of Dr. Saeedeh Momtazi) )

INTERNATIONAL JOURNAL OF PURE AND APPLIED RESEARCH IN ENGINEERING AND TECHNOLOGY

Predictive Indexing for Fast Search

TextJoiner: On-demand Information Extraction with Multi-Pattern Queries

Introduction p. 1 What is the World Wide Web? p. 1 A Brief History of the Web and the Internet p. 2 Web Data Mining p. 4 What is Data Mining? p.

Extracting Relation Descriptors with Conditional Random Fields

Mining Relations from Git Repositories

Database Group Research Overview. Immanuel Trummer

Information Retrieval

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Rapid Information Discovery System (RAID)

Semantic Extensions to Syntactic Analysis of Queries Ben Handy, Rohini Rajaraman

Entailment-based Text Exploration with Application to the Health-care Domain

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Chapter 27 Introduction to Information Retrieval and Web Search

Information Extraction Techniques in Terrorism Surveillance

News-Oriented Keyword Indexing with Maximum Entropy Principle.

C. The system is equally reliable for classifying any one of the eight logo types 78% of the time.

Information Retrieval

Bing Liu. Web Data Mining. Exploring Hyperlinks, Contents, and Usage Data. With 177 Figures. Springer

Natural Language Processing

News Filtering and Summarization System Architecture for Recognition and Summarization of News Pages

3-2. Index construction. Most slides were adapted from Stanford CS 276 course and University of Munich IR course.

Structured Queries Over Web Text

Making Sense Out of the Web

Collaborative Ranking between Supervised and Unsupervised Approaches for Keyphrase Extraction

Lightly-Supervised Attribute Extraction

Statistical parsing. Fei Xia Feb 27, 2009 CSE 590A

Unsupervised Semantic Parsing

CHAPTER-26 Mining Text Databases

Joint Entity Resolution

Part I: Data Mining Foundations

University of Sheffield, NLP. Chunking Practical Exercise

(12) Patent Application Publication (10) Pub. No.: US 2010/ A1. Yu (43) Pub. Date: Aug. 26, 2010

Data-Mining Algorithms with Semantic Knowledge

Creating a Classifier for a Focused Web Crawler

Domain-specific Concept-based Information Retrieval System

Assignment #1: Named Entity Recognition

2 Ambiguity in Analyses of Idiomatic Phrases

The CKY algorithm part 1: Recognition

CS47300: Web Information Search and Management

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

CSCI 5417 Information Retrieval Systems Jim Martin!

Question Answering Using XML-Tagged Documents

Machine Learning in GATE

Machine Learning using MapReduce

String Vector based KNN for Text Categorization

Latent Relation Representations for Universal Schemas

Information Retrieval

Scalable Attribute-Value Extraction from Semi-Structured Text

Knowledge Engineering with Semantic Web Technologies

Index Construction 1

NLP Final Project Fall 2015, Due Friday, December 18

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

Refresher on Dependency Syntax and the Nivre Algorithm

Classification. 1 o Semestre 2007/2008

over Multi Label Images

Introduc)on to. CS60092: Informa0on Retrieval

CS54701: Information Retrieval

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Overview of the INEX 2009 Link the Wiki Track

Web-Scale Extraction of Structured Data

INFORMATION EXTRACTION

Apache UIMA and Mayo ctakes

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Index construction CE-324: Modern Information Retrieval Sharif University of Technology

2. Design Methodology

LANGUAGE MODEL SIZE REDUCTION BY PRUNING AND CLUSTERING

A hybrid method to categorize HTML documents

Text Data Pre-processing and Dimensionality Reduction Techniques for Document Clustering

Wikulu: Information Management in Wikis Enhanced by Language Technologies

Technique For Clustering Uncertain Data Based On Probability Distribution Similarity

Outlier Ensembles. Charu C. Aggarwal IBM T J Watson Research Center Yorktown, NY Keynote, Outlier Detection and Description Workshop, 2013

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

DATA MINING TEST 2 INSTRUCTIONS: this test consists of 4 questions you may attempt all questions. maximum marks = 100 bonus marks available = 10

A Context-Aware Relation Extraction Method for Relation Completion

Speech-based Information Retrieval System with Clarification Dialogue Strategy

YAGO - Yet Another Great Ontology

Introduction to Information Retrieval (Manning, Raghavan, Schutze)

Tag-based Social Interest Discovery

Dynamic Programming. Ellen Feldman and Avishek Dutta. February 27, CS155 Machine Learning and Data Mining

Automatic Detection of Outdated Information in Wikipedia Infoboxes. Thong Tran 1 and Tru H. Cao 2

Transcription:

OPEN INFORMATION EXTRACTION FROM THE WEB Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

Call for a Shake Up in Search! Question Answering rather than indexed key word search Gravity of keyword search Massive, heterogeneous data Knowledge assertion Call for a general-purpose question-answering systems

Watson, Siri

Motivation Traditional Information Extraction (IE) Require hand-crafted extraction rule, training example Re-specify relation of interest Usually domain specific Dose not scale well with large and heterogeneous corpora

Overview Preliminary Key Components design of Open IE system Evaluation Related work Demo

About this paper High level description on system components Framework design Technical details largely based on description rather than rigorous details Work on Maximum Entropy Methods (part-of-speech labeling, identifying noun phrases ) Work on KnowItAll paper

Several terminologies Tuple: (e i, r ij,e j ), r ij is relation Relation: general rules for connecting entities, e.g. City such as New York, Tokyo, London, Beijing Relation arguments: for tuple (e i, r ij,e j ), e i and e j are arguments for relation r ij

Design Goals Automation Corpus heterogeneity Efficiency

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier

Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier

Self-supervised Learner 1.1 Trainer parses through text. For each sentence, find all base noun phrase e i, for each pair (e i, e j ), identify potential relation r ij (sequence of words) in tuple t=(e i, r ij,e j ) 1.2 Using constrains to label t as positive or negtive Length of dependency chain connecting (e i, e j ) Path from (e i, e j ) does not cross sentence boundary

Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier

Self-supervised Learner 2.1 Map each tuple to a feature vector E.g. number of tokens in r ij, presence of POS tag sequence in r ij, POS tag to the left of e i 2.2 Labeled feature vectors are as input to a Naïve Bayes classifier Classifier is language specific

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Single-pass Extractor Make a single pass over corpus Tag POS label for each word in sentence Using tags and nous phrase chunker to identify entities Relations are extracted by analyzing text between noun phrases Classifier classifies Candidate tuples. TextRunner Stores the trustworthy tuples

Single-pass Extractor Relation Normalization: Non essential phrases are eliminated to have succinct relation text (e.g. definitely developed is reduced to developed Entity Normalization: Chunker assigns probability to entities. Tuples containing entities with low confidence are dropped.

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Redundancy-based Assessor Merge identical tuples Count distinct sentences The count is used to assign probability to each tuple (KnowItAll) Intuition: tuple t=(e i, r ij,e j ) is a correct instance of relation r ij if it is extracted from many different sentences

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query processing

Query Processing Using Inverted Index distributed over a pool of machines Each relation is assigned to one machine Each machine then store a reference to all tuples that are instances of any relation assigned to it Like a Distributed Hash Table

Query Processing Relation centric index Can be used for advance natural language like searching and answering Distributed pool of machines support interactive search speed

Experimental Results Comparison with Traditional IE Global Statistics on Facts Learned

Comparison with Traditional IE TextRunner VS KnowItAll Open IE vs Closed IE 10 relations are pre-selected

Comparison with Traditional IE Speed-wise: TextRunner, 85 CPU hours for all relations in corpus at once; KnowItAll, 6.3 hours per relation

Global Statistics on Facts Learned Evaluation Goal: How many of the tuples found represent actual relationships with plausible arguments What subset of these tuples is correct? How many of these tuples are distinct?

Global Statistics on Facts Learned Data Set used: 9 million Web pages 133 million sentences 60.5 million tuples extracted (2.2 tuples per sentence)

Filtering Criteria Tuples with probability >.8 Tuple s relation is supported by 10 distinct sentences Not a general relation (top.1% relations) e.g.(np1, has, NP2) A result of 11.3 million tuples containing 278,085 distinct relation strings.

Estimating the Correctness of Facts

Estimating the Number of Distinct Facts Only address relation synonymy Merge relation by using linguistic/syntactic components (punctuation, auxiliary verbs, leading stopwords, use of active and passive voice) Reduce the number of distinct relations to 91% of the number before merging

Estimating the Number of Distinct Facts Difficulty: rare to find two distinct relations that are truly synonymous in all senses of each phrase E.g. person develop diseases vs. scientist develop technology Use synonymy clusters, human invovled assessment at tuple level

cribes 2 distinct relations R 1 a is the name of a scientist, then developed is synonymous with ndestimating Station the XNumber as delineated of Distinct Facts bel (e hley Park, 1,r,e 2 ), (e 1,q,e 2 ) where r = q, that is tuples where the was location of hley Park, hley Park, hley Park, to find two distinct relations that are truly synonymous in all senses of each phrase unless domain-specific type checking is performed on one or both arguments. If the first argument is the name of a scientist, then developed is synonymous with invented and created,and is closely related to patented. Withoutsuch argumenttype checking, these relationswill pick out overlapping, but quite distinct sets of tuples. 5 It is, however, easier for a human to assess similarity at the tuple level, where context in the form of entities grounding the relationship is available. In order to estimate the number of similar facts extracted by TEXTRUNNER, we began with our filtered set of 11.3 million tuples. For each tuple, we found clusters of concrete tuples of the form (e 1,r,e 2 ), (e 1,q,e 2 ) where r = q, that is tuples where the entities match but the relation strings are distinct. We found that only one third of the tuples belonged to such synonymy clusters. Next, we randomly sampled 100 synonymy clusters and asked one author of this paper to determine how many distinct facts existed within each cluster. For example, the cluster of4 tuples below describes 2 distinct relations R 1 and R 2 between Bletchley Park and Station X as delineated below: R 1 (Bletchley Park, was location of,station X) being called,knownas,codenamed R 2 (Bletchley Park, being called,station X) R 2 (Bletchley Park,,known as,station X) R 2 (Bletchley Park,,codenamed,Station X) Overall, we found that roughly one quarter of the tuples in our sample were reformulations of other tuples contained somewhere in the filtered set of 11.3 million tuples. Given ourpreviousmeasurement that two thirds of the concrete fact tuples do not belong to synonymy clusters, we can compute that 2 3 + ( 1 3 3 4 ) or roughly 92% of the tuples found by TEXTRUNNER express distinct assertions. As pointed out earlier, this is an overestimate of the number of unique facts because we have not been able to factor in the impact of multiple entity names, which is a topic for future work. 4 Related Work Traditional closed IE work was discussed in Section 1. Recent efforts [Pasca et al., 2006] seeking to undertake largescale extraction indicate a growing interest in the problem. This year, Sekine [Sekine, 2006] proposed a paradigm for on-demand information extraction, which aims to eliminate customization involved with adapting IE systems to new topics. Using unsupervised learning methods, the system automatically creates patterns and performs extraction based on a specificity, but does not scale to the Web as explained be Given a collection of documents, their system first forms clustering of the entire set of articles, partitionin corpus into sets of articles believed to discuss similar to Within each cluster, named-entity recognition, co-refe resolution and deep linguistic parse structures are comp and then used to automatically identify relations between of entities. This use of heavy linguistic machinery w be problematic if applied to the Web. Shinyama and Sekine s system, which uses pai vector-space clustering, initially requires an O(D 2 ) e where D is the number of documents. Each documen signed to a cluster is then subject to linguistic proces potentially resulting in another pass through the set of documents. This is far more expensive for large docu collections than TEXTRUNNER s O(D + T logt) runtim presented earlier. From a collection of 28,000 newswire articles, Shiny and Sekine were able to discover 101 relations. While difficult to measure the exact number of relations foun TEXTRUNNER on its 9,000,000 Web page corpus, it is at two or three orders of magnitude greater than 101. 5 Conclusions This paper introduces Open IE from the Web, an uns vised extraction paradigm that eschews relation-specifi traction in favor of a single extraction pass over the co during which relations of interest are automatically dis ered and efficiently stored. Unlike traditional IE system repeatedly incur the cost of corpus analysis with the na of each new relation, Open IE s one-time relation disco procedure allows a user to name and explore relationshi interactive speeds. The paper also introduces TEXTRUNNER, a fully im mented Open IE system, and demonstrates its abili extract massive amounts of high-quality information a nine million Web page corpus. We have shown TEXTRUNNER is able to match the recall of the KNOWIT state-of-the-art Web IE system, while achieving higher p sion. In the future, we plan to integrate scalable methods fo tecting synonyms and resolving multiple mentions of en in TEXTRUNNER. The system would also benefit from ability to learn the types of entities commonly taken b lations. This would enable the system to make a distin between differentsenses of a relation, as well as better l entity boundaries. Finally we plan to unify tuples outp TEXTRUNNER into a graph-based structure, enabling plex relational queries. Cluster found by (e 1,p,e 2 ), (e 1,r,e 2 ), where p!=r 92% of the tuples found by TEXTRUNNER express distinct assertions (over estimation)

Estimating the Number of Distinct Facts Challenge: find methods for detecting synonyms and resolving multiple mentions of entities

Related Work KnowItAll Project (umbrella project) IBM Watson TextRunner Demo

Questions?