OPEN INFORMATION EXTRACTION FROM THE WEB. Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

OPEN INFORMATION EXTRACTION FROM THE WEB Michele Banko, Michael J Cafarella, Stephen Soderland, Matt Broadhead and Oren Etzioni

Call for a Shake Up in Search! Question Answering rather than indexed key word search Gravity of keyword search Massive, heterogeneous data Knowledge assertion Call for a general-purpose question-answering systems

Watson, Siri

Motivation Traditional Information Extraction (IE) Require hand-crafted extraction rule, training example Re-specify relation of interest Usually domain specific Dose not scale well with large and heterogeneous corpora

Overview Preliminary Key Components design of Open IE system Evaluation Related work Demo

About this paper High level description on system components Framework design Technical details largely based on description rather than rigorous details Work on Maximum Entropy Methods (part-of-speech labeling, identifying noun phrases ) Work on KnowItAll paper

Several terminologies Tuple: (e i, r ij,e j ), r ij is relation Relation: general rules for connecting entities, e.g. City such as New York, Tokyo, London, Beijing Relation arguments: for tuple (e i, r ij,e j ), e i and e j are arguments for relation r ij

Design Goals Automation Corpus heterogeneity Efficiency

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier

Self-supervised Learner 1.1 Trainer parses through text. For each sentence, find all base noun phrase e i, for each pair (e i, e j ), identify potential relation r ij (sequence of words) in tuple t=(e i, r ij,e j ) 1.2 Using constrains to label t as positive or negtive Length of dependency chain connecting (e i, e j ) Path from (e i, e j ) does not cross sentence boundary

Self-supervised Learner Step 1: Label training data as positive or negative (using parser to train extractor) Step 2: Use labeled data (extract features) to train a Naïve Bayes classifier

Self-supervised Learner 2.1 Map each tuple to a feature vector E.g. number of tokens in r ij, presence of POS tag sequence in r ij, POS tag to the left of e i 2.2 Labeled feature vectors are as input to a Naïve Bayes classifier Classifier is language specific

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Single-pass Extractor Make a single pass over corpus Tag POS label for each word in sentence Using tags and nous phrase chunker to identify entities Relations are extracted by analyzing text between noun phrases Classifier classifies Candidate tuples. TextRunner Stores the trustworthy tuples

Single-pass Extractor Relation Normalization: Non essential phrases are eliminated to have succinct relation text (e.g. definitely developed is reduced to developed Entity Normalization: Chunker assigns probability to entities. Tuples containing entities with low confidence are dropped.

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query Processing

Redundancy-based Assessor Merge identical tuples Count distinct sentences The count is used to assign probability to each tuple (KnowItAll) Intuition: tuple t=(e i, r ij,e j ) is a correct instance of relation r ij if it is extracted from many different sentences

TEXTRunner--Open IE Key components: Self-supervised learner Single-pass extractor Redundancy-based extractor Query processing

Query Processing Using Inverted Index distributed over a pool of machines Each relation is assigned to one machine Each machine then store a reference to all tuples that are instances of any relation assigned to it Like a Distributed Hash Table

Query Processing Relation centric index Can be used for advance natural language like searching and answering Distributed pool of machines support interactive search speed

Experimental Results Comparison with Traditional IE Global Statistics on Facts Learned

Comparison with Traditional IE TextRunner VS KnowItAll Open IE vs Closed IE 10 relations are pre-selected

Comparison with Traditional IE Speed-wise: TextRunner, 85 CPU hours for all relations in corpus at once; KnowItAll, 6.3 hours per relation

Global Statistics on Facts Learned Evaluation Goal: How many of the tuples found represent actual relationships with plausible arguments What subset of these tuples is correct? How many of these tuples are distinct?

Global Statistics on Facts Learned Data Set used: 9 million Web pages 133 million sentences 60.5 million tuples extracted (2.2 tuples per sentence)

Filtering Criteria Tuples with probability >.8 Tuple s relation is supported by 10 distinct sentences Not a general relation (top.1% relations) e.g.(np1, has, NP2) A result of 11.3 million tuples containing 278,085 distinct relation strings.

Estimating the Correctness of Facts

Estimating the Number of Distinct Facts Only address relation synonymy Merge relation by using linguistic/syntactic components (punctuation, auxiliary verbs, leading stopwords, use of active and passive voice) Reduce the number of distinct relations to 91% of the number before merging

Estimating the Number of Distinct Facts Difficulty: rare to find two distinct relations that are truly synonymous in all senses of each phrase E.g. person develop diseases vs. scientist develop technology Use synonymy clusters, human invovled assessment at tuple level

cribes 2 distinct relations R 1 a is the name of a scientist, then developed is synonymous with ndestimating Station the XNumber as delineated of Distinct Facts bel (e hley Park, 1,r,e 2 ), (e 1,q,e 2 ) where r = q, that is tuples where the was location of hley Park, hley Park, hley Park, to find two distinct relations that are truly synonymous in all senses of each phrase unless domain-specific type checking is performed on one or both arguments. If the first argument is the name of a scientist, then developed is synonymous with invented and created,and is closely related to patented. Withoutsuch argumenttype checking, these relationswill pick out overlapping, but quite distinct sets of tuples. 5 It is, however, easier for a human to assess similarity at the tuple level, where context in the form of entities grounding the relationship is available. In order to estimate the number of similar facts extracted by TEXTRUNNER, we began with our filtered set of 11.3 million tuples. For each tuple, we found clusters of concrete tuples of the form (e 1,r,e 2 ), (e 1,q,e 2 ) where r = q, that is tuples where the entities match but the relation strings are distinct. We found that only one third of the tuples belonged to such synonymy clusters. Next, we randomly sampled 100 synonymy clusters and asked one author of this paper to determine how many distinct facts existed within each cluster. For example, the cluster of4 tuples below describes 2 distinct relations R 1 and R 2 between Bletchley Park and Station X as delineated below: R 1 (Bletchley Park, was location of,station X) being called,knownas,codenamed R 2 (Bletchley Park, being called,station X) R 2 (Bletchley Park,,known as,station X) R 2 (Bletchley Park,,codenamed,Station X) Overall, we found that roughly one quarter of the tuples in our sample were reformulations of other tuples contained somewhere in the filtered set of 11.3 million tuples. Given ourpreviousmeasurement that two thirds of the concrete fact tuples do not belong to synonymy clusters, we can compute that 2 3 + ( 1 3 3 4 ) or roughly 92% of the tuples found by TEXTRUNNER express distinct assertions. As pointed out earlier, this is an overestimate of the number of unique facts because we have not been able to factor in the impact of multiple entity names, which is a topic for future work. 4 Related Work Traditional closed IE work was discussed in Section 1. Recent efforts [Pasca et al., 2006] seeking to undertake largescale extraction indicate a growing interest in the problem. This year, Sekine [Sekine, 2006] proposed a paradigm for on-demand information extraction, which aims to eliminate customization involved with adapting IE systems to new topics. Using unsupervised learning methods, the system automatically creates patterns and performs extraction based on a specificity, but does not scale to the Web as explained be Given a collection of documents, their system first forms clustering of the entire set of articles, partitionin corpus into sets of articles believed to discuss similar to Within each cluster, named-entity recognition, co-refe resolution and deep linguistic parse structures are comp and then used to automatically identify relations between of entities. This use of heavy linguistic machinery w be problematic if applied to the Web. Shinyama and Sekine s system, which uses pai vector-space clustering, initially requires an O(D 2 ) e where D is the number of documents. Each documen signed to a cluster is then subject to linguistic proces potentially resulting in another pass through the set of documents. This is far more expensive for large docu collections than TEXTRUNNER s O(D + T logt) runtim presented earlier. From a collection of 28,000 newswire articles, Shiny and Sekine were able to discover 101 relations. While difficult to measure the exact number of relations foun TEXTRUNNER on its 9,000,000 Web page corpus, it is at two or three orders of magnitude greater than 101. 5 Conclusions This paper introduces Open IE from the Web, an uns vised extraction paradigm that eschews relation-specifi traction in favor of a single extraction pass over the co during which relations of interest are automatically dis ered and efficiently stored. Unlike traditional IE system repeatedly incur the cost of corpus analysis with the na of each new relation, Open IE s one-time relation disco procedure allows a user to name and explore relationshi interactive speeds. The paper also introduces TEXTRUNNER, a fully im mented Open IE system, and demonstrates its abili extract massive amounts of high-quality information a nine million Web page corpus. We have shown TEXTRUNNER is able to match the recall of the KNOWIT state-of-the-art Web IE system, while achieving higher p sion. In the future, we plan to integrate scalable methods fo tecting synonyms and resolving multiple mentions of en in TEXTRUNNER. The system would also benefit from ability to learn the types of entities commonly taken b lations. This would enable the system to make a distin between differentsenses of a relation, as well as better l entity boundaries. Finally we plan to unify tuples outp TEXTRUNNER into a graph-based structure, enabling plex relational queries. Cluster found by (e 1,p,e 2 ), (e 1,r,e 2 ), where p!=r 92% of the tuples found by TEXTRUNNER express distinct assertions (over estimation)

Estimating the Number of Distinct Facts Challenge: find methods for detecting synonyms and resolving multiple mentions of entities

Related Work KnowItAll Project (umbrella project) IBM Watson TextRunner Demo

Questions?