Lightly-Supervised Attribute Extraction

Size: px

Start display at page:

Download "Lightly-Supervised Attribute Extraction"

Lynette White
6 years ago
Views:

1 Lightly-Supervised Attribute Extraction Abstract We introduce lightly-supervised methods for extracting entity attributes from natural language text. Using those methods, we are able to extract large number of attributes of different entities at fairly high precision from a large natural language corpus. We compare our methods against a standard pattern-based relation extractor, showing that our methods improve very considerably on that baseline extractor. 1 Introduction Minimally supervised extraction of tuples of entities involved in a particular relation has been the subject of considerable recent investigation(brin, 1998; Agichtein and Gravano, 2000; Pantel and Pennacchiotti, 2006). These methods start from a small set of known tuples in the relation of interest, and bootstrap extractors that can find more tuples in the relation from text. For example, if the relation of interest is (corporate) acquisition, such a learned extractor may include patterns that extract the pair (AcmeCorp., XYZInc.) from the sentence XYZ Inc. was acquired by Acme Corp. for 10 million in cash.. Such extractors for specific relations are useful in question answering and text mining applications, but they do not provide an inventory of the relations that may involve entities of a given type, and that will help us characterize specific entities of the type. In this study, we investigate methods that from a small seed set learn how to extract a wide range of entity attributes. These attributes are for instance expressed in natural language with relational nouns ( the revenue of X ) or reduced clauses ( a company headquartered in... ). In contrast to previous methods that seek specific relationsips like headquarters of or instance of and their values (Agichtein and Gravano, 2000; Pantel and Ravichandran, 2004), we seek to identity all the commonly used attributes of an entity type, such as the capital, population, and GDP attributes of countries. Attribute extraction is of potential value in search and text mining. Pasca and Van Durme (2007) show the prevalence of attribute mentions in Web queries, and discuss how recognizing attributes can improve search results. Pasca and Van Durme (2007) consider attribute extraction as a building block for the automatic creation of knowledge bases from text. Probst et al. (2007) state that applications in the business domain can greatly benefit from the knowledge of product attributes and their values. We adopt a semi-supervised bootstrapping approach to avoid the cost of annotating a large training set for supervised learning. Earlier systems that use bootstrapping for extraction differ in the types of resources used and the types of decisions made during the extraction process. Large unlabeled collections, including (portions of) the Web are used either for estimating the confidence of candidate extractions or as a source of further instantiations of the relations of interest. KnowItAll (Etzioni et al., 2005) crawls the Web looking for fixed patterns corresponding to a desired relation. Since the Web provides redundant evidence for many relations of in-

2 terest, the system can afford to favor precision over recall. However, when operating on smaller corpora (for example, product reviews) such methods can fail to produce useful results because of their low recall. Other systems such as DIPRE (Brin, 1998) and SnowBall (Agichtein and Gravano, 2000) assume functional relationships, for instance that each organization has a single location. Since some attributes express one-to-many relationships, we cannot rely on functionality in the present work. Our methods build on the work of Pantel and Pennacchiotti (2006) and use co-training to recognize attributes of a new entity type. As Collins and Singer (1999), we split features into two sets: content features that depend only on the candidate tup,e and context features that depend only on the context in which the candidate tuple occurs. Given these two sets and an initial set of seed features, the method iterates between finding relevant tuples and find relevant patterns. By expressing the problem in terms of features rather than fixed expressions, we are able to improve recall while maintaining high precision. After the iterative steps, the list of resulting candidate tuples are re-ranked using additional features. A mutual information based measure is used to induce patterns and tuples that are highly correlated. The rest of the paper is organized as follows. Previous work is discussed in Section 2. The methods are presented in Section 3, and experimental results in Section 4. We conclude with a summary and discussion of future directions. 2 Related Work Pasca and Van Durme (2007) extract entity-attribute pairs from Web query logs, while Probst et al. (2007) extract entity-attribute-value triples from product descriptions. As far as we know, our work is the first to consider unrestricted attribute extraction from a very large text corpus. Moreover, Pasca and Van Durme (2007) use only a small set of linguistically motivated extraction patterns while our methods learn also the extraction patterns from text, similarly to some of the earlier semi-supervised relation extraction work (Brin, 1998; Agichtein and Gravano, 2000; Pantel and Pennacchiotti, 2006). Probst et al. (2007) extract both attributes and their values from product descriptions. However, their methods do not seem feasible for a wider range of texts, entity types, and attributes, because in general text, many attributes are referred to repeatedly but their actual values are specified only rarely and in complex linguistic contexts that require deeper analysis than these methods provide. Therefore, we focus in this work on the subproblem of identifying just attributes. As mentioned earlier, our methods rely on cotraining (Blum and Mitchell, 1998) to exploit the availability of virtually unlimited unlabeled data. However, previous applications of co-training to information extraction (Collins and Singer, 1999) rely of a closed set of class labels to link between the two views (feature sets) in the co-training algorithm: if one view assigns one of the labels to an instance, that also serves as a negative instance for candidate assignments of any of the other labels by the other view. For attribute extraction, however, we do not have a closed class of possible choices. Instead, we have effectively a one-class problem for which we use mutual information (Pantel and Pennacchiotti, 2006) to link the views. The work presented in this paper fits within the minimally supervised pattern-based relation extraction approach previously developed by Brin (1998), Agichtein and Gravano (2000), and (Pantel and Pennacchiotti, 2006), and originally pioneered by Hearst (1992). In particular, our methods build on those of Espresso (Pantel and Pennacchiotti, 2006). However, there are significant differences between our methods and the previous work. First, unlike Espresso, we do not use generic patterns. Creation of generic patterns in Espresso involves an Web expansion phase during which new extraction instances are retrieved from the Web by issuing queries to a search engine; our methods work instead on a fixed corpus without auxiliary outside sources. However, outside resources could be incorporated into our methods. Second, we use both a generic co-training-based ranking method and also a final re-ranking step. The ranking scheme used by Pantel and Pennacchiotti (2006) can be seen as a particular case or ours. Third, our task does not meet the constraints of previous work. In particular, some of that work assumes a functional dependency in the pairs being extracted (Brin, 1998; Agichtein and Gravano, 2000), which does not hold for the general entity-

3 attribute relations we consider. 3 Methods We adapted the co-training algorithm (DL-CoTrain) presented in Collins and Singer (1999) to learn a decision list (DL) given examples from only one class using co-training. We start with a review of DL-CoTrain and then describe PMI-CoTrain, our mutual-information-based modification for oneclass problems. 3.1 Multiple Class Co-training (DL-CoTrain) For this algorithm, we represent an instance x by the set of its binary features x = {x 1,, x k } X where X is the set of possible features; class labels are drawn from a finite set Y. Then a decision list is a function d : X Y [0, 1] that maps a feature and a label to a confidence value d(x, y). In other terms, d represents a set of decision rules x y with weights d(x, y). The decision list d can then be used to label an instance as follows: ŷ = arg max x x,y Y d(x, y) That is, the instance gets the label for the decision rule in d with the highest weight. The algorithm learns by inducing two decision lists, one for features based on the content of an entity mention (typically spelling features) and the other for features based on the context in which the entity mention occurs. Content features might include contains(bush) or lowercase-starts-with(b); context features might include lexical-context( the president ) for the instance George Bush, the president of United States. Starting from seed set of content decision rules for each class, the entire corpus is labeled. In this process, some of the examples with no matching decision rule may be left unlabeled. Using the current set of labeled examples, possible context features x are induced where the weight for a candidate context rule with that feature is: d(x, y) = Count(x, y) + α y (Count(x, y) + α) (1) At each iteration a maximum of n = 5 rules with confidence value greater than p min are added to the corresponding decision list for each class. Now that context rules have been induced, the corpus is relabeled using context rules to induce new content rules. This process is repeated until a maximum of N max = 2500 rules are gathered. In this manner, cotraining utilizes the redundancy in natural language to produce a largely unsupervised classifier. 3.2 Single Class Co-training (PMI-CoTrain) In practice, the DL-CoTrain method performs well when there are a fixed, small number of class labels. However, for attribute extraction there are many different attributes that are applicable to different entity types. Providing seeds for each attribute would be impractical, especially since we want our methods to discover attributes with minimal supervision. Instead, we propose the use of relation-specific positive seeds to bootstrap the co-training algorithm. For DL-CoTrain algorithm with just one label the equation (1) is uniformly one. To alleviate this problem we split our feature set into two redundant views (spelling and context) X 1 X 2 and score features in the two sets based on pointwise mutual information. The confidence estimation scheme is similar to the one outlined in (Pantel and Pennacchiotti, 2006). Conf(x 1 ) = x 2 X 2 ( pmi(x 1,x 2 ) max pmi Conf(x 2 )) X 2 (2) x 1 X 1 ( pmi(x 1,x 2 ) max Conf(x 2 ) = pmi Conf(x 1 )) X 1 (3) where Conf(x 1 ) is the confidence of feature x 1, Conf(x 2 ) is the confidence of feature x 2, pmi(x 1, x 2 ) is the pointwise mutual information (pmi) (Cover and Thomas, 1991) between features x 1 and x 2. max pmi is the maximum pmi between all features x 1 X 1 and x 2 X 2. It is worth nothing that Conf(x 1 ) and Conf(x 2 ) are recursively defined. Conf(x 1 ) = 1.0 for the manually supplied seed features. The pointwise mutual information is then calculated for two features x 1 X 1 and x 2 X 2 as follows, pmi(x 1, x 2 ) = DF log P (x 1, x 2 ) P (x 1 )P (x 2 )

4 = DF log x 1, x 2, x 1,, x 2 where x 1, x 2, x 1,,, x 2 and, are the counts of feature co-occurences for x 1 with x 2, x 1 with any x 2 X 2, x 2 with any from x 1 X 1 and any x 1 X 1 with x 2 X 2 respectively. A discount factor (DF ), as suggested by Pantel and Ravichandran (2004), may be used to prevent lowcount features from getting a higher rank. To get the score of a particular feature pair (x 1, x 2 ) we use, Conf P T (x 1, x 2 ) = Conf P T (x 1 ) Conf P T (x 2 ) (4) The PMI-CoTrain method proceeds in a similar manner to the DL-CoTrain algorithm by going back and forth between content and context features. The confidence values in equations (2) and (3) are used in place of the probability estimate in equation (1). The method allows us to use arbitrary features in the redundant views unlike earlier methods (Pantel and Pennacchiotti, 2006) where the entire context or spelling have to be used. However, care needs to be taken that the features generated are neither too specific (lower recall) nor too general (lower precision). Our experiments below suggest that this is possible. In such a scheme, either a fixed number of iterations are used or the iterations are continued until confidence estimates drop below certain threshold. 3.3 Extraction Algorithm Our extraction algorithm proceeds in two stages, both stages utilizing the PMI-CoTrain method described above: 1. Patterns Tuples extraction: First, a set of instances are labeled using the seeds and the confidence of the extracted tuple features is set to The PMI-CoTrain algorithm is run on the feature set which is divided into features based on patterns and those based on tuples. The confidence values of these features are updated based on Equations 2 and 3. The algorithm is run for a fixed number of iterations N and finally the tuples and instances with the highest confidence scores are outputted. 2. Keys Values re-ranking (KV -ranking): This stage takes in as input the set of tuples extracted by stage 1 and outputs a list of tuples sorted based on scores computed using equation (5). Here the two feature sets utilize the keys and values of the (entity, attribute) tuple respectively. For example, (IBM, employees) is an instance of the relation (Company, attribute). In this case, IBM is the key and employees is the value. The confidence of a particular entity-attribute pair is modified from equation (4) to be, Conf KV (x 1, x 2 ) = Conf P T (x 1, x 2 ) Conf KV (x 1 ) Conf KV (x 2 ) (5) We use Equations 2 and 3 to estimate Conf KV (x1) and Conf KV (x2), respectively. Analogous to the Patterns Tuples view, the intuition here is to have increased confidence in a value which is associated with many reliable keys and its association with such keys is high. In the experiments reported in this paper, we do not use bootstrapping during KV -ranking and instead use a fixed number of iterations. Since KV-ranking generates separate confidences for keys and values, ranked key and value lists are important by-products of KV-ranking. The final output of the algorithm are tuples which are extracted and re-ranked using PMI-CoTrain. 4 Experimental Results In this section we present an empirical evaluation of the attribute extraction architecture presented in this paper. The architecture is tested on the following two relations: (Company, Attribute) extraction and (Country, Attribute) extraction. 4.1 Baseline Our baseline relation-extractor, BL, is our own limited implementation of the Espresso system described in (Pantel and Pennacchiotti, 2006). Firstly, as mentioned in Section 2, we do not exploit generic patterns and skip the web expansion phase of Espresso as the objective here is to measure effectiveness of the algorithms on a fixed corpora. Secondly, we differ from Espresso on the heuristic used to select the set of reliable patterns in the first iteration. Apart from these differences, we tried our

5 level best to match Espresso. We generalized sentences by replacing terminological expressions 1 by the token TR. The BL method ranks all patterns using Espresso s pattern reliability measure (Section 3.2 of (Pantel and Pennacchiotti, 2006)). All patterns except the top-k are discarded where k is set to the number of reliable patterns from previous iteration plus one. Relation tuples extracted by the selected patterns are scored using Espresso s instance reliability measure (Section 3.2 of (Pantel and Pennacchiotti, 2006)) and the highest scoring top 200 tuples are selected as reliable tuples which are used to rank patterns in the next iteration. As in Espresso, the iterative extraction & ranking is continued until the average pattern reliability scored decreased by more than 50% from the previous iteration. The systems used in empirical evaluation are: BL: The baseline as described in Section 4.1 of this paper. A tuple is assigned a valid score only if it is extracted by at least one of the reliable patterns. All other tuples are assigned a score of and such tuples are excluded from evaluation. P T : PMI-CoTraining based extractor as described in Section 3.2 using only pattern and tuple identity features. P T + : P T with additional surrounding context and lexical features. P T + +KV : KV -ranking applied on the extractions generated by P T +. SE+KV : The initial seeds are used to extract patterns from the corpus. These seedextracting patterns are further used to generate other non-seed tuples and KV -ranking is performed on these extractions. With the flexibility given by the co-training framework, examples of types of features are: Patterns Tuples: In this view, entire pattern strings and tuples can be used as features. An example of spelling feature is tuple= syria, population and example of a context feature is pattern= left has a right, where left and 1 definition given in (Pantel and Pennacchiotti, 2006) right are placeholders for syria and population respectively. In P T +, additional features indicating surrounding context like surrctxt=percent of && left s right are also used. Keys Values: Spelling features consist of the entire key string but additional features like tokens and non-alphabet characters in the key can also be used similar to (Collins and Singer, 1999). For example, given the key Goldman & Sachs Co. we have a spelling feature key-string=goldman & Sachs Co.. Other possible features that could be used are contains=goldman and nonalpha=&.. For the context we used the entire value string as a feature (e.g. value-string=employees). For evaluation, we manually evaluated 50 randomly selected tuples at various ranks from the set of non-seed extracted tuples. The final precision score is the average of two random runs of this nature. Unless otherwise specified, precision in Tables 8 and 9 specify the percentage of correct tuples in these 100 (50 2) evaluated tuples. 4.2 Company-Attribute Experiment For this experiment, we use a newswire corpus of 122 million tokens with articles collected from Wall Street Journal, AFP and Xinhua News. The first 100 company names from the Fortune-500 list are used as country seeds. In addition, 12 seed attributes are manually selected, shown in Table 5. The final set of seed tuples are obtained by taking cross product of these two sets. Rather than concentrating on the size of the seed set, we concentrate on the practicality of the seed generation process. We feel that it is acceptable to have a non-tiny seed set as long as it can be cheaply obtained. Results for the Company-Attribute experiment are shown in Table 6 and 8. Variation of precision against increasing size of ranked attribute set is shown in Table 6. Table 8 shows the precision for the top 1000 (1k), 5000 (5k) and (10k) extracted tuples. The bold entries in both the tables highlight the algorithm that performs the best in terms of precision at a particular rank. P T +, P T + +KV and SE+KV generally outperform the baseline BL and P T methods. P T + gives a higher precision on top

6 ranking extractions while SE+KV performs better with the increasing size of the ranked set. Some of the discovered co-training features after running P T + are shown in Table Country-Attribute Experiment For this experiment, we use the same corpora as the one described in Section 4.2. We use the 191 United Nations member countries list as country seeds and manually select 8 seed attributes (shown in Table 5). Once again, the final set of seed tuples was obtained by taking cross product of these two sets. Experimental results are shown in Table 7 and 9. For this entity-attribute relation, we again find that P T + +KV and SE+KV outperform other methods. Ranked attribute precision at varying ranks is shown in Table 7. Precision of ranked tuples at varying ranks is shown in Table 9. In Table 7, inferior ranking produced by P T + +KV (as compared to SE+KV ) needs further explanation: Since there is no initial ranking and pruning, the SE phase in SEKV extracted around 2.8 million candidate extractions, majority of which are incorrect. On the other hand, we restricted the number of P T + extractions to 299k. Because of the ranking in P T + and the limited number of candidate extractions used, most of the noisy extractions are already removed from the input of KV phase in P T + +KV. But that is not the case with the tuples fed to the KV phase of SE+KV. Hence, KV (in SE+KV ) shows less confidence in an attribute occurring with so many other non-country entities, thereby settling on a more accurate ranking of attributes as shown in Table 7. Unfortunately, the KV phase in P T + +KV does not have incorrect entities as in SE+KV to exploit and thereby produces an inferior ranking of attributes. However, P T + +KV ranks the keys (countries) well and since final confidence in a tuple depends on confidence of P T + in the tuple, confidence of P T + +KV in the key and confidence of P T + +KV in the value (see Equation 5), tuple confidence generated by P T + +KV is unaffected (see Table 9) by the phenomenon described above. 4.4 Discussion In general, we find that in both Company-Attribute and Country-Attribute experiments additional context features that include surrounding words SE+KV 90% 75% 56% 40% P T + +KV 100% 85% 70% 70% Table 6: Non-seed attribute precision at various ranks for the (Company, Attr) relation. Bold entries indicate superior performance at a particular rank. SE+KV 80% 90% 88% 82% P T + +KV 40% 65% 64% 58% Table 7: Non-seed attribute precision at various ranks for the (Country, Attr) relation. @10k BL 91% 42% 23% P T 83% 33% 30% P T + 93% 58% 37% P T + +KV 87% 70% 53% SE+KV 52% 69% 74% Table 8: Ranked non-seed tuple precision at various ranks for the (Company, Attr) relation. @10k BL 29% - - P T 11% 15% 17% P T + 56% 44% 42% P T + +KV 88% 70% 70% SE+KV 84% 81% 74% Table 9: Ranked non-seed tuple precision at various ranks for the (Country, Attr) relation. BL extracted only 3114 non-seed tuples and hence evaluation for only top ranking 1000 non-seed extractions are shown.

7 Type Sample features Spelling tuple= bell atlantic, chairman, tuple= air canada, president, tuple= cub foods, president, tuple= cognizant corp., executive vice president Context pattern= ATTR and ceo of COMPANY, surrctxt=chairman and && ATTR of COMPANY, surrctxt=executive at && COMPANY, to be its ATTR Table 1: Sample high-ranking spelling and context features automatically discovered by P T + during cotraining of the (Company, Attribute) relation. Type Spelling Sample features tuple= uruguayr, people, tuple= argentina, citizens, tuple= india, muslims, tuple= ireland, voters, tuple= china, farmers Context surrctxt=percent of && COUNTRY s ATTR, pattern= ATTR of the republic of COUNTRY, surrctxt=meeting in the && ATTR of COUNTRY, surrctxt=city, && ATTR of COUNTRY Table 2: Sample high-ranking spelling and context features automatically discovered by P T + during cotraining of the (Country, Attribute) relation. help improve precision significantly. For example, for Company-Attribute relation, ATTR of COMPANY is a low precision generic pattern as it can extract myriad of other relations (e.g. height of John) in addition to the desired Company-Attribute instances. However, if you add the sourrounding context chairman and to this pattern and use the concatenated feature (surrctxt=chairman and && ATTR of COMPANY ) as an extractor then it becomes a high precision Company-Attribute extractor. Improvements in P T + compared to BL and P T in Tables 8, 9 reflect this fact. Hence, by using such novel features we are able to exploit generic patterns even without using any external resource (e.g. web search results in (Pantel and Pennacchiotti, 2006)). Some useful features discovered by P T + are shown in Tables 1, 2. The P T + +KV and SE+KV methods are the best algorithms on both the data sets, although for Company-Attribute relation P T + performs well on top ranking extractions. It is clear that the methods presented in this paper perform better than previous relation extraction techniques like BL and P T without the use of the web. Thus, a featurebased approach can be applied to problems of extraction where the relation is many-to-one and training seeds are provided from one relation only. Summarizing, SE+KV and P T + +KV methods are reliable attribute extraction methods. Using additional features (e.g. surrounding context) helps improve performance over previous methods (Pantel and Pennacchiotti, 2006). Finally, KV -ranking helps boost precision further by re-ranking the output generated by a separate extraction algorithm. 5 Conclusion This paper presents one of the earliest attempts at extracting attributes of entities from large natural language corpus. We have presented Co-Training based novel extraction and ranking mechanisms for attribute extraction. This opens up the possibility of using rich features for pattern and tuple reliability estimation. We also present another Co-Training based ranking algorithm, KV -ranking, which has been shown to be very effective in the experiments reported in the paper. Items for future work include exploration of richer features for improved ranking, more comparisons with other relation extractions algorithms and showing utility of extracted attributes in appropriate NLP tasks. References Eugene Agichtein and Luis Gravano Snowball: Extracting Relations from Large Plain-Text Collections. 5th ACM International Conference on Digital Libraries

8 Relation (Country, Attribute) Top non-seed attributes presidents, border, governments, foreign ministers, government, relations, num-num victory, countries, people, ties Table 3: Top ten non-seed Country attributes extracted by SE+KV. Relation (Company, Attribute) Top non-seed attributes chief executive officer, suit, chief executive, unit, lawsuit, part, joint venture, announcement, news, executive vice president Table 4: Top ten non-seed Company attributes extracted by P T + +KV. Relation Key Seeds Value Seeds (Company, Attribute) Top 100 Fortune- 500 companies type, headquarters, chairman, ceo, products, revenue, operating income, net income, employees, subsidiaries, website, headquarter (Country, Attribute) 191 UN member countires. capital, largest city, official language, president, area, population, gdp, currency Table 5: (Company, Attribute) and (Country, Attribute) seeds used in the experiments. Avrim Blum and Tom Mitchell Combining Labeled and Unlabeled Data with Co-Training. Proceedings of the 11th Annual Conference on Computational Learning Theory, pages , Sergey Brin Extracting Patterns and Relations from the World Wide Web, International Workshop on The World Wide Web and Databases Micahel Collins and Yoram Singer Unsupervised Models for Named Entity Classification. In Proceedings of EMNLP1999. T.M. Cover and J.A. Thomas Elements of Information Theory. John Wiley & Sons. Oren Etzioni and M. Cafarella and D. Downey and A.- M. Popescu and T. Shaked and S. Soderland and D.S. Weld and A. Yates Unsupervised namedentity extraction from the web: an experimental study. Artificial Intelligence. Pages: , Vol. 165, Issue 1, Marti Hearst Automatic Acquisition of Hyponyms from Large Text Corpora. Proceedings of the Fourteenth International Conference on Computational Linguistics, Nantes, France, July Patrick Pantel and Marco Pennacchiotti Espresso: Leveraging Generic Patterns for Automatically Harvesting Semantic Relations. Conference on Computational Linguistics / Association for Computational Linguistics (COLING/ACL-06), pp Sydney, Australia. Patrick Pantel and Deepak Ravichandran Automatically Labeling Semantic Classes. In Proceedings of HLT/NAACL-04 pp Boston, MA. Marius Pasca and Benjamin Van Durme What You Seek is What You Get: Extraction of Class Attributes from Query Logs.International Joint Conference on Artificial Intelligence (IJCAI-07). Ferbruary, Katharina Probst and Rayid Ghani and Marko Krema and Andy Fano and Yan Liu Semi-supervised Learning of Attribute-Value Pairs from Product Descriptions. Proceedings of the International Joint Conference in Artificial Intelligence (IJCAI-07). Ferbruary, 2007.

Iterative Learning of Relation Patterns for Market Analysis with UIMA

UIMA Workshop, GLDV, Tübingen, 09.04.2007 Iterative Learning of Relation Patterns for Market Analysis with UIMA Sebastian Blohm, Jürgen Umbrich, Philipp Cimiano, York Sure Universität Karlsruhe (TH), Institut