Supervised Models for Coreference Resolution [Rahman & Ng, EMNLP09] Many machine learning models for coreference resolution have been created, using not only different feature sets but also fundamentally different designs. Rahman & Ng compare four different designs and discuss their strengths and weaknesses. Mention pair model Mention ranking model Entity mention model Cluster ranking model Running Example [Barack Obama] 1 1 nominated [Hillary Rodham Clinton] 2 2 as [his] 1 3 [secretary of state] 3 4 on [Monday] 4 5. [He] 1 6 Each mention appears in [brackets] A mention is annotated as [m] cid mid, where: à mid is the mention id à cid is the cluster id This example corresponds to the following clusters: 1: { Barack Obama, his, He } 2: { Hillary Rodham Clinton } 3: { secretary of state } 4: { Monday } Mention Pair Model Each training instance is a pair of mentions: (m j, m k ) An instance is labeled as positive if m j and m k are coreferent, otherwise it is labeled as negative. If all possible pairs were used, then the negative instances would substantially outnumber the positive! So the following approach has been adopted: a positive instance is created for each anaphoric mention m k and its closest antecedent m j. a negative instance is created for m k paired with each of the intervening mentions m j+1, m j+2,, m k-1 Mention Pair Example [Barack Obama] 1 1 nominated [Hillary Rodham Clinton] 2 2 as [his] 1 3 [secretary of state] 3 4 on [Monday] 4 5. [He] 1 6 The instances for the mention pair model would be: Positive = (He, his) (his, Barack Obama) Negative = (He, Monday) (He, secretary of state) (his, Hillary Rodham Clinton)
Post-classification Clustering The output of a mention pair model then needs to be clustered to coordinate the independent coreference decisions. Why? the coreference relation should have transitivity, but this may be violated by independent pairwise decisions many candidates may be classified as coreferent with a mention Common clustering algorithms include: transitive closure ( single link ): groups together all pairs that are connected by a path of links best first: group a mention with the antecedent that has the highest confidence value most recent: group a mention with its most recent antecedent Problems with Mention Pair Model Mention pair models are the traditional approach for supervised learning for coreference resolution. They are simple, but have several drawbacks. Each mention pair is considered independently from the others. The candidate antecedents cannot be compared to each other. Features can only be extracted from the two Clusterlevel information is not available. Need post-classification clustering step Computationally, this approach can be expensive. For long documents, the number of mention pairs can explode. Entity Mention Model An entity mention model decides whether a mention m k is coreferent with a (partial) cluster preceding m k. A cluster is viewed as representing an entity. A training instance is a mention and cluster pair: (m k, c j ) Two types of features are used: 1. features that describe m k 2. cluster-level features that characterize the relationship between m k and c j. Four values were used for these features: NONE: the feature is false for m k and all mentions in c j MOST-FALSE: the feature is true for m k and less than half (but at least one) of the mentions in c j MOST-TRUE: the feature is true for m k and at least half (but not all) of the mentions in c j ALL: the feature is true for m k and all mentions in c j Entity Mention Model A positive instance is created for each mention m k and the preceding cluster to which it belongs. A negative instance is created for each mention m k paired with each partial cluster whose last mention appears between m k and and its closest antecedent. When applying the classifier, mentions are processed left-to-right. For each m k, an instance is created between m k and each preceding cluster. The closest cluster classified as coreferent is chosen. Partial clusters are created incrementally based on the predictions of the classifier on the first k-1 mentions!
Mention Ranking model Reformulates the problem in terms of ranking rather than classification: which candidate antecedent is the most probable? all candidate antecedents are considered simultaneously and a ranking is imposed among them. an SVM ranker-learning algorithm is used. The features and training instances are identical to the mention pair model except for the values of the training instances: the pair with the closest antecedent gets a value of 2 all other (m j, m k ) pairs get a value of 1 When applying the model, the candidate antecedent with the largest value produced by ranker is chosen. Cluster Ranking Model Cluster Ranking combines the benefits of both the entity mention model and the mention ranking model: the set of preceding (partial) clusters are ranked. A training instance is a mention and cluster pair: (m k, c j ) An instance is created between m k and each of its preceding clusters. The values of the training instances are: if m k belongs to c j, then the pair s value is 2 otherwise the pair s value is 1 Both mention and cluster-level features are used. When applying the model, m k is paired with each of the preceding clusters and the one with the highest rank value is chosen. Features for Individual Mentions Features between Pairs of Mentions Feature values are Yes or No. Feature values are Compatible, Incompatible, or Not Applicable.
More Features between Pairs Anaphoricity Detection Two approaches were tried to explicity detect nonanaphoric 1. An independent anaphoricity classifier was trained. The classifier is applied first, and if m k is labeled as nonanaphoric then it will not be resolved. 2. The ranking models were trained to jointly learn discourse-new relations and to find resolutions. Training is done with both anaphoric and non-anaphoric For each m k, a new instance is created for it as a new cluster. Extracting Mentions To extract system mentions, a mention detector was trained with supervised learning. Results with Gold Mentions The first set of experiments uses gold mentions: Mention extraction was cast as a sequence labeling task using IOB tags and a CRF model was created. 29 features were used of the following types: Lexical (7): target word w i and window size +/-3 around it Capitalization (4): IsAllCap, IsInitCap, IsCapPeriod, IsAllLower Morphological (8): prefixes and suffixes up to length 4 Grammatical (1): POS tag of w i Semantic (1): Named Entity Tag of w i Gazetteers (8): dictionaries of pronouns, common words, person names and titles, vehicles, locations, companies, and hyponyms of PERSON from WordNet. Conclusions: The ranking models improve precision. Joint anaphoricity detection improves both ranking models. Cluster ranking outperforms mention ranking
Results with System Mentions The second set of experiments uses system-generated Precision is lower with system mentions, but the same general trends hold. Cluster ranking seems to be the best overall model.