WSI using Graphs of Collocations. Paper by: Ioannis P. Klapaftis and Suresh Manandhar Presented by: Ahmad R. Shahid

Size: px

Start display at page:

Download "WSI using Graphs of Collocations. Paper by: Ioannis P. Klapaftis and Suresh Manandhar Presented by: Ahmad R. Shahid"

Ralph Mitchell
5 years ago
Views:

1 WSI using Graphs of Collocations Paper by: Ioannis P. Klapaftis and Suresh Manandhar Presented by: Ahmad R. Shahid

2 Word Sense Induction (WSI) Identifying different senses (uses) of a word Finds applications in Information Retrieval (IR) and Machine Translation (MT) Most of the work in WSI is based on vectorspace model Each context of a target word is represented as a vector of features Context vectors are clustered and the resulting clusters are taken to represent the induced senses.

3 Word Sense Induction (WSI)

4 Graph based methods Agirre et al. (2007) used co-occurrence graphs Vertices are words and two vertices share an edge if they co-occur in the same context Each edge receives a weight indicating the strength of relationship between words (vertices) Co-occurrence graphs have highly dense subgraphs representing different clusters the entity may have Each cluster has a hub Hubs are highly dense vertices

5 Graph based methods Each cluster (induced sense) consists of a set of words that are semantically related to the particular sense. Graph-based methods assume that each context word is related to one and only one sense of the target word Not always valid

6 Graph based methods Consider the contexts for the target word network: To install our satellite system please call our technicians and book an appointment. Connection to our television network is free of charge To connect to the BT network, proceed with the installation of the connection software and then reboot your system Two senses are used: 1) Television Network, 2) Computer Network

7 Graph based methods Any hard-clustering approach would assign system to only one of the two senses of network, even though its related to both Same is true for connection The two words cannot be filtered out as noise since they are semantically related to the target word

8 WSI using Graph Clustering

9 Small Lexical Worlds

10 Small Lexical Worlds Small worlds The characteristic path length (L) Mean length of the shortest path between two nodes of the graph. Let dmin ( i, j) be the length of the shortest path between two nodes i and j, and let N be the total number of nodes L = 1 N i= 1 Clustering coefficient (C) N dmin ( i, j)

11 Clustering Coefficient i For each node, one can define a local clustering coefficient C i equal to the proportion of connections E( Γ ( i) ) between the neighbors Γ( i) of that node For a node i with four neighbors the maximum number of connections is Γ() i = 6 If five of these connections actually exist, The global coefficient C is the mean of the local N coefficients 1 E( Γ( i) ) C = N Γ() i i= 1 2 Its 0 for totally disconnected and 1 for a complete graph. 2 C i = 5 6 ~ 0.83

12 Small World Networks They lie somewhere between regular graphs and random graphs In the case of random graph of N nodes whose mean degree is k L ( N ) rand ~ log log( k) Small world graphs are characterized by: L ~ L rand C 2k rand ~ N C >> Crand

13 Small World Networks At a constant mean degree, the number of nodes can increase exponentially, whereas the characteristic path length will only increase in a linear way Six degrees of freedom In a small world there will be bundles or highly connected groups Friends of a given individual will be much more likely to be acquainted with each other than would be predicted if the edges of the graph were simply drawn at random

14 Small World Networks

15 Adam Smith Every individual...generally, indeed, neither intends to promote the public interest, nor knows how much he is promoting it. By preferring the support of domestic to that of foreign industry he intends only his own security; and by directing that industry in such a manner as its produce may be of the greatest value, he intends only his own gain, and he is in this, as in many other cases, led by an invisible hand to promote an end which was no part of his intention. The Wealth of Nations, Book IV Chapter II Greed is good : the basic theme of Capitalism

16 Co-occurrence Graphs Co-occurrence graphs are small world graphs The number of nodes can increase exponentially, whereas the characteristic path length will only increase in a linear way They are scale-free They contain a small number of highly connected hubs and a large number of weakly connected nodes

17 Co-occurrence Graphs Since they are small-world networks Contain highly dense subgraphs (hubs) which represent the different clusters (senses) the target word may have

18 High Density Components Different uses of a target word form highly interconnected bundles in a small world of cooccurrences (high density components) Barrage (in the sense of a hydraulic dam) must cooccur frequently with eau, ouvrage, riviere, crue, irrigation, production, electricite (water, work, river, flood, irrigation, production, electricity) Those words themselves are likely to be interconnected

19 High Density Components Detecting the different uses of a word amounts to isolating the high density components in the cooccurrence graph Most exact graph-partitioning techniques are NP-hard Given that graphs have several thousand nodes and edges, only approximate heuristic-based methods can be employed

20 Detecting Root Hubs In every high-density component, one of the nodes has a higher degree than the others Called the component s root hub For the most frequent use of barrage (hydraulic dam), the root hub is the word eau (water). The isolated root hub is deleted along with all of its neighbors Must have at least 6 neighbors (determined empirically)

21 Minimum Spanning Tree (MST) After isolating the root hub along with all its neighbors, the next root hub is identified and the process is repeated A MST is computed by taking the target word as the root and making its first level consist of the previously identified root hubs

22 Minimum Spanning Tree (MST)

23 Veronis Algorithm Iteratively finds the candidate root hub The one with the highest degree The root hub is deleted along with its direct neighbors from the graph Only if it satisfies certain heuristics Minimum number of vertices in a hub The average weight between the candidate root hub and its adjacent neighbors Minimum frequency of a root hub

24 Collocational Graphs for WSI Let bc, be the base corpus Consists of paragraphs containing the target word tw The aim is to induce the sense of tw given bc as the only input Let rc be a large reference corpus British National Corpus (BNC) has been used for this study

25 Corpus pre-processing Initially, tw is removed from bc Each paragraph of bc and rc is POS-tagged Only nouns are kept and lemmatized Since they are less ambiguous than verbs, adverbs or adjectives At this stage each paragraph both in bc and rc is a list of lemmatized nouns p i

26 Corpus pre-processing Each paragraph p i in bc contains nouns which are semantically related to tw, as well as, common nouns which are noisy, in the sense that they are not semantically related to tw In order to filter out the noise, they used a technique based on corpora comparison using 2 log-likelihood ( G ).

27 Corpus pre-processing The aim is to check if the distribution of a word w i, given it appears in bc, is similar to the distribution of wi, given it appears in rc, i.e. p( wi bc) = p( wi rc) It s the null hypothesis 2 If that is true G will have a small value, and wi should be removed from the paragraphs of. bc

28 Corpus pre-processing If the probability of the occurrence of a word in the base corpus is the same as in the reference corpus, then it looses its discriminating power and must be weeded out In other words if the observed frequency of a word is very close to its expected value then it really hasn t got much to say p ( w bc) = p( w rc) p( w) i i =

29 Corpus pre-processing The expressions are: n ij G 2 nij = 2* n ij.log i, j mij 2 2 n n k ik. = 1 = 1 = N k kj mij The corresponds to values in the observed values (OT) table m ij The corresponds to values in the expected values (ET) table The values in ET are calculated from the values in OT using the equation for m ij

30 Corpus pre-processing They created two noun frequency lists lbc, derived from the bc corpus lrc, derived from the rc corpus w i lbc For word, they created two contingency tables OT contains the observed counts taken from lbc and lrc ET contains the expected values under the model of independence

31 Corpus pre-processing 2 Then they calculated G, where n is the i ij, j cell of OT and mij is the i, j cell of ET, and N = n i, j ij lbc is first filtered by removing words, which have a relative frequency in lbc less than lrc 2 The resulting lbc is then sorted by the G values 2 G 2 G The - sorted list is used to remove words from each paragraph of bc, which have a value less than a prespecified thresholed (parameter ). At the end of that stage, each paragraph is a list of nouns, which are assumed to be topically related to the target word tw p 1 p i bc

32 Corpus pre-processing

33 Collocational Graph A key problem at this stage is the determination of related nouns They can be grouped into collocations Where each collocation is assigned a weight In this study collocations of size 2 are considered (pairs of words) They consist of 2 nouns

34 Collocational Graph Collocations are detected by generating all the combinations for each n 2 - length paragraph Then measuring their frequency The frequency of a collocation is the number of paragraphs, which contain that collocation Consider the following paragraphs: To install our satellite system please call our technicians and book an appointment. Connection to our television network is free of charge To connect to the BT network, proceed with the installation of the connection software and then reboot your system n n All the combinations for each - length paragraph of our 2 example would provide us with 24 unique collocations, such as {system, technician}, {system, connection} etc. n

35 Collocational Graph 2 Although the use of G aims at keeping in bc words, which are related to the target one, this does not necessarily mean that their pairwise combinations are useful for discriminating the senses of tw For example ambiguous collocations, which are related to both senses of tw should not be taken into account, such as the {system, connection} collocation To circumvent around this problem, each extracted collocation is assigned a weight, which measures the relative frequency of two nouns co-occurring. Usually weighted using information theoretic measures such as pointwise mutual information (PMI)

36 Collocational Graph Conditional probabilities produce better results than PMI, which overestimates rare events Hence they used conditional probabilities Let freq ij denote the number of paragraphs, in which nouns i, j co-occur, and freq j denote the number of paragraphs, where noun j occurs G tw 2 Since allows us to capture the words which are related to, the calculations for collocations frequency take place on the whole SemEval-2007 WSI (SWSI) corpus (27132 paragraphs) To deal with data sparsity and to determine whether a candidate collocation appears frequently enough to be included in our graphs

37 Collocational Graph We can measure the conditional probability freq ij and p ( j i) in a similar manner p ( i j ) = The final weight applied to collocation is the average of the calculated conditional probabilities p( i j) + p( j i) wc ij = They only extracted collocations, which had frequency (parameter ) and weight (parameter p ) higher than pre-specified thresholds The collocational graph can now be created, in which each extracted and weighted collocation is represented as a vertex and two vertices share an edge, if they co-occur, in one or more paragraphs of bc 2 c ij freq ( i j) p p2 3 j

38 Collocational Graph The next stage is to weight the edges of the initial collocational graph, G, as well as to discover new edges connecting the vertices of G The constructed graph is sparse since we are attempting to identify rare events, i.e. edges connecting collocations The solution to the problem of data sparsity is smoothing

39 Weighting and Populating i For each vertex (collocation ci ), they associated a vertex vector VC i containing the vertices (collocations), which share an edge with i in graph G Table shows an example of two vertices, i.e. cnn_nbc and nbc_news, which are not connected in of the target word network G

40 Weighting and Populating In the next step, the similarity between each vertex vector VC and each vertex vector VC is i j calculated Lee [4] showed that Jaccard similarity coefficient (JC) showed superior performance over other symmetric similarity measures such as cosine, L1 norm, euclidean distance, Jensen-Shannon divergence, etc.

41 Weighting and Populating Using JC for estimating similarity between vertex vectors yields: VC VC (, VC ) JC VC c i j c = Two collocations i and j are said to be mutually similar if ci is the most similar collocation to c j and the other way around. VC i i VC j j

42 Weighting and Populating Two mutually similar collocations c and c i j are clustered with the result that an occurrence of a collocation c with one of c k i, c jis also counted as an occurrence with the other collocation In table (slide 34) if cnn_nbc and nbc_news are mutually similar, then the zero-frequency event between nbc_news and cnn_tv is set equal to the joint frequency between cnn_nbc and cnn_tv Many collocations connected to one of the target collocations are not connected to the other, although they should be, since both of the target collocations are contextually related i.e. both of them refer to the Television Network sense.

43 Weighting and Populating The weight applied to each edge connecting vertices i and j (collocations c i and c j ) is the maximum of their conditional probabilities, where: freq ij p ( i j) = Where denotes the number of paragraphs, in which nouns i, j co-occur, and freq j denotes the number of paragraphs, where noun j occurs freq freq i, j j

44 Inducing Senses and Tagging / The final graph G, resulting from the previous stage, is clustered in order to produce the induced senses. The two criteria for choosing a clustering algorithm were: Its ability to automatically induce the number of clusters Its execution time

45 Inducing Senses and Tagging Markov Clustering Algorithm Is fast Is based on stochastic flow in graphs The number of clusters produced depends on an inflation parameter that controls the number of produced clusters Chinese Whispers Is a randomized graph-clustering method Time-linear to the number of edges Does not require any input parameters Not guaranteed to converge Automatically infers the number and size of clusters

46 Inducing Senses and Tagging Normalized MinCut Graph partitioning technique Graph is partitioned into two subgraphs by minimising the total association between the two subgraphs Iteratively applied for each extracted subgraph until a user-defined criterion is met (e.g. number of clusters)

47 Inducing Senses and Tagging CW assigns all vertices to different classes Each vertex i is processed for an x (parameter p4) number of iterations and inherits the strongest class in its local neighbourhood (LN) in an update step. LN is defined as the set of vertices which share a direct connection with vertex i During the update step for a vertex i : each class, cl, receives a score equal to the sum of the weights of edges ( i, j ), where j has been assigned class cl The maximum score determines the strongest class» In case of multiple classes, one is chosen at random Classes are updated immediately, which means that a node can inherit classes from its LN that were introduced there in the same iteration

48 Evaluation The WSI approach was evaluated under the framework of SemEval-2007 WSI task (SWSI) The corpus consists of texts of the Wall Street Journal corpus, and is hand tagged with OntoNotes senses They focused on all 35 nouns of SWSI, ignoring verbs

49 Evaluation They induced the senses of each target noun,, and then they tagged each instance of tn with one of its induced senses SWSI organizers employ two evaluation schemes Unsupervised evaluation The results of systems are treated as clusters of target noun contexts and gold standard (GS) senses as classes Supervised evaluation The training corpus is used to map the induced clusters to GS senses. The testing corpus is then used to measure performance tn

50 Evaluation Perfect clustering solution is defined in terms of Homogeneity and Completeness Homogeneity Where each induced cluster has exactly the same contexts as one of the classes Completeness Where each class has exactly the same contexts as one of the clusters

51 Evaluation F-Score is used to asses the overall quality of clustering Measures both Homogeneity and Completeness Other measures, entropy and purity only measure the first

52 Evaluation Purity q Let be the number of classes in the gold standard (GS), k be the number of clusters, nr be the size of i cluster r, and nrbe the number of data points in class that belong to cluster, then: r i Entropy Purity = k = nr n k r= 1 1 arg max n n i 1 log q r= 1 i= 1 q n n i r r i r log n n i r r q

53 Evaluation Their WSI methodology used Jaccard similarity to populate the graph, referred to as Col-JC Col-BL induces senses as Col-JC does but without smoothing Baselines 1cl1inst: assigns each instance to a distinct cluster 1c1w: groups all instances of a target word into one cluster equal to the most frequent baseline (MFS) in the supervised evaluation The sense which appears most often in an annotated text

54 Evaluation The evaluation results are given in the table below: UOY, UBS-AC have used labeled data for parameter estimation I2R, UPV_SI, UMND2 do not state how their parameters were estimated

55 Analysis Evaluation of WSI methods is a difficult task 1cl1inst baseline achieves a perfect purity and entropy, however scores low on F-Score Because senses of GS are spread among induced clusters causing a low unsupervised recall Supervised recall of 1cl1inst is undefined, due to the fact that each cluster tags one and only one instance in the corpus Clusters tagging instances in the test corpus do not tag any instances in the train corpus and the mapping cannot be performed

56 Analysis 1c1w baseline achieves high F-Score performance due to the dominance of MFS in the testing corpus Its purity, entropy and supervised recall are much lower than other systems

57 Analysis A clustering solution, which achieves high supervised recall, does not necessarily achieve high F-Score Because F-Score penalizes systems for getting the number of GS classes wrongly as 1cl1inst baseline

58 Analysis No system was able to achieve high performance in both settings except their technique Col-BL (Col-JC) achieved 72.9% (78%) F- Score

59 Analysis The target of smoothing was to reduce the number of clusters and obtain a better mapping of clusters to GS senses, but without affecting the clustering quality

60 Bibliography 1) Ioannis Klapaftis, and Suresh Manandhar, Word Sense Induction Using Graphs of Collocations, in Proceedings of the 18th European Conference On Artificial Intelligence (ECAI-2008). 2) Agirre, E., Soroa, A.: Ubc-as: A graph based unsupervised system for induction and classification. In: Proceedings of the Fourth International Workshop on Semantic Evaluations (SemEval-2007), pp (2007)

61 Bibliography 3) J. V eronis Hyperlex: lexical cartography for information retrieval. Computer Speech & Language, 18(3): ) Lee, L Measures of distributional similarity. In Proc. ACL 99,

MIA - Master on Artificial Intelligence

MIA - Master on Artificial Intelligence 1 Hierarchical Non-hierarchical Evaluation 1 Hierarchical Non-hierarchical Evaluation The Concept of, proximity, affinity, distance, difference, divergence We use