Natural Language Understanding using Knowledge Bases and Random Walks

Size: px

Start display at page:

Download "Natural Language Understanding using Knowledge Bases and Random Walks"

Colin Jennings
6 years ago
Views:

1 Natural Language Understanding using Knowledge Bases and Random Walks Eneko Agirre ixa2.si.ehu.eus/eneko IXA NLP Group University of the Basque Country PROPOR Tomar In collaboration with: Ander Barrena, Josu Goikoetxea, Oier Lopez de Lacalle, Arantxa Otegi, Aitor Soroa, Mark Stevenson Agirre (UBC) NLU using KBs and Random Walks July / 70

2 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

3 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

4 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

5 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document

6 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

7 Large Graphs and Random Walks Vision: WWW is a graph! source: Agirre (UBC) NLU using KBs and Random Walks July / 70

8 Large Graphs and Random Walks Vision: WWW is a graph! Prefer well-connected webpages source: Agirre (UBC) NLU using KBs and Random Walks July / 70

9 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

10 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

11 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

12 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

Large Graphs and Random Walks Knowledge Bases

sources: http://sixdegrees.hu/ http://www2.

13 Large Graphs and Random Walks Knowledge Bases are also large graphs! sources: yifanhu/ Agirre (UBC) NLU using KBs and Random Walks July / 70

14 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Agirre (UBC) NLU using KBs and Random Walks July / 70

15 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

.. with respect to some Knowledge Base Understanding requires grounding text to Entities and

16 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

17 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

18 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

19 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

20 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

21 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

22 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

23 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

24 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

25 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

26 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

27 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

28 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

29 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

30 Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

31 PageRank and Personalized PageRank Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

32 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

33 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

34 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

35 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

36 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Markov chains N states (nodes) N N transition probability matrix M N For all i M ij = 1 j=1 Ergodic: path from any state to any other, for any start state after a finite time T 0, the probability of being in any state is non-zero For any ergodic Markov chain there is a unique long-term visit rate Steady-state probability distribution It does not matter where we start Agirre (UBC) NLU using KBs and Random Walks July / 70

37 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Markov chains N states (nodes) N N transition probability matrix M N For all i M ij = 1 j=1 Ergodic: path from any state to any other, for any start state after a finite time T 0, the probability of being in any state is non-zero For any ergodic Markov chain there is a unique long-term visit rate Steady-state probability distribution It does not matter where we start Agirre (UBC) NLU using KBs and Random Walks July / 70

38 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

39 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

40 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

41 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

42 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: [ ] [ ] [ ] = Agirre (UBC) NLU using KBs and Random Walks July / 70

43 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

44 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

45 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

46 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

47 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

48 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

49 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

50 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

51 Random walks for Disambiguation Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

52 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team.... coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) NLU using KBs and Random Walks July / 70

53 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team.... coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) NLU using KBs and Random Walks July / 70

54 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

55 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

56 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

57 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD WordNet is the most widely used hierarchically organized lexical database for English (Fellbaum, 1998) Broad coverage of nouns, verbs, adjectives, adverbs Main unit: synset (concept) coach#1, manager#3, handler#2 someone in charge of training an athlete or a team. A word is associated to several concepts (word senses) A concept can be lexicalised with several words (variants) Relations between concepts: synonymy (built-in), hyperonymy, antonymy, meronymy, entailment, derivation, gloss Agirre (UBC) NLU using KBs and Random Walks July / 70

58 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD Representing WordNet as a graph [Hughes and Ramage, 2007]: Nodes represent concepts Edges represent relations (undirected) In addition, directed edges from words to corresponding concepts (senses) Agirre (UBC) NLU using KBs and Random Walks July / 70

59 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD handle#v6 managership#n3 derivation trainer#n1 sport#n1 domain coach#n1 derivation hyperonym hyperonym teacher#n1 coach coach#n2 derivation coach#n5 hyperonym holonym public_transport#n1 holonym fleet#n2 tutorial#n1 seat#n1 Agirre (UBC) NLU using KBs and Random Walks July / 70

60 Random walks for Disambiguation Word Sense Disambiguation (WSD) Using Personalized PageRank for WSD [Agirre et al., 2014] Our fleet comprises coaches from 35 to 58 seats. P s = (1 c) P s M + c v For each word W i i = 1... m in the context (e.g. coach) Initialize v with uniform probabilities over words W j =i (e.g. fleet, comprise, seat) Context words act as source nodes injecting probability mass into the concept graph Run Personalized PageRank, yielding P s Choose highest ranking sense for target word W i in P s (e.g. coach) This is called word-to-word Personalized PageRank, PPR w2w Agirre (UBC) NLU using KBs and Random Walks July / 70

Random walks for Disambiguation Word Sense Disambiguation (WSD) Using Personalized PageRank (PPR) Our fleet comprises coaches from 35 to 58 seats.

61 Random walks for Disambiguation Word Sense Disambiguation (WSD) Using Personalized PageRank (PPR) Our fleet comprises coaches from 35 to 58 seats. handle#n8 managership#n3 trainer#n1 sport#n1 coach#n1 teacher#n1 coach fleet comprise... seat coach#n2 tutorial#n1 coach#n5 comprise#v1... fleet#n2 public_transport#n1 seat#n1 Agirre (UBC) NLU using KBs and Random Walks July / 70

62 Random walks for Disambiguation Word Sense Disambiguation (WSD) Results and comparison to related work System S2AW S3AW S07CG (N) [Agirre and Soroa, 2008] KB [Tsatsaronis et al., 2010] KB [Ponzetto and Navigli, 2010] KB (79.4) [Moro et al., 2014] KB (84.6) PPR w2w KB (83.6) PPR w2w + MFS KB (82.1) [Taghipour and Ng, 2015] SUP (82.3) Agirre (UBC) NLU using KBs and Random Walks July / 70

63 Random walks for Disambiguation WSD on the biomedical domain Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

64 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] Ambiguity believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) NLU using KBs and Random Walks July / 70

Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.

2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C0234192: Cold (Cold

65 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] Ambiguity believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) NLU using KBs and Random Walks July / 70

66 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] UMLS is a Metathesarus: ( 1M CUIs) Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs ( 5M): parent, can be qualified by, related possibly sinonymous, related other We applied Personalized PageRank. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, [Jimeno and Aronson, 2011] Agirre (UBC) NLU using KBs and Random Walks July / 70

67 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] UMLS is a Metathesarus: ( 1M CUIs) Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs ( 5M): parent, can be qualified by, related possibly sinonymous, related other We applied Personalized PageRank. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, [Jimeno and Aronson, 2011] Agirre (UBC) NLU using KBs and Random Walks July / 70

68 Random walks for Disambiguation Named-Entity Disambiguation (NED) Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

69 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation [Agirre et al., 2015, Barrena et al., 2015] Given a Named Entity mention, ground to instance in KB (aka Entity Linking, Wikification) KB is Wikipedia ( DBpedia), represented as graph: 5M articles, nodes, represent concepts and instances 90M hyperlinks, edges, represent relations Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town. Agirre (UBC) NLU using KBs and Random Walks July / 70

70 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation [Agirre et al., 2015, Barrena et al., 2015] Given a Named Entity mention, ground to instance in KB (aka Entity Linking, Wikification) KB is Wikipedia ( DBpedia), represented as graph: 5M articles, nodes, represent concepts and instances 90M hyperlinks, edges, represent relations Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town. Agirre (UBC) NLU using KBs and Random Walks July / 70

71 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

72 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

73 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

74 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation TAC2009 TAC2010 TAC2013 AIDA PPR w2w Best system Evaluation: accuracy for KB mentions (we don t do NILs) Best: best in each competition, [Houlsby and Ciaramita, 2014] for AIDA Key for performance: only keep hyperlinks which have a reciprocal hyperlink (e.g. Tomar and Santarem district) Agirre (UBC) NLU using KBs and Random Walks July / 70

75 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation TAC2009 TAC2010 TAC2013 AIDA PPR w2w Best system Evaluation: accuracy for KB mentions (we don t do NILs) Best: best in each competition, [Houlsby and Ciaramita, 2014] for AIDA Key for performance: only keep hyperlinks which have a reciprocal hyperlink (e.g. Tomar and Santarem district) Agirre (UBC) NLU using KBs and Random Walks July / 70

76 Random walks for Disambiguation Complementary to other resources? Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

77 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED [Barrena et al., 2015] We set up a generative framework: entity knowledge P(e) name knowledge P(s e) context knowledge P(c bow e) context knowledge P(c grf e) Return entity which maximizes joint probability arg max P(s, c, e) e = arg max P(e)P(s e)p(c bow e)p(c grf e) e Agirre (UBC) NLU using KBs and Random Walks July / 70

78 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED [Barrena et al., 2015] We set up a generative framework: entity knowledge P(e) name knowledge P(s e) context knowledge P(c bow e) context knowledge P(c grf e) Return entity which maximizes joint probability arg max P(s, c, e) e = arg max P(e)P(s e)p(c bow e)p(c grf e) e Agirre (UBC) NLU using KBs and Random Walks July / 70

79 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED Agirre (UBC) NLU using KBs and Random Walks July / 70

80 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED Results Best system in each competition, [Houlsby and Ciaramita, 2014] for AIDA Knowledge-Based and Supervised are competitive and complementary! Agirre (UBC) NLU using KBs and Random Walks July / 70

81 Random walks for similarity Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

82 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

83 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

84 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations

85 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

86 Similarity datasets Random walks for similarity RG dataset WordSim353 dataset cord smile 0.02 king cabbage 0.23 rooster voyage 0.04 professor cucumber glass jewel 1.78 investigation effort 4.59 magician oracle 1.82 movie star cemetery graveyard 3.88 journey voyage 9.29 automobile car 3.92 midday noon 9.29 midday noon 3.94 tiger tiger pairs, 51 subjects 353 pairs, 16 subjects Similarity Relatedness Evaluation: Spearman correlation Agirre (UBC) NLU using KBs and Random Walks July / 70

87 Similarity datasets Random walks for similarity RG dataset WordSim353 dataset cord smile 0.02 king cabbage 0.23 rooster voyage 0.04 professor cucumber glass jewel 1.78 investigation effort 4.59 magician oracle 1.82 movie star cemetery graveyard 3.88 journey voyage 9.29 automobile car 3.92 midday noon 9.29 midday noon 3.94 tiger tiger pairs, 51 subjects 353 pairs, 16 subjects Similarity Relatedness Evaluation: Spearman correlation Agirre (UBC) NLU using KBs and Random Walks July / 70

88 Random walks for similarity Similarity Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Two main approaches: Knowledge-based Corpus-based, also known as distributional similarity (embeddings!) Agirre (UBC) NLU using KBs and Random Walks July / 70

89 Random walks for similarity Similarity Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Two main approaches: Knowledge-based Corpus-based, also known as distributional similarity (embeddings!) Agirre (UBC) NLU using KBs and Random Walks July / 70

90 Random walks for similarity Using Random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

91 Random walks for similarity Using Random walks Random walks [Hughes and Ramage, 2007] [Eneko Agirre and Soroa, 2010, Agirre et al., 2015] Given two words estimate how similar they are. gem jewel Given a pair of words (w 1, w 2 ): Initialize teleport probability mass v with w 1 Run Personalized Pagerank, obtaining w 1 = P s Initialize v with w 2 and obtain w 2 = P s Measure similarity between w 1 and w 2 (e.g. cosine) P s = (1 c) P s M + c v Agirre (UBC) NLU using KBs and Random Walks July / 70

92 Random walks for similarity Using Random walks Random walks [Hughes and Ramage, 2007] [Eneko Agirre and Soroa, 2010, Agirre et al., 2015] Given two words estimate how similar they are. gem jewel Given a pair of words (w 1, w 2 ): Initialize teleport probability mass v with w 1 Run Personalized Pagerank, obtaining w 1 = P s Initialize v with w 2 and obtain w 2 = P s Measure similarity between w 1 and w 2 (e.g. cosine) P s = (1 c) P s M + c v Agirre (UBC) NLU using KBs and Random Walks July / 70

93 Random walks for similarity Using Random walks Using Random Walks Probability vectors on Wikipedia for drink and alcohol. drink alcohol DRINK.124 ALCOHOL.145 ALCOHOLIC BEVERAGE.036 ALCOHOLIC BEVERAGE.026 DRINKING.028 ETHANOL.018 COFFEE.020 ALKENE.006 TEA.017 ALCOHOLISM.006 CIDER.016 ALDEHYDE.005 MASALA CHAI.014 KETONE.004 WINE.014 ESTER.004 SUGAR SUBSTITUTE.014 ALKANE.004 CAPPUCCINO.013 ISOPROPYL ALCOHOL.003 HOT CHOCOLATE.013 ETHER Agirre (UBC) NLU using KBs and Random Walks July / 70

94 Random walks for similarity Using Random walks Using Random walks Method Source WS353 RG [Hughes and Ramage, 2007] WordNet [Finkelstein et al., 2002] Corpora (LSA) [Agirre et al., 2009] Corpora PPR WordNet [Huang et al., 2012] Corpora (NN) [Baroni et al., 2014] Corpora (NN) PPR Wikipedia [Gabrilovich and Markovitch, 2007] Wikipedia [Reisinger and Mooney, 2010] Corpora [Pilehvar et al., 2013] BabelNet PPR Wiki + WNet [Radinsky et al., 2011] Corpora (Time) Agirre (UBC) NLU using KBs and Random Walks July / 70

95 Random walks for similarity Embedding random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

96 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

97 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

98 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

99 Random walks for similarity Embedding random walks Low-dimensional word representations Producing pseudo-corpus: 1 start random walk at any synset, 2 emit lexicalization, 3 with probability 85% follow edge, goto step 2 4 else restart, goto step 1 Examples of text generated by random walks on WordNet yucatec mayan quiche kekchi speak sino-tibetan tone language west chadic amphora wine nabuchadnezzar bear retain long graphology writer write scribble scrawler heedlessly in haste jot note notebook Agirre (UBC) NLU using KBs and Random Walks July / 70

100 Random walks for similarity Embedding random walks Low-dimensional word representations Producing pseudo-corpus: 1 start random walk at any synset, 2 emit lexicalization, 3 with probability 85% follow edge, goto step 2 4 else restart, goto step 1 Examples of text generated by random walks on WordNet yucatec mayan quiche kekchi speak sino-tibetan tone language west chadic amphora wine nabuchadnezzar bear retain long graphology writer write scribble scrawler heedlessly in haste jot note notebook Agirre (UBC) NLU using KBs and Random Walks July / 70

101 Random walks for similarity Embedding random walks Low-dimensional word representations Results on Relatedness and Similarity Agirre (UBC) NLU using KBs and Random Walks July / 70

102 Random walks for similarity Embedding random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

103 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

104 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

105 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

106 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings Improvement of combination with respect to corpus-based embeddings: RG SL WSS WSR MTU MEN WS all Corpus WordNet CEN AVG CAT PCA Random walks on WordNet competitive with corpus-based Very large improvements, showing that they are highly complementary! Agirre (UBC) NLU using KBs and Random Walks July / 70

107 Random walks for similarity Complementary to other resources? Combining graphs with text-based embeddings Other alternatives provide smaller improvements (e.g. retro-fitting [Faruqui et al., 2015]) Agirre (UBC) NLU using KBs and Random Walks July / 70

108 Similarity and Information Retrieval Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

109 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

110 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

111 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

112 Similarity and Information Retrieval Example of document expansion You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Agirre (UBC) NLU using KBs and Random Walks July / 70

113 Similarity and Information Retrieval Example of document expansion Agirre (UBC) NLU using KBs and Random Walks July / 70

Knowledge-Based Word Sense Disambiguation and Similarity using Random Walks

Knowledge-Based Word Sense Disambiguation and Similarity using Random Walks Eneko Agirre ixa2.si.ehu.es/eneko University of the Basque Country (Currently visiting at Stanford) SRI, 2011 Agirre (UBC) Knowledge-Based