Natural Language Understanding using Knowledge Bases and Random Walks

Size: px
Start display at page:

Download "Natural Language Understanding using Knowledge Bases and Random Walks"

Transcription

1 Natural Language Understanding using Knowledge Bases and Random Walks Eneko Agirre ixa2.si.ehu.eus/eneko IXA NLP Group University of the Basque Country PROPOR Tomar In collaboration with: Ander Barrena, Josu Goikoetxea, Oier Lopez de Lacalle, Arantxa Otegi, Aitor Soroa, Mark Stevenson Agirre (UBC) NLU using KBs and Random Walks July / 70

2 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

3 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

4 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

5 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

6 Large Graphs and Random Walks History of search in the WWW In the beginning (early 90 s) there was keyword search: Return documents which contained query terms Good for small libraries, document collections, early WWW How do you rank documents about Tomar? First try, count occurrences of Tomar in document Does not work, all hotels and restaurants would spam! It lead to Yahoo and similar hand-edited directories What else could one do? source: Agirre (UBC) NLU using KBs and Random Walks July / 70

7 Large Graphs and Random Walks Vision: WWW is a graph! source: Agirre (UBC) NLU using KBs and Random Walks July / 70

8 Large Graphs and Random Walks Vision: WWW is a graph! Prefer well-connected webpages source: Agirre (UBC) NLU using KBs and Random Walks July / 70

9 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

10 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

11 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

12 Large Graphs and Random Walks How do we know which webpages are well-connected Each webpage is a node Hyperlink in a webpage is a directed edge to another node We prefer webpages with many incoming edges (in-degree) Wait! This can also be easily spammed with fake webpages! Edges from webpages with many incoming edges should be more relevant Mathematical formalization: markov models and random walks Random walks, PageRank and Google Agirre (UBC) NLU using KBs and Random Walks July / 70

13 Large Graphs and Random Walks Knowledge Bases are also large graphs! sources: yifanhu/ Agirre (UBC) NLU using KBs and Random Walks July / 70

14 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Agirre (UBC) NLU using KBs and Random Walks July / 70

15 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

16 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

17 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

18 Text Understanding with Knowledge Bases Understanding of broad language, what s behind the surface strings From string to semantic representation (e.g. First Order Logic)... with respect to some Knowledge Base Understanding requires grounding text to Entities and Concepts Barcelona coach praises Jose Mourinho Agirre (UBC) NLU using KBs and Random Walks July / 70

19 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

20 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

21 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

22 Text Understanding with Knowledge Bases Understanding requires inference capability, e.g. textual similarity jewel gem jewel dirt Barcelona coach Luis Enrique Also longer texts Barcelona coach praises Mourinho Luis Enrique honors Mourinho Mourinho travels to Barcelona by coach Agirre (UBC) NLU using KBs and Random Walks July / 70

23 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

24 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

25 Text Understanding with Knowledge Bases From string to semantic representation (First Order Logic) Barcelona coach praises Jose Mourinho. Exist e1, x1, x2, x3 such that FC Barcelona=x1 and coach:n:1(x1,x2) and praise:v:2(e1,x2,x3) and José Mourinho=x3... Disambiguation: Concepts, Entities and Semantic Roles Quantifiers, modality, negation, connotations,... Inference and Reasoning... with respect to some Knowledge Base Agirre (UBC) NLU using KBs and Random Walks July / 70

26 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

27 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

28 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

29 Text Understanding with Knowledge Bases How far can we go with current KBs and graph-based algorithms? Ground words in context to KB concepts and instances Word Sense Disambiguation Named Entity Disambiguation Similarity between concepts, instances and words Improve ad-hoc information retrieval Results in the state-of-the-art Knowledge-based methods and corpus-based methods are complementary Agirre (UBC) NLU using KBs and Random Walks July / 70

30 Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

31 PageRank and Personalized PageRank Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

32 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

33 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

34 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

35 PageRank and Personalized PageRank Random Walks: PageRank Imagine a person on a random walk in the WWW: Start at random page Follow one of the links at random At the limit ( steady state ) each page has a long-term visit rate Use this as the score of the page PROBLEM. Stuck in dead-ends (webpages with no links) SOLUTION: Teleporting Dead-ends: jump at random to any webpage Other nodes: 15% jump at random to any webpage 85% follow one of the links Equivalent to adding links to all webpages All webpages get visited at some point Agirre (UBC) NLU using KBs and Random Walks July / 70

36 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Markov chains N states (nodes) N N transition probability matrix M N For all i M ij = 1 j=1 Ergodic: path from any state to any other, for any start state after a finite time T 0, the probability of being in any state is non-zero For any ergodic Markov chain there is a unique long-term visit rate Steady-state probability distribution It does not matter where we start Agirre (UBC) NLU using KBs and Random Walks July / 70

37 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Markov chains N states (nodes) N N transition probability matrix M N For all i M ij = 1 j=1 Ergodic: path from any state to any other, for any start state after a finite time T 0, the probability of being in any state is non-zero For any ergodic Markov chain there is a unique long-term visit rate Steady-state probability distribution It does not matter where we start Agirre (UBC) NLU using KBs and Random Walks July / 70

38 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

39 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

40 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

41 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: Agirre (UBC) NLU using KBs and Random Walks July / 70

42 PageRank and Personalized PageRank Random Walks: PageRank How to compute long-term visit rate? Probability vectors P = [p 1... p n ] the walk is in state i with probability p i For instance [ ], we are at state i (start) Given P j at step j, what is P j+1 if we take one step? P j+1 = P j M Algorithm: iterate until convergence The steady state: P s = P s M. For instance: [ ] [ ] [ ] = Agirre (UBC) NLU using KBs and Random Walks July / 70

43 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

44 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

45 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

46 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

47 PageRank and Personalized PageRank Random Walks: PageRank Let s factor out teleporting: M: N N transition probability matrix v: 1 N teleport probability vector P: 1 N Pagerank vector P s = (1 c) P s M + c v walker follows edges walker jumps to any node with probability 1/N c: teleport ratio, the way in which these two terms are combined (e.g. 0.15) Agirre (UBC) NLU using KBs and Random Walks July / 70

48 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

49 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

50 PageRank and Personalized PageRank Random Walks: Personalized PageRank PageRank gives a static view of the graph. We need to include context: Importance of nodes according to some node(s) of interest. Personalized PageRank: non-uniform v [Haveliwala, 2002] Assign stronger probabilities to certain nodes in v Bias PageRank to prefer these nodes P s = (1 c) P s M + c v For ex. if we concentrate all mass on node i for v (e.g. Tomar website): All random jumps return to n i Rank of n i will be high High rank of n i will make all the nodes in its vicinity also receive a high rank Importance of n i given by the initial v spreads along the graph (e.g. websites closely related to Tomar) Agirre (UBC) NLU using KBs and Random Walks July / 70

51 Random walks for Disambiguation Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

52 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team.... coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) NLU using KBs and Random Walks July / 70

53 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation Goal: determine senses of the open-class words in a text. Nadal is sharing a house with his uncle and coach, Toni. Our fleet comprises coaches from 35 to 58 seats. Knowledge Base (e.g. WordNet): coach#1 someone in charge of training an athlete or a team.... coach#5 a vehicle carrying many passengers; used for public transport. Agirre (UBC) NLU using KBs and Random Walks July / 70

54 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

55 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

56 Random walks for Disambiguation Word Sense Disambiguation (WSD) Word Sense Disambiguation (WSD) Many potential applications, enable natural language understanding, link text to knowledge base, deploy semantic web. Supervised corpus-based WSD performs best Train classifiers on hand-tagged data (typically SemCor) Data sparseness, e.g. coach 20 examples (20,0,0,0,0,0) Results decrease when train/test from different sources (even Brown, BNC) Decrease even more when train/test from different domains Knowledge-based WSD Uses information in a KB (WordNet) Relation coverage Agirre (UBC) NLU using KBs and Random Walks July / 70

57 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD WordNet is the most widely used hierarchically organized lexical database for English (Fellbaum, 1998) Broad coverage of nouns, verbs, adjectives, adverbs Main unit: synset (concept) coach#1, manager#3, handler#2 someone in charge of training an athlete or a team. A word is associated to several concepts (word senses) A concept can be lexicalised with several words (variants) Relations between concepts: synonymy (built-in), hyperonymy, antonymy, meronymy, entailment, derivation, gloss Agirre (UBC) NLU using KBs and Random Walks July / 70

58 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD Representing WordNet as a graph [Hughes and Ramage, 2007]: Nodes represent concepts Edges represent relations (undirected) In addition, directed edges from words to corresponding concepts (senses) Agirre (UBC) NLU using KBs and Random Walks July / 70

59 Random walks for Disambiguation Word Sense Disambiguation (WSD) WordNet is the usual KB for WSD handle#v6 managership#n3 derivation trainer#n1 sport#n1 domain coach#n1 derivation hyperonym hyperonym teacher#n1 coach coach#n2 derivation coach#n5 hyperonym holonym public_transport#n1 holonym fleet#n2 tutorial#n1 seat#n1 Agirre (UBC) NLU using KBs and Random Walks July / 70

60 Random walks for Disambiguation Word Sense Disambiguation (WSD) Using Personalized PageRank for WSD [Agirre et al., 2014] Our fleet comprises coaches from 35 to 58 seats. P s = (1 c) P s M + c v For each word W i i = 1... m in the context (e.g. coach) Initialize v with uniform probabilities over words W j =i (e.g. fleet, comprise, seat) Context words act as source nodes injecting probability mass into the concept graph Run Personalized PageRank, yielding P s Choose highest ranking sense for target word W i in P s (e.g. coach) This is called word-to-word Personalized PageRank, PPR w2w Agirre (UBC) NLU using KBs and Random Walks July / 70

61 Random walks for Disambiguation Word Sense Disambiguation (WSD) Using Personalized PageRank (PPR) Our fleet comprises coaches from 35 to 58 seats. handle#n8 managership#n3 trainer#n1 sport#n1 coach#n1 teacher#n1 coach fleet comprise... seat coach#n2 tutorial#n1 coach#n5 comprise#v1... fleet#n2 public_transport#n1 seat#n1 Agirre (UBC) NLU using KBs and Random Walks July / 70

62 Random walks for Disambiguation Word Sense Disambiguation (WSD) Results and comparison to related work System S2AW S3AW S07CG (N) [Agirre and Soroa, 2008] KB [Tsatsaronis et al., 2010] KB [Ponzetto and Navigli, 2010] KB (79.4) [Moro et al., 2014] KB (84.6) PPR w2w KB (83.6) PPR w2w + MFS KB (82.1) [Taghipour and Ng, 2015] SUP (82.3) Agirre (UBC) NLU using KBs and Random Walks July / 70

63 Random walks for Disambiguation WSD on the biomedical domain Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

64 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] Ambiguity believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) NLU using KBs and Random Walks July / 70

65 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] Ambiguity believed not to occur on specific domains On the Use of Cold Water as a Powerful Remedial Agent in Chronic Disease. Intranasal ipratropium bromide for the common cold. 11.7% of the phrases in abstracts added to MEDLINE in 1998 were ambiguous (Weeber et al. 2011) Unified Medical Language System (UMLS) Metathesaurus Concept Unique Identifiers (CUIs) C : Cold (Cold Sensation) [Physiologic Function] C : Cold (cold temperature) [Natural Phenomenon or Process] C : Cold (Common Cold) [Disease or Syndrome] Agirre (UBC) NLU using KBs and Random Walks July / 70

66 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] UMLS is a Metathesarus: ( 1M CUIs) Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs ( 5M): parent, can be qualified by, related possibly sinonymous, related other We applied Personalized PageRank. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, [Jimeno and Aronson, 2011] Agirre (UBC) NLU using KBs and Random Walks July / 70

67 Random walks for Disambiguation WSD on the biomedical domain Biomedical WSD and UMLS [Agirre et al., 2010] UMLS is a Metathesarus: ( 1M CUIs) Alcohol and other drugs, Medical Subject Headings, Crisp Thesaurus, SNOMED Clinical Terms, etc. Relations in the Metathesaurus between CUIs ( 5M): parent, can be qualified by, related possibly sinonymous, related other We applied Personalized PageRank. Evaluated on NLM-WSD, 50 ambiguous terms (100 instances each) KB #CUIs #relations Acc. Terms AOD 15,901 58, MSH 278,297 1,098, CSP 16,703 73, SNOMEDCT 304,443 1,237, all above 572,105 2,433, all relations - 5,352, [Jimeno and Aronson, 2011] Agirre (UBC) NLU using KBs and Random Walks July / 70

68 Random walks for Disambiguation Named-Entity Disambiguation (NED) Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

69 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation [Agirre et al., 2015, Barrena et al., 2015] Given a Named Entity mention, ground to instance in KB (aka Entity Linking, Wikification) KB is Wikipedia ( DBpedia), represented as graph: 5M articles, nodes, represent concepts and instances 90M hyperlinks, edges, represent relations Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town. Agirre (UBC) NLU using KBs and Random Walks July / 70

70 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation [Agirre et al., 2015, Barrena et al., 2015] Given a Named Entity mention, ground to instance in KB (aka Entity Linking, Wikification) KB is Wikipedia ( DBpedia), represented as graph: 5M articles, nodes, represent concepts and instances 90M hyperlinks, edges, represent relations Alan Kourie, CEO of the Lions franchise, had discussions with Fletcher in Cape Town. Agirre (UBC) NLU using KBs and Random Walks July / 70

71 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

72 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

73 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation Main steps: Named Entity Recognition in text (NER) Dictionary for candidate generation: use titles, redirects, text in anchors Partial view of dictionary entry for string gotham. Article Freq. P(e s) GOTHAM CITY GOTHAM (MAGAZINE) GOTHAM (TYPEFACE) GOTHAM, NOTTINGHAMSHIRE GOTHAM (ALBUM) GOTHAM (BAND) NEW YORK CITY GOTHAM RECORDS Disambiguation: Personalized PageRank Agirre (UBC) NLU using KBs and Random Walks July / 70

74 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation TAC2009 TAC2010 TAC2013 AIDA PPR w2w Best system Evaluation: accuracy for KB mentions (we don t do NILs) Best: best in each competition, [Houlsby and Ciaramita, 2014] for AIDA Key for performance: only keep hyperlinks which have a reciprocal hyperlink (e.g. Tomar and Santarem district) Agirre (UBC) NLU using KBs and Random Walks July / 70

75 Random walks for Disambiguation Named-Entity Disambiguation (NED) Named Entity Disambiguation TAC2009 TAC2010 TAC2013 AIDA PPR w2w Best system Evaluation: accuracy for KB mentions (we don t do NILs) Best: best in each competition, [Houlsby and Ciaramita, 2014] for AIDA Key for performance: only keep hyperlinks which have a reciprocal hyperlink (e.g. Tomar and Santarem district) Agirre (UBC) NLU using KBs and Random Walks July / 70

76 Random walks for Disambiguation Complementary to other resources? Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

77 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED [Barrena et al., 2015] We set up a generative framework: entity knowledge P(e) name knowledge P(s e) context knowledge P(c bow e) context knowledge P(c grf e) Return entity which maximizes joint probability arg max P(s, c, e) e = arg max P(e)P(s e)p(c bow e)p(c grf e) e Agirre (UBC) NLU using KBs and Random Walks July / 70

78 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED [Barrena et al., 2015] We set up a generative framework: entity knowledge P(e) name knowledge P(s e) context knowledge P(c bow e) context knowledge P(c grf e) Return entity which maximizes joint probability arg max P(s, c, e) e = arg max P(e)P(s e)p(c bow e)p(c grf e) e Agirre (UBC) NLU using KBs and Random Walks July / 70

79 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED Agirre (UBC) NLU using KBs and Random Walks July / 70

80 Random walks for Disambiguation Complementary to other resources? Combining graphs & supervised NED Results Best system in each competition, [Houlsby and Ciaramita, 2014] for AIDA Knowledge-Based and Supervised are competitive and complementary! Agirre (UBC) NLU using KBs and Random Walks July / 70

81 Random walks for similarity Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

82 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

83 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

84 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

85 Random walks for similarity Similarity (and relatedness) Given two words or multiword-expressions, estimate how similar they are. gem jewel Features shared, superclass shared Relatedness is a more general relationship, including other relations like topical relatedness or meronymy. movie star Similarity and disambiguation are closely related! Gold Standard: a numeric value of similarity/relatedness. Agirre (UBC) NLU using KBs and Random Walks July / 70

86 Similarity datasets Random walks for similarity RG dataset WordSim353 dataset cord smile 0.02 king cabbage 0.23 rooster voyage 0.04 professor cucumber glass jewel 1.78 investigation effort 4.59 magician oracle 1.82 movie star cemetery graveyard 3.88 journey voyage 9.29 automobile car 3.92 midday noon 9.29 midday noon 3.94 tiger tiger pairs, 51 subjects 353 pairs, 16 subjects Similarity Relatedness Evaluation: Spearman correlation Agirre (UBC) NLU using KBs and Random Walks July / 70

87 Similarity datasets Random walks for similarity RG dataset WordSim353 dataset cord smile 0.02 king cabbage 0.23 rooster voyage 0.04 professor cucumber glass jewel 1.78 investigation effort 4.59 magician oracle 1.82 movie star cemetery graveyard 3.88 journey voyage 9.29 automobile car 3.92 midday noon 9.29 midday noon 3.94 tiger tiger pairs, 51 subjects 353 pairs, 16 subjects Similarity Relatedness Evaluation: Spearman correlation Agirre (UBC) NLU using KBs and Random Walks July / 70

88 Random walks for similarity Similarity Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Two main approaches: Knowledge-based Corpus-based, also known as distributional similarity (embeddings!) Agirre (UBC) NLU using KBs and Random Walks July / 70

89 Random walks for similarity Similarity Many potential applications: Overcome brittleness (word match) NLP subtasks (parsing, semantic role labeling) Information retrieval Question answering Summarization Machine translation optimization and evaluation Inference (textual entailment) Two main approaches: Knowledge-based Corpus-based, also known as distributional similarity (embeddings!) Agirre (UBC) NLU using KBs and Random Walks July / 70

90 Random walks for similarity Using Random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

91 Random walks for similarity Using Random walks Random walks [Hughes and Ramage, 2007] [Eneko Agirre and Soroa, 2010, Agirre et al., 2015] Given two words estimate how similar they are. gem jewel Given a pair of words (w 1, w 2 ): Initialize teleport probability mass v with w 1 Run Personalized Pagerank, obtaining w 1 = P s Initialize v with w 2 and obtain w 2 = P s Measure similarity between w 1 and w 2 (e.g. cosine) P s = (1 c) P s M + c v Agirre (UBC) NLU using KBs and Random Walks July / 70

92 Random walks for similarity Using Random walks Random walks [Hughes and Ramage, 2007] [Eneko Agirre and Soroa, 2010, Agirre et al., 2015] Given two words estimate how similar they are. gem jewel Given a pair of words (w 1, w 2 ): Initialize teleport probability mass v with w 1 Run Personalized Pagerank, obtaining w 1 = P s Initialize v with w 2 and obtain w 2 = P s Measure similarity between w 1 and w 2 (e.g. cosine) P s = (1 c) P s M + c v Agirre (UBC) NLU using KBs and Random Walks July / 70

93 Random walks for similarity Using Random walks Using Random Walks Probability vectors on Wikipedia for drink and alcohol. drink alcohol DRINK.124 ALCOHOL.145 ALCOHOLIC BEVERAGE.036 ALCOHOLIC BEVERAGE.026 DRINKING.028 ETHANOL.018 COFFEE.020 ALKENE.006 TEA.017 ALCOHOLISM.006 CIDER.016 ALDEHYDE.005 MASALA CHAI.014 KETONE.004 WINE.014 ESTER.004 SUGAR SUBSTITUTE.014 ALKANE.004 CAPPUCCINO.013 ISOPROPYL ALCOHOL.003 HOT CHOCOLATE.013 ETHER Agirre (UBC) NLU using KBs and Random Walks July / 70

94 Random walks for similarity Using Random walks Using Random walks Method Source WS353 RG [Hughes and Ramage, 2007] WordNet [Finkelstein et al., 2002] Corpora (LSA) [Agirre et al., 2009] Corpora PPR WordNet [Huang et al., 2012] Corpora (NN) [Baroni et al., 2014] Corpora (NN) PPR Wikipedia [Gabrilovich and Markovitch, 2007] Wikipedia [Reisinger and Mooney, 2010] Corpora [Pilehvar et al., 2013] BabelNet PPR Wiki + WNet [Radinsky et al., 2011] Corpora (Time) Agirre (UBC) NLU using KBs and Random Walks July / 70

95 Random walks for similarity Embedding random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

96 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

97 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

98 Random walks for similarity Embedding random walks Low-dimensional word representations [Goikoetxea et al., 2015] Vectors produced by PPR contain thousand (millions) of dimensions Feed random walk (WordNet) into neural network language model (word2vec) Agirre (UBC) NLU using KBs and Random Walks July / 70

99 Random walks for similarity Embedding random walks Low-dimensional word representations Producing pseudo-corpus: 1 start random walk at any synset, 2 emit lexicalization, 3 with probability 85% follow edge, goto step 2 4 else restart, goto step 1 Examples of text generated by random walks on WordNet yucatec mayan quiche kekchi speak sino-tibetan tone language west chadic amphora wine nabuchadnezzar bear retain long graphology writer write scribble scrawler heedlessly in haste jot note notebook Agirre (UBC) NLU using KBs and Random Walks July / 70

100 Random walks for similarity Embedding random walks Low-dimensional word representations Producing pseudo-corpus: 1 start random walk at any synset, 2 emit lexicalization, 3 with probability 85% follow edge, goto step 2 4 else restart, goto step 1 Examples of text generated by random walks on WordNet yucatec mayan quiche kekchi speak sino-tibetan tone language west chadic amphora wine nabuchadnezzar bear retain long graphology writer write scribble scrawler heedlessly in haste jot note notebook Agirre (UBC) NLU using KBs and Random Walks July / 70

101 Random walks for similarity Embedding random walks Low-dimensional word representations Results on Relatedness and Similarity Agirre (UBC) NLU using KBs and Random Walks July / 70

102 Random walks for similarity Embedding random walks Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

103 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

104 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

105 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings [Goikoetxea et al., 2016] Given two sources of embeddings: Large corpora Random Walks on WordNet Combine embeddings into single embeddings: Using centroid (CEN) Concatenating embeddings (CAT) Dimensionality reduction of CAT (PCA) Combine cosines from each embeddings space: Average of cosines (AVG) Agirre (UBC) NLU using KBs and Random Walks July / 70

106 Random walks for similarity Complementary to other resources? Combining graph- and text-based embeddings Improvement of combination with respect to corpus-based embeddings: RG SL WSS WSR MTU MEN WS all Corpus WordNet CEN AVG CAT PCA Random walks on WordNet competitive with corpus-based Very large improvements, showing that they are highly complementary! Agirre (UBC) NLU using KBs and Random Walks July / 70

107 Random walks for similarity Complementary to other resources? Combining graphs with text-based embeddings Other alternatives provide smaller improvements (e.g. retro-fitting [Faruqui et al., 2015]) Agirre (UBC) NLU using KBs and Random Walks July / 70

108 Similarity and Information Retrieval Outline 1 PageRank and Personalized PageRank 2 Random walks for Disambiguation Word Sense Disambiguation (WSD) WSD on the biomedical domain Named-Entity Disambiguation (NED) Complementary to other resources? 3 Random walks for similarity Using Random walks Embedding random walks Complementary to other resources? 4 Similarity and Information Retrieval 5 Conclusions Agirre (UBC) NLU using KBs and Random Walks July / 70

109 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

110 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

111 Similarity and Information Retrieval Similarity and Information Retrieval [Otegi et al., 2014] In information retrieval, given a query, we need to retrieve a document, but mismatches happen (example from Yahoo! Answer): I can t install DSL because of the antivirus program, any hints? You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Document expansion (aka clustering and smoothing) has been shown to be successful in IR Use WordNet and similarity to expand documents Method: Initialize random walk with document words Retrieve top k synsets Introduce words on those k synsets in a secondary index When retrieving, use both primary and secondary indexes Results: better results, particularly with domain changes and short documents Agirre (UBC) NLU using KBs and Random Walks July / 70

112 Similarity and Information Retrieval Example of document expansion You should turn off virus and anti-spy software. And thats done within each of the softwares themselves. Then turn them back on later after setting up any DSL softwares. Agirre (UBC) NLU using KBs and Random Walks July / 70

113 Similarity and Information Retrieval Example of document expansion Agirre (UBC) NLU using KBs and Random Walks July / 70

Knowledge-Based Word Sense Disambiguation and Similarity using Random Walks

Knowledge-Based Word Sense Disambiguation and Similarity using Random Walks Knowledge-Based Word Sense Disambiguation and Similarity using Random Walks Eneko Agirre ixa2.si.ehu.es/eneko University of the Basque Country (Currently visiting at Stanford) SRI, 2011 Agirre (UBC) Knowledge-Based

More information

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li Random Walks for Knowledge-Based Word Sense Disambiguation Qiuyu Li Word Sense Disambiguation 1 Supervised - using labeled training sets (features and proper sense label) 2 Unsupervised - only use unlabeled

More information

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014

UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014 UBC Entity Discovery and Linking & Diagnostic Entity Linking at TAC-KBP 2014 Ander Barrena, Eneko Agirre, Aitor Soroa IXA NLP Group / University of the Basque Country, Donostia, Basque Country ander.barrena@ehu.es,

More information

Exploring Knowledge Bases for Similarity

Exploring Knowledge Bases for Similarity Exploring Knowledge Bases for Similarity Eneko Agirre, Montse Cuadros German Rigau, Aitor Soroa IXA NLP Group, University of the Basque Country, Donostia, Basque Country, e.agirre@ehu.es, german.rigau@ehu.es,

More information

Exploring Knowledge Bases for Similarity

Exploring Knowledge Bases for Similarity Exploring Knowledge Bases for Similarity Eneko Agirre, Montse Cuadros German Rigau, Aitor Soroa IXA NLP Group, University of the Basque Country, Donostia, Basque Country, e.agirre@ehu.es, german.rigau@ehu.es,

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

NATURAL LANGUAGE PROCESSING

NATURAL LANGUAGE PROCESSING NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Ronan Cummins 1 Natural Language and Information Processing (NLIP) Group ronan.cummins@cl.cam.ac.uk 2017

More information

The Sheffield and Basque Country Universities Entry to CHiC: using Random Walks and Similarity to access Cultural Heritage

The Sheffield and Basque Country Universities Entry to CHiC: using Random Walks and Similarity to access Cultural Heritage The Sheffield and Basque Country Universities Entry to CHiC: using Random Walks and Similarity to access Cultural Heritage Eneko Agirre 1, Paul Clough 2, Samuel Fernando 2, Mark Hall 2, Arantxa Otegi 1,

More information

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme

Einführung in Web und Data Science Community Analysis. Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Einführung in Web und Data Science Community Analysis Prof. Dr. Ralf Möller Universität zu Lübeck Institut für Informationssysteme Today s lecture Anchor text Link analysis for ranking Pagerank and variants

More information

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation Eneko Agirre, Aitor Soroa IXA NLP Group University of Basque Country Donostia, Basque Contry a.soroa@ehu.es Abstract

More information

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network Roberto Navigli, Simone Paolo Ponzetto What is BabelNet a very large, wide-coverage multilingual

More information

Making Sense Out of the Web

Making Sense Out of the Web Making Sense Out of the Web Rada Mihalcea University of North Texas Department of Computer Science rada@cs.unt.edu Abstract. In the past few years, we have witnessed a tremendous growth of the World Wide

More information

Introduction to Information Retrieval

Introduction to Information Retrieval Introduction to Information Retrieval http://informationretrieval.org IIR 21: Link Analysis Hinrich Schütze Center for Information and Language Processing, University of Munich 2014-06-18 1/80 Overview

More information

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot

Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Knowledge-based Word Sense Disambiguation using Topic Models Devendra Singh Chaplot Ruslan Salakhutdinov Word Sense Disambiguation Word sense disambiguation (WSD) is defined as the problem of computationally

More information

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Pagerank Scoring. Imagine a browser doing a random walk on web pages: Ranking Sec. 21.2 Pagerank Scoring Imagine a browser doing a random walk on web pages: Start at a random page At each step, go out of the current page along one of the links on that page, equiprobably

More information

Brief (non-technical) history

Brief (non-technical) history Web Data Management Part 2 Advanced Topics in Database Management (INFSCI 2711) Textbooks: Database System Concepts - 2010 Introduction to Information Retrieval - 2008 Vladimir Zadorozhny, DINS, SCI, University

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Link analysis Instructor: Rada Mihalcea (Note: This slide set was adapted from an IR course taught by Prof. Chris Manning at Stanford U.) The Web as a Directed Graph

More information

Information retrieval

Information retrieval Information retrieval Lecture 8 Special thanks to Andrei Broder, IBM Krishna Bharat, Google for sharing some of the slides to follow. Top Online Activities (Jupiter Communications, 2000) Email 96% Web

More information

WordNet-based User Profiles for Semantic Personalization

WordNet-based User Profiles for Semantic Personalization PIA 2005 Workshop on New Technologies for Personalized Information Access WordNet-based User Profiles for Semantic Personalization Giovanni Semeraro, Marco Degemmis, Pasquale Lops, Ignazio Palmisano LACAM

More information

Mining Web Data. Lijun Zhang

Mining Web Data. Lijun Zhang Mining Web Data Lijun Zhang zlj@nju.edu.cn http://cs.nju.edu.cn/zlj Outline Introduction Web Crawling and Resource Discovery Search Engine Indexing and Query Processing Ranking Algorithms Recommender Systems

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Big Data Analytics CSCI 4030

Big Data Analytics CSCI 4030 High dim. data Graph data Infinite data Machine learning Apps Locality sensitive hashing PageRank, SimRank Filtering data streams SVM Recommen der systems Clustering Community Detection Web advertising

More information

Stanford-UBC at TAC-KBP

Stanford-UBC at TAC-KBP Stanford-UBC at TAC-KBP Eneko Agirre, Angel Chang, Dan Jurafsky, Christopher Manning, Valentin Spitkovsky, Eric Yeh Ixa NLP group, University of the Basque Country NLP group, Stanford University Outline

More information

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids Marek Lipczak Arash Koushkestani Evangelos Milios Problem definition The goal of Entity Recognition and Disambiguation

More information

Part 1: Link Analysis & Page Rank

Part 1: Link Analysis & Page Rank Chapter 8: Graph Data Part 1: Link Analysis & Page Rank Based on Leskovec, Rajaraman, Ullman 214: Mining of Massive Datasets 1 Graph Data: Social Networks [Source: 4-degrees of separation, Backstrom-Boldi-Rosa-Ugander-Vigna,

More information

Lecture 8: Linkage algorithms and web search

Lecture 8: Linkage algorithms and web search Lecture 8: Linkage algorithms and web search Information Retrieval Computer Science Tripos Part II Simone Teufel Natural Language and Information Processing (NLIP) Group Simone.Teufel@cl.cam.ac.uk Lent

More information

Automatically Annotating Text with Linked Open Data

Automatically Annotating Text with Linked Open Data Automatically Annotating Text with Linked Open Data Delia Rusu, Blaž Fortuna, Dunja Mladenić Jožef Stefan Institute Motivation: Annotating Text with LOD Open Cyc DBpedia WordNet Overview Related work Algorithms

More information

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity

Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Disambiguating Search by Leveraging a Social Context Based on the Stream of User s Activity Tomáš Kramár, Michal Barla and Mária Bieliková Faculty of Informatics and Information Technology Slovak University

More information

Jianyong Wang Department of Computer Science and Technology Tsinghua University

Jianyong Wang Department of Computer Science and Technology Tsinghua University Jianyong Wang Department of Computer Science and Technology Tsinghua University jianyong@tsinghua.edu.cn Joint work with Wei Shen (Tsinghua), Ping Luo (HP), and Min Wang (HP) Outline Introduction to entity

More information

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 43 CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING 3.1 INTRODUCTION This chapter emphasizes the Information Retrieval based on Query Expansion (QE) and Latent Semantic

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Lecture 24: NER & Entity Linking

Lecture 24: NER & Entity Linking Lecture 24: NER & Entity Linking Kai-Wei Chang CS @ University of Virginia kw@kwchang.net Couse webpage: http://kwchang.net/teaching/nlp16 1 Organizing knowledge It s a version of Chicago the standard

More information

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page

Web consists of web pages and hyperlinks between pages. A page receiving many links from other pages may be a hint of the authority of the page Link Analysis Links Web consists of web pages and hyperlinks between pages A page receiving many links from other pages may be a hint of the authority of the page Links are also popular in some other information

More information

Putting ontologies to work in NLP

Putting ontologies to work in NLP Putting ontologies to work in NLP The lemon model and its future John P. McCrae National University of Ireland, Galway Introduction In natural language processing we are doing three main things Understanding

More information

Information Retrieval and Web Search

Information Retrieval and Web Search Information Retrieval and Web Search Relevance Feedback. Query Expansion Instructor: Rada Mihalcea Intelligent Information Retrieval 1. Relevance feedback - Direct feedback - Pseudo feedback 2. Query expansion

More information

Papers for comprehensive viva-voce

Papers for comprehensive viva-voce Papers for comprehensive viva-voce Priya Radhakrishnan Advisor : Dr. Vasudeva Varma Search and Information Extraction Lab, International Institute of Information Technology, Gachibowli, Hyderabad, India

More information

Query Expansion using Wikipedia and DBpedia

Query Expansion using Wikipedia and DBpedia Query Expansion using Wikipedia and DBpedia Nitish Aggarwal and Paul Buitelaar Unit for Natural Language Processing, Digital Enterprise Research Institute, National University of Ireland, Galway firstname.lastname@deri.org

More information

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google

F. Aiolli - Sistemi Informativi 2007/2008. Web Search before Google Web Search Engines 1 Web Search before Google Web Search Engines (WSEs) of the first generation (up to 1998) Identified relevance with topic-relateness Based on keywords inserted by web page creators (META

More information

Web Search: Techniques, algorithms and Aplications. Basic Techniques for Web Search

Web Search: Techniques, algorithms and Aplications. Basic Techniques for Web Search Web Search: Techniques, algorithms and Aplications Basic Techniques for Web Search German Rigau [Based on slides by Eneko Agirre and Christopher Manning and Prabhakar Raghavan] 1

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India International Journal of Scientific Research in Computer Science, Engineering and Information Technology 2018 IJSRCSEIT Volume 3 Issue 3 ISSN : 2456-3307 Some Issues in Application of NLP to Intelligent

More information

Link Structure Analysis

Link Structure Analysis Link Structure Analysis Kira Radinsky All of the following slides are courtesy of Ronny Lempel (Yahoo!) Link Analysis In the Lecture HITS: topic-specific algorithm Assigns each page two scores a hub score

More information

Two graph-based algorithms for state-of-the-art WSD

Two graph-based algorithms for state-of-the-art WSD Two graph-based algorithms for state-of-the-art WSD Eneko Agirre, David Martínez, Oier López de Lacalle and Aitor Soroa IXA NLP Group University of the Basque Country Donostia, Basque Contry a.soroa@si.ehu.es

More information

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition Ana Zelaia, Olatz Arregi and Basilio Sierra Computer Science Faculty University of the Basque Country ana.zelaia@ehu.es

More information

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities

Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Is Brad Pitt Related to Backstreet Boys? Exploring Related Entities Nitish Aggarwal, Kartik Asooja, Paul Buitelaar, and Gabriela Vulcu Unit for Natural Language Processing Insight-centre, National University

More information

Information Retrieval. Lecture 11 - Link analysis

Information Retrieval. Lecture 11 - Link analysis Information Retrieval Lecture 11 - Link analysis Seminar für Sprachwissenschaft International Studies in Computational Linguistics Wintersemester 2007 1/ 35 Introduction Link analysis: using hyperlinks

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5)

INTRODUCTION TO DATA SCIENCE. Link Analysis (MMDS5) INTRODUCTION TO DATA SCIENCE Link Analysis (MMDS5) Introduction Motivation: accurate web search Spammers: want you to land on their pages Google s PageRank and variants TrustRank Hubs and Authorities (HITS)

More information

Slides based on those in:

Slides based on those in: Spyros Kontogiannis & Christos Zaroliagis Slides based on those in: http://www.mmds.org A 3.3 B 38.4 C 34.3 D 3.9 E 8.1 F 3.9 1.6 1.6 1.6 1.6 1.6 2 y 0.8 ½+0.2 ⅓ M 1/2 1/2 0 0.8 1/2 0 0 + 0.2 0 1/2 1 [1/N]

More information

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Semantically Driven Snippet Selection for Supporting Focused Web Searches Semantically Driven Snippet Selection for Supporting Focused Web Searches IRAKLIS VARLAMIS Harokopio University of Athens Department of Informatics and Telematics, 89, Harokopou Street, 176 71, Athens,

More information

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University

CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University CS224W: Social and Information Network Analysis Jure Leskovec, Stanford University http://cs224w.stanford.edu How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second

More information

Text Similarity. Semantic Similarity: Synonymy and other Semantic Relations

Text Similarity. Semantic Similarity: Synonymy and other Semantic Relations NLP Text Similarity Semantic Similarity: Synonymy and other Semantic Relations Synonyms and Paraphrases Example: post-close market announcements The S&P 500 climbed 6.93, or 0.56 percent, to 1,243.72,

More information

Identifying Poorly-Defined Concepts in WordNet with Graph Metrics

Identifying Poorly-Defined Concepts in WordNet with Graph Metrics Identifying Poorly-Defined Concepts in WordNet with Graph Metrics John P. McCrae and Narumol Prangnawarat Insight Centre for Data Analytics, National University of Ireland, Galway john@mccr.ae, narumol.prangnawarat@insight-centre.org

More information

Personalized Terms Derivative

Personalized Terms Derivative 2016 International Conference on Information Technology Personalized Terms Derivative Semi-Supervised Word Root Finder Nitin Kumar Bangalore, India jhanit@gmail.com Abhishek Pradhan Bangalore, India abhishek.pradhan2008@gmail.com

More information

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries Reza Taghizadeh Hemayati 1, Weiyi Meng 1, Clement Yu 2 1 Department of Computer Science, Binghamton university,

More information

Natural Language Processing with PoolParty

Natural Language Processing with PoolParty Natural Language Processing with PoolParty Table of Content Introduction to PoolParty 2 Resolving Language Problems 4 Key Features 5 Entity Extraction and Term Extraction 5 Shadow Concepts 6 Word Sense

More information

Unstructured Data. CS102 Winter 2019

Unstructured Data. CS102 Winter 2019 Winter 2019 Big Data Tools and Techniques Basic Data Manipulation and Analysis Performing well-defined computations or asking well-defined questions ( queries ) Data Mining Looking for patterns in data

More information

Markov Chains for Robust Graph-based Commonsense Information Extraction

Markov Chains for Robust Graph-based Commonsense Information Extraction Markov Chains for Robust Graph-based Commonsense Information Extraction N iket Tandon 1,4 Dheera j Ra jagopal 2,4 Gerard de M elo 3 (1) Max Planck Institute for Informatics, Germany (2) NUS, Singapore

More information

A graph-based method to improve WordNet Domains

A graph-based method to improve WordNet Domains A graph-based method to improve WordNet Domains Aitor González, German Rigau IXA group UPV/EHU, Donostia, Spain agonzalez278@ikasle.ehu.com german.rigau@ehu.com Mauro Castillo UTEM, Santiago de Chile,

More information

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis

Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Introduction to Information Retrieval (Manning, Raghavan, Schutze) Chapter 21 Link analysis Content Anchor text Link analysis for ranking Pagerank and variants HITS The Web as a Directed Graph Page A Anchor

More information

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press.

Link Analysis SEEM5680. Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. Link Analysis SEEM5680 Taken from Introduction to Information Retrieval by C. Manning, P. Raghavan, and H. Schutze, Cambridge University Press. 1 The Web as a Directed Graph Page A Anchor hyperlink Page

More information

Introduction to Text Mining. Hongning Wang

Introduction to Text Mining. Hongning Wang Introduction to Text Mining Hongning Wang CS@UVa Who Am I? Hongning Wang Assistant professor in CS@UVa since August 2014 Research areas Information retrieval Data mining Machine learning CS@UVa CS6501:

More information

Introduction to Data Mining

Introduction to Data Mining Introduction to Data Mining Lecture #10: Link Analysis-2 Seoul National University 1 In This Lecture Pagerank: Google formulation Make the solution to converge Computing Pagerank for very large graphs

More information

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE COMP90042 LECTURE 3 LEXICAL SEMANTICS SENTIMENT ANALYSIS REVISITED 2 Bag of words, knn classifier. Training data: This is a good movie.! This is a great movie.! This is a terrible film. " This is a wonderful

More information

Improving the Precision of Web Search for Medical Domain using Automatic Query Expansion

Improving the Precision of Web Search for Medical Domain using Automatic Query Expansion Improving the Precision of Web Search for Medical Domain using Automatic Query Expansion Vinay Kakade vkakade@cs.stanford.edu Madhura Sharangpani smadhura@cs.stanford.edu Department of Computer Science

More information

Logic Programming: from NLP to NLU?

Logic Programming: from NLP to NLU? Logic Programming: from NLP to NLU? Paul Tarau Department of Computer Science and Engineering University of North Texas AppLP 2016 Paul Tarau (University of North Texas) Logic Programming: from NLP to

More information

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms

Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Sheffield University and the TREC 2004 Genomics Track: Query Expansion Using Synonymous Terms Yikun Guo, Henk Harkema, Rob Gaizauskas University of Sheffield, UK {guo, harkema, gaizauskas}@dcs.shef.ac.uk

More information

Ontology Research Group Overview

Ontology Research Group Overview Ontology Research Group Overview ORG Dr. Valerie Cross Sriram Ramakrishnan Ramanathan Somasundaram En Yu Yi Sun Miami University OCWIC 2007 February 17, Deer Creek Resort OCWIC 2007 1 Outline Motivation

More information

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London

Interpreting Document Collections with Topic Models. Nikolaos Aletras University College London Interpreting Document Collections with Topic Models Nikolaos Aletras University College London Acknowledgements Mark Stevenson, Sheffield Tim Baldwin, Melbourne Jey Han Lau, IBM Research Talk Outline Introduction

More information

Soft Word Sense Disambiguation

Soft Word Sense Disambiguation Soft Word Sense Disambiguation Abstract: Word sense disambiguation is a core problem in many tasks related to language processing. In this paper, we introduce the notion of soft word sense disambiguation

More information

How to organize the Web?

How to organize the Web? How to organize the Web? First try: Human curated Web directories Yahoo, DMOZ, LookSmart Second try: Web Search Information Retrieval attempts to find relevant docs in a small and trusted set Newspaper

More information

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods

Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Lecture Notes to Big Data Management and Analytics Winter Term 2017/2018 Node Importance and Neighborhoods Matthias Schubert, Matthias Renz, Felix Borutta, Evgeniy Faerman, Christian Frey, Klaus Arthur

More information

NUS-I2R: Learning a Combined System for Entity Linking

NUS-I2R: Learning a Combined System for Entity Linking NUS-I2R: Learning a Combined System for Entity Linking Wei Zhang Yan Chuan Sim Jian Su Chew Lim Tan School of Computing National University of Singapore {z-wei, tancl} @comp.nus.edu.sg Institute for Infocomm

More information

Lecture 27: Learning from relational data

Lecture 27: Learning from relational data Lecture 27: Learning from relational data STATS 202: Data mining and analysis December 2, 2017 1 / 12 Announcements Kaggle deadline is this Thursday (Dec 7) at 4pm. If you haven t already, make a submission

More information

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan CC5212-1 PROCESAMIENTO MASIVO DE DATOS OTOÑO 2017 Lecture 7: Information Retrieval II Aidan Hogan aidhog@gmail.com How does Google know about the Web? Inverted Index: Example 1 Fruitvale Station is a 2013

More information

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses

SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses SemEval-2013 Task 13: Word Sense Induction for Graded and Non-Graded Senses David Jurgens Dipartimento di Informatica Sapienza Universita di Roma jurgens@di.uniroma1.it Ioannis Klapaftis Search Technology

More information

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211

PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 PV211: Introduction to Information Retrieval https://www.fi.muni.cz/~sojka/pv211 IIR 21: Link analysis Handout version Petr Sojka, Hinrich Schütze et al. Faculty of Informatics, Masaryk University, Brno

More information

Link Analysis in the Cloud

Link Analysis in the Cloud Cloud Computing Link Analysis in the Cloud Dell Zhang Birkbeck, University of London 2017/18 Graph Problems & Representations What is a Graph? G = (V,E), where V represents the set of vertices (nodes)

More information

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed Let s get parsing! SpaCy default model includes tagger, parser and entity recognizer nlp = spacy.load('en ) tells spacy to use "en" with ["tagger", "parser", "ner"] Each component processes the Doc object,

More information

Automatic Summarization

Automatic Summarization Automatic Summarization CS 769 Guest Lecture Andrew B. Goldberg goldberg@cs.wisc.edu Department of Computer Sciences University of Wisconsin, Madison February 22, 2008 Andrew B. Goldberg (CS Dept) Summarization

More information

Personalized Page Rank for Named Entity Disambiguation

Personalized Page Rank for Named Entity Disambiguation Personalized Page Rank for Named Entity Disambiguation Maria Pershina Yifan He Ralph Grishman Computer Science Department New York University New York, NY 10003, USA {pershina,yhe,grishman}@cs.nyu.edu

More information

Chapter 27 Introduction to Information Retrieval and Web Search

Chapter 27 Introduction to Information Retrieval and Web Search Chapter 27 Introduction to Information Retrieval and Web Search Copyright 2011 Pearson Education, Inc. Publishing as Pearson Addison-Wesley Chapter 27 Outline Information Retrieval (IR) Concepts Retrieval

More information

Boolean Queries. Keywords combined with Boolean operators:

Boolean Queries. Keywords combined with Boolean operators: Query Languages 1 Boolean Queries Keywords combined with Boolean operators: OR: (e 1 OR e 2 ) AND: (e 1 AND e 2 ) BUT: (e 1 BUT e 2 ) Satisfy e 1 but not e 2 Negation only allowed using BUT to allow efficient

More information

Unit VIII. Chapter 9. Link Analysis

Unit VIII. Chapter 9. Link Analysis Unit VIII Link Analysis: Page Ranking in web search engines, Efficient Computation of Page Rank using Map-Reduce and other approaches, Topic-Sensitive Page Rank, Link Spam, Hubs and Authorities (Text Book:2

More information

An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages

An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages Dmitry Ustalov, Denis Teslenko, Alexander Panchenko, Mikhail Chernoskutov, Chris Biemann, Simone Paolo Ponzetto Data and Web

More information

Entity Linking at TAC Task Description

Entity Linking at TAC Task Description Entity Linking at TAC 2013 Task Description Version 1.0 of April 9, 2013 1 Introduction The main goal of the Knowledge Base Population (KBP) track at TAC 2013 is to promote research in and to evaluate

More information

Question Answering over Knowledge Bases: Entity, Text, and System Perspectives. Wanyun Cui Fudan University

Question Answering over Knowledge Bases: Entity, Text, and System Perspectives. Wanyun Cui Fudan University Question Answering over Knowledge Bases: Entity, Text, and System Perspectives Wanyun Cui Fudan University Backgrounds Question Answering (QA) systems answer questions posed by humans in a natural language.

More information

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity

Link analysis. Query-independent ordering. Query processing. Spamming simple popularity Today s topic CS347 Link-based ranking in web search engines Lecture 6 April 25, 2001 Prabhakar Raghavan Web idiosyncrasies Distributed authorship Millions of people creating pages with their own style,

More information

Collaborative filtering based on a random walk model on a graph

Collaborative filtering based on a random walk model on a graph Collaborative filtering based on a random walk model on a graph Marco Saerens, Francois Fouss, Alain Pirotte, Luh Yen, Pierre Dupont (UCL) Jean-Michel Renders (Xerox Research Europe) Some recent methods:

More information

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY Scientific Journal of Impact Factor (SJIF): 5.71 International Journal of Advance Engineering and Research Development Volume 5, Issue 04, April -2018 e-issn (O): 2348-4470 p-issn (P): 2348-6406 SENSE

More information

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

Automatic Construction of WordNets by Using Machine Translation and Language Modeling Automatic Construction of WordNets by Using Machine Translation and Language Modeling Martin Saveski, Igor Trajkovski Information Society Language Technologies Ljubljana 2010 1 Outline WordNet Motivation

More information

CS 6320 Natural Language Processing

CS 6320 Natural Language Processing CS 6320 Natural Language Processing Information Retrieval Yang Liu Slides modified from Ray Mooney s (http://www.cs.utexas.edu/users/mooney/ir-course/slides/) 1 Introduction of IR System components, basic

More information

Query Phrase Expansion using Wikipedia for Patent Class Search

Query Phrase Expansion using Wikipedia for Patent Class Search Query Phrase Expansion using Wikipedia for Patent Class Search 1 Bashar Al-Shboul, Sung-Hyon Myaeng Korea Advanced Institute of Science and Technology (KAIST) December 19 th, 2011 AIRS 11, Dubai, UAE OUTLINE

More information

Named Entity Detection and Entity Linking in the Context of Semantic Web

Named Entity Detection and Entity Linking in the Context of Semantic Web [1/52] Concordia Seminar - December 2012 Named Entity Detection and in the Context of Semantic Web Exploring the ambiguity question. Eric Charton, Ph.D. [2/52] Concordia Seminar - December 2012 Challenge

More information

Similarity Ranking in Large- Scale Bipartite Graphs

Similarity Ranking in Large- Scale Bipartite Graphs Similarity Ranking in Large- Scale Bipartite Graphs Alessandro Epasto Brown University - 20 th March 2014 1 Joint work with J. Feldman, S. Lattanzi, S. Leonardi, V. Mirrokni [WWW, 2014] 2 AdWords Ads Ads

More information

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018

Building Multilingual Resources and Neural Models for Word Sense Disambiguation. Alessandro Raganato March 15th, 2018 Building Multilingual Resources and Neural Models for Word Sense Disambiguation Alessandro Raganato March 15th, 2018 About me alessandro.raganato@helsinki.fi http://wwwusers.di.uniroma1.it/~raganato ERC

More information

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs

Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Reduce and Aggregate: Similarity Ranking in Multi-Categorical Bipartite Graphs Alessandro Epasto J. Feldman*, S. Lattanzi*, S. Leonardi, V. Mirrokni*. *Google Research Sapienza U. Rome Motivation Recommendation

More information

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge

Exploiting Internal and External Semantics for the Clustering of Short Texts Using World Knowledge Exploiting Internal and External Semantics for the Using World Knowledge, 1,2 Nan Sun, 1 Chao Zhang, 1 Tat-Seng Chua 1 1 School of Computing National University of Singapore 2 School of Computer Science

More information

Handling Place References in Text

Handling Place References in Text Handling Place References in Text Introduction Most (geographic) information is available in the form of textual documents Place reference resolution involves two-subtasks: Recognition : Delimiting occurrences

More information