NATURAL LANGUAGE PROCESSING

Similar documents
MIA - Master on Artificial Intelligence

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

Mining Web Data. Lijun Zhang

A Semantic Similarity Measure for Linked Data: An Information Content-Based Approach

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

Automatic Construction of WordNets by Using Machine Translation and Language Modeling

A Linguistic Approach for Semantic Web Service Discovery

Mining Web Data. Lijun Zhang

YUSUF AYTAR B.S. Ege University

1 Introduction. *Corresponding author addresses:

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

Mining Opinion Attributes From Texts using Multiple Kernel Learning

Semantic Similarity Measures in MeSH Ontology and their application to Information Retrieval on Medline. Angelos Hliaoutakis

Extractive Text Summarization Techniques

CC PROCESAMIENTO MASIVO DE DATOS OTOÑO Lecture 7: Information Retrieval II. Aidan Hogan

Keywords Web Query, Classification, Intermediate categories, State space tree

An Experimental Study of Selected Methods towards Achieving 100% Recall of Synonyms in Software Requirements Documents

WordNet-based User Profiles for Semantic Personalization

A Semantic Based Search Engine for Open Architecture Requirements Documents

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

Chapter 27 Introduction to Information Retrieval and Web Search

Information Retrieval and Web Search

Web Information Retrieval using WordNet

Link Analysis and Web Search

Information Retrieval

Information Retrieval Exercises

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Information Retrieval

INFUSING SEMANTICS IN WSDL WEB SERVICE DESCRIPTIONS TO ENHANCE SERVICE COMPOSITION AND DISCOVERY

Vector Space Models: Theory and Applications

Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

University of Maryland. Tuesday, March 2, 2010

Digital Libraries: Language Technologies

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

Graph Algorithms. Revised based on the slides by Ruoming Kent State

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval

An Ontology-Driven Approach for Semantic Information Retrieval on the Web

Question Answering Approach Using a WordNet-based Answer Type Taxonomy

Ontology Research Group Overview

Improving Retrieval Experience Exploiting Semantic Representation of Documents

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Multimedia Data Management M

Boolean Queries. Keywords combined with Boolean operators:

Chrome based Keyword Visualizer (under sparse text constraint) SANGHO SUH MOONSHIK KANG HOONHEE CHO

Punjabi WordNet Relations and Categorization of Synsets

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

Clustering. Bruno Martins. 1 st Semester 2012/2013

CS47300: Web Information Search and Management

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

Feature selection. LING 572 Fei Xia

Information Retrieval (IR) Introduction to Information Retrieval. Lecture Overview. Why do we need IR? Basics of an IR system.

Link Analysis in the Cloud

A Family of Contextual Measures of Similarity between Distributions with Application to Image Retrieval

vector space retrieval many slides courtesy James Amherst

Tulip: Lightweight Entity Recognition and Disambiguation Using Wikipedia-Based Topic Centroids. Marek Lipczak Arash Koushkestani Evangelos Milios

CS54701: Information Retrieval

Utilizing Semantic Word Similarity Measures for Video Retrieval

CSE 494: Information Retrieval, Mining and Integration on the Internet

Limitations of XPath & XQuery in an Environment with Diverse Schemes

Authoritative K-Means for Clustering of Web Search Results

Making Sense Out of the Web

IN4325 Query refinement. Claudia Hauff (WIS, TU Delft)

Indexing: Part IV. Announcements (February 17) Keyword search. CPS 216 Advanced Database Systems

Jianyong Wang Department of Computer Science and Technology Tsinghua University

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Markov Chains for Robust Graph-based Commonsense Information Extraction

Discovering Semantic Similarity between Words Using Web Document and Context Aware Semantic Association Ranking

Measuring Semantic Similarity between Words Using Page Counts and Snippets

Results of NBJLM for OAEI 2010

RPI INSIDE DEEPQA INTRODUCTION QUESTION ANALYSIS 11/26/2013. Watson is. IBM Watson. Inside Watson RPI WATSON RPI WATSON ??? ??? ???

Information Retrieval by Semantic Similarity

Query Operations. Relevance Feedback Query Expansion Query interpretation

Using NLP and context for improved search result in specialized search engines

PRISM: Concept-preserving Social Image Search Results Summarization

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

Introduction to Information Retrieval

The Pennsylvania State University. The Graduate School. College of Information Sciences and Technology

Evaluation Two Label Matching Approaches For Indonesian Language

Department of Electronic Engineering FINAL YEAR PROJECT REPORT

Agenda. Math Google PageRank algorithm. 2 Developing a formula for ranking web pages. 3 Interpretation. 4 Computing the score of each page

Relevance Feedback & Other Query Expansion Techniques

Information Retrieval and Web Search

Knowledge Discovery and Data Mining 1 (VO) ( )

Text Similarity. Semantic Similarity: Synonymy and other Semantic Relations

CS 347 Parallel and Distributed Data Processing

CS 347 Parallel and Distributed Data Processing

Data-Mining Algorithms with Semantic Knowledge

Hybrid Approach for Query Expansion using Query Log

Text Analytics (Text Mining)

Re-ranking Search Results Using Semantic Similarity

Relevance of a Document to a Query

Pagerank Scoring. Imagine a browser doing a random walk on web pages:

Semantic Web. Ontology Engineering and Evaluation. Morteza Amini. Sharif University of Technology Fall 93-94

Transcription:

NATURAL LANGUAGE PROCESSING LESSON 9 : SEMANTIC SIMILARITY OUTLINE Semantic Relations Semantic Similarity Levels Sense Level Word Level Text Level WordNet-based Similarity Methods Hybrid Methods Similarity PageRank-based Similarity 1

SEMANTIC RELATIONS In the literature, three different terms are used for similarity: semantic relatedness, similarity and semantic distance. There are many semantic relation types. The most important semantic relation is synonym (black-dark) and antonym (black-white). But non-synonym (or non-antonym) entities may also be semantically related by other relationships such as meronym (finger is meronym of hand), hyponym (eagle is hyponym of bird), hypernym (bird is hypernym of eagle). SEMANTIC RELATIONS Relation type Example Synonym (synset) Different - Unlike Antonym Buy Sell Category Domain Cell - Biology Sub event Search Query Causes Slimming, Weight loss Hypernomy Jam Rose Jam Hyponomy Rose Jam Jam Similar to Next Following 2

SEMANTIC RELATIONS Semantic similarity relates to computing the conceptual similarity between terms according to a lexicon (dictionary). Example: «tiger» «cat» «kaplan» «ev kedisi» SEMANTIC RELATIONS While Antarctica and penguin are not similar according to their lexical definitions, we feel a strong relation between them. Because most of penguins live in Antarctica. But there is not `live in` relation among the known semantic relations. For this kind of relations we should define a general relation such as `related`. Note that semantic relatedness is a more flexible relation than the other known ones. For some other word pairs, related can be used (pencil paper, penguin Antarctica, rain flood). 3

SEMANTIC SIMILARITY LEVELS There are three type of semantic similarity levels: Sense level deals with the conceptual part of a word. It is a unique representation of a concept and has no ambiguity. Word level deals with the word which might contain multiple senses, so ambiguity can be possible. Text level including short text (sentence, paragraph) and documents. In this level, a text has usually several ambiguity. SENSE LEVEL SIMILARITY It is the primary step of similarity, sense is the concept that a word aims to define. A typical sense fox#n#1, n (noun) is part of speech tagging and 1 is the first meaning in dictionary. fox#n#1: alert carnivorous mammal. fox#n#2: a shifty deceptive person. To understand a text in sense level, at first, it requires word sense disambiguation. 4

SENSE LEVEL SIMILARITY Sense-level semantic similarity are mostly based on lexical resources such as dictionary or thesaurus. Lexical resources are mostly used in form of semantic networks. In order to determine semantic similarity of two words, it is used structural properties of these kind of networks such as adjacencies. The most popular lexical resource is the WordNet. SENSE LEVEL SIMILARITY In addition to WordNet, other lexical resources: Collaboratively-constructed resources such as Wikipedia Wiktionary Dictionaries such as Longman Dictionary Integrated knowledge resources such as BabelNet 5

WORD LEVEL SIMILARITY The approaches at the word level can be grouped into two categories: Distributional approaches Lexical resource-based approaches WORD LEVEL SIMILARITY Distributional approaches use co-occurrence statistics for the computation of vector-based representations of different words. The weights in co-occurrence-based vectors are usually computed by means of TF IDF or Pointwise Mutual Information (PMI). The dimensionality of the resulting weights matrix is often reduced, for instance using Singular Value Decomposition. The structured textual content of specific lexical resources such as Wikipedia has been used with distributional approaches. 6

WORD LEVEL SIMILARITY Lexical resource-based approaches usually assume that; The similarity of two words can be calculated in terms of the similarity of their closest senses. For example, fox and rabbit Therefore every sense-level approach can be directly applied for word similarity. TEXT LEVEL SIMILARITY Text-level similarity methods can be grouped into two categories: Viewing a text as a combination of words and calculate the similarity of two texts by aggregating the similarities of word pairs across the two texts, Modelling a text as a whole and calculate the similarity of two texts by comparing the two models obtained. 7

TEXT LEVEL SIMILARITY Approaches in the first category search for pair of words in different texts that maximize similarity and compute the overall similarity by aggregating individual similarity values. `Car goes faster than horse.` `Train goes in railway.` tokens={car, go, fast, horse} tokens={train, go, railway} TEXT LEVEL SIMILARITY The second category usually involves transforming texts into vectors and computing the similarity of texts by comparing their corresponding vectors. Vector space models are an example of this category. Models mainly focused on the representation of larger pieces of text, such as documents, where a text is modeled on the basis of the frequency statistics of the words it contains. Such models, however, suffer from sparseness and cannot capture similarities between short text pairs that use different wordings. 8

WORDNET BASED SIMILARITY METHODS WordNet is the most common structural lexical knowledge and organized hierarchically in graph structure. It consist of nodes and edges, nodes are the synsets and edges are the relations. More than 20 relation types are defined in the current version of the WordNet. WordNet based first methods use Hypernym (Is a), Meronomy (Part of) and Antonomy relations. Recent methods uses all available relation types. WORDNET BASED SIMILARITY METHODS WordNet based similarity methods use graph structure of the WordNet, and measures similarity using several metrics like path length, depth length, lcs (lowest common subsumer), direction of the relations. Following methods are the most frequently used WordNet Based similarity methods. Leacock & Chodorow Method (1998) Wu & Palmer Method (1994) Hirst And St-onge Method (1998) 9

LEACOCK & CHODOROW METHOD (1998) SIM LC A, B = log Len(A,B) 2 D max Len(A,B) : length of the shortest path between two concepts using node-counting D max : max depth of the taxonomy WU & PALMER METHOD (1994) lcs(a, b): lowest common subsumer depth(c) : number of edges from root to c 10

HIRST AND ST-ONGE METHOD (1998) Hirst and St-Onge s approach is summarized by the following formula for two WordNet concepts c1 c2: relhs(c1, c2) = C len(c1, c2) k turns(c1, c2) where C and k are constants (in practice, they used C = 8 and k = 1), turns(c1, c2) is the number of times the path between c1 and c2 changes direction. relhs(bike, truck) = 8-len(bike,truck)-change_of_direction relhs(bike, truck) = 8-3 - 1 = 4 Here, the maxiumum similarity is 8 and the minimum is 0. HYBRID METHODS FOR SIMILARITY Hybrid methods utilize both WordNet and a corpus. Also known as information content based methods. It gathers information from corpus for a specific concept, and uses WordNet to get similarity. Information Content (IC) is a measure of specificity for a concept. Higher values are associated with more specific concepts (e.g., pitchfork: yaba), while those with lower values are more general (e.g., idea: fikir). IC is computed based on frequency counts of a concepts found in a text corpus. 11

HYBRID METHODS FOR SIMILARITY For each concept (synset) c in WordNet, Information Content is defined as the negative log of the probability of that concept (based on the observed frequency counts). IC(c) = log P(c) Content measures of similarity can only be applied to pairs of nouns or pairs of verbs. Following methods are the well known Hybrid Methods; Resnik s Information-based Approach (1995) Lin s Universal Similarity Measure (1998) Jiang and Conrath s Combined Approach (1997) RESNIK S INFORMATION-BASED APPROACH (1995) The key idea underlying Resnik s (1995) approach is the intuition that one criterion of similarity between two concepts is the extent to which they share information in common. IC(c) = log P(c) sim R (c 1,c 2 )=IC(lcs(c 1,c 2 )) lcs is the lowest common subsumer, it is the common concept that both c 1 and c 2 are hyponym of it. 12

LIN S UNIVERSAL SIMILARITY MEASURE (1998) The similarity between A and B is measured by the ratio between the amount of information needed to state their commonality and the information needed to fully describe what they are. SIM L A, B = log P(comm(A,B)) log P(descr(A,B)) SIM L C 1, C 2 = 2 X log P(lso(C 1,C 2 )) log P C 1 + log P C 1 JIANG AND CONRATH S COMBINED APPROACH (1997) Jiang and Conrath s (1997) synthesize edge and node based techniques and it is formulated as; 13

PAGERANK PageRank is the method applied to the graph structure to rank the each nodes in the graph. It is developed by Google, it is used to rank the web url s according to their importance. Importance is found using the PageRank method. As a search engine Google indexes the web URLs by keywords, if there are multiple pages for that keyword, so those pages need to be ranked. PAGERANK A simple way to describe the PageRank algorithm is to consider a user who surfs the web by randomly clicking on hyperlinks. The probability that the surfer will click on a specific hyperlink is given by the PageRank value of the page to which the hyperlink points. PageRank also assumes that the surfer will get bored after a finite number of clicks and will jump to some other page at random. The probability that our surfer will continue surfing by clicking on the hyperlinks is given by a fixed-value parameter, usually referred to as the damping factor.. 14

PAGERANK PAGERANK PageRank generation steps; Consider this graph as directed graph with four nodes. First we generate Marchow Chain Matrix as below PageRank (δ t in t th jump) vector is generated as; δ t =(1-α) δ 0 + αmδ t-1 α is the dumping factor. Iteration is stopped after δ t δ t-1 <ε where ε<0.0001 15

PAGERANK BASED SIMILARITY Since the structure of the WordNet is in graph structure. PageRank can be applied to the WordNet and similarity can be calculated over PageRank vectors. PageRank vectors are generated for each node in the graph, so each synset will have PageRank vectors. Nodes corresponds to the web pages and relations corresponds to the links in that web page. If we apply PageRank method into the WordNet, size of the PageRank vector will be 117k for each synset. PAGERANK BASED SIMILARITY PageRank vector for each synset is also called semantic signature of that synset. We will obtaine a matrix with 117K x 117K. This process has been completed for WordNet 3.0 and it is available to download PageRank vectors of all of the synsets in WordNet 3.0 It is totally around 1.2 gb zipped data. 16

PAGERANK BASED SIMILARITY Similarity computation of the PageRank vectors is the last process to get similarity score. There are several types of vector comparison model, but the Cosine similarity is widely used one. It computes the cosine value of the angle between two PageRank vectors. Consider A and B are the PageRank vectors of two synsets, the Cosine similarity is calculated as. PAGERANK BASED SIMILARITY By using cosine similarity we can calculate the similarity of the given synset s by using their PageRank vectors. There are other comparison methods to get similarity from the PageRank vectors. Jensen Shannon Divergence Method, This measure is based on the Kullback Leibler divergence, which is commonly referred to as KL divergence, and is computed for a pair of semantic signatures (probability distributions in general) S1 and S2 as: 17

PAGERANK BASED SIMILARITY KL divergence is non-symmetric, so Jensen Shannon (JS) divergence, which is a symmetrized and smoothed version of KL divergence is used: PROJECT Preparing a WordNet based similarity tool. Download the WordNet data Convert it into a graph database (we suggest Neo4j) Choose a similarity measure Prepare a user interface 18