Automatic Construction of WordNets by Using Machine Translation and Language Modeling

Similar documents
Automatic Construction of Wordnets by Using Machine Translation and Language Modeling

WordNet-based User Profiles for Semantic Personalization

NATURAL LANGUAGE PROCESSING

Web Information Retrieval using WordNet

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

MIA - Master on Artificial Intelligence

Limitations of XPath & XQuery in an Environment with Diverse Schemes

Query Difficulty Prediction for Contextual Image Retrieval

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

Serbian Wordnet for biomedical sciences

Enhancing Automatic Wordnet Construction Using Word Embeddings

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

Ontology Research Group Overview

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Discovering Semantic Connections in Community Data Eleni Yiangou BSc Computer Science 2009/2010

Information Retrieval and Web Search

For convenience in typing examples, we can shorten the wordnet name to wn.

Whither WordNet? Christiane Fellbaum George A. Miller Princeton University

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

Sense-based Information Retrieval System by using Jaccard Coefficient Based WSD Algorithm

An Experimental Study of Selected Methods towards Achieving 100% Recall of Synonyms in Software Requirements Documents

Package wordnet. November 26, 2017

Evaluation Two Label Matching Approaches For Indonesian Language

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

INTERCONNECTING AND MANAGING MULTILINGUAL LEXICAL LINKED DATA. Ernesto William De Luca

Text Mining. Munawar, PhD. Text Mining - Munawar, PhD

A Comprehensive Analysis of using Semantic Information in Text Categorization

Adaptive Model of Personalized Searches using Query Expansion and Ant Colony Optimization in the Digital Library

MEASURING SEMANTIC SIMILARITY BETWEEN WORDS AND IMPROVING WORD SIMILARITY BY AUGUMENTING PMI

BabelNet: The automatic construction, evaluation and application of a wide-coverage multilingual semantic network

Ontology Based Search Engine

Optimal Query. Assume that the relevant set of documents C r. 1 N C r d j. d j. Where N is the total number of documents.

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

CS473: Course Review CS-473. Luo Si Department of Computer Science Purdue University

Making Sense Out of the Web

MRD-based Word Sense Disambiguation: Extensions and Applications

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet

The Dictionary Parsing Project: Steps Toward a Lexicographer s Workstation

Package wordnet. February 15, 2013

arxiv:cmp-lg/ v1 5 Aug 1998

Information Retrieval

Evaluation of Synset Assignment to Bi-lingual Dictionary

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation

Research Article Relevance Feedback Based Query Expansion Model Using Borda Count and Semantic Similarity Approach

LexiRes: A Tool for Exploring and Restructuring EuroWordNet for Information Retrieval

Boolean Queries. Keywords combined with Boolean operators:

Information Retrieval Exercises

Outline. Possible solutions. The basic problem. How? How? Relevance Feedback, Query Expansion, and Inputs to Ranking Beyond Similarity

What is this Song About?: Identification of Keywords in Bollywood Lyrics

Web Query Translation with Representative Synonyms in Cross Language Information Retrieval

Conceptual document indexing using a large scale semantic dictionary providing a concept hierarchy

JWOLF: Java Free French Wordnet Library

CHAPTER 3 DYNAMIC NOMINAL LANGUAGE MODEL FOR INFORMATION RETRIEVAL

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

Index terms- Wordnet, Pali, Natural Language Processing, Base Concepts etc.

CS54701: Information Retrieval

Incorporating WordNet in an Information Retrieval System

Multilingual Vocabularies in Open Access: Semantic Network WordNet

Putting ontologies to work in NLP

Cross-Lingual Word Sense Disambiguation

TERM BASED WEIGHT MEASURE FOR INFORMATION FILTERING IN SEARCH ENGINES

SEMANTIC MATCHING APPROACHES

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Semantic Matching: Algorithms and Implementation *

String Vector based KNN for Text Categorization

Data-Mining Algorithms with Semantic Knowledge

Query Operations. Relevance Feedback Query Expansion Query interpretation

An ontology-based approach for semantics ranking of the web search engines results

Babelplagiarism: what can BabelNet do for crosslanguage plagiarism detection? Roberto Navigli

Wordnet Based Document Clustering

An Ontology-Driven Approach for Semantic Information Retrieval on the Web

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY

Outline. Morning program Preliminaries Semantic matching Learning to rank Entities

Academic Paper Recommendation Based on Heterogeneous Graph

TEXT MINING APPLICATION PROGRAMMING

Keywords Web Query, Classification, Intermediate categories, State space tree

Improving Retrieval Experience Exploiting Semantic Representation of Documents

Automatically Annotating Text with Linked Open Data

Enriching Ontology Concepts Based on Texts from WWW and Corpus

Adapting a synonym database to specific domains

A Conceptual Representation of Documents and Queries for Information Retrieval Systems by Using Light Ontologies

Evaluating wordnets in Cross-Language Information Retrieval: the ITEM search engine

A Semantic Tree-Based Approach for Sketch-Based 3D Model Retrieval

A Linguistic Approach for Semantic Web Service Discovery

Semantic Indexing of Technical Documentation

Query Expansion using Wikipedia and DBpedia

10/10/13. Traditional database system. Information Retrieval. Information Retrieval. Information retrieval system? Information Retrieval Issues

A Novel Techinque For Ranking of Documents Using Semantic Similarity

Information Retrieval by Semantic Similarity

NLP - Based Expert System for Database Design and Development

CS210 Project 6 (WordNet) Swami Iyer

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

The JUMP project: domain ontologies and linguistic work

A hybrid method to categorize HTML documents

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Watson & WMR2017. (slides mostly derived from Jim Hendler and Simon Ellis, Rensselaer Polytechnic Institute, or from IBM itself)

Automatic Word Sense Disambiguation Using Wikipedia

Efficient query processing

Influence of Word Normalization on Text Classification

Transcription:

Automatic Construction of WordNets by Using Machine Translation and Language Modeling Martin Saveski, Igor Trajkovski Information Society Language Technologies Ljubljana 2010 1

Outline WordNet Motivation and Problem Statement Methodology Results Evaluation Conclusion and Future Work 2

WordNet Lexical database of the English language Groups words into sets of cognitive synonyms called synsets Each synsets contains gloss and links to other synsets Links define the place of the synset in the conceptual space Source of motivation for researchers from various fields 3

WordNet Example {motor vehicle, automotive vehicle} Hypernym Car Auto Automobile Motorcar Synset Hyponym {cab, taxi, hack, taxicab} a motor vehicle with four wheels; usually propelled by an internal combustion engine Gloss 4

Motivation Plethora of WordNet applications Text classification, clustering, query expansion, etc There is no publicly available WordNet for the Macedonian Language Macedonian was not included in the EuroWordNet and BalkaNet projects Manual construction is expensive and labor intensive process Need to automate the process 5

Problem Statement Assumptions: The conceptual space modeled by the PWN is not depended on the language in which it is expressed Majority of the concepts exist in both languages, English and Macedonian, but have different notations English Synset find translations which lexicalize the same concept Macedonian Synset Given a synset in English, it is our goal to find a set of words which lexicalize the same concept in Macedonian 6

Resources and Tools Resources: Princeton implementation of WordNet (PWN) backbone for the construction English-Macedonian Machine Readable Dictionary (in-house-developed) 182,000 entries Tools: Google Translation System (Google Translate) Google Search Engine 7

Methodology 1 Finding Candidate Words 2 Translating the synset gloss 3 Assigning scores the candidate words 4 Selection of the candidate words 8

Finding Candidate Words W1 W2 W3 Wn MRD T(W1) = CW11, CW12 CW1s T(W2) = CW21, CW22 CW2k T(W3) = CW31, CW32 CW3j T(Wn) = CWn1, CWn2 CWnm SET CW1 CW2 CW3 CWy PWN Synset Candidate Words T(W1) contains translations of all senses of the word W1 Essentially, we have Word Sense Disambiguation (WSD) problem 9

Finding Candidate Words (cont) W1 W2 Wn MRD CW1 CW2 CWm PWN Synset Candidate Words

Translating the synset gloss Statistical approach to WSD: Using the word sense definitions and a large text corpus, we can determine the sense in which the word is Word Sense Definition = Synset Gloss The gloss translation can be used to measure the correlation between the synset and the candidate words We use Google Translate (EN-MK) to translate the glosses of the PWN synsets 11

Translating the synset gloss (cont) PWN Synset Gloss Gloss Translation (T-Gloss) W1 W2 Wn MRD CW1 CW2 CWm PWN Synset Candidate Words

Assigning scores to the candidate words To apply the statistical WSD technique we lack a large, domain independent text corpus Google Similarity Distance (GSD) Calculates the semantic similarity between words/phrases based on the Google result counts We calculate GSD between each candidate word and gloss translation The GSD score is assigned to each candidate word 13

Assigning scores to the candidate words PWN Synset Gloss Gloss Translation (T-Gloss) Google Similarity Distance (GSD) W1 W2 Wn MRD CW1 CW2 CWm GSD(CW1, T-Gloss) GSD(CW2, T-Gloss) GSD(CWm, T-Gloss) PWN Synset Candidate Words Similarity Scores

Selection of the candidate words Selection by using two thresholds: 1 Score(CW) > T1 - Ensures that the candidate word has minimum correlation with the gloss translation 2 Score(CW) > (T2 x MaxScore) - Discriminates between the words which capture the meaning of the synset and those that do not 15

Selection of the candidate words (cont) PWN Synset Gloss Gloss Translation (T-Gloss) Google Similarity Distance (GSD) W1 W2 Wn MRD CW1 CW2 CWm GSD(CW1, T-Gloss) GSD(CW2, T-Gloss) GSD(CWm, T-Gloss) Selection CW1 CW2 CWk PWN Synset Candidate Words Similarity Scores Resulting Synset

Example a defamatory or abusive word or phrase Synset Gloss со клевети или навредлив збор или фраза (MK-GLOSS) Gloss Translation Google Similarity Distance (GSD) Name Epithet PWN Synset MRD Candidate Word навреда епитет углед крсти назив презиме наслов глас име English Explanation offence, insult epithet, in a positive sense reputation to name somebody name, title last name title voice first name GSD Score 078 049 041 040 037 035 035 034 033 Selection T1 = 0,2 T2 = 0,62 Навреда MWN Synset

Results of the MWN construction 25000 20000 15000 10000 5000 0 Nouns Verbs Adjectives Adverbs Synsets 22838 7256 3125 57 Words 12480 2786 2203 84 Size of the MWN NB: All words included in the MWN are lemmas 18

Evaluation of the MWN There is no manually constructed WordNet (lack of Golden Standard) Manual evaluation: Labor intensive and expensive Alternative Method: Evaluation by use of MWN in practical applications MWN applications were our motivation and ultimate goal 19

MWN for Text Classification Easy to measure and compare the performance of the classification algorithms We extended the synset similarity measures to word-to-word ie text-to-text level Leacock and Chodorow (LCH) (node-based) Wu and Palmer (WUP) (arc-based) Baseline: Cosine Similarity (classical approach) 20

MWN for Text Classification (cont) Classification Algorithm: K Nearest Neighbors (KNN) Allows the similarity measures to be compared unambiguously Corpus: A1 TV - News Archive (2005-2008) Category Balkan Economy Macedonia Sci/Tech World Sport TOTAL Articles 1,264 1,053 3,323 920 1,845 1,232 9,637 Tokens 159,956 160,579 585,368 17,775 222,560 142,958 1,289,196 A1 Corpus, size and categories 21

MWN for Text Classification Results 100% 90% 80% 70% 60% 50% 40% 30% 20% 10% 0% Balkan Economy Macedonia Sci/Tech World Sport Weighted Average WUP Similarity Cosine Similarity LCH Similarity 80,4% 73,7% 59,8% Text Classification Results (F-Measure, 10-fold cross-validation) 22

Future Work Investigation of the semantic relatedness between the candidate words Word Clustering prior to assigning to synset Assigning group of candidate words to the synset Experiments of using the MWN for other applications Text Clustering Word Sense Disambiguation 23

Q&A Thank you for your attention Questions? 24

Google Similarity Distance Word/phrases acquire meaning from the way they are used in the society and from their relative semantics to other words/phrases Formula: f(x), f(y), f(x,y) results counts of x, y, and (x, y) N Normalization factor 25

Synset similarity metrics Leacock and Chodorow (LCH) sim LCH s 1, s 2 = log len s 1, s 2 2 D len number of nodes form s1 to s2, D maximum depth of the hierarchy Measures in number of nodes 26

Synset similarity metrics (cont) Wu and Palmer (WUP) sim WUP s 1, s 2 = 2 depth(lcs) depth s 1 + depth s 2 LCS most specific synset ancestor to both synsets Measures in number of links 27

Semantic Word Similarity The similarity of W1 and W2 is defined as: The maximum similarity (minimum distance) between the: Set of synsets containing W1, Set of synsets containing W2 28

Semantic Text Similarity The similarity between texts T1 and T2 is: sim T 1, T 2 = 1 2 + w T 1 w T 2 maxsim w, T 2 idf w w T 1 idf w maxsim w, T 1 idf w w T 2 idf w idf inverse document frequency (measures word specificity) 29