Automatically Enriching a Thesaurus with Information from Dictionaries

Similar documents
Towards the Automatic Creation of a Wordnet from a Term-based Lexical Network

Onto.PT: Towards the Automatic Construction of a Lexical Ontology for Portuguese

On the Automatic Enrichment of a Portuguese Wordnet with Dictionary Definitions

Automatic Discovery of Fuzzy Synsets from Dictionary Definitions

Software Design using Analogy and WordNet

Building and Exploring Semantic Equivalences Resources

Enriching a Portuguese WordNet using Synonyms from a Monolingual Dictionary

WordNet-based User Profiles for Semantic Personalization

Package wordnet. February 15, 2013

Improving Retrieval Experience Exploiting Semantic Representation of Documents

Package wordnet. November 26, 2017

Serbian Wordnet for biomedical sciences

Random Walks for Knowledge-Based Word Sense Disambiguation. Qiuyu Li

Using the Multilingual Central Repository for Graph-Based Word Sense Disambiguation

Esfinge (Sphinx) at CLEF 2008: Experimenting with answer retrieval patterns. Can they help?

CHAPTER 3 INFORMATION RETRIEVAL BASED ON QUERY EXPANSION AND LATENT SEMANTIC INDEXING

QUERY EXPANSION USING WORDNET WITH A LOGICAL MODEL OF INFORMATION RETRIEVAL

Putting ontologies to work in NLP

Assigning Polarity Automatically to the Synsets of a Wordnet-like Resource

A Combined Method of Text Summarization via Sentence Extraction

NATURAL LANGUAGE PROCESSING

Information Retrieval and Web Search

Lightweight Transformation of Tabular Open Data to RDF

WEIGHTING QUERY TERMS USING WORDNET ONTOLOGY

Open Source Dutch WordNet

A graph-based method to improve WordNet Domains

Text Similarity. Semantic Similarity: Synonymy and other Semantic Relations

Web Information Retrieval using WordNet

Enhancing Automatic Wordnet Construction Using Word Embeddings

Case-Based Adaptation for UML Diagram Reuse

Automatic Wordnet Mapping: from CoreNet to Princeton WordNet

International Journal of Advance Engineering and Research Development SENSE BASED INDEXING OF HIDDEN WEB USING ONTOLOGY

Introduction to Tools for IndoWordNet and Word Sense Disambiguation

Mapping WordNet to the SUMO Ontology

Linking Lexicons and Ontologies: Mapping WordNet to the Suggested Upper Merged Ontology

Semantic Enrichment of Places for the Portuguese Language

APPROACHES TO IMPLEMENT SEMANTIC SEARCH. Johannes Peter Product Owner / Architect for Search

COMP90042 LECTURE 3 LEXICAL SEMANTICS COPYRIGHT 2018, THE UNIVERSITY OF MELBOURNE

Simple Method for Ontology Automatic Extraction from Documents

NLP - Based Expert System for Database Design and Development

An Unsupervised Word Sense Disambiguation System for Under-Resourced Languages

Evaluating a Conceptual Indexing Method by Utilizing WordNet

Web People Search with Domain Ranking

NUS-I2R: Learning a Combined System for Entity Linking

Whither WordNet? Christiane Fellbaum George A. Miller Princeton University

Semantically Driven Snippet Selection for Supporting Focused Web Searches

Multilingual Vocabularies in Open Access: Semantic Network WordNet

Tag Semantics for the Retrieval of XML Documents

Service Composition based on Natural Language Requests

LexiRes: A Tool for Exploring and Restructuring EuroWordNet for Information Retrieval

WordNets and TEI-LEX. John P. McCrae, Thierry Declerck

Real-Time Open-Domain QA on the Portuguese Web

SAACO: Semantic Analysis based Ant Colony Optimization Algorithm for Efficient Text Document Clustering

Fuzzy V-Measure An Evaluation Method for Cluster Analyses of Ambiguous Data

INTERNATIONAL JOURNAL OF COMPUTER ENGINEERING & TECHNOLOGY (IJCET) CONTEXT SENSITIVE TEXT SUMMARIZATION USING HIERARCHICAL CLUSTERING ALGORITHM

Punjabi WordNet Relations and Categorization of Synsets

Let s get parsing! Each component processes the Doc object, then passes it on. doc.is_parsed attribute checks whether a Doc object has been parsed

A Semantic Role Repository Linking FrameNet and WordNet

Shrey Patel B.E. Computer Engineering, Gujarat Technological University, Ahmedabad, Gujarat, India

Enriching Ontology Concepts Based on Texts from WWW and Corpus

Integrating Spanish Linguistic Resources in a Web Site Assistant

A Linguistic Approach for Semantic Web Service Discovery

Ontology Based Search Engine

CHAPTER 5 SEARCH ENGINE USING SEMANTIC CONCEPTS

Fuzzy Synsets, and Lexicon-Based Sentiment Analysis

A Computational Study of Conflict Graphs and Aggressive Cut Separation in Integer Programming

Boolean Queries. Keywords combined with Boolean operators:

Identifying and Ranking Possible Semantic and Common Usage Categories of Search Engine Queries

A hybrid method to categorize HTML documents

A faceted lightweight ontology for Earthquake Engineering Research Projects and Experiments

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Supervised Learning for Relationship Extraction From Textual Documents

Linking Thesauri and Glossaries Case Study 0: linking a fake resource Roberto Navigli

SPARSE COMPONENT ANALYSIS FOR BLIND SOURCE SEPARATION WITH LESS SENSORS THAN SOURCES. Yuanqing Li, Andrzej Cichocki and Shun-ichi Amari

A Multiclassifier based Approach for Word Sense Disambiguation using Singular Value Decomposition

Lecture 7: Relevance Feedback and Query Expansion

INTERCONNECTING AND MANAGING MULTILINGUAL LEXICAL LINKED DATA. Ernesto William De Luca

Eurown: an EuroWordNet module for Python

Making Sense Out of the Web

Identifying Poorly-Defined Concepts in WordNet with Graph Metrics

Sense Match Making Approach for Semantic Web Service Discovery 1 2 G.Bharath, P.Deivanai 1 2 M.Tech Student, Assistant Professor

MEASUREMENT OF SEMANTIC SIMILARITY BETWEEN WORDS: A SURVEY

A Comprehensive Analysis of using Semantic Information in Text Categorization

An ontology-based approach for semantics ranking of the web search engines results

Hidden Markov Models. Natural Language Processing: Jordan Boyd-Graber. University of Colorado Boulder LECTURE 20. Adapted from material by Ray Mooney

Evaluation of Synset Assignment to Bi-lingual Dictionary

Worth its Weight in Gold or Yet Another Resource

CS 6320 Natural Language Processing

Content-based Recommender Systems

Automatic Word Sense Disambiguation Using Wikipedia

JWOLF: Java Free French Wordnet Library

Ontology-Based Application Server to the Execution of Imperative Natural Language Requests

Beyond the Synset: Synonyms in Collaboratively Constructed Semantic Resources Michael Matuschek and Iryna Gurevych

The JUMP project: domain ontologies and linguistic work

A Visualization Tool to Improve the Performance of a Classifier Based on Hidden Markov Models

Personalization in Digital Library: An Intelligent Service based on Semantic User Profiles

ARKTiS - A Fast Tag Recommender System Based On Heuristics

SEMANTIC ANALYSIS BASED TEXT CLUSTERING BY THE FUSION OF BISECTING K-MEANS AND UPGMA ALGORITHM

CS580 - Query Processing. Build Support Word and Dispute Word Dictionary

Multi-Modal Word Synset Induction. Jesse Thomason and Raymond Mooney University of Texas at Austin

Transcription:

Automatically Enriching a Thesaurus with Information from Dictionaries Hugo Gonçalo Oliveira 1 Paulo Gomes {hroliv,pgomes}@dei.uc.pt Cognitive & Media Systems Group CISUC, Universidade de Coimbra October 11, 2011 1 supported by FCT scholarship grant SFRH/BD/44955/2008 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 1 / 18

Index 1 Introduction 2 Proposed approach 3 Enriching TeP with synonymy in PAPEL 4 Evaluation 5 Concluding remarks Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 2 / 18

Lexical knowledge bases Introduction Thesaurus, lexical networks, lexical ontologies,... Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Try to cover the whole language Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Try to cover the whole language No specific domain Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Introduction Lexical knowledge bases Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Lexical knowledge bases Introduction Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language Useful for NLP tasks (eg. word-sense disambiguation, question-answering, determining similarities,...) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Lexical knowledge bases Introduction Thesaurus, lexical networks, lexical ontologies,... Structured on words and their meanings Try to cover the whole language No specific domain Essential for developing NLP tools for a language Useful for NLP tasks (eg. word-sense disambiguation, question-answering, determining similarities,...) See Princeton WordNet [Fellbaum, 1998] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 3 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 Collaborative dictionary Portuguese Wiktionary 3 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 Collaborative dictionary Portuguese Wiktionary 3 Public domain lexical network PAPEL [Gonçalo Oliveira et al., 2010] 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 Collaborative dictionary Portuguese Wiktionary 3 Public domain lexical network PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] Onto.PT [Gonçalo Oliveira and Gomes, 2010] 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 Collaborative dictionary Portuguese Wiktionary 3 Public domain lexical network PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] Onto.PT [Gonçalo Oliveira and Gomes, 2010] More complementary than overlapping 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

Introduction Free lexical knowledge bases for Portuguese Public domain thesaurus: TeP [Maziero et al., 2008] OpenThesaurus.PT 2 Collaborative dictionary Portuguese Wiktionary 3 Public domain lexical network PAPEL [Gonçalo Oliveira et al., 2010] Lexical ontology [coming soon] Onto.PT [Gonçalo Oliveira and Gomes, 2010] More complementary than overlapping Fruitful to merge some of them in a unique broader resource 2 http://openthesaurus.caixamagica.pt/ 3 http://pt.wiktionary.org/ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 4 / 18

This work Introduction Integrate synonymy information from dictionaries in a thesaurus Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18

This work Introduction Integrate synonymy information from dictionaries in a thesaurus 1 Extraction of synpairs from dictionaries Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18

This work Introduction Integrate synonymy information from dictionaries in a thesaurus 1 Extraction of synpairs from dictionaries 2 Assigning synpairs to synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18

This work Introduction Integrate synonymy information from dictionaries in a thesaurus 1 Extraction of synpairs from dictionaries 2 Assigning synpairs to synsets 3 Clustering remaining pairs Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18

Introduction This work Integrate synonymy information from dictionaries in a thesaurus 1 Extraction of synpairs from dictionaries 2 Assigning synpairs to synsets 3 Clustering remaining pairs Apply the procedure in the enrichment of TeP with PAPEL Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 5 / 18

Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] máquina, n: o mesmo que computador [machine, n: the same as computer] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18

Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] (cérebro, mente) (cabeça, mente) (intelecto, mente) [(brain, mind) (head, mind) (intellect, mind)] máquina, n: o mesmo que computador [machine, n: the same as computer] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18

Proposed approach Extracting synpairs from dictionaries mente, n: cérebro, cabeça, intelecto [mind, n: brain, head, intellect] (cérebro, mente) (cabeça, mente) (intelecto, mente) [(brain, mind) (head, mind) (intellect, mind)] máquina, n: o mesmo que computador [machine, n: the same as computer] (computador, máquina) [(computer, machine)] Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 6 / 18

Proposed approach Assigning synpairs to synsets p = (w x, w y ) + S a = (w 1, w 2,..., w n ) S a = (w 1, w 2,..., w n, w x, w y ) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 7 / 18

Proposed approach Assigning synpairs to synsets p = (w x, w y ) + S a = (w 1, w 2,..., w n ) S a = (w 1, w 2,..., w n, w x, w y ) Synonymy graph G All the extracted synpairs Nodes represent words (eg. w x, w y ) p = (w x, w y ) establishes an edge between w x and w y Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 7 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. 3 If C = 1, p + C 1. a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. 3 If C = 1, p + C 1. 4 Compute the adjacency vector [p] = [w x ] + [w y ]. The adjacency vector of a word is a column of the matrix M, [w j ] = [M j ]; a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. 3 If C = 1, p + C 1. 4 Compute the adjacency vector [p] = [w x ] + [w y ]. The adjacency vector of a word is a column of the matrix M, [w j ] = [M j ]; 5 Compute the adjacency vector of each C j C [C j ] = C j k=1 [w k] : w k C j ; a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. 3 If C = 1, p + C 1. 4 Compute the adjacency vector [p] = [w x ] + [w y ]. The adjacency vector of a word is a column of the matrix M, [w j ] = [M j ]; 5 Compute the adjacency vector of each C j C [C j ] = C j k=1 [w k] : w k C j ; 6 Select the most similar synset C best : sim(p, C best ) a = max(sim(p, C j )); a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Assigning synpairs to synsets For each synpair p = (w x, w y ) 1 If S i T : w x S i w y S i, nothing is done. 2 Select all synsets C j C : C T, C = {C 1, C 2,..., C n } (C j C) : w x C j w y C j. 3 If C = 1, p + C 1. 4 Compute the adjacency vector [p] = [w x ] + [w y ]. The adjacency vector of a word is a column of the matrix M, [w j ] = [M j ]; 5 Compute the adjacency vector of each C j C [C j ] = C j k=1 [w k] : w k C j ; 6 Select the most similar synset C best : sim(p, C best ) a = max(sim(p, C j )); 7 p + C best. a Any measure for computing the similarity of two vectors can be used Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 8 / 18

Proposed approach Clustering remaining pairs G is established by the remaining pairs 1 Sparse matrix M ( N N ) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18

Proposed approach Clustering remaining pairs G is established by the remaining pairs 1 Sparse matrix M ( N N ) 2 M ij = sim([w i], [w j ]) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18

Proposed approach Clustering remaining pairs G is established by the remaining pairs 1 Sparse matrix M ( N N ) 2 M ij = sim([w i], [w j ]) 3 Normalise the columns of M, so that M j k=1 M jk = 1 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18

Proposed approach Clustering remaining pairs G is established by the remaining pairs 1 Sparse matrix M ( N N ) 2 M ij = sim([w i], [w j ]) 3 Normalise the columns of M, so that M j k=1 M jk = 1 4 Extract cluster S i from each row M i, with the words w j where M ij > θ Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18

Proposed approach Clustering remaining pairs G is established by the remaining pairs 1 Sparse matrix M ( N N ) 2 M ij = sim([w i], [w j ]) 3 Normalise the columns of M, so that M j k=1 M jk = 1 4 Extract cluster S i from each row M i, with the words w j where M ij > θ 5 For each S i : S i S j = S j and S i S j = S i, S i is discarded. Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 9 / 18

Enriching TeP with synonymy in PAPEL Coverage of the synpairs by TeP POS Synpairs In TeP C 4 = 0 C = 1 C > 1 C Nouns 37,452 27.38% 14.98% 12.01% 45.63% 3.86 Verbs 21,465 43.01% 1.34% 4.04% 51.66% 6.64 Adjectives 19,073 37.60% 5.58% 8.22% 48.60% 4.26 4 Number of candidate synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 10 / 18

Enriching TeP with synonymy in PAPEL Coverage of the synpairs by TeP POS Synpairs In TeP C 4 = 0 C = 1 C > 1 C Nouns 37,452 27.38% 14.98% 12.01% 45.63% 3.86 Verbs 21,465 43.01% 1.34% 4.04% 51.66% 6.64 Adjectives 19,073 37.60% 5.58% 8.22% 48.60% 4.26 Experimentation was performed using the cosine similarity 4 Number of candidate synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 10 / 18

Results words Enriching TeP with synonymy in PAPEL Thesaurus TeP 2.0 After assignments Clusters Final thesaurus POS Words Total Ambiguous Avg(senses) Most ambig. Nouns 17,158 5,805 1.71 20 Verbs 10,827 4,905 2.08 41 Adjectives 14,586 3,735 1.46 19 Nouns 23,775 10,418 2.09 37 Verbs 12,818 7,094 2.64 42 Adjectives 17,158 6,294 1.83 22 Nouns 8,546 701 1.15 8 Verbs 502 8 1.02 3 Adjectives 1,858 39 1.03 4 Nouns 30,369 12,045 1.96 38 Verbs 13,090 7,221 2.62 42 Adjectives 18,525 6,550 1.80 23 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 11 / 18

Results synsets Enriching TeP with synonymy in PAPEL Thesaurus TeP 2.0 After assignments Clusters Final thesaurus POS Synsets Total Avg(size) size = 2 size > 25 max(size) Nouns 8,254 3.56 3,079 0 21 Verbs 3,978 5.67 939 48 53 Adjectives 6,066 3.50 3,033 19 43 Nouns 8,254 6.01 1,930 179 150 Verbs 3,978 8.50 702 217 148 Adjectives 6,066 5.17 2,369 120 110 Nouns 3,524 2.78 2,247 0 13 Verbs 220 2.34 174 0 6 Adjectives 820 2.33 656 0 10 Nouns 11,778 5.05 4,177 179 150 Verbs 4,198 8.18 876 217 148 Adjectives 6,886 4.84 3,025 120 110 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 12 / 18

Evaluation Assignments evaluation Manual evaluation of sample assignments Two judges for each assignment Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 13 / 18

Evaluation Assignments evaluation Manual evaluation of sample assignments Two judges for each assignment POS Sample Correct Incorrect Agreement Nouns 100 assigns. 2 153 (76.50%) 47 (23.50%) 77.00% Verbs 100 assigns. 2 142 (71.00%) 58 (29.00%) 74.00% Adjectives 100 assigns. 2 151 (75.50%) 49 (24.50%) 75.00% Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 13 / 18

Assignments evaluation Evaluation Manual evaluation of sample assignments Two judges for each assignment POS Synpair Synset Judge 1 Judge 2 Nouns Verbs Adjectives (escrutínio,votação) votação;voto;sufrágio 1 1 (decisão,desempate) resolução;objetivação;tenção;intenção 0 1 (plano,gizamento) planície;chã;chanura;plaino;plano;planura 0 0 (venerar,homenagear) venerar;cultuar;adorar;idolatrar 1 1 (atacar,combater) atacar;inciar 0 1 (obter,rapar) depilar;despelar;pelar;raspar;rapar;rascar 0 0 (grandioso,épico) admirável;fabuloso;grandioso 1 1 (delicado,requintado) difícil;complicado;delicado 0 1 (falido,queimado) queimado;incendiado 0 0 Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 13 / 18

Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18

Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18

Evaluation Clustering Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Table: Evaluation of clustering POS Sample Correct Incorrect Agreement Nouns 105 2 179 (85.24%) 31 (14.76%) 91.43% Verbs 105 2 193 (91.90%) 17 (8.10%) 87.62% Adjectives 105 2 189 (90.00%) 21 (10.00%) 85.71% Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18

Clustering Evaluation Manual evaluation of clusters Two judges for each cluster Cluster is correct if, in some context, all its words might have the same meaning Figure: Examples of connected subgraphs and resulting clusters. Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 14 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Average similarity of the pair with each synset element One vector per synset element: [Cj ] = ([w 1 ],..., [w n]), n = C j sim(p, C j ) = C j cos([p],[m wk ]) k=1 C j, w k C j Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Average similarity of the pair with each synset element One vector per synset element: [Cj ] = ([w 1 ],..., [w n]), n = C j sim(p, C j ) = C j cos([p],[m wk ]) k=1 C j, w k C j Gold resource of 220 synpairs and possible assignments Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Average similarity of the pair with each synset element One vector per synset element: [Cj ] = ([w 1 ],..., [w n]), n = C j sim(p, C j ) = C j cos([p],[m wk ]) k=1 C j, w k C j Gold resource of 220 synpairs and possible assignments Variable cut point θ on similarity Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Average similarity of the pair with each synset element One vector per synset element: [Cj ] = ([w 1 ],..., [w n]), n = C j sim(p, C j ) = C j cos([p],[m wk ]) k=1 C j, w k C j Gold resource of 220 synpairs and possible assignments Variable cut point θ on similarity Possible to assign the same synpair to 0 n C synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Concluding remarks Update: computing similarity Sum the adjacencies One vector per synset: [C j ] = C j k=1 [w k] : w k C j ; sim(p, C j ) = sim([p], [C j ]) Average similarity of the pair with each synset element One vector per synset element: [Cj ] = ([w 1 ],..., [w n]), n = C j sim(p, C j ) = C j cos([p],[m wk ]) k=1 C j, w k C j Gold resource of 220 synpairs and possible assignments Variable cut point θ on similarity Possible to assign the same synpair to 0 n C synsets Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 15 / 18

Final remarks Concluding remarks Flexible method for enriching thesaurus with synonymy in dictionaries Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18

Concluding remarks Final remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18

Final remarks Concluding remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT Automatic creation of a lexical ontology for Portuguese Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18

Final remarks Concluding remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT Automatic creation of a lexical ontology for Portuguese Extraction + integration of lexical information from textual sources Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18

Final remarks Concluding remarks Flexible method for enriching thesaurus with synonymy in dictionaries Applied to the enrichment of a Portuguese thesaurus This work was made in the scope of Onto.PT Automatic creation of a lexical ontology for Portuguese Extraction + integration of lexical information from textual sources Soon freely available! Check http://ontopt.dei.uc.pt Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 16 / 18

Concluding remarks Thank you! Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 17 / 18

References References I [Fellbaum, 1998] Fellbaum, C., editor (1998). WordNet: An Electronic Lexical Database (Language, Speech, and Communication). The MIT Press. [Gonçalo Oliveira and Gomes, 2010] Gonçalo Oliveira, H. and Gomes, P. (2010). Onto.PT: Automatic Construction of a Lexical Ontology for Portuguese. In Proc. 5th European Starting AI Researcher Symposium (STAIRS 2010). IOS Press. [Gonçalo Oliveira et al., 2010] Gonçalo Oliveira, H., Santos, D., and Gomes, P. (2010). Extracção de relações semânticas entre palavras a partir de um dicionário: o PAPEL e sua avaliação. Linguamática, 2(1):77 93. [Maziero et al., 2008] Maziero, E. G., Pardo, T. A. S., Felippo, A. D., and Dias-da-Silva, B. C. (2008). A Base de Dados Lexical e a Interface Web do TeP 2.0 - Thesaurus Eletrônico para o Português do Brasil. In VI Workshop em Tecnologia da Informação e da Linguagem Humana (TIL), pages 390 392. Gonçalo Oliveira & Gomes (CISUC) KDBI, EPIA 2011 October 11, 2011 18 / 18